Skip to content

WEHILWilliams/ncbi-datadownload

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ncbi-datadownload

This is a python script to download data from ncbi using ncbi Python APIs

git clone https://github.com/WEHI-ResearchComputing/ncbi-datadownload.git
cd ncbi-datadownload
module load anaconda3
conda init
conda create --name ncbi --file requirements.txt

Install ncbi and test env is set up correctly

conda activate ncbi
pip3 install --user ncbi-datasets-pylib~=11.0
python -c 'import ncbi.datasets.openapi; print(ncbi.datasets.openapi.__version__)'

The result should be similar to

11.32.1

Configure Download/Output directories

You can use any text editor to open config.json that or through Open Ondemand File Menu

nano config.json
{"taxname": "Pseudomonas aeruginosa", 
"assembly_level": ["complete_genome"], 
"ret_content": "ASSM_ACC", 
"other_species": ["Pseudomonas putida", "Pseudomonas fluorescens", "Pseudomonas stutzeri", "Pseudomonas syringae", "Pseudomonas viridiflava", "Pseudomonas chlororaphis"], 
"download_dir": "/vast/scratch/users/iskander.j/download", 
"output_dir": "/vast/scratch/users/iskander.j/ncbi_output"}

Change the paths values for download_dir and output_dir to your directories on vast or HPCScratch. You can also change the inclusion genome name taxname or add/remove items from the exclusion group other_species

Modify slurm job submission script

Open job.slurm add your email after --mail-user

nano job.slurm
#!/bin/bash

#SBATCH --time=8:00:00
#SBATCH --job-name=ncbi_dl
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --output %x_%j.out
#SBATCH --cpus-per-task=10
#SBATCH --mem=500MB

source /stornext/System/data/apps/anaconda3/anaconda3-4.3.1/etc/profile.d/conda.sh
conda activate ncbi

python run.py

Running

sbatch job.slurm

Output Folder structure

  • Master
  • Nontarget
  • Pool
  • Results

Pool will contain all inclusion genomes. Nontarget will contain all exclusion group genomes

squeue -u <userid> will show a list of your jobs running in the queue, R means running and PD means pending A text file will be created in the folder called ncbi_dl_.out to where the output of the running processes will be redirected. When the job ends, you will get an email and to check that all files have been downloaded check the last line in ncbi_dl_.out

Found 654,398 and Moved 654,398

The numbers of files found for inclusion group (654) and exclusion group (398) should be equal to number of files moved for inclusion group (654) and exclusion group (398).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.7%
  • Shell 5.3%