This is a python script to download data from ncbi using ncbi Python APIs
Setting up env in vc7-shared/interactive slurm session
git clone https://github.com/WEHI-ResearchComputing/ncbi-datadownload.git
cd ncbi-datadownload
module load anaconda3
conda init
conda create --name ncbi --file requirements.txt
conda activate ncbi
pip3 install --user ncbi-datasets-pylib~=11.0
python -c 'import ncbi.datasets.openapi; print(ncbi.datasets.openapi.__version__)'
The result should be similar to
11.32.1
You can use any text editor to open config.json
that or through Open Ondemand File Menu
nano config.json
{"taxname": "Pseudomonas aeruginosa",
"assembly_level": ["complete_genome"],
"ret_content": "ASSM_ACC",
"other_species": ["Pseudomonas putida", "Pseudomonas fluorescens", "Pseudomonas stutzeri", "Pseudomonas syringae", "Pseudomonas viridiflava", "Pseudomonas chlororaphis"],
"download_dir": "/vast/scratch/users/iskander.j/download",
"output_dir": "/vast/scratch/users/iskander.j/ncbi_output"}
Change the paths values for download_dir and output_dir to your directories on vast or HPCScratch.
You can also change the inclusion genome name taxname
or add/remove items from the exclusion group other_species
Open job.slurm
add your email after --mail-user
nano job.slurm
#!/bin/bash
#SBATCH --time=8:00:00
#SBATCH --job-name=ncbi_dl
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --output %x_%j.out
#SBATCH --cpus-per-task=10
#SBATCH --mem=500MB
source /stornext/System/data/apps/anaconda3/anaconda3-4.3.1/etc/profile.d/conda.sh
conda activate ncbi
python run.py
sbatch job.slurm
- Master
- Nontarget
- Pool
- Results
Pool will contain all inclusion genomes. Nontarget will contain all exclusion group genomes
squeue -u <userid>
will show a list of your jobs running in the queue, R
means running and PD
means pending
A text file will be created in the folder called ncbi_dl_.out to where the output of the running processes will be redirected.
When the job ends, you will get an email and to check that all files have been downloaded check the last line in ncbi_dl_.out
Found 654,398 and Moved 654,398
The numbers of files found for inclusion group (654) and exclusion group (398) should be equal to number of files moved for inclusion group (654) and exclusion group (398).