diff --git a/README.md b/README.md index 2c9edc4..df2ef53 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,16 @@ [![DOI](https://zenodo.org/badge/792101561.svg)](https://zenodo.org/doi/10.5281/zenodo.11165725) -tl;dr - download and sketch NCBI Assembly Datasets by accession +tl;dr - download and sketch data directly ## About -This plugin is an attempt to improve database generation by downloading assemblies, checking md5sum, and sketching to a sourmash zipfile. FASTA files can also be saved if desired. It's quite fast, but still very much at alpha level. Here be dragons. +Commands: + +- `gbsketch` - download and sketch NCBI Assembly Datasets by accession +- `urlsketch` - download and sketch directly from a url + +This plugin is an attempt to improve sourmash database generation by downloading files, checking md5sum if provided or accessible, and sketching to a sourmash zipfile. FASTA files can also be saved if desired. It's quite fast, but still very much at alpha level. Here be dragons. ## Installation @@ -16,7 +21,8 @@ This plugin is an attempt to improve database generation by downloading assembli pip install sourmash_plugin_directsketch ``` -## Usage +## `gbsketch` +download and sketch NCBI Assembly Datasets by accession ### Create an input file @@ -43,15 +49,13 @@ For reference: To run the test accession file at `tests/test-data/acc.csv`, run: ``` -sourmash scripts gbsketch tests/test-data/acc.csv -o test.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1 +sourmash scripts gbsketch tests/test-data/acc.csv -o test-gbsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1 ``` Full Usage: ``` -usage: gbsketch [-h] [-q] [-d] -o OUTPUT [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] - [-r RETRY_TIMES] [-g | -m] - input_csv +usage: gbsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] [-g | -m] input_csv download and sketch GenBank assembly datasets @@ -66,7 +70,7 @@ options: output zip file for the signatures -f FASTAS, --fastas FASTAS Write fastas here - -k, --keep-fastas write FASTA files in addition to sketching. Default: do not write FASTA files + -k, --keep-fasta write FASTA files in addition to sketching. Default: do not write FASTA files --download-only just download genomes; do not sketch --failed FAILED csv of failed accessions and download links (should be mostly protein). -p PARAM_STRING, --param-string PARAM_STRING @@ -76,7 +80,61 @@ options: -r RETRY_TIMES, --retry-times RETRY_TIMES number of times to retry failed downloads -g, --genomes-only just download and sketch genome (DNA) files - -m, --proteomes-only just download and sketch proteome (protein) files +``` + +## `urlsketch` +download and sketch directly from a url +### Create an input file + +First, create a file, e.g. `acc-url.csv` with identifiers, sketch names, and other required info. +``` +accession,name,moltype,md5sum,download_filename,url +GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,dna,47b9fb20c51f0552b87db5d44d5d4566,GCA_000961135.2_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_genomic.fna.gz +GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,protein,fb7920fb8f3cf5d6ab9b6b754a5976a4,GCA_000961135.2_protein.urlsketch.faa.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_protein.faa.gz +GCA_000175535.1,GCA_000175535.1 Chlamydia muridarum MopnTet14 (agent of mouse pneumonitis) strain=MopnTet14,dna,a1a8f1c6dc56999c73fe298871c963d1,GCA_000175535.1_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/175/535/GCA_000175535.1_ASM17553v1/GCA_000175535.1_ASM17553v1_genomic.fna.gz +``` +> Six columns must be present: +> - `accession` - an accession or unique identifier. Ideally no spaces. +> - `name` - full name for the sketch. +> - `moltype` - is the file 'dna' or 'protein'? +> - `md5sum` - expected md5sum (optional, will be checked after download if provided) +> - `download_filename` - filename for FASTA download. Required if `--keep-fastas`, but useful for signatures, too (saved in sig data). +> - `url` - direct link for the file + +### Run: + +To run the test accession file at `tests/test-data/acc-url.csv`, run: +``` +sourmash scripts urlsketch tests/test-data/acc-url.csv -o test-urlsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1 +``` + +Full Usage: +``` +usage: urlsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] input_csv + +download and sketch GenBank assembly datasets + +positional arguments: + input_csv a txt file or csv file containing accessions in the first column + +options: + -h, --help show this help message and exit + -q, --quiet suppress non-error output + -d, --debug provide debugging output + -o OUTPUT, --output OUTPUT + output zip file for the signatures + -f FASTAS, --fastas FASTAS + Write fastas here + -k, --keep-fasta, --keep-fastq + write FASTA/Q files in addition to sketching. Default: do not write FASTA files + --download-only just download genomes; do not sketch + --failed FAILED csv of failed accessions and download links (should be mostly protein). + -p PARAM_STRING, --param-string PARAM_STRING + parameter string for sketching (default: k=31,scaled=1000) + -c CORES, --cores CORES + number of cores to use (default is all available) + -r RETRY_TIMES, --retry-times RETRY_TIMES + number of times to retry failed downloads ``` ## Code of Conduct