Skip to content

Commit

Permalink
MRG: modify n simultaneous downloads; update buildutils (#154)
Browse files Browse the repository at this point in the history
This PR integrates the changes to `BuildUtils` (`MultiSelection`
details, minor changes to `BuildCollection` filtering + writing) that
arose from integration into branchwater. It also makes the number of
simultaneous downloads tunable, since I was having trouble when using
the 3 default permits with large eukaryotic genomes.

It also handles changes associated with zipfile handling from
sourmash-bio/sourmash#3431 arising from updating
sourmash core to 0.18.0

ref #134
  • Loading branch information
bluegenes authored Dec 27, 2024
1 parent a080c45 commit 38d99b6
Show file tree
Hide file tree
Showing 10 changed files with 330 additions and 448 deletions.
43 changes: 21 additions & 22 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ name = "sourmash_plugin_directsketch"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.23.3", features = ["extension-module", "anyhow"] }
pyo3 = { version = "0.23.3", features = ["extension-module","anyhow"]}
rayon = "1.10.0"
serde = { version = "1.0.204", features = ["derive"] }
sourmash = { version = "0.17.2"}
sourmash = { version = "0.18.0"}
serde_json = "1.0.134"
niffler = "2.4.0"
needletail = "0.5.1"
Expand Down
19 changes: 16 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,15 @@ pip install sourmash_plugin_directsketch

## Usage Considerations

If you're building large databases (over 20k files), we highly recommend you use batched zipfiles (v0.4+) to facilitate restart. If you encounter unexpected failures and are using a single zipfile output (default), `gbsketch`/`urlsketch` will have to re-download and re-sketch all files. If you instead set a batch size using `--batch-size`, e.g. 10000, then `gbsketch`/`urlsketch` can load any batched zips that finished writing, and avoid re-generating those signatures. For `gbsketch`, the batch size represents the number of accessions included in each zip, with all signatures associated with an accession grouped within a single `zip`. For `urlsketch`, the batch size represents the number of total signatures included in each zip. Note that batches will use the `--output` file to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc.
### Allowing restart with batching

If you're building large databases, we highly recommend you use batched zipfiles (v0.4+) to facilitate restart. If you encounter unexpected failures and are using a single zipfile output (default), `gbsketch`/`urlsketch` will have to re-download and re-sketch all files. If you instead set a batch size using `--batch-size`, then `gbsketch`/`urlsketch` can load any batched zips that finished writing, and avoid re-generating those signatures. For `gbsketch`, the batch size represents the number of accessions included in each zip, with all signatures associated with an accession grouped within a single `zip`. For `urlsketch`, the batch size represents the number of sigs associated with each url provided. Note that batches will use the `--output` file to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc. For small genomes (e.g. microbes), you can keep batch sizes quite large, e.g. 1000s-10000s. For large eukaryotic genomes where download takes much longer, you may want to use smaller batch sizes.

To build a single database after batched sketching, you can use `sig cat` to build a single zipfile (`sourmash sig cat *.zip -o OUTPUT.zip`) or `sig collect` to collect all the zips into a standalone manifest that can be used with sourmash and branchwater commands.

### Memory Requirements

Directsketch downloads the full file, optionally checking the `md5sum`, then performs the sketch. As a result, you will need enough memory to hold up to 3 genomes in memory at once. For microbial genomes, this is trivial. For large eukaryotic genomes (e.g. plants!), be sure to provide sufficient memory. You can tune the number of simultaneous downloads (and thus, the number of genomes that will be in memory simultaneously) with `--n-simultaneous-downloads`.

## Running the commands

Expand Down Expand Up @@ -125,7 +132,8 @@ options:
-f FASTAS, --fastas FASTAS
Write fastas here
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows gbsketch to recover after unexpected failures, rather than needing to
Write smaller zipfiles, each containing sigs associated with this number of accessions.
This allows gbsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-k, --keep-fasta write FASTA files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
Expand All @@ -138,6 +146,8 @@ options:
number of cores to use (default is all available)
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
-n {1,2,3}, --n-simultaneous-downloads {1,2,3}
number of accessions to download simultaneously (default=1)
-g, --genomes-only just download and sketch genome (DNA) files
-m, --proteomes-only just download and sketch proteome (protein) files
```
Expand Down Expand Up @@ -186,7 +196,8 @@ options:
-o OUTPUT, --output OUTPUT
output zip file for the signatures
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows urlsketch to recover after unexpected failures, rather than needing to
Write smaller zipfiles, each containing sigs associated with this number of urls.
This allows urlsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-f FASTAS, --fastas FASTAS
Write fastas here
Expand All @@ -202,6 +213,8 @@ options:
number of cores to use (default is all available)
-r RETRY_TIMES, --retry-times RETRY_TIMES
number of times to retry failed downloads
-n {1,2,3}, --n-simultaneous-downloads {1,2,3}
number of simultaneous downloads (default=3)
```

## Code of Conduct
Expand Down
Loading

0 comments on commit 38d99b6

Please sign in to comment.