Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: improve restart by optionally writing batched zipfiles #102

Merged
merged 48 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
10078e4
init changes for tmpdir
bluegenes Jul 24, 2024
786619f
Merge branch 'main' into improve-restart
bluegenes Sep 12, 2024
5ffa71f
functions compiling, at least :)
bluegenes Sep 13, 2024
45fef54
better error propagation
bluegenes Sep 13, 2024
5420976
add tmpdir to urlsketch
bluegenes Sep 13, 2024
09677a0
init TemplateCollection + associated changes
bluegenes Sep 27, 2024
a5e79de
clean up a little
bluegenes Sep 27, 2024
578e9bb
more cleanup
bluegenes Sep 27, 2024
db0f065
Template --> Build
bluegenes Sep 27, 2024
3b7fa22
method to update record, sig after building
bluegenes Sep 27, 2024
622c38b
add build_sigs impls
bluegenes Sep 27, 2024
54256a0
add manifest new
bluegenes Sep 27, 2024
588fa8e
add singleton method
bluegenes Sep 27, 2024
50f9daa
use coll better
bluegenes Sep 27, 2024
bbdb9cb
async write functions in BuildCollection, BuildManifest
bluegenes Sep 29, 2024
c8d7dff
update internal_location when writing sigs
bluegenes Sep 29, 2024
e9e2811
for proteins, ksize needs to be k, not k*3, in manifest
bluegenes Sep 29, 2024
72b858b
aha! adjusted_ksize again. Use for sig, just not for manifest
bluegenes Sep 29, 2024
7684e30
clean up now unused build_siginfo
bluegenes Sep 29, 2024
97582b7
read existing + attempt to filter
bluegenes Sep 29, 2024
cdbc4bc
add batch reading, comparison logic
bluegenes Sep 29, 2024
3df61df
rm zip batching logic
bluegenes Sep 30, 2024
64fbc29
rm unused fn
bluegenes Sep 30, 2024
e1e9506
use MultiBuildCollection instead of allowing extend of BuildCollection
bluegenes Sep 30, 2024
b67ed58
merge in MultiBuildCollection changes
bluegenes Sep 30, 2024
0a4606c
rm unnecessary is_compat in BuildCollection
bluegenes Sep 30, 2024
2eafeb6
add back batching logic that was lost in merge to bring in MultiBuild…
bluegenes Sep 30, 2024
9777797
Merge branch 'main' into batched-zip
bluegenes Sep 30, 2024
70ed5f4
fix batching, add gbsketch batch test
bluegenes Sep 30, 2024
fef279d
test param hashing; turns out it was fine - test sigs were abund!
bluegenes Oct 1, 2024
367b38d
add batch logic to urlsketch
bluegenes Oct 1, 2024
7f9e418
create zips when needed to avoid initializing if no sigs to write
bluegenes Oct 1, 2024
dcb6b4b
clippy fixes
bluegenes Oct 1, 2024
71168f9
Merge branch 'main' into batched-zip
bluegenes Oct 1, 2024
e5db06e
no guarantee on sig order in zip batches; mod test to ignore order
bluegenes Oct 1, 2024
b257cb8
Merge branch 'main' into batched-zip
bluegenes Oct 1, 2024
980b573
add docs
bluegenes Oct 1, 2024
ca1405a
make sure batch_size is not negative
bluegenes Oct 1, 2024
820cc8c
more doc
bluegenes Oct 1, 2024
b4e1673
handle (+ overwrite) invalid/incomplete batched zipfiles
bluegenes Oct 1, 2024
baa8afb
clean up test
bluegenes Oct 1, 2024
7dd6638
Merge branch 'main' into batched-zip
bluegenes Oct 2, 2024
b583fb9
unify usage considerations
bluegenes Oct 4, 2024
9a7f965
rename Params --> BuildParams
bluegenes Oct 4, 2024
5a3814f
one more rename
bluegenes Oct 4, 2024
545fe9b
consistent test naming
bluegenes Oct 4, 2024
54b6b97
clarify batch size differences between gbsketch, urlsketch
bluegenes Oct 4, 2024
4298ff4
add version for zip batching
bluegenes Oct 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 35 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ For reference:
To test `gbsketch`, you can download a csv file and run:
```
curl -JLO https://raw.githubusercontent.com/sourmash-bio/sourmash_plugin_directsketch/main/tests/test-data/acc.csv
sourmash scripts gbsketch acc.csv -o test-gbsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
sourmash scripts gbsketch acc.csv -o test-gbsketch.zip -f out_fastas -k --failed test.failed.csv --checksum-fail test.checksum-failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
```
To check that the `zip` was created properly, you can run:
```
Expand All @@ -99,10 +99,21 @@ summary of sketches:
1 sketches with protein, k=10, scaled=100, abund 5108 total hashes
```

### Usage Considerations

If you're building large databases (over 20k files), we highly recommend you use batched zipfiles to facilitate restart.
If you encounter unexpected failures and are using a single zipfile output (default), `gbsketch` will have to re-download and
re-sketch all files. If you instead set a number of accessions using `--batch-size`, e.g. 10000, then `gbsketch` can load any
batched zips that finished writing, and avoid re-generating those signatures. Note that batches will use the `--output` file
to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc.


Full Usage:

```
usage: gbsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] [-g | -m] input_csv
usage: gbsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [--batch-size BATCH_SIZE] [-k] [--download-only] --failed FAILED --checksum-fail CHECKSUM_FAIL [-p PARAM_STRING] [-c CORES]
[-r RETRY_TIMES] [-g | -m]
input_csv

download and sketch GenBank assembly datasets

Expand All @@ -117,9 +128,14 @@ options:
output zip file for the signatures
-f FASTAS, --fastas FASTAS
Write fastas here
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows gbsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-k, --keep-fasta write FASTA files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
--failed FAILED csv of failed accessions and download links (should be mostly protein).
--checksum-fail CHECKSUM_FAIL
csv of accessions where the md5sum check failed or the md5sum file was improperly formatted or could not be downloaded
-p PARAM_STRING, --param-string PARAM_STRING
parameter string for sketching (default: k=31,scaled=1000)
-c CORES, --cores CORES
Expand Down Expand Up @@ -156,9 +172,19 @@ To run the test accession file at `tests/test-data/acc-url.csv`, run:
sourmash scripts urlsketch tests/test-data/acc-url.csv -o test-urlsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
```

### Usage Considerations
bluegenes marked this conversation as resolved.
Show resolved Hide resolved

If you're building large databases (over 20k files), we highly recommend you use batched zipfiles to facilitate restart.
If you encounter unexpected failures and are using a single zipfile output (default), `urlsketch` will have to re-download and
re-sketch all files. If you instead set a number of accessions using `--batch-size`, e.g. 10000, then `urlsketch` can load any
batched zips that finished writing, and avoid re-generating those signatures. Note that batches will use the `--output` file
to build batched filenames, so if you provided `output.zip`, your batches will be `output.1.zip`, `output.2.zip`, etc.

Full Usage:
```
usage: urlsketch [-h] [-q] [-d] [-o OUTPUT] [-f FASTAS] [-k] [--download-only] [--failed FAILED] [-p PARAM_STRING] [-c CORES] [-r RETRY_TIMES] input_csv
usage: urlsketch [-h] [-q] [-d] [-o OUTPUT] [--batch-size BATCH_SIZE] [-f FASTAS] [-k] [--download-only] --failed FAILED [--checksum-fail CHECKSUM_FAIL] [-p PARAM_STRING] [-c CORES]
[-r RETRY_TIMES]
input_csv

download and sketch GenBank assembly datasets

Expand All @@ -171,12 +197,17 @@ options:
-d, --debug provide debugging output
-o OUTPUT, --output OUTPUT
output zip file for the signatures
--batch-size BATCH_SIZE
Write smaller zipfiles, each containing sigs associated with this number of accessions. This allows urlsketch to recover after unexpected failures, rather than needing to
restart sketching from scratch. Default: write all sigs to single zipfile.
-f FASTAS, --fastas FASTAS
Write fastas here
-k, --keep-fasta, --keep-fastq
write FASTA/Q files in addition to sketching. Default: do not write FASTA files
--download-only just download genomes; do not sketch
--failed FAILED csv of failed accessions and download links (should be mostly protein).
--failed FAILED csv of failed accessions and download links.
--checksum-fail CHECKSUM_FAIL
csv of accessions where the md5sum check failed. If not provided, md5sum failures will be written to the download failures file (no additional md5sum information).
-p PARAM_STRING, --param-string PARAM_STRING
parameter string for sketching (default: k=31,scaled=1000)
-c CORES, --cores CORES
Expand Down
Loading