Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does resuming a batch output job work? #172

Open
ctb opened this issue Jan 12, 2025 · 3 comments
Open

how does resuming a batch output job work? #172

ctb opened this issue Jan 12, 2025 · 3 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jan 12, 2025

something died with OOM and then resuming didn't work - it just restarted from scratch. Any tips or tricks?

@bluegenes
Copy link
Collaborator

Were you using --batch-size? Can you give me the output from the resumed run?

In short, if you are using batches you get {filename}.n.zip zipfiles, where n is the batch and {filename}.zip is the specified output. If we find any {filename}.n.zip files on a subsequent run, we read all that we can, ignoring incomplete batches, and continue forward with batch n+1.

If you are not using batches, we do not resume, b/c afaik, rust zip utils can't append to zips and incomplete zipfiles are not readable. With current strategy, if we were to read {filename}.zip, we would count those sketches as 'done', but then we would overwrite {filename}.zip with the new sketches (meaning we lose the old ones). An alternative is that we could read that file and copy all the old sketches into memory before writing them all again into the same output.

Happy to modify if I'm missing something about rust zip writing or you have other strategy suggestions.

@ctb
Copy link
Contributor Author

ctb commented Jan 13, 2025

I was using batch, but it didn't pick it up. Maybe I got something wrong. I'll give it a try again!

For the bigger databases, I'm also thinking of doing a manual split of the input CSV to get to a small chunk size and then using snakemake on that. Animal genomes are all really big!

@bluegenes
Copy link
Collaborator

bluegenes commented Jan 13, 2025

I also think using the NCBI REST API links instead might help, especially since we could up the # of simultaneous downloads if providing an API key. I'll make an issue for that

it is much faster with simultaneous downloads, especially since genome sizes vary and the biggest ones take a lot of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants