Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvements for large genome downloads #134

Open
bluegenes opened this issue Nov 1, 2024 · 0 comments
Open

improvements for large genome downloads #134

bluegenes opened this issue Nov 1, 2024 · 0 comments

Comments

@bluegenes
Copy link
Collaborator

I'm getting some unexpected failures trying to download some large plant genomes. My guess is that these downloads take a lot longer, so I need to modify parameters to avoid blocking by NCBI.

To fix:

  • could make # semaphore permits a modifiable parameter (currently 3, but maybe use 1 for large genomes).
  • can we use an API key to get a higher download limit?
  • need to double check that we're not attempting protein downloads when we're not keeping fastas + not generating protein signatures. Despite tests and outputs, I do see some "protein" filenames in the failures file, so let's check again. Attempting these downloads might be the cause of hitting ncbi download limits.
bluegenes added a commit that referenced this issue Dec 27, 2024
This PR integrates the changes to `BuildUtils` (`MultiSelection`
details, minor changes to `BuildCollection` filtering + writing) that
arose from integration into branchwater. It also makes the number of
simultaneous downloads tunable, since I was having trouble when using
the 3 default permits with large eukaryotic genomes.

It also handles changes associated with zipfile handling from
sourmash-bio/sourmash#3431 arising from updating
sourmash core to 0.18.0

ref #134
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant