a lighter database? #95

GaioTransposon · 2022-02-20T13:25:32Z

Hi there and thank you for the tool,

is there an option to download only part of the database?
https://zenodo.org/record/5961398/files/db.tar.gz) is nearly 30GB and it takes about 12 hours to download (I am using bakta_db download --output . with bakta installed with conda.

what if one just wants to use only one of the DBs (eg.: UniProtKB/Swiss-Prot: 2021_04) ?

Kind Regards
Dany

The text was updated successfully, but these errors were encountered:

oschwengers · 2022-02-21T08:56:48Z

Hi Dany,
thanks for reaching out. Yes, DB size is sometimes and for some users an issue. As we decided to come up with a taxonomically untargeted approach and database, it has become fairly large.

The two largest parts of the DB are the PSC Diamond db (UniRef90 cluster representative sequences) and the SQLite db storing the ~200 million IPS sequence hashes (UniRef100) along with all pre-compiled annotations. Therefore, excluding many except of just one annotation DB wouldn't result in significant DB size reductions.

One option to reduce the databse size (that I already thought about) is to compile sub databases for certain phyla. Of course, that would imply a couple of things to develop, implement and test and thus would take its time on a mid term schedule. If this would be of interest for more users, we'd happily address that.

Another option would be to host the database on more servers that distributed around the globe and thus might provide more bandwidth and better download times. Might that help in your case? Do you know of any free hosting services that would be eligible?

Best regards,
Oliver

oschwengers · 2023-02-16T09:31:56Z

Another idea (inspired by @tseemann) is to use a ranked set of broader protein clusters. This could be addressed by skipping the IPS and PSC from the normal database and use a size-filtered subset of the PSCC, only.

A quick check on Uniprot/UniRef50 revealed 2,660,356 UniRef50 proteins. I'd estimate a size reduction of the entire database down to let's say 3-4 Gb.

oschwengers · 2023-02-24T15:22:32Z

Hi @GaioTransposon,
fyi: you might be interested in v1.7.0 which introduces a light database version as described in #196

This lightweight version is only 1.2 Gb zipped and 3 Gb unzipped.

jfy133 · 2023-03-02T11:15:41Z

EDIT: it was a fault conda installation (I think scales was missing), it's working now :), and using the latst biocontainer build also now works :)

~~I just tried this with 1.7.0 but I get the following error (both via bioconda intsall conda tool, and also the corresponding singularity biocontainer)~~

$ bakta_db download --type light
Bakta software version: 1.7.0
Required database schema version: 5

fetch DB versions...
	... compatible DB versions: 1
download database: v5.0, type=light, 2023-02-20, DOI: 10.5281/zenodo.7669534, URL: https://zenodo.org/record/7669534/files/db-light.tar.gz...
Traceback (most recent call last):
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 91, in validator
    result = CONFIG_VARS[key](value)
KeyError: 'scale'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jfellows/.conda/envs/bakta/bin/bakta_db", line 10, in <module>
    sys.exit(main())
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 203, in main
    download(db_url, tarball_path)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/bakta/db.py", line 119, in download
    with alive_bar(total=total_length, scale='SI') as bar:
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/progress.py", line 95, in alive_bar
    config = config_handler(**options)
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 82, in create_context
    local_config.update(_parse(theme, options))
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in _parse
    return {k: validator(k, v) for k, v in options.items()}
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 106, in <dictcomp>
    return {k: validator(k, v) for k, v in options.items()}
  File "/home/jfellows/.conda/envs/bakta/lib/python3.10/site-packages/alive_progress/core/configuration.py", line 96, in validator
    raise ValueError('invalid config name: {}'.format(key))
ValueError: invalid config name: scale

~~Did I miss something in my command, for example?~~

~~Conda environment creation: conda create -n bakta -c bioconda bakta~~

oschwengers · 2023-03-02T13:06:53Z

Yes, the 3rd party dependencies needed an update. It should work, now.

oschwengers added enhancement New feature or request help wanted Extra attention is needed labels Feb 21, 2022

oschwengers pinned this issue Mar 10, 2022

oschwengers self-assigned this Feb 22, 2023

oschwengers added a commit that referenced this issue Feb 22, 2023

introduce light db type #95

a96e545

oschwengers added a commit that referenced this issue Feb 22, 2023

bump required db major to 5 #95

2d31868

oschwengers added this to the v1.7.0 milestone Feb 22, 2023

oschwengers mentioned this issue Feb 23, 2023

A lightweight Db version #196

Merged

oschwengers added a commit that referenced this issue Mar 10, 2023

add DB type info to outputs #95

586b7af

oschwengers closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a lighter database? #95

a lighter database? #95

GaioTransposon commented Feb 20, 2022

oschwengers commented Feb 21, 2022 •

edited

Loading

oschwengers commented Feb 16, 2023

oschwengers commented Feb 24, 2023

jfy133 commented Mar 2, 2023 •

edited

Loading

oschwengers commented Mar 2, 2023

a lighter database? #95

a lighter database? #95

Comments

GaioTransposon commented Feb 20, 2022

oschwengers commented Feb 21, 2022 • edited Loading

oschwengers commented Feb 16, 2023

oschwengers commented Feb 24, 2023

jfy133 commented Mar 2, 2023 • edited Loading

oschwengers commented Mar 2, 2023

oschwengers commented Feb 21, 2022 •

edited

Loading

jfy133 commented Mar 2, 2023 •

edited

Loading