Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Memory Efficiency for ppanggolin fasta Command #283

Merged
merged 14 commits into from
Sep 27, 2024
Merged

Conversation

JeanMainguy
Copy link
Member

@JeanMainguy JeanMainguy commented Sep 17, 2024

The ppanggolin fasta command could use less memory. Right now, it loads a lot of data to build the pangenome object, which makes filtering sequences easy but comes with a big memory and time cost, especially for large pangenomes with thousands of genomes.

This PR changes the command to read directly from the HDF5 tables and write the sequences on the fly, reducing the load. Some intermediate tables are only loaded when needed.

The arguments --genes, --proteins, --prot_families, and --gene_families have been optimized. Only --regions works the same as before, as it needs more work to optimize.

@JeanMainguy JeanMainguy marked this pull request as draft September 19, 2024 10:14
@JeanMainguy
Copy link
Member Author

JeanMainguy commented Sep 20, 2024

Benchmark on an E. coli pangenome made of 5k genomes, resulting in a 2.4 GB HDF5 file.

command PeakMemory improve_fasta_cmd (GB) PeakMemory v2.1.2 (GB) time improve_fasta_cmd (min) time v2.1.2 (min)
--genes softcore 5.2 37.4 9.0 26.2
--genes core 4.8 36.3 3.4 24.1
--gene_families softcore 4.8 36.3 3.3 24.1
--genes all 3.0 33.6 9.5 25.4
--prot_families core 4.8 32.9 1.4 20.6
--genes persistent 3.8 32.5 7.5 25.3
--genes rgp 2.4 32.1 4.2 25.1
--genes shell 2.6 31.9 5.1 24.4
--genes cloud 1.2 31.4 2.3 23.3
--gene_families cloud 1.1 31.3 2.0 23.1
--gene_families all 1.1 31.3 1.6 23.4
--gene_families shell 1.1 31.3 1.5 23.2
--genes module_0 1.1 31.3 2.1 22.5
--genes module_1 1.1 31.3 1.6 23.1
--gene_families module_1 1.1 31.3 1.4 23.1
--gene_families module_0 1.1 31.3 1.4 23.1
--prot_families rgp 1.2 28.4 0.4 21.5
--prot_families persistent 0.5 11.1 0.1 5.2
--prot_families shell 0.5 11.1 0.1 5.2
--prot_families all 0.5 11.1 0.1 5.3

This PR significantly reduces memory usage and speeds up execution !

@JeanMainguy JeanMainguy marked this pull request as ready for review September 20, 2024 15:35
@axbazin axbazin merged commit 2923c56 into dev Sep 27, 2024
4 checks passed
@axbazin axbazin deleted the improve_fasta_cmd branch September 27, 2024 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants