Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zstd support #116

Open
hjarnek opened this issue Oct 14, 2024 · 2 comments
Open

Add zstd support #116

hjarnek opened this issue Oct 14, 2024 · 2 comments

Comments

@hjarnek
Copy link

hjarnek commented Oct 14, 2024

Hi,

It would be great with support for zstd compression and decompression of especially FASTQ files, as they can get very big with modern sequencing technologies, and zstd seems more and more like the given successor to gzip. Probably (hopefully) the field of bioinformatics will move away from gzip in the near future, and zstd is an increasingly popular candidate. It's much faster, has better compression ratio, supports multithreading natively, and comes in a well-maintained C library. Any plans to implement this?

@vjcitn
Copy link
Contributor

vjcitn commented Oct 14, 2024

I just tried it out with hg19.fa and won't bother with statistics. For compressing the large single sequence, zstd with default parameters seems very performant relative to gzip. I then asked whether it is part of the samtools/htslib stack and saw samtools/htslib#1770, so that does not seem super favorable at the moment. It does pop up in a UKBB workflow: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/protocol-for-processing-ukb-whole-exome-sequencing-data-sets. @hjarnek please supply some links with information on uptake in bioinformatics so that we can assess the priority of such a move.

@hjarnek
Copy link
Author

hjarnek commented Oct 14, 2024

I don't have any specific sources, it's just an observation that zstd is being used a lot in other contexts, and seeing as gzip is getting old compared to more modern compression algorithms, I thought zstd could be a good successor. Who knows what the field will eventually settle on. I'm a biologist, not a computer scientist, but I think it's clear that data compression is becoming increasingly valued as the amounts of data grow, also in bioinformatics, so I find it logical that people will try to move away from gzip in the near future. There are of course other fast high compression algorithms next to zstd, maybe another one is better suited. I see the discussion was going strong for a while in the GH issue related to the PR you linked, and according to a pretty graph there, zstd seems to be coming out on top also with bioinformatic data. But I'm not the right person to discuss technical details with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants