-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zstd support #116
Comments
I just tried it out with hg19.fa and won't bother with statistics. For compressing the large single sequence, zstd with default parameters seems very performant relative to gzip. I then asked whether it is part of the samtools/htslib stack and saw samtools/htslib#1770, so that does not seem super favorable at the moment. It does pop up in a UKBB workflow: https://dnanexus.gitbook.io/uk-biobank-rap/science-corner/whole-exome-sequencing-oqfe-protocol/protocol-for-processing-ukb-whole-exome-sequencing-data-sets. @hjarnek please supply some links with information on uptake in bioinformatics so that we can assess the priority of such a move. |
I don't have any specific sources, it's just an observation that zstd is being used a lot in other contexts, and seeing as gzip is getting old compared to more modern compression algorithms, I thought zstd could be a good successor. Who knows what the field will eventually settle on. I'm a biologist, not a computer scientist, but I think it's clear that data compression is becoming increasingly valued as the amounts of data grow, also in bioinformatics, so I find it logical that people will try to move away from gzip in the near future. There are of course other fast high compression algorithms next to zstd, maybe another one is better suited. I see the discussion was going strong for a while in the GH issue related to the PR you linked, and according to a pretty graph there, zstd seems to be coming out on top also with bioinformatic data. But I'm not the right person to discuss technical details with. |
Hi,
It would be great with support for zstd compression and decompression of especially FASTQ files, as they can get very big with modern sequencing technologies, and zstd seems more and more like the given successor to gzip. Probably (hopefully) the field of bioinformatics will move away from gzip in the near future, and zstd is an increasingly popular candidate. It's much faster, has better compression ratio, supports multithreading natively, and comes in a well-maintained C library. Any plans to implement this?
The text was updated successfully, but these errors were encountered: