-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRAM as input format? #54
Comments
The current release doesn't support CRAM as we're still using the older samtools java library to read BAM files and that predates CRAM. I think it would be easy enough to add support for this if we updated to use the HTSJDK library which replaced the old samtools one. It's not a drop in replacement, but the changes are pretty minimal (I had to do this for seqmonk) and there is a speedup as a nice side effect. The other issue with CRAM is going to be supporting the required reference genomes. When we added support in seqmonk this wasn't an issue as we only needed the mapped positions, but in FastQC we need the sequences so we'll need the full reference genome. This would either then require a way to manually specify the genome - which could be different for multiple CRAM files in the same run, or we could auto-download them, but that could be a pretty unfriendly experience if you end up downloading several gigs of reference before your processing can start. I can look into this if there's interest, but some of the practicalities might turn out to be a bit ugly. If you have a nice (preferbly small, and from a small genome) example CRAM file I can use to test that would help get things moving. |
Awesome, thanks 😃 GitHub does not support attachment of .cram files. Here is a small one from EBI, approximately 80MB: https://www.ebi.ac.uk/ena/browser/view/ERS240640 The download didn't work when clicking on the name directly, but by selecting the file and pressing 'Download selected files' |
This uses htsjdk 2.23.0 but currently hardcodes the reference path. It fails for the test case because of versioning issues. Refers to #54
I had a go at implementing this and there is a basic implementation in the cram branch which should work for the test file which was specified. At the moment I've hardcoded the reference file path but it would be easy enough to add that as an option. Unfortuantely this fails on the test file with the following error:
So it looks like htsjdk don't support CRAM version 2.0, which is what that file is. Looking into their source at: https://github.com/samtools/htsjdk/blob/e803eea9b7edb56edb773f881a0b039399e061cc/src/main/java/htsjdk/samtools/cram/common/CramVersions.java it looks like they only support versions 2.1 and 3.0.
So although I can add the remainder of the infrastructure needed to support this, it may be that it won't work for a significant number of input files. I've no idea what the break down between different CRAM versions is in real world data. |
Got a reply from the htslib people:
So maybe the implementation is actually good enough for what we need. I'll try to find another test file, and then build a snapshot for people to test. If it works out I can put this into a new release. |
I was wondering whether
cram
is also an accepted input format. I can't get FastQc to run with it, but I was not sure whether the file is corrupted or whether FastQC just doesn't accept crams in general.The text was updated successfully, but these errors were encountered: