Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAM as input format? #54

Open
FriederikeHanssen opened this issue Aug 18, 2020 · 4 comments
Open

CRAM as input format? #54

FriederikeHanssen opened this issue Aug 18, 2020 · 4 comments
Assignees

Comments

@FriederikeHanssen
Copy link

I was wondering whether cram is also an accepted input format. I can't get FastQc to run with it, but I was not sure whether the file is corrupted or whether FastQC just doesn't accept crams in general.

@s-andrews
Copy link
Owner

The current release doesn't support CRAM as we're still using the older samtools java library to read BAM files and that predates CRAM.

I think it would be easy enough to add support for this if we updated to use the HTSJDK library which replaced the old samtools one. It's not a drop in replacement, but the changes are pretty minimal (I had to do this for seqmonk) and there is a speedup as a nice side effect.

The other issue with CRAM is going to be supporting the required reference genomes. When we added support in seqmonk this wasn't an issue as we only needed the mapped positions, but in FastQC we need the sequences so we'll need the full reference genome. This would either then require a way to manually specify the genome - which could be different for multiple CRAM files in the same run, or we could auto-download them, but that could be a pretty unfriendly experience if you end up downloading several gigs of reference before your processing can start.

I can look into this if there's interest, but some of the practicalities might turn out to be a bit ugly. If you have a nice (preferbly small, and from a small genome) example CRAM file I can use to test that would help get things moving.

@s-andrews s-andrews self-assigned this Aug 18, 2020
@FriederikeHanssen
Copy link
Author

FriederikeHanssen commented Aug 18, 2020

Awesome, thanks 😃

GitHub does not support attachment of .cram files. Here is a small one from EBI, approximately 80MB: https://www.ebi.ac.uk/ena/browser/view/ERS240640

The download didn't work when clicking on the name directly, but by selecting the file and pressing 'Download selected files'

s-andrews pushed a commit that referenced this issue Aug 18, 2020
This uses htsjdk 2.23.0 but currently hardcodes the reference path.

It fails for the test case because of versioning issues.

Refers to #54
@s-andrews
Copy link
Owner

I had a go at implementing this and there is a basic implementation in the cram branch which should work for the test file which was specified. At the moment I've hardcoded the reference file path but it would be easy enough to add that as an option.

Unfortuantely this fails on the test file with the following error:

Exception in thread "AWT-EventQueue-0" java.lang.RuntimeException: CRAM version 2.0 is not supported
	at htsjdk.samtools.cram.build.CramIO.readCramHeader(CramIO.java:132)
	at htsjdk.samtools.cram.build.CramContainerIterator.<init>(CramContainerIterator.java:25)
	at htsjdk.samtools.CRAMIterator.<init>(CRAMIterator.java:63)
	at htsjdk.samtools.CRAMFileReader.initWithStreams(CRAMFileReader.java:200)
	at htsjdk.samtools.CRAMFileReader.<init>(CRAMFileReader.java:151)
	at htsjdk.samtools.CRAMFileReader.<init>(CRAMFileReader.java:170)
	at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:415)
	at uk.ac.babraham.FastQC.Sequence.BAMFile.<init>(BAMFile.java:66)

So it looks like htsjdk don't support CRAM version 2.0, which is what that file is. Looking into their source at: https://github.com/samtools/htsjdk/blob/e803eea9b7edb56edb773f881a0b039399e061cc/src/main/java/htsjdk/samtools/cram/common/CramVersions.java it looks like they only support versions 2.1 and 3.0.

    final static Set<CRAMVersion> supportedCRAMVersions = new HashSet<CRAMVersion>() {{
        add(CRAM_v2_1);
        add(CRAM_v3);
    }};

So although I can add the remainder of the infrastructure needed to support this, it may be that it won't work for a significant number of input files. I've no idea what the break down between different CRAM versions is in real world data.

s-andrews pushed a commit that referenced this issue Aug 18, 2020
@s-andrews
Copy link
Owner

Got a reply from the htslib people:

Your inquiry is only the second time in several years that I can recall someone asking about 2.0 (though the other time was two weeks ago…).

I would think that the vast majority of cram in the wild is 3.0.

So maybe the implementation is actually good enough for what we need. I'll try to find another test file, and then build a snapshot for people to test. If it works out I can put this into a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants