Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graphmap spliced alpha: no spliced alignments reported #76

Open
aechchiki opened this issue Aug 23, 2017 · 2 comments
Open

graphmap spliced alpha: no spliced alignments reported #76

aechchiki opened this issue Aug 23, 2017 · 2 comments

Comments

@aechchiki
Copy link

Hi Ivan,

we tested Graphmap to align RNA-seq reads to the reference genome. For benchmark purposes, we tried to align the reference transcripts in fasta format on the reference genome, both from Ensembl.

I tried to install and run Graphmap as described in the docs for spliced alignments. I compiled the rna-alpha branch, as suggested, and run this command: /home/aechchik/graphmap/bin/graphmap-not_release align -x rnaseq -t 16 -r reference-genome.fa -d reference-transcripts.fasta -o graphmap_ref.sam

The main problems is that the local alignments for the transcripts sub-features are not reconciled: if a transcript has two exons, then both are reported but their local alignments are not merged to a spliced alignment, reported separately as multiple alignments instead. For example - transcript FBtr0344900 has two annotated exons:

cat Drosophila_melanogaster.BDGP6.84.gtf | grep FBtr0344900 | grep exon
4    FlyBase    exon    42774    43157    .    +    .    gene_id "FBgn0266617"; transcript_id "FBtr0344900"; exon_number "1"; gene_name "CR45124"; gene_source "FlyBase"; gene_biotype "lincRNA"; transcript_name "CR45124-RA"; transcript_source "FlyBase"; transcript_biotype "lincRNA"; exon_id "FBtr0344900-E1";
4    FlyBase    exon    43231    43374    .    +    .    gene_id "FBgn0266617"; transcript_id "FBtr0344900"; exon_number "2"; gene_name "CR45124"; gene_source "FlyBase"; gene_biotype "lincRNA"; transcript_name "CR45124-RA"; transcript_source "FlyBase"; transcript_biotype "lincRNA"; exon_id "FBtr0344900-E2";

In the alignment file by graphmap, it figures twice as local alignment - once as mapping on the forward strand, once as secondary mapping on the forward strand ($2):

cat graphmap_ref.sam | grep FBtr0344900
FBtr0344900	0	4	42774	40	384M144S	*	0	0TGCGACATTGTTCTACGATGACTACAAAAAATGACCAATAACTTCTATAAACCAATACGATATGTCAGGAGTTTCGGTCCCATACGAAGTCGCCGACTTAAGTATTTTATTTTTATTTTGATATGTGTTTGCTATTTTACCTTGTCGAATGCTTCCACACGCTATGAGAATACCATCGTGAGCGTAGCTTACTACTAGAATTTTGTTGAAGTTATTGACAAGCGATGTCTCAATATCTTCCGGACAGCCTCCAGCGTGACATTGCGGGGAATCATGTAACGGCCCAGTAACAGCCTCGGCCAGCACTCGAAGGTTTTCGTTAAGTTTAAGTATTTTATTTGTAGCACCCGCAAACAAAACATTGTGCATAAAGTCGAAGCTCATCTGGAAGCTGTTGATTGAACTGGTATTGATGGCAAGTTAAACTGGGCGACTATGTCATTTAAGGGAGATAACGCCTGAGCCGGCAGTTCTTCAATGCAGTTAACGCAATAATGCTGAGAACCGAGTATGATAATAATACACAGT	*	MD:Z:384	NM:i:0	AS:i:1920	H0:i:1	ZE:f:4.54542e-127	ZF:f:0.299417	ZQ:i:528ZR:i:1348131
FBtr0344900	256	4	43231	40	384S143M1S	*	0	0TGCGACATTGTTCTACGATGACTACAAAAAATGACCAATAACTTCTATAAACCAATACGATATGTCAGGAGTTTCGGTCCCATACGAAGTCGCCGACTTAAGTATTTTATTTTTATTTTGATATGTGTTTGCTATTTTACCTTGTCGAATGCTTCCACACGCTATGAGAATACCATCGTGAGCGTAGCTTACTACTAGAATTTTGTTGAAGTTATTGACAAGCGATGTCTCAATATCTTCCGGACAGCCTCCAGCGTGACATTGCGGGGAATCATGTAACGGCCCAGTAACAGCCTCGGCCAGCACTCGAAGGTTTTCGTTAAGTTTAAGTATTTTATTTGTAGCACCCGCAAACAAAACATTGTGCATAAAGTCGAAGCTCATCTGGAAGCTGTTGATTGAACTGGTATTGATGGCAAGTTAAACTGGGCGACTATGTCATTTAAGGGAGATAACGCCTGAGCCGGCAGTTCTTCAATGCAGTTAACGCAATAATGCTGAGAACCGAGTATGATAATAATACACAGT	*	MD:Z:143	NM:i:0	AS:i:715	H0:i:1	ZE:f:3.39258e-42	ZF:f:0.299417	ZQ:i:528ZR:i:1348131

What we would rather see instead would be something like the output of STAR:

cat star_ref.sam | grep FBtr0344900
FBtr0344900    0    4    42774    255    384M73N144M    *    0    0TGCGACATTGTTCTACGATGACTACAAAAAATGACCAATAACTTCTATAAACCAATACGATATGTCAGGAGTTTCGGTCCCATACGAAGTCGCCGACTTAAGTATTTTATTTTTATTTTGATATGTGTTTGCTATTTTACCTTGTCGAATGCTTCCACACGCTATGAGAATACCATCGTGAGCGTAGCTTACTACTAGAATTTTGTTGAAGTTATTGACAAGCGATGTCTCAATATCTTCCGGACAGCCTCCAGCGTGACATTGCGGGGAATCATGTAACGGCCCAGTAACAGCCTCGGCCAGCACTCGAAGGTTTTCGTTAAGTTTAAGTATTTTATTTGTAGCACCCGCAAACAAAACATTGTGCATAAAGTCGAAGCTCATCTGGAAGCTGTTGATTGAACTGGTATTGATGGCAAGTTAAACTGGGCGACTATGTCATTTAAGGGAGATAACGCCTGAGCCGGCAGTTCTTCAATGCAGTTAACGCAATAATGCTGAGAACCGAGTATGATAATAATACACAGT    *    NH:i:1    HI:i:1    NM:i:0    MD:Z:528

Am I doing something wrong or simply this feature is not implemented yet?

Another strange behaviour as seen in the output alignment file is that some of the reads ID appear once in the alignment file, but there is no other alignment of the same read ID reporting the primary alignment flag. For example: cat graphmap_ref.sam | grep -v '^@' | awk '$2==256 {print $0}' | head -1 | cut -f1 outputs: FBtr0300689. If we grep this ID out of the mapping file, then we are returned one hit only:

cat graphmap_ref.sam | grep -v '^@' | grep FBtr0300689
FBtr0300689	256	2L	8191	40	586S1293M1S	*	0	0CTACTCGCATGTAGAGATTTCCACTTATGTTTTCTCTACTTTCAGCAACCGAGAAGAGAACCCACGTTTGAACAAGTATCGGCGTGTGGACAACAGCTATCCCCGCTTCATAACGAATGAGGCTGCCGAGGACCTGATTTACAAGAAGTCCATGGGCGAGCGGGATCAGCCACAGAGCTCAGAGCGGATCTCAATATTTAATCCGCCAGTATACACGCAGCACCAGGTGCGCAATGAAGCCCCCTACATACCCACCACATTTGACCTCCTCTCAGACGATGAGGAGTCGTCACAGAGAGTTGCCAACGCCGGGCCATCTTTCAGGCCCTTGACTTACTCGGATGCTGTGCGTCTAAGCCAGAATGGCTTCGCCAACTCCCGCGTAAGTGGGCACTCCAGCTATACGGTGCGCAGACCACCGGCACTAGTTGACAGAAGCATTCTATCCCAGGAAATGGAGCGCATGGACCAAGAGCAGTATATCTACCTTATCCGTACCGCAGCCCAAAGTAATTCCGTGGGCAGTCACTACGCCGAACCGGTTACTGATAACTCGGAGGTCAAGAAAGTCAGTGAAACCAACAAAAGTGACGCACCACAGCCGTTAACCCCTCAACCTACCAGACTCACCAGAACAGAATCCTTGCACCGTCGTTTTGCCAGCTGCGTCAACTTAAATGATGACTTCGCCGAGCAATTTAAAGCAAGAGCGGCGGACTGTGAAGAGAAATCCAAACATCGTCTTAGATTAGCTGAAGAGCAGAGGCTTTTTTCGAATTTCAGTGCTATAAAGAACATAGATGAACTCCGTGCCTATGAACGAAAAGTAGTGGAAAACATATTCCAGTCTTGTATCGCCCACAAGCCCATTTTTGTACTCGGGCCCTTGGACAAGCCAAATGTGAAGAAAGTGACCAAGCTCATTCCGTTAACAGAGGAGCACCACGATCGCTTTAACGAAATTACACAGGATGATAAATCGACGGTATGGCAACGAATATATTGATGTCTTTCGTACCCATTGAAAACGTTGTGGTGCTTGCGCTTTAAAATCTTATATTAGGAAATTATTTTTAAATTTAACCTACACATAACTACCGAAGACATATGCACGTTTATTAATGGGAAATGGCTTAACGACGAGGTCATTAACTTTTACATGTCCTTGCTGACAGAACGGTCGGAGAAGAGATCTGGCGTACTTCCCGCCACTTACGCCATAAACACATTCTTCGTGCCCCGCCTCCTGCAAGCTGGGCATGCAGGCATTAAGCGCTGGACTCGCAAAGTGGACTTGTTCAGCAAGGACATAATCCCGGTACCAGTGCACTGCAACGGCGTCCACTGGTGCATGGCCATCATACACTTGCGGAACAAGACAATCCGGTATTATGACTCAAAGGGAAAGCCAAACCGACCAGTGCTGGACGCTCTAGAGAAATATCTACGCGAAGAGTCAATATTCAAGCCCAAAAAGCAGTTTGATACCAGCGATTTTGTTATTGAGAGCGTGCAGAATATACCACGACAGTTAGATGGCAGCGATTGCGGTATCTTCAGCTGCATGTTCGCCGAGTATATAACGTGTGATGTGCCAATTACCTTTACCCAGTCGGAAATGTTGTACTTCCGCAAGAAGATGGCTCTAGAAATCGTCGACGGAGAGTTGTGACAGTAGAATCACACAGCTACGCAAGAATGTGGAGAATCCAGTTTAGTTATTTTTACAAATCTTACGTAAACACTCCAAGCATGAATTCGCAACAAGTGCTTAGCTATTTAATTGAATTGAGCTGGCCGAGAGATGTGCTGGTGCAATAACTTGTTCTCATATCTGATTGTAACAGAGAATCTAGTTTTTCAATAAAATTTCCCCAAGT	*	MD:Z:1293	NM:i:0	AS:i:6465	H0:i:1	ZE:f:0	ZF:f:0.276046	ZQ:i:1880	ZR:i:23513712

In my opinion, this read ID should have alignment flag equal to 0 instead.

I compiled the branch alpha on CentOS 6.9:

aechchik@frt:~/software/graphmap$ git branch
  master
* rna-alpha

And graphmap-not_release is the only executable in the execution folder:

aechchik@frt:~/software/graphmap/bin$ ls
graphmap-not_release

Cheers,
Amina

@isovic
Copy link
Owner

isovic commented Aug 24, 2017

Dear Amina,

Thank you for testing and reporting back!
You are right about exons being reported as separate SAM alignments. This was an intentional approach for the proof of concept. However, we are working on implementing a solution which would produce alignments similar to STAR, but it might take some time to finalize this.

Regarding the missing primary alignments - that should definitely not happen. Could you by any chance share a sample of your data which reproduces this issue, so I can take a look?

Best regards,
Ivan.

@aechchiki
Copy link
Author

aechchiki commented Aug 24, 2017

Hi Ivan,

I ran the following command: /home/aechchik/graphmap/bin/graphmap-not_release align -x rnaseq -t 16 -r reference-genome.fa -d reference-transcripts.fasta -o graphmap_ref.sam where:

  • reference-genome.fa is a concatenation of chromosomal genome files at: ftp://ftp.ensembl.org/pub/release-89/fasta/drosophila_melanogaster/dna/ : cat Drosophila_melanogaster.BDGP6.dna.chromosome.*.fa.gz
  • reference-transcripts.fasta is an extraction of the fasta sequences from the reference annotation at: ftp://ftp.ensembl.org/pub/release-89/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.89.gtf.gz using the gffread utility, using the command: gffread -w reference_transcripts.fasta -g reference_genome.fa reference_transcripts.gtf
    You should be able to reproduce exactly what I got.

(I edited the comment, for some reason the links to the Ensembl ftp were not displayed )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants