Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between TElocal counts and BAM file reads #54

Open
ptranvan opened this issue Dec 9, 2024 · 13 comments
Open

Mismatch between TElocal counts and BAM file reads #54

ptranvan opened this issue Dec 9, 2024 · 13 comments
Labels

Comments

@ptranvan
Copy link

ptranvan commented Dec 9, 2024

Hi,

I used TElocal to generate counts, and I’m particularly interested in one specific feature. For this feature, I obtained a count of 24. However, when I went back to the BAM file to extract the reads that map to this feature using bedtools intersect, I found only 12 reads (8 uniques) that come from:

6 paired-end reads (representing 4 unique pairs).

How is it possible for the count to be 24 in this case?

Thanks for your help

@olivertam
Copy link
Member

Hi,

Thank you for your interest in the software.
Would you be able to the send an excerpt of the BAM file for that particular genomic region (and the corresponding TE annotation)?
For example, if your TE is located at chr1:1-1000:

$ samtools view [BAM] chr1:1-1000 > TE_overlapping_reads.sam

Thanks.

@olivertam olivertam added the invalid This doesn't seem right label Dec 9, 2024
@ptranvan
Copy link
Author

ptranvan commented Dec 9, 2024

Thanks @olivertam for your prompt answer, here is the file
TE_overlapping_reads.sam.zip

@olivertam
Copy link
Member

Hi,

Thank you for that file.
Could you also provide the TElocal command line, as well as the BAM headers?

$ samtools view -H [BAM] > BAM_header.txt

Thanks.

@talithaforcier
Copy link
Contributor

Hi! Which gene and TE annotation files were you using with this?
Thanks!

@ptranvan
Copy link
Author

ptranvan commented Dec 9, 2024

BAM_header.txt
TE_overlapping_reads.sam.zip

The TE_overlapping_reads.sam and BAM_header.txt are from the initial mapping, then I sorted by read name:

samtools view -bS bam1.bam | samtools sort -n - > bam2.bam

Then TElocal

TElocal -b bam2.bam --GTF Homo_sapiens-GCA_009914755.4-2022_07-genes_chr.gtf --TE T2T-CHM13v2_rmsk_TE.gtf.locInd --stranded reverse --mode multi

Gene annotation is from

https://rapid.ensembl.org/Homo_sapiens_GCA_009914755.4/Info/Index
with custom modifcation adding "chr" before scaffold number

and TE annotation:

https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/T2T-CHM13v2_rmsk_TE.gtf.gz

The feature that has 24 counts is ALR/Alpha_dup1372

@olivertam
Copy link
Member

Thank you for providing the information.

We'll take a closer look.

Thanks.

@olivertam
Copy link
Member

Hi,

Sorry for bothering you again, but could you rerun the samtools command with the following change:

$ samtools view -P [BAM] chrY:10562600-10883787 > TE_overlapping_reads.sam

Thanks.

@ptranvan
Copy link
Author

ptranvan commented Dec 9, 2024

No problem, it looks a bit more logical now indeed.

TE_overlapping_reads2.sam.zip

should I still run with with --sortByPos ??

@olivertam
Copy link
Member

olivertam commented Dec 9, 2024

Hi,

I don't think you need to rerun with --sortByPos.

Based on the new SAM, it does appear that there are 25 distinct read pairs that contributes to the region (23 pairs that align uniquely, and 2 pairs with 2 alignments). The way that TElocal works on paired end reads is that it checks both ends of the fragment to determine if it overlaps strongly with the annotation. I suspect that out of all these, all but one of them have at least one end overlapping significantly with the ALR (the last one might have less than 50% overlap on one of the ends), and since there are no other gene or TE annotations that the fragment could be annotated to, the ALR annotation would be given to those.

I think that might give rise to the count of 24 for ALR/Alpha_dup1372.

The previous samtools command would ignore read pairs where one or more portion lie outside the region of interest. I am surprised that these reads did not come out from bedtools intersect, though that might be due to some parameters in that program.

Sorry for the inconvenience. Let us know if you have further questions.

Thanks.

@olivertam olivertam added question and removed invalid This doesn't seem right labels Dec 9, 2024
@ptranvan
Copy link
Author

Dear @olivertam

I come back to you with an other similar issue. This is my command:

samtools view -P x.bam chr3:95726522-95727149 > TE_overlapping_reads2.sam

It seems to have one paired end mapped.

However in my TElocal analysis, I don't see any count for that element (L1PA15_dup4168:L1PA15:L1:LINE)

TE_overlapping_reads2.txt

@olivertam
Copy link
Member

Hi,

This is a case where you would need to see where the other 5 alignments for the read is located (as indicated by the NH:i:5 tag in the alignment output. It is possible (highly likely) that the other alignment position for this read are also overlapping TE instances (e.g. another L1PA15), and based on the EM, this read was redistributed to another copy.

Please let me know if that does not address your question.

@ptranvan
Copy link
Author

Oh, I see. How do you decide which copy gets the count assignment?
I thought the counts were split between each mapped copy.

@olivertam
Copy link
Member

olivertam commented Dec 17, 2024

The counts are initially split between each mapped copy, but then through the EM loop, the distribution of the counts are recalculated. This is done iteratively until the overall difference between subsequent EM iterations falls below a threshold (i.e. converged). To answer your question, the copy that got this particular read assignment is most likely the copy with the highest count where one of the 5 alignment overlapped.

For more details on the methodology of the EM, section 3.3 of the paper might be helpful.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants