Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRACH sites at the beginning or end of sequences #119

Open
tjprins opened this issue Jun 13, 2023 · 1 comment
Open

DRACH sites at the beginning or end of sequences #119

tjprins opened this issue Jun 13, 2023 · 1 comment

Comments

@tjprins
Copy link

tjprins commented Jun 13, 2023

Hello,

In my data, I have two DRACH sites that are at the start and end of an RNA sequence (i.e., two bases from the start of the sequence and right before the poly A tail), which don't show up on the m6ANet output. I tried running this data through EpiNano for comparison and they pop up, however, EpiNano appears to be less accurate overall than m6ANet. If I look at these strands in IGV, there are reads that sequenced all the way to the beginning/end, so there are reads within these regions for m6ANet to pull data from.

On a similar note, the Workman et al. 2019 paper (see here: https://www.nature.com/articles/s41592-019-0617-2) studying RNA sequencing via nanopore notes that some sequences in their runs appear "truncated", which may be a limitation of Nanopore sequencing. Given this, I'm curious how m6ANet handles DRACH sites in regions that have less coverage than other sites. If a single read does not map all the way to a DRACH site near the end of the transcript, or there is a "deletion" in the middle of the transcript where the DRACH site lies, does m6ANet disregard those reads during its probability calculation (i.e., only using data from reads that mapped all the way through these sites)?

Thanks!

@chrishendra93
Copy link
Collaborator

chrishendra93 commented Jul 23, 2023

hi @tjprins, apology for the delay in my reply,

Currently, m6Anet excludes sites with less than 20 reads and that might be the reason as to why they don't show up on the m6Anet output. I have not really tested the accuracy of m6Anet for sites with less than 20 reads but technically, as long as they can be mapped, m6Anet can at the very least output single molecular probability for these reads although the current m6Anet version does not support this yet. If you are really interested to retrieve those reads, you can supply --min_segment_count argument during dataprep (the default is 20 but you can set it lower to 1 for example) so that data.info and data.json will contain those reads with lower coverage.

Afterward, during inference, you need to change some hard-coded values in m6anet/scripts/inference.py (DEFAULT_MIN_READS is set to 20, you need to change this value so that the dataloader will not exclude site will fewer than 20 reads), and also within run_inference function in m6anet/utils/inference_utils.py, you will need to change the hard-coded number in calculate_site_proba from 20 to something else. We will do some benchmarking in the near future and hopefully, we can provide recommendations on the number of reads and allow users to specify this in the future.

Regarding your question about deletion in the middle of the transcript, this will really depend on the ability of nanopolish eventalign to segment the transcript. Based on my understanding, nanopolish eventalign might still segment the "deleted" DRACH segments if it thinks that the deletion is purely due to basecalling error, but I am not too sure if this is accurate. Regardless, if such reads are segmented by nanopolish eventalign, then m6Anet will likely be able to process those reads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants