DRACH sites at the beginning or end of sequences #119

tjprins · 2023-06-13T19:19:23Z

Hello,

In my data, I have two DRACH sites that are at the start and end of an RNA sequence (i.e., two bases from the start of the sequence and right before the poly A tail), which don't show up on the m6ANet output. I tried running this data through EpiNano for comparison and they pop up, however, EpiNano appears to be less accurate overall than m6ANet. If I look at these strands in IGV, there are reads that sequenced all the way to the beginning/end, so there are reads within these regions for m6ANet to pull data from.

On a similar note, the Workman et al. 2019 paper (see here: https://www.nature.com/articles/s41592-019-0617-2) studying RNA sequencing via nanopore notes that some sequences in their runs appear "truncated", which may be a limitation of Nanopore sequencing. Given this, I'm curious how m6ANet handles DRACH sites in regions that have less coverage than other sites. If a single read does not map all the way to a DRACH site near the end of the transcript, or there is a "deletion" in the middle of the transcript where the DRACH site lies, does m6ANet disregard those reads during its probability calculation (i.e., only using data from reads that mapped all the way through these sites)?

Thanks!

chrishendra93 · 2023-07-23T08:41:44Z

hi @tjprins, apology for the delay in my reply,

Currently, m6Anet excludes sites with less than 20 reads and that might be the reason as to why they don't show up on the m6Anet output. I have not really tested the accuracy of m6Anet for sites with less than 20 reads but technically, as long as they can be mapped, m6Anet can at the very least output single molecular probability for these reads although the current m6Anet version does not support this yet. If you are really interested to retrieve those reads, you can supply --min_segment_count argument during dataprep (the default is 20 but you can set it lower to 1 for example) so that data.info and data.json will contain those reads with lower coverage.

Afterward, during inference, you need to change some hard-coded values in m6anet/scripts/inference.py (DEFAULT_MIN_READS is set to 20, you need to change this value so that the dataloader will not exclude site will fewer than 20 reads), and also within run_inference function in m6anet/utils/inference_utils.py, you will need to change the hard-coded number in calculate_site_proba from 20 to something else. We will do some benchmarking in the near future and hopefully, we can provide recommendations on the number of reads and allow users to specify this in the future.

Regarding your question about deletion in the middle of the transcript, this will really depend on the ability of nanopolish eventalign to segment the transcript. Based on my understanding, nanopolish eventalign might still segment the "deleted" DRACH segments if it thinks that the deletion is purely due to basecalling error, but I am not too sure if this is accurate. Regardless, if such reads are segmented by nanopolish eventalign, then m6Anet will likely be able to process those reads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRACH sites at the beginning or end of sequences #119

DRACH sites at the beginning or end of sequences #119

tjprins commented Jun 13, 2023 •

edited

Loading

chrishendra93 commented Jul 23, 2023 •

edited

Loading

DRACH sites at the beginning or end of sequences #119

DRACH sites at the beginning or end of sequences #119

Comments

tjprins commented Jun 13, 2023 • edited Loading

chrishendra93 commented Jul 23, 2023 • edited Loading

tjprins commented Jun 13, 2023 •

edited

Loading

chrishendra93 commented Jul 23, 2023 •

edited

Loading