Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: canonical transcript mapped read extraction #77

Merged
merged 48 commits into from
Jan 30, 2024

Conversation

dlaehnemann
Copy link
Collaborator

The main aim of this PR is to get the extraction of reads mapped to the canonical transcripts in the rule get_mapped_canonical_transcripts down from about 44GB of memory usage regardless of the input BAM file size and hours of grepping, down to seconds / minutes of extraction time with almost no memory footprint. This happens here:
https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/compare/fix-canonical-transcript-mapped-read-extraction?expand=1#diff-6562a38fb77f8839a8731b8a882bf0d0b683d6268cdffb433dcbe1f360ccedc4R103

This required using a BED file, and thus I switched to generating a valid BED file in the rule get_canonical_ids and switched to using that BED file for keeping track of transcript strand information instead of hacking this into the contig names of the reference fasta file. This lead to some cleanup in the workflow.

Other things that happened along the way are:

  • removed an unnecessary r-dplyr dependency, as this is pulled in by r-tidyverse anyways
  • cleaned up file names to more clearly reflect what they contain
  • cleaned up variable / column names in python script to use only lowercase letters
  • removed some redundancy in wrapper calling by re-useing the samtools index rule -- thus, we only have to update the wrapper version in one place and all instances should always stay in sync

For now, this is not yet tested. So I'll mark this as a draft to start with. But I wanted it up here to be able to test it on different setups by checking out the branch.

@dlaehnemann dlaehnemann marked this pull request as draft August 17, 2023 11:55
@dlaehnemann
Copy link
Collaborator Author

Just to document what still needs to be done, here:

Currently, the rule get_canonical_transcripts skips most transcripts, as the poly-A tails have been removed and the lengths given in the BED input file thus don't match with the actual lengths of the FASTA entries any more. Also, we have to find a way to not have the coordinates of start and end of a transcript appended to the FASTA entry names, because this will otherwise break downstream transcript name matching.

Lähnemann and others added 18 commits August 28, 2023 17:20
… that we will need all of the transcripts for proper maximum 3-prime position filtering
… the samtools view rule in the future, once we have a proper pysam rule for the read filtering in one step
@dlaehnemann dlaehnemann marked this pull request as ready for review September 29, 2023 13:36
@johanneskoester
Copy link
Contributor

LGTM, but there are conflicts with the master branch that need to be fixed before merging.

@dlaehnemann
Copy link
Collaborator Author

Thanks for looking through this. Will merge and create a new release as soon as it passes with the conflicts resolution...

Addimator and others added 10 commits January 15, 2024 11:13
…esting data to get QuantSeq tests to pass (#86)

The underlying problem that we identified with the work on this
`debug-vroom` branch was a malformatted `custom` file for specifying
canonical transcript to use in `sleuth-diffexp.R`. The take-away message
here were:

1. The `sleuth-diffexp.R` script with its large `write_results()`
function was hard to debug, and to ease the burden a bit in the future,
we should probably keep the extra logging statements we included.
2. The `datavzrd/diffexp-template.yaml` does not seem to play nice with
`custom` canonical transcript files. The `canonical` column does not
make sense in the `genes_aggregated` results table (as this should only
contain gene names / identifiers, and no transcript identifiers, and
only the latter could be canonical or not), so we remove it there.
However, how to have a `canonical` column in the `genes_representative`
case with a `custom` canonical transcript file still needs to be solved
before merging this PR.
@johanneskoester johanneskoester merged commit 52b56b0 into main Jan 30, 2024
6 checks passed
@johanneskoester johanneskoester deleted the fix-canonical-transcript-mapped-read-extraction branch January 30, 2024 12:14
dlaehnemann added a commit that referenced this pull request Jan 30, 2024
🤖 I have created a release *beep* *boop*
---


##
[2.5.3](v2.5.2...v2.5.3)
(2024-01-30)


### Bug Fixes

* canonical transcript mapped read extraction
([#77](#77))
([52b56b0](52b56b0))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: David Laehnemann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants