fix: canonical transcript mapped read extraction #77

dlaehnemann · 2023-08-17T11:54:52Z

The main aim of this PR is to get the extraction of reads mapped to the canonical transcripts in the rule get_mapped_canonical_transcripts down from about 44GB of memory usage regardless of the input BAM file size and hours of grepping, down to seconds / minutes of extraction time with almost no memory footprint. This happens here:
https://github.com/snakemake-workflows/rna-seq-kallisto-sleuth/compare/fix-canonical-transcript-mapped-read-extraction?expand=1#diff-6562a38fb77f8839a8731b8a882bf0d0b683d6268cdffb433dcbe1f360ccedc4R103

This required using a BED file, and thus I switched to generating a valid BED file in the rule get_canonical_ids and switched to using that BED file for keeping track of transcript strand information instead of hacking this into the contig names of the reference fasta file. This lead to some cleanup in the workflow.

Other things that happened along the way are:

removed an unnecessary r-dplyr dependency, as this is pulled in by r-tidyverse anyways
cleaned up file names to more clearly reflect what they contain
cleaned up variable / column names in python script to use only lowercase letters
removed some redundancy in wrapper calling by re-useing the samtools index rule -- thus, we only have to update the wrapper version in one place and all instances should always stay in sync

For now, this is not yet tested. So I'll mark this as a draft to start with. But I wanted it up here to be able to test it on different setups by checking out the branch.

…erse meta-package

… the same samtools_index rule to avoid multiple wrapper version being loaded in the future

… in QC plot scripts

dlaehnemann · 2023-08-17T15:35:25Z

Just to document what still needs to be done, here:

Currently, the rule get_canonical_transcripts skips most transcripts, as the poly-A tails have been removed and the lengths given in the BED input file thus don't match with the actual lengths of the FASTA entries any more. Also, we have to find a way to not have the coordinates of start and end of a transcript appended to the FASTA entry names, because this will otherwise break downstream transcript name matching.

…n `results/logs/`

…am should handle this)

… that we will need all of the transcripts for proper maximum 3-prime position filtering

… the samtools view rule in the future, once we have a proper pysam rule for the read filtering in one step

…spective script

…all transcripts

johanneskoester · 2023-10-20T14:59:08Z

LGTM, but there are conflicts with the master branch that need to be fixed before merging.

…-kallisto-sleuth into fix-canonical-transcript-mapped-read-extraction

dlaehnemann · 2023-10-20T16:10:26Z

Thanks for looking through this. Will merge and create a new release as soon as it passes with the conflicts resolution...

…sleuth bioconda recipe implicitly

…conda

…-base=4.3)

…esting data to get QuantSeq tests to pass (#86) The underlying problem that we identified with the work on this `debug-vroom` branch was a malformatted `custom` file for specifying canonical transcript to use in `sleuth-diffexp.R`. The take-away message here were: 1. The `sleuth-diffexp.R` script with its large `write_results()` function was hard to debug, and to ease the burden a bit in the future, we should probably keep the extra logging statements we included. 2. The `datavzrd/diffexp-template.yaml` does not seem to play nice with `custom` canonical transcript files. The `canonical` column does not make sense in the `genes_aggregated` results table (as this should only contain gene names / identifiers, and no transcript identifiers, and only the latter could be canonical or not), so we remove it there. However, how to have a `canonical` column in the `genes_representative` case with a `custom` canonical transcript file still needs to be solved before merging this PR.

…+ awk instead of pysam

… transcript reads

🤖 I have created a release *beep* *boop* --- ## [2.5.3](v2.5.2...v2.5.3) (2024-01-30) ### Bug Fixes * canonical transcript mapped read extraction ([#77](#77)) ([52b56b0](52b56b0)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: David Laehnemann <[email protected]>

dlaehnemann added 5 commits August 17, 2023 10:32

remove unnecessary dplyr dependency, as this is included in the tidyv…

2454f00

…erse meta-package

put strand info in bed file instead of parsing it into contig names

9f1c60b

do sorting of mapped bams in bwa_mem rule

1302eee

fix: consistently use BED file, rename for clearer file names, re-use…

c71f286

… the same samtools_index rule to avoid multiple wrapper version being loaded in the future

fix: use BED file in QC plot data generation, clean up variable names…

abe45d3

… in QC plot scripts

dlaehnemann marked this pull request as draft August 17, 2023 11:55

dlaehnemann added 5 commits August 17, 2023 15:21

fix forgotten filename change

ede168a

fix -fullHeader command line option to bedtools getfasta

3814915

fix fasta header writing after poly-A tail removal

ebd5fdf

remove unnecessary imports and dependencies

969a768

clean up scripts

060b095

Lähnemann and others added 18 commits August 28, 2023 17:20

snakefmt

679afda

put all log: files into logs/ folder (instead of some ending up i…

9ac85db

…n `results/logs/`

fix fasta header lines after poly-A tail removal

eed4959

more programmatic poly-A tail removal

745c804

use python script to do canonical transcript extraction from fasta file

8ea2fc9

create correct SeqRecord after poly-A tail removal

92ffa8b

remove unnecessary if statement (conditional input functions downstre…

121d7fc

…am should handle this)

fix strand .loc[] statement to match on integers instead of strings

34de1b2

use correct samtools view flag for extracting reads by read name

94c40da

add required numpy back into environment QC.yaml and load it in script

847273d

more readable rule ordering

fa1d27b

clean up transcript retrieval and mane_select filtering, anticipating…

d3ffcb0

… that we will need all of the transcripts for proper maximum 3-prime position filtering

move rule get_aligned_pos to the qc_3prime.smk where it belongs

2c0fa8e

change mane_select_transcripts from BED to TSV file, as we won't need…

37ce2b1

… the samtools view rule in the future, once we have a proper pysam rule for the read filtering in one step

check in fixes before removal of script

1e76329

update biomart environment to latest biomart and tidyverse 2.0

97c2437

move to getting transcript annotations only once, and clean up the re…

8fe2557

…spective script

change mane select fasta generation to work with TSV file containing …

8c0cffd

…all transcripts

dlaehnemann added 2 commits September 29, 2023 15:27

update pysam to 0.21 and also include pandas 2.0 in env

705b7fd

cleanly do all the 3prime read extraction in one pysam script

9901fd0

dlaehnemann marked this pull request as ready for review September 29, 2023 13:36

only recode canonical column if it exists

ae55ef2

johanneskoester approved these changes Oct 20, 2023

View reviewed changes

Merge branch 'main' of https://github.com/snakemake-workflows/rna-seq…

83241b9

…-kallisto-sleuth into fix-canonical-transcript-mapped-read-extraction

dlaehnemann added 6 commits October 20, 2023 18:24

snakefmt

c576b56

remove (and thus unpin) bioconductor-rhdf5 dependency, handled via r-…

57ca94e

…sleuth bioconda recipe implicitly

try including some error messaging when all mirrors fail

a1a58f9

try coercing error object to str with message() function

7c0fa58

dirty fix, until newest bioconductor release becomes available on bio…

5bf0c52

…conda

update sleuth.yaml to get to latest bioconductor-rhdf5lib (requires r…

00247ce

…-base=4.3)

dlaehnemann mentioned this pull request Dec 18, 2023

Libcrypto.so.1.1 not available? #85

Closed

Addimator and others added 10 commits January 15, 2024 11:13

fix: use canonical transcript as fallback for quantseq data

e54ecdc

fix: canonical recoding

9aedf76

Change datavzrd wrapper version

ebb4932

fix: update datavzrd to working verion

29f6fbb

feat: make most_three_prime_main extraction faster by using samtools …

7d06a7b

…+ awk instead of pysam

bring back needed fasta script and delete correct python script instead

30189b9

minor fixes

385b97c

fix: use annotation header in awk command extracting three_prime main…

ada6c20

… transcript reads

snakefmt

8cc48e9

johanneskoester approved these changes Jan 30, 2024

View reviewed changes

johanneskoester merged commit 52b56b0 into main Jan 30, 2024
6 checks passed

johanneskoester deleted the fix-canonical-transcript-mapped-read-extraction branch January 30, 2024 12:14

github-actions bot mentioned this pull request Jan 30, 2024

chore(main): release 2.5.3 #87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: canonical transcript mapped read extraction #77

fix: canonical transcript mapped read extraction #77

dlaehnemann commented Aug 17, 2023

dlaehnemann commented Aug 17, 2023

johanneskoester commented Oct 20, 2023

dlaehnemann commented Oct 20, 2023

fix: canonical transcript mapped read extraction #77

fix: canonical transcript mapped read extraction #77

Conversation

dlaehnemann commented Aug 17, 2023

dlaehnemann commented Aug 17, 2023

johanneskoester commented Oct 20, 2023

dlaehnemann commented Oct 20, 2023