Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: canonical transcript mapped read extraction #77

Merged
merged 48 commits into from
Jan 30, 2024
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
2454f00
remove unnecessary dplyr dependency, as this is included in the tidyv…
dlaehnemann Aug 17, 2023
9f1c60b
put strand info in bed file instead of parsing it into contig names
dlaehnemann Aug 17, 2023
1302eee
do sorting of mapped bams in bwa_mem rule
dlaehnemann Aug 17, 2023
c71f286
fix: consistently use BED file, rename for clearer file names, re-use…
dlaehnemann Aug 17, 2023
abe45d3
fix: use BED file in QC plot data generation, clean up variable names…
dlaehnemann Aug 17, 2023
ede168a
fix forgotten filename change
dlaehnemann Aug 17, 2023
3814915
fix -fullHeader command line option to bedtools getfasta
dlaehnemann Aug 17, 2023
ebd5fdf
fix fasta header writing after poly-A tail removal
dlaehnemann Aug 17, 2023
969a768
remove unnecessary imports and dependencies
dlaehnemann Aug 17, 2023
060b095
clean up scripts
dlaehnemann Aug 17, 2023
679afda
snakefmt
Aug 28, 2023
9ac85db
put all `log:` files into `logs/` folder (instead of some ending up i…
dlaehnemann Aug 29, 2023
eed4959
fix fasta header lines after poly-A tail removal
dlaehnemann Aug 29, 2023
745c804
more programmatic poly-A tail removal
dlaehnemann Aug 29, 2023
8ea2fc9
use python script to do canonical transcript extraction from fasta file
dlaehnemann Aug 29, 2023
92ffa8b
create correct SeqRecord after poly-A tail removal
dlaehnemann Aug 30, 2023
121d7fc
remove unnecessary if statement (conditional input functions downstre…
dlaehnemann Aug 30, 2023
34de1b2
fix strand .loc[] statement to match on integers instead of strings
dlaehnemann Sep 1, 2023
94c40da
use correct samtools view flag for extracting reads by read name
dlaehnemann Sep 1, 2023
847273d
add required numpy back into environment QC.yaml and load it in script
dlaehnemann Sep 1, 2023
fa1d27b
more readable rule ordering
dlaehnemann Sep 14, 2023
d3ffcb0
clean up transcript retrieval and mane_select filtering, anticipating…
dlaehnemann Sep 15, 2023
2c0fa8e
move rule get_aligned_pos to the qc_3prime.smk where it belongs
dlaehnemann Sep 16, 2023
37ce2b1
change mane_select_transcripts from BED to TSV file, as we won't need…
dlaehnemann Sep 16, 2023
1e76329
check in fixes before removal of script
dlaehnemann Sep 19, 2023
97c2437
update biomart environment to latest biomart and tidyverse 2.0
dlaehnemann Sep 19, 2023
8fe2557
move to getting transcript annotations only once, and clean up the re…
dlaehnemann Sep 19, 2023
8c0cffd
change mane select fasta generation to work with TSV file containing …
dlaehnemann Sep 29, 2023
705b7fd
update pysam to 0.21 and also include pandas 2.0 in env
dlaehnemann Sep 29, 2023
9901fd0
cleanly do all the 3prime read extraction in one pysam script
dlaehnemann Sep 29, 2023
ae55ef2
only recode canonical column if it exists
dlaehnemann Oct 19, 2023
83241b9
Merge branch 'main' of https://github.com/snakemake-workflows/rna-seq…
dlaehnemann Oct 20, 2023
c576b56
snakefmt
dlaehnemann Oct 20, 2023
57ca94e
remove (and thus unpin) bioconductor-rhdf5 dependency, handled via r-…
dlaehnemann Oct 23, 2023
a1a58f9
try including some error messaging when all mirrors fail
dlaehnemann Oct 30, 2023
7c0fa58
try coercing error object to str with message() function
dlaehnemann Oct 31, 2023
5bf0c52
dirty fix, until newest bioconductor release becomes available on bio…
dlaehnemann Oct 31, 2023
00247ce
update sleuth.yaml to get to latest bioconductor-rhdf5lib (requires r…
dlaehnemann Nov 1, 2023
e54ecdc
fix: use canonical transcript as fallback for quantseq data
Addimator Jan 15, 2024
9aedf76
fix: canonical recoding
Addimator Jan 15, 2024
ebb4932
Change datavzrd wrapper version
Addimator Jan 23, 2024
29f6fbb
fix: update datavzrd to working verion
Addimator Jan 24, 2024
17d272a
fix: add more logging statemens to `sleuth-diffexp.R`, add QuantSeq t…
dlaehnemann Jan 29, 2024
7d06a7b
feat: make most_three_prime_main extraction faster by using samtools …
dlaehnemann Jan 29, 2024
30189b9
bring back needed fasta script and delete correct python script instead
dlaehnemann Jan 29, 2024
385b97c
minor fixes
dlaehnemann Jan 29, 2024
ada6c20
fix: use annotation header in awk command extracting three_prime main…
dlaehnemann Jan 29, 2024
8cc48e9
snakefmt
dlaehnemann Jan 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 2 additions & 5 deletions workflow/envs/QC.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,6 @@ channels:
dependencies:
- altair-transform =0.2.0
- altair =4.2.0
- pysam =0.19.1
- numpy =1.22.0
- altair_saver =0.5.0
- scipy =1.7.3
- matplotlib =3.5.2
- pandas =1
- pandas =1
- numpy =1.22.0
5 changes: 2 additions & 3 deletions workflow/envs/biomart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,5 @@ channels:
- conda-forge
- bioconda
dependencies:
- bioconductor-biomart =2.46
- r-tidyverse =1.3
- r-dplyr =1.0.9
- bioconductor-biomart =2.56
- r-tidyverse =2.0
3 changes: 2 additions & 1 deletion workflow/envs/biopython.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ channels:
- bioconda
- nodefaults
dependencies:
- biopython =1.79
- biopython =1.79
- pandas >=1.3,<2
9 changes: 0 additions & 9 deletions workflow/envs/get_canonical_ids.yaml

This file was deleted.

3 changes: 2 additions & 1 deletion workflow/envs/pysam.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ channels:
- bioconda
- nodefaults
dependencies:
- pysam =0.19
- pysam =0.21
- pandas =2.0
2 changes: 1 addition & 1 deletion workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ enrichment_env = render_enrichment_env()

def kallisto_quant_input(wildcards):
if is_3prime_experiment:
return "results/canonical_reads/{sample}-{unit}.fastq"
return "results/mane_3prime_reads/{sample}-{unit}.fastq",
elif not is_single_end(wildcards.sample, wildcards.unit):
return expand(
"results/trimmed/{{sample}}-{{unit}}.{group}.fastq.gz", group=[1, 2]
Expand Down
2 changes: 1 addition & 1 deletion workflow/rules/diffexp.smk
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ rule sleuth_init:
input:
kallisto=kallisto_output,
samples="results/sleuth/{model}.samples.tsv",
transcript_info="resources/transcript-info.rds",
transcript_info="resources/transcripts_annotation.results.rds",
output:
sleuth_object="results/sleuth/{model,[^.]+}.rds",
designmatrix="results/sleuth/{model}.designmatrix.rds",
Expand Down
22 changes: 17 additions & 5 deletions workflow/rules/qc_3prime.smk
Original file line number Diff line number Diff line change
@@ -1,8 +1,20 @@
rule get_aligned_pos:
input:
bam_file="results/kallisto_cdna/{sample}-{unit}",
output:
aligned_files=temp("results/QC/{sample}-{unit}.aligned.txt"),
log:
"logs/QC/{sample}-{unit}.aligned.log",
conda:
"../envs/samtools.yaml"
shell:
"samtools view {input.bam_file}/pseudoalignments.bam | cut -f1,3,4,10,11 > {output} 2> {log}"


rule get_selected_transcripts_aligned_read_bins:
input:
aligned_file="results/QC/{sample}-{unit}.aligned.txt",
samtools_sort="results/kallisto-bam-sorted/{sample}-{unit}-pseudoalignments.sorted.bam",
samtools_index="results/kallisto-bam-sorted/{sample}-{unit}-pseudoalignments.sorted.bam.bai",
transcripts_annotation="resources/transcripts_annotation.mane_strand_length.tsv",
read_length="results/stats/max-read-length.json",
output:
fwrd_allsamp_hist_fil=temp(
Expand All @@ -15,7 +27,7 @@ rule get_selected_transcripts_aligned_read_bins:
each_transcript="{ind_transcripts}",
samples="{sample}-{unit}",
log:
"results/logs/QC/{sample}-{unit}.{ind_transcripts}.aligned-read-bins.log",
"logs/QC/{sample}-{unit}.{ind_transcripts}.aligned-read-bins.log",
conda:
"../envs/QC.yaml"
script:
Expand Down Expand Up @@ -67,7 +79,7 @@ if is_3prime_experiment and config["experiment"]["3-prime-rna-seq"]["plot-qc"] !
params:
each_transcript="{ind_transcripts}",
log:
"results/logs/QC/3prime-QC-plot.{ind_transcripts}.log",
"logs/QC/3prime-QC-plot.{ind_transcripts}.log",
conda:
"../envs/QC.yaml"
script:
Expand Down Expand Up @@ -108,7 +120,7 @@ else:
params:
each_transcript="{ind_transcripts}",
log:
"results/logs/QC/3prime-QC-plot.{ind_transcripts}.log",
"logs/QC/3prime-QC-plot.{ind_transcripts}.log",
conda:
"../envs/QC.yaml"
script:
Expand Down
6 changes: 3 additions & 3 deletions workflow/rules/quant.smk
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
rule kallisto_index:
input:
fasta="resources/transcriptome_clean.cdna.fasta"
fasta="resources/transcriptome.cdna.without_poly_a.fasta"
if is_3prime_experiment
else "resources/transcriptome.cdna.fasta",
output:
index="results/kallisto_cdna/transcripts.cdna.idx",
log:
"results/logs/kallisto_cdna/index.cdna.log",
"logs/kallisto_cdna/index.cdna.log",
threads: 1
wrapper:
"v1.23.1/bio/kallisto/index"
Expand All @@ -19,7 +19,7 @@ rule kallisto_quant:
output:
kallisto_folder=directory("results/kallisto_cdna/{sample}-{unit}"),
log:
"results/logs/kallisto_cdna/quant/{sample}-{unit}.log",
"logs/kallisto_cdna/quant/{sample}-{unit}.log",
params:
extra=kallisto_params,
threads: 5
Expand Down
160 changes: 34 additions & 126 deletions workflow/rules/quant_3prime.smk
Original file line number Diff line number Diff line change
@@ -1,47 +1,6 @@
rule get_aligned_pos:
input:
bam_file="results/kallisto_cdna/{sample}-{unit}",
output:
aligned_files=temp("results/QC/{sample}-{unit}.aligned.txt"),
log:
"results/logs/QC/{sample}-{unit}.aligned.log",
conda:
"../envs/samtools.yaml"
shell:
"samtools view {input.bam_file}/pseudoalignments.bam | cut -f1,3,4,10,11 > {output} 2> {log}"


if is_3prime_experiment:

rule kallisto_3prime_index:
input:
fasta="resources/transcriptome_clean.3prime.fasta",
output:
index="results/kallisto_3prime/transcripts.3prime.idx",
log:
"results/logs/kallisto_3prime/index.3prime.log",
threads: 1
wrapper:
"v1.23.1/bio/kallisto/index"

rule kallisto_3prime_quant:
input:
fastq=kallisto_quant_input,
index="results/kallisto_3prime/transcripts.3prime.idx",
output:
kallisto_folder=directory("results/kallisto_3prime/{sample}-{unit}"),
log:
"results/logs/kallisto_3prime/quant/{sample}-{unit}.log",
params:
extra=kallisto_params,
threads: 5
wrapper:
"v1.23.1/bio/kallisto/quant"


rule bwa_index:
input:
"resources/transcriptome_clean.cdna.fasta",
"resources/transcriptome.cdna.without_poly_a.fasta",
output:
idx=multiext("resources/transcriptome", ".amb", ".ann", ".bwt", ".pac", ".sa"),
log:
Expand All @@ -57,124 +16,73 @@ rule bwa_mem:
reads=get_trimmed,
idx=multiext("resources/transcriptome", ".amb", ".ann", ".bwt", ".pac", ".sa"),
output:
"results/mapped_mem/{sample}-{unit}.bam",
"results/mapped_mem/{sample}-{unit}.namesorted.bam",
log:
"logs/bwa_mem/{sample}-{unit}.log",
params:
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sorting="none", # Can be 'none', 'samtools' or 'picard'.
sorting="samtools", # Can be 'none', 'samtools' or 'picard'.
sort_order="queryname", # Can be 'queryname' or 'coordinate'.
sort_extra="", # Extra args for samtools/picard.
threads: 8
wrapper:
"v1.17.2/bio/bwa/mem"


rule get_mapped_canonical_transcripts:
rule get_only_mane_select_reads_closest_to_3_prime:
input:
mapped_bam="results/mapped_mem/{sample}-{unit}.bam",
canonical_ids="resources/canonical_ids.csv",
bam="results/mapped_mem/{sample}-{unit}.namesorted.bam",
annotation="resources/transcripts_annotation.mane_strand_length.tsv",
output:
canonical_mapped_bam=temp(
"results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.bam"
mane_select_reads_closest_to_3_prime=temp(
"results/mapped_3prime_mane/{sample}-{unit}.mane_select_closest_to_3_prime.bam"
),
log:
"results/logs/canonical_mapped_bam/{sample}-{unit}.canonical-mapped-bam.log",
"logs/mapped_3prime_bam/{sample}-{unit}.mapped.pos.log",
conda:
"../envs/samtools.yaml"
shell:
"samtools view -h -F 4 {input.mapped_bam} | cut -f1-12 | grep -f {input.canonical_ids} | samtools view -o {output.canonical_mapped_bam} 2> {log}"
"../envs/pysam.yaml"
script:
"../scripts/get-only-mane-select-reads-closest-to-3-prime.py"


rule get_mapped_canonical_positions:
rule get_mane_fastq:
input:
canonical_mapped_bam="results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.bam",
bam="results/mapped_3prime_mane/{sample}-{unit}.mane_select_closest_to_3_prime.bam"
output:
canonical_mapped_pos=temp(
"results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.position.txt"
),
fastq="results/mane_3prime_reads/{sample}-{unit}.fastq",
log:
"results/logs/canonical_mapped_bam/{sample}-{unit}.canonical-mapped-pos.log",
"logs/mane_3prime_reads/{sample}-{unit}.log",
conda:
"../envs/samtools.yaml"
shell:
"samtools view {input.canonical_mapped_bam} | cut -f1,3,4,10,11 > {output} 2> {log}"
"samtools bam2fq {input.bam} > {output.fastq} 2> {log}"


rule bwa_samtools_sort:
rule kallisto_3prime_index:
input:
"results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.bam",
fasta="resources/transcriptome.cdna.without_poly_a.mane.fasta",
output:
temp("results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.sorted.bam"),
index="results/kallisto_3prime/transcripts.3prime.idx",
log:
"results/logs/QC/{sample}-{unit}.sorted.log",
params:
extra="-m 4G",
threads: 8
"logs/kallisto_3prime/index.3prime.log",
threads: 1
wrapper:
"v1.18.3/bio/samtools/sort"
"v1.23.1/bio/kallisto/index"


rule bwa_samtools_index:
rule kallisto_3prime_quant:
input:
"results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.sorted.bam",
fastq=kallisto_quant_input,
index="results/kallisto_3prime/transcripts.3prime.idx",
output:
temp(
"results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.sorted.bam.bai"
),
kallisto_folder=directory("results/kallisto_3prime/{sample}-{unit}"),
log:
"results/logs/QC/{sample}-{unit}.sorted.index.log",
"logs/kallisto_3prime/quant/{sample}-{unit}.log",
params:
extra="", # optional params string
threads: 4 # This value - 1 will be sent to -@
extra=kallisto_params,
threads: 5
wrapper:
"v1.18.3/bio/samtools/index"


rule get_closest_3prime_aligned_pos:
input:
canonical_mapped_bam="results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.sorted.bam",
canonical_mapped_bam_index="results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.sorted.bam.bai",
canonical_mapped_pos="results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.position.txt",
output:
canonical_mapped_3prime_pos=temp(
"results/mapped_3prime_bam/{sample}-{unit}.canonical.mapped.3prime_pos.txt"
),
log:
"results/logs/mapped_3prime_bam/{sample}-{unit}.mapped.pos.log",
conda:
"../envs/QC.yaml"
script:
"../scripts/get-3prime-max-positions.py"


rule get_closest_3prime_aligned_pos_bam:
input:
canonical_mapped_bam="results/canonical_mapped_bam/{sample}-{unit}.canonical.mapped.bam",
canonical_mapped_3prime_pos="results/mapped_3prime_bam/{sample}-{unit}.canonical.mapped.3prime_pos.txt",
output:
canonical_mapped_3prime_bam=temp(
"results/canonical_3prime_mapped_bam/{sample}-{unit}.canonical.3prime_mapped.bam"
),
log:
"results/logs/canonical_3prime_mapped_bam/{sample}-{unit}.canonical.3prime_mapped.log",
conda:
"../envs/samtools.yaml"
shell:
"samtools view -R {input.canonical_mapped_3prime_pos} {input.canonical_mapped_bam} -o {output.canonical_mapped_3prime_bam} 2> {log}"


rule get_canonical_fastq:
input:
canonical_3prime_mapped_bam="results/canonical_3prime_mapped_bam/{sample}-{unit}.canonical.3prime_mapped.bam",
output:
canonical_fastq="results/canonical_reads/{sample}-{unit}.fastq",
log:
"results/logs/canonical_3prime_mapped_bam/{sample}-{unit}.canonical_3prime_mapped.fastq.log",
conda:
"../envs/samtools.yaml"
shell:
"samtools bam2fq {input.canonical_3prime_mapped_bam} > {output.canonical_fastq} 2> {log}"
"v1.23.1/bio/kallisto/quant"


rule kallisto_samtools_sort:
Expand All @@ -183,7 +91,7 @@ rule kallisto_samtools_sort:
output:
temp("results/kallisto-bam-sorted/{sample}-{unit}-pseudoalignments.sorted.bam"),
log:
"results/logs/QC/{sample}-{unit}.sorted.log",
"logs/QC/{sample}-{unit}.sorted.log",
conda:
"../envs/samtools.yaml"
shell:
Expand All @@ -198,9 +106,9 @@ rule kallisto_samtools_index:
"results/kallisto-bam-sorted/{sample}-{unit}-pseudoalignments.sorted.bam.bai"
),
log:
"results/logs/QC/{sample}-{unit}.sorted.index.log",
"logs/QC/{sample}-{unit}.sorted.index.log",
params:
extra="", # optional params string
threads: 4 # This value - 1 will be sent to -@
wrapper:
"v1.18.3/bio/samtools/index"
"v1.18.3/bio/samtools/index"
2 changes: 1 addition & 1 deletion workflow/rules/ref.smk
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ rule get_annotation:

rule get_transcript_info:
output:
"resources/transcript-info.rds",
multiext("resources/transcripts_annotation", ".results.rds", ".mane_strand_length.tsv"),
params:
species=get_bioc_species_name(),
version=config["resources"]["ref"]["release"],
Expand Down
Loading
Loading