- Cocciardi et al. "Clonal evolution patterns in acute myeloid leukemia with NPM1 mutation". In: Nature Communications 10.1 (2019)
- Scheffold et al. "IGF1R as druggable target mediating PI3K-δ inhibitor resistance in a murine model of chronic lymphocytic leukemia". In: Blood (2019)
- Rücker et al. "Chromothripsis is linked to TP53 alteration, cell cycle impairment, and dismal outcome in acute myeloid leukemia with complex karyotype". In: haematologica 103.1 (2018)
- L'Abbate et al. "MYC-containing amplicons in acute myeloid leukemia: genomic structures, evolution, and transcriptional consequences". In: Leukemia (2018)
- Hirsch et al. "Circular RNAs of the nucleophosmin (NPM1) gene in acute myeloid leukemia". In: haematologica (2017)
- Thiel et al. "Heterodimerization of AML1/ETO with CBFβ is required for leukemogenesis but not for myeloproliferation". In: Leukemia (2017)
T.J. Blätte
Actual pipeline scripts are prefixed by bpipe_
and implemented using
the same. They include three main types of files:
*stages
files, which define the individual pipeline steps*config
files, which define some default parameters*pipeline
files, which piece together configs and stages to define a specific pipeline
Bpipe and all other third-party tools must be installed and correctly assigned to the
respective variables defined within the *stages
file.
Correctly formatted / pre-processed references must be provided for alignment; these are defined within the *config
files. Depending on the pipeline to be used, additional database files are required for filtering and annotation.
Required inputs to all pipelines are paired FASTQ files of one or more samples. Paired FASTQ files are recognized based on the substring _R?
- this should therefore be the only capital R
contained in the FASTQ name and while additional characters may follow, different samples should be distinguished by whatever comes before.
To run a given pipeline, provide these to the pipeline file:
bpipe run your-choice.pipeline \
your-input_R1.fastq \
your-input_R2.fastq
Several parameters can be controlled via optional command line arguments. To use these, follow the bpipe run
command with -p
and the argument assignment.
Common to all pipelines is for example nkern
for the number of cores to use for parallelization. Note that these are the cores used per sample and must therefore be adjusted when multiple samples are to be processed in parallel:
bpipe run -p nkern=42 your-choice.pipeline \
your-input_R1.fastq \
your-input_R2.fastq
Four pipelines are available to call mutations from Illumina sequencing data:
bpipe_amplicon.pipeline
for naive ampliconsbpipe_haloplexHS.pipeline
for molecularly barcoded ampliconsbpipe_exome.pipeline
for targeted enrichment protocolsbpipe_rna_callVariants.pipeline
for RNA sequencing data
Primary outputs of interest for all of these are the CSV files containing the filtered lists of mutations. Intermediate alignments and statistics are saved as well, at nearly every step of the analysis.
The amplicon pipeline performs the following steps: Quality assessment with the NGS QC Toolkit, adapter trimming via cutadapt, alignment to the reference via BWA-MEM, optical duplicate removal via Picard, local realignment via GATK, coverage calculation with BEDTools, variant calling using VarScan2, annotation via ANNOVAR and filtering using our in-house developed filter_*
scripts.
Expected inputs are paired FASTQ files and a BED file defining the target intervals of the reference covered by the design. Variant calling and on-target coverage calculation, but not alignment, will be restricted to these intervals. FASTQ files must be passed as positional arguments on the command line; the BED file may be defined as a default in the respective config or passed as an optional argument:
bpipe run -p target_region=your-targets.BED bpipe_amplicon.pipeline \
your-input_R1.fastq \
your-input_R2.fastq
To simultaneously change the default number of cores to use for the analysis, run:
bpipe run -p target_region=your-targets.BED -p nkern=42 bpipe_amplicon.pipeline \
your-input_R1.fastq \
your-input_R2.fastq
To analyze multiple samples in parallel using the default BED defined in the config, do:
bpipe run bpipe_amplicon.pipeline \
your-input_R1.fastq \
your-input_R2.fastq \
your-other-input_R1.fastq \
your-other-input_R2.fastq \
your-yet-another-input_R1.fastq \
your-yet-another-input_R2.fastq
The haloplexHS pipeline processes molecularly barcoded amplicons and performs the following steps: Quality assessment with the NGS QC Toolkit, alignment to the reference via BWA-MEM, optical and barcode duplicate removal via Picard, local realignment via GATK, coverage calculation with BEDTools, variant calling using VarScan2, annotation via ANNOVAR and filtering using our in-house developed filter_*
scripts.
Expected inputs are paired FASTQ files containing the sequencing reads, a third FASTQ named *_R3*.fastq
containing the molecular barcode of each read pair and a BED file defining the target amplicons of the design. Variant calling and on-target coverage calculation, but not alignment, will be restricted to these intervals. FASTQ files must be passed as positional arguments on the command line; the BED file may be defined as a default in the respective config or passed as an optional argument:
bpipe run -p target_region=your-targets.BED bpipe_haloplexHS.pipeline \
your-input_R1.fastq \
your-input_R2.fastq \
your-inputs-molecular-barcodes_R3.fastq
The exome / targeted enrichment pipeline processes Illumina NGS data from targeted enrichment based protocols, including whole-exome sequencing (WES) but also designs to enrich for other target regions or subsets of genes of interest. It performs the following steps: Quality assessment with the NGS QC Toolkit, alignment to the reference via BWA-MEM, optical and sequence duplicate removal via Picard, local realignment via GATK, coverage calculation with BEDTools, variant calling using VarScan2, annotation via ANNOVAR and filtering using our in-house developed filter_*
scripts.
Expected inputs are paired FASTQ files of two matched samples, the second being a healthy / control sample used to distinguish germline, somatic and LOH mutations, and a BED file defining the regions targeted by the design. Variant calling and on-target coverage calculation, but not alignment, will be restricted to these intervals. FASTQ files must be passed as positional arguments on the command line; the BED file may be defined as a default in the respective config or passed as an optional argument:
bpipe run bpipe_exome.pipeline \
your-tumor_R1.fastq \
your-tumor_R2.fastq \
your-tumors-healthy-control_R1.fastq \
your-tumors-healthy-control_R2.fastq
The RNA mutation calling pipeline processes Illumina RNA sequencing but is currently still somewhat experimental, as we have so far only used it to check for aberrations in specific candidate genes or in combination with matched DNA sequencing data. It mimics the exome / targeted enrichment pipeline and performs quality assessment with the NGS QC Toolkit, alignment to the reference via STAR, optical and sequence duplicate removal via Picard, local realignment via GATK, coverage calculation with BEDTools, variant calling using VarScan2, annotation via ANNOVAR and filtering using our in-house developed filter_*
scripts:
bpipe run bpipe_rna_callVariants.pipeline \
your-input_R1.fastq \
your-input_R2.fastq
There is one bpipe pipeline available for gene expression analysis, which performs quality assessment with the NGS QC Toolkit, reference alignment and gene-wise quantification via STAR, optical duplicate removal via Picard and requantification with HTSeq. With the help of different downstream analyses, this can be used to study both linear and circular transcripts.
Expected inputs are the paired FASTQ files of a single sample:
bpipe run bpipe_rna.pipeline \
your-input_R1.fastq \
your-input_R2.fastq
Primary outputs of interest are the STAR and HTSeq read count files. Note that STAR writes hidden files by default (filename starts with .
).
HTSeq counts can be directly input into DESeq2 for downstream analysis of differential expression.
Counts output by STAR have to be reformatted first.
The main difference is that while HTSeq generates a different counts file for each of the possible stranded / unstranded library preparation protocols, STAR writes all variants to one file.
To extract the desired counts into a 2-column file that can be processed further, use star_counts_to_htseq_format.sh
.
The script takes as input the counts file to be converted and the column to extract. When no column is given, it defaults to column 2
, for unstranded designs. Columns 3 and 4 contain counts for the respective stranded protocols:
star_counts_to_htseq_format.sh \
your-star-counts.tab \
2
Heavily based on the DESeq2 manual, the deseq2*
scripts test for differentially expressed genes.
Inputs to the wrapper script, which then itself calls the respective R script, are:
- the design table
- the design formula to test for
- the reference level
- alpha
- the minimum log fold change to test for
- a database file for ensembl gene ID to gene symbol conversion
The design table is a TSV file specifying at least the columns fileName
, sampleName
and the condition to test on for differential expression. Each line describes one sample. Filenames must refer to the respective read count files, which are expected to reside within the intermediate_files
subfolder, in the directory from which the script is run. These count files are those output by HTSeq or STAR, following the respective formatting in the latter case. All of the information provided in this table will be used to annotate the respective output plots.
The design formula is something like ~ condition
for an unpaired analysis or ~ patient + condition
for a paired analysis. Provided factors must be defined as columns in the design table.
The reference level defines relative to what fold changes are calculated. Typically, this is something like untreated
, control
or similar. The term must be specified in the tested condition column of the design table.
Alpha corresponds to the false discovery rate (FDR) cutoff that defines differentially expressed genes. Default is 0.1
.
The minimum log fold change to test for between groups defaults to 0.6
(1.5 on a linear / non-log scale).
Set this to 0
(1 on a linear / non-log scale) to test for any difference between groups.
The database file defaults to /NGS/known_sites/hg19/gencode.v19.chr_patch_hapl_scaff.annotation_UCSCcontigs.gtf
.
In addition, if a file named candidates.txt
is present in the same folder as the input design table, containing gene IDs, one per line, present in the analyzed count files, all files generated for the differentially expressed genes will additionally be generated for the set of genes in this file.
This includes the PC and clustering plots as well as the count and test statistic tables.
If a file named replace_sizeFactors.txt
is present, containing the size factors calculated by DESeq2 in a previous analysis to normalize for sequencing depth, these are used in the current analysis instead of whatever would be recalculated by DESeq2.
To compare treated versus untreated samples of the same patients using all of the provided defaults, run:
deseq2_wrapper.sh \
your-design-table.tsv \
"~ patient + treatment" \
"untreated"
The provided TSV file must then contain a patient
column, containing patient IDs of each sample, and a treatment
column containing entries treated
and untreated
.
To change the provided defaults, the values with which to override the defaults must be added to command.
Because arguments are assigned based on their position, all arguments up to the last default that shall be changed must be provided. For example, to change the minimum log fold change to 1
(2 on the linear scale), run:
deseq2_wrapper.sh \
your-design-table.tsv \
"~ patient + treatment" \
"untreated" \
0.1 \
1
Primary outputs of interest created by the script are:
- tables containing the normalized and log-transformed read counts of all samples
- a table containing the test statistics, fold changes, p- and FDR- estimates of each gene
- PDF plots of principal components 1-6 from PCA and heatmaps of different hierarchical clusterings, for differentially expressed genes and increasing subsets of genes with a high coefficient of variation (CV), as a sort of unsupervised exploratory comparison
This script takes DESeq2 results, ranks genes based on their log2 fold change estimates and calls on Broad's GSEA for gene set enrichment analysis. It requires a database file for ensembl ID to HUGO gene symbol conversion, defined within the script, the GSEA JAR file and one or more gene set collections to test. The latter can be downloaded from MSigDB or custom-built.
To process the paired DESeq2 analysis from above, run:
gsea.sh \
your-design-table_DESeq2~patient+treatment_all_woutNA.txt \
your-favorite-gene-set-collection.gmt \
your-second-favorite-gene-set-collection.gmt
A dated output folder is created with a subfolder for each individual analysis which in turn contains all of the generated output files.
Several scripts are available to detect and quantify circular RNAs from our gene expression analysis pipelines:
get_circs.sh
count_circs.sh
count_linear-alternatives.sh
The get_circs.sh
script identifies circular RNAs supported by backsplice junctions, where downstream splice donors are joined to upstream splice acceptors to form circular RNAs.
It takes STAR output files containing the discovered chimeric junctions and chimeric reads, the BAM file containing all of the alignments, and the genome file of the reference that was used for this alignment:
get_circs.sh \
your-input_Chimeric.out.junction \
your-input_Chimeric.out.sam \
your-input_alignSTAR.bam \
stars_chrNameLength.txt
Two output files are of importance:
your-input_Chimeric.out.junction_circsAnnotatedFinal.txt
contains a table listing the coordinates, genes and supporting reads of each backsplice junction that was discovered.
your-input_Chimeric.out.junction_circsAnnotatedFinal_linearJunctionsCounts.tsv
contains in addition the number of reads supporting the backsplice junctions' linear alternatives, that is linear junctions involving the backsplice donor or acceptor. The supporting reads of these alternative linear junction can then be used in a downstream analysis to compare circular to linear transcript expression.
The count_circs.sh
script takes the *linearJunctionCounts.tsv
files created by get_circs.sh
as input and generates read count files, based on HTSeq's output format, which can then be further processed by DESeq2.
Two of these count files are created per sample:
*_circs.tsv
contains the supporting reads of each circular RNA / backsplice junction; *_circs-per-gene.counts
contains the supporting reads of the circular RNAs / backsplice junctions of each gene.
Thus, for the latter, counts are summed when multiple circular RNAs / backsplice junctions are discovered within the same gene.
Specifically, the script takes as input a single output file prefix and all of the *_linearJunctionCounts.tsv
files generated by get_circs.sh
, which are to be processed together by DESeq2.
It is important that all of these samples are preprocessed together for count generation to ensure that the same features are counted for all samples, even when distinct sets of circRNAs were discovered.
count_circs.sh \
your-output-prefix \
your-input_*linearJunctionCounts.tsv \
your-other-input_*linearJunctionCounts.tsv \
your-yet-another-input_*linearJunctionCounts.tsv
The generated count files are all saved to the intermediate_files
subfolder, which is also created by the script, and can then be input into DESeq2 to test for differential expression, as described above.
A dummy DESeq2 design table listing the respective files is also created: *_circs.tsv
lists per-junction read count files and *_circs-per-gene.tsv
lists the per-gene files.
Results from DESeq2 can then be further studied using GSEA, just like conventional counts from linear transcripts.
A comparison of backsplice vs all non-backsplice reads is biased and does not properly reflect circular to linear transcript expression.
Instead, we compare backsplice-supporting read counts to those supporting alternative linear junctions involving either the backsplice donor or acceptor.
To quantify support (= expression) of these linear junctions, count_linear-alternatives.sh
was implemented.
It works analogously to count_circs.sh
and also takes as input an output file prefix and the *_linearJunctionCounts.tsv
files which are to be processed further:
count_linear-alternatives.sh \
your-output-prefix \
your-input_*linearJunctionCounts.tsv \
your-other-input_*linearJunctionCounts.tsv \
your-yet-another-input_*linearJunctionCounts.tsv
Count files *_linear-alternatives.counts
and *_linear-alternatives-per-gene.counts
are again saved to the intermediate_files
subfolder and dummy design tables for DESeq2 are prepared as *_linear-alternatives.tsv
and linear-alternatives-per-gene.tsv
.
To generate the various genome index files required for the DNA-based pipelines, use make_genome_files.sh
:
make_genome_files.sh \
your-reference-genome.fasta
This script, which must be run once for each reference genome that is to be used, will generate the general index file (.fai), the dictionary file required by GATK (.dict), the genome file required by BEDTools (.genomeFile) and the many index files required by BWA. Note that this script itself require samtools, picard and bwa and will only generate the index files required for our DNA pipelines. STAR, which is used as part of the RNA pipelines, must be used explicitely to create its own index.
To extract a random set of n
reads, use the sample_fastqs.sh
script.
It takes as input the two paired FASTQ files to subsample and the number of reads to extract.
Thus, the command to extract 100 reads would be:
sample_fastqs.sh \
your-input_R1.fastq \
your-input_R2.fastq \
100
Two output files will be written, one for each input file, containing the respectively extracted reads. Their filename prefix will be your-input
, the suffix will contain the number of reads extracted together with a timestamp, dedicating the day, hour, minute and second that the script was run at. This serves to distinguish individual subsamples created from the same pair of original FASTQ files.
All BED files must be sorted before they can be passed to our pipelines by chromosome contig and coordinate.
This can be done using the sort_bed.sh
script, which takes as input the BED file to be sorted and the reference genome index file (.fai), which defines the proper order of reference contigs:
sort_bed.sh \
your-input.bed \
your-reference-genome-index.fai
In case the index file does not exist yet, it can be created with the make_genome_files.sh
script as described above.
In addition to a sorted BED file, sort_bed.sh
also generates a padded BED file, which is also sorted, whose entries are extended by 5 bp each in both the 3' and 5' direction.
These padded BED files are required by targeted-enrichment pipeline to limit variant calling to the actual targets but include directly adjacent flanking regions as well, whose coverage is often also sufficent for analysis.
The get_shared_degs.sh
scripts takes as input an FDR cutoff and one or more DESeq2 result tables.
When one table is given, it returns all those genes which are differentially expressed with an FDR at or below the given cutoff.
When multiple files are given, the script returns only those DEGs shared by all of the provided files at or below the given FDR cutoff:
get_shared_degs.sh \
your-fdr-cutoff \
your-input_DESeq2*woutNA.txt \
your-other-input_DESeq2*woutNA.txt \
your-yet-another-input_DESeq2*woutNA.txt
Two helper scripts were implemented:
get_shared_gsea_sets.sh
get_shared_gsea_sets_for_all_combinations.sh
The first of these, get_shared_gsea_sets.sh
takes as input an FDR cutoff and one or more GSEA result tables.
It works analogously to get_shared_degs.sh
:
get_shared_gsea_sets.sh \
your-fdr-cutoff \
your-gsea-result-tables-to-extract-shared-significant-gene-sets-from
The second script, get_shared_gsea_sets_for_all_combinations.sh
, is essentially a wrapper for the first: Provided with an FDR cutoff, a ranking metric and one or more GSEA output file prefixes, shared significant gene sets per combination of ranking metric, gene set collection and direction of enrichment (enrichment vs depletion) are returned.
The current version of the gsea.sh
script generates only analyses for gene lists ranked by log fold change, for which lfc
has to be passed as the ranking metric.
The prefix has to uniquely identify the respective GSEA analyses:
get_shared_gsea_sets_for_all_combinations.sh \
your-fdr-cutoff \
lfc \
your-design-table_DESeq2~patient+treatment_all_woutNA.txt \
your-other-design-table_DESeq2~patient+treatment_all_woutNA.txt \
your-yet-another-design-table_DESeq2~patient+treatment_all_woutNA.txt
The test_association.R
script applies fisher's exact tests to test
for the independence of two variables.
These may be any attributes provided in the DESeq2 design table and / or
the expression level of a given gene. For genes, samples are split based on
the median expression and high vs low expression is the attribute that goes
into testing.
Required inputs are the DESeq2 design table and the name of the two attributes to test:
Rscript test_association.R \
your-design-table \
one-column-in-your-design-table \
another-column-in-your-design-table
When gene symbols are given, the normalized counts table from DESeq2 is required as a fourth argument:
Rscript test_association.R \
your-design-table \
one-column-in-your-design-table \
your-gene-of-interest \
your-design-table_DESeq2~patient+treatment_all_countsNormalized.txt
(Except when the gene symbol is actually also a column in the design table. In this case, the design table will be used instead.)
Three scripts exist to merge the per-sample stat files created by bpipe into a single file describing all the samples of a given project:
summarize_coverage.sh
summarize_duplicates.sh
summarize_filtering.sh
All of these scripts work analogously: They take as input a folder from which they collect the per-sample stat files to merge and then create a single output table with data of all of these. When no input folder is given, it defaults to the current working directory from which the script is run.
summarize_coverage.sh
collects the .summary
files from the given folder and merges them to a single file, coverage_summary.tsv
.
The original .summary
files contain the coverage statistics written by bpipe for a single given sample.
summarize_coverage.sh \
folder-with-individual-coverage-stat-files
summarize_duplicates.sh
collects Picard's .metrics.txt
files from the given folder and merges them to a single one, bamStat_summary.tsv
.
These files contain the number and proportion of sequence and optical duplicates detected by Picard in the respective samples.
summarize_duplicates.sh \
folder-with-individual-duplicate-stat-files
summarize_filtering.sh
collects the filter_statistic.txt
files from the given folder and merges them into one file, filter_summary.tsv
.
The original .summary
files contain the coverage statistics written by bpipe for a single given sample.
summarize_coverage.sh \
folder-with-individual-coverage-stat-files
The get_coverage_per_gene.sh
script calculates the average per-gene coverage for one or more samples processed by one of the mutation calling pipelines.
It takes as input the BED file which was also provided to the respective bpipe pipeline followed by all of the *coverBED.txt
files generated by bpipe, for which the per-gene coverage is to be calculated. Column 4 of this BED file must contain the gene IDs / symbols of interest.
Output is a single file, average_coverage_summary.tsv
, analogously to the summary scripts above.
get_coverage_per_gene.sh \
your-targets.BED \
your-input_*coverBED.txt \
your-other-input_*coverBED.txt \
your-yet-another-input_*coverBED.txt
The filter_specific_mutations.sh
script filters all of the provided input files for a specified list of mutations. It was originally developed to filter recurrent mutations not of interest in a certain project. For each input file, mutations matching and not matching the provided list are written to distinct output files.
Note that the search is based on exact string matching. Therefore, the list of mutations to filter for may contain any string or substring present (or not) in the CSV files. For example, exact mutation coordinates, chromosomes or genes may be supplied, but all must be exactly formatted as expected in the CSV, with one search key per line.
filter_specific_mutations.sh \
your-list-of-muts-to-filter-for.txt \
your-inputs-mutations.csv \
your-other-inputs-mutations.csv \
your-yet-another-inputs-mutations.csv
The get_flanking_sequence*.sh
scripts annotate mutations output by our pipelines with flanking reference sequences of a given length. Originally, get_flanking_sequence.sh
was used, which is slow but works offline:
get_flanking_sequence.sh \
your-inputs-mutations.csv \
the-reference-genome.fasta \
number-of-flanking-bases-to-annotate
Currently, our pipelines use get_flanking_sequence_online.sh
for annotation of all of the detected mutations.
It requires the input CSV file and the number of flanking bases to annotate as well as a running internet connection:
get_flanking_sequence_online.sh \
your-inputs-mutations.csv \
number-of-flanking-bases-to-annotate
For one of our protocols, molecular barcodes were sequenced not as separate index reads but as part of the main sequence reads.
To extract these into separate files, get_barcodes.py
was implemented.
It takes as input the two paired FASTQ files from which the barcodes are to be extracted, an output file prefix for the extracted barcodes and the forward and reverse reads' primer sequences that signal the start / end of the respective barcode - these primers are then matched with the reads' sequences and preceding / succeeding sequence is extracted:
python3 get_barcodes.py \
your-input_R1.fastq \
your-input_R2.fastq \
your-output-prefix \
your-primer-sequence-in-your-input_R1 \
your-primer-sequence-in-your-input_R2
Five output files are generated:
your-output-prefix_indexed_R1.fastq
: your-input_R1.fastq minus the extracted barcodesyour-output-prefix_indexed_R2.fastq
: your-input_R2.fastq minus the extracted barcodesyour-output-prefix_indexed_I1.fastq
: Barcodes extracted from your-input_R1.fastqyour-output-prefix_indexed_I2.fastq
: Barcodes extracted from your-input_R2.fastqyour-output-prefix_indexed_I.fastq
: Concatenation of the extracted barcodes into one per read pair
The file containing the concatenated indices can then, for example, be renamed / symlinked to _R3.fastq
and passed as the index file of our HaloplexHS pipeline.
The poretools_wrapper.sh
takes as input a folder containing Oxford Nanopore FAST5 files and converts these to a single FASTQ file.
Several statistics on run and data quality are also generated.
All output files are prefixed with the input folder's basename.
poretools_wrapper.sh \
your-input-folder
The run_jar.sh
script is a simple wrapper script that calls on a single JAR file, residing in the same folder as the wrapper script, via java -jar
and passes all arguments on to the JAR file. I implemented this to make JAR files callable from $PATH
.