Merge remote-tracking branch 'origin/main' into add_test_ci

viash-hub · Jan 30, 2024 · bf7bfa0 · bf7bfa0
2 parents 8b7a4e8 + b3573a2
commit bf7bfa0
Show file tree

Hide file tree

Showing 19 changed files with 1,585 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,17 @@
+*.DS_Store
+*__pycache__
+
+# IDE ignores
+.idea/
+
+# R specific ignores
+.Rhistory
+.Rproj.user
+*.Rproj
+
+# viash specific ignores
+target/
+
+# nextflow specific ignores
+.nextflow*
+work
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,10 @@
 
 ## NEW FEATURES
 
+* `arriba`: Detect gene fusions from RNA-seq data (PR #1).
+
+* `fastp`: An ultra-fast all-in-one FASTQ preprocessor (PR #3).
+
 ## MAJOR CHANGES
 
 ## MINOR CHANGES

diff --git a/src/arriba/config.vsh.yaml b/src/arriba/config.vsh.yaml
diff --git a/src/arriba/help.txt b/src/arriba/help.txt
@@ -0,0 +1,198 @@
+```bash
+arriba -h
+```
+
+Arriba gene fusion detector
+---------------------------
+Version: 2.4.0
+
+Arriba is a fast tool to search for aberrant transcripts such as gene fusions. 
+It is based on chimeric alignments found by the STAR RNA-Seq aligner.
+
+Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \
+              -g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
+              [-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \
+              -o fusions.tsv [-O fusions.discarded.tsv] \
+              [OPTIONS]
+
+ -c FILE  File in SAM/BAM/CRAM format with chimeric alignments as generated by STAR 
+          (Chimeric.out.sam). This parameter is only required, if STAR was run with the 
+          parameter '--chimOutType SeparateSAMold'. When STAR was run with the parameter 
+          '--chimOutType WithinBAM', it suffices to pass the parameter -x to Arriba and -c 
+          can be omitted. 
+
+ -x FILE  File in SAM/BAM/CRAM format with main alignments as generated by STAR 
+          (Aligned.out.sam). Arriba extracts candidate reads from this file. 
+
+ -g FILE  GTF file with gene annotation. The file may be gzip-compressed. 
+
+ -G GTF_FEATURES  Comma-/space-separated list of names of GTF features. 
+                  Default: gene_name=gene_name|gene_id gene_id=gene_id 
+                  transcript_id=transcript_id feature_exon=exon feature_CDS=CDS 
+
+ -a FILE  FastA file with genome sequence (assembly). The file may be gzip-compressed. An 
+          index with the file extension .fai must exist only if CRAM files are processed. 
+
+ -b FILE  File containing blacklisted events (recurrent artifacts and transcripts 
+          observed in healthy tissue). 
+
+ -k FILE  File containing known/recurrent fusions. Some cancer entities are often 
+          characterized by fusions between the same pair of genes. In order to boost 
+          sensitivity, a list of known fusions can be supplied using this parameter. The list 
+          must contain two columns with the names of the fused genes, separated by tabs. 
+
+ -o FILE  Output file with fusions that have passed all filters. 
+
+ -O FILE  Output file with fusions that were discarded due to filtering. 
+
+ -t FILE  Tab-separated file containing fusions to annotate with tags in the 'tags' column. 
+          The first two columns specify the genes; the third column specifies the tag. The 
+          file may be gzip-compressed. 
+
+ -p FILE  File in GFF3 format containing coordinates of the protein domains of genes. The 
+          protein domains retained in a fusion are listed in the column 
+          'retained_protein_domains'. The file may be gzip-compressed. 
+
+ -d FILE  Tab-separated file with coordinates of structural variants found using 
+          whole-genome sequencing data. These coordinates serve to increase sensitivity 
+          towards weakly expressed fusions and to eliminate fusions with low evidence. 
+
+ -D MAX_GENOMIC_BREAKPOINT_DISTANCE  When a file with genomic breakpoints obtained via 
+                                     whole-genome sequencing is supplied via the -d 
+                                     parameter, this parameter determines how far a 
+                                     genomic breakpoint may be away from a 
+                                     transcriptomic breakpoint to consider it as a 
+                                     related event. For events inside genes, the 
+                                     distance is added to the end of the gene; for 
+                                     intergenic events, the distance threshold is 
+                                     applied as is. Default: 100000 
+
+ -s STRANDEDNESS  Whether a strand-specific protocol was used for library preparation, 
+                  and if so, the type of strandedness (auto/yes/no/reverse). When 
+                  unstranded data is processed, the strand can sometimes be inferred from 
+                  splice-patterns. But in unclear situations, stranded data helps 
+                  resolve ambiguities. Default: auto 
+
+ -i CONTIGS  Comma-/space-separated list of interesting contigs. Fusions between genes 
+             on other contigs are ignored. Cfontigs can be specified with or without the 
+             prefix "chr". Asterisks (*) are treated as wild-cards. 
+             Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_* 
+
+ -v CONTIGS  Comma-/space-separated list of viral contigs. Asterisks (*) are treated as 
+             wild-cards. 
+             Default: AC_* NC_* 
+
+ -f FILTERS  Comma-/space-separated list of filters to disable. By default all filters are 
+             enabled. Valid values: homologs, low_entropy, isoforms, 
+             top_expressed_viral_contigs, viral_contigs, uninteresting_contigs, 
+             non_coding_neighbors, mismatches, duplicates, no_genomic_support, 
+             genomic_support, intronic, end_to_end, relative_support, 
+             low_coverage_viral_contigs, merge_adjacent, mismappers, multimappers, 
+             same_gene, long_gap, internal_tandem_duplication, small_insert_size, 
+             read_through, inconsistently_clipped, intragenic_exonic, 
+             marginal_read_through, spliced, hairpin, blacklist, min_support, 
+             select_best, in_vitro, short_anchor, known_fusions, no_coverage, 
+             homopolymer, many_spliced 
+
+ -E MAX_E-VALUE  Arriba estimates the number of fusions with a given number of supporting 
+                 reads which one would expect to see by random chance. If the expected number 
+                 of fusions (e-value) is higher than this threshold, the fusion is 
+                 discarded by the 'relative_support' filter. Note: Increasing this 
+                 threshold can dramatically increase the number of false positives and may 
+                 increase the runtime of resource-intensive steps. Fractional values are 
+                 possible. Default: 0.300000 
+
+ -S MIN_SUPPORTING_READS  The 'min_support' filter discards all fusions with fewer than 
+                          this many supporting reads (split reads and discordant mates 
+                          combined). Default: 2 
+
+ -m MAX_MISMAPPERS  When more than this fraction of supporting reads turns out to be 
+                    mismappers, the 'mismappers' filter discards the fusion. Default: 
+                    0.800000 
+
+ -L MAX_HOMOLOG_IDENTITY  Genes with more than the given fraction of sequence identity are 
+                          considered homologs and removed by the 'homologs' filter. 
+                          Default: 0.300000 
+
+ -H HOMOPOLYMER_LENGTH  The 'homopolymer' filter removes breakpoints adjacent to 
+                        homopolymers of the given length or more. Default: 6 
+
+ -R READ_THROUGH_DISTANCE  The 'read_through' filter removes read-through fusions 
+                           where the breakpoints are less than the given distance away 
+                           from each other. Default: 10000 
+
+ -A MIN_ANCHOR_LENGTH  Alignment artifacts are often characterized by split reads coming 
+                       from only one gene and no discordant mates. Moreover, the split 
+                       reads only align to a short stretch in one of the genes. The 
+                       'short_anchor' filter removes these fusions. This parameter sets 
+                       the threshold in bp for what the filter considers short. Default: 23 
+
+ -M MANY_SPLICED_EVENTS  The 'many_spliced' filter recovers fusions between genes that 
+                         have at least this many spliced breakpoints. Default: 4 
+
+ -K MAX_KMER_CONTENT  The 'low_entropy' filter removes reads with repetitive 3-mers. If 
+                      the 3-mers make up more than the given fraction of the sequence, then 
+                      the read is discarded. Default: 0.600000 
+
+ -V MAX_MISMATCH_PVALUE  The 'mismatches' filter uses a binomial model to calculate a 
+                         p-value for observing a given number of mismatches in a read. If 
+                         the number of mismatches is too high, the read is discarded. 
+                         Default: 0.010000 
+
+ -F FRAGMENT_LENGTH  When paired-end data is given, the fragment length is estimated 
+                     automatically and this parameter has no effect. But when single-end 
+                     data is given, the mean fragment length should be specified to 
+                     effectively filter fusions that arise from hairpin structures. 
+                     Default: 200 
+
+ -U MAX_READS  Subsample fusions with more than the given number of supporting reads. This 
+               improves performance without compromising sensitivity, as long as the 
+               threshold is high. Counting of supporting reads beyond the threshold is 
+               inaccurate, obviously. Default: 300 
+
+ -Q QUANTILE  Highly expressed genes are prone to produce artifacts during library 
+              preparation. Genes with an expression above the given quantile are eligible 
+              for filtering by the 'in_vitro' filter. Default: 0.998000 
+
+ -e EXONIC_FRACTION  The breakpoints of false-positive predictions of intragenic events 
+                     are often both in exons. True predictions are more likely to have at 
+                     least one breakpoint in an intron, because introns are larger. If the 
+                     fraction of exonic sequence between two breakpoints is smaller than 
+                     the given fraction, the 'intragenic_exonic' filter discards the 
+                     event. Default: 0.330000 
+
+ -T TOP_N  Only report viral integration sites of the top N most highly expressed viral 
+           contigs. Default: 5 
+
+ -C COVERED_FRACTION  Ignore virally associated events if the virus is not fully 
+                      expressed, i.e., less than the given fraction of the viral contig is 
+                      transcribed. Default: 0.050000 
+
+ -l MAX_ITD_LENGTH  Maximum length of internal tandem duplications. Note: Increasing 
+                    this value beyond the default can impair performance and lead to many 
+                    false positives. Default: 100 
+
+ -z MIN_ITD_ALLELE_FRACTION  Required fraction of supporting reads to report an internal 
+                             tandem duplication. Default: 0.070000 
+
+ -Z MIN_ITD_SUPPORTING_READS  Required absolute number of supporting reads to report an 
+                              internal tandem duplication. Default: 10 
+
+ -u  Instead of performing duplicate marking itself, Arriba relies on duplicate marking by a 
+     preceding program using the BAM_FDUP flag. This makes sense when unique molecular 
+     identifiers (UMI) are used. 
+
+ -X  To reduce the runtime and file size, by default, the columns 'fusion_transcript', 
+     'peptide_sequence', and 'read_identifiers' are left empty in the file containing 
+     discarded fusion candidates (see parameter -O). When this flag is set, this extra 
+     information is reported in the discarded fusions file. 
+
+ -I  If assembly of the fusion transcript sequence from the supporting reads is incomplete 
+     (denoted as '...'), fill the gaps using the assembly sequence wherever possible. 
+
+ -h  Print help and exit. 
+
+         Code repository: https://github.com/suhrig/arriba
+    Get help/report bugs: https://github.com/suhrig/arriba/issues
+             User manual: https://arriba.readthedocs.io/
+             Please cite: https://doi.org/10.1101/gr.257246.119
diff --git a/src/arriba/script.sh b/src/arriba/script.sh
@@ -0,0 +1,47 @@
+#!/bin/bash
+
+## VIASH START
+## VIASH END
+
+[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking
+[[ "$par_extra_information" == "false" ]] && unset par_extra_information
+[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps
+
+arriba \
+  -x "$par_bam" \
+  -a "$par_genome" \
+  -g "$par_gene_annotation" \
+  -o "$par_fusions" \
+  ${par_known_fusions:+-k "${par_known_fusions}"} \
+  ${par_blacklist:+-b "${par_blacklist}"} \
+  ${par_structural_variants:+-d "${par_structural_variants}"} \
+  ${par_tags:+-t "${par_tags}"} \
+  ${par_protein_domains:+-p "${par_protein_domains}"} \
+  ${par_fusions_discarded:+-O "${par_fusions_discarded}"} \
+  ${par_max_genomic_breakpoint_distance:+-D "${par_max_genomic_breakpoint_distance}"} \
+  ${par_strandedness:+-s "${par_strandedness}"} \
+  ${par_interesting_contigs:+-i "${par_interesting_contigs}"} \
+  ${par_viral_contigs:+-v "${par_viral_contigs}"} \
+  ${par_disable_filters:+-f "${par_disable_filters}"} \
+  ${par_max_e_value:+-E "${par_max_e_value}"} \
+  ${par_min_supporting_reads:+-S "${par_min_supporting_reads}"} \
+  ${par_max_mismappers:+-m "${par_max_mismappers}"} \
+  ${par_max_homolog_identity:+-L "${par_max_homolog_identity}"} \
+  ${par_homopolymer_length:+-H "${par_homopolymer_length}"} \
+  ${par_read_through_distance:+-R "${par_read_through_distance}"} \
+  ${par_min_anchor_length:+-A "${par_min_anchor_length}"} \
+  ${par_many_spliced_events:+-M "${par_many_spliced_events}"} \
+  ${par_max_kmer_content:+-K "${par_max_kmer_content}"} \
+  ${par_max_mismatch_pvalue:+-V "${par_max_mismatch_pvalue}"} \
+  ${par_fragment_length:+-F "${par_fragment_length}"} \
+  ${par_max_reads:+-U "${par_max_reads}"} \
+  ${par_quantile:+-Q "${par_quantile}"} \
+  ${par_exonic_fraction:+-e "${par_exonic_fraction}"} \
+  ${par_top_n:+-T "${par_top_n}"} \
+  ${par_covered_fraction:+-C "${par_covered_fraction}"} \
+  ${par_max_itd_length:+-l "${par_max_itd_length}"} \
+  ${par_min_itd_allele_fraction:+-z "${par_min_itd_allele_fraction}"} \
+  ${par_min_itd_supporting_reads:+-Z "${par_min_itd_supporting_reads}"} \
+  ${par_skip_duplicate_marking:+-u} \
+  ${par_extra_information:+-X} \
+  ${par_fill_gaps:+-I}
diff --git a/src/arriba/test.sh b/src/arriba/test.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+
+set -e
+
+dir_in="$meta_resources_dir/test_data"
+
+echo "> Run arriba with blacklist"
+"$meta_executable" \
+  --bam "$dir_in/A.bam" \
+  --genome "$dir_in/genome.fasta" \
+  --gene_annotation "$dir_in/annotation.gtf" \
+  --blacklist "$dir_in/blacklist.tsv" \
+  --fusions "fusions.tsv" \
+  --fusions_discarded "fusions_discarded.tsv" \
+  --interesting_contigs "1,2"
+
+echo ">> Checking output"
+[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
+[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
+
+echo ">> Check if output is empty"
+[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
+[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
+
+rm fusions.tsv fusions_discarded.tsv
+
+echo "> Run arriba without blacklist"
+"$meta_executable" \
+  --bam "$dir_in/A.bam" \
+  --genome "$dir_in/genome.fasta" \
+  --gene_annotation "$dir_in/annotation.gtf" \
+  --fusions "fusions.tsv" \
+  --fusions_discarded "fusions_discarded.tsv" \
+  --interesting_contigs "1,2" \
+  --disable_filters blacklist
+
+echo ">> Checking output"
+[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1
+[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1
+
+echo ">> Check if output is empty"
+[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1
+[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1
+
+echo "> Test successful"
diff --git a/src/arriba/test_data/A.bam b/src/arriba/test_data/A.bam
diff --git a/src/arriba/test_data/annotation.gtf b/src/arriba/test_data/annotation.gtf
@@ -0,0 +1,6 @@
+1	havana	gene	1	80	.	+	.	gene_id "ENSG00000000000"; gene_version "5"; gene_name "A"; gene_source "havana"; gene_biotype "gene";
+1	havana	transcript	1	80	.	+	.	gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
+1	havana	exon	1	80	.	+	.	gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; exon_number "1"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000000"; exon_version "1"; tag "basic"; transcript_support_level "1";
+2	havana	gene	1	80	.	+	.	gene_id "ENSG00000000001"; gene_version "5"; gene_name "B"; gene_source "havana"; gene_biotype "gene";
+2	havana	transcript	1	80	.	+	.	gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
+2	havana	exon	1	80	.	+	.	gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; exon_number "1"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000001"; exon_version "1"; tag "basic"; transcript_support_level "1";
diff --git a/src/arriba/test_data/blacklist.tsv b/src/arriba/test_data/blacklist.tsv
diff --git a/src/arriba/test_data/genome.fasta b/src/arriba/test_data/genome.fasta
@@ -0,0 +1,4 @@
+>1
+GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+>2
+AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
diff --git a/src/arriba/test_data/script.sh b/src/arriba/test_data/script.sh
@@ -0,0 +1,10 @@
+# arriba test data
+
+# Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/arriba/test
+
+if [ ! -d /tmp/snakemake-wrappers ]; then
+  git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers /tmp/snakemake-wrappers
+fi
+
+cp -r /tmp/snakemake-wrappers/bio/arriba/test/* src/arriba/test_data
+