-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add arriba component * changes based on #2 * update changelog * quote arguments * add command to requirements
- Loading branch information
Showing
10 changed files
with
696 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
```bash | ||
arriba -h | ||
``` | ||
|
||
Arriba gene fusion detector | ||
--------------------------- | ||
Version: 2.4.0 | ||
|
||
Arriba is a fast tool to search for aberrant transcripts such as gene fusions. | ||
It is based on chimeric alignments found by the STAR RNA-Seq aligner. | ||
|
||
Usage: arriba [-c Chimeric.out.sam] -x Aligned.out.bam \ | ||
-g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \ | ||
[-t tags.tsv] [-p protein_domains.gff3] [-d structural_variants_from_WGS.tsv] \ | ||
-o fusions.tsv [-O fusions.discarded.tsv] \ | ||
[OPTIONS] | ||
|
||
-c FILE File in SAM/BAM/CRAM format with chimeric alignments as generated by STAR | ||
(Chimeric.out.sam). This parameter is only required, if STAR was run with the | ||
parameter '--chimOutType SeparateSAMold'. When STAR was run with the parameter | ||
'--chimOutType WithinBAM', it suffices to pass the parameter -x to Arriba and -c | ||
can be omitted. | ||
|
||
-x FILE File in SAM/BAM/CRAM format with main alignments as generated by STAR | ||
(Aligned.out.sam). Arriba extracts candidate reads from this file. | ||
|
||
-g FILE GTF file with gene annotation. The file may be gzip-compressed. | ||
|
||
-G GTF_FEATURES Comma-/space-separated list of names of GTF features. | ||
Default: gene_name=gene_name|gene_id gene_id=gene_id | ||
transcript_id=transcript_id feature_exon=exon feature_CDS=CDS | ||
|
||
-a FILE FastA file with genome sequence (assembly). The file may be gzip-compressed. An | ||
index with the file extension .fai must exist only if CRAM files are processed. | ||
|
||
-b FILE File containing blacklisted events (recurrent artifacts and transcripts | ||
observed in healthy tissue). | ||
|
||
-k FILE File containing known/recurrent fusions. Some cancer entities are often | ||
characterized by fusions between the same pair of genes. In order to boost | ||
sensitivity, a list of known fusions can be supplied using this parameter. The list | ||
must contain two columns with the names of the fused genes, separated by tabs. | ||
|
||
-o FILE Output file with fusions that have passed all filters. | ||
|
||
-O FILE Output file with fusions that were discarded due to filtering. | ||
|
||
-t FILE Tab-separated file containing fusions to annotate with tags in the 'tags' column. | ||
The first two columns specify the genes; the third column specifies the tag. The | ||
file may be gzip-compressed. | ||
|
||
-p FILE File in GFF3 format containing coordinates of the protein domains of genes. The | ||
protein domains retained in a fusion are listed in the column | ||
'retained_protein_domains'. The file may be gzip-compressed. | ||
|
||
-d FILE Tab-separated file with coordinates of structural variants found using | ||
whole-genome sequencing data. These coordinates serve to increase sensitivity | ||
towards weakly expressed fusions and to eliminate fusions with low evidence. | ||
|
||
-D MAX_GENOMIC_BREAKPOINT_DISTANCE When a file with genomic breakpoints obtained via | ||
whole-genome sequencing is supplied via the -d | ||
parameter, this parameter determines how far a | ||
genomic breakpoint may be away from a | ||
transcriptomic breakpoint to consider it as a | ||
related event. For events inside genes, the | ||
distance is added to the end of the gene; for | ||
intergenic events, the distance threshold is | ||
applied as is. Default: 100000 | ||
|
||
-s STRANDEDNESS Whether a strand-specific protocol was used for library preparation, | ||
and if so, the type of strandedness (auto/yes/no/reverse). When | ||
unstranded data is processed, the strand can sometimes be inferred from | ||
splice-patterns. But in unclear situations, stranded data helps | ||
resolve ambiguities. Default: auto | ||
|
||
-i CONTIGS Comma-/space-separated list of interesting contigs. Fusions between genes | ||
on other contigs are ignored. Cfontigs can be specified with or without the | ||
prefix "chr". Asterisks (*) are treated as wild-cards. | ||
Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_* | ||
|
||
-v CONTIGS Comma-/space-separated list of viral contigs. Asterisks (*) are treated as | ||
wild-cards. | ||
Default: AC_* NC_* | ||
|
||
-f FILTERS Comma-/space-separated list of filters to disable. By default all filters are | ||
enabled. Valid values: homologs, low_entropy, isoforms, | ||
top_expressed_viral_contigs, viral_contigs, uninteresting_contigs, | ||
non_coding_neighbors, mismatches, duplicates, no_genomic_support, | ||
genomic_support, intronic, end_to_end, relative_support, | ||
low_coverage_viral_contigs, merge_adjacent, mismappers, multimappers, | ||
same_gene, long_gap, internal_tandem_duplication, small_insert_size, | ||
read_through, inconsistently_clipped, intragenic_exonic, | ||
marginal_read_through, spliced, hairpin, blacklist, min_support, | ||
select_best, in_vitro, short_anchor, known_fusions, no_coverage, | ||
homopolymer, many_spliced | ||
|
||
-E MAX_E-VALUE Arriba estimates the number of fusions with a given number of supporting | ||
reads which one would expect to see by random chance. If the expected number | ||
of fusions (e-value) is higher than this threshold, the fusion is | ||
discarded by the 'relative_support' filter. Note: Increasing this | ||
threshold can dramatically increase the number of false positives and may | ||
increase the runtime of resource-intensive steps. Fractional values are | ||
possible. Default: 0.300000 | ||
|
||
-S MIN_SUPPORTING_READS The 'min_support' filter discards all fusions with fewer than | ||
this many supporting reads (split reads and discordant mates | ||
combined). Default: 2 | ||
|
||
-m MAX_MISMAPPERS When more than this fraction of supporting reads turns out to be | ||
mismappers, the 'mismappers' filter discards the fusion. Default: | ||
0.800000 | ||
|
||
-L MAX_HOMOLOG_IDENTITY Genes with more than the given fraction of sequence identity are | ||
considered homologs and removed by the 'homologs' filter. | ||
Default: 0.300000 | ||
|
||
-H HOMOPOLYMER_LENGTH The 'homopolymer' filter removes breakpoints adjacent to | ||
homopolymers of the given length or more. Default: 6 | ||
|
||
-R READ_THROUGH_DISTANCE The 'read_through' filter removes read-through fusions | ||
where the breakpoints are less than the given distance away | ||
from each other. Default: 10000 | ||
|
||
-A MIN_ANCHOR_LENGTH Alignment artifacts are often characterized by split reads coming | ||
from only one gene and no discordant mates. Moreover, the split | ||
reads only align to a short stretch in one of the genes. The | ||
'short_anchor' filter removes these fusions. This parameter sets | ||
the threshold in bp for what the filter considers short. Default: 23 | ||
|
||
-M MANY_SPLICED_EVENTS The 'many_spliced' filter recovers fusions between genes that | ||
have at least this many spliced breakpoints. Default: 4 | ||
|
||
-K MAX_KMER_CONTENT The 'low_entropy' filter removes reads with repetitive 3-mers. If | ||
the 3-mers make up more than the given fraction of the sequence, then | ||
the read is discarded. Default: 0.600000 | ||
|
||
-V MAX_MISMATCH_PVALUE The 'mismatches' filter uses a binomial model to calculate a | ||
p-value for observing a given number of mismatches in a read. If | ||
the number of mismatches is too high, the read is discarded. | ||
Default: 0.010000 | ||
|
||
-F FRAGMENT_LENGTH When paired-end data is given, the fragment length is estimated | ||
automatically and this parameter has no effect. But when single-end | ||
data is given, the mean fragment length should be specified to | ||
effectively filter fusions that arise from hairpin structures. | ||
Default: 200 | ||
|
||
-U MAX_READS Subsample fusions with more than the given number of supporting reads. This | ||
improves performance without compromising sensitivity, as long as the | ||
threshold is high. Counting of supporting reads beyond the threshold is | ||
inaccurate, obviously. Default: 300 | ||
|
||
-Q QUANTILE Highly expressed genes are prone to produce artifacts during library | ||
preparation. Genes with an expression above the given quantile are eligible | ||
for filtering by the 'in_vitro' filter. Default: 0.998000 | ||
|
||
-e EXONIC_FRACTION The breakpoints of false-positive predictions of intragenic events | ||
are often both in exons. True predictions are more likely to have at | ||
least one breakpoint in an intron, because introns are larger. If the | ||
fraction of exonic sequence between two breakpoints is smaller than | ||
the given fraction, the 'intragenic_exonic' filter discards the | ||
event. Default: 0.330000 | ||
|
||
-T TOP_N Only report viral integration sites of the top N most highly expressed viral | ||
contigs. Default: 5 | ||
|
||
-C COVERED_FRACTION Ignore virally associated events if the virus is not fully | ||
expressed, i.e., less than the given fraction of the viral contig is | ||
transcribed. Default: 0.050000 | ||
|
||
-l MAX_ITD_LENGTH Maximum length of internal tandem duplications. Note: Increasing | ||
this value beyond the default can impair performance and lead to many | ||
false positives. Default: 100 | ||
|
||
-z MIN_ITD_ALLELE_FRACTION Required fraction of supporting reads to report an internal | ||
tandem duplication. Default: 0.070000 | ||
|
||
-Z MIN_ITD_SUPPORTING_READS Required absolute number of supporting reads to report an | ||
internal tandem duplication. Default: 10 | ||
|
||
-u Instead of performing duplicate marking itself, Arriba relies on duplicate marking by a | ||
preceding program using the BAM_FDUP flag. This makes sense when unique molecular | ||
identifiers (UMI) are used. | ||
|
||
-X To reduce the runtime and file size, by default, the columns 'fusion_transcript', | ||
'peptide_sequence', and 'read_identifiers' are left empty in the file containing | ||
discarded fusion candidates (see parameter -O). When this flag is set, this extra | ||
information is reported in the discarded fusions file. | ||
|
||
-I If assembly of the fusion transcript sequence from the supporting reads is incomplete | ||
(denoted as '...'), fill the gaps using the assembly sequence wherever possible. | ||
|
||
-h Print help and exit. | ||
|
||
Code repository: https://github.com/suhrig/arriba | ||
Get help/report bugs: https://github.com/suhrig/arriba/issues | ||
User manual: https://arriba.readthedocs.io/ | ||
Please cite: https://doi.org/10.1101/gr.257246.119 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
#!/bin/bash | ||
|
||
## VIASH START | ||
## VIASH END | ||
|
||
[[ "$par_skip_duplicate_marking" == "false" ]] && unset par_skip_duplicate_marking | ||
[[ "$par_extra_information" == "false" ]] && unset par_extra_information | ||
[[ "$par_fill_gaps" == "false" ]] && unset par_fill_gaps | ||
|
||
arriba \ | ||
-x "$par_bam" \ | ||
-a "$par_genome" \ | ||
-g "$par_gene_annotation" \ | ||
-o "$par_fusions" \ | ||
${par_known_fusions:+-k "${par_known_fusions}"} \ | ||
${par_blacklist:+-b "${par_blacklist}"} \ | ||
${par_structural_variants:+-d "${par_structural_variants}"} \ | ||
${par_tags:+-t "${par_tags}"} \ | ||
${par_protein_domains:+-p "${par_protein_domains}"} \ | ||
${par_fusions_discarded:+-O "${par_fusions_discarded}"} \ | ||
${par_max_genomic_breakpoint_distance:+-D "${par_max_genomic_breakpoint_distance}"} \ | ||
${par_strandedness:+-s "${par_strandedness}"} \ | ||
${par_interesting_contigs:+-i "${par_interesting_contigs}"} \ | ||
${par_viral_contigs:+-v "${par_viral_contigs}"} \ | ||
${par_disable_filters:+-f "${par_disable_filters}"} \ | ||
${par_max_e_value:+-E "${par_max_e_value}"} \ | ||
${par_min_supporting_reads:+-S "${par_min_supporting_reads}"} \ | ||
${par_max_mismappers:+-m "${par_max_mismappers}"} \ | ||
${par_max_homolog_identity:+-L "${par_max_homolog_identity}"} \ | ||
${par_homopolymer_length:+-H "${par_homopolymer_length}"} \ | ||
${par_read_through_distance:+-R "${par_read_through_distance}"} \ | ||
${par_min_anchor_length:+-A "${par_min_anchor_length}"} \ | ||
${par_many_spliced_events:+-M "${par_many_spliced_events}"} \ | ||
${par_max_kmer_content:+-K "${par_max_kmer_content}"} \ | ||
${par_max_mismatch_pvalue:+-V "${par_max_mismatch_pvalue}"} \ | ||
${par_fragment_length:+-F "${par_fragment_length}"} \ | ||
${par_max_reads:+-U "${par_max_reads}"} \ | ||
${par_quantile:+-Q "${par_quantile}"} \ | ||
${par_exonic_fraction:+-e "${par_exonic_fraction}"} \ | ||
${par_top_n:+-T "${par_top_n}"} \ | ||
${par_covered_fraction:+-C "${par_covered_fraction}"} \ | ||
${par_max_itd_length:+-l "${par_max_itd_length}"} \ | ||
${par_min_itd_allele_fraction:+-z "${par_min_itd_allele_fraction}"} \ | ||
${par_min_itd_supporting_reads:+-Z "${par_min_itd_supporting_reads}"} \ | ||
${par_skip_duplicate_marking:+-u} \ | ||
${par_extra_information:+-X} \ | ||
${par_fill_gaps:+-I} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
#!/bin/bash | ||
|
||
set -e | ||
|
||
dir_in="$meta_resources_dir/test_data" | ||
|
||
echo "> Run arriba with blacklist" | ||
"$meta_executable" \ | ||
--bam "$dir_in/A.bam" \ | ||
--genome "$dir_in/genome.fasta" \ | ||
--gene_annotation "$dir_in/annotation.gtf" \ | ||
--blacklist "$dir_in/blacklist.tsv" \ | ||
--fusions "fusions.tsv" \ | ||
--fusions_discarded "fusions_discarded.tsv" \ | ||
--interesting_contigs "1,2" | ||
|
||
echo ">> Checking output" | ||
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1 | ||
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1 | ||
|
||
echo ">> Check if output is empty" | ||
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1 | ||
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1 | ||
|
||
rm fusions.tsv fusions_discarded.tsv | ||
|
||
echo "> Run arriba without blacklist" | ||
"$meta_executable" \ | ||
--bam "$dir_in/A.bam" \ | ||
--genome "$dir_in/genome.fasta" \ | ||
--gene_annotation "$dir_in/annotation.gtf" \ | ||
--fusions "fusions.tsv" \ | ||
--fusions_discarded "fusions_discarded.tsv" \ | ||
--interesting_contigs "1,2" \ | ||
--disable_filters blacklist | ||
|
||
echo ">> Checking output" | ||
[ ! -f "fusions.tsv" ] && echo "Output file fusions.tsv does not exist" && exit 1 | ||
[ ! -f "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv does not exist" && exit 1 | ||
|
||
echo ">> Check if output is empty" | ||
[ ! -s "fusions.tsv" ] && echo "Output file fusions.tsv is empty" && exit 1 | ||
[ ! -s "fusions_discarded.tsv" ] && echo "Output file fusions_discarded.tsv is empty" && exit 1 | ||
|
||
echo "> Test successful" |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# arriba test data | ||
|
||
Test data was obtained from https://github.com/snakemake/snakemake-wrappers/tree/master/bio/arriba/test. | ||
|
||
__author__ = "Jan Forster" | ||
__copyright__ = "Copyright 2019, Jan Forster" | ||
__email__ = "[email protected]" | ||
__license__ = "MIT" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
1 havana gene 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; | ||
1 havana transcript 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1"; | ||
1 havana exon 1 80 . + . gene_id "ENSG00000000000"; gene_version "5"; transcript_id "ENST00000000000"; transcript_version "2"; exon_number "1"; gene_name "A"; gene_source "havana"; gene_biotype "gene"; transcript_name "A-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000000"; exon_version "1"; tag "basic"; transcript_support_level "1"; | ||
2 havana gene 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; | ||
2 havana transcript 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1"; | ||
2 havana exon 1 80 . + . gene_id "ENSG00000000001"; gene_version "5"; transcript_id "ENST00000000001"; transcript_version "2"; exon_number "1"; gene_name "B"; gene_source "havana"; gene_biotype "gene"; transcript_name "B-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00000000001"; exon_version "1"; tag "basic"; transcript_support_level "1"; |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
>1 | ||
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG | ||
>2 | ||
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA |