diff --git a/CHANGELOG.md b/CHANGELOG.md index 9ff0bf83..a9ccf26b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -29,17 +29,32 @@ * `bedtools`: - `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94). - `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98). + - `bedtools/bedtools_genomecov`: Compute the coverage of a feature file (bed/gff/vcf/bam) among a genome (PR #128). + - `bedtools/bedtools_groupby`: Summarizes a dataset column based upon common column groupings. Akin to the SQL "group by" command (PR #123). + - `bedtools/bedtools_merge`: Merges overlapping BED/GFF/VCF entries into a single interval (PR #118). - `bedtools/bedtools_bamtofastq`: Convert BAM alignments to FASTQ files (PR #101). - `bedtools/bedtools_bedtobam`: Converts genomic feature records (bed/gff/vcf) to BAM format (PR #111). + - `bedtools/bedtools_bed12tobed6`: Converts BED12 files to BED6 files (PR #140). + - `bedtools/bedtools_links`: Creates an HTML file with links to an instance of the UCSC Genome Browser for all features / intervals in a (bed/gff/vcf) file (PR #137). * `qualimap/qualimap_rnaseq`: RNA-seq QC analysis using qualimap (PR #74). * `rsem/rsem_prepare_reference`: Prepare transcript references for RSEM (PR #89). +* `bcftools`: + - `bcftools/bcftools_concat`: Concatenate or combine VCF/BCF files (PR #145). + - `bcftools/bcftools_norm`: Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows (PR #144). + - `bcftools/bcftools_annotate`: Add or remove annotations from a VCF/BCF file (PR #143). + - `bcftools/bcftools_stats`: Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats (PR #142). + - `bcftools/bcftools_sort`: Sorts BCF/VCF files by position and other criteria (PR #141). + +* `fastqc`: High throughput sequence quality control analysis tool (PR #92). + * `kallisto`: - `kallisto/kallisto_quant`: Quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads (PR #152). + ## MINOR CHANGES * `busco` components: update BUSCO to `5.7.1` (PR #72). @@ -127,7 +142,8 @@ - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53). * `umi_tools`: - -`umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71). + - `umi_tools/umi_tools_extract`: Flexible removal of UMI sequences from fastq reads (PR #71). + - `umi_tools/umi_tools_prepareforrsem`: Fix paired-end reads in name sorted BAM file to prepare for RSEM (PR #148). * `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43). @@ -135,6 +151,15 @@ - `bedtools_getfasta`: extract sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file (PR #59). +* `sortmerna`: Local sequence alignment tool for mapping, clustering, and filtering rRNA from metatranscriptomic + data. (PR #146) + +* `fq_subsample`: Sample a subset of records from single or paired FASTQ files (PR #147). + +* `kallisto`: + - `kallisto_index`: Create a kallisto index (PR #149). + + ## MINOR CHANGES * Uniformize component metadata (PR #23). diff --git a/src/bcftools/bcftools_annotate/config.vsh.yaml b/src/bcftools/bcftools_annotate/config.vsh.yaml new file mode 100644 index 00000000..67e8f46e --- /dev/null +++ b/src/bcftools/bcftools_annotate/config.vsh.yaml @@ -0,0 +1,250 @@ +name: bcftools_annotate +namespace: bcftools +description: | + Add or remove annotations from a VCF/BCF file. +keywords: [Annotate, VCF, BCF] +links: + homepage: https://samtools.github.io/bcftools/ + documentation: https://samtools.github.io/bcftools/bcftools.html#annotate + repository: https://github.com/samtools/bcftools + issue_tracker: https://github.com/samtools/bcftools/issues +references: + doi: https://doi.org/10.1093/gigascience/giab008 +license: MIT/Expat, GNU +requirements: + commands: [bcftools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [author] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + multiple: true + description: Input VCF/BCF file. + required: true + + - name: Outputs + arguments: + - name: --output + alternatives: -o + direction: output + type: file + description: Output annotated file. + required: true + + - name: Options + description: | + For examples on how to use use bcftools annotate see http://samtools.github.io/bcftools/howtos/annotate.html. + For more details on the options see https://samtools.github.io/bcftools/bcftools.html#annotate. + arguments: + + - name: --annotations + alternatives: --a + type: file + description: | + VCF file or tabix-indexed FILE with annotations: CHR\tPOS[\tVALUE]+ . + + - name: --columns + alternatives: --c + type: string + description: | + List of columns in the annotation file, e.g. CHROM,POS,REF,ALT,-,INFO/TAG. + See man page for details. + + - name: --columns_file + alternatives: --C + type: file + description: | + Read -c columns from FILE, one name per row, with optional --merge_logic TYPE: NAME[ TYPE]. + + - name: --exclude + alternatives: --e + type: string + description: | + Exclude sites for which the expression is true. + See https://samtools.github.io/bcftools/bcftools.html#expressions for details. + example: 'QUAL >= 30 && DP >= 10' + + - name: --force + type: boolean_true + description: | + continue even when parsing errors, such as undefined tags, are encountered. + Note this can be an unsafe operation and can result in corrupted BCF files. + If this option is used, make sure to sanity check the result thoroughly. + + - name: --header_line + alternatives: --H + type: string + description: | + Header line which should be appended to the VCF header, can be given multiple times. + + - name: --header_lines + alternatives: --h + type: file + description: | + File with header lines to append to the VCF header. + For example: + ##INFO= + ##INFO= + + - name: --set_id + alternatives: --I + type: string + description: | + Set ID column using a `bcftools query`-like expression, see man page for details. + + - name: --include + type: string + description: | + Select sites for which the expression is true. + See https://samtools.github.io/bcftools/bcftools.html#expressions for details. + example: 'QUAL >= 30 && DP >= 10' + + - name: --keep_sites + alternatives: --k + type: boolean_true + description: | + Leave --include/--exclude sites unchanged instead of discarding them. + + - name: --merge_logic + alternatives: --l + type: string + choices: + description: | + When multiple regions overlap a single record, this option defines how to treat multiple annotation values. + See man page for more details. + + - name: --mark_sites + alternatives: --m + type: string + description: | + Annotate sites which are present ("+") or absent ("-") in the -a file with a new INFO/TAG flag. + + - name: --min_overlap + type: string + description: | + Minimum overlap required as a fraction of the variant in the annotation -a file (ANN), + in the target VCF file (:VCF), or both for reciprocal overlap (ANN:VCF). + By default overlaps of arbitrary length are sufficient. + The option can be used only with the tab-delimited annotation -a file and with BEG and END columns present. + + - name: --no_version + type: boolean_true + description: | + Do not append version and command line information to the output VCF header. + + - name: --output_type + alternatives: --O + type: string + choices: ['u', 'z', 'b', 'v'] + description: | + Output type: + u: uncompressed BCF + z: compressed VCF + b: compressed BCF + v: uncompressed VCF + + - name: --pair_logic + type: string + choices: ['snps', 'indels', 'both', 'all', 'some', 'exact'] + description: | + Controls how to match records from the annotation file to the target VCF. + Effective only when -a is a VCF or BCF file. + The option replaces the former uninuitive --collapse. + See Common Options for more. + + - name: --regions + alternatives: --r + type: string + description: | + Restrict to comma-separated list of regions. + Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​]. + example: '20:1000000-2000000' + + - name: --regions_file + alternatives: --R + type: file + description: | + Restrict to regions listed in a file. + Regions can be specified either on a VCF, BED, or tab-delimited file (the default). + For more information check manual. + + - name: --regions_overlap + type: string + choices: ['pos', 'record', 'variant', '0', '1', '2'] + description: | + This option controls how overlapping records are determined: + set to 'pos' or '0' if the VCF record has to have POS inside a region (this corresponds to the default behavior of -t/-T); + set to 'record' or '1' if also overlapping records with POS outside a region should be included (this is the default behavior of -r/-R, + and includes indels with POS at the end of a region, which are technically outside the region); + or set to 'variant' or '2' to include only true overlapping variation (compare the full VCF representation "TA>T-" vs the true sequence variation "A>-"). + + - name: --rename_annotations + type: file + description: | + Rename annotations: TYPE/old\tnew, where TYPE is one of FILTER,INFO,FORMAT. + + - name: --rename_chromosomes + type: file + description: | + Rename chromosomes according to the map in file, with "old_name new_name\n" pairs + separated by whitespaces, each on a separate line. + + - name: --samples + type: string + description: | + Subset of samples to annotate. + See also https://samtools.github.io/bcftools/bcftools.html#common_options. + + - name: --samples_file + type: file + description: | + Subset of samples to annotate in file format. + See also https://samtools.github.io/bcftools/bcftools.html#common_options. + + - name: --single_overlaps + type: boolean_true + description: | + Use this option to keep memory requirements low with very large annotation files. + Note, however, that this comes at a cost, only single overlapping intervals are considered in this mode. + This was the default mode until the commit af6f0c9 (Feb 24 2019). + + - name: --remove + alternatives: --x + type: string + description: | + List of annotations to remove. + Use "FILTER" to remove all filters or "FILTER/SomeFilter" to remove a specific filter. + Similarly, "INFO" can be used to remove all INFO tags and "FORMAT" to remove all FORMAT tags except GT. + To remove all INFO tags except "FOO" and "BAR", use "^INFO/FOO,INFO/BAR" (and similarly for FORMAT and FILTER). + "INFO" can be abbreviated to "INF" and "FORMAT" to "FMT". + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bcftools, procps] + - type: docker + run: | + echo "bcftools: \"$(bcftools --version | grep 'bcftools' | sed -n 's/^bcftools //p')\"" > /var/software_versions.txt + test_setup: + - type: apt + packages: [tabix] + +runners: + - type: executable + - type: nextflow + diff --git a/src/bcftools/bcftools_annotate/help.txt b/src/bcftools/bcftools_annotate/help.txt new file mode 100644 index 00000000..2d1c7807 --- /dev/null +++ b/src/bcftools/bcftools_annotate/help.txt @@ -0,0 +1,41 @@ +``` +bcftools annotate -h +``` + +annotate: option requires an argument -- 'h' + +About: Annotate and edit VCF/BCF files. +Usage: bcftools annotate [options] VCF + +Options: + -a, --annotations FILE VCF file or tabix-indexed FILE with annotations: CHR\tPOS[\tVALUE]+ + -c, --columns LIST List of columns in the annotation file, e.g. CHROM,POS,REF,ALT,-,INFO/TAG. See man page for details + -C, --columns-file FILE Read -c columns from FILE, one name per row, with optional --merge-logic TYPE: NAME[ TYPE] + -e, --exclude EXPR Exclude sites for which the expression is true (see man page for details) + --force Continue despite parsing error (at your own risk!) + -H, --header-line STR Header line which should be appended to the VCF header, can be given multiple times + -h, --header-lines FILE Lines which should be appended to the VCF header + -I, --set-id [+]FORMAT Set ID column using a `bcftools query`-like expression, see man page for details + -i, --include EXPR Select sites for which the expression is true (see man page for details) + -k, --keep-sites Leave -i/-e sites unchanged instead of discarding them + -l, --merge-logic TAG:TYPE Merge logic for multiple overlapping regions (see man page for details), EXPERIMENTAL + -m, --mark-sites [+-]TAG Add INFO/TAG flag to sites which are ("+") or are not ("-") listed in the -a file + --min-overlap ANN:VCF Required overlap as a fraction of variant in the -a file (ANN), the VCF (:VCF), or reciprocal (ANN:VCF) + --no-version Do not append version and command line to the header + -o, --output FILE Write output to a file [standard output] + -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v] + --pair-logic STR Matching records by , see man page for details [some] + -r, --regions REGION Restrict to comma-separated list of regions + -R, --regions-file FILE Restrict to regions listed in FILE + --regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1] + --rename-annots FILE Rename annotations: TYPE/old\tnew, where TYPE is one of FILTER,INFO,FORMAT + --rename-chrs FILE Rename sequences according to the mapping: old\tnew + -s, --samples [^]LIST Comma separated list of samples to annotate (or exclude with "^" prefix) + -S, --samples-file [^]FILE File of samples to annotate (or exclude with "^" prefix) + --single-overlaps Keep memory low by avoiding complexities arising from handling multiple overlapping intervals + -x, --remove LIST List of annotations (e.g. ID,INFO/DP,FORMAT/DP,FILTER) to remove (or keep with "^" prefix). See man page for details + --threads INT Number of extra output compression threads [0] + +Examples: + http://samtools.github.io/bcftools/howtos/annotate.html + diff --git a/src/bcftools/bcftools_annotate/script.sh b/src/bcftools/bcftools_annotate/script.sh new file mode 100644 index 00000000..18137bbf --- /dev/null +++ b/src/bcftools/bcftools_annotate/script.sh @@ -0,0 +1,54 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Unset parameters +unset_if_false=( + par_force + par_keep_sites + par_no_version + par_single_overlaps +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +# Execute bcftools annotate with the provided arguments +bcftools annotate \ + ${par_annotations:+-a "$par_annotations"} \ + ${par_columns:+-c "$par_columns"} \ + ${par_columns_file:+-C "$par_columns_file"} \ + ${par_exclude:+-e "$par_exclude"} \ + ${par_force:+--force} \ + ${par_header_line:+-H "$par_header_line"} \ + ${par_header_lines:+-h "$par_header_lines"} \ + ${par_set_id:+-I "$par_set_id"} \ + ${par_include:+-i "$par_include"} \ + ${par_keep_sites:+-k} \ + ${par_merge_logic:+-l "$par_merge_logic"} \ + ${par_mark_sites:+-m "$par_mark_sites"} \ + ${par_min_overlap:+--min-overlap "$par_min_overlap"} \ + ${par_no_version:+--no-version} \ + ${par_samples_file:+-S "$par_samples_file"} \ + ${par_output_type:+-O "$par_output_type"} \ + ${par_pair_logic:+--pair-logic "$par_pair_logic"} \ + ${par_regions:+-r "$par_regions"} \ + ${par_regions_file:+-R "$par_regions_file"} \ + ${par_regions_overlap:+--regions-overlap "$par_regions_overlap"} \ + ${par_rename_annotations:+--rename-annots "$par_rename_annotations"} \ + ${par_rename_chromosomes:+--rename-chrs "$par_rename_chromosomes"} \ + ${par_samples:+-s "$par_samples"} \ + ${par_single_overlaps:+--single-overlaps} \ + ${par_threads:+--threads "$par_threads"} \ + ${par_remove:+-x "$par_remove"} \ + -o $par_output \ + $par_input + + + \ No newline at end of file diff --git a/src/bcftools/bcftools_annotate/test.sh b/src/bcftools/bcftools_annotate/test.sh new file mode 100644 index 00000000..39835c82 --- /dev/null +++ b/src/bcftools/bcftools_annotate/test.sh @@ -0,0 +1,305 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +#test_data="$meta_resources_dir/test_data" + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create test data +cat < "$TMPDIR/example.vcf" +##fileformat=VCFv4.1 +##contig= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 +1 752567 llama A C . . . . . +1 752722 . G A . . . . . +EOF + +bgzip -c $TMPDIR/example.vcf > $TMPDIR/example.vcf.gz +tabix -p vcf $TMPDIR/example.vcf.gz + +cat < "$TMPDIR/annots.tsv" +1 752567 752567 FooValue1 12345 +1 752722 752722 FooValue2 67890 +EOF + +cat < "$TMPDIR/rename.tsv" +INFO/. Luigi +EOF + +bgzip $TMPDIR/annots.tsv +tabix -s1 -b2 -e3 $TMPDIR/annots.tsv.gz + +cat < "$TMPDIR/header.hdr" +##FORMAT= +##INFO= +EOF + +cat < "$TMPDIR/rename_chrm.tsv" +1 chr1 +2 chr2 +EOF + +# Test 1: Remove ID annotations +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bcftools_annotate remove annotations" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --remove "ID" \ + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "1 752567 . A C" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: Annotate with -a, -c and -h options +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bcftools_annotate with -a, -c and -h options" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --annotations "../annots.tsv.gz" \ + --header_lines "../header.hdr" \ + --columns "CHROM,FROM,TO,FMT/FOO,BAR" \ + --mark_sites "BAR" \ + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" $(echo -e "1\t752567\tllama\tA\tC\t.\t.\tBAR=12345\tFOO\tFooValue1") +echo "- test2 succeeded -" + +popd > /dev/null + +# Test 3: +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bcftools_annotate with --set_id option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --set_id "+'%CHROM\_%POS\_%REF\_%FIRST_ALT'" \ + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "'1_752722_G_A'" +echo "- test3 succeeded -" + +popd > /dev/null + +# Test 4: +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "> Run bcftools_annotate with --rename-annotations option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --rename_annotations "../rename.tsv" + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "##bcftools_annotateCommand=annotate --rename-annots ../rename.tsv -o annotated.vcf" +echo "- test4 succeeded -" + +popd > /dev/null + +# Test 5: Rename chromosomes +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "> Run bcftools_annotate with --rename-chromosomes option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --rename_chromosomes "../rename_chrm.tsv" + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "chr1" +echo "- test5 succeeded -" + +popd > /dev/null + +# Test 6: Sample option +mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null + +echo "> Run bcftools_annotate with -s option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --samples "SAMPLE1" + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "##bcftools_annotateCommand=annotate -s SAMPLE1 -o annotated.vcf ../example.vcf" +echo "- test6 succeeded -" + +popd > /dev/null + +# Test 7: Single overlaps +mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null + +echo "> Run bcftools_annotate with --single-overlaps option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --single_overlaps \ + --keep_sites \ + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "annotate -k --single-overlaps -o annotated.vcf ../example.vcf" +echo "- test7 succeeded -" + +popd > /dev/null + +# Test 8: Min overlap +mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null + +echo "> Run bcftools_annotate with --min-overlap option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --annotations "../annots.tsv.gz" \ + --columns "CHROM,FROM,TO,FMT/FOO,BAR" \ + --header_lines "../header.hdr" \ + --min_overlap "1" + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "annotate -a ../annots.tsv.gz -c CHROM,FROM,TO,FMT/FOO,BAR -h ../header.hdr --min-overlap 1 -o annotated.vcf ../example.vcf" +echo "- test8 succeeded -" + +popd > /dev/null + +# Test 9: Regions +mkdir "$TMPDIR/test9" && pushd "$TMPDIR/test9" > /dev/null + +echo "> Run bcftools_annotate with -r option" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --output "annotated.vcf" \ + --regions "1:752567-752722" + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "annotate -r 1:752567-752722 -o annotated.vcf ../example.vcf.gz" +echo "- test9 succeeded -" + +popd > /dev/null + +# Test 10: pair-logic +mkdir "$TMPDIR/test10" && pushd "$TMPDIR/test10" > /dev/null + +echo "> Run bcftools_annotate with --pair-logic option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --pair_logic "all" + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "annotate --pair-logic all -o annotated.vcf ../example.vcf" +echo "- test10 succeeded -" + +popd > /dev/null + +# Test 11: regions-overlap +mkdir "$TMPDIR/test11" && pushd "$TMPDIR/test11" > /dev/null + +echo "> Run bcftools_annotate with --regions-overlap option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --regions_overlap "1" + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "annotate --regions-overlap 1 -o annotated.vcf ../example.vcf" +echo "- test11 succeeded -" + +popd > /dev/null + +# Test 12: include +mkdir "$TMPDIR/test12" && pushd "$TMPDIR/test12" > /dev/null + +echo "> Run bcftools_annotate with -i option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --include "FILTER='PASS'" \ + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "annotate -i FILTER='PASS' -o annotated.vcf ../example.vcf" +echo "- test12 succeeded -" + +popd > /dev/null + +# Test 13: exclude +mkdir "$TMPDIR/test13" && pushd "$TMPDIR/test13" > /dev/null + +echo "> Run bcftools_annotate with -e option" +"$meta_executable" \ + --annotations "../annots.tsv.gz" \ + --input "../example.vcf" \ + --output "annotated.vcf" \ + --exclude "FILTER='PASS'" \ + --header_lines "../header.hdr" \ + --columns "CHROM,FROM,TO,FMT/FOO,BAR" \ + --merge_logic "FOO:first" \ + +# checks +assert_file_exists "annotated.vcf" +assert_file_not_empty "annotated.vcf" +assert_file_contains "annotated.vcf" "annotate -a ../annots.tsv.gz -c CHROM,FROM,TO,FMT/FOO,BAR -e FILTER='PASS' -h ../header.hdr -l FOO:first -o annotated.vcf ../example.vcf" +echo "- test13 succeeded -" + +popd > /dev/null + + +echo "---- All tests succeeded! ----" +exit 0 + diff --git a/src/bcftools/bcftools_concat/config.vsh.yaml b/src/bcftools/bcftools_concat/config.vsh.yaml new file mode 100644 index 00000000..2bb32f1c --- /dev/null +++ b/src/bcftools/bcftools_concat/config.vsh.yaml @@ -0,0 +1,172 @@ +name: bcftools_concat +namespace: bcftools +description: | + Concatenate or combine VCF/BCF files. All source files must have the same sample + columns appearing in the same order. The program can be used, for example, to + concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel + VCF into one. The input files must be sorted by chr and position. The files + must be given in the correct order to produce sorted VCF on output unless + the -a, --allow-overlaps option is specified. With the --naive option, the files + are concatenated without being recompressed, which is very fast. +keywords: [Concatenate, VCF, BCF] +links: + homepage: https://samtools.github.io/bcftools/ + documentation: https://samtools.github.io/bcftools/bcftools.html#concat + repository: https://github.com/samtools/bcftools + issue_tracker: https://github.com/samtools/bcftools/issues +references: + doi: https://doi.org/10.1093/gigascience/giab008 +license: MIT/Expat, GNU +requirements: + commands: [bcftools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [author] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + multiple: true + description: Input VCF/BCF files to concatenate. + + - name: --file_list + alternatives: -f + type: file + description: Read the list of VCF/BCF files from a file, one file name per line. + + - name: Outputs + arguments: + - name: --output + alternatives: -o + direction: output + type: file + description: Output concatenated VCF/BCF file. + required: true + + - name: Options + arguments: + + - name: --allow_overlaps + alternatives: -a + type: boolean_true + description: | + First coordinate of the next file can precede last record of the current file. + + - name: --compact_PS + alternatives: -c + type: boolean_true + description: | + Do not output PS tag at each site, only at the start of a new phase set block. + + - name: --remove_duplicates + alternatives: -d + type: string + choices: ['snps', 'indels', 'both', 'all', 'exact', 'none'] + description: | + Output duplicate records present in multiple files only once: . + + - name: --ligate + alternatives: -l + type: boolean_true + description: Ligate phased VCFs by matching phase at overlapping haplotypes. + + - name: --ligate_force + type: boolean_true + description: Ligate even non-overlapping chunks, keep all sites. + + - name: --ligate_warn + type: boolean_true + description: Drop sites in imperfect overlaps. + + - name: --no_version + type: boolean_true + description: Do not append version and command line information to the header. + + - name: --naive + alternatives: -n + type: boolean_true + description: Concatenate files without recompression, a header check compatibility is performed. + + - name: --naive_force + type: boolean_true + description: | + Same as --naive, but header compatibility is not checked. + Dangerous, use with caution. + + - name: --output_type + alternatives: -O + type: string + choices: ['u', 'z', 'b', 'v'] + description: | + Output type: + u: uncompressed BCF + z: compressed VCF + b: compressed BCF + v: uncompressed VCF + + - name: --min_PQ + alternatives: -q + type: integer + description: Break phase set if phasing quality is lower than . + example: 30 + + - name: --regions + alternatives: -r + type: string + description: | + Restrict to comma-separated list of regions. + Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​]. + example: '20:1000000-2000000' + + - name: --regions_file + alternatives: -R + type: file + description: | + Restrict to regions listed in a file. + Regions can be specified either on a VCF, BED, or tab-delimited file (the default). + For more information check manual. + + - name: --regions_overlap + type: string + choices: ['pos', 'record', 'variant', '0', '1', '2'] + description: | + This option controls how overlapping records are determined: + set to 'pos' or '0' if the VCF record has to have POS inside a region (this corresponds to the default behavior of -t/-T); + set to 'record' or '1' if also overlapping records with POS outside a region should be included (this is the default behavior of -r/-R, + and includes indels with POS at the end of a region, which are technically outside the region); + or set to 'variant' or '2' to include only true overlapping variation (compare the full VCF representation "TA>T-" vs the true sequence variation "A>-"). + + #PS: Verbose seems to be broken in this version of bcftools + # - name: --verbose + # alternatives: -v + # type: integer + # choices: [0, 1] + # description: Set verbosity level. + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bcftools, procps] + - type: docker + run: | + echo "bcftools: \"$(bcftools --version | grep 'bcftools' | sed -n 's/^bcftools //p')\"" > /var/software_versions.txt + test_setup: + - type: apt + packages: [tabix] + +runners: + - type: executable + - type: nextflow \ No newline at end of file diff --git a/src/bcftools/bcftools_concat/help.txt b/src/bcftools/bcftools_concat/help.txt new file mode 100644 index 00000000..fc0f1914 --- /dev/null +++ b/src/bcftools/bcftools_concat/help.txt @@ -0,0 +1,36 @@ +``` +bcftools concat -h +``` + +concat: option requires an argument -- 'h' + +About: Concatenate or combine VCF/BCF files. All source files must have the same sample + columns appearing in the same order. The program can be used, for example, to + concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel + VCF into one. The input files must be sorted by chr and position. The files + must be given in the correct order to produce sorted VCF on output unless + the -a, --allow-overlaps option is specified. With the --naive option, the files + are concatenated without being recompressed, which is very fast. +Usage: bcftools concat [options] [ [...]] + +Options: + -a, --allow-overlaps First coordinate of the next file can precede last record of the current file. + -c, --compact-PS Do not output PS tag at each site, only at the start of a new phase set block. + -d, --rm-dups STRING Output duplicate records present in multiple files only once: + -D, --remove-duplicates Alias for -d exact + -f, --file-list FILE Read the list of files from a file. + -l, --ligate Ligate phased VCFs by matching phase at overlapping haplotypes + --ligate-force Ligate even non-overlapping chunks, keep all sites + --ligate-warn Drop sites in imperfect overlaps + --no-version Do not append version and command line to the header + -n, --naive Concatenate files without recompression, a header check compatibility is performed + --naive-force Same as --naive, but header compatibility is not checked. Dangerous, use with caution. + -o, --output FILE Write output to a file [standard output] + -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v] + -q, --min-PQ INT Break phase set if phasing quality is lower than [30] + -r, --regions REGION Restrict to comma-separated list of regions + -R, --regions-file FILE Restrict to regions listed in a file + --regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1] + --threads INT Use multithreading with worker threads [0] + -v, --verbose 0|1 Set verbosity level [1] + diff --git a/src/bcftools/bcftools_concat/script.sh b/src/bcftools/bcftools_concat/script.sh new file mode 100644 index 00000000..5614cd1b --- /dev/null +++ b/src/bcftools/bcftools_concat/script.sh @@ -0,0 +1,54 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Unset parameters +unset_if_false=( + par_allow_overlaps + par_compact_PS + par_ligate + par_ligate_force + par_ligate_warn + par_no_version + par_naive + par_naive_force +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +# Check to see whether the par_input or the par_file_list is set +if [[ -z "${par_input}" && -z "${par_file_list}" ]]; then + echo "Error: One of the parameters '--input' or '--file_list' must be used." + exit 1 +fi + +# Create input array +IFS=";" read -ra input <<< $par_input + +# Execute bcftools concat with the provided arguments +bcftools concat \ + ${par_allow_overlaps:+-a} \ + ${par_compact_PS:+-c} \ + ${par_remove_duplicates:+-d "$par_remove_duplicates"} \ + ${par_ligate:+-l} \ + ${par_ligate_force:+--ligate-force} \ + ${par_ligate_warn:+--ligate-warn} \ + ${par_no_version:+--no-version} \ + ${par_naive:+-n} \ + ${par_naive_force:+--naive-force} \ + ${par_output_type:+--O "$par_output_type"} \ + ${par_min_PQ:+-q "$par_min_PQ"} \ + ${par_regions:+-r "$par_regions"} \ + ${par_regions_file:+-R "$par_regions_file"} \ + ${par_regions_overlap:+--regions-overlap "$par_regions_overlap"} \ + ${meta_cpus:+--threads "$meta_cpus"} \ + -o $par_output \ + ${par_file_list:+-f "$par_file_list"} \ + ${input[@]} \ \ No newline at end of file diff --git a/src/bcftools/bcftools_concat/test.sh b/src/bcftools/bcftools_concat/test.sh new file mode 100644 index 00000000..3c1c7bb6 --- /dev/null +++ b/src/bcftools/bcftools_concat/test.sh @@ -0,0 +1,227 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +#test_data="$meta_resources_dir/test_data" + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create test data +cat < "$TMPDIR/example.vcf" +##fileformat=VCFv4.1 +##contig= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 +1 752567 llama G C,A 15 . . . 1/2 +1 752752 . G A,AAA 20 . . . ./. +EOF + +bgzip -c $TMPDIR/example.vcf > $TMPDIR/example.vcf.gz +tabix -p vcf $TMPDIR/example.vcf.gz + +cat < "$TMPDIR/example_2.vcf" +##fileformat=VCFv4.1 +##contig= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 +1 752569 cat G C,A 15 . . . 1/2 +1 752739 . G A,AAA 20 . . . ./. +EOF + +bgzip -c $TMPDIR/example_2.vcf > $TMPDIR/example_2.vcf.gz +tabix -p vcf $TMPDIR/example_2.vcf.gz + +cat < "$TMPDIR/file_list.txt" +$TMPDIR/example.vcf.gz +$TMPDIR/example_2.vcf.gz +EOF + +# Test 1: Default test +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bcftools_concat default test" +"$meta_executable" \ + --input "../example.vcf" \ + --input "../example_2.vcf" \ + --output "concatenated.vcf" \ + &> /dev/null + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "concat -o concatenated.vcf ../example.vcf ../example_2.vcf" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: Allow overlaps, compact PS and remove duplicates +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bcftools_concat test with allow overlaps, and remove duplicates" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --input "../example_2.vcf.gz" \ + --output "concatenated.vcf" \ + --allow_overlaps \ + --remove_duplicates 'none' \ + &> /dev/null + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "concat -a -d none -o concatenated.vcf ../example.vcf.gz ../example_2.vcf.gz" +echo "- test2 succeeded -" + +popd > /dev/null + + +# Test 3: Ligate, ligate force and ligate warn +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bcftools_concat test with ligate, ligate force and ligate warn" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --input "../example_2.vcf.gz" \ + --output "concatenated.vcf" \ + --ligate \ + --compact_PS \ + &> /dev/null + + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "concat -c -l -o concatenated.vcf ../example.vcf.gz ../example_2.vcf.gz" +echo "- test3 succeeded -" + +popd > /dev/null + +# Test 4: file list with ligate force and ligate warn +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "> Run bcftools_concat test with file list, ligate force and ligate warn" +"$meta_executable" \ + --file_list "../file_list.txt" \ + --output "concatenated.vcf" \ + --ligate_force \ + &> /dev/null + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "concat --ligate-force -o concatenated.vcf -f ../file_list.txt" +echo "- test4 succeeded -" + +popd > /dev/null + +# Test 5: ligate warn and naive +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "> Run bcftools_concat test with ligate warn and naive" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --input "../example_2.vcf.gz" \ + --output "concatenated.vcf.gz" \ + --ligate_warn \ + --naive \ + &> /dev/null + +bgzip -d concatenated.vcf.gz + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "##fileformat=VCFv4.1" +echo "- test5 succeeded -" + +popd > /dev/null + +# Test 6: minimal PQ +mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null + +echo "> Run bcftools_concat test with minimal PQ" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --input "../example_2.vcf.gz" \ + --output "concatenated.vcf" \ + --min_PQ 20 \ + &> /dev/null + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "concat -q 20 -o concatenated.vcf ../example.vcf.gz ../example_2.vcf.gz" +echo "- test6 succeeded -" + +popd > /dev/null + +# Test 7: regions +mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null + +echo "> Run bcftools_concat test with regions" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --input "../example_2.vcf.gz" \ + --output "concatenated.vcf" \ + --allow_overlaps \ + --regions "1:752569-752739" \ + &> /dev/null + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "concat -a -r 1:752569-752739 -o concatenated.vcf ../example.vcf.gz ../example_2.vcf.gz" +echo "- test7 succeeded -" + +popd > /dev/null + +# Test 8: regions overlap +mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null + +echo "> Run bcftools_concat test with regions overlap" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --input "../example_2.vcf.gz" \ + --output "concatenated.vcf" \ + --allow_overlaps \ + --regions_overlap 'pos' \ + &> /dev/null + +# checks +assert_file_exists "concatenated.vcf" +assert_file_not_empty "concatenated.vcf" +assert_file_contains "concatenated.vcf" "concat -a --regions-overlap pos -o concatenated.vcf ../example.vcf.gz ../example_2.vcf.gz" +echo "- test8 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 + + + diff --git a/src/bcftools/bcftools_norm/config.vsh.yaml b/src/bcftools/bcftools_norm/config.vsh.yaml new file mode 100644 index 00000000..5c525d3a --- /dev/null +++ b/src/bcftools/bcftools_norm/config.vsh.yaml @@ -0,0 +1,194 @@ +name: bcftools_norm +namespace: bcftools +description: | + Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; + recover multiallelics from multiple rows. +keywords: [Normalize, VCF, BCF] +links: + homepage: https://samtools.github.io/bcftools/ + documentation: https://samtools.github.io/bcftools/bcftools.html#norm + repository: https://github.com/samtools/bcftools + issue_tracker: https://github.com/samtools/bcftools/issues +references: + doi: https://doi.org/10.1093/gigascience/giab008 +license: MIT/Expat, GNU +requirements: + commands: [bcftools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [author] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + description: Input VCF/BCF file. + required: true + + - name: Outputs + arguments: + - name: --output + alternatives: -o + direction: output + type: file + description: Output normalized VCF/BCF file. + required: true + + - name: Options + arguments: + + - name: --atomize + alternatives: -a + type: boolean_true + description: | + Decompose complex variants (e.g., MNVs become consecutive SNVs). + + - name: --atom_overlaps + type: string + choices: [".", "*"] + description: | + Use the star allele (*) for overlapping alleles or set to missing (.). + + - name: --check_ref + alternatives: -c + type: string + choices: ['e', 'w', 'x', 's'] + description: | + Check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites. + + - name: --remove_duplicates + alternatives: -d + type: string + choices: ['snps', 'indels', 'both', 'all', 'exact', 'none'] + description: Remove duplicate snps, indels, both, all, exact matches, or none (old -D option). + + - name: --fasta_ref + alternatives: -f + type: file + description: Reference fasta sequence file. + + - name: --force + type: boolean_true + description: | + Try to proceed even if malformed tags are encountered. + Experimental, use at your own risk. + + - name: --keep_sum + type: string + description: | + Keep vector sum constant when splitting multiallelics (see github issue #360). + + - name: --multiallelics + alternatives: -m + type: string + choices: ['+snps', '+indels', '+both', '+any', '-snps', '-indels', '-both', '-any'] + description: | + Split multiallelics (-) or join biallelics (+), type: snps, indels, both, any [default: both]. + + - name: --no_version + type: boolean_true + description: Do not append version and command line information to the header. + + - name: --do_not_normalize + alternatives: -N + type: boolean_true + description: Do not normalize indels (with -m or -c s). + + - name: --output_type + alternatives: --O + type: string + choices: ['u', 'z', 'b', 'v'] + description: | + Output type: + u: uncompressed BCF + z: compressed VCF + b: compressed BCF + v: uncompressed VCF + + - name: --old_rec_tag + type: string + description: Annotate modified records with INFO/STR indicating the original variant. + + - name: --regions + alternatives: --r + type: string + description: | + Restrict to comma-separated list of regions. + Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​]. + example: '20:1000000-2000000' + + - name: --regions_file + alternatives: --R + type: file + description: | + Restrict to regions listed in a file. + Regions can be specified either on a VCF, BED, or tab-delimited file (the default). + For more information check manual. + + - name: --regions_overlap + type: string + choices: ['pos', 'record', 'variant', '0', '1', '2'] + description: | + This option controls how overlapping records are determined: + set to 'pos' or '0' if the VCF record has to have POS inside a region (this corresponds to the default behavior of -t/-T); + set to 'record' or '1' if also overlapping records with POS outside a region should be included (this is the default behavior of -r/-R, + and includes indels with POS at the end of a region, which are technically outside the region); + or set to 'variant' or '2' to include only true overlapping variation (compare the full VCF representation "TA>T-" vs the true sequence variation "A>-"). + + - name: --site_win + alternatives: -w + type: integer + description: | + Buffer for sorting lines that changed position during realignment. + + - name: --strict_filter + alternatives: -s + type: boolean_true + description: When merging (-m+), merged site is PASS only if all sites being merged PASS. + + - name: --targets + alternatives: -t + type: string + description: Similar to --regions but streams rather than index-jumps. + example: '20:1000000-2000000' + + - name: --targets_file + alternatives: -T + type: file + description: Similar to --regions_file but streams rather than index-jumps. + + - name: --targets_overlap + type: string + choices: ['pos', 'record', 'variant', '0', '1', '2'] + description: | + Include if POS in the region (0), record overlaps (1), variant overlaps (2). + Similar to --regions_overlap. + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bcftools, procps] + - type: docker + run: | + echo "bcftools: \"$(bcftools --version | grep 'bcftools' | sed -n 's/^bcftools //p')\"" > /var/software_versions.txt + test_setup: + - type: apt + packages: [tabix] + +runners: + - type: executable + - type: nextflow + + diff --git a/src/bcftools/bcftools_norm/help.txt b/src/bcftools/bcftools_norm/help.txt new file mode 100644 index 00000000..02e9761a --- /dev/null +++ b/src/bcftools/bcftools_norm/help.txt @@ -0,0 +1,41 @@ +``` +bcftools norm -h +``` + +About: Left-align and normalize indels; check if REF alleles match the reference; + split multiallelic sites into multiple rows; recover multiallelics from + multiple rows. +Usage: bcftools norm [options] + +Options: + -a, --atomize Decompose complex variants (e.g. MNVs become consecutive SNVs) + --atom-overlaps '*'|. Use the star allele (*) for overlapping alleles or set to missing (.) [*] + -c, --check-ref e|w|x|s Check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites [e] + -D, --remove-duplicates Remove duplicate lines of the same type. + -d, --rm-dup TYPE Remove duplicate snps|indels|both|all|exact + -f, --fasta-ref FILE Reference sequence + --force Try to proceed even if malformed tags are encountered. Experimental, use at your own risk + --keep-sum TAG,.. Keep vector sum constant when splitting multiallelics (see github issue #360) + -m, --multiallelics -|+TYPE Split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both] + --no-version Do not append version and command line to the header + -N, --do-not-normalize Do not normalize indels (with -m or -c s) + --old-rec-tag STR Annotate modified records with INFO/STR indicating the original variant + -o, --output FILE Write output to a file [standard output] + -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v] + -r, --regions REGION Restrict to comma-separated list of regions + -R, --regions-file FILE Restrict to regions listed in a file + --regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1] + -s, --strict-filter When merging (-m+), merged site is PASS only if all sites being merged PASS + -t, --targets REGION Similar to -r but streams rather than index-jumps + -T, --targets-file FILE Similar to -R but streams rather than index-jumps + --targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0] + --threads INT Use multithreading with worker threads [0] + -w, --site-win INT Buffer for sorting lines which changed position during realignment [1000] + +Examples: + # normalize and left-align indels + bcftools norm -f ref.fa in.vcf + + # split multi-allelic sites + bcftools norm -m- in.vcf + diff --git a/src/bcftools/bcftools_norm/script.sh b/src/bcftools/bcftools_norm/script.sh new file mode 100644 index 00000000..0f43e593 --- /dev/null +++ b/src/bcftools/bcftools_norm/script.sh @@ -0,0 +1,49 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Unset parameters +unset_if_false=( + par_atomize + par_remove_duplicates + par_force + par_no_version + par_do_not_normalize + par_strict_filter +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +# Execute bcftools norm with the provided arguments +bcftools norm \ + ${par_atomize:+--atomize} \ + ${par_atom_overlaps:+--atom-overlaps "$par_atom_overlaps"} \ + ${par_check_ref:+-c "$par_check_ref"} \ + ${par_remove_duplicates:+-d "$par_remove_duplicates"} \ + ${par_fasta_ref:+-f "$par_fasta_ref"} \ + ${par_force:+--force} \ + ${par_keep_sum:+--keep-sum "$par_keep_sum"} \ + ${par_multiallelics:+-m "$par_multiallelics"} \ + ${par_no_version:+--no-version} \ + ${par_do_not_normalize:+-N} \ + ${par_old_rec_tag:+--old-rec-tag "$par_old_rec_tag"} \ + ${par_regions:+-r "$par_regions"} \ + ${par_regions_file:+-R "$par_regions_file"} \ + ${par_regions_overlap:+--regions-overlap "$par_regions_overlap"} \ + ${par_site_win:+-w "$par_site_win"} \ + ${par_strict_filter:+-s} \ + ${par_targets:+-t "$par_targets"} \ + ${par_targets_file:+-T "$par_targets_file"} \ + ${par_targets_overlap:+--targets-overlap "$par_targets_overlap"} \ + ${meta_cpus:+--threads "$meta_cpus"} \ + ${par_output_type:+-O "$par_output_type"} \ + -o $par_output \ + $par_input + diff --git a/src/bcftools/bcftools_norm/test.sh b/src/bcftools/bcftools_norm/test.sh new file mode 100644 index 00000000..254c7176 --- /dev/null +++ b/src/bcftools/bcftools_norm/test.sh @@ -0,0 +1,231 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +#test_data="$meta_resources_dir/test_data" + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create test data +cat < "$TMPDIR/example.vcf" +##fileformat=VCFv4.1 +##contig= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 +1 752567 llama G C,A . . . . 1/2 +1 752722 . G A,AAA . . . . ./. +EOF + +bgzip -c $TMPDIR/example.vcf > $TMPDIR/example.vcf.gz +tabix -p vcf $TMPDIR/example.vcf.gz + +cat < "$TMPDIR/reference.fa" +>1 +ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG +>2 +CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGAT +EOF + +# Test 1: Remove ID annotations +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bcftools_norm" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --atomize \ + --atom_overlaps "." \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "bcftools_normCommand=norm --atomize --atom-overlaps . -o normalized.vcf ../example.vcf" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: Check reference +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bcftools_norm with remove duplicates" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --atomize \ + --remove_duplicates 'all' \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm --atomize -d all -o normalized.vcf ../example.vcf" +echo "- test2 succeeded -" + +popd > /dev/null + +# Test 3: Check reference and fasta reference +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bcftools_norm with check reference and fasta reference" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --atomize \ + --fasta_ref "../reference.fa" \ + --check_ref "e" \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm --atomize -c e -f ../reference.fa -o normalized.vcf ../example.vcf" +echo "- test3 succeeded -" + +popd > /dev/null + +# Test 4: Multiallelics +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "> Run bcftools_norm with multiallelics" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --multiallelics "-any" \ + --old_rec_tag "wazzaaa" \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm -m -any --old-rec-tag wazzaaa -o normalized.vcf ../example.vcf" +echo "- test4 succeeded -" + +popd > /dev/null + +# Test 5: Regions +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "> Run bcftools_norm with regions" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --output "normalized.vcf" \ + --atomize \ + --regions "1:752567-752722" \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm --atomize -r 1:752567-752722 -o normalized.vcf ../example.vcf.gz" +echo "- test5 succeeded -" + +popd > /dev/null + +# Test 6: Targets +mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null + +echo "> Run bcftools_norm with targets" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --atomize \ + --targets "1:752567-752722" \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm --atomize -t 1:752567-752722 -o normalized.vcf ../example.vcf" +echo "- test6 succeeded -" + +popd > /dev/null + +# Test 7: Regions overlap +mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null + +echo "> Run bcftools_norm with regions overlap" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --atomize \ + --regions_overlap "pos" \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm --atomize --regions-overlap pos -o normalized.vcf ../example.vcf" +echo "- test7 succeeded -" + +popd > /dev/null + +# Test 8: Strict filter and targets overlap +mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null + +echo "> Run bcftools_norm with strict filter and targets overlap" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --atomize \ + --strict_filter \ + --targets_overlap "1" \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm --atomize -s --targets-overlap 1 -o normalized.vcf ../example.vcf" +echo "- test8 succeeded -" + +popd > /dev/null + +# Test 9: Do not normalize +mkdir "$TMPDIR/test9" && pushd "$TMPDIR/test9" > /dev/null + +echo "> Run bcftools_norm with do not normalize" +"$meta_executable" \ + --input "../example.vcf" \ + --output "normalized.vcf" \ + --do_not_normalize \ + --atomize \ + &> /dev/null + +# checks +assert_file_exists "normalized.vcf" +assert_file_not_empty "normalized.vcf" +assert_file_contains "normalized.vcf" "norm --atomize -N -o normalized.vcf ../example.vcf" +echo "- test9 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 + + diff --git a/src/bcftools/bcftools_sort/config.vsh.yaml b/src/bcftools/bcftools_sort/config.vsh.yaml new file mode 100644 index 00000000..71a15309 --- /dev/null +++ b/src/bcftools/bcftools_sort/config.vsh.yaml @@ -0,0 +1,73 @@ +name: bcftools_sort +namespace: bcftools +description: | + Sorts VCF/BCF files. +keywords: [Sort, VCF, BCF] +links: + homepage: https://samtools.github.io/bcftools/ + documentation: https://samtools.github.io/bcftools/bcftools.html#sort + repository: https://github.com/samtools/bcftools + issue_tracker: https://github.com/samtools/bcftools/issues +references: + doi: https://doi.org/10.1093/gigascience/giab008 +license: MIT/Expat, GNU +requirements: + commands: [bcftools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author, maintainer ] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + description: Input VCF/BCF file. + required: true + + - name: Outputs + arguments: + - name: --output + alternatives: -o + direction: output + type: file + description: Output sorted VCF/BCF file. + required: true + + - name: Options + arguments: + - name: --output_type + alternatives: -O + type: string + choices: [b, u, z, v] + description: | + Compresses or uncompresses the output. + The options are: + b: compressed BCF, + u: uncompressed BCF, + z: compressed VCF, + v: uncompressed VCF. + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + - path: test_data + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bcftools, procps] + - type: docker + run: | + echo "bcftools: \"$(bcftools --version | grep 'bcftools' | sed -n 's/^bcftools //p')\"" > /var/software_versions.txt + +runners: + - type: executable + - type: nextflow diff --git a/src/bcftools/bcftools_sort/help.txt b/src/bcftools/bcftools_sort/help.txt new file mode 100644 index 00000000..3b5fa80b --- /dev/null +++ b/src/bcftools/bcftools_sort/help.txt @@ -0,0 +1,14 @@ +``` +bcftools sort +``` + +About: Sort VCF/BCF file. +Usage: bcftools sort [OPTIONS] + +Options: + -m, --max-mem FLOAT[kMG] maximum memory to use [768M] + -o, --output FILE output file name [stdout] + -O, --output-type b|u|z|v b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v] + -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v] + -T, --temp-dir DIR temporary files [/tmp/bcftools.XXXXXX] + diff --git a/src/bcftools/bcftools_sort/script.sh b/src/bcftools/bcftools_sort/script.sh new file mode 100644 index 00000000..e9afb223 --- /dev/null +++ b/src/bcftools/bcftools_sort/script.sh @@ -0,0 +1,16 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Execute bedtools bamtofastq with the provided arguments +bcftools sort \ + -o "$par_output" \ + ${par_output_type:+-O "$par_output_type"} \ + ${meta_memory_mb:+-m "${meta_memory_mb}M"} \ + ${meta_temp_dir:+-T "$meta_temp_dir"} \ + $par_input \ + diff --git a/src/bcftools/bcftools_sort/test.sh b/src/bcftools/bcftools_sort/test.sh new file mode 100644 index 00000000..f406b8e2 --- /dev/null +++ b/src/bcftools/bcftools_sort/test.sh @@ -0,0 +1,185 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +test_data="$meta_resources_dir/test_data" + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create test data +cat < "$TMPDIR/example.vcf" +##fileformat=VCFv4.0 +##fileDate=20090805 +##source=myImputationProgramV3.1 +##reference=1000GenomesPilot-NCBI36 +##contig= +##contig= +##phasing=partial +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##FILTER= +##FILTER= +##FORMAT= +##FORMAT= +##FORMAT= +##FORMAT= +##ALT= +##ALT= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 +19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +19 111 . A C 9.6 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +20 1235237 . T . . . . GT 0/0 0|0 ./. +20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. +20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,. +20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.,. +20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,. +20 1234567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1 GT:GQ:DP 0/1:.:4 0/2:17:2 1/1:40:3 +EOF + +# Create expected output +cat < "$TMPDIR/expected_output.vcf" +##fileformat=VCFv4.0 +##FILTER= +##fileDate=20090805 +##source=myImputationProgramV3.1 +##reference=1000GenomesPilot-NCBI36 +##contig= +##contig= +##phasing=partial +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##FILTER= +##FILTER= +##FORMAT= +##FORMAT= +##FORMAT= +##FORMAT= +##ALT= +##ALT= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 +19 111 . A C 9.6 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. +20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,. +20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.,. +20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,. +20 1234567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1 GT:GQ:DP 0/1:.:4 0/2:17:2 1/1:40:3 +20 1235237 . T . . . . GT 0/0 0|0 ./. +EOF + +cat < "$TMPDIR/expected_bcf.vcf" +##fileformat=VCFv4.0 +##FILTER= +##fileDate=20090805 +##source=myImputationProgramV3.1 +##reference=1000GenomesPilot-NCBI36 +##contig= +##contig= +##phasing=partial +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##FILTER= +##FILTER= +##FORMAT= +##FORMAT= +##FORMAT= +##FORMAT= +##ALT= +##ALT= +##bcftools_viewVersion=1.16+htslib-1.16 +##bcftools_viewCommand=view -O b -o example.bcf example.vcf.gz; Date=Mon Aug 26 13:00:22 2024 +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 +19 111 . A C 9.6 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. +20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,. +20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.,. +20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,. +20 1234567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1 GT:GQ:DP 0/1:.:4 0/2:17:2 1/1:40:3 +20 1235237 . T . . . . GT 0/0 0|0 ./. +EOF + + +# Test 1: Default Use +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bcftools_sort on VCF file" +"$meta_executable" \ + --input "../example.vcf" \ + --output "output.vcf" \ + --output_type "v" \ + &> /dev/null + +# checks +assert_file_exists "output.vcf" +assert_file_not_empty "output.vcf" +assert_identical_content "output.vcf" "../expected_output.vcf" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: BCF file input +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bcftools_sort on BCF file" +"$meta_executable" \ + --input "${test_data}/example.bcf" \ + --output "output.vcf" \ + --output_type "v" \ + &> /dev/null + +# checks +assert_file_exists "output.vcf" +assert_file_not_empty "output.vcf" +assert_identical_content "output.vcf" "../expected_bcf.vcf" +echo "- test2 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 diff --git a/src/bcftools/bcftools_sort/test_data/example.bcf b/src/bcftools/bcftools_sort/test_data/example.bcf new file mode 100644 index 00000000..d78ae010 Binary files /dev/null and b/src/bcftools/bcftools_sort/test_data/example.bcf differ diff --git a/src/bcftools/bcftools_stats/config.vsh.yaml b/src/bcftools/bcftools_stats/config.vsh.yaml new file mode 100644 index 00000000..8fb57f7a --- /dev/null +++ b/src/bcftools/bcftools_stats/config.vsh.yaml @@ -0,0 +1,240 @@ +name: bcftools_stats +namespace: bcftools +description: | + Parses VCF or BCF and produces a txt stats file which can be plotted using plot-vcfstats. + When two files are given, the program generates separate stats for intersection + and the complements. By default only sites are compared, -s/-S must given to include + also sample columns. +keywords: [Stats, VCF, BCF] +links: + homepage: https://samtools.github.io/bcftools/ + documentation: https://samtools.github.io/bcftools/bcftools.html#stats + repository: https://github.com/samtools/bcftools + issue_tracker: https://github.com/samtools/bcftools/issues +references: + doi: https://doi.org/10.1093/gigascience/giab008 +license: MIT/Expat, GNU +requirements: + commands: [bcftools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author ] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + multiple: true + description: Input VCF/BCF file. Maximum of two files. + required: true + + - name: Outputs + arguments: + - name: --output + alternatives: -o + direction: output + type: file + description: Output txt statistics file. + required: true + + - name: Options + arguments: + + - name: --allele_frequency_bins + alternatives: --af_bins + type: string + description: | + Allele frequency bins, a list of bin values (0.1,0.5,1). + example: 0.1,0.5,1 + + - name: --allele_frequency_bins_file + alternatives: --af_bins_file + type: file + description: | + Same as allele_frequency_bins, but in a file. + Format of file is one value per line. + e.g. + 0.1 + 0.5 + 1 + + - name: --allele_frequency_tag + alternatives: --af_tag + type: string + description: | + Allele frequency tag to use, by default estimated from AN,AC or GT. + + - name: --first_allele_only + alternatives: --first_only + type: boolean_true + description: | + Include only 1st allele at multiallelic sites. + + - name: --collapse + alternatives: --c + type: string + choices: [ snps, indels, both, all, some, none ] + description: | + Treat as identical records with . + See https://samtools.github.io/bcftools/bcftools.html#common_options for details. + + - name: --depth + alternatives: --d + type: string + description: | + Depth distribution: min,max,bin size. + example: 0,500,1 + + - name: --exclude + alternatives: --e + type: string + description: | + Exclude sites for which the expression is true. + See https://samtools.github.io/bcftools/bcftools.html#expressions for details. + example: 'QUAL < 30 && DP < 10' + + - name: --exons + alternatives: --E + type: file + description: | + tab-delimited file with exons for indel frameshifts statistics. + The columns of the file are CHR, FROM, TO, with 1-based, inclusive, positions. + The file is BGZF-compressed and indexed with tabix (e.g. tabix -s1 -b2 -e3 file.gz). + + - name: --apply_filters + alternatives: --f + type: string + description: | + Require at least one of the listed FILTER strings (e.g. "PASS,."). + + - name: --fasta_reference + alternatives: --F + type: file + description: | + Faidx indexed reference sequence file to determine INDEL context. + + - name: --include + alternatives: --i + type: string + description: | + Select sites for which the expression is true. + See https://samtools.github.io/bcftools/bcftools.html#expressions for details. + example: 'QUAL >= 30 && DP >= 10' + + - name: --split_by_ID + alternatives: --I + type: boolean_true + description: | + Collect stats for sites with ID separately (known vs novel). + + - name: --regions + alternatives: --r + type: string + description: | + Restrict to comma-separated list of regions. + Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​]. + example: '20:1000000-2000000' + + - name: --regions_file + alternatives: --R + type: file + description: | + Restrict to regions listed in a file. + Regions can be specified either on a VCF, BED, or tab-delimited file (the default). + For more information check manual. + + - name: --regions_overlap + type: string + choices: ['pos', 'record', 'variant', '0', '1', '2'] + description: | + This option controls how overlapping records are determined: + set to 'pos' or '0' if the VCF record has to have POS inside a region (this corresponds to the default behavior of -t/-T); + set to 'record' or '1' if also overlapping records with POS outside a region should be included (this is the default behavior of -r/-R, + and includes indels with POS at the end of a region, which are technically outside the region); + or set to 'variant' or '2' to include only true overlapping variation (compare the full VCF representation "TA>T-" vs the true sequence variation "A>-"). + + - name: --samples + alternatives: --s + type: string + description: | + List of samples for sample stats, "-" to include all samples. + + - name: --samples_file + alternatives: --S + type: file + description: | + File of samples to include. + e.g. + sample1 1 + sample2 2 + sample3 2 + + - name: --targets + alternatives: --t + type: string + description: | + Similar as -r, --regions, but the next position is accessed by streaming the whole VCF/BCF + rather than using the tbi/csi index. Both -r and -t options can be applied simultaneously: -r uses the + index to jump to a region and -t discards positions which are not in the targets. Unlike -r, targets + can be prefixed with "^" to request logical complement. For example, "^X,Y,MT" indicates that + sequences X, Y and MT should be skipped. Yet another difference between the -t/-T and -r/-R is + that -r/-R checks for proper overlaps and considers both POS and the end position of an indel, + while -t/-T considers the POS coordinate only (by default; see also --regions-overlap and --targets-overlap). + Note that -t cannot be used in combination with -T. + Following formats are supported: chr|chr:pos|chr:beg-end|chr:beg-[,…​]. + example: '20:1000000-2000000' + + - name: --targets_file + alternatives: --T + type: file + description: | + Similar to --regions_file option but streams rather than index-jumps. + + - name: --targets_overlaps + type: string + choices: ['pos', 'record', 'variant', '0', '1', '2'] + description: | + Include if POS in the region (0), record overlaps (1), variant overlaps (2). + + - name: --user_tstv + alternatives: --u + type: string + description: | + Collect Ts/Tv stats for any tag using the given binning [0:1:100]. + Format is . + A subfield can be selected as e.g. 'PV4[0]', here the first value of the PV4 tag. + + + - name: --verbose + alternatives: --v + type: boolean_true + description: | + Produce verbose per-site and per-sample output. + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bcftools, procps] + - type: docker + run: | + echo "bcftools: \"$(bcftools --version | grep 'bcftools' | sed -n 's/^bcftools //p')\"" > /var/software_versions.txt + test_setup: + - type: apt + packages: [tabix] + +runners: + - type: executable + - type: nextflow + diff --git a/src/bcftools/bcftools_stats/help.txt b/src/bcftools/bcftools_stats/help.txt new file mode 100644 index 00000000..e702e838 --- /dev/null +++ b/src/bcftools/bcftools_stats/help.txt @@ -0,0 +1,35 @@ +``` +bcftools stats -h +``` + +About: Parses VCF or BCF and produces stats which can be plotted using plot-vcfstats. + When two files are given, the program generates separate stats for intersection + and the complements. By default only sites are compared, -s/-S must given to include + also sample columns. +Usage: bcftools stats [options] [] + +Options: + --af-bins LIST Allele frequency bins, a list (0.1,0.5,1) or a file (0.1\n0.5\n1) + --af-tag STRING Allele frequency tag to use, by default estimated from AN,AC or GT + -1, --1st-allele-only Include only 1st allele at multiallelic sites + -c, --collapse STRING Treat as identical records with , see man page for details [none] + -d, --depth INT,INT,INT Depth distribution: min,max,bin size [0,500,1] + -e, --exclude EXPR Exclude sites for which the expression is true (see man page for details) + -E, --exons FILE.gz Tab-delimited file with exons for indel frameshifts (chr,beg,end; 1-based, inclusive, bgzip compressed) + -f, --apply-filters LIST Require at least one of the listed FILTER strings (e.g. "PASS,.") + -F, --fasta-ref FILE Faidx indexed reference sequence file to determine INDEL context + -i, --include EXPR Select sites for which the expression is true (see man page for details) + -I, --split-by-ID Collect stats for sites with ID separately (known vs novel) + -r, --regions REGION Restrict to comma-separated list of regions + -R, --regions-file FILE Restrict to regions listed in a file + --regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1] + -s, --samples LIST List of samples for sample stats, "-" to include all samples + -S, --samples-file FILE File of samples to include + -t, --targets REGION Similar to -r but streams rather than index-jumps + -T, --targets-file FILE Similar to -R but streams rather than index-jumps + --targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0] + -u, --user-tstv TAG[:min:max:n] Collect Ts/Tv stats for any tag using the given binning [0:1:100] + A subfield can be selected as e.g. 'PV4[0]', here the first value of the PV4 tag + --threads INT Use multithreading with worker threads [0] + -v, --verbose Produce verbose per-site and per-sample output + diff --git a/src/bcftools/bcftools_stats/script.sh b/src/bcftools/bcftools_stats/script.sh new file mode 100644 index 00000000..119502fd --- /dev/null +++ b/src/bcftools/bcftools_stats/script.sh @@ -0,0 +1,56 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Unset parameters +unset_if_false=( + par_first_allele_only + par_split_by_ID + par_verbose +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +# Create input array +IFS=";" read -ra input <<< $par_input + +# Check the size of the input array +if [[ ${#input[@]} -gt 2 ]]; then + echo "Error: --input only takes a max of two files!" + exit 1 +fi + +# Execute bcftools stats with the provided arguments +bcftools stats \ + ${par_first_allele_only:+--1st-allele-only} \ + ${par_split_by_ID:+--split-by-ID} \ + ${par_verbose:+--verbose} \ + ${par_allele_frequency_bins:+--af-bins "${par_allele_frequency_bins}"} \ + ${par_allele_frequency_bins_file:+--af-bins "${par_allele_frequency_bins_file}"} \ + ${par_allele_frequency_tag:+--af-tag "${par_allele_frequency_tag}"} \ + ${par_collapse:+-c "${par_collapse}"} \ + ${par_depth:+-d "${par_depth}"} \ + ${par_exclude:+-e "${par_exclude}"} \ + ${par_exons:+-E "${par_exons}"} \ + ${par_apply_filters:+-f "${par_apply_filters}"} \ + ${par_fasta_reference:+-F "${par_fasta_reference}"} \ + ${par_include:+-i "${par_include}"} \ + ${par_regions:+-r "${par_regions}"} \ + ${par_regions_file:+-R "${par_regions_file}"} \ + ${par_regions_overlap:+--regions-overlap "${par_regions_overlap}"} \ + ${par_samples:+-s "${par_samples}"} \ + ${par_samples_file:+-S "${par_samples_file}"} \ + ${par_targets:+-t "${par_targets}"} \ + ${par_targets_file:+-T "${par_targets_file}"} \ + ${par_targets_overlaps:+--targets-overlap "${par_targets_overlaps}"} \ + ${par_user_tstv:+-u "${par_user_tstv}"} \ + "${input[@]}" \ + > $par_output + diff --git a/src/bcftools/bcftools_stats/test.sh b/src/bcftools/bcftools_stats/test.sh new file mode 100644 index 00000000..18f0256b --- /dev/null +++ b/src/bcftools/bcftools_stats/test.sh @@ -0,0 +1,306 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +#test_data="$meta_resources_dir/test_data" + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create test data +cat < "$TMPDIR/example.vcf" +##fileformat=VCFv4.0 +##fileDate=20090805 +##source=myImputationProgramV3.1 +##reference=1000GenomesPilot-NCBI36 +##contig= +##contig= +##phasing=partial +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##INFO= +##FILTER= +##FILTER= +##FORMAT= +##FORMAT= +##FORMAT= +##FORMAT= +##ALT= +##ALT= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 +19 111 . A C 9.6 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3 +20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. +20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,. +20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.,. +20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,. +20 1234567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1 GT:GQ:DP 0/1:.:4 0/2:17:2 1/1:40:3 +20 1235237 . T . . . . GT 0/0 0|0 ./. +EOF + +bgzip -c $TMPDIR/example.vcf > $TMPDIR/example.vcf.gz +tabix -p vcf $TMPDIR/example.vcf.gz + +cat < "$TMPDIR/exons.bed" +chr19 12345 12567 +chr20 23456 23789 +EOF + +# Compressing and indexing the exons file +bgzip -c $TMPDIR/exons.bed > $TMPDIR/exons.bed.gz +tabix -s1 -b2 -e3 $TMPDIR/exons.bed.gz + +# Create fai test file +# cat < "$TMPDIR/reference.fasta.fai" +# 19 100 895464957 60 61 +# 20 10000 1083893029 60 61 +# EOF + +# Create allele frequency bins file +cat < "$TMPDIR/allele_frequency_bins.txt" +0.1 +0.2 +0.3 +0.4 +0.5 +0.6 +0.7 +0.8 +0.9 +EOF + +# Test 1: Default Use +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bcftools_stats on VCF file" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats ../example.vcf" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: First allele only +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bcftools_stats on VCF file with first allele only" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --first_allele_only \ + --allele_frequency_bins "0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9" \ + --allele_frequency_tag "AF" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats --1st-allele-only --af-bins 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 --af-tag AF ../example.vcf" +echo "- test2 succeeded -" + +popd > /dev/null + +# Test 3: Split by ID +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bcftools_stats on VCF file with split by ID" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --split_by_ID \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats --split-by-ID ../example.vcf" +echo "- test3 succeeded -" + +popd > /dev/null + +# Test 4: Collapse, Depth, Exclude +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "> Run bcftools_stats on VCF file with collapse, depth, and exclude" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --depth "0,500,1" \ + --exclude "GT='mis'" \ + --collapse "snps" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats -c snps -d 0,500,1 -e GT='mis' ../example.vcf" +echo "- test4 succeeded -" + +popd > /dev/null + +# Test 5: Exons, Apply Filters +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "> Run bcftools_stats on VCF file with exons, apply filters, and fasta reference" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --exons "../exons.bed.gz" \ + --apply_filters "PASS" \ +# --fasta_reference "../reference.fasta.fai" \ + +# NOTE: fasta_reference option not included in testing because of error from bcftools stats. + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats -E ../exons.bed.gz -f PASS ../example.vcf" +#assert_file_contains "stats.txt" "bcftools stats -E ../exons.bed.gz -f PASS -F ../reference.fasta.fai ../example.vcf" +echo "- test5 succeeded -" + +popd > /dev/null + +# Test 6: Include, Regions +mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null + +echo "> Run bcftools_stats on VCF file with include and regions options" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --output "stats.txt" \ + --include "GT='mis'" \ + --regions "20:1000000-2000000" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats -i GT='mis' -r 20:1000000-2000000 ../example.vcf.gz" +echo "- test6 succeeded -" + +popd > /dev/null + +# Test 7: Regions Overlap, Samples +mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null + +echo "> Run bcftools_stats on VCF file with regions overlap, and samples options" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --regions_overlap "record" \ + --samples "NA00001,NA00002" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats --regions-overlap record -s NA00001,NA00002 ../example.vcf" +echo "- test7 succeeded -" + +popd > /dev/null + +# Test 8: Targets, Targets File, Targets Overlaps +mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null + +echo "> Run bcftools_stats on VCF file with targets, targets file, and targets overlaps" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --targets "20:1000000-2000000" \ + --targets_overlaps "pos" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats -t 20:1000000-2000000 --targets-overlap pos ../example.vcf" +echo "- test8 succeeded -" + +popd > /dev/null + +# Test 9: User TSTV and Verbose +mkdir "$TMPDIR/test9" && pushd "$TMPDIR/test9" > /dev/null + +echo "> Run bcftools_stats on VCF file with user TSTV and verbose" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --user_tstv "DP" \ + --verbose \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats --verbose -u DP ../example.vcf" +echo "- test9 succeeded -" + +popd > /dev/null + +# Test 10: Two vcf files +mkdir "$TMPDIR/test10" && pushd "$TMPDIR/test10" > /dev/null + +echo "> Run bcftools_stats on two VCF files" +"$meta_executable" \ + --input "../example.vcf.gz" \ + --input "../example.vcf.gz" \ + --output "stats.txt" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats ../example.vcf.gz ../example.vcf.gz" +echo "- test10 succeeded -" + +popd > /dev/null + +# Test 11: with allele frequency bins file option +mkdir "$TMPDIR/test11" && pushd "$TMPDIR/test11" > /dev/null + +echo "> Run bcftools_stats on VCF file with allele frequency bins file option" +"$meta_executable" \ + --input "../example.vcf" \ + --output "stats.txt" \ + --allele_frequency_bins "../allele_frequency_bins.txt" \ + +# checks +assert_file_exists "stats.txt" +assert_file_not_empty "stats.txt" +assert_file_contains "stats.txt" "bcftools stats --af-bins ../allele_frequency_bins.txt ../example.vcf" +echo "- test11 succeeded -" + +popd > /dev/null + + +echo "---- All tests succeeded! ----" +exit 0 + + diff --git a/src/bedtools/bedtools_bed12tobed6/config.vsh.yaml b/src/bedtools/bedtools_bed12tobed6/config.vsh.yaml new file mode 100644 index 00000000..8dd6328c --- /dev/null +++ b/src/bedtools/bedtools_bed12tobed6/config.vsh.yaml @@ -0,0 +1,67 @@ +name: bedtools_bed12tobed6 +namespace: bedtools +description: | + Converts BED features in BED12 (a.k.a. “blocked” BED features such as genes) to discrete BED6 features. + For example, in the case of a gene with six exons, bed12ToBed6 would create six separate BED6 features (i.e., one for each exon). +keywords: [Converts, BED12, BED6] +links: + documentation: https://bedtools.readthedocs.io/en/latest/content/tools/bed12tobed6.html + repository: https://github.com/arq5x/bedtools2 + homepage: https://bedtools.readthedocs.io/en/latest/# + issue_tracker: https://github.com/arq5x/bedtools2/issues +references: + doi: 10.1093/bioinformatics/btq033 +license: MIT +requirements: + commands: [bedtools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author, maintainer ] + +argument_groups: + + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + description: Input BED12 file. + required: true + + - name: Outputs + arguments: + - name: --output + alternatives: -o + type: file + direction: output + description: Output BED6 file to be written. + + - name: Options + arguments: + - name: --n_score + alternatives: -n + type: boolean_true + description: | + Force the score to be the (1-based) block number from the BED12. + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bedtools, procps] + - type: docker + run: | + echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt + +runners: + - type: executable + - type: nextflow diff --git a/src/bedtools/bedtools_bed12tobed6/help.txt b/src/bedtools/bedtools_bed12tobed6/help.txt new file mode 100644 index 00000000..17af6983 --- /dev/null +++ b/src/bedtools/bedtools_bed12tobed6/help.txt @@ -0,0 +1,13 @@ +``` +bedtools bed12tobed6 -h +``` + +Tool: bedtools bed12tobed6 (aka bed12ToBed6) +Version: v2.30.0 +Summary: Splits BED12 features into discrete BED6 features. + +Usage: bedtools bed12tobed6 [OPTIONS] -i + +Options: + -n Force the score to be the (1-based) block number from the BED12. + diff --git a/src/bedtools/bedtools_bed12tobed6/script.sh b/src/bedtools/bedtools_bed12tobed6/script.sh new file mode 100644 index 00000000..bbfaddc6 --- /dev/null +++ b/src/bedtools/bedtools_bed12tobed6/script.sh @@ -0,0 +1,15 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +set -eo pipefail + +# Unset parameters +[[ "$par_n_score" == "false" ]] && unset par_n_score + +# Execute bedtools bed12tobed6 conversion +bedtools bed12tobed6 \ + ${par_n_score:+-n} \ + -i "$par_input" \ + > "$par_output" diff --git a/src/bedtools/bedtools_bed12tobed6/test.sh b/src/bedtools/bedtools_bed12tobed6/test.sh new file mode 100644 index 00000000..2ef596d9 --- /dev/null +++ b/src/bedtools/bedtools_bed12tobed6/test.sh @@ -0,0 +1,85 @@ +#!/bin/bash + +# exit on error +set -eo pipefail + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create example BED12 file +cat < "$TMPDIR/example.bed12" +chr21 10079666 10120808 uc002yiv.1 0 - 10081686 1 0 1 2 0 6 0 8 0 4 528,91,101,215, 0,1930,39750,40927, +chr21 10080031 10081687 uc002yiw.1 0 - 10080031 1 0 0 8 0 0 3 1 0 2 200,91, 0,1565, +chr21 10081660 10120796 uc002yix.2 0 - 10081660 1 0 0 8 1 6 6 0 0 3 27,101,223, 0,37756,38913, +EOF + +# Expected output bed6 file +cat < "$TMPDIR/expected.bed6" +chr21 10079666 10120808 uc002yiv.1 0 - +chr21 10080031 10081687 uc002yiw.1 0 - +chr21 10081660 10120796 uc002yix.2 0 - +EOF +# Expected output bed6 file with -n option +cat < "$TMPDIR/expected_n.bed6" +chr21 10079666 10120808 uc002yiv.1 1 - +chr21 10080031 10081687 uc002yiw.1 1 - +chr21 10081660 10120796 uc002yix.2 1 - +EOF + +# Test 1: Default conversion BED12 to BED6 +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bedtools_bed12tobed6 on BED12 file" +"$meta_executable" \ + --input "../example.bed12" \ + --output "output.bed6" + +# checks +assert_file_exists "output.bed6" +assert_file_not_empty "output.bed6" +assert_identical_content "output.bed6" "../expected.bed6" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: Conversion BED12 to BED6 with -n option +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bedtools_bed12tobed6 on BED12 file with -n option" +"$meta_executable" \ + --input "../example.bed12" \ + --output "output.bed6" \ + --n_score + +# checks +assert_file_exists "output.bed6" +assert_file_not_empty "output.bed6" +assert_identical_content "output.bed6" "../expected_n.bed6" +echo "- test2 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 diff --git a/src/bedtools/bedtools_genomecov/config.vsh.yaml b/src/bedtools/bedtools_genomecov/config.vsh.yaml new file mode 100644 index 00000000..775587de --- /dev/null +++ b/src/bedtools/bedtools_genomecov/config.vsh.yaml @@ -0,0 +1,208 @@ +name: bedtools_genomecov +namespace: bedtools +description: | + Compute the coverage of a feature file among a genome. +keywords: [genome coverage, BED, GFF, VCF, BAM] +links: + homepage: https://bedtools.readthedocs.io/en/latest/# + documentation: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html + repository: https://github.com/arq5x/bedtools2 + issue_tracker: https://github.com/arq5x/bedtools2/issues +references: + doi: 10.1093/bioinformatics/btq033 +license: MIT +requirements: + commands: [bedtools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author, maintainer ] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + direction: input + description: | + The input file (BED/GFF/VCF) to be used. + example: input.bed + + - name: --input_bam + alternatives: -ibam + type: file + description: | + The input file is in BAM format. + Note: BAM _must_ be sorted by positions. + '--genome' option is ignored if you use '--input_bam' option! + + - name: --genome + alternatives: -g + type: file + direction: input + description: | + The genome file to be used. + example: genome.txt + + - name: Outputs + arguments: + - name: --output + type: file + direction: output + description: | + The output BED file. + required: true + example: output.bed + + - name: Options + arguments: + + - name: --depth + alternatives: -d + type: boolean_true + description: | + Report the depth at each genome position (with one-based coordinates). + Default behavior is to report a histogram. + + - name: --depth_zero + alternatives: -dz + type: boolean_true + description: | + Report the depth at each genome position (with zero-based coordinates). + Reports only non-zero positions. + Default behavior is to report a histogram. + + - name: --bed_graph + alternatives: -bg + type: boolean_true + description: | + Report depth in BedGraph format. For details, see: + genome.ucsc.edu/goldenPath/help/bedgraph.html + + - name: --bed_graph_zero_coverage + alternatives: -bga + type: boolean_true + description: | + Report depth in BedGraph format, as above (-bg). + However with this option, regions with zero + coverage are also reported. This allows one to + quickly extract all regions of a genome with 0 + coverage by applying: "grep -w 0$" to the output. + + - name: --split + type: boolean_true + description: | + Treat "split" BAM or BED12 entries as distinct BED intervals. + when computing coverage. + For BAM files, this uses the CIGAR "N" and "D" operations + to infer the blocks for computing coverage. + For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds + fields (i.e., columns 10,11,12). + + - name: --ignore_deletion + alternatives: -ignoreD + type: boolean_true + description: | + Ignore local deletions (CIGAR "D" operations) in BAM entries + when computing coverage. + + - name: --strand + type: string + choices: ["+", "-"] + description: | + Calculate coverage of intervals from a specific strand. + With BED files, requires at least 6 columns (strand is column 6). + + - name: --pair_end_coverage + alternatives: -pc + type: boolean_true + description: | + Calculate coverage of pair-end fragments. + Works for BAM files only + + - name: --fragment_size + alternatives: -fs + type: boolean_true + description: | + Force to use provided fragment size instead of read length + Works for BAM files only + + - name: --du + type: boolean_true + description: | + Change strand af the mate read (so both reads from the same strand) useful for strand specific + Works for BAM files only + + - name: --five_prime + alternatives: -5 + type: boolean_true + description: | + Calculate coverage of 5" positions (instead of entire interval). + + - name: --three_prime + alternatives: -3 + type: boolean_true + description: | + Calculate coverage of 3" positions (instead of entire interval). + + - name: --max + type: integer + min: 0 + description: | + Combine all positions with a depth >= max into + a single bin in the histogram. Irrelevant + for -d and -bedGraph + - (INTEGER) + + - name: --scale + type: double + min: 0 + description: | + Scale the coverage by a constant factor. + Each coverage value is multiplied by this factor before being reported. + Useful for normalizing coverage by, e.g., reads per million (RPM). + - Default is 1.0; i.e., unscaled. + - (FLOAT) + + - name: --trackline + type: boolean_true + description: | + Adds a UCSC/Genome-Browser track line definition in the first line of the output. + - See here for more details about track line definition: + http://genome.ucsc.edu/goldenPath/help/bedgraph.html + - NOTE: When adding a trackline definition, the output BedGraph can be easily + uploaded to the Genome Browser as a custom track, + BUT CAN NOT be converted into a BigWig file (w/o removing the first line). + + - name: --trackopts + type: string + description: | + Writes additional track line definition parameters in the first line. + - Example: + -trackopts 'name="My Track" visibility=2 color=255,30,30' + Note the use of single-quotes if you have spaces in your parameters. + - (TEXT) + multiple: true + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + - path: test_data + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bedtools, procps] + - type: docker + run: | + echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt + +runners: + - type: executable + - type: nextflow \ No newline at end of file diff --git a/src/bedtools/bedtools_genomecov/help.txt b/src/bedtools/bedtools_genomecov/help.txt new file mode 100644 index 00000000..f13a71d3 --- /dev/null +++ b/src/bedtools/bedtools_genomecov/help.txt @@ -0,0 +1,101 @@ +```bash +bedtools genomecov +``` + +Tool: bedtools genomecov (aka genomeCoverageBed) +Version: v2.30.0 +Summary: Compute the coverage of a feature file among a genome. + +Usage: bedtools genomecov [OPTIONS] -i -g + +Options: + -ibam The input file is in BAM format. + Note: BAM _must_ be sorted by position + + -d Report the depth at each genome position (with one-based coordinates). + Default behavior is to report a histogram. + + -dz Report the depth at each genome position (with zero-based coordinates). + Reports only non-zero positions. + Default behavior is to report a histogram. + + -bg Report depth in BedGraph format. For details, see: + genome.ucsc.edu/goldenPath/help/bedgraph.html + + -bga Report depth in BedGraph format, as above (-bg). + However with this option, regions with zero + coverage are also reported. This allows one to + quickly extract all regions of a genome with 0 + coverage by applying: "grep -w 0$" to the output. + + -split Treat "split" BAM or BED12 entries as distinct BED intervals. + when computing coverage. + For BAM files, this uses the CIGAR "N" and "D" operations + to infer the blocks for computing coverage. + For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds + fields (i.e., columns 10,11,12). + + -ignoreD Ignore local deletions (CIGAR "D" operations) in BAM entries + when computing coverage. + + -strand Calculate coverage of intervals from a specific strand. + With BED files, requires at least 6 columns (strand is column 6). + - (STRING): can be + or - + + -pc Calculate coverage of pair-end fragments. + Works for BAM files only + -fs Force to use provided fragment size instead of read length + Works for BAM files only + -du Change strand af the mate read (so both reads from the same strand) useful for strand specific + Works for BAM files only + -5 Calculate coverage of 5" positions (instead of entire interval). + + -3 Calculate coverage of 3" positions (instead of entire interval). + + -max Combine all positions with a depth >= max into + a single bin in the histogram. Irrelevant + for -d and -bedGraph + - (INTEGER) + + -scale Scale the coverage by a constant factor. + Each coverage value is multiplied by this factor before being reported. + Useful for normalizing coverage by, e.g., reads per million (RPM). + - Default is 1.0; i.e., unscaled. + - (FLOAT) + + -trackline Adds a UCSC/Genome-Browser track line definition in the first line of the output. + - See here for more details about track line definition: + http://genome.ucsc.edu/goldenPath/help/bedgraph.html + - NOTE: When adding a trackline definition, the output BedGraph can be easily + uploaded to the Genome Browser as a custom track, + BUT CAN NOT be converted into a BigWig file (w/o removing the first line). + + -trackopts Writes additional track line definition parameters in the first line. + - Example: + -trackopts 'name="My Track" visibility=2 color=255,30,30' + Note the use of single-quotes if you have spaces in your parameters. + - (TEXT) + +Notes: + (1) The genome file should tab delimited and structured as follows: + + + For example, Human (hg19): + chr1 249250621 + chr2 243199373 + ... + chr18_gl000207_random 4262 + + (2) The input BED (-i) file must be grouped by chromosome. + A simple "sort -k 1,1 > .sorted" will suffice. + + (3) The input BAM (-ibam) file must be sorted by position. + A "samtools sort " should suffice. + +Tips: + One can use the UCSC Genome Browser's MySQL database to extract + chromosome sizes. For example, H. sapiens: + + mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \ + "select chrom, size from hg19.chromInfo" > hg19.genome + diff --git a/src/bedtools/bedtools_genomecov/script.sh b/src/bedtools/bedtools_genomecov/script.sh new file mode 100644 index 00000000..20fbd968 --- /dev/null +++ b/src/bedtools/bedtools_genomecov/script.sh @@ -0,0 +1,55 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Unset variables +unset_if_false=( + par_input_bam + par_depth + par_depth_zero + par_bed_graph + par_bed_graph_zero_coverage + par_split + par_ignore_deletion + par_pair_end_coverage + par_fragment_size + par_du + par_five_prime + par_three_prime + par_trackline +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +# Create input array +IFS=";" read -ra trackopts <<< $par_trackopts + +bedtools genomecov \ + ${par_depth:+-d} \ + ${par_depth_zero:+-dz} \ + ${par_bed_graph:+-bg} \ + ${par_bed_graph_zero_coverage:+-bga} \ + ${par_split:+-split} \ + ${par_ignore_deletion:+-ignoreD} \ + ${par_du:+-du} \ + ${par_five_prime:+-5} \ + ${par_three_prime:+-3} \ + ${par_trackline:+-trackline} \ + ${par_strand:+-strand "$par_strand"} \ + ${par_max:+-max "$par_max"} \ + ${par_scale:+-scale "$par_scale"} \ + ${par_trackopts:+-trackopts "${trackopts[*]}"} \ + ${par_input_bam:+-ibam "$par_input_bam"} \ + ${par_input:+-i "$par_input"} \ + ${par_genome:+-g "$par_genome"} \ + ${par_pair_end_coverage:+-pc} \ + ${par_fragment_size:+-fs} \ + > "$par_output" + \ No newline at end of file diff --git a/src/bedtools/bedtools_genomecov/test.sh b/src/bedtools/bedtools_genomecov/test.sh new file mode 100644 index 00000000..7e4487da --- /dev/null +++ b/src/bedtools/bedtools_genomecov/test.sh @@ -0,0 +1,333 @@ +#!/bin/bash + +# exit on error +set -eo pipefail + +## VIASH START +meta_executable="target/executable/bedtools/bedtools_intersect/bedtools_intersect" +meta_resources_dir="src/bedtools/bedtools_intersect" +## VIASH END + +# directory of the bam file +test_data="$meta_resources_dir/test_data" + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create and populate input files +printf "chr1\t248956422\nchr2\t198295559\nchr3\t242193529\n" > "$TMPDIR/genome.txt" +printf "chr2\t128\t228\tmy_read/1\t37\t+\nchr2\t428\t528\tmy_read/2\t37\t-\n" > "$TMPDIR/example.bed" +printf "chr2\t128\t228\tmy_read/1\t60\t+\t128\t228\t255,0,0\t1\t100\t0\nchr2\t428\t528\tmy_read/2\t60\t-\t428\t528\t255,0,0\t1\t100\t0\n" > "$TMPDIR/example.bed12" +printf "chr2\t100\t103\n" > "$TMPDIR/example_dz.bed" + +# expected outputs +cat > "$TMPDIR/expected_default.bed" < "$TMPDIR/expected_ibam.bed" < "$TMPDIR/expected_ibam_pc.bed" < "$TMPDIR/expected_ibam_fs.bed" < "$TMPDIR/expected_dz.bed" < "$TMPDIR/expected_strand.bed" < "$TMPDIR/expected_5.bed" < "$TMPDIR/expected_bg_scale.bed" < "$TMPDIR/expected_trackopts.bed" < "$TMPDIR/expected_split.bed" < "$TMPDIR/expected_ignoreD_du.bed" < /dev/null + +echo "> Run bedtools_genomecov on BED file" +"$meta_executable" \ + --input "../example.bed" \ + --genome "../genome.txt" \ + --output "output.bed" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_default.bed" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: ibam option +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bedtools_genomecov on BAM file with -ibam" +"$meta_executable" \ + --input_bam "$test_data/example.bam" \ + --output "output.bed" \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_ibam.bed" +echo "- test2 succeeded -" + +popd > /dev/null + +# Test 3: depth option +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bedtools_genomecov on BED file with -dz" +"$meta_executable" \ + --input "../example_dz.bed" \ + --genome "../genome.txt" \ + --output "output.bed" \ + --depth_zero + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_dz.bed" +echo "- test3 succeeded -" + +popd > /dev/null + +# Test 4: strand option +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "> Run bedtools_genomecov on BED file with -strand" +"$meta_executable" \ + --input "../example.bed" \ + --genome "../genome.txt" \ + --output "output.bed" \ + --strand "-" \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_strand.bed" +echo "- test4 succeeded -" + +popd > /dev/null + +# Test 5: 5' end option +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "> Run bedtools_genomecov on BED file with -5" +"$meta_executable" \ + --input "../example.bed" \ + --genome "../genome.txt" \ + --output "output.bed" \ + --five_prime \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_5.bed" +echo "- test5 succeeded -" + +popd > /dev/null + +# Test 6: max option +mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null + +echo "> Run bedtools_genomecov on BED file with -max" +"$meta_executable" \ + --input "../example.bed" \ + --genome "../genome.txt" \ + --output "output.bed" \ + --max 100 \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_default.bed" +echo "- test6 succeeded -" + +popd > /dev/null + +# Test 7: bedgraph and scale option +mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null + +echo "> Run bedtools_genomecov on BED file with -bg and -scale" +"$meta_executable" \ + --input "../example.bed" \ + --genome "../genome.txt" \ + --output "output.bed" \ + --bed_graph \ + --scale 100 \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_bg_scale.bed" +echo "- test7 succeeded -" + +popd > /dev/null + +# Test 8: trackopts option +mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null + +echo "> Run bedtools_genomecov on BED file with -bg and -trackopts" +"$meta_executable" \ + --input "../example.bed" \ + --genome "../genome.txt" \ + --output "output.bed" \ + --bed_graph \ + --trackopts "name=example" \ + --trackopts "llama=Alpaco" \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_trackopts.bed" +echo "- test8 succeeded -" + +popd > /dev/null + +# Test 9: ibam pc options +mkdir "$TMPDIR/test9" && pushd "$TMPDIR/test9" > /dev/null + +echo "> Run bedtools_genomecov on BAM file with -ibam, -pc" +"$meta_executable" \ + --input_bam "$test_data/example.bam" \ + --output "output.bed" \ + --fragment_size \ + --pair_end_coverage \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_ibam_pc.bed" +echo "- test9 succeeded -" + +popd > /dev/null + +# Test 10: ibam fs options +mkdir "$TMPDIR/test10" && pushd "$TMPDIR/test10" > /dev/null + +echo "> Run bedtools_genomecov on BAM file with -ibam, -fs" +"$meta_executable" \ + --input_bam "$test_data/example.bam" \ + --output "output.bed" \ + --fragment_size \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_ibam_fs.bed" +echo "- test10 succeeded -" + +popd > /dev/null + +# Test 11: split +mkdir "$TMPDIR/test11" && pushd "$TMPDIR/test11" > /dev/null + +echo "> Run bedtools_genomecov on BED12 file with -split" +"$meta_executable" \ + --input "../example.bed12" \ + --genome "../genome.txt" \ + --output "output.bed" \ + --split \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_split.bed" +echo "- test11 succeeded -" + +popd > /dev/null + +# Test 12: ignore deletion and du +mkdir "$TMPDIR/test12" && pushd "$TMPDIR/test12" > /dev/null + +echo "> Run bedtools_genomecov on BAM file with -ignoreD and -du" +"$meta_executable" \ + --input_bam "$test_data/example.bam" \ + --output "output.bed" \ + --ignore_deletion \ + --du \ + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_ignoreD_du.bed" +echo "- test12 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 diff --git a/src/bedtools/bedtools_genomecov/test_data/example.bam b/src/bedtools/bedtools_genomecov/test_data/example.bam new file mode 100644 index 00000000..ffc075ab Binary files /dev/null and b/src/bedtools/bedtools_genomecov/test_data/example.bam differ diff --git a/src/bedtools/bedtools_groupby/config.vsh.yaml b/src/bedtools/bedtools_groupby/config.vsh.yaml new file mode 100644 index 00000000..89c4845b --- /dev/null +++ b/src/bedtools/bedtools_groupby/config.vsh.yaml @@ -0,0 +1,155 @@ +name: bedtools_groupby +namespace: bedtools +description: | + Summarizes a dataset column based upon common column groupings. + Akin to the SQL "group by" command. +keywords: [groupby, BED] +links: + documentation: https://bedtools.readthedocs.io/en/latest/content/tools/groupby.html + repository: https://github.com/arq5x/bedtools2 + homepage: https://bedtools.readthedocs.io/en/latest/# + issue_tracker: https://github.com/arq5x/bedtools2/issues +references: + doi: 10.1093/bioinformatics/btq033 +license: MIT +requirements: + commands: [bedtools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author, maintainer ] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + direction: input + description: | + The input BED file to be used. + required: true + example: input_a.bed + + - name: Outputs + arguments: + - name: --output + type: file + direction: output + description: | + The output groupby BED file. + required: true + example: output.bed + + - name: Options + arguments: + - name: --groupby + alternatives: [-g, -grp] + type: string + description: | + Specify the columns (1-based) for the grouping. + The columns must be comma separated. + - Default: 1,2,3 + required: true + + - name: --column + alternatives: [-c, -opCols] + type: integer + description: | + Specify the column (1-based) that should be summarized. + required: true + + - name: --operation + alternatives: [-o, -ops] + type: string + description: | + Specify the operation that should be applied to opCol. + Valid operations: + sum, count, count_distinct, min, max, + mean, median, mode, antimode, + stdev, sstdev (sample standard dev.), + collapse (i.e., print a comma separated list (duplicates allowed)), + distinct (i.e., print a comma separated list (NO duplicates allowed)), + distinct_sort_num (as distinct, but sorted numerically, ascending), + distinct_sort_num_desc (as distinct, but sorted numerically, descending), + concat (i.e., merge values into a single, non-delimited string), + freqdesc (i.e., print desc. list of values:freq) + freqasc (i.e., print asc. list of values:freq) + first (i.e., print first value) + last (i.e., print last value) + + Default value: sum + + If there is only column, but multiple operations, all operations will be + applied on that column. Likewise, if there is only one operation, but + multiple columns, that operation will be applied to all columns. + Otherwise, the number of columns must match the the number of operations, + and will be applied in respective order. + E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, + the mean of column 4, and the count of column 6. + The order of output columns will match the ordering given in the command. + + - name: --full + type: boolean_true + description: | + Print all columns from input file. The first line in the group is used. + Default: print only grouped columns. + + - name: --inheader + type: boolean_true + description: | + Input file has a header line - the first line will be ignored. + + - name: --outheader + type: boolean_true + description: | + Print header line in the output, detailing the column names. + If the input file has headers (-inheader), the output file + will use the input's column names. + If the input file has no headers, the output file + will use "col_1", "col_2", etc. as the column names. + + - name: --header + type: boolean_true + description: same as '-inheader -outheader'. + + - name: --ignorecase + type: boolean_true + description: | + Group values regardless of upper/lower case. + + - name: --precision + alternatives: -prec + type: integer + description: | + Sets the decimal precision for output. + default: 5 + + - name: --delimiter + alternatives: -delim + type: string + description: | + Specify a custom delimiter for the collapse operations. + example: "|" + default: "," + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bedtools, procps] + - type: docker + run: | + echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt + +runners: + - type: executable + - type: nextflow diff --git a/src/bedtools/bedtools_groupby/help.txt b/src/bedtools/bedtools_groupby/help.txt new file mode 100644 index 00000000..a631b4b1 --- /dev/null +++ b/src/bedtools/bedtools_groupby/help.txt @@ -0,0 +1,93 @@ +```bash +bedtools groupby +``` + +Tool: bedtools groupby +Version: v2.30.0 +Summary: Summarizes a dataset column based upon + common column groupings. Akin to the SQL "group by" command. + +Usage: bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops] + cat [FILE] | bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops] + +Options: + -i Input file. Assumes "stdin" if omitted. + + -g -grp Specify the columns (1-based) for the grouping. + The columns must be comma separated. + - Default: 1,2,3 + + -c -opCols Specify the column (1-based) that should be summarized. + - Required. + + -o -ops Specify the operation that should be applied to opCol. + Valid operations: + sum, count, count_distinct, min, max, + mean, median, mode, antimode, + stdev, sstdev (sample standard dev.), + collapse (i.e., print a comma separated list (duplicates allowed)), + distinct (i.e., print a comma separated list (NO duplicates allowed)), + distinct_sort_num (as distinct, but sorted numerically, ascending), + distinct_sort_num_desc (as distinct, but sorted numerically, descending), + concat (i.e., merge values into a single, non-delimited string), + freqdesc (i.e., print desc. list of values:freq) + freqasc (i.e., print asc. list of values:freq) + first (i.e., print first value) + last (i.e., print last value) + - Default: sum + + If there is only column, but multiple operations, all operations will be + applied on that column. Likewise, if there is only one operation, but + multiple columns, that operation will be applied to all columns. + Otherwise, the number of columns must match the the number of operations, + and will be applied in respective order. + E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, + the mean of column 4, and the count of column 6. + The order of output columns will match the ordering given in the command. + + + -full Print all columns from input file. The first line in the group is used. + Default: print only grouped columns. + + -inheader Input file has a header line - the first line will be ignored. + + -outheader Print header line in the output, detailing the column names. + If the input file has headers (-inheader), the output file + will use the input's column names. + If the input file has no headers, the output file + will use "col_1", "col_2", etc. as the column names. + + -header same as '-inheader -outheader' + + -ignorecase Group values regardless of upper/lower case. + + -prec Sets the decimal precision for output (Default: 5) + + -delim Specify a custom delimiter for the collapse operations. + - Example: -delim "|" + - Default: ",". + +Examples: + $ cat ex1.out + chr1 10 20 A chr1 15 25 B.1 1000 ATAT + chr1 10 20 A chr1 25 35 B.2 10000 CGCG + + $ groupBy -i ex1.out -g 1,2,3,4 -c 9 -o sum + chr1 10 20 A 11000 + + $ groupBy -i ex1.out -grp 1,2,3,4 -opCols 9,9 -ops sum,max + chr1 10 20 A 11000 10000 + + $ groupBy -i ex1.out -g 1,2,3,4 -c 8,9 -o collapse,mean + chr1 10 20 A B.1,B.2, 5500 + + $ cat ex1.out | groupBy -g 1,2,3,4 -c 8,9 -o collapse,mean + chr1 10 20 A B.1,B.2, 5500 + + $ cat ex1.out | groupBy -g 1,2,3,4 -c 10 -o concat + chr1 10 20 A ATATCGCG + +Notes: + (1) The input file/stream should be sorted/grouped by the -grp. columns + (2) If -i is unspecified, input is assumed to come from stdin. + diff --git a/src/bedtools/bedtools_groupby/script.sh b/src/bedtools/bedtools_groupby/script.sh new file mode 100644 index 00000000..b8a40cdc --- /dev/null +++ b/src/bedtools/bedtools_groupby/script.sh @@ -0,0 +1,36 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Unset parameters +unset_if_false=( + par_full + par_inheader + par_outheader + par_header + par_ignorecase +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +bedtools groupby \ + ${par_full:+-full} \ + ${par_inheader:+-inheader} \ + ${par_outheader:+-outheader} \ + ${par_header:+-header} \ + ${par_ignorecase:+-ignorecase} \ + ${par_precision:+-prec "$par_precision"} \ + ${par_delimiter:+-delim "$par_delimiter"} \ + -i "$par_input" \ + -g "$par_groupby" \ + -c "$par_column" \ + ${par_operation:+-o "$par_operation"} \ + > "$par_output" + \ No newline at end of file diff --git a/src/bedtools/bedtools_groupby/test.sh b/src/bedtools/bedtools_groupby/test.sh new file mode 100644 index 00000000..ce99a1ec --- /dev/null +++ b/src/bedtools/bedtools_groupby/test.sh @@ -0,0 +1,198 @@ +#!/bin/bash + +# exit on error +set -eo pipefail + +## VIASH START +meta_executable="target/executable/bedtools/bedtools_groupby/bedtools_groupby" +meta_resources_dir="src/bedtools/bedtools_groupby" +## VIASH END + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create and populate example.bed +cat << EOF > $TMPDIR/example.bed +# Header +chr21 9719758 9729320 variant1 chr21 9719768 9721892 ALR/Alpha 1004 + +chr21 9719758 9729320 variant1 chr21 9721905 9725582 ALR/Alpha 1010 + +chr21 9719758 9729320 variant1 chr21 9725582 9725977 L1PA3 3288 + +chr21 9719758 9729320 variant1 chr21 9726021 9729309 ALR/Alpha 1051 + +chr21 9729310 9757478 variant2 chr21 9729320 9729809 L1PA3 3897 - +chr21 9729310 9757478 variant2 chr21 9729809 9730866 L1P1 8367 + +chr21 9729310 9757478 variant2 chr21 9730866 9734026 ALR/Alpha 1036 - +chr21 9729310 9757478 variant2 chr21 9734037 9757471 ALR/Alpha 1182 - +chr21 9795588 9796685 variant3 chr21 9795589 9795713 (GAATG)n 308 + +chr21 9795588 9796685 variant3 chr21 9795736 9795894 (GAATG)n 683 + +chr21 9795588 9796685 variant3 chr21 9795911 9796007 (GAATG)n 345 + +chr21 9795588 9796685 variant3 chr21 9796028 9796187 (GAATG)n 756 + +chr21 9795588 9796685 variant3 chr21 9796202 9796615 (GAATG)n 891 + +chr21 9795588 9796685 variant3 chr21 9796637 9796824 (GAATG)n 621 + +EOF + +# Create and populate expected output files for different tests +cat << EOF > $TMPDIR/expected.bed +chr21 9719758 9729320 6353 +chr21 9729310 9757478 14482 +chr21 9795588 9796685 3604 +EOF +cat << EOF > $TMPDIR/expected_max.bed +chr21 9719758 9729320 variant1 3288 +chr21 9729310 9757478 variant2 8367 +chr21 9795588 9796685 variant3 891 +EOF +cat << EOF > $TMPDIR/expected_full.bed +chr21 9719758 9729320 variant1 chr21 9719768 9721892 ALR/Alpha 1004 + 6353 +chr21 9729310 9757478 variant2 chr21 9729320 9729809 L1PA3 3897 - 14482 +chr21 9795588 9796685 variant3 chr21 9795589 9795713 (GAATG)n 308 + 3604 +EOF +cat << EOF > $TMPDIR/expected_delimited.bed +chr21 9719758 9729320 variant1 1004;1010;3288;1051 +chr21 9729310 9757478 variant2 3897;8367;1036;1182 +chr21 9795588 9796685 variant3 308;683;345;756;891;621 +EOF +cat << EOF > $TMPDIR/expected_precision.bed +chr21 9719758 9729320 variant1 1.6e+03 +chr21 9729310 9757478 variant2 3.6e+03 +chr21 9795588 9796685 variant3 6e+02 +EOF + +# Test 1: without operation option, default operation is sum +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bedtools groupby on BED file" +"$meta_executable" \ + --input "../example.bed" \ + --groupby "1,2,3" \ + --column "9" \ + --output "output.bed" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected.bed" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: with operation max option +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bedtools groupby on BED file with max operation" +"$meta_executable" \ + --input "../example.bed" \ + --groupby "1-4" \ + --column "9" \ + --operation "max" \ + --output "output.bed" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_max.bed" +echo "- test2 succeeded -" + +popd > /dev/null + +# Test 3: full option +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bedtools groupby on BED file with full option" +"$meta_executable" \ + --input "../example.bed" \ + --groupby "1-4" \ + --column "9" \ + --full \ + --output "output.bed" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_full.bed" +echo "- test3 succeeded -" + +popd > /dev/null + +# Test 4: header option +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "> Run bedtools groupby on BED file with header option" +"$meta_executable" \ + --input "../example.bed" \ + --groupby "1-4" \ + --column "9" \ + --header \ + --output "output.bed" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_file_contains "output.bed" "# Header" +echo "- test4 succeeded -" + +popd > /dev/null + +# Test 5: Delimiter and collapse +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "> Run bedtools groupby on BED file with delimiter and collapse options" +"$meta_executable" \ + --input "../example.bed" \ + --groupby "1-4" \ + --column "9" \ + --operation "collapse" \ + --delimiter ";" \ + --output "output.bed" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_delimited.bed" +echo "- test5 succeeded -" + +popd > /dev/null + +# Test 6: precision option +mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null + +echo "> Run bedtools groupby on BED file with precision option" +"$meta_executable" \ + --input "../example.bed" \ + --groupby "1-4" \ + --column "9" \ + --operation "mean" \ + --precision 2 \ + --output "output.bed" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_precision.bed" +echo "- test6 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 diff --git a/src/bedtools/bedtools_links/config.vsh.yaml b/src/bedtools/bedtools_links/config.vsh.yaml new file mode 100644 index 00000000..b4e43cd3 --- /dev/null +++ b/src/bedtools/bedtools_links/config.vsh.yaml @@ -0,0 +1,91 @@ +name: bedtools_links +namespace: bedtools +description: | + Creates an HTML file with links to an instance of the UCSC Genome Browser for all features / intervals in a file. + This is useful for cases when one wants to manually inspect through a large set of annotations or features. +keywords: [Links, BED, GFF, VCF] +links: + documentation: https://bedtools.readthedocs.io/en/latest/content/tools/links.html + repository: https://github.com/arq5x/bedtools2 + homepage: https://bedtools.readthedocs.io/en/latest/# + issue_tracker: https://github.com/arq5x/bedtools2/issues +references: + doi: 10.1093/bioinformatics/btq033 +license: MIT +requirements: + commands: [bedtools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author, maintainer ] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + description: Input file (bed/gff/vcf). + required: true + + - name: Outputs + arguments: + - name: --output + alternatives: -o + type: file + direction: output + description: Output HTML file to be written. + + - name: Options + description: | + By default, the links created will point to human (hg18) UCSC browser. + If you have a local mirror, you can override this behavior by supplying + the -base, -org, and -db options. + + For example, if the URL of your local mirror for mouse MM9 is called: + http://mymirror.myuniversity.edu, then you would use the following: + --base_url http://mymirror.myuniversity.edu + --organism mouse + --database mm9 + arguments: + - name: --base_url + alternatives: -base + type: string + description: | + The “basename” for the UCSC browser. + default: http://genome.ucsc.edu + + - name: --organism + alternatives: -org + type: string + description: | + The organism (e.g. mouse, human). + default: human + + - name: --database + alternatives: -db + type: string + description: | + The genome build. + default: hg18 + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bedtools, procps] + - type: docker + run: | + echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt + +runners: + - type: executable + - type: nextflow diff --git a/src/bedtools/bedtools_links/help.txt b/src/bedtools/bedtools_links/help.txt new file mode 100644 index 00000000..d848d989 --- /dev/null +++ b/src/bedtools/bedtools_links/help.txt @@ -0,0 +1,25 @@ +``` +bedtools links -h +``` + +Tool: bedtools links (aka linksBed) +Version: v2.30.0 +Summary: Creates HTML links to an UCSC Genome Browser from a feature file. + +Usage: bedtools links [OPTIONS] -i > out.html + +Options: + -base The browser basename. Default: http://genome.ucsc.edu + -org The organism. Default: human + -db The build. Default: hg18 + +Example: + By default, the links created will point to human (hg18) UCSC browser. + If you have a local mirror, you can override this behavior by supplying + the -base, -org, and -db options. + + For example, if the URL of your local mirror for mouse MM9 is called: + http://mymirror.myuniversity.edu, then you would use the following: + -base http://mymirror.myuniversity.edu + -org mouse + -db mm9 diff --git a/src/bedtools/bedtools_links/script.sh b/src/bedtools/bedtools_links/script.sh new file mode 100644 index 00000000..b8ee9a56 --- /dev/null +++ b/src/bedtools/bedtools_links/script.sh @@ -0,0 +1,14 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +set -eo pipefail + +# Execute bedtools links +bedtools links \ + ${par_base_url:+-base "$par_base_url"} \ + ${par_organism:+-org "$par_organism"} \ + ${par_database:+-db "$par_database"} \ + -i "$par_input" \ + > "$par_output" diff --git a/src/bedtools/bedtools_links/test.sh b/src/bedtools/bedtools_links/test.sh new file mode 100644 index 00000000..d79cbd6c --- /dev/null +++ b/src/bedtools/bedtools_links/test.sh @@ -0,0 +1,98 @@ +#!/bin/bash + +# exit on error +set -eo pipefail + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create test data +cat < "$TMPDIR/genes.bed" +chr21 9928613 10012791 uc002yip.1 0 - +chr21 9928613 10012791 uc002yiq.1 0 - +chr21 9928613 10012791 uc002yir.1 0 - +chr21 9928613 10012791 uc010gkv.1 0 - +chr21 9928613 10061300 uc002yis.1 0 - +chr21 10042683 10120796 uc002yit.1 0 - +chr21 10042683 10120808 uc002yiu.1 0 - +chr21 10079666 10120808 uc002yiv.1 0 - +chr21 10080031 10081687 uc002yiw.1 0 - +chr21 10081660 10120796 uc002yix.2 0 - +EOF + +# Test 1: Default Use +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bedtools_links on BED file" +"$meta_executable" \ + --input "../genes.bed" \ + --output "genes.html" + +# checks +assert_file_exists "genes.html" +assert_file_not_empty "genes.html" +assert_file_contains "genes.html" "uc002yip.1" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: Base URL +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bedtools_links with base option" +"$meta_executable" \ + --input "../genes.bed" \ + --output "genes.html" \ + --base_url "http://genome.ucsc.edu" + +# checks +assert_file_exists "genes.html" +assert_file_not_empty "genes.html" +assert_file_contains "genes.html" "uc002yip.1" +echo "- test2 succeeded -" + +popd > /dev/null + +# Test 3: Organism and Genome Database Build +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bedtools_links with organism option and genome database build" +"$meta_executable" \ + --input "../genes.bed" \ + --output "genes.html" \ + --base_url "http://genome.ucsc.edu" \ + --organism "mouse" \ + --database "mm9" + +# checks +assert_file_exists "genes.html" +assert_file_not_empty "genes.html" +assert_file_contains "genes.html" "uc002yip.1" +echo "- test3 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 diff --git a/src/bedtools/bedtools_merge/config.vsh.yaml b/src/bedtools/bedtools_merge/config.vsh.yaml new file mode 100644 index 00000000..45e4a01d --- /dev/null +++ b/src/bedtools/bedtools_merge/config.vsh.yaml @@ -0,0 +1,160 @@ +name: bedtools_merge +namespace: bedtools +description: | + Merges overlapping BED/GFF/VCF entries into a single interval. +links: + documentation: https://bedtools.readthedocs.io/en/latest/content/tools/merge.html + repository: https://github.com/arq5x/bedtools2 + homepage: https://bedtools.readthedocs.io/en/latest/# + issue_tracker: https://github.com/arq5x/bedtools2/issues +references: + doi: 10.1093/bioinformatics/btq033 +license: MIT +requirements: + commands: [bedtools] +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author, maintainer ] + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: -i + type: file + description: Input file (BED/GFF/VCF) to be merged. + required: true + + - name: Outputs + arguments: + - name: --output + type: file + direction: output + description: Output merged file BED to be written. + required: true + + - name: Options + arguments: + - name: --strand + alternatives: -s + type: boolean_true + description: | + Force strandedness. That is, only merge features + that are on the same strand. + - By default, merging is done without respect to strand. + + - name: --specific_strand + alternatives: -S + type: string + choices: ["+", "-"] + description: | + Force merge for one specific strand only. + Follow with + or - to force merge from only + the forward or reverse strand, respectively. + - By default, merging is done without respect to strand. + + - name: --distance + alternatives: -d + type: integer + description: | + Maximum distance between features allowed for features + to be merged. + - Def. 0. That is, overlapping & book-ended features are merged. + - (INTEGER) + - Note: negative values enforce the number of b.p. required for overlap. + + - name: --columns + alternatives: -c + type: integer + description: | + Specify columns from the B file to map onto intervals in A. + Default: 5. + Multiple columns can be specified in a comma-delimited list. + + - name: --operation + alternatives: -o + type: string + description: | + Specify the operation that should be applied to -c. + Valid operations: + sum, min, max, absmin, absmax, + mean, median, mode, antimode + stdev, sstdev + collapse (i.e., print a delimited list (duplicates allowed)), + distinct (i.e., print a delimited list (NO duplicates allowed)), + distinct_sort_num (as distinct, sorted numerically, ascending), + distinct_sort_num_desc (as distinct, sorted numerically, desscending), + distinct_only (delimited list of only unique values), + count + count_distinct (i.e., a count of the unique values in the column), + first (i.e., just the first value in the column), + last (i.e., just the last value in the column), + Default: sum + Multiple operations can be specified in a comma-delimited list. + + If there is only column, but multiple operations, all operations will be + applied on that column. Likewise, if there is only one operation, but + multiple columns, that operation will be applied to all columns. + Otherwise, the number of columns must match the the number of operations, + and will be applied in respective order. + E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, + the mean of column 4, and the count of column 6. + The order of output columns will match the ordering given in the command. + + - name: --delimiter + alternatives: -delim + type: string + description: | + Specify a custom delimiter for the collapse operations. + example: "|" + default: "," + + - name: --precision + alternatives: -prec + type: integer + description: | + Sets the decimal precision for output (Default: 5). + + - name: --bed + type: boolean_true + description: | + If using BAM input, write output as BED. + + - name: --header + type: boolean_true + description: | + Print the header from the A file prior to results. + + - name: --no_buffer + alternatives: -nobuf + type: boolean_true + description: | + Disable buffered output. Using this option will cause each line + of output to be printed as it is generated, rather than saved + in a buffer. This will make printing large output files + noticeably slower, but can be useful in conjunction with + other software tools and scripts that need to process one + line of bedtools output at a time. + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + - path: test_data + +engines: + - type: docker + image: debian:stable-slim + setup: + - type: apt + packages: [bedtools, procps] + - type: docker + run: | + echo "bedtools: \"$(bedtools --version | sed -n 's/^bedtools //p')\"" > /var/software_versions.txt + +runners: + - type: executable + - type: nextflow \ No newline at end of file diff --git a/src/bedtools/bedtools_merge/help.txt b/src/bedtools/bedtools_merge/help.txt new file mode 100644 index 00000000..bc78fc67 --- /dev/null +++ b/src/bedtools/bedtools_merge/help.txt @@ -0,0 +1,85 @@ +```bash +bedtools merge +``` + +Tool: bedtools merge (aka mergeBed) +Version: v2.30.0 +Summary: Merges overlapping BED/GFF/VCF entries into a single interval. + +Usage: bedtools merge [OPTIONS] -i + +Options: + -s Force strandedness. That is, only merge features + that are on the same strand. + - By default, merging is done without respect to strand. + + -S Force merge for one specific strand only. + Follow with + or - to force merge from only + the forward or reverse strand, respectively. + - By default, merging is done without respect to strand. + + -d Maximum distance between features allowed for features + to be merged. + - Def. 0. That is, overlapping & book-ended features are merged. + - (INTEGER) + - Note: negative values enforce the number of b.p. required for overlap. + + -c Specify columns from the B file to map onto intervals in A. + Default: 5. + Multiple columns can be specified in a comma-delimited list. + + -o Specify the operation that should be applied to -c. + Valid operations: + sum, min, max, absmin, absmax, + mean, median, mode, antimode + stdev, sstdev + collapse (i.e., print a delimited list (duplicates allowed)), + distinct (i.e., print a delimited list (NO duplicates allowed)), + distinct_sort_num (as distinct, sorted numerically, ascending), + distinct_sort_num_desc (as distinct, sorted numerically, desscending), + distinct_only (delimited list of only unique values), + count + count_distinct (i.e., a count of the unique values in the column), + first (i.e., just the first value in the column), + last (i.e., just the last value in the column), + Default: sum + Multiple operations can be specified in a comma-delimited list. + + If there is only column, but multiple operations, all operations will be + applied on that column. Likewise, if there is only one operation, but + multiple columns, that operation will be applied to all columns. + Otherwise, the number of columns must match the the number of operations, + and will be applied in respective order. + E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5, + the mean of column 4, and the count of column 6. + The order of output columns will match the ordering given in the command. + + + -delim Specify a custom delimiter for the collapse operations. + - Example: -delim "|" + - Default: ",". + + -prec Sets the decimal precision for output (Default: 5) + + -bed If using BAM input, write output as BED. + + -header Print the header from the A file prior to results. + + -nobuf Disable buffered output. Using this option will cause each line + of output to be printed as it is generated, rather than saved + in a buffer. This will make printing large output files + noticeably slower, but can be useful in conjunction with + other software tools and scripts that need to process one + line of bedtools output at a time. + + -iobuf Specify amount of memory to use for input buffer. + Takes an integer argument. Optional suffixes K/M/G supported. + Note: currently has no effect with compressed files. + +Notes: + (1) The input file (-i) file must be sorted by chrom, then start. + + + + +***** ERROR: No input file given. Exiting. ***** diff --git a/src/bedtools/bedtools_merge/script.sh b/src/bedtools/bedtools_merge/script.sh new file mode 100644 index 00000000..db50dd83 --- /dev/null +++ b/src/bedtools/bedtools_merge/script.sh @@ -0,0 +1,35 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# Exit on error +set -eo pipefail + +# Unset parameters +unset_if_false=( + par_strand + par_bed + par_header + par_no_buffer +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +# Execute bedtools merge with the provided arguments +bedtools merge \ + ${par_strand:+-s} \ + ${par_specific_strand:+-S "$par_specific_strand"} \ + ${par_bed:+-bed} \ + ${par_header:+-header} \ + ${par_no_buffer:+-nobuf} \ + ${par_distance:+-d "$par_distance"} \ + ${par_columns:+-c "$par_columns"} \ + ${par_operation:+-o "$par_operation"} \ + ${par_delimiter:+-delim "$par_delimiter"} \ + ${par_precision:+-prec "$par_precision"} \ + -i "$par_input" \ + > "$par_output" diff --git a/src/bedtools/bedtools_merge/test.sh b/src/bedtools/bedtools_merge/test.sh new file mode 100644 index 00000000..e2b46c15 --- /dev/null +++ b/src/bedtools/bedtools_merge/test.sh @@ -0,0 +1,222 @@ +#!/bin/bash + +# exit on error +set -eo pipefail + +## VIASH START +meta_executable="target/executable/bedtools/bedtools_sort/bedtools_merge" +meta_resources_dir="src/bedtools/bedtools_merge" +## VIASH END + +# directory of the bam file +test_data="$meta_resources_dir/test_data" + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create and populate example files +printf "chr1\t100\t200\nchr1\t150\t250\nchr1\t300\t400\n" > "$TMPDIR/featureA.bed" +printf "chr1\t100\t200\ta1\t1\t+\nchr1\t180\t250\ta2\t2\t+\nchr1\t250\t500\ta3\t3\t-\nchr1\t501\t1000\ta4\t4\t+\n" > "$TMPDIR/featureB.bed" +printf "chr1\t100\t200\ta1\t1.9\t+\nchr1\t180\t250\ta2\t2.5\t+\nchr1\t250\t500\ta3\t3.3\t-\nchr1\t501\t1000\ta4\t4\t+\n" > "$TMPDIR/feature_precision.bed" + +# Create and populate feature.gff file +printf "##gff-version 3\n" > "$TMPDIR/feature.gff" +printf "chr1\t.\tgene\t1000\t2000\t.\t+\t.\tID=gene1;Name=Gene1\n" >> "$TMPDIR/feature.gff" +printf "chr1\t.\texon\t1000\t1200\t.\t+\t.\tID=exon1;Parent=transcript1\n" >> "$TMPDIR/feature.gff" +printf "chr1\t.\tCDS\t1000\t1200\t.\t+\t0\tID=cds1;Parent=transcript1\n" >> "$TMPDIR/feature.gff" +printf "chr1\t.\tCDS\t1500\t1700\t.\t+\t2\tID=cds2;Parent=transcript1\n" >> "$TMPDIR/feature.gff" +printf "chr2\t.\texon\t1500\t1700\t.\t+\t.\tID=exon2;Parent=transcript1\n" >> "$TMPDIR/feature.gff" +printf "chr3\t.\tmRNA\t1000\t2000\t.\t+\t.\tID=transcript1;Parent=gene1\n" >> "$TMPDIR/feature.gff" + +# Create expected output files +printf "chr1\t100\t250\nchr1\t300\t400\n" > "$TMPDIR/expected.bed" +printf "chr1\t100\t250\nchr1\t250\t500\nchr1\t501\t1000\n" > "$TMPDIR/expected_strand.bed" +printf "chr1\t100\t250\nchr1\t501\t1000\n" > "$TMPDIR/expected_specific_strand.bed" +printf "chr1\t128\t228\nchr1\t428\t528\n" > "$TMPDIR/expected_bam.bed" +printf "chr1\t100\t400\n" > "$TMPDIR/expected_distance.bed" +printf "chr1\t100\t500\t2\t1\t3\nchr1\t501\t1000\t4\t4\t4\n" > "$TMPDIR/expected_operation.bed" +printf "chr1\t100\t500\ta1|a2|a3\nchr1\t501\t1000\ta4\n" > "$TMPDIR/expected_delim.bed" +printf "chr1\t100\t500\t2.567\nchr1\t501\t1000\t4\n" > "$TMPDIR/expected_precision.bed" +printf "##gff-version 3\nchr1\t999\t2000\nchr2\t1499\t1700\nchr3\t999\t2000\n" > "$TMPDIR/expected_header.bed" + +# Test 1: Default sort on BED file +mkdir "$TMPDIR/test1" && pushd "$TMPDIR/test1" > /dev/null + +echo "> Run bedtools_merge on BED file" +"$meta_executable" \ + --input "../featureA.bed" \ + --output "output.bed" + +# # checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected.bed" +echo "- test1 succeeded -" + +popd > /dev/null + +# Test 2: strand option +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "> Run bedtools_merge on BED file with strand option" +"$meta_executable" \ + --input "../featureB.bed" \ + --output "output.bed" \ + --strand + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_strand.bed" +echo "- test2 succeeded -" + +popd > /dev/null + +# Test 3: specific strand option +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "> Run bedtools_merge on BED file with specific strand option" +"$meta_executable" \ + --input "../featureB.bed" \ + --output "output.bed" \ + --specific_strand "+" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_specific_strand.bed" +echo "- test3 succeeded -" + +popd > /dev/null + +# Test 4: BED option +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "> Run bedtools_merge on BAM file with BED option" +"$meta_executable" \ + --input "$test_data/feature.bam" \ + --output "output.bed" \ + --bed + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_bam.bed" +echo "- test4 succeeded -" + +popd > /dev/null + +# Test 5: distance option +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "> Run bedtools_merge on BED file with distance option" +"$meta_executable" \ + --input "../featureA.bed" \ + --output "output.bed" \ + --distance -5 + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected.bed" +echo "- test5 succeeded -" + +popd > /dev/null + +# Test 6: columns option & operation option +mkdir "$TMPDIR/test6" && pushd "$TMPDIR/test6" > /dev/null + +echo "> Run bedtools_merge on BED file with columns & operation options" +"$meta_executable" \ + --input "../featureB.bed" \ + --output "output.bed" \ + --columns 5 \ + --operation "mean,min,max" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_operation.bed" +echo "- test6 succeeded -" + +popd > /dev/null + +# Test 7: delimeter option +mkdir "$TMPDIR/test7" && pushd "$TMPDIR/test7" > /dev/null + +echo "> Run bedtools_merge on BED file with delimeter option" +"$meta_executable" \ + --input "../featureB.bed" \ + --output "output.bed" \ + --columns 4 \ + --operation "collapse" \ + --delimiter "|" + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_delim.bed" +echo "- test7 succeeded -" + +popd > /dev/null + +# Test 8: precision option +mkdir "$TMPDIR/test8" && pushd "$TMPDIR/test8" > /dev/null + +echo "> Run bedtools_merge on BED file with precision option" +"$meta_executable" \ + --input "../feature_precision.bed" \ + --output "output.bed" \ + --columns 5 \ + --operation "mean" \ + --precision 4 + +# checks +assert_file_exists "output.bed" +assert_file_not_empty "output.bed" +assert_identical_content "output.bed" "../expected_precision.bed" +echo "- test8 succeeded -" + +popd > /dev/null + +# Test 9: header option +mkdir "$TMPDIR/test9" && pushd "$TMPDIR/test9" > /dev/null + +echo "> Run bedtools_merge on GFF file with header option" +"$meta_executable" \ + --input "../feature.gff" \ + --output "output.gff" \ + --header + +# checks +assert_file_exists "output.gff" +assert_file_not_empty "output.gff" +assert_identical_content "output.gff" "../expected_header.bed" +echo "- test9 succeeded -" + +popd > /dev/null + +echo "---- All tests succeeded! ----" +exit 0 diff --git a/src/bedtools/bedtools_merge/test_data/feature.bam b/src/bedtools/bedtools_merge/test_data/feature.bam new file mode 100644 index 00000000..3d56a631 Binary files /dev/null and b/src/bedtools/bedtools_merge/test_data/feature.bam differ diff --git a/src/fastqc/config.vsh.yaml b/src/fastqc/config.vsh.yaml new file mode 100644 index 00000000..75b16f36 --- /dev/null +++ b/src/fastqc/config.vsh.yaml @@ -0,0 +1,209 @@ +name: fastqc +description: FastQC - A high throughput sequence QC analysis tool. +keywords: [Quality control, BAM, SAM, FASTQ] +links: + homepage: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ + documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/ + repository: https://github.com/s-andrews/FastQC + issue_tracker: https://github.com/s-andrews/FastQC/issues +license: GPL-3.0, Apache-2.0 +authors: + - __merge__: /src/_authors/theodoro_gasperin.yaml + roles: [ author, maintainer ] + +argument_groups: + - name: Inputs + arguments: + - name: --input + type: file + direction: input + multiple: true + description: | + FASTQ file(s) to be analyzed. + required: true + example: input.fq + + - name: Outputs + description: | + At least one of the output options (--html, --zip, --summary, --data) must be used. + arguments: + + - name: --html + type: file + direction: output + multiple: true + description: | + Create the HTML report of the results. + '*' wild card must be provided in the output file name. + Wild card will be replaced by the input file basename. + e.g. + --input "sample_1.fq" + --html "*.html" + would create an output html file named sample_1.html + example: "*.html" + + - name: --zip + type: file + direction: output + multiple: true + description: | + Create the zip file(s) containing: html report, data, images, icons, summary, etc. + '*' wild card must be provided in the output file name. + Wild card will be replaced by the input basename. + e.g. + --input "sample_1.fq" + --html "*.zip" + would create an output zip file named sample_1.zip + example: "*.zip" + + - name: --summary + type: file + direction: output + multiple: true + description: | + Create the summary file(s). + '*' wild card must be provided in the output file name. + Wild card will be replaced by the input basename. + e.g. + --input "sample_1.fq" + --summary "*_summary.txt" + would create an output summary.txt file named sample_1_summary.txt + example: "*_summary.txt" + + - name: --data + type: file + direction: output + multiple: true + description: | + Create the data file(s). + '*' wild card must be provided in the output file name. + Wild card will be replaced by the input basename. + e.g. + --input "sample_1.fq" + --summary "*_data.txt" + would create an output data.txt file named sample_1_data.txt + example: "*_data.txt" + + - name: Options + arguments: + - name: --casava + type: boolean_true + description: | + Files come from raw casava output. Files in the same sample + group (differing only by the group number) will be analysed + as a set rather than individually. Sequences with the filter + flag set in the header will be excluded from the analysis. + Files must have the same names given to them by casava + (including being gzipped and ending with .gz) otherwise they + won't be grouped together correctly. + + - name: --nano + type: boolean_true + description: | + Files come from nanopore sequences and are in fast5 format. In + this mode you can pass in directories to process and the program + will take in all fast5 files within those directories and produce + a single output file from the sequences found in all files. + + - name: --nofilter + type: boolean_true + description: | + If running with --casava then don't remove read flagged by + casava as poor quality when performing the QC analysis. + + - name: --nogroup + type: boolean_true + description: | + Disable grouping of bases for reads >50bp. + All reports will show data for every base in the read. + WARNING: Using this option will cause fastqc to crash + and burn if you use it on really long reads, and your + plots may end up a ridiculous size. You have been warned! + + - name: --min_length + type: integer + description: | + Sets an artificial lower limit on the length of the + sequence to be shown in the report. As long as you + set this to a value greater or equal to your longest + read length then this will be the sequence length used + to create your read groups. This can be useful for making + directly comparable statistics from datasets with somewhat + variable read lengths. + example: 0 + + - name: --format + alternatives: -f + type: string + description: | + Bypasses the normal sequence file format detection and + forces the program to use the specified format. + Valid formats are bam, sam, bam_mapped, sam_mapped, and fastq. + example: bam + + - name: --contaminants + alternatives: -c + type: file + description: | + Specifies a non-default file which contains the list + of contaminants to screen overrepresented sequences against. + The file must contain sets of named contaminants in the form + name[tab]sequence. Lines prefixed with a hash will be ignored. + example: contaminants.txt + + - name: --adapters + alternatives: -a + type: file + description: | + Specifies a non-default file which contains the list of + adapter sequences which will be explicitly searched against + the library. The file must contain sets of named adapters + in the form name[tab]sequence. Lines prefixed with a hash will be ignored. + example: adapters.txt + + - name: --limits + alternatives: -l + type: file + description: | + Specifies a non-default file which contains + a set of criteria which will be used to determine + the warn/error limits for the various modules. + This file can also be used to selectively remove + some modules from the output altogether. The format + needs to mirror the default limits.txt file found in + the Configuration folder. + example: limits.txt + + - name: --kmers + alternatives: -k + type: integer + description: | + Specifies the length of Kmer to look for in the Kmer + content module. Specified Kmer length must be between + 2 and 10. Default length is 7 if not specified. + example: 7 + + - name: --quiet + alternatives: -q + type: boolean_true + description: | + Suppress all progress messages on stdout and only report errors. + +resources: + - type: bash_script + path: script.sh +test_resources: + - type: bash_script + path: test.sh + +engines: + - type: docker + image: biocontainers/fastqc:v0.11.9_cv8 + setup: + - type: docker + run: | + echo "fastqc: $(fastqc --version | sed -n 's/^FastQC //p')" > /var/software_versions.txt + +runners: + - type: executable + - type: nextflow diff --git a/src/fastqc/help.txt b/src/fastqc/help.txt new file mode 100644 index 00000000..502aebc0 --- /dev/null +++ b/src/fastqc/help.txt @@ -0,0 +1,125 @@ +```bash +fastqc --help +``` + + FastQC - A high throughput sequence QC analysis tool + +SYNOPSIS + + fastqc seqfile1 seqfile2 .. seqfileN + + fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] + [-c contaminant file] seqfile1 .. seqfileN + +DESCRIPTION + + FastQC reads a set of sequence files and produces from each one a quality + control report consisting of a number of different modules, each one of + which will help to identify a different potential type of problem in your + data. + + If no files to process are specified on the command line then the program + will start as an interactive graphical application. If files are provided + on the command line then the program will run with no user interaction + required. In this mode it is suitable for inclusion into a standardised + analysis pipeline. + + The options for the program as as follows: + + -h --help Print this help file and exit + + -v --version Print the version of the program and exit + + -o --outdir Create all output files in the specified output directory. + Please note that this directory must exist as the program + will not create it. If this option is not set then the + output file for each sequence file is created in the same + directory as the sequence file which was processed. + + --casava Files come from raw casava output. Files in the same sample + group (differing only by the group number) will be analysed + as a set rather than individually. Sequences with the filter + flag set in the header will be excluded from the analysis. + Files must have the same names given to them by casava + (including being gzipped and ending with .gz) otherwise they + won't be grouped together correctly. + + --nano Files come from nanopore sequences and are in fast5 format. In + this mode you can pass in directories to process and the program + will take in all fast5 files within those directories and produce + a single output file from the sequences found in all files. + + --nofilter If running with --casava then don't remove read flagged by + casava as poor quality when performing the QC analysis. + + --extract If set then the zipped output file will be uncompressed in + the same directory after it has been created. By default + this option will be set if fastqc is run in non-interactive + mode. + + -j --java Provides the full path to the java binary you want to use to + launch fastqc. If not supplied then java is assumed to be in + your path. + + --noextract Do not uncompress the output file after creating it. You + should set this option if you do not wish to uncompress + the output when running in non-interactive mode. + + --nogroup Disable grouping of bases for reads >50bp. All reports will + show data for every base in the read. WARNING: Using this + option will cause fastqc to crash and burn if you use it on + really long reads, and your plots may end up a ridiculous size. + You have been warned! + + --min_length Sets an artificial lower limit on the length of the sequence + to be shown in the report. As long as you set this to a value + greater or equal to your longest read length then this will be + the sequence length used to create your read groups. This can + be useful for making directly comaparable statistics from + datasets with somewhat variable read lengths. + + -f --format Bypasses the normal sequence file format detection and + forces the program to use the specified format. Valid + formats are bam,sam,bam_mapped,sam_mapped and fastq + + -t --threads Specifies the number of files which can be processed + simultaneously. Each thread will be allocated 250MB of + memory so you shouldn't run more threads than your + available memory will cope with, and not more than + 6 threads on a 32 bit machine + + -c Specifies a non-default file which contains the list of + --contaminants contaminants to screen overrepresented sequences against. + The file must contain sets of named contaminants in the + form name[tab]sequence. Lines prefixed with a hash will + be ignored. + + -a Specifies a non-default file which contains the list of + --adapters adapter sequences which will be explicity searched against + the library. The file must contain sets of named adapters + in the form name[tab]sequence. Lines prefixed with a hash + will be ignored. + + -l Specifies a non-default file which contains a set of criteria + --limits which will be used to determine the warn/error limits for the + various modules. This file can also be used to selectively + remove some modules from the output all together. The format + needs to mirror the default limits.txt file found in the + Configuration folder. + + -k --kmers Specifies the length of Kmer to look for in the Kmer content + module. Specified Kmer length must be between 2 and 10. Default + length is 7 if not specified. + + -q --quiet Supress all progress messages on stdout and only report errors. + + -d --dir Selects a directory to be used for temporary files written when + generating report images. Defaults to system temp directory if + not specified. + +BUGS + + Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk + or in www.bioinformatics.babraham.ac.uk/bugzilla/ + + diff --git a/src/fastqc/script.sh b/src/fastqc/script.sh new file mode 100644 index 00000000..5cf55868 --- /dev/null +++ b/src/fastqc/script.sh @@ -0,0 +1,86 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +# exit on error +set -eo pipefail + +# Check if both outputs are empty, at least one must be passed. +if [[ -z "$par_html" ]] && [[ -z "$par_zip" ]] && [[ -z "$par_summary" ]] && [[ -z "$par_data" ]]; then + echo "Error: At least one of the output arguments (--html, --zip, --summary, and --data) must be passed." + exit 1 +fi + +# unset flags +unset_if_false=( + par_casava + par_nano + par_nofilter + par_extract + par_noextract + par_nogroup + par_quiet +) + +for par in ${unset_if_false[@]}; do + test_val="${!par}" + [[ "$test_val" == "false" ]] && unset $par +done + +tmpdir=$(mktemp -d "${meta_temp_dir}/${meta_name}-XXXXXXXX") +function clean_up { + rm -rf "$tmpdir" +} +trap clean_up EXIT + +# Create input array +IFS=";" read -ra input <<< $par_input + +# Run fastqc +fastqc \ + --extract \ + ${par_casava:+--casava} \ + ${par_nano:+--nano} \ + ${par_nofilter:+--nofilter} \ + ${par_nogroup:+--nogroup} \ + ${par_min_length:+--min_length "$par_min_length"} \ + ${par_format:+--format "$par_format"} \ + ${par_contaminants:+--contaminants "$par_contaminants"} \ + ${par_adapters:+--adapters "$par_adapters"} \ + ${par_limits:+--limits "$par_limits"} \ + ${par_kmers:+--kmers "$par_kmers"} \ + ${par_quiet:+--quiet} \ + ${meta_cpus:+--threads "$meta_cpus"} \ + ${meta_temp_dir:+--dir "$meta_temp_dir"} \ + --outdir "${tmpdir}" \ + "${input[@]}" + +# Move output files +for file in "${input[@]}"; do + # Removes everthing after the first dot of the basename + sample_name=$(basename "${file}" | sed 's/\..*$//') + if [[ -n "$par_html" ]]; then + input_html="${tmpdir}/${sample_name}_fastqc.html" + html_file="${par_html//\*/$sample_name}" + mv "$input_html" "$html_file" + fi + if [[ -n "$par_zip" ]]; then + input_zip="${tmpdir}/${sample_name}_fastqc.zip" + zip_file="${par_zip//\*/$sample_name}" + mv "$input_zip" "$zip_file" + fi + if [[ -n "$par_summary" ]]; then + summary_file="${tmpdir}/${sample_name}_fastqc/summary.txt" + new_summary="${par_summary//\*/$sample_name}" + mv "$summary_file" "$new_summary" + fi + if [[ -n "$par_data" ]]; then + data_file="${tmpdir}/${sample_name}_fastqc/fastqc_data.txt" + new_data="${par_data//\*/$sample_name}" + mv "$data_file" "$new_data" + fi + # Remove the extracted directory + rm -r "${tmpdir}/${sample_name}_fastqc" +done + diff --git a/src/fastqc/test.sh b/src/fastqc/test.sh new file mode 100644 index 00000000..8c581ac8 --- /dev/null +++ b/src/fastqc/test.sh @@ -0,0 +1,235 @@ +#!/bin/bash + +# exit on error +set -eo pipefail + +## VIASH START +# meta_executable="target/executable/fastqc" +# meta_resources_dir="src/fastqc" +## VIASH END + +############################################# +# helper functions +assert_file_exists() { + [ -f "$1" ] || { echo "File '$1' does not exist" && exit 1; } +} +assert_file_not_empty() { + [ -s "$1" ] || { echo "File '$1' is empty but shouldn't be" && exit 1; } +} +assert_file_contains() { + grep -q "$2" "$1" || { echo "File '$1' does not contain '$2'" && exit 1; } +} +assert_identical_content() { + diff -a "$2" "$1" \ + || (echo "Files are not identical!" && exit 1) +} +############################################# + +# Create directories for tests +echo "Creating Test Data..." +TMPDIR=$(mktemp -d "$meta_temp_dir/XXXXXX") +function clean_up { + [[ -d "$TMPDIR" ]] && rm -r "$TMPDIR" +} +trap clean_up EXIT + +# Create and populate input.fasta +cat > "$TMPDIR/input_1.fq" < "$TMPDIR/input_2.fq" < "$TMPDIR/contaminants.txt" +printf "contaminant_sequence2\tGATCTTGG\n" >> "$TMPDIR/contaminants.txt" + +# Create and populate SAM file +printf "@HD\tVN:1.0\tSO:unsorted\n" > "$TMPDIR/example.sam" +printf "@SQ\tSN:chr1\tLN:248956422\n" >> "$TMPDIR/example.sam" +printf "@SQ\tSN:chr2\tLN:242193529\n" >> "$TMPDIR/example.sam" +printf "@PG\tID:bowtie2\tPN:bowtie2\tVN:2.3.4.1\tCL:\"/usr/bin/bowtie2-align-s --wrapper basic-0 -x genome -U reads.fq -S output.sam\"\n" >> "$TMPDIR/example.sam" +printf "read1\t0\tchr1\t100\t255\t50M\t*\t0\t0\tACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT\tIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\tAS:i:-10\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\n" >> "$TMPDIR/example.sam" +printf "read2\t0\tchr2\t150\t255\t50M\t*\t0\t0\tTGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC\tIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\tAS:i:-8\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU\n" >> "$TMPDIR/example.sam" +printf "read3\t16\tchr1\t200\t255\t50M\t*\t0\t0\tGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA\tIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\tAS:i:-12\tXN:i:0\tXM:i:0\tXO:i:0\tXG:i:0\tNM:i:0\tMD:Z:50\tYT:Z:UU" >> "$TMPDIR/example.sam" + +cat > "$TMPDIR/expected_summary.txt" < "$TMPDIR/expected_summary2.txt" < "$TMPDIR/expected_summary_sam.txt" < /dev/null + +echo "-> Run Test1: one input" +"$meta_executable" \ + --input "../input_1.fq" \ + --html "*_fastqc.html" \ + --zip "*_fastqc.zip" \ + --summary "*_summary.txt" \ + --data "*_data.txt" \ + --quiet \ + +assert_file_exists "input_1_fastqc.html" +assert_file_exists "input_1_fastqc.zip" +assert_file_exists "input_1_summary.txt" +assert_file_not_empty "input_1_fastqc.html" +assert_file_not_empty "input_1_fastqc.zip" +assert_identical_content "input_1_summary.txt" "../expected_summary.txt" +echo "- test succeeded -" + +popd > /dev/null + + +# Test 2: Run fastqc with multiple inputs +mkdir "$TMPDIR/test2" && pushd "$TMPDIR/test2" > /dev/null + +echo "-> Run Test2: two inputs" +"$meta_executable" \ + --input "../input_1.fq" \ + --input "../input_2.fq" \ + --html "*_fastqc.html" \ + --zip "*_fastqc.zip" \ + --summary "*_summary.txt" \ + --data "*_data.txt" \ + --quiet \ + +# File 1 +assert_file_exists "input_1_fastqc.html" +assert_file_exists "input_1_fastqc.zip" +assert_file_exists "input_1_summary.txt" +assert_file_not_empty "input_1_fastqc.html" +assert_file_not_empty "input_1_fastqc.zip" +assert_identical_content "input_1_summary.txt" "../expected_summary.txt" +# File 2 +assert_file_exists "input_2_fastqc.html" +assert_file_exists "input_2_fastqc.zip" +assert_file_exists "input_2_summary.txt" +assert_file_not_empty "input_2_fastqc.html" +assert_file_not_empty "input_2_fastqc.zip" +assert_identical_content "input_2_summary.txt" "../expected_summary2.txt" +echo "- test succeeded -" + +popd > /dev/null + +# Test 3: Run fastqc with contaminants +mkdir "$TMPDIR/test3" && pushd "$TMPDIR/test3" > /dev/null + +echo "-> Run Test3: contaminants" +"$meta_executable" \ + --input "../input_1.fq" \ + --contaminants "../contaminants.txt" \ + --html "*_fastqc.html" \ + --zip "*_fastqc.zip" \ + --summary "*_summary.txt" \ + --data "*_data.txt" \ + --quiet \ + +assert_file_exists "input_1_fastqc.html" +assert_file_exists "input_1_fastqc.zip" +assert_file_exists "input_1_summary.txt" +assert_file_not_empty "input_1_fastqc.html" +assert_file_not_empty "input_1_fastqc.zip" +assert_identical_content "input_1_summary.txt" "../expected_summary.txt" +assert_file_contains "input_1_data.txt" "contaminant" +echo "- test succeeded -" + +popd > /dev/null + +# Test 4: Run fastqc with sam file +mkdir "$TMPDIR/test4" && pushd "$TMPDIR/test4" > /dev/null + +echo "-> Run Test4: sam file" +"$meta_executable" \ + --input "../example.sam" \ + --format "sam" \ + --html "*_fastqc.html" \ + --zip "*_fastqc.zip" \ + --summary "*_summary.txt" \ + --data "*_data.txt" \ + --quiet \ + +assert_file_exists "example_fastqc.html" +assert_file_exists "example_fastqc.zip" +assert_file_exists "example_summary.txt" +assert_file_not_empty "example_fastqc.html" +assert_file_not_empty "example_fastqc.zip" +assert_identical_content "example_summary.txt" "../expected_summary_sam.txt" +echo "- test succeeded -" + +popd > /dev/null + +# Test 5: Run fastqc with multiple options +mkdir "$TMPDIR/test5" && pushd "$TMPDIR/test5" > /dev/null + +echo "-> Run Test5: multiple options" +"$meta_executable" \ + --input "../input_1.fq" \ + --contaminants "../contaminants.txt" \ + --format "fastq" \ + --nofilter \ + --nogroup \ + --min_length 10 \ + --kmers 5 \ + --html "*_fastqc.html" \ + --zip "*_fastqc.zip" \ + --summary "*_summary.txt" \ + --data "*_data.txt" \ + --quiet \ +# --casava \ + +assert_file_exists "input_1_fastqc.html" +assert_file_exists "input_1_fastqc.zip" +assert_file_exists "input_1_summary.txt" +assert_file_not_empty "input_1_fastqc.html" +assert_file_not_empty "input_1_fastqc.zip" +assert_identical_content "input_1_summary.txt" "../expected_summary.txt" +assert_file_contains "input_1_data.txt" "contaminant" +echo "- test succeeded -" + +popd > /dev/null + +echo "All tests succeeded!" +exit 0 diff --git a/src/fq_subsample/config.vsh.yaml b/src/fq_subsample/config.vsh.yaml new file mode 100644 index 00000000..2455a341 --- /dev/null +++ b/src/fq_subsample/config.vsh.yaml @@ -0,0 +1,68 @@ +name: fq_subsample +description: fq subsample outputs a subset of records from single or paired FASTQ files. +keywords: [fastq, subsample, subset] +links: + homepage: https://github.com/stjude-rust-labs/fq/blob/master/README.md + documentation: https://github.com/stjude-rust-labs/fq/blob/master/README.md + repository: https://github.com/stjude-rust-labs/fq +license: MIT + +argument_groups: +- name: "Input" + arguments: + - name: "--input_1" + type: file + required: true + description: First input fastq file to subsample. Accepts both raw and gzipped FASTQ inputs. + - name: "--input_2" + type: file + description: Second input fastq files to subsample. Accepts both raw and gzipped FASTQ inputs. + +- name: "Output" + arguments: + - name: "--output_1" + type: file + direction: output + description: Sampled read 1 fastq files. Output will be gzipped if ends in `.gz`. + - name: "--output_2" + type: file + direction: output + description: Sampled read 2 fastq files. Output will be gzipped if ends in `.gz`. + +- name: "Options" + arguments: + - name: "--probability" + type: double + description: The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count` + - name: "--record_count" + type: integer + description: The exact number of records to keep. Cannot be used with `probability` + - name: "--seed" + type: integer + description: Seed to use for the random number generator + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + - path: test_data + +engines: + - type: docker + image: rust:1.81-slim + setup: + - type: docker + run: | + apt-get update && apt-get install -y git procps && \ + git clone --depth 1 --branch v0.12.0 https://github.com/stjude-rust-labs/fq.git && \ + cd fq && \ + cargo install --locked --path . && \ + mv target/release/fq /usr/local/bin/ && \ + cd / && rm -rf /fq + +runners: + - type: executable + - type: nextflow diff --git a/src/fq_subsample/help.txt b/src/fq_subsample/help.txt new file mode 100644 index 00000000..6f4a9acf --- /dev/null +++ b/src/fq_subsample/help.txt @@ -0,0 +1,20 @@ +``` +fq subsample -h +``` + +Outputs a subset of records + +Usage: fq subsample [OPTIONS] --r1-dst <--probability |--record-count > [R2_SRC] + +Arguments: + Read 1 source. Accepts both raw and gzipped FASTQ inputs + [R2_SRC] Read 2 source. Accepts both raw and gzipped FASTQ inputs + +Options: + -p, --probability The probability a record is kept, as a percentage (0.0, 1.0). Cannot be used with `record-count` + -n, --record-count The exact number of records to keep. Cannot be used with `probability` + -s, --seed Seed to use for the random number generator + --r1-dst Read 1 destination. Output will be gzipped if ends in `.gz` + --r2-dst Read 2 destination. Output will be gzipped if ends in `.gz` + -h, --help Print help + -V, --version \ No newline at end of file diff --git a/src/fq_subsample/script.sh b/src/fq_subsample/script.sh new file mode 100755 index 00000000..bcc81b40 --- /dev/null +++ b/src/fq_subsample/script.sh @@ -0,0 +1,26 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +set -eo pipefail + + +required_args=("-p" "--probability" "-n" "--record_count") + +# exclusive OR for required arguments $par_probability and $par_record_count +if [[ -n $par_probability && -n $par_record_count ]] || [[ -z $par_probability && -z $par_record_count ]]; then + echo "FQ/SUBSAMPLE requires either --probability or --record_count to be specified" + exit 1 +fi + + +fq subsample \ + ${par_output_1:+--r1-dst "${par_output_1}"} \ + ${par_output_2:+--r2-dst "${par_output_2}"} \ + ${par_probability:+--probability "${par_probability}"} \ + ${par_record_count:+--record-count "${par_record_count}"} \ + ${par_seed:+--seed "${par_seed}"} \ + ${par_input_1} \ + ${par_input_2} + diff --git a/src/fq_subsample/test.sh b/src/fq_subsample/test.sh new file mode 100644 index 00000000..1de48e95 --- /dev/null +++ b/src/fq_subsample/test.sh @@ -0,0 +1,36 @@ +#!/bin/bash + +echo ">>> Testing $meta_executable" + +echo ">>> Testing for paired-end reads" +"$meta_executable" \ + --input_1 $meta_resources_dir/test_data/a.3.fastq.gz \ + --input_2 $meta_resources_dir/test_data/a.4.fastq.gz \ + --record_count 3 \ + --seed 1 \ + --output_1 a.1.subsampled.fastq \ + --output_2 a.2.subsampled.fastq + +echo ">> Checking if the correct files are present" +[ ! -f "a.1.subsampled.fastq" ] && echo "Subsampled FASTQ file for read 1 is missing!" && exit 1 +[ $(wc -l < a.1.subsampled.fastq) -ne 12 ] && echo "Subsampled FASTQ file for read 1 does not contain the expected number of records" && exit 1 +[ ! -f "a.2.subsampled.fastq" ] && echo "Subsampled FASTQ file for read 2 is missing" && exit 1 +[ $(wc -l < a.2.subsampled.fastq) -ne 12 ] && echo "Subsampled FASTQ file for read 2 does not contain the expected number of records" && exit 1 + +rm a.1.subsampled.fastq a.2.subsampled.fastq + +echo ">>> Testing for single-end reads" +"$meta_executable" \ + --input_1 $meta_resources_dir/test_data/a.3.fastq.gz \ + --record_count 3 \ + --seed 1 \ + --output_1 a.1.subsampled.fastq + + +echo ">> Checking if the correct files are present" +[ ! -f "a.1.subsampled.fastq" ] && echo "Subsampled FASTQ file is missing" && exit 1 +[ $(wc -l < a.1.subsampled.fastq) -ne 12 ] && echo "Subsampled FASTQ file does not contain the expected number of records" && exit 1 + +echo ">>> Tests finished successfully" +exit 0 + diff --git a/src/fq_subsample/test_data/a.3.fastq.gz b/src/fq_subsample/test_data/a.3.fastq.gz new file mode 100644 index 00000000..3e38d06d Binary files /dev/null and b/src/fq_subsample/test_data/a.3.fastq.gz differ diff --git a/src/fq_subsample/test_data/a.4.fastq.gz b/src/fq_subsample/test_data/a.4.fastq.gz new file mode 100644 index 00000000..3164c614 Binary files /dev/null and b/src/fq_subsample/test_data/a.4.fastq.gz differ diff --git a/src/kallisto/kallisto_index/Kallisto b/src/kallisto/kallisto_index/Kallisto new file mode 100644 index 00000000..3c7b5b2b Binary files /dev/null and b/src/kallisto/kallisto_index/Kallisto differ diff --git a/src/kallisto/kallisto_index/config.vsh.yaml b/src/kallisto/kallisto_index/config.vsh.yaml new file mode 100644 index 00000000..2c4f65c7 --- /dev/null +++ b/src/kallisto/kallisto_index/config.vsh.yaml @@ -0,0 +1,94 @@ +name: kallisto_index +namespace: kallisto +description: | + Build a Kallisto index for the transcriptome to use Kallisto in the mapping-based mode. +keywords: [kallisto, index] +links: + homepage: https://pachterlab.github.io/kallisto/about + documentation: https://pachterlab.github.io/kallisto/manual + repository: https://github.com/pachterlab/kallisto + issue_tracker: https://github.com/pachterlab/kallisto/issues +references: + doi: https://doi.org/10.1038/nbt.3519 +license: BSD 2-Clause License + +argument_groups: +- name: "Input" + arguments: + - name: "--input" + type: file + description: | + Path to a FASTA-file containing the transcriptome sequences, either in plain text or + compressed (.gz) format. + required: true + - name: "--d_list" + type: file + description: | + Path to a FASTA-file containing sequences to mask from quantification. + +- name: "Output" + arguments: + - name: "--index" + type: file + direction: output + example: Kallisto_index + +- name: "Options" + arguments: + - name: "--kmer_size" + type: integer + description: | + Kmer length passed to indexing step of pseudoaligners (default: '31'). + example: 31 + - name: "--make_unique" + type: boolean_true + description: | + Replace repeated target names with unique names. + - name: "--aa" + type: boolean_true + description: | + Generate index from a FASTA-file containing amino acid sequences. + - name: "--distiguish" + type: boolean_true + description: | + Generate index where sequences are distinguished by the sequence names. + - name: "--min_size" + alternatives: ["-m"] + type: integer + description: | + Length of minimizers (default: automatically chosen). + - name: "--ec_max_size" + alternatives: ["-e"] + type: integer + description: | + Maximum number of targets in an equivalence class (default: no maximum). + - name: "--tmp" + alternatives: ["-T"] + type: string + description: | + Path to a directory for temporary files. + example: "tmp" + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + - path: test_data + +engines: + - type: docker + image: ubuntu:22.04 + setup: + - type: docker + run: | + apt-get update && \ + apt-get install -y --no-install-recommends wget && \ + wget --no-check-certificate https://github.com/pachterlab/kallisto/releases/download/v0.50.1/kallisto_linux-v0.50.1.tar.gz && \ + tar -xzf kallisto_linux-v0.50.1.tar.gz && \ + mv kallisto/kallisto /usr/local/bin/ +runners: + - type: executable + - type: nextflow diff --git a/src/kallisto/kallisto_index/help.txt b/src/kallisto/kallisto_index/help.txt new file mode 100644 index 00000000..28778ac0 --- /dev/null +++ b/src/kallisto/kallisto_index/help.txt @@ -0,0 +1,21 @@ +``` +kallisto index +``` +kallisto 0.50.1 +Builds a kallisto index + +Usage: kallisto index [arguments] FASTA-files + +Required argument: +-i, --index=STRING Filename for the kallisto index to be constructed + +Optional argument: +-k, --kmer-size=INT k-mer (odd) length (default: 31, max value: 31) +-t, --threads=INT Number of threads to use (default: 1) +-d, --d-list=STRING Path to a FASTA-file containing sequences to mask from quantification + --make-unique Replace repeated target names with unique names + --aa Generate index from a FASTA-file containing amino acid sequences + --distinguish Generate index where sequences are distinguished by the sequence name +-T, --tmp=STRING Temporary directory (default: tmp) +-m, --min-size=INT Length of minimizers (default: automatically chosen) +-e, --ec-max-size=INT Maximum number of targets in an equivalence class (default: no maximum) diff --git a/src/kallisto/kallisto_index/script.sh b/src/kallisto/kallisto_index/script.sh new file mode 100644 index 00000000..59a5d3de --- /dev/null +++ b/src/kallisto/kallisto_index/script.sh @@ -0,0 +1,34 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +set -eo pipefail + +unset_if_false=( par_make_unique par_aa par_distinguish ) + +for var in "${unset_if_false[@]}"; do + temp_var="${!var}" + [[ "$temp_var" == "false" ]] && unset $var +done + +if [ -n "$par_kmer_size" ]; then + if [[ "$par_kmer_size" -lt 1 || "$par_kmer_size" -gt 31 || $(( par_kmer_size % 2 )) -eq 0 ]]; then + echo "Error: Kmer size must be an odd number between 1 and 31." + exit 1 + fi +fi + +kallisto index \ + -i "${par_index}" \ + ${par_kmer_size:+--kmer-size "${par_kmer_size}"} \ + ${par_make_unique:+--make-unique} \ + ${par_aa:+--aa} \ + ${par_distinguish:+--distinguish} \ + ${par_min_size:+--min-size "${par_min_size}"} \ + ${par_ec_max_size:+--ec-max-size "${par_ec_max_size}"} \ + ${par_d_list:+--d-list "${par_d_list}"} \ + ${meta_cpus:+--cpu "${meta_cpus}"} \ + ${par_tmp:+--tmp "${par_tmp}"} \ + "${par_input}" + diff --git a/src/kallisto/kallisto_index/test.sh b/src/kallisto/kallisto_index/test.sh new file mode 100644 index 00000000..2646dcd8 --- /dev/null +++ b/src/kallisto/kallisto_index/test.sh @@ -0,0 +1,35 @@ +#!/bin/bash + +echo ">>>Test 1: Testing $meta_functionality_name with non-default k-mer size" + +"$meta_executable" \ + --input "$meta_resources_dir/test_data/transcriptome.fasta" \ + --index Kallisto \ + --kmer_size 21 + + +echo ">>> Checking whether output exists and is correct" +[ ! -f "Kallisto" ] && echo "Kallisto index does not exist!" && exit 1 +[ ! -s "Kallisto" ] && echo "Kallisto index is empty!" && exit 1 + +kallisto inspect Kallisto 2> test.txt +grep "number of k-mers: 989" test.txt || { echo "The content of the index seems to be incorrect." && exit 1; } + +################################################################################ + +echo ">>>Test 2: Testing $meta_functionality_name with d_list argument" + +"$meta_executable" \ + --input "$meta_resources_dir/test_data/transcriptome.fasta" \ + --index Kallisto \ + --d_list "$meta_resources_dir/test_data/d_list.fasta" + +echo ">>> Checking whether output exists and is correct" +[ ! -f "Kallisto" ] && echo "Kallisto index does not exist!" && exit 1 +[ ! -s "Kallisto" ] && echo "Kallisto index is empty!" && exit 1 + +kallisto inspect Kallisto 2> test.txt +grep "number of k-mers: 959" test.txt || { echo "The content of the index seems to be incorrect." && exit 1; } + +echo "All tests succeeded!" +exit 0 diff --git a/src/kallisto/kallisto_index/test_data/d_list.fasta b/src/kallisto/kallisto_index/test_data/d_list.fasta new file mode 100644 index 00000000..ad5e05bf --- /dev/null +++ b/src/kallisto/kallisto_index/test_data/d_list.fasta @@ -0,0 +1,5 @@ +>YAL067W-A CDS=1-228 +ATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGG +TCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGT +CTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTG +GGAGTCGTATACTGTTAG diff --git a/src/kallisto/kallisto_index/test_data/transcriptome.fasta b/src/kallisto/kallisto_index/test_data/transcriptome.fasta new file mode 100644 index 00000000..94c06163 --- /dev/null +++ b/src/kallisto/kallisto_index/test_data/transcriptome.fasta @@ -0,0 +1,23 @@ +>YAL069W CDS=1-315 +ATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTC +ACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTC +AGATTCCACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACG +GCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATAT +CTATATCTCATTCGGCGGTCCCAAATATTGTATAA +>YAL068W-A CDS=1-255 +ATGCACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATT +TTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAATACATACGTTATACCACT +TTTGCACCATATACTTACCACTCCATTTATATACACTTATGTCAATATTACAGAAAAATCCCCACAAAAA +TCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAA +>YAL068C CDS=1-363 +ATGGTCAAATTAACTTCAATCGCCGCTGGTGTCGCTGCCATCGCTGCTACTGCTTCTGCAACCACCACTC +TAGCTCAATCTGACGAAAGAGTCAACTTGGTGGAATTGGGTGTCTACGTCTCTGATATCAGAGCTCACTT +AGCCCAATACTACATGTTCCAAGCCGCCCACCCAACTGAAACCTACCCAGTCGAAGTTGCTGAAGCCGTT +TTCAACTACGGTGACTTCACCACCATGTTGACCGGTATTGCTCCAGACCAAGTGACCAGAATGATCACCG +GTGTTCCATGGTACTCCAGCAGATTAAAGCCAGCCATCTCCAGTGCTCTATCCAAGGACGGTATCTACAC +TATCGCAAACTAG +>YAL067W-A CDS=1-228 +ATGCCAATTATAGGGGTGCCGAGGTGCCTTATAAAACCCTTTTCTGTGCCTGTGACATTTCCTTTTTCGG +TCAAAAAGAATATCCGAATTTTAGATTTGGACCCTCGTACAGAAGCTTATTGTCTAAGCCTGAATTCAGT +CTGCTTTAAACGGCTTCCGCGGAGGAAATATTTCCATCTCTTGAATTCGTACAACATTAAACGTGTGTTG +GGAGTCGTATACTGTTAG \ No newline at end of file diff --git a/src/salmon/salmon_quant/config.vsh.yaml b/src/salmon/salmon_quant/config.vsh.yaml index 1f96f0c9..5fa3d48f 100644 --- a/src/salmon/salmon_quant/config.vsh.yaml +++ b/src/salmon/salmon_quant/config.vsh.yaml @@ -24,7 +24,7 @@ argument_groups: description: | Format string describing the library. The library type string consists of three parts: - 1. Relative orientation of the reads: This part is only provided if the library is paired-end, THe possible options are + 1. Relative orientation of the reads: This part is only provided if the library is paired-end, The possible options are I = inward O = outward M = matching @@ -118,7 +118,7 @@ argument_groups: direction: output description: | Salmon quantification file. - required: true + required: false example: quant.sf - name: Basic options @@ -327,7 +327,7 @@ argument_groups: If this option is provided, then the selective-alignment results will be written out in SAM-compatible format. By default, output will be directed to stdout, but an alternative file name can be provided instead. - name: --mapping_sam type: file - description: Path to file that should output the selective-alignment results in SAM-compatible format. THis option must be provided while using --write_mappings + description: Path to file that should output the selective-alignment results in SAM-compatible format. This option must be provided while using --write_mappings required: false direction: output example: mappings.sam diff --git a/src/sortmerna/config.vsh.yaml b/src/sortmerna/config.vsh.yaml new file mode 100644 index 00000000..6477660f --- /dev/null +++ b/src/sortmerna/config.vsh.yaml @@ -0,0 +1,290 @@ +name: sortmerna +description: | + Local sequence alignment tool for filtering, mapping and clustering. The main + application of SortMeRNA is filtering rRNA from metatranscriptomic data. +keywords: [sort, mRNA, rRNA, alignment, filtering, mapping, clustering] +links: + homepage: https://sortmerna.readthedocs.io/en/latest/ + documentation: https://sortmerna.readthedocs.io/en/latest/manual4.0.html + repository: https://github.com/sortmerna/sortmerna +references: + doi: 10.1093/bioinformatics/bts611 +license: GPL-3.0 + +argument_groups: +- name: "Input" + arguments: + - name: "--paired" + type: boolean_true + description: | + Reads are paired-end. If a single reads file is provided, use this option + to indicate the file contains interleaved paired reads when neither + 'paired_in' | 'paired_out' | 'out2' | 'sout' are specified. + - name: "--input" + type: file + multiple: true + description: Input fastq + - name: "--ref" + type: file + multiple: true + description: Reference fasta file(s) for rRNA database. + - name: "--ribo_database_manifest" + type: file + description: Text file containing paths to fasta files (one per line) that will be used to create the database for SortMeRNA. + +- name: "Output" + arguments: + - name: "--log" + type: file + direction: output + must_exist: false + example: $id.sortmerna.log + description: Sortmerna log file. + - name: "--output" + alternatives: ["--aligned"] + type: string + description: | + Directory and file prefix for aligned output. The appropriate extension: + (fasta|fastq|blast|sam|etc) is automatically added. + If 'dir' is not specified, the output is created in the WORKDIR/out/. + If 'pfx' is not specified, the prefix 'aligned' is used. + - name: "--other" + type: string + description: Create Non-aligned reads output file with this path/prefix. Must be used with fastx. + +- name: "Options" + arguments: + - name: "--kvdb" + type: string + description: Path to directory of the key-value database file, used for storing the alignment results. + - name: "--idx_dir" + type: string + description: Path to the directory for storing the reference index files. + - name: "--readb" + type: string + description: Path to the directory for storing pre-processed reads. + - name: "--fastx" + type: boolean_true + description: Output aligned reads into FASTA/FASTQ file + - name: "--sam" + type: boolean_true + description: Output SAM alignment for aligned reads. + - name: "--sq" + type: boolean_true + description: Add SQ tags to the SAM file + - name: "--blast" + type: string + description: | + Blast options: + * '0' - pairwise + * '1' - tabular(Blast - m 8 format) + * '1 cigar' - tabular + column for CIGAR + * '1 cigar qcov' - tabular + columns for CIGAR and query coverage + * '1 cigar qcov qstrand' - tabular + columns for CIGAR, query coverage and strand + choices: ['0', '1', '1 cigar', '1 cigar qcov', '1 cigar qcov qstrand'] + - name: "--num_alignments" + type: integer + description: | + Report first INT alignments per read reaching E-value. If Int = 0, all alignments will be output. Default: '0' + example: 0 + - name: "--min_lis" + type: integer + description: | + search all alignments having the first INT longest LIS. LIS stands for Longest Increasing Subsequence, it is + computed using seeds’ positions to expand hits into longer matches prior to Smith-Waterman alignment. Default: '2'. + example: 2 + - name: "--print_all_reads" + type: boolean_true + description: output null alignment strings for non-aligned reads to SAM and/or BLAST tabular files. + - name: "--paired_in" + type: boolean_true + description: | + In the case where a pair of reads is aligned with a score above the threshold, the output of the reads is controlled + by the following options: + * --paired_in and --paired_out are both false: Only one read per pair is output to the aligned fasta file. + * --paired_in is true and --paired_out is false: Both reads of the pair are output to the aligned fasta file. + * --paired_in is false and --paired_out is true: Both reads are output the the other fasta file (if it is specified). + - name: "--paired_out" + type: boolean_true + description: See description of --paired_in. + - name: "--out2" + type: boolean_true + description: | + Output paired reads into separate files. Must be used with '--fastx'. If a single reads file is provided, this options + implies interleaved paired reads. When used with 'sout', four (4) output files for aligned reads will be generated: + 'aligned-paired-fwd, aligned-paired-rev, aligned-singleton-fwd, aligned-singleton-rev'. If 'other' option is also used, + eight (8) output files will be generated. + - name: "--sout" + type: boolean_true + description: | + Separate paired and singleton aligned reads. Must be used with '--fastx'. If a single reads file is provided, + this options implies interleaved paired reads. Cannot be used with '--paired_in' or '--paired_out'. + - name: "--zip_out" + type: string + description: | + Compress the output files. The possible values are: + * '1/true/t/yes/y' + * '0/false/f/no/n' + *'-1' (the same format as input - default) + The values are Not case sensitive. + choices: ['1', 'true', 't', 'yes', 'y', '0', 'false', 'f', 'no', 'n', '-1'] + example: "-1" + - name: "--match" + type: integer + description: | + Smith-Waterman score for a match (positive integer). Default: '2'. + example: 2 + - name: "--mismatch" + type: integer + description: | + Smith-Waterman penalty for a mismatch (negative integer). Default: '-3'. + example: -3 + - name: "--gap_open" + type: integer + description: | + Smith-Waterman penalty for introducing a gap (positive integer). Default: '5'. + example: 5 + - name: "--gap_ext" + type: integer + description: | + Smith-Waterman penalty for extending a gap (positive integer). Default: '2'. + example: 2 + - name: "--N" + type: integer + description: | + Smith-Waterman penalty for ambiguous letters (N’s) scored as --mismatch. Default: '-1'.\ + example: -1 + - name: "--a" + type: integer + description: | + Number of threads to use. Default: '1'. + example: 1 + - name: "--e" + type: double + description: | + E-value threshold. Default: '1'. + example: 1 + - name: "--F" + type: boolean_true + description: Search only the forward strand. + - name: "--R" + type: boolean_true + description: Search only the reverse-complementary strand. + - name: "--num_alignment" + type: integer + description: | + Report first INT alignments per read reaching E-value (--num_alignments 0 signifies all alignments will be output). + Default: '-1' + example: -1 + - name: "--best" + type: integer + description: | + Report INT best alignments per read reaching E-value by searching --min_lis INT candidate alignments (--best 0 + signifies all candidate alignments will be searched) Default: '1'. + example: 1 + - name: "--verbose" + alternatives: ["-v"] + type: boolean_true + description: Verbose output. + +- name: "OTU picking options" + arguments: + - name: "--id" + type: double + description: | + %id similarity threshold (the alignment must still pass the E-value threshold). Default: '0.97'. + example: 0.97 + - name: "--coverage" + type: double + description: | + %query coverage threshold (the alignment must still pass the E-value threshold). Default: '0.97'. + example: 0.97 + - name: "--de_novo" + type: boolean_true + description: | + FASTA/FASTQ file for reads matching database < %id off (set using --id) and < %cov (set using --coverage) + (alignment must still pass the E-value threshold). + - name: "--otu_map" + type: boolean_true + description: | + Output OTU map (input to QIIME’s make_otu_table.py). + +- name: "Advanced options" + arguments: + - name: "--num_seed" + type: integer + description: | + Number of seeds matched before searching for candidate LIS. Default: '2'. + example: 2 + - name: "--passes" + type: integer + multiple: true + description: | + Three intervals at which to place the seed on the read L,L/2,3 (L is the seed length set in ./indexdb_rna). + - name: "--edge" + type: string + description: | + The number (or percentage if followed by %) of nucleotides to add to each edge of the alignment region on the + reference sequence before performing Smith-Waterman alignment. Default: '4'. + example: 4 + - name: "--full_search" + type: boolean_true + description: | + Search for all 0-error and 1-error seed off matches in the index rather than stopping after finding a 0-error match + (<1% gain in sensitivity with up four-fold decrease in speed). + +- name: "Indexing Options" + arguments: + - name: "--index" + type: integer + description: | + Create index files for the reference database. By default when this option is not used, the program checks the + reference index and builds it if not already existing. + This can be changed by using '-index' as follows: + * '-index 0' - skip indexing. If the index does not exist, the program will terminate + and warn to build the index prior performing the alignment + * '-index 1' - only perform the indexing and terminate + * '-index 2' - the default behaviour, the same as when not using this option at all + example: 2 + choices: [0, 1, 2] + - name: "-L" + type: double + description: | + Indexing seed length. Default: '18' + example: 18 + - name: "--interval" + type: integer + description: | + Index every Nth L-mer in the reference database. Default: '1' + example: 1 + - name: "--max_pos" + type: integer + description: | + Maximum number of positions to store for each unique L-mer. Set to 0 to store all positions. Default: '1000' + example: 1000 + + + +resources: + - type: bash_script + path: script.sh + +test_resources: + - type: bash_script + path: test.sh + - path: test_data + +engines: +- type: docker + image: ubuntu:22.04 + setup: + - type: docker + run: | + apt-get update && \ + apt-get install -y --no-install-recommends gzip cmake g++ wget && \ + apt-get clean && \ + wget --no-check-certificate https://github.com/sortmerna/sortmerna/releases/download/v4.3.6/sortmerna-4.3.6-Linux.sh && \ + bash sortmerna-4.3.6-Linux.sh --skip-license +runners: +- type: executable +- type: nextflow \ No newline at end of file diff --git a/src/sortmerna/help.txt b/src/sortmerna/help.txt new file mode 100644 index 00000000..f0842707 --- /dev/null +++ b/src/sortmerna/help.txt @@ -0,0 +1,319 @@ +``` +sortmerna -h +``` + + + Program: SortMeRNA version 4.3.6 + Copyright: 2016-2020 Clarity Genomics BVBA: + Turnhoutseweg 30, 2340 Beerse, Belgium + 2014-2016 Knight Lab: + Department of Pediatrics, UCSD, La Jolla + 2012-2014 Bonsai Bioinformatics Research Group: + LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe + Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the + implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU Lesser General Public License for more details. + Contributors: Jenya Kopylova jenya.kopylov@gmail.com + Laurent Noé laurent.noe@lifl.fr + Pierre Pericard pierre.pericard@lifl.fr + Daniel McDonald wasade@gmail.com + Mikaël Salson mikael.salson@lifl.fr + Hélène Touzet helene.touzet@lifl.fr + Rob Knight robknight@ucsd.edu + + Usage: sortmerna -ref FILE [-ref FILE] -reads FWD_READS [-reads REV_READS] [OPTIONS]: + ------------------------------------------------------------------------------------------------------------- + | option type-format description default | + ------------------------------------------------------------------------------------------------------------- + + [REQUIRED] + --ref PATH Required Reference file (FASTA) absolute or relative path. + + Use mutliple times, once per a reference file + + + --reads PATH Required Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ). + + Use twice for files with paired reads. + The file extensions are Not important. The program automatically + recognizes the file format as flat/compressed, fasta/fastq + + + + [COMMON] + --workdir PATH Optional Workspace directory USRDIR/sortmerna/run/ + + Default structure: WORKDIR/ + idx/ (References index) + kvdb/ (Key-value storage for alignments) + out/ (processing output) + readb/ (pre-processed reads/index) + + + --kvdb PATH Optional Directory for Key-value database WORKDIR/kvdb + + KVDB is used for storing the alignment results. + + + --idx-dir PATH Optional Directory for storing Reference index. WORKDIR/idx + + + --readb PATH Optional Storage for pre-processed reads WORKDIR/readb/ + + Directory storing the split reads, or the random access index of compressed reads + + + --fastx BOOL Optional Output aligned reads into FASTA/FASTQ file + --sam BOOL Optional Output SAM alignment for aligned reads. + + + --SQ BOOL Optional Add SQ tags to the SAM file + + + --blast STR Optional output alignments in various Blast-like formats + + Sample values: '0' - pairwise + '1' - tabular (Blast - m 8 format) + '1 cigar' - tabular + column for CIGAR + '1 cigar qcov' - tabular + columns for CIGAR and query coverage + '1 cigar qcov qstrand' - tabular + columns for CIGAR, query coverage, + and strand + + + --aligned STR/BOOL Optional Aligned reads file prefix [dir/][pfx] WORKDIR/out/aligned + + Directory and file prefix for aligned output i.e. each + output file goes into the specified directory with the given prefix. + The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added. + Both 'dir' and 'pfx' are optional. + The 'dir' can be a relative or an absolute path. + If 'dir' is not specified, the output is created in the WORKDIR/out/ + If 'pfx' is not specified, the prefix 'aligned' is used + Examples: + '-aligned $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta + '-aligned dir_1/apfx' -> $PWD/dir_1/apfx.fasta + '-aligned dir_1/' -> $PWD/aligned.fasta + '-aligned apfx' -> $PWD/apfx.fasta + '-aligned (no argument)' -> WORKDIR/out/aligned.fasta + + + --other STR/BOOL Optional Non-aligned reads file prefix [dir/][pfx] WORKDIR/out/other + + Directory and file prefix for non-aligned output i.e. each + output file goes into the specified directory with the given prefix. + The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added. + Must be used with 'fastx'. + Both 'dir' and 'pfx' are optional. + The 'dir' can be a relative or an absolute path. + If 'dir' is not specified, the output is created in the WORKDIR/out/ + If 'pfx' is not specified, the prefix 'other' is used + Examples: + '-other $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta + '-other dir_1/apfx' -> $PWD/dir_1/apfx.fasta + '-other dir_1/' -> $PWD/dir_1/other.fasta + '-other apfx' -> $PWD/apfx.fasta + '-other (no argument)' -> aligned_out/other.fasta + i.e. the same output directory + as used for aligned output + + + --num_alignments INT Optional Positive integer (INT >=0). + + If used with '-no-best' reports first INT alignments per read reaching + E-value threshold, which allows to lower the CPU time and memory use. + Otherwise outputs INT best alignments. + If INT = 0, all alignments are output + + + --no-best BOOL Optional Disable best alignments search False + + The 'best' alignment is the highest scoring alignment out of All alignments of a read, + and the read can potentially be aligned (reaching E-value threshold) to multiple reference + sequences. + By default the program searches for best alignments i.e. performs an exhaustive search + over all references. Using '-no-best' will make the program to search just + the first N alignments, where N is set using '-num_alignments' i.e. 1 by default. + + + --min_lis INT Optional Search only alignments that have the LIS 2 + of at least N seeds long + + LIS stands for Longest Increasing Subsequence. It is computed using seeds, which + are k-mers common to the read and the reference sequence. Sorted sequences of such seeds + are used to filter the candidate references prior performing the Smith-Waterman alignment. + + + --print_all_reads BOOL Optional Output null alignment strings for non-aligned reads False + to SAM and/or BLAST tabular files + + --paired BOOL Optional Flags paired reads False + + If a single reads file is provided, use this option to indicate + the file contains interleaved paired reads when neither + 'paired_in' | 'paired_out' | 'out2' | 'sout' are specified. + + + --paired_in BOOL Optional Flags the paired-end reads as Aligned, False + when either of them is Aligned. + + With this option both reads are output into Aligned FASTA/Q file + Must be used with 'fastx'. + Mutually exclusive with 'paired_out'. + + + --paired_out BOOL Optional Flags the paired-end reads as Non-aligned, False + when either of them is non-aligned. + + With this option both reads are output into Non-Aligned FASTA/Q file + Must be used with 'fastx'. + Mutually exclusive with 'paired_in'. + + + --out2 BOOL Optional Output paired reads into separate files. False + + Must be used with 'fastx'. + If a single reads file is provided, this options implies interleaved paired reads + When used with 'sout', four (4) output files for aligned reads will be generated: + 'aligned-paired-fwd, aligned-paired-rev, aligned-singleton-fwd, aligned-singleton-rev'. + If 'other' option is also used, eight (8) output files will be generated. + + + --sout BOOL Optional Separate paired and singleton aligned reads. False + + To be used with 'fastx'. + If a single reads file is provided, this options implies interleaved paired reads + Cannot be used with 'paired_in' | 'paired_out' + + + --zip-out STR/BOOL Optional Controls the output compression '-1' + + By default the report files are produced in the same format as the input i.e. + if the reads files are compressed (gz), the output is also compressed. + The default behaviour can be overriden by using '-zip-out'. + The possible values: '1/true/t/yes/y' + '0/false/f/no/n' + '-1' (the same format as input - default) + The values are Not case sensitive i.e. 'Yes, YES, yEs, Y, y' are all OK + Examples: + '-reads freads.gz -zip-out n' : generate flat output when the input is compressed + '-reads freads.flat -zip-out' : compress the output when the input files are flat + + + --match INT Optional SW score (positive integer) for a match. 2 + + --mismatch INT Optional SW penalty (negative integer) for a mismatch. -3 + + --gap_open INT Optional SW penalty (positive integer) for introducing a gap. 5 + + --gap_ext INT Optional SW penalty (positive integer) for extending a gap. 2 + + -e DOUBLE Optional E-value threshold. 1 + + Defines the 'statistical significance' of a local alignment. + Exponentially correllates with the Minimal Alignment score. + Higher E-values (100, 1000, ...) cause More reads to Pass the alignment threshold + + + -F BOOL Optional Search only the forward strand. False + + -N BOOL Optional SW penalty for ambiguous letters (N's) scored + as --mismatch + + -R BOOL Optional Search only the reverse-complementary strand. False + + + [OTU_PICKING] + --id INT Optional %%id similarity threshold (the alignment 0.97 + must still pass the E-value threshold). + + --coverage INT Optional %%query coverage threshold (the alignment must 0.97 + still pass the E-value threshold) + + --de_novo_otu BOOL Optional Output FASTA file with 'de novo' reads False + + Read is 'de novo' if its alignment score passes E-value threshold, but both the identity + '-id', and the '-coverage' are below their corresponding thresholds + i.e. ID < %%id and COV < %%cov + + + --otu_map BOOL Optional Output OTU map (input to QIIME's make_otu_table.py). False + Cannot be used with 'no-best because + the grouping is done around the best alignment' + + + [ADVANCED] + --passes INT,INT,INT Optional Three intervals at which to place the seed on L,L/2,3 + the read (L is the seed length) + + --edges INT Optional Number (or percent if INT followed by %% sign) of 4 + nucleotides to add to each edge of the read + prior to SW local alignment + + --num_seeds BOOL Optional Number of seeds matched before searching 2 + for candidate LIS + + --full_search INT Optional Search for all 0-error and 1-error seed False + matches in the index rather than stopping + after finding a 0-error match (<1%% gain in + sensitivity with up four-fold decrease in speed) + + --pid BOOL Optional Add pid to output file names. False + + -a INT Optional DEPRECATED in favour of '-threads'. Number of numCores + processing threads to use. + Automatically redirects to '-threads' + + --threads INT Optional Number of Processing threads to use 2 + + + [INDEXING] + --index INT Optional Build reference database index 2 + + By default when this option is not used, the program checks the reference index and + builds it if not already existing. + This can be changed by using '-index' as follows: + '-index 0' - skip indexing. If the index does not exist, the program will terminate + and warn to build the index prior performing the alignment + '-index 1' - only perform the indexing and terminate + '-index 2' - the default behaviour, the same as when not using this option at all + + + -L DOUBLE Optional Indexing: seed length. 18 + + -m DOUBLE Optional Indexing: the amount of memory (in Mbytes) for 3072 + building the index. + + -v BOOL Optional Produce verbose output when building the index True + + --interval INT Optional Indexing: Positive integer: index every Nth L-mer in 1 + the reference database e.g. '-interval 2'. + + --max_pos INT Optional Indexing: maximum (integer) number of positions to 1000 + store for each unique L-mer. + If 0 - all positions are stored. + + + [HELP] + -h BOOL Optional Print help information + + --version BOOL Optional Print SortMeRNA version number + + + [DEVELOPER] + --dbg_put_db BOOL Optional + --cmd BOOL Optional Launch an interactive session (command prompt) False + + --task INT Optional Processing Task 4 + + Possible values: 0 - align. Only perform alignment + 1 - post-processing (log writing) + 2 - generate reports + 3 - align and post-process + 4 - all + + + --dbg-level INT Optional Debug level 0 + + Controls verbosity of the execution trace. Default value of 0 corresponds to + the least verbose output. + The highest value currently is 2. diff --git a/src/sortmerna/script.sh b/src/sortmerna/script.sh new file mode 100755 index 00000000..8dda3d60 --- /dev/null +++ b/src/sortmerna/script.sh @@ -0,0 +1,108 @@ +#!/bin/bash + +## VIASH START +## VIASH END + +set -eo pipefail + +unset_if_false=( par_fastx par_sq par_fastx par_print_all_reads par_paired_in par_paired_out + par_F par_R par_verbose par_de_novo par_otu_map par_full_search par_out2 + par_sout par_sam par_paired ) + + +for var in "${unset_if_false[@]}"; do + if [ "${!var}" == "false" ]; then + unset $var + fi +done + +reads=() +IFS=";" read -ra input <<< "$par_input" +if [ "${#input[@]}" -eq 2 ]; then + reads="--reads ${input[0]} --reads ${input[1]}" + # set paired to true in case it's not + par_paired=true +else + reads="--reads ${input[0]}" + par_paired=false +fi + +refs=() + +# check if references are input normally or through a manifest file +if [[ ! -z "$par_ribo_database_manifest" ]]; then + while IFS= read -r path || [[ -n $path ]]; do + refs=$refs" --ref $path" + done < $par_ribo_database_manifest + +elif [[ ! -z "$par_ref" ]]; then + IFS=";" read -ra ref <<< "$par_ref" + # check if length is 2 and par_paired is set to true + if [[ "${#ref[@]}" -eq 2 && "$par_paired" == "true" ]]; then + refs="--ref ${ref[0]} --ref ${ref[1]}" + # check if length is 1 and par_paired is set to false + elif [[ "${#ref[@]}" -eq 1 && "$par_paired" == "false" ]]; then + refs="--ref $par_ref" + else # if one reference provided but paired is set to true: + echo "Two reference fasta files are required for paired-end reads" + exit 1 + fi +else + echo "No reference fasta file(s) provided" + exit 1 +fi + + +sortmerna \ + $refs \ + $reads \ + --workdir . \ + ${par_output:+--aligned "${par_output}"} \ + ${par_fastx:+--fastx} \ + ${par_other:+--other "${par_other}"} \ + ${par_kvdb:+--kvdb "${par_kvdb}"} \ + ${par_idx_dir:+--idx-dir "${par_idx_dir}"} \ + ${par_readb:+--readb "${par_readb}"} \ + ${par_sam:+--sam} \ + ${par_sq:+--sq} \ + ${par_blast:+--blast "${par_blast}"} \ + ${par_num_alignments:+--num_alignments "${par_num_alignments}"} \ + ${par_min_lis:+--min_lis "${par_min_lis}"} \ + ${par_print_all_reads:+--print_all_reads} \ + ${par_paired_in:+--paired_in} \ + ${par_paired_out:+--paired_out} \ + ${par_out2:+--out2} \ + ${par_sout:+--sout} \ + ${par_zip_out:+--zip-out "${par_zip_out}"} \ + ${par_match:+--match "${par_match}"} \ + ${par_mismatch:+--mismatch "${par_mismatch}"} \ + ${par_gap_open:+--gap_open "${par_gap_open}"} \ + ${par_gap_ext:+--gap_ext "${par_gap_ext}"} \ + ${par_N:+-N "${par_N}"} \ + ${par_a:+-a "${par_a}"} \ + ${par_e:+-e "${par_e}"} \ + ${par_F:+-F} \ + ${par_R:+-R} \ + ${par_num_alignment:+--num_alignment "${par_num_alignment}"} \ + ${par_best:+--best "${par_best}"} \ + ${par_verbose:+--verbose} \ + ${par_id:+--id "${par_id}"} \ + ${par_coverage:+--coverage "${par_coverage}"} \ + ${par_de_novo:+--de_novo} \ + ${par_otu_map:+--otu_map} \ + ${par_num_seed:+--num_seed "${par_num_seed}"} \ + ${par_passes:+--passes "${par_passes}"} \ + ${par_edge:+--edge "${par_edge}"} \ + ${par_full_search:+--full_search} \ + ${par_index:+--index "${par_index}"} \ + ${par_L:+-L $par_L} \ + ${par_interval:+--interval "${par_interval}"} \ + ${par_max_pos:+--max_pos "${par_max_pos}"} + + +if [ ! -z $par_log ]; then + mv "${par_output}.log" $par_log +fi + +exit 0 + diff --git a/src/sortmerna/test.sh b/src/sortmerna/test.sh new file mode 100644 index 00000000..390b9307 --- /dev/null +++ b/src/sortmerna/test.sh @@ -0,0 +1,101 @@ +#!/bin/bash + +echo ">>> Testing $meta_functionality_name" + +find $meta_resources_dir/test_data/rRNA -type f > test_data/rrna-db.txt + +echo ">>> Testing for paired-end reads and database manifest" +# out2 separates the read pairs into two files (one fwd and one rev) +# paired_in outputs both reads of a pair +# other is the output file for non-rRNA reads +"$meta_executable" \ + --output "rRNA_reads" \ + --other "non_rRNA_reads" \ + --input "$meta_resources_dir/test_data/reads_1.fq.gz;$meta_resources_dir/test_data/reads_2.fq.gz" \ + --ribo_database_manifest test_data/rrna-db.txt \ + --log test_log.log \ + --paired_in \ + --fastx \ + --out2 + + +echo ">> Checking if the correct files are present" +[[ -f "rRNA_reads_fwd.fq.gz" ]] || [[ -f "rRNA_reads_rev.fq.gz" ]] || { echo "rRNA output fastq file is missing!"; exit 1; } +[[ -s "rRNA_reads_fwd.fq.gz" ]] && [[ -s "rRNA_reads_rev.fq.gz" ]] || { echo "rRNA output fastq file is empty!"; exit 1; } +[[ -f "non_rRNA_reads_fwd.fq.gz" ]] || [[ -f "non_rRNA_reads_rev.fq.gz" ]] || { echo "Non-rRNA output fastq file is missing!"; exit 1;} +gzip -dk non_rRNA_reads_fwd.fq.gz +gzip -dk non_rRNA_reads_rev.fq.gz +[[ ! -s "non_rRNA_reads_fwd.fq" ]] && [[ ! -s "non_rRNA_reads_rev.fq" ]] || { echo "Non-rRNA output fastq file is not empty!"; exit 1;} + +rm -f rRNA_reads_fwd.fq.gz rRNA_reads_rev.fq.gz non_rRNA_reads_fwd.fq.gz non_rRNA_reads_rev.fq.gz test_log.log +rm -rf kvdb/ + +################################################################################ +echo ">>> Testing for paired-end reads and --ref and --paired_out argumens" +"$meta_executable" \ + --output "rRNA_reads" \ + --other "non_rRNA_reads" \ + --input "$meta_resources_dir/test_data/reads_1.fq.gz;$meta_resources_dir/test_data/reads_2.fq.gz" \ + --ref "$meta_resources_dir/test_data/rRNA/database1.fa;$meta_resources_dir/test_data/rRNA/database2.fa" \ + --log test_log.log \ + --paired_out \ + --fastx \ + --out2 + +echo ">> Checking if the correct files are present" +[[ -f "rRNA_reads_fwd.fq.gz" ]] || [[ -f "rRNA_reads_rev.fq.gz" ]] || { echo "rRNA output fastq file is missing!"; exit 1; } +gzip -dkf rRNA_reads_fwd.fq.gz +[[ ! -s "rRNA_reads_fwd.fq" ]] && [[ ! -s "rRNA_reads_rev.fq" ]] || { echo "rRNA output fastq file is not empty!"; exit 1; } +[[ -f "non_rRNA_reads_fwd.fq.gz" ]] || [[ -f "non_rRNA_reads_rev.fq.gz" ]] || { echo "Non-rRNA output fastq file is missing!"; exit 1;} +gzip -dkf non_rRNA_reads_fwd.fq.gz +gzip -dkf non_rRNA_reads_rev.fq.gz +[[ -s "non_rRNA_reads_fwd.fq" ]] && [[ -s "non_rRNA_reads_rev.fq" ]] || { echo "Non-rRNA output fastq file is empty!"; exit 1; } + +rm -f rRNA_reads_fwd.fq.gz rRNA_reads_rev.fq.gz non_rRNA_reads_fwd.fq.gz non_rRNA_reads_rev.fq.gz test_log.log +rm -rf kvdb/ + +################################################################################ + +echo ">>> Testing for single-end reads and --ref argument" +"$meta_executable" \ + --aligned "rRNA_reads" \ + --other "non_rRNA_reads" \ + --input $meta_resources_dir/test_data/reads_1.fq.gz \ + --ref $meta_resources_dir/test_data/rRNA/database1.fa \ + --log test_log.log \ + --fastx + +echo ">> Checking if the correct files are present" +[[ ! -f "rRNA_reads.fq.gz" ]] && echo "rRNA output fastq file is missing!" && exit 1 +gzip -dk rRNA_reads.fq.gz +[[ -s "rRNA_reads.fq" ]] && echo "rRNA output fastq file is not empty!" && exit 1 +[[ ! -f "non_rRNA_reads.fq.gz" ]] && echo "Non-rRNA output fastq file is missing!" && exit 1 +[[ ! -s "non_rRNA_reads.fq.gz" ]] && echo "Non-rRNA output fastq file is empty!" && exit 1 + +rm -f rRNA_reads.fq.gz non_rRNA_reads.fq.gz test_log.log +rm -rf kvdb/ + +################################################################################ + +echo ">>> Testing for single-end reads with singleton output files" +"$meta_executable" \ + --aligned "rRNA_reads" \ + --other "non_rRNA_reads" \ + --input "$meta_resources_dir/test_data/reads_1.fq.gz;$meta_resources_dir/test_data/reads_2.fq.gz" \ + --ribo_database_manifest test_data/rrna-db.txt \ + --log test_log.log \ + --fastx \ + --sout + +echo ">> Checking if the correct files are present" +[[ ! -f "rRNA_reads_paired.fq.gz" ]] && echo "Aligned paired fwd output fastq file is missing!" && exit 1 +[[ ! -f "rRNA_reads_singleton.fq.gz" ]] && echo "Aligned singleton fwd output fastq file is missing!" && exit 1 +[[ ! -f "non_rRNA_reads_fwd.fq" ]] && echo "Non-rRNA fwd output fastq file is missing!" && exit 1 +[[ ! -f "non_rRNA_reads_rev.fq" ]] && echo "Non-rRNA rev output fastq file is missing!" && exit 1 +[[ ! -f "non_rRNA_reads_singleton.fq.gz" ]] && echo "Non-rRNA singleton output fastq file is missing!" && exit 1 +[[ ! -f "non_rRNA_reads_paired.fq.gz" ]] && echo "Non-rRNA paired output fastq file is missing!" && exit 1 + + + +echo ">>> All tests passed" +exit 0 \ No newline at end of file diff --git a/src/sortmerna/test_data/rRNA/database1.fa b/src/sortmerna/test_data/rRNA/database1.fa new file mode 100644 index 00000000..bae23aba --- /dev/null +++ b/src/sortmerna/test_data/rRNA/database1.fa @@ -0,0 +1,24 @@ +>AY846379.1.1791 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w +CCUGGUUGAUCCUGCCAGUAGUCAUAUGCUUGUCUCAAAGAUUAAGCCAUGCAUGUCUAAGUAUAAACUGCUUAUACUGU +GAAACUGCGAAUGGCUCAUUAAAUCAGUUAUAGUUUAUUUGAUGGUACCUCUACACGGAUAACCGUAGUAAUUCUAGAGC +UAAUACGUGCGUAAAUCCCGACUUCUGGAAGGGACGUAUUUAUUAGAUAAAAGGCCGACCGAGCUUUGCUCGACCCGCGG +UGAAUCAUGAUAACUUCACGAAUCGCAUAGCCUUGUGCUGGCGAUGUUUCAUUCAAAUUUCUGCCCUAUCAACUUUCGAU +GGUAGGAUAGAGGCCUACCAUGGUGGUAACGGGUGACGGAGGAUUAGGGUUCGAUUCCGGAGAGGGAGCCUGAGAAACGG +CUACCACAUCCAAGGAAGGCAGCAGGCGCGCAAAUUACCCAAUCCUGAUACGGGGAGGUAGUGACAAUAAAUAACAAUGC +CGGGCAUUUCAUGUCUGGCAAUUGGAAUGAGUACAAUCUAAAUCCCUUAACGAGGAUCAAUUGGAGGGCAAGUCUGGUGC +CAGCAGCCGCGGUAAUUCCAGCUCCAAUAGCGUAUAUUUAAGUUGUUGCAGUUAAAAAGCUCGUAGUUGGAUUUCGGGUG +GGUUCCAGCGGUCCGCCUAUGGUGAGUACUGCUGUGGCCCUCCUUUUUGUCGGGGACGGGCUCCUGGGCUUCAUUGUCCG +GGACUCGGAGUCGACGAUGAUACUUUGAGUAAAUUAGAGUGUUCAAAGCAAGCCUACGCUCUGAAUACUUUAGCAUGGAA +UAUCGCGAUAGGACUCUGGCCUAUCUCGUUGGUCUGUAGGACCGGAGUAAUGAUUAAGAGGGACAGUCGGGGGCAUUCGU +AUUUCAUUGUCAGAGGUGAAAUUCUUGGAUUUAUGAAAGACGAACUACUGCGAAAGCAUUUGCCAAGGAUGUUUUCAUUA +AUCAAGAACGAAAGUUGGGGGCUCGAAGACGAUUAGAUACCGUCGUAGUCUCAACCAUAAACGAUGCCGACUAGGGAUUG +GAGGAUGUUCUUUUGAUGACUUCUCCAGCACCUUAUGAGAAAUCAAAGUUUUUGGGUUCCGGGGGGAGUAUGGUCGCAAG +GCUGAAACUUAAAGGAAUUGACGGAAGGGCACCACCAGGCGUGGAGCCUGCGGCUUAAUUUGACUCAACACGGGAAAACU +UACCAGGUCCAGACAUAGUGAGGAUUGACAGAUUGAGAGCUCUUUCUUGAUUCUAUGGGUGGUGGUGCAUGGCCGUUCUU +AGUUGGUGGGUUGCCUUGUCAGGUUGAUUCCGGUAACGAACGAGACCUCAGCCUGCUAAAUAUGUCACAUUCGCUUUUUG +CGGAUGGCCGACUUCUUAGAGGGACUAUUGGCGUUUAGUCAAUGGAAGUAUGAGGCAAUAACAGGUCUGUGAUGCCCUUA +GAUGUUCUGGGCCGCACGCGCGCUACACUGACGCAUUCAGCAAGCCUAUCCUUGACCGAGAGGUCUGGGUAAUCUUUGAA +ACUGCGUCGUGAUGGGGAUAGAUUAUUGCAAUUAUUAGUCUUCAACGAGGAAUGCCUAGUAAGCGCAAGUCAUCAGCUUG +CGUUGAUUACGUCCCUGCCCUUUGUACACACCGCCCGUCGCUCCUACCGAUUGGGUGUGCUGGUGAAGUGUUCGGAUUGG +CAGAGCGGGUGGCAACACUUGCUUUUGCCGAGAAGUUCAUUAAACCCUCCCACCUAGAGGAAGGAGAAGUCGUAACAAGG +UUUCCGUAGGUGAACCUGCAGAAG \ No newline at end of file diff --git a/src/sortmerna/test_data/rRNA/database2.fa b/src/sortmerna/test_data/rRNA/database2.fa new file mode 100644 index 00000000..87b5bc99 --- /dev/null +++ b/src/sortmerna/test_data/rRNA/database2.fa @@ -0,0 +1,16 @@ +>AB001445.1.1538 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali pv. morsprunorum +AGAGUUUGAUCAUGGCUCAGAUUGAACGCUGGCGGCAGGCCUAACACAUGCAAGUCGAGCGGCAGCACGGGUACUUGUAC +CUGGUGGCGAGCGGCGGACGGGUGAGUAAUGCCUAGGAAUCUGCCUGGUAGUGGGGGAUAACGCUCGGAAACGGACGCUA +AUACCGCAUACGUCCUACGGGAGAAAGCAGGGGACCUUCGGGCCUUGCGCUAUCAGAUGAGCCUAGGUCGGAUUAGCUAG +UUGGUGAGGUAAUGGCUCACCAAGGCGACGAUCCGUAACUGGUCUGAGAGGAUGAUCAGUCACACUGGAACUGAGACACG +GUCCAGACUCCUACGGGAGGCAGCAGUGGGGAAUAUUGGACAAUGGGCGAAAGCCUGAUCCAGCCAUGCCGCGUGUGUGA +AGAAGGUCUUCGGAUUGUAAAGCACUUUAAGUUGGGAGGAAGGGCAGUUACCUAAUACGUAUCUGUUUUGACGUUACCGA +CAGAAUAAGCACCGGCUAACUCUGUGCCAGCAGCCGCGGUAAUACAGAGGGUGCAAGCGUUAAUCGGAAUUACUGGGCGU +AAAGCGCGCGUAGGUGGUUUGUUAAGUUGAAUGUGAAAUCCCCGGGCUCAACCUGGGAACUGCAUCCAAAACUGGCAAGC +UAGAGUAUGGUAGAGGGUGGUGGAAUUUCCUGUGUAGCGGUGAAAUGCGUAGAUAUAGGAAGGAACACCAGUGGCGAAGG +CGACCACCUGGACUGAUACUGACACUGAGGUGCGAAAGCGUGGGGAGCAAACAGGAUUAGAUACCCUGGUAGUCCACGCC +GUAAACGAUGUCAACUAGCCGUUGGGAGCCUUGAGCUCUUAGUGGCGCAGCUAACGCAUUAAGUUGACCGCCUGGGGAGU +ACGGCCGCAAGGUUAAAACUCAAAUGAAUUGACGGGGGCCCGCACAAGCGGUGGAGCAUGUGGUUUAAUUCGAAGCAACG +CGAAGAACCUUACCAGGCCUUGACAUCCAAUGAAUCCUUUAGAGAUAGAGGAGUGCCUUCGGGAGCAUUGAGACAGGUGC +UGCAUGGCUGUCGUCAGCUCGUGUCGUGAGAUGUUGGGUUAAGUCCCGUAACGAGCGCAACCCUUGUCCUUAGUUACCAG +CACGUCAUGGUGGGCACUCUAAGGAGACUGCCGGUGACAAACCGGAGGAAGGUGGGGAUGACGUCAAGUCAUCAUGGCCC diff --git a/src/sortmerna/test_data/reads_1.fq.gz b/src/sortmerna/test_data/reads_1.fq.gz new file mode 100644 index 00000000..41c02a22 Binary files /dev/null and b/src/sortmerna/test_data/reads_1.fq.gz differ diff --git a/src/sortmerna/test_data/reads_2.fq.gz b/src/sortmerna/test_data/reads_2.fq.gz new file mode 100644 index 00000000..9d0f8d3f Binary files /dev/null and b/src/sortmerna/test_data/reads_2.fq.gz differ diff --git a/src/sortmerna/test_data/script.sh b/src/sortmerna/test_data/script.sh new file mode 100755 index 00000000..b2531248 --- /dev/null +++ b/src/sortmerna/test_data/script.sh @@ -0,0 +1,8 @@ +#!/bin/bash + +if [ ! -d /tmp/sortmerna_source ]; then + git clone --depth 2 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers.git /tmp/sortmerna_source +fi + +# copy test data +cp -r /tmp/sortmerna_source/bio/sortmerna/test/* . diff --git a/src/umi_tools/umi_tools_prepareforrsem/config.vsh.yaml b/src/umi_tools/umi_tools_prepareforrsem/config.vsh.yaml new file mode 100644 index 00000000..ceac2052 --- /dev/null +++ b/src/umi_tools/umi_tools_prepareforrsem/config.vsh.yaml @@ -0,0 +1,107 @@ +name: "umi_tools_prepareforrsem" +namespace: "umi_tools" +description: Make the output from umi-tools dedup or group compatible with RSEM +keywords: [umi_tools, rsem, bam, sam] +links: + homepage: https://umi-tools.readthedocs.io/en/latest/ + documentation: https://umi-tools.readthedocs.io/en/latest/reference/extract.html + repository: https://github.com/CGATOxford/UMI-tools +references: + doi: 10.1101/gr.209601.116 +license: MIT + +argument_groups: +- name: "Input" + arguments: + - name: "--input" + alternatives: ["-I", "--stdin"] + type: file + required: true + example: $id.transcriptome.bam + +- name: "Output" + arguments: + - name: "--output" + alternatives: ["-S", "--stdout"] + type: file + direction: output + example: $id.transcriptome_sorted.bam + - name: "--log" + alternatives: ["-L"] + type: file + direction: output + description: File with logging information [default = stdout]. + - name: "--error" + alternatives: ["-E"] + type: file + direction: output + description: File with error information [default = stderr]. + - name: "--log2stderr" + type: boolean_true + description: Send logging information to stderr [default = False]. + - name: "--temp_dir" + type: string + description: | + Directory for temporary files. If not set, the bash environmental variable + TMPDIR is used. + - name: "--compresslevel" + type: integer + description: | + Level of Gzip compression to use. Default (6) matchesGNU gzip rather than python + gzip default (which is 9). + +- name: "Options" + arguments: + - name: "--tags" + type: string + description: | + Comma-seperated list of tags to transfer from read1 to read2 (Default: 'UG,BX') + example: "UG,BX" + - name: "--sam" + type: boolean_true + description: Input and output SAM rather than BAM. + - name: "--timeit" + type: string + description: | + Store timeing information in file [none]. + - name: "--timeit_name" + type: string + description: | + Name in timing file for this class of jobs [all]. + - name: "--timeit_header" + type: boolean_true + description: Add header for timing information [none]. + - name: "--verbose" + alternatives: ["-v"] + type: integer + description: | + Loglevel [1]. The higher, the more output. + - name: "--random_seed" + type: integer + description: | + Random seed to initialize number generator with [none]. + + +resources: + - type: bash_script + path: script.sh + # copied from https://github.com/nf-core/rnaseq/blob/3.12.0/bin/prepare-for-rsem.py + - path: prepare-for-rsem.py +test_resources: + - type: bash_script + path: test.sh + - type: file + path: test_data + +engines: + - type: docker + image: quay.io/biocontainers/umi_tools:1.1.5--py38h0020b31_3 + setup: + - type: docker + run: | + umi_tools -v | sed 's/ version//g' > /var/software_versions.txt + + +runners: +- type: executable +- type: nextflow \ No newline at end of file diff --git a/src/umi_tools/umi_tools_prepareforrsem/help.txt b/src/umi_tools/umi_tools_prepareforrsem/help.txt new file mode 100644 index 00000000..efaf4de6 --- /dev/null +++ b/src/umi_tools/umi_tools_prepareforrsem/help.txt @@ -0,0 +1,54 @@ +``` +umi_tools prepare-for-rsem --help +``` + +prepare_for_rsem - make output from dedup or group compatible with RSEM + +Usage: umi_tools prepare_for_rsem [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM] + + note: If --stdout is ommited, standard out is output. To + generate a valid BAM file on standard out, please + redirect log with --log=LOGFILE or --log2stderr + +For full UMI-tools documentation, see https://umi-tools.readthedocs.io/en/latest/ + +Options: + --version show program's version number and exit + + RSEM preparation specific options: + --tags=TAGS Comma-seperated list of tags to transfer from read1 to + read2 + --sam input and output SAM rather than BAM + + input/output options: + -I FILE, --stdin=FILE + file to read stdin from [default = stdin]. + -L FILE, --log=FILE + file with logging information [default = stdout]. + -E FILE, --error=FILE + file with error information [default = stderr]. + -S FILE, --stdout=FILE + file where output is to go [default = stdout]. + --temp-dir=FILE Directory for temporary files. If not set, the bash + environmental variable TMPDIR is used[default = None]. + --log2stderr send logging information to stderr [default = False]. + --compresslevel=COMPRESSLEVEL + Level of Gzip compression to use. Default (6) + matchesGNU gzip rather than python gzip default (which + is 9) + + profiling options: + --timeit=TIMEIT_FILE + store timeing information in file [none]. + --timeit-name=TIMEIT_NAME + name in timing file for this class of jobs [all]. + --timeit-header add header for timing information [none]. + + common options: + -v LOGLEVEL, --verbose=LOGLEVEL + loglevel [1]. The higher, the more output. + -h, --help output short help (command line options only). + --help-extended Output full documentation + --random-seed=RANDOM_SEED + random seed to initialize number generator with + [none]. \ No newline at end of file diff --git a/src/umi_tools/umi_tools_prepareforrsem/prepare-for-rsem.py b/src/umi_tools/umi_tools_prepareforrsem/prepare-for-rsem.py new file mode 100644 index 00000000..b53d30ac --- /dev/null +++ b/src/umi_tools/umi_tools_prepareforrsem/prepare-for-rsem.py @@ -0,0 +1,271 @@ +#!/usr/bin/env python3 + +""" +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Credits +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This script is a clone of the "prepare-for-rsem.py" script written by +Ian Sudbury, Tom Smith and other contributors to the UMI-tools package: +https://github.com/CGATOxford/UMI-tools + +It has been included here to address problems encountered with +Salmon quant and RSEM as discussed in the issue below: +https://github.com/CGATOxford/UMI-tools/issues/465 + +When the "umi_tools prepare-for-rsem" command becomes available in an official +UMI-tools release this script will be replaced and deprecated. + +Commit: +https://github.com/CGATOxford/UMI-tools/blob/bf8608d6a172c5ca0dcf33c126b4e23429177a72/umi_tools/prepare-for-rsem.py + +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +prepare_for_rsem - make the output from dedup or group compatible with RSEM +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The SAM format specification states that the mnext and mpos fields should point +to the primary alignment of a read's mate. However, not all aligners adhere to +this standard. In addition, the RSEM software requires that the mate of a read1 +appears directly after it in its input BAM. This requires that there is exactly +one read1 alignment for every read2 and vice versa. + +In general (except in a few edge cases) UMI tools outputs only the read2 to that +corresponds to the read specified in the mnext and mpos positions of a selected +read1, and only outputs this read once, even if multiple read1s point to it. +This makes UMI-tools outputs incompatible with RSEM. This script takes the output +from dedup or groups and ensures that each read1 has exactly one read2 (and vice +versa), that read2 always appears directly after read1,and that pairs point to +each other (note this is technically not valid SAM format). Copy any specified +tags from read1 to read2 if they are present (by default, UG and BX, the unique +group and correct UMI tags added by _group_) + +Input must to name sorted. + + +https://raw.githubusercontent.com/CGATOxford/UMI-tools/master/LICENSE + +""" + +from umi_tools import Utilities as U +from collections import defaultdict, Counter +import pysam +import sys + + +usage = """ +prepare_for_rsem - make output from dedup or group compatible with RSEM + +Usage: umi_tools prepare_for_rsem [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM] + + note: If --stdout is omited, standard out is output. To + generate a valid BAM file on standard out, please + redirect log with --log=LOGFILE or --log2stderr """ + + +def chunk_bam(bamfile): + """Take in a iterator of pysam.AlignmentSegment entries and yield + lists of reads that all share the same name""" + + last_query_name = None + output_buffer = list() + + for read in bamfile: + if last_query_name is not None and last_query_name != read.query_name: + yield (output_buffer) + output_buffer = list() + + last_query_name = read.query_name + output_buffer.append(read) + + yield (output_buffer) + + +def copy_tags(tags, read1, read2): + """Given a list of tags, copies the values of these tags from read1 + to read2, if the tag is set""" + + for tag in tags: + try: + read1_tag = read1.get_tag(tag, with_value_type=True) + read2.set_tag(tag, value=read1_tag[0], value_type=read1_tag[1]) + except KeyError: + pass + + return read2 + + +def pick_mate(read, template_dict, mate_key): + """Find the mate of read in the template dict using key. It will retrieve + all reads at that key, and then scan to pick the one that refers to _read_ + as it's mate. If there is no such read, it picks a first one it comes to""" + + mate = None + + # get a list of secondary reads at the correct alignment position + potential_mates = template_dict[not read.is_read1][mate_key] + + # search through one at a time to find a read that points to the current read + # as its mate. + for candidate_mate in potential_mates: + if ( + candidate_mate.next_reference_name == read.reference_name + and candidate_mate.next_reference_start == read.pos + ): + mate = candidate_mate + + # if no such read is found, then pick any old secondary alignment at that position + # note: this happens when UMI-tools outputs the wrong read as something's pair. + if mate is None and len(potential_mates) > 0: + mate = potential_mates[0] + + return mate + + +def main(argv=None): + if argv is None: + argv = sys.argv + + # setup command line parser + parser = U.OptionParser(version="%prog version: $Id$", usage=usage, description=globals()["__doc__"]) + group = U.OptionGroup(parser, "RSEM preparation specific options") + + group.add_option( + "--tags", + dest="tags", + type="string", + default="UG,BX", + help="Comma-separated list of tags to transfer from read1 to read2", + ) + group.add_option( + "--sam", dest="sam", action="store_true", default=False, help="input and output SAM rather than BAM" + ) + + parser.add_option_group(group) + + # add common options (-h/--help, ...) and parse command line + (options, args) = U.Start( + parser, argv=argv, add_group_dedup_options=False, add_umi_grouping_options=False, add_sam_options=False + ) + + skipped_stats = Counter() + + if options.stdin != sys.stdin: + in_name = options.stdin.name + options.stdin.close() + else: + in_name = "-" + + if options.sam: + mode = "" + else: + mode = "b" + + inbam = pysam.AlignmentFile(in_name, "r" + mode) + + if options.stdout != sys.stdout: + out_name = options.stdout.name + options.stdout.close() + else: + out_name = "-" + + outbam = pysam.AlignmentFile(out_name, "w" + mode, template=inbam) + + options.tags = options.tags.split(",") + + for template in chunk_bam(inbam): + assert len(set(r.query_name for r in template)) == 1 + current_template = {True: defaultdict(list), False: defaultdict(list)} + + for read in template: + key = (read.reference_name, read.pos, not read.is_secondary) + current_template[read.is_read1][key].append(read) + + output = set() + + for read in template: + mate = None + + # if this read is a non_primary alignment, we first want to check if it has a mate + # with the non-primary alignment flag set. + + mate_key_primary = True + mate_key_secondary = (read.next_reference_name, read.next_reference_start, False) + + # First look for a read that has the same primary/secondary status + # as read (i.e. secondary mate for secondary read, and primary mate + # for primary read) + mate_key = (read.next_reference_name, read.next_reference_start, read.is_secondary) + mate = pick_mate(read, current_template, mate_key) + + # If none was found then look for the opposite (primary mate of secondary + # read or seconadary mate of primary read) + if mate is None: + mate_key = (read.next_reference_name, read.next_reference_start, not read.is_secondary) + mate = pick_mate(read, current_template, mate_key) + + # If we still don't have a mate, then their can't be one? + if mate is None: + skipped_stats["no_mate"] += 1 + U.warn( + "Alignment {} has no mate -- skipped".format( + "\t".join(map(str, [read.query_name, read.flag, read.reference_name, int(read.pos)])) + ) + ) + continue + + # because we might want to make changes to the read, but not have those changes reflected + # if we need the read again,we copy the read. This is only way I can find to do this. + read = pysam.AlignedSegment().from_dict(read.to_dict(), read.header) + mate = pysam.AlignedSegment().from_dict(mate.to_dict(), read.header) + + # Make it so that if our read is secondary, the mate is also secondary. We don't make the + # mate primary if the read is primary because we would otherwise end up with mulitple + # primary alignments. + if read.is_secondary: + mate.is_secondary = True + + # In a situation where there is already one mate for each read, then we will come across + # each pair twice - once when we scan read1 and once when we scan read2. Thus we need + # to make sure we don't output something already output. + if read.is_read1: + mate = copy_tags(options.tags, read, mate) + output_key = str(read) + str(mate) + + if output_key not in output: + output.add(output_key) + outbam.write(read) + outbam.write(mate) + skipped_stats["pairs_output"] += 1 + + elif read.is_read2: + read = copy_tags(options.tags, mate, read) + output_key = str(mate) + str(read) + + if output_key not in output: + output.add(output_key) + outbam.write(mate) + outbam.write(read) + skipped_stats["pairs_output"] += 1 + + else: + skipped_stats["skipped_not_read_12"] += 1 + U.warn( + "Alignment {} is neither read1 nor read2 -- skipped".format( + "\t".join(map(str, [read.query_name, read.flag, read.reference_name, int(read.pos)])) + ) + ) + continue + + if not out_name == "-": + outbam.close() + + U.info( + "Total pairs output: {}, Pairs skipped - no mates: {}," + " Pairs skipped - not read1 or 2: {}".format( + skipped_stats["pairs_output"], skipped_stats["no_mate"], skipped_stats["skipped_not_read12"] + ) + ) + U.Stop() + + +if __name__ == "__main__": + sys.exit(main(sys.argv)) diff --git a/src/umi_tools/umi_tools_prepareforrsem/script.sh b/src/umi_tools/umi_tools_prepareforrsem/script.sh new file mode 100755 index 00000000..d6b3775f --- /dev/null +++ b/src/umi_tools/umi_tools_prepareforrsem/script.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -eo pipefail + +unset_if_false=( + par_sam + par_error + par_log2stderr + par_timeit_header ) + +for var in "${unset_if_false[@]}"; do + test_val="${!var}" + [[ "$test_val" == "false" ]] && unset $var +done + +umi_tools prepare-for-rsem \ + ${par_log:+--log "${par_log}"} \ + ${par_tags:+--tags "${par_tags}"} \ + ${par_sam:+--sam} \ + --stdin="${par_input}" \ + ${par_output:+--stdout "${par_output}"} \ + ${par_error:+--error "${par_error}"} \ + ${par_temp_dir:+--temp-dir "${par_temp_dir}"} \ + ${par_log2stderr:+--log2stderr} \ + ${par_verbose:+--verbose "${par_verbose}"} \ + ${par_random_seed:+--random-seed "${par_random_seed}"} \ + ${par_compresslevel:+--compresslevel "${par_compresslevel}"} + ${par_timeit:+--timeit "${par_timeit}"} \ + ${par_timeit_name:+--timeit-name "${par_timeit_name}"} \ + ${par_timeit_header:+--timeit-header} + + diff --git a/src/umi_tools/umi_tools_prepareforrsem/test.sh b/src/umi_tools/umi_tools_prepareforrsem/test.sh new file mode 100644 index 00000000..c94a202d --- /dev/null +++ b/src/umi_tools/umi_tools_prepareforrsem/test.sh @@ -0,0 +1,55 @@ +#!/bin/bash + +test_dir="$meta_resources_dir/test_data" +apt-get -q update && apt-get -q install -y samtools + +################################################################################ +echo ">>> Test 1: with --sam:" + +"${meta_executable}" \ + --input "$test_dir/test_dedup.sam" \ + --output "$test_dir/test_output.sam" \ + --sam + +echo ">>> Check if output is present" +[[ ! -f "$test_dir/test_output.sam" ]] && echo "Output file not found" && exit 1 +[[ ! -s "$test_dir/test_output.sam" ]] && echo "Output file is empty" && exit 1 + +echo ">>> Check if output is correct" +# use diff but ignoring the header lines (which start with @) as they may differ slightly +diff <(grep -v "^@" "$test_dir/test_output.sam") <(grep -v "^@" "$test_dir/test.sam") && echo "Output is correct" || (echo "Output is incorrect" && exit 1) + +################################################################################ +echo ">>> Test 2: without --sam:" + +"${meta_executable}" \ + --input "$test_dir/test_dedup.bam" \ + --output "$test_dir/test_output.bam" + +echo ">>> Check if output is present" +[[ ! -f "$test_dir/test_output.bam" ]] && echo "Output file not found" && exit 1 +[[ ! -s "$test_dir/test_output.bam" ]] && echo "Output file is empty" && exit 1 + +echo ">>> Check if output is correct" +diff <(samtools view "$test_dir/test_output.bam") <(samtools view "$test_dir/test.bam") || (echo "Output is incorrect" && exit 1) + +################################################################################ +echo ">>> Test 3: with --log:" + +"${meta_executable}" \ + --log "$test_dir/test_log.log" \ + --input "$test_dir/test_dedup.sam" \ + --output "$test_dir/test_output.sam" \ + --sam + +echo ">>> Check if output is present" +[[ ! -f "$test_dir/test_output.sam" ]] && echo "Output file not found" && exit 1 +[[ ! -s "$test_dir/test_output.sam" ]] && echo "Output file is empty" && exit 1 +[[ ! -f "$test_dir/test_log.log" ]] && echo "Log file not found" && exit 1 +[[ ! -s "$test_dir/test_log.log" ]] && echo "Log file is empty" && exit 1 + +echo ">>> Check if log file is correct" +diff <(grep -v '^#' "$test_dir/test_log.log" | sed 's/^[0-9-]* [0-9:]*,[0-9]\{3\} //') <(grep -v '^#' "$test_dir/log.log" | sed 's/^[0-9-]* [0-9:]*,[0-9]\{3\} //') || (echo "Log file is incorrect" && exit 1) + +echo ">>> All test succeeded" +exit 0 \ No newline at end of file diff --git a/src/umi_tools/umi_tools_prepareforrsem/test_data/log.log b/src/umi_tools/umi_tools_prepareforrsem/test_data/log.log new file mode 100644 index 00000000..e4b56e57 --- /dev/null +++ b/src/umi_tools/umi_tools_prepareforrsem/test_data/log.log @@ -0,0 +1,103 @@ +# UMI-tools version: 1.1.5 +# output generated by prepare-for-rsem.py --log test_data/log.log --sam --stdin test_data/test_dedup.sam --stdout jnfgioeurg.sam +# job started at Tue Sep 10 06:43:30 2024 on 4855b4607095 -- 07ae7548-56e8-4772-9b48-7406710fd838 +# pid: 28, system: Linux 6.10.0-linuxkit #1 SMP PREEMPT_DYNAMIC Wed Jul 17 10:54:05 UTC 2024 x86_64 +# compresslevel : 6 +# log2stderr : False +# loglevel : 1 +# random_seed : None +# sam : True +# short_help : None +# stderr : <_io.TextIOWrapper name='' mode='w' encoding='utf-8'> +# stdin : <_io.TextIOWrapper name='test_data/test_dedup.sam' mode='r' encoding='UTF-8'> +# stdlog : <_io.TextIOWrapper name='test_data/log.log' mode='a' encoding='UTF-8'> +# stdout : <_io.TextIOWrapper name='jnfgioeurg.sam' mode='w' encoding='UTF-8'> +# tags : UG,BX +# timeit_file : None +# timeit_header : None +# timeit_name : all +# tmpdir : None +2024-09-10 06:43:30,918 WARNING Alignment ERR5069949.114870 99 MT192765.1 642 has no mate -- skipped +2024-09-10 06:43:30,918 WARNING Alignment ERR5069949.147998 163 MT192765.1 673 has no mate -- skipped +2024-09-10 06:43:30,919 WARNING Alignment ERR5069949.114870 147 MT192765.1 747 has no mate -- skipped +2024-09-10 06:43:30,920 WARNING Alignment ERR5069949.147998 83 MT192765.1 918 has no mate -- skipped +2024-09-10 06:43:30,921 WARNING Alignment ERR5069949.184542 99 MT192765.1 1054 has no mate -- skipped +2024-09-10 06:43:30,921 WARNING Alignment ERR5069949.184542 147 MT192765.1 1254 has no mate -- skipped +2024-09-10 06:43:30,922 WARNING Alignment ERR5069949.376959 99 MT192765.1 4104 has no mate -- skipped +2024-09-10 06:43:30,924 WARNING Alignment ERR5069949.376959 147 MT192765.1 4189 has no mate -- skipped +2024-09-10 06:43:30,925 WARNING Alignment ERR5069949.532979 99 MT192765.1 5567 has no mate -- skipped +2024-09-10 06:43:30,926 WARNING Alignment ERR5069949.540529 163 MT192765.1 5569 has no mate -- skipped +2024-09-10 06:43:30,926 WARNING Alignment ERR5069949.532979 147 MT192765.1 5620 has no mate -- skipped +2024-09-10 06:43:30,927 WARNING Alignment ERR5069949.540529 83 MT192765.1 5658 has no mate -- skipped +2024-09-10 06:43:30,930 WARNING Alignment ERR5069949.856527 99 MT192765.1 10117 has no mate -- skipped +2024-09-10 06:43:30,931 WARNING Alignment ERR5069949.870926 99 MT192765.1 10117 has no mate -- skipped +2024-09-10 06:43:30,931 WARNING Alignment ERR5069949.856527 147 MT192765.1 10198 has no mate -- skipped +2024-09-10 06:43:30,931 WARNING Alignment ERR5069949.885966 99 MT192765.1 10229 has no mate -- skipped +2024-09-10 06:43:30,932 WARNING Alignment ERR5069949.870926 147 MT192765.1 10244 has no mate -- skipped +2024-09-10 06:43:30,932 WARNING Alignment ERR5069949.885966 147 MT192765.1 10276 has no mate -- skipped +2024-09-10 06:43:30,932 WARNING Alignment ERR5069949.937422 99 MT192765.1 10421 has no mate -- skipped +2024-09-10 06:43:30,933 WARNING Alignment ERR5069949.937422 147 MT192765.1 10590 has no mate -- skipped +2024-09-10 06:43:30,934 WARNING Alignment ERR5069949.1066259 99 MT192765.1 11336 has no mate -- skipped +2024-09-10 06:43:30,935 WARNING Alignment ERR5069949.1062611 163 MT192765.1 11426 has no mate -- skipped +2024-09-10 06:43:30,936 WARNING Alignment ERR5069949.1067032 163 MT192765.1 11433 has no mate -- skipped +2024-09-10 06:43:30,936 WARNING Alignment ERR5069949.1062611 83 MT192765.1 11453 has no mate -- skipped +2024-09-10 06:43:30,936 WARNING Alignment ERR5069949.1066259 147 MT192765.1 11479 has no mate -- skipped +2024-09-10 06:43:30,937 WARNING Alignment ERR5069949.1067032 83 MT192765.1 11480 has no mate -- skipped +2024-09-10 06:43:30,938 WARNING Alignment ERR5069949.1258508 163 MT192765.1 12424 has no mate -- skipped +2024-09-10 06:43:30,939 WARNING Alignment ERR5069949.1261808 99 MT192765.1 12592 has no mate -- skipped +2024-09-10 06:43:30,940 WARNING Alignment ERR5069949.1258508 83 MT192765.1 12637 has no mate -- skipped +2024-09-10 06:43:30,940 WARNING Alignment ERR5069949.1261808 147 MT192765.1 12653 has no mate -- skipped +2024-09-10 06:43:30,941 WARNING Alignment ERR5069949.1372331 163 MT192765.1 13010 has no mate -- skipped +2024-09-10 06:43:30,941 WARNING Alignment ERR5069949.1372331 83 MT192765.1 13131 has no mate -- skipped +2024-09-10 06:43:30,942 WARNING Alignment ERR5069949.1552198 99 MT192765.1 13943 has no mate -- skipped +2024-09-10 06:43:30,943 WARNING Alignment ERR5069949.1561137 163 MT192765.1 13990 has no mate -- skipped +2024-09-10 06:43:30,943 WARNING Alignment ERR5069949.1552198 147 MT192765.1 14026 has no mate -- skipped +2024-09-10 06:43:30,944 WARNING Alignment ERR5069949.1561137 83 MT192765.1 14080 has no mate -- skipped +2024-09-10 06:43:30,947 WARNING Alignment ERR5069949.2098070 99 MT192765.1 17114 has no mate -- skipped +2024-09-10 06:43:30,947 WARNING Alignment ERR5069949.2064910 99 MT192765.1 17122 has no mate -- skipped +2024-09-10 06:43:30,947 WARNING Alignment ERR5069949.2125592 99 MT192765.1 17179 has no mate -- skipped +2024-09-10 06:43:30,947 WARNING Alignment ERR5069949.2064910 147 MT192765.1 17179 has no mate -- skipped +2024-09-10 06:43:30,948 WARNING Alignment ERR5069949.2098070 147 MT192765.1 17269 has no mate -- skipped +2024-09-10 06:43:30,948 WARNING Alignment ERR5069949.2125592 147 MT192765.1 17288 has no mate -- skipped +2024-09-10 06:43:30,948 WARNING Alignment ERR5069949.2185111 163 MT192765.1 17405 has no mate -- skipped +2024-09-10 06:43:30,949 WARNING Alignment ERR5069949.2151832 163 MT192765.1 17415 has no mate -- skipped +2024-09-10 06:43:30,949 WARNING Alignment ERR5069949.2176303 99 MT192765.1 17441 has no mate -- skipped +2024-09-10 06:43:30,949 WARNING Alignment ERR5069949.2151832 83 MT192765.1 17452 has no mate -- skipped +2024-09-10 06:43:30,949 WARNING Alignment ERR5069949.2205229 99 MT192765.1 17475 has no mate -- skipped +2024-09-10 06:43:30,950 WARNING Alignment ERR5069949.2216307 163 MT192765.1 17503 has no mate -- skipped +2024-09-10 06:43:30,950 WARNING Alignment ERR5069949.2176303 147 MT192765.1 17518 has no mate -- skipped +2024-09-10 06:43:30,950 WARNING Alignment ERR5069949.2185111 83 MT192765.1 17536 has no mate -- skipped +2024-09-10 06:43:30,951 WARNING Alignment ERR5069949.2205229 147 MT192765.1 17584 has no mate -- skipped +2024-09-10 06:43:30,951 WARNING Alignment ERR5069949.2216307 83 MT192765.1 17600 has no mate -- skipped +2024-09-10 06:43:30,952 WARNING Alignment ERR5069949.2270078 163 MT192765.1 17969 has no mate -- skipped +2024-09-10 06:43:30,953 WARNING Alignment ERR5069949.2270078 83 MT192765.1 18102 has no mate -- skipped +2024-09-10 06:43:30,953 WARNING Alignment ERR5069949.2328704 163 MT192765.1 18285 has no mate -- skipped +2024-09-10 06:43:30,954 WARNING Alignment ERR5069949.2342766 99 MT192765.1 18396 has no mate -- skipped +2024-09-10 06:43:30,954 WARNING Alignment ERR5069949.2328704 83 MT192765.1 18411 has no mate -- skipped +2024-09-10 06:43:30,954 WARNING Alignment ERR5069949.2361683 99 MT192765.1 18425 has no mate -- skipped +2024-09-10 06:43:30,954 WARNING Alignment ERR5069949.2342766 147 MT192765.1 18468 has no mate -- skipped +2024-09-10 06:43:30,955 WARNING Alignment ERR5069949.2361683 147 MT192765.1 18512 has no mate -- skipped +2024-09-10 06:43:30,955 WARNING Alignment ERR5069949.2415814 99 MT192765.1 18597 has no mate -- skipped +2024-09-10 06:43:30,955 WARNING Alignment ERR5069949.2385514 99 MT192765.1 18602 has no mate -- skipped +2024-09-10 06:43:30,956 WARNING Alignment ERR5069949.2417063 99 MT192765.1 18648 has no mate -- skipped +2024-09-10 06:43:30,956 WARNING Alignment ERR5069949.2388984 99 MT192765.1 18653 has no mate -- skipped +2024-09-10 06:43:30,956 WARNING Alignment ERR5069949.2385514 147 MT192765.1 18684 has no mate -- skipped +2024-09-10 06:43:30,956 WARNING Alignment ERR5069949.2388984 147 MT192765.1 18693 has no mate -- skipped +2024-09-10 06:43:30,957 WARNING Alignment ERR5069949.2431709 99 MT192765.1 18748 has no mate -- skipped +2024-09-10 06:43:30,957 WARNING Alignment ERR5069949.2415814 147 MT192765.1 18764 has no mate -- skipped +2024-09-10 06:43:30,957 WARNING Alignment ERR5069949.2417063 147 MT192765.1 18765 has no mate -- skipped +2024-09-10 06:43:30,958 WARNING Alignment ERR5069949.2431709 147 MT192765.1 18776 has no mate -- skipped +2024-09-10 06:43:30,959 WARNING Alignment ERR5069949.2668880 99 MT192765.1 23124 has no mate -- skipped +2024-09-10 06:43:30,960 WARNING Alignment ERR5069949.2674295 163 MT192765.1 23133 has no mate -- skipped +2024-09-10 06:43:30,960 WARNING Alignment ERR5069949.2668880 147 MT192765.1 23145 has no mate -- skipped +2024-09-10 06:43:30,960 WARNING Alignment ERR5069949.2674295 83 MT192765.1 23203 has no mate -- skipped +2024-09-10 06:43:30,963 WARNING Alignment ERR5069949.2953930 99 MT192765.1 25344 has no mate -- skipped +2024-09-10 06:43:30,963 WARNING Alignment ERR5069949.2972968 163 MT192765.1 25425 has no mate -- skipped +2024-09-10 06:43:30,963 WARNING Alignment ERR5069949.2953930 147 MT192765.1 25464 has no mate -- skipped +2024-09-10 06:43:30,964 WARNING Alignment ERR5069949.2972968 83 MT192765.1 25518 has no mate -- skipped +2024-09-10 06:43:30,966 WARNING Alignment ERR5069949.3273002 163 MT192765.1 28442 has no mate -- skipped +2024-09-10 06:43:30,966 WARNING Alignment ERR5069949.3277445 99 MT192765.1 28508 has no mate -- skipped +2024-09-10 06:43:30,966 WARNING Alignment ERR5069949.3273002 83 MT192765.1 28543 has no mate -- skipped +2024-09-10 06:43:30,966 WARNING Alignment ERR5069949.3277445 147 MT192765.1 28573 has no mate -- skipped +2024-09-10 06:43:30,968 INFO Total pairs output: 56, Pairs skipped - no mates: 82, Pairs skipped - not read1 or 2: 0 +# job finished in 0 seconds at Tue Sep 10 06:43:30 2024 -- 4.44 0.25 0.00 0.00 -- 07ae7548-56e8-4772-9b48-7406710fd838 diff --git a/src/umi_tools/umi_tools_prepareforrsem/test_data/test.bam b/src/umi_tools/umi_tools_prepareforrsem/test_data/test.bam new file mode 100644 index 00000000..7793c7e3 Binary files /dev/null and b/src/umi_tools/umi_tools_prepareforrsem/test_data/test.bam differ diff --git a/src/umi_tools/umi_tools_prepareforrsem/test_data/test.sam b/src/umi_tools/umi_tools_prepareforrsem/test_data/test.sam new file mode 100644 index 00000000..6465827d --- /dev/null +++ b/src/umi_tools/umi_tools_prepareforrsem/test_data/test.sam @@ -0,0 +1,119 @@ +@HD VN:1.6 SO:coordinate +@SQ SN:MT192765.1 LN:29829 +@RG ID:1 LB:lib1 PL:ILLUMINA SM:test PU:barcode1 +@PG ID:minimap2 PN:minimap2 VN:2.17-r941 CL:minimap2 -ax sr tests/data/fasta/sarscov2/GCA_011545545.1_ASM1154554v1_genomic.fna tests/data/fastq/dna/sarscov2_1.fastq.gz tests/data/fastq/dna/sarscov2_2.fastq.gz +@PG ID:samtools PN:samtools PP:minimap2 VN:1.11 CL:samtools view -Sb sarscov2_aln.sam +@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11 CL:samtools sort -o sarscov2_paired_aln.sorted.bam sarscov2_paired_aln.bam +@PG ID:samtools.2 PN:samtools PP:samtools.1 VN:1.20 CL:samtools view -h test_data/test_dedup.bam +ERR5069949.29668 83 MT192765.1 267 60 89M = 121 -235 CCTTGTCCCTGGTTACAACTAGAAACCACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAG E////6/E/EE/EE/<