diff --git a/CHANGELOG.md b/CHANGELOG.md index 3a036fba..3fc960f4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -66,9 +66,14 @@ - `samtools/samtools_collate`: Shuffles and groups reads in SAM/BAM/CRAM files together by their names (PR #42). - `samtools/samtools_view`: Views and converts SAM/BAM/CRAM files (PR #48). - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTQ (PR #52). + - `samtools/samtools_fastq`: Converts a SAM/BAM/CRAM file to FASTA (PR #53). + * `falco`: A C++ drop-in replacement of FastQC to assess the quality of sequence read data (PR #43). +* `umitools`: + - `umitools_dedup`: Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read (PR #54). + * `bedtools`: - `bedtools_getfasta`: extract sequences from a FASTA file for each of the intervals defined in a BED/GFF/VCF file (PR #59). diff --git a/src/samtools/samtools_fasta/config.vsh.yaml b/src/samtools/samtools_fasta/config.vsh.yaml new file mode 100644 index 00000000..23517f6c --- /dev/null +++ b/src/samtools/samtools_fasta/config.vsh.yaml @@ -0,0 +1,191 @@ +name: samtools_fasta +namespace: samtools +description: Converts a SAM, BAM or CRAM to FASTA format. +keywords: [fasta, bam, sam, cram] +links: + homepage: https://www.htslib.org/ + documentation: https://www.htslib.org/doc/samtools-fasta.html + repository: https://github.com/samtools/samtools +references: + doi: [10.1093/bioinformatics/btp352, 10.1093/gigascience/giab008] +license: MIT/Expat + +argument_groups: + - name: Inputs + arguments: + - name: --input + type: file + description: input SAM/BAM/CRAM file + required: true + - name: Outputs + arguments: + - name: --output + type: file + description: output FASTA file + required: true + direction: output + - name: Options + arguments: + - name: --no_suffix + alternatives: -n + type: boolean_true + description: | + By default, either '/1' or '/2' is added to the end of read names where the corresponding + READ1 or READ2 FLAG bit is set. Using -n causes read names to be left as they are. + - name: --suffix + alternatives: -N + type: boolean_true + description: | + Always add either '/1' or '/2' to the end of read names even when put into different files. + - name: --use_oq + alternatives: -O + type: boolean_true + description: | + Use quality values from OQ tags in preference to standard quality string if available. + - name: --singleton + alternatives: -s + type: file + description: write singleton reads to FILE. + - name: --copy_tags + alternatives: -t + type: boolean_true + description: | + Copy RG, BC and QT tags to the FASTA header line, if they exist. + - name: --copy_tags_list + alternatives: -T + type: string + description: | + Specify a comma-separated list of tags to copy to the FASTA header line, if they exist. + TAGLIST can be blank or `*` to indicate all tags should be copied to the output. If using `*`, + be careful to quote it to avoid unwanted shell expansion. + - name: --read1 + alternatives: -1 + type: file + description: | + Write reads with the READ1 FLAG set (and READ2 not set) to FILE instead of outputting them. + If the -s option is used, only paired reads will be written to this file. + direction: output + - name: --read2 + alternatives: -2 + type: file + description: | + Write reads with the READ2 FLAG set (and READ1 not set) to FILE instead of outputting them. + If the -s option is used, only paired reads will be written to this file. + direction: output + - name: --output_reads + alternatives: -o + type: file + description: | + Write reads with either READ1 FLAG or READ2 flag set to FILE instead of outputting them to stdout. + This is equivalent to -1 FILE -2 FILE. + direction: output + - name: --output_reads_both + alternatives: -0 + type: file + description: | + Write reads where the READ1 and READ2 FLAG bits set are either both set or both unset to FILE + instead of outputting them. + direction: output + - name: --filter_flags + alternatives: -f + type: integer + description: | + Only output alignments with all bits set in INT present in the FLAG field. INT can be specified + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' + (i.e. /^0[0-7]+/). Default: `0`. + example: 0 + - name: --excl_flags + alternatives: -F + type: string + description: | + Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' + (i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and + supplementary alignments. Default: `0x900`. + example: "0x900" + - name: --incl_flags + alternatives: --rf + type: string + description: | + Only output alignments with any bits set in INT present in the FLAG field. INT can be specified + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with '0' + (i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of + flag names. Default: `0`. + example: 0 + - name: --excl_flags_all + alternatives: -G + type: integer + description: | + Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/). + Default: `0`. + example: 0 + - name: --aux_tag + alternatives: -d + type: string + description: | + Only output alignments containing an auxiliary tag matching both TAG and VAL. If VAL is omitted + then any value is accepted. The tag types supported are i, f, Z, A and H. "B" arrays are not + supported. This is comparable to the method used in samtools view --tag. The option may be specified + multiple times and is equivalent to using the --aux_tag_file option. + - name: --aux_tag_file + alternatives: -D + type: string + description: | + Only output alignments containing an auxiliary tag matching TAG and having a value listed in FILE. + The format of the file is one line per value. This is equivalent to specifying --aux_tag multiple times. + - name: --casava + alternatives: -i + type: boolean_true + description: add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG) + - name: --compression + alternatives: -c + type: integer + description: set compression level when writing gz or bgzf fasta files. + example: 0 + - name: --index1 + alternatives: --i1 + type: file + description: write first index reads to FILE. + - name: --index2 + alternatives: --i2 + type: file + description: write second index reads to FILE. + - name: --barcode_tag + type: string + description: | + Auxiliary tag to find index reads in. Default: `BC`. + example: "BC" + - name: --quality_tag + type: string + description: | + Auxiliary tag to find index quality in. Default: `QT`. + example: "QT" + - name: --index_format + type: string + description: | + string to describe how to parse the barcode and quality tags. For example: + * `i14i8`: the first 14 characters are index 1, the next 8 characters are index 2. + * `n8i14`: ignore the first 8 characters, and use the next 14 characters for index 1. + If the tag contains a separator, then the numeric part can be replaced with`*` to mean + 'read until the separator or end of tag', for example: `n*i*`. + +resources: + - type: bash_script + path: ../samtools_fastq/script.sh +test_resources: + - type: bash_script + path: test.sh + - type: file + path: test_data +engines: + - type: docker + image: quay.io/biocontainers/samtools:1.19.2--h50ea8bc_1 + setup: + - type: docker + run: | + samtools --version 2>&1 | grep -E '^(samtools|Using htslib)' | \ + sed 's#Using ##;s# \([0-9\.]*\)$#: \1#' > /var/software_versions.txt +runners: +- type: executable +- type: nextflow diff --git a/src/samtools/samtools_fasta/help.txt b/src/samtools/samtools_fasta/help.txt new file mode 100644 index 00000000..39ed0d00 --- /dev/null +++ b/src/samtools/samtools_fasta/help.txt @@ -0,0 +1,80 @@ +``` +samtools fastq +``` + +Usage: samtools fastq [options...] + +Description: +Converts a SAM, BAM or CRAM to FASTQ format. + +Options: + -0 FILE write reads designated READ_OTHER to FILE + -1 FILE write reads designated READ1 to FILE + -2 FILE write reads designated READ2 to FILE + -o FILE write reads designated READ1 or READ2 to FILE + note: if a singleton file is specified with -s, only + paired reads will be written to the -1 and -2 files. + -d, --tag TAG[:VAL] + only include reads containing TAG, optionally with value VAL + -f, --require-flags INT + only include reads with all of the FLAGs in INT present [0] + -F, --excl[ude]-flags INT + only include reads with none of the FLAGs in INT present [0x900] + --rf, --incl[ude]-flags INT + only include reads with any of the FLAGs in INT present [0] + -G INT only EXCLUDE reads with all of the FLAGs in INT present [0] + -n don't append /1 and /2 to the read name + -N always append /1 and /2 to the read name + -O output quality in the OQ tag if present + -s FILE write singleton reads designated READ1 or READ2 to FILE + -t copy RG, BC and QT tags to the FASTQ header line + -T TAGLIST copy arbitrary tags to the FASTQ header line, '*' for all + -v INT default quality score if not given in file [1] + -i add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG) + -c INT compression level [0..9] to use when writing bgzf files [1] + --i1 FILE write first index reads to FILE + --i2 FILE write second index reads to FILE + --barcode-tag TAG + Barcode tag [BC] + --quality-tag TAG + Quality tag [QT] + --index-format STR + How to parse barcode and quality tags + + --input-fmt-option OPT[=VAL] + Specify a single input file format option in the form + of OPTION or OPTION=VALUE + --reference FILE + Reference sequence FASTA FILE [null] + -@, --threads INT + Number of additional threads to use [0] + --verbosity INT + Set level of verbosity + +The files will be automatically compressed if the file names have a .gz +or .bgzf extension. The input to this program must be collated by name. +Run 'samtools collate' or 'samtools sort -n' to achieve this. + +Reads are designated READ1 if FLAG READ1 is set and READ2 is not set. +Reads are designated READ2 if FLAG READ1 is not set and READ2 is set. +Otherwise reads are designated READ_OTHER (both flags set or both flags unset). +Run 'samtools flags' for more information on flag codes and meanings. + +The index-format string describes how to parse the barcode and quality tags. +It is made up of 'i' or 'n' followed by a length or '*'. For example: + i14i8 The first 14 characters are index 1, the next 8 are index 2 + n8i14 Ignore the first 8 characters, and use the next 14 for index 1 + +If the tag contains a separator, then the numeric part can be replaced with +'*' to mean 'read until the separator or end of tag', for example: + i*i* Break the tag at the separator into index 1 and index 2 + n*i* Ignore the left part of the tag until the separator, + then use the second part of the tag as index 1 + +Examples: +To get just the paired reads in separate files, use: + samtools fastq -1 pair1.fq -2 pair2.fq -0 /dev/null -s /dev/null -n in.bam + +To get all non-supplementary/secondary reads in a single file, redirect +the output: + samtools fastq in.bam > all_reads.fq \ No newline at end of file diff --git a/src/samtools/samtools_fasta/test.sh b/src/samtools/samtools_fasta/test.sh new file mode 100644 index 00000000..687965ae --- /dev/null +++ b/src/samtools/samtools_fasta/test.sh @@ -0,0 +1,96 @@ +#!/bin/bash + +test_dir="${meta_resources_dir}/test_data" +out_dir="${meta_resources_dir}/out_data" + +############################################################################################ + +echo ">>> Test 1: Convert all reads from a bam file to fasta format" +"$meta_executable" \ + --input "$test_dir/a.bam" \ + --output "$out_dir/a.fa" + +echo ">>> Check if output file exists" +[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1 + +echo ">>> Check if output is empty" +[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1 + +echo ">>> Check if output matches expected output" +diff "$out_dir/a.fa" "$test_dir/a.fa" || + (echo "Output file a.fa does not match expected output" && exit 1) + +rm "$out_dir/a.fa" + +############################################################################################ + +echo ">>> Test 2: Convert all reads from a sam file to fasta format" +"$meta_executable" \ + --input "$test_dir/a.sam" \ + --output "$out_dir/a.fa" + +echo ">>> Check if output file exists" +[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1 + +echo ">>> Check if output is empty" +[ ! -s "$out_dir/a.fa" ] && echo "Output file a.fa is empty" && exit 1 + +echo ">>> Check if output matches expected output" +diff "$out_dir/a.fa" "$test_dir/a.fa" || + (echo "Output file a.fa does not match expected output" && exit 1) + +rm "$out_dir/a.fa" + +############################################################################################ + +echo ">>> Test 3: Output reads from bam file to separate files" + +"$meta_executable" \ + --input "$test_dir/a.bam" \ + --read1 "$out_dir/a.1.fa" \ + --read2 "$out_dir/a.2.fa" \ + --output "$out_dir/a.fa" + +echo ">>> Check if output files exist" +[ ! -f "$out_dir/a.1.fa" ] && echo "Output file a.1.fa does not exist" && exit 1 +[ ! -f "$out_dir/a.2.fa" ] && echo "Output file a.2.fa does not exist" && exit 1 +[ ! -f "$out_dir/a.fa" ] && echo "Output file a.fa does not exist" && exit 1 + +echo ">>> Check if output files are empty" +[ ! -s "$out_dir/a.1.fa" ] && echo "Output file a.1.fa is empty" && exit 1 +[ ! -s "$out_dir/a.2.fa" ] && echo "Output file a.2.fa is empty" && exit 1 +# output should be empty since input has no singleton reads + +echo ">>> Check if output files match expected output" +diff "$out_dir/a.1.fa" "$test_dir/a.1.fa" || + (echo "Output file a.1.fa does not match expected output" && exit 1) +diff "$out_dir/a.2.fa" "$test_dir/a.2.fa" || + (echo "Output file a.2.fa does not match expected output" && exit 1) + +rm "$out_dir/a.1.fa" "$out_dir/a.2.fa" "$out_dir/a.fa" + +############################################################################################ + +echo ">>> Test 4: Output only forward reads from bam file to fasta format" + +"$meta_executable" \ + --input "$test_dir/a.sam" \ + --excl_flags "0x80" \ + --output "$out_dir/half.fa" + +echo ">>> Check if output file exists" +[ ! -f "$out_dir/half.fa" ] && echo "Output file half.fa does not exist" && exit 1 + +echo ">>> Check if output is empty" +[ ! -s "$out_dir/half.fa" ] && echo "Output file half.fa is empty" && exit 1 + +echo ">>> Check if output matches expected output" +diff "$out_dir/half.fa" "$test_dir/half.fa" || + (echo "Output file half.fa does not match expected output" && exit 1) + +rm "$out_dir/half.fa" + +############################################################################################ + +echo "All tests succeeded!" +exit 0 \ No newline at end of file diff --git a/src/samtools/samtools_fasta/test_data/a.1.fa b/src/samtools/samtools_fasta/test_data/a.1.fa new file mode 100644 index 00000000..2c9fdbe5 --- /dev/null +++ b/src/samtools/samtools_fasta/test_data/a.1.fa @@ -0,0 +1,6 @@ +>a1 +AAAAAAAAAA +>b1 +AAAAAAAAAA +>c1 +AAAAAAAAAA diff --git a/src/samtools/samtools_fasta/test_data/a.2.fa b/src/samtools/samtools_fasta/test_data/a.2.fa new file mode 100644 index 00000000..2c9fdbe5 --- /dev/null +++ b/src/samtools/samtools_fasta/test_data/a.2.fa @@ -0,0 +1,6 @@ +>a1 +AAAAAAAAAA +>b1 +AAAAAAAAAA +>c1 +AAAAAAAAAA diff --git a/src/samtools/samtools_fasta/test_data/a.bam b/src/samtools/samtools_fasta/test_data/a.bam new file mode 100644 index 00000000..dba1268a Binary files /dev/null and b/src/samtools/samtools_fasta/test_data/a.bam differ diff --git a/src/samtools/samtools_fasta/test_data/a.fa b/src/samtools/samtools_fasta/test_data/a.fa new file mode 100644 index 00000000..693cd395 --- /dev/null +++ b/src/samtools/samtools_fasta/test_data/a.fa @@ -0,0 +1,12 @@ +>a1/1 +AAAAAAAAAA +>b1/1 +AAAAAAAAAA +>c1/1 +AAAAAAAAAA +>a1/2 +AAAAAAAAAA +>b1/2 +AAAAAAAAAA +>c1/2 +AAAAAAAAAA diff --git a/src/samtools/samtools_fasta/test_data/a.sam b/src/samtools/samtools_fasta/test_data/a.sam new file mode 100644 index 00000000..aa8c77b3 --- /dev/null +++ b/src/samtools/samtools_fasta/test_data/a.sam @@ -0,0 +1,7 @@ +@SQ SN:xx LN:20 +a1 99 xx 1 1 10M = 11 20 AAAAAAAAAA ********** +b1 99 xx 1 1 10M = 11 20 AAAAAAAAAA ********** +c1 99 xx 1 1 10M = 11 20 AAAAAAAAAA ********** +a1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT ********** +b1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT ********** +c1 147 xx 11 1 10M = 1 -20 TTTTTTTTTT ********** diff --git a/src/samtools/samtools_fasta/test_data/half.fa b/src/samtools/samtools_fasta/test_data/half.fa new file mode 100644 index 00000000..36cd438c --- /dev/null +++ b/src/samtools/samtools_fasta/test_data/half.fa @@ -0,0 +1,6 @@ +>a1/1 +AAAAAAAAAA +>b1/1 +AAAAAAAAAA +>c1/1 +AAAAAAAAAA diff --git a/src/samtools/samtools_fasta/test_data/script.sh b/src/samtools/samtools_fasta/test_data/script.sh new file mode 100755 index 00000000..b59bc1bd --- /dev/null +++ b/src/samtools/samtools_fasta/test_data/script.sh @@ -0,0 +1,11 @@ +#!/bin/bash + +# dowload test data from snakemake wrapper +if [ ! -d /tmp/fastq_source ]; then + git clone --depth 1 --single-branch --branch master https://github.com/snakemake/snakemake-wrappers.git /tmp/fastq_source +fi + +cp -r /tmp/fastq_source/bio/samtools/fastx/test/*.sam src/samtools/samtools_fastq/test_data/ +cp -r /tmp/fastq_source/bio/samtools/fastq/interleaved/test/mapped/*.bam src/samtools/samtools_fastq/test_data/ +cp -r /tmp/fastq_source/bio/samtools/fastq/interleaved/test/reads/*.fq src/samtools/samtools_fastq/test_data/ +cp -r /tmp/fastq_source/bio/samtools/fastq/separate/test/reads/*.fq src/samtools/samtools_fastq/test_data/ \ No newline at end of file diff --git a/src/samtools/samtools_fastq/config.vsh.yaml b/src/samtools/samtools_fastq/config.vsh.yaml index 39e926f0..cac7653b 100644 --- a/src/samtools/samtools_fastq/config.vsh.yaml +++ b/src/samtools/samtools_fastq/config.vsh.yaml @@ -56,7 +56,7 @@ argument_groups: type: string description: | Specify a comma-separated list of tags to copy to the FASTQ header line, if they exist. - TAGLIST can be blank or * to indicate all tags should be copied to the output. If using *, + TAGLIST can be blank or `*` to indicate all tags should be copied to the output. If using `*`, be careful to quote it to avoid unwanted shell expansion. - name: --read1 alternatives: -1 @@ -91,35 +91,35 @@ argument_groups: type: integer description: | Only output alignments with all bits set in INT present in the FLAG field. INT can be specified - in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' - (i.e. /^0[0-7]+/). - default: 0 + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' + (i.e. /^0[0-7]+/). Default: `0`. + example: 0 - name: --excl_flags alternatives: -F type: string description: | Do not output alignments with any bits set in INT present in the FLAG field. INT can be specified - in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/). This defaults to 0x900 representing filtering of secondary and - supplementary alignments. - default: 0x900 + supplementary alignments. Default: `0x900`. + example: "0x900" - name: --incl_flags alternatives: --rf type: string description: | Only output alignments with any bits set in INT present in the FLAG field. INT can be specified - in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with `0' + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/), in octal by beginning with '0' (i.e. /^0[0-7]+/), as a decimal number not beginning with '0' or as a comma-separated list of - flag names. - default: 0 + flag names. Default: `0`. + example: 0 - name: --excl_flags_all alternatives: -G type: integer description: | Only EXCLUDE reads with all of the bits set in INT present in the FLAG field. INT can be specified - in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' - (i.e. /^0[0-7]+/). - default: 0 + in hex by beginning with '0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with '0' (i.e. /^0[0-7]+/). + Default: `0`. + example: 0 - name: --aux_tag alternatives: -d type: string @@ -137,12 +137,13 @@ argument_groups: - name: --casava alternatives: -i type: boolean_true - description: add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG) + description: | + Add Illumina Casava 1.8 format entry to header, for example: `1:N:0:ATCACG`. - name: --compression alternatives: -c type: integer description: set compression level when writing gz or bgzf fastq files. - default: 0 + example: 0 - name: --index1 alternatives: --i1 type: file @@ -153,20 +154,22 @@ argument_groups: description: write second index reads to FILE. - name: --barcode_tag type: string - description: Auxiliary tag to find index reads in. - default: BC + description: | + Auxiliary tag to find index reads in. Default: `BC`. + example: "BC" - name: --quality_tag type: string - description: Auxiliary tag to find index quality in. - default: QT + description: | + Auxiliary tag to find index quality in. Default: `QT`. + example: QT - name: --index_format type: string description: | string to describe how to parse the barcode and quality tags. For example: - [i14i8]: the first 14 characters are index 1, the next 8 characters are index 2. - [n8i14]: ignore the first 8 characters, and use the next 14 characters for index 1. + * `i14i8`: the first 14 characters are index 1, the next 8 characters are index 2. + * `n8i14`: ignore the first 8 characters, and use the next 14 characters for index 1. If the tag contains a separator, then the numeric part can be replaced with '*' to mean - 'read until the separator or end of tag', for example: [n*i*]. + 'read until the separator or end of tag', for example: `n*i*`. resources: - type: bash_script diff --git a/src/samtools/samtools_fastq/script.sh b/src/samtools/samtools_fastq/script.sh index 367432f9..0cad9cfe 100644 --- a/src/samtools/samtools_fastq/script.sh +++ b/src/samtools/samtools_fastq/script.sh @@ -11,7 +11,14 @@ set -e [[ "$par_copy_tags" == "false" ]] && unset par_copy_tags [[ "$par_casava" == "false" ]] && unset par_casava -samtools fastq \ +if [[ "$meta_name" == "samtools_fasta" ]]; then + subcommand=fasta +elif [[ "$meta_name" == "samtools_fastq" ]]; then + subcommand=fastq +else + echo "Unrecognized component name" && exit 1 +fi +samtools "$subcommand" \ ${par_no_suffix:+-n} \ ${par_suffix:+-N} \ ${par_use_oq:+-O} \ diff --git a/src/umi_tools/umi_tools_dedup/config.vsh.yaml b/src/umi_tools/umi_tools_dedup/config.vsh.yaml new file mode 100644 index 00000000..a02e70a1 --- /dev/null +++ b/src/umi_tools/umi_tools_dedup/config.vsh.yaml @@ -0,0 +1,303 @@ +name: umi_tools_dedup +namespace: umi_tools +description: | + Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read. +keywords: [umi_tools, deduplication, dedup] +links: + homepage: https://umi-tools.readthedocs.io/en/latest/ + documentation: https://umi-tools.readthedocs.io/en/latest/reference/dedup.html + repository: https://github.com/CGATOxford/UMI-tools +references: + doi: 10.1101/gr.209601.116 +license: MIT + +argument_groups: + - name: Inputs + arguments: + - name: --input + alternatives: --stdin + type: file + description: Input BAM or SAM file. Use --in_sam to specify SAM format. + required: true + - name: --in_sam + type: boolean_true + description: | + By default, inputs are assumed to be in BAM format. Use this options to specify the use of SAM + format for input. + - name: --bai + type: file + description: BAM index + - name: --random_seed + type: integer + description: Random seed to initialize number generator with. + + - name: Outputs + arguments: + - name: --output + alternatives: --stdout + type: file + description: Deduplicated BAM file. + required: true + direction: output + - name: --out_sam + type: boolean_true + description: | + By default, outputa are written in BAM format. Use this options to specify the use of SAM format + for output. + - name: --paired + type: boolean_true + description: | + BAM is paired end - output both read pairs. This will also force the use of the template length + to determine reads with the same mapping coordinates. + - name: --output_stats + type: string + description: | + Generate files containing UMI based deduplication statistics files with this prefix in the file names. + - name: --extract_umi_method + type: string + choices: [read_id, tag, umis] + description: | + Specify the method by which the barcodes were encoded in the read. + The options are: + * read_id (default) + * tag + * umis + example: "read_id" + - name: --umi_tag + type: string + description: | + The tag containing the UMI sequence. This is only required if the extract_umi_method is set to tag. + - name: --umi_separator + type: string + description: | + The separator used to separate the UMI from the read sequence. This is only required if the + extract_umi_method is set to id_read. Default: `_`. + example: '_' + - name: --umi_tag_split + type: string + description: Separate the UMI in tag by and take the first element. + - name: --umi_tag_delimiter + type: string + description: Separate the UMI in by and concatenate the elements. + - name: --cell_tag + type: string + description: | + The tag containing the cell barcode sequence. This is only required if the extract_umi_method + is set to tag. + - name: --cell_tag_split + type: string + description: Separate the cell barcode in tag by and take the first element. + - name: --cell_tag_delimiter + type: string + description: Separate the cell barcode in by and concatenate the elements. + + - name: Grouping Options + arguments: + - name: --method + type: string + choices: [unique, percentile, cluster, adjacency, directional] + description: | + The method to use for grouping reads. + The options are: + * unique + * percentile + * cluster + * adjacency + * directional (default) + example: "directional" + - name: --edit_distance_threshold + type: integer + description: | + For the adjacency and cluster methods the threshold for the edit distance to connect two + UMIs in the network can be increased. The default value of 1 works best unless the UMI is + very long (>14bp). Default: `1`. + example: 1 + - name: --spliced_is_unique + type: boolean_true + description: | + Causes two reads that start in the same position on the same strand and having the same UMI + to be considered unique if one is spliced and the other is not. (Uses the 'N' cigar operation + to test for splicing). + - name: --soft_clip_threshold + type: integer + description: | + Mappers that soft clip will sometimes do so rather than mapping a spliced read if there is only + a small overhang over the exon junction. By setting this option, you can treat reads with at + least this many bases soft-clipped at the 3' end as spliced. Default: `4`. + example: 4 + - name: --multimapping_detection_method + type: string + description: | + If the sam/bam contains tags to identify multimapping reads, you can specify for use when selecting + the best read at a given loci. Supported tags are `NH`, `X0` and `XT`. If not specified, the read + with the highest mapping quality will be selected. + - name: --read_length + type: boolean_true + description: Use the read length as a criteria when deduping, for e.g. sRNA-Seq. + + - name: Single-cell RNA-Seq Options + arguments: + - name: --per_gene + type: boolean_true + description: | + Reads will be grouped together if they have the same gene. This is useful if your library prep + generates PCR duplicates with non identical alignment positions such as CEL-Seq. Note this option + is hardcoded to be on with the count command. I.e. counting is always performed per-gene. Must be + combined with either --gene_tag or --per_contig option. + - name: --gene_tag + type: string + description: | + Deduplicate per gene. The gene information is encoded in the bam read tag specified. + - name: --assigned_status_tag + type: string + description: | + BAM tag which describes whether a read is assigned to a gene. Defaults to the same value as given + for --gene_tag. + - name: --skip_tags_regex + type: string + description: | + Use in conjunction with the --assigned_status_tag option to skip any reads where the tag matches + this regex. Default ("^[__|Unassigned]") matches anything which starts with "__" or "Unassigned". + - name: --per_contig + type: boolean_true + description: | + Deduplicate per contig (field 3 in BAM; RNAME). All reads with the sam contig will be considered to + have the same alignment position. This is useful if you have aligned to a reference transcriptome + with one transcript per gene. If you have aligned to a transcriptome with more than one transcript + per gene, you can supply a map between transcripts and gene using the --gene_transcript_map option. + - name: --gene_transcript_map + type: file + description: | + A file containing a mapping between gene names and transcript names. The file should be tab + separated with the gene name in the first column and the transcript name in the second column. + - name: --per_cell + type: boolean_true + description: | + Reads will only be grouped together if they have the same cell barcode. Can be combined with + --per_gene. + + - name: SAM/BAM Options + arguments: + - name: --mapping_quality + type: integer + description: | + Minimium mapping quality (MAPQ) for a read to be retained. Default: `0`. + example: 0 + - name: --unmapped_reads + type: string + description: | + How unmapped reads should be handled. + The options are: + * "discard": Discard all unmapped reads. (default) + * "use": If read2 is unmapped, deduplicate using read1 only. Requires --paired. + * "output": Output unmapped reads/read pairs without UMI grouping/deduplication. Only available in umi_tools group. + example: "discard" + - name: --chimeric_pairs + type: string + choices: [discard, use, output] + description: | + How chimeric pairs should be handled. + The options are: + * "discard": Discard all chimeric read pairs. + * "use": Deduplicate using read1 only. (default) + * "output": Output chimeric pairs without UMI grouping/deduplication. Only available in + umi_tools group. + example: "use" + - name: --unpaired_reads + type: string + choices: [discard, use, output] + description: | + How unpaired reads should be handled. + The options are: + * "discard": Discard all unmapped reads. + * "use": If read2 is unmapped, deduplicate using read1 only. Requires --paired. (default) + * "output": Output unmapped reads/read pairs without UMI grouping/deduplication. Only available + in umi_tools group. + example: "use" + - name: --ignore_umi + type: boolean_true + description: Ignore the UMI and group reads using mapping coordinates only. + - name: --subset + type: double + description: | + Only consider a fraction of the reads, chosen at random. This is useful for doing saturation + analyses. + - name: --chrom + type: string + description: Only consider a single chromosome. This is useful for debugging/testing purposes. + + - name: Group/Dedup Options + arguments: + - name: --no_sort_output + type: boolean_true + description: | + By default, output is sorted. This involves the use of a temporary unsorted file (saved in + --temp_dir). Use this option to turn off sorting. + - name: --buffer_whole_contig + type: boolean_true + description: | + Forces dedup to parse an entire contig before yielding any reads for deduplication. This is the + only way to absolutely guarantee that all reads with the same start position are grouped together + for deduplication since dedup uses the start position of the read, not the alignment coordinate on + which the reads are sorted. However, by default, dedup reads for another 1000bp before outputting + read groups which will avoid any reads being missed with short read sequencing (<1000bp). + + - name: Common Options + arguments: + - name: --log + alternatives: -L + type: file + description: File with logging information. + - name: --log2stderr + type: boolean_true + description: Send logging information to stderr. + - name: --verbose + alternatives: -v + type: integer + description: | + Log level. The higher, the more output. Default: `0`. + example: 0 + - name: --error + alternatives: -E + type: file + description: File with error information. + - name: --temp_dir + type: string + description: | + Directory for temporary files. If not set, the bash environmental variable TMPDIR is used. + - name: --compresslevel + type: integer + description: | + Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default. + Default: `6`. + example: 6 + - name: --timeit + type: file + description: Store timing information in file. + - name: --timeit_name + type: string + description: | + Name in timing file for this class of jobs. Default: `all`. + example: "all" + - name: --timeit_header + type: string + description: Add header for timing information. + +resources: + - type: bash_script + path: script.sh +test_resources: + - type: bash_script + path: test.sh + - type: file + path: test_data +engines: + - type: docker + image: quay.io/biocontainers/umi_tools:1.1.5--py39hf95cd2a_1 + setup: + - type: docker + run: | + umi_tools -v | sed 's/ version//g' > /var/software_versions.txt +runners: +- type: executable +- type: nextflow \ No newline at end of file diff --git a/src/umi_tools/umi_tools_dedup/help.txt b/src/umi_tools/umi_tools_dedup/help.txt new file mode 100644 index 00000000..87baf322 --- /dev/null +++ b/src/umi_tools/umi_tools_dedup/help.txt @@ -0,0 +1,113 @@ +''' +Generated from the following UMI-tools documentation: + https://umi-tools.readthedocs.io/en/latest/common_options.html#common-options + https://umi-tools.readthedocs.io/en/latest/reference/dedup.html +''' + + +dedup - Deduplicate reads using UMI and mapping coordinates + +Usage: umi_tools dedup [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM] + + note: If --stdout is ommited, standard out is output. To + generate a valid BAM file on standard out, please + redirect log with --log=LOGFILE or --log2stderr + +Common UMI-tools Options: + + -S, --stdout File where output is to go [default = stdout]. + -L, --log File with logging information [default = stdout]. + --log2stderr Send logging information to stderr [default = False]. + -v, --verbose Log level. The higher, the more output [default = 1]. + -E, --error File with error information [default = stderr]. + --temp-dir Directory for temporary files. If not set, the bash environmental variable TMPDIR is used[default = None]. + --compresslevel Level of Gzip compression to use. Default=6 matches GNU gzip rather than python gzip default (which is 9) + + profiling and debugging options: + --timeit Store timing information in file [default=none]. + --timeit-name Name in timing file for this class of jobs [default=all]. + --timeit-header Add header for timing information [default=none]. + --random-seed Random seed to initialize number generator with [default=none]. + +Dedup Options: + --output-stats= One can use the edit distance between UMIs at the same position as an quality control for the + deduplication process by comparing with a null expectation of random sampling. For the random + sampling, the observed frequency of UMIs is used to more reasonably model the null expectation. + Use this option to generate a stats outfiles called: + [PREFIX]_stats_edit_distance.tsv + Reports the (binned) average edit distance between the UMIs at each position. + In addition, this option will trigger reporting of further summary statistics for the UMIs which + may be informative for selecting the optimal deduplication method or debugging. + Each unique UMI sequence may be observed [0-many] times at multiple positions in the BAM. The + following files report the distribution for the frequencies of each UMI. + [PREFIX]_stats_per_umi_per_position.tsv + Tabulates the counts for unique combinations of UMI and position. + [PREFIX]_stats_per_umi_per.tsv + The _stats_per_umi_per.tsv table provides UMI-level summary statistics. + --extract-umi-method= How are the barcodes encoded in the read? + Options are: read_id (default), tag, umis + --umi-separator= Separator between read id and UMI. See --extract-umi-method above. Default=_ + --umi-tag= Tag which contains UMI. See --extract-umi-method above + --umi-tag-split= Separate the UMI in tag by SPLIT and take the first element + --umi-tag-delimiter= Separate the UMI in by DELIMITER and concatenate the elements + --cell-tag= Tag which contains cell barcode. See --extract-umi-method above + --cell-tag-split= Separate the cell barcode in tag by SPLIT and take the first element + --cell-tag-delimiter= Separate the cell barcode in by DELIMITER and concatenate the elements + --method= What method to use to identify group of reads with the same (or similar) UMI(s)? + All methods start by identifying the reads with the same mapping position. + The simplest methods, unique and percentile, group reads with the exact same UMI. + The network-based methods, cluster, adjacency and directional, build networks where + nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). + The groups of reads are then defined from the network in a method-specific manner. + For all the network-based methods, each read group is equivalent to one read count for the gene. + --edit-distance-threshold= For the adjacency and cluster methods the threshold for the edit distance to connect + two UMIs in the network can be increased. The default value of 1 works best unless + the UMI is very long (>14bp). + --spliced-is-unique Causes two reads that start in the same position on the same strand and having the + same UMI to be considered unique if one is spliced and the other is not. + (Uses the 'N' cigar operation to test for splicing). + --soft-clip-threshold= Mappers that soft clip will sometimes do so rather than mapping a spliced read if + there is only a small overhang over the exon junction. By setting this option, you + can treat reads with at least this many bases soft-clipped at the 3' end as spliced. + Default=4. + --multimapping-detection-method= If the sam/bam contains tags to identify multimapping reads, you can specify + for use when selecting the best read at a given loci. Supported tags are "NH", + "X0" and "XT". If not specified, the read with the highest mapping quality will be selected. + --read-length Use the read length as a criteria when deduping, for e.g sRNA-Seq. + --per-gene Reads will be grouped together if they have the same gene. This is useful if your + library prep generates PCR duplicates with non identical alignment positions such as CEL-Seq. + Note this option is hardcoded to be on with the count command. I.e counting is always + performed per-gene. Must be combined with either --gene-tag or --per-contig option. + --gene-tag= Deduplicate per gene. The gene information is encoded in the bam read tag specified + --assigned-status-tag= BAM tag which describes whether a read is assigned to a gene. Defaults to the same value + as given for --gene-tag + --skip-tags-regex= Use in conjunction with the --assigned-status-tag option to skip any reads where the + tag matches this regex. Default ("^[__|Unassigned]") matches anything which starts with "__" + or "Unassigned": + --per-contig Deduplicate per contig (field 3 in BAM; RNAME). All reads with the same contig will be + considered to have the same alignment position. This is useful if you have aligned to a + reference transcriptome with one transcript per gene. If you have aligned to a transcriptome + with more than one transcript per gene, you can supply a map between transcripts and gene + using the --gene-transcript-map option + --gene-transcript-map= File mapping genes to transcripts (tab separated) + --per-cell Reads will only be grouped together if they have the same cell barcode. Can be combined with --per-gene. + --mapping-quality= Minimium mapping quality (MAPQ) for a read to be retained. Default is 0. + --unmapped-reads=