Skip to content

Latest commit

 

History

History
452 lines (348 loc) · 17.3 KB

2.Atria_trimming_methods_and_usages.md

File metadata and controls

452 lines (348 loc) · 17.3 KB

Atria v4.1.0

An ultra-fast and accurate adapter and quality trimming software designed for paired-end sequencing data.

If you use Atria, please cite

Jiacheng Chuan, Aiguo Zhou, Lawrence Richard Hale, Miao He, Xiang Li, Atria: an ultra-fast and accurate trimmer for adapter and quality trimming, Gigabyte, 1, 2021 https://doi.org/10.46471/gigabyte.31

Github: https://github.com/cihga39871/Atria

Usage

Try atria -h or atria --help for more information.

Input and Output

The input files should be paired-end FastQ(.gz|.bz2) files (in the same order), or single-end fastqs:

  1. Read 1 files: -r XXXX_R1.fastq YYYY_R1.fastq.gz ...

  2. Read 2 files (optional): -R XXXX_R2.fastq YYYY_R2.fastq.gz ...

Output all files to a directory: -o PATH or --output-dir PATH. Default is the current directory.

Atria skips completed analysis by default. Use -f or --force to disable the feature.

Order of Processing

Order of trimming and filtration processing methods. Unlisted process will not be done. See default for process names.

  • --order PROCESS... or -O PROCESS...: default:

    • CheckIdentifier
    • PolyG
    • PolyT
    • PolyA
    • PolyC
    • LengthFilter
    • AdapterTrim
    • HardClipEndR1
    • HardClipEndR2
    • HardClipAfterR1
    • HardClipAfterR2
    • HardClipFrontR1
    • HardClipFrontR2
    • QualityTrim
    • TailNTrim
    • MaxNFilter
    • LengthFilter
    • ComplexityFilter
    • PCRDedup

Poly X Tail Trimming (PolyG / PolyT / PolyA / PolyC)

Remove poly-X tails. Suggest to enable --polyG for Illumina NextSeq/NovaSeq data.

  • Enable: --polyG, --polyT, --polyA, and/or --polyC (default: disabled)

  • Trim poly X tail if length > INT: --poly-length 10

Adapter Trimming (AdapterTrim)

Multiple adapter pairs are allowed from Atria v4.

  • Read 1 adapter(s) (after DNA insert): -a SEQ... or --adapter1 SEQ... (default: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA)

  • Read 2 adapter(s) (after DNA insert): -A SEQ... or --adapter2 SEQ... (default: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT) (if paired-end)

  • Disable: --no-adapter-trim

  • --detect-adapter if you do not know adapter sequences.

    Atria does not trim detected adapters automatically, please check results first.

Paired-end Consensus Calling

The overlapped regions of read pairs are checked and corrected. It is available only when input files are paired-end and Adapter Trimming is on.

  • Disable: --no-consensus

Hard Clip End (HardClipEndR1 / HardClipEndR2)

Remove the last INT bases from 3' end (tail).

  • Number of bases to keep in read 1: -z INT or --clip3-r1 INT (default: disabled)

  • Number of bases to keep in read 2: -Z INT or --clip3-r2 INT (default: disabled)

Hard Clip After N Bases (HardClipAfterR1 / HardClipAfterR2)

Resize reads to a fixed length by discarding extra bases in 3' end (tail).

  • Number of bases to keep in read 1: -b INT or --clip-after-r1 INT (default: disabled)

  • Number of bases to keep in read 2: -B INT or --clip-after-r2 INT (default: disabled)

Hard Clip Front (HardClipFrontR1 / HardClipFrontR2)

Remove the first INT bases from 5' end (front).

  • Number of bases to remove in read 1: -e INT or --clip5-r1 INT (default: disabled)

  • Number of bases to remove in read 2: -E INT or --clip5-r2 INT (default: disabled)

Quality Trimming (QualityTrim)

Trim low-quality tails. Trimming read tails when the average quality of bases in a sliding window is low.

  • Average quality threshold: -q 20 or --quality-score 20 (default: 20)

  • Sliding window length: --quality-kmer 5 (default: 5)

  • FastQ quality format: --quality-format Illumina1.8, or --quality-format 33 (default: 33, ie. Illumina1.8)

  • Disable: --no-quality-trim

Tail N Trimming (TailNTrim)

Trim N tails.

  • Disable: --no-tail-n-trim

Max N Filtration (MaxNFilter)

Discard a read pair if the number of N in one read is greater than a certain amount. N tails are ignored if Tail N Trimming is on.

  • Number of N allowed in each read: -n 15 or --max-n 15 (default: 15)

  • Disable: -n -1 or --max-n -1

Length Filtration (LengthFilter)

Filter read pair length in a range.

  • Read length range: --length-range 30:999999 (default: 30:999999)

  • Disable: --no-length-filtration

Remove PCR duplicates

Only write unique sequences (dedup). Paired reads are only considered identical if both reads are duplicates to both reads in a previous pair.

Dedup uses LARGE memory to store all unique sequences.

  • Enable: --pcr-dedup.

  • Also write a count table of PCR duplicates: --pcr-dedup-count.

Complexity Filtration (ComplexityFilter)

Discard reads with low complexity. Complexity is the percentage of base that is different from its next base.

  • Enable: --enable-complexity-filtration (default: disabled)

  • Complexity threshold: --min-complexity 0.3 (default: 0.3)

Parallel computing

  • Use INT threads: -t 8 or --threads 8 (default: 8)

  • If memory is not sufficient, use --log2-chunk-size INT where INT is from 23 to 25. Memory usage reduces exponentially as it decreases.

Try atria -h or atria --help for more information.

Examples

Print the help page:

atria --help

Trim R1.fastq(.gz) and R2.fastq(.gz) with default settings:

atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2

Trim R1.fastq(.gz) and R2.fastq(.gz) and save output files to directory:

atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2 \
	--output-dir directory

Trim multiple paired samples in parallel mode (using 8 threads):

atria --read1 S*R1.fastq.gz --read2 S*R2.fastq.gz --adapter1 SEQ1 --adapter2 SEQ2 \
	--threads 8

Only perform adapter trimming, skip other trimming and filtration methods:

atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2 \
	--no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration

Skip quality trimming:

atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2 \
	--no-quality-trim

Dedup only (remove PCR duplicates):

atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) \
	-O PCRDedup --pcr-dedup --pcr-dedup-count

Argument details

usage: atria [-t INT] [--log2-chunk-size INDEX] [-f]
             -r R1-FASTQ [R1-FASTQ...] [-R [R2-FASTQ...]] [-o PATH]
             [-g AUTO|NO|GZ|GZIP|BZ2|BZIP2] [--check-identifier]
             [--detect-adapter] [--pcr-dedup] [--pcr-dedup-count]
             [-O PROCESS [PROCESS...]] [--polyG] [--polyT] [--polyA]
             [--polyC] [--poly-length POLY-LENGTH]
             [--poly-mismatch-per-16mer INT] [--no-adapter-trim]
             [-a SEQ [SEQ...]] [-A SEQ [SEQ...]] [-T INT] [-d INT]
             [-D INT] [-s INT] [--trim-score-pe FLOAT]
             [--trim-score-se FLOAT] [-l INT] [--stats]
             [--no-consensus] [--kmer-tolerance-consensus INT]
             [--min-ratio-mismatch FLOAT] [--overlap-score FLOAT]
             [--prob-diff FLOAT] [-z INT] [-Z INT] [-e INT] [-E INT]
             [-b INT] [-B INT] [--no-quality-trim] [-q INT]
             [--quality-kmer INT] [--quality-format FORMAT]
             [--no-tail-n-trim] [-n INT] [--no-length-filtration]
             [--length-range INT:INT] [--enable-complexity-filtration]
             [--min-complexity FLOAT] [-p INT] [-C INT] [-c INT]
             [--version] [-h]

Atria v4.1.0

optional arguments:
  -t, --threads INT     use INT threads to process one sample
                        (multi-threading parallel). (type: Int64,
                        default: 8)
  --log2-chunk-size INDEX
                        read at most 2^INDEX bits each time. Suggest
                        to process 200,000 reads each time. Reduce
                        INDEX to lower the memory usage. (type: Int64,
                        default: 26)
  -f, --force           force to analyze all samples; not skip
                        completed ones
  --version             show version information and exit
  -h, --help            show this help message and exit

input/output: input read 1 and read 2 should be in the same order:
  -r, --read1 R1-FASTQ [R1-FASTQ...]
                        input read 1 fastq file(s), or single-end
                        fastq files
  -R, --read2 [R2-FASTQ...]
                        input read 2 fastq file(s) (paired with
                        R1-FASTQ)
  -o, --output-dir PATH
                        store output files and stats to PATH (default:
                                                "/data/jc/analysis/archive/202304-possible-pw-ngs/0.raw_data")
  -g, --compress AUTO|NO|GZ|GZIP|BZ2|BZIP2
                        compression methods for output files (AUTO:
                        same as input, NO: no compression, GZ|GZIP:
                        gzip with `pigz`, BZ2|BZIP2: bzip2 with
                        `pbzip2`) (default: "AUTO")
  --check-identifier    check whether the identifiers of r1 and r2 are
                        the same
  --detect-adapter      detect possible adapters for each sample only

remove duplicate (large RAM required):
  --pcr-dedup           enable pcr dedup: remove PCR duplicates.
                        Paired reads are only considered identical if
                        both reads are duplicates to both reads in a
                        previous pair.
  --pcr-dedup-count     if --pcr-dedup, write a table of duplicate
                        count

processing order:
  -O, --order PROCESS [PROCESS...]
                        order of trimming and filtration processing
                        methods. Unlisted process will not be done.
                        See epilog for process names (default:
                        ["DefaultOrder"])

poly X tail trimming:
  --polyG               enable trimming poly G tails
  --polyT               enable trimming poly T tails
  --polyA               enable trimming poly A tails
  --polyC               enable trimming poly C tails
  --poly-length POLY-LENGTH
                        the minimum length of poly X (type: Int64,
                        default: 10)
  --poly-mismatch-per-16mer INT
                        the number of mismatch allowed in 16 mer poly
                        X (type: Int64, default: 2)

adapter trimming:
  --no-adapter-trim     disable adapter and pair-end trimming
  -a, --adapter1 SEQ [SEQ...]
                        read 1 adapter(s) appended to insert DNA at 3'
                        end (default:
                        ["AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"])
  -A, --adapter2 SEQ [SEQ...]
                        read 2 adapter(s) appended to insert DNA at 3'
                        end (default:
                        ["AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"])
  -T, --kmer-tolerance INT
                        # of mismatch allowed in 16-mers adapter and
                        pair-end matching (type: Int64, default: 2)
  -d, --pe-adapter-diff INT
                        (FOR PAIRED END) number of bases allowed when
                        disconcordance found between adapter and
                        pair-end search (type: Int64, default: 0)
  -D, --r1-r2-diff INT  (FOR PAIRED END) number of bases allowed when
                        the insert sizes of r1 and r2 are different
                        (type: Int64, default: 0)
  -s, --kmer-n-match INT
                        (FOR PAIRED END) if n base matched [0-16] is
                        less than INT, loosen matches will be made
                        based on the match with the highest n base
                        match (type: Int64, default: 9)
  --trim-score-pe FLOAT
                        (FOR PAIRED END) if final score [0-32] of read
                        pair is greater than FLOAT, the reads will be
                        trimmed. (type: Float64, default: 10.0)
  --trim-score-se FLOAT
                        (FOR SINGLE END) if final score [0-16] of read
                        is greater than FLOAT, the reads will be
                        trimmed. (type: Float64, default: 10.0)
  -l, --tail-length INT
                        (FOR PAIRED END) if the adapter is in the tail
                        region, and insert size of pe match is smaller
                        than this region, do not trim the read. (type:
                        Int64, default: 12)
  --stats               (DEV ONLY) write stats to description lines of
                        r2 reads.

consensus/merging in adapter trimming (FOR PAIRED END):
  --no-consensus        disable generating consensus paired reads. If
                        adapter trimming is disabled, consensus
                        calling is not performed even the flag is not
                        set.
  --kmer-tolerance-consensus INT
                        # of mismatch allowed in 16-mers matching in
                        consensus calling (type: Int64, default: 10)
  --min-ratio-mismatch FLOAT
                        if the ratio of mismatch of the overlapped
                        region is less than FLOAT, skip consensus
                        calling. (type: Float64, default: 0.28)
  --overlap-score FLOAT
                        if no adapter was found, scan the tails of the
                        paired reads. Then, if the maximum score of
                        the overlapped 16-mers are less than FLOAT,
                        skip consensus calling for the read pair. If
                        adapters were found, this step is ignored.
                        (type: Float64, default: 0.0)
  --prob-diff FLOAT     when doing consensus calling, if the bases
                        were not complementary, the base with the
                        higher quality probability is selected unless
                        the quality probability difference are less
                        than FLOAT (type: Float64, default: 0.0)

hard clipping: trim a fixed length:
  -z, --clip3-r1 INT    remove the last INT bases from 3' end of R1.
                        (type: Int64, default: 0)
  -Z, --clip3-r2 INT    remove the last INT bases from 3' end of R2.
                        (type: Int64, default: 0)
  -b, --clip-after-r1 INT
                        hard clip the 3' tails of R1 to contain only
                        INT bases. 0 to disable. (type: Int64,
                        default: 0)
  -B, --clip-after-r2 INT
                        hard clip the 3' tails of R2 to contain only
                        INT bases. 0 to disable. (type: Int64,
                        default: 0)
  -e, --clip5-r1 INT    remove the first INT bases from 5' front of
                        R1. (type: Int64, default: 0)
  -E, --clip5-r2 INT    remove the first INT bases from 5' front of
                        R2. (type: Int64, default: 0)

quality trimming: trim the tail when the average quality of bases in
a sliding window is low:
  --no-quality-trim     skip quality trimming
  -q, --quality-score INT
                        threshold of quality score; 0 means turn off
                        quality trimming (type: Int64, default: 20)
  --quality-kmer INT    trim the tail once found the average quality
                        of bases in a sliding window is low (type:
                        Int64, default: 5)
  --quality-format FORMAT
                        the format of the quality score (Illumina1.3,
                        Illumina1.8, Sanger, Illumina1.5, Solexa); or
                        the ASCII number when quality score == 0
                        (default: "33")

N trimming:
  --no-tail-n-trim      disable removing NNNNN tail.
  -n, --max-n INT       # N allowed in each read; N tails not included
                        if --no-tail-n-trim; INT<0 to disable (type:
                        Int64, default: 15)

length filtration:
  --no-length-filtration
                        disable length filtration
  --length-range INT:INT
                        length range of good reads; format is min:max
                        (default: "30:999999")

complexity filtration:
  --enable-complexity-filtration
                        enable complexity filtration
  --min-complexity FLOAT
                        complexity threshold (type: Float64, default:
                        0.3)

legacy arguments:
  -p, --procs INT       ignored (multi-proc is disabled) (default:
                        "1")
  -C, --clip-after INT  removed (use --clip-after-r1 and
                        --clip-after-r2 instead) (type: Int64,
                        default: 0)
  -c, --clip5 INT       removed (use --clip5-r1 and --clip5-r2
                        instead) (type: Int64, default: 0)

----------

Process names for --order | -O:
    DefaultOrder    = [CheckIdentifier, PolyG, PolyT, PolyA, PolyC, LengthFilter, AdapterTrim, HardClipEndR1, HardClipEndR2, HardClipAfterR1, HardClipAfterR2, HardClipFrontR1, HardClipFrontR2, QualityTrim, TailNTrim, MaxNFilter, LengthFilter, ComplexityFilter, PCRDedup],
    CheckIdentifier ,
    PolyX           = [PolyG, PolyT, PolyA, PolyC],
    PolyG           ,
    PolyT           ,
    PolyA           ,
    PolyC           ,
    AdapterTrim     ,
    HardClipEnd     = [HardClipEndR1, HardClipEndR2],
    HardClipEndR1   ,
    HardClipEndR2   ,
    HardClipAfter   = [HardClipAfterR1, HardClipAfterR2],
    HardClipAfterR1 ,
    HardClipAfterR2 ,
    HardClipFront   = [HardClipFrontR1, HardClipFrontR2],
    HardClipFrontR1 ,
    HardClipFrontR2 ,
    QualityTrim     ,
    TailNTrim       ,
    MaxNFilter      ,
    LengthFilter    ,
    ComplexityFilter,
    PCRDedup        .