An ultra-fast and accurate adapter and quality trimming software designed for paired-end sequencing data.
If you use Atria, please cite
Jiacheng Chuan, Aiguo Zhou, Lawrence Richard Hale, Miao He, Xiang Li, Atria: an ultra-fast and accurate trimmer for adapter and quality trimming, Gigabyte, 1, 2021 https://doi.org/10.46471/gigabyte.31
Github: https://github.com/cihga39871/Atria
Try atria -h
or atria --help
for more information.
The input files should be paired-end FastQ(.gz|.bz2) files (in the same order), or single-end fastqs:
-
Read 1 files:
-r XXXX_R1.fastq YYYY_R1.fastq.gz ...
-
Read 2 files (optional):
-R XXXX_R2.fastq YYYY_R2.fastq.gz ...
Output all files to a directory: -o PATH
or --output-dir PATH
. Default is the current directory.
Atria skips completed analysis by default. Use -f
or --force
to disable the feature.
Order of trimming and filtration processing methods. Unlisted process will not be done. See default for process names.
-
--order PROCESS...
or-O PROCESS...
: default:- CheckIdentifier
- PolyG
- PolyT
- PolyA
- PolyC
- LengthFilter
- AdapterTrim
- HardClipEndR1
- HardClipEndR2
- HardClipAfterR1
- HardClipAfterR2
- HardClipFrontR1
- HardClipFrontR2
- QualityTrim
- TailNTrim
- MaxNFilter
- LengthFilter
- ComplexityFilter
- PCRDedup
Remove poly-X tails. Suggest to enable --polyG
for Illumina NextSeq/NovaSeq data.
-
Enable:
--polyG
,--polyT
,--polyA
, and/or--polyC
(default: disabled) -
Trim poly X tail if length > INT:
--poly-length 10
Multiple adapter pairs are allowed from Atria v4.
-
Read 1 adapter(s) (after DNA insert):
-a SEQ...
or--adapter1 SEQ...
(default: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA) -
Read 2 adapter(s) (after DNA insert):
-A SEQ...
or--adapter2 SEQ...
(default: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT) (if paired-end) -
Disable:
--no-adapter-trim
-
--detect-adapter
if you do not know adapter sequences.Atria does not trim detected adapters automatically, please check results first.
The overlapped regions of read pairs are checked and corrected. It is available only when input files are paired-end and Adapter Trimming is on.
- Disable:
--no-consensus
Remove the last INT bases from 3' end (tail).
-
Number of bases to keep in read 1:
-z INT
or--clip3-r1 INT
(default: disabled) -
Number of bases to keep in read 2:
-Z INT
or--clip3-r2 INT
(default: disabled)
Resize reads to a fixed length by discarding extra bases in 3' end (tail).
-
Number of bases to keep in read 1:
-b INT
or--clip-after-r1 INT
(default: disabled) -
Number of bases to keep in read 2:
-B INT
or--clip-after-r2 INT
(default: disabled)
Remove the first INT bases from 5' end (front).
-
Number of bases to remove in read 1:
-e INT
or--clip5-r1 INT
(default: disabled) -
Number of bases to remove in read 2:
-E INT
or--clip5-r2 INT
(default: disabled)
Trim low-quality tails. Trimming read tails when the average quality of bases in a sliding window is low.
-
Average quality threshold:
-q 20
or--quality-score 20
(default: 20) -
Sliding window length:
--quality-kmer 5
(default: 5) -
FastQ quality format:
--quality-format Illumina1.8
, or--quality-format 33
(default: 33, ie. Illumina1.8) -
Disable:
--no-quality-trim
Trim N tails.
- Disable:
--no-tail-n-trim
Discard a read pair if the number of N in one read is greater than a certain amount. N tails are ignored if Tail N Trimming is on.
-
Number of N allowed in each read:
-n 15
or--max-n 15
(default: 15) -
Disable:
-n -1
or--max-n -1
Filter read pair length in a range.
-
Read length range:
--length-range 30:999999
(default: 30:999999) -
Disable:
--no-length-filtration
Only write unique sequences (dedup). Paired reads are only considered identical if both reads are duplicates to both reads in a previous pair.
Dedup uses LARGE memory to store all unique sequences.
-
Enable:
--pcr-dedup
. -
Also write a count table of PCR duplicates:
--pcr-dedup-count
.
Discard reads with low complexity. Complexity is the percentage of base that is different from its next base.
-
Enable:
--enable-complexity-filtration
(default: disabled) -
Complexity threshold:
--min-complexity 0.3
(default: 0.3)
-
Use INT threads:
-t 8
or--threads 8
(default: 8) -
If memory is not sufficient, use
--log2-chunk-size INT
where INT is from 23 to 25. Memory usage reduces exponentially as it decreases.
Try atria -h
or atria --help
for more information.
Print the help page:
atria --help
Trim R1.fastq(.gz)
and R2.fastq(.gz)
with default settings:
atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2
Trim R1.fastq(.gz)
and R2.fastq(.gz)
and save output files to directory
:
atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2 \
--output-dir directory
Trim multiple paired samples in parallel mode (using 8 threads):
atria --read1 S*R1.fastq.gz --read2 S*R2.fastq.gz --adapter1 SEQ1 --adapter2 SEQ2 \
--threads 8
Only perform adapter trimming, skip other trimming and filtration methods:
atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2 \
--no-tail-n-trim --max-n=-1 --no-quality-trim --no-length-filtration
Skip quality trimming:
atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) --adapter1 SEQ1 --adapter2 SEQ2 \
--no-quality-trim
Dedup only (remove PCR duplicates):
atria --read1 R1.fastq(.gz) --read2 R2.fastq(.gz) \
-O PCRDedup --pcr-dedup --pcr-dedup-count
usage: atria [-t INT] [--log2-chunk-size INDEX] [-f]
-r R1-FASTQ [R1-FASTQ...] [-R [R2-FASTQ...]] [-o PATH]
[-g AUTO|NO|GZ|GZIP|BZ2|BZIP2] [--check-identifier]
[--detect-adapter] [--pcr-dedup] [--pcr-dedup-count]
[-O PROCESS [PROCESS...]] [--polyG] [--polyT] [--polyA]
[--polyC] [--poly-length POLY-LENGTH]
[--poly-mismatch-per-16mer INT] [--no-adapter-trim]
[-a SEQ [SEQ...]] [-A SEQ [SEQ...]] [-T INT] [-d INT]
[-D INT] [-s INT] [--trim-score-pe FLOAT]
[--trim-score-se FLOAT] [-l INT] [--stats]
[--no-consensus] [--kmer-tolerance-consensus INT]
[--min-ratio-mismatch FLOAT] [--overlap-score FLOAT]
[--prob-diff FLOAT] [-z INT] [-Z INT] [-e INT] [-E INT]
[-b INT] [-B INT] [--no-quality-trim] [-q INT]
[--quality-kmer INT] [--quality-format FORMAT]
[--no-tail-n-trim] [-n INT] [--no-length-filtration]
[--length-range INT:INT] [--enable-complexity-filtration]
[--min-complexity FLOAT] [-p INT] [-C INT] [-c INT]
[--version] [-h]
Atria v4.1.0
optional arguments:
-t, --threads INT use INT threads to process one sample
(multi-threading parallel). (type: Int64,
default: 8)
--log2-chunk-size INDEX
read at most 2^INDEX bits each time. Suggest
to process 200,000 reads each time. Reduce
INDEX to lower the memory usage. (type: Int64,
default: 26)
-f, --force force to analyze all samples; not skip
completed ones
--version show version information and exit
-h, --help show this help message and exit
input/output: input read 1 and read 2 should be in the same order:
-r, --read1 R1-FASTQ [R1-FASTQ...]
input read 1 fastq file(s), or single-end
fastq files
-R, --read2 [R2-FASTQ...]
input read 2 fastq file(s) (paired with
R1-FASTQ)
-o, --output-dir PATH
store output files and stats to PATH (default:
"/data/jc/analysis/archive/202304-possible-pw-ngs/0.raw_data")
-g, --compress AUTO|NO|GZ|GZIP|BZ2|BZIP2
compression methods for output files (AUTO:
same as input, NO: no compression, GZ|GZIP:
gzip with `pigz`, BZ2|BZIP2: bzip2 with
`pbzip2`) (default: "AUTO")
--check-identifier check whether the identifiers of r1 and r2 are
the same
--detect-adapter detect possible adapters for each sample only
remove duplicate (large RAM required):
--pcr-dedup enable pcr dedup: remove PCR duplicates.
Paired reads are only considered identical if
both reads are duplicates to both reads in a
previous pair.
--pcr-dedup-count if --pcr-dedup, write a table of duplicate
count
processing order:
-O, --order PROCESS [PROCESS...]
order of trimming and filtration processing
methods. Unlisted process will not be done.
See epilog for process names (default:
["DefaultOrder"])
poly X tail trimming:
--polyG enable trimming poly G tails
--polyT enable trimming poly T tails
--polyA enable trimming poly A tails
--polyC enable trimming poly C tails
--poly-length POLY-LENGTH
the minimum length of poly X (type: Int64,
default: 10)
--poly-mismatch-per-16mer INT
the number of mismatch allowed in 16 mer poly
X (type: Int64, default: 2)
adapter trimming:
--no-adapter-trim disable adapter and pair-end trimming
-a, --adapter1 SEQ [SEQ...]
read 1 adapter(s) appended to insert DNA at 3'
end (default:
["AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"])
-A, --adapter2 SEQ [SEQ...]
read 2 adapter(s) appended to insert DNA at 3'
end (default:
["AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"])
-T, --kmer-tolerance INT
# of mismatch allowed in 16-mers adapter and
pair-end matching (type: Int64, default: 2)
-d, --pe-adapter-diff INT
(FOR PAIRED END) number of bases allowed when
disconcordance found between adapter and
pair-end search (type: Int64, default: 0)
-D, --r1-r2-diff INT (FOR PAIRED END) number of bases allowed when
the insert sizes of r1 and r2 are different
(type: Int64, default: 0)
-s, --kmer-n-match INT
(FOR PAIRED END) if n base matched [0-16] is
less than INT, loosen matches will be made
based on the match with the highest n base
match (type: Int64, default: 9)
--trim-score-pe FLOAT
(FOR PAIRED END) if final score [0-32] of read
pair is greater than FLOAT, the reads will be
trimmed. (type: Float64, default: 10.0)
--trim-score-se FLOAT
(FOR SINGLE END) if final score [0-16] of read
is greater than FLOAT, the reads will be
trimmed. (type: Float64, default: 10.0)
-l, --tail-length INT
(FOR PAIRED END) if the adapter is in the tail
region, and insert size of pe match is smaller
than this region, do not trim the read. (type:
Int64, default: 12)
--stats (DEV ONLY) write stats to description lines of
r2 reads.
consensus/merging in adapter trimming (FOR PAIRED END):
--no-consensus disable generating consensus paired reads. If
adapter trimming is disabled, consensus
calling is not performed even the flag is not
set.
--kmer-tolerance-consensus INT
# of mismatch allowed in 16-mers matching in
consensus calling (type: Int64, default: 10)
--min-ratio-mismatch FLOAT
if the ratio of mismatch of the overlapped
region is less than FLOAT, skip consensus
calling. (type: Float64, default: 0.28)
--overlap-score FLOAT
if no adapter was found, scan the tails of the
paired reads. Then, if the maximum score of
the overlapped 16-mers are less than FLOAT,
skip consensus calling for the read pair. If
adapters were found, this step is ignored.
(type: Float64, default: 0.0)
--prob-diff FLOAT when doing consensus calling, if the bases
were not complementary, the base with the
higher quality probability is selected unless
the quality probability difference are less
than FLOAT (type: Float64, default: 0.0)
hard clipping: trim a fixed length:
-z, --clip3-r1 INT remove the last INT bases from 3' end of R1.
(type: Int64, default: 0)
-Z, --clip3-r2 INT remove the last INT bases from 3' end of R2.
(type: Int64, default: 0)
-b, --clip-after-r1 INT
hard clip the 3' tails of R1 to contain only
INT bases. 0 to disable. (type: Int64,
default: 0)
-B, --clip-after-r2 INT
hard clip the 3' tails of R2 to contain only
INT bases. 0 to disable. (type: Int64,
default: 0)
-e, --clip5-r1 INT remove the first INT bases from 5' front of
R1. (type: Int64, default: 0)
-E, --clip5-r2 INT remove the first INT bases from 5' front of
R2. (type: Int64, default: 0)
quality trimming: trim the tail when the average quality of bases in
a sliding window is low:
--no-quality-trim skip quality trimming
-q, --quality-score INT
threshold of quality score; 0 means turn off
quality trimming (type: Int64, default: 20)
--quality-kmer INT trim the tail once found the average quality
of bases in a sliding window is low (type:
Int64, default: 5)
--quality-format FORMAT
the format of the quality score (Illumina1.3,
Illumina1.8, Sanger, Illumina1.5, Solexa); or
the ASCII number when quality score == 0
(default: "33")
N trimming:
--no-tail-n-trim disable removing NNNNN tail.
-n, --max-n INT # N allowed in each read; N tails not included
if --no-tail-n-trim; INT<0 to disable (type:
Int64, default: 15)
length filtration:
--no-length-filtration
disable length filtration
--length-range INT:INT
length range of good reads; format is min:max
(default: "30:999999")
complexity filtration:
--enable-complexity-filtration
enable complexity filtration
--min-complexity FLOAT
complexity threshold (type: Float64, default:
0.3)
legacy arguments:
-p, --procs INT ignored (multi-proc is disabled) (default:
"1")
-C, --clip-after INT removed (use --clip-after-r1 and
--clip-after-r2 instead) (type: Int64,
default: 0)
-c, --clip5 INT removed (use --clip5-r1 and --clip5-r2
instead) (type: Int64, default: 0)
----------
Process names for --order | -O:
DefaultOrder = [CheckIdentifier, PolyG, PolyT, PolyA, PolyC, LengthFilter, AdapterTrim, HardClipEndR1, HardClipEndR2, HardClipAfterR1, HardClipAfterR2, HardClipFrontR1, HardClipFrontR2, QualityTrim, TailNTrim, MaxNFilter, LengthFilter, ComplexityFilter, PCRDedup],
CheckIdentifier ,
PolyX = [PolyG, PolyT, PolyA, PolyC],
PolyG ,
PolyT ,
PolyA ,
PolyC ,
AdapterTrim ,
HardClipEnd = [HardClipEndR1, HardClipEndR2],
HardClipEndR1 ,
HardClipEndR2 ,
HardClipAfter = [HardClipAfterR1, HardClipAfterR2],
HardClipAfterR1 ,
HardClipAfterR2 ,
HardClipFront = [HardClipFrontR1, HardClipFrontR2],
HardClipFrontR1 ,
HardClipFrontR2 ,
QualityTrim ,
TailNTrim ,
MaxNFilter ,
LengthFilter ,
ComplexityFilter,
PCRDedup .