-
Notifications
You must be signed in to change notification settings - Fork 29
Manual
Flexbar conducts the following series of preprocessing steps.
Only step 1 and 10 are active in standard settings.
- Filtering reads with uncalled bases
- Trimming of bases on left or right end
- Quality-based read trimming modes
- Barcode detection, removal and read separation
- Overlap detection for paired reads
- Adapter detection and removal
- Trimming of homopolymers at read ends
- Quality-based trimming only after removal
- Read trimming to certain length from right
- Filtering short sequencing reads
Flexbar usage on command-line:
flexbar -r reads [-b barcodes] [-a adapters] [options]
One file with sequencing reads in fasta or fastq format is required as input. It can be specified with option -r
or --reads
. The file should have a .fasta
, .fa
, .fastq
, or .fq
extension. For read files in fastq format, the endings .txt
and .dat
are also accepted. Files that are compressed with gzip or bzip2 are supported and should end with .gz
or .bz2
. Reads can also be passed via standard input by using a minus sign -
as reads filename.
The prefix of target file names or paths can be set using the --target
option. For example, the default target flexbarOut
leads to files based on this name and .fasta
ending in case of fasta input. The following program call with another target generates a output.fasta
file:
flexbar -r reads.fasta -t output
To separate reads based on barcodes or to remove adapters from sequencing reads, specify barcodes or adapter(s). Create a fasta file containing barcodes or adapter sequences to be detected. Sequence names can be chosen freely but should not be the same:
>adapter1
TCGATTACGT
>adapter2
GGTAGTACGCTA
The wildcard character N
can be used in barcodes and adapters to virtually match all characters. Run Flexbar for removal of adapters, e.g. with 8 threads for parallel computation:
flexbar -r reads.fastq -a adapters.fasta -n 8
Flexbar aligns each read from reads.fastq
to the adapter sequences found in adapters.fasta
one by one based on global alignment. For each read, the highest scoring valid alignment is employed for adapter removal and resulting reads are written to the flexbarOut.fastq
file.
Flexbar processes paired reads and outputs consistent read files after processing. Single reads that are left over, can be written to an extra file. To use paired reads, specify two input files in same format that are in correct order for paired reads, for example adapter removal with pair overlap detection:
flexbar -r reads1.fastq -p reads2.fastq -a adapters.fasta -ap ON
Output files flexbarOut_1.fastq
and flexbarOut_2.fastq
with processed reads are created. It is recommended to turn on the pair overlap detection for adapter removal in case of standard paired-end libraries, see section pair overlap detection below.
Several adapter presets for Illumina libraries are included in Flexbar. For example, select the TruSeq
preset for standard TruSeq adapters and specify two read files for paired reads. If a preset is chosen, a separate file with adapters is not needed for removal. It is recommended to turn on the pair overlap detection for standard paired-end libraries:
flexbar -r reads1.fq -p reads2.fq --adapter-preset TruSeq -ap ON
Presets for Illumina adapter removal:
-
TruSeq
: TruSeq LT and HT-based kits -
SmallRNA
: TruSeq Small RNA -
Methyl
: TruSeq DNA Methylation and ScriptSeq -
Ribo
: TruSeq Ribo Profile -
Nextera
: Nextera, AmpliSeq, and TruSight -
NexteraMP
: Nextera Mate Pair
Adapter sequences for presets are based on recommendations by Illumina, see website.
Oligonucleotide sequences © 2018 Illumina, Inc. All rights reserved.
To detect barcodes within reads, specify a barcodes file when running Flexbar:
flexbar -r barcoded_reads.fastq -b barcodes.fasta [options]
This generates an output file for each barcode being contained in the barcode fasta file. Fasta tags are used in filenames:
flexbarOut_barcodeA.fastq
flexbarOut_barcodeB.fastq
...
Unassigned reads are not part of the output per default. The switch --barcode-unassigned
changes this behaviour and leads to generation of output files for unassigned reads.
Illumina produces two output files for single barcoded and three for paired-end barcoded runs:
reads_1.fastq - sequencing reads (1st read of pairs)
reads_2.fastq - separate barcode reads
reads_3.fastq - 2nd read of pairs (optional)
Run Flexbar with these files:
flexbar -r reads_1.fastq -p reads_3.fastq -b barcodes.fasta -br reads_2.fastq
For each barcode, several output files are generated:
flexbarOut_barcodeA_1.fastq (reads_1.fastq with barcode A)
flexbarOut_barcodeA_2.fastq (reads_3.fastq with barcode A)
flexbarOut_barcodeB_1.fastq (etc.)
...
Parameters for detection and removal of barcodes as well as adapters:
The options --barcode-trim-end
and --adapter-trim-end
are important. They specify the barcode or adapter positioning within the read and which part of the read gets removed.
The following illustrations explain the five different modes. When a barcode or adapter aligns in agreement with the threshold and min-overlap, it could still be rejected due to restriction by trim-end mode. Red indicates a non-valid match and no removal of sequence from the read. Blue indicates removal of corresponding read sequence. Green highlights the remaining sequence of the read after removal. The aligned barcode or adapter is depicted by Xs.
ANY
: longer side of read remains after removal of overlap
RIGHT
: left part remains after removal, align >= read start
LEFT
: right side remains after removal, align <= read end
RTAIL
: consider only last n bases of reads (default: barcode or adapter length)
LTAIL
: use only first n bases of reads (default: barcode or adapter length)
Options --barcode-tail-length
and --adapter-tail-length
can be used to modify the length of the tail mode region to be considered.
The minimum required overlap for barcodes is set to their length per default and to 3 base-pairs for adapters. It can be a drawback to set --adapter-min-overlap
too high, if adapters are mostly located at the end of reads. Therefore a low default value is used. To change this, try:
flexbar -r reads.fasta -a adapters.fasta -ao 5 -l ALL
Assume a simple adapter like AAAAAA
and a min-overlap of 5. The adapter would be recognized in the following read:
TCGAAAAAAGCGTGTTT
||||||
---AAAAAA--------
If the adapter is located at the 3' end with an overlap of only 4 bases, the adapter would not be removed because a min-overlap of 5 is requested. With a lower min-overlap value, adapters can be better detected at read ends. Per default this adapter would be removed.
TCGGCGTGTTTAAAA
||||
-----------AAAAAA
The error rate parameters determine how many mismatches and indels are allowed for an adapter or barcode sequence to be removed. Consider the following alignment:
ACGTAGCCGTACTGT
|||| |
-------CGTATT--
There is 1 mismatches in 6 bases. If an --adapter-error-rate
of 0.1 is selected, only 0.6 errors are allowed for an overlap of 6 bases. The adapter is not removed. By increasing the threshold to 0.2, we allow 1.2 errors per 6 bases and therefore the adapter gets removed:
flexbar -r reads.fasta -a adapters.fasta --adapter-error-rate 0.2
The scoring scheme can be adjusted separately for detection of barcodes and adapters. This includes alignment match, mismatch and gap scores, see program options page. For example, it could make sense to specify a larger score for gaps, when the data's sequencing platform has high indel error rates:
flexbar -r reads.fasta -a adapters.fasta --adapter-gap -4
If the gap score is set to a value of -4, the score of a gap corresponds to 4 mismatches and can be compensated by 4 matches.
It is recommended to activate the pair overlap detection for adapter removal in case of standard paired-end libraries. This increases the sensitivity by removing also very short parts of adapters if a sufficient overlap is recognized for a pair. Note that this method considerably increases the runtime of the program. The standard form to activate this method is for example:
flexbar -r reads1.fastq -p reads2.fastq -a adapters.fa -ap ON
Per default the minimum overlap for read pairs is 40. The option -av
can be used to set it to another value. The first read of each read pair is aligned to the reverse complement of the second read to detect if a read-through into adapters occured:
TCAGGGCAATACACAAGGACACGCAATACACAGGGGACCCATCAA 1st read
AATCAGGGCAATACACAAGGACACGCAATACACAGGGGACCCATC reverse complement of 2nd read
Without pair overlap detection the short adapter sequence overhangs AA
with 2 nucleotides at the read ends would not be removed because the minimum adapter overlap is 3 in default settings. If the pair overlap detection is turned ON
and an overlap between paired reads is observed with an overhang at the ends, the minimum adapter overlap is decreased to 1 for the alignment of adapters. Now the adapter overhangs can be removed. There are also other pair overlap detection modes, SHORT
for example:
flexbar -r reads1.fastq -p reads2.fastq -a adapters.fa -ap SHORT
This option leads to direct trimming if the overhang is shorter than the minimum adapter overlap without requirement of an adapter match. This is useful to be more robust towards errors in these short adapter sequence overhangs. The third pair overlap detection mode is ONLY
. It can be used even without specification of known adapters. This mode leads to direct trimming of the overhangs at read ends if an overlap is detected for a pair without any limitations. This mode is not recommended because long sequences that do not stem from adapters could be trimmed in certain situations.
Flexbar provides several basic read filtering and trimming features.
In the first step, reads that contain more uncalled bases than specified are discarded. These reads are not included in further processing steps and the output. Per default, not a single uncalled base is allowed. To allow e.g. 2 uncalled bases per read:
flexbar -r reads.fastq --max-uncalled 2
Trimming a fixed number of bases at left and right read ends is the next step, which is not performed with standard settings. For example, to trim 5 bases at the left side of reads:
flexbar -r reads.fastq --pre-trim-left 5
Specific homopolymers on the left or right end of reads can be trimmed. The following command trims for example poly(A) or poly(T) tails with a minimum length of 10 and error rate 0.1 on the right side of reads:
flexbar -r reads.fastq --htrim-right AT --htrim-min-length 10 --htrim-error-rate 0.1
Reads can be trimmed to a certain length by cutting the right side. This step comes after barcoding and adapter removal. It is suited for cases where tools in the downstream analysis require a maximal or even an exact length of reads. For example, to cut the rigth side of reads such that reads are not longer than 50 bases, issue the following command:
flexbar -r reads.fastq --post-trim-length 50
There are many applications of sequencing data in which reads cannot be processed if they are too short. Therefore, we support filtering of reads based on their length. For example to discard reads that are shorter than 50 base pairs, set --min-read-length
to this number. To make sure that reads have exactly a certain length, use the --post-trim-length
option in addition:
flexbar -r reads.fastq --post-trim-length 50 --min-read-length 50
Trimming based on phred quality values helps to deal with higher error rates towards the end of reads. Use option --qtrim
to choose one of TAIL
, WIN
, and BWA
as trimming mode. Also the format of quality scores has to be chosen. For example, to trim the 3' end until quality offset value 30 (corresponding to 63 in sanger format) or higher is reached, use the command:
flexbar -r reads.fastq -q TAIL -qf i1.8 -qt 30
Choose the quality score format of fastq files with option --qtrim-format
to indicate it for quality-based read trimming. Supported quality scalings are:
- sanger
- solexa
- i1.3 (illumina 1.3)
- i1.5 (same as i1.3)
- i1.8 (same as sanger)
Quality-based trimming is performed before barcode and adapter processing steps per default. Use option --qtrim-post-removal
in case you prefer to apply it after these steps instead.
The option --fasta-output
forces the non-quality file format fasta for output. This option is suited if the input format is fastq, and fasta output is preferred. Furthermore, the length distribution of read output files can be inspected by setting the option --length-dist
, which leads to the generation of a length distribution file for each read output file.
Ouput files of reads can be directly compressed using gzip or bzip2. Specify GZ
or BZ2
with option --zip-output
to enable this feature. The .gz
and .bz2
file ending is used.
It is possible to send reads to standard output instead of files by setting option --stdout-reads
. If barcode based separation of reads is being conducted, read names get extended by barcode names separated by underscore. Paired reads are written in interleaved format.
If barcode detection is not conducted, the output read files can be specified with options --output-reads
and --output-reads2
instead of using the prefix of the target option. This is useful for workflow system or if you want to set the file path explicitly. The files should have a .fasta
, .fa
, .fastq
, .fq
, or .dat
extension.
While processing paired reads, it is possible to run into the situation that only one of the two reads of a pair is shorter than the specified minimal read length. In this case Flexbar discards also the single read that is not too short in order to keep the two files of paired reads in sync. The --single-reads
option can be used to write these long enough single reads to separate files without loosing consistency for paired read output. Option --single-reads-paired
leads to integration of such single reads in pairs with symbol N
for too short counterparts.
Flexbar serves logging and read tagging features for inspection of alignments, and to facilitate downstream data analysis. For example, unique molecular identifiers which allows to recognize artifacts that stem from library amplification by PCR are supported.
Print the optimal sequence alignment for each read and the barcode or adapter with maximal score if it is valid. Choose either to view all such valid alignmnets ALL
or only those being used for read modification by sequence removal MOD
. A third option TAB
can be selected for tabular output of alignment statistics. Usage example:
flexbar -r reads.fastq -a adapters.fasta --align-log ALL
Unique molecular identifiers (UMIs) are random sequence elements in reads that allow to recognize artifacts and errors, e.g. stemming from amplification by PCR during preparation of the sequencing library. Such UMIs can be captured with Flexbar by specifying the --umi-tags
option and the character N
at barcode or adapter positions for which the read is supposed to contain the random sequence. The characters at these positions are appended to the read name, separated by underscore.
Given the barcode pattern TGAGATNNNN
in the barcodes fasta file to describe a composite barcode and the following read in a fasta file:
>read
TGAGATCGTTCAGTACGGCAATCGTATGCCGTCTTC
A Flexbar command for extracting the UMI is for example:
flexbar -r reads.fasta -b barcodes.fasta --umi-tags
The result contains the read with the variable part of the barcode in the read name:
>read_CGTT
CAGTACGGCAATCGTATGCCGTCTTC
The --number-tags
option triggers replacement of read name tags by an ascending number to save space. The --removal-tags
option can be employed to tag reads for which adapter or barcode removal takes place.