-
Notifications
You must be signed in to change notification settings - Fork 29
Manual
Flexbar conducts the following series of preprocessing steps.
Only step 1 and 9 are active in standard settings.
- Filtering reads with uncalled bases
- Trimming of bases on left or right end
- Quality-based read trimming modes
- Barcode detection, removal and read separation
- Adapter detection and removal
- Quality-based trimming only after removal
- Trimming of homopolymers at read ends
- Read trimming to certain length from right
- Filtering short sequencing reads
Flexbar usage on command-line:
flexbar -r reads [-b barcodes] [-a adapters] [options]
One file with sequencing reads in fasta or fastq format is required as input. It can be specified with option -r
or --reads
. The file should have a .fasta
, .fa
, .fastq
, or .fq
extension. Files that are compressed with gzip or bzip2 are supported and should end with .gz
or .bz2
. Reads can also be passed via standard input by using a minus sign -
as reads filename.
The prefix of target file names or paths can be set using the --target
option. For example, the default target flexbarOut
leads to files based on this name and .fasta
ending in case of fasta input. The following program call with another target generates a output.fasta
file:
flexbar -r reads.fasta -t output
To separate reads based on barcodes or to remove adapters from sequencing reads, specify barcodes or adapter(s). Create a fasta file containing barcodes or adapter sequences to be detected. Sequence names can be chosen freely but should not be the same:
>adapter1
TCGATTACGT
>adapter2
GGTAGTACGCTA
The wildcard character N
can be used in barcodes and adapters to virtually match all characters. Run Flexbar for removal of adapters, e.g. with 8 threads for parallel computation:
flexbar -r reads.fastq -a adapters.fasta -n 8
Flexbar aligns each read from reads.fastq
to the adapter sequences found in adapters.fasta
one by one based on global alignment. For each read, the highest scoring valid alignment is employed for adapter removal and resulting reads are written to the flexbarOut.fastq
file.
Flexbar processes paired reads and outputs consistent read files after processing. Single reads that are left over, can be written to an extra file. To use paired reads, specify two input files in same format that are in correct order for paired reads, for example:
flexbar -r reads_1.fastq -p reads_2.fastq [options]
Output files flexbarOut_1.fastq
and flexbarOut_2.fastq
with processed reads are created.
To detect barcodes within reads, specify a barcodes file when running Flexbar:
flexbar -r barcoded_reads.fastq -b barcodes.fasta [options]
This generates an output file for each barcode being contained in the barcode fasta file. Fasta tags are used in filenames:
flexbarOut_barcodeA.fastq
flexbarOut_barcodeB.fastq
...
Unassigned reads are not part of the output per default. The switch --barcode-unassigned
changes this behaviour and leads to the generation of separate output files for unassigned reads.
Illumina produces two output files for single barcoded and three for paired-end barcoded runs:
reads_1.fastq - sequencing reads (1st read of pairs)
reads_2.fastq - separate barcode reads
reads_3.fastq - 2nd read of pairs (optional)
Run Flexbar with these files:
flexbar -r reads_1.fastq -p reads_3.fastq -b barcodes.fasta -br reads_2.fastq
For each barcode, several output files are generated:
flexbarOut_barcodeA_1.fastq (reads_1.fastq with barcode A)
flexbarOut_barcodeA_2.fastq (reads_3.fastq with barcode A)
flexbarOut_barcodeB_1.fastq (etc.)
...
Parameters for detection and removal of barcodes as well as adapters:
The options --barcode-trim-end
and --adapter-trim-end
are important. They specify the barcode or adapter positioning within the read and which part of the read gets removed.
The following illustrations explain the five different modes. When a barcode or adapter aligns in agreement with the threshold and min-overlap, it could still be rejected due to restriction by trim-end mode. Red indicates a non-valid match and no removal of sequence from the read. Blue indicates removal of corresponding read sequence. Green highlights the remaining sequence of the read after removal. The aligned barcode or adapter sequence is depicted by Xs.
ANY
: longer side of read remains after removal of overlap
RIGHT
: left part remains after removal, align >= read start
LEFT
: right side remains after removal, align <= read end
RTAIL
: consider only last n bases of reads (default: barcode or adapter length)
LTAIL
: use only first n bases of reads (default: barcode or adapter length)
Options --barcode-tail-length
and --adapter-tail-length
can be used to modify the length of the tail mode region to be considered.
The minimum required overlap for barcodes is set to their length per default and to 3 base-pairs for adapters. It can be a drawback to set --adapter-min-overlap
too high, if adapters are mostly located at the end of reads. Therefore a low default value is used. To change this, try:
flexbar -r reads.fasta -a adapters.fasta -ao 5 -l ALL
Assume a simple adapter like AAAAAA and a min-overlap of 5. The adapter would be recognized in the following read:
TCGAAAAAAGCGTGTTT
||||||
---AAAAAA--------
If the adapter is located at the 3' end with an overlap of only 4 bases, the adapter would not be removed because a min-overlap of 5 is requested. With a lower min-overlap value, adapters can be better detected at read ends. Per default this adapter would be removed.
TCGGCGTGTTTAAAA
||||
-----------AAAAAA
The threshold parameters specifies how many mismatches and indels are allowed for an adapter or barcode sequence to be removed. Consider the following alignment:
ACGTAGCCGTACTGT
|||| |
-------CGTATT--
There is 1 mismatches in 6 bases. If an --adapter-threshold
of 1 error per 10 bases is selected, only 0.6 errors are allowed for an overlap of 6 bases. The adapter is not removed. By increasing the threshold to 2, we allow 1.2 errors per 6 bases and therefore the adapter gets removed. For min-overlap we choose 6 or lower. Otherwise the adapter never gets removed.
flexbar -r reads.fasta -a adapters.fasta -at 2 -ao 6 -l ALL
The scoring scheme can be adjusted separately for detection of barcodes and adapters. This includes alignment match, mismatch and gap scores, see program options page. For example, it could make sense to specify a larger score for gaps, when the data's sequencing platform has high indel error rates:
flexbar -r reads.fasta -a adapters.fasta --adapter-gap -4
If the gap score is set to a value of -4, the score of a gap corresponds to 4 mismatches and can be compensated by 4 matches.
Flexbar provides several basic read filtering and trimming features.
In the first step, reads that contain more uncalled bases than specified are discarded. These reads are not included in further processing steps and the output. Per default, not a single uncalled base is allowed. To allow e.g. 2 uncalled bases per read:
flexbar -r reads.fastq --max-uncalled 2
Trimming a fixed number of bases at left and right read ends is the next step, which is not performed with standard settings. For example, to trim 5 bases at the left side of reads:
flexbar -r reads.fastq --pre-trim-left 5
Reads can be trimmed to a certain length by cutting the right side. This step comes after barcoding and adapter removal. It is suited for cases where tools in the downstream analysis require a maximal or even an exact length of reads. For example, to cut the rigth side of reads such that reads are not longer than 50 bases, issue the following command:
flexbar -r reads.fastq --post-trim-length 50
There are many applications of sequencing data in which reads cannot be processed if they are too short. Therefore, we support filtering of reads based on their length. For example to discard reads that are shorter than 50 base pairs, set --min-read-length
to this number. To make sure that reads have exactly a certain length, use the --post-trim-length
option in addition:
flexbar -r reads.fastq --post-trim-length 50 --min-read-length 50
Trimming based on phred quality values helps to deal with higher error rates towards the end of reads. Use option --qtrim
to choose one of TAIL
, WIN
, and BWA
as trimming mode. Also the format of quality scores has to be chosen. For example, to trim the 3' end until quality offset value 30 (corresponding to 63 in sanger format) or higher is reached, use the command:
flexbar -r reads.fastq -q TAIL -qf i1.8 -qt 30
Choose the quality score format of fastq files with option --qtrim-format
to indicate it for quality-based read trimming. Supported quality scalings are:
- sanger
- solexa
- i1.3 (illumina)
- i1.5 (as i1.3)
- i1.8 (as sanger)
Quality-based trimming is performed before barcode and adapter processing steps per default. Use option --qtrim-post-removal
in case you prefer to apply it after these steps instead.
The option --fasta-output
forces the non-quality file format fasta for output. This option is suited if the input format is fastq, and fasta output is preferred. Furthermore, the length distribution of read output files can be inspected by setting the option --length-dist
, which leads to the generation of a length distribution file for each read output file.
Ouput files of reads can be directly compressed using gzip or bzip2. Specify GZ
or BZ2
with option --zip-output
to enable this feature. The .gz
and .bz2
file ending is used.
It is possible to send reads to standard output instead of files by setting option --stdout-reads
. In this case, the Flexbar output of parameters and statistics that usually uses stdout is written to a file named by target and the ending .log
. When barcode based separation of reads is being conducted, read tags get extended by corresponding barcode tags separated by underscore. Paired reads are written in interleaved format.
While processing paired reads, it is possible to run into the situation that only one of the two reads of a pair is shorter than the specified minimal read length. In this case Flexbar discards also the single read that is not too short in order to keep the two files of paired reads in sync. The --single-reads
option can be used to write these long enough single reads to separate files without loosing consistency for paired read output. Option --single-reads-paired
leads to integration of such single reads in pairs with symbol N
for too short counterparts.
Flexbar serves logging and read tagging features for inspection of alignments, and to facilitate downstream data analysis. For example, random sequence tags which allows to recognize artifacts that stem from library amplification are supported.
Print the optimal sequence alignment for each read and the barcode or adapter with maximal score if it is valid. Choose either to view all such valid alignmnets ALL
or only those being used for read modification by sequence removal MOD
. A third option TAB
can be selected for tabular output of alignment statistics. Usage example:
flexbar -r reads.fastq -a adapters.fasta --align-log ALL
Unique molecular identifiers (UMIs) are random sequence elements in reads that allow to recognize artifacts and errors, e.g. stemming from amplification by PCR during preparation of the sequencing library. Such UMIs can be captured with Flexbar by specifying the --umi-tags
option and the character N
at barcode or adapter positions for which the read is supposed to contain the random sequence. The characters at these positions are appended to the read name, separated by underscore.
Given the barcode pattern TGAGATNNNN
in the barcodes fasta file to describe a composite barcode and the following read in a fasta file:
>read
TGAGATCGTTCAGTACGGCAATCGTATGCCGTCTTC
A Flexbar command for extracting the UMI is for example:
flexbar -r reads.fasta -b barcodes.fasta --umi-tags
The result contains the read with the variable part of the barcode in the read name:
>read_CGTT
CAGTACGGCAATCGTATGCCGTCTTC
The --number-tags
option triggers replacement of read name tags by an ascending number to save space. The --removal-tags
option can be employed to tag reads for which adapter or barcode removal takes place.