Skip to content
jtroehr edited this page Jun 22, 2018 · 71 revisions

Flexbar logo

Processing steps of Flexbar 3.4

Flexbar conducts the following series of preprocessing steps.
Only step 1 and 10 are active in standard settings.

  1. Filtering reads with uncalled bases
  2. Trimming of bases on left or right end
  3. Quality-based read trimming modes
  4. Barcode detection, removal and read separation
  5. Overlap detection for paired reads
  6. Adapter detection and removal
  7. Trimming of homopolymers at read ends
  8. Quality-based trimming only after removal
  9. Read trimming to certain length from right
  10. Filtering short sequencing reads

Basic usage

Flexbar usage on command-line:

flexbar -r reads [-b barcodes] [-a adapters] [options]

Input reads file

One file with sequencing reads in fasta or fastq format is required as input. It can be specified with option -r or --reads. The file should have a .fasta, .fa, .fastq, or .fq extension. For read files in fastq format, the endings .txt and .dat are also accepted. Files that are compressed with gzip or bzip2 are supported and should end with .gz or .bz2. Reads can also be passed via standard input by using a minus sign - as reads filename.

Target prefix

The prefix of target file names or paths can be set using the --target option. For example, the default target flexbarOut leads to files based on this name and .fasta ending in case of fasta input. The following program call with another target generates a output.fasta file:

flexbar -r reads.fasta -t output

Barcodes and adapters

To separate reads based on barcodes or to remove adapters from sequencing reads, specify barcodes or adapter(s). Create a fasta file containing barcodes or adapter sequences to be detected. Sequence names can be chosen freely but should not be the same:

>adapter1
TCGATTACGT
>adapter2
GGTAGTACGCTA

The wildcard character N can be used in barcodes and adapters to match all characters. Run Flexbar for removal of adapters, e.g. with 8 threads for parallel computation:

flexbar -r reads.fastq -a adapters.fasta -n 8

Flexbar aligns each read from reads.fastq to the adapter sequences found in adapters.fasta, one by one based on global alignment. For each read, the highest scoring valid alignment is employed for adapter removal and resulting reads are written to the flexbarOut.fastq file.

Paired reads

Flexbar processes paired reads and outputs consistent read files after processing. Single reads that are left over, can be written to an extra file. To use paired reads, specify two input files in same format that are in correct order for paired reads, for example adapter removal with pair overlap detection:

flexbar -r reads1.fastq -p reads2.fastq -a adapters.fasta -ap ON

Output files flexbarOut_1.fastq and flexbarOut_2.fastq with processed reads are created. It is recommended to turn on the pair overlap detection for adapter removal in case of standard paired-end libraries, see section pair overlap detection below.

Adapters in the fasta file that is specified with option -a or --adapters are searched in both read files of paired reads in default settings. Often certain adapters occure only in the first read set and others in the second read set. In this case, one should specify a second fasta file with adapters with the --adapters2 option. Adapters that are specifies with --adapters are then only searched in the first read set and those specified with --adapters2 are only searched in the second reads file.

Adapter removal presets

Several adapter presets for Illumina libraries are included in Flexbar. For example, select the TruSeq preset for standard TruSeq adapters and specify two read files for paired reads. If a preset is chosen, a separate file with adapters is not needed for removal. It is recommended to turn on the pair overlap detection for standard paired-end libraries:

flexbar -r reads1.fq -p reads2.fq --adapter-preset TruSeq -ap ON

Presets for removal of Illumina adapters:

  • TruSeq: TruSeq LT and HT-based kits
  • SmallRNA: TruSeq Small RNA
  • Methyl: TruSeq DNA Methylation and ScriptSeq
  • Ribo: TruSeq Ribo Profile
  • Nextera: Nextera, AmpliSeq, and TruSight
  • NexteraMP: Nextera Mate Pair

Adapter sequences for presets are based on recommendations by Illumina, see website.
Oligonucleotide sequences © 2018 Illumina, Inc. All rights reserved.

Barcoded reads

To detect barcodes within reads, specify a barcodes file when running Flexbar:

flexbar -r barcoded_reads.fastq -b barcodes.fasta [options]

This generates an output file for each barcode being contained in the barcode fasta file. Fasta tags are used in filenames:

flexbarOut_barcodeA.fastq
flexbarOut_barcodeB.fastq
...

Unassigned reads are not part of the output per default. The switch --barcode-unassigned changes this behaviour and leads to generation of output files for unassigned reads.

Separate barcode reads

Illumina produces two output files for single barcoded and three for paired-end barcoded runs:

reads_1.fastq   - sequencing reads (1st read of pairs)
reads_2.fastq   - separate barcode reads
reads_3.fastq   - 2nd read of pairs (optional)

Run Flexbar with these files:

flexbar -r reads_1.fastq -p reads_3.fastq -b barcodes.fasta -br reads_2.fastq

For each barcode, several output files are generated:

flexbarOut_barcodeA_1.fastq   (reads_1.fastq with barcode A)
flexbarOut_barcodeA_2.fastq   (reads_3.fastq with barcode A)
flexbarOut_barcodeB_1.fastq   (etc.)
...

Detection parameters

Parameters for detection and removal of barcodes as well as adapters:

Trim-end modes

The options --barcode-trim-end and --adapter-trim-end are important. They specify the barcode or adapter positioning within the read and which part of the read gets removed.

The following illustrations explain the five different modes. When a barcode or adapter aligns in agreement with the threshold and min-overlap, it could still be rejected due to restriction by trim-end mode. Red indicates a non-valid match and no removal of sequence from the read. Blue indicates removal of corresponding read sequence. Green highlights the remaining sequence of the read after removal. The aligned barcode or adapter is depicted by Xs.

ANY: longer side of read remains after removal of overlap

ANY mode image

RIGHT: left part remains after removal, align >= read start

RIGHT mode image

LEFT: right side remains after removal, align <= read end

LEFT mode image

RTAIL: consider only last n bases of reads (default: barcode or adapter length)

RTAIL mode image

LTAIL: use only first n bases of reads (default: barcode or adapter length)

LTAIL mode image

Options --barcode-tail-length and --adapter-tail-length can be used to modify the length of the tail mode region to be considered.

Minimum overlap

The minimum required overlap for barcodes is set to their length per default and to 3 base-pairs for adapters. It can be a drawback to set --adapter-min-overlap too high, if adapters are mostly located at the end of reads. Therefore a low default value is used. To change this, try:

flexbar -r reads.fasta -a adapters.fasta -ao 5 -l ALL

Assume a simple adapter like AAAAAA and a min-overlap of 5. The adapter would be recognized in the following read:

TCGAAAAAAGCGTGTTT
   ||||||
---AAAAAA--------

If the adapter is located at the 3' end with an overlap of only 4 bases, the adapter would not be removed because a min-overlap of 5 is requested. With a lower min-overlap value, adapters can be better detected at read ends. Per default this adapter would be removed.

TCGGCGTGTTTAAAA
           ||||
-----------AAAAAA

Error rate

The error rate parameters determine how many mismatches and indels are allowed for an adapter or barcode sequence to be removed. Consider the following alignment:

ACGTAGCCGTACTGT
       |||| |
-------CGTATT--

There is 1 mismatches in 6 bases. If an --adapter-error-rate of 0.1 is selected, only 0.6 errors are allowed for an overlap of 6 bases. The adapter is not removed. By increasing the threshold to 0.2, we allow 1.2 errors per 6 bases and therefore the adapter gets removed:

flexbar -r reads.fasta -a adapters.fasta --adapter-error-rate 0.2

Alignment scoring

The scoring scheme can be adjusted separately for detection of barcodes and adapters. This includes alignment match, mismatch and gap scores, see program options page. For example, it could make sense to specify a larger score for gaps, when the data's sequencing platform has high indel error rates:

flexbar -r reads.fasta -a adapters.fasta --adapter-gap -4

If the gap score is set to a value of -4, the score of a gap corresponds to 4 mismatches and can be compensated by 4 matches.

Pair overlap detection

It is recommended to activate the pair overlap detection for adapter removal in case of standard paired-end libraries. This increases the sensitivity by removing also very short parts of adapters if a sufficient overlap is recognized for a pair. Note that this method considerably increases the runtime of the program. The standard form to activate this method is for example:

flexbar -r reads1.fastq -p reads2.fastq -a adapters.fa -ap ON

Per default the minimum overlap for read pairs is 40. The option -av can be used to set it to another value. The first read of each read pair is aligned to the reverse complement of the second read to detect if a read-through into adapters occured:

  TCAGGGCAATACACAAGGACACGCAATACACAGGGGACCCATCAA   1st read
AATCAGGGCAATACACAAGGACACGCAATACACAGGGGACCCATC     reverse complement of 2nd read

Without pair overlap detection the short adapter sequence overhangs AA with 2 nucleotides at the read ends would not be removed because the minimum adapter overlap is 3 in default settings. If the pair overlap detection is turned ON and an overlap between paired reads is observed with an overhang at the ends, the minimum adapter overlap is decreased to 1 for the alignment of adapters. Now the adapter overhangs can be removed. There are also other pair overlap detection modes, SHORT for example:

flexbar -r reads1.fastq -p reads2.fastq -a adapters.fa -ap SHORT

This option leads to direct trimming if the overhang is shorter than the minimum adapter overlap without requirement of an adapter match. This is useful to be more robust towards errors in these short adapter sequence overhangs. The third pair overlap detection mode is ONLY. It can be used even without specification of known adapters. This mode leads to direct trimming of the overhangs at read ends if an overlap is detected for a pair without any limitations. This mode is not recommended because long sequences that do not stem from adapters could be trimmed in certain situations.

Filtering and trimming

Flexbar provides several basic read filtering and trimming features.

Filtering uncalled bases

In the first step, reads that contain more uncalled bases than specified are discarded. These reads are not included in further processing steps and the output. Per default, not a single uncalled base is allowed. To allow e.g. 2 uncalled bases per read:

flexbar -r reads.fastq --max-uncalled 2

Trimming of read ends

Trimming a fixed number of bases at left and right read ends is the next step, which is not performed with standard settings. For example, to trim 5 bases at the left side of reads:

flexbar -r reads.fastq --pre-trim-left 5

Trimming of homopolymers

Specific homopolymers on the left or right end of reads can be trimmed. The following command trims for example poly(A) or poly(T) tails with a minimum length of 10 and error rate 0.1 on the right side of reads:

flexbar -r reads.fastq --htrim-right AT --htrim-min-length 10 --htrim-error-rate 0.1

Trimming to read length

Reads can be trimmed to a certain length by cutting the right side. This step comes after barcoding and adapter removal. It is suited for cases where tools in the downstream analysis require a maximal or even an exact length of reads. For example, to cut the rigth side of reads such that reads are not longer than 50 bases, issue the following command:

flexbar -r reads.fastq --post-trim-length 50

Filtering short reads

There are many applications of sequencing data in which reads cannot be processed if they are too short. Therefore, we support filtering of reads based on their length. For example to discard reads that are shorter than 50 base pairs, set --min-read-length to this number. To make sure that reads have exactly a certain length, use the --post-trim-length option in addition:

flexbar -r reads.fastq --post-trim-length 50 --min-read-length 50

Quality-based trimming

Trimming based on phred quality values helps to deal with higher error rates towards the end of reads. Use option --qtrim to choose one of TAIL, WIN, and BWA as trimming mode. Also the format of quality scores has to be chosen. For example, to trim the 3' end until quality offset value 30 (corresponding to 63 in sanger format) or higher is reached, use the command:

flexbar -r reads.fastq -q TAIL -qf i1.8 -qt 30

Quality format

Choose the quality score format of fastq files with option --qtrim-format to indicate it for quality-based read trimming. Supported quality scalings are:

  • sanger
  • solexa
  • i1.3 (illumina 1.3)
  • i1.5 (same as i1.3)
  • i1.8 (same as sanger)

After removal steps

Quality-based trimming is performed before barcode and adapter processing steps per default. Use option --qtrim-post-removal in case you prefer to apply it after these steps instead.

Output selection

The option --fasta-output forces the non-quality file format fasta for output. This option is suited if the input format is fastq, and fasta output is preferred. Furthermore, the length distribution of read output files can be inspected by setting the option --length-dist, which leads to the generation of a length distribution file for each read output file.

Compressed output

Output files of reads can be directly compressed using gzip or bzip2. Specify GZ or BZ2 with option --zip-output to enable this feature. The .gz and .bz2 file ending is used.

Standard output

It is possible to send reads to standard output instead of files by setting option --stdout-reads. If barcode based separation of reads is being conducted, read names get extended by barcode names separated by underscore. Paired reads are written in interleaved format.

Output read files

If barcode detection is not performed, the output read files can be specified with options --output-reads and --output-reads2 instead of using the prefix of the target option. This is useful for workflow system or if you want to set the file path explicitly. The files should have a .fasta, .fa, .fastq, .fq, or .dat extension.

Single read output

While processing paired reads, it is possible to run into the situation that only one of the two reads of a pair is shorter than the specified minimal read length. In this case Flexbar discards also the single read that is not too short in order to keep the two files of paired reads in sync. The --single-reads option can be used to write these long enough single reads to separate files without loosing consistency for paired read output. Option --single-reads-paired leads to integration of such single reads in pairs with symbol N for too short counterparts.

Logging and tagging

Flexbar serves logging and read tagging features for inspection of alignments, and to facilitate downstream data analysis. For example, unique molecular identifiers which allows to recognize artifacts that stem from library amplification by PCR are supported.

Logging alignments

Print the optimal sequence alignment for each read and the barcode or adapter with maximal score if it is valid. Choose either to view all such valid alignmnets ALL or only those being used for read modification by sequence removal MOD. A third option TAB can be selected for tabular output of alignment statistics. Usage example:

flexbar -r reads.fastq -a adapters.fasta --align-log ALL

Unique molecular identifiers

Unique molecular identifiers (UMIs) are random sequence elements in reads that allow to recognize artifacts and errors, e.g. stemming from amplification by PCR during preparation of the sequencing library. Such UMIs can be captured with Flexbar by specifying the --umi-tags option and the character N at barcode or adapter positions for which the read is supposed to contain the random sequence. The characters at these positions are appended to the read name, separated by underscore.

Given the barcode pattern TGAGATNNNN in the barcodes fasta file to describe a composite barcode and the following read in a fasta file:

>read
TGAGATCGTTCAGTACGGCAATCGTATGCCGTCTTC

A Flexbar command for extracting the UMI is for example:

flexbar -r reads.fasta -b barcodes.fasta --umi-tags

The result contains the read with the variable part of the barcode in the read name:

>read_CGTT
CAGTACGGCAATCGTATGCCGTCTTC

Further tagging

The --number-tags option triggers replacement of read name tags by an ascending number to save space. The --removal-tags option can be employed to tag reads for which adapter or barcode removal takes place.