Manual

Processing steps of Flexbar 3.1

Flexbar conducts the following series of preprocessing steps.
Only step 1 and 9 are active in standard settings.

Filtering reads with uncalled bases
Trimming of bases on left or right end
Quality-based read trimming modes
Barcode detection, removal and read separation
Adapter detection and removal
Quality-based trimming only after removal
Trimming of homopolymers at read ends
Read trimming to certain length from right
Filtering short sequencing reads

Program usage

Flexbar usage on command-line:

	flexbar -r reads [-b barcodes] [-a adapters] [options]

Input reads file

One file with sequencing reads in fasta or fastq format is required as input. It can be specified with option -r or --reads. The file should have a .fasta, .fa, .fastq, or .fq extension. Files that are compressed with gzip or bzip2 are supported and should end with .gz or .bz2. Reads can also be passed via standard input by using a minus sign - as reads filename.

Target prefix

The prefix of target file names or paths can be set using the --target option. For example, the default target flexbarOut leads to files based on this name and .fasta ending in case of fasta input. The following program call with another target generates a output.fasta file:

	flexbar -r reads.fasta -t output

Barcodes and adapters

To separate reads based on barcodes or to remove adapters from sequencing reads, specify barcodes or adapter(s). Create a fasta file containing barcodes or adapter sequences to be detected. Sequence names can be chosen freely but should not be the same:

	>adapter1
	TCGATTACGT
	>adapter2
	GGTAGTACGCTA

The wildcard character N can be used in barcodes and adapters to virtually match all characters. Run Flexbar for removal of adapters, e.g. with 8 threads for parallel computation:

	flexbar -r reads.fastq -a adapters.fasta -n 8

Flexbar aligns each read from reads.fastq to the adapter sequences found in adapters.fasta one by one based on global alignment. For each read, the highest scoring valid alignment is employed for adapter removal and resulting reads are written to the flexbarOut.fastq file.

Paired reads

Flexbar processes paired reads and outputs consistent read files after processing. Single reads that are left over, can be written to an extra file. To use paired reads, specify two input files in same format that are in correct order for paired reads, for example:

	flexbar -r reads_1.fastq -p reads_2.fastq [options]

Output files flexbarOut_1.fastq and flexbarOut_2.fastq with processed reads are created.

Barcoded reads

To detect barcodes within reads, specify a barcodes file when running Flexbar:

	flexbar -r barcoded_reads.fastq -b barcodes.fasta [options]

This generates an output file for each barcode being contained in the barcode fasta file. Fasta tags are used in filenames:

	flexbarOut_barcodeA.fastq
	flexbarOut_barcodeB.fastq
	...

Unassigned reads are not part of the output per default. The switch --barcode-unassigned changes this behaviour and leads to the generation of separate output files for unassigned reads.

Separate barcode reads

Illumina produces two output files for single barcoded and three for paired-end barcoded runs:

	reads_1.fastq   - sequencing reads (1st read of pairs)
	reads_2.fastq   - separate barcode reads
	reads_3.fastq   - 2nd read of pairs (optional)

Run Flexbar with these files:

	flexbar -r reads_1.fastq -p reads_3.fastq -b barcodes.fasta -br reads_2.fastq

For each barcode, several output files are generated:

	flexbarOut_barcodeA_1.fastq   (reads_1.fastq with barcode A)
	flexbarOut_barcodeA_2.fastq   (reads_3.fastq with barcode A)
	flexbarOut_barcodeB_1.fastq   (etc.)
	...

Detection parameters

Parameters for detection and removal of barcodes as well as adapters:

Trim-end modes

The options --barcode-trim-end and --adapter-trim-end are important. They specify the barcode or adapter positioning within the read and which part of the read gets removed.

The following illustrations explain the five different modes. When a barcode or adapter aligns in agreement with the threshold and min-overlap, it could still be rejected due to restriction by trim-end mode. Red indicates a non-valid match and no removal of sequence from the read. Blue indicates removal of corresponding read sequence. Green highlights the remaining sequence of the read after removal. The aligned barcode or adapter sequence is depicted by Xs.

ANY: longer side of read remains after removal of overlap

ANY mode image

RIGHT: left part remains after removal, align >= read start

RIGHT mode image

LEFT: right side remains after removal, align <= read end

LEFT mode image

RTAIL: consider only last n bases of reads (default: barcode or adapter length)

RTAIL mode image

LTAIL: use only first n bases of reads (default: barcode or adapter length)

LTAIL mode image

Options --barcode-tail-length and --adapter-tail-length can be used to modify the length of the tail mode region to be considered.

Min-overlap

The minimum required overlap for barcodes is set to their length per default and to 3 base-pairs for adapters. It can be a drawback to set --adapter-min-overlap too high, if adapters are mostly located at the end of reads. Therefore a low default value is used. To change this, try:

	flexbar -r reads.fasta -a adapters.fasta -ao 5 -l ALL

Assume a simple adapter like AAAAAA and a min-overlap of 5. The adapter would be recognized in the following read:

	TCGAAAAAAGCGTGTTT
	   ||||||
	---AAAAAA--------

If the adapter is located at the 3' end with an overlap of only 4 bases, the adapter would not be removed because a min-overlap of 5 is requested. With a lower min-overlap value, adapters can be better detected at read ends. Per default this adapter would be removed.

	TCGGCGTGTTTAAAA
	           ||||
	-----------AAAAAA

Threshold

The threshold parameters specifies how many mismatches and indels are allowed for an adapter or barcode sequence to be removed. Consider the following alignment:

	ACGTAGCCGTACTGT
   	       |||| |
	-------CGTATT--

There is 1 mismatches in 6 bases. If an --adapter-threshold of 1 error per 10 bases is selected, only 0.6 errors are allowed for an overlap of 6 bases. The adapter is not removed. By increasing the threshold to 2, we allow 1.2 errors per 6 bases and therefore the adapter gets removed. For min-overlap we choose 6 or lower. Otherwise the adapter never gets removed.

	flexbar -r reads.fasta -a adapters.fasta -at 2 -ao 6 -l ALL

Alignment scoring

The scoring scheme can be adjusted separately for detection of barcodes and adapters. This includes alignment match, mismatch and gap scores, see program options page. For example, it could make sense to specify a larger score for gaps, when the data's sequencing platform has high indel error rates:

	flexbar -r reads.fasta -a adapters.fasta --adapter-gap -4

If the gap score is set to a value of -4, the score of a gap corresponds to 4 mismatches and can be compensated by 4 matches.

Filtering and trimming

Flexbar provides several basic read filtering and trimming features.

Filtering uncalled bases

In the first step, reads that contain more uncalled bases than specified are discarded. These reads are not included in further processing steps and the output. Per default, not a single uncalled base is allowed. To allow e.g. 2 uncalled bases per read:

	flexbar -r reads.fastq --max-uncalled 2

Trimming of read ends

Trimming a fixed number of bases at left and right read ends is the next step, which is not performed with standard settings. For example, to trim 5 bases at the left side of reads:

	flexbar -r reads.fastq --pre-trim-left 5

Trimming of homopolymers

Specific homopolymers on the left or right end of reads can be trimmed. The following command trims for example poly(A) and poly(T) tails with a minimum length of 10 and error rate 0.1 on the right side of reads:

	flexbar -r reads.fastq --post-trim-right-hps AT --post-trim-hps-length 10

Trimming to read length

Reads can be trimmed to a certain length by cutting the right side. This step comes after barcoding and adapter removal. It is suited for cases where tools in the downstream analysis require a maximal or even an exact length of reads. For example, to cut the rigth side of reads such that reads are not longer than 50 bases, issue the following command:

	flexbar -r reads.fastq --post-trim-length 50

Filtering short reads

There are many applications of sequencing data in which reads cannot be processed if they are too short. Therefore, we support filtering of reads based on their length. For example to discard reads that are shorter than 50 base pairs, set --min-read-length to this number. To make sure that reads have exactly a certain length, use the --post-trim-length option in addition:

	flexbar -r reads.fastq --post-trim-length 50 --min-read-length 50

Quality-based trimming

Trimming based on phred quality values helps to deal with higher error rates towards the end of reads. Use option --qtrim to choose one of TAIL, WIN, and BWA as trimming mode. Also the format of quality scores has to be chosen. For example, to trim the 3' end until quality offset value 30 (corresponding to 63 in sanger format) or higher is reached, use the command:

	flexbar -r reads.fastq -q TAIL -qf i1.8 -qt 30

Quality format

Choose the quality score format of fastq files with option --qtrim-format to indicate it for quality-based read trimming. Supported quality scalings are:

sanger
solexa
i1.3 (illumina)
i1.5 (as i1.3)
i1.8 (as sanger)

After removal steps

Quality-based trimming is performed before barcode and adapter processing steps per default. Use option --qtrim-post-removal in case you prefer to apply it after these steps instead.

Output selection

The option --fasta-output forces the non-quality file format fasta for output. This option is suited if the input format is fastq, and fasta output is preferred. Furthermore, the length distribution of read output files can be inspected by setting the option --length-dist, which leads to the generation of a length distribution file for each read output file.

Compressed output

Ouput files of reads can be directly compressed using gzip or bzip2. Specify GZ or BZ2 with option --zip-output to enable this feature. The .gz and .bz2 file ending is used.

Standard output

It is possible to send reads to standard output instead of files by setting option --stdout-reads. In this case, the Flexbar output of parameters and statistics that usually uses stdout is written to a file named by target and the ending .log. When barcode based separation of reads is being conducted, read tags get extended by corresponding barcode tags separated by underscore. Paired reads are written in interleaved format.

Single read output

While processing paired reads, it is possible to run into the situation that only one of the two reads of a pair is shorter than the specified minimal read length. In this case Flexbar discards also the single read that is not too short in order to keep the two files of paired reads in sync. The --single-reads option can be used to write these long enough single reads to separate files without loosing consistency for paired read output. Option --single-reads-paired leads to integration of such single reads in pairs with symbol N for too short counterparts.

Logging and tagging

Flexbar serves logging and read tagging features for inspection of alignments, and to facilitate downstream data analysis. For example, random sequence tags which allows to recognize artifacts that stem from library amplification are supported.

Logging alignments

Print the optimal sequence alignment for each read and the barcode or adapter with maximal score if it is valid. Choose either to view all such valid alignmnets ALL or only those being used for read modification by sequence removal MOD. A third option TAB can be selected for tabular output of alignment statistics. Usage example:

	flexbar -r reads.fastq -a adapters.fasta --align-log ALL

Unique molecular identifiers

Unique molecular identifiers (UMIs) are random sequence elements in reads that allow to recognize artifacts and errors, e.g. stemming from amplification by PCR during preparation of the sequencing library. Such UMIs can be captured with Flexbar by specifying the --umi-tags option and the character N at barcode or adapter positions for which the read is supposed to contain the random sequence. The characters at these positions are appended to the read name, separated by underscore.

Given the barcode pattern TGAGATNNNN in the barcodes fasta file to describe a composite barcode and the following read in a fasta file:

	>read
	TGAGATCGTTCAGTACGGCAATCGTATGCCGTCTTC

A Flexbar command for extracting the UMI is for example:

	flexbar -r reads.fasta -b barcodes.fasta --umi-tags

The result contains the read with the variable part of the barcode in the read name:

	>read_CGTT
	CAGTACGGCAATCGTATGCCGTCTTC

Further tagging

The --number-tags option triggers replacement of read name tags by an ascending number to save space. The --removal-tags option can be employed to tag reads for which adapter or barcode removal takes place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly