The usage of BWA-PSSM is very similar to that of BWA. Due to some of the underlying differences, howerver, the recommended parameters are different. More information can be found at the official web site:
BWA-PSSM requires the gdsl library. It can be downloaded from:
Once gdsl is installed, BWA-PSSM can be compiled by simply running 'make'.
Converting damaged DNA to a PSSM
For use in ancient DNA sequencing applications, the original fastq file may be converted to a pssm and used for alignment. This approach modifies the probabilities of the bases likely to be degraded:
cat damaged.fastq | python scripts/fastq2wm.py
Note that if the base pairs are base 64, the -q option should be used:
cat damaged.fastq | python scripts/fastq2wm.py -q 64
Within both the 33 and 64 base pair scripts, the maximum length of the read needs to be set. Only reads shorter than the maximum read length are modified by the damage matrix. It is presumed that maximum length reads have not been degraded and should not be modified.
cat damaged.fastq | python scripts/fastq2wm.py --max-length 76
Low quality data
In the examples below, pssm-file can be either a pssm file as generated by fastq2wm33.pl or a regular fastq file. If it is a regular fastq file, it will be converted to a pssm internally:
bwa pssm -m 2000 index-file pssm-file | bwa samse index-file - pssm-file > out.sam
Helicos data
bwa pssm -N -i2 -o5 -O 0.015 -E 0.015 -D 0.03 -m 2000 index-file pssm-file | bwa samse index-file - pssm-file > out.sam
Short explanation of the new parameters
-z This options indicates that the pssm threshold will be the maximum possible score - * maximum-mismatch-penalty, where maximum-mismatch-penalty is equal to the difference between the match and mismatch score of the highest quality base found in the read. In other words, we allow the equivalent of high quality mismatches in the alignment.
-y Similar to -z, except it uses bwa's default mismatch criteria for creating the mismatch allowance. Suppose bwa's method determines that a read of length l should be allowed n mismatches. Then using -y for such a read will be equivalent to using the option -z (n - ). This can be useful in files containing reads of variable length.
-O Specify a probability for gap opens.
-E Specify a probability for gap extensions.
-D Specify a probability for deletions.
-m Indicate the size of the heap for storing partial matches. This is equivalent to the number of recursions allowed in the search tree. Small values lead to fast running times while large ones lead to more sensitivity.
Error model:
An error model can be introduced in the following format.
Each line has qual score and base followed by 4 scores for A, C, G and T. Fields separated by blanks Like this:
0 A 0. 0. 0. 0.
0 C 0. 0. 0. 0.
:
3 A 1.3 -2.1 -3.1 -2.0
3 C -1.9 2.0 -2.1 -1.8
:
:
50 T ...
The first column corresponds to the position. The second to the base for which the scores are entered. The final four columns correspond to the score for that row of the pssm. So the 6th column of the example above corresponds to a score of 0. for encountering a T, if there was an A present in the read.
If a certain position does not have additional information from the error model, then its PSSM scores are unchanged from the values calculated from the quality score.