-
Notifications
You must be signed in to change notification settings - Fork 10
Read Sequence
A ReadSequence
is in Eoulsan an object that contains information about a read. It extend the Sequence
class and add the following information :
- a FASTQ format
- a quality sequence
The Javadoc of the ReadSequence
class is available here.
// Create a sequence object with default values (id=0, string value are null and alphabet is AMBIGUOUS_DNA_ALPHABET,
// FASTQ format is sanger)
ReadSequence r1 = new ReadSequence();
// Create a sequence and set the id, name and sequence
ReadSequence r2 = new Sequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");
// Here we add also the description
ReadSequence r2 = new Sequence(2, "read1", "GATCGGAAGAG", "???D1ADD?D;", FastqFormat.FASTQ_SANGER);
ReadSequence r1 = new ReadSequence();
r1.getFastqFormat(); // FastqFormat.FASTQ_SANGER
r1.setSequence("ATGC");
r1.setQuality("???D");
r1.getSequence(); // ATGC
r1.getQuality(); // ???D
The validate()
method of a ReadSequence
object check also if all the characters of the quality string are allowed by the FASTQ format and the quality string length is equal to the sequence length.
ReadSequence r1 = new ReadSequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");
r1.validate(); // true
ReadSequence r2 = new ReadSequence(2, "read1", "GATCGGAAGAG", "");
r2.validate(); // false
There is many FASTQ formats used by sequencers. The FastqFormat
Enum in Eoulsan can handle all existing formats:
- Sanger/Illumina 1.8 (Phred+33, raw reads typically 0 to 40)
- Solexa (Solexa+64, raw reads typically -5 to 40)
- Illumina 1.3 (Phred+64, raw reads typically 0 to 40)
- Illumina 1.5 (Phred+64, raw reads typically 3 to 40)
For more information about FASTQ format see the Wikipedia page about FASTQ.
ReadSequence r1 = new ReadSequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");
// Get the quality score as an array of integers
int [qualities = r1.qualityScores();
// Get the probability that the corresponding base call is incorrect
// as an array of double
double [](]) probabilities = r1.errorProbabilities();
You can easly convert a quality string from a FASTQ format to another:
String solexaQuality = ";<=>?@ABCDEFGHI";
String sangerQuality = FastqFormat.FASTQ_SOLEXA.convertTo(solexaQuality, FastqFormat.FASTQ_SANGER);
The subsequence
and concat
method of the Sequence
object work also with ReadSequence
object. In this case the sequence and the quality sequence will be modified.
The IlluminaReadId
allows to parse the identifiers of the reads generated with an Illumina sequencer.
// Create a IlluminaReadId from a string
IlluminaReadId irid = new IlluminaReadId("@HWUSI-EAS100R:6:73:941:1973#0/1");
ReadSequence r = new ReadSequence(1,"EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG",
"GATTTGGGGTTCAAAGCAGT",
"!''*((((***+))%%%++)")
// IlluminaReadId object can be reused
// The parse() methods or the constructor can use string or ReadSequence as arguments
irid.parse(r);
Once an Illunina identifier has been parsed, we can access to all the fields inside the identifier:
irid.getInstrumentId(); // EAS139
irid.getSequenceIndex(); // ATCACG
irid.isFiltered(); // true
irid.getPairMember(); // 1
All the fields of the Illumina id can be accessed, see the javadoc about IlluminaReadId
for more informations.
Convert a ReadSequence
to FASTQ:
ReadSequence r = new Sequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");
// Print the ReadSequence in FASTQ format
System.out.println(r.toFastq());
@read1
GATCGGAAGAG
+
???D1ADD?D;
// Print the ReadSequence in FASTQ format with a repeat of the id on the third line
System.out.println(r.toFastq(true));
@read1
GATCGGAAGAG
+read1
???D1ADD?D;
Convert a ReadSequence to TFQ format. The TFQ format is a format where all informations about a read are on only one line and where the separator is a tabulation.
ReadSequence r = new Sequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");
// Print the ReadSequence in TFQ format
System.out.println(r.toTFQ());
read1 GATCGGAAGAG ???D1ADD?D;
Parsing a FASTQ entry:
String s = "@read1\nGATCGGAAGAG\n+\n???D1ADD?D;";
ReadSequence r = new ReadSequence();
r.parseFastQ(s);