Skip to content
Laurent Jourdren edited this page Apr 11, 2017 · 3 revisions

Handling reads

WARNING: This documentation is outdated and will soon be updated.

Introduction

A ReadSequence is in Eoulsan an object that contains information about a read. It extend the Sequence class and add the following information :

  • a FASTQ format
  • a quality sequence

The Javadoc of the ReadSequence class is available here.

Creating a ReadSequence

// Create a sequence object with default values (id=0, string value are null and alphabet is AMBIGUOUS_DNA_ALPHABET,
// FASTQ format is sanger)
ReadSequence r1 = new ReadSequence();

// Create a sequence and set the id, name and sequence
ReadSequence r2 = new Sequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");

// Here we add also the description
ReadSequence r2 = new Sequence(2, "read1", "GATCGGAAGAG", "???D1ADD?D;", FastqFormat.FASTQ_SANGER);

Getters and setters


ReadSequence r1 = new ReadSequence();
r1.getFastqFormat();  // FastqFormat.FASTQ_SANGER

r1.setSequence("ATGC");
r1.setQuality("???D");
r1.getSequence(); // ATGC
r1.getQuality(); // ???D

Validation method

The validate() method of a ReadSequence object check also if all the characters of the quality string are allowed by the FASTQ format and the quality string length is equal to the sequence length.


ReadSequence r1 = new ReadSequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");
r1.validate(); // true

ReadSequence r2 = new ReadSequence(2, "read1", "GATCGGAAGAG", "");
r2.validate(); // false

FASTQ formats

There is many FASTQ formats used by sequencers. The FastqFormat Enum in Eoulsan can handle all existing formats:

  • Sanger/Illumina 1.8 (Phred+33, raw reads typically 0 to 40)
  • Solexa (Solexa+64, raw reads typically -5 to 40)
  • Illumina 1.3 (Phred+64, raw reads typically 0 to 40)
  • Illumina 1.5 (Phred+64, raw reads typically 3 to 40)

For more information about FASTQ format see the Wikipedia page about FASTQ.

ReadSequence r1 = new ReadSequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");

// Get the quality score as an array of integers
int [qualities = r1.qualityScores(); 

// Get the probability that the corresponding base call is incorrect
// as an array of double
double [](]) probabilities = r1.errorProbabilities(); 

You can easly convert a quality string from a FASTQ format to another:

String solexaQuality = ";<=>?@ABCDEFGHI";
String sangerQuality = FastqFormat.FASTQ_SOLEXA.convertTo(solexaQuality, FastqFormat.FASTQ_SANGER);

ReadSequence manipulation

The subsequence and concat method of the Sequence object work also with ReadSequence object. In this case the sequence and the quality sequence will be modified.

Parsing Illumina ids

The IlluminaReadId allows to parse the identifiers of the reads generated with an Illumina sequencer.


// Create a IlluminaReadId from a string
IlluminaReadId irid = new IlluminaReadId("@HWUSI-EAS100R:6:73:941:1973#0/1");

ReadSequence r = new ReadSequence(1,"EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG",
               "GATTTGGGGTTCAAAGCAGT",
               "!''*((((***+))%%%++)")

// IlluminaReadId object can be reused
// The parse() methods or the constructor can use string or ReadSequence as arguments
irid.parse(r);

Once an Illunina identifier has been parsed, we can access to all the fields inside the identifier:


irid.getInstrumentId(); // EAS139
irid.getSequenceIndex(); // ATCACG
irid.isFiltered(); // true
irid.getPairMember(); // 1

All the fields of the Illumina id can be accessed, see the javadoc about IlluminaReadId for more informations.

FASTQ export and parsing

Convert a ReadSequence to FASTQ:

ReadSequence r = new Sequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");

// Print the ReadSequence in FASTQ format
System.out.println(r.toFastq());
@read1
GATCGGAAGAG
+
???D1ADD?D;

// Print the ReadSequence in FASTQ format with a repeat of the id on the third line
System.out.println(r.toFastq(true));
@read1
GATCGGAAGAG
+read1
???D1ADD?D;

Convert a ReadSequence to TFQ format. The TFQ format is a format where all informations about a read are on only one line and where the separator is a tabulation.

ReadSequence r = new Sequence(1, "read1", "GATCGGAAGAG", "???D1ADD?D;");

// Print the ReadSequence in TFQ format
System.out.println(r.toTFQ());
read1	GATCGGAAGAG	???D1ADD?D;

Parsing a FASTQ entry:

String s = "@read1\nGATCGGAAGAG\n+\n???D1ADD?D;";
ReadSequence r = new ReadSequence();
r.parseFastQ(s);