Skip to content
Laurent Jourdren edited this page Apr 11, 2017 · 2 revisions

Handling sequence

Introduction

A Sequence in Eoulsan is an Object that contains the following information:

  • An numerical id
  • A name
  • A description
  • An alphabet
  • A sequence

The Javadoc of the Sequence class is available here.

Create a sequence


// Create a sequence object with default values (id=0, string value are null and alphabet is AMBIGUOUS_DNA_ALPHABET
Sequence s1 = new Sequence();

// Create a sequence and set the id, name and sequence
Sequence s2 = new Sequence(1, "chr1", "ATGCATGC");

// Here we add also the description
Sequence s2 = new Sequence(1, "chr1", "ATGCATGC", "a test sequence");

Getters and setters

There is getters and setters for all the members of the class (id, name, description, alphabet and sequence). The default setters do not check the value of set. For string values, the value set is always trimmed.


Sequence s = new Sequence();

s.setId(5);
int id = s.getId(); // 5
s.setId(-14785);
int id = s.getId(); // -14785

s.setName("chr2");
String chr = s.getName(); // chr2

There is two special setters that check the value before changing the member of the class.


boolean r = s.setNameWithValidation(null); // false
boolean r = s.setNameWithValidation("     "); // false
boolean r = s.setNameWithValidation("chr1"); // true

boolean r = setSequenceWithValidation("ATAGQ"); // false with AMBIGUOUS_DNA_ALPHABET_LETTERS
boolean r = setSequenceWithValidation("ATAGA"); // true with AMBIGUOUS_DNA_ALPHABET_LETTERS

Alphabets

Eoulsan alphabet system in Sequence object handle several IUPAC alphabets and a read alphabet with letters "ATGCN". The defaults alphabets are available in the Alphabets class:

  • Alphabets.AMBIGUOUS_DNA_ALPHABET
  • Alphabets.UNAMBIGUOUS_DNA_ALPHABET
  • Alphabets.AMBIGUOUS_RNA_ALPHABET
  • Alphabets.UNAMBIGUOUS_RNA_ALPHABET
  • Alphabets.READ_DNA_ALPHABET
s.setAlphabet(Alphabets.AMBIGUOUS_DNA_ALPHABET);

To get the reverse complement of a sequence use the reverseComplement() method :


Sequence s = new Sequence();
s.setSequence("ATGC");
String seq1 = s.getSequence(); // ATGC
s.reverseComplement();
String seq2 = s.getSequence(); // GCAT

String seq1 = "ATGC";
String seq2 = Sequence.reverseComplement(seq1, Alphabets.AMBIGUOUS_DNA_ALPHABET); // GCAT

Sequence Validation

The Sequence class define a validate() that check if the name of the sequence is valid (not null and length >0) and if the sequence string is valid (not null, length > 0 and all the characters of the sequence must be allowed by the alphabet of the sequence).

new Sequence(0, "seq1", "ATCG").validate(); // true
new Sequence(0, "seq1", "").validate(); // false
new Sequence(0, "seq1", "ATCQ").validate(); // false

Sequence manipulation

You can get the length of a sequence with the lenght() method:

Sequence s = new Sequence();
s.setSequence("ATGC");
s.length(); //4

We can also remove a part of the sequence:

Sequence s = new Sequence();
s.setSequence("ATGCATGC");
s.subSequence(2,4); // Note that index starts at 0
s.getSequence(); // GCA

And concat sequences:

Sequence s1 = new Sequence();
Sequence s2 = new Sequence();
s1.setSequence("AATT");
s2.setSequence("GGCC");
Sequence s3 = s1.concat(s2);
s3.getSequence(); // AATTGGCC

Fasta export and parsing

Convert a sequence to a fasta string:


Sequence s = new Sequence(1, "myseq", "ATGCATGCATGC");
System.out.println(s.toFasta());
>myseq
ATGCATGCATGC

System.out.println(s.toFasta(4));
>myseq
ATGCA
TGCAT
GC

Parsing a fasta sequence

Sequence s = new Sequence();
s.getName(); // null
s.getSequence(); // null
s.parse(">myseq\nATGCATGCATGC\n");
s.getName(); // myseq
s.getSequence(); // ATGCATGCATGC

Other functionalities

The Sequence class contains also other utility methods:

  • getTm()
  • getGCPercent()