Skip to content

app guide

tcpan edited this page Feb 16, 2015 · 6 revisions

Application Level API

The BLISS library contains high level api for integration into exisiting or new applications.

The design goals of these APIs are

  1. semantically clear
  2. simple to use
  3. hides (most of) complexity of parallel backend
  4. easy to extend

The flexibility and extensibility of BLISS primarily comes from C++ templates, which not only allows an implementation to support different data types, but also different pluggable function modules.

C++ Template Primer

A very simple example of c++ templates can be seen in the standard template library's container classes.

std::vector<int> integers;
integers.push_back(1);

std::vector<float> floats;
floats.push_back(1.0);

The example above shows how the same vector container can be used with 2 different data types.

A class can also be specialized for a particular template parameter type, so that data type specific algorithm is used for greater efficiency or accuracy. For example,

log2<unsigned int>(10);
log2<float>(10.0);

The first case can be implemented using bit shifting, while the second case requires floating point computation.

Finally, it is also possible to use template parameter to directly replace specific logic. For example, the hash function of a hashmap (unordered_map in C++ STL) can be replaced.

std::unordered_map<int, float, std::hash<int>> intHashMap;
std::unordered_map<int, float, my_md5<int>>  md5HashMap;

BLISS heavily utilizes C++ templates to support extensibility and flexibility, as illustrated in the next section.

Kmer Index

API

Three Kmer Indexes classes are currently provided. All use MPI and OpenMP as their parallel backend.

template<unsigned int Kmer_Size, typename Alphabet> class KmerCountIndex;
template<unsigned int Kmer_Size, typename Alphabet> class KmerPositionIndex;
template<unsigned int Kmer_Size, typename Alphabet> class KmerPositionAndQualityIndex;

Each class provides the following functions:

void build(const std::string &filename, const int &nthreads)
void sendQuery(const KmerType & kmer)
void flushQuery()
void query(const std::vector<KmerType>& kmers)
void finalize()

Internally, KmerType is defined by the Kmer Index classes as bliss::Kmer<Kmer_Size, Alphabet>, with Alphabet being DNA (ACGT), RNA (ACGU), DNA5 (ACGTN), RNA5 (ACGUN), or DNA16 (IUPAC). (User can define a custom alphabet, of course)

To instantiate a Kmer Index, for example KmerCounting Index, we need to call its constructor. Each class provides a constructor of the following form, "___" being the specific index in question.

Kmer___Index(MPI_Comm _comm, int _comm_size,
                                    const std::function<void(std::pair<KmerType, ValueType>*, std::size_t)>& callbackFunction,
                                    int num_threads = 1)

Note that an MPI communicator must be supplied (MPI_COMM_WORLD suffices), and its size (MPI_Comm_size(MPI_COMM_WORLD, &_comm_size)).

Next of importance is a callbackFunction with the form

void myCallback(std::pair<KmerType, ValueType>*, std::size_t);

ValueType is size_t for KmerCountIndex, bliss::io::FASTQSequenceId for KmerPositionIndex, and std::pair<bliss::io::FASTQSequenceId, float> for KmerPositionAndQualityIndex.

This function is called when query results are received, and is best implemented by the application developer.

To instantiate a KmerCountIndex, for example,

int comm_size;
MPI_Init(argc, argv);
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
int numThreads = 4;
KmerCountIndex<21, bliss::DNA> index(MPI_COMM_WORLD, comm_size, myCallback, numThreads);

The next step is to build the KmerCountIndex.

index.build('my_fastq_file', numThreads);

Each MPI process involved will read a portion of 'my_fastq_file' using 4 threads, hash the generated Kmers and redistribute them to the appropriate processors.

After index build is complete, a set of queries can be sent:

index.sendQuery(kmer1);
index.sendQuery(kmer2);
index.sendQuery(kmer3);
...
index.flushQuery();

The queries are queued and sent asynchronous to the MPI process with the appropriate data. When flushQuery() is called, all pending communications (queries) are sent. The results are streamed back and processed by myCallback function asynchronously. An example implementation of myCallback function is show below:

void myCallback(std::pair<KmerType, size_t>* answers, std::size_t answer_count) 
{
   for (size_t i = 0; i < answer_count; ++i) {
     KmerType key;
     size_t val;
     std::tie(key, val) = answers[i];

     if (val > this->threshold)
       printf("high frequency kmer:  %s", key.toString().c_str(), val);
     }
};

The example above prints only Kmers with count greater than some threshold, e.g. 60. While this example shows a sequential callback, a parallel version can be implemented as well.

Clone this wiki locally