Switch to cortex for de Bruijn graph implementation #235

mbhall88 · 2020-10-15T07:39:15Z

The ultimate aim of this PR is to close #204 and make the implementation of #163 / #195 easier.

This is in an extremely draft phase. I have opened this PR to allow for gradual code review as I still feel quite awkward in C/C++ - especially trying to port C code to C++. It's kind of like I'm vision-impaired but am acting as my own guide dog 😅. But I don't want to use that as an excuse so am hoping to learn a lot about both languages in the process of this (while wishing pandora was in Rust the whole time).

So please give me lots of feedback/tips as I go @leoisl and @iqbal-lab

mbhall88 · 2020-10-16T03:35:27Z

Does anyone foresee an issue with me using std::unordered_map to replace the HashTable in cortex?

iqbal-lab · 2020-10-16T05:29:13Z

Depends how fast unordered map is. Cortexs hash is pretty fast.

mbhall88 · 2020-10-16T05:59:12Z

Depends how fast unordered map is. Cortexs hash is pretty fast.

We can provide a different hash function for unordered_map if that's what you're referring to regarding cortex being fast?

I suspect @leoisl will have a better sense than myself as to the speed of an unordered_map...

leoisl · 2020-10-17T18:35:46Z

I usually prefer to use data structures in the standard library for some reasons:

They are usually more efficient (compiler devs are really good and has performance as priority), and more customisable;
It is easier to develop with: there is probably a method, or a function in <algorithm> that fits whatever we need to do with the data structure, i.e. no need for we to code and test anything;
It is easier for other people to understand code: if we use standard data structures, people developing in C++ already know them, don't need to relearn;
They are extremely tested and probably bug free;

In principle, I think that cortex HashTable and std::unordered_map do the same thing, but I have no idea which one is more efficient. std::unordered_map might have a bunch of optimisations (things that we don't even try in our bioinformatics code), but the only way to be sure is through empirical tests I guess.

On the other hand, I would usually prefer to use std::map. It is just a little bit slower, operations are in log n, instead of constant on average case, but we don't risk having a linear time worst case, in case we have a bad number of buckets or a bad hash function. Although constant is the fastest, log n is very fast, and linear is slow. I'd rather have the guarantee that we never go into the slow case. Also, storing the data in a sorted way might give us opportunities to design more efficient algorithms based on the ordering downstream.

I see also that we are directly changing cortex code. I would have a preference to put cortex in our codebase as a git submodule, and make a local C++ wrapper around it. Like this, we would know exactly which commit/version we are working on, and if there are any updates to cortex, we can easily update cortex in our code base. This could happen if pandora needs a specific thing in the future that cortex don't actually implement now, or some kind of optimisation. If we put a cortex copy directly in our codebase, it is hard to know at which commit we started working on, and it will be hard to incorporate new changes. Also, I think the cortex version we are working on is a well-tested and stable version, so it is something we can trust. But changing the code directly means that we will need to add test code to cortex to be sure we did not introduce any bugs. I would vote for a C++ wrapper, as it is more modular, and we would keep this stable cortex code intact. Then we just need to test the wrapper, which is easy by mocking the cortex library.

mbhall88 · 2020-10-19T07:41:42Z

Alright. Fair points.

I will delete all of this work and start again from a git submodule tomorrow.

leoisl · 2020-10-19T07:44:58Z

Oh, if it is needed to restart all from scratch, I am not sure if it is worth it... do you think you can move most of what you've done to a wrapper? If not, if it is indeed needed to restart all from scratch, I'd rather leave this decision to you and @iqbal-lab

mbhall88 · 2020-10-19T07:50:28Z

No, I can't really move what I've done into a wrapper as I was manually copying over the cortex code and "modernising/C++-ising" it at the same time.

Better to restart now, rather than 2 weeks down the track.

I'll include it as a submodule and write a wrapper that calls the relevant C code for various operations.

leoisl · 2020-10-19T07:58:51Z

Ok, just to be sure, my opinions on that long post were just from engineering POV. There are other factors to consider. If updates to cortex are a very rare event, then I think we can simply change it directly, and add test code on our additions (on the hope that we almost never will need to add new commits from the cortex repo to our modified local copy). This engineering choice will also just improve some code modularity/testability, but the end result will probably be the same. I am not sure is worth you to invest your time more on the engineering aspect, or continue the work you've done to explore your research question earlier. I did not enter these questions in my previous post, as I think it this is more of a management aspect, which you and @iqbal-lab should decide

iqbal-lab · 2020-10-19T08:47:24Z

cortex updates essentially never and particularly not that code. fixes are sometimes need to the perl scripts.

mbhall88 · 2020-10-21T05:04:31Z

I'm going to close this and start fresh.

WIP: add binary kmer class

7b908be

mbhall88 requested review from iqbal-lab and leoisl October 15, 2020 07:39

add Nucleotide enum and cortex namespace

5ed90a9

add initial, bare-bones graph and node

b0ba468

remove cortex includes

dcf58e0

mbhall88 closed this Oct 21, 2020

mbhall88 deleted the cortex branch October 21, 2020 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to cortex for de Bruijn graph implementation #235

Switch to cortex for de Bruijn graph implementation #235

mbhall88 commented Oct 15, 2020 •

edited

Loading

mbhall88 commented Oct 16, 2020

iqbal-lab commented Oct 16, 2020

mbhall88 commented Oct 16, 2020

leoisl commented Oct 17, 2020

mbhall88 commented Oct 19, 2020

leoisl commented Oct 19, 2020

mbhall88 commented Oct 19, 2020

leoisl commented Oct 19, 2020

iqbal-lab commented Oct 19, 2020

mbhall88 commented Oct 21, 2020

Switch to cortex for de Bruijn graph implementation #235

Switch to cortex for de Bruijn graph implementation #235

Conversation

mbhall88 commented Oct 15, 2020 • edited Loading

mbhall88 commented Oct 16, 2020

iqbal-lab commented Oct 16, 2020

mbhall88 commented Oct 16, 2020

leoisl commented Oct 17, 2020

mbhall88 commented Oct 19, 2020

leoisl commented Oct 19, 2020

mbhall88 commented Oct 19, 2020

leoisl commented Oct 19, 2020

iqbal-lab commented Oct 19, 2020

mbhall88 commented Oct 21, 2020

mbhall88 commented Oct 15, 2020 •

edited

Loading