Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to cortex for de Bruijn graph implementation #235

Closed
wants to merge 4 commits into from

Conversation

mbhall88
Copy link
Member

@mbhall88 mbhall88 commented Oct 15, 2020

The ultimate aim of this PR is to close #204 and make the implementation of #163 / #195 easier.

This is in an extremely draft phase. I have opened this PR to allow for gradual code review as I still feel quite awkward in C/C++ - especially trying to port C code to C++. It's kind of like I'm vision-impaired but am acting as my own guide dog 😅. But I don't want to use that as an excuse so am hoping to learn a lot about both languages in the process of this (while wishing pandora was in Rust the whole time).

So please give me lots of feedback/tips as I go @leoisl and @iqbal-lab

@mbhall88 mbhall88 requested review from iqbal-lab and leoisl October 15, 2020 07:39
@mbhall88
Copy link
Member Author

Does anyone foresee an issue with me using std::unordered_map to replace the HashTable in cortex?

@iqbal-lab
Copy link
Collaborator

Depends how fast unordered map is. Cortexs hash is pretty fast.

@mbhall88
Copy link
Member Author

Depends how fast unordered map is. Cortexs hash is pretty fast.

We can provide a different hash function for unordered_map if that's what you're referring to regarding cortex being fast?

I suspect @leoisl will have a better sense than myself as to the speed of an unordered_map...

@leoisl
Copy link
Collaborator

leoisl commented Oct 17, 2020

I usually prefer to use data structures in the standard library for some reasons:

  1. They are usually more efficient (compiler devs are really good and has performance as priority), and more customisable;
  2. It is easier to develop with: there is probably a method, or a function in <algorithm> that fits whatever we need to do with the data structure, i.e. no need for we to code and test anything;
  3. It is easier for other people to understand code: if we use standard data structures, people developing in C++ already know them, don't need to relearn;
  4. They are extremely tested and probably bug free;

In principle, I think that cortex HashTable and std::unordered_map do the same thing, but I have no idea which one is more efficient. std::unordered_map might have a bunch of optimisations (things that we don't even try in our bioinformatics code), but the only way to be sure is through empirical tests I guess.

On the other hand, I would usually prefer to use std::map. It is just a little bit slower, operations are in log n, instead of constant on average case, but we don't risk having a linear time worst case, in case we have a bad number of buckets or a bad hash function. Although constant is the fastest, log n is very fast, and linear is slow. I'd rather have the guarantee that we never go into the slow case. Also, storing the data in a sorted way might give us opportunities to design more efficient algorithms based on the ordering downstream.

I see also that we are directly changing cortex code. I would have a preference to put cortex in our codebase as a git submodule, and make a local C++ wrapper around it. Like this, we would know exactly which commit/version we are working on, and if there are any updates to cortex, we can easily update cortex in our code base. This could happen if pandora needs a specific thing in the future that cortex don't actually implement now, or some kind of optimisation. If we put a cortex copy directly in our codebase, it is hard to know at which commit we started working on, and it will be hard to incorporate new changes. Also, I think the cortex version we are working on is a well-tested and stable version, so it is something we can trust. But changing the code directly means that we will need to add test code to cortex to be sure we did not introduce any bugs. I would vote for a C++ wrapper, as it is more modular, and we would keep this stable cortex code intact. Then we just need to test the wrapper, which is easy by mocking the cortex library.

@mbhall88
Copy link
Member Author

Alright. Fair points.

I will delete all of this work and start again from a git submodule tomorrow.

@leoisl
Copy link
Collaborator

leoisl commented Oct 19, 2020

Oh, if it is needed to restart all from scratch, I am not sure if it is worth it... do you think you can move most of what you've done to a wrapper? If not, if it is indeed needed to restart all from scratch, I'd rather leave this decision to you and @iqbal-lab

@mbhall88
Copy link
Member Author

No, I can't really move what I've done into a wrapper as I was manually copying over the cortex code and "modernising/C++-ising" it at the same time.

Better to restart now, rather than 2 weeks down the track.

I'll include it as a submodule and write a wrapper that calls the relevant C code for various operations.

@leoisl
Copy link
Collaborator

leoisl commented Oct 19, 2020

Ok, just to be sure, my opinions on that long post were just from engineering POV. There are other factors to consider. If updates to cortex are a very rare event, then I think we can simply change it directly, and add test code on our additions (on the hope that we almost never will need to add new commits from the cortex repo to our modified local copy). This engineering choice will also just improve some code modularity/testability, but the end result will probably be the same. I am not sure is worth you to invest your time more on the engineering aspect, or continue the work you've done to explore your research question earlier. I did not enter these questions in my previous post, as I think it this is more of a management aspect, which you and @iqbal-lab should decide

@iqbal-lab
Copy link
Collaborator

cortex updates essentially never and particularly not that code. fixes are sometimes need to the perl scripts.

@mbhall88
Copy link
Member Author

I'm going to close this and start fresh.

@mbhall88 mbhall88 closed this Oct 21, 2020
@mbhall88 mbhall88 deleted the cortex branch October 21, 2020 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants