Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating change of de Bruijn graph library used for de novo discovery #204

Closed
mbhall88 opened this issue Nov 26, 2019 · 1 comment
Closed

Comments

@mbhall88
Copy link
Member

mbhall88 commented Nov 26, 2019

Discussions and links for investigating the possibility of moving to use of another library for the de Bruijn graph (DBG) implementation used by the de novo discovery routine.

Previous discussion relating to the initial choice and integration of GATB for this role can be found at #16.

Reasons for initiating this discussion:

  • Integration of GATB caused some compatibility issues with boost. Part of this problem seems to have been that GATB expects a system-wide boost. In addition, there are boost files actually inside the GATB repository. Both of these issues have combined to mean that rather than building the boost dependencies with pandora we require them to be system-wide.
  • The writing of temporary graph .h5 files is limiting our ability to multi-thread the de novo routine Multiprocess/multithread pandora map --discover issue with GATB graph creation #195 . @leoisl has found a way of implementing this fix, but this will rely on us having a fork of GATB that we maintain. This is obviously not ideal.
  • GATB takes a very long time to compile - significantly longer than the rest of pandora.
  • There does not seem to be a wide range of Clang compiler support for GATB. This will affect Mac users.

A solution that has been proposed if moving to the use of McCortex. One added benefit here is @iqbal-lab is very familiar with this code. Pros/Cons of McCortex vs GATB is probably a good place to start this discussion.

@mbhall88 mbhall88 changed the title Investigating change of library used for de Bruijn graph used by de novo discovery Investigating change of de Bruijn graph library used for de novo discovery Jan 15, 2020
@mbhall88
Copy link
Member Author

UPDATE: I had a go at switching from GATB to bifrost. It was super easy to integrate, but I realised later, when I was changing the interface class, that bifrost does not store coverage information in the graph. I had a chat to Guillaume and he said it is possible to add the info manually, but it requires a fairly convoluted process.
You add a string to the graph (bifrost handles breaking it into k-mers). Then, you need to look up each k-mer from the string you added (yourself) in the graph and increment a custom data attribute integer for coverage. This is then repeated for each sequence you add. So in total, we end up doing a tonne of work for each sequence we add.

Seems like cortex is probably going to be the best bet. Yay C 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants