-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acceptable local assembly performance #74
Comments
Curious to know which bit it is getting stuck in specifically. |
@rmcolq can you please link to examples you are talking about? |
Job was killed 😞 |
#14 is related to this and identifies a possible slow spot. |
If by examples, you mean which functions am I referring to as not having full tests? Commented out tests that could be adjusted to new code as extra test cases: Also noticed we have two When Zam says you have been "sitting in the 'get intervals' stage", do you know where exactly? |
remove redundant functions and tests as mentioned in #74
Ok. I have removed the junk. @rmcolq Regarding the tests, I thought you said you could make some tests quite easily for those functions? @rmcolq If I jump to the bottom of the log file at the moment, it shows the following
This code comes from the I am guessing this is being printed out whenever there is some kind of insertion into a map happening? |
I did indeed. I had lost track. I will write some tests for that.
I think it is, but if you point me at the log I will take a look and see.
Sent from my Samsung Galaxy smartphone.
…-------- Original message --------
From: Michael Hall <[email protected]>
Date: 22/10/2018 13:30 (GMT+00:00)
To: rmcolq/pandora <[email protected]>
Cc: Rachel Norris <[email protected]>, Mention <[email protected]>
Subject: Re: [rmcolq/pandora] Acceptable pandora map+de novo performance (#74)
Ok. I have removed the junk.
@rmcolq<https://github.com/rmcolq> Regarding the tests, I thought you said you could make some tests quite easily for those functions?
@rmcolq<https://github.com/rmcolq> If I jump to the bottom of the log file at the moment, it shows the following
continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking
check if pnodes defined
check if one is defined
again
continue checking
This code comes from the < operator for GeneIntervalInfo https://github.com/rmcolq/pandora/blob/83e9858c467128d8d528121187fb73969cfe0160/src/denovo_discovery/gene_interval_info.cpp#L6
I am guessing this is being printed out whenever there is some kind of insertion into a map happening?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#74 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AJaLO0JpC3CRdyP0t1h2yt92r0G5OMNAks5unbpNgaJpZM4XxSPP>.
|
Profiled CPU usage on a toy example using valgrind --tool=callgrind /Users/mbhall88/Projects/pandora/cmake-build-debug/pandora map --output_covgs --discover -p toy_prg.fa -r toy_5000_mapped.fastq.gz -o . --genome_size 20000 --genotype It segfaulted when it was running local assembly on an interval, but I have results for the code prior to that. It appears that up until it segfaulted, 98.07% of CPU time was in Speaking with @rffrancon it appears this line is not necessary as it seems like it is being used to clear the |
Excellent, if it runs without it, and tests pass, do remove it! |
Gramtools uses SeqRead for parsing fastq, see: https://github.com/iqbal-lab-org/gramtools/blob/ad47bb63a09c4b0b1f92f005e2059ce842223567/libgramtools/src/quasimap/quasimap.cpp#L74 |
At one point I tried FQFeeder - at the time it was slower, but I was doing different things. It may be faster for this. |
Seems very popular:
|
Now added missed tests in commit ba4078c |
Most recent This is obviously only a small portion of the overall call graph, but is where about half the run-time is at the moment on my toy dataset. This is the higher level shot This example runs local assembly on two intervals in total. Here is the callgrind log file |
Looking into this a little, it seems - for this toy example - 21.86% of time is spent on this @rmcolq Is there a better way of achieving this? A lookup table or approximation perhaps? It is called 6 million times. |
I didn't think of this before now as i dont really use asserts: https://codeyarns.com/2015/03/16/how-to-disable-assert-in-gcc/ I would go with the cmake option. |
Correct me if wrong @rffrancon , but I think Robyn is saying you can set things up so you can compile the code in two ways. |
Is this line where I would put the compiler flag? |
Compiling with Here is an overview of the call graph now. In terms of absolute time. Averaging 5 runs on the toy dataset for each:
Which is an 11% speedup on the toy dataset. |
Going to close this as speed is now "acceptable" and #150 is tracking the performance of the map module now. |
I am opening this issue to track an investigation I am currently doing to pinpoint concrete examples of where our slow points in local assembly are. This analysis also feeds directly into #195 Whilst trying to use I selected an example of a sample that took 4 days to (successfully) run I have parsed the log file to get information about how long local assembly took on each candidate region and how long the region is to see whether the length of a candidate region is correlated with its runtime, or if there is something else at play. ConclusionThe length of a candidate region, at least in this example, does not seem to have a noticeable impact on the amount of time it takes to run local assembly. We can also see that, for this sample, there is a single candidate region that takes nearly all of the time. 75% of all candidate regions take 0.26-0.37 seconds to complete, with the remaining 25% taking next to no time (these are likely regions where there are little-to-no reads to build the dBG from and other various GATB errors). Take a sample with 10,000 candidate regions, each running for 0.3 seconds, then we will spend 37.5 minutes enumerating paths through local assembly graphs. Theoretically, if #195 is implemented, each extra thread should equate to a proportional reduction in local assembly runtime. However, this does not address the issue of single regions that take 4 days! When I looked at this candidate region a little closer, I found it was Next stepI will pull out this PRG and the reads that map to it and play around with discover params that can help reduce this runtime. |
Do you keep a count of how many "loops you meet"? Maybe no of times you meet a kmer you've met before? Coukd have a ceiling. In principle this can be partially done at construction time not runtime. We coukd take the reference for a locus and count how many repeated kmers there are |
Sorry I accidentally closed the issue |
Sorry for the long message, it seems I get carried away when brainstorming... We had a talk on late monday UK / early tuesday Aus, and we actually came up with a similar idea, but that was on runtime. We could limit how many times we loop through cycles simply by having a count of how many times we visited each kmer when building a path, and if we visit a kmer up to, for eg 5 times, we don't consider it as a candidate to get in the path again (sort of removing the kmer from the graph). But this was during runtime and during path building... Removing the kmer at construction time has the impact of being faster (i.e. we won't even loop through repeat kmers, as they will be straight out removed), but it might really affect results (e.g. it might be that the true denovo path to be found goes through the repeat, and removing these repeat kmers makes it impossible to find it). I guess we need to wait for more details on this region from Michael. I would like very much to see the graph (I think GATB can export to GFA? Or just export to a tabular format, and we can view on cytoscape). I wonder if we have several loops over loops, etc there. It is fine that we don't perform well on this. These repeat-induced-full-of-cycles graphs are one of the main hardness on assemblying genomes, everyone struggles... This was one of my PhD topics, trying to solve repeats (which induce cycles), but assemblying human alternative splicing events. I did not succeed much, but some strategies that I remember:
There are several other ways to tackle repeats/cycles/complex assembly graphs, we might need to research and pick the one that fits better our data |
i was simply suggesting at construction time counting high-repeat kmers - i wasnt suggesting removing them from the middle of the graph. |
Oh, I see! It is another way of deciding if the graph is complex or not. We could increase the k-mer size and recompute this stat instead of directly abandoning the assembly, do you think is worth it? |
the mccortex read-coherence thing is expensive and uses i/o. right now i'm more thinking in terms of discarding candidate regions if they are too repeaty. They would also be difficult regions for mapping/normal-approaches anyway. |
the cost/benefit is all about how many regions there are like this. |
important to avoid trying too hard to squeeze tiny increases of theoretical recall |
but my idea of looking at the locus might/not spot michael's problem region. and it might spot loads of others that local assembly copes with fine. we need data really |
I do agree with this, especially with my recent experience on pandora paper. Finding the correct denovo path for this single region among several thousand other regions won't really make a difference in the results... I changed my idea from trying to deal with these complicated regions to identifying and skipping them
Yeah, I think we need a measure to identify the complexity of the region, which correlates with the time spent on it. Maybe a high GC-content could flag a complex region, and we don't process it (although I don't think this was the single high-GC-content region), or maybe the nb of repeated kmers, or nb of cycles, etc... IIRC, Michael also suggested a timeout on the finding-denovo-paths algorithm. I previously opposed to this, as things can become non-reproduceable, and there might be some other external factors that can impact runtime (e.g. filesystem latency, cpu load, etc), and we might end up missing a bunch of regions... but a timeout is indeed an option rethinking again, as almost everything takes <5s, and a single one takes 48h (not sure this is the behaviour of other datasets) |
not sure if this link will work so attaching a screenshot. that's a lot of consecutive G/C |
No, we do not currently keep track of how many times we visit a kmer. I could definitely implement this but would need to alter the data structure for out DFS tree. I will come back to this... @leoisl unfortunately GATB doesn't output GFA, only HDF5. There is a hack I can do to get a cortex graph, but for the moment I have been working on other avenues as this is very time consuming. Thanks for the ideas and discussion! I think the node visiting is a great idea. For the long-running candidate region I have been prodding at I've found out some more information. If I change the distance for merging candidate regions to 15 (instead of 22) then we have no runtime issues. This causes the region causing the long runtime to be broken into three separate regions and each of these completes quickly. Digging in a little deeper I suspect it is the region near the beginning of the original longer region that is causing problems. When I zoom in on that region it has a GC content of ~90%!! https://bacteria.ensembl.org/Mycobacterium_tuberculosis_h37rv/Location/View?db=core;g=Rv3508;r=Chromosome:3932920-3933052;t=CCP46330 If I clean the GATB graph (with merge distance at the original 22) discover does complete on that long runtime region in 6 minutes, but abandons that region as there are too many paths. If I use a discover kmer size of 15 (with the original merge distance) it completes quickly also, but can't find a path between any anchor combination for that region. ConclusionI am going to look at the impact of reducing the merging distance to 15, the impact of cleaning the GATB graph, and the impact of |
Nice investigation, this cleared a lot of doubts! Anyway, I think you will be able to tweak some parameters here and there, and will be able to run de novo discovery on these samples spending a short amount of time, but I think this issue might still come up to us in the future or to a user. Should we still think on a way to flag a graph as too complex, and on strategies to deal with it (i.e. try to automatise what Michael is doing)? I think simply skipping the region, and warning the user how many regions were skipped can be already a good first solution. Other solutions include, I think, doing what you will test: reducing the merging distance and check if the subgraphs produced due to this are not complex anymore, or cleaning, or increasing k (it will depend on what worked for you) |
Update from overnight: I reran the pipeline with Re: our ongoing discussion about the PE_PGRS54 locus I think we actually make this complex region even more difficult for ourselves due to the way we select variants when building our Mtb PRG (mbhall88/head_to_head_pipeline#10). We apply some filtering to the VCFs for the sparse and dense PRGs - see https://github.com/mbhall88/head_to_head_pipeline/blob/0323cbb8b088c357e39d62489721682e0fc8b707/data/H37Rv_PRG/Snakefile#L262-L298 The two bits of filtering that are important for the current issue are:
Now, normally, these are totally sane filters. But, we know that pe/ppe genes are commonly inside these masks. (I also include FILTER in this as the SummaryI don't think we should be masking at the PRG construction stage - this should be a downstream process. |
Cleaning the local assembly (GATB) graph seems to hit us pretty hard in terms of recall. The fourth pair of boxes from the left is the run with cleaning However, we do gain precision - but bear in mind this is without any post-filtering. It is beginning to look like the de novo k-mer size isn't as influential as I first thought (at least for Mtb). I'm still expolring parameters though. |
That looks like the classic cleaning DBGs results: - recall, + precision. It is interesting nonetheless, I'd propose to add this to your Mtb paper (at least as supplementary... not sure if it fits the paper though... but I bet there will be use cases where we prefer better precision than recall, and this shows a way of doing it...) |
|
That's COMPASS
The abundance threshold is indicated by abdunx in the names on the x-axis |
I am now confident I have explored the discover parameters thoroughly enough. Based on the following recall plot, I am going to put in a PR to change the default From the point of view of performance, the summary statistics of wall-time for the discover jobs for the newly proposed defaults (with 8 threads) are:
|
@mbhall88 has been running "latest" Pandora with the perf changes you 3 produced on the full graph with a cardio sample. So far running for days, sitting in the "get intervals" stage.
@mbhall88 feel free to edit/update with details.
The text was updated successfully, but these errors were encountered: