-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking pandora compare RAM usage and runtime #148
Comments
Heavy dataset just to measure maximum RAM usage (no massif profiling): Full klebs dataset with 61340 PRGs and 151 samples at 45x coverage (see Git commit: leoisl@888fff3 Maximum RAM usage: 17.1 GB Info about time performance: |
Please tell me if:
Heavier datasets than the full klebs dataset are very welcome. |
For that first one, surely we don't need a uint32 for number of samples. |
In that first one we don't have a uint32 for number of samples, we have a pair of uint32 to represent the forward and reverse coverage in each sample on each node in the kmergraph. We still don't need uint32. At some point I remember that we updated almost all uints in pandora to uint32 on Robyn's advice to reduce the possibility of uint conversions. This may be the time to rethink that. |
I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?). If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those. Is runtime still linear? My gut feeling is that maybe that could be improved a bit next? |
Agreed. This is how quantizing coverage is done in GATB: GATB/gatb-core#19 (comment) ... there are probably many ways to do this... then we would not need a If we want to decrease even further the RAM usage, we could go for What do you think? |
That's a neat method! My only hesitation is the idea of using any more GATB... |
My guess is that it scales in RAM for thousands of samples. The peak memory usage shown here is when pandora is mapping a single sample (creating the pangraph from the sample). This does not change with the number of samples, since we process each sample one by one. If we look strictly after the reads are all mapped, we have 1.3 GB of RAM usage. The 2nd and 3rd most RAM consumers are the And you are right about runtime - this will be the bottleneck... I'd guess from some days to 1 week or more for 1000 samples, which is not very nice... |
That is really great!! Runtime it is then... |
GATB is already a dependency due to the de novo module, so I think there is no issue of using GATB stuff then... If you prefer, we could replicate what they do for the coverage encoding: https://github.com/GATB/gatb-core/blob/2332064bf74032e801537d43dce8f87f018cea4a/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L96-L157 |
yeah... but these are just guesses, I'd prefer to work with data however, so we would be sure we would be working on the true bottlenecks... Do we have any dataset with thousands of samples? If not, can we simply duplicate the samples in the klebs dataset until we reach thousands? If exact duplication is not realistic, maybe simulate reads or sth like that? |
@iqbal-lab said he had some with 1000s to process in the near future. Probably doesn't have them right now. Ha, let's not reimplement - ignore what I said and either use GATB, or use |
Ooooops, I am totally mistaken here.... I computed these stats for the downsampled dataset for massif profiling, and used as if they were on the full dataset... I think linear RAM growth is an upper bound so this seems runnable on yoda, but the best way to know is running such dataset itself |
Ok, let's switch to |
Yes we do have thousands now! 3000 klebs and 8000 e coli. But also in terms of priorities, we have bigger holes to fill about how we run Pandora smoothly, integrating output of de novo back in, make prg etc. |
Re coverage
|
Agreed, will then postpone performance improvement in
We will encode with |
Sounds good to me! |
Closing, outdated. We have better performance tracking with the plots generated to evaluate pandora performance for the paper. |
This is not really an issue - just a way of documenting
pandora compare
RAM usage across commits and to know if we need to improve or not.Dataset for massif profiling:
downsampled klebs dataset with 8000 PRGs and 50 samples at 45x coverage (for the PRG, reads, command-line, etc, see
/hps/nobackup/iqbal/leandro/compare_test/profiling_test_8000_PRGs_50_samples
on yoda).Massif profiling:
Git commit: leoisl@5b329a7
Massif output:
massif.out.DEBUG_5b329a.txt
Peak snapshot screenshot:
Comments on the 3 instructions allocating most of the RAM:
Takes 413MB / 1.5GB (26.4% of the RAM): the current RAM bottleneck. The RAM usage of this data structure increases with number of PRGs, number of nodes in PRGs and number of samples. It might be tricky to improve on this.
Takes 226MB / 1.5GB (14.45% of RAM) - this increases only with number of PRGs and number of nodes in PRGs - it does not increase with number of samples, so I think we don't need to worry about improving the memory usage here.
Takes 167MB / 1.5GB (10.68% of RAM) - could be improved (see TODOs in https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/include/minihit.h ), but it does not seem a bottleneck.
The text was updated successfully, but these errors were encountered: