Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Parallelization to analysis.hydrogenbonds.hbond_analysis #4718

Merged
merged 12 commits into from
Oct 7, 2024

Conversation

talagayev
Copy link
Member

@talagayev talagayev commented Sep 25, 2024

Fixes #4664

Changes made in this Pull Request:

  • Parallelization of the backend support to the class HydrogenBondAnalysis in hbond_analysis.py
  • Addition of parallelization tests in test_hydrogenbonds_analysis.py and fixtures in conftest.py
  • Updated Changelog

There is still the case that one of the pytests raises an Failure, while the remainder work fine, I am currently not sure what the reason for this is.

The Failure is due to this:
AttributeError: 'HydrogenBondAnalysis' object has no attribute '_hydrogens' and appears only in test_no_hydrogens

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4718.org.readthedocs.build/en/4718/

Added ResultsGroup for hbond and the supported backends
Added the client for HydrogenBondAnalysis
Added client_HydrogenBondAnalysis to the tests
Added to changelog entry about hbond parallelization
@pep8speaks
Copy link

pep8speaks commented Sep 25, 2024

Hello @talagayev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 345:80: E501 line too long (88 > 79 characters)

Line 131:1: W293 blank line contains whitespace
Line 132:1: E302 expected 2 blank lines, found 1
Line 134:25: W292 no newline at end of file

Comment last updated at 2024-10-06 15:23:19 UTC

Copy link

github-actions bot commented Sep 25, 2024

Linter Bot Results:

Hi @talagayev! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11203036680/job/31139825891


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

@orbeckst orbeckst self-assigned this Sep 26, 2024
@orbeckst
Copy link
Member

Exciting! Did you happen to benchmark?

@talagayev
Copy link
Member Author

Exciting! Did you happen to benchmark?

hey @orbeckst, no sadly not, since I wasn't sure if it is correct, due to this one Failure for the specific pytest, but I can check out the performance of the parallel hbond_analysis :)

@talagayev
Copy link
Member Author

talagayev commented Sep 27, 2024

I ran some local benchmarking on a local system and it did perform quite well, tested it with 3, 5 & 10 CPUs.
The sytem that needed 10 minutes without parallelization showed the following times:

multiprocessing:
3 CPU: 4 minutes 23 seconds
5 CPU: 2 minutes 56 seconds
10 CPU: 2 minutes 6 seconds

dask:
3 CPU: 4 minutes 19 seconds
5 CPU: 2 minutes 49 seconds
10 CPU: 1 minute 55 seconds

The sytem that needed 20 minutes without parallelization showed the following times:

multiprocessing:
3 CPU: 9 minutes 13 seconds
5 CPU: 6 minutes 26 seconds
10 CPU: 3 minutes 39 seconds

dask:
3 CPU: 8 minutes 45 seconds
5 CPU: 5 minutes 58 seconds
10 CPU: 3 minute 38 seconds

Looks reasonable from the times and dask performs a little bit faster, at least for the case that I tested :)

@orbeckst
Copy link
Member

These are very encouraging results, it looks like pretty good scaling! Nice. (Now we just have to ensure that the results are also correct...)

What kind of systems were you testing, i.e., number of atoms, number of trajectory frames, type (eg water only, protein in water, ...)?

What machine did you test on (CPU?) and where did you store the trajectory (SSD?)?

cc @RMeli @marinegor

@talagayev
Copy link
Member Author

These are very encouraging results, it looks like pretty good scaling! Nice. (Now we just have to ensure that the results are also correct...)

What kind of systems were you testing, i.e., number of atoms, number of trajectory frames, type (eg water only, protein in water, ...)?

What machine did you test on (CPU?) and where did you store the trajectory (SSD?)?

cc @RMeli @marinegor

@orbeckst I mainly tried to concatenete the MDAnalysisTests Files to get something that would be reproducable, for the tests I mentioned I ran the PSF and DCD (only noticed after that that there were no H bonds calculated with the settings I used 🙈 ), but I would use the waterPSF and waterDCD Files and just use HydrogenBondAnalysis options displayed in there so that it would be than easier to reproduce, as for the frames amount, I currently concatanated 1.000.000 Frames so that the analysis takes 5 minutes 48 seconds on my PC on one processor and checking could be to test the hbonds.results of the obtained results with the non parallelized analysis and the parallelized analysis, I checked the hbonds.results of the multiprocessing with

hbonds_non_parallized.results == hbonds_multiprocessing_3.results and the arrays were identical.

https://userguide.mdanalysis.org/stable/examples/analysis/hydrogen_bonds/hbonds.html

As for the system that I have its:
CPU: AMD Ryzen 5 5600H with 12 cores
GPU: Nvidia Geforce RTX 3060

I run everything through a jupyter notebook to check always if it works and saved the topology and trajectory with the 1 million frames on my SSD :)

@RMeli
Copy link
Member

RMeli commented Sep 30, 2024

Very nice results @talagayev, thank you for sharing. It's really helpful to have small benchmarks for these PRs, so that we can see that the parallelization indeed speed things up.

@marinegor
Copy link
Contributor

Thanks for sharing indeed @talagayev ! Interesting to see that dask performs slightly better (though I'm not sure if it'd hold up if you run the tests multiple times). Also, did you reset the filesystem cache when doing the tests? I assume that might influence the results. As a test, you can either change the order of tested backends (I assume you did dask first, multiprocessing second?), or simply copy file to a new place every time before running the benchmark on it.

@talagayev
Copy link
Member Author

Happy to help :)

@marinegor I didn't reset the cache, so that would influence the tests and I only ran them only once, so I could check for multiple runs how it would perform and also reset the filesystem cache after the runs, as for the order it was first without the parallelization, afterwards with multiprocessing for the different amount of workers and then dask for the different amount of workers :)

@marinegor @RMeli for the benchmarks, should I somehow use then directly the files that are present in the MDAnalysisTests for all of those benchmarks or what would be the best approach to add benchmarks for the parallelization?

@yuxuanzhuang
Copy link
Contributor

The reason it was failing was that _donors and _hydrogens were defined in _prepare, which only runs for each computational group and, therefore, were not stored in the original instance.

@RMeli
Copy link
Member

RMeli commented Oct 3, 2024

for the benchmarks, should I somehow use then directly the files that are present in the MDAnalysisTests for all of those benchmarks or what would be the best approach to add benchmarks for the parallelization?

I think it is OK the use the provided files and report the results here, so other people can easily run the benchmark on their system too, if they wish. However, if you have real systems you are interested in, it would definitely be good to see the results too.

I don't think at this stage we discussed about automated benchmarks using Airspeed Velocity (or other frameworks); maybe that is something to keep in mind and discuss.

Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.53%. Comparing base (84ee67b) to head (e36ac7b).
Report is 4 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4718      +/-   ##
===========================================
- Coverage    93.55%   93.53%   -0.02%     
===========================================
  Files          173      185      +12     
  Lines        21451    22523    +1072     
  Branches      3985     3986       +1     
===========================================
+ Hits         20068    21067     +999     
- Misses         929     1002      +73     
  Partials       454      454              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@marinegor
Copy link
Contributor

hi @talagayev , I don't think it's actually necessary to run the benchmarks on MDAnalysis test files. Though if you still decide to, here's how I ran my benchmarks earlier: https://gist.github.com/marinegor/17558d1685cd2f24a6de65aa99cf5c9e

It also contains zenodo link for the trajectory itself.

Copy link
Contributor

@yuxuanzhuang yuxuanzhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just waiting for CI (after having added the versionchanged)

@orbeckst
Copy link
Member

orbeckst commented Oct 6, 2024

@yuxuanzhuang @marinegor @talagayev can you briefly explain why the definitions of donors and acceptors needed to be moved from _prepare to __init__? I am particularly asking because this may be something that we need to add to the docs: what can you do in _prepare and what needs to be elsewhere.

@orbeckst orbeckst merged commit 474be5b into MDAnalysis:develop Oct 7, 2024
22 of 23 checks passed
@talagayev
Copy link
Member Author

@RMeli @marinegor Ah I see, thanks for the information, yes that makes sense and thanks @marinegor for for the link how you ran the benchmarks, I can try to apply it in the future :)

@marinegor
Copy link
Contributor

@orbeckst

can you briefly explain why the definitions of donors and acceptors needed to be moved from _prepare to init?

I think in general it's ok to define attributes in either of those, unless you'll need them after you run _conclude -- then you certainly want to do it in __init__. In this particular case, tests are written in a way that they're checking a "private" fields _donors, _acceptors and _hydrogens (link), hence the failures.

A general rule here might be "if an attribute small and you'll need it later, put it in __init__, if it's either size and you won't need it later -- in _prepare, if it's big and you'll need it later -- check again, or be prepared for a memory overhead".

lilyminium pushed a commit to lilyminium/mdanalysis that referenced this pull request Oct 19, 2024
…lysis (MDAnalysis#4718)

- Fixes MDAnalysis#4664 
- Parallelization of the backend support to the class HydrogenBondAnalysis in hbond_analysis.py
- Moved setting up of donors and acceptors from _prepare() to __init__() (needed to make
  parallel processing work)
- Addition of parallelization tests in test_hydrogenbonds_analysis.py and fixtures in conftest.py
- Updated Changelog

---------

Co-authored-by: Yuxuan Zhuang <[email protected]>
Co-authored-by: Oliver Beckstein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MDAnalysis.analysis.hydrogenbonds.hbond_analysis: Implement parallelization or mark as unparallelizable
6 participants