Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.3.0 - Into the multi-GPU-niverse #616

Merged
merged 429 commits into from
Dec 5, 2024
Merged

v1.3.0 - Into the multi-GPU-niverse #616

merged 429 commits into from
Dec 5, 2024

Conversation

RandomDefaultUser
Copy link
Member

@RandomDefaultUser RandomDefaultUser commented Nov 28, 2024

New features

  • Multi-GPU inference: Models can now make predictions on an arbitrary number of GPUs

  • Multi-GPU training: Models can now be trained on an arbitrary number of GPUs

  • MALA now works with 2D materials, i.e., any system which is only periodic in two dimensions

  • Bispectrum descriptor calculation now possible in python

    • This is route is significantly slower than LAMMPS, but can be helpful for developers who want to test the entire MALA modeling workflow without installing LAMMPS
  • Logging for network training has been overhauled and now allows for the logging of multiple metrics

  • (EXPERIMENTAL) Implementation of a mutual information based metric to replace/complement the ACSD

  • (EXPERIMENTAL) Implementation of a class for LDOS alignment to a reference energy value; this can be useful for models across multiple mass densities

Changes to API/user experience

  • New parallelization parameters available:
    • use_lammps - enable/disable LAMMPS (enabled by default, recommended for optimal performance, will automatically be disabled if no LAMMPS is found on the machine)
    • use_atomic_density_formula - enable the use of total energy evaluation based on a Gaussian representation (enabled if LAMMPS and GPU are enabled, recommended for optimal performance, details can be found in our paper on size transfer)
    • use_ddp - enable/disable DDP, i.e., Pytorch's distributed training scheme (disabled by default)
  • Multiple LAMMPS/QE calculations can now be run in one directory
    • Prior to this version, doing so would lead to problems due to the file based nature of these interfaces
    • This allows for multiple simultaneous inferences in the same folder
  • Class SNAP and all associated options are deprecated, use Bispectrum and associated options instead
  • Default units for reading from .cube files are now set to units commonly used within Quantum ESPRESSO, this should make it easier to avoid inconsistencies in data sampling
  • ASE calculator MALA now reads models with load_run() instead of load_model which is more consistent with the rest of MALA
  • Error reporting with the Tester class has been improved, all errors and energy values reported there are now consistently given in meV/atom
  • MALA calculators (LDOS, density, DOS) now also read energy contributions and forces from Quantum ESPRESSO output files, these can be accessed via properties

Fixes

  • Updated various performance/accessibility issues of CI/CD
  • Fixed compatability with newer Optuna versions
  • Added missing docstrings
  • Fixed shuffling interface, arbitrary numbers of shuffled snapshots can now be created without loss of information
  • Fixed inconsistency of density dimensions when using directly from cube file
  • Fixed error when using GPU graphs with arbitrary batch sizes

pcagas and others added 30 commits June 7, 2024 13:53
Its better to use the DOI which always points to the latest version of
the test data repo. This avoids updating the CI at several places each
time there is a new version of the test data repo.

Co-authored-by: David Pape <[email protected]>
The top level directory in the zip file is suffixed with a commit hash
that relates to the downloaded test data repository. Subsequent steps in
the pipeline expect this directory have the name `test_data`. This snippet
avoids manual renaming of the extracted folder with each newer version
of the test data repository.
Node.js 16 actions are deprecated. Please update the following actions
to use Node.js 20: actions/checkout@v3, actions/cache@v3.
This fixes Node.js 16 deprecation warnings in cpu-tests.yml
Use RODARE api instead of hard coded URL:
Remove caches after pushes to develop/master (+tags)
The diffs of the two Conda environments are now displayed next to each other
to make it easier to spot a discrepancy between the two.
Enhance diff output of Conda environments
This is a temporary fix to make the caching mechanism in the CI work
again. Its currently broken due to a switch to BuildKit as the default
builder for Docker Engine as of version 23.0 (2023-02-01).
Use legacy builder to build Docker image
- Update workflow name to match style of the other workflows
- Fix: Node.js 16 actions are deprecated. Please update the following actions
       to use Node.js 20: actions/checkout@v3.
- Fix indentation isues
build_total_energy_energy_module.sh -> build_total_energy_module.sh

Link to github issue documenting issues when building QE with cmake.
doc: link to GPU usage docs from lammps install section
Copy link
Member

@acangi acangi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Fantastic job @RandomDefaultUser!

@RandomDefaultUser
Copy link
Member Author

The CI is currently failing because there is an "internal server error" being sent back by RODARE... I don't know why that is, but it is most likely only a temporary problem of RODARE. I will resubmit the CI later today, and if the error persists, on Monday. It it persists thereafter, I will contact RODARE staff.

@srajama1
Copy link
Contributor

srajama1 commented Dec 4, 2024

Looks good to me, thank you @RandomDefaultUser !

@RandomDefaultUser RandomDefaultUser merged commit e83d3c3 into master Dec 5, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants