v1.3.0 - Into the multi-GPU-niverse #616

RandomDefaultUser · 2024-11-28T17:35:01Z

New features

Multi-GPU inference: Models can now make predictions on an arbitrary number of GPUs
Multi-GPU training: Models can now be trained on an arbitrary number of GPUs
MALA now works with 2D materials, i.e., any system which is only periodic in two dimensions
Bispectrum descriptor calculation now possible in python
- This is route is significantly slower than LAMMPS, but can be helpful for developers who want to test the entire MALA modeling workflow without installing LAMMPS
Logging for network training has been overhauled and now allows for the logging of multiple metrics
(EXPERIMENTAL) Implementation of a mutual information based metric to replace/complement the ACSD
(EXPERIMENTAL) Implementation of a class for LDOS alignment to a reference energy value; this can be useful for models across multiple mass densities

Changes to API/user experience

New parallelization parameters available:
- use_lammps - enable/disable LAMMPS (enabled by default, recommended for optimal performance, will automatically be disabled if no LAMMPS is found on the machine)
- use_atomic_density_formula - enable the use of total energy evaluation based on a Gaussian representation (enabled if LAMMPS and GPU are enabled, recommended for optimal performance, details can be found in our paper on size transfer)
- use_ddp - enable/disable DDP, i.e., Pytorch's distributed training scheme (disabled by default)
Multiple LAMMPS/QE calculations can now be run in one directory
- Prior to this version, doing so would lead to problems due to the file based nature of these interfaces
- This allows for multiple simultaneous inferences in the same folder
Class SNAP and all associated options are deprecated, use Bispectrum and associated options instead
Default units for reading from .cube files are now set to units commonly used within Quantum ESPRESSO, this should make it easier to avoid inconsistencies in data sampling
ASE calculator MALA now reads models with load_run() instead of load_model which is more consistent with the rest of MALA
Error reporting with the Tester class has been improved, all errors and energy values reported there are now consistently given in meV/atom
MALA calculators (LDOS, density, DOS) now also read energy contributions and forces from Quantum ESPRESSO output files, these can be accessed via properties

Fixes

Updated various performance/accessibility issues of CI/CD
Fixed compatability with newer Optuna versions
Added missing docstrings
Fixed shuffling interface, arbitrary numbers of shuffled snapshots can now be created without loss of information
Fixed inconsistency of density dimensions when using directly from cube file
Fixed error when using GPU graphs with arbitrary batch sizes

…flow_test to Be_model

openPMD I/O: fix parallel flushing

Resolving pytest issues

Its better to use the DOI which always points to the latest version of the test data repo. This avoids updating the CI at several places each time there is a new version of the test data repo. Co-authored-by: David Pape <[email protected]>

The top level directory in the zip file is suffixed with a commit hash that relates to the downloaded test data repository. Subsequent steps in the pipeline expect this directory have the name `test_data`. This snippet avoids manual renaming of the extracted folder with each newer version of the test data repository.

Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/cache@v3.

This fixes Node.js 16 deprecation warnings in cpu-tests.yml

Use RODARE api instead of hard coded URL:

Remove caches after pushes to develop/master (+tags)

The diffs of the two Conda environments are now displayed next to each other to make it easier to spot a discrepancy between the two.

Enhance diff output of Conda environments

…npmd Fix CI installation of openPMD-api

This is a temporary fix to make the caching mechanism in the CI work again. Its currently broken due to a switch to BuildKit as the default builder for Docker Engine as of version 23.0 (2023-02-01).

Use legacy builder to build Docker image

- Update workflow name to match style of the other workflows - Fix: Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. - Fix indentation isues

Update mirror-to-casus.yml

build_total_energy_energy_module.sh -> build_total_energy_module.sh Link to github issue documenting issues when building QE with cmake.

doc: link to GPU usage docs from lammps install section

… is too strict

Quickfixing the ACSD

…xing necessary in the CI now

Align MALA version

… something

acangi

Looks good to me. Fantastic job @RandomDefaultUser!

Recovering DDP scalability

RandomDefaultUser · 2024-11-29T13:51:17Z

The CI is currently failing because there is an "internal server error" being sent back by RODARE... I don't know why that is, but it is most likely only a temporary problem of RODARE. I will resubmit the CI later today, and if the error persists, on Monday. It it persists thereafter, I will contact RODARE staff.

srajama1 · 2024-12-04T19:45:14Z

Looks good to me, thank you @RandomDefaultUser !

pcagas and others added 30 commits June 7, 2024 13:53

Updating the path to test-data in the test suite and redirecting work…

67e7c3f

…flow_test to Be_model

Merge branch 'develop' into test-improvement

032b21b

Merge pull request #540 from franzpoeschel/fix-parallel-flushing

a1faeba

openPMD I/O: fix parallel flushing

Merge pull request #517 from pcagas/test-improvement

e967711

Resolving pytest issues

Fix CI installation of openPMD-api

6c2d438

Adapt further dependencies

35a4a8f

Fix typo

b60fdd1

Use RODARE api instead of hard coded URL:

9923992

Its better to use the DOI which always points to the latest version of the test data repo. This avoids updating the CI at several places each time there is a new version of the test data repo. Co-authored-by: David Pape <[email protected]>

This fixes Node.js 16 deprecation warning:

e4a410d

Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/cache@v3.

Merge pull request #543 from DanielKotik/update-actions-in-cpu-tests

8e71856

This fixes Node.js 16 deprecation warnings in cpu-tests.yml

Merge pull request #542 from DanielKotik/ci-test-data-download

c3383a6

Use RODARE api instead of hard coded URL:

Remove caches after pushes to develop/master (+tags)

68fead8

Update cleanup-caches.yml

940fd21

Merge pull request #544 from DanielKotik/update-cleanup-cache

4b32470

Remove caches after pushes to develop/master (+tags)

Also added the forces for good measure

789f909

Enhance diff output of Conda environments:

979bab5

The diffs of the two Conda environments are now displayed next to each other to make it easier to spot a discrepancy between the two.

Fix a typo and rename files for a better understanding

6224824

Merge pull request #547 from DanielKotik/update-cpu-test-workflow

22ff604

Enhance diff output of Conda environments

be less specific about openpmd-api version

29348ee

Enforce installation of openPMD via pip

2bf4933

Merge pull request #541 from franzpoeschel/fix-ci-installation-of-ope…

dd37531

…npmd Fix CI installation of openPMD-api

Use legacy builder to build Docker image:

031fab3

This is a temporary fix to make the caching mechanism in the CI work again. Its currently broken due to a switch to BuildKit as the default builder for Docker Engine as of version 23.0 (2023-02-01).

Merge pull request #548 from DanielKotik/temp-fix-docker-build

8434b8b

Use legacy builder to build Docker image

Update mirror-to-casus.yml:

4ea9adc

- Update workflow name to match style of the other workflows - Fix: Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. - Fix indentation isues

Merge pull request #549 from DanielKotik/update-mirror-to-casus-workflow

d59a57f

Update mirror-to-casus.yml

doc: link to GPU usage docs from lammps install section

c4c587f

doc: QE install: fix typos, add cmake note

bf10ea0

build_total_energy_energy_module.sh -> build_total_energy_module.sh Link to github issue documenting issues when building QE with cmake.

Merge pull request #552 from elcorto/feature-doc-lammps-gpu

340db25

doc: link to GPU usage docs from lammps install section

Be explicit about the fetch depth

01e46be

RandomDefaultUser and others added 19 commits November 26, 2024 15:44

Implemented a base class for ACSD and mutual information

69e9a9c

Finished with code for now

632fe26

Small final adjustments

1790092

Slightly altering the parameter of the hyperparameter test because it…

cdcf1b4

… is too strict

Merge pull request #513 from RandomDefaultUser/quickfix_acsd

ad1b5fd

Quickfixing the ACSD

Turning of logging by default

d52ca0c

Merge remote-tracking branch 'origin/develop' into develop

c2e1bb7

Aligned python versions throughout MALA; there will likely be some fi…

618c4b8

…xing necessary in the CI now

Forgot one place

3b75e05

Forgot one place

100bd45

Going up to 3.10.4

59bc877

Updating the conda yaml files locally

5c2976c

Got rid of --color-always

63a6575

Debugging the command

7da7f42

Some debugging

eec81da

Let's see if cutting MALA out helps

9994799

Added requests to the environment.yml

b7af1c3

Added a line of documentation

a769f3e

Merge pull request #614 from mala-project/align_mala_version

dddccd4

Align MALA version

RandomDefaultUser requested review from srajama1 and acangi November 28, 2024 17:35

RandomDefaultUser added 2 commits November 29, 2024 11:29

Reintroduced old validation loss calculation, let's see if this fixes…

8358e03

… something

Forgot a renaming

d3043e6

acangi approved these changes Nov 29, 2024

View reviewed changes

RandomDefaultUser and others added 2 commits November 29, 2024 11:46

Refactored code internally

20a06f2

Merge pull request #617 from RandomDefaultUser/fix_ddp_validation

03f6b96

Recovering DDP scalability

srajama1 approved these changes Dec 4, 2024

View reviewed changes

RandomDefaultUser merged commit e83d3c3 into master Dec 5, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0 - Into the multi-GPU-niverse #616

v1.3.0 - Into the multi-GPU-niverse #616

RandomDefaultUser commented Nov 28, 2024 •

edited by acangi

Loading

acangi left a comment

RandomDefaultUser commented Nov 29, 2024

srajama1 commented Dec 4, 2024

v1.3.0 - Into the multi-GPU-niverse #616

v1.3.0 - Into the multi-GPU-niverse #616

Conversation

RandomDefaultUser commented Nov 28, 2024 • edited by acangi Loading

New features

Changes to API/user experience

Fixes

acangi left a comment

Choose a reason for hiding this comment

RandomDefaultUser commented Nov 29, 2024

srajama1 commented Dec 4, 2024

RandomDefaultUser commented Nov 28, 2024 •

edited by acangi

Loading