Refactoring changes in the training routine #243

wiederm · 2024-08-22T14:15:42Z

Description

This PR introduces a few improvements/changes to the training routine and its tests:

improve logging messages
improve fixture in conftest to obtain a single batch from different datasets and with different batchsize without hardcoding any of the values
provide training toml file that sets parameters for force training
The Error classes will return now the error per molecule instead of the mean of the error, this makes it easier to log and use torchmetrics
tracking loss now also with torchmetrics instead of custom torch.nn.Module
allowing to cache the processed dataset. In that case we assume that another process has already processed the dataset (since the .pt file is present) and, if cache regeneration is not requested, we will use this file and skiptt the prepare_dataset operation.
lock the prepare_dataset method: only a single instance should execute this method per dataset.

This PR also includes bugfixes:

fix a bug in the force loss calculation
only retrain/create graph of force when in training mode (not in evaluation or test mode)

Status

Ready to go

… into ref-training

… for synchronized training)

…ght fix it

…e it allocates too much money

…esolved.

chrisiacovella · 2024-08-31T07:53:36Z

The CI was failing due to running out of memory (CI runner is capped at 16 gb). See #246 . The code will skip training sake with forces.

modelforge/train/training.py

scripts/config.toml

wiederm and others added 21 commits August 21, 2024 13:21

small modifications

389ff8f

add batchsize

5b75fed

update

2876384

Merge branch 'main' into ref-training

b152bdf

update optimizer, fix bug

ba118eb

Merge branch 'ref-training' of https://github.com/choderalab/modelforge…

0bd48b4

… into ref-training

use custom torchmetric to sync accross differnt nodes

ae7ee91

make logging consistent and log also loss with torchmetric (necessary…

0d61ccb

… for synchronized training)

skip if ANI and SPICE

a62029c

include force training test

ac9e853

still issues with mutliple GPUs

6ae7ca0

make loss tensor's stride contiguous

ebba646

sync log

a315724

stride is an issue in the backward pass through the forces, this m mi…

c5b87be

…ght fix it

still stride issue

8da97be

add stride hook for backward

e4ce8d6

dicst as module output for stride

3f087ab

avoid saving grad in val/test routine

5072098

only linting changes

5895aca

fix loss test

701122d

fix tests

42dc20a

wiederm self-assigned this Aug 27, 2024

Merge branch 'main' into ref-training

14ed13f

wiederm requested a review from chrisiacovella August 27, 2024 07:27

wiederm added bug Something isn't working refactoring Improve the quality of the code without functional changes labels Aug 27, 2024

wiederm changed the title ~~Small refactoring changes in the training routine~~ Refactoring changes in the training routine Aug 27, 2024

wiederm added 3 commits August 27, 2024 13:42

decorator for method locking

2c0b643

lock

992b7bc

typo

b5f9888

wiederm and others added 10 commits August 27, 2024 15:37

correct lock file mode

acd06fe

linting

62003b7

fix test failures

e483551

reasonable defaults for regeneration

c2cfb5f

ha, didn't think about that

a1daf2c

update parameter name

2bc8c3c

detach metric

d3539e5

Having test_train_lightning skip sake+forces when on github CI becaus…

7c374a5

…e it allocates too much money

fixed captitalization error, test should now skip

2ca2b0a

in testing I removed the check to see if in github actions. This is r…

daaedff

…esolved.

chrisiacovella mentioned this pull request Aug 31, 2024

test_train_with_lightning failing on CI #246

Closed

chrisiacovella reviewed Aug 31, 2024

View reviewed changes

modelforge/train/training.py Show resolved Hide resolved

chrisiacovella approved these changes Aug 31, 2024

View reviewed changes

chrisiacovella reviewed Aug 31, 2024

View reviewed changes

scripts/config.toml Outdated Show resolved Hide resolved

wiederm and others added 2 commits August 31, 2024 09:17

Merge branch 'main' into ref-training

34a7631

update weights

4ffc4d5

wiederm merged commit 769d6a8 into main Aug 31, 2024
1 of 2 checks passed

wiederm deleted the ref-training branch August 31, 2024 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring changes in the training routine #243

Refactoring changes in the training routine #243

wiederm commented Aug 22, 2024 •

edited

Loading

chrisiacovella commented Aug 31, 2024

Refactoring changes in the training routine #243

Refactoring changes in the training routine #243

Conversation

wiederm commented Aug 22, 2024 • edited Loading

Description

Status

chrisiacovella commented Aug 31, 2024

wiederm commented Aug 22, 2024 •

edited

Loading