Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring changes in the training routine #243

Merged
merged 37 commits into from
Aug 31, 2024
Merged

Refactoring changes in the training routine #243

merged 37 commits into from
Aug 31, 2024

Conversation

wiederm
Copy link
Member

@wiederm wiederm commented Aug 22, 2024

Description

This PR introduces a few improvements/changes to the training routine and its tests:

  • improve logging messages
  • improve fixture in conftest to obtain a single batch from different datasets and with different batchsize without hardcoding any of the values
  • provide training toml file that sets parameters for force training
  • The Error classes will return now the error per molecule instead of the mean of the error, this makes it easier to log and use torchmetrics
  • tracking loss now also with torchmetrics instead of custom torch.nn.Module
  • allowing to cache the processed dataset. In that case we assume that another process has already processed the dataset (since the .pt file is present) and, if cache regeneration is not requested, we will use this file and skiptt the prepare_dataset operation.
  • lock the prepare_dataset method: only a single instance should execute this method per dataset.

This PR also includes bugfixes:

  • fix a bug in the force loss calculation
  • only retrain/create graph of force when in training mode (not in evaluation or test mode)

Status

  • Ready to go

@wiederm wiederm self-assigned this Aug 27, 2024
@wiederm wiederm added bug Something isn't working refactoring Improve the quality of the code without functional changes labels Aug 27, 2024
@wiederm wiederm changed the title Small refactoring changes in the training routine Refactoring changes in the training routine Aug 27, 2024
@chrisiacovella
Copy link
Member

The CI was failing due to running out of memory (CI runner is capped at 16 gb). See #246 . The code will skip training sake with forces.

scripts/config.toml Outdated Show resolved Hide resolved
@wiederm wiederm merged commit 769d6a8 into main Aug 31, 2024
1 of 2 checks passed
@wiederm wiederm deleted the ref-training branch August 31, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working refactoring Improve the quality of the code without functional changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants