Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mat shape missing match for Multihead fine-tune #615

Open
MengnanCui opened this issue Oct 1, 2024 · 11 comments
Open

Mat shape missing match for Multihead fine-tune #615

MengnanCui opened this issue Oct 1, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@MengnanCui
Copy link

Descirbe the bug

Hi, I want to do multihead finetuning on personal pre-trained model(ptbp_model.modelbased on version 0.3.7, main branch), after editing these commands

  • --foundation_model='../ptbp_model.model'
  • --multiheads_finetuning=True
  • --pt_train_file='../../../transferability7k/training.xyz'
  • --pt_valid_file='../../../transferability7k/validation.xyz'

I got error messages as the following, do you have ideas about this problem? Do I have to use the latest code to pre-train a model, then fine-tune with multihead approach?
for your information, the pretrained model based on dftb_ key, then finetuning on dft_.

Start Training with MACE:{'seed': 2747, 'training': 'training.xyz', 'validation': '../../fixed_validation.xyz', 'test': '../../fixed_test.xyz', 'config_type_weights': '{"Default":1.0}', 'E0s': {74: -11.022250868182281}, 'model': 'MACE', 'hidden_irreps': '128x0e + 128x1o', 'r_max': 6.0, 'batch_size': 16, 'valid_batch_size': 16, 'max_num_epochs': 1000, 'start_swa': 750, 'energy_key': 'dft_energy', 'forces_key': 'dft_forces', 'default_dtype': 'float64', 'patience': 500, 'device': 'cuda', 'multitask': False, 'distributed': False, 'exc_path': 'mace_run_train', 'foundation_model': False}
2024-10-01 07:30:05.662 INFO: ===========VERIFYING SETTINGS===========
2024-10-01 07:30:05.662 INFO: MACE version: 0.3.7
2024-10-01 07:30:05.723 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-01 07:30:06.318 INFO: Using foundation model ../ptbp_model.model as initial checkpoint.
2024-10-01 07:30:06.319 INFO: ===========LOADING INPUT DATA===========
2024-10-01 07:30:06.319 INFO: Using heads: ['default']
2024-10-01 07:30:06.319 INFO: =============    Processing head default     ===========
2024-10-01 07:30:06.381 INFO: Training set [100 configs, 100 energy, 4761 forces] loaded from 'training.xyz'
2024-10-01 07:30:06.686 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../fixed_validation.xyz'
2024-10-01 07:30:06.990 INFO: Test set (1000 configs) loaded from '../../fixed_test.xyz':
2024-10-01 07:30:06.991 INFO: Default_Default: 1000 configs, 1000 energy, 46560 forces
2024-10-01 07:30:06.991 INFO: Total number of configurations: train=100, valid=1000, tests=[Default_Default: 1000],
2024-10-01 07:30:06.991 INFO: ==================Using multiheads finetuning mode==================
2024-10-01 07:30:06.991 INFO: Using foundation model for multiheads finetuning with ../../../transferability7k/training.xyz
2024-10-01 07:30:09.239 INFO: Training set [7642 configs, 7642 energy, 380589 forces] loaded from '../../../transferability7k/training.xyz'
2024-10-01 07:30:09.722 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../../transferability7k/validation.xyz'
2024-10-01 07:30:09.722 INFO: Total number of configurations: train=7642, valid=1000
2024-10-01 07:30:09.755 INFO: Atomic Numbers used: [74]
2024-10-01 07:30:09.756 INFO: Isolated Atomic Energies (E0s) not in training file, using command line argument
2024-10-01 07:30:09.757 INFO: Atomic Energies used (z: eV) for head default: {74: -11.022250868182281}
2024-10-01 07:30:09.760 INFO: Atomic Energies used (z: eV) for head pt_head: {74: -29.330717613489064}
2024-10-01 07:30:17.415 INFO: Average number of neighbors: 57.10205972318726
2024-10-01 07:30:17.416 INFO: During training the following quantities will be reported: energy, forces, virials, stress
2024-10-01 07:30:17.416 INFO: ===========MODEL DETAILS===========
Traceback (most recent call last):
  File "/home/mncui/software/miniconda3/envs/mace_foundation/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/work/home/mncui/software/mace_main09_2024/mace/cli/run_train.py", line 62, in main
    run(args)
  File "/work/home/mncui/software/mace_main09_2024/mace/cli/run_train.py", line 501, in run
    model, output_args = configure_model(args, train_loader, atomic_energies, model_foundation, heads, z_table)
  File "/work/home/mncui/software/mace_main09_2024/mace/tools/model_script_utils.py", line 37, in configure_model
    args.mean, args.std = modules.scaling_classes[args.scaling](
  File "/work/home/mncui/software/mace_main09_2024/mace/modules/utils.py", line 312, in compute_mean_rms_energy_forces
    node_e0 = atomic_energies_fn(batch.node_attrs)
  File "/home/mncui/software/miniconda3/envs/mace_foundation/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/home/mncui/software/mace_main09_2024/mace/modules/blocks.py", line 160, in forward
    return torch.matmul(x, torch.atleast_2d(self.atomic_energies).T)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (229x1 and 2x1)
Training finished!

Here is the log file
MACE_model_run-2747_debug.log

@ilyes319
Copy link
Contributor

ilyes319 commented Oct 1, 2024

Hello,
The multihead finetuning is not yet supported for other models than MP pretrained models. Hopefully I can fix that soon. For now please use the normal finetuning.

@MengnanCui
Copy link
Author

Ok, thanks for your reply, looking foward to know updates.

@ilyes319
Copy link
Contributor

ilyes319 commented Oct 2, 2024

@MengnanCui Can you test again with the latest main? I should have fixed that.

@ilyes319 ilyes319 reopened this Oct 2, 2024
@ilyes319 ilyes319 added the bug Something isn't working label Oct 2, 2024
@MengnanCui
Copy link
Author

Great, Thank you! I will try it!

@MengnanCui
Copy link
Author

Hi, @ilyes319 thank so much for your efforts.

(1) I tried the latest main branch, with the same setting as all above, it still outputs this error while finetuning.

2024-10-03 08:32:13.385 INFO: ===========VERIFYING SETTINGS===========
2024-10-03 08:32:13.386 INFO: MACE version: 0.3.7
2024-10-03 08:32:13.453 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-03 08:32:14.229 INFO: Using foundation model ../ptbp_model.model as initial checkpoint.
2024-10-03 08:32:14.230 INFO: ===========LOADING INPUT DATA===========
2024-10-03 08:32:14.230 INFO: Using heads: ['default']
2024-10-03 08:32:14.231 INFO: =============    Processing head default     ===========
2024-10-03 08:32:14.300 INFO: Training set [100 configs, 100 energy, 4761 forces] loaded from 'training.xyz'
2024-10-03 08:32:14.628 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../fixed_validation.xyz'
2024-10-03 08:32:14.946 INFO: Test set (1000 configs) loaded from '../../fixed_test.xyz':
2024-10-03 08:32:14.947 INFO: Default_Default: 1000 configs, 1000 energy, 46560 forces
2024-10-03 08:32:14.947 INFO: Total number of configurations: train=100, valid=1000, tests=[Default_Default: 1000],
2024-10-03 08:32:14.948 INFO: ==================Using multiheads finetuning mode==================
2024-10-03 08:32:14.948 INFO: Using foundation model for multiheads finetuning with ../../../transferability7k/training.xyz
2024-10-03 08:32:17.246 INFO: Training set [7642 configs, 7642 energy, 380589 forces] loaded from '../../../transferability7k/training.xyz'
2024-10-03 08:32:17.776 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../../transferability7k/validation.xyz'
2024-10-03 08:32:17.776 INFO: Total number of configurations: train=7642, valid=1000
2024-10-03 08:32:17.817 INFO: Atomic Numbers used: [74]
2024-10-03 08:32:17.817 INFO: Isolated Atomic Energies (E0s) not in training file, using command line argument
2024-10-03 08:32:17.823 INFO: Atomic Energies used (z: eV) for head default: {74: -11.022250868182281}
2024-10-03 08:32:17.823 INFO: Atomic Energies used (z: eV) for head pt_head: {74: -29.330717613489064}
2024-10-03 08:32:26.050 INFO: Average number of neighbors: 57.10205972318726
2024-10-03 08:32:26.051 INFO: During training the following quantities will be reported: energy, forces, virials, stress
2024-10-03 08:32:26.051 INFO: ===========MODEL DETAILS===========
Traceback (most recent call last):
  File "/home/mncui/software/miniconda3/envs/mace_foundation/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 63, in main
    run(args)
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 505, in run
    model, output_args = configure_model(args, train_loader, atomic_energies, model_foundation, heads, z_table)
  File "/work/home/mncui/software/mace_main10_2024/mace/tools/model_script_utils.py", line 37, in configure_model
    args.mean, args.std = modules.scaling_classes[args.scaling](
  File "/work/home/mncui/software/mace_main10_2024/mace/modules/utils.py", line 312, in compute_mean_rms_energy_forces
    node_e0 = atomic_energies_fn(batch.node_attrs)
  File "/home/mncui/software/miniconda3/envs/mace_foundation/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/home/mncui/software/mace_main10_2024/mace/modules/blocks.py", line 160, in forward
    return torch.matmul(x, torch.atleast_2d(self.atomic_energies).T)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (421x1 and 2x1)

MACE_model_run-2024_debug.log

(2) On the other hand & for your information, to exclude the effects of mace version.(the ../ptbp_model.model i used above was based on a code at least 3/4 month ago.) Therefore, I did a new training with the latest main branch, got ./train_main/MACE_model.model, then multihead finetunning based on it, there is a different error message as the following:

2024-10-03 10:05:17.508 INFO: ===========VERIFYING SETTINGS===========
2024-10-03 10:05:17.508 INFO: MACE version: 0.3.7
2024-10-03 10:05:17.570 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-03 10:05:18.256 INFO: Using foundation model ./train_main/MACE_model.model as initial checkpoint.
2024-10-03 10:05:18.257 INFO: ===========LOADING INPUT DATA===========
2024-10-03 10:05:18.257 INFO: Using heads: ['default']
2024-10-03 10:05:18.257 INFO: =============    Processing head default     ===========
2024-10-03 10:05:18.320 INFO: Training set [100 configs, 100 energy, 4761 forces] loaded from 'training.xyz'
2024-10-03 10:05:18.624 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../fixed_validation.xyz'
2024-10-03 10:05:18.925 INFO: Test set (1000 configs) loaded from '../../fixed_test.xyz':
2024-10-03 10:05:18.926 INFO: Default_Default: 1000 configs, 1000 energy, 46560 forces
2024-10-03 10:05:18.927 INFO: Total number of configurations: train=100, valid=1000, tests=[Default_Default: 1000],
2024-10-03 10:05:18.927 INFO: ==================Using multiheads finetuning mode==================
2024-10-03 10:05:18.928 INFO: Using foundation model for multiheads finetuning with ../../../transferability7k/training.xyz
2024-10-03 10:05:21.214 INFO: Training set [7642 configs, 7642 energy, 380589 forces] loaded from '../../../transferability7k/training.xyz'
2024-10-03 10:05:21.689 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../../transferability7k/validation.xyz'
2024-10-03 10:05:21.689 INFO: Total number of configurations: train=7642, valid=1000
2024-10-03 10:05:21.719 INFO: Atomic Numbers used: [74]
2024-10-03 10:05:21.720 INFO: Isolated Atomic Energies (E0s) not in training file, using command line argument
Traceback (most recent call last):
  File "/home/mncui/software/miniconda3/envs/mace_foundation/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 63, in main
    run(args)
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 356, in run
    atomic_energies_dict[head_config.head_name] = {
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 357, in <dictcomp>
    z: model_foundation.atomic_energies_fn.atomic_energies[
IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

MACE_model_newrun-2024_debug.log

Thanks again and hope these info can help.

@ilyes319
Copy link
Contributor

ilyes319 commented Oct 3, 2024

Could send your input script, a small sample of your data and your model at [email protected] so I can reproduce that myself.
Also how are you parsing your E0s?

@MengnanCui
Copy link
Author

Hi, hope the email fine you, the E0s were set inside the script, there is only one element in the datasets "Tungsten"

@MengnanCui
Copy link
Author

by the way, the E0s for pretrained models are set all the same in the input script but from DFTB calculation {74: -29.330717613489064}, as you can find, there are dftb_ tagged energy&forces inside all the data as well.

@gabor1
Copy link
Collaborator

gabor1 commented Oct 3, 2024

the E0s need to be calculated with the same method as the data you are fitting

@MengnanCui
Copy link
Author

Yes, that's so I have dftb E0s for pretraining on dftb, the dft E0s for finetuning on dft.

@ilyes319 ilyes319 self-assigned this Nov 1, 2024
@ilyes319
Copy link
Contributor

ilyes319 commented Nov 4, 2024

@MengnanCui I should have fixed that in the main branch. Could you try and tell me if it is fixed indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants