High GPU memory usage due to large intermediate tensor in calculate_radial_contributions in AimNet2 #315

wiederm · 2024-11-09T08:39:39Z

Description:

We have identified a significant GPU memory consumption issue within the AIMNet2InteractionModule, specifically in the calculate_radial_contributions function. The problem arises from creating a large intermediate tensor with shape (number_of_pairs, G, F_atom), which can consume substantial GPU memory when dealing with large datasets or complex models.

Steps to Reproduce:

Use the AimNet2Core model with a dataset.
Monitor GPU memory usage during the forward pass.
Observe the spike in memory usage when calculate_radial_contributions is called.
Expected Behavior:

The model should efficiently compute radial contributions without excessive GPU memory consumption, allowing for larger batch sizes and more complex models.

Actual Behavior:

The model consumes a large amount of GPU memory due to the creation of the intermediate tensor avf_s with shape (number_of_pairs, G, F_atom), where:

number_of_pairs is the total number of atomic pairs.
G is the number of radial basis functions.
F_atom is the number of per-atom features.
This high memory usage limits the scalability of the model and may lead to CUDA out of memory errors.

def calculate_radial_contributions(
    self,
    gs: Tensor,
    a_j: Tensor,
    number_of_atoms: int,
    idx_j: Tensor,
) -> Tensor:
    # Compute radial contributions
    avf_s = gs.unsqueeze(-1) * a_j.unsqueeze(1)  # Shape: (number_of_pairs, G, F_atom)
    avf_s = avf_s.sum(dim=1)  # Sum over G

    # Aggregate per atom
    radial_contributions = torch.zeros(
        (number_of_atoms, F_atom),
        device=avf_s.device,
        dtype=avf_s.dtype,
    )
    radial_contributions.index_add_(0, idx_j, avf_s)

    return radial_contributions

Analysis
The operation gs.unsqueeze(-1) * a_j.unsqueeze(1) creates an intermediate tensor of size (number_of_pairs, G, F_atom).
When number_of_pairs, G, and F_atom are large, this tensor consumes a significant amount of GPU memory.

Proposed Solution:
Use more memory-efficient operations, such as element-wise multiplication and mapping gs to match the dimension of a_j.

wiederm · 2024-11-10T15:17:40Z

Has been resolved in PR #316

wiederm self-assigned this Nov 9, 2024

wiederm added the bug Something isn't working label Nov 9, 2024

wiederm mentioned this issue Nov 9, 2024

Optimize calculate_radial_contributions to reduce GPU memory usage #316

Merged

9 tasks

wiederm closed this as completed Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High GPU memory usage due to large intermediate tensor in calculate_radial_contributions in AimNet2 #315

High GPU memory usage due to large intermediate tensor in calculate_radial_contributions in AimNet2 #315

wiederm commented Nov 9, 2024

wiederm commented Nov 10, 2024

High GPU memory usage due to large intermediate tensor in calculate_radial_contributions in AimNet2 #315

High GPU memory usage due to large intermediate tensor in calculate_radial_contributions in AimNet2 #315

Comments

wiederm commented Nov 9, 2024

wiederm commented Nov 10, 2024