Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/2.5] ModuleTracker: Add explicit garbage collection #1661

Open
wants to merge 1 commit into
base: release/2.5
Choose a base branch
from

Conversation

pragupta
Copy link

When running an FSDP model with FlopCounterMode, we are experiencing a memory leak. It is coming from ModuleTracker class. Even though ModuleTracker class is keeping weakrefrences of the operators, the tensors/operators are not being freed after the backward pass. To force free these tensors/operators after forward pass, I explicitly added garbage collection in the post forward hook.

(cherry picked from commit 63dc40d)

Fixes #ISSUE_NUMBER

When running an FSDP model with FlopCounterMode, we are experiencing a memory
leak. It is coming from ModuleTracker class. Even though
ModuleTracker class is keeping weakrefrences of the operators, the
tensors/operators are not being freed after the backward pass. To force
free these tensors/operators after backwardpass, I explicitly added
garbage collection in the post forward hook.

(cherry picked from commit 63dc40d)
@pruthvistony
Copy link
Collaborator

Not yet decided on cherry-picks into 2.5, so want to wait on this PR merge.

@rocm-mici
Copy link

Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-mici
Copy link

Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-mici
Copy link

Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@rocm-mici
Copy link

Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during Pytorch building:

   93 | __global__ void unrolled_elementwise_kernel(
      |                 ^
/var/lib/jenkins/pytorch/aten/src/ATen/native/hip/HIPLoops.cuh:93:17: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
180 warnings generated when compiling for gfx942.
[8009/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o 
cd /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn && /opt/conda/envs/py_3.10/bin/cmake -E make_directory /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/. && /opt/conda/envs/py_3.10/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/./torch_hip_generated_flash_api.hip.o -P /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o.cmake
In file included from /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/flash_api.hip:57:
/var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/aotriton_adapter.h:120:10: error: no matching constructor for initialization of 'aotriton::TensorView<0>'
  120 |   return aotriton::TensorView<0>(reinterpret_cast<intptr_t>(q.data_ptr()),
      |          ^                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@rocm-mici
Copy link

Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during Pytorch building:

      |                 ^
48 warnings generated when compiling for gfx942.
[8010/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/torch_hip_generated_CUDASymmetricMemoryOps.cu.o
[8011/8668] Linking CXX static library lib/libunbox_lib.a
[8012/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o 
cd /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn && /opt/conda/envs/py_3.10/bin/cmake -E make_directory /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/. && /opt/conda/envs/py_3.10/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/./torch_hip_generated_flash_api.hip.o -P /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o.cmake
In file included from /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/flash_api.hip:57:
/var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/aotriton_adapter.h:120:10: error: no matching constructor for initialization of 'aotriton::TensorView<0>'
  120 |   return aotriton::TensorView<0>(reinterpret_cast<intptr_t>(q.data_ptr()),
      |          ^                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@rocm-mici
Copy link

Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during Pytorch building:

  580 | #define DEPRECATED(msg) __attribute__ ((deprecated(msg)))
      |                                         ^
1 warning generated when compiling for gfx908.
[8005/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/torch_hip_generated_CUDASymmetricMemoryOps.cu.o
[8006/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o 
cd /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn && /opt/conda/envs/py_3.10/bin/cmake -E make_directory /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/. && /opt/conda/envs/py_3.10/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/./torch_hip_generated_flash_api.hip.o -P /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/flash_attn/torch_hip_generated_flash_api.hip.o.cmake
In file included from /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/flash_attn/flash_api.hip:57:
/var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/aotriton_adapter.h:120:10: error: no matching constructor for initialization of 'aotriton::TensorView<0>'
  120 |   return aotriton::TensorView<0>(reinterpret_cast<intptr_t>(q.data_ptr()),
      |          ^                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@rocm-mici
Copy link

Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during Pytorch building:

[7970/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_modified_bessel_k1.hip.o
[7971/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_shifted_chebyshev_polynomial_u.hip.o
[7972/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_scaled_modified_bessel_k0.hip.o
[7973/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/hip/torch_hip_generated_shifted_chebyshev_polynomial_v.hip.o
[7974/8668] Building HIPCC object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/torch_hip_generated_attention_backward.hip.o
FAILED: caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/torch_hip_generated_attention_backward.hip.o /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/torch_hip_generated_attention_backward.hip.o 
cd /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip && /opt/conda/envs/py_3.10/bin/cmake -E make_directory /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/. && /opt/conda/envs/py_3.10/bin/cmake -D verbose:BOOL=OFF -D build_configuration:STRING=RELEASE -D generated_file:STRING=/var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/./torch_hip_generated_attention_backward.hip.o -P /var/lib/jenkins/pytorch/build/caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/transformers/hip/torch_hip_generated_attention_backward.hip.o.cmake
In file included from /var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/attention_backward.hip:49:
/var/lib/jenkins/pytorch/aten/src/ATen/native/transformers/hip/aotriton_adapter.h:120:10: error: no matching constructor for initialization of 'aotriton::TensorView<0>'
  120 |   return aotriton::TensorView<0>(reinterpret_cast<intptr_t>(q.data_ptr()),
      |          ^                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants