Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support garbage collection after pt2 compilation (#143364)
Summary: **Context:** recently we observed ~10% Training GPU memory regression, due to the not efficient recycling of the memory at Pytorch2 compilation time. This diff is to save the memory regression caused by the PT2 compilation. Detailed debugging notes: https://docs.google.com/document/d/1EPopAyYyXwTnkyVaUJ5Xa_Uw9iWv3zimK7FkagKsKIY/edit?tab=t.0#bookmark=id.e5b26tcdfl5g In this diff, we support garbage collection after pt2 compilation. **Rollout / rollback plan:** To ensure the system reliability, we design 2 layers of control for this change's rollout: - Add jk to control the global rollout / rollback of this functionality. The jk is on by default - Add env var to control individual job's rollout. The env var is on by default. X-link: pytorch/pytorch#143364 Approved by: https://github.com/ezyang Reviewed By: ezyang Differential Revision: D67328568 Pulled By: huydhn fbshipit-source-id: d0c856846bef3bdd3b060df90cf5888d57245ff8
- Loading branch information