Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add model num params display, gpu memory metrics (#56) #544

Closed
wants to merge 1 commit into from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 20, 2024

Stack from ghstack (oldest at bottom):

  • (to be filled)

This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
Screenshot 2024-02-12 at 4 12 22 PM

this helps de-mistify for example the size of our debug model as well:
Screenshot 2024-02-12 at 4 10 17 PM

additional updates - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
Screenshot 2024-02-13 at 9 32 11 AM

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
Screenshot 2024-02-13 at 9 12 45 AM

This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
<img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5">

this helps de-mistify for example the size of our debug model as well:
<img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd">

**additional updates** - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
<img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895">

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
<img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2024
@H-Huang H-Huang closed this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants