Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in main from argonne-lcf/Megatron-DeepSpeed #9

Merged
merged 553 commits into from
Nov 15, 2024

Conversation

saforem2
Copy link
Owner

@saforem2 saforem2 commented Oct 12, 2024

Summary by Sourcery

Refactor the training script for improved logging and modularity, introduce new features such as support for additional optimizers and tokenizers, and enhance the data processing pipeline. Add new scripts for setting up and running training jobs on ALCF systems, and update documentation to reflect these changes.

New Features:

  • Introduce a new logging mechanism using the log object from the logging module, replacing the previous print_rank_0 calls for improved logging consistency.
  • Add support for multiple optimizers including galore_adamw, adafactor, and adam8bit, providing more flexibility in training configurations.
  • Implement a new tokenizer type Llama2Tokenizer for handling specific tokenization needs.
  • Add a new script pretrain_llama.py for pretraining Llama models, which includes setup for distributed training and logging with WandB.
  • Introduce a new helper script ALCF/helpers.sh for setting up and launching training jobs on ALCF systems, including environment setup and job configuration.

Bug Fixes:

  • Fix issues with the get_ltor_masks_and_position_ids function to ensure correct mask and position ID generation.
  • Resolve compatibility issues with the torch and deepspeed versions by updating the setup and installation scripts.

Enhancements:

  • Refactor the training script to improve modularity and logging, including the use of decorators for logging and timing functions.
  • Enhance the data loading process by introducing a BuildConcatDataset class for efficient dataset handling and concatenation.
  • Improve the optimizer setup by allowing more granular control over parameter groups and conditions for weight decay and learning rate scaling.
  • Optimize the data processing pipeline by adding support for sequence parallelism and curriculum learning adjustments.

Build:

  • Add a new build script make_data to ensure the megatron/data/helpers.cpp is compiled before training.

Deployment:

  • Add a new script ALCF/README.md to guide users on setting up and running training jobs on ALCF systems, including detailed instructions for different environments.

Documentation:

  • Update documentation to include new features and enhancements, particularly around the new tokenizer and optimizer options.

Chores:

  • Organize and clean up the codebase by removing unused imports and redundant code blocks.

saforem2 and others added 30 commits May 20, 2024 09:44
Fix path in `prof.export_chrome_trace()` from `pretrain_gpt_alcf.py`
Merge in `tokenizer-tests` branch into `main`
saforem2 and others added 27 commits October 14, 2024 23:28
[merge]: into `microsoft-main` $\leftarrow$ from `hzheng-data-fix`
Copy link
Owner Author

@saforem2 saforem2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@saforem2 saforem2 merged commit 33962ee into saforem2:main Nov 15, 2024
1 check passed
saforem2 added a commit that referenced this pull request Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants