Pull in `main` from `argonne-lcf/Megatron-DeepSpeed` #9

saforem2 · 2024-10-12T00:44:07Z

Summary by Sourcery

Refactor the training script for improved logging and modularity, introduce new features such as support for additional optimizers and tokenizers, and enhance the data processing pipeline. Add new scripts for setting up and running training jobs on ALCF systems, and update documentation to reflect these changes.

New Features:

Introduce a new logging mechanism using the log object from the logging module, replacing the previous print_rank_0 calls for improved logging consistency.
Add support for multiple optimizers including galore_adamw, adafactor, and adam8bit, providing more flexibility in training configurations.
Implement a new tokenizer type Llama2Tokenizer for handling specific tokenization needs.
Add a new script pretrain_llama.py for pretraining Llama models, which includes setup for distributed training and logging with WandB.
Introduce a new helper script ALCF/helpers.sh for setting up and launching training jobs on ALCF systems, including environment setup and job configuration.

Bug Fixes:

Fix issues with the get_ltor_masks_and_position_ids function to ensure correct mask and position ID generation.
Resolve compatibility issues with the torch and deepspeed versions by updating the setup and installation scripts.

Enhancements:

Refactor the training script to improve modularity and logging, including the use of decorators for logging and timing functions.
Enhance the data loading process by introducing a BuildConcatDataset class for efficient dataset handling and concatenation.
Improve the optimizer setup by allowing more granular control over parameter groups and conditions for weight decay and learning rate scaling.
Optimize the data processing pipeline by adding support for sequence parallelism and curriculum learning adjustments.

Build:

Add a new build script make_data to ensure the megatron/data/helpers.cpp is compiled before training.

Deployment:

Add a new script ALCF/README.md to guide users on setting up and running training jobs on ALCF systems, including detailed instructions for different environments.

Documentation:

Update documentation to include new features and enhancements, particularly around the new tokenizer and optimizer options.

Chores:

Organize and clean up the codebase by removing unused imports and redundant code blocks.

Sunspot frameworks tests

Fix path in `prof.export_chrome_trace()` from `pretrain_gpt_alcf.py`

…tron-DeepSpeed into tokenizer-tests

Merge in `tokenizer-tests` branch into `main`

…ranks touch it

)

…into hzheng-data-fix

…tron-DeepSpeed into hzheng-data-fix

Pull in changes from [6acc370](6acc370) to [`megatron/utils.py`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

[merge]: into `microsoft-main` $\leftarrow$ from `hzheng-data-fix`

saforem2

LGTM 👍

Remove apex deps

saforem2 and others added 30 commits May 20, 2024 09:44

Update train_llama_alcf.sh

47bf9b5

Update ALCF/README.md

e68d270

Merge pull request #14 from argonne-lcf/sunspot-frameworks-tests

4dd51dd

Sunspot frameworks tests

Update README.md

ac414a0

Fix path in prof.export_chrome_trace() from pretrain_gpt_alcf.py

9aa7fab

Merge pull request #15 from argonne-lcf/fix-trace-output-path

7d20359

Fix path in `prof.export_chrome_trace()` from `pretrain_gpt_alcf.py`

changed environment variable

0508cf6

added torch profiler per step output support

c4250a1

local changes

fa04d11

merge

bec5f9e

distributed loading

6cca87f

fixed print issue

62f8f56

Update README.md

2f01543

Update README.md

13171c2

Added function for on-the-fly building the dataset

06ac065

fixed minor issue in _build_train_valid_test_datasets_single

120a2b5

fixed variable order in Builder

6fdbfd3

fixed minor issue

3e2aa23

Add setup_tokenizer_and_data() function to ALCF/helpers.sh

b371742

Update train_llama_alcf.sh

d93fb7f

Update train_aGPT_7B.sh

05d82c3

Update ALCF/README.md

6de8496

Update ALCF/helpers.sh

03aa7c1

Update train_aGPT_7B.sh

3cd3f1a

Fix --data-cache-path in ALCF/helpers.sh, train_llama_alcf.sh

bc1dbfd

Add ALCF/sunspot-env-2024-04-15-002.sh

c3a4451

Update train_aGPT_7B.sh

0fc3919

Merge branch 'tokenizer-tests' of https://github.com/argonne-lcf/Mega…

318d860

…tron-DeepSpeed into tokenizer-tests

Merge pull request #17 from argonne-lcf/tokenizer-tests

c7a20cf

Merge in `tokenizer-tests` branch into `main`

added a barrier to make sure all the datasets are built before other …

efb2a3a

…ranks touch it

saforem2 and others added 27 commits October 14, 2024 23:28

Discard changes to megatron/data/gpt_dataset.py

45ff652

Consistent logging in megatron/data/*.py

52a406c

Update megatron/data/gpt_dataset.py

63b1901

Use time.perf_counter in megatron/data/blendable_dataset.py

7ef26bf

fix init issue for silently ignoring the deepspeed config (microsoft#452

deb95cd

)

Update ALCF/helpers.sh

68da2db

Merge branch 'main' of https://github.com/microsoft/Megatron-DeepSpeed …

ab3a8ec

…into hzheng-data-fix

Merge branch 'hzheng-data-fix' of https://github.com/argonne-lcf/Mega…

ed21bd9

…tron-DeepSpeed into hzheng-data-fix

fix moe tflops (microsoft#445)

6acc370

Merge 'upstream/main' into hzeng-data-fix

467279b

Pull in changes from [6acc370](6acc370) to [`megatron/utils.py`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

Remove duplicate gradient_accumulation_steps in DS config

9e015cc

Update default EVAL args

58dc2d7

Catch eval metrics in megatron/training.py

277d308

Save git branch to env in train_aGPT_7B.sh

af4cba1

fixed print out bug

8a8472c

Merge pull request #68 from argonne-lcf/feature/blending_corpus

dfd0643

Fix args.shuffle in megatron/data/gpt_dataset.py

6cb727d

Update --{shuffle,blend}-sample-in-corpus arg in ALCF/helpers.sh

5d10179

fix: GRAD_ACC_STEPS when NHOSTS == 256

160d6a6

Merge pull request #63 from argonne-lcf/hzheng-data-fix

40db8c2

[merge]: into `microsoft-main` $\leftarrow$ from `hzheng-data-fix`

🚧 ALCF/ds_to_universal.py

ce7d553

docs: Add ALCF/notes/checkpoints.md

8e0bff8

feat: Enable --use-flash-attn-builder by default on Aurora

bd8c246

Update python.yml

26f2e71

Update python.yml

48b3c81

Update python.yml

0a997bb

Merge pull request #62 from argonne-lcf/microsoft-main

c4de4d1

saforem2 commented Nov 15, 2024

View reviewed changes

saforem2 merged commit 33962ee into saforem2:main Nov 15, 2024
1 check passed

saforem2 added a commit that referenced this pull request Nov 15, 2024

Merge pull request #9 from argonne-lcf/remove-apex-deps

3145945

Remove apex deps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull in `main` from `argonne-lcf/Megatron-DeepSpeed` #9

Pull in `main` from `argonne-lcf/Megatron-DeepSpeed` #9

saforem2 commented Oct 12, 2024 •

edited by sourcery-ai bot

Loading

saforem2 left a comment

Pull in main from argonne-lcf/Megatron-DeepSpeed #9

Pull in main from argonne-lcf/Megatron-DeepSpeed #9

Conversation

saforem2 commented Oct 12, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

saforem2 left a comment

Choose a reason for hiding this comment

Pull in `main` from `argonne-lcf/Megatron-DeepSpeed` #9

Pull in `main` from `argonne-lcf/Megatron-DeepSpeed` #9

saforem2 commented Oct 12, 2024 •

edited by sourcery-ai bot

Loading