merge: Pull Huihuo's data fix into `microsoft-main` #63

saforem2 · 2024-10-12T19:59:34Z

Data Fix

Adds mechanism for correctly shuffling samples across documents from multiple corpora.

The mechanism is implemented inside the BuildConcatDataset object from megatron/data/gpt_dataset.py.

In particular, the shuffle indices are created at

gpt_dataset.py#L118-119

which are then used for selecting an individual sample in the BuildConcatDataset.__getitem__ method here:

gpt_dataset.py#L130-L132

Other changes

…t.py`

saforem2 · 2024-10-12T22:40:29Z

megatron/data/gpt_dataset.py

+                np_rng = np.random.RandomState(seed=dataset_builders[0].seed)
+                self.shuffle_index = np.arange(self.num_samples)
+                np_rng.shuffle(self.shuffle_index)


Create (and shuffle) a global shuffle_index of len(num_samples).

Explicitly, this

BuildConcatDataset.shuffle_index

is a list of indices maps each sample to a particular {dataset_index, dataset_sample_index}

Example

>>> import numpy as np >>> shuffle_index = np.arange(10) >>> np_rng = np.random.RandomState(seed=123) >>> shuffle_index array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np_rng.shuffle(shuffle_index) >>> shuffle_index array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])

…pSpeed into feature/profile

… corpus

saforem2 · 2024-10-14T20:02:52Z

Other quality of life improvements

Note

Below copied from
#64 (comment)

Logging Improvements

Replace calls to print_rank_0 with an appropriate logging call in megatron/data/{gpt_dataset.py,blendable_dataset.py,indexed_dataset.py}

Additionally, we set the logging level to be DEBUG by default in megatron/data/gpt_dataset.py
Previously, these logs were quite verbose and printed a lot of information that, while valuable, really cluttered up the logs.

LOG_LEVEL=INFO (DEFAULT)

[2024-10-13 10:35:15.801675][INFO][training_log.py:661] -  iteration=      21/  105963 | consumed_samples=       10368 | consumed_tokens=    42467328 | elapsed_time_per_iteration_ms=33736.1 | learning_rate=1.27402e-07 | global_batch_size= 4608 | lm loss=11.192327 | grad_norm=6.567 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=136.590 | tokens_per_gpu_per_second_tgs=5827.827 | [LM]TFLOPs=48.80 | [DS]TFLOPs=62.72 |
[2024-10-13 10:35:15.803942][INFO][utils.py:249] - [Rank 0] (after 21 iterations) memory (MB) | allocated: 10272.43798828125 | max allocated: 46425.61083984375 | reserved: 58130.0 | max reserved: 58130.0
[2024-10-13 10:35:36,499] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.8402508875376674e-07, 1.8402508875376674e-07], mom=[(0.9, 0.95), (0.9, 0.95)]
[2024-10-13 10:35:36,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 4635.16 | bwd_microstep: 15678.70 | bwd_inner_microstep: 14110.46 | bwd_allreduce_microstep: 1568.21 | step_microstep: 326.84
[2024-10-13 10:35:36,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4635.16 | bwd: 15678.70 | bwd_inner: 14110.46 | bwd_allreduce: 1568.21 | step: 326.84
[2024-10-13 10:35:36.513118][INFO][training_log.py:661] -  iteration=      22/  105963 | consumed_samples=       14976 | consumed_tokens=    61341696 | elapsed_time_per_iteration_ms=20711.3 | learning_rate=1.84025e-07 | global_batch_size= 4608 | lm loss=11.192505 | grad_norm=6.578 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=222.488 | tokens_per_gpu_per_second_tgs=9492.801 | [LM]TFLOPs=79.49 | [DS]TFLOPs=102.17 |

LOG_LEVEL=DEBUG

NOTE: The log level will only be set to DEBUG if the variable

LOG_LEVEL=DEBUG bash train_aGPT_7B.sh

is caught in the running environment.

[2024-10-13 10:38:52.986261][INFO][training_log.py:661] -  iteration=      21/  105963 | consumed_samples=       10368 | consumed_tokens=    42467328 | elapsed_time_per_iteration_ms=29754.2 | learning_rate=1.27402e-07 | global_batch_size= 4608 | lm loss=11.192327 | grad_norm=6.567 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=154.869 | tokens_per_gpu_per_second_tgs=6607.731 | [LM]TFLOPs=55.33 | [DS]TFLOPs=71.12 |
[2024-10-13 10:38:52.988508][INFO][utils.py:249] - [Rank 0] (after 21 iterations) memory (MB) | allocated: 10272.43798828125 | max allocated: 46425.61083984375 | reserved: 58130.0 | max reserved: 58130.0
[2024-10-13 10:38:52.995778][DEBUG][gpt_dataset.py:446] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/stackexchange-0009_text_document
[2024-10-13 10:38:52.996706][DEBUG][gpt_dataset.py:604] -  > building dataset index ...
[2024-10-13 10:38:52.999996][DEBUG][indexed_dataset.py:476] -     reading sizes...
[2024-10-13 10:38:53.000761][DEBUG][indexed_dataset.py:480] -     reading pointers...
[2024-10-13 10:38:53.001260][DEBUG][indexed_dataset.py:487] -     reading document index...
[2024-10-13 10:38:53.001737][DEBUG][indexed_dataset.py:541] -     creating numpy buffer of mmap...
[2024-10-13 10:38:53.002212][DEBUG][indexed_dataset.py:542] - /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/stackexchange-0009_text_document.bin
[2024-10-13 10:38:53.002807][DEBUG][indexed_dataset.py:546] -     creating memory view of numpy buffer...
[2024-10-13 10:38:53.003338][DEBUG][gpt_dataset.py:608] -  > finished creating indexed dataset in 0.006100 seconds
[2024-10-13 10:38:53.003861][DEBUG][gpt_dataset.py:613] -     number of documents: 1137635
[2024-10-13 10:38:53.004351][DEBUG][gpt_dataset.py:454] -  > dataset split:
[2024-10-13 10:38:53.004785][DEBUG][gpt_dataset.py:457] -     train:
[2024-10-13 10:38:53.005214][DEBUG][gpt_dataset.py:458] -      document indices in [0, 1137635) total of 1137635 documents
[2024-10-13 10:38:53.005793][DEBUG][gpt_dataset.py:457] -     validation:
[2024-10-13 10:38:53.006226][DEBUG][gpt_dataset.py:458] -      document indices in [1137635, 1137635) total of 0 documents
[2024-10-13 10:38:53.006762][DEBUG][gpt_dataset.py:457] -     test:
[2024-10-13 10:38:53.007157][DEBUG][gpt_dataset.py:458] -      document indices in [1137635, 1137635) total of 0 documents
[2024-10-13 10:38:53.021128][DEBUG][gpt_dataset.py:947] -  > loading doc-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/f7d2e4b9462a6e17d2e80c2a4545516d_doc_idx.npy
[2024-10-13 10:38:53.024649][DEBUG][gpt_dataset.py:950] -  > loading sample-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/f7d2e4b9462a6e17d2e80c2a4545516d_sample_idx.npy
[2024-10-13 10:38:53.028050][DEBUG][gpt_dataset.py:953] -  > loading shuffle-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/f7d2e4b9462a6e17d2e80c2a4545516d_shuffle_idx.npy
[2024-10-13 10:38:53.030382][DEBUG][gpt_dataset.py:956] -     loaded indexed file in 0.009 seconds
[2024-10-13 10:38:53.031057][DEBUG][gpt_dataset.py:959] -     total number of samples: 584688
[2024-10-13 10:38:53.031533][DEBUG][gpt_dataset.py:960] -     total number of epochs: 3
[2024-10-13 10:38:53.032231][DEBUG][gpt_dataset.py:446] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/cc_en_middle-0323_text_document
[2024-10-13 10:38:53.032845][DEBUG][gpt_dataset.py:604] -  > building dataset index ...
[2024-10-13 10:38:53.035264][DEBUG][indexed_dataset.py:476] -     reading sizes...
[2024-10-13 10:38:53.035931][DEBUG][indexed_dataset.py:480] -     reading pointers...
[2024-10-13 10:38:53.036379][DEBUG][indexed_dataset.py:487] -     reading document index...
[2024-10-13 10:38:53.036806][DEBUG][indexed_dataset.py:541] -     creating numpy buffer of mmap...
[2024-10-13 10:38:53.037186][DEBUG][indexed_dataset.py:542] - /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/cc_en_middle-0323_text_document.bin
[2024-10-13 10:38:53.037644][DEBUG][indexed_dataset.py:546] -     creating memory view of numpy buffer...
[2024-10-13 10:38:53.038064][DEBUG][gpt_dataset.py:608] -  > finished creating indexed dataset in 0.004776 seconds
[2024-10-13 10:38:53.038486][DEBUG][gpt_dataset.py:613] -     number of documents: 1795808
[2024-10-13 10:38:53.038876][DEBUG][gpt_dataset.py:454] -  > dataset split:
[2024-10-13 10:38:53.039214][DEBUG][gpt_dataset.py:457] -     train:
[2024-10-13 10:38:53.039524][DEBUG][gpt_dataset.py:458] -      document indices in [0, 1795808) total of 1795808 documents
[2024-10-13 10:38:53.039964][DEBUG][gpt_dataset.py:457] -     validation:
[2024-10-13 10:38:53.040293][DEBUG][gpt_dataset.py:458] -      document indices in [1795808, 1795808) total of 0 documents
[2024-10-13 10:38:53.040701][DEBUG][gpt_dataset.py:457] -     test:
[2024-10-13 10:38:53.041008][DEBUG][gpt_dataset.py:458] -      document indices in [1795808, 1795808) total of 0 documents
[2024-10-13 10:38:53.059401][DEBUG][gpt_dataset.py:947] -  > loading doc-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/2572e1fae8e82f973d4d4528bb1073d5_doc_idx.npy
[2024-10-13 10:38:53.062711][DEBUG][gpt_dataset.py:950] -  > loading sample-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/2572e1fae8e82f973d4d4528bb1073d5_sample_idx.npy
[2024-10-13 10:38:53.065822][DEBUG][gpt_dataset.py:953] -  > loading shuffle-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/2572e1fae8e82f973d4d4528bb1073d5_shuffle_idx.npy
[2024-10-13 10:38:53.067502][DEBUG][gpt_dataset.py:956] -     loaded indexed file in 0.008 seconds
[2024-10-13 10:38:53.068089][DEBUG][gpt_dataset.py:959] -     total number of samples: 339068
[2024-10-13 10:38:53.068500][DEBUG][gpt_dataset.py:960] -     total number of epochs: 1


# (...clipped...)

[2024-10-13 10:38:53.069112][DEBUG][gpt_dataset.py:446] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/pes2o-0008_text_document
[2024-10-13 10:38:53.069627][DEBUG][gpt_dataset.py:604] -  > building dataset index ...
[2024-10-13 10:38:53.072254][DEBUG][indexed_dataset.py:476] -     reading sizes...
[2024-10-13 10:38:53.072850][DEBUG][indexed_dataset.py:480] -     reading pointers...
[2024-10-13 10:38:53.073281][DEBUG][indexed_dataset.py:487] -     reading document index...
[2024-10-13 10:38:53.073709][DEBUG][indexed_dataset.py:541] -     creating numpy buffer of mmap...
[2024-10-13 10:38:53.074143][DEBUG][indexed_dataset.py:542] - /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/pes2o-0008_text_document.bin
[2024-10-13 10:38:53.074674][DEBUG][indexed_dataset.py:546] -     creating memory view of numpy buffer...
[2024-10-13 10:38:53.075158][DEBUG][gpt_dataset.py:608] -  > finished creating indexed dataset in 0.005068 seconds
[2024-10-13 10:38:53.075644][DEBUG][gpt_dataset.py:613] -     number of documents: 468277
[2024-10-13 10:38:53.076103][DEBUG][gpt_dataset.py:454] -  > dataset split:
[2024-10-13 10:38:53.076501][DEBUG][gpt_dataset.py:457] -     train:
[2024-10-13 10:38:53.076879][DEBUG][gpt_dataset.py:458] -      document indices in [0, 468277) total of 468277 documents
[2024-10-13 10:38:53.077385][DEBUG][gpt_dataset.py:457] -     validation:
[2024-10-13 10:38:53.077797][DEBUG][gpt_dataset.py:458] -      document indices in [468277, 468277) total of 0 documents
[2024-10-13 10:38:53.078301][DEBUG][gpt_dataset.py:457] -     test:
[2024-10-13 10:38:53.078682][DEBUG][gpt_dataset.py:458] -      document indices in [468277, 468277) total of 0 documents
[2024-10-13 10:38:53.085577][DEBUG][gpt_dataset.py:947] -  > loading doc-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/44232bbdda53b92a0c2f0a6e918b0b7c_doc_idx.npy
[2024-10-13 10:38:53.089132][DEBUG][gpt_dataset.py:950] -  > loading sample-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/44232bbdda53b92a0c2f0a6e918b0b7c_sample_idx.npy
[2024-10-13 10:38:53.092393][DEBUG][gpt_dataset.py:953] -  > loading shuffle-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/44232bbdda53b92a0c2f0a6e918b0b7c_shuffle_idx.npy
[2024-10-13 10:38:53.095662][DEBUG][gpt_dataset.py:956] -     loaded indexed file in 0.010 seconds
[2024-10-13 10:38:53.096285][DEBUG][gpt_dataset.py:959] -     total number of samples: 2441651
[2024-10-13 10:38:53.096752][DEBUG][gpt_dataset.py:960] -     total number of epochs: 3

[2024-10-13 10:39:13,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.8402508875376674e-07, 1.8402508875376674e-07], mom=[(0.9, 0.95), (0.9, 0.95)]
[2024-10-13 10:39:13,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 4631.17 | bwd_microstep: 15589.38 | bwd_inner_microstep: 14105.76 | bwd_allreduce_microstep: 1483.58 | step_microstep: 423.28
[2024-10-13 10:39:13,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4631.17 | bwd: 15589.38 | bwd_inner: 14105.76 | bwd_allreduce: 1483.59 | step: 423.28
[2024-10-13 10:39:13.709465][INFO][training_log.py:661] -  iteration=      22/  105963 | consumed_samples=       14976 | consumed_tokens=    61341696 | elapsed_time_per_iteration_ms=20723.0 | learning_rate=1.84025e-07 | global_batch_size= 4608 | lm loss=11.192505 | grad_norm=6.578 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=222.362 | tokens_per_gpu_per_second_tgs=9487.424 | [LM]TFLOPs=79.45 | [DS]TFLOPs=102.11 |

Changes to `ALCF/helpers.sh`

Use bash arrays instead of haphazardly concatenating strings in ALCF/helpers.sh
Add ability to manually specify ${CKPT_DIR} to specify checkpoint to try and load from

Loading checkpoint from custom `CKPT_DIR`

We add new functions to ALCF/helpers.sh for determining the checkpoint directory when starting a new run.

If specified at runtime, e.g.:
```
CKPT_DIR=checkpoints/custom_checkpoint bash train_aGPT_7B.sh
```
it will then:
1. try and load checkpoint from
2. save future checkpoints to
If not specified, checkpoints will be saved to checkpoints/$(get_output_prefix) which will be a string that uniquely identifies each run, e.g.
```
checkpoints/ws768_ds_stage1_nl32_hs4096_mb4_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr0.00020_lwf0.05/
```

)

…into hzheng-data-fix

…tron-DeepSpeed into hzheng-data-fix

Pull in changes from [6acc370](6acc370) to [`megatron/utils.py`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

saforem2 and others added 22 commits February 1, 2024 13:52

Track additional metrics with W&B in megatron/training.py

b87a03c

Merge branch 'microsoft:main' into main

51ce9f7

Update megatron/training.py

1302c61

Merge branch 'microsoft:main' into main

3472f34

Remove assert num_datasets < 255 in `megatron/data/blendable_datase…

de60d86

…t.py`

[format] megatron/data/indexed_dataset.py

c11589d

Add debug logic (+ formatting fixes) in megatron/data/gpt_dataset.py

df5c92b

Update .gitignore

1906d31

Merge branch 'microsoft:main' into main

8c8c260

Update .gitignore

ee7ce6f

fixed dftracer compatibility

ea0c3c7

added requirements.txt

a0ac750

Update utils.py

de7f22f

fix check

12f6f8e

Merge branch 'microsoft:main' into main

ecc248a

merge: argonne-lcf-microsoft-main into main

9b5be12

shuffle concate dataset index

5394156

fixed bugs

573b668

Update ALCF/helpers.sh, train_aGPT_7B.sh

41ff059

merge: feature/profile with data fix into microsoft-main

89db92a

Fix shuffle_idx in megatron/data/gpt_dataset.py

9de83a9

Fix shuffle_idx in megatron/data/gpt_dataset.py

d7a2594

saforem2 commented Oct 12, 2024

View reviewed changes

saforem2 and others added 7 commits October 13, 2024 10:15

Update ALCF/helpers.sh, train_aGPT_7B.sh

3e33a6a

Update pretrain_gpt_alcf.py

43cde2b

Update megatron/data/{blendable,gpt,indexed}_dataset.py

9f09733

Update ALCF/requirements/requirements.txt

2b31b44

Update megatron/utils.py

5e9eed0

fixed bugs and added commandline option

3dcb297

Merge branch 'debug-logging' into feature/profile

bec9b7a

zhenghh04 and others added 5 commits October 13, 2024 23:00

fixed typo

43fc2fe

Merge branch 'feature/profile' of github.com:argonne-lcf/Megatron-Dee…

94d5337

…pSpeed into feature/profile

Merge pull request #67 from argonne-lcf/feature/profile

bb55e97

added support for blending samples across different files in the same…

d50239f

… corpus

Merge pull request #64 from argonne-lcf/debug-logging

9b4f510

saforem2 mentioned this pull request Oct 15, 2024

Pull in from upstream saforem2/Megatron-DeepSpeed#10

Closed

saforem2 and others added 20 commits October 14, 2024 21:55

Merge branch 'alcf-hzheng-data-fix' into hzheng-data-fix

324ef11

Discard changes to megatron/data/gpt_dataset.py

45ff652

Consistent logging in megatron/data/*.py

52a406c

Update megatron/data/gpt_dataset.py

63b1901

Use time.perf_counter in megatron/data/blendable_dataset.py

7ef26bf

fix init issue for silently ignoring the deepspeed config (microsoft#452

deb95cd

)

Update ALCF/helpers.sh

68da2db

Merge branch 'main' of https://github.com/microsoft/Megatron-DeepSpeed …

ab3a8ec

…into hzheng-data-fix

Merge branch 'hzheng-data-fix' of https://github.com/argonne-lcf/Mega…

ed21bd9

…tron-DeepSpeed into hzheng-data-fix

fix moe tflops (microsoft#445)

6acc370

Merge 'upstream/main' into hzeng-data-fix

467279b

Pull in changes from [6acc370](6acc370) to [`megatron/utils.py`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

Remove duplicate gradient_accumulation_steps in DS config

9e015cc

Update default EVAL args

58dc2d7

Catch eval metrics in megatron/training.py

277d308

Save git branch to env in train_aGPT_7B.sh

af4cba1

fixed print out bug

8a8472c

Merge pull request #68 from argonne-lcf/feature/blending_corpus

dfd0643

Fix args.shuffle in megatron/data/gpt_dataset.py

6cb727d

Update --{shuffle,blend}-sample-in-corpus arg in ALCF/helpers.sh

5d10179

fix: GRAD_ACC_STEPS when NHOSTS == 256

160d6a6

saforem2 merged commit 40db8c2 into microsoft-main Nov 5, 2024
4 checks passed

saforem2 deleted the hzheng-data-fix branch November 7, 2024 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: Pull Huihuo's data fix into `microsoft-main` #63

merge: Pull Huihuo's data fix into `microsoft-main` #63

saforem2 commented Oct 12, 2024

saforem2 Oct 12, 2024 •

edited

Loading

saforem2 commented Oct 14, 2024 •

edited

Loading

merge: Pull Huihuo's data fix into microsoft-main #63

merge: Pull Huihuo's data fix into microsoft-main #63

Conversation

saforem2 commented Oct 12, 2024

Data Fix

Other changes

saforem2 Oct 12, 2024 • edited Loading

Choose a reason for hiding this comment

saforem2 commented Oct 14, 2024 • edited Loading

Other quality of life improvements

Logging Improvements

Changes to ALCF/helpers.sh

Loading checkpoint from custom CKPT_DIR

merge: Pull Huihuo's data fix into `microsoft-main` #63

merge: Pull Huihuo's data fix into `microsoft-main` #63

saforem2 Oct 12, 2024 •

edited

Loading

saforem2 commented Oct 14, 2024 •

edited

Loading

Changes to `ALCF/helpers.sh`

Loading checkpoint from custom `CKPT_DIR`