Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge: Pull Huihuo's data fix into microsoft-main #63

Merged
merged 54 commits into from
Nov 5, 2024

Conversation

saforem2
Copy link
Member

Data Fix

Adds mechanism for correctly shuffling samples across documents from multiple corpora.

The mechanism is implemented inside the BuildConcatDataset object from megatron/data/gpt_dataset.py.

In particular, the shuffle indices are created at

gpt_dataset.py#L118-119

which are then used for selecting an individual sample in the BuildConcatDataset.__getitem__ method here:

gpt_dataset.py#L130-L132

Other changes

Comment on lines 119 to 121
np_rng = np.random.RandomState(seed=dataset_builders[0].seed)
self.shuffle_index = np.arange(self.num_samples)
np_rng.shuffle(self.shuffle_index)
Copy link
Member Author

@saforem2 saforem2 Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create (and shuffle) a global shuffle_index of len(num_samples).

Explicitly, this

BuildConcatDataset.shuffle_index

is a list of indices maps each sample to a particular {dataset_index, dataset_sample_index}

Example

>>> import numpy as np
>>> shuffle_index = np.arange(10)
>>> np_rng = np.random.RandomState(seed=123)
>>> shuffle_index
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np_rng.shuffle(shuffle_index)
>>> shuffle_index
array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])

@saforem2
Copy link
Member Author

saforem2 commented Oct 14, 2024

Other quality of life improvements

Note

Below copied from
#64 (comment)

Logging Improvements

  • Replace calls to print_rank_0 with an appropriate logging call in megatron/data/{gpt_dataset.py,blendable_dataset.py,indexed_dataset.py}

    • Additionally, we set the logging level to be DEBUG by default in megatron/data/gpt_dataset.py

    • Previously, these logs were quite verbose and printed a lot of information that, while valuable, really cluttered up the logs.

    LOG_LEVEL=INFO (DEFAULT)
    [2024-10-13 10:35:15.801675][INFO][training_log.py:661] -  iteration=      21/  105963 | consumed_samples=       10368 | consumed_tokens=    42467328 | elapsed_time_per_iteration_ms=33736.1 | learning_rate=1.27402e-07 | global_batch_size= 4608 | lm loss=11.192327 | grad_norm=6.567 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=136.590 | tokens_per_gpu_per_second_tgs=5827.827 | [LM]TFLOPs=48.80 | [DS]TFLOPs=62.72 |
    [2024-10-13 10:35:15.803942][INFO][utils.py:249] - [Rank 0] (after 21 iterations) memory (MB) | allocated: 10272.43798828125 | max allocated: 46425.61083984375 | reserved: 58130.0 | max reserved: 58130.0
    [2024-10-13 10:35:36,499] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.8402508875376674e-07, 1.8402508875376674e-07], mom=[(0.9, 0.95), (0.9, 0.95)]
    [2024-10-13 10:35:36,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 4635.16 | bwd_microstep: 15678.70 | bwd_inner_microstep: 14110.46 | bwd_allreduce_microstep: 1568.21 | step_microstep: 326.84
    [2024-10-13 10:35:36,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4635.16 | bwd: 15678.70 | bwd_inner: 14110.46 | bwd_allreduce: 1568.21 | step: 326.84
    [2024-10-13 10:35:36.513118][INFO][training_log.py:661] -  iteration=      22/  105963 | consumed_samples=       14976 | consumed_tokens=    61341696 | elapsed_time_per_iteration_ms=20711.3 | learning_rate=1.84025e-07 | global_batch_size= 4608 | lm loss=11.192505 | grad_norm=6.578 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=222.488 | tokens_per_gpu_per_second_tgs=9492.801 | [LM]TFLOPs=79.49 | [DS]TFLOPs=102.17 |
    LOG_LEVEL=DEBUG

    NOTE: The log level will only be set to DEBUG if the variable

    LOG_LEVEL=DEBUG bash train_aGPT_7B.sh

    is caught in the running environment.

    [2024-10-13 10:38:52.986261][INFO][training_log.py:661] -  iteration=      21/  105963 | consumed_samples=       10368 | consumed_tokens=    42467328 | elapsed_time_per_iteration_ms=29754.2 | learning_rate=1.27402e-07 | global_batch_size= 4608 | lm loss=11.192327 | grad_norm=6.567 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=154.869 | tokens_per_gpu_per_second_tgs=6607.731 | [LM]TFLOPs=55.33 | [DS]TFLOPs=71.12 |
    [2024-10-13 10:38:52.988508][INFO][utils.py:249] - [Rank 0] (after 21 iterations) memory (MB) | allocated: 10272.43798828125 | max allocated: 46425.61083984375 | reserved: 58130.0 | max reserved: 58130.0
    [2024-10-13 10:38:52.995778][DEBUG][gpt_dataset.py:446] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/stackexchange-0009_text_document
    [2024-10-13 10:38:52.996706][DEBUG][gpt_dataset.py:604] -  > building dataset index ...
    [2024-10-13 10:38:52.999996][DEBUG][indexed_dataset.py:476] -     reading sizes...
    [2024-10-13 10:38:53.000761][DEBUG][indexed_dataset.py:480] -     reading pointers...
    [2024-10-13 10:38:53.001260][DEBUG][indexed_dataset.py:487] -     reading document index...
    [2024-10-13 10:38:53.001737][DEBUG][indexed_dataset.py:541] -     creating numpy buffer of mmap...
    [2024-10-13 10:38:53.002212][DEBUG][indexed_dataset.py:542] - /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/stackexchange-0009_text_document.bin
    [2024-10-13 10:38:53.002807][DEBUG][indexed_dataset.py:546] -     creating memory view of numpy buffer...
    [2024-10-13 10:38:53.003338][DEBUG][gpt_dataset.py:608] -  > finished creating indexed dataset in 0.006100 seconds
    [2024-10-13 10:38:53.003861][DEBUG][gpt_dataset.py:613] -     number of documents: 1137635
    [2024-10-13 10:38:53.004351][DEBUG][gpt_dataset.py:454] -  > dataset split:
    [2024-10-13 10:38:53.004785][DEBUG][gpt_dataset.py:457] -     train:
    [2024-10-13 10:38:53.005214][DEBUG][gpt_dataset.py:458] -      document indices in [0, 1137635) total of 1137635 documents
    [2024-10-13 10:38:53.005793][DEBUG][gpt_dataset.py:457] -     validation:
    [2024-10-13 10:38:53.006226][DEBUG][gpt_dataset.py:458] -      document indices in [1137635, 1137635) total of 0 documents
    [2024-10-13 10:38:53.006762][DEBUG][gpt_dataset.py:457] -     test:
    [2024-10-13 10:38:53.007157][DEBUG][gpt_dataset.py:458] -      document indices in [1137635, 1137635) total of 0 documents
    [2024-10-13 10:38:53.021128][DEBUG][gpt_dataset.py:947] -  > loading doc-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/f7d2e4b9462a6e17d2e80c2a4545516d_doc_idx.npy
    [2024-10-13 10:38:53.024649][DEBUG][gpt_dataset.py:950] -  > loading sample-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/f7d2e4b9462a6e17d2e80c2a4545516d_sample_idx.npy
    [2024-10-13 10:38:53.028050][DEBUG][gpt_dataset.py:953] -  > loading shuffle-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/f7d2e4b9462a6e17d2e80c2a4545516d_shuffle_idx.npy
    [2024-10-13 10:38:53.030382][DEBUG][gpt_dataset.py:956] -     loaded indexed file in 0.009 seconds
    [2024-10-13 10:38:53.031057][DEBUG][gpt_dataset.py:959] -     total number of samples: 584688
    [2024-10-13 10:38:53.031533][DEBUG][gpt_dataset.py:960] -     total number of epochs: 3
    [2024-10-13 10:38:53.032231][DEBUG][gpt_dataset.py:446] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/cc_en_middle-0323_text_document
    [2024-10-13 10:38:53.032845][DEBUG][gpt_dataset.py:604] -  > building dataset index ...
    [2024-10-13 10:38:53.035264][DEBUG][indexed_dataset.py:476] -     reading sizes...
    [2024-10-13 10:38:53.035931][DEBUG][indexed_dataset.py:480] -     reading pointers...
    [2024-10-13 10:38:53.036379][DEBUG][indexed_dataset.py:487] -     reading document index...
    [2024-10-13 10:38:53.036806][DEBUG][indexed_dataset.py:541] -     creating numpy buffer of mmap...
    [2024-10-13 10:38:53.037186][DEBUG][indexed_dataset.py:542] - /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/cc_en_middle-0323_text_document.bin
    [2024-10-13 10:38:53.037644][DEBUG][indexed_dataset.py:546] -     creating memory view of numpy buffer...
    [2024-10-13 10:38:53.038064][DEBUG][gpt_dataset.py:608] -  > finished creating indexed dataset in 0.004776 seconds
    [2024-10-13 10:38:53.038486][DEBUG][gpt_dataset.py:613] -     number of documents: 1795808
    [2024-10-13 10:38:53.038876][DEBUG][gpt_dataset.py:454] -  > dataset split:
    [2024-10-13 10:38:53.039214][DEBUG][gpt_dataset.py:457] -     train:
    [2024-10-13 10:38:53.039524][DEBUG][gpt_dataset.py:458] -      document indices in [0, 1795808) total of 1795808 documents
    [2024-10-13 10:38:53.039964][DEBUG][gpt_dataset.py:457] -     validation:
    [2024-10-13 10:38:53.040293][DEBUG][gpt_dataset.py:458] -      document indices in [1795808, 1795808) total of 0 documents
    [2024-10-13 10:38:53.040701][DEBUG][gpt_dataset.py:457] -     test:
    [2024-10-13 10:38:53.041008][DEBUG][gpt_dataset.py:458] -      document indices in [1795808, 1795808) total of 0 documents
    [2024-10-13 10:38:53.059401][DEBUG][gpt_dataset.py:947] -  > loading doc-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/2572e1fae8e82f973d4d4528bb1073d5_doc_idx.npy
    [2024-10-13 10:38:53.062711][DEBUG][gpt_dataset.py:950] -  > loading sample-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/2572e1fae8e82f973d4d4528bb1073d5_sample_idx.npy
    [2024-10-13 10:38:53.065822][DEBUG][gpt_dataset.py:953] -  > loading shuffle-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/2572e1fae8e82f973d4d4528bb1073d5_shuffle_idx.npy
    [2024-10-13 10:38:53.067502][DEBUG][gpt_dataset.py:956] -     loaded indexed file in 0.008 seconds
    [2024-10-13 10:38:53.068089][DEBUG][gpt_dataset.py:959] -     total number of samples: 339068
    [2024-10-13 10:38:53.068500][DEBUG][gpt_dataset.py:960] -     total number of epochs: 1
    
    
    # (...clipped...)
    
    [2024-10-13 10:38:53.069112][DEBUG][gpt_dataset.py:446] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/pes2o-0008_text_document
    [2024-10-13 10:38:53.069627][DEBUG][gpt_dataset.py:604] -  > building dataset index ...
    [2024-10-13 10:38:53.072254][DEBUG][indexed_dataset.py:476] -     reading sizes...
    [2024-10-13 10:38:53.072850][DEBUG][indexed_dataset.py:480] -     reading pointers...
    [2024-10-13 10:38:53.073281][DEBUG][indexed_dataset.py:487] -     reading document index...
    [2024-10-13 10:38:53.073709][DEBUG][indexed_dataset.py:541] -     creating numpy buffer of mmap...
    [2024-10-13 10:38:53.074143][DEBUG][indexed_dataset.py:542] - /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/pes2o-0008_text_document.bin
    [2024-10-13 10:38:53.074674][DEBUG][indexed_dataset.py:546] -     creating memory view of numpy buffer...
    [2024-10-13 10:38:53.075158][DEBUG][gpt_dataset.py:608] -  > finished creating indexed dataset in 0.005068 seconds
    [2024-10-13 10:38:53.075644][DEBUG][gpt_dataset.py:613] -     number of documents: 468277
    [2024-10-13 10:38:53.076103][DEBUG][gpt_dataset.py:454] -  > dataset split:
    [2024-10-13 10:38:53.076501][DEBUG][gpt_dataset.py:457] -     train:
    [2024-10-13 10:38:53.076879][DEBUG][gpt_dataset.py:458] -      document indices in [0, 468277) total of 468277 documents
    [2024-10-13 10:38:53.077385][DEBUG][gpt_dataset.py:457] -     validation:
    [2024-10-13 10:38:53.077797][DEBUG][gpt_dataset.py:458] -      document indices in [468277, 468277) total of 0 documents
    [2024-10-13 10:38:53.078301][DEBUG][gpt_dataset.py:457] -     test:
    [2024-10-13 10:38:53.078682][DEBUG][gpt_dataset.py:458] -      document indices in [468277, 468277) total of 0 documents
    [2024-10-13 10:38:53.085577][DEBUG][gpt_dataset.py:947] -  > loading doc-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/44232bbdda53b92a0c2f0a6e918b0b7c_doc_idx.npy
    [2024-10-13 10:38:53.089132][DEBUG][gpt_dataset.py:950] -  > loading sample-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/44232bbdda53b92a0c2f0a6e918b0b7c_sample_idx.npy
    [2024-10-13 10:38:53.092393][DEBUG][gpt_dataset.py:953] -  > loading shuffle-idx mapping from checkpoints/ds_stage0_nl6_hs4096_mb12_seq4096_sp1_pp1_tp1_bf16_optadamw_lr0.0003_lwf0.05_flash//.cache/dolma/index-cache/44232bbdda53b92a0c2f0a6e918b0b7c_shuffle_idx.npy
    [2024-10-13 10:38:53.095662][DEBUG][gpt_dataset.py:956] -     loaded indexed file in 0.010 seconds
    [2024-10-13 10:38:53.096285][DEBUG][gpt_dataset.py:959] -     total number of samples: 2441651
    [2024-10-13 10:38:53.096752][DEBUG][gpt_dataset.py:960] -     total number of epochs: 3
    
    [2024-10-13 10:39:13,696] [INFO] [logging.py:96:log_dist] [Rank 0] step=22, skipped=0, lr=[1.8402508875376674e-07, 1.8402508875376674e-07], mom=[(0.9, 0.95), (0.9, 0.95)]
    [2024-10-13 10:39:13,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 4631.17 | bwd_microstep: 15589.38 | bwd_inner_microstep: 14105.76 | bwd_allreduce_microstep: 1483.58 | step_microstep: 423.28
    [2024-10-13 10:39:13,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4631.17 | bwd: 15589.38 | bwd_inner: 14105.76 | bwd_allreduce: 1483.59 | step: 423.28
    [2024-10-13 10:39:13.709465][INFO][training_log.py:661] -  iteration=      22/  105963 | consumed_samples=       14976 | consumed_tokens=    61341696 | elapsed_time_per_iteration_ms=20723.0 | learning_rate=1.84025e-07 | global_batch_size= 4608 | lm loss=11.192505 | grad_norm=6.578 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=222.362 | tokens_per_gpu_per_second_tgs=9487.424 | [LM]TFLOPs=79.45 | [DS]TFLOPs=102.11 |

Changes to ALCF/helpers.sh

  • Use bash arrays instead of haphazardly concatenating strings in ALCF/helpers.sh
  • Add ability to manually specify ${CKPT_DIR} to specify checkpoint to try and load from

Loading checkpoint from custom CKPT_DIR

  • We add new functions to ALCF/helpers.sh for determining the checkpoint directory when starting a new run.

    If specified at runtime, e.g.:

    CKPT_DIR=checkpoints/custom_checkpoint bash train_aGPT_7B.sh

    it will then:

    1. try and load checkpoint from
    2. save future checkpoints to

    If not specified, checkpoints will be saved to checkpoints/$(get_output_prefix) which will be a string that uniquely identifies each run, e.g.

    checkpoints/ws768_ds_stage1_nl32_hs4096_mb4_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr0.00020_lwf0.05/

@saforem2 saforem2 merged commit 40db8c2 into microsoft-main Nov 5, 2024
4 checks passed
@saforem2 saforem2 deleted the hzheng-data-fix branch November 7, 2024 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants