-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge: Pull Huihuo's data fix into microsoft-main
#63
Conversation
megatron/data/gpt_dataset.py
Outdated
np_rng = np.random.RandomState(seed=dataset_builders[0].seed) | ||
self.shuffle_index = np.arange(self.num_samples) | ||
np_rng.shuffle(self.shuffle_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create (and shuffle) a global shuffle_index
of len(num_samples)
.
Explicitly, this
BuildConcatDataset.shuffle_index
is a list of indices maps each sample to a particular {dataset_index
, dataset_sample_index
}
Example
>>> import numpy as np
>>> shuffle_index = np.arange(10)
>>> np_rng = np.random.RandomState(seed=123)
>>> shuffle_index
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np_rng.shuffle(shuffle_index)
>>> shuffle_index
array([4, 0, 7, 5, 8, 3, 1, 6, 9, 2])
…pSpeed into feature/profile
Other quality of life improvementsNote Below copied from Logging Improvements
Changes to
|
…into hzheng-data-fix
…tron-DeepSpeed into hzheng-data-fix
Pull in changes from [6acc370](6acc370) to [`megatron/utils.py`](https://github.com/argonne-lcf/Megatron-DeepSpeed)
Data Fix
Adds mechanism for correctly shuffling samples across documents from multiple corpora.
The mechanism is implemented inside the
BuildConcatDataset
object frommegatron/data/gpt_dataset.py
.In particular, the shuffle indices are created at
gpt_dataset.py#L118-119
which are then used for selecting an individual sample in the
BuildConcatDataset.__getitem__
method here:gpt_dataset.py#L130-L132
Other changes
ALCF/requirements/requirements.txt
ALCF/test_blendable_dataset.py
megatron/core/pipeline_parallel/p2p_communication.py
megatron/data/gpt_dataset.py
megatron/utils.py