Pull in from upstream #10

saforem2 · 2024-10-15T02:43:40Z

PR @ argonne-lcf/Megatron-DeepSpeed/pull/63

Summary by Sourcery

Refactor dataset handling and logging mechanisms to improve code readability and maintainability. Enhance shell scripts for better configuration management and introduce sample shuffling in datasets. Update DeepSpeed configuration handling and add new test functionalities for dataset sample distribution.

Enhancements:

Replace print statements with logging for better control over log levels and output.
Refactor dataset building functions to improve readability and maintainability by using consistent formatting and adding comments.
Introduce a new logging utility to standardize logging across the codebase.
Add support for shuffling samples within dataset files to enhance data randomness during training.
Refactor shell script to use arrays for command construction, improving readability and flexibility.
Enhance DeepSpeed configuration generation with additional options and conditions for optimizer and activation checkpointing.

Build:

Modify shell scripts to dynamically set and export environment variables for training configurations.
Update DeepSpeed configuration setup in shell scripts to support new parameters and conditions.

Tests:

Add functionality to test the distribution of samples across batches in the blendable dataset.

…pSpeed into feature/profile

… corpus

sourcery-ai · 2024-10-15T02:43:45Z

Reviewer's Guide by Sourcery

This pull request implements several changes to improve code quality, performance, and functionality across multiple files in the Megatron-DeepSpeed project. The changes include code refactoring, bug fixes, and the addition of new features.

Updated class diagram for GPTDataset and related classes

classDiagram
    class GPTDataset {
        +__init__(name, data_prefix, documents, indexed_dataset, splits_string, num_samples, seq_length, seed, return_doc_ids, data_cache_path)
        +__len__() int
        +__getitem__(idx) dict
    }
    class IndexedDataset {
        +__init__(path)
        +read_index(path)
        +read_data(path)
        +check_index(i)
        +__getitem__(idx)
        +size(index) int
        +exists(path) bool
    }
    class BlendableDataset {
        +__init__(datasets, weights, size, data_cache_path)
        +_build_indices()
    }
    class DatasetBuilder {
        +__init__(prefix, corpus, data_impl, splits_string, num_samples, seq_length, seed, skip_warmup, return_doc_ids, data_cache_path, name)
        +Build()
    }
    GPTDataset --> IndexedDataset
    BlendableDataset --> DatasetBuilder

Updated class diagram for MMapIndexedDataset and related classes

classDiagram
    class MMapIndexedDataset {
        +__init__(path, skip_warmup)
        +__getitem__(idx)
        +get(idx, offset, length)
        +exists(path) bool
    }
    class MMapIndexedDatasetBuilder {
        +__init__(out_file, dtype)
        +add_item(tensor)
        +add_doc(tensor, sizes)
        +merge_file_(another_file)
        +finalize(index_file)
    }
    MMapIndexedDatasetBuilder --> MMapIndexedDataset
    MMapIndexedDataset --> IndexedDataset
    class IndexedDataset {
        +__init__(path)
        +read_index(path)
        +read_data(path)
        +check_index(i)
        +__getitem__(idx)
        +size(index) int
        +exists(path) bool
    }

Updated class diagram for logging utilities

classDiagram
    class Logger {
        +get_logger(name, level, rank_zero_only) Logger
    }
    class Profile {
        +log(func)
        +log_init(func)
        +iter(func, iter_name)
        +__enter__()
        +__exit__(type, value, traceback)
        +update(epoch, step, image_idx, image_size, args)
        +flush()
        +reset()
        +log_static(func)
    }
    class dftracer {
        +initialize_log(logfile, data_dir, process_id)
        +get_time()
        +enter_event()
        +exit_event()
        +log_event(name, cat, start_time, duration, string_args)
        +finalize()
    }
    Logger --> Profile
    Profile --> dftracer

File-Level Changes

Change	Details	Files
Refactored and improved the dataset handling and processing	Updated the build_train_valid_test_datasets function to use logging instead of print statements Implemented a new shuffle_sample argument for dataset processing Refactored the BuildConcatDataset class to include shuffling functionality Updated the _build_train_valid_test_datasets_single function to use logging	`megatron/data/gpt_dataset.py`
Enhanced logging and configuration management in helper scripts	Replaced print statements with logging calls Refactored the setup_run_cmd function to use arrays for command arguments Added new configuration options for tokenizer and data handling Implemented a reset_env function to clear custom environment variables	`ALCF/helpers.sh`
Improved logging and error handling in indexed dataset operations	Replaced print_rank_0 calls with logging Updated error messages and assertions for better clarity Refactored the MMapIndexedDataset class for improved readability	`megatron/data/indexed_dataset.py`
Refactored utility functions and logging setup	Implemented a new get_logger function for consistent logging setup Updated the Profile and PerfTrace classes for better performance tracking Added support for dftracer if available	`megatron/utils.py`
Updated blendable dataset implementation	Replaced print_rank_0 calls with logging Improved error handling and messaging in the BlendableDataset class	`megatron/data/blendable_dataset.py`
Added performance profiling to pipeline parallel operations	Added @dlp.log decorators to various functions in the p2p_communication module	`megatron/core/pipeline_parallel/p2p_communication.py`
Updated training script and configuration	Modified the run command execution in train_aGPT_7B.sh Added a new shuffle-sample argument in megatron/arguments.py Updated the model_provider function in pretrain_gpt_alcf.py to use args.deepspeed_config instead of args.deepspeed_config_dict	`train_aGPT_7B.sh` `megatron/arguments.py` `pretrain_gpt_alcf.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time. You can also use
this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @saforem2 - I've reviewed your changes - here's some feedback:

Overall Comments:

The logging improvements and configuration changes look good. Consider adding more explanatory comments for complex changes, especially in data loading and indexing code.
We recommend conducting performance benchmarks to validate the impact of these changes, particularly for large-scale training scenarios.

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 1 issue found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2024-10-15T02:45:01Z

megatron/utils.py

 except Exception:
    _DLIO_PROFILER_EXIST = False

-if _DLIO_PROFILER_EXIST:
+
+if _DFTRACER_EXIST:


issue (complexity): Consider implementing a factory pattern and dependency injection for profiling functionality.

The current implementation introduces unnecessary complexity through multiple conditional imports and complex class definitions for Profile and PerfTrace. To simplify and improve maintainability while preserving functionality, consider the following changes:

Implement a factory pattern to create Profile and PerfTrace objects:

def create_profiler(): if _DFTRACER_EXIST: from dftracer.logger import dftracer as PerfTrace, dft_fn as Profile, DFTRACER_ENABLE return Profile, PerfTrace, DFTRACER_ENABLE elif _DLIO_PROFILER_EXIST: from dlio_profiler.logger import fn_interceptor as Profile, dlio_logger as PerfTrace return Profile, PerfTrace, True else: return DummyProfile, DummyPerfTrace, False class DummyProfile: def __init__(self, *args, **kwargs): pass def __enter__(self): pass def __exit__(self, *args, **kwargs): pass def log(self, func): return func class DummyPerfTrace: def initialize_log(self, *args, **kwargs): pass Profile, PerfTrace, PROFILER_ENABLED = create_profiler()

Simplify the dummy implementations by only including methods that are actually used in the codebase. This reduces unnecessary code while still providing a fallback when profiling libraries are unavailable.

Consider using dependency injection to further decouple the profiling functionality:

class CodeModule: def __init__(self, profiler): self.profiler = profiler @profiler.log def some_method(self): # Method implementation # Usage profiler, _, _ = create_profiler() module = CodeModule(profiler)

These changes will make the code more modular, easier to maintain, and simpler to extend with new profiling libraries in the future.

)

…into hzheng-data-fix

…tron-DeepSpeed into hzheng-data-fix

Pull in changes from [6acc370](6acc370) to [`megatron/utils.py`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

Merge `alcf-tests` into `main`

zhenghh04 and others added 21 commits August 30, 2024 15:09

fixed dftracer compatibility

ea0c3c7

added requirements.txt

a0ac750

Update utils.py

de7f22f

fix check

12f6f8e

shuffle concate dataset index

5394156

fixed bugs

573b668

merge: feature/profile with data fix into microsoft-main

89db92a

Fix shuffle_idx in megatron/data/gpt_dataset.py

9de83a9

Fix shuffle_idx in megatron/data/gpt_dataset.py

d7a2594

Update ALCF/helpers.sh, train_aGPT_7B.sh

3e33a6a

Update pretrain_gpt_alcf.py

43cde2b

Update megatron/data/{blendable,gpt,indexed}_dataset.py

9f09733

Update ALCF/requirements/requirements.txt

2b31b44

Update megatron/utils.py

5e9eed0

fixed bugs and added commandline option

3dcb297

Merge branch 'debug-logging' into feature/profile

bec9b7a

fixed typo

43fc2fe

Merge branch 'feature/profile' of github.com:argonne-lcf/Megatron-Dee…

94d5337

…pSpeed into feature/profile

Merge pull request #67 from argonne-lcf/feature/profile

bb55e97

added support for blending samples across different files in the same…

d50239f

… corpus

Merge pull request #64 from argonne-lcf/debug-logging

9b4f510

sourcery-ai bot reviewed Oct 15, 2024

View reviewed changes

saforem2 added 3 commits October 14, 2024 21:55

Merge branch 'alcf-hzheng-data-fix' into hzheng-data-fix

324ef11

Discard changes to megatron/data/gpt_dataset.py

45ff652

Consistent logging in megatron/data/*.py

52a406c

saforem2 marked this pull request as draft October 15, 2024 04:52

saforem2 and others added 3 commits October 16, 2024 11:24

Update megatron/data/gpt_dataset.py

63b1901

Use time.perf_counter in megatron/data/blendable_dataset.py

7ef26bf

fix init issue for silently ignoring the deepspeed config (microsoft#452

deb95cd

)

saforem2 and others added 14 commits October 17, 2024 17:25

Update ALCF/helpers.sh

68da2db

Merge branch 'main' of https://github.com/microsoft/Megatron-DeepSpeed …

ab3a8ec

…into hzheng-data-fix

Merge branch 'hzheng-data-fix' of https://github.com/argonne-lcf/Mega…

ed21bd9

…tron-DeepSpeed into hzheng-data-fix

fix moe tflops (microsoft#445)

6acc370

Merge 'upstream/main' into hzeng-data-fix

467279b

Pull in changes from [6acc370](6acc370) to [`megatron/utils.py`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

Remove duplicate gradient_accumulation_steps in DS config

9e015cc

Update default EVAL args

58dc2d7

Catch eval metrics in megatron/training.py

277d308

Save git branch to env in train_aGPT_7B.sh

af4cba1

fixed print out bug

8a8472c

Merge pull request #68 from argonne-lcf/feature/blending_corpus

dfd0643

Fix args.shuffle in megatron/data/gpt_dataset.py

6cb727d

Update --{shuffle,blend}-sample-in-corpus arg in ALCF/helpers.sh

5d10179

fix: GRAD_ACC_STEPS when NHOSTS == 256

160d6a6

saforem2 closed this Nov 7, 2024

saforem2 deleted the hzheng-data-fix branch November 7, 2024 03:58

saforem2 added a commit that referenced this pull request Nov 15, 2024

Merge pull request #10 from argonne-lcf/alcf-tests

7681642

Merge `alcf-tests` into `main`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull in from upstream #10

Pull in from upstream #10

saforem2 commented Oct 15, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 15, 2024 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Oct 15, 2024

Pull in from upstream #10

Pull in from upstream #10

Conversation

saforem2 commented Oct 15, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Oct 15, 2024 • edited Loading

Reviewer's Guide by Sourcery

Updated class diagram for GPTDataset and related classes

Updated class diagram for MMapIndexedDataset and related classes

Updated class diagram for logging utilities

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Oct 15, 2024

Choose a reason for hiding this comment

saforem2 commented Oct 15, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 15, 2024 •

edited

Loading