Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull in from upstream #10

Closed

Conversation

saforem2
Copy link
Owner

@saforem2 saforem2 commented Oct 15, 2024

PR @ argonne-lcf/Megatron-DeepSpeed/pull/63

Summary by Sourcery

Refactor dataset handling and logging mechanisms to improve code readability and maintainability. Enhance shell scripts for better configuration management and introduce sample shuffling in datasets. Update DeepSpeed configuration handling and add new test functionalities for dataset sample distribution.

Enhancements:

  • Replace print statements with logging for better control over log levels and output.
  • Refactor dataset building functions to improve readability and maintainability by using consistent formatting and adding comments.
  • Introduce a new logging utility to standardize logging across the codebase.
  • Add support for shuffling samples within dataset files to enhance data randomness during training.
  • Refactor shell script to use arrays for command construction, improving readability and flexibility.
  • Enhance DeepSpeed configuration generation with additional options and conditions for optimizer and activation checkpointing.

Build:

  • Modify shell scripts to dynamically set and export environment variables for training configurations.
  • Update DeepSpeed configuration setup in shell scripts to support new parameters and conditions.

Tests:

  • Add functionality to test the distribution of samples across batches in the blendable dataset.

Copy link

sourcery-ai bot commented Oct 15, 2024

Reviewer's Guide by Sourcery

This pull request implements several changes to improve code quality, performance, and functionality across multiple files in the Megatron-DeepSpeed project. The changes include code refactoring, bug fixes, and the addition of new features.

Updated class diagram for GPTDataset and related classes

classDiagram
    class GPTDataset {
        +__init__(name, data_prefix, documents, indexed_dataset, splits_string, num_samples, seq_length, seed, return_doc_ids, data_cache_path)
        +__len__() int
        +__getitem__(idx) dict
    }
    class IndexedDataset {
        +__init__(path)
        +read_index(path)
        +read_data(path)
        +check_index(i)
        +__getitem__(idx)
        +size(index) int
        +exists(path) bool
    }
    class BlendableDataset {
        +__init__(datasets, weights, size, data_cache_path)
        +_build_indices()
    }
    class DatasetBuilder {
        +__init__(prefix, corpus, data_impl, splits_string, num_samples, seq_length, seed, skip_warmup, return_doc_ids, data_cache_path, name)
        +Build()
    }
    GPTDataset --> IndexedDataset
    BlendableDataset --> DatasetBuilder
Loading

Updated class diagram for MMapIndexedDataset and related classes

classDiagram
    class MMapIndexedDataset {
        +__init__(path, skip_warmup)
        +__getitem__(idx)
        +get(idx, offset, length)
        +exists(path) bool
    }
    class MMapIndexedDatasetBuilder {
        +__init__(out_file, dtype)
        +add_item(tensor)
        +add_doc(tensor, sizes)
        +merge_file_(another_file)
        +finalize(index_file)
    }
    MMapIndexedDatasetBuilder --> MMapIndexedDataset
    MMapIndexedDataset --> IndexedDataset
    class IndexedDataset {
        +__init__(path)
        +read_index(path)
        +read_data(path)
        +check_index(i)
        +__getitem__(idx)
        +size(index) int
        +exists(path) bool
    }
Loading

Updated class diagram for logging utilities

classDiagram
    class Logger {
        +get_logger(name, level, rank_zero_only) Logger
    }
    class Profile {
        +log(func)
        +log_init(func)
        +iter(func, iter_name)
        +__enter__()
        +__exit__(type, value, traceback)
        +update(epoch, step, image_idx, image_size, args)
        +flush()
        +reset()
        +log_static(func)
    }
    class dftracer {
        +initialize_log(logfile, data_dir, process_id)
        +get_time()
        +enter_event()
        +exit_event()
        +log_event(name, cat, start_time, duration, string_args)
        +finalize()
    }
    Logger --> Profile
    Profile --> dftracer
Loading

File-Level Changes

Change Details Files
Refactored and improved the dataset handling and processing
  • Updated the build_train_valid_test_datasets function to use logging instead of print statements
  • Implemented a new shuffle_sample argument for dataset processing
  • Refactored the BuildConcatDataset class to include shuffling functionality
  • Updated the _build_train_valid_test_datasets_single function to use logging
megatron/data/gpt_dataset.py
Enhanced logging and configuration management in helper scripts
  • Replaced print statements with logging calls
  • Refactored the setup_run_cmd function to use arrays for command arguments
  • Added new configuration options for tokenizer and data handling
  • Implemented a reset_env function to clear custom environment variables
ALCF/helpers.sh
Improved logging and error handling in indexed dataset operations
  • Replaced print_rank_0 calls with logging
  • Updated error messages and assertions for better clarity
  • Refactored the MMapIndexedDataset class for improved readability
megatron/data/indexed_dataset.py
Refactored utility functions and logging setup
  • Implemented a new get_logger function for consistent logging setup
  • Updated the Profile and PerfTrace classes for better performance tracking
  • Added support for dftracer if available
megatron/utils.py
Updated blendable dataset implementation
  • Replaced print_rank_0 calls with logging
  • Improved error handling and messaging in the BlendableDataset class
megatron/data/blendable_dataset.py
Added performance profiling to pipeline parallel operations
  • Added @dlp.log decorators to various functions in the p2p_communication module
megatron/core/pipeline_parallel/p2p_communication.py
Updated training script and configuration
  • Modified the run command execution in train_aGPT_7B.sh
  • Added a new shuffle-sample argument in megatron/arguments.py
  • Updated the model_provider function in pretrain_gpt_alcf.py to use args.deepspeed_config instead of args.deepspeed_config_dict
train_aGPT_7B.sh
megatron/arguments.py
pretrain_gpt_alcf.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @saforem2 - I've reviewed your changes - here's some feedback:

Overall Comments:

  • The logging improvements and configuration changes look good. Consider adding more explanatory comments for complex changes, especially in data loading and indexing code.
  • We recommend conducting performance benchmarks to validate the impact of these changes, particularly for large-scale training scenarios.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

except Exception:
_DLIO_PROFILER_EXIST = False

if _DLIO_PROFILER_EXIST:

if _DFTRACER_EXIST:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider implementing a factory pattern and dependency injection for profiling functionality.

The current implementation introduces unnecessary complexity through multiple conditional imports and complex class definitions for Profile and PerfTrace. To simplify and improve maintainability while preserving functionality, consider the following changes:

  1. Implement a factory pattern to create Profile and PerfTrace objects:
def create_profiler():
    if _DFTRACER_EXIST:
        from dftracer.logger import dftracer as PerfTrace, dft_fn as Profile, DFTRACER_ENABLE
        return Profile, PerfTrace, DFTRACER_ENABLE
    elif _DLIO_PROFILER_EXIST:
        from dlio_profiler.logger import fn_interceptor as Profile, dlio_logger as PerfTrace
        return Profile, PerfTrace, True
    else:
        return DummyProfile, DummyPerfTrace, False

class DummyProfile:
    def __init__(self, *args, **kwargs):
        pass
    def __enter__(self):
        pass
    def __exit__(self, *args, **kwargs):
        pass
    def log(self, func):
        return func

class DummyPerfTrace:
    def initialize_log(self, *args, **kwargs):
        pass

Profile, PerfTrace, PROFILER_ENABLED = create_profiler()
  1. Simplify the dummy implementations by only including methods that are actually used in the codebase. This reduces unnecessary code while still providing a fallback when profiling libraries are unavailable.

  2. Consider using dependency injection to further decouple the profiling functionality:

class CodeModule:
    def __init__(self, profiler):
        self.profiler = profiler

    @profiler.log
    def some_method(self):
        # Method implementation

# Usage
profiler, _, _ = create_profiler()
module = CodeModule(profiler)

These changes will make the code more modular, easier to maintain, and simpler to extend with new profiling libraries in the future.

@saforem2 saforem2 marked this pull request as draft October 15, 2024 04:52
@saforem2 saforem2 closed this Nov 7, 2024
@saforem2 saforem2 deleted the hzheng-data-fix branch November 7, 2024 03:58
saforem2 added a commit that referenced this pull request Nov 15, 2024
Merge `alcf-tests` into `main`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants