Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: DataProcessor v1 #381

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Commits on Nov 29, 2024

  1. Configuration menu
    Copy the full SHA
    63e1472 View commit details
    Browse the repository at this point in the history
  2. Add initial implementation of dataloader v1

    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    5245166 View commit details
    Browse the repository at this point in the history
  3. tests: reformat mock.patch to inside unit tests

    Signed-off-by: Will Johnson <[email protected]>
    
    fmt
    
    Signed-off-by: Will Johnson <[email protected]>
    willmj authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    50dd7fe View commit details
    Browse the repository at this point in the history
  4. Add data config argument to data preprocessor

    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    ac17ebb View commit details
    Browse the repository at this point in the history
  5. fix: Changes to support current implementation

    Signed-off-by: Abhishek <[email protected]>
    Abhishek-TAMU authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    fe25b48 View commit details
    Browse the repository at this point in the history
  6. Ensure data handling is done within process dataargs

    Removes unused dead code after adding the new framework and refactors
    some test cases and files.
    
    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    dcd3f97 View commit details
    Browse the repository at this point in the history
  7. Remove accelerator in favor of torch distributed check for multi node

    data preprocessing
    
    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    7adfeb0 View commit details
    Browse the repository at this point in the history
  8. Refactor data util tests as data handler tests.

    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    2546733 View commit details
    Browse the repository at this point in the history
  9. fix: add __init__.py to add tuning.data to python package

    Signed-off-by: Will Johnson <[email protected]>
    willmj authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    2a0f3f0 View commit details
    Browse the repository at this point in the history
  10. fix: multi GPU prepare training dataset

    Signed-off-by: Will Johnson <[email protected]>
    willmj authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    0338634 View commit details
    Browse the repository at this point in the history
  11. fix: lint

    Signed-off-by: Will Johnson <[email protected]>
    willmj authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    5e994ba View commit details
    Browse the repository at this point in the history
  12. fix: Add TODO

    Signed-off-by: Will Johnson <[email protected]>
    willmj authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    507f08e View commit details
    Browse the repository at this point in the history
  13. test: add test for process_dataset_configs in HFBasedDataPreProcessor

    Signed-off-by: Will Johnson <[email protected]>
    willmj authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    0aa253b View commit details
    Browse the repository at this point in the history
  14. add: test cases for framework

    Signed-off-by: Abhishek <[email protected]>
    Abhishek-TAMU authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    9456b73 View commit details
    Browse the repository at this point in the history
  15. fix: update function name get_dataprocessor->get_datapreprocessor

    Signed-off-by: Will Johnson <[email protected]>
    willmj authored and dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    668653e View commit details
    Browse the repository at this point in the history
  16. Rename loader to processor

    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    4f882f3 View commit details
    Browse the repository at this point in the history
  17. data folders should be together

    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    7621173 View commit details
    Browse the repository at this point in the history

Commits on Dec 2, 2024

  1. Add code comments and make code path clearer.

    Remove packing check as packing support for pretokenised data is merged
    to trl. See huggingface/trl#2011
    
    Signed-off-by: Dushyant Behl <[email protected]>
    dushyantbehl committed Dec 2, 2024
    Configuration menu
    Copy the full SHA
    e045ca7 View commit details
    Browse the repository at this point in the history