Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: DataProcessor v1 #381

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

dushyantbehl
Copy link
Contributor

@dushyantbehl dushyantbehl commented Oct 27, 2024

Description of the change

This PR is to be merged after #398 as it follows the changes. Once #398 is merged this PR will be rebased on top of it.

This PR is to provide a data preprocessor framework which will enable flexibility to easily add more data preprocessing features in the future. This PR covers the following:

  • Change base framework to the new configurable framework
    • Preprocessing utils function changed from format_dataset to process_dataargs
  • Handle currently supported features using the new base framework
  • Cleanup redundant code and port unit tests.
  • Unit testing data preprocessor
  • Unit testing setup data preprocessor

This PR does not explicitly enable any new features in fms-hf-tuning, it's purpose is to more easily allow new features to be added in the future by changing the backend.

Co-Authored by @willmj @Abhishek-TAMU

How to verify the PR

  • New/refactored unit tests located in testing/data
  • Unit tests of train() function
  • Run e2e tuning + vLLM (see example build)

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

tuning/sft_trainer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dushyantbehl Thanks for the PR to let us know on the WIP. Just some comments on below would be appreciated.

Comment on lines 56 to 65
def apply_dataset_formatting(
element: Dict[str, str], tokenizer: AutoTokenizer, dataset_text_field: str, **kwargs
):
return {
f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token
}

Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When raw_datasets = raw_datasets.map(handler, **kwargs) is called and kwargs["batched"] = True then element[f"{dataset_text_field}"] here would be a list right. So does below condition make sense ?

if isinstance(element[dataset_text_field], list): # batched = True
      return {
          f"{dataset_text_field}": [
              text + tokenizer.eos_token for text in element[f"{dataset_text_field}"]
          ]
      }
return {
    f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token
}
  

And similar case addition in tokenize_and_apply_instruction_masking logic when kwargs["batched"] = True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch @Abhishek-TAMU yes I need to update the code below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhishek-TAMU I consciously reverted this change and made our default handlers run in batched=False mode for now, we can mention this in documentation for people to not use these handlers in batched mode. This change was made due to the fact that this and the other handlers we have defined right now are complex operations and need us to deconstruct each example from a batch before processing them.

That being said this was after I had already implemented a patch to make all handlers take in batched and non-batched input and it is here in a different branch
so we can even cherry pick this to the current branch if needed.

cc @Ssukriti @willmj @ashokponkumar

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can even cherry pick this to the current branch if needed.

This sounds fine to me.

tuning/utils/utils.py Outdated Show resolved Hide resolved
@dushyantbehl dushyantbehl changed the title feat: [WIP] Dataloader v2 impl feat: [WIP] DataProcessor v2 impl Oct 30, 2024
Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the WIP PR @dushyantbehl, it's looking really nice so far!
Would definitely like to see some unit tests for these new data types and functions using the example configs you provided as unit tests will only get harder to add as we continue to build onto this. Additionally, we need to make sure our current unit tests pass to retain existing behavior. Overall though, these changes are a great starting point - thanks for your hard work on this!

architecture_records/004-dataloader-v2.md Outdated Show resolved Hide resolved
tuning/data/data_processors.py Outdated Show resolved Hide resolved
@dushyantbehl dushyantbehl force-pushed the dataloader-v2-impl branch 8 times, most recently from b83c6e4 to ac148eb Compare November 8, 2024 20:06
@dushyantbehl dushyantbehl changed the title feat: [WIP] DataProcessor v2 impl feat: [WIP] DataProcessor v2 Nov 18, 2024
@dushyantbehl dushyantbehl marked this pull request as ready for review November 19, 2024 09:09
@dushyantbehl dushyantbehl changed the title feat: [WIP] DataProcessor v2 feat: DataProcessor v2 Nov 20, 2024
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dushyant for cleanup of format_dataset function and related test cases.
Thanks Will for the multi-gpu fix. Cleanup and other changes looks good to me as its not affecting current implementation.

Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
@kmehant
Copy link
Collaborator

kmehant commented Nov 26, 2024

@dushyantbehl
Support for packing pre-tokenized datasets is now first class feature in huggingface/trl. Good to allow for that in fms-hf-tuning.

fms-hf-tuning check previously present which can be relaxed and as well absorbed in to this PR :

raise ValueError("packing will not be used when datasets are pretokenized")

PR for reference: huggingface/trl#2011

cc: @ashokponkumar

@willmj willmj requested a review from Ssukriti November 26, 2024 19:48
@fabianlim
Copy link
Collaborator

I feel the PR lacks docmentation. We should have some things written out for the common use cases

  • adding a new data preprocessor
  • adding a new handler. etc..
  • fallback on for use cases that does not require the data preprocessor routines..
  • how to write a data config.. etc.

@dushyantbehl dushyantbehl changed the title feat: DataProcessor v2 feat: DataProcessor v1 Nov 28, 2024
@dushyantbehl dushyantbehl force-pushed the dataloader-v2-impl branch 2 times, most recently from e04cf3c to 70252af Compare November 28, 2024 12:19
@ashokponkumar
Copy link
Collaborator

I feel the PR lacks docmentation. We should have some things written out for the common use cases

  • adding a new data preprocessor
  • adding a new handler. etc..
  • fallback on for use cases that does not require the data preprocessor routines..
  • how to write a data config.. etc.

We definitely need it, but we will add it in the next PR where we will be exposing these features to external users. Currently these are internal implementation which is used by the existing exposed interface.

So I would recommend going ahead with this PR and following it up with these documentations in the next PR.

Remove packing check as packing support for pretokenised data is merged
to trl. See huggingface/trl#2011

Signed-off-by: Dushyant Behl <[email protected]>
Copy link
Collaborator

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok that my comments are addressed, cc: @ashokponkumar @Ssukriti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants