feat: DataProcessor v1 #381

dushyantbehl · 2024-10-27T16:10:20Z

Description of the change

This PR is to be merged after #398 as it follows the changes. Once #398 is merged this PR will be rebased on top of it.

This PR is to provide a data preprocessor framework which will enable flexibility to easily add more data preprocessing features in the future. This PR covers the following:

Change base framework to the new configurable framework
- Preprocessing utils function changed from format_dataset to process_dataargs
Handle currently supported features using the new base framework
Cleanup redundant code and port unit tests.
Unit testing data preprocessor
Unit testing setup data preprocessor

This PR does not explicitly enable any new features in fms-hf-tuning, it's purpose is to more easily allow new features to be added in the future by changing the backend.

Co-Authored by @willmj @Abhishek-TAMU

How to verify the PR

New/refactored unit tests located in testing/data
Unit tests of train() function
Run e2e tuning + vLLM (see example build)

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

github-actions · 2024-10-27T16:10:30Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

tuning/sft_trainer.py

Abhishek-TAMU

@dushyantbehl Thanks for the PR to let us know on the WIP. Just some comments on below would be appreciated.

Abhishek-TAMU · 2024-10-29T22:36:34Z

tuning/data/data_handlers.py

+def apply_dataset_formatting(
+    element: Dict[str, str], tokenizer: AutoTokenizer, dataset_text_field: str, **kwargs
+):
+    return {
+        f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token
+    }
+


When raw_datasets = raw_datasets.map(handler, **kwargs) is called and kwargs["batched"] = True then element[f"{dataset_text_field}"] here would be a list right. So does below condition make sense ?

if isinstance(element[dataset_text_field], list): # batched = True return { f"{dataset_text_field}": [ text + tokenizer.eos_token for text in element[f"{dataset_text_field}"] ] } return { f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token }

And similar case addition in tokenize_and_apply_instruction_masking logic when kwargs["batched"] = True

Thanks for the catch @Abhishek-TAMU yes I need to update the code below.

@Abhishek-TAMU I consciously reverted this change and made our default handlers run in batched=False mode for now, we can mention this in documentation for people to not use these handlers in batched mode. This change was made due to the fact that this and the other handlers we have defined right now are complex operations and need us to deconstruct each example from a batch before processing them.

That being said this was after I had already implemented a patch to make all handlers take in batched and non-batched input and it is here in a different branch
so we can even cherry pick this to the current branch if needed.

cc @Ssukriti @willmj @ashokponkumar

so we can even cherry pick this to the current branch if needed.

This sounds fine to me.

tuning/utils/utils.py

tuning/data/data_handlers.py

willmj

Thanks for the WIP PR @dushyantbehl, it's looking really nice so far!
Would definitely like to see some unit tests for these new data types and functions using the example configs you provided as unit tests will only get harder to add as we continue to build onto this. Additionally, we need to make sure our current unit tests pass to retain existing behavior. Overall though, these changes are a great starting point - thanks for your hard work on this!

architecture_records/004-dataloader-v2.md

tuning/data/data_processors.py

tuning/data/data_handlers.py

Signed-off-by: Dushyant Behl <[email protected]>

Signed-off-by: Will Johnson <[email protected]> fmt Signed-off-by: Will Johnson <[email protected]>

Signed-off-by: Dushyant Behl <[email protected]>

data preprocessing Signed-off-by: Dushyant Behl <[email protected]>

Signed-off-by: Dushyant Behl <[email protected]>

Signed-off-by: Will Johnson <[email protected]>

Abhishek-TAMU

Thanks Dushyant for cleanup of format_dataset function and related test cases.
Thanks Will for the multi-gpu fix. Cleanup and other changes looks good to me as its not affecting current implementation.

Signed-off-by: Will Johnson <[email protected]>

kmehant · 2024-11-26T06:09:30Z

@dushyantbehl
Support for packing pre-tokenized datasets is now first class feature in huggingface/trl. Good to allow for that in fms-hf-tuning.

fms-hf-tuning check previously present which can be relaxed and as well absorbed in to this PR :

fms-hf-tuning/tuning/utils/preprocessing_utils.py

Line 85 in 5c29940

raise ValueError("packing will not be used when datasets are pretokenized")

PR for reference: huggingface/trl#2011

cc: @ashokponkumar

Signed-off-by: Will Johnson <[email protected]>

Signed-off-by: Abhishek <[email protected]>

tuning/data/setup_dataprocessor.py

tuning/data/data_preprocessing_utils.py

tuning/data/data_processors.py

tuning/data/data_preprocessing_utils.py

tuning/data/data_processors.py

fabianlim · 2024-11-27T07:30:50Z

I feel the PR lacks docmentation. We should have some things written out for the common use cases

adding a new data preprocessor
adding a new handler. etc..
fallback on for use cases that does not require the data preprocessor routines..
how to write a data config.. etc.

Signed-off-by: Will Johnson <[email protected]>

Signed-off-by: Dushyant Behl <[email protected]>

ashokponkumar · 2024-11-28T15:05:57Z

I feel the PR lacks docmentation. We should have some things written out for the common use cases

adding a new data preprocessor

adding a new handler. etc..

fallback on for use cases that does not require the data preprocessor routines..

how to write a data config.. etc.

We definitely need it, but we will add it in the next PR where we will be exposing these features to external users. Currently these are internal implementation which is used by the existing exposed interface.

So I would recommend going ahead with this PR and following it up with these documentations in the next PR.

Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <[email protected]>

fabianlim

I am ok that my comments are addressed, cc: @ashokponkumar @Ssukriti

github-actions bot added the feat label Oct 27, 2024

dushyantbehl force-pushed the dataloader-v2-impl branch from a94c197 to 5e948f1 Compare October 27, 2024 16:26

fabianlim reviewed Oct 28, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Abhishek-TAMU reviewed Oct 29, 2024

View reviewed changes

Ssukriti reviewed Oct 29, 2024

View reviewed changes

tuning/data/data_handlers.py Outdated Show resolved Hide resolved

dushyantbehl changed the title ~~feat: [WIP] Dataloader v2 impl~~ feat: [WIP] DataProcessor v2 impl Oct 30, 2024

willmj reviewed Oct 30, 2024

View reviewed changes

architecture_records/004-dataloader-v2.md Outdated Show resolved Hide resolved

tuning/data/data_processors.py Outdated Show resolved Hide resolved

dushyantbehl force-pushed the dataloader-v2-impl branch 8 times, most recently from b83c6e4 to ac148eb Compare November 8, 2024 20:06

dushyantbehl force-pushed the dataloader-v2-impl branch from ed15fe3 to 4c0b109 Compare November 12, 2024 08:52

dushyantbehl changed the title ~~feat: [WIP] DataProcessor v2 impl~~ feat: [WIP] DataProcessor v2 Nov 18, 2024

dushyantbehl marked this pull request as ready for review November 19, 2024 09:09

dushyantbehl requested review from anhuong, aluu317 and kmehant as code owners November 19, 2024 09:09

dushyantbehl changed the title ~~feat: [WIP] DataProcessor v2~~ feat: DataProcessor v2 Nov 20, 2024

dushyantbehl force-pushed the dataloader-v2-impl branch from 7154c23 to 96b9467 Compare November 20, 2024 10:02

Abhishek-TAMU reviewed Nov 21, 2024

View reviewed changes

tuning/data/data_handlers.py Outdated Show resolved Hide resolved

Abhishek-TAMU mentioned this pull request Nov 21, 2024

fix: Changes in function process_dataargs to support the current implementation dushyantbehl/fms-hf-tuning#2

Merged

2 tasks

dushyantbehl force-pushed the dataloader-v2-impl branch from 60ac26c to 0469db1 Compare November 22, 2024 03:31

dushyantbehl and others added 3 commits November 22, 2024 09:05

add initial implementation of dataloader v2

137c2d4

Signed-off-by: Dushyant Behl <[email protected]>

tests: reformat mock.patch to inside unit tests

b63a9a5

Signed-off-by: Will Johnson <[email protected]> fmt Signed-off-by: Will Johnson <[email protected]>

Add data config argument to data preprocessor

84cf65c

Signed-off-by: Dushyant Behl <[email protected]>

dushyantbehl added 3 commits November 22, 2024 19:08

Remove accelerator in favor of torch distributed check for multi node

cea24ab

data preprocessing Signed-off-by: Dushyant Behl <[email protected]>

rename tests/data to tests/testdata

43626ed

Signed-off-by: Dushyant Behl <[email protected]>

Refactor data util tests as data handler tests.

3bd42b5

Signed-off-by: Dushyant Behl <[email protected]>

willmj force-pushed the dataloader-v2-impl branch from 2c9e86d to 3bd42b5 Compare November 22, 2024 18:44

willmj added 2 commits November 22, 2024 13:50

fix: add __init__.py to add tuning.data to python package

7da1bf8

Signed-off-by: Will Johnson <[email protected]>

fix: multi GPU prepare training dataset

098e9d8

Signed-off-by: Will Johnson <[email protected]>

Abhishek-TAMU reviewed Nov 22, 2024

View reviewed changes

willmj added 2 commits November 22, 2024 17:52

fix: lint

643313a

Signed-off-by: Will Johnson <[email protected]>

fix: Add TODO

c078d07

Signed-off-by: Will Johnson <[email protected]>

willmj and others added 2 commits November 26, 2024 10:08

test: add test for process_dataset_configs in HFBasedDataPreProcessor

2a0777c

Signed-off-by: Will Johnson <[email protected]>

add: test cases for framework

9a8ae76

Signed-off-by: Abhishek <[email protected]>

willmj requested a review from Ssukriti November 26, 2024 19:48