-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: DataProcessor v1 #381
base: main
Are you sure you want to change the base?
feat: DataProcessor v1 #381
Conversation
Thanks for making a pull request! 😃 |
a94c197
to
5e948f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dushyantbehl Thanks for the PR to let us know on the WIP. Just some comments on below would be appreciated.
tuning/data/data_handlers.py
Outdated
def apply_dataset_formatting( | ||
element: Dict[str, str], tokenizer: AutoTokenizer, dataset_text_field: str, **kwargs | ||
): | ||
return { | ||
f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When raw_datasets = raw_datasets.map(handler, **kwargs) is called and kwargs["batched"] = True
then element[f"{dataset_text_field}"]
here would be a list right. So does below condition make sense ?
if isinstance(element[dataset_text_field], list): # batched = True
return {
f"{dataset_text_field}": [
text + tokenizer.eos_token for text in element[f"{dataset_text_field}"]
]
}
return {
f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token
}
And similar case addition in tokenize_and_apply_instruction_masking
logic when kwargs["batched"] = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch @Abhishek-TAMU yes I need to update the code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Abhishek-TAMU I consciously reverted this change and made our default handlers run in batched=False
mode for now, we can mention this in documentation for people to not use these handlers in batched
mode. This change was made due to the fact that this and the other handlers we have defined right now are complex operations and need us to deconstruct each example from a batch before processing them.
That being said this was after I had already implemented a patch to make all handlers take in batched
and non-batched
input and it is here in a different branch
so we can even cherry pick this to the current branch if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can even cherry pick this to the current branch if needed.
This sounds fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the WIP PR @dushyantbehl, it's looking really nice so far!
Would definitely like to see some unit tests for these new data types and functions using the example configs you provided as unit tests will only get harder to add as we continue to build onto this. Additionally, we need to make sure our current unit tests pass to retain existing behavior. Overall though, these changes are a great starting point - thanks for your hard work on this!
b83c6e4
to
ac148eb
Compare
ed15fe3
to
4c0b109
Compare
7154c23
to
96b9467
Compare
60ac26c
to
0469db1
Compare
Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: Will Johnson <[email protected]> fmt Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Dushyant Behl <[email protected]>
data preprocessing Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: Dushyant Behl <[email protected]>
2c9e86d
to
3bd42b5
Compare
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dushyant for cleanup of format_dataset
function and related test cases.
Thanks Will for the multi-gpu fix. Cleanup and other changes looks good to me as its not affecting current implementation.
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
@dushyantbehl fms-hf-tuning check previously present which can be relaxed and as well absorbed in to this PR :
PR for reference: huggingface/trl#2011 cc: @ashokponkumar |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Abhishek <[email protected]>
I feel the PR lacks docmentation. We should have some things written out for the common use cases
|
Signed-off-by: Will Johnson <[email protected]>
e04cf3c
to
70252af
Compare
Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: Dushyant Behl <[email protected]>
82b548f
to
c94bbd2
Compare
We definitely need it, but we will add it in the next PR where we will be exposing these features to external users. Currently these are internal implementation which is used by the existing exposed interface. So I would recommend going ahead with this PR and following it up with these documentations in the next PR. |
Remove packing check as packing support for pretokenised data is merged to trl. See huggingface/trl#2011 Signed-off-by: Dushyant Behl <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok that my comments are addressed, cc: @ashokponkumar @Ssukriti
Description of the change
This PR is to be merged after #398 as it follows the changes. Once #398 is merged this PR will be rebased on top of it.
This PR is to provide a data preprocessor framework which will enable flexibility to easily add more data preprocessing features in the future. This PR covers the following:
format_dataset
toprocess_dataargs
This PR does not explicitly enable any new features in fms-hf-tuning, it's purpose is to more easily allow new features to be added in the future by changing the backend.
Co-Authored by @willmj @Abhishek-TAMU
How to verify the PR
testing/data
train()
functionWas the PR tested