-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📦 Support for packing tokenized datasets for SFT #2011
Conversation
d7a4ca9
to
f6dedb5
Compare
Anyone having bandwidth, requesting review thank you - @qgallouedec @lewtun @kashif @lvwerra or others from community. Discussion can be seen here - #1848 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @kmehant ! Overall it looks good to me - would you mind adding an integration test for this scenario?
f6dedb5
to
004f128
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@lewtun Thank you for your review. I have addressed the review comments and as well added the test cases. Thank you. |
0d0f34c
to
c689f24
Compare
95113ee
to
25b997d
Compare
@lewtun @qgallouedec apologies for the failing tests before. Tests passing now for me locally! Thank you. |
@lewtun @qgallouedec rebased looking for escalation, thanks. |
b13e324
to
3e9110b
Compare
@lewtun @qgallouedec pulse pinging for review and merge, thank you. |
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
@lewtun @qgallouedec it will be of great help pushing this forward before it gets into stale state for being open for a long time after having a positive discussion over the issue #1848 with @qgallouedec. If you would like to tag anyone else from the community who have bandwidth basing on our discussion over the issue (#1848) would be nice! |
tests/test_sft_trainer.py
Outdated
constant_len_dataset = ConstantLengthDataset( | ||
self.tokenizer, | ||
self.dummy_tokenized_dataset, | ||
dataset_text_field=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset_text_field=None, |
tests/test_sft_trainer.py
Outdated
"content": [ | ||
{ | ||
"type": "text", | ||
"text": "Oh ye, you are right, what is 1+1", | ||
} | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"content": [ | |
{ | |
"type": "text", | |
"text": "Oh ye, you are right, what is 1+1", | |
} | |
], | |
"content": [{"type": "text", "text": "Oh ye, you are right, what is 1+1"}], |
Sorry for the delay. I'm trying to keep track of all these issues but it's not easy. Lgtm overall, thanks. Just making a few comments |
Signed-off-by: Mehant Kammakomati <[email protected]>
@qgallouedec Thank you for circling back, I have addressed your comments. |
What does this PR do?
Fixes #1848
Before submitting
Pull Request section?
to it if that's the case. Support packing for pretokenized datasets #1848
documentation guidelines.
Who can review?
@qgallouedec
Anyone from the community!