-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Reward Model Preprocessing Modifications #3431
Conversation
Unit Test Results 6 files ± 0 6 suites ±0 42m 54s ⏱️ - 36m 44s For more details on these failures, see this check. Results for commit 440cbec. ± Comparison against base commit 9112470. This pull request removes 33 and adds 2780 tests. Note that renamed tests count towards both.
This pull request removes 4 skipped tests and adds 9 skipped tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
ludwig/data/preprocessing.py
Outdated
@@ -1205,6 +1205,50 @@ def build_dataset( | |||
logger.debug(f"sample {sample_ratio} of data") | |||
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed) | |||
|
|||
# If training a reward model, perform grouping and joining on dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This looks like the right set of transformations, but I think we'll likely want to do this at the end of preprocessing, using the processed text input feature, rather than here at the beginning. Specifically, I would consider doing this here:
https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1364
I would also suggest adding a test in test_preprocessing.py
to verify it works end-to-end on a synthetic dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I have made the edits and refactored the code a bit to be simpler.
Code Pull Requests
Please provide the following:
Documentation Pull Requests
Note that the documentation HTML files are in
docs/
while the Markdown sources are inmkdocs/docs
.If you are proposing a modification to the documentation you should change only the Markdown files.
api.md
is automatically generated from the docstrings in the code, so if you want to change something in that file, first modifyludwig/api.py
docstring, then runmkdocs/code_docs_autogen.py
, which will createmkdocs/docs/api.md
.