Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Reward Model Preprocessing Modifications #3431

Closed
wants to merge 10 commits into from
Closed

Conversation

asdataminer
Copy link

Code Pull Requests

Please provide the following:

  • a clear explanation of what your code does
  • if applicable, a reference to an issue
  • a reproducible test for your PR (code, config and data sample)

Documentation Pull Requests

Note that the documentation HTML files are in docs/ while the Markdown sources are in mkdocs/docs.

If you are proposing a modification to the documentation you should change only the Markdown files.

api.md is automatically generated from the docstrings in the code, so if you want to change something in that file, first modify ludwig/api.py docstring, then run mkdocs/code_docs_autogen.py, which will create mkdocs/docs/api.md .

@github-actions
Copy link

github-actions bot commented Jun 5, 2023

Unit Test Results

       6 files  ±       0         6 suites  ±0   42m 54s ⏱️ - 36m 44s
2 780 tests +2 747  2 733 ✔️ +2 704    9 💤 +  5    38 +  38 
8 346 runs  +8 247  8 199 ✔️ +8 112  33 💤 +21  114 +114 

For more details on these failures, see this check.

Results for commit 440cbec. ± Comparison against base commit 9112470.

This pull request removes 33 and adds 2780 tests. Note that renamed tests count towards both.
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-1]
tests.integration_tests.test_cli ‑ test_train_cli_horovod
tests.integration_tests.test_experiment ‑ test_experiment_model_resume_distributed[horovod]
…
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_image_augmentation[augmentation_pipeline_ops0]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_image_augmentation[augmentation_pipeline_ops1]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_image_augmentation[augmentation_pipeline_ops2]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_invalid_augmentation_parameters[None]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_invalid_augmentation_parameters[augmentation_pipeline_ops1]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_invalid_augmentation_parameters[augmentation_pipeline_ops2]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_invalid_augmentation_parameters[augmentation_pipeline_ops4]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_invalid_augmentation_parameters[random_horizontal_flip]
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_load_model_with_augmentation_pipeline
tests.ludwig.augmentation.test_augmentation_pipeline ‑ test_local_model_training_with_augmentation_pipeline[preprocessing0-encoder0-False]
…
This pull request removes 4 skipped tests and adds 9 skipped tests. Note that renamed tests count towards both.
tests.integration_tests.test_horovod ‑ test_horovod_gpu_memory_limit
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[ames_housing.ecd.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[mercedes_benz_greener.ecd.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[sarcos.ecd.yaml]
tests.ludwig.automl.test_base_config
tests.ludwig.automl.test_utils
tests.ludwig.backend.test_ray
tests.ludwig.benchmarking.test_profiler
tests.ludwig.data.test_ray_data
tests.ludwig.models.test_training_determinism ‑ test_training_determinism_ray_backend
tests.ludwig.utils.test_fs_utils ‑ test_get_fs_and_path_invalid_windows
tests.ludwig.utils.test_hyperopt_ray_utils ‑ test_grid_strategy[test_1]
tests.ludwig.utils.test_hyperopt_ray_utils ‑ test_grid_strategy[test_2]

♻️ This comment has been updated with latest results.

@@ -1205,6 +1205,50 @@ def build_dataset(
logger.debug(f"sample {sample_ratio} of data")
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)

# If training a reward model, perform grouping and joining on dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks like the right set of transformations, but I think we'll likely want to do this at the end of preprocessing, using the processed text input feature, rather than here at the beginning. Specifically, I would consider doing this here:

https://github.com/ludwig-ai/ludwig/blob/master/ludwig/data/preprocessing.py#L1364

I would also suggest adding a test in test_preprocessing.py to verify it works end-to-end on a synthetic dataset.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I have made the edits and refactored the code a bit to be simpler.

@mhabedank mhabedank closed this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants