Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce evaluate_experiment method #630

Merged
merged 11 commits into from
Nov 15, 2024
Merged

Conversation

jverre
Copy link
Collaborator

@jverre jverre commented Nov 13, 2024

Details

Introduces a new evaluate_experiment function that can be used to update the scores of an existing experiment. This can be useful when iterating on the scoring metrics so you don't have to keep rerunning an evaluation.

Testing

This was tested by running the evaluate_existing_experiment.py script, extensive testing for edge cases was not performed.

Documentation

Documentation was updated (both reference and guides)

@jverre jverre requested review from a team as code owners November 13, 2024 23:27
@jverre jverre changed the title Jacques/experiment improvements Introduce evaluate_experiment method Nov 13, 2024
Copy link

@JehandadK JehandadK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 🥇
Much needed feature!



evaluate_experiment(
experiment_name="surprised_herb_1466",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nits] Hi @jacques-comet, I have been using the comet dashboard and it seems like the experiment names can be same even for different runs. Is the intention of this function to reevaluate scores for all matching names?
Screenshot 2024-11-14 at 15 27 09

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JehandadK Yes, this function would re-evaluate all experiments with the same name. I'll do a follow up PR to support experiment IDs (and small FE update to show these ideas) so that you can re-score only one of these experiments

Copy link
Collaborator

@alexkuzmik alexkuzmik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments regarding the code organization, but from the functional side looks good 👍
2 points to sum up:

  1. In general, we should try not to go with client._rest_client when it's possible. Whenever we need it somewhere, it likely means that we can add a new convenient method to our API objects: Opik, Experiment, Dataset, Prompt, ... (if it is not there already). This will help us with enriching our public API and keeping the code base cleaner. A lot of REST-related logic spread across the modules (especially like utils) will become very messy and hard to maintain very quickly.
  2. An e2e test for this use case would be nice, either in this PR or in another one (if not in this one, let's open a ticket/issue so we don't forget).

test_cases = []
page = 1

while True:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move the logic for getting experiment items to opik.api_objects.experiment.Experiment.get_items() similarly to the logic we have for dataset items.

Regarding the endpoint used, I think stream_experiment_items would be better than the paginated one designed for frontend.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexkuzmik I tried to do this but I couldn't get it to work, the stream endpoint doesn't return the input or output just the trace_id. I think tried to move the logic I have using the FE endpoint but that requires a dataset_id rather than dataset_name which made the whole logic a lot more complicated

So for now I recommend we keep it as is and come back later to clean up the code

sdks/python/src/opik/evaluation/utils.py Outdated Show resolved Hide resolved
sdks/python/src/opik/evaluation/utils.py Outdated Show resolved Hide resolved
sdks/python/src/opik/evaluation/utils.py Outdated Show resolved Hide resolved
return project_metadata.name


def get_experiment_test_cases(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we extract get_items logic, the function should become.

def experiment_items_to_test_cases(experiment_items: List[ExperimentItem]) -> List[test_case.TestCase]

The items will be retrieved in evaluator.py before this function is called.

sdks/python/src/opik/evaluation/evaluator.py Outdated Show resolved Hide resolved
verbose: an integer value that controls evaluation output logs such as summary and tqdm progress bar.
"""
start_time = time.time()
# Get the experiment object
Copy link
Collaborator

@alexkuzmik alexkuzmik Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove such comments from the code.
We should leave comments only if there is a non-trivial logic hard to understand without context (comments usually are for "Why?" questions, not "What?").

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jverre jverre merged commit f5cb060 into main Nov 15, 2024
25 checks passed
@jverre jverre deleted the jacques/experiment-improvements branch November 15, 2024 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants