-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce evaluate_experiment
method
#630
Conversation
…-name-handle' into jacques/experiment-improvements
evaluate_experiment
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 🥇
Much needed feature!
|
||
|
||
evaluate_experiment( | ||
experiment_name="surprised_herb_1466", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nits] Hi @jacques-comet, I have been using the comet dashboard and it seems like the experiment names can be same even for different runs. Is the intention of this function to reevaluate scores for all matching names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JehandadK Yes, this function would re-evaluate all experiments with the same name. I'll do a follow up PR to support experiment IDs (and small FE update to show these ideas) so that you can re-score only one of these experiments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments regarding the code organization, but from the functional side looks good 👍
2 points to sum up:
- In general, we should try not to go with
client._rest_client
when it's possible. Whenever we need it somewhere, it likely means that we can add a new convenient method to our API objects: Opik, Experiment, Dataset, Prompt, ... (if it is not there already). This will help us with enriching our public API and keeping the code base cleaner. A lot of REST-related logic spread across the modules (especially likeutils
) will become very messy and hard to maintain very quickly. - An e2e test for this use case would be nice, either in this PR or in another one (if not in this one, let's open a ticket/issue so we don't forget).
test_cases = [] | ||
page = 1 | ||
|
||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move the logic for getting experiment items to opik.api_objects.experiment.Experiment.get_items()
similarly to the logic we have for dataset items.
Regarding the endpoint used, I think stream_experiment_items
would be better than the paginated one designed for frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexkuzmik I tried to do this but I couldn't get it to work, the stream endpoint doesn't return the input or output just the trace_id. I think tried to move the logic I have using the FE endpoint but that requires a dataset_id rather than dataset_name which made the whole logic a lot more complicated
So for now I recommend we keep it as is and come back later to clean up the code
return project_metadata.name | ||
|
||
|
||
def get_experiment_test_cases( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we extract get_items logic, the function should become.
def experiment_items_to_test_cases(experiment_items: List[ExperimentItem]) -> List[test_case.TestCase]
The items will be retrieved in evaluator.py
before this function is called.
verbose: an integer value that controls evaluation output logs such as summary and tqdm progress bar. | ||
""" | ||
start_time = time.time() | ||
# Get the experiment object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove such comments from the code.
We should leave comments only if there is a non-trivial logic hard to understand without context (comments usually are for "Why?" questions, not "What?").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Details
Introduces a new
evaluate_experiment
function that can be used to update the scores of an existing experiment. This can be useful when iterating on the scoring metrics so you don't have to keep rerunning an evaluation.Testing
This was tested by running the
evaluate_existing_experiment.py
script, extensive testing for edge cases was not performed.Documentation
Documentation was updated (both reference and guides)