-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking #72
Benchmarking #72
Conversation
@HashemAlsaket I have built a utility to load datasets from Hugging face, I hope it could be useful |
Can we use hugging face dataset library, please review #75 |
Awesome work, @LuvvAggarwal . I think we can include #75 as a good util function. Can you put in the PR to merge into this branch instead of main? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Hashem,
Thanks for opening this PR. Overall, it looks good! A few notes:
- The files are definitely larger than we'd like to be in our repo. We should consider cutting it down as you mentioned to maybe something less than 1MB to start. We can add instructions to download the full dataset on the top of the notebook.
- Notebook:
- Let's move it to the existing
examples
folder, we can have a subfolderbenchmark
there. - Let's divide it into sections with section description to make it easier for users to follow along.
- We might be able to do a
for
loop in the notebook to run across different models.
- Let's move it to the existing
prompttools/benchmarks/benchmark.py
Outdated
eval_methods (list(eval methods)): list of evaluation methods to measure response similarity | ||
prompts (list(str)): list of queries, questions, prompts for LLMs to respond to | ||
response_options (list(str)): possible responses to measure against | ||
correct_response_index (list(int)): list of index of correct response in response_options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct_response_index (list(int)): list of index of correct response in response_options | |
correct_response_indices (list(int)): list of index of correct response in response_options |
prompttools/benchmarks/benchmark.py
Outdated
Args: | ||
---- | ||
experiment (experiment type): experiment to use | ||
eval_methods (list(eval methods)): list of evaluation methods to measure response similarity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval_methods (list(eval methods)): list of evaluation methods to measure response similarity | |
eval_method (Callable): evaluation method to measure response similarity |
Based on the class argument definition in __init__
, I believe we want a single Callable
?
import warnings | ||
|
||
|
||
class Benchmark: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about the future plan of this class. Will this focus on multiple choice benchmark or will we use it across additional test sets as well?
If it is the former, we can name it MultipleChoiceBenchmark
for now. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention was something like this:
class Benchmark:
def multiple_choice_benchmark():
def standard_benchmark():
multiple_choice_benchmark()
would process a data structure similar to HellaSwag's test set
standard_benchmark()
would process standard prompt-response pairs
Each function would use different quality measures depending on user preference
prompttools/benchmarks/benchmark.py
Outdated
for i, choice in enumerate(self.benchmark_df["response_options"].values): | ||
model_choice.append(self.response_options[i].index(choice)) | ||
self.benchmark_df["model_choice"] = model_choice | ||
self.benchmark_df["labels"] = self.correct_response_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do we plan to return `benchmark_df somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for pointing that out. My thought process must have veered. Removing self for readability.
@NivekT all issues tended 👍 |
@HashemAlsaket I am unable to merge into this repository |
Thanks @HashemAlsaket!! 🚀 |
Kick starting benchmarking [draft]. Keeping the HellaSwag data set in the branch for convenience until merge (if the sample size is too big, feel free to cut it down [currently 60MB] but I think it's okay for now).
High level thought process:
experiments, data, evals submitted to benchmark class -> evals run on responses against data -> log