Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking #72

Merged
merged 7 commits into from
Aug 15, 2023
Merged

Benchmarking #72

merged 7 commits into from
Aug 15, 2023

Conversation

HashemAlsaket
Copy link
Contributor

Kick starting benchmarking [draft]. Keeping the HellaSwag data set in the branch for convenience until merge (if the sample size is too big, feel free to cut it down [currently 60MB] but I think it's okay for now).

High level thought process:
experiments, data, evals submitted to benchmark class -> evals run on responses against data -> log

@HashemAlsaket HashemAlsaket marked this pull request as draft August 12, 2023 17:24
@LuvvAggarwal
Copy link

@HashemAlsaket I have built a utility to load datasets from Hugging face, I hope it could be useful

@LuvvAggarwal
Copy link

Can we use hugging face dataset library, please review #75

@HashemAlsaket
Copy link
Contributor Author

Awesome work, @LuvvAggarwal . I think we can include #75 as a good util function. Can you put in the PR to merge into this branch instead of main?

@NivekT NivekT marked this pull request as ready for review August 14, 2023 23:56
Copy link
Collaborator

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Hashem,

Thanks for opening this PR. Overall, it looks good! A few notes:

  1. The files are definitely larger than we'd like to be in our repo. We should consider cutting it down as you mentioned to maybe something less than 1MB to start. We can add instructions to download the full dataset on the top of the notebook.
  2. Notebook:
    • Let's move it to the existing examples folder, we can have a subfolder benchmark there.
    • Let's divide it into sections with section description to make it easier for users to follow along.
    • We might be able to do a for loop in the notebook to run across different models.

eval_methods (list(eval methods)): list of evaluation methods to measure response similarity
prompts (list(str)): list of queries, questions, prompts for LLMs to respond to
response_options (list(str)): possible responses to measure against
correct_response_index (list(int)): list of index of correct response in response_options
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
correct_response_index (list(int)): list of index of correct response in response_options
correct_response_indices (list(int)): list of index of correct response in response_options

Args:
----
experiment (experiment type): experiment to use
eval_methods (list(eval methods)): list of evaluation methods to measure response similarity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eval_methods (list(eval methods)): list of evaluation methods to measure response similarity
eval_method (Callable): evaluation method to measure response similarity

Based on the class argument definition in __init__, I believe we want a single Callable?

import warnings


class Benchmark:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about the future plan of this class. Will this focus on multiple choice benchmark or will we use it across additional test sets as well?

If it is the former, we can name it MultipleChoiceBenchmark for now. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was something like this:

class Benchmark:

    def multiple_choice_benchmark():

    def standard_benchmark():

multiple_choice_benchmark() would process a data structure similar to HellaSwag's test set
standard_benchmark() would process standard prompt-response pairs

Each function would use different quality measures depending on user preference

for i, choice in enumerate(self.benchmark_df["response_options"].values):
model_choice.append(self.response_options[i].index(choice))
self.benchmark_df["model_choice"] = model_choice
self.benchmark_df["labels"] = self.correct_response_indices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we plan to return `benchmark_df somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing that out. My thought process must have veered. Removing self for readability.

@HashemAlsaket HashemAlsaket changed the title Draft: benchmark boiler plate Benchmarking Aug 15, 2023
@HashemAlsaket
Copy link
Contributor Author

@NivekT all issues tended 👍

@LuvvAggarwal
Copy link

@HashemAlsaket I am unable to merge into this repository

@NivekT
Copy link
Collaborator

NivekT commented Aug 15, 2023

Thanks @HashemAlsaket!! 🚀

@NivekT NivekT merged commit 9a7902a into hegelai:main Aug 15, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants