Benchmarking #72

HashemAlsaket · 2023-08-12T17:24:01Z

Kick starting benchmarking [draft]. Keeping the HellaSwag data set in the branch for convenience until merge (if the sample size is too big, feel free to cut it down [currently 60MB] but I think it's okay for now).

High level thought process:
experiments, data, evals submitted to benchmark class -> evals run on responses against data -> log

LuvvAggarwal · 2023-08-14T05:56:34Z

@HashemAlsaket I have built a utility to load datasets from Hugging face, I hope it could be useful

LuvvAggarwal · 2023-08-14T17:19:51Z

Can we use hugging face dataset library, please review #75

HashemAlsaket · 2023-08-14T23:40:10Z

Awesome work, @LuvvAggarwal . I think we can include #75 as a good util function. Can you put in the PR to merge into this branch instead of main?

NivekT

Hi Hashem,

Thanks for opening this PR. Overall, it looks good! A few notes:

The files are definitely larger than we'd like to be in our repo. We should consider cutting it down as you mentioned to maybe something less than 1MB to start. We can add instructions to download the full dataset on the top of the notebook.
Notebook:
- Let's move it to the existing examples folder, we can have a subfolder benchmark there.
- Let's divide it into sections with section description to make it easier for users to follow along.
- We might be able to do a for loop in the notebook to run across different models.

NivekT · 2023-08-15T01:43:47Z

prompttools/benchmarks/benchmark.py

+ eval_methods (list(eval methods)): list of evaluation methods to measure response similarity
+ prompts (list(str)): list of queries, questions, prompts for LLMs to respond to
+ response_options (list(str)): possible responses to measure against
+ correct_response_index (list(int)): list of index of correct response in response_options


Suggested change

correct_response_index (list(int)): list of index of correct response in response_options

correct_response_indices (list(int)): list of index of correct response in response_options

NivekT · 2023-08-15T01:45:44Z

prompttools/benchmarks/benchmark.py

+ Args:
+ ----
+ experiment (experiment type): experiment to use
+ eval_methods (list(eval methods)): list of evaluation methods to measure response similarity


Suggested change

eval_methods (list(eval methods)): list of evaluation methods to measure response similarity

eval_method (Callable): evaluation method to measure response similarity

Based on the class argument definition in __init__, I believe we want a single Callable?

NivekT · 2023-08-15T01:48:08Z

prompttools/benchmarks/benchmark.py

+import warnings
+
+
+class Benchmark:


I am curious about the future plan of this class. Will this focus on multiple choice benchmark or will we use it across additional test sets as well?

If it is the former, we can name it MultipleChoiceBenchmark for now. What do you think?

The intention was something like this:

class Benchmark: def multiple_choice_benchmark(): def standard_benchmark():

multiple_choice_benchmark() would process a data structure similar to HellaSwag's test set
standard_benchmark() would process standard prompt-response pairs

Each function would use different quality measures depending on user preference

NivekT · 2023-08-15T02:04:35Z

prompttools/benchmarks/benchmark.py

+ for i, choice in enumerate(self.benchmark_df["response_options"].values):
+ model_choice.append(self.response_options[i].index(choice))
+ self.benchmark_df["model_choice"] = model_choice
+ self.benchmark_df["labels"] = self.correct_response_indices


Question: Do we plan to return `benchmark_df somewhere?

Ah, thanks for pointing that out. My thought process must have veered. Removing self for readability.

HashemAlsaket · 2023-08-15T03:20:01Z

@NivekT all issues tended 👍

LuvvAggarwal · 2023-08-15T10:29:48Z

@HashemAlsaket I am unable to merge into this repository

NivekT · 2023-08-15T14:57:22Z

Thanks @HashemAlsaket!! 🚀

benchmark boiler plate

393326b

HashemAlsaket marked this pull request as draft August 12, 2023 17:24

HashemAlsaket mentioned this pull request Aug 12, 2023

Add common benchmarks #50

Open

rearrange files

fd7cc76

HashemAlsaket added 2 commits August 14, 2023 18:32

benchmarking

4ae216c

hellaswag test data

41b8df6

NivekT marked this pull request as ready for review August 14, 2023 23:56

NivekT reviewed Aug 15, 2023

View reviewed changes

cut down dataset

607659c

HashemAlsaket changed the title ~~Draft: benchmark boiler plate~~ Benchmarking Aug 15, 2023

HashemAlsaket added 2 commits August 14, 2023 22:17

cleaned up notebook

a092187

doc string

3986259

NivekT merged commit 9a7902a into hegelai:main Aug 15, 2023
1 check passed

HashemAlsaket mentioned this pull request Aug 15, 2023

Dataset load for benchmarking #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking #72

Benchmarking #72

HashemAlsaket commented Aug 12, 2023

LuvvAggarwal commented Aug 14, 2023

LuvvAggarwal commented Aug 14, 2023

HashemAlsaket commented Aug 14, 2023

NivekT left a comment

NivekT Aug 15, 2023

NivekT Aug 15, 2023

NivekT Aug 15, 2023

HashemAlsaket Aug 15, 2023

NivekT Aug 15, 2023

HashemAlsaket Aug 15, 2023

HashemAlsaket commented Aug 15, 2023

LuvvAggarwal commented Aug 15, 2023

NivekT commented Aug 15, 2023 •

edited

Loading

	correct_response_index (list(int)): list of index of correct response in response_options
	correct_response_indices (list(int)): list of index of correct response in response_options

	eval_methods (list(eval methods)): list of evaluation methods to measure response similarity
	eval_method (Callable): evaluation method to measure response similarity

Benchmarking #72

Benchmarking #72

Conversation

HashemAlsaket commented Aug 12, 2023

LuvvAggarwal commented Aug 14, 2023

LuvvAggarwal commented Aug 14, 2023

HashemAlsaket commented Aug 14, 2023

NivekT left a comment

Choose a reason for hiding this comment

NivekT Aug 15, 2023

Choose a reason for hiding this comment

NivekT Aug 15, 2023

Choose a reason for hiding this comment

NivekT Aug 15, 2023

Choose a reason for hiding this comment

HashemAlsaket Aug 15, 2023

Choose a reason for hiding this comment

NivekT Aug 15, 2023

Choose a reason for hiding this comment

HashemAlsaket Aug 15, 2023

Choose a reason for hiding this comment

HashemAlsaket commented Aug 15, 2023

LuvvAggarwal commented Aug 15, 2023

NivekT commented Aug 15, 2023 • edited Loading

NivekT commented Aug 15, 2023 •

edited

Loading