Evaluate the model with OpenAI Evals #12

walking-octopus · 2023-05-21T09:36:19Z

After releasing GPT-4, OpenAI was met with a significant challenge: there weren't many benchmarks for LLMs focused on emergent capabilities like translation, reasoning, pattern identification, reasoning, etc. So they've created Evals, a coudsourced open-source set of benchmarks for LLMs. While somewhat OpenAI-centric, as the submission rules prohibit adding tests that GPT-4 can already consistently pass, it still remains a valuable tool for objective model evaluation.

If different open-access LLM projects can switch to a well-designed common benchmark, we may finally get to objectively compare our model quality, which I find essential for the future if local LLMs. For example, we may compare it against WizardLM, raw Vicuna, or GPT-3.5.

For reference on testing non OpenAI models with Evals, see OpenAssistant model evals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate the model with OpenAI Evals #12

Evaluate the model with OpenAI Evals #12

walking-octopus commented May 21, 2023

Evaluate the model with OpenAI Evals #12

Evaluate the model with OpenAI Evals #12

Comments

walking-octopus commented May 21, 2023