You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After releasing GPT-4, OpenAI was met with a significant challenge: there weren't many benchmarks for LLMs focused on emergent capabilities like translation, reasoning, pattern identification, reasoning, etc. So they've created Evals, a coudsourced open-source set of benchmarks for LLMs. While somewhat OpenAI-centric, as the submission rules prohibit adding tests that GPT-4 can already consistently pass, it still remains a valuable tool for objective model evaluation.
If different open-access LLM projects can switch to a well-designed common benchmark, we may finally get to objectively compare our model quality, which I find essential for the future if local LLMs. For example, we may compare it against WizardLM, raw Vicuna, or GPT-3.5.
After releasing GPT-4, OpenAI was met with a significant challenge: there weren't many benchmarks for LLMs focused on emergent capabilities like translation, reasoning, pattern identification, reasoning, etc. So they've created Evals, a coudsourced open-source set of benchmarks for LLMs. While somewhat OpenAI-centric, as the submission rules prohibit adding tests that GPT-4 can already consistently pass, it still remains a valuable tool for objective model evaluation.
If different open-access LLM projects can switch to a well-designed common benchmark, we may finally get to objectively compare our model quality, which I find essential for the future if local LLMs. For example, we may compare it against WizardLM, raw Vicuna, or GPT-3.5.
For reference on testing non OpenAI models with Evals, see OpenAssistant model evals.
The text was updated successfully, but these errors were encountered: