Replication Study: GPT-3's Performance on Historical Fields in the MMLU Benchmarks (May 2022)

By Daniel Hutchinson

In January 2021, researchers in the field of machine learning introduced a set of benchmarks for measuring the accuracy of Large Language Models (LLMs) on questions in three historical fields. (Hendryks et al, Measuring Massive Multitask Language Understanding)¹ This replication study reexamines GPT-3’s performance on these benchmarks following the implementation of the GPT-3 Instruct model in January 2022.² The resulting data records significant gains on these benchmarks, approaching expert-level accuracy (80%) in two of the three historical fields. These results approach the performance of Deepmind’s Chinchilla, the LLM currently achieving the highest accuracy on these benchmarks. ³

LLM	A.P. U.S. History	A.P. European History	A.P. World History
GPT-3 (January 2021)	52.9%	53.9%	56.1%
GPT-3 Instruct (May 2022)	74.8%	60.9%	75.5%
Chinchilla (March 2022)	83.3%	78.8%	85.2%

Replication Method:

Using the Python script benchmarks.py, the benchmark questions released by Hendrycks were retested using a zero-shot method on the DaVinci model of GPT-3 Instruct via OpenAI's API.⁴ GPT-3’s responses were then compared against the published answers contained in the Hendrycks benchmark sets. The results of these replications are contained in this collection. Users can rerun GPT-3’s performance on these benchmarks at Can AIs Accurately Interpret History? A Digital History Experiment.

Credits

Many thanks to Dan Hendrycks for sharing the discipline-specific results for the historical fields contained in the MMMLU Benchmarks.

Dan Hendrycks, et al, “Measuring Massive Multitask Language Understanding,” arXiv:2009.03300v3 (January 2021), 2, 11. ↩
Long Ouyang, et al, “Training language models to follow instructions with human feedback,” arXiv:2203.02155v1 (March 2022). ↩
Jordan Hoffmann, et al, “Training Compute-Optimal Large Language Models,” arXiv:2203.15556v1 (March 2022), 31, table A6. ↩
The specific settings used in the API requests are: 0 temperature, 50 max tokens. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
benchmarks_results.py		benchmarks_results.py
european_history_benchmark_results.csv		european_history_benchmark_results.csv
high_school_european_history_test.csv		high_school_european_history_test.csv
high_school_us_history_test.csv		high_school_us_history_test.csv
high_school_world_history_test.csv		high_school_world_history_test.csv
us_history_benchmark_results.csv		us_history_benchmark_results.csv
world_history_benchmark_results.csv		world_history_benchmark_results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Replication Study: GPT-3's Performance on Historical Fields in the MMLU Benchmarks (May 2022)

By Daniel Hutchinson

Replication Method:

Credits

About

Releases

Packages

Languages

License

Dr-Hutchinson/gpt-3_history_benchmark_results

Folders and files

Latest commit

History

Repository files navigation

Replication Study: GPT-3's Performance on Historical Fields in the MMLU Benchmarks (May 2022)

By Daniel Hutchinson

Replication Method:

Credits

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages