GitHub - Zhuoxuan-Zhang/GTBench: GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Examining the Performance and Reasoning of Competing LLMs in Language-Based, Competitive Environments

Kevin Li, Connor Rabinowitz, Zhuoxuan Zhang, Louis Zheng

This repository and codebase was forked from https://github.com/jinhaoduan/GTBench, the original repository by the authors of GTBENCH: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations. We modified and added new files - to extend the framework to the analysis of two new games not covered in the original suite of games provided by the authors.

The new files include:

NRA_test.py - testing framework that allows us to compare NRA values and Completion Rates of ModelxReasoning Type with GPT-3.5 TurboxPrompt.
/gamingbench/games/crazy_eights.py and /gamingbench/games/dots_and_boxes.py - code that allows translation from OpenSpiel framework implementation to LLM token response and vice versa.
/gamingbench/prompts/observation_prompts/crazy_eights.py and /gamingbench/prompts/observation_prompts/crazy_eights.py - code that produces the head prompt and observation prompt that is directed towards the LLM.

There have been other minor additions in terms of config files.

To run the NRA_test.py to produce results, please follow this set of steps:

Clone this repository.
Access your OpenAI API Key and DeepInfra API Key.
Update NRA_test.py and /gamingbench/chat/chat.py with copied API Keys. Ctrl-F for "api" to see where to enter these.
Run NRA_test.py with the following command line arguments:

python3 NRA_test.py {game} {opponent_llm_model} {llm_reasoning_type} {num_matches}

example: python3 NRA_test.py crazy_eights gpt-4-turbo prompt_agent 50
Access the produced information under the newly created /experiments_{TIME_AT_RUN} folder. Each match will have its own subdirectory contained the json log file with overall game information and .log files with historical record of the gameplay and LLM interaction and response. After all matches have ended, a run_log.txt (which is updated throughout the run) will finish producing with information on number of matches, winning players, player scores at end, final NRA value, and completion rates.

Note our codebase supports GPT-4-Turbo, GPT-3.5-Turbo, CodeLlama-34b-Instruct and Llama-2-7b-chat. Prompt, Chain of Thought (CoT), and Self-contained Chain Thought (ScCoT) have been tested and confirmed to be supported with our implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
gamingbench		gamingbench
.gitignore		.gitignore
NRA_test.py		NRA_test.py
README.md		README.md
json_example.jsonl		json_example.jsonl
llm_vs_x.sh		llm_vs_x.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Examining the Performance and Reasoning of Competing LLMs in Language-Based, Competitive Environments

Kevin Li, Connor Rabinowitz, Zhuoxuan Zhang, Louis Zheng

About

Releases

Packages

Languages

Zhuoxuan-Zhang/GTBench

Folders and files

Latest commit

History

Repository files navigation

Examining the Performance and Reasoning of Competing LLMs in Language-Based, Competitive Environments

Kevin Li, Connor Rabinowitz, Zhuoxuan Zhang, Louis Zheng

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages