Repository for the evaluation of Large Language Models on logical and abstract reasoning tasks
To install the repository, use the following command:
git clone https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.git
To install the dependencies in a virtual environment, use the following:
cd Logical-and-abstract-reasoning
python -m venv env/
source env/bin/activate
pip install -r requirements.txt
You may need to install transformers from the repository:
pip install git+https://github.com/huggingface/transformers
To evaluate a model in the repository, use the following command:
python run_evaluation config/model/<model_config.yaml> config/data/<data_config.yaml> --<kwarg_name> <kwarg>
You can choose the model to evaluate by changing the <model_config.yaml>
file, and the dataset to evaluate the model on by changing the <data_config.yaml>
file. You can add any additional arguments as <kwargs>
(e.g. private API key for GPT models).
By default, all the results are saved in a csv file in the logs/
folder. You can re-compute the metrics from the evaluation run from this file by running the following:
python src/evaluate/evaluator.py logs/<results_file.csv>
To fine-tune a model on a given dataset, run the following:
python run_finetuning.py config/model/<model_config.yaml> config/data/<data_config.yaml> config/trainer/<trainer_config.yaml>
The configuration files work similarly as for evaluation. The <model_config.yaml>
file contains additoinal configuration for training. The logs are saved in fine-tuning-output/
and the model weights are saved in fine-tuning-saves/
.
Currently, only HuggingFace models can be fine-tuned.
We use the LLaMA-based model fine-tuning from the Stanford Alpaca training script. If you want to conduct a LLaMA-based model on instruction fine-tuning, you can do that by following this link.
Inference Type | Model | Size | Task | Link | Remark | |
---|---|---|---|---|---|---|
Logical Reasoning on Reading Comprehension | MERIt | - | Reading Comprehension | paper project |
#3 on the ReClor leaderboard | |
LReasoner | - | Reading Comprehension | paper project |
#6 on the ReClor leaderboard | ||
AMR-LE | - | Reading Comprehension | project | #2 and #5 on the ReClor leaderboard | ||
LLaMA | - | Reading Comprehension | paper code |
Open source very large language model | ||
LLaMA2 | - | Reading Comprehension | paper code |
Open source very large language model | ||
TinyLLaMA | - | Reading Comprehension | paper code |
Open source very large language model | ||
Alpaca | - | Reading Comprehension | code | Fine-tuned LLaMA | ||
Vicuna | - | Reading Comprehension | project code |
Fine-tuned LLaMA | ||
ChatGPT | - | Reading Comprehension | paper project |
Use api to do prompt tuning | ||
GPT-4 | - | Reading Comprehension | paper project |
Use api to do prompt tuning | ||
Zephyr-7b-beta | - | Reading Comprehension | code | Fine-tuned Mistral-7b |
Inference Type | Dataset | Size | Task | Link | Remark | |
---|---|---|---|---|---|---|
Logical Reasoning on Reading Comprehension | ReClor | - | Reading Comprehension | paper project |
Logical reasoning reading comprehension | |
LogiQA | - | Reading Comprehension | paper project |
Logical reasoning reading comprehension | ||
LogiQA V2 | - | Reading Comprehension | project | Logical reasoning reading comprehension | ||
LogiQA Logical Reasoning Plus | - | Reading Comprehension | project | Logical reasoning reading comprehension for out-of-distribution evaluation | ||
Abstract Reasoning | ARC | - | Abstract Reasoning | paper code |
Text version of a Visual Abstract Reasoning task | |
ACRE | - | Abstract Reasoning | paper code |
Text version of a Visual Abstract Reasoning task | ||
PVR | - | Abstract Reasoning | paper | Abstract Reasoning task | ||
RAVEN | - | Abstract Reasoning | paper project |
Text version of a Visual Abstract Reasoning task | ||
Diagrammatic Logic | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
Logic | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
Logic Statements | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
Pattern Identification | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
String Patterns | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
List Functions | - | Abstract Reasoning | code | Extracted from Google BIG-bench |
Our proposed new dataset logiqa-logical-reasoning-plus has been merged by OpenAI/Evals.