An evaluation pipeline for autoregressive language models using direct probability measurement for minimal pairs.
This pipeline evaluates language models by reading out the conditional log probabilities for minimal pairs of sentences. In each pair, one sentence is considered correct, while the other contains a minimal violation. The model is expected to assign a lower probability to the incorrect sentence.
By using a sufficient number of test items targeting specific linguistic phenomena, the accuracy of the model’s probability assignments provides an indication of its linguistic capabilities and understanding of these phenomena. Assessing models at different training checkpoints allows for analyzing learning dynamics of selected phenomena.
AI2-OLMo | EleutherAI-Pythia |
---|---|
Huggingface Suite | Huggingface Suite |
Github | Github |
Technical Report | Technical Report |
Website | Website |
Both models were released in different parameter sizes at different intermediate training checkpoints (revisions). This makes it possible to test for emerging capabilities across parameter scale and training time.
- tested on Python
3.12.x
,3.11.x
,3.10.x
- requires GPU with
CUDA >= 12.1
support (smaller models can run on CPU, but not recommended)
- recommended: use uv package manager for a fast setup
uv venv
# macOS / Linux
source .venv/bin/activate
# Windows
.venv\Scripts\activate
uv pip install -r requirements.txt
conda env create -f environment.yml
conda activate pipe
An example dataset for testing can be found in the data
folder.
Additional datasets can easily be integrated and tested.
Please refer to the corresponding README.md in the folder for more details.
Run the Python script and specify the dataset, model
and
optionally revision
(defaults to main
, final checkpoint for all models).
To access different intermediate training checkpoints (revisions), check either Pythia or OLMo suites on Huggingface, select a model's Files and versions and choose a corresponding branch.
# Template
python run_eval.py {dataset} {model} {optional: revision}
-
python run_eval.py dtfit EleutherAI/pythia-14m
- Performance
- fix batch support
- Optional
- add support for commercial APIs as upper bound
- extract & analyze contextual word embeddings
- test other open models with checkpoints?
togethercomputer/RedPajama-INCITE-7B-Base
TinyLlama/TinyLlama-1.1B
Zyphra/Zamba-7b
- Ablation Models?
- checkpoints available for different common datasets for pretraining
- Maximilian Krupop