LinguisticStructureLM: Transformer-based Language Modeling with Symbolic Linguistic Structure Representations
Published at NAACL-HLT 2022 as "Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling" by Jakob Prange, Nathan Schneider, and Lingpeng Kong.
Please cite as:
@inproceedings{prange-etal-2022-linguistic,
title = "Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling",
author = "Prange, Jakob and
Schneider, Nathan and
Kong, Lingpeng",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.325",
pages = "4375--4391"
- To install dependencies, run:
pip install -r requirements.txt
- Download the trained models into this directory.
- Obtain annotated data and store all training and evaluation files as
FORMALISM.training.mrp
andFORMALISM.validation.mrp
(whereFORMALISM
is one of{dm, psd, eds, ptg, ud, ptb-phrase, ptb-func, empty}
) in a directory calledmrp/
which is a subdirectory of this one. Note: We used the annotated and MRP-formatted WSJ data, so we cannot publicly release it here. Please contact me or open an issue! (You'll probably need an LDC license to get the data.)
To reproduce the main results (table 2 in the paper), complete the following steps:
- Edit
lm_eval.sh
to match your local environment - Run:
sh eval_all_lm.sh
- The results will be written to
stdout
by the eval.py, which will be collected in a file calledeval-dm,dm,psd,eds,ptg,ud,ptb-phrase,ptb-func-10-0001-0.0_0.0-0-14combined.out
by lm_eval.sh. Run:cat eval-dm,dm,psd,eds,ptg,ud,ptb-phrase,ptb-func-10-0001-0.0_0.0-0-14combined.out | grep ";all;" | grep gold
, which will give you a bunch of semicolon-separated lines you can paste into your favorite spreadsheet. Voila!
To get more info on commandline arguments, run:
python3 train.py
or python3 eval.py
To evaluate a trained model more generally (might require additional input file; contact me!), edit lm_eval.sh
to match your environment and directory structure, uncomment the lines you want in eval_all_lm.sh
and run:
sh eval_all_lm.sh SEED
where SEED
is the last number before .pt
in the model name (currently only seed=14 models are available for download).
To train a new model (requires access to .mrp-formatted and preprocessed data, which you can find here and/or contact me about), edit lm.sh
to match your environment and directory structure, uncomment the lines you want in run_all_lm.sh
and run:
sh run_all_lm.sh SEED
where SEED
is a custom random seed you can set.