Skip to content

Latest commit

 

History

History
142 lines (96 loc) · 6.14 KB

README.md

File metadata and controls

142 lines (96 loc) · 6.14 KB

Build Status Johnny - DEPendency Parser

The following code is used for our EMNLP 2018 paper: What do character-level models learn about morphology? The case of dependency parsing

The following code is originally from here, v0.1.1 release, now can support chainer v4.3.0

Additional features implemented:

  • Allow input based on morphological analysis (oracle)
  • Attention over morphological features of the headword
  • Option to extract hidden states of the encoder
  • Option to extract attention vector

What is Johnny?

This is an implementation of a graph based arc factored neural dependency parser implemented using Chainer. There are 3 encoders that can be used with this parser.

  • Word-BILSTM, a Bidirectional LSTM encoder that encodes words.
  • Char-BILSTM, a Bidirectional LSTM encoder that encodes words on the character level.
  • Char-CNN, a Convolutional Neural network encoder that encodes words on the character level.

The implementation is based on the papers that can be found in the References section.

Installation

git clone https://github.com/andreasgrv/johnny
cd johnny
# virtualenv .env && source .env/bin/activate # optional but recommended
pip install -r requirements.txt
pip install .

Training

While the basic library was tested for on Debian: python2.7, python3.4, python3.5, the train and test utility scripts will only work on python >= 3.3.

Models for dependency parsing can be trained on the Universal Dependencies v2.0 dataset using the train.py script.

Download and extract the contents to a folder of your choosing (we will refer to this as UD_FOLDER, the path to the folder containing the languages). This will probably look something like "ud-treebanks-v2.0".

To train models you can use the default blueprints I used in my dissertation. Alternatively, if you are in for a thrill, you can override the settings to see what happens. The blueprints can be found under the blueprints folder.

As an example, to train a parser using the Char-BILSTM encoder on the Universal Dependencies v2.0 dataset, you can follow this snippet:

mkdir models # you can use a different folder if you like
python train.py -i UD_FOLDER -o models --verbose --name mytest \
				--load_blueprint blueprints/dissertation/cnn-char-level.yaml
				--dataset.lang Russian # Unsurprisingly, English is the default

This will write 3 files to a directory under the models folder. The directory depends on the name of the dataset used. The 3 files should be:

  1. mytest.bp (a blueprint file) # mytest is whatever you passed to --name
  2. mytest.vocab (a vocabulary file)
  3. mytest.model (the numpy matrices of the chainer model)

You can override the defaults specified in the blueprint on the fly from the command line using . notation. See mlconf for details on how this works.

Note that the above may take quite a few hours to train on cpu (To use the gpu version use --gpu_id 0, assuming gpu is the 0 device. See here for more advice on using gpus with chainer.). If you want to train for less time you can also specify --max_epochs or make the --checkpoint.patience parameter smaller.

Testing

In order to test a model, you need to provide the test.py script with the blueprint written during training (mytest.bp if you followed the previous step).

You can test the model on the development set by providing the path to the dev .conllu file after the --test_file option. If you want to evaluate on the test set you need to first download it from the Universal Dependencies website. Make sure you provide the test file for the right language :)

python test.py --blueprint models/conll2017_v2_0/russian/mytest.bp --test_file PATH_TO_CONLLU

Terminal Visualisation

Below is a hacky terminal visualisation of the parser predictions during training on the Universal Dependencies dataset.

The white box in the Cur index row shows what word we are currently looking at in the sentence. The number to the side is the index in the sentence - which reaches up to sentence length. The number in |absolute value| is the distance of the real head from the current index.

For each word in the sentence the parser chooses one word to be its governor - head, to which it draws an arc to. This is represented by the white box in the row labelled Pred head - the height of which roughly corresponds to the predicted probability (Can only represent few levels with a unicode box :) ).

It then labels the predicted arc with the relationship the two words are predicted to have as can be seen in the row labelled Pred label.

The Real head and Real label rows show which word is the correct head and label for the training data - namely what the parser should have predicted.

A visualisation of the parser running in the terminal

To enable the above visualisation during training you can specify the --visualise option.

License

3 clause BSD, see LICENSE.txt file

References

Code

https://github.com/elikip/bist-parser

https://github.com/XingxingZhang/dense_parser

Papers

Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations

Dependency Parsing as Head Selection

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

Character-Aware Neural Language Models

From Characters to Words to in Between: Do We Capture Morphology?