Skip to content

Commit

Permalink
Add some initial documentation of UDPipe 2 model training.
Browse files Browse the repository at this point in the history
  • Loading branch information
foxik committed Jul 28, 2022
1 parent ec193b1 commit cc218a4
Showing 1 changed file with 30 additions and 2 deletions.
32 changes: 30 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,35 @@ too many models are loaded). If you would like to run BERT on a GPU and the
remaining computation on a CPU, you could use GPU-enabled wembeddings service
plus a CPU-only UDPipe 2 service.


## Training New Models

TODO: To be finished soon.
You can train UDPipe 2 models, but we provide no support for the model training,
and you will probably need to read the source code to find out what various
training options do.

To train a new UDPipe model, you need to perform three steps, assuming
you have the data in CoNLL-U format:
1. First, you need to compute the contextualized embeddings for your data,
using the `wembedding_service/compute_wembeddings.py` script.

The official UDPipe models use the
[scripts/compute_wembeddings.sh](https://github.com/ufal/udpipe/blob/udpipe-2/scripts/compute_embeddings.sh)
script, where you can see how to name the outputs and which BERT-like models
are used for which treebanks.

2. Then, you need to train the UDPipe 2 models themselves, using the
[udpipe2.py](https://github.com/ufal/udpipe/blob/udpipe-2/udpipe2.py) script.
You can generally use the default hyperparameters, you only need to specify
the path of the trained model (as the first argument) and then paths to your
data using `--train`, `--dev`, and `--test` options.

The official UDPipe 2 models are trained using the
[scripts/train.sh](https://github.com/ufal/udpipe/blob/udpipe-2/scripts/train.sh)
script; you can see they tweak `--batch_size` and `--rnn_cell_dim` a bit
depending on the trained treebank size.

3. Because UDPipe 2 does not include tokenization functionality, you need to
[train a UDPipe 1 tokenizer](https://ufal.mff.cuni.cz/udpipe/1/users-manual#model_training_tokenizer).
The tokenizer should then be put in the trained UDPipe 2 model directory,
named as the variant you specify on the command line of the
[udpipe2_server.py](https://github.com/ufal/udpipe/blob/udpipe-2/udpipe2_server.py).

0 comments on commit cc218a4

Please sign in to comment.