Releases: ufal/udpipe
UDPipe 2.1.0
Compared to UDPipe 2.0.0:
- Add support for using a morphological dictionary via
ufal.morphodita
during prediction – if the dictionary returns some analyses for a given form, we return the one most probable according to the predicted logits. - Add support for
--no_single_root
in the evaluation script.
UDPipe 1.3.1
Maintenance release of UDPipe1.
Changes since UDPipe 1.3.0:
- Update MorphoDiTa to 1.11.2.
UDPipe 1.3.0
Maintenance release of UDPipe1.
Changes since UDPipe 1.2.0:
- Get rid of
UndefinedBehaviourSanitizer
andAddressSanitizer
findings. - Add
segment_size
andlearning_rate_final
parameters to tokenizer training. - Add several options to
udpipe_server
. - Fix bug in returning the trained model as a string; use bytes instead.
- Fix a bug that newlines after URL/emails were considered just spaces.
- Fix a silent error on aarch64 caused by assuming char is signed.
- On Windows, the file paths are now UTF-8 encoded, instead of ANSI. This change affects the API, binary arguments, and program outputs.
- The Windows binaries are now compiled with VS 2019, older systems than Windows 7 are no longer supported.
- Add ARM64 macOS build.
- The Python wheels are provided for Pythons 3.6-3.11.
UDPipe 2.0.0
Compared to UDPipe 1:
- UDPipe 2 is Python-only and tested only in Linux,
- UDPipe 2 is meant as a research tool, not as a user-friendly UDPipe 1 replacement,
- UDPipe 2 achieves much better performance, but requires a GPU for reasonable performance,
- UDPipe 2 does not perform tokenization by itself – it uses UDPipe 1 for that.
UDPipe 2 is available as a REST service running at https://lindat.mff.cuni.cz/services/udpipe. If you like, you can use the udpipe2_client.py script to interact with it.
However, if you prefer to run UDPipe 2 locally, you can use this release.
Running Inference with Existing Models
To run UDPipe 2, you need to first download a model from the list of UDPipe 2 models. Then you can run UDPipe 2 as a local REST server, and use the udpipe2_client.py script to interact with it (in the same way as with the official service).
To run the server, use the udpipe2_server.py script.
- Install the requirements.txt. While only TF 1 is supported for model training (ancient, I know), you can use also TF 2 for inference.
- The script has the following required options:
port
: the port to listen on. We useSO_REUSEPORT
to allow multiple processes to run concurrently, supporting seamless upgrades;default_model
: model name to use when no model is specified in the request;models
: each model is then a quadruple of the following parameters (each published model contains a fileMODEL.txt
with these parameters):model names
: any number of model names separated by:
; furthermore, any hyphen-separated prefix of any model name can be also used as a name (e.g.,czech-pdt-ud-2.10-220711:cs_pdt-ud-2.10-220711:cs:ces:cze
);model path
: path to the model directory;treebank name
: because multiple treebanks can be handled by a single model, we need to specify a treebank name to use (this also specifies which tokenizer to use from the model directory);acknowledgements
: a URL to the model's acknowledgements.
- The script has the following optional parameters:
--batch_size
: batch size to use (default 32);--logfile
: if specified, log to this file instead of standard error;--max_request_size
: maximum request size, in bytes (default 4MB);--preload_models
: list of models to preload (orall
) immediately after start (default none);--threads
: number of threads to use (default is to use all physical cores);--wembedding_server
: for deployment purposes, it might be useful to compute the contextualized embeddings (mBERT, RobeCzech) not in the UDPipe 2 service, but in a specialized service – see https://github.com/ufal/wembedding_service for documentation of the wembeddings service (default is to compute the embeddings directly in the UDPipe 2 service).
The service can be stopped by a SIGINT
(Ctrl+C) signal or by a SIGUSR1
signal. Once such a signal is received, the service stops accepting new requests, but waits until all existing connections are handled and closed.
The models are loaded on-demand, but they are never freed. If a GPU is available, then all computation is performed on it (and an OOM might occur if too many models are loaded). If you would like to run BERT on a GPU and the remaining computation on a CPU, you could use GPU-enabled wembeddings service plus a CPU-only UDPipe 2 service.
UDPipe 1.2.0
Changes since UDPipe 1.1.0:
- On-demand loading of models in REST server, with a pool of least recently used models.
- Make GRU tokenizer dimension configurable (16, 24, 64 supported).
- Track paragraph boundaries even under
normalized_spaces
. - Support experimental sentence segmentation using jointly both the tokenizer and the parser.
- Add EPE output format.
- Make default model in REST server explicit.
- Support pre-filling according to URL params in the webapp.
UDPipe 1.1.0
Changes since UDPipe 1.0.0:
- Morphodita_parsito models (now version 3) require at least UDPipe version 1.1.0.
- CoNLL-U v2 format is supported. Notably spaces in forms and lemmas are now allowed, as are empty nodes.
- Support options for
input_format
andoutput_format
instances. - Preserve all spacing when tokenizing.
- Optionally generate document-level token ranges in the original text.
- Optionally respect given segmentation during tokenization.
- Tokenizer can be trained to allow spaces in tokens (default if there are forms with spaces in the training data).
- Parser can be trained to return always one root per sentence (default).
- Improve
input_format
API to allow inter-block state (for correct tracking of inter-sentence spaces and document-level offsets). - Improve
output_format
API to support begin/end document marks and to allow state in the output_format instance (to allow numbering output sentences, for example).
UDPipe 1.0.0
- Initial public release.