A primitive but functional implementation of the Byte-Pair encoding in Python. Note that the training algorithm is rather slow, however, it speeds up as pieces merge.
To get a Byte-Pair encoding from your corpus, you must provide a single plain text file.
Example:
python3 bpe.py file.txt 100
Arguments:
- Your input file
- The desired vocabulary size.
No libraries must be installed to run the script, the standard Python library provides the necessary packages.