Python notebook and models for the Machine Translation Lab @ ALPS 2024
This repository is a modified version of another tutorial on NMT by Kyunghyun Cho et al.
Note that this repo implements a very basic MT library, which will not scale very well to large datasets or large models. For a more advanced and versatile MT framework, check out Pasero!
- Go to https://colab.research.google.com
- Under the "GitHub" tab, type the URL of this repo (https://github.com/naverlabseurope/ALPS2024-MT-LAB), then click on "NMT.ipynb"
- In the Colab menu, go to "Runtime / Change runtime type", then select "GPU" in the "Hardware accelerator" drop-down list
- Open this link and connect to your Google Drive account
- Then go to "Shared with me" in your Google Drive, right-click the "ALPS2024-NMT" folder and select "Add shortcut to Drive"
- Start playing with the notebook. Note that the models you train in the notebook won't be saved (they will be lost when you close the notebook). However, you can manually download them to your computer or copy them to your Google Drive if you wish.
Note: if you don't have a Google account, you can still run the notebook in Colab. You just need to set colab = False
, and the download-data.sh
script will be used to download the data and pre-trained models.
git clone https://github.com/naverlabseurope/ALPS2024-MT-LAB.git
cd ALPS2024-MT-LAB
scripts/setup.sh # creates a Python environment, installs the dependencies and downloads the data and models
scripts/run-notebook.sh # starts a jupyter notebook where the lab will take place
You also need to set colab = False
in the notebook.
If you don't have a GPU, you need to set cpu = True
. Models will be very slow to train, but you can still do inference in a reasonable time.
In the following, replace HOSTNAME
by the name of your server.
- SSH to the server, install the repo and run the notebook
ssh HOSTNAME
git clone https://github.com/naverlabseurope/ALPS2024-MT-LAB.git
cd ALPS2024-MT-LAB
scripts/setup.sh
scripts/run-notebook.sh # modify this script to change the port if 8888 is already used
- Create an SSH tunnel from your machine
ssh -L 8888:localhost:8888 HOSTNAME
- Open the URL printed by the
scripts/run-notebook.sh
command (which looks like http://127.0.0.1:8888/?token=XXX) in your favorite browser - Enjoy!
The train.py
script can be used to train models directly from the command line (locally or via SSH on a remote machine), without using the notebook. It is convenient for training multiple models.
How to use it:
nvidia-smi
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Tesla T4 Off | 00000000:03:00.0 Off | 0 |
| N/A 47C P0 26W / 70W | 10734MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Tesla T4 Off | 00000000:41:00.0 Off | 0 |
| N/A 30C P8 9W / 70W | 3MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
# GPU 1 is free
CUDA_VISIBLE_DEVICES=1 ./train.py models/en-fr/transformer.pt -s en -t fr \
--model-type transformer --encoder-layers 2 --decoder-layers 1 --heads 4 --embed-dim 512 --ffn-dim 512 \
--epochs 10 --lr 0.0005 --batch-size 512 --dropout 0.1 -v
This will reproduce the training of the EN-FR Transformer model we shared.
Run ./train.py -h
to see more options.