Skip to content

Latest commit

 

History

History
142 lines (106 loc) · 4.26 KB

README.md

File metadata and controls

142 lines (106 loc) · 4.26 KB

hmByT5 Pretraining

This section documents pretraining of a ByT5 model using the original T5 and ByT5 code base.

Preparing VM

In the next step, a VM needs to be created to coordinate the pre-training process. We are using a n1-standard-2 instance and a customized boot disk size of 50GB. Please notice that the default boot disk size is 10GB, which is not enough for all Python dependencies. We are using TensorFlow 2.8 in our experiments:

$ gcloud compute instances create hmbyt5 --zone=europe-west4-a \
  --machine-type=n1-standard-2 --image-project=ml-images \
  --image-family=tf-2-8-0 --scopes=cloud-platform \
  --boot-disk-size 50GB

The VM should be in the same zone as GCP bucket and TPU.

After VM creation, we can SSH into it:

$ gcloud compute ssh hmbyt5 --zone europe-west4-a

Now we immediately start a tmux session and run all commands in this session. If the connection to the VM got lost, you can re-sume the session with tmux attach after next login.

Installing Dependencies

We just need to clone the T5 repository:

$ git clone https://github.com/google-research/text-to-text-transfer-transformer.git
$ cd text-to-text-transfer-transformer
$ git checkout c3be7cf
$ pip3 install -e .
$ export PATH=$PATH:$HOME/.local/bin
$ cd

Note: we need to use a special commit, because of recent code changes that are not compatible with Python 3.7. The following dependencies needs a downgrade:

$ pip3 install --upgrade pyglove==0.2.1
$ pip3 install seqio==0.0.13

This will install all necessary dependencies. To make sure that everything is working, just run:

$ t5_mesh_transformer --helpfull

In the next step, the ByT5 repo is cloned:

$ git clone https://github.com/google-research/byt5.git
$ cd byt5

Clone this repository into the ByT5 repo, so that our custom tasks can be used:

$ git clone https://github.com/stefan-it/hmByT5.git

Custom SeqIO Tasks

To pretrain a ByT5 model on our own corpus, we need to slightly extend the ByT5 library. To do so, we add our datasets to the internal task registry. This is done in the hmbyt5/tasks.py file. Here's an example of the English dataset:

seqio.TaskRegistry.add(
    "en_corpus",
    source=seqio.TfdsDataSource(tfds_name="en_dataset:1.0.0"),
    preprocessors=[
          functools.partial(
              t5.data.preprocessors.rekey,
              key_map={
                  "inputs": None,
                  "targets": "text"
              }),
          seqio.preprocessors.tokenize,
          seqio.CacheDatasetPlaceholder(),
          functools.partial(t5.data.preprocessors.span_corruption,
                            mean_noise_span_length=MEAN_NOISE_SPAN_LENGTH),
          seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_BYTE_OUTPUT_FEATURES,
    metric_fns=[]
)

Model Configurations

Our current strategy is using the original ByT5 (Small) Model as init checkpoint. Then we continue pretraining language-after-language: 100k steps for each language. Thus, 6 different GIN configuration files are located in the configs folder:

  • ./configs/0_english_operative_config.gin
  • ./configs/1_german_operative_config.gin

TPU Creation

A v3-32 TPU pod can be created via:

$ gcloud compute tpus create hmbyt5 --zone=europe-west4-a \
  --accelerator-type=v3-32 --network=default \
  --range=192.168.2.0/29 --version=2.8.0

Start Pretrainings

English

The first model - on English corpus - can be started with:

$ cp ./hmByT5/hmbyt5/tasks.py .
$ cp ./hmByT5/hmbyt5/configs/0_english_operative_config.gin .

$ python3 -m t5.models.mesh_transformer_main \
--tpu="hmbyt5" \
--tpu_zone="europe-west4-a" \
--model_dir="gs://hmbyt5/models/byt5-small-english" \
--gin_file="./0_english_operative_config.gin" \
--t5_tfds_data_dir="gs://hmbyt5/datasets" \
--module_import="tasks"

The next model - on German corpus - can be started with:

$ cp ./hmByT5/hmbyt5/tasks.py .
$ cp ./hmByT5/hmbyt5/configs/1_german_operative_config.gin .

$ python3 -m t5.models.mesh_transformer_main \
--tpu="hmbyt5" \
--tpu_zone="europe-west4-a" \
--model_dir="gs://hmbyt5/models/byt5-small-german" \
--gin_file="./1_german_operative_config.gin" \
--t5_tfds_data_dir="gs://hmbyt5/datasets" \
--module_import="tasks"