Skip to content

Commit

Permalink
[BERT] update README (pytorch#391)
Browse files Browse the repository at this point in the history
  • Loading branch information
sgpyc authored Jun 19, 2020
1 parent 80fef7a commit 67c2ebe
Showing 1 changed file with 25 additions and 13 deletions.
38 changes: 25 additions & 13 deletions language_model/tensorflow/bert/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ The following command will run the clean up steps, and put the results in ./resu
After running the process_wiki.sh script, for the 20200101 wiki dump, there will be 500 files, named part-00xxx-of-00500 in the ./results directory.

Exact steps (starting in the bert path)

```shell
cd cleanup_scripts
mkdir -p wiki
cd wiki
Expand All @@ -32,6 +34,7 @@ git clone https://github.com/attardi/wikiextractor.git
python3 wikiextractor/WikiExtractor.py wiki/enwiki-20200101-pages-articles-multistream.xml # Results are placed in bert/cleanup_scripts/text
./process_wiki.sh '<text/*/wiki_??'
python3 extract_test_set_articles.py
```

MD5sums:
7f59165e21b7d566db610ff6756c926b - bert_config.json
Expand Down Expand Up @@ -59,12 +62,13 @@ python3 create_pretraining_data.py \
```

The generated tfrecord has 500 parts, totalling to ~365GB.
The dataset was generated using Python 3.7.6 and tensorflow-gpu 1.15.2.

# Stopping criteria
The training should occur over a minimum of 9,600,000 samples (e.g., 150k steps x 64 global batch size). A valid submission will evaluate a masked lm accuracy >= 0.678. The requirement for both minimum step count and accuracy is due to the inherent fluctuations in accuracy wrt pretraining steps, which has a standard deviation of step count measured at ~10% to meet the targer mlm_accuracy of 0.678 on a sample of 10 training runs. Based on this data, it is expected that ~95% of runs will be >= 0.678 mlm_accuracy after training 9.6M samples using the reference code and data.
The evaluation will be on the first 500 consecutive samples of the withheld test set. Slightly more data than was needed was withheld in the test set for experimentation to determine the stopping criteria. 500 samples was chosen to minimize evaluation cost while maintaining stability of the results.
The training should occur over a minimum of 3,000,000 samples. A valid submission will evaluate a masked lm accuracy >= 0.712.

The evaluation will be on the first 10,000 consecutive samples of the training set. The evalution frequency is every 500,000 samples, starting from 3,000,000 samples. The evaluation can be either offline or online for v0.7. More details please refer to the training policy.

The generation of the evaluation set shard should follow the exact command shown above, using create_pretraining_data.py. In particular the seed (12345) must be set to ensure everyone evaluates on the same data.

# Running the model
Expand All @@ -73,6 +77,7 @@ To run this model, use the following command.

```shell

TF_XLA_FLAGS='--tf_xla_auto_jit=2' \
python run_pretraining.py \
--bert_config_file=./bert_config.json \
--output_dir=/tmp/output/ \
Expand All @@ -85,20 +90,21 @@ python run_pretraining.py \
--iterations_per_loop=1000 \
--max_predictions_per_seq=76 \
--max_seq_length=512 \
--num_train_steps=150000 \
--num_warmup_steps=0 \
--num_train_steps=682666666 \
--num_warmup_steps=1562 \
--optimizer=lamb \
--save_checkpoints_steps=1000 \
--save_checkpoints_steps=20833 \
--start_warmup_step=0 \
--num_gpus=16 \
--train_batch_size=4/
--num_gpus=8 \
--train_batch_size=24

```

The above parameters are for a machine with 8 V100 GPUs with 16GB memory each; the hyper parameters (learning rate, warm up steps, etc.) are for testing only. The training script won’t print out the masked_lm_accuracy; in order to get masked_lm_accuracy, a separately invocation of run_pretraining.py with the following command with a V100 GPU with 16 GB memory:

```shell

TF_XLA_FLAGS='--tf_xla_auto_jit=2' \
python3 run_pretraining.py \
--bert_config_file=./bert_config.json \
--output_dir=/tmp/output/ \
Expand All @@ -113,13 +119,19 @@ python3 run_pretraining.py \
--max_predictions_per_seq=76 \
--max_seq_length=512 \
--num_gpus=1 \
--num_train_steps=2400000 \
--num_warmup_steps=3125 \
--num_train_steps=682666666 \
--num_warmup_steps=1562 \
--optimizer=lamb \
--save_checkpoints_steps=1000 \
--save_checkpoints_steps=20833 \
--start_warmup_step=0 \
--train_batch_size=4 \
--train_batch_size=24 \
--nouse_tpu

```

The model has been tested using the following stack:
- Debian GNU/Linux 9.12 GNU/Linux 4.9.0-12-amd64 x86_64
- NVIDIA Driver 440.64.00
- NVIDIA Docker 2.2.2-1 + Docker 19.03.8
- docker image tensorflow/tensorflow:2.2.0rc0-gpu-py3

0 comments on commit 67c2ebe

Please sign in to comment.