Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDA Example #343

Open
wants to merge 91 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 82 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
1ca7f31
draft of back translation
haoyuLucas Oct 3, 2020
f166160
Add a backtranslation augmenter
haoyuLucas Oct 5, 2020
babda7a
after merge
haoyuLucas Oct 16, 2020
4e1e866
rebase on the new base classes
haoyuLucas Oct 16, 2020
31b326e
Update data_augment_processor.py
haoyuLucas Oct 16, 2020
681f15c
Update data_augment_processor.py
haoyuLucas Oct 16, 2020
2142587
Update data_augment_processor.py
haoyuLucas Oct 16, 2020
d6f3a4f
Delete base_augmenter.py
haoyuLucas Oct 16, 2020
653a00e
Delete dictionary_replacement_augmenter.py
haoyuLucas Oct 16, 2020
b85190f
Delete text_generation_augment_processor.py
haoyuLucas Oct 16, 2020
610e99a
Delete dictionary_replacement_augmenter_test.py
haoyuLucas Oct 16, 2020
47d0805
Delete text_generation_augment_processor_test.py
haoyuLucas Oct 16, 2020
d7b9a5b
change the configs to a Texar Config
haoyuLucas Oct 16, 2020
917090c
Merge branch 'bt' of github.com:haoyuLucas/forte into bt
haoyuLucas Oct 16, 2020
a6b6169
abstract a machine translator class
haoyuLucas Oct 20, 2020
5d17c13
Update machine_translator.py
haoyuLucas Oct 20, 2020
45f0100
add the transformer to requirements
haoyuLucas Oct 21, 2020
fcd33d0
add an extra space
haoyuLucas Oct 21, 2020
5445fbe
add the transformers to travis yml
haoyuLucas Oct 23, 2020
07d7b27
add travis yml
haoyuLucas Oct 26, 2020
05947ab
Merge branch 'master' into bt
haoyuLucas Oct 26, 2020
a229a58
add text classifier
jrxk Oct 28, 2020
ae0ffa5
add list
jrxk Oct 29, 2020
04e92fa
fix main
jrxk Oct 29, 2020
2dc228c
fix
jrxk Oct 29, 2020
1d1a577
delete some files
jrxk Oct 29, 2020
3c46e9b
Merge branch 'master' into imdb_classifier
jrxk Oct 29, 2020
8c6e000
first commit of uda
haoyuLucas Oct 30, 2020
a386c62
Merge branch 'master' of https://github.com/asyml/forte into imdb_cla…
jrxk Nov 4, 2020
f8c0995
Merge branch 'master' into imdb_classifier
hunterhector Nov 7, 2020
26de46b
add the bool return value
haoyuLucas Nov 8, 2020
ceff3f5
add some comments
haoyuLucas Nov 8, 2020
c14024a
switch to texar-pytorch
jrxk Nov 9, 2020
2082208
Merge branch 'imdb_classifier' of https://github.com/jrxk/forte into …
jrxk Nov 9, 2020
8758d09
fix travis
jrxk Nov 9, 2020
eeaf010
Merge branch 'master' into bt
hunterhector Nov 11, 2020
7669a58
modify the setup
haoyuLucas Nov 11, 2020
7b98a5d
Merge branch 'bt' of github.com:haoyuLucas/forte into bt
haoyuLucas Nov 11, 2020
e9416e7
modify travis config
haoyuLucas Nov 11, 2020
7422994
initial version of UDA iterator
haoyuLucas Nov 15, 2020
e9b4379
Merge branch 'master' into UDA
haoyuLucas Nov 15, 2020
04e7dfc
fix mypy error
haoyuLucas Nov 15, 2020
7a2216c
Merge branch 'UDA' of github.com:haoyuLucas/forte into UDA
haoyuLucas Nov 15, 2020
fed4d5e
rerun travis
haoyuLucas Nov 15, 2020
0f6b1c4
Merge branch 'UDA' of https://github.com/haoyuLucas/forte into uda_ex…
jrxk Nov 17, 2020
608977f
Merge branch 'bt' of https://github.com/haoyuLucas/forte into uda_exp…
jrxk Nov 17, 2020
a0f1470
add bt pipeline
jrxk Nov 17, 2020
26e7207
Merge branch 'auto_align_replace' of https://github.com/haoyuLucas/fo…
jrxk Nov 17, 2020
957cf08
Add toy data
jrxk Nov 19, 2020
6873fb4
Merge branch 'master' into UDA
haoyuLucas Nov 23, 2020
b452aa5
changed data for uda
jrxk Nov 24, 2020
81a484a
modify train loop for UDA
jrxk Nov 24, 2020
0f8d7b0
Add doc for data augmentation
haoyuLucas Nov 26, 2020
de45a27
add TSA, minor changes
jrxk Nov 30, 2020
fe18f75
Merge branch 'UDA' of https://github.com/haoyuLucas/forte into uda_ex…
jrxk Nov 30, 2020
67308ae
create imdb classifier
jrxk Dec 1, 2020
0daf443
fix travis
jrxk Dec 1, 2020
2349f0c
update UDA
jrxk Dec 1, 2020
69eb738
update config, minor fixes
jrxk Dec 2, 2020
6ec645c
Merge branch 'imdb_classifier_2' of https://github.com/jrxk/forte int…
jrxk Dec 2, 2020
8b4fdab
add README, remove files
jrxk Dec 2, 2020
de59db0
add file link
jrxk Dec 2, 2020
9340277
remove data files
jrxk Dec 2, 2020
8aca85b
remove files
jrxk Dec 2, 2020
8ffaf72
use UDA's back trans data
jrxk Dec 9, 2020
f7313dd
update README
jrxk Dec 10, 2020
69009fe
Merge branch 'master' of https://github.com/asyml/forte into uda_expe…
jrxk Dec 18, 2020
c4d6991
some refactor
jrxk Dec 19, 2020
f5740a8
remove imdb model
jrxk Dec 19, 2020
e565bb8
Merge branch 'master' of https://github.com/asyml/forte into uda_example
jrxk Dec 19, 2020
9a79456
refactor
jrxk Dec 19, 2020
f60961a
update README
jrxk Dec 19, 2020
a95023c
fix init
jrxk Dec 19, 2020
2e16182
Merge branch 'master' into uda_example
hunterhector Dec 21, 2020
b00293f
Merge branch 'master' into uda_example
hunterhector Dec 21, 2020
3984038
Merge branch 'master' of https://github.com/asyml/forte into uda_example
jrxk Dec 23, 2020
efcae45
removed wget, changed imdb_format to forte pipeline
jrxk Dec 23, 2020
2800218
Update README with tutorial to UDA
jrxk Dec 24, 2020
f6a4be1
fix travis
jrxk Dec 24, 2020
c0aa83f
Merge branch 'master' into uda_example
hunterhector Dec 24, 2020
f17a238
move to da folder, remove classes
jrxk Dec 24, 2020
8d8016f
Merge branch 'uda_example' of https://github.com/jrxk/forte into uda_…
jrxk Dec 24, 2020
a7fb89d
clean some code, update reader
jrxk Dec 24, 2020
46ccb67
more clean, adding bt
jrxk Dec 25, 2020
0fec67c
update scripts, add requirements for t2t
jrxk Dec 27, 2020
34727eb
add merge sentences code
jrxk Dec 30, 2020
554f892
Added instructions for back translation
Dec 30, 2020
259664f
fix docstring, travis
jrxk Dec 30, 2020
f935d74
fix travis
jrxk Dec 30, 2020
e319fe9
update test
jrxk Dec 30, 2020
4c1b6fb
Merge branch 'master' into uda_example
jrxk Dec 30, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions examples/data_augmentation/uda/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
## Unsupervised Data Augmentation for Text Classification

Unsupervised Data Augmentation or UDA is a semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks. For details, please refer to the [paper](https://arxiv.org/abs/1904.12848) and the [official repository](https://github.com/google-research/uda).

In this example, we demonstrate Forte's implementation of UDA using a simple BERT-based text classifier.

## Quick Start

### Install the dependencies

You need to install [texar-pytorch](https://github.com/asyml/texar-pytorch) first.

### Get the IMDB data

We use the IMDB Text Classification dataset for this example. Use the following script to download the supervised and unsupervised training data.

```bash
python download_imdb.py
```

### Preproces and generate augmented data

You can use the following script to process the data into CSV format.

```bash
python imdb_format.py
```

The next step is to generate augment training data (using your favorite back translation model) and output to a TXT file. Each example in the file should correspond to the same line in `train.csv` (without headers).

For demonstration purpose, we provide the processed and augmented [data files](https://drive.google.com/file/d/1OKrbS76mbGCIz3FcFQ8-qPpMTQkQy8bP/view?usp=sharing). Place the CSV and txt files in directory `data/IMDB`.

### Train

To train the baseline model without UDA:

```bash
python main.py --do-train --do-eval --do-test
```

To train with UDA:

```bash
python main.py --do-train --do-eval --do-test --use-uda
```

To change the hyperparameters, please see `config_data.py`. You can also change the number of labeled examples used for training (`num_train_data`).

#### GPU Memory Issue:

According to the authors' [guideline for hyperparameters](https://github.com/google-research/uda#general-guidelines-for-setting-hyperparameters), longer sequence length and larger batch size lead to better performances. The sequence length and batch size are limited by the GPU memory. By default, we use `max_seq_length=128` and `batch_size=24` to run on a GTX1080Ti with 11GB memory.

## Results

With the provided data, you should be able to achieve performance similar to the following:

| Number of Labeled Examples | BERT Accuracy | BERT+UDA Accuracy|
| -------------------------- | ------------- | ------------------ |
| 24 | 61.54 | 84.92 |
| 25000 | 89.68 | 90.19 |

When training with 24 examples, we use the Training Signal Annealing technique which can be turned on by setting `tsa=True`.

You can further improve the performance by tuning hyperparameters, generate better back-translation data, using a larger BERT model, using a larger `max_seq_length` etc.

## Using the UDAIterator

Here is a brief tutorial to using Forte's `UDAIterator`. You can also refer to the `run_uda` function in `main.py`.

### Initialization

First, we initialize the `UDAIterator` with the supervised and unsupervised data:

```
iterator = tx.data.DataIterator(
{"train": train_dataset, "eval": eval_dataset}
)

unsup_iterator = tx.data.DataIterator(
{"unsup": unsup_dataset}
)

uda_iterator = UDAIterator(
iterator,
unsup_iterator,
softmax_temperature=1.0,
confidence_threshold=-1,
reduction="mean")
```

The next step is to tell the iterator which dataset to use, and initialize the internal iterators:

```
uda_iterator.switch_to_dataset_unsup("unsup")
uda_iterator.switch_to_dataset("train", use_unsup=True)
uda_iterator = iter(uda_iterator) # call iter() to initialize the internal iterators
```

### Training with UDA

The UDA loss is the KL divergence between the the output probabilities of original input and augmented input. Here, we define `unsup_forward_fn` to calculate the probabilities:

```
def unsup_forward_fn(batch):
input_ids = batch["input_ids"]
segment_ids = batch["segment_ids"]
input_length = (1 - (input_ids == 0).int()).sum(dim=1)

aug_input_ids = batch["aug_input_ids"]
aug_segment_ids = batch["aug_segment_ids"]
aug_input_length = (1 - (aug_input_ids == 0).int()).sum(dim=1)

logits, _ = model(input_ids, input_length, segment_ids)
logits = logits.detach() # gradient does not propagate back to original input
aug_logits, _ = model(aug_input_ids, aug_input_length, aug_segment_ids)
return logits, aug_logits
```

Then, `UDAIterator.calculate_uda_loss` computes the UDA loss for us. Inside the training loop, we compute the supervised loss as usual (or with a TSA schedule), and add the unsupervised loss to produce the final loss:

```
# ...
# Inside Training Loop:
# sup loss
logits, _ = model(input_ids, input_length, segment_ids)
loss = _compute_loss_tsa(logits, labels, scheduler.last_epoch,\
num_train_steps)
# unsup loss
unsup_logits, unsup_aug_logits = unsup_forward_fn(unsup_batch)
unsup_loss = uda_iterator.calculate_uda_loss(unsup_logits, unsup_aug_logits)

loss = loss + unsup_loss # unsup coefficient = 1
loss.backward()
# ...
```

You can read more about the TSA schedule from the UDA paper.

### Evaluation

For evaluation, we simply switch to the eval dataset. In the `for` loop we only need the supervised batch:

```
uda_iterator.switch_to_dataset("eval", use_unsup=False)
for batch, _ in uda_iterator:
# do evaluation ...
```
13 changes: 13 additions & 0 deletions examples/data_augmentation/uda/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright 2020 The Forte Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
11 changes: 11 additions & 0 deletions examples/data_augmentation/uda/config_classifier.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name = "bert_classifier"
hidden_size = 768
clas_strategy = "cls_time"
dropout = 0.1
num_classes = 2

# This hyperparams is used in bert_with_hypertuning_main.py example
hyperparams = {
"optimizer.warmup_steps": {"start": 10000, "end": 20000, "dtype": int},
"optimizer.static_lr": {"start": 1e-3, "end": 1e-2, "dtype": float}
}
77 changes: 77 additions & 0 deletions examples/data_augmentation/uda/config_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
pickle_data_dir = "data/IMDB"
unsup_bt_file = "data/IMDB/para_0.txt"
jrxk marked this conversation as resolved.
Show resolved Hide resolved
max_seq_length = 128
num_classes = 2
num_train_data = 24 # supervised data limit. max 25000

train_batch_size = 24
max_train_epoch = 3000
display_steps = 50 # Print training loss every display_steps; -1 to disable

eval_steps = 100 # Eval every eval_steps; if -1 will eval every epoch
# Proportion of training to perform linear learning rate warmup for.
# E.g., 0.1 = 10% of training.
warmup_proportion = 0.1
eval_batch_size = 8
test_batch_size = 8

feature_types = {
# Reading features from pickled data file.
# E.g., Reading feature "input_ids" as dtype `int64`;
# "FixedLenFeature" indicates its length is fixed for all data instances;
# and the sequence length is limited by `max_seq_length`.
"input_ids": ["int64", "stacked_tensor", max_seq_length],
"input_mask": ["int64", "stacked_tensor", max_seq_length],
"segment_ids": ["int64", "stacked_tensor", max_seq_length],
"label_ids": ["int64", "stacked_tensor"]
}

train_hparam = {
"allow_smaller_final_batch": False,
"batch_size": train_batch_size,
"dataset": {
"data_name": "data",
"feature_types": feature_types,
"files": "{}/train.pkl".format(pickle_data_dir)
},
"shuffle": True,
"shuffle_buffer_size": None
}

eval_hparam = {
"allow_smaller_final_batch": True,
"batch_size": eval_batch_size,
"dataset": {
"data_name": "data",
"feature_types": feature_types,
"files": "{}/eval.pkl".format(pickle_data_dir)
},
"shuffle": False
}

# UDA config
tsa = True
tsa_schedule = "linear_schedule" # linear_schedule, exp_schedule, log_schedule

unsup_feature_types = {
"input_ids": ["int64", "stacked_tensor", max_seq_length],
"input_mask": ["int64", "stacked_tensor", max_seq_length],
"segment_ids": ["int64", "stacked_tensor", max_seq_length],
"label_ids": ["int64", "stacked_tensor"],
"aug_input_ids": ["int64", "stacked_tensor", max_seq_length],
"aug_input_mask": ["int64", "stacked_tensor", max_seq_length],
"aug_segment_ids": ["int64", "stacked_tensor", max_seq_length],
"aug_label_ids": ["int64", "stacked_tensor"]
}

unsup_hparam = {
"allow_smaller_final_batch": True,
"batch_size": train_batch_size,
"dataset": {
"data_name": "data",
"feature_types": unsup_feature_types,
"files": "{}/unsup.pkl".format(pickle_data_dir)
},
"shuffle": True,
"shuffle_buffer_size": None,
}
31 changes: 31 additions & 0 deletions examples/data_augmentation/uda/download_imdb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright 2020 The Forte Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
Download IMDB dataset.
"""
from forte.data.data_utils import maybe_download


def main():
download_path = "data/IMDB_raw"
maybe_download(urls=[
"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"],
path=download_path,
extract=True)


if __name__ == '__main__':
main()
Loading