Augmentation Strategy Optimization for Language Understanding 🐗

Generating adversarial examples by stacking mutiple augmentation methods automatically.

About • Setup • Main Usage • Other Usage • Design

About

Augmentation Strategy Optimization for Language Understanding is a Python framework for adversarial attacks, data augmentation, and model training in NLP. Stacked data augmentation (SDA) is a Python framework for stacking different augmentation methods automatically with reinforcement learning.

Setup

Installation for Code

You should be running Python 3.6.13 to use this package. A CUDA-compatible GPU is optional but will greatly improve code speed.

SDA can install irectly from GitHub. For a basic install, run:

git clone https://github.com/BigPigKing/Adversarial_Data_Boost.git
cd Adversarial_Data_Boost
pip3 install -r requirements.txt

Installation for Prelimilary Directory

rsync -avh -e 'ssh -p 12030' [email protected]:/home/god/lab/Adversarial_Data_Boost/data .  # Installation for dataset
rsync -avh -e 'ssh -p 12030' [email protected]:/home/god/lab/Intergration/model_record .  # Installation for model recording dataset

Main Usage

The procedure for training with SDA from scratch can be divided into four steps.

1. Selection of Target Dataset

In SDA, there are totally six dataset can be selected for testing, which are SST-2, SST-5, MPQA, TREC-6, CR, and SUBJ.

To start with the specific dataset, just load the model config json file provided in model_configs directory.

rm model_config.json
cp model_configs/sst2_model_config.json model_config.json

2. Training the Baseline Text-model from Scratch

To commit an adversarial training, we need to first train an text model from scratch with clean data.

python3 sst_complete.py

Waiting until the training is finished, if the training process going to the training of REINFORCE. Press Ctrl^C for cancling it.

Otherwise the training of generator and discriminator will be intereact for one time.

3. Record the Baseline Model for Different Hyperparameter Setting

It is essential to retain the parameter of original clean model, thus the comparasion can be coducted easily

cd model_record
cp -r text_model_weights test_bed  # Retain the original clean model

4. Going to the model_config.json and Change the selected model to 1

vim model_config.json

And you will get the modelconfig.json like the below.

5. Running the run.sh to Attack and Defense for Multiple Times

./run.sh

The times can be choose by change the number of seq. (30 in the figure)

6. Clean Log file and Get Original Model

Once you finish the experiments, one can use ./clean.sh to back to the original text model which training only using clean dataset without adversarial training.

./clean.sh

Other Usage

There are also many different function is supported in SDA including Visualization, TextAttack, TextAugment and ModelLoading.

Visualization

One can check the training process of adversarial training using Tensorboard

tensorboard --logdir=runs --samples_per_plugin=text=100

Or check the log.txt for more detailed information

cat log.txt

TextAttack

SDA also provides the training process of the other adversarial training methods. It can be done by leverageing the power of TextAttack Module

pip3 install textattack  # it should already installed in previous steps
./attack.sh

Running the attack.sh, and it will automatically running three different AT methods, DWB, PWWS, and TextBugger.

If you want to change differentt target model, just change the model card in --target-model

And the detailed of different AT method is provided in below:

Attack Recipe Name	Goal Function	ConstraintsEnforced	Transformation	Search Method	Main Idea
Attacks on classification tasks, like sentiment classification and entailment:
`bae`	_{Untargeted Classification}	_{USE sentence encoding cosine similarity}	_{BERT Masked Token Prediction}	_Greedy-WIR	_{BERT masked language model transformation attack from (["BAE: BERT-based Adversarial Examples for Text Classification" (Garg & Ramakrishnan, 2019)](https://arxiv.org/abs/2004.01970)).}
`deepwordbug`	_{{Untargeted, Targeted} Classification}	_{Levenshtein edit distance}	_{{Character Insertion, Character Deletion, Neighboring Character Swap, Character Substitution}}	_Greedy-WIR	_{Greedy replace-1 scoring and multi-transformation character-swap attack (["Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers" (Gao et al., 2018)](https://arxiv.org/abs/1801.04354)}
`fast-alzantot`	_{Untargeted {Classification, Entailment}}	_{Percentage of words perturbed, Language Model perplexity, Word embedding distance}	_{Counter-fitted word embedding swap}	_{Genetic Algorithm}	_{Modified, faster version of the Alzantot et al. genetic algorithm, from (["Certified Robustness to Adversarial Word Substitutions" (Jia et al., 2019)](https://arxiv.org/abs/1909.00986))}
`hotflip` (word swap)	_{Untargeted Classification}	_{Word Embedding Cosine Similarity, Part-of-speech match, Number of words perturbed}	_{Gradient-Based Word Swap}	_{Beam search}	_{(["HotFlip: White-Box Adversarial Examples for Text Classification" (Ebrahimi et al., 2017)](https://arxiv.org/abs/1712.06751))}
`iga`	_{Untargeted {Classification, Entailment}}	_{Percentage of words perturbed, Word embedding distance}	_{Counter-fitted word embedding swap}	_{Genetic Algorithm}	_{Improved genetic algorithm -based word substitution from (["Natural Language Adversarial Attacks and Defenses in Word Level (Wang et al., 2019)"](https://arxiv.org/abs/1909.06723)}
`input-reduction`	_{Input Reduction}		_{Word deletion}	_Greedy-WIR	_{Greedy attack with word importance ranking , Reducing the input while maintaining the prediction through word importance ranking (["Pathologies of Neural Models Make Interpretation Difficult" (Feng et al., 2018)](https://arxiv.org/pdf/1804.07781.pdf))}
`kuleshov`	_{Untargeted Classification}	_{Thought vector encoding cosine similarity, Language model similarity probability}	_{Counter-fitted word embedding swap}	_{Greedy word swap}	_{(["Adversarial Examples for Natural Language Classification Problems" (Kuleshov et al., 2018)](https://openreview.net/pdf?id=r1QZ3zbAZ))}
`pso`	_{Untargeted Classification}		_{HowNet Word Swap}	_{Particle Swarm Optimization}	_{(["Word-level Textual Adversarial Attacking as Combinatorial Optimization" (Zang et al., 2020)](https://www.aclweb.org/anthology/2020.acl-main.540/))}
`pwws`	_{Untargeted Classification}		_{WordNet-based synonym swap}	_{Greedy-WIR (saliency)}	_{Greedy attack with word importance ranking based on word saliency and synonym swap scores (["Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency" (Ren et al., 2019)](https://www.aclweb.org/anthology/P19-1103/))}
`textbugger` : (black-box)	_{Untargeted Classification}	_{USE sentence encoding cosine similarity}	_{{Character Insertion, Character Deletion, Neighboring Character Swap, Character Substitution}}	_Greedy-WIR	_{([(["TextBugger: Generating Adversarial Text Against Real-world Applications" (Li et al., 2018)](https://arxiv.org/abs/1812.05271)).}
`textfooler`	_{Untargeted {Classification, Entailment}}	_{Word Embedding Distance, Part-of-speech match, USE sentence encoding cosine similarity}	_{Counter-fitted word embedding swap}	_Greedy-WIR	_{Greedy attack with word importance ranking (["Is Bert Really Robust?" (Jin et al., 2019)](https://arxiv.org/abs/1907.11932))}

TextAugment

SDA also provides the function to augment the target dataset automatically using ten different augmentation methods.

Including SEDA, EDA, Word Embedding, Clare, Checklist, Charswap, BackTranslation (De, Zh, Ru), and Spelling.

./make_noisy_to_all.sh

If you want to change the specific hyperparameter for different augmentation methods.

vim make_noisy.sh

And change the hyperparameter to the favor one.

ModelLoading

SDA will store all the model weight in model_record.

For the weights of differen AT methods, it will stored in outputs

And all the detailed can be accessed.

Citing SDA

If you use Augmentation Strategy Optimization for Language Understanding for your research, please cite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Augmentation Strategy Optimization for Language Understanding 🐗

About

Setup

Installation for Code

Installation for Prelimilary Directory

Main Usage

1. Selection of Target Dataset

2. Training the Baseline Text-model from Scratch

3. Record the Baseline Model for Different Hyperparameter Setting

4. Going to the model_config.json and Change the selected model to 1

5. Running the run.sh to Attack and Defense for Multiple Times

6. Clean Log file and Get Original Model

Other Usage

Visualization

TextAttack

TextAugment

ModelLoading

Citing SDA

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
apple		apple
lib		lib
model_configs		model_configs
LICENSE		LICENSE
README.md		README.md
attack.py		attack.py
attack.sh		attack.sh
clean.sh		clean.sh
create_noisy.py		create_noisy.py
make_noisy.sh		make_noisy.sh
make_noisy_to_all.sh		make_noisy_to_all.sh
model_config.json		model_config.json
requirements.txt		requirements.txt
run.sh		run.sh
sst_complete.py		sst_complete.py

License

NYCU-MLLab/Augmentation-Strategy-Optimization-for-Language-Understanding

Folders and files

Latest commit

History

Repository files navigation

Augmentation Strategy Optimization for Language Understanding 🐗

About

Setup

Installation for Code

Installation for Prelimilary Directory

Main Usage

1. Selection of Target Dataset

2. Training the Baseline Text-model from Scratch

3. Record the Baseline Model for Different Hyperparameter Setting

4. Going to the model_config.json and Change the selected model to 1

5. Running the run.sh to Attack and Defense for Multiple Times

6. Clean Log file and Get Original Model

Other Usage

Visualization

TextAttack

TextAugment

ModelLoading

Citing SDA

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages