Here we demonstrate how to use the HEC on a text classification task, thus we are going to reproduce the results from the paper GoEmotions: A Dataset of Fine-Grained Emotions. The main results we are reproducing are those from table 4, whereby the task is given a text (reddit comment) predict the emotion within the comment. This task is a multi-label classification task, therefore more than one emotion can be assigned to a comment. Here we are going to use the same model, a fine tuned BERTbase model.
This dataset has 27 categories (see appendix A in the paper for details on those 27 labels). There was another category neutral
which is used when no emotion was expressed within the text. However from looking at the dataset the neutral
category can exist with other emotions for a given text. They state in the paper Reddit has a bias towards young male users, toxic and offensive language thus they perform various different data filtering techniques that can be found in section 3.1 of the paper. They also masked both people's names and religion terms with [NAME]
and [RELIGION]
tokens respectively.
To get the main dataset that was used for training, developing, and testing the model run the following script, which will create the directory ./data
and add train.tsv
, dev.tsv
, and test.tsv
to that directory:
python download_data.py ./data
Each TSV contains the following columns:
- text.
- comma-separated list of emotion ids (the ids are indexed based on the order of emotions in ./data/emotions.txt e.g. id 27=
neutral
and id 0=admiration
). - id of the comment.
The following statistics of the different splits (train, development, and test) per label separated by a comma:
- label name
- label raw count
- percentage of texts that contains this label
- percentage of times it occurs with another label.
Can be seen if the following script is ran:
python data_analysis.py
The number of examples per split matches that stated in the original paper stated in their Github repository:
- Train - 43,410 samples
- Development - 5,426 samples
- Test - 5,427 samples
Furthermore what I find interesting is that across all of the three splits the neutral label for at least 9.8% of samples is in a text that is assigned with at least one other label. However out of all of the labels the neutral label is most likely assigned to a sample with no other label with respect to the frequency of each label e.g. some labels occur very frequently but when normalised to their label count they are less likely to be assigned to a sample with no other label.
This section explains the model that was used in the GoEmotions: A Dataset of Fine-Grained Emotions paper.
BERTbase model is a 12 layer transformer model with 12 attention heads per layer with a 768 dimension hidden layer. As we are fine tuning it for multi-label classification the final layer will use the the classification token ([cls]
in figure 1) representation as input to 28 separate linear layers (one for each of the classes, including the neutral class). All of the 28 linear layers will be put through a sigmoid activation function to predict if the text contains the associated emotion label. This model setup is the same as the GoEmotions paper and we use the same loss function the sigmoid cross entropy loss function, BCEWithLogitsLoss
in PyTorch.
In the original paper they used the following hyper-parameters for the BERT model:
- batch size = 16
- learning rate = 5e-5
- training for 4 epochs.
- download_transformer_model.py -- Downloads a transformer model from the HuggingFace library, this is best to do before running any scripts that require a transformer model as those scripts would have to download the transformer model and that could take up time on a GPU node where as the process of downloading the model can be done on a small CPU node. By default it download the
bert-base-uncased
transformer model but can also download other transformer models if wanted like RoBERTa, for a full list of pre-trained transformer names:
python download_transformer_model.py --transformer-model roberta-base
Further when running this script on the HEC the transformer models be default are saved to your Home
directory, which on the HEC can only store 10GB. However you can change this by setting a different File path to the environment variable TRANSFORMERS_CACHE
e.g. setting TRANSFORMERS_CACHE
to $global_storage/huggingface_transformer_models_cache
e.g.:
TRANSFORMERS_CACHE=$global_storage/huggingface_transformer_models_cache
export TRANSFORMERS_CACHE
For more details on the HuggingFace model caching see the see the caching models section on the HuggingFace transformer documentation.
- download_data.py -- Downloads the data to a given directory e.g. in this case it will download the
train
,dev
,test
, and emotion labels to the directory./data
whereby each will be called./data/train.tsv
,./data/dev.tsv
,./data/test.tsv
, and./data/emotions.txt
:
python download_data.py ./data
- bert_model.py -- Trains a transformer model for one epoch with the following arguments:
- Path to training data
- Path to development data
- Path to test data
- Path to the emotion labels
- Path to save the trained model
--cuda
-- whether to use CUDA or not e.g.--cuda
for GPU.--learning-rate
-- whether to specify a learning rate other than the default5e-5
--batch-size
-- whether to specify a batch size other than default16
--transformer-model
-- whether to use a transformer model other than the default Englishbert-base-uncased
. For a full list of pre-trained transformer names that can be used.
Example of how to use this script:
python ./bert_model.py ./data/train.tsv ./data/dev.tsv ./data/test.tsv ./data/emotions.txt ./model/saved_model.pt --cuda --batch-size 16
This specifies that we do want to use the GPU (--cuda
) and to use a batch size of 16.
NOTE The bert_model.py
is work in progress in that it only trains the model for one epoch on the given data and prints out how long it took to train for one epoch, how long it took to save the trained model, the loss every 100 batches and the average loss across the whole epoch it was trained on. On my home machine when using the English bert-base-uncased
transformer model it took 20 minutes to run using a 6GB nvidia 1060 GPU with a batch size of 4.
The following steps explains how to train a transformer model on the HEC using this Emotions dataset, in this case we are going to fine-tune the English roberta-base
transformer model. In all of the steps we are going to utilise in some way the scripts explained in Explaining the scripts section above.
- Transfer this directory to your home directory on the HEC:
scp -r ../multi_label_classification_gpu/ [email protected]:./
- Login to the HEC
ssh [email protected]
and go to the text classification directory:cd multi_label_classification_gpu
- Create the Conda environment with the relevant python dependencies. This can be done by submitting the ./install.com job e.g.
qsub install.com
. This will create the Conda environment at$global_storage/conda_environments/py3.7-text-classification
- Download the transformer model,
roberta-base
, and the data we are going to train/test. This is done now before training the model so we do not waste GPU node time downloading the model and data rather here we shall use a CPU node.qsub download_model_data.com
will download the data to$global_storage/data/go-emotions
and the model to$global_storage/huggingface_transformer_models_cache
. - Now we are going to train the model so we need to switch to the GPU cluster
switch-gpu
- To train the model for one epoch
qsub run_transformer_model.com
. This will save the fine-tunedroberta-base
model to$global_storage/models/go-emotions/roberta_base.pt
. This ran on the HEC in 157 seconds which is a lot quicker than my home machine which took 20 minutes. - To find out how much GPU memory you used you can cat the output of the output file e.g.
cat run-transformer-model.o3177
, in my case it used 16 GB (17728274432 bytes). You can find how much RAM was used throughqacct -j 3177
and then looking at the fieldmaxvmem
in my case it used 4 GB.