Skip to content

Commit

Permalink
done
Browse files Browse the repository at this point in the history
  • Loading branch information
sofieditmer committed Nov 22, 2021
1 parent d1c5181 commit 5ef48e1
Show file tree
Hide file tree
Showing 14 changed files with 528 additions and 86 deletions.
91 changes: 62 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,82 @@
# Assignment 3: Named Entity Recognition (NER)
## Using LSTMs and Word Embeddings to detect entities in unstructured texts

*see assignment_description.md for a description of the assignment*
<div align="center"><img src="https://github.com/Orlz/CDS_Visual_Analytics/blob/main/Portfolio/company.png" width="75" height="75"/></div
# Summary
## Table of Contents
<!--
This should include a short description of which models you have tried and conclusions from comparing these models. This should be no longer than an abstract. This section can also include questions regarding the assignment.
-->
- [Description](#Project)
- [Results](#Results)
- [Reproducing the Scripts](#Operating)
- [Project Organization](#ProjectOrganization)
# Performance
<!--
This should include a table of performance metrics of different models. The performance metrics should at least include accuracy and F1-score.
-->
## Assignment Description <br>
In this assignment, a recurrent neural network in the form of a Long Short Term Memory (LSTM) model was trained to identify named entities in the 'CoNLLPP' dataset using gensim's GloVe word embeddings. The LSTM model was trained periodically using early stopping. Hence, if the model has not improves in a specified number of epochs, the training is stopped and the model is saved.The CoNLLPP dataset is an improvement of the popular CoNLL-2003 dataset, which contains sentences collected from the Reuters Corpus. This corpus consists of Reuters news stories between August 1996 and August 1997. For the training and development set, ten days worth of data were taken from the files representing the end of August 1996. For the test set, the texts were from December 1996. The preprocessed raw data covers the month of September 1996.
To evaluate the model, F1-score and accuracy score were computed. Moreover, three experiments were conducted:
1. Comparing the effect of the word embedding size
2. Comparing the effect of the size of the hidden layer in the LSTM model
3. Comparing the use of a bi-directional LSTM with a unidirectional LSTM

**What are LSTM Models?** <br>
LSTM models are a special type of recurrent neural network capable of learning order dependence in sequence prediction problems. This gives them the ability to learn long term dependencies, where both information before and later in the sentence can be used to inform of a word's nature or importance. They are especially good for named entity tasks where a deeper unstanding of the contextual information is needed.

## Learning goals of the assignment
1. To work with recurrent layers using PyTorch
2. To understand the nature of named entity recognition tasks
3. To be able to implement early stopping and meaninful experiments which influence the performance of the model

## Results
The results of the LSTM model with default parameters (n_epochs = 30, lr = 0.01, hidden_dim = 30, patience = 10, optimizer = Adam, bidirectional = False, word_embedding_dim = 100-dim GloVe) and the experiments which include changing the dimensions of the word embeddings (exp. 1), changing the size of the hidden layer (exp. 2) and changing the LSTM to bidirectional (exp. 3) are reported below:

| | Default | Exp. 1: word embeddings = 300 dim | Exp. 2: Hidden layer dim = 100 | Exp. 3: Bidirectional = True |
|---------------|---------|-----------------------------------|--------------------------------|------------------------------|
| Accuracy | 0.82 | 0.82 | 0.82 | 0.87 |
| Macro avg. F1 | 0.12 | 0.13 | 0.13 | 0.32 |

The experiments in which we change the word embedding dimensions and the hidden layer size do not change the results. However, when we make the LSTM model bidirectional rather than unidirectional, we see notable improvements in both the accuracy and the F1-score. We still get very low F1-score across all labels, indicating that the model is overfitting to a single class (i.e. 0).

## Reproducing the Scripts
1. If the user wishes to engage with the code and reproduce the obtained results, this section includes the necessary instructions to do so. First, the user will have to create their own version of the repository by cloning it from GitHub. This is done by executing the following from the command line:

```
$ git clone https://github.com/auNLP/a3-orla-johan-sofie-jan.git named-entity-recognition
```

2. Once the user has cloned the repository, a virtual environment must be set up in which the relevant dependencies can be installed. To set up the virtual environment and install the relevant dependencies, a bash-script is provided, which automatically creates and installs the dependencies listed in the ```requirements.txt``` file when executed. To run the bash-script that sets up the virtual environment and installs the relevant dependencies, the user must execute the following from the command line.

```
$ cd named-entity-recognition
$ bash create_venv.sh
```

3. Once the virtual environment has been set up and the relevant dependencies listed in the ```requirements.txt``` have been installed within it, the user is now able to run the ```main.py```script from the command line. In order to run the script, the user must first activate the virtual environment in which the script can be run. Activating the virtual environment is done as follows.


```
$ source named-entity-venv/bin/activate
```

4. Once the virtual environment has been activated, the user is now able to run the ```main.py```script.

```
(named-entity-venv) $ python main.py --epochs 10 --gensim_embedding glove-wiki-gigaword-100
```

## Project Organization
The organization of the project is as follows:

<!--
Correct this to reflect changes
-->

```
├── LICENSE <- the license of this code
├── README.md <- The top-level README for this project.
├── .github
│ └── workflows <- workflows to automatically run when code is pushed
│ │ └── pytest.yml <- A workflow which runs pytests upon push
├── classification <- The main folder for scripts
├── mdl_results <- Model results
├── ner <- The main folder for scripts
│ ├── tests <- The pytest test suite
│ │ └── ...
| └── ...
├── .gitignore <- A list of files not uploaded to git
├── requirement.txt <- A requirements file of the required packages.
├── requirements.txt <- A requirements file of the required packages.
└── assignment_description.md <- the assignment description
```



## Running the code
You can run the reproduce all the experiments by cloning the GitHub repository and running the following:

<!--
Update the code below such that it runs all the experiments in the performance section and print the performances.
-->

```
pip install -r requirements.txt
python ner/main.py --epochs 10 --gensim_embedding glove-wiki-gigaword-50
```
17 changes: 17 additions & 0 deletions create-venv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash

VENVNAME=named-entity-venv

python3 -m venv $VENVNAME
source $VENVNAME/bin/activate
pip install --upgrade pip

pip install ipython
pip install jupyter

python -m ipykernel install --user --name=$VENVNAME

test -f requirements.txt && pip install -r requirements.txt

deactivate
echo "build $VENVNAME"
39 changes: 39 additions & 0 deletions driver.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
python3 ner/main.py \
-f "baseline" \
--batchsize 1024 \
--nepochs 30 \
--learningrate 0.1 \
--embeddings "glove-wiki-gigaword-100" \
--hiddenlayer 30 \
--patience 10 \
--bidirectional False

python3 ner/main.py \
-f "large_dim_embedding" \
--batchsize 1024 \
--nepochs 30 \
--learningrate 0.1 \
--embeddings "glove-wiki-gigaword-300" \
--hiddenlayer 30 \
--patience 10 \
--bidirectional False

python3 ner/main.py \
-f "large_hidden_layer" \
--batchsize 1024 \
--nepochs 30 \
--learningrate 0.1 \
--embeddings "glove-wiki-gigaword-100" \
--hiddenlayer 100 \
--patience 10 \
--bidirectional False

python3 ner/main.py \
-f "bidirectional" \
--batchsize 1024 \
--nepochs 30 \
--learningrate 0.1 \
--embeddings "glove-wiki-gigaword-100" \
--hiddenlayer 30 \
--patience 10 \
--bidirectional True
15 changes: 15 additions & 0 deletions mdl_results/211122_baseline_report.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
precision recall f1-score support

0 0.89 0.99 0.94 38177
1 0.53 0.67 0.59 1618
2 0.77 0.49 0.60 1161
3 0.73 0.28 0.41 1715
4 1.00 0.00 0.01 882
5 0.79 0.19 0.30 1646
6 0.00 0.00 0.00 259
7 0.00 0.00 0.00 723
8 0.00 0.00 0.00 254

accuracy 0.87 46435
macro avg 0.52 0.29 0.32 46435
weighted avg 0.84 0.87 0.83 46435
15 changes: 15 additions & 0 deletions mdl_results/211122_bidirectional_report.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
precision recall f1-score support

0 0.89 0.99 0.94 38177
1 0.53 0.67 0.59 1618
2 0.77 0.49 0.60 1161
3 0.73 0.28 0.41 1715
4 1.00 0.00 0.01 882
5 0.79 0.19 0.30 1646
6 0.00 0.00 0.00 259
7 0.00 0.00 0.00 723
8 0.00 0.00 0.00 254

accuracy 0.87 46435
macro avg 0.52 0.29 0.32 46435
weighted avg 0.84 0.87 0.83 46435
15 changes: 15 additions & 0 deletions mdl_results/211122_large_dim_embedding_report.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
precision recall f1-score support

0 0.89 0.99 0.94 38177
1 0.61 0.63 0.62 1618
2 0.73 0.50 0.59 1161
3 0.65 0.38 0.48 1715
4 0.00 0.00 0.00 882
5 0.56 0.09 0.16 1646
6 0.00 0.00 0.00 259
7 0.00 0.00 0.00 723
8 0.00 0.00 0.00 254

accuracy 0.87 46435
macro avg 0.38 0.29 0.31 46435
weighted avg 0.81 0.87 0.83 46435
15 changes: 15 additions & 0 deletions mdl_results/211122_large_hidden_layer_report.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
precision recall f1-score support

0 0.88 0.99 0.93 38177
1 0.53 0.46 0.49 1618
2 0.63 0.41 0.50 1161
3 0.65 0.38 0.48 1715
4 0.00 0.00 0.00 882
5 0.57 0.04 0.07 1646
6 0.00 0.00 0.00 259
7 0.67 0.00 0.01 723
8 0.00 0.00 0.00 254

accuracy 0.86 46435
macro avg 0.44 0.25 0.28 46435
weighted avg 0.81 0.86 0.82 46435
4 changes: 2 additions & 2 deletions ner/LSTM.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ class TokenLSTM(nn.Module):
An LSTM layer that takes in a sequence of tokens and returns a sequence of tags.
"""
def __init__(
self, output_dim: int, embedding_layer: nn.Embedding, hidden_dim_size: int
self, output_dim: int, embedding_layer: nn.Embedding, hidden_dim_size: int, bidirectional: bool
):
super().__init__()

Expand All @@ -18,7 +18,7 @@ def __init__(
self.embedding_size = embedding_layer.weight.shape[1]

# the LSTM takes an embedded sentence
self.lstm = nn.LSTM(self.embedding_size, hidden_dim_size, batch_first=True)
self.lstm = nn.LSTM(self.embedding_size, hidden_dim_size, batch_first=True, bidirectional=bidirectional)

# fc (fully connected) layer transforms the LSTM-output to give the final output layer
self.fc = nn.Linear(hidden_dim_size, output_dim)
Expand Down
46 changes: 36 additions & 10 deletions ner/data.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
"""
Contains function for loading, batching and converting data.
"""

import random
from itertools import islice
from typing import Iterable, List, Tuple

import numpy as np
import torch
from torch import nn

import datasets
from datasets.dataset_dict import DatasetDict

from gensim.models.keyedvectors import KeyedVectors
from torch import nn
import random


def load_data() -> DatasetDict:
"""Load the conllpp dataset.
Expand Down Expand Up @@ -46,15 +45,13 @@ def load_sst2() -> DatasetDict:
test_idx = [i for i, is_test in enumerate(bool_is_test) if is_test]
train_idx = [i for i, is_test in enumerate(bool_is_test) if not is_test]


# overwrite existing test and train set
dataset["test"] = dataset["train"].select(np.array(test_idx))
dataset["train"] = dataset["train"].select(np.array(train_idx))

return dataset



def batch(dataset: Iterable, batch_size: int) -> Iterable:
"""Creates batches from an iterable.
Expand Down Expand Up @@ -109,14 +106,43 @@ def gensim_to_torch_embedding(gensim_wv: KeyedVectors) -> Tuple[nn.Embedding, di

return emb_layer, vocab

def prepare_batch(tokens: List[List[str]], labels: List[List[int]]) -> Tuple[torch.Tensor, torch.Tensor]:

def tokens_to_idx(tokens, vocab):
"""
TODO documentation
"""
# toks, batch_size = tokens
return [vocab.get(t.lower(), vocab["UNK"]) for t in tokens]


def data_to_tensor(
tokens: List[List[str]],
labels: List[List[int]],
vocab,
max_sentence_length
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Prepare a batch of data for training.
Args:
tokens (List[List[str]]): A list of lists of tokens.
labels (List[List[int]]): A list of lists of labels.
Returns:
Tuple[torch.Tensor, torch.Tensor]: A tuple of tensors containing the tokens and labels.
"""
pass
Tuple[torch.Tensor, torch.Tensor]: A tuple of tensors containing the token ids and labels.
"""
n_docs = len(tokens)

batch_tok_idx = [tokens_to_idx(sent, vocab=vocab) for sent in tokens]

token_map = vocab["PAD"] * np.ones((n_docs, max_sentence_length))
label_map = -1 * np.ones((n_docs, max_sentence_length))

for i in range(n_docs):
tok_idx = batch_tok_idx[i]
tags = labels[i]
size = len(tok_idx)

token_map[i][:size] = tok_idx
label_map[i][:size] = tags

return torch.LongTensor(token_map), torch.LongTensor(label_map)
Loading

0 comments on commit 5ef48e1

Please sign in to comment.