diff --git a/README.md b/README.md
index e2b2ae2..440aa68 100644
--- a/README.md
+++ b/README.md
@@ -1,49 +1,82 @@
+# Assignment 3: Named Entity Recognition (NER)
+## Using LSTMs and Word Embeddings to detect entities in unstructured texts
 
-*see assignment_description.md for a description of the assignment*
+<div align="center"><img src="https://github.com/Orlz/CDS_Visual_Analytics/blob/main/Portfolio/company.png" width="75" height="75"/></div
 
-# Summary
+## Table of Contents 
 
-<!-- 
-This should include a short description of which models you have tried and conclusions from comparing these models. This should be no longer than an abstract. This section can also include questions regarding the assignment.
--->
+- [Description](#Project)
+- [Results](#Results)
+- [Reproducing the Scripts](#Operating)
+- [Project Organization](#ProjectOrganization)
 
-# Performance
-<!-- 
-This should include a table of performance metrics of different models. The performance metrics should at least include accuracy and F1-score.
- -->
+## Assignment Description <br> 
+In this assignment, a recurrent neural network in the form of a Long Short Term Memory (LSTM) model was trained to identify named entities in the 'CoNLLPP' dataset using gensim's GloVe word embeddings. The LSTM model was trained periodically using early stopping. Hence, if the model has not improves in a specified number of epochs, the training is stopped and the model is saved.The CoNLLPP dataset is an improvement of the popular CoNLL-2003 dataset, which contains sentences collected from the Reuters Corpus. This corpus consists of Reuters news stories between August 1996 and August 1997. For the training and development set, ten days worth of data were taken from the files representing the end of August 1996. For the test set, the texts were from December 1996. The preprocessed raw data covers the month of September 1996.
+To evaluate the model, F1-score and accuracy score were computed. Moreover, three experiments were conducted:
+1. Comparing the effect of the word embedding size
+2. Comparing the effect of the size of the hidden layer in the LSTM model
+3. Comparing the use of a bi-directional LSTM with a unidirectional LSTM
+
+**What are LSTM Models?** <br>
+LSTM models are a special type of recurrent neural network capable of learning order dependence in sequence prediction problems. This gives them the ability to learn long term dependencies, where both information before and later in the sentence can be used to inform of a word's nature or importance. They are especially good for named entity tasks where a deeper unstanding of the contextual information is needed. 
+
+## Learning goals of the assignment
+1. To work with recurrent layers using PyTorch 
+2. To understand the nature of named entity recognition tasks  
+3. To be able to implement early stopping and meaninful experiments which influence the performance of the model
+
+## Results
+The results of the LSTM model with default parameters (n_epochs = 30, lr = 0.01, hidden_dim = 30, patience = 10, optimizer = Adam, bidirectional = False, word_embedding_dim = 100-dim GloVe) and the experiments which include changing the dimensions of the word embeddings (exp. 1), changing the size of the hidden layer (exp. 2) and changing the LSTM to bidirectional (exp. 3) are reported below:
+
+|               | Default | Exp. 1: word embeddings = 300 dim | Exp. 2: Hidden layer dim = 100 | Exp. 3: Bidirectional = True |
+|---------------|---------|-----------------------------------|--------------------------------|------------------------------|
+| Accuracy      | 0.82    | 0.82                              | 0.82                           | 0.87                         |
+| Macro avg. F1 | 0.12    | 0.13                              | 0.13                           | 0.32                         |
+
+The experiments in which we change the word embedding dimensions and the hidden layer size do not change the results. However, when we make the LSTM model bidirectional rather than unidirectional, we see notable improvements in both the accuracy and the F1-score. We still get very low F1-score across all labels, indicating that the model is overfitting to a single class (i.e. 0). 
+
+## Reproducing the Scripts 
+1. If the user wishes to engage with the code and reproduce the obtained results, this section includes the necessary instructions to do so. First, the user will have to create their own version of the repository by cloning it from GitHub. This is done by executing the following from the command line: 
+
+```
+$ git clone https://github.com/auNLP/a3-orla-johan-sofie-jan.git named-entity-recognition
+```
+
+2. Once the user has cloned the repository, a virtual environment must be set up in which the relevant dependencies can be installed. To set up the virtual environment and install the relevant dependencies, a bash-script is provided, which automatically creates and installs the dependencies listed in the ```requirements.txt``` file when executed. To run the bash-script that sets up the virtual environment and installs the relevant dependencies, the user must execute the following from the command line. 
+
+```
+$ cd named-entity-recognition
+$ bash create_venv.sh 
+```
+
+3. Once the virtual environment has been set up and the relevant dependencies listed in the ```requirements.txt``` have been installed within it, the user is now able to run the ```main.py```script from the command line. In order to run the script, the user must first activate the virtual environment in which the script can be run. Activating the virtual environment is done as follows.
+
+
+```
+$ source named-entity-venv/bin/activate
+```
+
+4. Once the virtual environment has been activated, the user is now able to run the ```main.py```script.
+
+```
+(named-entity-venv) $ python main.py --epochs 10 --gensim_embedding glove-wiki-gigaword-100
+```
 
 ## Project Organization
 The organization of the project is as follows:
 
-<!-- 
-Correct this to reflect changes
--->
-
 ```
 ├── LICENSE                    <- the license of this code
 ├── README.md                  <- The top-level README for this project.
 ├── .github            
 │   └── workflows              <- workflows to automatically run when code is pushed
 │   │    └── pytest.yml        <- A workflow which runs pytests upon push
-├── classification             <- The main folder for scripts
+├── mdl_results                <- Model results 
+├── ner                        <- The main folder for scripts
 │   ├── tests                  <- The pytest test suite
 │   │   └── ...
 |   └── ...
 ├── .gitignore                 <- A list of files not uploaded to git
-├── requirement.txt            <- A requirements file of the required packages.
+├── requirements.txt           <- A requirements file of the required packages.
 └── assignment_description.md  <- the assignment description
-```
-
-
-
-## Running the code
-You can run the reproduce all the experiments by cloning the GitHub repository and running the following:
-
-<!-- 
-Update the code below such that it runs all the experiments in the performance section and print the performances.
--->
-
-```
-pip install -r requirements.txt
-python ner/main.py --epochs 10 --gensim_embedding glove-wiki-gigaword-50
 ```
\ No newline at end of file
diff --git a/create-venv.sh b/create-venv.sh
new file mode 100644
index 0000000..e500307
--- /dev/null
+++ b/create-venv.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+
+VENVNAME=named-entity-venv
+
+python3 -m venv $VENVNAME
+source $VENVNAME/bin/activate
+pip install --upgrade pip
+
+pip install ipython
+pip install jupyter
+
+python -m ipykernel install --user --name=$VENVNAME
+
+test -f requirements.txt && pip install -r requirements.txt
+
+deactivate
+echo "build $VENVNAME"
diff --git a/driver.sh b/driver.sh
new file mode 100644
index 0000000..8c2af32
--- /dev/null
+++ b/driver.sh
@@ -0,0 +1,39 @@
+python3 ner/main.py \
+    -f "baseline" \
+    --batchsize 1024 \
+    --nepochs 30 \
+    --learningrate 0.1 \
+    --embeddings "glove-wiki-gigaword-100" \
+    --hiddenlayer 30 \
+    --patience 10 \
+    --bidirectional False
+
+python3 ner/main.py \
+    -f "large_dim_embedding" \
+    --batchsize 1024 \
+    --nepochs 30 \
+    --learningrate 0.1 \
+    --embeddings "glove-wiki-gigaword-300" \
+    --hiddenlayer 30 \
+    --patience 10 \
+    --bidirectional False
+
+python3 ner/main.py \
+    -f "large_hidden_layer" \
+    --batchsize 1024 \
+    --nepochs 30 \
+    --learningrate 0.1 \
+    --embeddings "glove-wiki-gigaword-100" \
+    --hiddenlayer 100 \
+    --patience 10 \
+    --bidirectional False
+
+python3 ner/main.py \
+    -f "bidirectional" \
+    --batchsize 1024 \
+    --nepochs 30 \
+    --learningrate 0.1 \
+    --embeddings "glove-wiki-gigaword-100" \
+    --hiddenlayer 30 \
+    --patience 10 \
+    --bidirectional True
\ No newline at end of file
diff --git a/mdl_results/211122_baseline_report.txt b/mdl_results/211122_baseline_report.txt
new file mode 100644
index 0000000..fcc248c
--- /dev/null
+++ b/mdl_results/211122_baseline_report.txt
@@ -0,0 +1,15 @@
+              precision    recall  f1-score   support
+
+           0       0.89      0.99      0.94     38177
+           1       0.53      0.67      0.59      1618
+           2       0.77      0.49      0.60      1161
+           3       0.73      0.28      0.41      1715
+           4       1.00      0.00      0.01       882
+           5       0.79      0.19      0.30      1646
+           6       0.00      0.00      0.00       259
+           7       0.00      0.00      0.00       723
+           8       0.00      0.00      0.00       254
+
+    accuracy                           0.87     46435
+   macro avg       0.52      0.29      0.32     46435
+weighted avg       0.84      0.87      0.83     46435
diff --git a/mdl_results/211122_bidirectional_report.txt b/mdl_results/211122_bidirectional_report.txt
new file mode 100644
index 0000000..fcc248c
--- /dev/null
+++ b/mdl_results/211122_bidirectional_report.txt
@@ -0,0 +1,15 @@
+              precision    recall  f1-score   support
+
+           0       0.89      0.99      0.94     38177
+           1       0.53      0.67      0.59      1618
+           2       0.77      0.49      0.60      1161
+           3       0.73      0.28      0.41      1715
+           4       1.00      0.00      0.01       882
+           5       0.79      0.19      0.30      1646
+           6       0.00      0.00      0.00       259
+           7       0.00      0.00      0.00       723
+           8       0.00      0.00      0.00       254
+
+    accuracy                           0.87     46435
+   macro avg       0.52      0.29      0.32     46435
+weighted avg       0.84      0.87      0.83     46435
diff --git a/mdl_results/211122_large_dim_embedding_report.txt b/mdl_results/211122_large_dim_embedding_report.txt
new file mode 100644
index 0000000..7d495f0
--- /dev/null
+++ b/mdl_results/211122_large_dim_embedding_report.txt
@@ -0,0 +1,15 @@
+              precision    recall  f1-score   support
+
+           0       0.89      0.99      0.94     38177
+           1       0.61      0.63      0.62      1618
+           2       0.73      0.50      0.59      1161
+           3       0.65      0.38      0.48      1715
+           4       0.00      0.00      0.00       882
+           5       0.56      0.09      0.16      1646
+           6       0.00      0.00      0.00       259
+           7       0.00      0.00      0.00       723
+           8       0.00      0.00      0.00       254
+
+    accuracy                           0.87     46435
+   macro avg       0.38      0.29      0.31     46435
+weighted avg       0.81      0.87      0.83     46435
diff --git a/mdl_results/211122_large_hidden_layer_report.txt b/mdl_results/211122_large_hidden_layer_report.txt
new file mode 100644
index 0000000..71fcca8
--- /dev/null
+++ b/mdl_results/211122_large_hidden_layer_report.txt
@@ -0,0 +1,15 @@
+              precision    recall  f1-score   support
+
+           0       0.88      0.99      0.93     38177
+           1       0.53      0.46      0.49      1618
+           2       0.63      0.41      0.50      1161
+           3       0.65      0.38      0.48      1715
+           4       0.00      0.00      0.00       882
+           5       0.57      0.04      0.07      1646
+           6       0.00      0.00      0.00       259
+           7       0.67      0.00      0.01       723
+           8       0.00      0.00      0.00       254
+
+    accuracy                           0.86     46435
+   macro avg       0.44      0.25      0.28     46435
+weighted avg       0.81      0.86      0.82     46435
diff --git a/ner/LSTM.py b/ner/LSTM.py
index aa0af27..9d730d3 100644
--- a/ner/LSTM.py
+++ b/ner/LSTM.py
@@ -8,7 +8,7 @@ class TokenLSTM(nn.Module):
     An LSTM layer that takes in a sequence of tokens and returns a sequence of tags.
     """
     def __init__(
-        self, output_dim: int, embedding_layer: nn.Embedding, hidden_dim_size: int
+        self, output_dim: int, embedding_layer: nn.Embedding, hidden_dim_size: int, bidirectional: bool
     ):
         super().__init__()
 
@@ -18,7 +18,7 @@ def __init__(
         self.embedding_size = embedding_layer.weight.shape[1]
 
         # the LSTM takes an embedded sentence
-        self.lstm = nn.LSTM(self.embedding_size, hidden_dim_size, batch_first=True)
+        self.lstm = nn.LSTM(self.embedding_size, hidden_dim_size, batch_first=True, bidirectional=bidirectional)
 
         # fc (fully connected) layer transforms the LSTM-output to give the final output layer
         self.fc = nn.Linear(hidden_dim_size, output_dim)
diff --git a/ner/data.py b/ner/data.py
index 36a1bc9..2f71794 100644
--- a/ner/data.py
+++ b/ner/data.py
@@ -1,19 +1,18 @@
 """
 Contains function for loading, batching and converting data.
 """
-
+import random
 from itertools import islice
 from typing import Iterable, List, Tuple
 
 import numpy as np
 import torch
+from torch import nn
 
 import datasets
 from datasets.dataset_dict import DatasetDict
-
 from gensim.models.keyedvectors import KeyedVectors
-from torch import nn
-import random
+
 
 def load_data() -> DatasetDict:
     """Load the conllpp dataset.
@@ -46,7 +45,6 @@ def load_sst2() -> DatasetDict:
     test_idx = [i for i, is_test in enumerate(bool_is_test) if is_test]
     train_idx = [i for i, is_test in enumerate(bool_is_test) if not is_test]
 
-
     # overwrite existing test and train set
     dataset["test"] = dataset["train"].select(np.array(test_idx))
     dataset["train"] = dataset["train"].select(np.array(train_idx))
@@ -54,7 +52,6 @@ def load_sst2() -> DatasetDict:
     return dataset
 
 
-
 def batch(dataset: Iterable, batch_size: int) -> Iterable:
     """Creates batches from an iterable.
 
@@ -109,7 +106,21 @@ def gensim_to_torch_embedding(gensim_wv: KeyedVectors) -> Tuple[nn.Embedding, di
 
     return emb_layer, vocab
 
-def prepare_batch(tokens: List[List[str]], labels: List[List[int]]) -> Tuple[torch.Tensor, torch.Tensor]:
+
+def tokens_to_idx(tokens, vocab):
+    """
+    TODO documentation
+    """
+    # toks, batch_size = tokens
+    return [vocab.get(t.lower(), vocab["UNK"]) for t in tokens]
+
+
+def data_to_tensor(
+    tokens: List[List[str]],
+    labels: List[List[int]],
+    vocab,
+    max_sentence_length
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
     """Prepare a batch of data for training.
 
     Args:
@@ -117,6 +128,21 @@ def prepare_batch(tokens: List[List[str]], labels: List[List[int]]) -> Tuple[tor
         labels (List[List[int]]): A list of lists of labels.
 
     Returns:
-        Tuple[torch.Tensor, torch.Tensor]: A tuple of tensors containing the tokens and labels.
-    """    
-    pass
\ No newline at end of file
+        Tuple[torch.Tensor, torch.Tensor]: A tuple of tensors containing the token ids and labels.
+    """ 
+    n_docs = len(tokens)
+
+    batch_tok_idx = [tokens_to_idx(sent, vocab=vocab) for sent in tokens]
+
+    token_map = vocab["PAD"] * np.ones((n_docs, max_sentence_length))
+    label_map = -1 * np.ones((n_docs, max_sentence_length))
+
+    for i in range(n_docs):
+        tok_idx = batch_tok_idx[i]
+        tags = labels[i]
+        size = len(tok_idx)
+
+        token_map[i][:size] = tok_idx
+        label_map[i][:size] = tags
+
+    return torch.LongTensor(token_map), torch.LongTensor(label_map)
\ No newline at end of file
diff --git a/ner/main.py b/ner/main.py
index b8fc960..9f7b485 100644
--- a/ner/main.py
+++ b/ner/main.py
@@ -1,56 +1,199 @@
-import numpy as np
-import torch
+"""
+This script uses a trainable LSTM and GloVe word embeddings to detect named entities in unstructured texts. 
+The LSTM model is trained periodically using early stopping. Hence, if the model has not improves in N epochs, the training is stopped.
+To evaluate the model, F1-score and accuracy score are computed.
+"""
+
+import argparse
+import os
 import random
+import datetime
 
+from wasabi import msg
+import numpy as np
 import gensim.downloader as api
+from sklearn.metrics import classification_report
+import torch
+import torch.optim as optim
+
+from data import batch, gensim_to_torch_embedding, load_data, data_to_tensor
+from LSTM import TokenLSTM
+from train import train_model
+
 
-from ner.data import batch, gensim_to_torch_embedding, load_data
-from ner.LSTM import TokenLSTM
+def main(
+    mdl_fname: str,
+    batch_size: int,
+    n_epochs: int,
+    learning_rate: float,
+    gensim_embedding: str,
+    hidden_layer_dim: int,
+    stopping_patience: int,
+    bidirectional: bool
+    ):
+    '''
+    Train & evaluate LSTM for NER labeling with early stopping.
 
+    Parameters
+    ----------
+    mdl_fname : str
+        path to save the model
+    batch_size : int
+        n datapoints in one batch
+    n_epochs : int
+        n epochs to train for
+    learning_rate : float
+        step size while descdending gradient
+    gensim_embedding : str
+        name of embedding model available in `gensim.api`
+    hidden_layer_dim : int|list
+        n nodes in hidden layer
+    stopping_patience : int
+        number of times the model has to perform better than the last one in a row before stopping
+    bidirectional : bool
+        run LSTM bidirectionally?
+    '''
+
+    # handle model filename for saving
+    today = datetime.date.today()
+    today_yymmdd = today.strftime("%y%m%d")
+    mdl_fname = os.path.join('..', 'mdl_results', f'{today_yymmdd}_{mdl_fname}.ph')
 
-def main(gensim_embedding: str, batch_size: int, epochs: int, learning_rate: float, patience: int=10):
     # set a seed to make the results reproducible
     torch.manual_seed(0)
     np.random.seed(0)
     random.seed(0)
 
+    # load data
+    msg.info('Importing data')
     dataset = load_data()
-    train = dataset["train"]
 
+    # shuffle training data
+    train = dataset["train"].shuffle(seed=1)
+
+    # lod validation data for early stopping 
+    validation = dataset["validation"]
+
+    # load test data for final validation
+    test = dataset["test"]
+
+    # load gensim word embeddings
     embeddings = api.load(gensim_embedding)
 
-    # convert gensim word embedding to torch word embedding
+    # prepare data
+    msg.info('Preparing data')
+    
+    # convert gensim word embeddings to torch word embeddings
     embedding_layer, vocab = gensim_to_torch_embedding(embeddings)
 
+    # prepare training data
+    # to get size of input layer, get length of the longest sentence
+    max_train_len = max([len(s) for s in train['tokens']])
+    max_val_len = max([len(s) for s in validation['tokens']])
+    max_test_len = max([len(s) for s in test['tokens']])
+    max_sentence_length = max([max_train_len, max_val_len, max_test_len])
 
-    # Preparing data
-        # shuffle dataset
-    train = dataset["train"].shuffle(seed=1)
+    # prepare validation data (convert data and labels to tensors)
+    val_X, val_y = data_to_tensor(
+        validation["tokens"],
+        validation["ner_tags"],
+        vocab=vocab,
+        max_sentence_length=max_sentence_length
+        )
+    
+    # prepare test data (convert to tensors)
+    test_X, test_y = data_to_tensor(
+        test["tokens"],
+        test["ner_tags"],
+        vocab=vocab,
+        max_sentence_length=max_sentence_length
+        )
 
-    # batch it using a utility function (don't spend time on the function, but make sure you understand the output)
+    # batch training tokens and labels 
     batches_tokens = batch(train["tokens"], batch_size)
     batches_tags = batch(train["ner_tags"], batch_size)
 
-    # Initialize the model
-    # Initialize optimizer
-
-    # Train model (I suggest writing a function for this)
-        ## for each epoch
-        ## for each batch
-        
-        ## prepare data (see code from the class on RNNs)
-        
-        ## train on one batch
-        # run forward pass
-        # run backward pass
-        # update parameters
-        # calculate loss
-
-        ##  periodically calculate loss on validation set
-        # if epoch % 10 == 0: # e.g. every 10 epochs
-            # save the model if it is the best so far
-            # stop the training if you haven't saved a better model in the last N epochs (this N here is typically referred to as patience)
-
-    # Load the best model
-    # Calculate the Accuracy and F1 on the test set (It might be easier to write a function for this)
-    
\ No newline at end of file
+    # initialize the LSTM model
+    LSTM = TokenLSTM(
+        output_dim=10,
+        embedding_layer=embedding_layer,
+        hidden_dim_size=hidden_layer_dim,
+        bidirectional=bidirectional
+    )
+
+    # initialize Adam optimizer
+    optimizer = optim.Adam(
+        params=LSTM.parameters(),
+        lr=learning_rate)
+
+    # train model with early stopping
+    msg.info('Training LSTM model')
+    m = train_model(
+        model=LSTM,
+        val_X=val_X,
+        val_y=val_y,
+        n_epochs=n_epochs,
+        batches_tokens=batches_tokens,
+        batches_tags=batches_tags,
+        vocab=vocab,
+        max_sentence_length=max_sentence_length,
+        optimizer=optimizer,
+        patience=stopping_patience,
+        mdl_fname=mdl_fname
+        )
+
+    # evaluate model on test data
+    msg.info('Evaluating performance')
+
+    # calculate predictions for test data
+    test_y_hat = m.forward(test_X)
+
+    # flatten data
+    test_true = test_y.view(-1)
+
+    # reshape y_hat to be len(docs) x len(real labels)
+    mask = torch.arange(0,9)
+    test_pred = torch.index_select(test_y_hat, 1, mask)
+    # get top label from predicted confidences
+    test_pred = torch.argmax(test_pred, dim=1)
+    # remove padding
+    test_pred = [y_pred for y_pred in test_pred if y_pred != 1]
+    test_true = [y for y in test_true if y != -1]
+    assert len(test_true) == len(test_pred)
+
+    test_true, test_pred = torch.tensor(test_true), torch.tensor(test_pred)
+
+    # create clf report
+    clf_report = classification_report(test_true, test_pred)
+    
+    # save model performance scores
+    report_fname = mdl_fname.replace('.ph', '_report.txt')
+    with open(report_fname, 'w') as f:
+        f.write(clf_report)
+
+
+if __name__ == "__main__":
+
+    torch.device('cpu')
+
+    ap = argparse.ArgumentParser()
+    ap.add_argument('-f', '--mdlfname', type=str)
+    ap.add_argument('--batchsize', type=int, default=1024)
+    ap.add_argument('--nepochs', type=int, default=30)
+    ap.add_argument('--learningrate', type=float, default=0.1)
+    ap.add_argument('--embeddings', type=str, default='glove-wiki-gigaword-100')
+    ap.add_argument('--hiddenlayer', type=int, default=30)
+    ap.add_argument('--patience', type=int, default=10)
+    ap.add_argument('--bidirectional', type=bool, default=False)
+    args = vars(ap.parse_args())
+
+    main(
+        mdl_fname=args['mdlfname'],
+        batch_size=args['batchsize'],
+        n_epochs=args['nepochs'],
+        learning_rate=args['learningrate'],
+        gensim_embedding=args['embeddings'],
+        hidden_layer_dim=args['hiddenlayer'],
+        stopping_patience=args['patience'],
+        bidirectional=args['bidirectional']
+    )
diff --git a/ner/tests/test_prepare_batch.py b/ner/tests/test_data_to_tensor.py
similarity index 60%
rename from ner/tests/test_prepare_batch.py
rename to ner/tests/test_data_to_tensor.py
index 10fe532..3e26a08 100644
--- a/ner/tests/test_prepare_batch.py
+++ b/ner/tests/test_data_to_tensor.py
@@ -1,4 +1,5 @@
-from ner.data import prepare_batch, load_data
+from collections import Counter
+from ner.data import data_to_tensor, load_data
 
 def create_tags_to_ids():
     dataset = load_data()
@@ -7,7 +8,7 @@ def create_tags_to_ids():
     return tags_to_ids
 
 
-def test_prepare_batch():
+def test_data_to_tensor():
     """test that the prepare batch function outputs the correct shape"""
     sample_texts = [
         ["I", "am", "happy"],
@@ -20,10 +21,20 @@ def test_prepare_batch():
         ["I-PER", "O", "O", "O", "O"],
     ]
 
+    max_sentence_length = max([len(doc) for doc in sample_texts])
+    vocab = dict()
+    for doc in sample_texts:
+        for token in doc:
+            if token not in vocab:
+                vocab[token] = len(vocab)
+
+    vocab.update({'UNK': len(vocab)})
+    vocab.update({'PAD': len(vocab)})
+
     tags_to_ids = create_tags_to_ids()
     sample_labels = [[tags_to_ids[tag] for tag in tags] for tags in sample_labels]
 
-    X, y = prepare_batch(sample_texts, sample_labels)
+    X, y = data_to_tensor(sample_texts, sample_labels, vocab, max_sentence_length)
     
     assert X.shape == (3, 5), "Your prepared batch does not have the correct size"
-    assert all(i in y.unique() for i in [-1, 0, 2]), "Your prepared batch does not contain the correct labels"
\ No newline at end of file
+    assert all(i in y.unique() for i in [-1, 0, 2]), "Your prepared batch does not contain the correct labels"
diff --git a/ner/tests/test_main.py b/ner/tests/test_main.py
index 7ad3a7c..fa62d86 100644
--- a/ner/tests/test_main.py
+++ b/ner/tests/test_main.py
@@ -3,4 +3,14 @@
 def test_main():
     """test that main run using a single epoch"""
 
-    main(gensim_model="glove-wiki-gigaword-50", epochs=1, batch_size=5, learning_rate=0.1)
+    main(
+        mdl_fname='test',
+        gensim_embedding="glove-wiki-gigaword-50",
+        n_epochs=1,
+        batch_size=5,
+        learning_rate=0.1,
+        hidden_layer_dim=6,
+        stopping_patience=1,
+        bidirectional=False 
+        )
+
diff --git a/ner/train.py b/ner/train.py
new file mode 100644
index 0000000..8d2397a
--- /dev/null
+++ b/ner/train.py
@@ -0,0 +1,103 @@
+from typing import List
+
+from wasabi import msg
+from tqdm import tqdm
+import torch
+
+from data import data_to_tensor
+
+
+def train_model(
+    model: torch.nn.Module,
+    val_X: torch.Tensor,
+    val_y: torch.Tensor, 
+    n_epochs: int, 
+    batches_tokens: List[List[str]],
+    batches_tags: List[List[int]], 
+    vocab: dict, 
+    max_sentence_length: int, 
+    optimizer: torch.optim.Optimizer, 
+    patience: int, 
+    mdl_fname: str
+    ) -> torch.nn.Module:
+    '''Trains a model with early stopping
+
+    Parameters
+    ----------
+    model : torch.nn.Module
+        initial model to train
+    val_X : torch.Tensor
+        validation data – input. Preprocessed with data.data_to_tensor()
+    val_y : torch.Tensor
+        validation data – labels. Preprocessed with data.data_to_tensor()
+    n_epochs : int
+        number of epochs to train for
+    batches_tokens : List[List[str]]
+        training data – input. Raw.
+    batches_tags : List[List[int]]
+        training data – labels. Raw.
+    vocab : dict
+        token to id mapping of words
+    max_sentence_length : int
+        n tokens in the longest sentence.
+    optimizer : torch.optim.Optimizer
+        optimizer to use for gradient descent
+    patience : int
+        number of times the model has to perform better than the last one in a row before stopping
+    mdl_fname : str
+        path to save the model
+
+    Returns
+    -------
+    torch.nn.Module
+        the best model trained
+    '''
+
+    # define empty list for validation losses and variable for best validation loss
+    val_losses = []
+    best_val_loss = None
+
+    # for each epoch
+    for epoch in tqdm(range(0, n_epochs)):
+        msg.info(f'Epoch {epoch}')
+        
+        # convert training data and labels to tensors
+        for token, label in zip(batches_tokens, batches_tags):
+            X, y = data_to_tensor(
+                tokens=token,
+                labels=label,
+                vocab=vocab,
+                max_sentence_length=max_sentence_length,
+            )
+
+            # forward pass on training data
+            y_hat = model.forward(X)
+
+            # calculate loss
+            loss = model.loss_fn(outputs=y_hat, labels=y)
+
+            # backward propagation
+            loss.backward()
+            optimizer.step()
+            optimizer.zero_grad()
+
+        # forward pass on validation data
+        val_y_hat = model.forward(val_X)
+
+        # compute loss on validation data
+        val_loss = model.loss_fn(outputs=val_y_hat, labels=val_y)
+        val_losses.append(val_loss)
+
+        # early stoppping
+        if not best_val_loss or val_loss < best_val_loss:
+            best_val_loss = val_loss
+            torch.save(model, mdl_fname)
+
+        better = [vl for vl in val_losses[-patience:] if val_loss >= vl]
+        if len(better) == patience:
+            break
+
+    # load best model
+    model = torch.load(mdl_fname)
+
+    return model
diff --git a/requirements.txt b/requirements.txt
index 1afce5c..a963314 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,6 @@
-numpy~=1.21.2
-sklearn~=0.0
-torch~=1.9.1
-datasets~=1.12.1
-scikit-learn~=1.0
-gensim~=4.1.2
\ No newline at end of file
+datasets
+sklearn
+torch
+scikit-learn
+gensim
+numpy
\ No newline at end of file