Skip to content

Latest commit

 

History

History
109 lines (60 loc) · 8.24 KB

READMEtransformers.md

File metadata and controls

109 lines (60 loc) · 8.24 KB

Attention Mechanisms and Transformers

One of the most important problems in the NLP domain is machine translation, an essential task that underlies tools such as Google Translate. In this section, we will focus on machine translation, or, more generally, on any sequence-to-sequence task (which is also called sentence transduction).

With RNNs, sequence-to-sequence is implemented by two recurrent networks, where one network, the encoder, collapses an input sequence into a hidden state, while another network, the decoder, unrolls this hidden state into a translated result. There are a couple of problems with this approach:

  • The final state of the encoder network has a hard time remembering the beginning of a sentence, thus causing poor quality of the model for long sentences
  • All words in a sequence have the same impact on the result. In reality, however, specific words in the input sequence often have more impact on sequential outputs than others.

Attention Mechanisms provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. The way it is implemented is by creating shortcuts between intermediate states of the input RNN and the output RNN. In this manner, when generating output symbol yt, we will take into account all input hidden states hi, with different weight coefficients αt,i.

Image showing an encoder/decoder model with an additive attention layer

The encoder-decoder model with additive attention mechanism in Bahdanau et al., 2015, cited from this blog post

The attention matrix {αi,j} would represent the degree that certain input words play in the generation of a given word in the output sequence. Below is an example of such a matrix:

Image showing a sample alignment found by RNNsearch-50, taken from Bahdanau - arviz.org

Figure from Bahdanau et al., 2015 (Fig.3)

Attention mechanisms are responsible for much of the current or near current state of the art in NLP. Adding attention however greatly increases the number of model parameters which led to scaling issues with RNNs. A key constraint of scaling RNNs is that the recurrent nature of the models makes it challenging to batch and parallelize training. In an RNN each element of a sequence needs to be processed in sequential order which means it cannot be easily parallelized.

Encoder Decoder with Attention

Figure from Google's Blog

The adoption of attention mechanisms combined with this constraint led to the creation of the now State of the Art Transformer Models that we know and use today such as BERT to Open-GPT3.

Transformer models

One of the main ideas behind transformers is to avoid sequential nature of RNNs and to create a model that is parallelizable during training. This is achieved by implementing two ideas:

  • positional encoding
  • using self-attention mechanism to capture patterns instead of RNNs (or CNNs) (that is why the paper that introduces transformers is called Attention is all you need

Positional Encoding/Embedding

The idea of positional encoding is the following.

  1. When using RNNs, the relative position of the tokens is represented by the number of steps, and thus does not need to be explicitly represented.
  2. However, once we switch to attention, we need to know the relative positions of tokens within a sequence.
  3. To get positional encoding, we augment our sequence of tokens with a sequence of token positions in the sequence (i.e., a sequence of numbers 0,1, ...).
  4. We then mix the token position with a token embedding vector. To transform the position (integer) into a vector, we can use different approaches:
  • Trainable embedding, similar to token embedding. This is the approach we consider here. We apply embedding layers on top of both tokens and their positions, resulting in embedding vectors of the same dimensions, which we then add together.
  • Fixed position encoding function, as proposed in the original paper.

Image by the author

The result that we get with positional embedding embeds both the original token and its position within a sequence.

Multi-Head Self-Attention

Next, we need to capture some patterns within our sequence. To do this, transformers use a self-attention mechanism, which is essentially attention applied to the same sequence as the input and output. Applying self-attention allows us to take into account context within the sentence, and see which words are inter-related. For example, it allows us to see which words are referred to by coreferences, such as it, and also take the context into account:

Image from the Google Blog

In transformers, we use Multi-Head Attention in order to give the network the power to capture several different types of dependencies, eg. long-term vs. short-term word relations, co-reference vs. something else, etc.

TensorFlow Notebook contains more detains on the implementation of transformer layers.

Encoder-Decoder Attention

In transformers, attention is used in two places:

  • To capture patterns within the input text using self-attention
  • To perform sequence translation - it is the attention layer between encoder and decoder.

Encoder-decoder attention is very similar to the attention mechanism used in RNNs, as described in the beginning of this section. This animated diagram explains the role of encoder-decoder attention.

Animated GIF showing how the evaluations are performed in transformer models.

Since each input position is mapped independently to each output position, transformers can parallelize better than RNNs, which enables much larger and more expressive language models. Each attention head can be used to learn different relationships between words that improves downstream Natural Language Processing tasks.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a very large multi layer transformer network with 12 layers for BERT-base, and 24 for BERT-large. The model is first pre-trained on a large corpus of text data (WikiPedia + books) using unsupervised training (predicting masked words in a sentence). During pre-training the model absorbs significant levels of language understanding which can then be leveraged with other datasets using fine tuning. This process is called transfer learning.

picture from http://jalammar.github.io/illustrated-bert/

Image source

✍️ Exercises: Transformers

Continue your learning in the following notebooks:

Conclusion

In this lesson you learned about Transformers and Attention Mechanisms, all essential tools in the NLP toolbox. There are many variations of Transformer architectures including BERT, DistilBERT. BigBird, OpenGPT3 and more that can be fine tuned. The HuggingFace package provides repository for training many of these architectures with both PyTorch and TensorFlow.

🚀 Challenge

Review & Self Study