Photo by lucasgwendt on Unsplash

Transformers, Attention and NMT— Part 2

This is the second part of multi-series articles on Transformers and Attention. In this article we will delve into the architecture of Transformers and the design choices it made.

Attention and Only Attention

In the previous part of this series, we saw how ConvS2S and ByteNet used convolutions instead of RNNs to parallelize large sequences to save training time. Further, we also saw attention with convolutions (in ConvS2S) achieved better results.

Scaled dot product Attention:

For NLP tasks, a sentence is first tokenized and every token(i.e word/sub-word) is converted first into an vector of real numbers in high dimension space i.e the Embedding vector.(E.g: GloVe and Word2Vec). These vectors are used as inputs directly for further processing. However, with transformers, we first create a “weighted representation” of every token in the sentence w.r.t every other token. One very specific case of such representations is contextualizing i.e capturing the sense of a specific word/token based on other words in the sentence.

“Bank” here refers to “river bank” and not a financial institution.
Figure-2: Using embedding vectors and attention to derive contextualized embedding
Figure-3: Scaled Dot Product Attention schematic with FC layers.
Scaled dot product attention. Source: Vaswani etal. 2017

Multi-Head attention

What we saw above was a single attention “head” that outputs one weighted representation. A natural extension would be to use multiple such heads (i.e multiple Query, Key and Value matrices), which can enable the model to learn multiple useful representations. These multiple-heads can be created by using multiple sets of Query, Key and Value matrices that generate independent attention weights all in parallel. The original Transformers paper uses 8 heads with a size of 64 but it is a hyperparameter and can be tuned based on requirements.

Multi-head Attention. Source:Transformers Paper
Multi Head Attention

Position-Wise Feedforward Networks(FFNs):

The concatenated output from multiple-attention heads above is passed next through position-wise FFNs. This layer applies two transformations: first, a simple fully connected layer to increase the dimension followed by RELU activation and second, another fully-connected layer to convert the dimension back to the original input dimension. The purpose of this network is to increase representation capacity to the overall model.

Positional Encoding:

Transformer models don’t have recurrent layers that could naturally keep track of sequential data. So in order to understand the ordering of tokens, positional information is added to each token’s embedding vector. In transformer models, positional encoding is achieved by using sine and cosine functions over the position of the token.

Positional encoding functions. pos is the position and i is the dimension. Source: Transformers Paper

Attention Block:

Finally, a single attention block can be visualized as below:

Attention block with multiple heads and position-wise FFNs

Self-Attention, Masked Multi-head Attention and Final Layers:

Three attention variants within transformer model. Source: NIPS, 2017
Masked Multi-head attention
Transformer with 2 layers in Encoder and Decoder. Source: Jay Alammar

Putting it all together:

Below is the overall architecture of the transformer model from the original “Attention is all you need” paper (Vaswani etal. 2017).

Transformer model architecture. Nx represents multiple blocks. Source: Vaswani etal. 2017
Model Parameters. Source: Vaswani etal. 2017

Comparison and Results:

Transformers achieved a significant breakthrough in terms of both computational speed and performance on NMT tasks, compared to prior RNN(GNMT) and CNN(ConvS2S) based models. Below is a comparison of some of the important attributes:

Comparison of complexity and other parameters. Source: Vaswani etal. 2017
Results on NMT Tasks on WMT ’14. Source: Vaswani etal. 2017
Attention Visualizations. Sources: Lena Voita, Tensor2Tensor

Conclusion and Final Notes:

No doubt, transformers achieved a major breakthrough in machine translation tasks. Self-attention allows for easier and richer unsupervised learning. Attention blocks, either in the encoder or decoder, could be stacked or expanded to take advantage of parallel execution on multiple GPUs/TPUs. Coupled with the success of transfer-learning in NLP tasks, as demonstrated by ULMFiT, transformers proved to be ubiquitous and scalable alternative for NLP.

GPT Architecture: Pre-Trained Model + Task specific Fine-Tuning. Source: Radford etal 2018

References and Credits

  1. Awesome course content delivered by Rohan Shravan, Zoheb from TSAI.
  2. “The Illustrated Transformers” and “The Annotated Transformers”
  3. This superb video on Transformer visualization.
  4. NIPS blogs and lectures
  5. Original papers referred — ConvS2S, Attention is all you need
  6. Lena Voita’s article and the original paper
  7. Tensor2Tensor GitHub repo
  8. BertViz GitHub Repo
Attention Visualization across layers and heads. Generated with: bertviz

Developer/Architect at Nokia Networks. Proud Father, CrossFitter and Coffee Lover.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store