This is the second part of multi-series articles on Transformers and Attention. In this article we will delve into the architecture of Transformers and the design choices it made.
“I can talk English, I can walk English, I can laugh English, because English is a funny language. Bhairon becomes barren and barren becomes Bhairon because their minds are very narrow. In the year 1929 when India was playing Australia at the Melbourne stadium Vijay Hazare and Vijay Merchant were at the crease. Vijay Merchant told Vijay Hazare, look Vijay Hazare, this is a very prestigious match and we must consider it very prestigiously. We must take this into consideration, the consideration that this is an important match and ultimately this consideration must end in a run.”
— Amitabh Bachchanji in Namak Halaal Movie
Who doesn’t remember this legendary performance by the one and only Amitabh Bachchanji and the perplexed faces of Ranjeet and Bhairon(Ram Sethi)!
Now let’s imagine Amitabhji, pre-2017, quizzed a well-trained humanoid robot in his KBC show and decides to show this paragraph to it with a question at the end:
“What does the last occurrence of the word “consideration” in above paragraph refer to?”.
A. The match being prestigious B. The match being important
C. A random consideration D. NP-Hard Problem
Quite a bit of challenge for our robot as it dabbles over the same word used multiple times in different context and even to indicate different “considerations” 😀. Let’s see how Transformers tackle this challenge.
Attention and Only Attention
In the previous part of this series, we saw how ConvS2S and ByteNet used convolutions instead of RNNs to parallelize large sequences to save training time. Further, we also saw attention with convolutions (in ConvS2S) achieved better results.
So what if we now replace convolution layers themselves with just attention mechanism? Convolution layers extract features to model the source representation while attention mechanism enables network to focus on specific tokens or sequences. The new attention mechanism, then, must be capable of performing both these tasks and must be “parallelizable”. Transformer architecture addresses all these concerns. To understand better, we start ground up with building blocks of transformer models i.e “scaled-dot product attention” “multi-head” attention, “position-wise FFNs”, “self-attention” and “masked multi-head” attention.
Scaled dot product Attention:
For NLP tasks, a sentence is first tokenized and every token(i.e word/sub-word) is converted first into an vector of real numbers in high dimension space i.e the Embedding vector.(E.g: GloVe and Word2Vec). These vectors are used as inputs directly for further processing. However, with transformers, we first create a “weighted representation” of every token in the sentence w.r.t every other token. One very specific case of such representations is contextualizing i.e capturing the sense of a specific word/token based on other words in the sentence.
To create this weighted representation, we can simply multiply two copies of the embedding vectors of every token in the sentence. This would give a matrix of n*n dimensions that roughly represents correlation of the words in the input sentence. Next, we perform softmax to generate final “attention weights”. These attention weights in turn are multiplied with the original embedding vector to give the final contextualized vector. We can visualize the operations as below:
Instead of performing multiplications on the embedding vectors directly, we can project them to lower dimensions using 3 fully connected layers. The output of this projection gives us Query, Key and Values. We use these matrices further to calculate the final attention weights and contextualized vectors. Adding these 3 FC layers has two important benefits. First, since the weights of these layers are learnt during training, the model can create variety of representations ranging from syntactic structure, location information to focusing on rare words. Second, since these operations are mainly matrix multiplications, the model can benefit from high degree of parallelism on GPUs/TPUs. Below diagram shows these operations:
Further, to prevent large values resulting from multiplications, the final product of the softmax operation (between Q & K matrices)is scaled by square-root of size of query/key matrix. Mathematically, the entire operation can be represented as below:
What we saw above was a single attention “head” that outputs one weighted representation. A natural extension would be to use multiple such heads (i.e multiple Query, Key and Value matrices), which can enable the model to learn multiple useful representations. These multiple-heads can be created by using multiple sets of Query, Key and Value matrices that generate independent attention weights all in parallel. The original Transformers paper uses 8 heads with a size of 64 but it is a hyperparameter and can be tuned based on requirements.
But why not use a single large block of equivalent dimension? At the core, attention can be viewed as an averaging function and when averaged over a large dimension, dominant features tend to get over-represented and mask some of the less subtle features. By providing multiple “heads”, the model has the ability to explore these subtle relations too.
Finally, all the attention heads are concatenated and passed through a linear layer. The whole operation is shown below:
Position-Wise Feedforward Networks(FFNs):
The concatenated output from multiple-attention heads above is passed next through position-wise FFNs. This layer applies two transformations: first, a simple fully connected layer to increase the dimension followed by RELU activation and second, another fully-connected layer to convert the dimension back to the original input dimension. The purpose of this network is to increase representation capacity to the overall model.
Transformer models don’t have recurrent layers that could naturally keep track of sequential data. So in order to understand the ordering of tokens, positional information is added to each token’s embedding vector. In transformer models, positional encoding is achieved by using sine and cosine functions over the position of the token.
This can also be achieved by using simple learned embedding i.e each word’s position converted to an embedding vector, similar to ConvS2S models.
Finally, a single attention block can be visualized as below:
Layer normalization and residual connections are applied on the output of each sub-layer (i.e MultiHead attention and Position-Wise FFNs). This allows for better gradient flow and stacking of multiple such blocks.
Self-Attention, Masked Multi-head Attention and Final Layers:
Self-Attention: We saw, in the previous section, that embedding vectors (word+positional) are transformed using 3 different FC layers to produce Query(Q), Key(K) and Value(V) matrices. These matrices are then used to derive attention weights and subsequently the final weighted representation. When Q, K and V are all derived from same embedding vector, final output of an attention block is a weighted representation over the single input sentence. This is called “Self-Attention”. In NMT, this mechanism is applied to both Encoder and Decoder.
Masked Multi-head attention: On the decoder side, we must maintain “causality” i.e the model must not see the tokens it hasn’t yet predicted (in other words, cheat-proof the inputs). We achieve this by masking the product of Q and K. This “masked” matrix is further used to calculate the final output. This process is called “Masked Multi-head attention”. The masks are lower triangular matrices of same dimension as Q/K with all positions below the main diagonal set to 1 and positions above the main diagonal set to very small negative value.
Encoder-Decoder Attention: Further, to predict the output on NMT tasks, we need mapping of tokens in the target sentence to the encoder’s output i.e Encoder-Decoder Attention. To derive this, we generate K and V matrices from the encoder’s output while the Query(Q) matrix is derived from decoder’s attention output.
Below diagram helps to visualize the flow of inputs and outputs through encoders and decoders. The original transformer model uses multiple layers of encoders and decoders similar to any deep neural network architecture. The final decoder layer output is passed through a fully-connected layer and softmax layer to produce the output probabilities of tokens.
Putting it all together:
Below is the overall architecture of the transformer model from the original “Attention is all you need” paper (Vaswani etal. 2017).
Comparison and Results:
Transformers achieved a significant breakthrough in terms of both computational speed and performance on NMT tasks, compared to prior RNN(GNMT) and CNN(ConvS2S) based models. Below is a comparison of some of the important attributes:
In the above table, d represents the size of the hidden dimension, n represents the length of the input sequence and k denotes the kernel size for convolution layers. When we compare per-layer complexity, CNN/RNN models are quadratic with respect to the number of parameters in the hidden dimension i.e O(n.d.d). However, transformer models are quadratic only in terms of input sequence length i.e O(n.n.d). Usually, in NMT tasks and language modelling the value of n tends to be much smaller than d (512 to 1024), thereby allowing for significantly lesser computation.
Another important comparison mainly between CNN and attention mechanism is the shortest path between any two tokens. To get full visibility between the farthest tokens, CNNs by default need to stack multiple layers, so the length tends to be at least O(log(n)). Attention blocks can operate on all tokens simultaneously so this reduces the shortest path to O(1).
From the below results, we can see a significant improvement both in terms training(lower FLOPs) and higher scores.
Further, self-attention mechanism allows for better interpretation. Below are some of the samples that show focus on content (location), context and semantic structure of input sequences:
Conclusion and Final Notes:
No doubt, transformers achieved a major breakthrough in machine translation tasks. Self-attention allows for easier and richer unsupervised learning. Attention blocks, either in the encoder or decoder, could be stacked or expanded to take advantage of parallel execution on multiple GPUs/TPUs. Coupled with the success of transfer-learning in NLP tasks, as demonstrated by ULMFiT, transformers proved to be ubiquitous and scalable alternative for NLP.
OpenAI’s GPT(2018) and Google’s BERT(2019) were some of the early models that leveraged representation and scaling power of transformers and transfer learning to unleash a race for large pre-trained language models that could be fine-tuned for multitude of downstream NLP tasks ranging from commonsense reasoning to question answering, setting SOTA scores on most of the GLUE/SuperGlue benchmarks.
References and Credits
- Awesome course content delivered by Rohan Shravan, Zoheb from TSAI.
- “The Illustrated Transformers” and “The Annotated Transformers”
- This superb video on Transformer visualization.
- NIPS blogs and lectures
- Original papers referred — ConvS2S, Attention is all you need
- Lena Voita’s article and the original paper
- Tensor2Tensor GitHub repo
- BertViz GitHub Repo
Hope, you liked this blog! Please feel free to share your comments or thoughts about this blog. If there are any attributions missing please drop a note would definitely add the same.
PPS: Below is a toy visualization of Attention heads, seen by GPT2 for the question posed to the humanoid robot.😀