Attention in Neural Networks

In this article, we will understand how attention works in Neural Networks for tasks such as Neural Machine Translation and Image Captioning, a precursor to the current state-of-art and super-exciting stuff unraveled by GPT-3.

We start with a quick refresher of basic building blocks.

RNNs(Recurrent Neural Networks) enabled the use of neural networks to model time-series or sequential data E.g: predicting the next word or character given previous set of words.

LSTMs(Long-Short Term Memory) and GRUs (Gated Recurrent Units) strengthened the concept of “memory” thereby overcoming the long-range dependency problems with traditional RNNs.

For “transduction problems”, problems that involve generating an output sequence given an input sequence, Sequence-to-Sequence models have the mainstay.

The classics, Christopher Olah’s blog on LSTMs and Andrej Karpathy’s blog “The Unreasonable Effectiveness of Recurrent Neural Networks” are a must read to understand the whole new world kick started by these architectures.

Alammar, Jay (2018). Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention). Retrieved from

Inside Sequence-to-Sequence Models

Sequence-to-sequence(seq2seq) models consist of:

  1. Encoder Network — which can be an RNN/CNN network. The task of this network is to generate a meaningful context from input data.
  2. Decoder Network — a separate RNN network that can generate desired output (E.g: musical notes, words, images, syllables, captions). The generation task is based on the context fed directly from the encoder.

One of the critical aspects of the seq2seq networks is the “context”. Context can be visualized as a simple fixed size vector of floating point values. Let’s consider an example of English-to-French translation task for a hypothetical well-trained seq2seq model

The French version ignored the catchy repeated phrases :(

The context vector, as you would have guessed, must be optimized yet rich enough to capture any nuances of the English text and any length of English text to output text in French.

Opening sentence from Shashi Tharoor’s Oxford Union speech, 2015

While it may not be so bad, but the core challenge faced by the seq2seq models was well analyzed in this paper, circa 2014:

“This analysis suggests that the current neural translation approach has its weakness in handling long sentences. The most obvious explanatory hypothesis is that the fixed-length vector representation does not have enough capacity to encode a long sentence with complicated structure and meaning. In order to encode a variable-length sequence, a neural network may “sacrifice” some of the important topics in the input sentence in order to remember others.”

Below are BLEU Score plots from the paper, which show a rapid decline (low is bad) with increasing length of sentence.

On the Properties of Neural Machine Translation: Encoder–Decoder Approaches(

From Memory to Attention:

Although attention as a concept was already around in 2014 and was widely applied to image classification tasks, Bahdanau etal, demonstrated the capabilities and results of using “soft attention” for Neural Machine Learning in this paper.

The core concept is to enable the decoder not just to look at the previous words but also to focus on the specific ones needed for its current prediction. This is achieved by introduction of “Attention Network”, which will generate a new context vector, for every time step.

So what is this Attention Network? It is a fully-connected layer, followed by a softmax function. This layer sits right between the encoder and decoder. Our new architecture looks as below:

Seq2Seq with Attention layer

So how does this additional mechanism help? To understand this we have to delve a bit deeper into the architecture. For simplicity, we will assume there are 3 time steps as input to the encoder. E.g: “How are you”? The job of the decoder is to predict the equivalent translation in a different language. We will also use the following notations:

{T1, T2,…Tn} — Time steps
{X1, X2….Xn} — Inputs to the Encoder at each time step Ti
{h0, h1….hn} — The hidden states of the Encoder at each time step Ti {S0,S1…..Sn} — Hidden states of the Decoder at each time step Ti
{c1, c2,…cn} — Context vectors
{a1, a2,….an} — Attention Weights calculated in the attention module
{Y1, Y2,…Yn} — Decoder’s final prediction for each time step Ti.

Attention Network performs following before time step-1 of the Decoder

  1. Use (h1,h2,h3) and S0 (deferred decoder hidden state) as input. S0 is initialized to 0.
  2. Perform forward pass through the FC layer followed by Softmax activation to generate attention weights (a1,a2,a3)
  3. Sum over the products of {a1,h1}, {a2,h2} and {a3,h3} to generate the final context vector for the time step: c1
Attention Network Operation

The context vector obtained above is now fed as input to the first stage of the Decoder-Step-1. The Decoder in turn provides the output Y1 which will be the first translated word. Given C1 and S0, previous hidden layer information, this layer has capability now to make a more informed “choice of words” as output.

Decoder Step-1

Below is the operation of Step-2. We see that now concatenated pairs of [S1, hi] are used as input to derive weights and eventually the context vector C2. The Decoder Step-2 acts on S1 and C2 now to decide its output Y2.

Decoder Step-2

The final step similarly uses S2 and (h1,h2,h3) to derive the final context vector C3 which, concatenated with S2 is used to predict Y3.

Decoder Step-3

Further, it is important to note that we train the Attention Network, along with Encoder and Decoder, which allows gradient flow through the FC layers.

Adding attention does something really fascinating! Let’s see few examples:

English to French. Figure from:
Speech to Text Example. Figure from
Figure from: “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”:

The models now not only remember sequences, they also make an informed judgement based on the selective aspects of previous sequence i.e learning to pay attention. This mechanism, although in more advanced forms, is a key aspect of complex models such as Transformers.

P.S: This is my first blog on Machine/Deep Learning, an area which I am currently exploring heavily. I would love to hear your inputs and feedback to improve. I have tried to attribute credits to the original authors and creators. Please drop me a message in case any attributions are missing.

References and Credits:

  1. Chris Olah and Shan Carter awesome visualizations.
  2. Attention in RNNs
  3. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches
  4. Neural Machine Translation by Jointly Learning to Align and Translate
  5. The Annotated Encoder Decoder
  6. Alammar, Jay (2018). Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention

P.P.S: You can refer to a sample implementation of Bahdanau Attention on my GitHub Link. It is basically retry with some tweaks from [5].

Developer/Architect at Nokia Networks. Proud Father, CrossFitter and Coffee Lover.