Transformers, Attention and NMT— Part 1(Convolutions)

8 min readFeb 6, 2021

This is first part of multi-series article on how Transformers and Attention mechanism have changed the landscape of NMT. In this article we will delve into some of the research papers and architectures that were published just before Transformers. This gives a better understanding of the design challenges and the direction of research.

From Attention to Attention is all you need:

2017 was truly a remarkable year for AI in general:

AlphaGo Zero defeats AlphaGo (which defeated Lee Seedol, 2016), after being bootstrapped and trained for just 3 days!
GANs(Generative Adversarial Networks) made significant strides — DiscoGANs, CycleGANs and Progressive GANs (accepted in 2018)
PyTorch gained strong popularity within research community and TensorFlow released v1.0
Andrew Ng started Landing.ai, continuing his pursuit to democratize AI research and access.
“Attention Is All You Need” paper was published by Vaswani et al — opening up the skies for Transformer landing.

There is no doubt that transformers set a whole new genre of models in motion. But let’s review the state of NLP, especially from Neural Machine translation perspective just before the watershed moment.

Seq2Seq with Attention, based on RNN/LSTMs, had become mainstay and achieved great results on multiple tasks in NLP (Neural Machine Translation, Language Modeling etc). In fact, GNMT, the production grade NMT system based on LSTMs and attention (and significant engineering), replaced traditional statistical or phrase based models. (Feel free to visit my previous blog for a quick recap on Bahdanau attention).

However, there were two crucial challenges with RNN models:

Inherently Sequential processing.
Challenges with long range dependencies between tokens.

So how bad could these challenges be. Let’s take cue from an article on challenges in speech synthesis from Google/DeepMind :

Researchers usually avoid modelling raw audio because it ticks so quickly: typically 16,000 samples per second or more, with important structure at many time-scales. Building a completely autoregressive model, in which the prediction for every one of those samples is influenced by all previous ones (in statistics-speak, each predictive distribution is conditioned on all previous observations), is clearly a challenging task.

16000 samples for just one second! Now imagine an RNN going over it one step at a time while also needing to “remember” an utterance just 2 seconds ago (approximately 32000 steps 😶). Further, as the sequences grow, the memory requirement would just keeping getting worse.

Overcoming Sequential Processing

An RNN can unroll only one-step at a time Source: NIPS Slide Set, 2017

For almost all sequential data, we can assume that there is an autoregressive dependency i.e the n-th sample would have some relation to the previous “n-1” inputs. Typically, an RNN would ingest one sample, create its hidden states and/or generate an output and then start over again.

On NMT tasks with a standard encoder-decoder architecture, the encoder’s job is to extract the hidden representation from underlying data. So, instead of having to loop over one time-step at a time, would it be possible to “batch” time-steps? Such mechanism would directly reduce the number of iterations needed for one pass over an ultra-long sequence. Also, we know that every token(word/sub-word/character) is converted into an embedding before being fed to a model. So, what if we could add some position information of the token into the embedding, enabling the model a way to understand the relative ordering of the tokens?

With position information embedded and batching in time domain we could choose convolutional neural networks. The important benefit of CNNs is their ability to perform parallel computation.(Of course, we could use feedforward network too! Please feel free to leave a comment on why or why not?)

On the decoder side, our task is to generate next token. Here we need to take care of “causality” i.e the decoder must not see the tokens that it has not yet predicted(else the model would simply cheat during training by constantly predicting just the last input 🤓). We will explore how to “cheat-proof” the inputs for decoder.

Efficient modelling of long range dependencies between tokens.

With speech synthesis the problem of long-range dependency are extremely acute. On regular NMT tasks, plain Seq2Seq models had this same problem even with a sequence length of 50 words. The attention network alleviated this problem by allowing the model to focus on specific inputs. But, attention weights are learnt during the training and would still be restricted by the sequential nature of RNNs.

To optimize such long-range dependencies and reduce training time, multiple papers(Wavenet, ByteNet and ConvS2S) adopted stacked convolutional layers instead of RNNs. As noted above convolutional operations enable parallel computation and act as strong feature extractors. Stacked convolution layers provide sub-linear path to reach tokens at long distance. We analyze two models that use convolutions, although with significantly different design criteria: ByteNet and ConvS2S.

ByteNet — Neural Machine Translation in Linear Time, KalchBrenner et al

The core requirements that this architecture intends to meet are:

“linear”, possibly constant complexity w.r.t sequence length of input data
“resolution preserving”, the size of hidden representation must be proportional to the amount of information in the source input
“shorter path” between tokens, i.e an efficient path to reach the farthest token, if the network needs such token for prediction.

Surprisingly, ByteNet paper, removes attention altogether from its architecture. Further, for NMT, it uses character-level encoding.

Much like WaveNet, ByteNet uses dilated convolutions. At every layer, the dilation rate is increased by a factor of 2 (i.e 1, 2, 4, 8 etc). This mechanism helps the feature maps to build a wider receptive field much faster than regular convolutions. Further, by maintaining, output size same as input size at every layer, the model prevents minimizing of input resolution (i.e the source embedding size is not compressed). Below diagram help to visualize better:

At every layer the kernel is dilated by factor of 2 allowing a faster growth of receptive field. It reduces the path length from O(n) to O(log_k(n)). Source: Deepmind

To enable deeper architecture, both encoder and decoder layers use residual blocks with each block also containing layer normalization, 1x1 convolution and RELU activation. The output of the encoder is directly fed into the decoder along with “cheat-proofed” target inputs. To achieve “cheat-proofing”, the output of the convolution is masked, i.e the subsequent tokens are multiplied with 0. The paper also proposes the use of zero-padded vectors at the beginning of the input target sequence. Overall working can be visualized as below:

Source: ByteNet (Kalchbrenner, 2016), nal.ai

The output of the decoder’s final conv layer, as usual, is passed through a softmax layer to obtain probabilities.

Convolutional Sequence to Sequence Learning, by Gehring et al

While, ByteNet relies on resolution preserving mechanism and dilated convolutions, ConvS2S uses the following:

Regular convolution layers over word + position embeddings, with skip connections to allow deeper layers.
Gated Linear Units(GLU) are used as non-linearity in both encoder and decoder layers.
Multi-step attention at every decoder layer.

On the encoder side, the position-aware embeddings are passed through a fully-connected layer first to create a larger dimension. This increased dimension vectors are now passed through stacked convolutional layers with residual connections.

Each convolutional block also has GLU activation. GLU replicates the gating mechanism of the LSTMs to selectively retain information from particular feature vector. It performs a sigmoid operation to select relevant inputs and then performs element-wise product with original input to generate final result. Mathematically, it can be represented as:

GLU Output = Element-Wise Product{Input, sigmoid(Input)}

The encoder layer and its residual connections can be viewed below:

Residual Connections across multiple conv layers

On the decoder side, to “cheat-proof” the target sequences are padded with zero-value vectors at the beginning. The convolution layers at decoder are similar to those of the encoder.

ConvS2S Decoder layer. Padding is applied at the beginning of the sequence

Attention is calculated at every decoder layer by using the encoder output, the last decoder output and target layer embeddings. This multi-hop attention allows for better modeling of both long-range dependencies as well as focus on word/token specific details. Below visualization helps to understand how the attention and output are calculated.

Attention calculation and final output generation. Source: FairSeq

The output of the decoder’s final residual layer are however passed first through another fully-connected layer before applying softmax to obtain final probabilities. Update: Code walk through on Conv Seq2Seq is available here.

Results and Comparison

Below are some of the results from original papers:

ConvS2S seems to have in fact surged past the carefully engineered and RL optimized GNMT setting a new state-of-the-art(although not for long!).

Comparison of training cost of various models

We can also see a significant reduction of training costs. It is important to note that ByteNet itself achieved SOTA result on character-level machine translation task with WMT2015 English-German dataset. Another striking point is, ConvS2S entails better performance despite usage of multiple attention blocks (the authors claim just 4% overhead).

Overall, the strategies adopted by these models seem to have provided significant advantages both for training costs and overall performance. Although, Transformers take a drastically different approach (no recurrence and no convolutions!), we can find some common design concepts such as position encoding, multiple attention blocks and an inherently parallel architecture to approach sequential data. RNNs have great representation power, given their challenges, we found reasonably well-performing substitutes.

Transformer Dominance. Source: Machine Translation on WMT2014 English-German. Paperswithcode

While Transformers continue to set new benchmarks(SOTA scores are dotted with Transformer-based models above 😀), there is still some research underway in usage of both convolutions and/or in tandem with transformer architecture.

References and Credits:

Awesome course content delivered by Rohan Shravan, Zoheb from TSAI
DeepMind Blogs and articles
NIPS blogs and lectures
Original papers referred— ConvS2S, ByteNet, WaveNet, Attention is all you need, GNMT.

Hope, you liked this blog! Please feel free to provide feedback on any aspect of this blog. If there are any attributions missing please drop a note would definitely add the same.

P.S: Very soon, I will also be releasing a PyTorch based code walk-through for both ConvS2S and Transformers model to explore the concepts from implementation perspective. So please watch out this space for more!

Update: Code walk through on Conv Seq2Seq is available now 🤓