Code-First Convolutional Seq2Seq

7 min readFeb 8, 2021

In this article we will build Convolutional Seq2Seq model for NMT(German to English) in PyTorch. For better understanding of the model please refer to my previous blog here. The full code is accessible at the end of the page.

The Model — a quick summary

As the name suggests, Convolutional Seq2Seq use convolutions instead of regular RNNs. This allows to overcome highly sequential processing inherent with RNN. Further by stacking multiple convolution blocks, the receptive field grows faster, allowing to model long-range dependencies between various words. The convolution blocks also use residual connections to allow for better gradient flow. For non-linearity, these models employ GLUs(Gated linear Units) which can be expressed as:

GLU Output = Element-Wise Product{Input, sigmoid(Input)}

On the decoder side, the inputs are padded with k-1 zero vectors to maintain causality(i.e “cheat-proofing”). Every decoder layer calculates attention using key-value memory mechanism. We will see further in code how this is handled.

Pre-requisites:

This code walkthrough is based on Python3.6+ and PyTorch-1.7cu101 on Colab with GPU. If you are using your own JupyterLab/Notebook environment, you will need to take of the following requirements:

Python3.6+
torch 
torchtext
spacy ### Both EN and DE models
numpy

Once you have spacy available you can download and install “de” and “en” models as below:

!python -m spacy download en
!python -m spacy download de

We will use torchtext module from PyTorch for datasets and pre-processing. Since we are dealing with translation task, we will use Multi30K which contains 30000 samples of German sentences and their equivalent English translations.

Imports and Dataset handling:

Most of the below imports are fairly standard. We also set the random seed to enable reproducible results.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Ffrom torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker 
import spacy
import numpy as np
import random
import mathSEED = 0xc001daab
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Tokenizers, datasets and iterators:

Our pipeline is simple and contains the following:

Spacy tokenizer to generate tokens on uncased inputs
Vocabulary building with torchtext utilities(we choose words that have atleast a frequency of 2)
Training, test and validation dataset creation

This simplicity is mainly enabled by the awesome torchtext datatype(a class to be precise) Fields, which supports everything ranging from pre-processing, tokenizing and padding to converting into tensors and generating batches. For complete documentation of various fields, please refer this page.

Below is the code for generating the dataset with sample output:

Creating datasets

We pass batch_first = True to the Field objects because eventually we will feed this data to CNN models that accept a tensor of the shape [sequence length, batch size]. Next, we build the vocabulary using build_vocab function that convert each token into numerical representation. The vocab object contains stoi and itos objects that provide the mapping between token and numerical value and vice-versa. It also has a handy predefined counter freqs for getting information on overall count and distribution.

Vocab object generation

Next we generate the iterators using BucketIterator class (which batches sentences of similar length) and map it to GPU if available. We will use a batch_size of 128 however depending on memory availability this can be tweaked. Each iterator object contains two fields [‘src’, ‘trg’] that represent the actual tensors of the respective source and target sentences.

Create Iterators with batch_size of 128.

Model Implementation:

We first define the top-level seq2seq class(ConvS2S) to contain an encoder and a decoder object passed as parameters to __init__. With PyTorch, every custom layer or “models” need to be sub-classed from nn.Module and implement the forward method. The forward method will take the actual tensors over which the computation will be performed i.e the [‘src’, ‘trg’] as we defined in the above section. Overall the code looks like as follows:

We see that the encoder returns two vectors (encoder_conved and encoder_combined) that are passed as input to the decoder. The decoder in its turn returns the final probabilities(output) and the attention vectors (attention). During training, the output vector will be used for calculating loss and gradients. We define the Encoder and Decoder classes.

Encoder Class Definition:

The below diagram provides a quick view of various layers in the encoder. The dimensions are at every layer are listed on the left side.

Encoder Schematic. Only 2 blocks are shown for brevity but in code it is a hyperparameter.

For encoder we need to take care of the following:

Positional and word embeddings as inputs(Lines #18–20, #49–55 in the gist). We use embedding size as 256 so every word or token is converted to a 256 dimension vector.
We use 1-D convolutions(kernel size=3)and to keep the output dimensions constant we need to pad the inputs as every convolutional operation would reduce the size of output by at least kernel_size-1. Also since GLU activation reduces the dimension by half, so the input should be doubled to keep the dimensions consistent across layers. Both of these can be managed while defining the conv block (Lines #26–30)
We need residual Connections across convolution blocks (Line #75)
Output should have two vector — the results of convolution and sum of both convolution and the source embedding. (Line #85-88)
We also add dropout at embedding and conv layers. (Lines #55, #69)

Note: PyTorch’s ordering differs for CNN layers and RNN/Embedding/FC layers. So to align the inputs we use “permute” function on the respective tensors when we either need to pass them as input to conv layers or perform element-wise operations. E.g: Line 63, 87 etc.

Encoder Class definition

The encoder needs to output both convolved output and combined output (elementwise sum of convolved + embedding) because the attention mechanism uses both of these while deriving the attention weights. These outputs will be of embedding dimension size.

Decoder Class Definition:

Almost all the considerations from the encoder apply to the decoder layer as well with some additional ones:

On the decoder side, we need to “cheat-proof” the target inputs. This is done by adding zero-vectors at the beginning of the input vector. (Lines #121–125)
Attention needs to be calculated at every convolution layer after GLU. (Lines #140–144). We will discuss attention calculation in next section.
The output of final conv layer is passed first through fully-connected layer to convert from hidden dimension to embedding dimension (reverse of the first hidden layer in encoder) and then through softmax/output layer to obtain final probabilities. (Lines #155–159).

Decoder Class definition

Attention Mechanism

We will use calculate_attention function defined in the gist above👆. It does the following:

It takes input as the target embedding vector, decoder’s convolved output conved, encoder’s convolved and combined vectors
The decoder’s convolved output has a higher dimension than embedding dimension so we first align them through fully-connected layer and perform element-wise addition with target embedding (Line #45–49)
Next, we perform matrix multiplication between above output and the encoder’s conved vector (encoder_conved). This gives a correlation between target and source weights(Line #53). We perform a softmax on this matrix to obtain the attention weights (Line #57).
As per the original paper, these attention weights act as key and respective values are obtained from the encoder_combined vector through matrix multiplication (Line #61). This allows for capturing both long-range and dependencies and word-level focus. Next step is to convert these weights to hidden dimension size and add to the decoder’s conved output (Line #71)
We return above output alongwith attention weights. Below is a visualization for these operations:

Training and Evaluation functions:

Our final model is described below(whopping 37M parameters!). We will use cross entropy as the loss function and Adam optimizer with default learning rate.

Final Model with loss functions and optimizer

The training and evaluation loops are fairly straightforward, where we process one batch at a time with loss calculation and gradient adjustment(only for training). For both training and evaluation, we need to remove the <eos> token from target sequences before passing it to the model. This allows the model to predict end-of-sentence based on source and target sentence structure rather than just length estimates etc. We also clip the gradients to prevent them from becoming too big and to ensure stability.

Training and evaluation functions

Below is a sample output from 10 epochs. This model trains significantly faster than RNN based models owing to convolutions. Further, it is also important to note that, unlike regular RNN models, we don’t need to loop over inputs sequences. It can be argued that we still loop over Conv layers but the layer count is often less than the number input tokens.

Sample training output.

Translation samples and Attention Visualization:

We can use the following code for translation and generating attention visualizations:

Inference and Attention Visualization

Sample translations and attention graphs are shown below:

Sample Translations

Conclusion and Ending Notes

Convolution layers with attention provide a significant alternative to RNN based approach. However, as we can to deal with longer sequences, we need stacked convolutions which increases model parameters and the length between tokens. Transformer-based models on the other hand provide a significant advantage by purely relying on attention heads and fully-convolutional layers.

References and Credits:

Sincere thanks to Rohan Shravan, Zoheb from TSAI for helping to understand these topics in details
Most of the code is inspired from Ben Trevett’s GitHub repo.
NIPS blogs and lectures
Original papers referred — ConvS2S, ByteNet, WaveNet, Attention is all you need, GNMT.

This is one of my first attempts of code walkthrough on Medium. So, it is quite a bit longish! Please feel free to suggest any improvements. Further, please drop a note below for any clarifications on any aspects of this post.