Photo by lucasgwendt on Unsplash

This is the second part of multi-series articles on Transformers and Attention. In this article we will delve into the architecture of Transformers and the design choices it made.

“I can talk English, I can walk English, I can laugh English, because English is a funny language. Bhairon becomes barren and barren becomes Bhairon because their minds are very narrow. In the year 1929 when India was playing Australia at the Melbourne stadium Vijay Hazare and Vijay Merchant were at the crease. Vijay Merchant told Vijay Hazare, look Vijay Hazare, this is a very prestigious match and we must consider…


In this article we will build Convolutional Seq2Seq model for NMT(German to English) in PyTorch. For better understanding of the model please refer to my previous blog here. The full code is accessible at the end of the page.

The Model — a quick summary

As the name suggests, Convolutional Seq2Seq use convolutions instead of regular RNNs. This allows to overcome highly sequential processing inherent with RNN. Further by stacking multiple convolution blocks, the receptive field grows faster, allowing to model long-range dependencies between various words. The convolution blocks also use residual connections to allow for better gradient flow. …


Source: Unsplash

This is first part of multi-series article on how Transformers and Attention mechanism have changed the landscape of NMT. In this article we will delve into some of the research papers and architectures that were published just before Transformers. This gives a better understanding of the design challenges and the direction of research.

From Attention to Attention is all you need:

2017 was truly a remarkable year for AI in general:

  1. AlphaGo Zero defeats AlphaGo (which defeated Lee Seedol, 2016), after being bootstrapped and trained for just 3 days!
  2. GANs(Generative Adversarial Networks) made significant strides — DiscoGANs, CycleGANs and Progressive GANs (accepted in 2018)
  3. PyTorch gained strong popularity…


In this article, we will understand how attention works in Neural Networks for tasks such as Neural Machine Translation and Image Captioning, a precursor to the current state-of-art and super-exciting stuff unraveled by GPT-3.

We start with a quick refresher of basic building blocks.

RNNs(Recurrent Neural Networks) enabled the use of neural networks to model time-series or sequential data E.g: predicting the next word or character given previous set of words.

LSTMs(Long-Short Term Memory) and GRUs (Gated Recurrent Units) strengthened the concept of “memory” thereby overcoming the long-range dependency problems with traditional RNNs.

For “transduction problems”, problems that involve generating…

Rajesh Y

Developer/Architect at Nokia Networks. Proud Father, CrossFitter and Coffee Lover.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store