Implementing a Transformer

Intuition
#

With the now famous paper Attention is all you need, Tranformer was introduced, aims to solve a problem that previous neural network architectures presistently find challenging: to better extract infomation from a longer context.

Here is how it generally works step by step:

First the input sequence is tokenized into a sequence of tokens
Each token will be turned into an embedding
A positional encoding will be added to each embedding, now you have a sequence of embeddings that also contains thier positional information
For each embedding, three different new embedding will be generated: Q K V, by linear transformation.
Each and every Q value will be calculated with each and every K value with dot product, to calculate thier ‘similarity’ or ‘connection’
The calculated connection table, then is applied to the V, giving the ‘attention’
The whole attention process will happen in parallel in many different attention heads and their result will be concatenated forming the final multi-head attention result

Intuition #

Intuition
#