Skip to main content

Implementing a Transformer

· Draft

Intuition
#

With the now famous paper Attention is all you need, Tranformer was introduced, aims to solve a problem that previous neural network architectures presistently find challenging: to better extract infomation from a longer context.

Here is how it generally works step by step:

  • First the input sequence is tokenized into a sequence of tokens
  • Each token will be turned into an embedding
  • A positional encoding will be added to each embedding, now you have a sequence of embeddings that also contains thier positional information
  • For each embedding, three different new embedding will be generated: Q K V, by linear transformation.
  • Each and every Q value will be calculated with each and every K value with dot product, to calculate thier ‘similarity’ or ‘connection’
  • The calculated connection table, then is applied to the V, giving the ‘attention’
  • The whole attention process will happen in parallel in many different attention heads and their result will be concatenated forming the final multi-head attention result