Premise:

while i have been enjoying the original Attention is All You Need paper , i feel Illya’s edit of the Annotated Transformer model is a better outcome , both in terms of understanding for a beginner and to also understand from the ground up the basic technology used in modern chat systems.

Background of Transformer:

  • Using CNN as basic building block, architectures like ByteNet, ConvS2S computes hidden representations in parallel for all input and output positions.
  • In these, the number of operations required to relate signals from 2 arbitary input or output positions grows in the distance between positions.
    • Linearly for ConvS2S
    • Logarithmically for ByteNet
    • Result: Makes it more difficult to learn dependencies between distant positions.
  • Transformers reduce this to a constant number of operations , albeit at the cost of reduced effective resolution due to averaging attention-weighted positions , which is counteracted with Multi-Head Attentions.

But what exactly is attention really ?

  • Attention : a method that determines the importance of each component in a sequence relative to the other components in that sequence.

    • Like in humans , attention mechanism was developed to address the weaknesses of leveraging information from the hidden layers of recurrent neural networks.
    • RNN favours more recent information contained in words at the end of a sentence , while information earlier in the sentence tends to be attenuated.
    • Attention allows a token equal access to any part of a sentence directly, rather than only though the previous state.
    • Ref: Attention
  • Self Attention: also called as intra attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

    • Used in a lot of tasks like reading comprehension, abstractive summarisation, textual entailment , learning task-independent sentence representation.

Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

Model Architecture:

  • Most competitive neural sequence transduction models have a encoder-decoder structure.
  • In this:
    • Encoder will map an input sequence of symbol representations to a sequence of continuous representations .
    • Given ,the decoder will then generate a output sequence of symbols one element at a time.
    • At each step the model is auto-regressive ie consuming the previously generated symbols as additional input when generating the next.
    • Transformer will follow this overall architecture using stacked self-attention and point-wise fully connected layers for both the encoder and decoder.

Encoder:

  • Encoder is composed of a stack of identical layers.
  • We then employ a residual connection around each of the 2 sub-layers, followed by a layer normalization.
  • i.e the output of each sub-layer is , where is the function implemented by the sub-layer itself.
  • We then apply dropout to the output of each sub-layer , before it is added to the sub-layer input and normalized.
  • We get an output of dimension to facilitate these residual connections , all sub-layers in the model, as well as the embedding layers.