Understanding transformer

Understanding transformer

Tags
Foundation
LLM
Published
May 18, 2025
Author
Yonghai Gong

Introduction

In recent years, Transformers have revolutionized the field of artificial intelligence, particularly in natural language processing (NLP). Introduced in the groundbreaking 2017 paper "Attention Is All You Need" by Vaswani et al., Transformers have become the backbone of models like GPT, BERT, and T5, enabling machines to understand and generate human-like text with remarkable accuracy.
It’s input is sequence (or token), and output is also sequence (or token), simply can be considered as it serves as transforming sequences to other sequences.

Background

Autoregressive

  • An autoregressive model predicts future values in a sequence based on past observations. The term comes from "auto" (self) and "regression" (predicting continuous values), meaning the model uses its own previous outputs to generate new ones.
    • Mathematical formulation: For a sequence of data points , an autoregressive model of order predicts the next values as:
      where constant term, model weights, random noise (error term).
  • Example with GPT
    • Input: "Once upon a"
    • Autoregressive Output:
      • Step 1 → "time"
      • Step 2 → "Once upon a time" → "there"
      • Step 3 → "Once upon a time there" → "was"
      • Final Output: "Once upon a time there was a king."

Layer normalization

  • Layer Normalization normalizes the activations across the features (channels) for each sample independently, rather than across the batch. It was introduced to address some limitations of Batch Normalization, particularly in recurrent networks (RNNs) and Transformers.
  • Layer Normalization vs. Batch Normalization: Key Differences
Feature
Layer Normalization (LN)
Batch Normalization (BN)
Normalization Axis
Normalizes per sample (over features)
Normalizes per feature (over batch)
Batch Size Dependence
Works with any batch size (even 1)
Requires large batches
Use Case
Ideal for RNNs, Transformers, small batches
Best for CNNs, large batches
Training vs. Inference
Same computation in both modes
Uses running averages in inference
Performance Impact
More stable for variable-length sequences
Can destabilize RNNs/Transformers
Examples
Used in GPT, BERT, LLaMA
Used in ResNet, VGG

Model architecture

           Figure 1: The Transformer-model architecture.
Figure 1: The Transformer-model architecture.
  • Encoder and decoder stacks
    • The encoder contains identical layers. Each layer has two sub-layers. The decoder also contains identical layers, while in addition to encoder, the decoder inserts a third sub-layer, which perform multi-head attention over the output of the encoder stack. Both employ residual connections around each of the sub-layers, followed by layer normalization.
      Decoder also use Masked Multi-Head Attention to prevent the autoregressive model seen the later output in advance, ensuring that the predictions for position can depend only on the known outputs at positions less than .
  • Attention: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
        Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of serval attentions layers running in parallel.
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of serval attentions layers running in parallel.
  • Scaled dot-product attention:
    • some explanations: the output should be a weighted sum of each entry of , so the sum of all the weights should equal to 1 (one of the reason us softmax functions here), where the weights is calculated by the similarities between entry of between each entry of . Divided by to avoid the dot products being to large.
  • Multi-head attention:
    • There are no model parameters to learn when using scaled dot-product attention function, since the input () Attention(). BUT in some scenarios, we want to do different methods on calculating the weights of . So multi-head attention come into being.
      where are parameter project matrices (which can be learned) and
  • Position-wise feed-forward networks (MLP):
    • In addition to attention sub-layers, each of the layers in encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
  • Embeddings: Input and output tokens vector () (multiply weights by just to scale)
  • Positional encoding
    • Since transformer does not have information of sequence, add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension as the embeddings, so that the two can be summed. There are many choices of positional encodings, the use:
      (here we can see the value PE is between , so times in embeddings)
  • Add & norm:
    • Add residual
      notion image

Why self-attention

notion image