Introduction
In recent years, Transformers have revolutionized the field of artificial intelligence, particularly in natural language processing (NLP). Introduced in the groundbreaking 2017 paper "Attention Is All You Need" by Vaswani et al., Transformers have become the backbone of models like GPT, BERT, and T5, enabling machines to understand and generate human-like text with remarkable accuracy.
It’s input is sequence (or token), and output is also sequence (or token), simply can be considered as it serves as transforming sequences to other sequences.
Background
Autoregressive
- An autoregressive model predicts future values in a sequence based on past observations. The term comes from "auto" (self) and "regression" (predicting continuous values), meaning the model uses its own previous outputs to generate new ones.
Mathematical formulation: For a sequence of data points , an autoregressive model of order predicts the next values as:
where constant term, model weights, random noise (error term).
- Example with GPT
- Input: "Once upon a"
- Autoregressive Output:
- Step 1 → "time"
- Step 2 → "Once upon a time" → "there"
- Step 3 → "Once upon a time there" → "was"
- Final Output: "Once upon a time there was a king."
Layer normalization
- Layer Normalization normalizes the activations across the features (channels) for each sample independently, rather than across the batch. It was introduced to address some limitations of Batch Normalization, particularly in recurrent networks (RNNs) and Transformers.
- Layer Normalization vs. Batch Normalization: Key Differences
Feature | Layer Normalization (LN) | Batch Normalization (BN) |
Normalization Axis | Normalizes per sample (over features) | Normalizes per feature (over batch) |
Batch Size Dependence | Works with any batch size (even 1) | Requires large batches |
Use Case | Ideal for RNNs, Transformers, small batches | Best for CNNs, large batches |
Training vs. Inference | Same computation in both modes | Uses running averages in inference |
Performance Impact | More stable for variable-length sequences | Can destabilize RNNs/Transformers |
Examples | Used in GPT, BERT, LLaMA | Used in ResNet, VGG |
Model architecture

- Encoder and decoder stacks
The encoder contains identical layers. Each layer has two sub-layers. The decoder also contains identical layers, while in addition to encoder, the decoder inserts a third sub-layer, which perform multi-head attention over the output of the encoder stack. Both employ residual connections around each of the sub-layers, followed by layer normalization.
Decoder also use Masked Multi-Head Attention to prevent the autoregressive model seen the later output in advance, ensuring that the predictions for position can depend only on the known outputs at positions less than .
- Attention: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

- Scaled dot-product attention:
some explanations: the output should be a weighted sum of each entry of , so the sum of all the weights should equal to 1 (one of the reason us softmax functions here), where the weights is calculated by the similarities between entry of between each entry of . Divided by to avoid the dot products being to large.
- Multi-head attention:
There are no model parameters to learn when using scaled dot-product attention function, since the input () Attention(). BUT in some scenarios, we want to do different methods on calculating the weights of . So multi-head attention come into being.
where are parameter project matrices (which can be learned) and
- Position-wise feed-forward networks (MLP):
In addition to attention sub-layers, each of the layers in encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
- Embeddings: Input and output tokens vector () (multiply weights by just to scale)
- Positional encoding
Since transformer does not have information of sequence, add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension as the embeddings, so that the two can be summed. There are many choices of positional encodings, the use:
(here we can see the value PE is between , so times in embeddings)
- Add & norm:
Add residual

Why self-attention
