Understanding transformer

Notations:

Notation	Semantics
	hidden size
	num of head
	vocab size
	batch size
	len of sequence

1 Memory footprint

1.1 # of model parameters

each layer consists of two blocks (self-attention and MLP, each followed by a layer normalization)

self-attention: 4 matrices (), 4 bias (), together has

MLP: consists of 2 fully connect linear layers. Generally, it maps , so it has two matrices () and 2 bias ( and ). Sum to parameters.

2 layer normalization: . In summation, is

Additionally, the parameters of the word embedding matrix is .

We can ultimately derive the total trainable parameter count of an -layer Transformer model as: . When is huge, we can approximate it by .

1.2 GPU memory consumption during training

Generally, the GPU memory cost mainly comes from 4 aspects (model parameters, intermediate activations generated during forward computation, gradients computed during backward propagation, and optimizer states).

Scenario: Using the Adam optimizer with mixed-precision training to accelerate training.

model + gradients + states

for each model parameter, mixed precision training typically stores weights and gradients in half precision formats (i.e., 2 bytes per parameter) for forward and backward propagations. It also keeps full-precision (4 bytes) copies in 32 bit float format for numerical stability in the optimizer.

Together cost , approximate by

activations (the majority part)

The activations are the main consumer of GPU memory. During the forward pass, intermediate activations must be saved for gradient computation in the backward pass. Here, "activations" refer to all tensors computed during the forward pass and required for the backward pass. This excludes model parameters and optimizer states but includes mask matrices used in dropout operations.

During large model training, mixed-precision training is typically employed, where intermediate activations are usually stored in float16 or bfloat16 format. When analyzing the memory overhead of intermediate activations, we assume they are stored in float16/bfloat16 format, with each element occupying 2 bytes. The only exception is the mask matrix for dropout operations, where each element occupies only 1 byte. In the following analysis, the unit is bytes rather than the number of elements.

activations in layer normalization

When analyzing the memory cost of intermediate activations, we only consider the major contributors and ignore smaller buffers. For example, in layer normalization, computing gradients requires the layer's input, as well as its mean and variance . (The input consists of parameters, but the and all together have , and .) For layer normalization, we approximate the activations as bytes.

One transformer layer consists of a self-attention block and an MLP block, each followed by a layer normalization operation.

activations in self-attention

for , need to maintain , (this is the activation), , cost bytes.

for , need to store , which is also , cost bytes.

for softmax(), need to store , (shape of : [b, head_num, s, per_head_hidden_size], : [b, head_num, per_head_hidden_size, s], : [b, head_num, s, s]) cost bytes, where is the # of head.

after softmax(), need to do dropout. There needs a mask matrix, which shares the same shape with , cost (do not need times 2 since mask occupies 1 byte).

compute attention on , need to store score, which is , and , , together cost .

also need to be followed by dropout, which is (input) and (mask), cost .

To sum all above, which is .

activations in MLP

first linear layer ,

ReLU

second linear layer

dropout (mask)

Totally, for a block of MLP, the activations cost is .

Finally, we can derive that each layer of transformer cost (two operations of layer normalization). We ignore the activations in embeddings. For a -layer transformer, the GPU memory overhead is .

1.3 Examples

As for GPT3-175B, we calculate the storage overhead.

Model	# of para	# of layer	hidden size	# of head
GPT3	175B	98	12288	96

model consumption:

suppose that the length of sequence , for different batch size, the activations cost

Training cost more memory than model itself.

2 Computation overhead

For , to calculate needs FLOPs. We also approximate it by , since usually is too large. (Another interpretation here is that if we consider the bias part in MLP, it is exactly .) For ease of interpretation, we eliminate the part here.

During one iteration, suppose the input shape is [ ].

2.1 Self-attention

compute cost .

, cost .

score * , cost .

linear mapping , cost .

2.2 MLP

the first linear layer, , cost .

the second linear layer, , cost .

2.3 Vector → vocab

, cost .

In summation, for an -layer transformer, one iteration training consumes FLOPs when the input shape is .

3 Connection between computation and # of para.

When the hidden_size (sequence length), we can ignore the linear term, and approximate by . When the number of model parameters is , and number of token is , we have . We can approximate that: during one forward pass, each model parameter requires 2 floating-point operations per token - one multiplication and one addition.

one training iteration: consists of a forward pass and a backward pass, where the backward pass requires the computation of the forward pass. Thus, [forward + backward] scaling factor = 3. In one training iteration, each model parameter requires floating-point operations per token (2 for forward + 4 for backward).

one inference pass: 2 floating-point operations per token.

Understanding transformer—Statistics