Meta Llama family

Meta Llama family

Tags
LLM
Llama
Published
June 24, 2025
Author
Yonghai Gong
This blog tends to introduce the model construction of Meta Llama family, for future quick guidance.

1. Llama 2

LLaMA
dimension
heads
layers
learning rate
batch size
tokens
7B
4096
32
32
3.0E-04
4M
1T
13B
5120
40
40
3.0E-04
4M
1T
33B
6656
52
60
1.5E-04
4M
1.4T
65B
8192
64
80
1.5E-04
4M
1.4T

1.1 Model architecture

notion image

1.2 RMSNorm (root mean square)

  • LayerNorm:
  • RMSNormis fast than LayerNorm by 40% by eliminating the redundant Mean and Var part :
 
Llama use RMSNorm in Attention layer and MLP, which is more stable than using RMSNorm in output part.

1.3 RoPE (Rotary Position Embedding)

See other parts Link

1.4 GQA (Group Query Attention)

GQA (Group Query Attention) is not as extreme as MQA (Multi-Query Attention). Instead, it groups queries and shares key-value (KV) pairs within each group, achieving performance close to MHA (Multi-Head Attention) while maintaining speed comparable to MQA.
Figure 1. Overview of grouped-query method. Multi-head attention has  query, key, and value heads. Multi-query attention shares single key and value heads across all query heads. Grouped-query attention instead shares single key and value heads for each group of query heads, interpolating between multi-head and multi-query attention.
Figure 1. Overview of grouped-query method. Multi-head attention has query, key, and value heads. Multi-query attention shares single key and value heads across all query heads. Grouped-query attention instead shares single key and value heads for each group of query heads, interpolating between multi-head and multi-query attention.

1.5 SwiGLU

Llama use SwiGLU (also as SiLU) instead of using ReLU, the formulations is as follows: