This blog tends to introduce the model construction of Meta Llama family, for future quick guidance.
1. Llama 2
LLaMA | dimension | heads | layers | learning rate | batch size | tokens |
7B | 4096 | 32 | 32 | 3.0E-04 | 4M | 1T |
13B | 5120 | 40 | 40 | 3.0E-04 | 4M | 1T |
33B | 6656 | 52 | 60 | 1.5E-04 | 4M | 1.4T |
65B | 8192 | 64 | 80 | 1.5E-04 | 4M | 1.4T |
1.1 Model architecture

1.2 RMSNorm (root mean square)
- LayerNorm:
- RMSNormis fast than LayerNorm by 40% by eliminating the redundant Mean and Var part :
Llama use RMSNorm in Attention layer and MLP, which is more stable than using RMSNorm in output part.
1.3 RoPE (Rotary Position Embedding)
See other parts Link
1.4 GQA (Group Query Attention)
GQA (Group Query Attention) is not as extreme as MQA (Multi-Query Attention). Instead, it groups queries and shares key-value (KV) pairs within each group, achieving performance close to MHA (Multi-Head Attention) while maintaining speed comparable to MQA.

1.5 SwiGLU
Llama use SwiGLU (also as SiLU) instead of using ReLU, the formulations is as follows: