Meta Llama family

This blog tends to introduce the model construction of Meta Llama family, for future quick guidance.

1. Llama 2

LLaMA	dimension	heads	layers	learning rate	batch size	tokens
7B	4096	32	32	3.0E-04	4M	1T
13B	5120	40	40	3.0E-04	4M	1T
33B	6656	52	60	1.5E-04	4M	1.4T
65B	8192	64	80	1.5E-04	4M	1.4T

1.1 Model architecture

1.2 RMSNorm (root mean square)

LayerNorm:

RMSNormis fast than LayerNorm by 40% by eliminating the redundant Mean and Var part :

Llama use RMSNorm in Attention layer and MLP, which is more stable than using RMSNorm in output part.

1.3 RoPE (Rotary Position Embedding)

See other parts Link

1.4 GQA (Group Query Attention)

GQA (Group Query Attention) is not as extreme as MQA (Multi-Query Attention). Instead, it groups queries and shares key-value (KV) pairs within each group, achieving performance close to MHA (Multi-Head Attention) while maintaining speed comparable to MQA.

Figure 1. Overview of grouped-query method. Multi-head attention has query, key, and value heads. Multi-query attention shares single key and value heads across all query heads. Grouped-query attention instead shares single key and value heads for each group of query heads, interpolating between multi-head and multi-query attention. — Figure 1. Overview of grouped-query method. Multi-head attention has query, key, and value heads. Multi-query attention shares single key and value heads across all query heads. Grouped-query attention instead shares single key and value heads for each *group* of query heads, interpolating between multi-head and multi-query attention.

1.5 SwiGLU

Llama use SwiGLU (also as SiLU) instead of using ReLU, the formulations is as follows: