Efficient LLM inference (Episode 1)

Efficient LLM inference (Episode 1)

Tags
LLM
Inference
Long sequence
Published
June 13, 2025
Author
Yonghai Gong

1. Background

Large Language Models (LLMs) can support contexts up to millions of tokens in length such as codebase analysis, summarization of documents, large-scale retrieval.
However, processing such long sequences with LLMs requires substantial computational and memory resources, thus have high latency due to the quadratic complexity of the self-attention mechanism:
This blog tend to introduce recents research papers that dive into dealing with the challenges in processing long sequences.

2. FlashAttention

2.1 Key problem

The process of self-attention mechanism need to writes matrices () to high bandwidth memory (HBM) (which contains more memory but less bandwidth than SRAM), causing high latency and memory usage.
Sparse/low-rank approximations reduces FLOPs but not IO, offering limited real-world speedup.

2.2 Methodology

notion image
FlashAttention focus on reducing the IO time when processing the self-attention mechanism. It mainly takes the following three techniques:
  1. Tiling
    1. partition input into blocks loaded into SRAM to compute attention block-wise, avoiding storing the full matrix.
    2. perform block-wise softmax by maintaining the maximum value and sum .
  1. Recomputation
    1. during backpropagation, dynamically recompute and , saving memory by only storing outputs and
  1. Kernel fusion
    1. fuse matrix multiplication, softmax, masking, etc., into a single CUDA kernel to reduce HBM read/write operations
A pseudocode is displayed as the following:
notion image

3. RingAttention

3.1 Key problem

Existing methods like FlashAttention optimize self-attention memory but FFN remains a bottleneck.

3.2 Core challenges

  1. Memory constraint on a single device: processing 100M tokens requires GB of memoryβ€”far beyond modern GPU/TPU capacity
  1. Poor scalability: traditional methods cannot scale context length linearly (e.g., GPT-4 os capped at 32K tokens)

3.3 Methodology

Introduce RingAttention, enabling near-infinite context length (up to millions of tokens) via blockwise computation and communication overlap across devices.
  1. Blockwise computation
    1. the sequence is partitioned across multiple devices, each device processes a local query block, while global KV blocks are passed along a ring topology.
  1. Communication-computation overlap
    1. while computing the current block, each device asynchronously sends/receives KV blocks to/from neighbors
    2. condition for full overlap: block size (e.g., for A100 with NVLink)
  1. Memory optimization
    1. each device stores only 6 blocks (6π‘π‘β„Ž bytes), making memory usage independent of sequence length .
The architecture of RingAttention and pseudocode are the following:
notion image
notion image
notion image

3.4 Comparison of maximum activation size among different architectures

notion image
where is batch size, is hidden dimension, is number of head, is sequence length, is block size.

4. Star Attention

4.1 Key problem

The same reason as described before, along with the substantial communication cost on the communication of KV block in RingAttention.

4.2 Key techniques

Star Attention utilizes a two-phase approach shown in the following figure
notion image
  1. anchor block mechanism:
    1. context is divided into contiguous blocks , then each block (except the first) is prefixed with the first block
This is due to the observation that first block contains the most important informations
notion image
  1. distributed softmax

4.3 Pseudocode

notion image
notion image