Efficient LLM inference (Episode 1)

1. Background

Large Language Models (LLMs) can support contexts up to millions of tokens in length such as codebase analysis, summarization of documents, large-scale retrieval.

However, processing such long sequences with LLMs requires substantial computational and memory resources, thus have high latency due to the quadratic complexity of the self-attention mechanism:

This blog tend to introduce recents research papers that dive into dealing with the challenges in processing long sequences.

2. FlashAttention

2.1 Key problem

The process of self-attention mechanism need to writes matrices () to high bandwidth memory (HBM) (which contains more memory but less bandwidth than SRAM), causing high latency and memory usage.

Sparse/low-rank approximations reduces FLOPs but not IO, offering limited real-world speedup.

2.2 Methodology

FlashAttention focus on reducing the IO time when processing the self-attention mechanism. It mainly takes the following three techniques:

Tiling

partition input into blocks loaded into SRAM to compute attention block-wise, avoiding storing the full matrix.
perform block-wise softmax by maintaining the maximum value and sum .

Recomputation

during backpropagation, dynamically recompute and , saving memory by only storing outputs and

Kernel fusion

fuse matrix multiplication, softmax, masking, etc., into a single CUDA kernel to reduce HBM read/write operations

A pseudocode is displayed as the following:

3. RingAttention

3.1 Key problem

Existing methods like FlashAttention optimize self-attention memory but FFN remains a bottleneck.

3.2 Core challenges

Memory constraint on a single device: processing 100M tokens requires GB of memory—far beyond modern GPU/TPU capacity

Poor scalability: traditional methods cannot scale context length linearly (e.g., GPT-4 os capped at 32K tokens)

3.3 Methodology

Introduce RingAttention, enabling near-infinite context length (up to millions of tokens) via blockwise computation and communication overlap across devices.

Blockwise computation

the sequence is partitioned across multiple devices, each device processes a local query block, while global KV blocks are passed along a ring topology.

Communication-computation overlap

while computing the current block, each device asynchronously sends/receives KV blocks to/from neighbors
condition for full overlap: block size (e.g., for A100 with NVLink)