1. Background
Large Language Models (LLMs) can support contexts up to millions of tokens in length such as codebase analysis, summarization of documents, large-scale retrieval.
However, processing such long sequences with LLMs requires substantial computational and memory resources, thus have high latency due to the quadratic complexity of the self-attention mechanism:
This blog tend to introduce recents research papers that dive into dealing with the challenges in processing long sequences.
2. FlashAttention
2.1 Key problem
The process of self-attention mechanism need to writes matrices () to high bandwidth memory (HBM) (which contains more memory but less bandwidth than SRAM), causing high latency and memory usage.
Sparse/low-rank approximations reduces FLOPs but not IO, offering limited real-world speedup.
2.2 Methodology

FlashAttention focus on reducing the IO time when processing the self-attention mechanism. It mainly takes the following three techniques:
- Tiling
- partition input into blocks loaded into SRAM to compute attention block-wise, avoiding storing the full matrix.
- perform block-wise softmax by maintaining the maximum value and sum .
- Recomputation
during backpropagation, dynamically recompute and , saving memory by only storing outputs and
- Kernel fusion
fuse matrix multiplication, softmax, masking, etc., into a single CUDA kernel to reduce HBM read/write operations
A pseudocode is displayed as the following:

3. RingAttention
3.1 Key problem
Existing methods like FlashAttention optimize self-attention memory but FFN remains a bottleneck.
3.2 Core challenges
- Memory constraint on a single device: processing 100M tokens requires GB of memoryβfar beyond modern GPU/TPU capacity
- Poor scalability: traditional methods cannot scale context length linearly (e.g., GPT-4 os capped at 32K tokens)
3.3 Methodology
Introduce RingAttention, enabling near-infinite context length (up to millions of tokens) via blockwise computation and communication overlap across devices.
- Blockwise computation
the sequence is partitioned across multiple devices, each device processes a local query
block, while global KV blocks are passed along a ring topology.
- Communication-computation overlap
- while computing the current block, each device asynchronously sends/receives KV blocks to/from neighbors
- condition for full overlap: block size (e.g., for A100 with NVLink)
- Memory optimization
each device stores only 6 blocks (6ππβ bytes), making memory usage independent of
sequence length .
The architecture of RingAttention and pseudocode are the following:



3.4 Comparison of maximum activation size among different architectures

where is batch size, is hidden dimension, is number of head, is sequence length, is block size.
4. Star Attention
4.1 Key problem
The same reason as described before, along with the substantial communication cost on the communication of KV block in RingAttention.
4.2 Key techniques
Star Attention utilizes a two-phase approach shown in the following figure

- anchor block mechanism:
context is divided into contiguous blocks , then each block (except the first) is prefixed with the first block
This is due to the observation that first block contains the most important informations

4.3 Pseudocode

