SGLang RadixAttention 原理与实现详解

2026年3月25日 · 阅读需 7 分钟

Zhiyuan Pan

Blog Author

深入解析 SGLang 的 RadixAttention：Radix Tree 数据结构、前缀匹配与节点分裂、LRU 驱逐策略、引用计数机制，以及与 vLLM Prefix Caching 的对比。

1. 为什么需要 RadixAttention

在 LLM 推理场景中，大量请求共享相同的前缀：

System Prompt: 所有用户共享同一条系统提示词
多轮对话: 同一用户的后续请求包含之前的完整对话历史
Few-shot: 多个请求使用相同的示例

传统做法（包括 vLLM）对每个请求都要从头做 prefill，即使前缀完全相同。这造成了大量重复计算和重复 KV Cache 存储。

核心问题: 如何高效地复用已计算的 KV Cache？

vLLM 的方案是基于 hash 的 prefix caching（自动前缀缓存），SGLang 提出了更优雅的方案——Radix Tree（基数树）。

2. Radix Tree 数据结构

2.1 什么是 Radix Tree

Radix Tree（基数树/压缩前缀树）是 Trie 的压缩版本，将只有一个子节点的链路合并为单个节点：

Trie:                          Radix Tree:
    root                           root
    ├─ [10]                        ├─ [10,20,30] ──→ node_A
    │   └─ [20]                    │   ├─ [40,50] ──→ node_B
    │       └─ [30]                │   └─ [60,70] ──→ node_C
    │           ├─ [40]            └─ [80,90] ──→ node_D
    │           │   └─ [50]
    │           └─ [60]
    │               └─ [70]
    └─ [80]
        └─ [90]

每个节点存储: token 序列 (key) + 对应的 KV Cache slot 索引 (value)

2.2 SGLang 中的节点结构

对应源码 sglang/python/sglang/srt/mem_cache/radix_cache.py 的 TreeNode:

class TreeNode:
    key: list[int]              # 该边上的 token 序列
    value: torch.Tensor         # 对应的 KV Cache 物理索引
    children: dict              # 子节点
    parent: TreeNode            # 父节点
    lock_ref: int               # 引用计数（>0 = 正在使用，不可驱逐）
    last_access_time: float     # LRU 驱逐时间戳
    hit_count: int              # 命中次数（用于 SLRU 驱逐）

2.3 关键设计

字段	作用
`lock_ref`	正在使用此前缀的请求数。>0 时不可驱逐，防止正在推理的请求被抢走 KV Cache
`last_access_time`	每次匹配时更新，LRU 驱逐依据
`hit_count`	SLRU 策略中区分热/冷数据

3. 核心操作

3.1 前缀匹配 (match_prefix)

当新请求到来时，沿 Radix Tree 查找最长匹配前缀：

请求 tokens: [10, 20, 30, 40, 50, 61, 62]

树的当前状态:
  root
  └─ [10,20,30,40,50] ──→ slots=[s0,s1,s2,s3,s4]
      └─ [61,62,63]   ──→ slots=[s5,s6,s7]

匹配过程:
  1. 从 root 开始，找到子节点 key=[10,20,30,40,50]
  2. 逐 token 比对: 10✓ 20✓ 30✓ 40✓ 50✓ → 完全匹配
  3. 进入子节点，找到 key=[61,62,63]
  4. 逐 token 比对: 61✓ 62✓ 63✗(请求没有更多token)
  5. 部分匹配 → 分裂节点!

结果: 匹配 7 个 token，复用 slots=[s0,s1,s2,s3,s4,s5,s6]
      只需计算 0 个新 token 的 KV Cache!

对应源码: RadixCache._match_prefix_helper() (radix_cache.py:671)

3.2 节点分裂 (split_node)

部分匹配时，需要将节点一分为二：

Before:
  parent ──[61,62,63]──→ child (slots=[s5,s6,s7])

After (split at position 2):
  parent ──[61,62]──→ new_node (slots=[s5,s6])
                       └──[63]──→ child (slots=[s7])

为什么需要分裂？ 因为新请求 [..., 61, 62, 71] 只共享前两个 token [61, 62]，第三个 token 不同。分裂后可以精确地在分叉点分享前缀。

对应源码: RadixCache._split_node() (radix_cache.py:697)

3.3 插入 (insert)

请求完成后，将其 KV Cache 插入树中供后续请求复用：

请求 tokens=[10,20,30,81,82], slots=[s0,s1,s2,s8,s9]

当前树:
  root
  └─ [10,20,30,40,50] ──→ slots=[s0,s1,s2,s3,s4]

插入过程:
  1. 匹配 [10,20,30] → 匹配了 3 个 token
  2. [40,50] 不匹配 [81,82] → 在 position 3 分裂
  3. 创建新节点 [81,82] → slots=[s8,s9]

结果:
  root
  └─ [10,20,30] ──→ slots=[s0,s1,s2]
      ├─ [40,50] ──→ slots=[s3,s4]
      └─ [81,82] ──→ slots=[s8,s9]

对应源码: RadixCache._insert_helper() (radix_cache.py:727)

3.4 LRU 驱逐 (evict)

GPU 显存有限，需要驱逐不常用的缓存：

# 驱逐策略（evict_policy.py）
收集所有叶子节点中 lock_ref == 0 的节点
按 last_access_time 排序（最小堆）
弹出最久未使用的叶子节点
释放其 KV slots → 从父节点删除
如果父节点变成叶子且 lock_ref == 0 → 也可以继续驱逐

SGLang 支持 7 种驱逐策略:

策略	排序依据	适用场景
LRU	last_access_time (默认)	通用场景
LFU	hit_count → last_access_time	高频前缀保护
SLRU	分段 LRU（热/冷分离）	区分长期热点和偶发访问
FIFO	creation_time	简单先进先出
MRU	-last_access_time	特殊场景
Priority	priority → last_access_time	优先级感知
FILO	-creation_time	后进先出

对应源码: RadixCache.evict() (radix_cache.py:586) + evict_policy.py

4. 引用计数机制

请求 req-1 匹配到 node_A 和 node_B:

inc_lock_ref:
  node_B.lock_ref: 0 → 1  (不可驱逐)
  node_A.lock_ref: 0 → 1  (不可驱逐)

req-1 推理中... 这些节点的 KV Cache 被保护

dec_lock_ref (req-1 完成):
  node_B.lock_ref: 1 → 0  (可驱逐)
  node_A.lock_ref: 1 → 0  (可驱逐)

关键: inc_lock_ref 和 dec_lock_ref 会沿着 parent 链向上传播，确保整条路径都被保护。

对应源码: RadixCache.inc_lock_ref() / dec_lock_ref() (radix_cache.py:615-649)

5. 与调度器的集成

完整的请求生命周期:

1. 新请求到达
   └→ scheduler 调用 tree.match_prefix(token_ids)
       └→ 返回: matched_slots, last_node, prefix_len

2. 调度执行
   └→ tree.inc_lock_ref(last_node)        # 锁定前缀
   └→ kv_pool.alloc(len - prefix_len)     # 只分配未命中部分
   └→ model.forward(token_ids[prefix_len:])  # 只计算未命中部分

3. Decode 阶段
   └→ 正常逐 token 生成

4. 请求完成
   └→ tree.insert(all_tokens, all_slots)  # 缓存供后续复用
   └→ tree.dec_lock_ref(last_node)        # 解锁前缀

对应源码:

schedule_batch.py Req 类中的 prefix_indices, last_node, extend_input_len
common.py 中的 release_kv_cache() → cache_finished_req()

6. 与 vLLM Prefix Caching 的对比

特性	SGLang Radix Tree	vLLM Hash-based
数据结构	Radix Tree (前缀树)	Hash Table
匹配方式	树遍历，自动最长前缀匹配	按 block hash 查找
粒度	token 级（page_size=1）或 page 级	block 级（固定 block_size）
前缀共享	树结构天然共享公共前缀	相同 hash 的 block 共享
部分匹配	支持（通过节点分裂）	不支持（必须完整 block 匹配）
驱逐策略	7 种策略可选	LRU
实现复杂度	较高	较低

SGLang 的优势: Radix Tree 天然适合前缀共享场景。多个请求共享同一 system prompt 时，树中只存一份 KV Cache，所有请求的公共前缀自动合并。

7. 实际效果示例

运行教学代码 code/course_sglang/radix_attention.py 的输出分析:

5 个请求处理过程:

req-1: [sys_prompt, 61, 62, 63]  → 无命中，完整计算 8 tokens
req-2: [sys_prompt, 61, 62, 71]  → 命中前缀 7/8 tokens! 只需计算 1 token
req-3: [sys_prompt, 81, 82, 83]  → 命中前缀 5/8 tokens! 只需计算 3 tokens
req-4: [90, 91, 92, 93]          → 无命中，完整计算 4 tokens
req-5: [sys_prompt, 61, 62, 63]  → 命中前缀 8/8 tokens! 0 计算!

总 prompt: 36 tokens
前缀复用: 20 tokens
计算节省: 55.6%

8. 源码文件索引

文件	内容
`sglang/srt/mem_cache/radix_cache.py`	Radix Tree 核心实现（Python）
`sglang/srt/mem_cache/evict_policy.py`	7 种驱逐策略
`sglang/srt/mem_cache/common.py`	缓存操作与调度器集成
`sglang/srt/mem_cache/memory_pool.py`	KV Cache 内存池
`sglang/srt/mem_cache/cpp_radix_tree/`	C++ 高性能实现
`sglang/srt/layers/radix_attention.py`	Attention 层集成
`sglang/srt/managers/schedule_batch.py`	调度器中的前缀缓存字段

9. 教学代码

配套的极简模拟代码在: vllm_learn/code/course_sglang/radix_attention.py

代码模拟了完整流程: 前缀匹配 → 节点分裂 → 插入 → LRU 驱逐 → 引用计数，可直接运行观察 Radix Tree 的动态变化。

1. 为什么需要 RadixAttention​

2. Radix Tree 数据结构​

2.1 什么是 Radix Tree​

2.2 SGLang 中的节点结构​

2.3 关键设计​

3. 核心操作​

3.1 前缀匹配 (match_prefix)​

3.2 节点分裂 (split_node)​

3.3 插入 (insert)​

3.4 LRU 驱逐 (evict)​

4. 引用计数机制​

5. 与调度器的集成​

6. 与 vLLM Prefix Caching 的对比​

7. 实际效果示例​

8. 源码文件索引​

9. 教学代码​