NeurIPS 2025

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu

Advances in Neural Information Processing Systems (NeurIPS 2025)


Abstract

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.


BibTeX

@inproceedings{ cai2025rkv, title={R-{KV}: Redundancy-aware {KV} Cache Compression for Reasoning Models}, author={Zefan Cai and Wen Xiao and Hanshi Sun and Cheng Luo and Yikai Zhang and Ke Wan and Yucheng Li and Yeyang Zhou and Li-Wen Chang and Jiuxiang Gu and Zhen Dong and Anima Anandkumar and Abedelkadir Asi and Junjie Hu}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, year={2025}, url={https://openreview.net/forum?id=2jwAjomEDB} }

Papers Citing This Work

Below is a manually curated list of publications that cite this paper. This list may be incomplete due to indexing delays or metadata issues across different sources.

Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang · arXiv · 2026
SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models
Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu · arXiv · 2025
G-KV: Decoding-Time KV Cache Eviction with Global Attention
Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan · arXiv · 2025
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna · arXiv · 2025
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · arXiv · 2025
Sentence-Anchored Gist Compression for Long-Context LLMs
Dmitrii Tarasov, Elizaveta Goncharova, Kuznetsov Andrey · arXiv · 2025
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavarm · arXiv · 2025
On the Existence and Behaviour of Secondary Attention Sinks
Jeffrey T.H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao · arXiv · 2025
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu · arXiv · 2025
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang · arXiv · 2025
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo · arXiv · 2025