R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Zefan Cai; Wen Xiao; Hanshi Sun; Cheng Luo; Yikai Zhang; Ke Wan; Yucheng Li; Yeyang Zhou; Li-Wen Chang; Jiuxiang Gu; Zhen Dong; Anima Anandkumar; Abedelkadir Asi; Junjie Hu

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu

Advances in Neural Information Processing Systems (NeurIPS 2025)

Abstract

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

BibTeX

@inproceedings{ cai2025rkv, title={R-{KV}: Redundancy-aware {KV} Cache Compression for Reasoning Models}, author={Zefan Cai and Wen Xiao and Hanshi Sun and Cheng Luo and Yikai Zhang and Ke Wan and Yucheng Li and Yeyang Zhou and Li-Wen Chang and Jiuxiang Gu and Zhen Dong and Anima Anandkumar and Abedelkadir Asi and Junjie Hu}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}, year={2025}, url={https://openreview.net/forum?id=2jwAjomEDB} }

Papers Citing This Work

Below is a manually curated list of publications that cite this paper. This list may be incomplete due to indexing delays or metadata issues across different sources.

Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts

Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang · arXiv · 2026

arXiv PDF | Local PDF | arXiv

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu · arXiv · 2025

arXiv PDF | Local PDF | arXiv

G-KV: Decoding-Time KV Cache Eviction with Global Attention

Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan · arXiv · 2025

arXiv PDF | Local PDF | arXiv

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna · arXiv · 2025

arXiv PDF | Local PDF | arXiv

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying · arXiv · 2025

arXiv PDF | Local PDF | arXiv

Sentence-Anchored Gist Compression for Long-Context LLMs

Dmitrii Tarasov, Elizaveta Goncharova, Kuznetsov Andrey · arXiv · 2025

arXiv PDF | Local PDF | arXiv

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavarm · arXiv · 2025

arXiv PDF | Local PDF | arXiv

On the Existence and Behaviour of Secondary Attention Sinks

Jeffrey T.H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao · arXiv · 2025

arXiv PDF | Local PDF | arXiv

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu · arXiv · 2025

arXiv PDF | Local PDF | arXiv

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang · arXiv · 2025

arXiv PDF | Local PDF | arXiv

LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning

Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo · arXiv · 2025

arXiv PDF | Local PDF | arXiv