Ke Wan

Hi! I’m a Software Engineer II on the HPC/AI team at Microsoft (Azure Core), where I work on large-scale LLM systems in production cloud environments. My research interests lie in efficient transformer inference and system-aware LLM serving. I focus on memory-efficient inference techniques, particularly KV-cache optimization, to enable scalable and high-throughput deployment without model retraining. My work addresses fundamental bottlenecks in large-scale inference and has been evaluated and adopted by both academic and industrial research. Related open-source implementations have received 1,100+ GitHub stars, demonstrating strong community engagement and practical impact. I apply these research insights to the design of reliable and scalable inference platforms serving high-volume workloads.

Email | CV | Google Scholar | GitHub |

Publications

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu

NeurIPS 2025 Poster

From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models

Zefan Cai, Haoyi Qiu, Haozhe Zhao, Ke Wan, Jiachen Li, Jiuxiang Gu, Wen Xiao, Nanyun Peng, Junjie Hu

ArXiv 2025