
A team of University of Washington and NVIDIA researchers developed a system that can help make large language models (LLMs) faster and more adaptable. The foundation of LLMs are built on transformers, a neural network architecture driven by attention mechanisms that help artificial intelligence focus on relevant and important information. As these LLMs evolve and find new applications in diverse fields, however, optimized lower-level implementation, or GPU kernels, become necessary to help prevent errors and ensure low-latency inference.
The researchers introduced FlashInfer, a versatile LLM inference kernel library that is open source as well as highly optimized and adaptable to new techniques including key-value, or KV, cache reuse algorithms. They presented their research titled “FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving” at the Eighth Annual Conference on Machine Learning and Systems (MLSys 2025) in May and received a Best Paper Award.
“FlashInfer proves what’s possible when academia, industry and the open-source community innovate together — ideas jump from whiteboard to GPU kernels at lightning speed,” said lead author and Allen School Ph.D. student Zihao Ye, who completed part of the research during his internship at NVIDIA. “That shared, rapid feedback loop lets us iterate, refine and ship breakthrough inference speedups that keep pushing the limits of large language models.”
FlashInfer is able to address major challenges that LLMs face in memory access and heterogenous hardware. The attention engine uses a unified block-sparse format, where data is stored and organized in dense blocks making it easier to navigate, to optimize KV cache storage and composable formats. FlashInfer can also adapt to various attention mechanisms through just-in-time compilation, while the dynamic load-balanced scheduling framework effectively and efficiently handles different workloads. Compared to other state-of-the-art LLM serving solutions, the researchers found that FlashInfer significantly boosted kernel performance across diverse inference scenarios. Already, FlashInfer has been integrated into several leading LLM serving frameworks, including SGLang, vLLM and MLC Engine.
Additional authors include Allen School professors Stephanie Wang, Baris Kasikci, Arvind Krishnamurthy and Luis Ceze, who is also VP of AI Systems Software at NVIDIA; Vinod Grover of NVIDIA and Wuwei Lin, previously at NVIDIA and now at OpenAI; Carnegie Mellon University professor Tianqi Chen (Ph.D., ‘19) and Ph.D. student Ruihang Lai; Lequn Chen (Ph.D., ‘24) at Perplexity; and Yineng Zhang at SGLang.
Read the full paper on FlashInfer.