With large language models dominating the discourse these days, artificial intelligence researchers find themselves increasingly in the limelight. But while LLMs continue to grow in size — and capture a growing share of the public’s imagination — their utility could be limited by their voracious appetite for compute resources and power.
This is where the systems researchers have an opportunity to shine. And it so happens that one of the brightest sparks working at the intersection of AI and systems can be found right here at the University of Washington.
Zihao Ye, a fourth-year Ph.D. student in the Allen School, builds serving systems for foundation models and sparse computation to improve the efficiency and enhance the programmability of emerging architectures such as graph networks and the aforementioned LLMs. To support his efforts, NVIDIA recently selected Ye as one of 10 recipients of the company’s highly competitive Graduate Research Fellowship. The honorees are described by NVIDIA Chief Scientist Bill Dally as “among the most talented graduate students in the world.”
Ye applies his talents to the development of techniques that enable machine learning systems with large and sparse tensors — and their large workloads — to run more efficiently in resource-constrained contexts such as smartphones and web browsers. To that end, he teamed up with professor Luis Ceze and alum Tianqi Chen (Ph.D., ’19), now a faculty member at Carnegie Mellon University and co-founder alongside Ceze of Allen School spinout OctoAI, in the Allen School’s interdisciplinary SAMPL group.
“Zihao is a deep thinker who is diligent about background research and extremely skilled in systems building. That is a powerful combination in a systems researcher,” said Ceze, who holds the Edward D. Lazowska Professorship in Computer Science & Engineering at the Allen School and also serves as CEO of OctoAI. “He also has a good eye for research problems and is a fantastic colleague and teammate.”
Ye’s eye for research problems led him to pursue what Ceze termed a “very elegant idea” for overcoming the so-called hardware lottery when programming neural networks to run on modern GPUs. One of the main obstacles is that neural networks, such as those used in graph analytics, are sparse tensor applications, whereas modern hardware is designed primarily for dense tensor operations. To solve the problem, Ye and his colleagues created SparseTIR, a composable programming abstraction that supports efficient sparse model optimization and compilation. SparseTIR decomposes a sparse matrix into multiple sub-matrices with homogeneous sparsity patterns to enable more hardware-friendly storage, while offloading the associated computation to different compute units within GPUs to optimize runtime performance. The team layered their approach onto Apache TVM, an open-source framework that supports the deployment of machine learning workloads on any hardware backend.
“The number of sparse deep learning workloads is rapidly growing, while at the same time, hardware backends are evolving toward accelerating dense operations,” Ye explained. “SparseTIR is flexible by design, enabling it to be applied to any sparse deep learning workload while leveraging new hardware and systems advances.”
In multiple instances, the team found that SparseTIR outperformed highly optimized sparse libraries on NVIDIA HW. Ye and his colleagues earned a Distinguished Artifact Award at ASPLOS ’23, the preeminent conference for interdisciplinary systems research, for their work.
Based on their scale and the amount of computation required, LLMs are fast becoming one of the most significant hardware workloads — and a potentially significant stumbling block. One of the critical factors for efficient LLM serving is kernel performance on GPUs. To that end, Ye and his collaborators examined LLM-serving operators to identify performance bottlenecks and developed an open-source library, FlashInfer, for enhanced LLM serving using inference acceleration techniques.
Ye also contributed to Punica, a project led by Ceze and faculty colleague Arvind Krishnamurthy to enable inference of multiple LLMs fine-tuned through low-rank adaptation from a common underlying pretrained model on a single GPU. The team’s approach, which significantly reduces the amount of memory and computation required for such tasks, earned first runner-up in the 2023 Madrona Prize competition.
“Zihao’s work is already having a direct impact,” Ceze noted. “He is a true ML systems star in the making.”
“I’m honored to receive the Graduate Research Fellowship from NVIDIA, which leads the way in research and development of machine learning acceleration. I’m particularly excited to learn from industry experts and to build good systems together for the greater good,” said Ye.
“I would like to thank Luis, who provided the best guidance and advice, and all my collaborators over the years,” he continued. “UW has a super-collaborative environment where I can team up with people who bring different knowledge and backgrounds, which has greatly expanded my horizons and inspired my research.”
Read more about the 2024 NVIDIA Graduate Research Fellowship recipients on the company’s blog.