Skip to main content

Allen School researchers earn EMNLP Best Paper Award for making Internet-scale texts efficiently searchable with infini-gram mini

Hao Xu (center) accepts the EMNLP Best Paper Award among members of the conference program chairs.
From left to right: EMNLP program chairs Violet Peng and Christos Christodoulopoulos, lead author Hao Xu, EMNLP general chair Dirk Hovy and program chair Carolyn Rose. 

Large language models (LLMs) such as ChatGPT are trained using massive text datasets downsampled from the Internet. As these language models become more popular and widespread, it becomes increasingly important to understand the composition of the data source and how it affects the model’s behavior. The first step is to make these texts searchable.

Current exact-match search engines are limited by their high storage requirements, hindering their application on extremely large-scale data. With previous methods, storing the Internet-size text index would cost around $500,000 per month. To make searching on such a large scale more efficient and affordable, a team of University of Washington and Allen Institute for Artificial Intelligence (Ai2) researchers developed infini-gram mini, a scalable system that uses the compressed FM-index data structure to index petabyte-level text corpora. 

The team presented their paper “Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index” at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) last November in Suzhou, China, and received the Best Paper Award.

“We developed infini-gram mini, an efficient search engine designed to handle exact-match search on arbitrarily long queries across Internet-scale corpora with minimal storage overhead,” said Allen School undergraduate student and lead author Hao Xu. “Infini-gram mini hosts the largest body of searchable text in the open-source community.”

While the FM-index has been widely used in bioinformatics, the team was the first to apply it to natural language data at the Internet scale. The infini-gram mini system improves on the best FM-index implementation, achieving an 18 times increase in indexing speed and a 3.2 times reduction in memory usage. The resulting index needs only 44% as much storage as the raw text, which is only 7% of what the original infini-gram required.

“In infini-gram mini, we combined advanced algorithms and data structures and scaled-up engineering to tackle real, pressing challenges in AI. It is a very unique combination,” said co-author and Allen School Ph.D. student Jiacheng Liu. “The most interesting part was that we revitalized a data structure repo that hasn’t been maintained for almost 10 years, armed it with modern parallel computing, and scaled it up to the sky to handle Internet-scale data with low compute needs. We almost built a Google Search without Google-level budget.”

To showcase infini-gram mini’s search capabilities, the researchers used the system to conduct a large-scale benchmark contamination analysis to see if the training data of LLMs inadvertently contains their test data. They found that many widely-used evaluation benchmarks appeared heavily in these corpora, which could lead to an overestimation of the language model’s true capabilities as it enables models to retrieve memorized answers from training data rather than performing task-specific reasoning. Alongside infiini-gram mini, the team also released a benchmark contamination monitoring system, with the goal of encouraging more transparent and reliable evaluation practices in the community, explained Xu.

Additional authors include Allen School professors and Ai2 researchers Hannaneh Hajishirzi and Noah A. Smith, along with Allen School affiliate faculty member Yejin Choi, faculty member at Stanford University.

Read the full paper on infini-gram mini here.