
If you ask 10 different people to write a metaphor about time, you might get 10 unique responses. However, if you give the same prompt to 10 large language models (LLMs), you may end up receiving similar outputs across the board — almost like they are all part of a hivemind.
These LLMs often struggle to generate distinct and human-like creative content, yet scalable methods for analyzing the diversity of LLM responses are limited beyond tasks such as random name or number generation. Recently, a team of University of Washington researchers developed Infinity-chat, a benchmark dataset featuring 26,000 real-world, open-ended queries paired with more than 31,000 human preference annotations. That work, which was led by Allen School Ph.D. student Liwei Jiang, allows for a systemic evaluation of the creative generation of artificial intelligence models.
The team presented their paper titled “Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)” at the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025) in December last month and received a Best Paper Award.
“This research reveals a critical limitation in large language models: despite their diversity of architectures and training approaches, LLMs produce strikingly homogeneous outputs on open-ended queries, a phenomenon we termed the ‘Artificial Hivemind,’” said co-author Yulia Tsvetkov, who holds the Paul G. Allen Career Development Professorship in the Allen School.
With Infinity-chat, Tsvetkov and her collaborators introduced the first comprehensive taxonomy of open-ended LLM queries. The researchers broke down the different queries that users pose to language models into six high-level categories and 17 fine-grained subcategories such as problem solving or speculative and hypothetical scenarios. Of the high-level categories, creative content generation (58%) and brainstorming and ideation (15.2%) were among some of the most common — emphasizing users’ reliance on LLMs for direct inspiration and thought.
The team then used Infinity-chat to conduct a large-scale study of mode collapse in LLMs. After evaluating more than 70 LLMs using real-world, open-ended questions, they found an “Artificial Hivemind” effect. This phenomenon is characterized by both intra-model repetition, where the same model fails to produce diverse responses, and inter-model homogeneity, which is where different models generate similar outputs. These insights can help guide future research into mitigating the long-term AI safety risks associated with the Artificial Hivemind.
“Testing over 70 models from major AI developers, our study found systematic convergence on similar responses to open-ended queries, raising concerns about groupthink in AI systems that could lead to shared blind spots and correlated errors,” said Tsvetkov. “These findings have direct implications across critical application areas including AI for science, medicine, education, decision support and many others, where robust reasoning across diverse perspectives is essential.”
Additional authors include Allen School Ph.D. students Margaret Li and Mickel Liu; alumni Raymond Fok (Ph.D., ‘25), now at Microsoft, and Maarten Sap (Ph.D., ‘21), now faculty at Carnegie Mellon University; Yuanjun Chai, a student in the UW Department of Electrical & Computer Engineering Daytime Master’s Program (MSEE); Nouha Dziri at the Allen Institute for Artificial Intelligence (Ai2); and Allen School affiliate faculty member Yejin Choi, professor at Stanford University.
Read the award-winning paper here.
