Recent advances in open-ended text generation could enable machines to produce text that approaches or even mimics that generated by humans. However, evaluating the quality and accuracy of these large-scale models has remained a significant computational challenge. Recently, researchers at the Allen School and Allen Institute for AI (AI2) offered a solution in the form of MAUVE, a practical tool for assessing modern text generation models’ output compared to human-generated text that is both efficient and scalable. The team’s paper describing this new approach, “MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers,” earned an Outstanding Paper Award at the Conference on Neural Information Processing Systems (NeurIPS 2021) in December.
The goal of open-ended text generation is to achieve a level of coherence, creativity, and fluency that mimics human text. Because the task is, as the name suggests, open-ended, there is no correct answer; this makes evaluation of a model’s performance more difficult than with more concrete tasks such as translation or summarization. MAUVE solves this problem by employing information divergence frontiers — heretofore a little-used concept in NLP — to reduce the comparison between model-generated text and human text to a computationally tractable yet effective measurement.
“For open-ended text generation to make that next leap forward, we need to be able to evaluate a model’s performance on two key aspects that are prone to error: how much weight it gives to sequences that truly resemble human text, as opposed to gibberish, and whether the generated text exhibits the variety of expression we would expect to see from humans, instead of boring or repetitive text that reads like a template,” explained lead author Krishna Pillutla, a Ph.D. candidate in the Allen School. “The beauty of MAUVE is that it enables us to quantify both, using a simple interface and an approach that is easily scaled to whatever sized model you’re working with.”
MAUVE computes the divergence between the model distribution and target distribution of human text for the above-mentioned pair of criteria in a quantized embedding space. It then summarizes the results as a single scalar that illustrates the gap between the machine-generated and human text. To validate MAUVE’s effectiveness, the team tested the tool using three open-ended text completion tasks involving web text, news articles and stories. The results of these experiments confirmed that MAUVE reliably identifies the known properties of machine-generated text, aligns strongly with human judgments, and scales naturally with model size — and does so with fewer restrictions than existing distributional evaluation metrics. And whereas other language modeling tools or statistical measures are typically limited to capturing a single statistic or correspond to only one point on the divergence curve, MAUVE offers expanded insights into a model’s performance.
“MAUVE enables us to identify the properties of machine-generated text that a good measure should capture,” noted co-author Swabha Swayamdipta, a postdoctoral investigator at AI2. “This includes distribution-level information that enables us to understand how the quality of output changes based on the size of the model, the length of text we are asking it to generate, and the choice of decoding algorithm.”
While Swayamdipta and her colleagues designed MAUVE with the goal of improving the quality of machine-generated text — where “quality” is defined according to how closely it resembles the human-authored kind — they point out that its capabilities also provide a foundation for future work on how to spot the difference.
“As with every new technology, there are benefits and risks,” said senior author Zaid Harchaoui, a professor in the University of Washington’s Department of Statistics and adjunct professor in the Allen School. “As the gap narrows between machine and human performance, having tools like MAUVE at our disposal will be critical to understanding how these more sophisticated emerging models work. The NLP community can then apply what we learn to the development of future tools for distinguishing between content generated by computers versus that which is produced by people.”
Additional co-authors of the paper introducing MAUVE include Allen School Ph.D. student Rowan Zellers, postdoc Sean Welleck, alumnus John Thickstun (Ph.D., ‘21) — now a postdoc at Stanford University — and Yejin Choi, the Brett Helsel Career Development Professor in the Allen School and a senior research manager at AI2. The team received one of six Outstanding Paper Awards presented at NeurIPS 2021, which are chosen based on their “clarity, insight, creativity, and potential for lasting impact.”
Members of the team also studied the statistical aspects of MAUVE in another paper simultaneously published at NeurIPS 2021. Together with Lang Liu, Ph.D. candidate in Statistics at UW, and Allen School professor Sewoong Oh, they established bounds on how many human-written and machine-generated text samples are necessary to accurately estimate MAUVE.
Read the research paper here and the NeurIPS award announcement here. Explore the MAUVE tool here.
Congratulations to the entire team!