Would you call your favorite fizzy drink a soda or a pop? Just because you speak the same language, does not mean you speak the same dialect based on variations in vocabulary, pronunciation and grammar. And whatever the language, most models used in artificial intelligence research are far from an open book, making them difficult to study.
At the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) in August, Allen School researchers took home multiple awards for their work to address these challenges. Their research ranged from introducing more dialects into language technology benchmarks to evaluating the reliability and fairness of language models and increasing the transparency and replicability of large language model training as well as evaluations across languages.
Best Social Impact Paper: DialectBench
The benchmarks used in natural language processing (NLP) research and evaluation are often limited to standard language varieties, making them less useful in real-world cases. To address this gap, Allen School researchers introduced DialectBench, the first large-scale NLP benchmark for language varieties that covers 40 different language clusters with 281 varieties across 10 NLP tasks.
While DialectBench can give researchers a comprehensive overview of the current state of NLPs, it also has the potential to bring more languages under the NLP model in the future.
“Language variation like African American or Indian English dialects in NLP is often treated as noise, however in the real world, language variation often reflects regional, social and cultural differences,” said senior author and Allen School professor Yulia Tsvetkov. “We developed a robust framework to evaluate the quality of multilingual models on a wide range of language varieties. We found huge performance disparities between standard languages and their respective varieties, highlighting directions for future NLP research.”
Benchmarking helps researchers track the progress the NLP field has made across various tasks by comparing it to other standard points of reference. However, it is difficult to test the robustness of multilingual models without an established NLP evaluation framework that covers many language clusters, or groups of standard languages alongside its closely related varieties. For DialectBench, the researchers constructed several clusters such as the Hindustani cluster which encapsulated Fiji Hindi and Hindi. Then, they selected tasks that test the model’s linguistic and demographic utilities.
The researchers used DialectBench to report the disparities across standard and non-standard language varieties. For example, they found that the highest-performing varieties were mostly standard high-resource languages, such as English, and a few high resource dialects including Norwegian dialects. On the other hand, the majority of the lowest-performing language variants were also low-resourced language varieties.
Additional authors of the DialectBench paper include Allen School Ph.D. students Orevaoghene Ahia, co-first author, and Kabir Ahuja; George Mason University Ph.D. student Fahim Faisal, lead author and professor Antonios Anastasopoulos; and University of Notre Dame Ph.D. student Aarohi Srivastava and professor David Chiang.
This was not the only ACL award-winning paper to come out of Tsvetkov’s research group, the TsvetShop. Another paper focusing on improving the reliability of large language models and preventing hallucinations from knowledge gaps won an Outstanding Paper Award and an Area Chair Award in the QA track.
Best Theme and Best Resource Papers: OLMo and Dolma
Two papers from Allen School professors Hanna Hajishirzi and Noah Smith, co-directors of the Open Language Model effort at the Allen Institute for Artificial Intelligence (AI2), along with their collaborators, earned accolades at ACL 2024 for advancing the state of open language models.
As language models have become more common in commercial products, at the same time, important details about these models’ training data, architectures and development have become hidden behind proprietary interfaces. Without these features, it may be difficult to scientifically study these models’ strengths, weaknesses and their potential biases and risks.
The researchers built a competitive, truly open language model, OLMo, to help fill this knowledge gap and inspire other scientists’ innovations. Alongside OLMo, the team also released its entire framework from the open training data to evaluation tools. The researchers earned Best Theme Paper at ACL for their work titled “OLMo: Accelerating the Science of Language Models.”
“Language models are a decades-old idea that have recently become the backbone of modern AI. Today the most famous models are built as commercial products by huge tech firms, and many details of their design are closely guarded secrets,” said Smith, the Amazon Professor of Machine Learning in the Allen School. “We launched the OLMo effort as a collaboration between the Allen Institute for AI and the Allen School to create a fully open alternative that scientists could study, because it’s important that we fully understand these artifacts.”
While this paper presents the team’s first release of OLMo, they intend to continue to support and extend the model and its framework, bringing in different model sizes, modalities, datasets and more. Already since OLMo’s original release, the researchers have improved the data and training; for example, the Massive Multitask Language Understanding scores, which measure knowledge acquired during pretraining, went up by 24 points to 52%.
Hajishirzi and Smith’s co-authors on the OLMo paper include Allen School professor Luke Zettlemoyer, postdocs Abhilasha Ravichander and Yanai Elazar, Ph.D. students Ananya Harsh Jha, Hamish Ivison, Ian Magnusson and Yizhong Wang, and alumnus Jacob Morrison (B.S. Computer Science, ‘17/M.S., Computational Linguistics, ‘22), now a researcher at AI2, and Mitchell Wortsman (Ph.D., ‘24), now a member of the technical staff at Anthropic; AI2 researchers Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yuling Gu, Jack Hessel, Tushar Khot, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Jesse Dodge, Kyle Lo and Luca Soldaini; and New York University Ph.D. student William Merrill.
OLMo’s efforts to progress research into language models would not be complete without its counterpart Dolma, an English corpus containing three trillion tokens from web content to scientific papers to public-domain books.
While there has been progress toward making model parameters more accessible, pretraining datasets, which are fundamental to developing capable language models, are not as open and available. The researchers built and released OLMo’s pretraining dataset “Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research” to help facilitate open research into language models — and earned Best Resource Paper at ACL in the process.
“Even among open models, there are differences in what researchers can work with. With OLMo, we wanted a competitive, strong model whose data was also fully available for inspection,” said Smith. “Dolma is the dataset used to pretrain OLMo. It is extensively documented, and the paper includes analyses and discussion of lessons learned through data curation. We also released open-source data curation tools to enable reproduction and improvement of our work.”
Like with OLMo, this is just the beginning for Dolma. The researchers continue to make advancements as part of follow-on releases that, for example, yield significant performance improvements on downstream tasks.
Additional authors on the Dolma paper include Zettlemoyer, Ravichander, Jha, Elazar, Magnusson, Morrison, Soldaini, Kinney, Bhagia, Schwenk, Atkinson, Authur, Chandu, Dumas, Lambert, Muennighoff, Naik, Nam, Peters, Richardson, Strubell, Subramani, Tafjord, Walsh, Beltagy, Groeneveld and Dodge along with Russell Authur, Ben Bogin, Valentin Hofmann and Xinxi Lyu of AI2; University of California, Berkeley Ph.D. student Li Lucy; Carnegie Mellon University Ph.D. student Aakanksha Naik; and MIT Ph.D. student Zejiang Shen.