Skip to main content

Allen School and AI2 researchers recognized at NeurIPS for outstanding contributions in large-scale embodied AI and next-generation image-text models

The NeurIPS logo is surrounded by purple patterning to the left.

Allen School researchers continue to push the boundaries of artificial intelligence (AI) and in building innovative image-text models. At the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), several members from the Allen School, along with researchers from the Allen Institute for AI (AI2), earned recognition for their work in advancing their respective fields. 

Allen School undergraduate Matt Deitke, professor Ali Farhadi, affiliate professors Ani Kembhavi, director of computer vision at AI2, and Roozbeh Mottaghi, research scientist manager at Meta AI, and their collaborators at AI2 won an Outstanding Paper Award for ProcTHOR: Large-Scale Embodied AI Using Procedural Generation, which investigated scaling up the diversity of datasets used to train robotic agents. The paper was among 13 selected for Outstanding Paper Recognition out of the 2,672 papers accepted to the conference — a high bar, given that NeurIPS received more than 10,000 total submissions. 

Matt Deitke, wearing glasses and black sweater over a white collared shirt, smiles in front of a blurred background. A window depicting a cityscape is to the left.
Matt Deitke

“We are delighted to have received the Outstanding Paper Award at NeurIPS 2022 for our work on ProcTHOR,” said Deitke, first author on the paper. “It is great recognition from the AI community, and it motivates us to continue pushing the boundaries of AI research. The award is a reflection of the work of our team and the research communities fostered at the Allen School at UW and AI2.”

In addition to Deitke, Farhadi, Kembhavi and Mottaghi, contributors to ProcTHOR include AI2 technical artist Eli VanderBilt, research engineers Alvaro Herrasti and Jordi Salvador, research scientists Luca Weihs and Kiana Ehsani (Ph.D., ‘21) and game designer Winson Han, and DexCare director of data science Eric Kolve. Kolve and Mottaghi were at AI2 for the duration of the project. 

The open-source ProcTHOR introduces a new framework for procedural generation of embodied AI environments. Prior to its introduction, artists had to manually design spaces such as simulated 3D houses or build environments via 3D scans of real structures. Each approach carried drawbacks, namely due to their cost, time-intensive nature and inability to scale. If the team trained a robot only on 100 houses designed by the artists, Deitke said, it would perform well in those environments but not generalize when placed in a house it had never seen before. 

“In ProcTHOR, we took on a difficult challenge of attempting to generate 3D houses from scratch,” Deitke said. “The generated houses can then be used to massively scale up the amount of training data available in embodied AI.”

With ProcTHOR, the robots demonstrated robust generalization results. Instead of 100 houses, the team sampled 10,000, showing the power of ProcTHOR’s data. 

The implications of such findings are manifold. Household robots have shown potential to aid in a number of tasks around the home, providing assistance to many, including individuals with disabilities and the elderly. For example, tasks such as cooking and cleaning could be delegated to the AI. Research has also explored how companion robots can assist with social interaction, motivating their owners to exercise, remind them about appointments or even talk about the weather and news.

They’re all aspects the team has taken into consideration as it looks to the future. Already, ProcTHOR has received interest from a number of companies, researchers and architects who see its potential to reimagine the AI landscape. 

“It is truly inspiring that ProcTHOR is able to enable work across many areas of AI,” Deitke said. 

Ludwig Schmidt

Switching gears from built architecture to language-vision architecture, Allen School professor Ludwig Schmidt, Ph.D. student Mitchell Wortsman and their collaborators received an Outstanding Datasets and Benchmarks Papers Award for LAION-5B: An open large-scale dataset for training next generation image-text models. Their work introduced a public dataset that democratizes research into the training and capabilities of language-vision architectures.

“We are glad to see work in developing open source datasets and models recognized by the NeurIPS community,” Schmidt said. “We are particularly happy as the LAION-5B dataset is a grassroots community effort and we believe that this kind of collaboration is essential to driving progress in machine learning.”

Before LAION-5B, which contains 5.85 billion CLIP-filtered image-text pairs, no datasets of this size were made publicly available, creating a bottleneck wherein related research flowed through a small number of industrial research labs. While multimodal machine learning has progressed rapidly in recent years, large companies and labs such as Google and OpenAI have driven much of it. 

LAION-5B addresses exactly this hurdle by introducing the first public billion-scale image-text dataset that is suitable for training state-of-the-art multimodal models. In addition, LAION-5B also provides a starting point for researchers working to improve image-text datasets. Furthermore, the team showed that LAION-5B could replicate the performance of the original CLIP models using the OpenCLIP library, which was also developed by Allen School researchers. The authors also created a web interface to browse the dataset, which makes it easy to audit — one example being finding toxic content that the automatic filters missed.

“Moreover, we hope that the community collaborates to further improve public training datasets so that we can together mitigate bias and safety concerns in our widely used machine learning datasets,” Wortsman said. “We have already performed content analysis and filtering for LAION-5B, but there is still much work to be done. Open and transparent datasets are an important step towards safer and less biased models.”

Mitchell Worstman, wearing a green sweater, smiles in front of a blurred classroom background.
Mitchell Wortsman

Other researchers have already built upon LAION-5B with projects such as Stable Diffusion or Imagen, which both generate images from a text description. 

“We believe that future projects will continue along these directions to develop more capable multimodal machine learning models,” Schmidt said. “LAION-5B also presents a unique opportunity to study the influence of the data on the behavior of multimodal models. Finally, we hope that researchers will build upon LAION-5B to develop the next generation of open datasets.”

Besides Schmidt and Wortsman, the study’s co-authors comprise a number of LAION members, including organizational lead and co-founder Christoph Schuhmann, founding members Theo Coombes and Aarush Katta, software engineer Clayton Mullis, researcher Srivatsa Kundurthy and LAION e.V. co-founder and scientific lead Jenia Jitsev

Stability AI lead cloud architect Richard Vencu and Stability AI artist Katherine Crowson also contributed to the project as co-authors, as well as Google machine learning engineer Romain Beaumont, UC Berkeley undergraduate research assistant Cade Gordon, Hugging Face computer vision developer Ross Wightman, Jülich Supercomputing Centre postdoctoral researcher Mehdi Cherti, TU Darmstadt University Ph.D. student Patrick Schramowski and Technische Universität München researcher Robert Kaczmarczyk.

Read more about LAION-5B here and about ProcTHOR in AI2’s December newsletter here.