Skip to main content

Gray sheep, golden cows, and everything in between: Yejin Choi earns Longuet-Higgins Prize in computer vision for enabling more precise image captions via natural language generation

Sheep standing in glass and metal bus shelter by road
“The gray sheep is by the gray road”

Allen School professor Yejin Choi is among a team of researchers recognized by the Computer Vision Foundation with its 2021 Longuet-Higgins Prize for their paper “Baby talk: Understanding and generating simple image descriptions.” The paper was among the first to explore the new task of generating image captions in natural language by bridging two fields of artificial intelligence: computer vision and natural language processing. Choi, who is also a senior research manager at the Allen Institute for AI (AI2), completed this work while a faculty member at Stony Brook University. She and her co-authors originally presented the paper at the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Baby talk is the process by which adults assist infants in acquiring language and building their understanding of the world that is characterized in part by the use of grammatically simplified speech. Drawing upon this concept, Choi and her collaborators set out to teach machines to generate simple yet original sentences describing what they “see” in a given image. This was a significant departure from conventional approaches grounded in the retrieval and summarization of pre-existing content. To move past the existing paradigm, the researchers constructed statistical models for visually descriptive language by mining and parsing the large quantities of text available online and paired them with the latest recognition algorithms. Their strategy enabled the new system to describe the content of an image by generating sentences specific to that particular image, as opposed to requiring it to shoehorn content drawn from a limited document corpus into a suitable description. The resulting captions, the team noted, had greater relevance and precision in the way they describe the visual content.

Yejin Choi
Yejin Choi

“At the time we did this work, the question of how to align the semantic correspondences or alignments across different modalities, such as language and vision, was relatively unstudied. Image captioning is an emblematic task to bridge the longstanding gap between NLP research with computer vision,” explained Choi. “By bridging this divide, we were able to generate richer visual descriptions that were more in line with how a person might describe visual content — such as their tendency to include not just information on what objects are pictured, but also where they are in relation to each other.” 

This incorporation of spatial relationships into their language generator was key in producing more natural-sounding descriptions. Up to that point, computer vision researchers who focused on text generation from visual content relied on spatial relationships between labeled regions of an image solely to improve labeling accuracy; they did not consider them outputs in their own right on a par with objects and modifiers. By contrast, Choi and her colleagues considered the relative positioning of individual objects as integral to developing the computer vision aspect of their system, to the point of using these relationships to drive sentence generation in conjunction with the depicted objects and their modifiers.

Some of the results were deemed to be “astonishingly good” by the human evaluators. In one example presented in the paper, the system accurately described a “gray sheep” as being positioned “by the gray road”; the “gray sky,” it noted, was above said road. For another image, the system correctly pegged that the “wooden dining table” was located “against the first window.” The system also accurately described the attributes and relative proximity of rectangular buses, shiny airplanes, furry dogs, and a golden cow — among other examples.

Cow with curved horns and shaggy, golden-brown hair standing in a field with trees in the background
The golden cow

The Longuet-Higgins Prize is an annual “test of time” award presented during CVPR by the IEEE Pattern Analysis and Machine Intelligence (PAMI) Technical Committee to recognize fundamental contributions that have had a significant impact in the field of computer vision. Choi’s co-authors on this year’s award-winning paper include then-master’s students Girish Kulkarni, Visruth Premraj and Sagnik Dhar; Ph.D. student SiMing Li; and professors Alexander C. Berg and Tamara L. Berg, both now on the faculty at University of North Carolina Chapel Hill.

Read the full paper here.

Congratulations to Yejin and the entire team!