Skip to main content

NLP for all: Professor and 2022 Sloan Research Fellow Yulia Tsvetkov is on a quest to make natural language tools more equitable, inclusive and socially aware

Portrait of Yulia Tsvetkov with leafy trees in the background

Less than a year after her arrival at the University of Washington, professor Yulia Tsvetkov is making her mark as the newest member of the Allen School’s Natural Language Processing group. As head of the Tsvetshop — a clever play on words that would likely stymie your typical natural language model — Tsvetkov draws upon elements of linguistics, economics, and the social and political sciences to develop technologies that not only represent the leading edge of artificial intelligence and natural language processing, but also benefit users across populations, cultures and languages. Having recently earned a 2022 Sloan Research Fellowship from the Alfred P. Sloan Foundation, Tsvetkov is looking forward to adding to her record of producing new tools and techniques for making AI and NLP more equitable, inclusive and socially aware.

“One of the goals of my work is to uncover hidden insights into the relationship between language and biases in society and to develop technologies for identifying and mitigating such bias,” said Tsvetkov. “I also aim to build more equitable and robust models that reflect the needs and preferences of diverse users, because many speakers of diverse language varieties are not well-served by existing tools.”

Her focus at the intersection of computation and social sciences has enabled Tsvetkov to make inroads when it comes to protecting the integrity of information beyond “fake news” by identifying more subtle forms of media manipulation. Even with the growing attention being paid to identifying and filtering out misleading content, tactics such as distraction, propaganda and censorship can be challenging for automated tools to detect. To overcome this challenge, Tsvetkov has spearheaded efforts to develop capabilities for discerning “the language of manipulation” automatically and at scale. 

In one project, Tsvetkov and her colleagues devised computational approaches for detecting subtle manipulation strategies in Russian newspaper coverage by applying agenda-setting and framing — two concepts from political science — to tease out how one outlet’s decisions about what to cover and how were used to distract readers from economic conditions. She also produced a framework for examining the spread of polarizing content on social media based on an analysis of Indian and Pakistani posts following the 2019 terrorist attacks in Kashmir. Given the growth in AI-generated text, Tsvetkov has lately turned her attention to semantic forensics, including the analysis of the types of misinformation and factual inconsistencies produced by large AI models with a view to developing interpretable deep learning approaches that will control for factuality and other traits of machine-generated content. 

“Understanding the deeper meaning of human- or machine-generated text, the writer’s intent, and what emotional reactions the text is likely to evoke in its readers is the next frontier in NLP,” said Tsvetkov. “Language technologies that are capable of doing such fine-grained analysis of pragmatic and social meaning will be critical for combating misinformation and opinion manipulation in cyberspace.”

Another of the ways in which Tsvetkov’s work has contributed to researchers’ understanding of the interplay between language and social attitudes is by surfacing biases in narrative text targeting vulnerable audiences. NLP researchers — including several of Tsvetkov’s Allen School colleagues — have demonstrated effective techniques for identifying toxic content online, and yet more subtle forms continue to evade moderation. Tsvetkov has been at the forefront of developing new datasets, algorithms and tools grounded in social psychology to detect discrimination, at scale and across multiple languages, based on gender, race and/or sexual orientation that manifests in online text and conversations. 

“Although there are tools for detecting hate speech, most harmful web content remains hidden,” Tsvetkov noted. “Such content is hard to detect computationally, so it propagates into downstream NLP tools that then serve to amplify systematic biases.”

One approach that Tsvetkov has employed to great effect is an expansion of contextual affective analysis (CAA), a technique for examining how people are portrayed along dimensions of power, agency and sentiment, to multilingual settings in an effort to understand how narrative text across different languages reflects cultural stereotypes. After applying a multilingual model to English, Spanish and Russian Wikipedia entries about prominent LGBTQ figures in history, Tsvetkov and her team found systematic differences in phrasing that reflected social biases. For example, entries about the late Alan Turing, who was persecuted for his homosexuality, described how he “accepted” chemical castration (English), “chose” it (Spanish), or “preferred” it (Russian) — three verbs with three very different connotations as to Turing’s agency, power and sentiment at the time. Tsvetkov applied similar analyses to uncover gender bias in media coverage of #MeToo and assist the Washington Post in tracking racial discrimination in China, and has since built upon this work to produce the first intersectional analysis of bias in Wikipedia biographies that examines gender disparities beyond cisgender women alongside racial disparities.

The fact that most existing NLP tools are grounded in a specific variant of English has been a driving force in much of Tsvetkov’s research. 

“We researchers often say that a model’s outputs are only as good as its inputs,” Tsvetkov noted. “For the purposes of natural language models, those inputs have mostly been limited to a certain English dialect — but there are multiple English dialects and over 6,000 languages besides English spoken around the world! That’s a significant disconnect between current tools and the billions of people for whom English is not the default. We can’t achieve NLP for all without closing that gap.”

To that end, Tsvetkov has recently turned her attention to developing new capabilities for NLP technologies to adapt to multilingual users’ linguistic proficiencies and preferences. For example, she envisions tools that can match the ability of bilingual and non-native speakers of English and Spanish to switch fluidly between the two languages in conversation, often within the same sentence. Her work has the potential to bridge the human-computer divide where, currently, meaning and context can get lost in translation.

“Yulia is intellectually fearless and has a track record of blending technical creativity with a rigorous understanding of the social realities of language and the communities who use it,” said Magdalena Balazinska, professor and director of the Allen School. “Her commitment to advancing language technologies that adapt to previously ignored users sets her apart from her research peers. By recognizing that AI is not only about data and math, but also about people and societies, Yulia is poised to have an enormous impact on the field of AI and beyond.”

Tsvetkov joined the Allen School last July after spending four years on the faculty of Carnegie Mellon University. She is one of two UW researchers who were honored by the Sloan Foundation in its class of 2022 Fellows, who are chosen based on their research accomplishments and creativity as rising leaders in selected scientific or technical fields. Briana Adams, a professor in the UW Department of Biology, joined Tsvetkov among a total of 118 honorees drawn from 51 institutions across the United States and Canada.

Read the Sloan Foundation press release here and a related UW News release here

Congratulations, Yulia!

Rebekka Coakley contributed to this story.