Allen School News » Data science and decentralization: Bloomberg Ph.D. Fellowship recipient Suchin Gururangan makes large language models more manageable to advance social good

Data science and decentralization: Bloomberg Ph.D. Fellowship recipient Suchin Gururangan makes large language models more manageable to advance social good

Suchin Gururangan, wearing a blue shirt, smiles in front of a blurred, wooded background.

Starting in high school, Suchin Gururangan felt the pull of research. A summer internship piqued his curiosity. Projects in his university’s neuroscience lab fed it further.

But “life happens,” as he put it, and graduate school took a backseat for the time being. He jumped into industry, first in venture capital, then in cybersecurity. Following graduation from the University of Chicago, he moved to the Boston area. Even after landing in Seattle to join a startup, the former researcher still had questions he wanted to answer through an academic lens.

“Throughout my journey in industry, I had always had some lingering desire to revisit my early days of research,” Gururangan said. “Which was always so exciting and fulfilling to me.”

When the startup folded, Gururangan decided to rekindle those interests, pursuing his master’s in the computational linguistics (CLMS) program. The timing couldn’t have been better.

“It was too late for most application deadlines,” he said. “I applied on a whim with the intention of coming back into industry afterwards. I got into the program and then the rest was history!”

Gururangan eventually joined professor Noah Smith’s ARK group in natural language processing (NLP) and nurtured his passion for research besides lasting friendships. While working as a predoctoral young investigator with the Allen Institute for AI, he grew a lot as a scholar, he said, crediting his collaborators and mentors for encouraging his academic pursuits.

Now as a third-year Ph.D. student, Gururangan is continuing to channel his curiosity into solving real-world problems. He recently received a 2022-2023 Bloomberg Data Science Ph.D. Fellowship, which provides early-career researchers in the data science field with financial aid and professional support. Fellows also have the opportunity to engage in a 14-week summer internship in New York, during which they’ll further their research goals while also helping Bloomberg crack complex challenges in fields such as machine learning, information extraction and retrieval and natural language processing, among several others.

“I’m grateful to Bloomberg for recognizing the research that me and my collaborators have done,” Gururangan said. “And I’m really thankful to all my collaborators who have helped me bring these research directions to life; I’m really proud about the work we’ve done so far.”

During his internship at Bloomberg next year, Gururangan will work on models that adapt to constantly evolving text streams. He’ll focus on building rapidly updating language models on incoming streams of news and also on developing methods to preserve privacy in experts specialized to sensitive financial documents.

The work dovetails with his previous research on a new domain expert mixture (DEMix) layer for modular large language models and embarrassingly parallel training of large language models. Each tackles the issue of centralization: Instead of one centralized model trained on thousands of graphics processing units (GPUs), which is costly and can invite bias, a number of smaller models fills the gap. The researchers can asynchronously train these models to specialize in a particular domain. For instance, they may train the models, or experts, in parsing social media text or scientific documents. Designed with modularity in mind, the experts exhibit more flexibility on the fly — a boon for organizations such as Bloomberg, where the news cycle never sleeps.

“How do you efficiently update these models or these experts to incorporate that new information that’s coming in?” Gururangan said. “That’s a really hard problem. Especially with these bigger existing models.”

Centralization also can lead to ethical questions surrounding data curation. Gururangan is trying to help the data science community better understand and answer these queries through the lens of artificial intelligence (AI) and NLP.

“The way that data is selected for training these models is really centralized and only a few people have the power to select that data,” Gururangan said. “And so what happens when only a few people are able to do that? What we found is that people have their own ideologies about what makes good text, what makes for reasonable text for models to see and be exposed to, and that has important downstream biases that result.”

Gururangan was also part of the team that created RealToxicityPrompts, a dataset of 100,000 naturally occurring text prompts that helped evaluate the phenomenon of toxic language degeneration – or how pretrained neural language models can exhibit the biases and noxious content that can result from having only a few hands at the controls. Even with just a handful of harmful inputs, the language models could go rogue, generating toxic content with real consequences.

“We’ve been thinking a lot about how you redesign these models to be more decentralized and so more people have more power to shape what the data coverage is at these models,” Gururangan said. “That can inform what sorts of technical solutions you’re interested in and what the possibilities are.”

Those possibilities continue to drive Gururangan to seek out solutions using data. Being at the intersection of AI and ethics excites him, he said, and not only feeds his curiosity but also his desire to focus his research toward advancing social good.

“Much of my research involves understanding language variation in large datasets,” Gururangan said, “and I strongly believe that if we’re careful about and understand where our training data is coming from, the stronger and more reliable our language technologies will be.”

Published by Roger Van Scyoc on December 21, 2022