Skip to main content

Allen School researchers showcase speech and audio as the new frontier for human-AI interaction at Interspeech 2024

An overhead view of people sitting a table using laptops, phones, and headphones working on a project.
Unsplash/Marvin Meyer

Trying to work or record interviews in busy and loud cafes may soon be easier thanks to new artificial intelligence models.

A team of University of Washington, Microsoft and AssemblyAI researchers led by Allen School professor Shyam Gollakota, who heads the Mobile Intelligence Lab, built two AI-powered models that can help reduce the noise. By analyzing turn-taking dynamics while people are talking, the team developed the target conversation extraction approach that can single out the main speakers from background audio in a recording. Similar kinds of technology may be difficult to run in real time on smaller devices like headphones, but the researchers also introduced knowledge boosting, a technique whereby a larger model remotely helps with inference for a smaller on-device model.

Portrait of Shyam Gollakota
Shyam Gollakota

The team presented its papers describing both innovations at the Interspeech 2024 Conference in Kos Island, Greece, earlier this month.

“Speech and audio are now the new frontier for human-AI interaction,” said Gollakota, the Washington Research Foundation/Thomas J. Cable Professor in the Allen School.

What did you say?: Target conversation extraction

One of the problems Gollakota and his colleagues sought to solve was how will the AI model know who are the main speakers in an audio recording with lots of background chatter. The researchers trained the neural network using conversation datasets in both English and Mandarin to recognize “the unique characteristics of people talking over each other in conversation,” Gollakota said. Across both language datasets, the researchers found the turn-taking dynamic held up with up to four speakers in conversation.

“If there are other people in the recording who are having a parallel conversation amongst themselves, they don’t follow this temporal pattern,” said lead author and Allen School Ph.D. student Tuochao Chen. “What that means is that there is way more overlap between them and my voice, and I can use that information to create an AI which can extract out who is involved in the conversation with me and remove everyone else.”

While the AI model leverages the turn-taking dynamic, it still preserves any backchannels happening within the conversation. These backchannels are small overlaps that happen when people are talking and showing each other that they are listening, such as laughter or saying “yeah.” Without these backchannels, the recording would not be an authentic representation of the conversation and would lose some of the vocal cues between speakers, Gollakota explained. 

“These cues are extremely important in conversations to understand how the other person is actually reacting,” Gollakota said. “Let’s say I’m having a phone call with you. These backchannel cues where we overlap each other with ‘mhm’ create the cadence of our conversation that we want to preserve.”

The AI model can work on any device that has a microphone and record audio, including laptops and smartphones, without needing any additional hardware, Gollakota noted.

Additional co-authors on the target conversation extraction paper include Malek Itani, a Ph.D. student in the UW Department of Electrical & Computer Engineering, Allen School undergraduate researchers Qirui Wang and Bohan Wu (B.S., ‘24), Microsoft Principal Researcher Emre Sefik Eskimez and Director of Research at AssemblyAI Takuya Yoshioka.

Turning up the power: Knowledge boosting

Target conversation extraction or other AI-enabled software that work in real-time would be difficult to run on smaller devices like headphones due to size and power restraints. Instead, Gollakota and his team introduced knowledge boosting, which can increase the performance of the small model operating on headphones, for example, with the help of a remote model running on a smartphone or in the cloud. Knowledge boosting can potentially be applied to noise cancellation features, augmented reality and virtual reality headsets, or other mobile devices that run AI software locally.

However, because the small model has to feed information to the larger remote model, there is a slight delay in the noise cancellation.

”Imagine that while I’m talking, there is a teacher remotely telling me how to improve my performance through delayed feedback or hints,” said lead author and Allen School Ph.D. student Vidya Srinivas. ”This is how knowledge boosting can improve small models’ performance despite large models not having the latest information.”

To work around the delay, the larger model attempts to predict what is going to happen milliseconds into the future so it can react to it. The larger model is “always looking at things which are 40–50 milliseconds in the past,” Gollakota said.

The larger model’s prediction capabilities open up the door for further research into AI systems that can anticipate and autocomplete what and how someone is speaking, Gollakota noted.

In addition to Gollakota and Srinivas, co-authors on the knowledge boosting paper include Itani, Chen, Eskimez and Yoshioka. 

This is the latest work from Gollakota and his colleagues to advance new AI-enabled audio capabilities, including headphones that allow the wearer to focus on a specific voice in a crowd just by looking at them and a system for selecting which sounds to hear and which ones to cancel out.

Read the full papers on target conversation extraction and knowledge boosting here.