Researchers in the Allen School’s Graphics & Imaging Laboratory (GRAIL) have developed a new technique that enables them to generate photorealistic videos from audio clips. The team, which includes recent Ph.D. graduate Supasorn Suwajanakorn and professors Steven Seitz and Ira Kemelmacher-Shlizerman, demonstrated their approach by producing a video of former president Barack Obama lip-syncing audio on a range of topics, complete with natural-looking facial expressions and mouth movements.
To achieve such a lifelike result, the researchers had to overcome the “uncanny valley” problem that typically plagues synthesized human likenesses — giving them a creepiness factor that most viewers will find hard to look past.
“People are particularly sensitive to any areas of your mouth that don’t look realistic,” noted Suwajanakorn, lead author of the paper describing the team’s results. “People can spot it right away and it’s going to look fake…you have to render the mouth region perfectly to get beyond the uncanny valley.”
For its demonstration with Obama, the team trained a neural network to view existing videos of the former president and translate sounds into mouth shapes. They then superimposed and blended those shapes onto a reference video — drawing on their previous research in 3-D facial reconstruction and digital modeling — to depict Obama accurately lip syncing speeches from unrelated audio clips. According to Kemelmacher-Shlizerman, it is the first time researchers have achieved such realistic results in an audio-to-video conversion.
The team’s approach, which will be presented at the SIGGRAPH 2017 conference in Los Angeles, California next month, could yield significant advancements in video conferencing and virtual reality applications.
Read the UW News release here and visit the project page here. Also check out coverage of the project by The Atlantic, IEEE Spectrum, Wired, New Atlas, Engadget, GeekWire, PCMag, Variety, and The Verge.