Researchers at Microsoft released a paper this week about VASA-1, a new AI tool that can generate a convincing video of someone speaking, using just a still image. Microsoft doesn’t have immediate plans to release the new tool to the public, but it’s pretty impressive. Well, it’s impressive if you don’t look too closely at the teeth. Just take a gander at those chompers.

The VASA-1 model works by taking any still photo of a human face—or, in the examples published by Microsoft, an AI-generated face of someone who doesn’t actually exist—and after being fed an audio file, can produce a synchronized video that includes facial nuances and natural looking motion.

Again, it’s all quite impressive, as you can see in one of the videos Microsoft provided below. But the one area where VASA-1 seems to struggle is rendering teeth. If you focus on the teeth, they can get a cartoonish quality, appearing slightly animated in a way that doesn’t quite fit with the hyper-realistic quality of everything else.

The video’s bizarre teeth become even more apparent when you slow the entire thing down, as Gizmodo did in the GIF below. (It can almost make you feel bad picking apart someone’s appearance until you remember the person below literally doesn’t exist.)

Another example video provided by Microsoft, which appears below, shows similar cartoon-like qualities in the teeth—even though other features appear very realistic, especially when you remember the only source material is a static image and an audio file.

For whatever reason, the teeth in videos showing men were slightly less noticeable, perhaps because the model didn’t show men opening their mouths quite as wide while speaking. But anyone who looks closely can still get the sense something isn’t quite right here.

One of the more interesting things noted by researchers is that its model can produce relatively high-quality video very quickly, something that other AI generators like OpenAI’s Sora have reportedly struggled with. In fact, the paper notes a latency of just 0.17 seconds on a desktop PC with a single NVIDIA RTX 4090 GPU.

And that speed is something that can deliver instant videos for a variety of applications, like real-time translation services.

“Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512×512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors,” the new paper reads.

The researchers are clearly aware of the dangers in this kind of tech, which perhaps explains why Microsoft hasn’t announced plans to rush it out to the public just yet. However, the researchers have also identified use cases that they believe will be useful to humanity.

“The benefits—such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others—underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being,” the paper reads.

“Given such context, we have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.”

That’s probably a good idea, given the number of scams that are possible with this kind of tech. After all, the 2024 presidential election in the U.S. is just seven months away. And the threat of fascism globally isn’t disappearing anytime soon. Humanity really does feel like it’s powerless against AI-generated fakes right now. And large companies like Microsoft should probably do everything in their power to limit the potential harm before virtually everything on the internet becomes fakery.

Share.
Exit mobile version