Microsoft’s new AI needs just 3 seconds of audio to clone a voice

VALL-E can even mimic a speaker’s emotions and acoustic environment.
Sign up for the Freethink Weekly newsletter!
A collection of our favorite stories straight to your inbox

Microsoft’s new voice-cloning AI can simulate a speaker’s voice with remarkable accuracy — and all it needs to get started is a three-second sample of them talking.

Voice cloning 101: Voice cloning isn’t new. Google the term, and you’ll get a long list of links to websites and apps offering to train an AI to produce audio that sounds just like you. You can then use the clone to hear yourself “read” any text you like.

For a writer, this can be useful for creating an author-narrated audio version of their book without spending days in a recording studio. A voice actor, meanwhile, might clone their voice so that they can rent out the AI for projects they don’t have time to tackle themselves.

Shorter source samples typically lead to voice clones that sound less realistic.

Depending on the service, the voice cloning process might start with you reciting 50 predetermined sentences or uploading a clip of you saying anything at all. Some services will ask for hours of audio to train their AI, while others will boast about needing just 5 seconds.

Often, you get out of these voice cloning services what you put into them — a shorter sample typically leads to a clone that sounds like a robot trying to impersonate a person, while longer clips can result in AI-generated audio that sounds just like the original speaker.

Short and sweet: Microsoft’s new voice-cloning AI, VALL-E, bucks this trend, generating audio that sounds remarkably like the original speaker from a voice sample just three seconds long. 

You can’t clone your own voice with VALL-E, but Microsoft has shared a research paper on arXiv and created a Github page where you can compare snippets of human voices to speech generated by VALL-E and a “baseline” voice-cloning AI (YourTSS).

On this page, Microsoft also demonstrates how the AI can mimic a speaker’s emotion and the acoustic environment of a sample — if the speaker sounds angry, VALL-E can generate angry-sounding audio, and if the original clip sounds like it was recorded over the phone, the AI can generate audio that matches those acoustics.

VALL-E’s training library was hundreds of times larger than other systems’.

How it works: An AI is typically only as good as its training data, and Microsoft opted to use Meta’s LibriLight — an audio library containing 60,000 hours of speech from more than 7,000 English speakers — to train VALL-E.

This means the AI’s training set was “hundreds of times larger” than those used to train existing voice cloning systems, according to the research paper.

When VALL-E is presented with a new voice to clone, it breaks the three second audio clip into bits Microsoft calls “acoustic tokens.” Using those tokens and its training data, it can then predict what the voice would sound like saying other phrases.

The big picture: If you go back to that list of “voice cloning” search results, you’ll likely find links to articles detailing how the AIs are being used for nefarious purposes.

There’s the cybercriminal who cloned a boss’s voice to trick an employee into transferring company cash into their bank account, and warnings to seniors that bad actors can now clone the voices of their grandchildren to extort money.

The Microsoft team addresses the potential for people to misuse VALL-E in their research paper, noting that such risks could be mitigated by the creation of a “detection model” capable of determining if a clip was generated by the AI. 

Even if bad actors find ways around such tools, though, other people will use the tech for good: creating synthetic voices for ALS patients, helping people connect with deceased loved ones, or doing something so remarkable we can’t even yet imagine it.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].

Sign up for the Freethink Weekly newsletter!
A collection of our favorite stories straight to your inbox
Related
The future of fertility, from artificial wombs to AI-assisted IVF
A look back at the history of infertility treatments and ahead to the tech that could change everything we thought we knew about reproduction.
“Model collapse” threatens to kill progress on generative AIs
Generative AIs start churning out nonsense when trained on synthetic data — a problem that could put a ceiling on their ability to improve.
The AI chip startup that could take down Nvidia
A new kind of AI chip developed by a team of Harvard dropouts could shift the ground beneath our massive AI economy.
The future of data centers — on land, at sea, and in space
As our digital world grows, demand for data centers is also increasing. To meet that demand sustainably, developers are getting creative.
LLMs are a dead end to AGI, says François Chollet
AI researcher François Chollet thought we needed a better way to measure progress on the path to AGI — so he made one.
Up Next
Subscribe to Freethink for more great stories