Microsoft's AI speech generator achieves human parity but is too dangerous for the public

Shawn Knight

Posts: 15,385   +193
Staff member
Too Real: Microsoft has developed a new iteration of its neural codec language model, Vall-E, that surpasses previous efforts in terms of naturalness, speech robustness, and speaker similarity. It is the first of its kind to reach human parity in a pair of popular benchmarks, and is apparently so lifelike that Microsoft has no plans to grant access to the public.

Leveraging Vall-E's groundwork, the new AI voice tool integrates two major enhancements that greatly improve performance. Grouped code modeling allows Microsoft to better organize codec codes, resulting in shorter sequence lengths that boost inference speed and help overcome challenges associated with long sequence modeling.

Repetition aware sampling, meanwhile, rethinks the original nucleus sampling process to look for token repetition when decoding. Microsoft said this process helps stabilize decoding and prevents the infinite loop issue that was present in the original Vall-E.

Microsoft put Vall-E 2 to the test using the LibriSpeech and VCTK datasets, and it passed them both with flying colors. When Redmond claims the AI tool achieves human parity, they mean Vall-E 2 performed better than ground truth samples in robustness, similarity, and naturalness. In other words, the tool can produce natural speech that is virtually identical to the original speaker.

Microsoft shared dozens of samples from Vall-E 2, which can be found over on the project summary page. Indeed, Vall-E 2 samples are incredibly lifelike and indistinguishable from the human speaker. The AI tool even masters subtleties like putting emphasis on the correct word in a sentence as people subconsciously do when speaking.

Microsoft said Vall-E 2 is purely a research project, adding that it has no plans to incorporate the tech into a consumer product or release the tool to the general public. Redmond further noted that it carries potential risk for misuse, such as impersonating a specific person or spoofing voice identification.

That said, the company believes it could have applications in education, translation, accessibility, journalism, self-authored content, and chatbots, among others.

Image credit: Rootnot Creations

Permalink to story:

 
Very impressive! I'm sure certain government agencies already have it, if not already used it in one of their clandestine operations!
How long will it be before this becomes another weapon in the cyber criminal's arsenal?
 
MS self validation that their own AI speech generator and said it is so good. Tha kind of raised many red flags to me. And by the way, nobody have access to it, nor does anyone know how it works in the background. They only shared a dozen of samples, and it was concluded that it reached human parity. Wow.
 
Meh, this tactic is old and is just a way to get people hyped up about it for when they do release it someday. They knew the capability that they were after when they developed this, it isn't like it is an accident. Like other models, it remains to be seen if the model's capabilities truly generalize, and how difficult it is to get it out of the uncanny valley.
 
Back