Researchers at Meta AI have made a significant advancement in speech generative AI. We created Voicebox, the first model that can perform at a cutting-edge level across speech-generation tasks that it was not exceptionally trained for.
Voicebox generates outputs in a wide range of styles, and it can both start from scratch and alter a sample that is given, just like generative systems for graphics and text. Yet Voicebox creates high-quality audio samples rather than a picture or a passage of text. The model is capable of noise reduction, content editing, style conversion, and different sample production in addition to speech synthesizing across six languages.
Before Voicebox, generative AI for speech needed to be trained specifically for each task using carefully crafted training data. Voicebox employs a novel technique to learn solely from untranslated audio. Voicebox can alter any portion of a sample, not simply the conclusion of an audio clip it is given, in contrast to autoregressive models for audio production.
The foundation of Voicebox is Flow Matching, which has been demonstrated to outperform diffusion models. Voicebox performs better on zero-shot text-to-speech than the most advanced English model VALL-E in terms of both intelligibility (5.9% vs. 1.9% word error rates) and audio similarity (0.580 vs. 0.681), while being up to 20 times faster. Voicebox surpasses YourTTS for cross-lingual style transfer, lowering the average word error rate from 10.9% to 5.2% and increasing audio similarity from 0.335 to 0.481.
A number of exciting applications for generative speech models, we are not currently making the Voicebox model or code available to the public because of security concerns. In order to advance the state of the art in AI, we feel it is critical to be open with the AI community and share our findings, but it’s also crucial to strike the correct balance between openness and responsibility. In light of these factors, we are releasing audio samples and a study paper that describe our methodology and the outcomes we got. In the study, we also explain how we created a powerful classifier that can differentiate between real speech and audio produced by Voicebox.