UPDATED 13:35 EDT / JUNE 16 2023

AI

Meta introduces Voicebox, a generative AI model for speech

Researchers at the artificial intelligence labs of Meta Platforms Inc. today announced a breakthrough with a generative AI model for speech dubbed “Voicebox,” which can accomplish a wide variety of tasks such as synthesizing speech, styling and editing content.

According to researchers, what large language models such as OpenAI LP’s ChatGPT and diffusion models such as DALL-E did for text and images, Voicebox is now capable of doing for speech.

“Like generative systems for images and text, Voicebox creates outputs in a vast variety of styles, and it can create outputs from scratch as well as modify a sample it’s given,” the Meta AI researchers said in a blog post. “But instead of creating a picture or a passage of text, Voicebox produces high-quality audio clips.”

Voicebox is a broadly capable model that can synthesize speech across six different languages without specialized training. It can also do content editing – including fixing interruptions — style conversion and generate samples in diverse voices.

All the model needs to learn is raw audio and its accompanying transcription. According to researchers, other models cannot generalize across multiple tasks and must be pretrained specifically for different tasks with specialized training. That sets the Voicebox model apart as it can do multiple different tasks without any specific training.

To make Voicebox sound more “human,” researchers built the model on the Flow Matching model, which allows the generative AI to learn from varied speech data without needing the variations to be specifically labeled. That allows the AI to perform different tasks and permits the training data to be ingested at a larger scale.

“We trained Voicebox with more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese,” the researchers said. “Voicebox is trained to predict a speech segment when given the surrounding speech and the transcript of the segment.”

According to the research, using Flow Matching, the model has achieved better results than Microsoft Corp.’s VALL-E model in terms of intelligibility — 5.9% versus 1.9% word error rates — and audio similarity, while running as much as 20 times faster.

Voicebox can use as little as two seconds of audio to match a sample’s style and use it for text-to-speech generation. It could be used for future applications for individuals who cannot speak, virtual assistants and voice acting in video games.

The model is also capable of infilling speech from context, predicting what words may have been spoken, and determining how they should sound, should they be interrupted in the middle of a clip. As a result, it can seamlessly edit audio clips if a speech is interrupted by short-duration, noise such as a dog barking.

Having been trained on numerous voices, Voicebox is also capable of simulating natural speech that is more representative of how people talk in the real world across the six languages that it is currently capable of using. That means it can be tuned to produce a variety of voices, tones and cadences, and even modify voice audio clips to match a different style or tone.

Although the researchers noted that this is an exciting breakthrough, they urged caution about its capabilities and its potential for misuse. As a result, the Voicebox model and its code are not being made available for public consumption.

“While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance” between openness and responsibility, the researchers said.

This concern is not without precedent, since voice simulation has existed for years and has been used for nefarious purposes before. Microsoft’s VALL-E model has similarly not been released to the public because of its capability of simulating people’s voices and thus creating a potential for misuse.

Right now, the information on Voicebox that Meta AI is sharing is in the form of the announcement, audio samples and a research paper detailing the results it has achieved.

Image: Racool_studio/Freepik

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One-click below supports our mission to provide free, deep and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU