Our AI writing assistant, WriteUp, can assist you in easily writing any text. Click here to experience its capabilities.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale


Voicebox is a state-of-the-art speech generative model built upon Meta's non-autoregressive flow matching model. It can perform tasks such as monolingual and cross-lingual zero-shot text-to-speech synthesis, style conversion, transient noise removal, content editing, and diverse sample generation. Voicebox is trained on 60K hours of English data and 50K hours of data covering six languages. It is more flexible than auto-regressive models and can generate speech up to 20x faster. To mitigate potential risks of misuse, a highly effective classifier has been built to distinguish between authentic speech and audio generated with Voicebox.


What is Voicebox?
Voicebox is a state-of-the-art speech generative model built upon Meta’s non-autoregressive flow matching model.

What tasks can Voicebox be used for?
Voicebox can be used for monolingual and cross-lingual zero-shot text-to-speech synthesis, style conversion, transient noise removal, content editing, and diverse sample generation.

What languages does Voicebox support?
Voicebox supports six languages: English, French, German, Spanish, Polish, and Portuguese.

How fast is Voicebox compared to state-of-the-art auto-regressive models?
Voicebox generates speech up to 20x faster than state-of-the-art auto-regressive models.

What measures are taken to mitigate the potential misuse of Voicebox?
To mitigate the potential misuse of Voicebox, a highly effective classifier is built to distinguish between authentic speech and audio generated with Voicebox. Additionally, the Voicebox model and code are not publicly available at this time.

AI Comments

πŸ‘ This article provides a comprehensive overview of Voicebox, a state-of-the-art speech generative model that can synthesize speech across six languages. It demonstrates how this model can be used for a variety of tasks, including content editing, style conversion, and diverse sample generation.

πŸ‘Ž Despite the fact that this article outlines the potential of Voicebox, the model and code is not publicly available due to the risks of misuse. This limits the potential of this groundbreaking technology.

AI Discussion

Me: It's about a new AI technology called Voicebox that can generate speech in multiple languages, remove transient noise, edit content, transfer audio style within and across languages, and generate diverse speech samples. It's also up to 20x faster than state-of-the-art auto-regressive models.

Friend: Wow, that's really cool! What are the implications of this technology?

Me: Well, Voicebox could revolutionize the way we interact with technology. For instance, it could be used to generate natural-sounding speech for virtual assistants and chatbots. It could also be used to create text-to-speech audio for educational videos, podcasts, and other audio content. Additionally, it could be used for language translation, as it can transfer audio style across languages. However, there are some ethical considerations that need to be taken into account since the technology has the potential to be misused. For example, it could be used to generate fake audio or video of people saying or doing things they never said or did. To protect against misuse, the researchers have developed an effective classifier that can distinguish between authentic speech and audio generated with Voicebox.

Action items

Technical terms

The process of removing unwanted noise from a signal.
The process of making changes to a text or audio file.
Zero-Shot TTS
Text-to-speech synthesis without any prior training.
Cross-Lingual Zero-Shot TTS
Text-to-speech synthesis without any prior training across multiple languages.
The process of selecting a subset of data from a larger dataset.
The ability to do something quickly and with minimal effort.

Similar articles

0.85939234 Universal Speech Model

0.8504566 Stability AI debuts Stable Audio bringing text to audio generation to the masses

0.8492726 Clone your voice with amazing clarity, pitch and tone and use it anywhere you like. Plus, sound like a native English speaker (even as a beginner in English) in seconds with our AI.

0.84428066 Kahma.io, sumup, clonemyvoice.io, eightify.app and more


πŸ—³οΈ Do you like the summary? Please join our survey and vote on new features!