Voicebox by Meta: Revolutionizing Speech Synthesis with Context-Based AI

LAB51_Meta AI Voicebox
By Eleni Murru
Eleni Murru

3 Min

June 21, 2023

Meta has recently unveiled Voicebox, its new generative AI model that can create realistic sound clips from text. It claims to produce results up to 20 times faster than the latest AI models with comparable performance. Voicebox departs from the traditional TTS architecture and adopts a model more similar to chatbots like ChatGPT or Bard. One of the key differentiators between Voicebox and similar TTS models is its ability to generate speech through context-based learning. This innovative tool can be useful for people with visual impairments as well as for content creators who want to add voice to their projects.

Alike ChatGPT and other transformation models, Voicebox relies on large-scale training datasets. Previous attempts to utilize extensive audio data have resulted in severely degraded audio quality. As a result, most TTS systems use smaller, highly curated, and labeled datasets. Meta addresses this limitation by employing a novel training scheme that abandons labels and categorization in favor of an architecture that "fills in" audio information.

Voicebox: What is the Generative AI Model for?

Voicebox is not yet available to the public, but Meta has shared some details and demos of its capabilities in their press release on June 16th. Voicebox is the "first model systematically capable of generating language in tasks for which it hasn't been specifically trained, achieving state-of-the-art performance."

This means that Voicebox can translate text into speech and synthesize replacement speech to eliminate unwanted noise while keeping the original content and quality intact. Also, it can handle six languages: English, French, Spanish, German, Polish, and Portuguese, and it is able to mimic different voices and speaking styles based on a short audio sample. So far, it is safe to say that this is one of the most advanced text-to-speech technologies developed by Meta.

According to the U.S. giant, multipurpose AI models like Voicebox could provide a “natural” voice to virtual assistants or non-player characters in the metaverse. They could also allow visually impaired people to listen to written messages from friends, relatives, and coworkers read by AI with a voice very similar to their own. For creators, Voicebox could offer them new tools to create and edit audio tracks for videos, for example. Voicebox is still in the experimental stage, but it has the potential to revolutionize various aspects of technology, from virtual assistants and gaming to accessibility and content creation.

The Potential Risks of Voicebox

This new tool comes at a time when moderation of online content is a hot topic for social media platforms. Voicebox is not the only tool of its kind, but it seems to be one of the most advanced. To prevent misuse and harm from fake or manipulated audio, Meta has created a special classifier that can easily tell the difference between real and Voicebox-generated speech.

As Meta states in its announcement: “We recognize that this technology brings the potential for misuse and unintended harm. In our paper, we detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox to mitigate these possible future risks. We believe it is important to be open about our work so the research community can build on it and continue the important conversations we’re having about how to build AI responsibly, which is why we are sharing our approach and results in a research paper.”

Meta believes that artificial intelligence should be used responsibly. That is why they have shared their initial results in the field of generative AI, even though Voicebox is still experimental.