Meta, the umbrella company of Facebook and Instagram, announced its new productive artificial intelligence model. Voicebox was designed to assist creators with its ability to perform speech creation tasks such as audio editing, sampling, and styling, although it was not specifically trained.
Meta says this new AI model will benefit many people around the world. He gives many examples, such as helping visually impaired people hear text messages from their friends in their own voices. It can also enable people to speak foreign languages with their own voices.
The AI model is capable of producing high-quality sound clips and is capable of editing pre-recorded sounds to eliminate unwanted noises such as car horns. Besides that, it can produce sounds in six languages while maintaining content and style. The model is also expected to give natural voices to visual assistants in the future, or to real non-player characters in games in the metaverse.
Meta compared Voicebox to other voice AI models on the market and specifically cited Vall-E and YourTTS as competitors. When comparing word error rates and style similarity, Voicebox is more advanced and outperforms both models.
Voicebox is built on Meta’s newest non-autoregressive generative model, a Flow Matching model that is capable of highly non-deterministic matching between text and speech. Voicebox has so far been trained using over 50,000 hours of recorded speech and transcripts from publicly available audiobooks in English, French, Spanish, German, Polish and Portuguese.
Meta will not make the artificial intelligence program available to everyone, nor will it share its source code.