Voice cloning systems have already been in our lives for a long time, but for an accurate and realistic cloning, a lot of audio data was needed before. Successful cloning was possible with weeks of editing. However, with artificial intelligence, the whole process was reduced to a few hours. Now it only takes a few seconds of voice data to clone someone’s voice. Meta’s Voicebox can also remove car horns, dog barking or similar background noise from the background of audio clips.
To be used for the visually impaired
Of course, Voicebox isn’t meant to do anything bad. Voicebox will be used to help visually impaired people hear text messages from friends and family. Meta states that Voicebox is multilingual, fluent in six languages, and users can speak any foreign language with their own voice. Languages include English, French, German, Spanish, Polish or Portuguese.
So how does it work? It’s actually pretty simple. A user gives Voicebox an example of his voice. This example could be a two-second or longer clip. Based on this, artificial intelligence predicts the sound and creates the realistic sound style of the user.
On the other hand, Voicebox has already raised some important ethical questions because people will now be able to imitate the voices of loved ones, best friends and even enemies in as little as two seconds. Such technology could have serious unintended consequences. In the simplest case, the voice response systems of banks can be fooled.
Meta is aware of the potential dangers of such a technology and fortunately keeps Voicebox’s core code secret. “There are many exciting uses for generative speech models, but we are not making the Voicebox model or code publicly available at this time due to potential abuse risks,” the company wrote in its research blog.