Meta has unveiled a new AI tool, dubbed ‘Voicebox’, which it claims represents a breakthrough in AI-powered speech generation. However, the company won’t be unleashing it on the public just yet – because doing so could be disastrous.
Voicebox is currently able to produce audio clips of speech in six languages (all of which are European of origin), and – according to a blog post from Meta – is the first AI model of its kind capable of completing tasks beyond what it was ‘specifically trained to accomplish’. Meta claims that Voicebox handily outperforms competing speech-generation AIs in virtually every area.
So what exactly is it capable of? Well, for starters, it can spew out reasonably accurate text-to-speech replications of a person’s voice using a sample audio file as short as two seconds, a seemingly innocuous ability that holds a huge amount of destructive potential in the wrong hands.
The dubious power of AI
Even setting aside the dodgy stuff that creeps on the internet have been doing with ChatGPT and other AI tools (Voicebox certainly sounds like it could be a boon for anyone making fake revenge porn), this is the sort of technology that could quite literally start a war.
After all, most major public figures, including politicians, have plenty of audio recordings floating around the internet. It wouldn’t be hard to collate some speech clips of an incumbent political leader and use Voicebox to produce a startlingly realistic replication of their voice – something that could then be used for nefarious purposes.
Such tools exist already, of course, but they’re less convincing; you may have seen amusing videos on social media featuring the likes of Joe Biden, Donald Trump, and Barack Obama supposedly playing Fortnite together. It’s good for a laugh, but the audio is hardly convincing. It mimics the mannerisms of each presidential gamer enough that they’re recognizable, but not so well that anyone with a brain would actually believe it’s them.
Meta clearly believes its new tool is good enough to fool at least the majority of people, though – since it’s explicitly not releasing Voicebox to the public, but instead publishing a research paper and detailing a classifier tool that can identify Voicebox-generated speech from real human speech. Meta describes the classifier as “highly effective” – though notably not perfectly effective.
Of course, while Meta is keen to stress that it recognizes the “potential for misuse and unintended harm” surrounding tools like Voicebox, it’s important not to lose sight of the potential benefits AI speech generation could have in the future.
Voicebox – befitting its name – could provide far more naturalistic speech to people who are mute or otherwise unable to communicate, removing some of the barriers to interaction caused by the existing text-to-speech ‘robot voice’ made famous by physicist Stephen Hawking. It could also perform real-time translation, bringing us one step closer to the sort of ‘universal translator’ devices that currently exist only in science fiction.
There are other applications too; smaller, but no less useful. Meta explains in its blog post that Voicebox can be used to edit and improve recorded speech. If you’ve recorded some audio but you mispronounced a word or were interrupted by background noise, Voicebox can isolate the offending segment and ‘re-record’ a snippet of speech using your voice. Impressive, and only slightly terrifying.
In any case, it’s good to see Meta taking a serious, considered approach here. Microsoft’s frantic eagerness to shove Bing AI into everything has landed it in hot water more than once, and OpenAI unleashing ChatGPT on the world has led to all sorts of weirdness over the past year. We’re in an AI gold rush, and these tools are making their way into every part of our lives.
A little caution, patience, and respect for the magnitude of this technology is a welcome sight – although I doubt Meta will sit on Voicebox for too long, since the shareholders will no doubt be wondering how much money it can make them..