OpenAI is improving the technology used to replicate voices in deepfakes, but it's trying to do so responsibly as such technology becomes increasingly prevalent.
Today OpenAI launches the preview debut of Voice Engine, an expansion of its existing text-to-speech API. In development for roughly two years, Voice Engine lets users upload any 15-second voice sample to generate a synthetic copy of that voice. But there's no date for public availability yet, giving the company time to respond to how the model is used and abused.
"We want to make sure everybody feels good about how it's being deployed — we understand the landscape of where this tech is dangerous and have mitigations in place for that," Jeff Harris, member of the product staff at OpenAI, told TechCrunch in an interview.
Training the model
The generative AI model powering Voice Engine has been hiding in plain sight for some time, Harris said.
That same model, from ChatGPT, is what powers OpenAI's AI-powered chatbot, which can understand speech and has a "read aloud" function. It's used to provide the preset voices through OpenAI's text-to-speech API. At the time of this writing, in early September, Spotify began relying on it for dubbed podcasts of popular hosts like Lex Fridman in various languages.
So I asked Harris where the model's training data originated – still a sore subject. All he would say was that the Voice Engine model was trained on a mix of licensed and publicly available data.
Models such as the one powering Voice Engine are trained upon gigantic numbers of examples-in this case, audio recordings-usually sourced from public sites and data sets around the web. Many of the generative AI vendors treat training data as a competitive advantage and therefore keep it and information related to it close to the chest. But training data details also are one of those potential sources of IP-related lawsuits, another disincentive to reveal much.
OpenAI is being sued for allegedly violating its IP law in training its AI on copyrighted content-including photos, artwork, code, articles, and e-books-without providing creators or owners credit or pay.
OpenAI has licensing arrangements with some content providers, including photo service Shutterstock and news publisher Axel Springer, and also permits webmasters to instruct its web crawler to not scrape their site for training data. Artists are also permitted "opt out" of and remove their work from the data sets that the company uses to train its image-generating models, including its most recently released DALL-E 3.
But other OpenAI products have no opt-out policy. And last year, OpenAI told the U.K.'s House of Lords that it's "unfeasible" to build useful AI models without copyrighted content, claiming that fair use—the legal doctrine that allows for the use of copyrighted works to make a secondary creation as long as it's transformative—protections apply where it's concerned with model training.
Synthesizing voice
Interestingly, Voice Engine isn't trained nor fine-tuned on user data. That's partly because of the transitory manner in which the model - a diffusion process with a transformer together - produces speech.
"We take a small audio sample and text and generate realistic speech that matches the original speaker," said Harris. "The audio that's used is dropped after the request is complete."
He explained that this actually simulates an analysis of both the speech data it extracts and the text it would be speaking, generating an appropriate voice without building a custom model of each individual speaker.
It's not new tech. Plenty of startups have been selling voice cloning products for years, from ElevenLabs to Replica Studios to Papercup to Deepdub to Respeecher. And so have Big Tech incumbents like Amazon, Google and Microsoft — the latter of which is a major OpenAI's investor incidentally.
Harris said OpenAI's approach yields overall higher-quality speech.
We also know it will be priced aggressively. Of course, OpenAI pulled the pricing of Voice Engine off the marketing materials it released today, but in documents passed along to me by TechCrunch, Voice Engine is listed at $15 per one million characters, or ~162,500 words. That would fit Dickens' "Oliver Twist" with a little room to spare. (An "HD" quality option costs twice that, but confusingly, an OpenAI spokesperson told TechCrunch that there's no difference between HD and non-HD voices. Make of that what you will.)
Which translates to approximately 18 hours of audio, pricing it somewhere south of $1 per hour. Not cheapest compared to some of the more popular rival vendors, such as ElevenLabs, which prices its offerings at $11 for 100,000 characters per month. But it comes at the expense of some customization.
Voice Engine offers no controls with which to adjust the tone, pitch, or cadence of a voice. In fact, it doesn't offer any fine-tuning knobs or dials at the moment, though Harris says any expressiveness in the 15-second voice sample will carry on through subsequent generations-for example, if you speak in an excited tone, the resulting synthetic voice will sound consistently excited. We'll see how the quality of the reading compares with other models when they can be compared directly.
Voice talent as commodity
According to ZipRecruiter, voice actor salaries range from $12 to $79 an hour-a lot more pricey than Voice Engine, even on the low end (actors with agents will command a much higher price per project). Were it to catch on, the tool of OpenAI could commodify voice work. So, where does that leave actors?
The talent industry wouldn't be caught mapping, exactly — it's been living with the existential threat of generative AI for some time. The union is now being asked to sign away rights to their voices so that clients can use AI to generate synthetic versions that could eventually replace them. Voice work — particularly cheap, entry-level work — is endangered because AI can make speech just as cheaply.
Now, some AI voice platforms are taking a middle-of-the-road balance.
Replica Studios signed a rather contentious agreement last year with SAG-AFTRA to make and license copies of the media artist union members' voices. The organizations said the arrangement established fair and ethical terms and conditions to ensure performer consent while negotiating terms for uses of synthetic voices in new works, including video games.
Writers' strike is over; how AI negotiations worked out
ElevenLabs is a marketplace for synthetic voices, hosting a place where users could create a voice, verify, and share it publicly. When others used that voice, the original creators got paid back—$1, 000 per 1,000 characters.
OpenAI won't make any of these labor union bargains or markets, at least in the near future, and does require users to get "explicit consent" of the people whose voices are cloned, provide clear disclosures about which voices are AI-generated and not use the voices of minors, deceased people, or political figures in their generation.
"How this intersects with the voice actor economy is something we're watching really closely and very curious about," Harris said. "I think there's going to be lots of opportunity to kind of scale your reach as a voice actor through this kind of technology. But this is all stuff we are going to learn as people actually deploy and play with the tech a little bit.".
Ethics and deepfakes
Voice cloning applications can be-and have been-used to do far more than threaten people's livings as actors.
The infamous trolling message board 4chan was already spreading hateful messages impersonating celebrities like Emma Watson via ElevenLabs's platform. The Verge's James Vincent used AI tools to maliciously and quickly clone voices, generating samples that contained everything from violent threats to racist and transphobic remarks. And over at Vice, reporter Joseph Cox documented generating a voice clone convincing enough to fool a bank's authentication system.
The notion of what bad actors could do with voice cloning in an election is enough to send chills down the spine. They have reason to be, too: just last January, a phone campaign used a deepfake version of President Biden to scare New Hampshire citizens away from voting-and got the FCC moving to bar future similar campaigns.
FCC officially declares AI-voiced robocalls illegal
So, apart from banning deepfakes at the policy level, what is OpenAI doing to ensure that Voice Engine is not misused in any way? Harris mentioned some:.
First, Voice Engine is only being rolled out to a vanishingly small slice of developers—for starters, some 10. For now, it's open use cases that are "low risk" and "socially beneficial," Harris points out—healthcare and accessibility, for example, plus exploring "responsible" synthetic media.
Among the notable early adopters of the Voice Engine are the edtech company Age of Learning, utilizing the tool for automatically generated voice-overs, pre-recorded by previously cast actors; HeyGen, the storytelling app, utilizes Voice Engine for translation; Livox and Lifespan, that apply Voice Engine for creating voices for people with speech impairment and disabilities; as well as Dimagi, which will build a tool based on Voice Engine for providing feedback to health workers in their first language.
The second is that the Voice Engine clones are watermarked with a method developed by OpenAI, which places inaudible identifiers in recordings. Other vendors such as Resemble AI and Microsoft use similar watermarks. Harris did say he could not guarantee there isn't a way to circumvent the watermark, but described it as "tamper resistant.".
"If there is an audio clip that exists, it's pretty simple for us to look at that clip and determine that it was created by our system and the developer that actually did that creation," Harris said. "So far, it isn't open sourced we have it internally for now. We're curious about making it publicly available but obviously, that comes with added risk related to exposure and breaking it."
OpenAI unveils a red teaming network to make its models more resilient
It will open Voice Engine to members of its contracted red teaming network - a group of experts that help inform the company's AI model risk assessment and mitigation strategies - to sniff out malicious uses.
Some experts say that AI red teaming isnt exhaustive enough and that it's incumbent on vendors to develop tools to defend against harms that their AI might cause. OpenAI isn't going quite that far with Voice Engine-but Harris asserts that the company's "top principle" is releasing the technology safely.
General release
Depending on preview and public reception to Voice Engine, OpenAI might open the tool up to its wider developer base, but for now, the company is not ready to commit to this.
Harris did preview Voice Engine's roadmap, and there one announces to be testing, at least what has users read apparently random text in order to prove their presence and awareness of how their voice is being used; that could give OpenAI the confidence it needs to bring Voice Engine to more people, said Harris, or that might be just the beginning.
"What's going to keep pushing us forward in terms of the actual voice matching technology is really going to depend on what we learn from the pilot, the safety issues that are uncovered and the mitigations that we have in place," he said. "We don't want people to be confused between artificial voices and actual human voices."
And on that last point we can agree.