That day approaches fast when generative AI will not only write and create images in a convincingly human-like style but compose music and sounds that sound like the work of a professional, too.
This morning Meta announced AudioCraft, a framework to generate what it describes as "high-quality," "realistic" audio and music from short text descriptions, or prompts. While not the social network giant's first foray into audio generation-the firm last month open sourced an AI-powered music generator called MusicGen-Meta claims that it has made advances that vastly improve the quality of AI-generated sounds, such as dogs barking, cars honking and footsteps on a wooden floor.
The AudioCraft framework was developed to make generative models for audio more accessible and easier for users to work with than previous work in the field, explains Meta in a recent blog post shared with TechCrunch. AudioCraft, opensource source code provided, is a suite of sound and music generators with compression algorithms that enables you to write and encode songs and audio without having to move back and forth between different codebases.
AudioCraft contains three generative AI models: MusicGen, AudioGen, and EnCodec.
MusicGen isn't new. But Meta's just released the training code for it, enabling users to train the model on their own dataset of music.
That could raise major ethical and legal issues, considering MusicGen "learns" from existing music to produce similar effects-a fact with which not all artists or generative AI users are comfortable.
Increasingly, homemade tracks relying on generative AI to conjure up familiar sounds that can be pawned off as authentic, or close enough, have gone viral. Music labels have moved quickly to flag them with their streaming partners, citing intellectual property concerns-and they've generally come out on top. But it's still unclear whether so-called "deepfake" music infringes the copyright of artists, labels and other rights holders.
Met says to be transparent, it trained the pretrained, out-of-the-box version of MusicGen on "Meta-owned and specifically licensed music," which adds up to 20,000 hours of audio—400,000 recordings along with text descriptions and metadata from the company's own Meta Music Initiative Sound Collection, Shutterstock's music library, and Pond5, a large stock media library. And Met took off vocals from training data so the model does not reproduce artists' voices. While MusicGen's Terms of Use warn against the model being used for "out-of-scope" use cases aside from research, Meta has not necessarily forbidden any commercial applications.
The other model in AudioCraft is AudioGen. This model is focused on the generation of environmental noises and effects instead of music and melodies.
AudioGen is a diffusion-based model, much like most contemporary image generators (see OpenAI's DALL-E 2, Google's Imagen and Stable Diffusion). In diffusion, a model learns how to gradually subtract noise from starting data made entirely of noise — for example, audio or images — moving it closer step by step to the target prompt.
Given a text description of an acoustic scene, AudioGen can be used to generate environmental sounds with "realistic recording conditions" and "complex scene content." Or so Meta says—we weren't given the chance to try AudioGen or hear its samples ahead of the model's rollout. The white paper published alongside AudioGen this morning describes other AudioGen capabilities that should interest the model's intended user base—its ability to generate speech from prompts, besides music, following the mixed make-up of its diverse training data.
In its whitepaper, Meta mentions the risk of using AudioCraft for the nefarious purposes of voice deepfaking a person. And given the generative music capabilities of AudioCraft, the model raises the same sort of questions as it does with MusicGen. But again, like MusicGen, Meta isn't putting much of the way in restrictions on how AudioCraft – and its training code – can be used, for better or worse.
EnCodec is the third model presented by AudioCraft, and it represents an improved Meta model to generate fewer-artifacts music. Meta claims that it models audio sequences more efficiently and captures different levels of information in the training data waveforms audio to help craft novel audio.
EnCodec is a lossy neural codec that was trained specifically to compress any kind of audio and reconstruct the original signal with high fidelity," Meta explains in the blog post. "The different streams capture different levels of information of the audio waveform, allowing us to reconstruct the audio with high fidelity from all the streams.".
So what does this all mean for AudioCraft? Meta touts the potential benefits, unsurprisingly: "Provide inspiration for musicians; help people iterate through composing process in new ways." But, as experience with image and text generators has taught us, there are downsides — and likely lawsuits — lurking in the shadows.
Damned consequences, Meta says it intends to continue researching better controllability and how to make generative audio models improve performance while looking to overcome the former limitations and biases of such models. On the subject of biases, MusicGen, Meta notes, does poorly with descriptions in languages other than English and musical styles and cultures that aren't Western—owing to very obvious biases in its training data.
"Rather than keeping the work as an impenetrable black box, being open about how we develop these models and ensuring that they're easy for people to use — whether it's researchers or the music community as a whole — helps people understand what these models can do, understand what they can't do, and be empowered to actually use them," Meta writes in the blog post. In the development of even more sophisticated controls, it would be hoped that such models may eventually be useful to both music amateurs and professionals.