Run Meta's Audio Generation AI model, AudioGen, on macOS with MPS (GPU)

Meta, the company behind Facebook, released AudioCraft – an AI capable of generating music and sound effects from English text. The initial version, v0.0.1, dropped in June 2023, followed by few revisions and the latest (as of now writing this) v1.3.0 in May 2024. The best part? You can run it locally for free!

However, there’s a catch: official support is limited to NVIDIA GPUs or CPUs. macOS users are stuck with CPU-only execution. Frustrating, right?

After much research and experimentation, I discovered a way to speed up the generation process for AudioGen, AudioCraft’s sound effects generator, by leveraging Apple Silicon’s GPU – MPS (Metal Performance Shaders)!

In this article, I’ll share my findings and guide you through the steps to unlock faster audio generation on your Mac.

AudioCraft: https://ai.meta.com/resources/models-and-libraries/audiocraft

GitHub: https://github.com/facebookresearch/audiocraft

Contents

1 Notes
2 Environment Setup
3 Sample Code
4 Usage
5 MPS cannot be used with MusicGen or MAGNeT.

Notes

While AudioCraft’s code is released under the permissive MIT license, it’s important to note that the model weights (the pre-trained files downloaded from Hugging Face) are distributed under the CC-BY-NC 4.0 license, which prohibits commercial use. Therefore, be mindful of this restriction if you plan to publicly share any audio generated using AudioCraft.

AudioCraft also includes MusicGen, a model for generating music, as well as MAGNeT, a newer, faster, and supposedly higher-performing model. Unfortunately,
I wasn’t able to get these models running with MPS.

While development isn’t stagnant, there are a few open issues on GitHub, hinting at possible future official support. However, even though you can run AudioCraft locally for free, unlike platforms like Stable Audio which offer commercial licenses for a fee, it seems unlikely that any external forces besides the passionate efforts of open-source programmers will drive significant progress. So, let’s manage our expectations!

Environment Setup

Confirmed Working Environment

macOS: 14.5
ffmpeg version 7.0.1

Setup Procedure

Install ffmpeg if not installed yet. You need brew installed.

brew install ffmpeg

Create a directory and clone the AudioCraft repository. Choose your preferred directory name.

mkdir AudioCraft_MPS
cd AudioCraft_MPS
git clone https://github.com/facebookresearch/audiocraft.git .

Set up a virtual environment. I prefer pipenv, but feel free to use your favorite. Python 3.9 or above is required.

pipenv --python 3.11
pipenv shell

Install PyToch with a specific version 2.1.0.

pip install torch==2.1.0

Set xformer’s version to 0.0.20 in requirements.txt. MPS doesn’t support xformers, but this was the easiest workaround. The example below uses vim, but feel free to use your preferred text editor.

vi requirements.txt
#xformer<0.0.23
xformers==0.0.20

Install everything, and the environment is set up!

pip install -e .

Edit one file to use MPS for generation.

Modify the following file to use MPS only for encoding:

audiocraft/models/encodec.py

The line numbers might vary depending on the version of the cloned repository, but the target is the decode() method within the class EncodecModel(CompressionModel):. Comment out the first out = self.decoder(emb) in the highlighted section and add the if~else block below it.

    def decode(self, codes: torch.Tensor, scale: tp.Optional[torch.Tensor] = None):
        """Decode the given codes to a reconstructed representation, using the scale to perform
        audio denormalization if needed.

        Args:
            codes (torch.Tensor): Int tensor of shape [B, K, T]
            scale (torch.Tensor, optional): Float tensor containing the scale value.

        Returns:
            out (torch.Tensor): Float tensor of shape [B, C, T], the reconstructed audio.
        """
        emb = self.decode_latent(codes)
        #out = self.decoder(emb)
        # Below if block is added based on https://github.com/facebookresearch/audiocraft/issues/31
        if emb.device.type == 'mps':
            # XXX: Since mps-decoder does not work, cpu-decoder is used instead
            out = self.decoder.to('cpu')(emb.to('cpu')).to('mps')
        else:
            out = self.decoder(emb)

        out = self.postprocess(out, scale)
        # out contains extra padding added by the encoder and decoder
        return out

The code mentioned above was written by EbaraKoji (whose name suggests he might be Japanese?) from the following issue. I tried using his forked repository, but unfortunately, it didn’t work for me.

https://github.com/facebookresearch/audiocraft/issues/31#issuecomment-1705769295

Sample Code

This code below is slightly modified from something found elsewhere. Let’s put it in the demos directory along with other executable demo codes.

from audiocraft.models import AudioGen
from audiocraft.data.audio import audio_write
import argparse
import time

model = AudioGen.get_pretrained('facebook/audiogen-medium', device='mps')
model.set_generation_params(duration=5)  # generate [duration] seconds.

start = time.time()
def generate_audio(descriptions):
  wav = model.generate(descriptions)  # generates samples for all descriptions in array.
  
  for idx, one_wav in enumerate(wav):
      # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
      audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
      print(f'Generated {idx}.wav.')
      print(f'Elapsed time: {round(time.time()-start, 2)}')

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate audio based on descriptions.")
    parser.add_argument("descriptions", nargs='+', help="List of descriptions for audio generation")
    args = parser.parse_args()
    
    generate_audio(args.descriptions)

The key part is device='mps' on line 6. This instructs it to use the GPU for generation. Changing it to 'cpu' will make generation slower but won’t consume as much memory. Also, there is another pre-trained smaller audio model facebook/audiogen-small available, (I haven’t tested this one).

Usage

Note: The first time you run it, the pre-trained audio model will be downloaded, which may take some time.

You can provide the desired sound in English as arguments, and it will generate audio files named 0.wav, 1.wav,…. The generation speed doesn’t increase much whether you provide one or multiple arguments, so I recommend generating several at once.

python demos/audiogen_mps_app.py "text 1" "text 2"

Example:

python demos/audiogen_mps_app.py "heavy rain with a clap of thunder" "knocking on a wooden door" "people whispering in a cave" "racing cars passing by"

/Users/handsome/Documents/Python/AudioCraft_MPS/.venv/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Generated 0.wav.
Elapsed time: 53.02
Generated 1.wav.
Elapsed time: 53.08
Generated 2.wav.
Elapsed time: 53.13
Generated 3.wav.
Elapsed time: 53.2

On an M2 Max with 32GB RAM, starting with low memory pressure, a 5-second file takes around 60 seconds to generate, and a 10-second file takes around 100 seconds.

There’s a warning that appears right after running it, but since it works, I haven’t looked into it further. You can probably ignore it as long as you don’t
upgrade the PyTorch (torch) version.

MPS cannot be used with MusicGen or MAGNeT.

I tried to make MusicGen work with MPS using a similar approach, but it didn’t succeed. It does run on CPU, so you can try the GUI with python demos/musicgen_app.py.

MAGNeT seems to be a more advanced version, but I couldn’t get it running on CPU either. Looking at the following issue and the linked commit, it appears that it might work. However, I was unsuccessful in getting it to run myself.

https://github.com/facebookresearch/audiocraft/issues/396

So, that concludes our exploration for now.

Image by Stable Diffusion (Mochi Diffusion)
This part, which I’ve been writing at the end of each article, will now only be visible to those who open this specific title. It’s not very relevant to the main content.
This time, it generated many good images with a simple prompt. I chose the one that seemed least likely to trigger claustrophobia.

Date:
2024-7-22 1:52:43

Model:
realisticVision-v51VAE_original_768x512_cn

Size:
768 x 512

Include in Image:
future realistic image of audio generative AI

Exclude from Image:

Seed:
751124804

Steps:
20

Guidance Scale:
20.0

Scheduler:
DPM-Solver++

ML Compute Unit:
All

Run Meta’s Audio Generation AI model, AudioGen, on macOS with MPS (GPU)