Make music with ACE-Step 1.5

In this example, we show you how you can run ACE Studio’s ACE-Step 1.5 music generation model on Modal.

ACE-Step 1.5 introduces a multi-model architecture: a DiT (Diffusion Transformer) handler for audio generation and an LM (Language Model) handler for prompt augmentation. The LM automatically enhances prompts, detects language, and generates metadata like BPM and key.

We’ll set up both a serverless music generation service and a web user interface.

Setting up dependencies 

We start by defining the environment our generation runs in. This takes some explaining since, like most cutting-edge ML environments, it is a bit fiddly.

This environment is captured by a container image, which we build step-by-step by calling methods to add dependencies, like apt_install to add system packages and uv_pip_install to add Python packages.

ACE-Step 1.5 uses a local path dependency (nano-vllm) in its package configuration, so we clone the repo first and install from the local directory. This lets uv resolve all dependencies together, including the CUDA-enabled PyTorch build and the local nano-vllm package.

In addition to source code, we’ll also need the model weights.

ACE-Step 1.5 integrates with the Hugging Face ecosystem, so setting up the models is straightforward. The model handlers use Hugging Face to download the weights if not already present.

We use a single checkpoints/ directory for all model downloads (both the DiT and LM models) and persist it with a Modal Volume. For more on storing model weights on Modal, see this guide.

We set the ACESTEP_PROJECT_ROOT environment variable so that the model handlers know where to find the checkpoints directory.

While we’re at it, let’s also define the environment for our UI. We’ll stick with Python and so use FastAPI and Gradio.

This is a totally different environment from the one we run our model in. Say goodbye to Python dependency conflict hell!

Running music generation on Modal 

Now, we write our music generation logic.

  • We make an App to organize our deployment.
  • We load the model at start, instead of during inference, with modal.enter, which requires that we use a Modal Cls.
  • In the app.cls decorator, we specify the Image we built and attach the Volume. We also pick a GPU to run on — here, an NVIDIA L40S.

We can then generate music from anywhere by running code like what we have in the local_entrypoint below.

You can execute it with a command like:

Pass in --help to see options and how to use them.

Hosting a web UI for the music generator 

With the Gradio library, we can create a simple web UI in Python that calls out to our music generator, then host it on Modal for anyone to try out.

To deploy both the music generator and the UI, run