Fine-Tuning with MusicGen

Do you want to create music in a certain style? MusicGen Fine-tune can help you with that. The fine-tuning mechanism was created by Jongmin Jung(aka. Sake). You can easily run your fine-tuned model from the web or by using the cloud API.

This tool allows you to fine-tune MusicGen according to your desired style, be it the nostalgic 16-bit video game chip-tunes or the serene ambiance of choral compositions.

MusicGen Fine-Tune

Fast Track to Mastery

In just 15 minutes, you can fine-tune MusicGen using the powerful 8x A40 (Large) hardware.

The flexibility to run your fine-tuned model from the web, cloud API, or through downloaded weights opens up a world of possibilities.

The Maestro Behind the Scenes

Jongmin Jung’s fine-tuning process is rooted in Meta’s AudioCraft and their built-in trainer, Dora. To simplify the training process, Sake integrated automatic audio chunking, auto-labeling, and vocal removal features. Now, your trained model can compose music beyond the conventional 30 seconds.

How to fine-tune MusicGen AI?

Building Your Dataset

With just 9-10 tracks, you can fine-tune MusicGen to emulate your chosen musical style. Ensure each track exceeds 30 seconds, and the training script will seamlessly handle the rest, automatically dividing lengthy audio files into 30-second chunks.

Labeling Your Symphony

MusicGen offers three labeling options:

  1. Automatic labeling using essentia, capturing genre, mood, theme, instrumentation, key, and bpm.
  2. A single description for all tracks using the one_same_description training parameter.
  3. Personalized descriptions for each audio file, require a text file with the same filename as the corresponding track.

Silencing the Vocals

To achieve optimal results, MusicGen works best with instrumental tracks. The base models are vocals-free, ensuring your fine-tuned model doesn’t produce peculiar outputs.

You can disable vocal removal by setting drop_vocals to false in your training parameters if your tracks are vocal-free or if you want to experiment with vocal-inclusive training.

Choosing Your Sonic Palette

Selecting the Model

You have the choice of training the small, medium, or melody models. The small model is the default, while the large model is not available for training.

The melody model, tailored for melodic input, is limited to 30 seconds but brings a unique dimension to your fine-tuned creation.

Tokenizing Your Dreams

Before diving into the fine-tuning process, obtain your Replicate API token from replicate.com/account/api-tokens and store it as an environment variable called REPLICATE_API_TOKEN.

Crafting Your Symphony

Creating a Model

Visit replicate.com/create to fashion a model on Replicate, the destination for your refined MusicGen version. Give it a unique name, like my-name/my-model.

Uploading Your Masterpiece

Zip your tracks and any text files, and either upload them as part of your Replicate CLI command or share the publicly accessible URL if using alternative methods.

Initiating the Training

Fire up the training process with a Python command or Replicate CLI, specifying the model version, dataset path, and destination. Fine-tune MusicGen with default parameters or customize them using the provided settings.

Monitoring the Crescendo

Track the training progress on replicate.com/trainings or programmatically inspect the training. Keep an eye on the status and the last few lines of logs for a comprehensive overview.

Running the Model

Once the model completes its training, unleash it on the web or through the API. Use the power of your refined MusicGen by providing a prompt that aligns with your trained descriptions. Experiment with prompts to discover the nuances of your personalized music generator.

Mastering the Harmony: All Fine-Tune Settings

Here’s a quick overview of the parameters that grant you control over your trained model:

SettingDescription
dataset_pathURL pointing to your zip or audio file
one_same_descriptionA description for all audio data (default: none)
auto_labelingAutomatic creation of label data like genre, mood, theme, etc. (default: true)
drop_vocalsDrops vocal tracks from audio files in the dataset (default: true)
model_versionThe model version to train – choices are “melody”, “small”, “medium” (default: “small”)
lrLearning rate (default: 1)
epochsNumber of epochs to train for (default: 3)
updates_per_epochNumber of iterations for one epoch (default: 100)
batch_sizeBatch size, must be a multiple of 8 (default: 16)