Fine-Tuning with MusicGen
Do you want to create music in a certain style? MusicGen Fine-tune can help you with that. The fine-tuning mechanism was created by Jongmin Jung(aka. Sake). You can easily run your fine-tuned model from the web or by using the cloud API.
This tool allows you to fine-tune MusicGen according to your desired style, be it the nostalgic 16-bit video game chip-tunes or the serene ambiance of choral compositions.
MusicGen Fine-Tune
Fast Track to Mastery
In just 15 minutes, you can fine-tune MusicGen using the powerful 8x A40 (Large) hardware.
The flexibility to run your fine-tuned model from the web, cloud API, or through downloaded weights opens up a world of possibilities.
The Maestro Behind the Scenes
Jongmin Jung’s fine-tuning process is rooted in Meta’s AudioCraft and their built-in trainer, Dora. To simplify the training process, Sake integrated automatic audio chunking, auto-labeling, and vocal removal features. Now, your trained model can compose music beyond the conventional 30 seconds.
How to fine-tune MusicGen AI?
Building Your Dataset
With just 9-10 tracks, you can fine-tune MusicGen to emulate your chosen musical style. Ensure each track exceeds 30 seconds, and the training script will seamlessly handle the rest, automatically dividing lengthy audio files into 30-second chunks.
Labeling Your Symphony
MusicGen offers three labeling options:
- Automatic labeling using essentia, capturing genre, mood, theme, instrumentation, key, and bpm.
- A single description for all tracks using the one_same_description training parameter.
- Personalized descriptions for each audio file, require a text file with the same filename as the corresponding track.
Silencing the Vocals
To achieve optimal results, MusicGen works best with instrumental tracks. The base models are vocals-free, ensuring your fine-tuned model doesn’t produce peculiar outputs.
You can disable vocal removal by setting drop_vocals to false in your training parameters if your tracks are vocal-free or if you want to experiment with vocal-inclusive training.
Choosing Your Sonic Palette
Selecting the Model
You have the choice of training the small, medium, or melody models. The small model is the default, while the large model is not available for training.
The melody model, tailored for melodic input, is limited to 30 seconds but brings a unique dimension to your fine-tuned creation.
Tokenizing Your Dreams
Before diving into the fine-tuning process, obtain your Replicate API token from replicate.com/account/api-tokens and store it as an environment variable called REPLICATE_API_TOKEN.
Crafting Your Symphony
Creating a Model
Visit replicate.com/create to fashion a model on Replicate, the destination for your refined MusicGen version. Give it a unique name, like my-name/my-model.
Uploading Your Masterpiece
Zip your tracks and any text files, and either upload them as part of your Replicate CLI command or share the publicly accessible URL if using alternative methods.
Initiating the Training
Fire up the training process with a Python command or Replicate CLI, specifying the model version, dataset path, and destination. Fine-tune MusicGen with default parameters or customize them using the provided settings.
Monitoring the Crescendo
Track the training progress on replicate.com/trainings or programmatically inspect the training. Keep an eye on the status and the last few lines of logs for a comprehensive overview.
Running the Model
Once the model completes its training, unleash it on the web or through the API. Use the power of your refined MusicGen by providing a prompt that aligns with your trained descriptions. Experiment with prompts to discover the nuances of your personalized music generator.
Mastering the Harmony: All Fine-Tune Settings
Here’s a quick overview of the parameters that grant you control over your trained model:
Setting | Description |
---|---|
dataset_path | URL pointing to your zip or audio file |
one_same_description | A description for all audio data (default: none) |
auto_labeling | Automatic creation of label data like genre, mood, theme, etc. (default: true) |
drop_vocals | Drops vocal tracks from audio files in the dataset (default: true) |
model_version | The model version to train – choices are “melody”, “small”, “medium” (default: “small”) |
lr | Learning rate (default: 1) |
epochs | Number of epochs to train for (default: 3) |
updates_per_epoch | Number of iterations for one epoch (default: 100) |
batch_size | Batch size, must be a multiple of 8 (default: 16) |
Demi Franco, a BTech in AI from CQUniversity, is a passionate writer focused on AI. She crafts insightful articles and blog posts that make complex AI topics accessible and engaging.