MusicGen AI OpenVINO (Quick Guide)

In this tutorial, we’ll explore the process of running the MusicGen model using OpenVINO for controllable music generation. MusicGen is a powerful auto-regressive Transformer model capable of generating high-quality music samples based on text descriptions or audio prompts.

To use the performance benefits of OpenVINO, we’ll convert the MusicGen model and its components into OpenVINO Intermediate Representation (IR) format.

Let’s break down the steps:

How to use MusicGen AI OpenVINO?

Step 1: Set Up Variables

Before diving into the conversion process, let’s set up some variables for the file paths where we’ll save the converted models.

models_dir = Path("./models")
t5_ir_path = models_dir / "t5.xml"
musicgen_0_ir_path = models_dir / "mg_0.xml"
musicgen_ir_path = models_dir / "mg.xml"
audio_decoder_onnx_path = models_dir / "encodec.onnx"
audio_decoder_ir_path = models_dir / "encodec.xml"

Step 2: Convert Text Encoder

The text encoder is responsible for converting input prompts into embeddings that the MusicGen decoder can use. We’ll use the OpenVINO Converter (OVC) to convert the PyTorch model to OpenVINO IR.

if not t5_ir_path.exists():
    t5_ov = convert_model(model.text_encoder, example_input={'input_ids': inputs['input_ids']})
    save_model(t5_ov, t5_ir_path)
    del t5_ov
    gc.collect()

Step 3: Convert MusicGen Language Model (0th Generation)

The MusicGen Language Model generates audio codes based on the embedded text representation. We convert the model into OpenVINO IR, considering the 0th generation step.

if not musicgen_0_ir_path.exists():
    # Set model config `torchscript` to True
    model.decoder.config.torchscript = True
    
    decoder_input = {
        'input_ids': torch.ones(8, 1, dtype=torch.int64),
        'encoder_hidden_states': torch.ones(2, 12, 1024, dtype=torch.float32),
        'encoder_attention_mask': torch.ones(2, 12, dtype=torch.int64),
    }
    
    mg_ov_0_step = convert_model(model.decoder, example_input=decoder_input)
    save_model(mg_ov_0_step, musicgen_0_ir_path)
    del mg_ov_0_step
    gc.collect()

Step 4: Convert MusicGen Language Model

For subsequent generations, we need to update the model with past_key_values and convert it again.

if not musicgen_ir_path.exists():
    decoder_input['past_key_values'] = tuple([...])  # Specify past key values
    
    mg_ov = convert_model(model.decoder, example_input=decoder_input)
    
    # Adjust shapes and types for past_key_values
    for input in mg_ov.inputs[3:]:
        input.get_node().set_partial_shape(PartialShape([-1, 16, -1, 64]))
        input.get_node().set_element_type(Type.f32)
    
    mg_ov.validate_nodes_and_infer_types()
    
    save_model(mg_ov, musicgen_ir_path)
    del mg_ov
    gc.collect()

Step 5: Convert Audio Decoder

The audio decoder, part of the EnCodec model, is responsible for recovering the audio waveform from the predicted audio tokens.

We’ll convert the audio decoder using ONNX.

if not audio_decoder_onnx_path.exists():
    # Create a wrapper class for the Audio Decoder
    class AudioDecoder(torch.nn.Module):
        def __init__(self, model):
            super().__init__()
            self.model = model

        def forward(self, output_ids):
            return self.model.decode(output_ids, [None])

    audio_decoder_input = {'output_ids': torch.ones(1, 1, 4, n_tokens - 3, dtype=torch.int64),}

    with torch.no_grad():
        torch.onnx.export(
            model=AudioDecoder(model.audio_encoder),
            args=audio_decoder_input,
            f=audio_decoder_onnx_path,
            input_names=['output_ids',],
            output_names=['decoded_audio'],
            dynamic_axes={
                'output_ids': {3: 'sequence_length'},
                'decoded_audio': {2: 'audio_values'}
            }
        )

Now, we can convert the frozen ONNX computation graph to OpenVINO IR.

if not audio_decoder_ir_path.exists():
    audio_decoder_ov = convert_model(str(audio_decoder_onnx_path))
    save_model(audio_decoder_ov, audio_decoder_ir_path)
    del audio_decoder_ov
    gc.collect()

Step 6: Adapt OpenVINO Models to the Original Pipeline

To integrate the OpenVINO models into the original inference pipeline, we need wrapper classes for each model. These wrappers ensure proper parameter handling and result formatting.

Make sure to forward parameters correctly and handle data type conversions.
Ensure the wrapper class returns results in the expected format for the pipeline.

Refer to the AudioDecoderWrapper to see how we wrap OpenVINO model inference into the decode method.

By following these steps, you can seamlessly integrate the MusicGen model with OpenVINO for efficient and controllable music generation.

Conclusion:

In conclusion, this tutorial guides you through the process of running the MusicGen model using OpenVINO for controllable music generation. By converting the models to OpenVINO IR format and adapting them to the original pipeline, you can harness the efficiency of OpenVINO for seamless integration and enhanced performance in generating high-quality music samples.

Demi Franco

Demi Franco, a BTech in AI from CQUniversity, is a passionate writer focused on AI. She crafts insightful articles and blog posts that make complex AI topics accessible and engaging.