Back The MTG uses the RES supercomputer MareNostrum5 Acc to advance AI music understanding

The MTG uses the RES supercomputer MareNostrum5 Acc to advance AI music understanding

Our collaboration with the Red Española de Supercomputación (RES) allows to create a new generation of open text-audio models
11.03.2025

Imatge inicial -

Following the success of our previous Barcelona Supercomputing Center project, "Large-scale Self-supervised Audio Representation Models for Music Understanding” (July-November 2024), we now explore the opportunities in combining large corpora of audio with associated textual descriptions, utilizing state-of-the-art LLMs in combination with our best SSL audio encoders from the previous project to enhance semantic understanding and broaden the potential applications of the resulting systems.

 

In our previous project, the team formed by Pablo Alonso, Dmitry Bogdanov, Recep Oğuz Araz, Pedro Ramoneda, and Martin Rocamora developed models based on BestRQ, a self-supervised learning paradigm where the training goal is to predict masked features (or tokens) from the input audio, a promising approach that has achieved state-of-the-art results in analysis tasks across the speech, sound, and music domains.  In our approach, the model predicts several types of masked features simultaneously which in our experiments leads to improvement on various music tasks. The image shows an audio processing pipeline that inputs a raw audio signal and transforms it through multiple stages with both knowledge-based and data-driven representations to analyze music.


 

 

In the continuation project, we will be training text-audio models on a collection of more than 300.000 hours of audio, aiming to push the boundaries of open models for music and sound understanding. Our main goals are:

 

  • To develop improved audio encoders through generative pre-training
  • To create text-audio models for cross-modal retrieval and zero-shot learning
  • To employ open audio and metadata from Freesound for sound-specific text-audio models 
  • Open-source releases with full documentation, model cards, and research papers
  • Robust evaluation across multiple music analysis tasks

This project is part of a broader effort to develop open-source multimodal models for sound and music understanding and generation.

 

With the support of: