Pablo Alonso defends his PhD thesis
Thursday, October 3rd 2024 at 11:00h - Room 55.309 (3rd floor) Tanger building (UPF Poblenou)
Title: Deep Audio Representation Learning for Music Using Weak Supervision
Supervisor: Dr. Dmitry Bogdanov and Dr. Xavier Serra
Jury: Dr. Romain Hennequin (Deezer), Dr. Perfecto Herrera Boyer (UPF), Dr. Rachel Bittner (Spotify)
Abstract:
Music audio tagging is the Music Information Retrieval task of assigning one or multiple labels to an audio signal. Music tagging systems are essential for developing applications involving cataloging, retrieval, or recommendation, so enhancing the accuracy, robustness, and efficiency of these models is beneficial for many real-world music applications. Current state-of-the-art music tagging systems rely on deep learning approaches, which offer high performance but also introduce challenges due to their large data requirements and tendency to overfit. In this thesis, we propose addressing music tagging from the perspective of representation learning to alleviate these limitations.
The goal of representation learning is to design pre-training objectives that make the learned representations suitable for several downstream tasks. When the representations are well-suited to the downstream task, it is often possible to achieve good performance using shallow models that require few resources to train and run. Additionally, using a single representation model to feed several shallow models is more efficient than having individual end-to-end models for each task, and enables addressing new related tasks with little additional effort.
Our work starts by investigating the capabilities of the representations learned by competitive music and audio tagging systems and evaluating their capabilities on out-of-distribution data, finding that pre-trained representations provide generalization benefits. To support the rest of this thesis, we create a large-scale dataset matched to Discogs' open music metadata that we use to develop novel representation models. Then, we investigate the effectiveness of using editorial and consumption metadata (such as artist names and playlists) as a source of supervision, showing that this information favors downstream performance without the need for explicit annotations which are typically much harder to obtain.
After this, we look into the transformer architecture, proposing design choices that optimize its performance for music representation learning. In our last contribution, we propose adapting existing audio interpretability strategies to operate with pretrained representations, thus contributing to more insightful music classification models.
Finally, this work is carried out in the context of Essentia an open-source library and collection of models for audio and music analysis. The techniques and models developed in this thesis are openly available as part of Essentia and have already been used both by the research community and industry.
Video: https://youtu.be/qho3v5LpV7c