Back Minz Won defends his PhD thesis

Minz Won defends his PhD thesis

Friday, July 1st 2022 at 10.00h (CET) - online

Imatge inicial

Title: Representation Learning for Music Classification and Retrieval : Bridging the Gap between Natural Language and Music Semantics

Supervisors: Dr. Xavier Serra and Dr. Horacio Saggion

Jury: Dr. Juhan Nam (Korea Advanced Institute of Science and Technology), Dr. Sebastian Ewert (Spotify), Dr. Yi-Hsuan Yang (Academia Sinica)


The explosion of digital music has dramatically changed our music consumption behavior. Massive digital music libraries are now available through streaming platforms. Since the amount of information available to an individual listener has increased greatly, it is nearly impossible for them to go through the entire catalog exhaustively. As a result, we need robust knowledge management systems more than ever. Recent advances in deep learning have enabled data-driven music representation learning for classification and retrieval. However, there is still a gap between machine-learned representations and the human understanding of music. This dissertation aims at reducing this semantic gap in order to assist listener’s behavior around music information with advanced algorithmic support. To this end, we tackle three main challenges in representation learning: model architecture design, scalability, and multi-modality. Firstly, we carefully review previous deep representation models and propose new architectures that improve the representation in qualitative and quantitative ways. The newly proposed models are more flexible, interpretable, and powerful than previous ones. Secondly, training schemes beyond supervised learning are explored as a way to achieve scalable research. Transfer learning, semi-supervised learning, and self-supervised learning approaches are addressed in detail; transfer learning and semi-supervised methods are applied to enhance music representation learning. Finally, metric learning is proposed as a way to bridge music audio representation and natural language semantics, forming a multimodal embedding space. This facilitates music retrieval using arbitrary tags beyond a fixed vocabulary, and makes it possible to match music to text stories based on mood. Although our work focuses on bridging music and natural language semantics, we believe the proposed approaches generalize to other modalities. All implementation details of this thesis are available and open-source for reproducibility. The knowledge gained throughout this thesis has been put into practice and grounded in research internships and collaborations with multiple industries.

This thesis defense will take place online. To attend use this link (ID of the meeting 944 2472 9106). The microphone and camera must be turned off, and the online access will be unavailable after 30 minutes from the start of the defense.




SDG - Sustainable Development Goals:

Els ODS a la UPF