Miguel Pérez Fernández defends his PhD thesis

Miguel Pérez Fernández defends his PhD thesis

Wednesday, October 22nd 2025 at 11:00h (CET)- Room 55.309 (3rd floor) Tanger building (UPF Poblenou) and online
13.10.2025

Imatge inicial -

Title: Deep Learning approaches to singing voice transcription in pop music

Supervisor: Dr. Xavier Serra

Jury: Dr. Simon Dixon (Queen Mary University of London), Dr. Marius Miron (Earth Species Project), Dr. Jose Javier Valero (Universidad de Alicante)

Abstract:

The human voice is undoubtedly the most accessible musical instrument and is therefore part of many musical practices around the world. This universal accessibility leads to a wide range of popular singing-related activities such as karaoke. In parallel, the immediacy of music streaming media has driven the demand for karaoke song versions for the latest popular songs, as well as song identification services. Consequently, automated systems capable of generating scores directly, or at least providing a solid base that requires minimal adjustment, are highly desirable. This challenge of transcribing sung vocals from popular music poses considerable difficulties. First, vocal melodies are often interwoven with complex instrumental arrangements. Second, achieving accurate transcription requires considering the diverse timbres and flexibility inherent in the human voice.

Deep learning approaches have been proven to be very successful in many scenarios, given enough training data. However, creating such datasets is time-consuming to manually annotate, and the few that exist are difficult to share due to the copyright issues surrounding commercial music.

We address these challenges through three complementary strategies. The first focuses on creating neural network models that leverage music priors by introducing a novel music-motivated neural network block, the design of which has been successfully patented.

The second is to augment the data, employing generative models not to create fully synthetic data, but to generate various realistic instrumental accompaniments for existing vocal recordings, thus extending training scenarios without relying on potentially biased generated annotations.

The third strategy addresses the automatic creation of highly accurate aligned data; we have developed an iterative event-based algorithm that refines audio-to-score alignments by comparing specific musical events, such as onsets, and can also be leveraged to assess the reliability of these alignments.

Based on various comparative and outcome analyses, we conclude that transcription performance for professional singers is approaching a plateau with the current metrics used for evaluation. However, transcription accuracy for amateur singers remains significantly lower, indicating a clear area for future research and development.

Video: https://youtu.be/CwTBpd73A5U