Benno Weck-Hufnagel defends his PhD thesis

Benno Weck-Hufnagel defends his PhD thesis

Monday, October 27th 2025 at 11:00h (CET)- Room 55.309 (3rd floor) Tanger building (UPF Poblenou) and online
19.10.2025

Imatge inicial -

Title: Content-based Retrieval in Large-scale Audio Collections with Natural Language as the Interface

Supervisor: Dr. Xavier Serra 

Jury: Dr. Romain Serizel (Université de Lorraine), Dr. Martín Rocamora Martínez (UPF), Dr. Shuo Zhang (Bose Corp/Tufts University)

Abstract: 

Audio collections, ranging from music archives to environmental sound libraries, have been growing quickly. However, these vast resources remain largely underutilised due to sparse metadata and limited search capabilities. This thesis investigates content-based retrieval in large-scale audio collections using natural language as the interface, with the goal of enabling more intuitive and expressive access to audio content. We address three central challenges: system design, data availability, and evaluation. For system design, we explore two primary directions. First, in audio captioning, we compare combinations of pretrained word embedding and machine listening models within a Transformer-based architecture. Second, in language-based retrieval, we investigate fine-tuning strategies for pretrained encoder models in a bi-encoder setup, considering different loss functions and the effects of augmenting training data with noisy audio-text pairs. To address the scarcity of paired text-music data, we introduce two novel datasets: Song Describer, a crowd-sourced collection of music captions, and WikiMuTe, which pairs music audio with encyclopedic textual descriptions. These datasets provide new resources for both evaluating and training multimodal models. In our evaluation work, we identify data leakage issues in an existing benchmark and propose more realistic dataset splits. We also introduce MuChoMusic, a multiple-choice question-answering benchmark designed to assess music under-standing in multimodal models. Additionally, a user study explores how system constraints shape natural language query behaviour, revealing a tendency toward short queries despite a willingness to provide more detailed input. Together, these contributions aim to advance the integration of natural language and audio understanding and lay the foundations for richer interaction with audio content.

Streaming: https://www.upf.edu/web/mtg/streaming