Eduardo Fonseca defends his PhD thesis
Wednesday, December 1st, 2021 at 12.00h (CET) - online
Title: Training Sound Event Classifiers Using Different Types of Supervision
Supervisor: Dr. Xavier Serra and Dr. Frederic Font
Jury: Dr. Emmanouil Benetos (Queen Mary University of London); Dr. Marius Miron (UPF); Dr. Annamaria Mesaros (Tampere University)
Abstract:
While speech and music processing have traditionally gained attention from the research community, the automatic recognition of sound events has received interest only in recent years, motivated by emerging applications in fields such as healthcare, smart homes, or urban planning. Prior to this thesis, research on sound event classification was mainly focused on supervised learning using small datasets, often carefully annotated with small vocabularies constrained to specific domains e.g., urban or domestic. However, ideal general-purpose sound event classifiers aim to recognize hundreds of sound events occurring in our everyday environment, such as kettle whistle, bird tweet, fire alarm or car passing by. At the same time, large amounts of rich environmental sound data are hosted in web repositories such as Freesound or Youtube, which can be convenient for training data-hungry deep learning approaches. To advance the state-of-the-art in sound event classification, this thesis investigates several strands of dataset creation as well as supervised and unsupervised learning in order to train large-vocabulary sound event classifiers, using different types of supervision in novel and alternative ways.
The first part of this thesis focuses on the creation of FSD50K, a large-vocabulary dataset with over 100h of audio manually labeled using 200 classes of sound events. We provide a detailed description of the creation process and a comprehensive characterization of the dataset, including a set of classification experiments to provide insight on the data for machine listening tasks. In addition, we explore novel architectural modifications to increase shift invariance in CNNs, improving robustness to time/frequency shifts in input spectrograms and achieving stateof-the-art classification performance.
In the second part, we focus on training sound event classifiers using noisy labels, which can reduce the reliance on costly manual annotation processes. First, we propose a dataset that supports the investigation of real label noise, including an empirical characterization of the noise. Then, we explore and develop efficient network-agnostic approaches to mitigate the effect of label noise during training, including regularization techniques, noise-robust loss functions, and strategies to reject potential noisy labeled examples. Further, we develop a teacher-student framework to address the problem of missing labels in large sound event datasets using AudioSet as a use case.
In the third part, we propose multiple strategies to learn audio representations from unlabeled data. We develop novel self-supervised contrastive learning frameworks, where representations are learned by comparing pairs of examples selected by some semantically-correlated notion of similarity. Pairs of positive examples are computed via compositions of data augmentation and automatic sound separation methods. We obtain an unsupervised audio representation that rivals state-of-the-art alternatives on the established AudioSet classification benchmark. Finally, we report on the organization of two DCASE Challenge Tasks on audio tagging with noisy labels.
Overall, this thesis contributes to the advancement of open and reproducible sound event research and to the transition from traditional supervised learning using clean labels to other learning strategies less dependent on annotation effort.
This thesis defense will take place online. To attend use this link (ID of the meeting 811 9604 3116). The microphone and camera must be turned off, and the online access will be unavailable after 30 minutes from the start of the defense.
Video: https://youtu.be/9F977W4UrFg