Back Pablo Zinemanas defends his PhD thesis
Pablo Zinemanas defends his PhD thesis
Friday, October 20th, 2023 at 11.00 AM - room 55.309 (3rd floor) Tanger building (UPF Poblenou)
Title: Interpretable Deep-learning Models for Sound Event Detection and Classification
Supervisors: Dr. Xavier Serra (UPF), Frederic Font Corbera (UPF)
Jury: Dr. Annamaria Mesaros (Tampere University), Dr. Perfecto Herrera Boyer (UPF), Dr. Roman Serizel (Université de Lorraine)
Deep-learning models have revolutionized state-of-the-art technologies in many research areas, but their black-box structure makes it difficult to understand their inner workings and the rationale behind their predictions. This may lead to unintended effects, such as being susceptible to adversarial attacks or the reinforcement of biases. As a consequence, there has been an increasing interest in developing deep-learning models that provide explanations of their decisions, a field known as interpretable deep learning. On the other hand, in the past few years, there has been a surge in developing technologies for environmental sound recognition motivated by its applications in healthcare, smart homes, or urban planning. However, most of the systems used for these applications are deep-learning-based black boxes and, therefore, can not be inspected, so the rationale behind their decisions is obscure. Despite recent advances, there is still a lack of research in interpretable machine learning in the audio domain. This thesis aims to reduce this gap by proposing several interpretable deep-learning models for automatic sound classification and event detection.
We start by describing an open-source software tool for reproducible research in the sound recognition field, which was used to implement the models and run experiments presented in this document. We then propose an interpretable front-end based on domain knowledge to tailor the feature-extraction layers of an end-to-end network for sound event detection. We then present a novel interpretable deep-learning model for automatic sound classification, which explains its predictions based on the similarity of the input to a set of learned prototypes in a latent space. We leverage domain knowledge by designing a frequency-dependent similarity measure. The proposed model achieves results comparable to state-of-the-art methods. In addition, we present two automatic methods to prune the proposed model that exploits its interpretability. This model is accompanied by a web application for the manual editing of the model, which allows for a human-in-the-loop debugging approach. Finally, we propose an extension of this model that works for a polyphonic setting, such as the sound event detection task. To provide interpretability, we leverage the prototype network approach and attention mechanisms.
The tools for reproducible research and the interpretable deep-learning models, such as those proposed in this thesis, can contribute to developing a more responsible and trustworthy Artificial Intelligence in the audio domain.