Jordi Pons defends his PhD thesis
Friday November 15th at 11:00h. Room 55.309 (Universitat Pompeu Fabra, Poblenou Campus)
Title: Deep neural networks for music and audio tagging
Supervisor: Dr. Xavier Serra
Jury: Dr. Perfecto Herrera (ESMUC & UPF), Dr. Geoffroy Peeters (Télécom París), Dr. Juhan Nam (KAIST)
Abstract: Automatic music and audio tagging can help increase the retrieval and re-use possibilities of many audio databases that remain poorly labeled. In this dissertation, we tackle the task of music and audio tagging from the deep learning perspective and, within that context, we address the following research questions:
(i) Which deep learning architectures are most appropriate for (music) audio signals?
(ii) In which scenarios is waveform-based end-to-end learning feasible?
(iii) How much data is required for carrying out competitive deep learning research?
In pursuit of answering research question (i), we propose to use musically motivated convolutional neural networks as an alternative to designing deep learning models that is based on domain knowledge, and we evaluate several deep learning architectures for audio at a low computational cost with a novel methodology based on non-trained (randomly weighted) convolutional neural networks. Throughout our work, we find that employing music and audio domain knowledge during the model's design can help improve the efficiency, interpretability, and performance of spectrogram-based deep learning models.
For research questions (ii) and (iii), we perform a study with the SampleCNN, a recently proposed end-to-end learning model, to assess its viability for music audio tagging when variable amounts of training data |ranging from 25k to 1.2M songs| are available. We compare the SampleCNN against a spectrogram-based architecture that is musically motivated and conclude that, given enough data, end-to-end learning models can achieve better results.
Finally, throughout our quest for answering research question (iii), we also investigate whether a naive regularization of the solution space, prototypical networks, transfer learning, or their combination, can foster deep learning models to better leverage a small number of training examples. Results indicate that transfer learning and prototypical networks are powerful strategies in such low-data regimes.