Back [PhD thesis] Deep Neural Networks for Music and Audio Tagging

[PhD thesis] Deep Neural Networks for Music and Audio Tagging

Author: Jordi Pons

Supervisor: Xavier Serra

Automatic music and audio tagging can help increase the retrieval and re-use possibilities of many audio databases that remain poorly labeled. In this dissertation, we tackle the task of music and audio tagging from the deep learning perspective and, within that context, we address the following research questions:

  1. Which deep learning architectures are most appropriate for(music) audio signals?
  2. In which scenarios is waveform-based end-to-end learning feasible?
  3. How much data is required for carrying out competitive deep learning research?

In pursuit of answering research question(i), we propose to use musically motivated convolutional neural networks as an alternative to designing deep learning models that is based on domain knowledge, and we evaluate several deep learning architectures for audio at a low computational cost with a novel methodology based on non-trained(randomly weighted) convolutional neural networks. Throughout our work, we find that employing music and audio domain knowledge during the model’s design can help improve the efficiency, interpretability, and performance of spectrogram-based deep learning models.

For research questions (ii)and (iii), we perform a study with the Sample CNN, a recently proposed end-to-end learning model, to assess its viability for music audio tagging when variable amounts of training data —ranging from 25k to 1.2M songs— are available. We compare the Sample CNN against a spectrogram-based architecture that is musically motivated and conclude that, given enough data, end-to-end learning models can achieve better results. Finally, throughout our quest for answering research question(iii), we also investigate whether a naive regularization of the solution space, prototypical networks, transfer learning, or their combination, can foster deep learning models to better leverage a small number of training examples. Results indicate that transfer learning and proto-typical networks are powerful strategies in such low-data regimes.

Link to manuscript

Usable outcome on Github: musicnn
Music tagging demo on Medium

Link to Jordi Pon's blog