The forth edition of the Research Awards Sociedad Científica Informática de España (SCIE) – Fundación BBVA has awarded Jordi Pons under the category of Young Researchers in Computer Science for his significant contributions within the discipline of music information retrieval, with special emphasis on the use of deep learning architectures in the labeling of sound and music signals, which have obtained a large number of citations. The awards aim to recognize the contributions of new generation of researchers rewarding the creativity, originality and excellence in the early years of their career and aims to serve as a stimulus for them to continue their research work.
Jordi Pons, is currently scientist researcher at Dolby Laboratories. He obtained his PhD at the Music Technology Group in November 2019, and he is also alumni of SMC master. His PhD thesis “Deep Neural Networks for Music and Audio Tagging”, supervised by Xavier Serra, was developed within the MdM project Machine learning approaches for structuring large sound and music collections.
Automatic music and audio tagging can help increase the retrieval and re-use possibilities of many audio databases thatremain poorly labelled. In this dissertation, we tackle the task ofmusic and audio tagging from the deep learning perspective and,within that context, we address the following research questions:
(i) Which deep learning architectures are most appropriate for(music) audio signals?
(ii) In which scenarios is waveform-based end-to-end learning feasible?
(iii) How much data is required for carrying out competitive deeplearning research?
In pursuit of answering research question(i), we propose to use mu-sically motivated convolutional neural networks as an alternative todesigning deep learning models that is based on domain knowledge,and we evaluate several deep learning architectures for audio at a lowcomputational cost with a novel methodology based on non-trained(randomly weighted) convolutional neural networks. Throughout ourwork, we find that employing music and audio domain knowledge dur-ing the model’s design can help improve the efficiency, interpretabil-ity, and performance of spectrogram-based deep learning models.
For research questions(ii)and(iii), we perform a study withthe SampleCNN, a recently proposed end-to-end learning model, toassess its viability for music audio tagging when variable amounts oftraining data —ranging from 25k to 1.2M songs— are available. Wecompare the SampleCNN against a spectrogram-based architecturethat is musically motivated and conclude that, given enough data,end-to-end learning models can achieve better results.
Finally, throughout our quest for answering research question(iii),we also investigate whether a naive regularization of the solutionspace, prototypical networks, transfer learning, or their combination,can foster deep learning models to better leverage a small number oftraining examples. Results indicate that transfer learning and proto-typical networks are powerful strategies in such low-data regimes.