Workshop on music knowledge extraction using machine learning on December 4th

The last Sunday (December 4th 2016), a day before the NIPS conference, the UPF hosted a workshop on music knowledge extraction using machine learning. The workshop took place at the Ciutadella campus and attracted researchers visiting NIPS and working on sound and music analysis.

Attendees were able to get familiar with works from both academia and industry.

The first session has been given by academia researches. A welcome talk by the head of the Music Technology Group Xavier Serra has been followed by two talks on the topic of matrix factorisation applications for audio signal processing.

The first speaker, Cédric Févotte from CNRS/IRIT, presented his work on “Non-negative matrix factorisation and some applications in audio signal processing”. Given a general idea of NMF and its applications, he focused on two specific metrics such as weighted  Kullback-Leibler divergence and Itakura-Saito divergence for supervised and blind audio source separation. The presentation was followed by a demo that demonstrated the high quality of the proposed method for audio source separation.

The second talk has been given by Rémi Flamary from CNRS/OCA following the same topic of non-negative matrix factorization for audio processing. He provided more details related to their latest work presented at NIPS on music transcription. Namely, he presented cost matrixes for energy transportation which is invariant to shifts of energy to harmonically-related frequencies, as well as to small and local displacements of energy.

The final talk of the first part of the workshop has been given by Katherine M. Kinnaird, Brown University, on the topic of music comparison. She introduced aligned hierarchies as a low-dimensional representation of audio data. This representation allows transforming a complex audio structure into one object that encodes hierarchical decompositions. To compute aligned hierarchies the author first extracts all repetitions of all possible lengths from a self-similarity or self-dissimilarity matrix, next filters the extracted segments and composes structure components. Check out her publication for more details!

During the coffee break several researchers that are currently hosted at the DTIC-UPF presented their ongoing work:

After the break, the workshop has been focused on applied research.

At first, Oriol Nieto presented the highlights of using deep neural networks (DNN) in Pandora. In particular, they use deep learning to estimate the 400 musical attributes of Music Genome Project with a multi-label classification model based on CNNs. The training process is supported by 1.5M tracks manually annotated by at least 2 experts each. Moreover, apart from content-based information retrieval, he described how Pandora uses different ensembles of recommenders (some based on deep learning) for personal music recommendations.

Next, Sageev Oore introduced a technique of interactive musical improvisation using LSTM-based recurrent neural network over a raw MIDI input. The main idea behind this approach of interactive feedback and improvisation is “transform and repeat”. The authors trained an autoencoder to extract latent semantic musical content in a way similar to the word2vec model in order to find similar riffs to embody call-response relationships. Please, check the demo!

After that, Aäron Van den Oord from Google DeepMind presented WaveNet, a Generative Model for Raw Audio. In their model, they use the so-called casual dilated convolutions which allow capturing larger receptive fields at a lower cost. Moreover, they embedded speaker-related information and linguistic information for better speech generation. They also demonstrated sounds generated from a model without text-conditioning which produces word-like sounds. You can also find some examples of generated audio in the official blog post.

The last presentation has been given by Colin Raffel. He described the main results of his PhD thesis, namely: the Lakh MIDI Dataset, as well as a way how it was built and how to use it.  The main challenge in creating the dataset was matching and aligning a huge collection of MIDI files with Million Song Dataset. The evident method for audio-to-MIDI alignment and matching is dynamic time warping (DTW), but, unfortunately, it does not scale for large collections. To solve this problem he proposed a method for fast sequence matching. As a result, the model was able to speedup about to ~500x compared to the DTW baseline.

We sincerely thank all the speakers for their excellent presentations and all the participants for their attention. We hope to meet you all again!