Seminar on music knowledge extraction using machine learning
December 4th 2016, 15:00h - 19:00h
Taking advantage of the researchers coming to Barcelona for the NIPS conference (https://nips.cc/), we are organizing this small and informal seminar at the UPF. The goal is to have an open discussion on various topics related to machine learning applied to music, but putting special emphasis on the knowledge extraction aspects of it.
Program (abstracts below)
15:00h - 16:30h Part 1
- Welcome - Xavier Serra (Music Technology Group, DTIC-UPF)
- Cédric Févotte (CNRS): Nonnegative matrix factorisation and some applications in audio signal processing
- Valentin Emiya (Aix-Marseille Université, CNRS): Optimal spectral transportation with application to music transcription
- Katherine M. Kinnaird (Brown University): Aligned Hierarchies - A Multi-Scale Structure-Based Representation for Music-Based Data Streams
16:30h - 17:15h - Coffee break and poster session. Posters:
- Merlijn Blaauw, Jordi Bonada (Music Technology Group, DTIC-UPF): A singing synthesizer based on PixelCNN
- Georgi Dzhambazov (Music Technology Group, DTIC-UPF): A probabilistic model for lyrics-to-audio alignment based on knowledge of singing voice specific characteristics
- Rong Gong (Music Technology Group, DTIC-UPF): Phonetic segmentation of jingju singing syllables by using HMM and boundary pattern classification
- Tetsuro Kitahara (Nihon University): Four-part Harmonization Using Probabilistic Models
- Marius Miron (Music Technology Group, DTIC-UPF): Supervised source separation for classical music - from non-negative matrix factorization to deep learning
- Sergio Oramas, Francesco Barbieri (Music Technology Group / Natural Language Processing Group, DTIC-UPF): Deep Text-based Music Recommendation
- Jordi Pons (Music Technology Group, DTIC-UPF): Thinking efficient deep learning architectures for modeling music
- Olga Slizovskaia (Music Technology Group & Image Processing Group, DTIC-UPF): Musical instrument recognition in user-generated videos using a multimodal convolutional neural network architecture
17:15h - 19:00h - Part 2
- Oriol Nieto (Pandora): Deep Learning for Large Scale Music Recommendation
- Sageev Oore (Saint Mary's University): Extracting Semantic Content for Musical Call and Response
- Aäron Van den Oord (Google Deep Mind): WaveNet: A Generative Model for Raw Audio
- Colin Raffel (Google Brain): The Lakh MIDI Dataset: How it was Made, and How to Use it
Registration is free but mandatory.
We are also organising in the same venue during December 3rd - 4th the 13th European Workshop on Reinforcement Learning
Cédric Févotte (CNRS): Nonnegative matrix factorisation and some applications in audio signal processing
Data is often available in matrix form, in which columns are samples, and processing of such data often entails finding an approximate factorisation of the matrix in two factors. The first factor (the “dictionary”) yields recurring patterns characteristic of the data. The second factor (“the activation matrix”) describes in which proportions each data sample is made of these patterns. In the last 15 years, nonnegative matrix factorisation (NMF) has become a popular technique for analysing data with non-negative values, with applications in many areas such as in text information retrieval, hyper-spectral imaging or audio signal processing. The presentation will give an overview of NMF and will describe spectral unmixing applications in music engineering (source separation, denoising).
V. Emiya (Aix Marseille Univ, CNRS, LIF): Optimal spectral transportation with application to music transcription
Many spectral unmixing methods rely on the non-negative decomposition of spectral data onto a dictionary of spectral templates. In particular, state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionary of representative note spectra. The typical measures of fit used to quantify the adequacy of the decomposition compare the data and template entries frequency-wise. As such, small displacements of energy from a frequency bin to another as well as variations of timber can disproportionally harm the fit. We address these issues by means of optimal transportation and propose a new measure of fit that treats the frequency distributions of energy holistically as opposed to frequency-wise. Building on the harmonic nature of sound, the new measure is invariant to shifts of energy to harmonically-related frequencies, as well as to small and local displacements of energy. Equipped with this new measure of fit, the dictionary of note templates can be considerably simplified to a set of Dirac vectors located at the target fundamental frequencies (musical pitch values). This in turns gives ground to a very fast and simple decomposition algorithm that achieves state-of-the-art performance on real musical data.Katherine M. Kinnaird (Brown University): Aligned Hierarchies - A Multi-Scale Structure-Based Representation for Music-Based Data Streams
In this talk, we will present the aligned hierarchies, a novel low-dimensional representation for music-based data streams, such as recordings of songs or digitized representations of scores. This representation encodes all hierarchical decompositions of repeated elements from a high-dimensional and noisy music-based data stream into one object. These aligned hierarchies are constructed by finding, encoding, and synthesizing all repeated structure present in a music-based data stream. Additionally, aligned hierarchies can be embedded into a classification space with a natural notion of distance. For a data set of digitized scores, we conducted experiments addressing the fingerprint task that achieved perfect precision-recall values. These experiments provide an initial proof of concept for aligned hierarchies addressing MIR tasks.
Oriol Nieto (Pandora): Deep Learning for Large Scale Music Recommendation
Music streaming services nowadays offer large catalogues that expose new challenges in terms of automatically recommending music to tens of millions of users. At Pandora, around 75 billion "thumbs" and manual analyses for over 1.5 million songs have been gathered over the years. In this talk we will leverage some of these data by applying deep networks to both audio content and metadata with the goal of improving the listener experience. The basics of collaborative filtering and machine listening will be reviewed, framed under the music recommendation umbrella, and enhanced with various deep learning techniques.
Sageev Oore (Saint Mary's University): Extracting Semantic Content for Musical Call and Response
Call and Response is a fundamental element of musical expression, in which one phrase is a “call”, and the next phrase-- usually played by one or more people other than the caller-- is a “response”. We consider modeling this complex and fluid interaction. First, we train a deep autoencoder on a set of musical phrases, learning to extract underlying semantic musical content. We then create a customized data set of calls and responses, and use our previously trained autoencoder to embed these phrases into a latent semantic space. Working in this latent semantic space enables us to learn a relatively simple model of the relationship between call and response from a very small data set, which we use to drive a musical improvising system.
Aäron Van den Oord (Google Deep Mind): WaveNet: A Generative Model for Raw Audio
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Colin Raffel (Google Brain): The Lakh MIDI Dataset: How it was Made, and How to Use it
MIDI files which are matched and aligned to corresponding audio recordings provide a bounty of information for music informatics. The lack of reliable metadata in MIDI files necessitates content-based analysis for determining whether a MIDI file matches a given audio recording. We therefore present methods for learning efficient representations of sequences using convolutional networks. Our first approach learns a mapping from sequences of feature vectors to downsampled sequences of binary vectors, providing quadratic speed gains and substantially faster distance calculations. For further speedup, we present an approximate pruning method which involves embedding sequences as fixed-length vectors in a Euclidean space by using form of attention which integrates over time. These techniques enabled the creation of the Lakh MIDI dataset, the largest collection of MIDI files which have been matched and aligned to corresponding audio recordings.
POSTER SESSION: Abstracts
One of the challenges in singing synthesis is that we generally want to be able to synthesize any timbre at any pitch, and that obtaining a dataset where all possible combinations are sufficiently densely sampled can be costly. The most straight-forward solution to this issue is to model pitch and timbre independently, using a vocoder to separate the signal into F0 and a spectral envelope mostly free of harmonic interferences. While using a vocoder inevitably introduces some degradation in sound quality, the synthesis results using parameters predicted by a model still tend to be noticeably worse than the upper bound a high quality vocoder can provide (i.e. analysis-synthesis using e.g. STRAIGHT). The recently proposed PixelCNN class of autoregressive generative models has shown very promising results in a number of domains including modeling natural images, raw audio waveforms, text and video. Besides being relatively straight-forward to optimize, a distinctive feature of this type of model is that it tends to produce high quality samples with very little over-smoothing. Thus, we investigate whether using a conditional Gated PixelCNN to model the spectral envelope of singing voice we may get a step closer to obtaining higher quality parametric singing synthesis.
Traditional approaches to the automatic alignment between lyrics and music audio adopt the phonetic recognizer methodology, devised for the related problem of speech-to-text alignment. In this work we propose a probabilistic approach for automatic lyrics-to-audio alignment, which extends a phonetic recognizer with singing-specific knowledge. Based on the notion that short-term changes of singing voice timbre are driven by higher-level musical events, we consider syllable durations and automatically extracted vocal note onsets. To this end we formulate rules that guide the transition between consecutive phonemes, based on learnt syllable durations and the presence of note onsets. These rules are incorporated into the transition matrix of a hidden Markov model (HMM), whereas phonetic timbre is modelled by a deep multilayer perceptron network (MLP). Evaluation is carried both on a cappella and polyphonic singing from Turkish art music. Results show that the proposed modifications surpass a baseline, unaware of singing voice events.
In jingju singing training and performing, much attention is paid to the clear pronunciation of each syllable part: head (tou), belly (fu) and tail (tail). Each of them is composed of one or more phonemes. A precise phonetic segmentation of the jingju singing syllable is the prerequisite step for achieving the automatic jingju singing evaluation. In this study, We introduce a new phonetic level annotated jingju singing dataset which included two role-types: dan and laosheng, and develop a supervised phonetic segmentation method based on HMM and boundary pattern classification. The HMM topology and the monophone acoustic models are constructed on the audio and the annotations of the training set. The phonetic boundaries are decoded by a Viterbi algorithm Incorporated with varying transition probabilities. The candidates are finally verified by the boundary pattern classifiers. The experiment shows that the proposed method outperforms the state-of-the-art method for the dan dataset.
Tetsuro Kitahara (Nihon University): Four-part Harmonization Using Probabilistic Models
We present a method for four-part harmonization using Bayesian networks. Four-part harmonization, which is to yield alto, tenor, and bass voices for a given soprano melody, is a fundamental problem in harmony, and various techniques have been proposed by different researchers. Here, we report some results of investigating how to design Bayesian networks for this harmonization. (This is a work already published in Sound and Music Computing Conference 2013 and Journal of New Music Research)
Deep learning approaches have become increasingly popular for blind source separation in the case of pop music. However, for the more complex case of classical music, performances can differ in terms of tempo, dynamics, micro-timing and instrument timbre. To that extent, non-negative matrix factorization (NMF) methods are preferable, because they provide a more robust model through certain constraints embedded in the model, especially when training data is scarce. We present the similarities between the NMF harmonic model and the convolutional neural network (CNN) in terms of mathematical representation and learning function. Furthermore, we propose several strategies to improve the performance of a CNN for classical music making it more robust to tempo, dynamics, micro-timing.
Most of the current Music Recommender Systems are mainly based on the use of usage data. However, when new artists are introduced in the system, there is no such information. To tackle this cold start problem, hybrid recommendation approaches that combine tags and collaborative information have been proposed, and also approaches based on learning latent factor representations from audio content. In this work, we propose a new method able to learn an artist latent factor representation directly from an artist biography using a deep learning architecture. Once new artists latent factors are learnt, recommendations can be easily computed.
The deep learning literature in speech or image classification significantly influenced the music informatics research community. For example, many researchers use small squared filters for their work with music spectrograms, similarly as in computer vision. However, how do we know that the relevant local stationarities in music (spectrograms) can be modeled by small squared filters? As seen, it is still not clear which are the best architectures to fit music audio. It is hard to discover the adequate combination of parameters for a particular task, which leads to architectures being difficult to interpret. Given this, it might be interesting to rationalize the design process by exploring deep learning architectures specifically thought to fit music audio - what will probably lead to more successful and understandable models. We will discuss our work using musically motivated architectures, that are designed considering the conclusions of a CNNs filter shapes discussion for music spectrograms.
We present a method for recognizing musical instruments in user-generated videos. Musical instrument recognition is a well-known task in the music information retrieval (MIR) field, and it is mostly based on the analysis of good-quality audio material. Our scenario is much more challenging as videos may contain multiple instruments sounding simultaneously, background noise and are varied in terms of recording conditions and quality. Our approach captures multimodal information embedded in the audio and visual domains. We develop a convolutional Neural Network (CNN) architecture which combines learned representations from both modalities at a late fusion stage.
Our approach is trained and evaluated on two large-scale video datasets: YouTube-8M and FCVID, and additionally evaluated using a novel dataset, that we have specially designed for this task. The proposed architecture demonstrates state-of-the-art performance in audio and video object recognition, provide additional robustness to missing modalities, and remains computationally cheap to train.