The latest in audio signal processing and in musical information presented

The latest in audio signal processing and in musical information presented

At the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing to be held from 12 to 17 May in Brighton (UK), with extensive participation by members of the Music Technology Group: Audio Signal Processing Lab and Music Information Research Lab.   


Imatge inicial

The Music Technology Group (MTG) is participating in the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing which will take place from 12 to 17 May in Brighton (UK), with recent work carried out by the Audio Signal Processing Lab, led by Xavier Serra; and the Music Information and Research Lab, led by Emilia Gómez.

Overcoming the limitations of deep learning

Jordi Pons, Xavier Serra (2019), “Randomly weighted CNNs for (music) audio classification”; Jordi Pons, Joan Serrà, Xavier Serra (2019), “Training neural audio classifiers with few data

Establishing a parallel with neurology, neural networks of artificial intelligence allow generating systems that mimic brain function in the way in which it classifies the information received, for example, identifying objects based on the features they contain. Their way of learning includes feedback; in successive rounds they receive the degree of certitude of their decisions and perform the necessary corrections. A system of trial and error like we humans use to perform a new task.

The concept of deep learning refers to the fact that neural networks have a structure based on many layers. Deep neural networks have led to a paradigm shift in the field of automatic sound classification. Although they have managed to significantly improve the results of previously proposed models, their main disadvantages are that they need a lot of training data and require a large computational infrastructure.

“In the two studies we are to present at the congress we attempt to better understand these two disadvantages, and we ask ourselves: can we train deep neural networks with less data?, Can we accurately compare different neural network architectures despite limiting our experiments to a small computational infrastructure? Our articles show that it is possible to train competent models with very few data and that it is also possible to compare different neural network architectures despite having few computational resources”, says Xavier Serra.

Automated methods that mitigate noise in audio labels

Eduardo Fonseca, Manoj Plakal, Daniel P.W Ellis, Frederic Font, Xavier Favory, Xavier Serra (2019), “Learning Sound Event Classifiers from Web Audio with Noisy Labels

As sound event classification moves towards larger datasets, issues of label noise become inevitable. “In this work we characterize the noise of the labels and propose automatic methods for mitigating its effects”, explains Fonseca. To carry it out, and to promote research into potentially incorrect entries in the classification of sound events, the authors present FSDnoisy18k, a dataset containing 42.5 hours of audio that incorporates 20 different sound classes with potentially incorrect labels.

Artificial intelligence methods to separate sound sources

Olga Slizovskaia, Leo Kim, Gloria Haro, Emilia Gómez (2019), “End-to-End Sound Source Separation Conditioned On Instrument Labels

In this paper, the authors show how to separate sound sources in music signals with a variable number of sources using a deep learning model. In particular, the authors extend the Wave-U-Net model to a variable number of sources and also propose integration in the neuronal network of labels that indicate the presence or absence of each type of instrument, which helps improve separation. “This approach can be further extended to other types of conditioning such as audiovisual and score-informed separation”, explains Emilia Gómez.

Cloning of singing voice timbres

Merlijn Blaauw, Jordi Bonada, Ryunosuke Daido (2019), “Data Efficient Voice Cloning for Neural Singing Synthesis

In this paper, the authors investigate the “cloning” of singing voice timbres from a relatively small number of recordings. “Using databases in English, Japanese, Catalan and Spanish, we show that 2-3 minutes of recording can provide the same results as those obtained before with an hour of data”, the authors explain. “We also investigate the specific case of choir singing, in which creating a number of voices efficiently is especially attractive”, they add. See some examples at:

Method of extracting voice from a musical mix

Pritish Chandna, Merlijn Blaauw, Jordi Bonada, Emilia Gómez (2019), “A Vocoder Based Method For Singing Voice Extraction

This paper presents a new method for extracting a vocal track from a musical mixture. The musical mixture consists of a singing voice and a backing track which may consist of various instruments. The authors estimate the parameters of the singer with which they synthesize the vocals track, without any interference from the backing track. “We evaluate our system through objective metrics pertinent to audio quality and interference from background sources, and via a comparative subjective evaluation. We use open-source source separation systems based on Non-negative Matrix Factorization (NMs) and Deep Learning methods as benchmarks for our system and discuss future applications for this particular algorithm”, the authors explain.

At the Audio Signal Processing Lab, led by Xavier Serra, we work to advance the understanding of sound signals and music and address practical problems. We work on a variety of complementary issues covering the creation of collections of sound and music, the development of signal processing aimed at machine learning tasks and methods, and the use of semantic technologies for structuring concepts of sound and music.

The Music Information Research Lab, led by Emilia Gómez works on the description of sound and music, music information retrieval, voice synthesis, separation of audio sources, music and audio processing. Currently, they are mainly focusing on research on voice synthesis and transformation, source separation and automatic generation of soundscapes.






Per a més informació

Para más información

For more information

Notícia publicada per:

Noticia publicada por:

News published by:

Unitat de Comunicació i Projecció Institucionals

Unidad de Comunicación y Proyección Institucionales

Institutional Communication and Promotion Unit