Slizovskaia O, Gómez E, Haro G. Musical Instrument Recognition in User-generated Videos using a Multimodal Convolutional Neural Network Architecture. ACM International Conference on Multimedia Retrieval (ICMR 2017)

We develop a large number of software tools and hosting infrastructures to support the research developed at the Department. We will be detailing in this section the different tools available. You can take a look for the moment at the offer available within the UPF Knowledge Portal, the innovations created in the context of EU projects in the Innovation Radar and the software sections of some of our research groups:

Artificial Intelligence

Nonlinear Time Series Analysis

Downloads

Web Research

Dyswebxia

Music Technology

Interactive Technologies

Barcelona MedTech

GitHub

Natural Language Processing

GitHub
Resources (datasets, software and other material)

Nonlinear Time Series Analysis

Downloads

UbicaLab

GitHub

Wireless Networking

GitHub

Educational Technologies

GitHub

Back Slizovskaia O, Gómez E, Haro G. Musical Instrument Recognition in User-generated Videos using a Multimodal Convolutional Neural Network Architecture. ACM International Conference on Multimedia Retrieval (ICMR 2017)

Slizovskaia O, Gómez E, Haro G. Musical Instrument Recognition in User-generated Videos using a Multimodal Convolutional Neural Network Architecture. ACM International Conference on Multimedia Retrieval (ICMR 2017)

This paper presents a method for recognising musical instruments in user-generated videos. Musical instrument recognition from music signals is a well-known task in the music information retrieval (MIR) field, where current approaches rely on the analysis of the good-quality audio material. This work addresses a real-world scenario with several research challenges, i.e. the analysis of user-generated videos that are varied in terms of recording conditions and quality and may contain multiple instruments sounding simultaneous and background noise. Our approach does not only focus on the analysis of audio information, but we exploit the multimodal information embedded in the audio and visual domains. In order to do so, we develop a Convolutional Neural Network (CNN) architecture which combines learned representations from both modalities at a late fusion stage.

Our approach is trained and evaluated on two large-scale video datasets: YouTube-8M and FCVID. The proposed architectures demonstrate state-of-the-art results in audio and video object recognition, provide additional robustness to missing modalities, and remains computationally cheap to train.

Additional material:

Postprint in Zenodo
Code, extracted features, pre-trained models and experimental results (GitHub)
Datasets:
- YouTube-8M
- FCVID

Link: http://doi.org/10.1145/3078971.3079002

DTIC MdM Strategic Program: Artificial and Natural Intelligence for ICT and beyond

Slizovskaia O, Gómez E, Haro G. Musical Instrument Recognition in User-generated Videos using a Multimodal Convolutional Neural Network Architecture. ACM International Conference on Multimedia Retrieval (ICMR 2017)

Related Assets