Zinemanas et al., Visual music transcription of clarinet video recordings trained with audio-based labelled data. CVAVM 2017.

We develop a large number of software tools and hosting infrastructures to support the research developed at the Department. We will be detailing in this section the different tools available. You can take a look for the moment at the offer available within the UPF Knowledge Portal, the innovations created in the context of EU projects in the Innovation Radar and the software sections of some of our research groups:

Artificial Intelligence

Nonlinear Time Series Analysis

Downloads

Web Research

Dyswebxia

Music Technology

Interactive Technologies

Barcelona MedTech

GitHub

Natural Language Processing

GitHub
Resources (datasets, software and other material)

Nonlinear Time Series Analysis

Downloads

UbicaLab

GitHub

Wireless Networking

GitHub

Educational Technologies

GitHub

Back Zinemanas et al., Visual music transcription of clarinet video recordings trained with audio-based labelled data. CVAVM 2017.

Zinemanas P, Arias P, Haro G, Gómez E. Visual music transcription of clarinet video recordings trained with audio-based labelled data. ICCV 2017 Workshop on Computer Vision for Audio-Visual Media (CVAVM)

Abstract

Automatic transcription is a well-known task in the music information retrieval (MIR) domain, and consists on the computation of a symbolic music representation (e.g. MIDI) from an audio recording. In this work, we address the automatic transcription of video recordings when the audio modality is missing or it does not have enough quality, and thus analyze the visual information. We focus on the clarinet which is played by opening/closing a set of holes and keys. We propose a method for automatic visual note estimation by detecting the fingertips of the player and measuring their displacement with respect to the holes and keys of the clarinet. To this aim, we track the clarinet and determine its position on every frame. The relative positions of the fingertips are used as features of a machine learning algorithm trained for note pitch classification. For that purpose, a dataset is built in a semiautomatic way by estimating pitch information from audio signals in an existing collection of 4.5 hours of video recordings from six different songs performed by nine different players. Our results confirm the difficulty of performing visual vs audio automatic transcription mainly due to motion blur and occlusions that cannot be solved with a single view.

Additional material:

Postprint in Zenodo
Final version of the video (corrects typographics errors in the video shown below)

Link: https://doi.org/10.5281/zenodo.848650

DTIC MdM Strategic Program: Artificial and Natural Intelligence for ICT and beyond

Zinemanas et al., Visual music transcription of clarinet video recordings trained with audio-based labelled data. CVAVM 2017.

Related Assets