Thesis linked to the implementation of the María de Maeztu Strategic Research Program.

Open access to PhD thesis carried out at the Department can be found at TDX

Please visit these pages for information on our PhD, MSc and BSc programs.


Back [PhD thesis] Machine learning and deep neural networks approach to modelling musical gestures

[PhD thesis] Machine learning and deep neural networks approach to modelling musical gestures

Author: David Cabrera Dalmazzo

Supervisor: Rafael Ramírez

Gestures can be defined as a form of non-verbal communication associated with an intention or an emotional state articulation. They are not only intrinsically part of the human language, but also explain specific details of a body-knowledge execution. Gestures are being studied not only in the language research field but also in dance, sports, rehabilitation, and music; where the term is understood as a “learned technique of the body”. Therefore, in music education, gestures are assumed as automatic-motor abilities learned by repetitional practice, to self-teach and fine-tune the motor actions optimally. Hence, those gestures are intended to be part of the performer’s technical repertoire to take fast actions/decisions on-the flight, assuming that they are not only relevant in music expressive capabilities but also, a method for a correct ‘energy-consumption’ habit development to avoid injuries. In this thesis, we applied state-of-the-art machine learning (ML) techniques to model violin bowing gestures in professional players. Concretely, we recorded a database of expert performers and different student levels and developed three strategies to classify and recognise those gestures in real-time: a) First, we developed a multimodal synchronisation system to record audio, video and IMU sensor data with a unified time reference. We programmed a custom C++ application to visualise the output from the ML models. We implemented a Hidden Markov Model to detect fingering disposition and bow-stroke gesture performance. b) A second approach is a system that extracts general time features from the gestures samples, creating a dataset of audio and motion data from expert performers implementing a Deep Neural Networks algorithm. To do so, we have implemented the hybrid model CNN LSTM architecture. c) Furthermore, a Melspectrogram based analysis that can read and extract patterns from only audio data, opening the option of recognising relevant information from the audio recordings without the need for external sensors to achieve similar results. All of these techniques are complementary and also incorporated into an education application as a computer assistant to enhance music-learners practice by providing useful real-time feedback. The application will be tested in a professional education institution.

Link to manuscript: