Entry by Olga Slizovskaia, member of the Music Information Research Lab at the Music Technology Group and of the Image Processing Group
The path of a PhD student is dangerous, full of adventurous discoveries, mistakes and, of course, learning new things.
Then I started this path in January 2016 I thought I’ll be very straightforward, pretty much like:
It turned out to be way more interesting, more like:
Still, it’s interesting, and I’ll try to explain what was my mistakes and discoveries and personal changes so far.
So I want to share an important “learning from a mistake” experience.
The most significant mistake I did was that I was working on my own and not reaching out for collaborations and help then it was necessary. There is no need to constantly update every single person about your research, of course, but it’s crucial to discuss your topic and ideas with others.
I’m working on multimodal video analysis for music information retrieval (MIR) and for some long time I worked on a clear and objective task — instrument classification. At some point, I started hesitating about the importance and value of my research. But instead of worrying if your particular task can have a direct application right now and who can benefit from solving it, one can explore all the details which the task has, and it's way more exciting!
We started with a rough evaluation of classification performance for each modality , then moved to a straightforward idea of using a late-fusion technique to predict from different modalities. It worked, and we got a great proof that audio and visual modalities are complementary both for human and ML algorithms, even under a hard circumstance of noisy datasets, the presence of many instruments and a substantial variety of training and testing examples .
What should one do with these results? I would better actively discuss them with people and search for further directions and ideas… but I didn’t, because I couldn’t find an interesting application.
We continued digging into the techniques trying to see how the performance can be improved. We tried out several ideas including aggregation of video information (in the form of Optical Flow fields) and complex loss functions which, in my opinion, should have helped to transfer the knowledge from a better generalised visual model to a worse generalised audio model. However, none of them has worked.
Meanwhile, I’ve got an opportunity to get feedback as I attended a few events which were great for knowledge acquisition: Deep Bayes Summer School, ESI Workshop on Systematic Approaches To Deep Learning Methods For Audio and ISMIR 2017 conference.
Indeed, I’ve got many ideas, and some of them are still waiting to be implemented.
The lesson learned: I need to talk more, I need to share my thoughts, to listen more actively and try to exchange the ideas as much as I can.
Discussing our results with others brought us the idea of looking deeply under the surface of black-box models we were using. So we started to develop another research direction, the interpretability of audio-visual models. We aimed to check the correspondences in activations and predictions between audio-based recognition and visual frame-based recognition, the work which was presented at ISMIR conference . We still work in this direction with, hopefully, more results soon to be published.
I would like to mention something which is not directly my research. I love being here and do my work. People are supporting, and university's inclusive environment and friendly atmosphere help me to overcome the frustration (which I sometimes encounter) and stay productive.
Thanks to my dearest friends and collaborators, I had a chance to explore the potential of musically-motivated architectures , take part in DCASE challenge  and give a number of presentation within and outside of the university.
Me (on the right), presenting a poster on correspondences in audio and visual deep models for musical instrument detection at ISMIR conference.
 Olga Slizovskaia, Emilia Gómez, and Gloria Haro. Automatic musical instrument recognition in audiovisual recordings by combining image and audio classification strategies. In 13th Sound and Music Computing Conference (SMC 2016), Hamburg, Germany, 2016
 Olga Slizovskaia, Emilia Gómez, and Gloria Haro. Musical instrument recognition in user-generated videos using a multimodal convolutional neural network architecture. In ACM International Conference on Multimedia Retrieval, Bucharest, Romania, 2017. ACM Digital Library.
 Olga Slizovskaia, Emilia Gómez, and Gloria Haro. Correspondence between audio and visual deep models for musical instrument detection in video recordings. In 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.
 Jordi Pons, Olga Slizovskaia, Rong Gong, Emilia Gómez, and Xavier Serra. Timbre analysis of music audio signals with convolutional neural networks. In 25th European Signal Processing Conference (EUSIPCO), Kos island, Greece, 2017. IEEE.
 Eduardo Fonseca, Rong Gong, Dmitry Bogdanov, Olga Slizovskaia, Emilia Gomez, and Xavier Serra. Acoustic scene classification by ensembling gradient boosting machine and convolutional neural networks. In Workshop on Detection and Classification of Acoustic Scenes and Events, Munich, Germany, 2017.