(Text by Minz Won, taken from his blog in Github, the original entry here). See also the UPF news (Spanish / Catalan)
I won the “WWW 2018 Challenge: Learning to Recognize Musical Genre” with Jay Kim. This post describes our approach for the challenge. Full paper is available here.
This year, the Web conference, also known as WWW, newly organized a challenge track. Learning to Recognize Musical Genre was one of four programs in the challenge track. A goal of our challenge was to recognize the musical genre of a piece of music of which only a recording is available. The data was a subset of FMA dataset.
The challenge consists of two rounds. In the first round, participants are provided a test set of 35,000 clips of 30 seconds each, and they have to submit their predictions for all the 35,000 clips. The primary metric for evaluation was the Mean Log Loss. In the second round, which is the final round, participants have to wrap their models in a Docker container. Organizers evaluate those against a new unseen test set.
I teamed up with Jay Kim, who is my previous colleague at MARG and currently a PhD student at TU Delft.
In the beginning of the challenge, each of us had own idea for the model design. So, we decided to work separately and ensemble the learned features from each model. However, during the experiment, we figured out some critical factors that can affect the final result.
Participants could check their results on the leader board interactively for the round 1. We got the best result when we overfit our model to the train set and stop the iteration at a certain heuristic point. One of our speculation for this phenomenon was that the test set of round 1 has shared artists with the train set.
Since we didn’t know the artist distribution of the round 2 and we wanted our model to learn more generalized representations, we needed more reliable targets to learn such representations. To this end, Jay proposed to use clusters as targets of our models instead of noisy genre annotations.
Artist Group Factors (AGFs)
Due to the reasons that I mentioned above, targetting artist label might be beneficial for this challenge. There was also a previous research that utilized artist labels for the representation learning. However, due to data sparsity, only a few tracks are assigned per artist. It can be beneficial to group artist labels into clusters of similar artists, avoiding learning bottlenecks caused by large numbers of classes. To this end, we proposed Artist Group Factors (AGFs).
The main idea of extracting AGFs is to cluster artists based on meaningful feature sets that allow for aggregation at (and beyond) the artist level. Let’s take a look at the pipeline step-by-step.
We generated four different AGFs using Essentia features, MFCC, dMFCC, and subgenre tags.
We trained five networks targetting genre tags and each of four AGFs. Same Convolutional Neural Network (CNN) structures have been used. You can check detailed structure from the paper.
Finally, we transferred all of learned representations to predict genre tags which was the goal of this challenge. A simple Multi Layer Perceptron (MLP) has been used for this step.
Next destination… Spotify-RecSys Challenge