The Semantic Artist Similarity dataset consists of two datasets of artists entities with their corresponding biography texts, and the list of top-10 most similar artists within the datasets used as ground truth. The dataset is composed by a corpus of 268 artists and a slightly larger one of 2,336 artists, both gathered from Last.fm in March 2015. The former is mapped to the MIREX Audio and Music Similarity evaluation dataset, so that its similarity judgments can be used as ground truth. For the latter corpus we use the similarity between artists as provided by the Last.fm API. For every artist there is a list with the top-10 most related artists. In the MIREX dataset there are 188 artists with at least 10 similar artists, the other 80 artists have less than 10 similar artists. In the Last.fm API dataset all artists have a list of 10 similar artists.
There are 4 files in the dataset.
mirex_gold_top10.txt and lastfmapi_gold_top10.txt have the top-10 lists of artists for every artist of both datasets. Artists are identified by MusicBrainz ID. The format of the file is one line per artist, with the artist mbid separated by a tab with the list of top-10 related artists identified by their mbid separated by spaces.
artist_mbid \t artist_mbid_top10_list_separated_by_spaces \n
mb2uri_mirex and mb2uri_lastfmapi.txt have the list of artists. In each line there are three fields separated by tabs. First field is the MusicBrainz ID, second field is the last.fm name of the artist, and third field is the DBpedia uri.
artist_mbid \t lastfm_name \t dbpedia_uri \n
There are also 2 folders in the dataset with the biography texts of each dataset. Each .txt file in the biography folders is named with the MusicBrainz ID of the biographied artist. Biographies were gathered from the Last.fm wiki page of every artist.