SAS: Semantic Artist Similarity Dataset

The Semantic Artist Similarity dataset consists of two datasets of artists entities with their corresponding biography texts, and the list of top-10 most similar artists within the datasets used as ground truth. The dataset is composed by a corpus of 268 artists and a slightly larger one of 2,336 artists, both gathered from Last.fm in March 2015. The former is mapped to the MIREX Audio and Music Similarity evaluation dataset, so that its similarity judgments can be used as ground truth. For the latter corpus we use the similarity between artists as provided by the Last.fm API. For every artist there is a list with the top-10 most related artists. In the MIREX dataset there are 188 artists with at least 10 similar artists, the other 80 artists have less than 10 similar artists. In the Last.fm API dataset all artists have a list of 10 similar artists.

There are 4 files in the dataset.

mirex_gold_top10.txt and lastfmapi_gold_top10.txt have the top-10 lists of artists for every artist of both datasets. Artists are identified by MusicBrainz ID. The format of the file is one line per artist, with the artist mbid separated by a tab with the list of top-10 related artists identified by their mbid separated by spaces.

artist_mbid \t artist_mbid_top10_list_separated_by_spaces \n

mb2uri_mirex and mb2uri_lastfmapi.txt have the list of artists. In each line there are three fields separated by tabs. First field is the MusicBrainz ID, second field is the last.fm name of the artist, and third field is the DBpedia uri.

artist_mbid \t lastfm_name \t dbpedia_uri \n

There are also 2 folders in the dataset with the biography texts of each dataset. Each .txt file in the biography folders is named with the MusicBrainz ID of the biographied artist. Biographies were gathered from the Last.fm wiki page of every artist.

Scientific References

For more details on how these files were generated, we refer to the following scientific publication. We would highly appreciate if scientific publications of works partly based on this dataset quote the following publication:

Oramas, S., Sordo M., Espinosa-Anke L., & Serra X. (2015). A Semantic-based Approach for Artist Similarity. 16th International Society for Music Information Retrieval Conference.

Acknowledgements

Please cite our paper in Academic Research

We would highly appreciate if scientific publications of works partly based on the Semantic Artist Similarity dataset quote the following publication:

Oramas, S., Sordo M., Espinosa-Anke L., & Serra X. (In Press). A Semantic-based Approach for Artist Similarity. 16th International Society for Music Information Retrieval Conference.

Supported by

Work supported by the Music Technology Group (MTG) and the Natural Language Processing Research Group (TALN) of the Pompeu Fabra University, and the Center for Computational Science of the University of Miami.

Download

Download the Semantic Artist Similarity dataset here:

Semantic Artist Similarity dataset

Conditions of Use

The Semantic Artist Similarity dataset is offered free of charge for internal non-commercial use only. You may not redistribute, publically communicate or modify it. Please see the license terms in the README file within the dataset for applicable conditions.

Feedback

Problems, positive feedback, negative feedback... it is all welcome! Please help me improve Semantic Artist Similarity by sending your feedback to:
[email protected] AND [email protected]

In case of a problem report please include as many details as possible.