ELMD: Entity Linking for the Music Domain Dataset

ELMD is a corpus of annotated named entities from the music domain that comes from a collection of about 13k Last.fm artist biographies. Entities are linked to DBpedia thanks to a voting system among different state of the art Entity Linking systems (ELVIS) with a precision of at least 0,94. In addition, by setting up a higher confidence threshold it is possible to obtain a subset of ELMD that prioritizes higher Precision by sacrificing Recall.

ELMD 2.0

During the last months we have reviewed and expanded ELMD, expanding it as follows:
  • Most of the entities are also linked now to MusicBrainz (Mapping retrieved through Last.fm API)
  • More annotations have been added by propagating existing annotations throughout the document in which they were found, assuming they appear in a one-sense-per-discourse fashion.
  • New output formats have been added: NIF and GATE
We provide updated statistics on the new dataset, e.g. number of annotations, unique entities by category, as well as percentage of annotations and unique entities with successful linking to reference KBs. Note that all entities are classified into one of the four categories and linked to Last.fm, and from there, these may be linked to DBpedia and MusicBrainz, to only one of them, or to none.
  Annotations Entities
All 144,593 63,902
Artist 112,524 39,131
Album 18,701 15,064
Track 9,203 7,832
Label 4,165 1,875
  Annotations Entities
DBpedia 58.6% 49.1%
MusicBrainz 93.6% 91.1%
Both 57.2% 47%
None 5% 9.2%

ELMD 2.0 is available in the following formats

In the JSON version every biography is stored in a separate document and splitted in sentences. For every sentence, annotations are stored as a list of entities with the following fields: startChar, endChar, uri (DBpedia URI), mbid (MusicBrainz ID), category (Artist/Album/Track/Label), and lastfm_url (Last.fm URL). Track and Album entities may have an additional mbid_artist field, which provides the artist's MusicBrainz ID.

In the XML version, entities are annotated inside text using the category of the entity as the XML tag and with 3 attributes: dbp (DBpedia URI), mb (MusicBrainz ID) and lfm (Last.fm URL).

The NIF version has the whole dataset in one single file, following the NIF 2.0 specification

The original ELMD 1.0 is also available for download here.

ELVIS (Entity Linking Framework Voting and Integration System), the source code used to generate ELMD 1.0 and 2.0, is also available for download here: https://github.com/sergiooramas/elvis

ELMDist: A vector space model with words and MusicBrainz entities

In addition, word vectors have been trained from ELMD 2.0 using word2vec. Vectors can be downloaded here: The code to retrain the vectors is available here.

Scientific References

Please cite the following paper if using ELVIS or any of the datasets (ELMD 1.0 and 2.0).

Please cite the following paper if using ELMDist.

Espinosa-Anke, L., Oramas S., Saggion H., & Serra X. (2017).  ELMDist: A vector space model with words and MusicBrainz entities. Workshop on Semantic Deep Learning (SemDeep), collocated with ESWC 2017.