We have relevant datasets, repositories, frameworks and tools of relevance for research and technology transfer initiatives related to knowledge extraction. This section provides an overview on a selection of them and links to download or contact details.

The MdM Strategic Research Program has its own community in Zenodo for material available in this repository  as well as at the UPF e-repository  . Below a non-exhaustive list of datasets representative of the research in the Department.

As part of the promotion of the availability of resources, the creation of specific communities in Zenodo has also been promoted, at level of research communities (for instance, MIR and Educational Data Analytics) or MSc programs (for instance, the Master in Sound and Music Computing)

 

 


The Online Conversations Threads Repository. V. Gómez, A. Kaltenbrunner and D. Laniado

http://repositori.upf.edu/handle/10230/26270

 

This repository contains datasets with online conversation threads collected and analyzed by different researchers. Currently, you can find datsets from different news aggregators (Slashdot, Barrapunto) and the English Wikipedia talk pages. Slashdot conversations (Aug 2005 - Aug 2006) Online conversations generated at Slashdot during a year. Posts and comments published between August 26th, 2005 and August 31th, 2006. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content. This dataset is different from the Slashdot Zoo social network (it is not a signed network of users) contained in the SNAP repository and represents the full version of the dataset used in the CAW 2.0 - Content Analysis for the WEB 2.0 workshop for the WWW 2009 conference that can be found in several repositories such as Konect Barrapunto conversations (Jan 2005 - Dec 2008) Online conversations generated at Barrapunto (Spanish clone of Slashdot) during three years. For each discussion thread: sub-domains, title, topics and hierarchical relations between comments. For each comment: user, date, score and textual content Wikipedia (2001 - Mar 2010) Data from articles discussions (talk) pages of the English Wikipedia as of March 2010. It contains comments on about 870,000 articles (i.e. all articles which had a corresponding talk page with at least one comment), in total about 9.4 million comments. The oldest comments date back to as early as 2001.

 

When using these data, please use the following references.

The datasets were first analyzed in:

[1] V. Gómez, A. Kaltenbrunner, V. López (2008). Statistical analysis of the social network and discussion threads in Slashdot. Proceedings of the 17th International World Wide Web Conference

[2] V. Gómez, H. J. Kappen, N. Litvak, A. Kaltenbrunner (2013). A likelihood-based framework for the analysis of discussion threads. World Wide Web 16(5-6):645-675 (arXiv version here)

[3] D. Laniado, R. Tasso, Y. Volkovich, A. Kaltenbrunner (2011). When the Wikipedians Talk: Network and Tree Structure of Wikipedia Discussion Pages. Fifth International AAAI Conference on Weblogs and Social Media


The Internet has become a fundamental resource for activism as it facilitates political mobilization at a global scale. Petition platforms are a clear example of how thousands of people around the world can contribute to social change. Avaaz.org, with a presence in over 200 countries, is one of the most popular of this type. However, little research has focused on this platform, probably due to a lack of available data. To overcome this problem, we release an open dataset of petitions from Avaaz.org. It is important to highlight two issues. First, the machine-readable robots.txt file on Avaaz.org does not specify any restrictions. Second, every page fetched by the crawler specified a Creative Commons Attribution 3.0 Unported License in the footnote. Therefore, the dataset is released under the same terms.

Associated publication and slides


MuMu: Multimodal Music Dataset

MuMu is a Multimodal Music dataset with multi-label genre annotations that combines information from the Amazon Reviews dataset and the Million Song Dataset (MSD). The former contains millions of album customer reviews and album metadata gathered from Amazon.com. The latter is a collection of metadata and precomputed audio features for a million songs. 

To map the information from both datasets we use MusicBrainz. This process yields the final set of 147,295 songs, which belong to 31,471 albums. For the mapped set of albums, there are 447,583 customer reviews from the Amazon Dataset. The dataset have been used for multi-label music genre classification experiments in the related publication. In addition to genre annotations, this dataset provides further information about each album, such as genre annotations, average rating, selling rank, similar products, and cover image url. For every text review it also provides helpfulness score of the reviews, average rating, and summary of the review. 

The mapping between the three datasets (Amazon, MusicBrainz and MSD), genre annotations, metadata, data splits, text reviews and links to images are available here. Images and audio files can not be released due to copyright issues.

  • MuMu dataset (mapping, metadata, annotations and text reviews)
  • Data splits and multimodal feature embeddings for ISMIR multi-label classification experiments 

These data can be used together with the Tartarus deep learning library https://github.com/sergiooramas/tartarus.

See all version under DOI https://doi.org/10.5281/zenodo.831188 

[TEXT] MSD-A

The MSD-A is a dataset related to the Million Song Dataset (MSD). It is a collection of artist tags and biographies gathered from Last.fm for all the artists that have songs in the MSD.

In addition, we provide the data splits, feature embeddings, and models to reproduce the experiments from the paper:

Oramas S., Nieto O., Sordo M., & Serra X. (2017) A Deep Multimodal Approach for Cold-start Music Recommendation. https://arxiv.org/abs/1706.09739

Downloads

Corpus of artist biographies

Corpus of artist tags

MSD-Taste triplets for artists and track ids

Data to reproduce experiments

Tartarus: Library for deep learning experiments https://github.com/sergiooramas/tartarus


These files contain data from Decidim.Barcelona (Pla Acció Municipal) for the hackaton in the #MetaDecidim workshop (25-26 Nov 2016, Convent dels Àngels i Auditori del MACBA, Barcelona).

More data can be found at: PAM Open Data

Note: The folder brechadigital is just a copy of the data released by Mobile World Capital Barcelona in the report The digital divide in the city of Barcelona


 

MARD contains texts and accompanying metadata originally obtained from a much larger dataset of Amazon customer reviews, which have been enriched with music metadata from MusicBrainz, and audio descriptors from AcousticBrainz. MARD amounts to a total of 65,566 albums and 263,525 customer reviews. A breakdown of the number of albums per genre is provided here:

 

Genre Amazon MusicBrainz AcousticBrainz
Alternative Rock 2,674 1,696 564
Reggae 509 260 79
Classical 10,000 2,197 587
R\&B 2,114 2,950 982
Country 2,771 1,032 424
Jazz 6,890 2,990 863
Metal 1,785 1,294 500
Pop 10,000 4,422 1701
New Age 2,656 638 155
Dance & Electronic 5,106 899 367
Rap & Hip-Hop 1,679 768 207
Latin Music 7,924 3,237 425
Rock 7,315 4,100 1482
Gospel 900 274 33
Blues 1,158 448 135
Folk 2,085 848 179
Total 66,566 28,053 8,683

 

A subset of the dataset was created for genre classification experiments. It contains 100 albums by genre from different artists, from 13 different genres. All the albums have been mapped to MusicBrainz and AcousticBrainz. It contains semantic, acoustic and sentiment features. 

We also provide all the necessary files to reproduce the experiments on genre classification in the paper referenced below

For details on the datasets and download please go to http://mtg.upf.edu/download/datasets/mard 

For more details on how these files were generated, we refer to the following scientific publication. We would highly appreciate if scientific publications of works partly based on the MARD dataset quote the following publication:

Oramas, S., Espinosa-Anke L., Lawlor A., Serra X., & Saggion H. (2016).  Exploring Customer Reviews for Music Genre Classification and Evolutionary Studies. 17th International Society for Music Information Retrieval Conference (ISMIR'16). 
 
The MARD dataset will be introduced in the next ISMIR tutorial "Natural Language Processing for MIR" https://wp.nyu.edu/ismir2016/event/tutorials/

[TEXT] KGRec-sound - Sound recommendation dataset

Number of items: 21,552
Number of users: 20,000
Number of items-users interactions: 2,117,698

All the data comes from Freesound.org. Items are sounds, which are described in terms of textual description and tags created by the sound creator at uploading time.

Files and folders in the dataset and download at http://mtg.upf.edu/download/datasets/knowledge-graph-rec

 


[TEXT] KGRec-music - Music Recommendation Dataset

 

Number of items: 8,640
Number of users: 5,199
Number of items-users interactions: 751,531

All the data comes from songfacts.com and last.fm websites. Items are songs, which are described in terms of textual description extracted from songfacts.com, and tags from last.fm.

Files and folders in the dataset and download available at http://mtg.upf.edu/download/datasets/knowledge-graph-rec


[TEXT] Human-centred design methods to empower ‘teachers as designers’

The present datasets have been used for the paper entitled "Human-centred design methods to empower ‘teachers as designers’".

Abstract

Teachers are learning designers, often unwittingly. To facilitate this role of ‘teachers as designers’, educators of all sectors need to adopt a design mindset and acquire the skills needed to address the design challenges they encounter in their everyday practice. Human-centred design (HCD) provides professional designers with the methods needed to address complex problems. It emphasises the human perspective throughout the design lifecycle and provides a practice-oriented, context-aware, empathetic and incremental approach which naturally fits educators realities.

This research reports on a MOOC designed to ‘walk’ educators through the design of an ICT-based learning activity following an HCD process and its techniques. A mixed methods approach is used to gauge how participants experienced the MOOC and the various design tasks it comprises. Although the perceived level of difficulty and value on the different methods varied - and, significant differences were seen between the experience of novice and expert educators-, the participants felt the overall approach constituted a powerful means for them to design technology-enhanced learning activities. The results support the idea of HCD as a valuable framework for educators and inform ongoing international efforts to shape a science and practice of learning design for teaching.  

10.5281/zenodo.1165147

 


 

ELMD is a corpus of annotated named entities from the music domain that comes from a collection of about 13k Last.fm artist biographies. Entities are linked to DBpedia thanks to a voting system among different state of the art Entity Linking systems (ELVIS) with a precision of at least 0,94. In addition, by setting up a higher confidence threshold it is possible to obtain a subset of ELMD that prioritizes higher Precision by sacrificing Recall.

  • Full details and download available here
  • ELVIS (Entity Linking Framework Voting and Integration System), the source code used to generate ELMD 1.0 and 2.0, is also available for download here
  • Scientific reference (full record here): Oramas S, Espinosa-Anke L, Sordo M, Saggion H, Serra X. ELMD: An Automatically Generated Entity Linking Gold Standard Dataset in the Music Domain. Proceedings of the Language Resource and Evaluation Conference 2016

The Dr. Inventor Multi-Layer Scientific Corpus (DRI Corpus) is the result of collaborative annotation efforts carried out in the context of the European Project Dr. Inventor (FP7-ICT-2013.8.1 - Grant no:611383).


The Corpus includes 40 Computer Graphics papers, selected by domain experts. Each paper of the Corpus has been annotated by three annotators by providing the following layers of annotations, each one characterizing a core aspect of scientific publications:

  • Scientific discourse: each sentence has been associated to a specific scientific discourse category (Background, Approach, Chalenge, Future Work, etc.).
  • Subjective statements and novelty: each sentence has been characterized with respect to advantages, disadvantages and novel aspects presented.
  • Citation purpose: to each citation has been associated a purpose specifying the reason why the authors of the paper cited the specific piece of research.
  • Summary relevance of sentences and hand written summaries: each sentence of the paper has been characterized by an integer score ranging from 1 to 5, to point out the relevance of the same sentence for its includion in the summary of the paper. Sentences rated as 5 are the most relevant ones to summarize a paper. For each paper three hand-written summaries (max 250 words) are provided.

 

To get more details concerning the annotation procedure and schema used, you can go to the following section (Corpus structure) of this page or refer ro the following paper: On the Discoursive Structure of Computer Graphics Research Papers for the scientific discourse layer, while for the other three annotation layers to this other article: A Multi-Layered Annotated Corpus of Scientific Papers.

 

The instructions to download the Corpus are explained in the Download section

International Standard Language Resource Number (ISLRN) of the Corpus: 372-096-409-709-2


[TEXT] Dataset of discussion threads from Meneame

Crawling process

We built a crawling process that collects all the stories in the front page of Meneame from 2011 to 2015 (both years included). We then performed a second crawling process to collect every comment from the discussion thread of each story. From both crawling processes, we obtained 72,005 stories and 5,385,324 comments.

It is important to highlight two issues taken into account when the crawler was designed. First, the machine-readable robots.txt file on Meneame does not disallow this process. Second, the footnote of Meneame indicates the licenses of the code, graphics and content of the website. The license for content is Attribution 3.0 Spain (CC BY 3.0 ES) which allows us to release this dataset.

More information http://doi.org/10.5281/zenodo.2536218

Dataset from our ICWSM 2017 paper. When using this resource, please use the following citation:

Aragón P., Gómez V., Kaltenbrunner A. (2017) To Thread or Not to Thread: The Impact of Conversation Threading on Online Discussion, ICWSM-17- 11th International AAAI Conference on Web and Social Media, Montreal, Canada.

More info about this dataset can also be found at:

Aragón P., Gómez V., Kaltenbrunner A., (2017) Detecting Platform Effects in Online Discussions, Policy & Internet, 9, 2017.


[TEXT] Dataset for paper "Sharing emotions at scale: The Vent dataset"

Lykousas, Nikolaos; Patsakis, Costantinos; Kaltenbrunner, Andreas; Gómez, Vicenç

This dataset is linked to the paper "Sharing emotions at scale: The Vent dataset" (arXiv:1901.04856 [cs.SI]) , and contains data collected from the Vent social network (https://www.vent.co/).

It is structured in files each containing a different entity (i.e. emotion categories, emotions, vent metadata and social links). Entities external to each file are cross-referenced via the anonymized  universally unique identifiers (UUIDs).

Supplementary material: 10.5281/zenodo.2537982

The dataset consists of the following files and data – see the information at http://doi.org/10.5281/zenodo.2537838


[TEXT] ColWordNet, an enriched WordNet with collocational information

Details to be added


 

Network Function Virtualization (NFV) proposes to move packet processing from dedicated hardware middle-boxes to software running on commodity servers: virtualized Network Function (NFs) (i.e, Firewall, Proxy, Intrusion Detection System etc.). We have been developing an experimental platform called Network Function Center (NFC) to study issues related to NFV and NFs, assuming that the NFC will deliver virtualized NFs as a service to clients on a subscription basis. Our studies specially focus on dynamic resource allocation for NFs and we have proposed two new resource allocation algorithms based on Genetic Programming (GP) [1] and currently working on another algorithm based on Iterative Local Search. For a more realistic evaluation of these algorithms, testing data is a fundamental component, but unfortunately, public traffic data specifically referring to virtualized NFs chains is not readily available. Therefore, we developed a model to generate the specific data we needed, based on the available general traffic data [2]. This repository contains all the details about how we modelled general data into the specific data we wanted, with along the software we used and the assumptions we made during the data modelling process. Using this data and programs, the evaluation results presented in our publications can be easily reproduced.

[1] W. Rankothge, J. Ma, F. Le, A. Russo, and J. Lobo, “Towards making network function virtualization a cloud computing service,” in IM 2015

[2] W. Rankothge, F. Le, A. Russo, and J. Lobo, “Experimental results on the use of genetic algorithms for scaling virtualized network functions,” in IEEE SDN/NFV 2015

Updates for this work available in GitHub: https://github.com/windyswsw/DataForNFVSDNExperiments


[NETWORKS] Data for NFVSDN experiments

Rankothge, Windhya; Le, Franck; Russo, Alessandra; Lobo, Jorge. Data for NFVSDN experiments. 2017

Network Function Virtualization (NFV) proposes to move packet processing from dedicated hardware middle-boxes to software running on commodity servers: virtualized Network Function (NFs) (i.e, Firewall, Proxy, Intrusion Detection System etc.). We have been developing an experimental platform called Network Function Center (NFC) to study issues related to NFV and NFs, assuming that the NFC will deliver virtualized NFs as a service to clients on a subscription basis. Our studies specially focus on dynamic resource allocation for NFs and we have proposed two new resource allocation algorithms based on Genetic Programming (GP) [1] and currently working on another algorithm based on Iterative Local Search. For a more realistic evaluation of these algorithms, testing data is a fundamental component, but unfortunately, public traffic data specifically referring to virtualized NFs chains is not readily available. Therefore, we developed a model to generate the specific data we needed, based on the available general traffic data [2]. This repository contains all the details about how we modelled general data into the specific data we wanted, with along the software we used and the assumptions we made during the data modelling process. Using this data and programs, the evaluation results presented in our publications can be easily reproduced.

[1] W. Rankothge, J. Ma, F. Le, A. Russo, and J. Lobo, [“Towards making network function virtualization a cloud computing service,”] (http://repositori.upf.edu/handle/10230/26035) in IM 2015.

[2] W. Rankothge, F. Le, A. Russo, and J. Lobo, [“Experimental results on the use of genetic algorithms for scaling virtualized network functions,”] (http://repositori.upf.edu/handle/10230/26036) in IEEE SDN/NFV 2015.


A multiscale imaging and modelling dataset of the human inner ear

Understanding the human inner ear anatomy and its internal structures is paramount to advance hearing implant technology. While the emergence of imaging devices allowed researchers to improve understanding of intracochlear structures, the difficulties to collect appropriate data has resulted in studies conducted with few samples. To assist the cochlear research community, a large collection of human temporal bone images is being made available. This data descriptor, therefore, describes a rich set of image volumes acquired using cone beam computed tomography and micro-CT modalities, accompanied by manual delineations of the cochlea and sub-compartments, a statistical shape model encoding its anatomical variability, and data for electrode insertion and electrical simulations. This data makes an important asset for future studies in need of high-resolution data and related statistical data objects of the cochlea used to leverage scientific hypotheses. It is of relevance to anatomists, audiologists, computer scientists in the different domains of image analysis, computer simulations, imaging formation, and for biomedical engineers designing new strategies for cochlear implantations, electrode design, and others.

See the related publication at Nature Scientific Data https://www.nature.com/articles/sdata2017132


 

This page provides information about the source code, data, and results provided along with the manuscript [1]. If you use any of these resources, please make sure that you cite reference [1]. This will allow other researchers to locate the resources and the corresponding information. Links to the source code, data, and results can be found at the bottom of this page.  We suggest to refer to the resources as Bern-Barcelona EEG database

[1] Andrzejak RG, Schindler K, Rummel C (2012).  Nonrandomness, nonlinear dependence, and nonstationarity of electroencephalographic recordings from epilepsy patients. Phys. Rev. E, 86, 046206

 
Key wordselectroencephalogram, epilepsy, intracranial EEG recordings, nonlinear signal analysis, nonlinear time series analysis, free EEG database, nonlinear prediction error source code, surrogate signals, surrogate source code, EEG download page Bonn, electroencephalographic recordings, open Matlab source codes

[EDUCATIONAL DATA] Understanding collective behavior of learning design communities

http://doi.org/10.5281/zenodo.1207447

The following dataset has been used for the paper entitled "Understanding Collective Behavior of Learning Design Communities".

Michos, K., & Hernández-Leo, D. (2016). Understanding collective behavior of learning design communities. In Proceedings of the 11t European Conference on Technology Enhanced Learning, 614-617. https://doi.org/10.1007/978-3-319-45153-4_75

Abstract

Social computing enables collective actions and social interaction with rich exchange of information. In the context of educators’ networks where they create and share learning design artifacts, little is known about their collective behavior. Learning design tooling focuses on supporting educators (learning designers) in making explicit their design ideas and encourages the development of “learning design communities”. Building on social elements, this paper aims to identify the level of engagement and interactions in three communities using an Integrated Learning Design Environment (ILDE). The results show a relationship between the exploration of different artifacts and creation of content in all the three communities confirming that browsing influence the community's outcomes. Different patterns of interaction suggest specific impact of language and length of support for users.


[EDUCATIONAL DATA] Teacher-led inquiry in technology-supported school communities

The following dataset have been used for the paper entitled "Teacher-led inquiry in technology-supported school communities".

Abstract

Learning design is a field of inquiry which studies how to best support teachers as designers of Technology-Enhanced Learning (TEL) situations. Although substantial work has been done in the articulation of the learning design process, little is known about how learning designs are experienced by students and teachers, especially in the context of schools. This paper empirically examines if a teacher inquiry model, as a tool for the systematic research by teachers into their own practice, facilitates the connection between the design and data-informed reflection of TEL interventions in two school communities. High school teachers participated in a learning design professional development program supported by a web-based community platform integrating a teacher inquiry tool (TILE). A multiple case study was conducted aimed at understanding: a) current teacher practice and b) teacher involvement in inquiry cycles of design and classroom implementations with technologies. Multiple data sources were used over a one year period including field notes, focus groups transcripts, teacher interview protocols, digital artifacts, and questionnaires. Sharing teacher-led inquiries together with learning analytics was perceived as being useful for connecting teacher expectations with their objective evaluation of learning designs and this differed from their current practice. Teachers’ reflections about their designs focused on the time management of learning activities and their familiarity with the enactment and analytics tools. Results inform how technology can support teacher-led inquiry and collective reflective practice in schools.


[EDUCATIONAL DATA] Supporting awareness in communities of learning design practice

http://doi.org/10.5281/zenodo.1209079

The following dataset has been used for the paper entitled "Supporting awareness in communities of learning design practice".

Abstract

The field of learning design has extensively studied the use of technology for the authoring of learning activities. However, the social dimension of the learning design process is still underexplored. In this paper, we investigate communities of teachers who used a social learning design platform (ILDE). We seek to understand how community awareness facilitates the learning design activity of teachers in different educational contexts. Following a design-based research methodology, we developed a community awareness dashboard (inILDE) based on the Cultural Historical Activity Theory (CHAT) framework. The dashboard displays the activity of teachers in ILDE, such as their interactions with learning designs, other members, and with supporting learning design tools. Evaluations of the inILDE dashboard were carried out in four educational communities – two secondary schools, a master programme for pre-service teachers, and in a Massive Open Online Course (MOOC) for teachers. The dashboard was perceived to be useful in summarizing the activity of the community and in identifying content and members’ roles. Further, the use of the dashboard increased participants’ interactions such as profile views and teachers showed a willingness to build on the contributions of others. As conclusions of the study, we propose five design principles for supporting awareness in learning design communities, namely community context, practice-related insights, visualizations and representations, tasks and community interests.

Additional material:


PyramidApp configurations and participants behaviour dataset

This dataset contains the necessary details to reproduce the experiments of the paper :

Manathunga K, Hernández-Leo D. PyramidApp: scalable method enabling collaboration in the classroom. In: Verbert K, Sharples M, Klobučar T, editors. Adaptive and adaptable learning: 11th European Conference on Technology Enhanced Learning, EC-TEL 2016, Lyon, France, September 13-16, 2016, Proceedings. Heidelberg: Springer, 2016. p. 422-7. (LNCS, no. 9891). DOI: 10.1007/978-3-319-45153-4_37 (postprint)

This data represents details of two PyramidApp experiments explained in the article (Secondary school and vocational training experiments)

  • Secondary school data folder contains 3 rounds of 6 rounds of PyramidApp flows with 3 different student samples, 2 rounds each. 
  • Vocation Training school data folder has 7 csv files containing data of 3 PyramidApp rounds with the sample including the flow design. 
  • Each sub folder contains 7 csv files from the PyramidApp database.
  • Flow.csv has information about the PyramidApp activity authoring configurations.
  • Flow_available_students.csv contains which student IDs are available for which flow id. 
  • Flow_student.csv has details about students initial answers submitted for the given task in each flow.
  • Flow_student_rating.csv has rating values by each student in each flow with the timestamp.
  • Pyramid_groups.csv has the groups created for each flow.
  • Pyramid_students.csv has the information about when students added for respective pyramids.
  • Selected_answers.csv has the highly rated options for each group in each flow with the rating scores. 
  • Some fields are being encoded considering the nature of data and these csv files gives only the information considered in the context of the above publication.

Participants' data and interview questions - Identifying design principles for learning design tools: the case of edCrumble

edCrumble: collaborative tool for design and analysis of blended courses https://ilde2.upf.edu/edcrumble/

10.5281/zenodo.1239740


This repository contains classes for data generation and preprocessing and feature computation, useful in training neural networks with large datasets that do not fit into memory. Additionally, you can find classes to query samples of instrument sounds from RWC instrument sound dataset.

In the 'examples' folder you can find use cases for the classes above for the case of music source separation. We provide code for feature computation (STFT) and for training convolutional neural networks for music source separation: singing voice source separation with the dataset iKala dataset, for voice, bass, drums separation with DSD100 dataset, for bassoon, clarinet, saxophone, violin with Bach10 dataset. The later is a good example for training a neural network with instrument samples from the RWC instrument sound database RWC instrument sound dataset, when the original score is available.

In the 'evaluation' folder you can find matlab code to evaluate the quality of separation, based on BSS eval.

For training neural networks we use Lasagne and Theano.

We provide code for separation using already trained models for different tasks.


[BIOIMAGE] Vascular structures. Xingce Wang; Yue Liu; Zhongke Wu; Xiao Mou; Mingquan Zhou; Miguel A. Gonzalez Ballester; Chong Zhang

Images of: ground truth bifucation labeling and results obtained from our automatic labeling method.


Some years ago, researchers at Hospital Clínic de Barcelona and Universitat Pompeu Fabra developed a swine model of left bundle branch block (LBBB) for experimental studies of CRT (Rigol et al., J Cardiovasc Transl Res 2013). Radiofrequency applications were performed to induce LBBB, and half of the animals presented a myocardial infarction located at the septal wall. Imaging data and electro-anatomical maps (EAM) were acquired at baseline, with the induced LBBB and after implantation of a CRT devicE. This rich data is well suited for evaluating some features of the different cardiac computational models available nowadays, and will be the basis of the CRT-EPiggy19 challenge. The training data will include two complete infarcted and two non-infarcted datasets (total of 4 cases), while the test data is composed of four cases for each of the two categories (infarcted vs non-infarcted; total of 8 cases). The electrical activation patterns of the training datasets have already been described with detail in Soto-Iglesias et al. (IEEE J Transl Eng Health Med 2016). Check the Datasets section for a preview of the training data and the procedure to download it. Unlike LBBB and CRT activation maps, baseline maps will not initially be released, since they do not necessarily contribute to the prediction of CRT from LBBB.

Some of the main sources of variability in the personalization of cardiac models involve the extraction of anatomical data from medical images and the creation of the geometrical domain where models are run. In order to reduce this variability in the CRT-EPiggy19 challenge, biventricular finite element meshes will be provided to each participant, which were built from the segmentation of cine-MRI data. These meshes will include cardiomyocyte orientation (obtained with rule-based models; see Doste et al. Int J Numer Meth Bio 2019 for details), several regional labels (AHA regions, endo- and epi-cardial walls, different ventricles) and the local activation times projected from EAM data. Additionally, the affected AHA segments and its transmurality will be given for infarcted cases. Furthermore, for visualization and analysis purposes, 2D bi-ventricular representations will be given.



[BIOIMAGE] 3D Cryo Soft X-ray Transmission Microscopy data of Intact Thick Cells for Membrane Segmentation and Quantification.

Chong Zhang; Rubén Cárdenes; Oxana Klementieva; Stephan Werner; Peter Guttmann; Christoph Pratsch; Josep Cladera; Bart Bijnens. Coming soon.

Thanks for your interest in our 3D Cryo Soft X-ray Transmission Microscopy data of Intact Thick Cells for Membrane Segmentation and Quantification. Please send an email to [email protected] to request access to the data. 


2D phase contrast HeLa cells images with ground truth annotations

Original images are from http://www.robots.ox.ac.uk/~vgg/software/cell_detection/. This software is associated with the publication "Learning to Detect Cells Using Non-overlapping Extremal Regions", MICCAI 2012. (DOI: 10.1007/978-3-642-33415-3_43)

Here, we provide the ground truth labels of: cell centers and segmentation, which are used in the publication "Cell Detection and Segmentation Using Correlation Clustering", MICCAI 2014. (DOI: 10.1007/978-3-319-10404-1_2)


[BIOIMAGE] 2D bright field yeast cell images with ground truth annotations. Zhang, Chong; Huber, Florian; Michael Knop; Fred Hamprecht.

 

Dataset used to evaluate the method described in "Yeast cell detection and segmentation in bright field microscopy", ISBI 2014 (DOI: 10.1109/ISBI.2014.6868107).


2D bright field Fission yeast cell images with ground truth annotations. Chong Zhang.

Original images are from http://www-bcf.usc.edu/~forsburg/pombeX.html. This software is associated with the publication "PombeX: robust cell segmentation for fission yeast transillumination images", PLoS One 2013. (doi: 10.1371/journal.pone.0081434)

Here, we provide the ground truth labels of: cell centers and segmentation, which are used in the publications:

  • "Learning to Segment: Training Hierarchical Segmentation under a Topological Loss", MICCAI 2015. (DOI: 10.1007/978-3-319-24574-4_32)
  • "Hierarchical Planar Correlation Clustering for Cell Segmentation", EMMCVPR 2015. (DOI: 10.1007/978-3-319-14612-6_36)
  • "Cell Detection and Segmentation Using Correlation Clustering", MICCAI 2014. (DOI: 10.1007/978-3-319-10404-1_2)

The Visual Lip-Reading Feasibility (VLRF) database is designed with the aim to contribute to research in visual only speech recognition. A key difference of the VLRF database with respect to existing corpora is that it has been designed from a novel point of view: instead of trying to lip-read from people who are speaking naturally (normal speed, normal intonation,...), we propose to lip-read from people who strive to be understood.

Database available for download at http://fsukno.atspace.eu/Data.htm#VLRF

Reference publication available at http://hdl.handle.net/10230/32726


 

Description

Orchset is intended to be used as a dataset for the development and evaluation of melody extraction algorithms. This collection contains 64 audio excerpts focused on symphonic music. with their corresponding annotation of the melody.

Melody is here defined as “the single (monophonic) pitch sequence that a listener might reproduce if asked to whistle or hum a piece of polyphonic music”.

The dataset creation comprised several tasks: excerpts selection, recording sessions of people singing along with the excerpts, analysis of the recordings and melody annotation. A complete description of the dataset and the creation methodology is presented in this paper:

Bosch, J., Marxer, R., Gomez, E., “Evaluation and Combination of Pitch Estimation Methods for Melody Extraction in Symphonic Classical Music”, Journal of New Music Research

 

For full details please go here


Check the original page for updates here

This dataset includes 118 recordings of sung melodies. The recordings were made as part of the experiments on Query-by-Humming (QBH) reported in the following article:
 
J. Salamon, J. Serrà and E. Gómez, "Tonal Representations for Music Retrieval: From Version Identification to Query-by-Humming", International Journal of Multimedia Information Retrieval, special issue on Hybrid Music Information Retrieval, In Press (accepted Nov. 2012). 
 
 
The recordings were made by 17 different subjects, 9 female and 8 male, whose musical experience ranged from none at all to amateur musicians. Subjects were presented with a list of songs out of which they were asked to select the ones they knew and sing part of the melody. The subjects were aware that the recordings will be used as queries in an experiment on QBH. There was no restriction as to how much of the melody should be sung nor which part of the melody should be sung, and the subjects were allowed to sing the melody with or without lyrics. The subjects did not listen to the original songs before recording the queries, and the recordings were all sung a capella without any accompaniment nor reference tone. To simulate a realistic QBH scenario, all recordings were done using a basic laptop microphone and no post-processing was applied. The duration of the recordings ranges from 11 to 98 seconds, with an average recording length of 26.8 seconds. 
 
In addition to the query recordings, three meta-data files are included, one describing the queries and two describing the music collections against which the queries were tested in the experiments described in the aforementioned article. Whilst the query recordings are included in this dataset, audio files for the music collections listed in the meta-data files are NOT included in this dataset, as they are protected by copyright law. If you wish to reproduce the experiments reported in the aforementioned paper, it is up to you to obtain the original audio files of these songs.
 
All subjects have given their explicit approval for this dataset to be made public.   
 

Audio Files Included

118 audio files of sung melodies (see description above) in 16 bit mono WAV format sampled at 44.1kHz
 
Meta-data Files Included

Collection_Full.csv

This file contains meta-data information about the full collection of songs against which the queries were evaluated in the experiments described in the aforementioned article. The collection was compiled in the context of:

J. Serrá, "Identification of versions of the same musical composition by processing audio descriptions", PhD Thesis, Universitat Pompeu Fabra, Barcelona, Spain.
 
In total 2125 recordings are described, and for each recording there are six columns:
  • Song ID: a unique and arbitrary identifier given to each recording. NOTE: the ID is unique for every recording, meaning different versions (recordings) of the same song will have different IDs.
  • Title: the title of the song. Note that recordings of the same song may have different titles, for example if they are sung in different languages.
  • Artist: the performing artist for this specific recording.
  • Original Artist: the composer of the song.
  • Canonical: whether this recording is the canonical version of this song. If it is, the field is set to 'YES'. Otherwise (i.e. not the canonical version) the field is left empty. NOTE: this labeling was done by the authors of this dataset and is completely subjective.
  • Class label: a unique label for all recordings of the same musical piece. This is the only field guaranteed to be identical for all recordings which are considered by the authors of the dataset to be versions of the same musical piece. 

Collection_Canonicals.csv

This file contains meta-data information about the collection of canonical songs against which the queries were evaluated in the experiments described in the aforementioned article. Unlike the full collection, this collection includes only one version of every musical piece, the one considered by the authors of the dataset to be the canonical version of the piece. This means this collection is a SUBSET of the full collection above. NOTE: the selection of canonical songs was done by the authors of the dataset and is completely subjective. In total 481 recordings are described, and for each recording there are six columns, as for the full collection. Since this collection is a subset of the full collection, the column values for a recording with certain Song ID in Collection_Canonicals.csv will be identical to the column values for the same recording (i.e. same Song ID) in Collection_Full.csv.

Queries.csv

This file contains meta-data information for the query recordings. In total 118 recordings are described, and
for each recording there are six columns:
  • Filename: the name of the audio file.
  • Query ID: the ID of the query.
  • Song ID: the ID of the song to which the query corresponds. NOTE: use this ID to match the query to the song to which it corresponds in Collection_Full.csv and Collection_Canonicals.csv.
  • Original Artist: the composer of the song to which the query corresponds.
  • Class label: a unique label for all recordings of the same musical piece. This is the only field guaranteed to be identical for all recordings (whether query or song) which are considered by the authors of the dataset to be versions of the same musical piece.

Download

Please go to the Download page.

 
 

Description

This dataset includes musical audio excerpts with annotations of the predominant instrument(s) present. It was used for the evaluation in the following article:

Bosch, J. J., Janer, J., Fuhrmann, F., & Herrera, P. “A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals”, in Proc. ISMIR (pp. 559-564), 2012

IRMAS is intended to be used for training and testing methods for the automatic recognition of predominant instruments in musical audio. The instruments considered are: cello, clarinet, flute, acoustic guitar, electric guitar, organ, piano, saxophone, trumpet, violin, and human singing voice. This dataset is derived from the one compiled by Ferdinand Fuhrmann in his PhD thesis, with the difference that we provide audio data in stereo format, the annotations in the testing dataset are limited to specific pitched instruments, and there is a different amount and lenght of excerpts.

For full detauls and download go here


This dataset was created in the context of the Pablo project, partially funded by KORG Inc. It contains monophonic recordings of two kind of exercises: single notes and scales. The dataset was reported in the following article:
 
Romaní Picas O., Parra Rodriguez H., Dabiri D., Tokuda H., Hariya W., Oishi K., & Serra X."A real-time system for measuring sound goodness in instrumental sounds", 138th Audio Engineering Society Convention (2015). 
 
The recordings were made in the Universitat Pompeu Fabra / Phonos recording studio by 15 different professional musicians, all of them holding a music degree and having some expertise in teaching. 12 different instruments were recorded using one or up to 4 different microphones (depending on the recording session). For all the instruments the whole set of playable semitones in the instrument is recorded several times with different tonal characteristics. Each note is recorded into a separate mono .flac audio file of 48kHz and 32 bits. The tonal characteristics are explained both in the the following section and the related publication.
 

The audio files are organised in one directory for each recording session. In addition to the files, one SQLite database file is included. The structure of the database is related in the following section.

 

Check this page for a detailed description and download


[AUDIO] Freesound Dataset (FSD)

FreesoundDataset (FSD) is a large-scale, general-purpose audio dataset. It consists of audio samples from Freesound organised in the hierarchy of Google's AudioSet Ontology, which includes 632 classes. We have mapped a total of 268,261 samples to the AudioSet Ontology. The mapping generated 703,359 annotations which express the potential presence of a sound source in an audio sample. You can find more information in our ISMIR 2017 paper and you can contact us via mail for issues related to FSD.

Read more at https://datasets.freesound.org/fsd/


MSD-I: Million Song Dataset with Images for Multimodal Genre Classification

The Million Song Dataset (https://labrosa.ee.columbia.edu/millionsong/) is a collection of metadata and precomputed audio features for 1 million songs. Along with this dataset, a dataset with annotations of 15 top-level genres with a single label per song was released. In our work, we combine the CD2c version of this genre datase (http://www.tagtraum.com/msd_genre_datasets.html) with a collection of album cover images. 

The final dataset contains 30,713 tracks from the MSD and their related album cover images, each annotated with a unique genre label among 15 classes. Based on an initial analysis on the images, we identified that this set of tracks is associated to 16,753 albums, yielding an average of 1.8 songs per album.

We randomly divide the dataset into three parts: 70% for training, 15% for validation, and 15% for test, with no artist and album overlap across these sets. This is crucial to avoid possible overfitting, as the classifier may learn to predict the artist instead of the genre. 

10.5281/zenodo.1240485

Content:

MSD-I dataset (mapping, metadata, annotations and links to images)
Data splits and feature vectors for TISMIR single-label classification experiments 

These data can be used together with the Tartarus deep learning python module https://github.com/sergiooramas/tartarus.

Scientific References:

Please cite the following paper if using MSD-I dataset or Tartarus software.

Oramas, S., Barbieri, F., Nieto, O., and Serra, X (2018). Multimodal Deep Learning for Music Genre Classification, Transactions of the International Society for Music Information Retrieval, V(1).


The AcousticBrainz project aims to crowd source acoustic information for all music in the world and to make it available to the public. This acoustic information describes the acoustic characteristics of music and includes low-level spectral information and information for genres, moods, keys, scales and much more. The goal of AcousticBrainz is to provide music technology researchers and open source hackers with a massive database of information about music. We hope that this database will spur the development of new music technology research and allow music hackers to create new and interesting recommendation engines.

AcousticBrainz is a joint effort between Music Technology Group at Universitat Pompeu Fabra in Barcelona and the MusicBrainz project. AcousticBrainz was originally envisioned by Xavier Serra, the founder and head of the MTG. At the heart of this project lies the Essentia toolkit from the MTG – this open source toolkit enables the automatic analysis of music. The output from Essentia is collected by the AcousticBrainz project and made available to the public.

AcousticBrainz organizes the data on a recording basis, indexed by the MusicBrainz ID for recordings. If you know the MBID for a recording, you can easily fetch from AcousticBrainz. For details on how to do this, visit our API documentation.

All of the data contained in AcousticBrainz is licensed under the CC0 license (public domain).

 


[AUDIO AND TEXT] PHENICX-Anechoic: denoised recordings and note annotations for Aalto anechoic orchestral database

This dataset includes audio and annotations useful for tasks as score-informed source separation, score following, multi-pitch estimation, transcription or instrument detection, in the context of symphonic music.

 

This dataset was presented and used in the evaluation of:

 

M. Miron, J. Carabias-Orti, J. J. Bosch, E. Gómez and J. Janer, "Score-informed source separation for multi-channel orchestral recordings", Journal of Electrical and Computer Engineering (2016))"

 

On this web page we do not provide the original audio files, which can be found at the web page hosted by Aalto University. However, with their permission we distribute the denoised versions for some of the anechoic orchestral recordings:

 

Pätynen, J., Pulkki, V., and Lokki, T., "Anechoic recording system for symphony orchestra," Acta Acustica united with Acustica, vol. 94, nr. 6, pp. 856-865, November/December 2008.

 

For the intellectual rights and the distribution policy of the audio recordings in this dataset contact Aalto University, Jukka Pätynen and Tapio Lokki. For more information about the original anechoic recordings we refer to the web page and the associated publication [2]

 

CHECK THE WEBPAGE OF THE DATASET FOR MORE INFORMATION, CONDITIONS FOR USE AND DOWNLOAD


Jingju (Beijing opera) Phoneme Annotation

This dataset is a collection of boundary annotations of a cappella singing performed by Beijing Opera (Jingju, 京剧, wiki page) professional and amateur singers.

The boundries have been annotated in a hierarchical way. Line (phrase), syllable, phoneme singing units have been annotated to a jingju (Beijing opera) a cappella singing audio dataset.

The corresponding audio files are the a-cappella singing arias recordings, which are stereo or mono, sampled at 44.1 kHz, and stored as wav files. Due to its large size, we can’t upload the audio files here, please refer to our zenodo link: http://doi.org/10.5281/zenodo.344932

The wav files are recorded by two institutes: those file names ending with ‘qm’ are recorded by C4DM Queen Mary University of London; others file names ending with ‘upf’ or ‘lon’ are recorded by MTG-UPF. If you use this audio dataset in your work, please cite both the following publication:

Rong Gong, Rafael Caro Repetto, & Yile Yang. (2017). Jingju a cappella singing dataset [Data set]. Zenodo. http://doi.org/10.5281/zenodo.344932

D. A. A. Black, M. Li, and M. Tian, “Automatic Identification of Emotional Cues in Chinese Opera Singing,” in 13th Int. Conf. on Music Perception and Cognition (ICMPC-2014), 2014, pp. 250–255.

For details: https://github.com/MTG/jingjuPhonemeAnnotation


[AUDIO AND TEXT] FSDnoisy18k

FSDnoisy18k is an audio dataset collected with the aim of fostering the investigation of label noise in sound event classification. It contains 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data.

Data curators

Eduardo Fonseca and Mercedes Collado

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at [email protected].

Citation

If you use this dataset or part of it, please cite the following paper:

Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

You can also consider citing our ISMIR 2017 paper that describes the Freesound Datasets platform, which was used to gather the manual annotations included in FSDnoisy18k:

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, “Freesound Datasets: A Platform for the Creation of Open Audio Datasets”, In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Description at http://doi.org/10.5281/zenodo.2529934

General details also available at this UPF news