Blogs

Thesis proposals with Wikipedia and Zenodo

In addition to the support of the traditional technology and knowledge transfer mechanisms available at UPF, the María de Maeztu Strategic Research Program aims at contributing to bidirectional transfer models aligned with the Open Science and Open Innovation movements.

In this context, the program has its own Open Science and Innovation program which supports the transition between research results of the ongoing MdM projects and sustainable actions aligned with Open Science principles; organises regular talks on internal and external open science initiatives and works to find ways to promote a closer collaboration with them; promotes the reproducibility of research results and its use in its educational activities.

For the next year, several bachelor’s thesis projects are proposed to UPF students in cooperation with two flagship initiatives of the “open movement”: Wikipedia, and the open access repository Zenodo created by OpenAire and CERN. These thesis complement the many others based on open actions promoted from the department such as Freesound, ILDE or RocketApp.

WIKIMEDIA FOUNDATION. Proposals supervised by Diego Saez-Trumper

- Data Sciences project in Cultural Heritage: Tracking the Flow of Historical Artifacts using Wikidata (Wikipedia): The aim of this project is to extract and construct an interactive map for provenance of historical artifacts such as art works, archaeological and paleontological remaining. It is a well-known fact that many historical artifacts has been transported to other countries from their source country such as many historical artifacts gathered from Bergama, Turkey is on display in a museum in Berlin, Germany. We are aiming to use semantic web structure of WikiData (www.wikidata.org) and construct a map that shows the flow of historical artifacts. More clearly, we will show the inflow and outflow of historical artifacts for each country. Finding of this works is expected to be published in an international conference.

- Connecting the Semantic Web and Jupyter Notebooks: Building an Open-Source and user friendly Python library for semantic queries (SPARQL) in Wikidata (Wikipedia): Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others. Currently Wikidata has around 50 Millions items, and it is one of the most popular semantic databases in the world [1]. In Wikidata every concept is represented as an item, that can be described in triplets, being related with other items, for example, describing George Orwell[2]:

George_Orwell isAnInstanceOf HumanBeing

George_Orwell hasOccupation Writer

George_Orwell hasAChild Richard_Blair

Therefore, the Wikidata database can be queried using a triplet structure such as

''' SELECT ?child WHERE { ?child hasFather George_Orwell. } ''' will return: Richard_Blair

Other more complex queries could be get: all the cities with female majors; all the players in the Spanish League born in South America; or the most popular color eyes for people entered in Wikidata (for examples check [3]). Unfortunately, learning how to query such database is not easy, and moreover, connecting the result of those queries with Jupyter Notebooks [4] (currently, the most popular interface for Data Scientist) is very complicate. In this project, we will create an open-source and user-friendly python library that will facilitate the connection between Wikidata and Jupyter Notebooks. The result of this work will be upload to public repositories, with the potential of reaching thousands users around the world.

[1] www.wikidata.org

[2] www.wikidata.org/wiki/Q3335

[3] query.wikidata.org

[4] jupyter.org

- Your Language, My Language, Our Language: Studying International Auxiliary Languages in Wikipedia with Data Science: An International auxiliary language (IAL) is a language meant for communication between people from different nations (...). An auxiliary language is primarily a foreign language [1]. Some popular examples are Esperanto and Interlingua. One of the aims of these languages is to avoid the supremacy of one language over another, allowing people from different cultures to be able to communicate without privileging just only one language, creating a neutral 'lingua franca' The aim of this Data Science project, would be to study the development of Wikipedia editions in IAL, studying the contributors (writers) and reader behaviors. Taken the advantage of the open-data policy of the Wikipedia projects, we will study were the contributors of IAL editions are coming from (in which other languages do they edit), which topics do they cover and also what readers visit in such editions. This will be a Data Science project, analyzing big datasets, both structured (SQL Replicas [3]) and unstructure (Wikipedia Dumps [4]), using Data Visualization, Natural Language Processing and Machine Learning Techniques. The aim of this project is to be published in International Conferences and well in open platforms such as Wikimeda Meta [5].

[1] https://en.wikipedia.org/wiki/International_auxiliary_language

[2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database

[3] https://en.wikipedia.org/wiki/Lingua_franca

 [4] https://dumps.wikimedia.org

 [5] https://meta.wikimedia.org/wiki/Main_Page

- A Machine Learning Project for multilingualism: Aligning Wikitext Across Languages (Wikipedia): Wikipedia is an on line knowledge repository with editions in more than 290 languages. Wikitext [1] is the markup language (such as Latex or Markdown) used to write Wikipedia articles. While many of the conventions for writing in Wikitext are language independent, every time that an specific instruction requires text, that text is language dependent. For example, for introducing the image "Pompeu.jpg" in Wikipedia article you will have different markups depending on the language: *English: [[File:Pompeu.jpg]] *Catalan: [[Fitxer:Pompeu.jpg]] *Spanish: [[Archivo:Pompeu.jpg]] *Vasque: [[Fitxategi:Pompeu.jpg] (This interface does not allow other scripts, but in Arabic or Russian you will find the Arabic or Cyrillic scripts) Similar situation will happen with many language dependent markups. The aim of this project is to use Machine Learning techniques and Big Data processing, to learn the mapping among all languages, and create a library or/and an API, that will allow developers and researchers to use language agnostic approaches when working with Wikipedia data, helping the integration and development of underrepresented languages in Wikipedia. The knowledge and skills that would be developed in this project includes the basic understanding of Machine Learning algorithms, work with very large (huge) datasets, and basics of Natural Language Processing. The output of this work will be uploaded in public repositories, and with the possibility of being productionaized in the Wikimedia Tools Labs [2] servers. [1]https://www.mediawiki.org/wiki/Wikitext

ZENODO. Proposals supervised by Horacio Saggion (TALN, UPF), with the support of the Zenodo team

- Mining Zenodo with Dr Inventor:  Launched in 2013, Zenodo is a data repository which allows researchers the store and discover different types of research data among which research papers. In its current version, Zenodo offers little support for discovering the valuable resources it contains limiting search to  a few facets such as: format (e.g. txt, pdf, zip), type (e.g. image),  author assigned keyword (e.g. biodiversity), and free text search. In this bachelor thesis we propose the development of an application to extract information from Zenodo’s universe of research papers in order to allow semantic indexing and discovery.  The work will be based on the application of the Dr Inventor tool to support the extraction of specific information types to be defined during the project (e.g.  topic discovery, keyword extraction). The project will be supervised by Dr. Horacio Saggion (TALN) . The candidate will have access to a number of available text processing to carry out the project.  Zenodo’s dumps of research articles will be available to carry out the research and development.

Zenodo:    https://zenodo.org/  

Dr Inventor Library: http://backingdata.org/dri/library/

PDFdigest:  http://scientmin.taln.upf.edu/pdfdigest/pdfparser.php

Example application:  http://scipub-taln.upf.edu/sepln/

- Semantic Indexing and Discovery of Software Artefacts in Zenodo using Natural Language Processing: Launched in 2013, Zenodo is a data repository which allows researchers the store and discover different types of research data among which software projects. In this bachelor project we propose to take advantage of the textual descriptions associated to software projects to extract semantic information from software projects. This proposal aims to analyze all textual and source code information in open software projects in order to support software discovery.The work will be based on the use of current methods in NLP which take advantage of continuous word vector representations. The candidate will have access to a number of available text processing to carry out the project. Zenodo’s dumps of software will be available to carry our the research and development.

Zenodo:    https://zenodo.org/