The second Maria de Maeztu Strategic Research Program (CEX2021-001195-M) of the Department of Information and Communication Technologies (DTIC) takes place between 2023 and 2026. The website for this program is under construction. You can find some details in this news.

The first María de Maeztu Strategic Research Program (MDM-2015-0502) took place between January 2016 and June 2020. It was focused on data-driven knowledge extraction, boosting synergistic research initiatives across our different research areas.

Back [MSc thesis] Term extraction and document similarity in an Integrated Learning Design Environment

[MSc thesis] Term extraction and document similarity in an Integrated Learning Design Environment

Author: Alberto Martínez Rodríguez

Supervisor: Davinia Hernández Leo, Horacio Saggion

MSc program: Master in Intelligent Interactive Systems

The Integrated Learning Design Environment is a social platform focused in supporting teachers in the computer-assisted design of Learning activities. In this platform, teachers and course designers can contextualize, author and share their designs within their community. This social component, of the ILDE, would benefit from the application of Information Retrieval and Natural Language Processing techniques to facilitate teachers and course designers to find shared designs as fast and efficient as possible. In this work, we use Natural Language Processing to classify learning designs written in Catalan, get the content of the users, parse this content with Freeling and extract education domainspecific terminology from the documents. To extract the terminology, a combination of two methods is used. The first method uses the Multilingual Central Repository ontology to check if a term belongs to any of four pedagogical fields. The second methodology, computes the tf-idf of all the documents terms using a non-domain-specific corpus, the Catalan Wikipedia. This work also discusses the potential of the proposed combination of methods to retrieve simple and complex terms from documents. The resulting combined method distributes the weight of each method in the extraction process to assign a score to each retrieved term. After this process of extracting education domain-specific terminology from different ILDE documents, it has been created a Document Similarity Application addressed to teachers and course designers. This application allows users to search documents based on the similarity between these documents and another document of the same ILDE community. Besides, given a document, users can visualize the education terminology that belongs to that document. Finally, users can also search for certain documents using a terminology-based query to obtain a set of documents and their similarity with respect to that query.

Additional material:

Open access version at UPF e-repository

Department of Information and Communication Technologies, UPF

Grant CEX2021-001195-M funded by MCIN/AEI /10.13039/501100011033


 


Department of Information and Communication Technologies, UPF

[email protected]

  • Àngel Lozano - Scientific director
  • Aurelio Ruiz - Program management