[MSc thesis] Term extraction and document similarity in an Integrated Learning Design Environment
[MSc thesis] Term extraction and document similarity in an Integrated Learning Design Environment
Author: Alberto Martínez Rodríguez
Supervisor: Davinia Hernández Leo, Horacio Saggion
MSc program: Master in Intelligent Interactive Systems
The Integrated Learning Design Environment is a social platform focused in supporting teachers in the computer-assisted design of Learning activities. In this platform, teachers and course designers can contextualize, author and share their designs within their community. This social component, of the ILDE, would benefit from the application of Information Retrieval and Natural Language Processing techniques to facilitate teachers and course designers to find shared designs as fast and efficient as possible. In this work, we use Natural Language Processing to classify learning designs written in Catalan, get the content of the users, parse this content with Freeling and extract education domainspecific terminology from the documents. To extract the terminology, a combination of two methods is used. The first method uses the Multilingual Central Repository ontology to check if a term belongs to any of four pedagogical fields. The second methodology, computes the tf-idf of all the documents terms using a non-domain-specific corpus, the Catalan Wikipedia. This work also discusses the potential of the proposed combination of methods to retrieve simple and complex terms from documents. The resulting combined method distributes the weight of each method in the extraction process to assign a score to each retrieved term. After this process of extracting education domain-specific terminology from different ILDE documents, it has been created a Document Similarity Application addressed to teachers and course designers. This application allows users to search documents based on the similarity between these documents and another document of the same ILDE community. Besides, given a document, users can visualize the education terminology that belongs to that document. Finally, users can also search for certain documents using a terminology-based query to obtain a set of documents and their similarity with respect to that query.
Additional material: