Corpus IULA project

The main goal of the Corpus project is the construction and exploitation of a textual, plurilingual and specialized corpus. The languages involved are the following: Catalan, Spanish, English, German and French. The areas of interest include: economics, law, computer science, medicine, enviromental science and linguistic sciences. Its main goal is to infer the norms that determine the behavior of each language in each area.

This corpus is IULA's main research and teaching support. The research carried out on the corpus includes the detection of neologisms and specialized terms, studies on linguistic variation, partial syntactic analysis or parsing, text alignment, data extraction for teaching purposes and for the creation of electronic dictionaries, elaboration of thesaurus, etc.

Within the framework of the project METANET4U (2011-2013): (1) corpus processing was adapted to the new LAF guidelines (Language resource management -- Linguistic annotation framework - ISO 24612:2012), XML format and "stand-off" annotation, and (2) the syntactic annotation level was added to more than 42,000 sentences in Spanish in the corpus.

Technical documentation for the project:

Methodology and work procedures (tags and text classification included)
Tools
Participants and collaborating organizations

The corpus' technical papers were published in the colleciton Papers de l'IULA. In 2006 a working paper titled 10 anys del Corpus de l'IULA was published, which can be found in the e-repositori.

Principal investigator: M. Teresa Cabré Castellví.

Technical coordination: Jorge Vivaldi Palatresi.