Methodology and work stages

Texts are selected and classified according to topics proposed by specialists in each area: Law, Economics, Environmental area, Medicine, Computer science and Language sciences.

Then the texts are tagged according to the standard SGML, following the guidelines proposed by the "Corpus Encoding Standard (CES)" of the EAGLES initiative.

Text processing includes the following steps:

structural tagging (tagsets)
text handling (detection of dates, numbers, proper names, ...)
morphological analysis and tagging according to the morphosyntactic tagsets for Spanish and Catalan developed at IULA
statistical and/or linguistic disambiguation
syntactical analysis: the syntactical analysis and annotation of more than 42,000 sentences of the Corpus in Spanish generated the IULA Spanish LSP Treebank corpus