Corpus and related tools
IULA Corpus Tècnic
Access |
Online query interface: Bwananet |
Parallell subcorpus English-Spanish in stand-off format (sentence alignment): e-repositori |
|
Descripció |
The Corpus Tècnic contains written texts from the fields of Law, Economy, Genomics, Medicine and Environment as well as a contrastive corpus from the press. The languages of the texts in the corpus are Catalan, Spanish, English, French and German. Subsets of documents from IULA Coprus tècnic have been reelaborated within the framework of the Metanet4U project. The formatting has been updated following the most recent international standards and in some cases, the linguistic information has been expanded. These subsets are available for downloading. |
IULA Spanish LSP Treebank
Access |
Online query interface: TreebankBrowser |
Download |
Corpus texts in CoNLL format: e-repositori |
Description |
Syntactic annotation of 42,000 sentences selected from IULA Corpus Tècnic (Spanish) developed within the frame of Metanet4U project. |
Malt parser for Spanish
Access ws | malt_parser web service |
Download | Download Malt parser Spanish module espmalt-1.0.mco: e-repositori |
Description |
An instance of MaltParser trained for Spanish using the IULA Spanish LSP Treebank corpus. |
PAAU92 Corpus
Access |
Online query interface: Bwananet Aplication included in the book El Corpus PAAU 1992: estudios descriptivos, textos y vocabulario: Corpus92 |
Download |
Corpus texts in stand-off format: e-repositori |
Description |
The PAAU92 corpus consists of texts written by students in June 1992 as part of the entrance exams to various Spanish universities. This corpus can be accessed through the online query interface Bwananet or through the aplication included in the book El Corpus PAAU 1992: estudios descriptivos, textos y vocabulario, which contains the texts that have been object of analysis in the book as well as the vocabulary lists that make up the corpus. The corpus has also been reelaborated within the framework of the Metanet4U project and is available for downloading from UPF e-repositori. |
Wikipedia corpus
Download |
Corpus texts in stand-off format (Catalan): e-repositori Corpus texts in stand-off format (Spanish): e-repositori |
Description |
Compilation of Wikipedia articles in Catalan and Spanish. This is an improved version of WikiCorpus developed within the frame of Metanet4U project. Text has been debugged, processed linguistically and generated in stand-off format.
|
Penn treebank IULA
Download | Sentences with dependencies anotation in CoNLL format: e-repositori |
Description |
Subset of 805 sentences (in English and Spanish) of the "Penn Treebank corpus" syntactically annotated. This corpuscontains texts from the Wall Street Journal originally compiled by the University of Pennsylvania. The translation of sentences into Spanish was done by human translators. |
RST Spanish Treebank
Access | online |
Description |
Online interface to query and download a corpus of specialized texts in Spanish tagged with Rhetorical Structure Theory(RST) discourse relations. The RST Spanish Treebank is the result of an international project in collaboration among three research groups: Iulaterm (IULA-UPF, Barcelona), Grupo de Ingeniería Lingüística (IINGEN-UNAM, México D.F.) and TALNE (LIA-UAPV, Avignon). |
Tools for Catalan and Spanish corpus processing
Access | online demo |
Description | A package of tools for Catalan and Spanish corpus processing. It includes a text handling module and a probabilistic POS tagger. It also allows consulting POS tagger dictionary data. |
PALIC
Access | online demo |
Description | A package of tools for the processing of the Corpus Tècnic in Catalan and Spanish. It includes a preprocessor, a PosTagger and a linguistic disambiguator. |
Syntactic analyser for Spanish
Access | online demo |
Description | An open-source HPSG grammar for Spanish implemented within the LKB system. |
DiZer 2.0
Access | online demo |
Description |
Online interface to develop and use discourse parsers based on Rhetorical Structure Theory (RST) in several languages. At present it includes a complete parser for Brazilian Portuguese and beta parsers for Spanish and English. DiZer 2.0 is the result of an international project in collaboration among three research groups: Núcleo Interinstitucional de Lingüística Computacional (ICMC-USP, São Paulo), Iulaterm (IULA-UPF, Barcelona) and TALNE (LIA-UAPV, Avignon).. |
DiSeg
Access | online demo |
Description |
Online interface to download and use a discourse segmenter for Spanish based on Rhetorical Structure Theory (RST). It also includes a gold standard corpus of manually segmented specialized texts. DiSeg is the result of an international project in collaboration among three research groups: Iulaterm (IULA-UPF, Barcelona), TALNE (LIA-UAPV, Avignon) and GRIAL (UB, Barcelona). |
Alinea
Access | online demo |
Description | A tool for parallelizing translated texts, which has been specially designed for specialized corpora and also as a translation validator. |