IULA Corpus Tècnic

Access

Online query interface: Bwananet

Download

Parallell subcorpus English-Spanish in stand-off format (sentence alignment): e-repositori

Descripció

The Corpus Tècnic contains written texts from the fields of Law, Economy, Genomics, Medicine and Environment as well as a contrastive corpus from the press. The languages of the texts in the corpus are Catalan, Spanish, English, French and German.

Subsets of documents from IULA Coprus tècnic have been reelaborated within the framework of the Metanet4U project. The formatting has been updated following the most recent international standards and in some cases, the linguistic information has been expanded. These subsets are available for downloading.

IULA Spanish LSP Treebank

Access

Online query interface: TreebankBrowser

Download

Corpus texts in CoNLL format: e-repositori

Description

Syntactic annotation of 42,000 sentences selected from IULA Corpus Tècnic (Spanish) developed within the frame of Metanet4U project.

[+ information]

Malt parser for Spanish

Access ws  malt_parser web service
Download Download Malt parser Spanish module espmalt-1.0.mco: e-repositori
Description

An instance of MaltParser trained for Spanish using the IULA Spanish LSP Treebank corpus.

[+ information]

PAAU92 Corpus

Access 

Online query interface: Bwananet

Aplication included in the book El Corpus PAAU 1992: estudios descriptivos, textos y vocabularioCorpus92

Download

Corpus texts in stand-off format: e-repositori

Description

The PAAU92 corpus consists of texts written by students in June 1992 as part of the entrance exams to various Spanish universities.

This corpus can be accessed through the online query interface Bwananet or through the aplication included in the book El Corpus PAAU 1992: estudios descriptivos, textos y vocabulario, which contains the texts that have been object of analysis in the book as well as the vocabulary lists that make up the corpus.

The corpus has also been reelaborated within the framework of the Metanet4U project and is available for downloading from UPF e-repositori.

Wikipedia corpus

Download

Corpus texts in stand-off format (Catalan): e-repositori

Corpus texts in stand-off format (Spanish): e-repositori

Description

Compilation of Wikipedia articles in Catalan and Spanish. This is an improved version of WikiCorpus developed within the frame of Metanet4U project. Text has been debugged, processed linguistically and generated in stand-off format.

  • Corpus in Catalan: 140.000 articles, 35,6 M words
  • Corpus in Spanish: 250.000 articles, 92 M words

Penn treebank IULA

Download Sentences with dependencies anotation in CoNLL format: e-repositori
Description

Subset of 805 sentences (in English and Spanish) of the "Penn Treebank corpus" syntactically annotated. This corpuscontains texts from the Wall Street Journal originally compiled by the University of Pennsylvania. The translation of sentences into Spanish was done by human translators.

RST Spanish Treebank

Access  online
Description

Online interface to query and download a corpus of specialized texts in Spanish tagged with Rhetorical Structure Theory(RST) discourse relations. The RST Spanish Treebank is the result of an international project in collaboration among three research groups: Iulaterm (IULA-UPF, Barcelona), Grupo de Ingeniería Lingüística (IINGEN-UNAM, México D.F.) and TALNE (LIA-UAPV, Avignon).

Tools for Catalan and Spanish corpus processing

Access  online demo
Description A package of tools for Catalan and Spanish corpus processing. It includes a text handling module and a probabilistic POS tagger. It also allows consulting POS tagger dictionary data.

PALIC

Access  online demo
Description A package of tools for the processing of the Corpus Tècnic in Catalan and Spanish. It includes a preprocessor, a PosTagger and a linguistic disambiguator.

Syntactic analyser for Spanish

Access  online demo
Description An open-source HPSG grammar for Spanish implemented within the LKB system.

DiZer 2.0

Access  online demo
Description

Online interface to develop and use discourse parsers based on Rhetorical Structure Theory (RST) in several languages.  At present it includes a complete parser for Brazilian Portuguese and beta parsers for Spanish and English. DiZer 2.0 is the result of an international project in collaboration among three research groups: Núcleo Interinstitucional de Lingüística Computacional (ICMC-USP, São Paulo), Iulaterm (IULA-UPF, Barcelona) and TALNE (LIA-UAPV, Avignon)..

DiSeg

Access  online demo
Description

Online interface to download and use a discourse segmenter for Spanish based on Rhetorical Structure Theory (RST). It also includes a gold standard corpus of manually segmented specialized texts. DiSeg is the result of an international project in collaboration among three research groups: Iulaterm (IULA-UPF, Barcelona), TALNE (LIA-UAPV, Avignon) and GRIAL (UB, Barcelona).

Alinea

Access  online demo
Description A tool for parallelizing translated texts, which has been specially designed for specialized corpora and also as a translation validator.