Project description

The aim of this project is a Web-based interface to access a collection of monolingual  and parallel corpora to be used for research and teaching purposes. The main feature of BancTrad is its ability to perform a search based not only on words (strings of characters), but also on two other types of criteria

  • Linguistic: lemma, part-of-speech, and (only in the case of Catalan) syntactic function
  • Extra linguistic:

The corpora in BancTrad consist of texts in one or more of the following languages: Catalan, Spanish, English, French and German (allways from Catalan/Spanish into the other languages or viceversa).

Search possibilities:

  1. Corpora 

Corpora available Monolingual

  • British National Corpus (BNC) -   English (100 million words) (Written and spoken language -     Annotation: POS)
  • Frankfurter Rundschau - German (34 million words) (German newspapers (1992-1993) -Annotation: lemma and POS)

Multilingual BancTrad

Description: Parallel corpus aligned at sentence level

Languages: Catalan, English, French, German, Spanish Size:2 million words

Type: Translation exercises of UPF students corrected by teachers as well as translations from teaching staff, publishing houses and Internet

Annotation: lemma, POS, and (for certain languages) syntactic functions as well as macrotextual attributes. Queries may thus be filtered by:

Subject matter (economics, science, politics, etc.)

Text type (normative, descriptive, literary, etc.)

Register (colloquial, standard, learned, etc.)

Degree of specialisation (low, intermediate, high)

 

2. Search modes 

BancTrad offers two different search modes:

a) The basic mode allows to search for sequences of specific word forms either in the source language or in the target language

b) The advanced mode allows to search for sequences of five quadruples -form, lemma, morphosyntactic tag, and (for catalan) syntactic function-, including the iteration of identical elements.


Interface

 

Extra-Linguistic tagging

Extra-linguistic tagging and text alignment are performed in a semi-automatic way through the Java Web Start application  BancTRad Manager (http://mutis.upf.es/~textosbt/) . By means of this application the user can introduce extralinguistic information about the document and can align the original and its respective translation.

Linguistic tagging

Linguistic tagging is performed in a completely automatic way. For Catalan, we use our linguistic tagger CATCG. For English, French and German, we use TreeTagger, a statistic tagger developed by IMS.   

 

Financing and Working Groups

BancTrad has been financed by UPF's Teaching Innovation Program, as well as by the Spanish Ministry of Education, Culture and Sports, and the Catalan regional government's Department of Universities, Research and the Information Society.

Related publications

·    Badia, T., G. Boleda, J. Brumme, C. Colominas, M. Garmendia, M. Quixal. 2002. BancTrad: un banco de corpus anotados con interficie web. Procesamiento del Lenguaje Natural, n. 29, pp. 293-294. ISSN 1135-5948. Valladolid, Septiembre.[RTF]

·    Badia, T., G. Boleda, C. Colominas, M. Garmendia, A. González, M. Quixal . 2002. BancTrad: a web interface for integrated access to parallel annotated corpora. Proceedings of the Workshop on Language Resources for Translation Work and Research held during the 3rd LREC Conference in Las Palmas 29-31, May. [PDF]