The aim of this project is a Web-based interface to access a collection of monolingual and parallel corpora to be used for research and teaching purposes. The main feature of BancTrad is its ability to perform a search based not only on words (strings of characters), but also on two other types of criteria
- Linguistic: lemma, part-of-speech, and (only in the case of Catalan) syntactic function
- Extra linguistic:
The corpora in BancTrad consist of texts in one or more of the following languages: Catalan, Spanish, English, French and German (allways from Catalan/Spanish into the other languages or viceversa).
Corpora available Monolingual
- British National Corpus (BNC) - English (100 million words) (Written and spoken language - Annotation: POS)
- Frankfurter Rundschau - German (34 million words) (German newspapers (1992-1993) -Annotation: lemma and POS)
Description: Parallel corpus aligned at sentence levelLanguages: Catalan, English, French, German, Spanish Size:2 million words
Type: Translation exercises of UPF students corrected by teachers as well as translations from teaching staff, publishing houses and Internet
Annotation: lemma, POS, and (for certain languages) syntactic functions as well as macrotextual attributes. Queries may thus be filtered by:
Subject matter (economics, science, politics, etc.)
Text type (normative, descriptive, literary, etc.)Register (colloquial, standard, learned, etc.)
Degree of specialisation (low, intermediate, high)
2. Search modes
BancTrad offers two different search modes:
a) The basic mode allows to search for sequences of specific word forms either in the source language or in the target language
b) The advanced mode allows to search for sequences of five quadruples -form, lemma, morphosyntactic tag, and (for catalan) syntactic function-, including the iteration of identical elements.
Extra-linguistic tagging and text alignment are performed in a semi-automatic way through the Java Web Start application BancTRad Manager (http://mutis.upf.es/~textosbt/) . By means of this application the user can introduce extralinguistic information about the document and can align the original and its respective translation.
Linguistic taggingLinguistic tagging is performed in a completely automatic way. For Catalan, we use our linguistic tagger CATCG. For English, French and German, we use TreeTagger, a statistic tagger developed by IMS.
Financing and Working Groups
BancTrad has been financed by UPF's Teaching Innovation Program, as well as by the Spanish Ministry of Education, Culture and Sports, and the Catalan regional government's Department of Universities, Research and the Information Society.
· Badia, T., G. Boleda, J. Brumme, C. Colominas, M. Garmendia, M. Quixal. 2002. BancTrad: un banco de corpus anotados con interficie web. Procesamiento del Lenguaje Natural, n. 29, pp. 293-294. ISSN 1135-5948. Valladolid, Septiembre.[RTF]
· Badia, T., G. Boleda, C. Colominas, M. Garmendia, A. González, M. Quixal . 2002. BancTrad: a web interface for integrated access to parallel annotated corpora. Proceedings of the Workshop on Language Resources for Translation Work and Research held during the 3rd LREC Conference in Las Palmas 29-31, May. [PDF]