Here is a list of publicly available corpora hosted on the GLiF server.
The following corpora can be accessed by anyone with a guest account, using the web-based corpus query environment CQPweb.
- OLDCA (“Català medieval”): a diachronic corpus of Catalan which includes 222 texts from the 11th century to the 17th century containing a total of 5,020,237 words. This corpus was initially developed under the Spanish national project FFI2010-15006, with additional support under project FFI2013-41301-P. Further information can be found here.
- OLDES (“Castellà medieval”): a diachronic corpus of Spanish, developed by Cristina Sánchez-Marco, includes 674 texts that cover from the 12th century to the 20th century containing a total of 44,470,288 words. This corpus was developed under the Spanish national project FFI2010-15006. Further information can be found here.
- CUCWEB (“Cátedra Telefónica corpus of the use of Catalan on the Web”): a 166 million word corpus for Catalan built by crawling the Web. This article documents the development of the corpus; see here for the linguistic criteria of the corpus annotations (in Catalan).
- Latin corpora (“Latin” and “Latin v2”): two corpora of texts in Latin. Latin contains 10,166,111 words and Latin v2 contains 2,912,738 words.
- Wikipedia corpora (“Wikipèdia en català”, “Wikipèdia en català”, “Wikipèdia en anglès” and “Wikipèdia en francès”): four corpora that contain large portions of the Wikipedia (based on a 2006 dump) in Catalan, Spanish, English and French and have been automatically enriched with linguistic information. More information can be found here.
- sdewac (Stuttgart Deutsch-Web-As-Corpus): a 0.88 billion word corpus derived from deWaC, constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds.
Other resources
- NOCANDO: a multilingual corpus of spontaneous oral speech for the study of non-canonical constructions, created by recording free picture-based narrations of native speakers in five languages: Catalan, Italian, Spanish, English, and German. Initially developed under the Spanish national project HUM2004-04463 (2004-2007).
- We also have corpora with limited access and corpora under development in the TEITOK environment. If you are a GLiF member, click here to learn more about these corpora and how to access them.