Resources

Here is a list of publicly available corpora hosted on the GLiF server.

CQPweb

The following corpora can be accessed by anyone with a guest account, using the web-based corpus query environment CQPweb.

OLDCA (“Català medieval”): a diachronic corpus of Catalan which includes 222 texts from the 11th century to the 17th century containing a total of 5,020,237 words. This corpus was initially developed under the Spanish national project FFI2010-15006, with additional support under project FFI2013-41301-P. Further information can be found here.
OLDES (“Castellà medieval”): a diachronic corpus of Spanish, developed by Cristina Sánchez-Marco, includes 674 texts that cover from the 12th century to the 20th century containing a total of 44,470,288 words. This corpus was developed under the Spanish national project FFI2010-15006. Further information can be found here.
CUCWEB (“Cátedra Telefónica corpus of the use of Catalan on the Web”): a 166 million word corpus for Catalan built by crawling the Web. This article documents the development of the corpus; see here for the linguistic criteria of the corpus annotations (in Catalan).
Latin corpora (“Latin” and “Latin v2”): two corpora of texts in Latin. Latin contains 10,166,111 words and Latin v2 contains 2,912,738 words.
Wikipedia corpora (“Wikipèdia en català”, “Wikipèdia en català”, “Wikipèdia en anglès” and “Wikipèdia en francès”): four corpora that contain large portions of the Wikipedia (based on a 2006 dump) in Catalan, Spanish, English and French and have been automatically enriched with linguistic information. More information can be found here.
sdewac (Stuttgart Deutsch-Web-As-Corpus): a 0.88 billion word corpus derived from deWaC, constructed from the Web limiting the crawl to the .de domain and using medium-frequency words from the SudDeutsche Zeitung corpus and basic German vocabulary lists as seeds.

Other resources

NOCANDO: a multilingual corpus of spontaneous oral speech for the study of non-canonical constructions, created by recording free picture-based narrations of native speakers in five languages: Catalan, Italian, Spanish, English, and German. Initially developed under the Spanish national project HUM2004-04463 (2004-2007).
We also have corpora with limited access and corpora under development in the TEITOK environment. If you are a GLiF member, click here to learn more about these corpora and how to access them.