A new report of Olga Kushch (RECSM member) on Social Sciences & Humanities Open Cloud (SSHOC)
The Multilingual Corpus of Survey Questionnaires (MCSQ): How Social Scientists can benefit from corpus linguistics
Lidun Hareide (Møreforsking Research Institute, Ålesund, Norway), Olga Kushch (Research and Expertise Centre for Survey Methodology, Universitat Pompeu Fabra)
What is a corpus?
A corpus is a searchable database of naturally occurring text sampled to be representative of a specific population of text. By naturally occurring, we mean text used in real life situations. In addition, a corpus may function as a repository, enabling the preservation of and access to data for posterity. The MCSQ corpus, the very first publicly available multilingual corpus of international survey texts, performs both these functions. It enables the storing, searching and the comparison of information from international social surveys in 8 languages (e.g. French) and 29 of their language varieties (e.g. Swiss French).
The MCSQ is an open-access, open-source research and training resource. It is FAIR (Findable, Accessible, Interoperable and Reproducible) by design. The current version (named Mileva Marić-Einstein), is compiled from the European Social Survey (ESS), the European Values Study (EVS) and the Survey of Health, Ageing and Retirement in Europe (SHARE) in British English source language and their translations into the eight languages: Catalan, Czech, French, German, Norwegian (Bokmål), Portuguese, Spanish and Russian. The current version comprises 3.5 million words and approximately 650,000 sentences. Nearly 80% of the corpus is sentence aligned, meaning that the source text sentences in English are linked to their translations in the different languages.
Read full news here.