MCSQ presented in the Latin American and Iberian Languages Open Corpora Forum

13.12.2021

Imatge inicial

RECSM members Danielly Sorato and Diana Zavala-Rojas gave a presentation on the MCSQ at the Open Corpora Forum on Latin American and Iberian Languages on 3 December. OpenCor is a forum for discussion of linguistic resources for languages spoken in Iberian and Latin American countries. 

 

Abstract

The Multilingual Corpus of Survey Questionnaires (MCSQ) is the first publicly available corpus of survey questionnaires. In its third version (entitled Rosalind Franklin), the MCSQ contains approximately 766.000 sentences and more than 4 million tokens, comprising 306 distinct questionnaires designed in the source (British) English language and their translations into Catalan, Czech, French, German, Norwegian, Portuguese, Spanish, and Russian, adding to 29 country-language combinations (e.g., Switzerland-French). The MCSQ is a resource designed and implemented following the FAIR principles, and its contents are freely available through an especially tailored user interface.

The MCSQ consists of more than 40 years of survey research from large-scale comparative survey projects that provide cross-national and cross-cultural data to the Social Sciences and Humanities (SSH), namely, the European Social Survey (ESS), the European Values Study (EVS), the Survey of Health Ageing and Retirement in Europe (SHARE), and the WageIndicator Survey (WIS). All questionnaires in the MCSQ are composed of survey items. A survey item is a request for an answer with a set of answer options, and may include additional textual elements guiding interviewers and clarifying the information that should be understood and provided by respondents. Except in the case of the WIS, the translation process was implemented according to the TRAPD (Translation, Review, Adjudication, Pretesting and, Documentation) method, a team approach for the translation of survey questionnaires.

Questionnaires included in the MCSQ were obtained from the survey projects' archives in distinct formats such as spreadsheets, XML, PDF files. The PDF files had to undergo an additional step of conversion to plain texts before going through the preprocessing pipeline. Then, the texts were extracted from the input files and preprocessed, sentence aligned with respect to the English source and annotated with Part-of-speech (POS) and Named Entity Recognition (NER) tags.

 

Read the abstract on the forum

Multimedia

Categories:

SDG - Sustainable Development Goals:

Els ODS a la UPF

Contact