RECSM members launch the first publicly available corpus of survey questionnaires

06.10.2021

UPF researchers have recently released Version 3, named after the scientist Rosalind Franklin, of the Multilingual Corpus of Survey Questionnaires (MCSQ). The MCSQ is authored by UPF-RECSM researchers Diana Zavala-Rojas and Danielly Sorato, Lidun Hareide (Møreforsking AS) and Knut Hofland (University of Bergen). Version Rosalind Franklin is composed of 306 distinct questionnaires comprising approximately 766.000 sentences and more than 4 million tokens. At present, the corpus consists of the following questionnaires and their versions:

European Social Survey (ESS): Round 1, Round 2, Round 3, Round 4, Round 5, Round 6, Round 7, Round 8, Round 9
Survey of Health, Ageing and Retirement in Europe (SHARE) Round 7, Round 8 and COVID-19 questionnaire
European Values Study (EVS): Wave 2, Wave 3, Wave 4, Wave 5
WageIndicator (WIZ): Round 1 and COVID-19 questionnaire

All the surveys are available in English language and their translations into different languages (the availability of the language varies from the survey and its version): Catalan, Czech, French (localized language varieties for France, Switzerland, Belgium and Luxembourg), German (localized for Austria, Germany, Switzerland and Luxembourg), Norwegian (Bokmål), Portuguese (localized for Portugal), Spanish (localized for Spain) and Russian (localized for Belarus, Estonia, Israel, Latvia, Lithuania, Russia and Ukraine).

The MCSQ is annoted with Part-of-Speech (POS) and Named Entity tags. The POS tagging task has the objective of automatically predict the parts of speech (e.g., noun, verb, pronoun, adverb) of words. While parts of speech are commonly assigned to individual words, a named entity usually refers to a proper name and it is often an entire multiword expression, such as the name "Ada Lovelace", the location "Barcelona", or the organization "Universitat Pompeu Fabra". Therefore, the Named Entity Recognition (NER) task aims to identify and classify entities in a set of categories, such as PERSON, LOCATION, ORGANIZATION, among others.

Those Named Entity Recognition (NER) annotations were included into the corpus. This annotation was executed with pre-trained models from different sources, namely FlairNLP (English, German, French, and Spanish), SpaCy (Catalan, Norwegian and Portuguese), and Slavic BERT from DeepPavlov (Czech and Russian).

The interface of the tool allows searching for specific words, retrieving and comparing word collocations, looking for word frequencies, Part-of-Speech-tag sequence search, and comparing data using various filter options. It also allows downloading customized subsets of the corpus and for the creation of translation memories.

The MCSQ is an FAIR tool by design. Open-access, open-source research resource that will be useful for linguistic researchers, translators, social scientists and other interested parties.

The MCSQ will allow for systematic analysis of past translations into survey translation lifecycles and allow for the use of corpus linguistics methods when translating questionnaires.

The MCSQ was developed as part of the European Social Survey ERIC contribution to the Social Sciences and Humanities Open Cloud (SSHOC) (funded by the EU Horizon 2020 Research and Innovation Programme (2014-2020) under Grant Agreement No. 823782).

Multimedia

SDG - Sustainable Development Goals:

Els ODS a la UPF

RECSM Research and Expertise Centre for Survey Methodology

RECSM members launch the first publicly available corpus of survey questionnaires

Multimedia

Categories:

SDG - Sustainable Development Goals:

Contact