Back UPF researchers design a new method to align the texts of multilingual survey questionnaires

UPF researchers design a new method to align the texts of multilingual survey questionnaires

RECSM-UPF members Danielly Sorato and Diana Zavala-Rojas have developed an alignment strategy that leverages sentence-level metadata annotations. They have published the results in a scientific article, within the framework of an international conference on applications of natural language in information systems.

06.07.2022

Imatge inicial

Danielly Sorato and Diana Zavala-Rojas, researchers at the Research and Expertise Centre for Survey Methodology (RECSM) of the UPF Department of Political and Social Sciences, have recently published a scientific article describing a method designed for aligning the texts of bilingual survey questionnaires, which takes advantage of metadata annotations at sentence level.

Danielly Sorato: “Aligned sentences represent a very useful resource for subsequent computational tasks, such as automatic text translation”

In their work, which is part of the Multilingual Corpus of Survey Questions (MCSQ) project (MCSQ) developed within RECSM, the researchers analysed the performance of their alignment strategy, building eight gold standards in four distinct languages (Catalan, French, Portuguese and Spanish) for this purpose.

The article can be found in the volume Lecture Notes in Computer Science published within the framework of the Natural Language Processing and Information Systems Conference 2022, an international congress held in Valencia from 15 to 17 June 2022.

Sentence alignment, a crucial task in the process of building parallel corpora

Sentence alignment is a computational task that aims to automatically find the correspondence of a given sentence written in one language and its translation into another language. Before a text written with a few sentences in a certain language and the same text with the same sentences translated into another language, what the alignment algorithm does is automatically link the sentences in the right way, with a proper cross-correspondence between the two languages. “Aligned sentences represent a very useful resource for subsequent computational tasks, such as automatic text translation”, Danielly Sorato points out.

Off-the-shelf sentence alignment tools generally perform well to this end. However, in some cases, depending on factors like sentence structure and the amount of contextual information, the sentence alignment task may pose a greater challenge and require additional resources that may be difficult to find, such as domain-specific bilingual dictionaries.

“Although investing in creating additional linguistic resources is frequently the chosen option in these circumstances, leveraging extra-linguistic information, such as sentence-level metadata, can be an easier alternative to narrow the alignment search space”,  Danielly Sorato asserts.

One step further within the multilingual corpus of RECSM survey questionnaires

This article is part of RECSM’s broader work on the MCSQ, which is the first multilingual corpus of publicly available survey questionnaires, an open source, open access artefact. It includes 306 different questionnaires in the source language (English) and their translations into Catalan, Czech, French, German, Norwegian, Portuguese, Spanish, Russian, as well as 29 language variants (e.g., Swiss-French, Austrian-German).

It is part of the project “The Social Sciences and Humanities Open Cloud” (SSHOC), funded by the European Union’s Horizon 2020 framework programme, which brings together some twenty organizations to develop the area of social sciences and humanities of the European Open Science Cloud (EOSC).

Reference work: Danielly Sorato, Diana Zavala-Rojas. “Sentence Alignment of Bilingual Survey Texts Applying a Metadata-Aware Strategy”. Natural Language Processing and Information Systems, pp 469–476

https://link.springer.com/chapter/10.1007/978-3-031-08473-7_43

Multimedia

SDG - Sustainable Development Goals:

Els ODS a la UPF

Contact

For more information

News published by:

Communication Office