Back New publication by RECSM Members Danielly Sorato and Diana Zavala-Rojas on Applications of Natural Language to Information Systems

New publication by RECSM Members Danielly Sorato and Diana Zavala-Rojas on Applications of Natural Language to Information Systems

Barcelona, Spain - June 23, 2022, Danielly Sorato and Diana Zavala-Rojas of Pompeu Fabra’s team at the Research and Expertise Center for Survey Methodology (RECSM) recently published a new article, “Sentence Alignment of Bilingual Survey Texts Applying a Metadata-Aware Strategy,” which can be found in in Lecture Notes in Computer Science from the 2022 Natural Language Processing and Information Systems Conference1.

23.06.2022

 

The paper describes a method designed for the alignment of bilingual survey questionnaires' texts, in the Multilingual Corpus of Survey Questions (MCSQ)2 , which leverages sentence-level metadata annotations. In this work, the researchers analysed the performance of their alignment strategy, building eight gold standards in four distinct languages (Catalan, French, Portuguese, and Spanish) to this end. The abstract of the article reads as follows:

Sentence alignment is a crucial task in the process of building parallel corpora. Off-the-shelf tools for sentence alignment generally perform well to this end. However in certain cases, depending on factors such as the sentence structure and the amount of contextual information, the sentence alignment task can be challenging and require further resources that may be difficult to find, such as domain-specific bilingual dictionaries. Although investing in creating additional linguistic resources is frequently the chosen option in these circumstances, leveraging extra-linguistic information such as sentence-level metadata can be an easier alternative to narrow the alignment search space. This paper presents a method designed for the alignment of bilingual survey questionnaires’ texts, which leverages sentence-level metadata annotations. We build eight gold standards in four distinct languages to measure our sentence aligner performance, namely Catalan, French, Portuguese, and Spanish.

This article is also part of RECSM’s larger work on the MCSQ which comprises 306 different questionnaires in the source language (English) and their translations into Catalan, Czech, French, German, Norwegian, Portuguese, Spanish, Russian, as well as 29 language varieties (e.g., Swiss-French, Austrian-German). It is part of the Social Sciences and Humanities Open Cloud (SSHOC) project. 

----

1. https://link.springer.com/chapter/10.1007/978-3-031-08473-7_43

2. https://www.upf.edu/web/mcsq/

 

Multimedia

Categories:

SDG - Sustainable Development Goals:

Els ODS a la UPF

Contact