[PhD thesis] Incorporating Prosody into Neural Speech Processing Pipelines. Applications on automatic speech transcription and spoken language machine translation

We develop a large number of software tools and hosting infrastructures to support the research developed at the Department. We will be detailing in this section the different tools available. You can take a look for the moment at the offer available within the UPF Knowledge Portal, the innovations created in the context of EU projects in the Innovation Radar and the software sections of some of our research groups:

Artificial Intelligence

Nonlinear Time Series Analysis

Downloads

Web Research

Dyswebxia

Music Technology

Interactive Technologies

Barcelona MedTech

GitHub

Natural Language Processing

GitHub
Resources (datasets, software and other material)

Nonlinear Time Series Analysis

Downloads

UbicaLab

GitHub

Wireless Networking

GitHub

Educational Technologies

GitHub

Back [PhD thesis] Incorporating Prosody into Neural Speech Processing Pipelines. Applications on automatic speech transcription and spoken language machine translation

[PhD thesis] Incorporating Prosody into Neural Speech Processing Pipelines. Applications on automatic speech transcription and spoken language machine translation

Author: Alp Öktem

Supervisor: Mireia Farrús and Antonio Bonafonte

In this dissertation, I study the inclusion of prosody into two applications that involve speech understanding: automatic speech transcription and spoken language translation. In the former case, I propose a method that uses an attention mechanism over parallel sequences of prosodic and morphosyntactic features. Results indicate an F1 score of 70.3% in terms of overall punctuation generation accuracy. In the latter problem I deal with enhancing spoken language translation with prosody. A neural machine translation system trained with movie-domain data is adapted with pause features using a prosodically annotated bilingual dataset. Results show that prosodic punctuation generation as a preliminary step to translation increases translation accuracy by 1% in terms of BLEU scores. Encoding pauses as an extra encoding feature gives an additional 1% increase to this number. The system is further extended to jointly predict pause features in order to be used as an input to a text-to-speech system.

Keywords: prosody, automatic speech transcription, punctuation restoration, spoken language machine translation, bilingual spoken corpus

Link at TDX: http://hdl.handle.net/10803/666222

Author's GitHub account and personal page: http://alpoktem.github.io/

Video of the defence

Link: https://github.com/alpoktem/PhDThesis/blob/master/PhD_Thesis.pdf

DTIC MdM Strategic Program: Artificial and Natural Intelligence for ICT and beyond

[PhD thesis] Incorporating Prosody into Neural Speech Processing Pipelines. Applications on automatic speech transcription and spoken language machine translation

Related Assets