Back [PhD thesis] Incorporating Prosody into Neural Speech Processing Pipelines. Applications on automatic speech transcription and spoken language machine translation

[PhD thesis] Incorporating Prosody into Neural Speech Processing Pipelines. Applications on automatic speech transcription and spoken language machine translation

Author: Alp Öktem

Supervisor: Mireia Farrús and Antonio Bonafonte

In this dissertation, I study the inclusion of prosody into two applications that involve speech understanding: automatic speech transcription and spoken language translation. In the former case, I propose a method that uses an attention mechanism over parallel sequences of prosodic and morphosyntactic features. Results indicate an F1 score of 70.3% in terms of overall punctuation generation accuracy. In the latter problem I deal with enhancing spoken language translation with prosody. A neural machine translation system trained with movie-domain data is adapted with pause features using a prosodically annotated bilingual dataset. Results show that prosodic punctuation generation as a preliminary step to translation increases translation accuracy by 1% in terms of BLEU scores. Encoding pauses as an extra encoding feature gives an additional 1% increase to this number. The system is further extended to jointly predict pause features in order to be used as an input to a text-to-speech system.

Keywords: prosody, automatic speech transcription, punctuation restoration, spoken language machine translation, bilingual spoken corpus

Link at TDX: http://hdl.handle.net/10803/666222

Author's GitHub account and personal page: http://alpoktem.github.io/

Video of the defence