Back Merlijn Blaauw defends his PhD thesis
Merlijn Blaauw defends his PhD thesis
Friday July 22nd, 2022 at 10.00h (CET) - online
Title: Modeling Timbre for Neural Singing Synthesis: Methods for Data-Efficient, Reduced Effort Voice Creation, and Fast and Stable Inference
Supervisors: Dr. Emilia Gómez and Dr. Jordi Bonada
Jury: Dr. Axel Roebel (IRCAM), Dr. Mireia Farrús (UB), Dr. Masataka Goto (AIST)
Singing synthesis has seen a notable surge in popularity in the last decade and a half. Music producers use this technology as an instrument, there is an audience for music with synthetic vocals, and an entire range of cultural phenomena surrounding singing synthesis has emerged. At the time of starting this work, the prevailing approaches for singing synthesis were concatenative synthesis on the one hand, and hidden Markov model synthesis on the other. Concatenative synthesis was state of the art in terms of quality, but lacked flexibility due to being based on signal processing, heuristics and carefully prepared data. By contrast, hidden Markov model synthesis is based on data-driven machine learning, which brings a certain degree of flexibility, but was never able to match the sound quality of concatenative synthesis. At the same time, the field of text-to-speech started to shift towards powerful new deep learning models that have shown to be able to combine high-quality results with a high degree of flexibility. In this dissertation, we try to answer whether similar models can also live up to this potential for singing synthesis. We also try to answer whether these approaches allow fast and stable synthesis, qualities important for many real-world applications. Finally, we try to answer whether the flexibility that the deep learning approaches offer allows creating new voices with smaller amounts of data, and less effort (time, expert knowledge), which is a notable bottleneck in older approaches. To this end, we propose a number of singing synthesis models, and evaluate them, principally through listening tests. The first part of this dissertation focuses on modeling timbre, via autoregressive and non-autoregressive models. The second part focuses on improving data efficiency through voice cloning, reducing the voice creation effort by using a sequence-to-sequence mechanism that requires fewer annotations, and a semi-supervised model which combines supervised pre-training with unsupervised training of a new target voice. Through our experiments, we show deep learning methods can not only outperform the previous state of the art, they can also allow for a significantly reduced voice creation effort. With our work on these elemental problems in singing synthesis, we hope that future research can advance the field further by focusing on topics such as expression, user control and non-modal voice qualities.
This thesis defense will take place online. To attend use this link (ID of the meeting 921 1742 1855). The microphone and camera must be turned off, and the online access will be unavailable after 30 minutes from the start of the defense.