2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Audio-Visual Speech Recognition (AVSR) faces the difficult task of exploiting acoustic and visual cues simultaneously. Augmenting speech with the visual channel creates its own challenges, e.g. every person has unique mouth movements, making the generalization of visual models very difficult. This factor motivates our focus on the generalization of speaker-independent (SI) AVSR systems especially in noisy environments by exploiting the visual domain. Specifically, we are the first to explore the visual adaptation of an SI-AVSR system to an unknown and unlabelled speaker. We adapt an AVSR system trained in a source domain to decode samples in a target domain without the need for labels in the target domain. For the domain adaptation of the unknown speaker, we use Coupled Generative Adversarial Networks to automatically learn a joint distribution of multi-domain images. We evaluate our character-based AVSR system on the TCD-TIMIT dataset and obtain up to a 10% average improvement with respect to its AVSR system equivalent.
Fernandez-Lopez A, Karaali A, Harte N, Sukno FM. CoGANs for unsupervised visual speech adaptation to new speakers. In: -. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1 ed. Barcelona: IEEE; 2020. p. 6294-6298.