Authors: Kyle Kastner, Joao Felipe Santos, Yoshua Bengio, Aaron Courville
Abstract: Recent character and phoneme-based parametric TTS systems
using deep learning have shown strong performance
in natural speech generation. However, the choice between
character or phoneme input can create serious limitations for
practical deployment, as direct control of pronunciation is
crucial in certain cases. We demonstrate a simple method for
combining multiple types of linguistic information in a single
encoder, named representation mixing, enabling flexible
choice between character, phoneme, or mixed representations
during inference. Experiments and user studies on a public
audiobook corpus show the efficacy of our approach.
Representation Mixing denotes training with our method
Static denotes training with fixed inputs of the respective type
PWCB denotes phoneme-with-character-backoff at inference
Chars denotes using only characters at inference
We list the following sections for comparison:
Section 1: Comparison of a model trained with representation mixing, using PWCB at inference against a baseline trained on fixed PWCB inputs
Section 2: Comparison of a model trained with representation mixing, using characters at inference against a baseline trained on fixed characters
Section 3: Comparison of a model trained with representation mixing, using characters at inference against a high-quality open source Tacotron 2 implementation trained only on characters.
Special thanks to Ryuichi Yamamoto and Rayhane Mama for their work on these open-source codebases, alongside many other contributors.
Section 4: Comparison of using WaveNet neural decoder, 1000 step L-BFGS+GL inversion, and 100 step L-BFGS+GL for same model