Audio samples from "Representation Mixing for TTS Synthesis"

Paper: arXiv, Submission to ICASSP 2019

Authors: Kyle Kastner, Joao Felipe Santos, Yoshua Bengio, Aaron Courville

Abstract: Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.

Test phrases and webpage styling / formatting largely taken from Tacotron and Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Representation Mixing denotes training with our method

Static denotes training with fixed inputs of the respective type

PWCB denotes phoneme-with-character-backoff at inference

Chars denotes using only characters at inference

We list the following sections for comparison:

Section 1: Comparison of a model trained with representation mixing, using PWCB at inference against a baseline trained on fixed PWCB inputs

Section 2: Comparison of a model trained with representation mixing, using characters at inference against a baseline trained on fixed characters

Section 3: Comparison of a model trained with representation mixing, using characters at inference against a high-quality open source Tacotron 2 implementation trained only on characters.

Special thanks to Ryuichi Yamamoto and Rayhane Mama for their work on these open-source codebases, alongside many other contributors.

Section 4: Comparison of using WaveNet neural decoder, 1000 step L-BFGS+GL inversion, and 100 step L-BFGS+GL for same model

To sample from a pretrained model and see the code, visit the Github repository https://github.com/kastnerkyle/representation_mixing

Note: All of the below phrases are unseen by our model during training.

Section 1: PWCB Comparisons

Representation Mixing PWCB	Static PWCB
Scientists at the CERN laboratory say they have discovered a new particle.

There's a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

The Senate's bill to repeal and replace the Affordable Care Act is now imperiled.

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

He has read the whole thing.

He reads books.

Don't desert me here in the desert!

He thought it was time to present the present.

Thisss isrealy awhsome.

The buses aren't the problem, they actually provide a solution.

The quick brown fox jumps over the lazy dog.

Does the quick brown fox jump over the lazy dog?

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

She sells sea-shells on the sea-shore. The shells she sells are sea-shells I'm sure.

The Blue Lagoon is a nineteen eighty American romance adventure film.

Section 2: Char Comparisons

Representation Mixing Chars	Static Chars
Scientists at the CERN laboratory say they have discovered a new particle.

There's a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

The Senate's bill to repeal and replace the Affordable Care Act is now imperiled.

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

He has read the whole thing.

He reads books.

Don't desert me here in the desert!

He thought it was time to present the present.

Thisss isrealy awhsome.

The buses aren't the problem, they actually provide a solution.

The quick brown fox jumps over the lazy dog.

Does the quick brown fox jump over the lazy dog?

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

She sells sea-shells on the sea-shore. The shells she sells are sea-shells I'm sure.

The Blue Lagoon is a nineteen eighty American romance adventure film.

Section 3: RM Char compared to Tacotron 2 OSS

Representation Mixing Chars	Tacotron 2 OSS
Scientists at the CERN laboratory say they have discovered a new particle.

There's a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

The Senate's bill to repeal and replace the Affordable Care Act is now imperiled.

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

He has read the whole thing.

He reads books.

Don't desert me here in the desert!

He thought it was time to present the present.

Thisss isrealy awhsome.

The buses aren't the problem, they actually provide a solution.

The quick brown fox jumps over the lazy dog.

Does the quick brown fox jump over the lazy dog?

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

She sells sea-shells on the sea-shore. The shells she sells are sea-shells I'm sure.

The Blue Lagoon is a nineteen eighty American romance adventure film.

Section 4: Comparing Audio Decoding Strategies

WaveNet Decoder	1000 L-BFGS + GL	100 L-BLGS + GL	100 L-BFGS Only
Scientists at the CERN laboratory say they have discovered a new particle.

There's a way to measure the acute emotional intelligence that has never gone out of style.

President Trump met with other leaders at the Group of 20 conference.

The Senate's bill to repeal and replace the Affordable Care Act is now imperiled.

Generative adversarial network or variational auto-encoder.

Basilar membrane and otolaryngology are not auto-correlations.

He has read the whole thing.

He reads books.

Don't desert me here in the desert!

He thought it was time to present the present.

Thisss isrealy awhsome.

The buses aren't the problem, they actually provide a solution.

The quick brown fox jumps over the lazy dog.

Does the quick brown fox jump over the lazy dog?

Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?

She sells sea-shells on the sea-shore. The shells she sells are sea-shells I'm sure.

The Blue Lagoon is a nineteen eighty American romance adventure film.