XTTSv2 notes

XTTS v2, Coqui's new version of open-source text-to-speech model.

Eren Gölge

Nov 12, 2023

🐸 Github
🤗 Demo
🤖 Model card
💬 Discord

0:00

-0:05

0:00

-0:04

0:00

-0:04

0:00

-0:11

TL.DR.

Features

Multi-lingual speech generation in 16 languages.
Cross-language voice cloning.
Streaming inference with < 200ms latency. (See Streaming inference)
Fine-tuning support. (See Training)

Updates with v2

Architectural changes for improved voice cloning.
2 new languages: Hungarian and Korean.
Across-the-board quality improvements.

Languages

English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko).

Technical Details

We recently released XTTSv2 with 🐸TTS v0.20, and here I go over the relevant details of the model.

XTTSv2 uses the same backbone as XTTSv1. It is a GPT2 model that predicts audio tokens computed by a pre-trained Discrete VAE model. The core update is changing the way we condition the model on the speaker information with a Perceiver model. In our model, the Perceiver inputs a mel0spectrogram and produces 32 latent vectors representing speaker information to prefix the GPT decoder.

We observed that the Perceiver captures the speaker characteristics better than a simple encoder like Tortoise or speech prompting like Vall-E. It also provides consistent model outputs between different runs, alleviating speaker shifting between different model runs.

The Perceiver allows the use of multiple references without any length limits. This way, capturing different aspects of the target speaker and even combining other speakers to create a unique voice is possible.

We switched to a HifiGAN model to compute the final audio signal from the GPT2 outputs. Compared to standard multi-stage models like VallE and SoundStorm, it considerably reduced the inference latency.

XTTSv2 can achieve less than 150ms streaming latency with a pure Pytorch implementation on a consumer-grade GPU, significantly faster than known open-source and commercial solutions.

XTTSv2 comes with additional languages, making a total of 16 languages.

I thank our community, who helped us create new datasets and evaluate the model for their native languages.

XTTSv2 is trained with more data and better-tuned hyper-parameters, achieving better loss curves.

We primarily use publicly available datasets for the training. We intentionally did not crawl the entire web. Some may consider this foolish in a competitive environment. However, we respect everyone's work and want to maintain this respect.

This approach also helps us keep our work and models accessible to enterprise and private users without worrying about future problems due to the training data.

Overall, XTTSv2 is an improvement in every way. It offers better cloning and audio quality, additional Hungarian and Korean languages, and more expressive and natural outputs.

This new release has received a great reception from our community and feedback. Give XTTSv2 a try!

We are actively working on the new version. We plan to expand the model's capabilities and add even more languages. Our great community is helping us with this endeavor. If you want to join us, we are on Discord.

Best 😃