🐸 Github
🤗 Demo
🤖 Model card
💬 Discord
TL.DR.
Features
Multi-lingual speech generation in 16 languages.
Cross-language voice cloning.
Streaming inference with < 200ms latency. (See Streaming inference)
Fine-tuning support. (See Training)
Updates with v2
Architectural changes for improved voice cloning.
2 new languages: Hungarian and Korean.
Across-the-board quality improvements.
Languages
English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko).
Technical Details
We recently released XTTSv2 with 🐸TTS v0.20, and here I go over the relevant details of the model.
XTTSv2 uses the same backbone as XTTSv1. It is a GPT2 model that predicts audio tokens computed by a pre-trained Discrete VAE model. The core update is changing the way we condition the model on the speaker information with a Perceiver model. In our model, the Perceiver inputs a mel0spectrogram and produces 32 latent vectors representing speaker information to prefix the GPT decoder.
We observed that the Perceiver captures the speaker characteristics better than a simple encoder like Tortoise or speech prompting like Vall-E. It also provides consistent model outputs between different runs, alleviating speaker shifting between different model runs.
The Perceiver allows the use of multiple references without any length limits. This way, capturing different aspects of the target speaker and even combining other speakers to create a unique voice is possible.
We switched to a HifiGAN model to compute the final audio signal from the GPT2 outputs. Compared to standard multi-stage models like VallE and SoundStorm, it considerably reduced the inference latency.
XTTSv2 can achieve less than 150ms streaming latency with a pure Pytorch implementation on a consumer-grade GPU, significantly faster than known open-source and commercial solutions.
XTTSv2 comes with additional languages, making a total of 16 languages.
I thank our community, who helped us create new datasets and evaluate the model for their native languages.
XTTSv2 is trained with more data and better-tuned hyper-parameters, achieving better loss curves.
We primarily use publicly available datasets for the training. We intentionally did not crawl the entire web. Some may consider this foolish in a competitive environment. However, we respect everyone's work and want to maintain this respect.
This approach also helps us keep our work and models accessible to enterprise and private users without worrying about future problems due to the training data.
Overall, XTTSv2 is an improvement in every way. It offers better cloning and audio quality, additional Hungarian and Korean languages, and more expressive and natural outputs.
This new release has received a great reception from our community and feedback. Give XTTSv2 a try!
We are actively working on the new version. We plan to expand the model's capabilities and add even more languages. Our great community is helping us with this endeavor. If you want to join us, we are on Discord.
Best 😃
References
XTTSv1: https://erogol.com/2023/09/27/xtts-v1-notes
VallE: https://arxiv.org/abs/2301.02111
Tortoise: https://github.com/neonbjb/tortoise-tts
DALL-E: https://arxiv.org/abs/2102.12092
Perceiver: https://arxiv.org/abs/2103.03206
Thanks for sharing! I have question about training vocoder with GPT outputs
How did you make GPT output for training vocoder?
GPT model has input like <condition> <text token> <mel token> and final layer output will be used in vocoder training, but how condition is selected?
In XTTS v1 technical report, condition mel was shuffled for training GPT, how condition mel is processed when training vocoder?