Skip to content

Commit fc1bf88

Browse files
authored
Update Info for MaskGCT and Vevo (#387)
1 parent 04dfe6e commit fc1bf88

File tree

5 files changed

+51
-52
lines changed

5 files changed

+51
-52
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
3535

3636
## 🚀 News
37+
- **2025/01/23**: [MaskGCT](https://arxiv.org/abs/2409.00750) and [Vevo](https://openreview.net/pdf?id=anQDiQZhDP) got accepted by ICLR 2025! 🎉
3738
- **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)
3839
- **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-space-purple)](https://modelscope.cn/studios/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-model-cyan)](https://modelscope.cn/models/amphion/MaskGCT) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
3940
- **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! 🤗
@@ -184,7 +185,7 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co
184185

185186
```bibtex
186187
@inproceedings{amphion,
187-
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
188+
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
188189
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
189190
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
190191
year={2024}

models/tts/debatts/try_inference_small_samples.py

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -306,12 +306,8 @@ def semantic2acoustic(combine_semantic_code, acoustic_code):
306306

307307

308308
device = torch.device("cuda:0")
309-
cfg_soundstorm_1layer = load_config(
310-
"./s2a_egs/s2a_debatts_1layer.json"
311-
)
312-
cfg_soundstorm_full = load_config(
313-
"./s2a_egs/s2a_debatts_full.json"
314-
)
309+
cfg_soundstorm_1layer = load_config("./s2a_egs/s2a_debatts_1layer.json")
310+
cfg_soundstorm_full = load_config("./s2a_egs/s2a_debatts_full.json")
315311

316312
soundstorm_1layer = build_soundstorm(cfg_soundstorm_1layer, device)
317313
soundstorm_full = build_soundstorm(cfg_soundstorm_full, device)
@@ -333,9 +329,7 @@ def semantic2acoustic(combine_semantic_code, acoustic_code):
333329
safetensors.torch.load_model(soundstorm_1layer, soundstorm_1layer_path)
334330
safetensors.torch.load_model(soundstorm_full, soundstorm_full_path)
335331

336-
t2s_cfg = load_config(
337-
"./t2s_egs/t2s_debatts.json"
338-
)
332+
t2s_cfg = load_config("./t2s_egs/t2s_debatts.json")
339333
t2s_model_new = build_t2s_model_new(t2s_cfg, device)
340334
t2s_model_new_ckpt_path = "./t2s_model/model.safetensors"
341335
safetensors.torch.load_model(t2s_model_new, t2s_model_new_ckpt_path)

models/tts/debatts/utils/g2p_new/cleaners.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,11 @@
66
import re
77
from utils.g2p_new.mandarin import chinese_to_ipa
88

9+
910
def cjekfd_cleaners(text, language, text_tokenizers):
1011

11-
if language == 'zh':
12-
return chinese_to_ipa(text, text_tokenizers['zh'])
12+
if language == "zh":
13+
return chinese_to_ipa(text, text_tokenizers["zh"])
1314
else:
14-
raise Exception('Unknown or Not supported yet language: %s' % language)
15+
raise Exception("Unknown or Not supported yet language: %s" % language)
1516
return None

models/tts/maskgct/README.md

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -132,12 +132,12 @@ Running this will automatically download the pretrained model from HuggingFace a
132132
We provide the following pretrained checkpoints:
133133

134134

135-
| Model Name | Description |
136-
|-------------------|-------------|
137-
| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens. |
138-
| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
139-
| [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. |
140-
| [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |
135+
| Model Name | Description |
136+
| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
137+
| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens. |
138+
| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
139+
| [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Predicting semantic tokens with text and prompt semantic tokens. |
140+
| [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model) | Predicts acoustic tokens conditioned on semantic tokens. |
141141

142142
You can download all pretrained checkpoints from [HuggingFace](https://huggingface.co/amphion/MaskGCT/tree/main) or use huggingface API.
143143

@@ -165,41 +165,42 @@ We use the [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) data
165165

166166
## Evaluation Results of MaskGCT
167167

168-
| System | SIM-O↑ | WER↓ | FSD↓ | SMOS↑ | CMOS↑ |
169-
| :--- | :---: | :---: | :---: | :---: | :---: |
170-
| | | **LibriSpeech test-clean** |
171-
| Ground Truth | 0.68 | 1.94 | | 4.05±0.12 | 0.00 |
172-
| VALL-E | 0.50 | 5.90 | - | 3.47 ±0.26 | -0.52±0.22 |
173-
| VoiceBox | 0.64 | 2.03 | 0.762 | 3.80±0.17 | -0.41±0.13 |
174-
| NaturalSpeech 3 | 0.67 | 1.94 | 0.786 | 4.26±0.10 | 0.16±0.14 |
175-
| VoiceCraft | 0.45 | 4.68 | 0.981 | 3.52±0.21 | -0.33 ±0.16 |
176-
| XTTS-v2 | 0.51 | 4.20 | 0.945 | 3.02±0.22 | -0.98 ±0.19 |
177-
| MaskGCT | 0.687(0.723) | 2.634(1.976) | 0.886 | 4.27±0.14 | 0.10±0.16 |
178-
| MaskGCT(gt length) | 0.697 | 2.012 | 0.746 | 4.33±0.11 | 0.13±0.13 |
179-
| | | **SeedTTS test-en** |
180-
| Ground Truth | 0.730 | 2.143 | | 3.92±0.15 | 0.00 |
181-
| CosyVoice | 0.643 | 4.079 | 0.316 | 3.52±0.17 | -0.41 ±0.18 |
182-
| XTTS-v2 | 0.463 | 3.248 | 0.484 | 3.15±0.22 | -0.86±0.19 |
183-
| VoiceCraft | 0.470 | 7.556 | 0.226 | 3.18±0.20 | -1.08 ±0.15 |
184-
| MaskGCT | 0.717(0.760) | 2.623(1.283) | 0.188 | 4.24 ±0.12 | 0.03 ±0.14 |
185-
| MaskGCT(gt length) | 0.728 | 2.466 | 0.159 | 4.13 ±0.17 | 0.12 ±0.15 |
186-
| | | **SeedTTS test-zh** |
187-
| Ground Truth | 0.750 | 1.254 | | 3.86 ±0.17 | 0.00 |
188-
| CosyVoice | 0.750 | 4.089 | 0.276 | 3.54 ±0.12 | -0.45 ±0.15 |
189-
| XTTS-v2 | 0.635 | 2.876 | 0.413 | 2.95 ±0.18 | -0.81 ±0.22 |
190-
| MaskGCT | 0.774(0.805) | 2.273(0.843) | 0.106 | 4.09 ±0.12 | 0.05 ±0.17 |
191-
| MaskGCT(gt length) | 0.777 | 2.183 | 0.101 | 4.11 ±0.12 | 0.08±0.18 |
168+
| System | SIM-O↑ | WER↓ | FSD↓ | SMOS↑ | CMOS↑ |
169+
| :----------------- | :----------: | :------------------------: | :---: | :--------: | :---------: |
170+
| | | **LibriSpeech test-clean** |
171+
| Ground Truth | 0.68 | 1.94 | | 4.05±0.12 | 0.00 |
172+
| VALL-E | 0.50 | 5.90 | - | 3.47 ±0.26 | -0.52±0.22 |
173+
| VoiceBox | 0.64 | 2.03 | 0.762 | 3.80±0.17 | -0.41±0.13 |
174+
| NaturalSpeech 3 | 0.67 | 1.94 | 0.786 | 4.26±0.10 | 0.16±0.14 |
175+
| VoiceCraft | 0.45 | 4.68 | 0.981 | 3.52±0.21 | -0.33 ±0.16 |
176+
| XTTS-v2 | 0.51 | 4.20 | 0.945 | 3.02±0.22 | -0.98 ±0.19 |
177+
| MaskGCT | 0.687(0.723) | 2.634(1.976) | 0.886 | 4.27±0.14 | 0.10±0.16 |
178+
| MaskGCT(gt length) | 0.697 | 2.012 | 0.746 | 4.33±0.11 | 0.13±0.13 |
179+
| | | **SeedTTS test-en** |
180+
| Ground Truth | 0.730 | 2.143 | | 3.92±0.15 | 0.00 |
181+
| CosyVoice | 0.643 | 4.079 | 0.316 | 3.52±0.17 | -0.41 ±0.18 |
182+
| XTTS-v2 | 0.463 | 3.248 | 0.484 | 3.15±0.22 | -0.86±0.19 |
183+
| VoiceCraft | 0.470 | 7.556 | 0.226 | 3.18±0.20 | -1.08 ±0.15 |
184+
| MaskGCT | 0.717(0.760) | 2.623(1.283) | 0.188 | 4.24 ±0.12 | 0.03 ±0.14 |
185+
| MaskGCT(gt length) | 0.728 | 2.466 | 0.159 | 4.13 ±0.17 | 0.12 ±0.15 |
186+
| | | **SeedTTS test-zh** |
187+
| Ground Truth | 0.750 | 1.254 | | 3.86 ±0.17 | 0.00 |
188+
| CosyVoice | 0.750 | 4.089 | 0.276 | 3.54 ±0.12 | -0.45 ±0.15 |
189+
| XTTS-v2 | 0.635 | 2.876 | 0.413 | 2.95 ±0.18 | -0.81 ±0.22 |
190+
| MaskGCT | 0.774(0.805) | 2.273(0.843) | 0.106 | 4.09 ±0.12 | 0.05 ±0.17 |
191+
| MaskGCT(gt length) | 0.777 | 2.183 | 0.101 | 4.11 ±0.12 | 0.08±0.18 |
192192

193193
## Citations
194194

195195
If you use MaskGCT in your research, please cite the following paper:
196196

197197
```bibtex
198-
@article{wang2024maskgct,
199-
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
198+
@inproceedings{wang2024maskgct,
200199
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
201-
journal={arXiv preprint arXiv:2409.00750},
202-
year={2024}
200+
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
201+
booktitle = {{ICLR}},
202+
publisher = {OpenReview.net},
203+
year = {2025}
203204
}
204205
205206
@inproceedings{amphion,

models/vc/vevo/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,14 +85,16 @@ Running this will automatically download the pretrained model from HuggingFace a
8585
If you use Vevo in your research, please cite the following papers:
8686

8787
```bibtex
88-
@article{vevo,
89-
title={Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
90-
journal={OpenReview},
91-
year={2024}
88+
@inproceedings{vevo,
89+
author = {Xueyao Zhang and Xiaohui Zhang and Kainan Peng and Zhenyu Tang and Vimal Manohar and Yingru Liu and Jeff Hwang and Dangna Li and Yuhao Wang and Julian Chan and Yuan Huang and Zhizheng Wu and Mingbo Ma},
90+
title = {Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
91+
booktitle = {{ICLR}},
92+
publisher = {OpenReview.net},
93+
year = {2025}
9294
}
9395
9496
@inproceedings{amphion,
95-
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
97+
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
9698
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
9799
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
98100
year={2024}

0 commit comments

Comments
 (0)