Update Info for MaskGCT and Vevo (#387)

RMSnow · web-flow · commit fc1bf8896869 · 2025-01-23T13:45:08.000+08:00
diff --git a/README.md b/README.md
@@ -34,6 +34,7 @@
 In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
 
 ## 🚀 News
+- **2025/01/23**: [MaskGCT](https://arxiv.org/abs/2409.00750) and [Vevo](https://openreview.net/pdf?id=anQDiQZhDP) got accepted by ICLR 2025! 🎉
 - **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)
 - **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance.  [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2409.00750) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/maskgct) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-space-purple)](https://modelscope.cn/studios/amphion/maskgct) [![ModelScope](https://img.shields.io/badge/ModelScope-model-cyan)](https://modelscope.cn/models/amphion/MaskGCT) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/tts/maskgct/README.md)
 - **2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! 🤗
@@ -184,7 +185,7 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co
 
 ```bibtex
 @inproceedings{amphion,
-    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+    author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
     title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
     booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
     year={2024}
diff --git a/models/tts/debatts/try_inference_small_samples.py b/models/tts/debatts/try_inference_small_samples.py
@@ -306,12 +306,8 @@ def semantic2acoustic(combine_semantic_code, acoustic_code):
 
 
 device = torch.device("cuda:0")
-cfg_soundstorm_1layer = load_config(
-    "./s2a_egs/s2a_debatts_1layer.json"
-)
-cfg_soundstorm_full = load_config(
-    "./s2a_egs/s2a_debatts_full.json"
-)
+cfg_soundstorm_1layer = load_config("./s2a_egs/s2a_debatts_1layer.json")
+cfg_soundstorm_full = load_config("./s2a_egs/s2a_debatts_full.json")
 
 soundstorm_1layer = build_soundstorm(cfg_soundstorm_1layer, device)
 soundstorm_full = build_soundstorm(cfg_soundstorm_full, device)
@@ -333,9 +329,7 @@ def semantic2acoustic(combine_semantic_code, acoustic_code):
 safetensors.torch.load_model(soundstorm_1layer, soundstorm_1layer_path)
 safetensors.torch.load_model(soundstorm_full, soundstorm_full_path)
 
-t2s_cfg = load_config(
-    "./t2s_egs/t2s_debatts.json"
-)
+t2s_cfg = load_config("./t2s_egs/t2s_debatts.json")
 t2s_model_new = build_t2s_model_new(t2s_cfg, device)
 t2s_model_new_ckpt_path = "./t2s_model/model.safetensors"
 safetensors.torch.load_model(t2s_model_new, t2s_model_new_ckpt_path)
diff --git a/models/tts/debatts/utils/g2p_new/cleaners.py b/models/tts/debatts/utils/g2p_new/cleaners.py
@@ -6,10 +6,11 @@
 import re
 from utils.g2p_new.mandarin import chinese_to_ipa
 
+
 def cjekfd_cleaners(text, language, text_tokenizers):
 
-    if language == 'zh':
-        return chinese_to_ipa(text, text_tokenizers['zh'])
+    if language == "zh":
+        return chinese_to_ipa(text, text_tokenizers["zh"])
     else:
-        raise Exception('Unknown or Not supported yet language: %s' % language)
+        raise Exception("Unknown or Not supported yet language: %s" % language)
         return None
diff --git a/models/tts/maskgct/README.md b/models/tts/maskgct/README.md
@@ -132,12 +132,12 @@ Running this will automatically download the pretrained model from HuggingFace a
 We provide the following pretrained checkpoints:
 
 
-| Model Name          | Description   |    
-|-------------------|-------------|
-| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec)      | Converting speech to semantic tokens. |
-| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec)      | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
-| [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model)         | Predicting semantic tokens with text and prompt semantic tokens.             |
-| [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model)         | Predicts acoustic tokens conditioned on semantic tokens.              |
+| Model Name                                                                        | Description                                                                            |
+| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
+| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens.                                                  |
+| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
+| [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model)         | Predicting semantic tokens with text and prompt semantic tokens.                       |
+| [MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model)         | Predicts acoustic tokens conditioned on semantic tokens.                               |
 
 You can download all pretrained checkpoints from [HuggingFace](https://huggingface.co/amphion/MaskGCT/tree/main) or use huggingface API.
 
@@ -165,41 +165,42 @@ We use the [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) data
 
 ## Evaluation Results of MaskGCT
 
-| System | SIM-O↑ | WER↓ | FSD↓ | SMOS↑ | CMOS↑ |
-| :--- | :---: | :---: | :---: | :---: | :---: |
-| | | **LibriSpeech test-clean** |
-| Ground Truth | 0.68 | 1.94 |  | 4.05±0.12 | 0.00 |
-| VALL-E | 0.50 | 5.90 | - | 3.47 ±0.26 | -0.52±0.22 |
-| VoiceBox | 0.64 | 2.03 | 0.762 | 3.80±0.17 | -0.41±0.13 |
-| NaturalSpeech 3 | 0.67 | 1.94 | 0.786 | 4.26±0.10 | 0.16±0.14 |
-| VoiceCraft | 0.45 | 4.68 | 0.981 | 3.52±0.21 | -0.33 ±0.16 |
-| XTTS-v2 | 0.51 | 4.20 | 0.945 | 3.02±0.22 | -0.98 ±0.19 |
-| MaskGCT | 0.687(0.723) | 2.634(1.976) | 0.886 | 4.27±0.14 | 0.10±0.16 |
-| MaskGCT(gt length) | 0.697 | 2.012 | 0.746 | 4.33±0.11 | 0.13±0.13 |
-| | | **SeedTTS test-en** |
-| Ground Truth | 0.730 | 2.143 |  | 3.92±0.15 | 0.00 |
-| CosyVoice | 0.643 | 4.079 | 0.316 | 3.52±0.17 | -0.41 ±0.18 |
-| XTTS-v2 | 0.463 | 3.248 | 0.484 | 3.15±0.22 | -0.86±0.19 |
-| VoiceCraft | 0.470 | 7.556 | 0.226 | 3.18±0.20 | -1.08 ±0.15 |
-| MaskGCT | 0.717(0.760) | 2.623(1.283) | 0.188 | 4.24 ±0.12 | 0.03 ±0.14 |
-| MaskGCT(gt length) | 0.728 | 2.466 | 0.159 | 4.13 ±0.17 | 0.12 ±0.15 |
-| | | **SeedTTS test-zh** |
-| Ground Truth | 0.750 | 1.254 |  | 3.86 ±0.17 | 0.00 |
-| CosyVoice | 0.750 | 4.089 | 0.276 | 3.54 ±0.12 | -0.45 ±0.15 |
-| XTTS-v2 | 0.635 | 2.876 | 0.413 | 2.95 ±0.18 | -0.81 ±0.22 |
-| MaskGCT | 0.774(0.805) | 2.273(0.843) | 0.106 | 4.09 ±0.12 | 0.05 ±0.17 |
-| MaskGCT(gt length) | 0.777 | 2.183 | 0.101 | 4.11 ±0.12 | 0.08±0.18 |
+| System             |    SIM-O↑    |            WER↓            | FSD↓  |   SMOS↑    |    CMOS↑    |
+| :----------------- | :----------: | :------------------------: | :---: | :--------: | :---------: |
+|                    |              | **LibriSpeech test-clean** |
+| Ground Truth       |     0.68     |            1.94            |       | 4.05±0.12  |    0.00     |
+| VALL-E             |     0.50     |            5.90            |   -   | 3.47 ±0.26 | -0.52±0.22  |
+| VoiceBox           |     0.64     |            2.03            | 0.762 | 3.80±0.17  | -0.41±0.13  |
+| NaturalSpeech 3    |     0.67     |            1.94            | 0.786 | 4.26±0.10  |  0.16±0.14  |
+| VoiceCraft         |     0.45     |            4.68            | 0.981 | 3.52±0.21  | -0.33 ±0.16 |
+| XTTS-v2            |     0.51     |            4.20            | 0.945 | 3.02±0.22  | -0.98 ±0.19 |
+| MaskGCT            | 0.687(0.723) |        2.634(1.976)        | 0.886 | 4.27±0.14  |  0.10±0.16  |
+| MaskGCT(gt length) |    0.697     |           2.012            | 0.746 | 4.33±0.11  |  0.13±0.13  |
+|                    |              |    **SeedTTS test-en**     |
+| Ground Truth       |    0.730     |           2.143            |       | 3.92±0.15  |    0.00     |
+| CosyVoice          |    0.643     |           4.079            | 0.316 | 3.52±0.17  | -0.41 ±0.18 |
+| XTTS-v2            |    0.463     |           3.248            | 0.484 | 3.15±0.22  | -0.86±0.19  |
+| VoiceCraft         |    0.470     |           7.556            | 0.226 | 3.18±0.20  | -1.08 ±0.15 |
+| MaskGCT            | 0.717(0.760) |        2.623(1.283)        | 0.188 | 4.24 ±0.12 | 0.03 ±0.14  |
+| MaskGCT(gt length) |    0.728     |           2.466            | 0.159 | 4.13 ±0.17 | 0.12 ±0.15  |
+|                    |              |    **SeedTTS test-zh**     |
+| Ground Truth       |    0.750     |           1.254            |       | 3.86 ±0.17 |    0.00     |
+| CosyVoice          |    0.750     |           4.089            | 0.276 | 3.54 ±0.12 | -0.45 ±0.15 |
+| XTTS-v2            |    0.635     |           2.876            | 0.413 | 2.95 ±0.18 | -0.81 ±0.22 |
+| MaskGCT            | 0.774(0.805) |        2.273(0.843)        | 0.106 | 4.09 ±0.12 | 0.05 ±0.17  |
+| MaskGCT(gt length) |    0.777     |           2.183            | 0.101 | 4.11 ±0.12 |  0.08±0.18  |
 
 ## Citations
 
 If you use MaskGCT in your research, please cite the following paper:
 
 ```bibtex
-@article{wang2024maskgct,
-  title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
+@inproceedings{wang2024maskgct,
   author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
-  journal={arXiv preprint arXiv:2409.00750},
-  year={2024}
+  title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
+  booktitle    = {{ICLR}},
+  publisher    = {OpenReview.net},
+  year         = {2025}
 }
 
 @inproceedings{amphion,
diff --git a/models/vc/vevo/README.md b/models/vc/vevo/README.md
@@ -85,14 +85,16 @@ Running this will automatically download the pretrained model from HuggingFace a
 If you use Vevo in your research, please cite the following papers:
 
 ```bibtex
-@article{vevo,
-  title={Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
-  journal={OpenReview},
-  year={2024}
+@inproceedings{vevo,
+  author       = {Xueyao Zhang and Xiaohui Zhang and Kainan Peng and Zhenyu Tang and Vimal Manohar and Yingru Liu and Jeff Hwang and Dangna Li and Yuhao Wang and Julian Chan and Yuan Huang and Zhizheng Wu and Mingbo Ma},
+  title        = {Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
+  booktitle    = {{ICLR}},
+  publisher    = {OpenReview.net},
+  year         = {2025}
 }
 
 @inproceedings{amphion,
-    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+    author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
     title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
     booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
     year={2024}