You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,7 @@
34
34
In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
35
35
36
36
## 🚀 News
37
+
-**2025/01/23**: [MaskGCT](https://arxiv.org/abs/2409.00750) and [Vevo](https://openreview.net/pdf?id=anQDiQZhDP) got accepted by ICLR 2025! 🎉
37
38
-**2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [](https://openreview.net/pdf?id=anQDiQZhDP)[](https://huggingface.co/amphion/Vevo)[](https://versavoice.github.io/)[](models/vc/vevo/README.md)
38
39
-**2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance. [](https://arxiv.org/abs/2409.00750)[](https://huggingface.co/amphion/maskgct)[](https://huggingface.co/spaces/amphion/maskgct)[](https://modelscope.cn/studios/amphion/maskgct)[](https://modelscope.cn/models/amphion/MaskGCT)[](models/tts/maskgct/README.md)
39
40
-**2024/09/01**: [Amphion](https://arxiv.org/abs/2312.09911), [Emilia](https://arxiv.org/abs/2407.05361) and [DSFF-SVC](https://arxiv.org/abs/2310.11160) got accepted by IEEE SLT 2024! 🤗
@@ -184,7 +185,7 @@ Amphion is under the [MIT License](LICENSE). It is free for both research and co
184
185
185
186
```bibtex
186
187
@inproceedings{amphion,
187
-
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
188
+
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
188
189
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
189
190
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
Copy file name to clipboardExpand all lines: models/tts/maskgct/README.md
+35-34Lines changed: 35 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -132,12 +132,12 @@ Running this will automatically download the pretrained model from HuggingFace a
132
132
We provide the following pretrained checkpoints:
133
133
134
134
135
-
| Model Name | Description |
136
-
|-------------------|-------------|
137
-
|[Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec)| Converting speech to semantic tokens. |
138
-
|[Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec)| Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
139
-
|[MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model)| Predicting semantic tokens with text and prompt semantic tokens. |
140
-
|[MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model)| Predicts acoustic tokens conditioned on semantic tokens. |
|[Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec)| Converting speech to semantic tokens.|
138
+
|[Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec)| Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
139
+
|[MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model)| Predicting semantic tokens with text and prompt semantic tokens. |
140
+
|[MaskGCT-S2A](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model)| Predicts acoustic tokens conditioned on semantic tokens. |
141
141
142
142
You can download all pretrained checkpoints from [HuggingFace](https://huggingface.co/amphion/MaskGCT/tree/main) or use huggingface API.
143
143
@@ -165,41 +165,42 @@ We use the [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) data
If you use MaskGCT in your research, please cite the following paper:
196
196
197
197
```bibtex
198
-
@article{wang2024maskgct,
199
-
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
198
+
@inproceedings{wang2024maskgct,
200
199
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
201
-
journal={arXiv preprint arXiv:2409.00750},
202
-
year={2024}
200
+
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
Copy file name to clipboardExpand all lines: models/vc/vevo/README.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -85,14 +85,16 @@ Running this will automatically download the pretrained model from HuggingFace a
85
85
If you use Vevo in your research, please cite the following papers:
86
86
87
87
```bibtex
88
-
@article{vevo,
89
-
title={Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
90
-
journal={OpenReview},
91
-
year={2024}
88
+
@inproceedings{vevo,
89
+
author = {Xueyao Zhang and Xiaohui Zhang and Kainan Peng and Zhenyu Tang and Vimal Manohar and Yingru Liu and Jeff Hwang and Dangna Li and Yuhao Wang and Julian Chan and Yuan Huang and Zhizheng Wu and Mingbo Ma},
90
+
title = {Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
91
+
booktitle = {{ICLR}},
92
+
publisher = {OpenReview.net},
93
+
year = {2025}
92
94
}
93
95
94
96
@inproceedings{amphion,
95
-
author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
97
+
author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
96
98
title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
97
99
booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
0 commit comments