You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-2Lines changed: 7 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,7 @@
34
34
In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
35
35
36
36
## 🚀 News
37
+
-**2025/05/26**: We release [***DualCodec***](models/codec/dualcodec/README.md), a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.[](http://arxiv.org/abs/2505.13000)[](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)[](https://dualcodec.github.io/)[](models/codec/dualcodec/README.md)
37
38
-**2025/04/12**: We release [***Vevo1.5***](models/svc/vevosing/README.md), which extends Vevo and focuses on unified and controllable generation for both speech and singing voice. Vevo1.5 can be applied into a series of speech and singing voice generation tasks, including VC, TTS, AC, SVS, SVC, Speech/Singing Voice Editing, Singing Style Conversion, and more. [](https://veiled-army-9c5.notion.site/Vevo1-5-1d2ce17b49a280b5b444d3fa2300c93a)
38
39
-**2025/02/26**: We release [***Metis***](https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis), a foundation model for unified speech generation. The system supports zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech. [](https://arxiv.org/pdf/2502.03128)[](https://huggingface.co/amphion/metis)
39
40
-**2025/02/26**: *The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!!* Emilia-Large combines the original 101k-hour Emilia dataset (licensed under `CC BY-NC 4.0`) with the brand-new 114k-hour **Emilia-YODAS dataset** (licensed under `CC BY 4.0`). Download at [](https://huggingface.co/datasets/amphion/Emilia-Dataset). Check details at [](https://arxiv.org/abs/2501.15907).
@@ -46,7 +47,6 @@ In addition to the specific generation tasks, Amphion includes several **vocoder
-**2024/08/27**: *The Emilia dataset is now publicly available!* Discover the most extensive and diverse speech generation dataset with 101k hours of in-the-wild speech data now at [](https://huggingface.co/datasets/amphion/Emilia-Dataset) or [](https://opendatalab.com/Amphion/Emilia)! 👑👑👑
48
49
-**2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [](https://arxiv.org/abs/2407.05361)[](https://huggingface.co/datasets/amphion/Emilia)[](https://emilia-dataset.github.io/Emilia-Demo-Page/)[](preprocessors/Emilia/README.md)
49
-
-**2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [](egs/tts/VALLE_V2/README.md)
50
50
-**2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [](https://arxiv.org/abs/2403.03100)[](https://huggingface.co/amphion/naturalspeech3_facodec)[](https://huggingface.co/spaces/amphion/naturalspeech3_facodec)[](models/codec/ns3_codec/README.md)
51
51
-**2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [](https://arxiv.org/abs/2402.12660)[](https://openxlab.org.cn/apps/detail/Amphion/SingVisio)[](https://drive.google.com/file/d/15097SGhQh-SwUNbdWDYNyWEP--YGLba5/view)[](egs/visualization/SingVisio/README.md)
@@ -59,11 +59,12 @@ In addition to the specific generation tasks, Amphion includes several **vocoder
59
59
- Amphion achieves state-of-the-art performance compared to existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
60
60
-[FastSpeech2](https://arxiv.org/abs/2006.04558): A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks. [](egs/tts/FastSpeech2/README.md)
61
61
-[VITS](https://arxiv.org/abs/2106.06103): An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning [](egs/tts/VITS/README.md)
62
-
-[VALL-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes. [](egs/tts/VALLE_V2/README.md)
62
+
-[VALL-E](https://arxiv.org/abs/2301.02111): A zero-shot TTS architecture that uses a neural codec language model with discrete codes. [](egs/tts/VALLE/README.md)
63
63
-[NaturalSpeech2](https://arxiv.org/abs/2304.09116): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices. [](egs/tts/NaturalSpeech2/README.md)
64
64
-[Jets](Jets): An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module. [](egs/tts/Jets/README.md)
65
65
-[MaskGCT](https://arxiv.org/abs/2409.00750): A fully non-autoregressive TTS architecture that eliminates the need for explicit alignment information between text and speech supervision. [](models/tts/maskgct/README.md)
66
66
-[Vevo-TTS](https://openreview.net/pdf?id=anQDiQZhDP): A zero-shot TTS architecture with controllable timbre and style. It consists of an autoregressive transformer and a flow-matching transformer. [](models/vc/vevo/README.md)
67
+
-[DualCodec-VALLE](models/codec/dualcodec/README.md): A VALLE model trained on 12.5Hz DualCodec tokens for super fast generation.
67
68
68
69
### VC: Voice Conversion
69
70
@@ -73,6 +74,10 @@ Amphion supports the following voice conversion models:
73
74
-[FACodec](https://arxiv.org/abs/2403.03100): FACodec decomposes speech into subspaces representing different attributes like content, prosody, and timbre. It can achieve zero-shot voice conversion. [](https://huggingface.co/amphion/naturalspeech3_facodec)
74
75
-[Noro](https://arxiv.org/abs/2411.19770): A **noise-robust** zero-shot voice conversion system. Noro introduces innovative components tailored for VC using noisy reference speeches, including a dual-branch reference encoding module and a noise-agnostic contrastive speaker loss. [](egs/vc/Noro/README.md)
75
76
77
+
## Neural Audio Codec
78
+
-[DualCodec](models/codec/dualcodec/README.md), a low-frame-rate (12.5Hz or 25Hz), semantically-enhanced (with SSL feature) Neural Audio Codec designed to extract discrete tokens for efficient speech generation.[](http://arxiv.org/abs/2505.13000)[](https://colab.research.google.com/drive/1VvUhsDffLdY5TdNuaqlLnYzIoXhvI8MK#scrollTo=Lsos3BK4J-4E)[](https://dualcodec.github.io/)[](models/codec/dualcodec/README.md)
79
+
-[FACodec](https://arxiv.org/abs/2403.03100): FACodec decomposes speech into subspaces representing different attributes like content, prosody, and timbre. [](https://huggingface.co/amphion/naturalspeech3_facodec)
80
+
76
81
### AC: Accent Conversion
77
82
78
83
- Amphion supports AC with [Vevo-Style](models/vc/vevo/README.md). Particularly, it can conduct the accent conversion in a zero-shot manner. [](models/vc/vevo/README.md)
Copy file name to clipboardExpand all lines: egs/tts/README.md
-5Lines changed: 0 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,11 @@
1
1
2
2
# Amphion Text-to-Speech (TTS) Recipe
3
3
4
-
## Quick Start
5
-
6
-
We provide a **[beginner recipe](VALLE_V2/)** to demonstrate how to train a cutting edge TTS model. Specifically, it is Amphion's re-implementation for [VALL-E](https://arxiv.org/abs/2301.02111), which is a zero-shot TTS architecture that uses a neural codec language model with discrete codes.
7
-
8
4
## Supported Model Architectures
9
5
10
6
Until now, Amphion TTS supports the following models or architectures,
11
7
-**[FastSpeech2](FastSpeech2)**: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
12
8
-**[VITS](VITS)**: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
13
-
-**[VALL-E](VALLE_V2)**: A zero-shot TTS architecture that uses a neural codec language model with discrete codes. This model is our updated VALL-E implementation as of June 2024 which uses Llama as its underlying architecture. The previous version of VALL-E release can be found [here](VALLE)
14
9
-**[NaturalSpeech2](NaturalSpeech2)** (👨💻 developing): An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.
15
10
-**[Jets](Jets)**: An end-to-end TTS model that jointly trains FastSpeech2 and HiFi-GAN with an alignment module.
0 commit comments