open-mmlab · yuantuo666 · Feb 26, 2025 · Feb 26, 2025 · Feb 26, 2025 · Feb 26, 2025
diff --git a/.gitignore b/.gitignore
@@ -61,4 +61,6 @@ logs
 source_audio
 result
 conversion_results
-get_available_gpu.py
+get_available_gpu.py
+
+*.safetensors
diff --git a/README.md b/README.md
@@ -34,7 +34,8 @@
 In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
 
 ## 🚀 News
-- **2025/02/24**: *The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!!* Emilia-Large combines the original 101k-hour Emilia dataset (licensed under `CC BY-NC 4.0`) with the brand-new 114k-hour **Emilia-YODAS dataset** (licensed under `CC BY 4.0`). Download at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset). Check details at [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2501.15907).
+- **2025/02/26**: We release [***Metis***](https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis), a foundation model for unified speech generation. The system supports zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/pdf/2502.03128) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/metis)
+- **2025/02/26**: *The Emilia-Large dataset, featuring over 200,000 hours of data, is now available!!!* Emilia-Large combines the original 101k-hour Emilia dataset (licensed under `CC BY-NC 4.0`) with the brand-new 114k-hour **Emilia-YODAS dataset** (licensed under `CC BY 4.0`). Download at [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia-Dataset). Check details at [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2501.15907).
 - **2025/01/30**: We release [Amphion v0.2 Technical Report](https://arxiv.org/abs/2501.15442), which provides a comprehensive overview of the Amphion updates in 2024. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2501.15442)
 - **2025/01/23**: [MaskGCT](https://arxiv.org/abs/2409.00750) and [Vevo](https://openreview.net/pdf?id=anQDiQZhDP) got accepted by ICLR 2025! 🎉
 - **2024/12/22**: We release the reproduction of **Vevo**, a zero-shot voice imitation framework with controllable timbre and style. Vevo can be applied into a series of speech generation tasks, including VC, TTS, AC, and more. The released pre-trained models are trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieve SOTA zero-shot VC performance. [![arXiv](https://img.shields.io/badge/OpenReview-Paper-COLOR.svg)](https://openreview.net/pdf?id=anQDiQZhDP) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/Vevo) [![WebPage](https://img.shields.io/badge/WebPage-Demo-red)](https://versavoice.github.io/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/vc/vevo/README.md)

diff --git a/imgs/metis/fine-tune.png b/imgs/metis/fine-tune.png
diff --git a/imgs/metis/pre-train.png b/imgs/metis/pre-train.png
diff --git a/imgs/metis/two-stage.png b/imgs/metis/two-stage.png
diff --git a/models/tts/maskgct/README.md b/models/tts/maskgct/README.md
@@ -21,6 +21,10 @@ MaskGCT (**Mask**ed **G**enerative **C**odec **T**ransformer) is *a fully non-au
 
 ## News
 
+- **2025/02/26**: We release [**Metis**](https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis), an upgraded version of MaskGCT that supports multiple speech generation tasks (text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip2speech) within a unified framework.
+
+- **2025/01/25**: MaskGCT gets accepted by ICLR 2025.
+
 - **2024/10/19**: We release **MaskGCT**, a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision. MaskGCT is trained on [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) dataset and achieves SOTA zero-shot TTS performance.
 
 ## Issues

diff --git a/models/tts/maskgct/llama_nar.py b/models/tts/maskgct/llama_nar.py
@@ -430,10 +430,13 @@ def __init__(
         hidden_size=1024,
         num_heads=16,
         num_layers=16,
+        use_phone_cond=True,
         config=LlamaConfig(0, 256, 1024, 1, 1),
     ):
         super().__init__(config)
 
+        self.use_phone_cond = use_phone_cond
+
         self.layers = nn.ModuleList(
             [
                 LlamaNARDecoderLayer(
@@ -458,11 +461,12 @@ def __init__(
             nn.Linear(hidden_size * 4, hidden_size),
         )
 
-        self.cond_mlp = nn.Sequential(
-            nn.Linear(hidden_size, hidden_size * 4),
-            nn.SiLU(),
-            nn.Linear(hidden_size * 4, hidden_size),
-        )
+        if self.use_phone_cond:
+            self.cond_mlp = nn.Sequential(
+                nn.Linear(hidden_size, hidden_size * 4),
+                nn.SiLU(),
+                nn.Linear(hidden_size * 4, hidden_size),
+            )
 
         for layer in self.layers:
             layer.input_layernorm = LlamaAdaptiveRMSNorm(
@@ -535,10 +539,15 @@ def forward(
 
         # retrieve some shape info
 
-        phone_embedding = self.cond_mlp(phone_embedding)  # (B, T, C)
-        phone_length = phone_embedding.shape[1]
-        inputs_embeds = torch.cat([phone_embedding, x], dim=1)
-        attention_mask = torch.cat([phone_mask, x_mask], dim=1)
+        if self.use_phone_cond and phone_embedding is not None:
+            phone_embedding = self.cond_mlp(phone_embedding)  # (B, T, C)
+            phone_length = phone_embedding.shape[1]
+            inputs_embeds = torch.cat([phone_embedding, x], dim=1)
+            attention_mask = torch.cat([phone_mask, x_mask], dim=1)
+        else:
+            inputs_embeds = x
+            attention_mask = x_mask
+            phone_length = 0
 
         # diffusion step embedding
         diffusion_step = self.diff_step_embedding(diffusion_step).to(x.device)

diff --git a/models/tts/maskgct/maskgct_t2s.py b/models/tts/maskgct/maskgct_t2s.py
@@ -41,6 +41,7 @@ def __init__(
         cfg_scale=0.2,
         cond_codebook_size=8192,
         cond_dim=1024,
+        use_phone_cond=True,
         cfg=None,
     ):
         super().__init__()
@@ -73,28 +74,37 @@ def __init__(
         cond_dim = (
             cfg.cond_dim if cfg is not None and hasattr(cfg, "cond_dim") else cond_dim
         )
+        use_phone_cond = (
+            cfg.use_phone_cond
+            if cfg is not None and hasattr(cfg, "use_phone_cond")
+            else use_phone_cond
+        )
 
         self.hidden_size = hidden_size
         self.num_layers = num_layers
         self.num_heads = num_heads
         self.cfg_scale = cfg_scale
         self.cond_codebook_size = cond_codebook_size
         self.cond_dim = cond_dim
+        self.use_phone_cond = use_phone_cond
 
         self.mask_emb = nn.Embedding(1, self.hidden_size)
 
         self.to_logit = nn.Linear(self.hidden_size, self.cond_codebook_size)
 
         self.cond_emb = nn.Embedding(cond_codebook_size, self.hidden_size)
 
-        self.phone_emb = nn.Embedding(1024, hidden_size, padding_idx=1023)
+        if self.use_phone_cond:
+            self.phone_emb = nn.Embedding(1024, hidden_size, padding_idx=1023)
+            torch.nn.init.normal_(self.phone_emb.weight, mean=0.0, std=0.02)
 
         self.reset_parameters()
 
         self.diff_estimator = DiffLlamaPrefix(
             hidden_size=hidden_size,
             num_heads=num_heads,
             num_layers=num_layers,
+            use_phone_cond=use_phone_cond,
         )
 
     def mask_prob(self, t):

diff --git a/models/tts/metis/README.md b/models/tts/metis/README.md
@@ -0,0 +1,240 @@
+# *Metis*: A Foundation Speech Generation Model with Masked Generative Pre-training
+
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/pdf/2502.03128)
+[![readme](https://img.shields.io/badge/README-Key%20Features-blue)](../../../models/tts/metis/README.md)
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/metis)
+[![ModelScope](https://img.shields.io/badge/ModelScope-model-cyan)](https://modelscope.cn/models/amphion/metis)
+
+<!-- [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/metis) -->
+<!-- [![ModelScope](https://img.shields.io/badge/ModelScope-space-purple)](https://modelscope.cn/studios/amphion/metis) -->
+
+## Overview
+
+We introduce ***Metis***, a foundation model for unified speech generation.
+Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks.
+Specifically, (1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. (2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. (3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters.
+Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems
+across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data.
+Audio samples are available at [demo page](https://metis-demo.github.io/).
+
+
+<div align="center">
+  <img src="../../../imgs/metis/pre-train.png" width="42%">
+  <img src="../../../imgs/metis/fine-tune.png" width="48%">
+</div>
+<div align="center">
+  <p><i>Pre-training (left) and fine-tuning (right).</i></p>
+</div>
+
+## News
+
+- **2025/02/26**: We release ***Metis***, a foundation model for unified speech generation. The system supports zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech.
+
+
+<!-- ## Todo List
+
+- [ ] Add inference code for lip2speech -->
+
+
+## Model Introduction
+
+Metis is fully compatible with MaskGCT and shares several key model components with it. These shared components are:
+
+
+| Model Name                                                                        | Description                                                                            |
+| --------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
+| [Semantic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/semantic_codec) | Converting speech to semantic tokens.                                                  |
+| [Acoustic Codec](https://huggingface.co/amphion/MaskGCT/tree/main/acoustic_codec) | Converting speech to acoustic tokens and reconstructing waveform from acoustic tokens. |
+| [Semantic2Acoustic](https://huggingface.co/amphion/MaskGCT/tree/main/s2a_model)         | Predicts acoustic tokens conditioned on semantic tokens.    |
+<!-- | [MaskGCT-T2S](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model)         | Predicting semantic tokens with text and prompt semantic tokens.                       | -->
+
+We open-source the pretrained model checkpoint of the first stage of Metis (with masked generative pre-training), as well as the fine-tuned models for speech enhancement (SE), target speaker extraction (TSE), voice conversion (VC), lip-to-speech (L2S), and the unified multi-task (Omni) model.
+
+For zero-shot text-to-speech, you can download the text2semantic model from MaskGCT, which is compatible with the Metis framework.
+
+| Model Name | Description |
+| --- | --- |
+| [Metis-Base](https://huggingface.co/amphion/metis/tree/main/metis_base) | The base model pre-trained with masked generative pre-training. |
+| [Metis-TSE](https://huggingface.co/amphion/metis/tree/main/metis_tse) | Fine-tuned model for target speaker extraction. Available in both full-scale and LoRA ($r = 32$) versions. |
+| [Metis-VC](https://huggingface.co/amphion/metis/tree/main/metis_vc) | Fine-tuned model for voice conversion. Available in full-scale version. |
+| [Metis-SE](https://huggingface.co/amphion/metis/tree/main/metis_se) | Fine-tuned model for speech enhancement. Available in both full-scale and LoRA ($r = 32$) versions. |
+| [Metis-L2S](https://huggingface.co/amphion/metis/tree/main/metis_l2s) | Fine-tuned model for lip-to-speech. Available in full-scale version. |
+| [Metis-TTS](https://huggingface.co/amphion/MaskGCT/tree/main/t2s_model) | Zero-shot text-to-speech model (as same as the first stage of MaskGCT). |
+| [Metis-Omni](https://huggingface.co/amphion/metis/tree/main/metis_omni) | Unified multi-task model supporting zero-shot TTS, VC, TSE, and SE. |
+
+
+## Usage
+
+To run this model, you need to follow the steps below:
+
+1. Clone the repository and install the environment.
+2. Run the Inference script.
+
+### Clone and Environment
+
+#### 1. Clone the repository
+
+```bash
+git clone https://github.com/open-mmlab/Amphion.git
+cd Amphion
+```
+#### 2. Install the environment
+
+Before start installing, making sure you are under the `Amphion` directory. If not, use `cd` to enter.
+
+Since we use `phonemizer` to convert text to phoneme, you need to install `espeak-ng` first. More details can be found [here](https://bootphon.github.io/phonemizer/install.html). Choose the correct installation command according to your operating system:
+
+```bash
+# For Debian-like distribution (e.g. Ubuntu, Mint, etc.)
+sudo apt-get install espeak-ng
+# For RedHat-like distribution (e.g. CentOS, Fedora, etc.) 
+sudo yum install espeak-ng
+
+# For Windows
+# Please visit https://github.com/espeak-ng/espeak-ng/releases to download .msi installer
+```
+
+**The environment used for Metis is the same as the one used for MaskGCT.**
+
+Now, we are going to install the environment. It is recommended to use conda to configure:
+
+```bash
+conda create -n maskgct python=3.10
+conda activate maskgct
+
+pip install -r models/tts/maskgct/requirements.txt
+```
+
+### Inference
+
+#### 1. Inference Script
+
+```bash
+# Metis TSE
+python -m models.tts.metis.metis_infer_tse
+
+# Metis SE
+python -m models.tts.metis.metis_infer_se
+
+# Metis VC
+python -m models.tts.metis.metis_infer_vc
+
+# Metis Lip2Speech
+python -m models.tts.metis.metis_infer_l2s
+```
+
+You can also use a similar framework for inference with MaskGCT:
+
+```bash
+# Metis TTS (MaskGCT)
+python -m models.tts.maskgct.maskgct_infer_tts
+```
+
+You can also use only one model (Metis-Omni) to infer TTS, VC, TSE, and SE tasks.
+
+```bash
+# Metis Omni
+python -m models.tts.metis.metis_infer_omni
+```
+
+Running this will automatically download the pretrained model from HuggingFace and start the inference process. We provide example audio files for inference. Please see the scripts for more details and parameter configurations.
+
+
+#### 2. Example Usaage
+
+Take Metis-TSE for example, the inference script first downloads the model checkpoints:
+
+```python
+# download base model, lora weights, and adapter weights
+base_ckpt_dir = snapshot_download(
+    "amphion/metis",
+    repo_type="model",
+    local_dir="./models/tts/metis/ckpt",
+    allow_patterns=["metis_base/model.safetensors"],
+)
+lora_ckpt_dir = snapshot_download(
+    "amphion/metis",
+    repo_type="model",
+    local_dir="./models/tts/metis/ckpt",
+    allow_patterns=["metis_tse/metis_tse_lora_32.safetensors"],
+)
+adapter_ckpt_dir = snapshot_download(
+    "amphion/metis",
+    repo_type="model",
+    local_dir="./models/tts/metis/ckpt",
+    allow_patterns=["metis_tse/metis_tse_lora_32_adapter.safetensors"],
+)
+```
+
+Then, the script will load the model checkpoints and initialize the fine-tined Metis model:
+
+```python
+base_ckpt_path = os.path.join(base_ckpt_dir, "metis_base/model.safetensors")
+lora_ckpt_path = os.path.join(
+    lora_ckpt_dir, "metis_tse/metis_tse_lora_32.safetensors"
+)
+adapter_ckpt_path = os.path.join(
+    adapter_ckpt_dir, "metis_tse/metis_tse_lora_32_adapter.safetensors"
+)
+
+metis = Metis(
+    base_ckpt_path=base_ckpt_path,
+    lora_ckpt_path=lora_ckpt_path,
+    adapter_ckpt_path=adapter_ckpt_path,
+    cfg=metis_cfg,
+    device=device,
+    model_type="tse",
+)
+```
+
+Finally, the script will generate the speech and save it to the `models/tts/metis/wav/tse/gen.wav` directory, you can change this in the script.
+
+```python
+prompt_speech_path = "./models/tts/metis/wav/tse/prompt.wav"
+source_speech_path = "./models/tts/metis/wav/tse/mix.wav"
+
+n_timesteps = 10
+cfg = 0.0
+
+gen_speech = metis(
+    prompt_speech_path=prompt_speech_path,
+    source_speech_path=source_speech_path,
+    cfg=cfg,
+    n_timesteps=n_timesteps,
+    model_type="tse",
+)
+
+sf.write("./models/tts/metis/wav/tse/gen.wav", gen_speech, 24000)
+```
+
+## Citations 
+
+If you use Metis in your research, please cite the following paper:
+
+```bibtex
+@article{wang2025metis,
+  title={Metis: A Foundation Speech Generation Model with Masked Generative Pre-training},
+  author={Wang, Yuancheng and Zheng, Jiachen and Zhang, Junan and Zhang, Xueyao and Liao, Huan and Wu, Zhizheng},
+  journal={arXiv preprint arXiv:2502.03128},
+  year={2025}
+}
+@inproceedings{wang2024maskgct,
+  author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
+  title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
+  booktitle    = {{ICLR}},
+  publisher    = {OpenReview.net},
+  year         = {2025}
+}
+@article{amphion_v0.2,
+  title        = {Overview of the Amphion Toolkit (v0.2)},
+  author       = {Jiaqi Li and Xueyao Zhang and Yuancheng Wang and Haorui He and Chaoren Wang and Li Wang and Huan Liao and Junyi Ao and Zeyu Xie and Yiqiao Huang and Junan Zhang and Zhizheng Wu},
+  year         = {2025},
+  journal      = {arXiv preprint arXiv:2501.15442},
+}
+@inproceedings{amphion,
+    author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
+    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
+    year={2024}
+}
+```