Official repository for RESOUND, which reconstructs intelligible, expressive speech from silent talking-face videos via acoustic–semantic decomposed modeling.
If you find this useful, please star 🌟 the repo and cite 📑:
@article{resound2025,
title = {RESOUND: Speech Reconstruction from Silent Videos via Acoustic–Semantic Decomposed Modeling},
author = {Pham, Long-Khanh and Tran, Thanh V. T. and Pham, Minh-Tan and Nguyen, Van},
journal = {Interspeech 2025},
year = {2025},
url = {https://arxiv.org/abs/2505.22024v1}
}
RESOUND separates acoustic (prosody/timbre from a short speaker prompt) and semantic (linguistic content from visual cues) paths, then decodes mel-spectrograms + discrete units before vocoding to waveform. This disentanglement improves naturalness and intelligibility.
conda create -n resound python=3.10 -y
conda activate resound
pip install -r requirements.txt
Please follow the official pipeline from lip2speech-unit:
https://github.com/choijeongsoo/lip2speech-unit
This repository reuses the same directory structure, manifests, and features. No additional instructions are provided here.
bash encoder/scripts/lrs3/train_avhubert_lrs3.sh
bash encoder/scripts/lrs3/inference_avhubert_lrs3-sh
bash vocoder/scripts/lrs3/inference.sh
This repository is built using Fairseq, AV-HuBERT, ESPnet, speech-resynthesis. We appreciate the open source of the projects.