A powerful Gradio-based interface built on Facebook’s SeamlessM4T-v2 model for multilingual speech ↔ text translation and generation.
- Multilingual speech & text translation
- Upload or record audio
- Convert speech to speech, speech to text, text to speech, and text to text
- Automatic 16 kHz audio preprocessing
- Supports GPU acceleration via PyTorch
- Clean Gradio UI with task-based flow
- Audio output saved as .wav
pip install torch torchvision torchaudio
pip install transformers soundfile librosa gradio numpy
-
ASR (Audio → Text):
-
Upload audio
-
Click Run
-
Output displays recognized text
-
-
S2TT (Speech → Translated Text):
-
Upload audio
-
Select target language
-
Produces translated text
-
-
S2ST (Speech → Speech Translation):
-
Upload audio
-
Choose target language
-
Generates translated .wav output
-
-
T2ST (Text → Speech Translation):
-
Provide text
-
Choose source & target language
-
Generates spoken audio output
-
-
T2TT (Text → Text Translation):
-
Enter text
-
Select target language
-
Produces translated text
-
English → eng
Hindi → hin
Tamil → tam
Telugu → tel
Malayalam → mal
Spanish → spa
French → fra
German → deu
Chinese → zho
Japanese → jpn
Arabic → ara
Russian → rus
You can extend this list anytime.
- Audio is always resampled to 16 kHz using librosa.
- Model outputs either:
- text tokens → decoded via processor.decode
- audio tensors → saved as .wav
- GPU is automatically used if available (torch.cuda.is_available())
- Slow processing Use GPU runtime (especially in Colab)
- CUDA OOM Reduce input length or switch to CPU
- Audio not playing Ensure file is written correctly at 16 kHz
- Wrong translation Verify tgt_lang code