|
1 |
| -# f5-tts-swift |
2 |
| -Implementation of F5-TTS in Swift using MLX |
| 1 | + |
| 2 | +# F5 TTS for Swift (WIP) |
| 3 | + |
| 4 | +Implementation of [F5-TTS](https://arxiv.org/abs/2410.06885) in Swift, using the [MLX Swift]([https://github.com/ml-explore/mlx](https://github.com/ml-explore/mlx-swift)) framework. |
| 5 | + |
| 6 | +You can listen to a [sample here](https://s3.amazonaws.com/lucasnewman.datasets/f5tts/sample.wav) that was generated in ~11 seconds on an M3 Max MacBook Pro. |
| 7 | + |
| 8 | +See the [Python repository](https://github.com/lucasnewman/f5-tts-mlx) for additional details on the model architecture. |
| 9 | +This repository is based on the original Pytorch implementation available [here](https://github.com/SWivid/F5-TTS). |
| 10 | + |
| 11 | + |
| 12 | +## Installation |
| 13 | + |
| 14 | +The `F5TTS` Swift package can be built and run from Xcode or SwiftPM. |
| 15 | + |
| 16 | +A pretrained model is available [on Huggingface](https://hf.co/lucasnewman/f5-tts-mlx). |
| 17 | + |
| 18 | + |
| 19 | +## Usage |
| 20 | + |
| 21 | +```swift |
| 22 | +import Vocos |
| 23 | +import F5TTS |
| 24 | + |
| 25 | +let f5tts = try await F5TTS.fromPretrained(repoId: "lucasnewman/f5-tts-mlx") |
| 26 | +let vocos = try await Vocos.fromPretrained(repoId: "lucasnewman/vocos-mel-24khz-mlx") // if decoding to audio output |
| 27 | + |
| 28 | +let inputAudio = MLXArray(...) |
| 29 | + |
| 30 | +let (outputAudio, _) = f5tts.sample( |
| 31 | + cond: inputAudio, |
| 32 | + text: ["This is the caption for the reference audio and generation text."], |
| 33 | + duration: ..., |
| 34 | + vocoder: vocos.decode) { progress in |
| 35 | + print("Progress: \(Int(progress * 100))%") |
| 36 | + } |
| 37 | +``` |
| 38 | + |
| 39 | +## Appreciation |
| 40 | + |
| 41 | +[Yushen Chen](https://github.com/SWivid) for the original Pytorch implementation of F5 TTS and pretrained model. |
| 42 | + |
| 43 | +[Phil Wang](https://github.com/lucidrains) for the E2 TTS implementation that this model is based on. |
| 44 | + |
| 45 | +## Citations |
| 46 | + |
| 47 | +```bibtex |
| 48 | +@article{chen-etal-2024-f5tts, |
| 49 | + title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, |
| 50 | + author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen}, |
| 51 | + journal={arXiv preprint arXiv:2410.06885}, |
| 52 | + year={2024}, |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +```bibtex |
| 57 | +@inproceedings{Eskimez2024E2TE, |
| 58 | + title = {E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS}, |
| 59 | + author = {Sefik Emre Eskimez and Xiaofei Wang and Manthan Thakker and Canrun Li and Chung-Hsien Tsai and Zhen Xiao and Hemin Yang and Zirun Zhu and Min Tang and Xu Tan and Yanqing Liu and Sheng Zhao and Naoyuki Kanda}, |
| 60 | + year = {2024}, |
| 61 | + url = {https://api.semanticscholar.org/CorpusID:270738197} |
| 62 | +} |
| 63 | +``` |
| 64 | + |
| 65 | +## License |
| 66 | + |
| 67 | +The code in this repository is released under the MIT license as found in the |
| 68 | +[LICENSE](LICENSE) file. |
0 commit comments