-
Notifications
You must be signed in to change notification settings - Fork 3
feat: add support for MMS TTS models via Modal #804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdded HF_TOKEN configuration in daras_ai_v2/settings.py sourced from the environment. Introduced MMS TTS support across the codebase: a new TextToSpeechProviders enum member and selector wiring with mms_tts_selector and mms_tts_language_options (cached) in daras_ai_v2/text_to_speech_settings_widgets.py. Added recipes/TextToSpeech.py support for MMS_TTS with a new TextToSpeechSettings field mms_tts_language and a Modal-based execution path that validates language, calls a remote Modal function, and uploads WAV audio. Added a new daras_ai_v2/mms_tts.py module that defines MMS_TTS_SUPPORTED_LANGUAGES, a Modal app/image/volume, and public functions load_pipe(language) and run_mms_tts(language, text) which instantiate and run HuggingFace MMS pipelines and return WAV bytes. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
daras_ai_v2/settings.py (1)
521-522
: Hugging Face token: set standard env var and avoid leaking into templates
- Mirror HF token to HUGGING_FACE_HUB_TOKEN so huggingface_hub & CLIs pick it up.
- Caution: all settings are injected into Jinja globals (Line 112-115), which makes HF_TOKEN visible to templates. Consider whitelisting non‑secret values.
Apply this diff:
-HF_TOKEN = config("HF_TOKEN", "") +HF_TOKEN = config("HF_TOKEN", default="") +if HF_TOKEN and "HUGGING_FACE_HUB_TOKEN" not in os.environ: + os.environ["HUGGING_FACE_HUB_TOKEN"] = HF_TOKENOutside this hunk (for your consideration), restrict what’s exposed to Jinja:
# replace settings=globals() with a safe subset, e.g.: SAFE_SETTINGS_FOR_TEMPLATES = { "APP_BASE_URL": APP_BASE_URL, "DOCS_URL": DOCS_URL, # ... non-secret fields only } templates.env.globals.update(dict(humanize=humanize, datetime=datetime, settings=SAFE_SETTINGS_FOR_TEMPLATES))
🧹 Nitpick comments (3)
recipes/TextToSpeech.py (1)
68-69
: Add field doc for clarityDocument expected code scheme (ISO‑639‑3).
Apply this diff:
- mms_tts_language: str = "eng" + mms_tts_language: str = Field("eng", description="ISO‑639‑3 code for MMS TTS")daras_ai_v2/text_to_speech_settings_widgets.py (2)
204-214
: Selectbox: pass keys and sort for stable UXPass the keys view explicitly and sort by display label for determinism.
Apply this diff:
-def mms_tts_selector(): - options = mms_tts_language_options() +def mms_tts_selector(): + options = mms_tts_language_options() + lang_codes = sorted(options.keys(), key=lambda c: options[c].lower()) gui.selectbox( label=""" ###### MMS TTS Language """, key="mms_tts_language", - format_func=lambda lang: options[lang], - options=options, + format_func=options.__getitem__, + options=lang_codes, )
216-225
: Options builder: add fallback if langcodes is missingAvoid runtime failure if langcodes isn’t installed; degrade to returning codes.
Apply this diff:
@redis_cache_decorator(ex=settings.REDIS_MODELS_CACHE_EXPIRY) def mms_tts_language_options(): - import langcodes - from daras_ai_v2.mms_tts import MMS_TTS_SUPPORTED_LANGUAGES - - result = {} - for lang in MMS_TTS_SUPPORTED_LANGUAGES: - result[lang] = langcodes.Language.get(lang).display_name() - return result + from daras_ai_v2.mms_tts import MMS_TTS_SUPPORTED_LANGUAGES + try: + import langcodes + except Exception: + return {lang: lang for lang in MMS_TTS_SUPPORTED_LANGUAGES} + return { + lang: langcodes.Language.get(lang).display_name() + for lang in MMS_TTS_SUPPORTED_LANGUAGES + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
daras_ai_v2/settings.py
(1 hunks)daras_ai_v2/text_to_speech_settings_widgets.py
(3 hunks)recipes/TextToSpeech.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
daras_ai_v2/text_to_speech_settings_widgets.py (1)
daras_ai_v2/redis_cache.py (1)
redis_cache_decorator
(23-51)
recipes/TextToSpeech.py (3)
daras_ai_v2/text_to_speech_settings_widgets.py (1)
TextToSpeechProviders
(69-77)daras_ai_v2/exceptions.py (1)
UserError
(58-65)daras_ai/image_input.py (1)
upload_file_from_bytes
(59-66)
🪛 Ruff (0.12.2)
recipes/TextToSpeech.py
422-422: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (3.10.12, 1.8.3)
🔇 Additional comments (3)
daras_ai_v2/text_to_speech_settings_widgets.py (2)
77-78
: Enum addition LGTMProvider wired into the enum with a clear label.
174-176
: Selector dispatch LGTMCorrectly routes MMS TTS to its selector.
recipes/TextToSpeech.py (1)
414-431
: MMS TTS — guard modal.enable_output(); don't add local ImportError handling
- Do not add a local try/except ImportError for modal — modal is imported at module top (recipes/TextToSpeech.py line 7) and used elsewhere (e.g., recipes/Functions.py), so it's a repo-level dependency.
- Change only the enable_output usage: at recipes/TextToSpeech.py (around lines 424–426) wrap modal.enable_output() behind a settings.DEBUG check; in production call run_mms_tts.remote(...) without enable_output(). Keep all other behavior unchanged.
Likely an incorrect or invalid review comment.
|
||
import gooey_gui as gui | ||
import requests | ||
import modal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazy‑load modal (import-time + optional dependency)
Top-level import of modal increases server import time and breaks environments without modal installed. Import it only when MMS TTS is selected.
Apply this diff:
-import modal
See follow-up change in the MMS_TTS case below.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import modal |
🤖 Prompt for AI Agents
In recipes/TextToSpeech.py around line 7, the top-level "import modal" makes
modal a required import at module import time and slows or breaks environments
without modal; remove the top-level import and instead lazy-import modal inside
the branch that handles the MMS TTS backend (i.e., import modal only when
MMS_TTS is selected), wrap the import in a try/except to raise a clear error if
modal is missing, and mirror this lazy-load pattern in the follow-up MMS_TTS
case so modal is only imported/used when that backend is actually chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
recipes/TextToSpeech.py (2)
7-7
: Lazy‑load Modal to avoid import‑time failures and keep server import fastTop‑level
import modal
makes Modal a hard dependency and slows/breaks environments without it. Import Modal only inside the MMS_TTS branch.Apply this diff:
- import modal
414-431
: Gate Modal usage and surface friendly errors; add progress messageHandle missing Modal, failed function lookup, and remote call errors gracefully; also mirror other providers’ “Uploading…” yield.
Apply this diff:
case TextToSpeechProviders.MMS_TTS: - from daras_ai_v2.mms_tts import ( - MMS_TTS_SUPPORTED_LANGUAGES, - app as modal_app, - ) + try: + import modal + except Exception as e: + raise UserError("MMS TTS requires the 'modal' package.") from e + from daras_ai_v2.mms_tts import ( + MMS_TTS_SUPPORTED_LANGUAGES, + app as modal_app, + ) language = state.get("mms_tts_language", "eng") if language not in MMS_TTS_SUPPORTED_LANGUAGES: raise UserError(f"Unsupported language: {language}") - run_mms_tts = modal.Function.lookup(modal_app.name, "run_mms_tts") - with modal.enable_output(): - audio = run_mms_tts.remote(language=language, text=text) + try: + run_mms_tts = modal.Function.lookup(modal_app.name, "run_mms_tts") + except Exception as e: + raise UserError("MMS TTS backend is unavailable. Please deploy daras_ai_v2/mms_tts.py.") from e + try: + with modal.enable_output(): + audio = run_mms_tts.remote(language=language, text=text) + except Exception as e: + raise UserError("MMS TTS synthesis failed.") from e - state["audio_url"] = upload_file_from_bytes( + yield "Uploading Audio file..." + state["audio_url"] = upload_file_from_bytes( filename="output.wav", data=audio, content_type="audio/wav" )
🧹 Nitpick comments (3)
recipes/TextToSpeech.py (1)
68-69
: Preseed default in sane_defaults for consistencyAdd
mms_tts_language: "eng"
tosane_defaults
so the UI/session state is prefilled without relying on model defaults.daras_ai_v2/mms_tts.py (2)
108-119
: Reuse the CUDA check and pass a single computed deviceMinor tidy‑up; avoids double checks and keeps intent clear.
Apply this diff:
- has_cuda = torch.cuda.is_available() - if has_cuda: - print("Using GPU") - else: - print("GPU not available, using CPU") + has_cuda = torch.cuda.is_available() + print("Using GPU" if has_cuda else "Using CPU") + device = 0 if has_cuda else -1 @@ - device=0 if torch.cuda.is_available() else -1, + device=device,
142-144
: Guard against shape variance in pipeline output (optional)Some TTS pipelines return
audio
with/without batch dim. Consider squeezing before write.Proposed change:
- scipy.io.wavfile.write(b, rate=output["sampling_rate"], data=output["audio"][0]) + audio = output["audio"] + try: + audio = audio[0] # handle batch dim if present + except Exception: + pass + scipy.io.wavfile.write(b, rate=output["sampling_rate"], data=audio)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
daras_ai_v2/mms_tts.py
(1 hunks)daras_ai_v2/settings.py
(1 hunks)daras_ai_v2/text_to_speech_settings_widgets.py
(3 hunks)recipes/TextToSpeech.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- daras_ai_v2/text_to_speech_settings_widgets.py
- daras_ai_v2/settings.py
🧰 Additional context used
🧬 Code graph analysis (1)
recipes/TextToSpeech.py (4)
daras_ai_v2/text_to_speech_settings_widgets.py (1)
TextToSpeechProviders
(69-77)daras_ai_v2/exceptions.py (1)
UserError
(58-65)daras_ai_v2/mms_tts.py (1)
run_mms_tts
(131-144)daras_ai/image_input.py (1)
upload_file_from_bytes
(59-66)
🪛 Ruff (0.12.2)
recipes/TextToSpeech.py
422-422: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (3.10.12, 1.8.3)
image = ( | ||
modal.Image.debian_slim() | ||
.pip_install( | ||
"transformers~=4.44", | ||
"huggingface_hub[hf_transfer]~=0.34.4", | ||
"torch~=2.5.1", | ||
"scipy~=1.11", | ||
"python-decouple~=3.6", | ||
) | ||
.env({"HF_HUB_CACHE": cache_dir, "HF_TOKEN": config("HF_TOKEN", cast=str)}) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t crash when HF_TOKEN is unset; only inject when present (+ enable hf_transfer)
config("HF_TOKEN", cast=str)
raises if the var isn’t set. Build env_vars
conditionally and enable faster downloads.
Apply this diff:
-image = (
- modal.Image.debian_slim()
- .pip_install(
- "transformers~=4.44",
- "huggingface_hub[hf_transfer]~=0.34.4",
- "torch~=2.5.1",
- "scipy~=1.11",
- "python-decouple~=3.6",
- )
- .env({"HF_HUB_CACHE": cache_dir, "HF_TOKEN": config("HF_TOKEN", cast=str)})
-)
+env_vars = {
+ "HF_HUB_CACHE": cache_dir,
+ "HF_HUB_ENABLE_HF_TRANSFER": "1",
+}
+hf_token = config("HF_TOKEN", default=None, cast=str)
+if hf_token:
+ env_vars["HF_TOKEN"] = hf_token
+image = (
+ modal.Image.debian_slim()
+ .pip_install(
+ "transformers~=4.44",
+ "huggingface_hub[hf_transfer]~=0.34.4",
+ "torch~=2.5.1",
+ "scipy~=1.11",
+ "python-decouple~=3.6",
+ )
+ .env(env_vars)
+)
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In daras_ai_v2/mms_tts.py around lines 91-101, avoid calling config("HF_TOKEN")
directly (which raises if unset); instead build an env_vars dict before
constructing the image, only add "HF_TOKEN" when it's present (e.g., from
os.getenv or config with a default), keep "HF_HUB_CACHE" always, and pass that
env_vars into modal.Image.env(...) so the build won’t crash when HF_TOKEN is
missing; also ensure the pip_install uses huggingface_hub[hf_transfer] (already
present) to enable faster downloads.
Q/A checklist
How to check import time?
You can visualize this using tuna:
To measure import time for a specific library:
To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:
Legal Boilerplate
Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.