-
Notifications
You must be signed in to change notification settings - Fork 3
fix: sunbird: use tokens for languages instead of langcodes #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -244,10 +244,11 @@ | |
} # fmt: skip | ||
|
||
SUNBIRD_SUPPORTED_LANGUAGES = { | ||
"eng", "swa", "ach", "lgg", "lug", "nyn", | ||
"teo", "xog", "ttj", "kin", "myx", | ||
"ach": "<|su|>", "eng": "<|en|>", "kin": "<|as|>", "lgg": "<|jw|>", "lug": "<|ba|>", "myx": "<|mg|>", | ||
"nyn": "<|ha|>", "swa": "<|sw|>", "teo": "<|ln|>", "ttj": "<|tt|>", "xog": "<|haw|>" | ||
} # fmt: skip | ||
|
||
|
||
# https://translation.ghananlp.org/api-details#api=ghananlp-translation-webservice-api | ||
GHANA_NLP_SUPPORTED = {'en': 'English', 'tw': 'Twi', 'gaa': 'Ga', 'ee': 'Ewe', 'fat': 'Fante', 'dag': 'Dagbani', | ||
'gur': 'Gurene', 'yo': 'Yoruba', 'ki': 'Kikuyu', 'luo': 'Luo', 'mer': 'Kimeru'} # fmt: skip | ||
|
@@ -1301,6 +1302,8 @@ def run_asr( | |
# don't pass language or task | ||
kwargs.pop("task", None) | ||
kwargs["max_length"] = 448 | ||
elif selected_model == AsrModels.whisper_sunbird_large_v3: | ||
kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()] | ||
Comment on lines
+1305
to
+1306
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Verification agent 🧩 Analysis chainFix crash on missing/invalid language for Sunbird and validate before lookup
- elif selected_model == AsrModels.whisper_sunbird_large_v3:
- kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+ elif selected_model == AsrModels.whisper_sunbird_large_v3:
+ if language:
+ # normalize to one of the supported keys, then map to token
+ lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+ try:
+ kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+ except KeyError:
+ raise UserError(f"Unsupported Sunbird language: {language!r}")
+ else:
+ # leave unset to allow model-side auto-detect if supported
+ pass Follow-up:
🏁 Script executed: #!/bin/bash
set -e
# Show mapping code around Sunbird large model in daras_ai_v2/asr.py
sed -n '1270,1350p' daras_ai_v2/asr.py
# Search for normalised_lang_in_collection definition or usage
rg -n "normalised_lang_in_collection" -C3 --type py
# Search for supports_auto_detect definition
rg -n "def supports_auto_detect" -C3 --type py
# Search for references to whisper_sunbird_large_v3
rg -n "whisper_sunbird_large_v3" -C5 --type py
# Locate SUNBIRD_SUPPORTED_LANGUAGES definition
rg -n "SUNBIRD_SUPPORTED_LANGUAGES" -C3 --type py Length of output: 12850 🏁 Script executed: #!/bin/bash
sed -n '300,340p' daras_ai_v2/asr.py Length of output: 1022 Validate - elif selected_model == AsrModels.whisper_sunbird_large_v3:
- kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+ elif selected_model == AsrModels.whisper_sunbird_large_v3:
+ if language:
+ # normalize to one of the supported keys, then map to token
+ lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+ try:
+ kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+ except KeyError:
+ raise UserError(f"Unsupported Sunbird language: {language!r}")
+ else:
+ # leave unset to allow model-side auto-detect if supported
+ pass Disable “Auto Detect” in UI for Sunbird if not supported
|
||
elif "whisper" in selected_model.name: | ||
forced_lang = forced_asr_languages.get(selected_model) | ||
if forced_lang: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify Sunbird language→token map and document intent
The new dict looks right structurally, but several mappings seem semantically unexpected (e.g., "kin"→"<|as|>", "nyn"→"<|ha|>", "xog"→"<|haw|>"). Please confirm these are exactly the tokens Sunbird “asr-whisper-large-v3-salt” expects, and add a short comment explaining the rationale (approximation vs. 1:1 mapping).
Would you like me to cross-check against upstream docs? If so, I can run these queries:
Also consider:
SUNBIRD_LANGUAGE_TOKENS
to clarify semantics.SUNBIRD_SUPPORTED_LANGUAGES.keys()
explicitly to avoid accidental iteration over values elsewhere.🌐 Web query:
💡 Result:
The model exposes the following language tokens (token ID = integer). The token keys are ISO 639‑3 language codes.
Sources: Hugging Face model README (Sunbird/asr‑whisper‑large‑v3‑salt) listing SALT_LANGUAGE_TOKENS_WHISPER and IDs; SALT project docs listing the language ISO codes and names. [1][2][3]
Align Sunbird language→token map with official SALT docs
The current mappings in
daras_ai_v2/asr.py
(e.g."kin": "<|as|>"
,"nyn": "<|ha|>"
,"xog": "<|haw|>"
) don’t match the tokens defined in the Hugging FaceSunbird/asr-whisper-large-v3-salt
README (for instance,kin
→50350,nyn
→50354,xog
→50352). Replace these entries with the exact token identifiers (either the SALT_LANGUAGE_TOKENS_WHISPER numeric IDs or their corresponding<|xx|>
codes), include a brief comment with the source, rename the dict toSUNBIRD_LANGUAGE_TOKENS
, and update any membership checks to useSUNBIRD_LANGUAGE_TOKENS.keys()
.🤖 Prompt for AI Agents