-
Notifications
You must be signed in to change notification settings - Fork 3
fix: sunbird: use tokens for languages instead of langcodes #794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThe SUNBIRD_SUPPORTED_LANGUAGES constant was changed from a set of codes to a dict mapping codes to transliteration tokens used by Sunbird Whisper v3. run_asr now special-cases whisper_sunbird_large_v3 to pass the mapped token via kwargs["language"]. A new public constant, GHANA_NLP_SUPPORTED, was added to map Ghana-related language codes to display names. No other public signatures were altered. Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (3 passed)✅ Passed checks (3 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
daras_ai_v2/asr.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test (3.10.12, 1.8.3)
"ach": "<|su|>", "eng": "<|en|>", "kin": "<|as|>", "lgg": "<|jw|>", "lug": "<|ba|>", "myx": "<|mg|>", | ||
"nyn": "<|ha|>", "swa": "<|sw|>", "teo": "<|ln|>", "ttj": "<|tt|>", "xog": "<|haw|>" | ||
} # fmt: skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify Sunbird language→token map and document intent
The new dict looks right structurally, but several mappings seem semantically unexpected (e.g., "kin"→"<|as|>", "nyn"→"<|ha|>", "xog"→"<|haw|>"). Please confirm these are exactly the tokens Sunbird “asr-whisper-large-v3-salt” expects, and add a short comment explaining the rationale (approximation vs. 1:1 mapping).
Would you like me to cross-check against upstream docs? If so, I can run these queries:
Also consider:
- Renaming to
SUNBIRD_LANGUAGE_TOKENS
to clarify semantics. - If only keys are user-selectable, ensure any membership checks use
SUNBIRD_SUPPORTED_LANGUAGES.keys()
explicitly to avoid accidental iteration over values elsewhere.
🌐 Web query:
Sunbird "asr-whisper-large-v3-salt" language token list and mapping to ISO codes
💡 Result:
The model exposes the following language tokens (token ID = integer). The token keys are ISO 639‑3 language codes.
- eng → 50259 (English). [1][2]
- swa → 50318 (Swahili). [1][2]
- ach → 50357 (Acholi). [1][2]
- lgg → 50356 (Lugbara). [1][2]
- lug → 50355 (Luganda). [1][2]
- nyn → 50354 (Runyankole). [1][2]
- teo → 50353 (Ateso). [1][2]
- xog → 50352 (Soga / Lusoga). [1][2]
- ttj → 50351 (Tooro / Rutooro). [1][2]
- kin → 50350 (Kinyarwanda). [1][2]
- myx → 50349 (Lumasaba / Masaaba). [1][2][3]
Sources: Hugging Face model README (Sunbird/asr‑whisper‑large‑v3‑salt) listing SALT_LANGUAGE_TOKENS_WHISPER and IDs; SALT project docs listing the language ISO codes and names. [1][2][3]
Align Sunbird language→token map with official SALT docs
The current mappings in daras_ai_v2/asr.py
(e.g. "kin": "<|as|>"
, "nyn": "<|ha|>"
, "xog": "<|haw|>"
) don’t match the tokens defined in the Hugging Face Sunbird/asr-whisper-large-v3-salt
README (for instance, kin
→50350, nyn
→50354, xog
→50352). Replace these entries with the exact token identifiers (either the SALT_LANGUAGE_TOKENS_WHISPER numeric IDs or their corresponding <|xx|>
codes), include a brief comment with the source, rename the dict to SUNBIRD_LANGUAGE_TOKENS
, and update any membership checks to use SUNBIRD_LANGUAGE_TOKENS.keys()
.
🤖 Prompt for AI Agents
In daras_ai_v2/asr.py around lines 247 to 249, the language→token mapping
contains incorrect Sunbird/SALT tokens (e.g. "kin","nyn","xog") and the dict
name is not aligned with its purpose; replace the current entries with the exact
Sunbird/SALT token identifiers (either the numeric SALT_LANGUAGE_TOKENS_WHISPER
IDs or their canonical "<|xx|>" tokens) matching the Hugging Face
Sunbird/asr-whisper-large-v3-salt README, rename the dict to
SUNBIRD_LANGUAGE_TOKENS, add a brief comment pointing to the HF README as the
source, and update any code that checks membership to use
SUNBIRD_LANGUAGE_TOKENS.keys() instead of the old name.
elif selected_model == AsrModels.whisper_sunbird_large_v3: | ||
kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix crash on missing/invalid language for Sunbird and validate before lookup
language.strip()
will raise when language is None
(auto-detect path is allowed by the UI for this model), and direct dict indexing can raise KeyError
. Normalize and guard before mapping to the token.
- elif selected_model == AsrModels.whisper_sunbird_large_v3:
- kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+ elif selected_model == AsrModels.whisper_sunbird_large_v3:
+ if language:
+ # normalize to one of the supported keys, then map to token
+ lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+ try:
+ kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+ except KeyError:
+ raise UserError(f"Unsupported Sunbird language: {language!r}")
+ else:
+ # leave unset to allow model-side auto-detect if supported
+ pass
Follow-up:
- If Sunbird requires a token (no auto-detect), add
AsrModels.whisper_sunbird_large_v3
to the set returned bysupports_auto_detect()
to disable “Auto Detect” in the UI.
🏁 Script executed:
#!/bin/bash
set -e
# Show mapping code around Sunbird large model in daras_ai_v2/asr.py
sed -n '1270,1350p' daras_ai_v2/asr.py
# Search for normalised_lang_in_collection definition or usage
rg -n "normalised_lang_in_collection" -C3 --type py
# Search for supports_auto_detect definition
rg -n "def supports_auto_detect" -C3 --type py
# Search for references to whisper_sunbird_large_v3
rg -n "whisper_sunbird_large_v3" -C5 --type py
# Locate SUNBIRD_SUPPORTED_LANGUAGES definition
rg -n "SUNBIRD_SUPPORTED_LANGUAGES" -C3 --type py
Length of output: 12850
🏁 Script executed:
#!/bin/bash
sed -n '300,340p' daras_ai_v2/asr.py
Length of output: 1022
Validate language
before mapping for Sunbird; guard KeyError
- elif selected_model == AsrModels.whisper_sunbird_large_v3:
- kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+ elif selected_model == AsrModels.whisper_sunbird_large_v3:
+ if language:
+ # normalize to one of the supported keys, then map to token
+ lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+ try:
+ kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+ except KeyError:
+ raise UserError(f"Unsupported Sunbird language: {language!r}")
+ else:
+ # leave unset to allow model-side auto-detect if supported
+ pass
Disable “Auto Detect” in UI for Sunbird if not supported
Add AsrModels.whisper_sunbird_large_v3
to the set in supports_auto_detect()
so it’s treated like other non–auto-detect models.
Committable suggestion skipped: line range outside the PR's diff.
Q/A checklist
How to check import time?
You can visualize this using tuna:
To measure import time for a specific library:
To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:
Legal Boilerplate
Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.