Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions daras_ai_v2/asr.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,10 +244,11 @@
} # fmt: skip

SUNBIRD_SUPPORTED_LANGUAGES = {
"eng", "swa", "ach", "lgg", "lug", "nyn",
"teo", "xog", "ttj", "kin", "myx",
"ach": "<|su|>", "eng": "<|en|>", "kin": "<|as|>", "lgg": "<|jw|>", "lug": "<|ba|>", "myx": "<|mg|>",
"nyn": "<|ha|>", "swa": "<|sw|>", "teo": "<|ln|>", "ttj": "<|tt|>", "xog": "<|haw|>"
} # fmt: skip
Comment on lines +247 to 249
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify Sunbird language→token map and document intent

The new dict looks right structurally, but several mappings seem semantically unexpected (e.g., "kin"→"<|as|>", "nyn"→"<|ha|>", "xog"→"<|haw|>"). Please confirm these are exactly the tokens Sunbird “asr-whisper-large-v3-salt” expects, and add a short comment explaining the rationale (approximation vs. 1:1 mapping).

Would you like me to cross-check against upstream docs? If so, I can run these queries:

Also consider:

  • Renaming to SUNBIRD_LANGUAGE_TOKENS to clarify semantics.
  • If only keys are user-selectable, ensure any membership checks use SUNBIRD_SUPPORTED_LANGUAGES.keys() explicitly to avoid accidental iteration over values elsewhere.

🌐 Web query:

Sunbird "asr-whisper-large-v3-salt" language token list and mapping to ISO codes

💡 Result:

The model exposes the following language tokens (token ID = integer). The token keys are ISO 639‑3 language codes.

  • eng → 50259 (English). [1][2]
  • swa → 50318 (Swahili). [1][2]
  • ach → 50357 (Acholi). [1][2]
  • lgg → 50356 (Lugbara). [1][2]
  • lug → 50355 (Luganda). [1][2]
  • nyn → 50354 (Runyankole). [1][2]
  • teo → 50353 (Ateso). [1][2]
  • xog → 50352 (Soga / Lusoga). [1][2]
  • ttj → 50351 (Tooro / Rutooro). [1][2]
  • kin → 50350 (Kinyarwanda). [1][2]
  • myx → 50349 (Lumasaba / Masaaba). [1][2][3]

Sources: Hugging Face model README (Sunbird/asr‑whisper‑large‑v3‑salt) listing SALT_LANGUAGE_TOKENS_WHISPER and IDs; SALT project docs listing the language ISO codes and names. [1][2][3]


Align Sunbird language→token map with official SALT docs
The current mappings in daras_ai_v2/asr.py (e.g. "kin": "<|as|>", "nyn": "<|ha|>", "xog": "<|haw|>") don’t match the tokens defined in the Hugging Face Sunbird/asr-whisper-large-v3-salt README (for instance, kin→50350, nyn→50354, xog→50352). Replace these entries with the exact token identifiers (either the SALT_LANGUAGE_TOKENS_WHISPER numeric IDs or their corresponding <|xx|> codes), include a brief comment with the source, rename the dict to SUNBIRD_LANGUAGE_TOKENS, and update any membership checks to use SUNBIRD_LANGUAGE_TOKENS.keys().

🤖 Prompt for AI Agents
In daras_ai_v2/asr.py around lines 247 to 249, the language→token mapping
contains incorrect Sunbird/SALT tokens (e.g. "kin","nyn","xog") and the dict
name is not aligned with its purpose; replace the current entries with the exact
Sunbird/SALT token identifiers (either the numeric SALT_LANGUAGE_TOKENS_WHISPER
IDs or their canonical "<|xx|>" tokens) matching the Hugging Face
Sunbird/asr-whisper-large-v3-salt README, rename the dict to
SUNBIRD_LANGUAGE_TOKENS, add a brief comment pointing to the HF README as the
source, and update any code that checks membership to use
SUNBIRD_LANGUAGE_TOKENS.keys() instead of the old name.



# https://translation.ghananlp.org/api-details#api=ghananlp-translation-webservice-api
GHANA_NLP_SUPPORTED = {'en': 'English', 'tw': 'Twi', 'gaa': 'Ga', 'ee': 'Ewe', 'fat': 'Fante', 'dag': 'Dagbani',
'gur': 'Gurene', 'yo': 'Yoruba', 'ki': 'Kikuyu', 'luo': 'Luo', 'mer': 'Kimeru'} # fmt: skip
Expand Down Expand Up @@ -1301,6 +1302,8 @@ def run_asr(
# don't pass language or task
kwargs.pop("task", None)
kwargs["max_length"] = 448
elif selected_model == AsrModels.whisper_sunbird_large_v3:
kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
Comment on lines +1305 to +1306
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix crash on missing/invalid language for Sunbird and validate before lookup

language.strip() will raise when language is None (auto-detect path is allowed by the UI for this model), and direct dict indexing can raise KeyError. Normalize and guard before mapping to the token.

-        elif selected_model == AsrModels.whisper_sunbird_large_v3:
-            kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+        elif selected_model == AsrModels.whisper_sunbird_large_v3:
+            if language:
+                # normalize to one of the supported keys, then map to token
+                lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+                try:
+                    kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+                except KeyError:
+                    raise UserError(f"Unsupported Sunbird language: {language!r}")
+            else:
+                # leave unset to allow model-side auto-detect if supported
+                pass

Follow-up:

  • If Sunbird requires a token (no auto-detect), add AsrModels.whisper_sunbird_large_v3 to the set returned by supports_auto_detect() to disable “Auto Detect” in the UI.

🏁 Script executed:

#!/bin/bash
set -e
# Show mapping code around Sunbird large model in daras_ai_v2/asr.py
sed -n '1270,1350p' daras_ai_v2/asr.py
# Search for normalised_lang_in_collection definition or usage
rg -n "normalised_lang_in_collection" -C3 --type py
# Search for supports_auto_detect definition
rg -n "def supports_auto_detect" -C3 --type py
# Search for references to whisper_sunbird_large_v3
rg -n "whisper_sunbird_large_v3" -C5 --type py
# Locate SUNBIRD_SUPPORTED_LANGUAGES definition
rg -n "SUNBIRD_SUPPORTED_LANGUAGES" -C3 --type py

Length of output: 12850


🏁 Script executed:

#!/bin/bash
sed -n '300,340p' daras_ai_v2/asr.py

Length of output: 1022


Validate language before mapping for Sunbird; guard KeyError

-        elif selected_model == AsrModels.whisper_sunbird_large_v3:
-            kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+        elif selected_model == AsrModels.whisper_sunbird_large_v3:
+            if language:
+                # normalize to one of the supported keys, then map to token
+                lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+                try:
+                    kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+                except KeyError:
+                    raise UserError(f"Unsupported Sunbird language: {language!r}")
+            else:
+                # leave unset to allow model-side auto-detect if supported
+                pass

Disable “Auto Detect” in UI for Sunbird if not supported
Add AsrModels.whisper_sunbird_large_v3 to the set in supports_auto_detect() so it’s treated like other non–auto-detect models.

Committable suggestion skipped: line range outside the PR's diff.

elif "whisper" in selected_model.name:
forced_lang = forced_asr_languages.get(selected_model)
if forced_lang:
Expand Down