Skip to content

Conversation

nikochiko
Copy link
Member

Q/A checklist

  • I have tested my UI changes on mobile and they look acceptable
  • I have tested changes to the workflows in both the API and the UI
  • I have done a code review of my changes and looked at each line of the diff + the references of each function I have changed
  • My changes have not increased the import time of the server
How to check import time?

time python -c 'import server'

You can visualize this using tuna:

python3 -X importtime -c 'import server' 2> out.log && tuna out.log

To measure import time for a specific library:

$ time python -c 'import pandas'

________________________________________________________
Executed in    1.15 secs    fish           external
   usr time    2.22 secs   86.00 micros    2.22 secs
   sys time    0.72 secs  613.00 micros    0.72 secs

To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:

def my_function():
    import pandas as pd
    ...

Legal Boilerplate

Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.

Copy link

coderabbitai bot commented Sep 10, 2025

📝 Walkthrough

Walkthrough

The SUNBIRD_SUPPORTED_LANGUAGES constant was changed from a set of codes to a dict mapping codes to transliteration tokens used by Sunbird Whisper v3. run_asr now special-cases whisper_sunbird_large_v3 to pass the mapped token via kwargs["language"]. A new public constant, GHANA_NLP_SUPPORTED, was added to map Ghana-related language codes to display names. No other public signatures were altered.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • devxpy

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes the primary change of switching to tokens for languages instead of language codes in the Sunbird implementation. It accurately reflects the update in the ASR module regarding SUNBIRD_SUPPORTED_LANGUAGES. The phrasing follows the repository’s conventional commit style and is clear and concise.
Description Check ✅ Passed The description includes all required sections from the repository’s PR template, namely the Q/A checklist with the four standard items, the detailed import-time measurement guidance in a collapsible section, and the legal boilerplate. Although the checklist items remain unchecked, the structure and required headings match the template exactly. There are no missing or off-topic sections, so the description fully complies with the expected format.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sunbird-use-language-tokens

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 161c759 and ff4242f.

📒 Files selected for processing (1)
  • daras_ai_v2/asr.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test (3.10.12, 1.8.3)

Comment on lines +247 to 249
"ach": "<|su|>", "eng": "<|en|>", "kin": "<|as|>", "lgg": "<|jw|>", "lug": "<|ba|>", "myx": "<|mg|>",
"nyn": "<|ha|>", "swa": "<|sw|>", "teo": "<|ln|>", "ttj": "<|tt|>", "xog": "<|haw|>"
} # fmt: skip
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify Sunbird language→token map and document intent

The new dict looks right structurally, but several mappings seem semantically unexpected (e.g., "kin"→"<|as|>", "nyn"→"<|ha|>", "xog"→"<|haw|>"). Please confirm these are exactly the tokens Sunbird “asr-whisper-large-v3-salt” expects, and add a short comment explaining the rationale (approximation vs. 1:1 mapping).

Would you like me to cross-check against upstream docs? If so, I can run these queries:

Also consider:

  • Renaming to SUNBIRD_LANGUAGE_TOKENS to clarify semantics.
  • If only keys are user-selectable, ensure any membership checks use SUNBIRD_SUPPORTED_LANGUAGES.keys() explicitly to avoid accidental iteration over values elsewhere.

🌐 Web query:

Sunbird "asr-whisper-large-v3-salt" language token list and mapping to ISO codes

💡 Result:

The model exposes the following language tokens (token ID = integer). The token keys are ISO 639‑3 language codes.

  • eng → 50259 (English). [1][2]
  • swa → 50318 (Swahili). [1][2]
  • ach → 50357 (Acholi). [1][2]
  • lgg → 50356 (Lugbara). [1][2]
  • lug → 50355 (Luganda). [1][2]
  • nyn → 50354 (Runyankole). [1][2]
  • teo → 50353 (Ateso). [1][2]
  • xog → 50352 (Soga / Lusoga). [1][2]
  • ttj → 50351 (Tooro / Rutooro). [1][2]
  • kin → 50350 (Kinyarwanda). [1][2]
  • myx → 50349 (Lumasaba / Masaaba). [1][2][3]

Sources: Hugging Face model README (Sunbird/asr‑whisper‑large‑v3‑salt) listing SALT_LANGUAGE_TOKENS_WHISPER and IDs; SALT project docs listing the language ISO codes and names. [1][2][3]


Align Sunbird language→token map with official SALT docs
The current mappings in daras_ai_v2/asr.py (e.g. "kin": "<|as|>", "nyn": "<|ha|>", "xog": "<|haw|>") don’t match the tokens defined in the Hugging Face Sunbird/asr-whisper-large-v3-salt README (for instance, kin→50350, nyn→50354, xog→50352). Replace these entries with the exact token identifiers (either the SALT_LANGUAGE_TOKENS_WHISPER numeric IDs or their corresponding <|xx|> codes), include a brief comment with the source, rename the dict to SUNBIRD_LANGUAGE_TOKENS, and update any membership checks to use SUNBIRD_LANGUAGE_TOKENS.keys().

🤖 Prompt for AI Agents
In daras_ai_v2/asr.py around lines 247 to 249, the language→token mapping
contains incorrect Sunbird/SALT tokens (e.g. "kin","nyn","xog") and the dict
name is not aligned with its purpose; replace the current entries with the exact
Sunbird/SALT token identifiers (either the numeric SALT_LANGUAGE_TOKENS_WHISPER
IDs or their canonical "<|xx|>" tokens) matching the Hugging Face
Sunbird/asr-whisper-large-v3-salt README, rename the dict to
SUNBIRD_LANGUAGE_TOKENS, add a brief comment pointing to the HF README as the
source, and update any code that checks membership to use
SUNBIRD_LANGUAGE_TOKENS.keys() instead of the old name.

Comment on lines +1305 to +1306
elif selected_model == AsrModels.whisper_sunbird_large_v3:
kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix crash on missing/invalid language for Sunbird and validate before lookup

language.strip() will raise when language is None (auto-detect path is allowed by the UI for this model), and direct dict indexing can raise KeyError. Normalize and guard before mapping to the token.

-        elif selected_model == AsrModels.whisper_sunbird_large_v3:
-            kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+        elif selected_model == AsrModels.whisper_sunbird_large_v3:
+            if language:
+                # normalize to one of the supported keys, then map to token
+                lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+                try:
+                    kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+                except KeyError:
+                    raise UserError(f"Unsupported Sunbird language: {language!r}")
+            else:
+                # leave unset to allow model-side auto-detect if supported
+                pass

Follow-up:

  • If Sunbird requires a token (no auto-detect), add AsrModels.whisper_sunbird_large_v3 to the set returned by supports_auto_detect() to disable “Auto Detect” in the UI.

🏁 Script executed:

#!/bin/bash
set -e
# Show mapping code around Sunbird large model in daras_ai_v2/asr.py
sed -n '1270,1350p' daras_ai_v2/asr.py
# Search for normalised_lang_in_collection definition or usage
rg -n "normalised_lang_in_collection" -C3 --type py
# Search for supports_auto_detect definition
rg -n "def supports_auto_detect" -C3 --type py
# Search for references to whisper_sunbird_large_v3
rg -n "whisper_sunbird_large_v3" -C5 --type py
# Locate SUNBIRD_SUPPORTED_LANGUAGES definition
rg -n "SUNBIRD_SUPPORTED_LANGUAGES" -C3 --type py

Length of output: 12850


🏁 Script executed:

#!/bin/bash
sed -n '300,340p' daras_ai_v2/asr.py

Length of output: 1022


Validate language before mapping for Sunbird; guard KeyError

-        elif selected_model == AsrModels.whisper_sunbird_large_v3:
-            kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[language.strip()]
+        elif selected_model == AsrModels.whisper_sunbird_large_v3:
+            if language:
+                # normalize to one of the supported keys, then map to token
+                lang_key = normalised_lang_in_collection(language, SUNBIRD_SUPPORTED_LANGUAGES)
+                try:
+                    kwargs["language"] = SUNBIRD_SUPPORTED_LANGUAGES[lang_key]
+                except KeyError:
+                    raise UserError(f"Unsupported Sunbird language: {language!r}")
+            else:
+                # leave unset to allow model-side auto-detect if supported
+                pass

Disable “Auto Detect” in UI for Sunbird if not supported
Add AsrModels.whisper_sunbird_large_v3 to the set in supports_auto_detect() so it’s treated like other non–auto-detect models.

Committable suggestion skipped: line range outside the PR's diff.

@nikochiko nikochiko merged commit 0b437cc into master Sep 10, 2025
8 checks passed
@nikochiko nikochiko deleted the sunbird-use-language-tokens branch September 10, 2025 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants