-
Notifications
You must be signed in to change notification settings - Fork 1k
Add tdt duration to APIs #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tdt duration to APIs #2514
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Caution Review failedThe pull request is closed. WalkthroughAdds per-token duration support end-to-end (core C++ structs, decoder, conversion, JSON), exposes durations in C API, Python and Go bindings, and adds C/Python example programs plus a CMake target for Nemo Parakeet offline decoding. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant App as Example (C/Python)
participant Rec as OfflineRecognizer
participant Dec as Nemo TDT Greedy Decoder
participant Res as OfflineRecognitionResult
User->>App: supply model paths + WAV
App->>Rec: create stream, feed waveform
Rec->>Dec: decode frames
Note over Dec: split logits -> token_logits & duration_logits\ny = argmax(token_logits)\nd = argmax(duration_logits) or 1
Dec-->>Rec: emit tokens, timestamps, durations (variable step t += d)
Rec->>Res: build result (text, tokens, timestamps, durations)
App->>Res: read result
Res-->>App: return (durations may be NULL/absent)
App-->>User: print text and per-token timestamps/durations
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (4)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
sherpa-onnx/c-api/c-api.h (1)
609-654: SherpaOnnxOfflineRecognizerResult Struct Change Is an ABI BreakThe addition of the
durationsfield before the existingcountmember inSherpaOnnxOfflineRecognizerResultshifts all subsequent offsets, breaking ABI compatibility with any client binary compiled against the previous layout.Please address this breaking change by choosing one of the following strategies:
- Introduce a new versioned struct and accessor:
- Define
SherpaOnnxOfflineRecognizerResultV2(withdurationsin the desired position).- Add a corresponding factory/getter function (e.g.,
SherpaOnnxGetOfflineStreamResultV2).- Leave the original
SherpaOnnxOfflineRecognizerResultand its API untouched for backward compatibility.- Or, if you’re willing to break existing clients:
- Bump the library’s SONAME or major version.
- Clearly document the ABI break in the changelog/release notes.
- Long-term: adopt a “size/flags at the top” extensibility pattern so future fields can be added without moving offsets of existing members.
Please confirm the intended compatibility guarantees and implement one of the above paths before merging. I can help draft the V2 API if needed.
sherpa-onnx/csrc/offline-transducer-greedy-search-nemo-decoder.cc (1)
118-119: Initialize skip to 1 and clamp it defensively to avoid t += 0 edge cases.If anything goes wrong before skip is set (unexpected early returns in future edits, exceptions), t += skip with skip==0 risks an infinite loop. Initialize and clamp.
- int32_t skip = 0; + int32_t skip = 1; // avoid t += 0 on the first iteration @@ - skip = 1; - int32_t duration = 1; + skip = 1; + int32_t duration = 1; @@ - skip = duration; - if (skip == 0) skip = 1; + skip = std::max(duration, 1);Also applies to: 146-154
🧹 Nitpick comments (17)
sherpa-onnx/csrc/offline-recognizer-transducer-impl.h (1)
39-39: Duration conversion looks correct; clarify unit contract and guard mismatches.You scale durations with the same frame_shift_s used for timestamps, which is exactly what we want if src.durations are in post-subsampling frames. However, offline-transducer-decoder.h currently says “frames or seconds,” which could lead to accidental double-scaling if a decoder ever outputs seconds. Recommend tightening the contract (frames-only at decoder layer) and optionally asserting size alignment with tokens/timestamps when durations are present.
Proposed doc tightening (see comment in offline-transducer-decoder.h) and optional alignment check:
float frame_shift_s = frame_shift_ms / 1000. * subsampling_factor; for (auto t : src.timestamps) { float time = frame_shift_s * t; r.timestamps.push_back(time); } - // Copy durations (if present) - for (auto d : src.durations) { - r.durations.push_back(d * frame_shift_s); - } + // Copy durations (if present). Expect src.durations to be in output frames. + if (!src.durations.empty()) { + // Optional: sanity check to avoid subtle mismatches + // (keep as debug-only if you prefer). + // assert(src.durations.size() == src.timestamps.size()); + for (auto d : src.durations) { + r.durations.push_back(d * frame_shift_s); // seconds + } + }Also applies to: 70-73
sherpa-onnx/csrc/offline-transducer-decoder.h (1)
23-24: Make durations unit unambiguous (post-subsampling frames).To avoid accidental double conversion, define durations at the decoder layer in frames (same scale as timestamps). Convert to seconds only in higher layers (Convert()).
- /// durations[i] contains the duration (in frames or seconds, as appropriate) for tokens[i] (TDT models only) + /// durations[i] contains the duration for tokens[i] in output frames + /// (post-subsampling). It is converted to seconds by higher layers + /// (e.g., Convert() in offline-recognizer-transducer-impl.h). std::vector<float> durations;sherpa-onnx/csrc/offline-stream.h (1)
41-43: Add size invariant to docs for durations.Durations are optional, but when present they should align 1:1 with tokens (and timestamps). Document this to help downstream consumers.
- /// durations[i] contains the duration (in seconds) for tokens[i] (TDT models only) + /// durations.size() == tokens.size() when non-empty. + /// durations[i] contains the duration (in seconds) for tokens[i] (TDT models only). std::vector<float> durations;sherpa-onnx/csrc/offline-stream.cc (1)
399-410: Consider emitting “durations” only when non-empty for stricter backward compatibility.Always including an empty “durations”: [] is likely fine, but gating on non-empty mirrors how optional fields are usually surfaced and reduces risk for strict consumers comparing schemas.
- os << "\"" - << "durations" - << "\"" - << ": "; - os << "["; - sep = ""; - for (auto d : durations) { - os << sep << std::fixed << std::setprecision(2) << d; - sep = ", "; - } - os << "], "; + if (!durations.empty()) { + os << "\"durations\": "; + os << "["; + sep = ""; + for (auto d : durations) { + os << sep << std::fixed << std::setprecision(2) << d; + sep = ", "; + } + os << "], "; + }sherpa-onnx/c-api/c-api.cc (1)
692-699: Durations propagation mirrors timestamps logic correctly.
- Size guard against r->count is correct and prevents mismatched copies.
- Allocation + std::copy usage is sound, and the destructor covers deallocation.
Minor optional: when count == 0 (Lines 702-706), durations remains nullptr due to the memset earlier; you could also set r->durations = nullptr there for symmetry with timestamps.
sherpa-onnx/c-api/c-api.h (2)
621-623: Doc nit: count comment should cover durations too.The comment currently says “number of entries in timestamps”. It should mention tokens/timestamps/durations are all length = count.
Apply:
- // number of entries in timestamps + // Number of entries in tokens/timestamps/durations
631-644: JSON doc is missing the new “durations” field.The comment block documenting the JSON payload doesn’t list durations. Propose adding it:
* "tokens": [x, x, x], - * "timestamps": [x, x, x], + * "timestamps": [x, x, x], + * "durations": [x, x, x],sherpa-onnx/python/csrc/offline-stream.cc (1)
45-49: Python binding for durations – LGTM.Read-only exposure of durations aligns with timestamps and returns std::vector as a Python list. Optional: add a short property doc (unit = seconds) for timestamps and durations to reduce ambiguity in downstream usage.
python-api-examples/offline-nemo-parakeet-decode-file.py (3)
6-6: Remove duplicate import.
import osis present twice (Lines 4 and 6). Drop one to satisfy linters.-import os import sys
17-19: Validate model file paths too (not just the WAV).If any of encoder/decoder/joiner/tokens is missing, the example will fail later with a less helpful error. Add a quick check.
if not os.path.exists(wav_filename): print(f"File not found: {wav_filename}") sys.exit(1) + +for p in (encoder, decoder, joiner, tokens): + if not os.path.exists(p): + print(f"File not found: {p}") + sys.exit(1)
43-48: Use length checks instead of hasattr to decide whether to print tables.The properties always exist; they may be empty. Using hasattr will always be True. Prefer truthiness/length checks to avoid printing an empty header.
-if hasattr(result, "tokens") and hasattr(result, "timestamps") and hasattr(result, "durations"): - print("Token\tTimestamp\tDuration") - for token, ts, dur in zip(result.tokens, result.timestamps, result.durations): - print(f"{token}\t{ts:.2f}\t{dur:.2f}") +if result.tokens and result.timestamps: + if result.durations: + print("Token\tTimestamp\tDuration") + for token, ts, dur in zip(result.tokens, result.timestamps, result.durations): + print(f"{token}\t{ts:.2f}\t{dur:.2f}") + else: + print("Token\tTimestamp") + for token, ts in zip(result.tokens, result.timestamps): + print(f"{token}\t{ts:.2f}") else: print("Timestamps or durations not available.")c-api-examples/nemo-parakeet-c-api.c (3)
24-27: Preflight-check all model/token files, not just the WAV.You only verify the WAV exists. If any ONNX or tokens path is wrong, the example fails later with a generic message. Add explicit checks to fail fast with actionable errors.
Apply this diff:
const char *provider = "cpu"; - if (!SherpaOnnxFileExists(wav_filename)) { - fprintf(stderr, "File not found: %s\n", wav_filename); - return -1; - } + // Preflight checks for all inputs + if (!SherpaOnnxFileExists(wav_filename)) { + fprintf(stderr, "File not found: %s\n", wav_filename); + return -1; + } + if (!SherpaOnnxFileExists(encoder_filename)) { + fprintf(stderr, "File not found: %s\n", encoder_filename); + return -1; + } + if (!SherpaOnnxFileExists(decoder_filename)) { + fprintf(stderr, "File not found: %s\n", decoder_filename); + return -1; + } + if (!SherpaOnnxFileExists(joiner_filename)) { + fprintf(stderr, "File not found: %s\n", joiner_filename); + return -1; + } + if (!SherpaOnnxFileExists(tokens_filename)) { + fprintf(stderr, "File not found: %s\n", tokens_filename); + return -1; + }Also applies to: 14-21
28-33: Tighten the WAV parse error message or validate format.ReadWave returns floats; the failure isn’t necessarily “not a valid mono 16‑bit.” Either make the message generic or actually validate channels/bit-depth before printing a specific claim.
- fprintf(stderr, "Failed to read or parse %s (not a valid mono 16-bit WAVE file)\n", wav_filename); + fprintf(stderr, "Failed to read or parse %s (unsupported/invalid WAV)\n", wav_filename);
74-81: Print whatever timing info is available instead of all-or-nothing.If tokens exist but only timestamps or durations are present, you currently print “not available” and hide useful data. Print available columns dynamically.
- if (result->tokens_arr && result->timestamps && result->durations) { - printf("Token\tTimestamp\tDuration\n"); - for (int32_t i = 0; i < result->count; ++i) { - printf("%s\t%.2f\t%.2f\n", result->tokens_arr[i], result->timestamps[i], result->durations[i]); - } - } else { - printf("Timestamps or durations not available.\n"); - } + if (result->tokens_arr) { + // Header + printf("Token"); + if (result->timestamps) printf("\tTimestamp"); + if (result->durations) printf("\tDuration"); + printf("\n"); + // Rows + for (int32_t i = 0; i < result->count; ++i) { + printf("%s", result->tokens_arr[i]); + if (result->timestamps) printf("\t%.2f", result->timestamps[i]); + if (result->durations) printf("\t%.2f", result->durations[i]); + printf("\n"); + } + } else { + printf("Tokens not available.\n"); + }sherpa-onnx/csrc/offline-transducer-greedy-search-nemo-decoder.cc (3)
159-160: Make the duration type explicit.durations is a vector; casting avoids implicit narrowing surprises during overload resolution or template deduction.
- ans.durations.push_back(duration); // Use the index as duration, matching Python + ans.durations.push_back(static_cast<float>(duration)); // index-as-duration, matching Python
104-105: Remove unused variable.token_durations is never read. Drop it to keep the decoder tight.
- std::vector<float> token_durations;
142-154: Duration semantics verified consistent across all bindings and JSON
- The greedy‐search decoder in
sherpa-onnx/csrc/offline-transducer-greedy-search-nemo-decoder.ccrecords raw frame‐count indices inans.durations(0..num_durations–1) and uses a separateskipclamp to advance at least one frame.- In
sherpa-onnx/csrc/offline-recognizer-transducer-impl.h, each index is converted to seconds viar.durations.push_back(d * frame_shift_s), producing a durations vector in seconds.- The Python binding (
offline-stream.cc), C API (c-api.cc) and the JSON serializer (offline-stream.cc) all expose the post-conversion durations in seconds, as documented inoffline-stream.handc-api.h.- There is no need for any additional “+1” or offset: model-output classes map directly to frame counts starting at zero, then clamped at runtime for
skip.Optional: consider clarifying the inline comment at line 159 of the greedy‐search decoder to note that these values are raw frame-count indices (converted downstream), to avoid confusion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
c-api-examples/CMakeLists.txt(1 hunks)c-api-examples/nemo-parakeet-c-api.c(1 hunks)python-api-examples/offline-nemo-parakeet-decode-file.py(1 hunks)scripts/go/sherpa_onnx.go(2 hunks)sherpa-onnx/c-api/c-api.cc(2 hunks)sherpa-onnx/c-api/c-api.h(1 hunks)sherpa-onnx/csrc/offline-recognizer-transducer-impl.h(2 hunks)sherpa-onnx/csrc/offline-stream.cc(1 hunks)sherpa-onnx/csrc/offline-stream.h(1 hunks)sherpa-onnx/csrc/offline-transducer-decoder.h(1 hunks)sherpa-onnx/csrc/offline-transducer-greedy-search-nemo-decoder.cc(3 hunks)sherpa-onnx/python/csrc/offline-stream.cc(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
python-api-examples/offline-nemo-parakeet-decode-file.py (3)
sherpa-onnx/csrc/offline-recognizer-transducer-impl.h (1)
sherpa_onnx(31-192)sherpa-onnx/csrc/offline-stream.h (1)
sherpa_onnx(17-107)sherpa-onnx/csrc/offline-transducer-decoder.h (1)
sherpa_onnx(13-44)
sherpa-onnx/c-api/c-api.cc (1)
sherpa-onnx/csrc/offline-stream.cc (2)
r(250-250)r(250-250)
c-api-examples/nemo-parakeet-c-api.c (1)
sherpa-onnx/c-api/c-api.cc (22)
SherpaOnnxFileExists(2055-2057)SherpaOnnxFileExists(2055-2055)SherpaOnnxReadWave(1439-1456)SherpaOnnxReadWave(1439-1439)SherpaOnnxCreateOfflineRecognizer(549-565)SherpaOnnxCreateOfflineRecognizer(549-550)SherpaOnnxFreeWave(1480-1485)SherpaOnnxFreeWave(1480-1480)SherpaOnnxCreateOfflineStream(580-585)SherpaOnnxCreateOfflineStream(580-581)SherpaOnnxDestroyOfflineRecognizer(575-578)SherpaOnnxDestroyOfflineRecognizer(575-576)SherpaOnnxAcceptWaveformOffline(598-602)SherpaOnnxAcceptWaveformOffline(598-600)SherpaOnnxDecodeOfflineStream(604-608)SherpaOnnxDecodeOfflineStream(604-606)SherpaOnnxGetOfflineStreamResult(620-709)SherpaOnnxGetOfflineStreamResult(620-621)SherpaOnnxDestroyOfflineRecognizerResult(711-725)SherpaOnnxDestroyOfflineRecognizerResult(711-712)SherpaOnnxDestroyOfflineStream(594-596)SherpaOnnxDestroyOfflineStream(594-594)
sherpa-onnx/csrc/offline-recognizer-transducer-impl.h (1)
sherpa-onnx/csrc/offline-stream.cc (2)
r(250-250)r(250-250)
🪛 Ruff (0.12.2)
python-api-examples/offline-nemo-parakeet-decode-file.py
6-6: Redefinition of unused os from line 4
Remove definition: os
(F811)
🪛 Flake8 (7.2.0)
python-api-examples/offline-nemo-parakeet-decode-file.py
[error] 6-6: redefinition of unused 'os' from line 4
(F811)
🔇 Additional comments (8)
c-api-examples/CMakeLists.txt (1)
62-64: Example doesn’t use cargs; no extra linkage neededThe
nemo-parakeet-c-api.cfile contains no#include <cargs.h>directive nor any calls tocargs_*functions. Therefore, linking against the cargs library is unnecessary and the existing:add_executable(nemo-parakeet-c-api nemo-parakeet-c-api.c) target_link_libraries(nemo-parakeet-c-api sherpa-onnx-c-api)is correct.
sherpa-onnx/c-api/c-api.cc (1)
711-717: Destructor now frees durations – LGTM.
delete[] r->durations;is placed alongside other fields, so no leak.sherpa-onnx/c-api/c-api.h (1)
662-672: Fix offline result doc comment function nameThe documentation for freeing an offline recognizer result incorrectly references the online destroy function. Update the comment in
sherpa-onnx/c-api/c-api.h:
- File: sherpa-onnx/c-api/c-api.h
- Line: 663
- /// SherpaOnnxDestroyOnlineRecognizerResult() to free the returned + /// SherpaOnnxDestroyOfflineRecognizerResult() to free the returnedscripts/go/sherpa_onnx.go (2)
523-530: Go result type gains Durations – LGTM.Public API is extended in a backward-compatible manner for Go users (new field, no breaking changes). Make sure release notes call out the addition.
876-889: Safe conditional population of Timestamps and Durations – LGTM.
- Guards against nil C pointers before slicing avoid panics.
- Loops copy with proper float32 casts.
- Durations population is independent of timestamps, which is correct.
No changes requested.
c-api-examples/nemo-parakeet-c-api.c (2)
83-87: Good cleanup ordering and completeness.Result, stream, recognizer, then wave are destroyed in the right order. No leaks.
66-71: No overflow risk onnum_samplesconversionThe
SherpaOnnxWavestruct definesnum_samplesas anint32_t, which matches theint32_t nparameter inSherpaOnnxAcceptWaveformOffline. There is no wider type being implicitly converted, so the overflow concern does not apply. Feel free to disregard the suggested cast/check.Likely an incorrect or invalid review comment.
sherpa-onnx/csrc/offline-transducer-greedy-search-nemo-decoder.cc (1)
138-146: Clean split between token and duration logits looks correct.Using the first vocab_size as token logits and the tail as duration logits aligns with the TDT output layout. Argmax over the token slice to get y is standard greedy decoding.
sherpa-onnx/csrc/offline-transducer-greedy-search-nemo-decoder.cc
Outdated
Show resolved
Hide resolved
csukuangfj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Left some minor comments.
Otherwise, it looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds duration support to offline recognition results for TDT (Token-Duration-Time) models, specifically targeting Nemo Parakeet models. The feature provides per-token duration information alongside existing timestamps.
- Extends the recognition result structure to include duration information across all APIs
- Updates the TDT decoder to extract and utilize duration logits from model output
- Adds comprehensive examples demonstrating duration usage in both C and Python
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| sherpa-onnx/csrc/offline-transducer-greedy-search-nemo-decoder.cc | Core decoder logic to extract duration from TDT model output |
| sherpa-onnx/csrc/offline-transducer-decoder.h | Added durations field to decoder result structure |
| sherpa-onnx/csrc/offline-recognizer-transducer-impl.h | Duration conversion from frames to seconds |
| sherpa-onnx/csrc/offline-stream.h | Added durations field to recognition result |
| sherpa-onnx/csrc/offline-stream.cc | JSON serialization support for durations |
| sherpa-onnx/python/csrc/offline-stream.cc | Python binding for durations property |
| sherpa-onnx/c-api/c-api.h | C API structure extension for durations |
| sherpa-onnx/c-api/c-api.cc | C API implementation for duration handling |
| scripts/go/sherpa_onnx.go | Go API support for durations |
| python-api-examples/offline-nemo-parakeet-decode-file.py | Python example demonstrating duration usage |
| c-api-examples/nemo-parakeet-c-api.c | C example demonstrating duration usage |
| c-api-examples/CMakeLists.txt | Build configuration for new C example |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| token_logits, | ||
| std::max_element(token_logits, token_logits + vocab_size))); | ||
|
|
||
| skip = 1; |
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable skip is unconditionally set to 1 on line 144, but then conditionally overridden on line 150. This creates confusion as the initial assignment appears unnecessary. Consider moving the skip = 1 assignment inside the else clause or after the duration calculation.
| skip = 1; |
| duration_logits, | ||
| std::max_element(duration_logits, duration_logits + num_durations))); | ||
| skip = duration; | ||
| if (skip == 0) skip = 1; |
Copilot
AI
Aug 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default duration value of 1 should be documented or made into a named constant. This magic number affects the behavior when no duration information is available from the model.
| if (skip == 0) skip = 1; | |
| skip = kDefaultDuration; | |
| int32_t duration = kDefaultDuration; | |
| if (num_durations > 0) { | |
| duration = static_cast<int32_t>(std::distance( | |
| duration_logits, | |
| std::max_element(duration_logits, duration_logits + num_durations))); | |
| skip = duration; | |
| if (skip == 0) skip = kDefaultDuration; |
This is a follow up to:
#2495
This should add support to keep and pass durations through to the APIs from the nemo parakeet tdt models.
Examples are included for C and Python.
This code was only tested to the point that I confirmed my test programs ran and gave reasonable output. More in depth reviews and testing would be appreciated.
Summary by CodeRabbit
New Features
Documentation