feat: handle Ack and Nack messages in OTAP Exporter.#1994
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1994 +/- ##
==========================================
+ Coverage 87.27% 87.28% +0.01%
==========================================
Files 553 553
Lines 181338 181503 +165
==========================================
+ Hits 158259 158424 +165
Misses 22545 22545
Partials 534 534
🚀 New features to boost your workflow:
|
|
Just to confirm, I believe this is not an issue with OTLP exporter - the exporter captures the context from |
|
You're right. This problem doesn't happen in OTLP Exporter, only in OTAP exporter. |
|
This pull request has been marked as stale due to lack of recent activity. It will be closed in 30 days if no further activity occurs. If this PR is still relevant, please comment or push new commits to keep it active. |
jmacd
left a comment
There was a problem hiding this comment.
This looks good, by the way!
|
Thank you @alochaus! FYI In the future we may want to extend this with an option to save the payload in case the retry_processor is subscribed with a RETURN_DATA interest. I can see how we might like a helper on OtapPdata and Context, a sort of clone_or_take() operation that clones otherwise takes that handles the RETURN_DATA case. |
Change Summary
Part of #1325 (see also #1324).
Problem: flaky test due to sleep-based synchronization.
The OTAP Exporter had two related issues:
1. Dropped Context
When the exporter received pipeline data (OtapPdata), it split the message into context and payload, then threw away the context:
otel-arrow/rust/otap-dataflow/crates/otap/src/otap_exporter.rs
Lines 233 to 235 in 82f7150
The context carries routing information that tells the pipeline who to notify when a message succeeds or fails. Without it, the exporter was a black hole — data went in, but no confirmation ever came back.
2. Flaky Test
The test
test_receiver_not_ready_on_starthad no way to know when the exporter finished processing, so it used arbitrary sleeps:otel-arrow/rust/otap-dataflow/crates/otap/src/otap_exporter.rs
Lines 866 to 868 in 82f7150
otel-arrow/rust/otap-dataflow/crates/otap/src/otap_exporter.rs
Lines 870 to 872 in 82f7150
otel-arrow/rust/otap-dataflow/crates/otap/src/otap_exporter.rs
Lines 883 to 885 in 82f7150
On a slow machine or under load, these might not be long enough. On a fast machine, they waste time.
Solution: thread context through for ACK/NACK.
The core idea is to preserve the
OtapPdatacontext by usingtake_payload()instead ofinto_parts().This extracts the payload for gRPC transmission while keeping the context alive in the original OtapPdata. Then pass that context through the entire streaming pipeline so it can be returned with ACK (success) or NACK (failure) notifications.
Key changes
The internal channels changed from carrying just data to carrying (context, data) tuples:
The exporter uses bidirectional gRPC streaming — requests go out on one stream, responses come back on another. A FIFO correlation channel pairs them:
Since both streams are ordered, the first response always corresponds to the first request.
The effect_handler uses the context inside pdata to route ACK/NACK back through the pipeline.
The test is now event-driven: it proceeds as soon as the NACK/ACK arrives, with a 5-second timeout as a safety net.
What issue does this PR close?
How are these changes tested?
Are there any user-facing changes?