Skip to content

Conversation

@Sanchit2662
Copy link

@Sanchit2662 Sanchit2662 commented Jan 16, 2026

Summary

This PR fixes a silent error propagation bug in the ArrayNode worker output-gathering path (flytepropeller/pkg/controller/nodes/array/worker.go).
Due to a hardcoded nil being returned on the response channel, errors encountered while reading sub-node outputs were silently discarded, leading to workflows completing successfully with missing or corrupted outputs.

This is a one-character fix that restores correct error propagation and aligns the behavior with the existing node execution handling logic.


Fix

Return the actual captured error instead of hardcoding nil:

// After
gatherOutputsRequest.responseChannel <- struct {
    literalMap map[string]*idlcore.Literal
    error
}{literalMap, err}

This mirrors the already-correct behavior used for nodeExecutionRequest earlier in the same worker loop and ensures consistent error handling.


Impact

Before

  • Output read failures silently ignored
  • Workflows succeed with incorrect or missing data
  • Downstream task behavior becomes nondeterministic
  • Production issues are extremely difficult to diagnose

After

  • Output gathering errors are properly propagated
  • Workflows fail explicitly and predictably
  • Retry / failure semantics behave as expected
  • No silent data corruption

This affects all Flyte deployments using ArrayNodes, which is a core execution primitive.


How was this patch tested?

  • Manually verified error propagation path by inducing output read failures
  • Confirmed consistency with nodeExecutionRequest error handling logic
  • Ensured workerErrorCollector correctly captures and reports the error after the fix

No new tests were added due to the minimal and localized nature of the change, but the behavior now matches the established and already-tested execution path.


Labels

  • fixed

Checklist

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@welcome
Copy link

welcome bot commented Jan 16, 2026

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@Sanchit2662
Copy link
Author

Hi @machichima ,
This pr fixes a bug where ArrayNode output-gathering errors were silently swallowed, causing workflows to succeed with missing outputs.
Whenever you have time, I’d really appreciate it if you could take a look and review the PR.
Thank you!

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.95%. Comparing base (575c0af) to head (26ab08e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6860      +/-   ##
==========================================
- Coverage   56.96%   56.95%   -0.01%     
==========================================
  Files         929      929              
  Lines       58152    58152              
==========================================
- Hits        33125    33120       -5     
- Misses      21985    21990       +5     
  Partials     3042     3042              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.10% <ø> (-0.04%) ⬇️
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.00% <ø> (ø)
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.13% <ø> (ø)
unittests-flytepropeller 53.63% <100.00%> (ø)
unittests-flytestdlib 63.29% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@machichima
Copy link
Member

Hi @Sanchit266,
Thank you for the fix! Do you mind providing result screenshot showing the error is raised correctly?

Copy link
Member

@machichima machichima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Sanchit2662
Copy link
Author

Hi @machichima,

I verified this locally by simulating the case where a subnode is marked Succeeded
but its output file is missing.

Running:
go clean -testcache && go test -v -run "TestHandleArrayNodePhaseSucceeding/Success$" .

Now results in:
worker error(s) encountered: [0]: failed to read data from
s3://bucket/output/0/0/outputs.pb: file does not exist

This confirms the output read error is correctly propagated instead of being
silently ignored. The test failure is expected in this setup since outputs are missing.

errrr

@machichima
Copy link
Member

machichima commented Jan 24, 2026

Thanks @Sanchit2662!
Could you also add a test here (like what you modified in TestHandleArrayNodePhaseSucceeding) to make sure the error is raised?
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants