Skip to content

Conversation

@fmeum
Copy link
Collaborator

@fmeum fmeum commented Sep 29, 2025

This allows builds to recover from a polluted remote cache that contains an action result with an exit code of 0 that hasn't produced the required output files.

The check added in a151116 is moved closer to the lookup so that the cached result can be ignored instead of resulting in an error, with forced re-execution in the case of remote execution.

@fmeum fmeum changed the title Treat action results with missing outputs as cache misses Treat action results with missing mandatory outputs as cache misses Sep 29, 2025
This allows builds to recover from a polluted remote cache that contains an action result with an exit code of 0 that hasn't produced the required output files.

The check added in a151116 is moved closer to the lookup so that the cached result can be ignored instead of resulting in an error, with forced re-execution in the case of remote execution.
@fmeum fmeum force-pushed the dont-accept-results-with-missing-outputs branch from f0eeada to 67f1920 Compare September 29, 2025 13:47
@fmeum fmeum marked this pull request as ready for review September 29, 2025 13:47
@fmeum fmeum requested a review from a team as a code owner September 29, 2025 13:47
@github-actions github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Sep 29, 2025
@fmeum fmeum requested a review from tjgq September 29, 2025 13:47
@sluongng
Copy link
Contributor

Worth noting the background on this change for folks who are out of the loop (@fmeum, please feel free to correct me where im wrong):

There is no way in the current Remote Execution API spec for the client to specify which among the expected outputs are "mandatory" and which are "optional". Because of this, the remote RBE workers have to solely rely on the exit code being zero to determine whether an action was completed successfully and upload the result to Action Cache(AC). As a result, if the remote action is badly configured to exit with code zero despite the unsuccessful run, an action cache entry would still be produced, but potentially without the outputs Bazel expected.

This, in our experience, is one of the highest causes of accidental "remote cache poisoning" for newer users. Traditionally, the fix involves either changing --remote_instance_name= value to switch to a new remote cache name space, or using --noremote_accept_cached flag in a clean build. Both options essentially discard all existing AC entries by rebuilding from scratch.

So, can we implement the validation on the worker side to fail the remote action, despite an exit code zero, when missing outputs are detected?

Today, all outputs specified by Bazel Starlark rules are considered mandatory by Bazel. So that could be a reasonable way to fix this issue. However, there is a certain set of "native" actions that are defined inside Bazel Java code that may behave differently:

  • TestRunner: All outputs are optional
  • CppCompile / ObjcCompile: some outputs can be optional when coverage is enabled
  • Javac / JavacTurbine: in cases where reduced-classpath is enabled, only .jdeps output is required for the first pass

Because of this, the remote worker/server implementation may fail some of Bazel's special actions for not producing enough outputs, even though they are completely valid. Hence, worker-side validation will be hard to maintain in the long term.

This PR introduces the alternative solution: when Bazel sees an Action Cache result with missing mandatory outputs, it will treat it as a remote cache miss and try to re-run the action to produce the missing outputs instead. This means that the "poisoned" Action Cache entry will be discarded by Bazel automatically instead of all existing Action Cache entries. The tradeoff is that if the build action is badly constructed, users will experience a lack of remote caching and thus, an increase in remote action executed.

@fmeum fmeum requested a review from coeuvre October 15, 2025 20:59
@coeuvre
Copy link
Member

coeuvre commented Oct 20, 2025

Importing.

@fmeum
Copy link
Collaborator Author

fmeum commented Oct 20, 2025

@bazel-io fork 8.5.0

@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Oct 20, 2025
@fmeum fmeum deleted the dont-accept-results-with-missing-outputs branch October 21, 2025 15:13
fmeum added a commit to fmeum/bazel that referenced this pull request Oct 21, 2025
…misses

This allows builds to recover from a polluted remote cache that contains an action result with an exit code of 0 that hasn't produced the required output files.

The check added in unknown commit is moved closer to the lookup so that the cached result can be ignored instead of resulting in an error, with forced re-execution in the case of remote execution.

Closes bazelbuild#27105.

PiperOrigin-RevId: 821651398
Change-Id: I067c668598f9ee83e670b4636854185282112f86
(cherry picked from commit 29d442b)
github-merge-queue bot pushed a commit that referenced this pull request Oct 23, 2025
…misses (#27380)

This allows builds to recover from a polluted remote cache that contains
an action result with an exit code of 0 that hasn't produced the
required output files.

The check added in unknown commit is moved closer to the lookup so that
the cached result can be ignored instead of resulting in an error, with
forced re-execution in the case of remote execution.

Closes #27105.

PiperOrigin-RevId: 821651398
Change-Id: I067c668598f9ee83e670b4636854185282112f86 (cherry picked from
commit 29d442b)

Closes #27363
tjgq pushed a commit to tjgq/bazel that referenced this pull request Oct 23, 2025
…misses (bazelbuild#27380)

This allows builds to recover from a polluted remote cache that contains
an action result with an exit code of 0 that hasn't produced the
required output files.

The check added in unknown commit is moved closer to the lookup so that
the cached result can be ignored instead of resulting in an error, with
forced re-execution in the case of remote execution.

Closes bazelbuild#27105.

PiperOrigin-RevId: 821651398
Change-Id: I067c668598f9ee83e670b4636854185282112f86 (cherry picked from
commit 29d442b)

Closes bazelbuild#27363
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants