|
| 1 | +# Buildbarn Architecture Decision Record #7: Nested invocations |
| 2 | + |
| 3 | +Author: Ed Schouten<br/> |
| 4 | +Date: 2021-12-27 |
| 5 | + |
| 6 | +# Context |
| 7 | + |
| 8 | +On 2020-11-16, bb\_scheduler was extended to take the Bazel invocation |
| 9 | +ID into account ([commit](https://github.com/buildbarn/bb-remote-execution/commit/1a8407a0e62bf559abcd4006e0ecc9d3de9838c7)). |
| 10 | +With this change applied, bb\_scheduler no longer places all incoming |
| 11 | +operations in a single FIFO-like queue (disregarding priorities). |
| 12 | +Instead, every invocation of Bazel gets its own queue. All of these |
| 13 | +queues combined are placed in another queue that determines the order of |
| 14 | +execution. The goal of this change was to improve fairness, by making |
| 15 | +the scheduler give every invocation an equal number of workers. |
| 16 | + |
| 17 | +Under the hood, [`InMemoryBuildQueue` calls into `ActionRouter`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L378), which then [calls into `invocation.KeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/routing/simple_action_router.go#L40) |
| 18 | +to obtain an [`invocation.Key`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/key.go). |
| 19 | +The latter acts as [a map key](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/in_memory_build_queue.go#L1201) |
| 20 | +for all invocations known to the scheduler. |
| 21 | +[The default implementation](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/request_metadata_key_extractor.go) |
| 22 | +is one that extracts [the `tool_invocation_id` field from the REv2 `RequestMetadata` message](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1836). |
| 23 | + |
| 24 | +What is pretty elegant about this design is that `InMemoryBuildQueue` |
| 25 | +doesn't really care about what kind of logic is used to compute |
| 26 | +invocation keys. One may, for example, provide a custom |
| 27 | +`invocation.KeyExtractor` that causes operations to be grouped by |
| 28 | +username, or by [the `correlated_invocations_id` field](https://github.com/bazelbuild/remote-apis/blob/636121a32fa7b9114311374e4786597d8e7a69f3/build/bazel/remote/execution/v2/remote_execution.proto#L1840). |
| 29 | + |
| 30 | +On 2021-07-16, bb\_scheduler was extended once more to use Bazel |
| 31 | +invocation IDs to provide worker locality ([commit](https://github.com/buildbarn/bb-remote-execution/commit/902aab278baa45c92d48cad4992c445cfc588e32)). |
| 32 | +This change causes the scheduler to keep track of invocation ID of the |
| 33 | +last operation to run on a worker. When scheduling successive tasks, the |
| 34 | +scheduler will try to reuse a worker with a matching invocation ID. This |
| 35 | +increases the probability of input root cache hits, as it is not |
| 36 | +uncommon for clients to run many actions having similar input root |
| 37 | +contents. |
| 38 | + |
| 39 | +At the same time, this change introduced [the notion of 'stickiness'](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/proto/configuration/bb_scheduler/bb_scheduler.proto#L177-L196), |
| 40 | +where a cluster operator can effectively quantify the overhead of |
| 41 | +switching between actions belonging to different invocations. Such |
| 42 | +overhead could include the startup time of virtual machines, download |
| 43 | +times of container images specified in REv2 platform properties. In the |
| 44 | +case embedded hardware testing, it could include the time it takes to |
| 45 | +flash an image to EEPROM/NAND and boot from it. By using a custom |
| 46 | +`invocation.KeyExtractor` that extracts these properties from incoming |
| 47 | +execution requests, the scheduler can reorder operations in such a way |
| 48 | +that the number of restarts, reboots, flash erasure cycles, etc. is |
| 49 | +reduced. |
| 50 | + |
| 51 | +One major limitation of how this feature is implemented right now, is |
| 52 | +that invocations are a flat namespace. The scheduler only provides |
| 53 | +fairness by keying execution requests by a single property. For example, |
| 54 | +there is no way to configure the scheduler to provide the following: |
| 55 | + |
| 56 | +- Fairness by username, followed by fairness by Bazel invocation ID for builds |
| 57 | + with an equal username. |
| 58 | +- Fairness by team/project/department, followed by fairness by username. |
| 59 | +- Fairness both by correlated invocations ID and tool invocation ID. |
| 60 | +- Fairness/stickiness by worker container image, followed by fairness by |
| 61 | + Bazel invocation ID. |
| 62 | + |
| 63 | +In this ADR we propose to remove this limitation, with the goal of |
| 64 | +improving fairness of multi-tenant Buildbarn setups. |
| 65 | + |
| 66 | +# Proposed changes |
| 67 | + |
| 68 | +This ADR mainly proposes that operations no longer have a single |
| 69 | +`invocation.Key`, but a list of them. The flat map of invocations |
| 70 | +managed by `InMemoryBuildQueue` will be changed to a |
| 71 | +[trie](https://en.wikipedia.org/wiki/Trie), where every level of the |
| 72 | +trie corresponds to a single `invocation.Key`. More concretely, if the |
| 73 | +scheduler is configured to not apply any fairness whatsoever, all |
| 74 | +operations will simply be queued within the root invocation. If the |
| 75 | +`ActionRouter` is configured to always return two `invocation.Key`s, the |
| 76 | +trie will have height two, having operations queued only at the leaves. |
| 77 | + |
| 78 | +When changing this data structure, we'll also need to rework all of the |
| 79 | +algorithms that are applied against it. For example, we currently use a |
| 80 | +[binary heap](https://en.wikipedia.org/wiki/Binary_heap) to store all |
| 81 | +invocations that have one or more queued operations, which is used for |
| 82 | +selecting the next operation to run. This will be replaced with a heap |
| 83 | +of heaps that mimics the layout of the trie. These algorithms will thus |
| 84 | +all become a bit more complex than before. |
| 85 | + |
| 86 | +Fortunately, there are also some places where the algorithms become |
| 87 | +simpler. For the worker locality, we currently have to deal with the |
| 88 | +case where the last task that ran on a worker was associated with |
| 89 | +multiple invocations (due to in-flight deduplication). Or it may not be |
| 90 | +associated with any invocation if the worker was created recently. In |
| 91 | +the new model we can simply pick the |
| 92 | +[lowest common ancestor](https://en.wikipedia.org/wiki/Lowest_common_ancestor) |
| 93 | +of all invocations in case of in-flight deduplication, and pick the root |
| 94 | +invocation for newly created workers. This ensures that every worker is |
| 95 | +always associated with exactly one invocation. |
| 96 | + |
| 97 | +For `InMemoryBuildQueue` to obtain multiple `invocation.Key`s for a |
| 98 | +given operation, we will need to alter `ActionRouter` to return a list |
| 99 | +instead of a single instance. The underlying `invocation.KeyExtractor` |
| 100 | +interface can remain as is, though we can remove |
| 101 | +[`invocation.EmptyKeyExtractor`](https://github.com/buildbarn/bb-remote-execution/blob/6d5d7a67f5552f9b961a7e1d81c3cd77f2086004/pkg/scheduler/invocation/empty_key_extractor.go). |
| 102 | +The same behaviour can now be achieved by not using any key extractors |
| 103 | +at all. |
| 104 | + |
| 105 | +Finally we have to extend the worker invocation stickiness logic to work |
| 106 | +with this nested model. As every level of the invocation tree now has a |
| 107 | +different cost when switching to a different branch, we now need to let |
| 108 | +the configuration accept a list of durations, each indicating the cost |
| 109 | +at that level. |
| 110 | + |
| 111 | +All of the changes proposed above have been implemented in |
| 112 | +[this commit](https://github.com/buildbarn/bb-remote-execution/commit/744c9a65215179f19f20523acd522937a03aad56). |
| 113 | + |
| 114 | +# Future work |
| 115 | + |
| 116 | +With `invocation.EmptyKeyExtractor` removed, only |
| 117 | +`invocation.RequestMetadataKeyExtractor` remains. This type is capable |
| 118 | +of creating an `invocation.Key` based on the `tool_invocation_id`. We |
| 119 | +should add more types, such as one capable of extracting the |
| 120 | +`correlated_invocations_id` or authentication related properties (e.g., |
| 121 | +username). |
| 122 | + |
| 123 | +Even though the scheduling policy of `InMemoryBuildQueue` is fair, there |
| 124 | +is no support for [preemption](https://en.wikipedia.org/wiki/Preemption_(computing)). |
| 125 | +At no point will the scheduler actively move operations from the |
| 126 | +`EXECUTING` back to the `QUEUED` stage if it prevents other invocations |
| 127 | +from making progress. |
0 commit comments