|
| 1 | +# Buildbarn Architecture Decision Record #11: Resharding Without Downtime |
| 2 | + |
| 3 | +Author: Benjamin Ingberg<br/> |
| 4 | +Date: 2024-12-30 |
| 5 | + |
| 6 | +# Context |
| 7 | + |
| 8 | +Resharding a Buildbarn cluster without downtime is today a multi step process |
| 9 | +that can be done as the following steps: |
| 10 | + |
| 11 | +1. Deploy new storage shards. |
| 12 | +2. Deploy a new topology with a read fallback configuration. Configure the old |
| 13 | + topology as the primary and the new topology as the secondary. |
| 14 | +3. When the topology from step 2 has propagated, swap the primary and secondary. |
| 15 | +4. When your new shards have performed sufficient amount of replication, deploy |
| 16 | + a topology without fallback configuration. |
| 17 | +5. Once the topology from step 4 has propagated, tear down any unused shards |
| 18 | + |
| 19 | +This process enables live resharding of a cluster. It works because each step is |
| 20 | +backwards compatible with the previous step. That is, accessing the blobstore |
| 21 | +with the topology from step N-1 will resolve correctly even if some components |
| 22 | +are already using the topology from step N. |
| 23 | + |
| 24 | +The exact timing and method needed to perform these steps depend on how you |
| 25 | +orchestrate your buildbarn cluster and your retention aims. This process might |
| 26 | +span an hour to several weeks. |
| 27 | + |
| 28 | +Resharding a cluster is a rare operation, so having multiple steps to achieve it |
| 29 | +is not inherently problematic. However, without a significant amount of |
| 30 | +automation of the cluster's meta-state there are large risks for performing it |
| 31 | +incorrectly. |
| 32 | + |
| 33 | +# Issues During Resharding |
| 34 | + |
| 35 | +## Non-Availability of a Secondary Set of Shards |
| 36 | + |
| 37 | +You might not have the ability to spin up a secondary set of storage shards to |
| 38 | +perform the switchover. This is a common situation in an on-prem environment, |
| 39 | +where running two copies of your production environment may not be feasible. |
| 40 | + |
| 41 | +This is not necessarily a blocker. You can reuse shards from the old topology |
| 42 | +in your new topology. However, this has a risk of significantly reducing your |
| 43 | +retention time since data must be stored according to the addressing schema of |
| 44 | +both the new and the old topology simultaneously. |
| 45 | + |
| 46 | +While it is possible to reduce the amount of address space that is resharded with |
| 47 | +drained backends this requires advance planning. |
| 48 | + |
| 49 | +## Topology changes requires restarts |
| 50 | + |
| 51 | +Currently, the only way to modify the topology visible to an individual |
| 52 | +Buildbarn component is to restart that component. While mounting a Kubernetes |
| 53 | +ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack |
| 54 | +logic for dynamically reloading their blob access configuration. |
| 55 | + |
| 56 | +A properly configured cluster can still perform rolling updates. However, there |
| 57 | +is a trade-off between the speed of rollout and the potential loss of ongoing |
| 58 | +work. Since clients automatically retry when encountering errors, losing ongoing |
| 59 | +work may be the preferred issue to address. Nonetheless, for clusters with very |
| 60 | +expensive long-running actions, this could result in significant work loss. |
| 61 | + |
| 62 | +The amount of restarted components can be reduced by routing traffic via |
| 63 | +internal facing frontends which are topology aware. |
| 64 | + |
| 65 | +# Improvements |
| 66 | + |
| 67 | +## Better Overlap Between Sharding Topologies |
| 68 | + |
| 69 | +Currently, two different sharding topologies, even if they share nodes, will |
| 70 | +have a small overlap between addressing schemas. This can be significantly |
| 71 | +improved by using a different sharding algorithm. |
| 72 | + |
| 73 | +For this purpose we replace the implementation of |
| 74 | +`ShardingBlobAccessConfiguration` with one that uses [Rendezvous |
| 75 | +hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing |
| 76 | +is a lightweight and stateless technique for distributed hash tables. It has a low |
| 77 | +overhead with minimal disruption during resharding. |
| 78 | + |
| 79 | +Sharding with Rendezvous hashing gives us the following properties: |
| 80 | + * Removing a shard is _guaranteed_ to only require resharding for the blobs |
| 81 | + that resolved to the removed shard. |
| 82 | + * Adding a shard will reshard any blob to the new shard with a probability of |
| 83 | + `weight/total_weight`. |
| 84 | + |
| 85 | +This effectively means adding or removing a shard triggers a predictable, |
| 86 | +minimal amount of resharding, eliminating the need for drained backends. |
| 87 | + |
| 88 | +``` |
| 89 | +message ShardingBlobAccessConfiguration { |
| 90 | + message Shard { |
| 91 | + // unchanged |
| 92 | + BlobAccessConfiguration backend = 1; |
| 93 | + // unchanged |
| 94 | + uint32 weight = 2; |
| 95 | + } |
| 96 | + // unchanged |
| 97 | + uint64 hash_initialization = 1; |
| 98 | +
|
| 99 | + // Was 'shards' an array of shards to use, has been replaced with |
| 100 | + // 'shard_map' |
| 101 | + reserved 2; |
| 102 | +
|
| 103 | + // NEW: |
| 104 | + // Shards identified by a key within the context of this sharding |
| 105 | + // configuration. The key is a freeform string which describes the identity |
| 106 | + // of the shard in the context of the current sharding configuration. |
| 107 | + // Shards are chosen via Rendezvous hashing based on the digest, weight, |
| 108 | + // key and hash_initialization of the configuration. |
| 109 | + // |
| 110 | + // When removing a shard from the map it is guaranteed that only blobs |
| 111 | + // which resolved to the removed shard will get a different shard. When |
| 112 | + // adding shards there is a weight/total_weight probability that any given |
| 113 | + // blob will be resolved to the new shards. |
| 114 | + map<string, Shard> shard_map = 3; |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +Other algorithms considered were: [Consistent |
| 119 | +hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and |
| 120 | +[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf). |
| 121 | + |
| 122 | +## Handling Topology Changes Dynamically |
| 123 | + |
| 124 | +To reduce the amount of restarts required buildbarn needs a construct that |
| 125 | +describes how to reload the topology. |
| 126 | + |
| 127 | +### RemotelyDefinedBlobAccessConfiguration |
| 128 | + |
| 129 | +``` |
| 130 | +message RemotelyDefinedBlobAccessConfiguration { |
| 131 | + // Fetch the blob access configuration from an external service |
| 132 | + buildbarn.configuration.grpc.ClientConfiguration endpoint = 1; |
| 133 | +
|
| 134 | + // Duration to reuse the previous configuration when not able to reach |
| 135 | + // RemoteBlobAccessConfiguration.GetBlobAccessConfiguration |
| 136 | + // before the component should consider itself partitioned from the cluster |
| 137 | + // and return UNAVAILABLE for any access requests. |
| 138 | + // |
| 139 | + // Recommended value: 10s |
| 140 | + google.protobuf.Duration remote_configuration_timeout = 2; |
| 141 | +} |
| 142 | +``` |
| 143 | + |
| 144 | +This configuration calls service that implements the following: |
| 145 | + |
| 146 | +``` |
| 147 | +service RemoteBlobAccessConfiguration { |
| 148 | + // A request to subscribe to the blob access configurations. |
| 149 | + // The service should immediately return the current blob access |
| 150 | + // configuration. After the initial blob access configuration the stream |
| 151 | + // will push out any changes to the blob access configuration. |
| 152 | + rpc GetBlobAccessConfiguration(GetBlobAccessConfigurationRequest) returns (stream BlobAccessConfiguration); |
| 153 | +} |
| 154 | +
|
| 155 | +message GetBlobAccessConfigurationRequest { |
| 156 | + enum StorageBackend { |
| 157 | + CAS = 0; |
| 158 | + AC = 1; |
| 159 | + ICAS = 2; |
| 160 | + ISCC = 3; |
| 161 | + FSAC = 4; |
| 162 | + } |
| 163 | + // An implementation defined identifier describing who the client is. The |
| 164 | + // remote service may take into consideration when returning the |
| 165 | + // BlobAccessConfiguration. This is typically used when clients are in |
| 166 | + // different networks and should route differently. |
| 167 | + google.protobuf.Value identifier = 1; |
| 168 | +
|
| 169 | + // Which storage backend that the service should describe the topology for. |
| 170 | + StorageBackend storage_backend = 2; |
| 171 | +} |
| 172 | +``` |
| 173 | + |
| 174 | +A simple implementation of this service could be a sidecar container that |
| 175 | +dynamically reads a configmap. |
| 176 | + |
| 177 | +A more complex implementation might: |
| 178 | + * Read prometheus metrics and roll out updates to the correct mirror. |
| 179 | + * Increase the number of shards when the metrics indicate that the retention |
| 180 | + time has fallen below a desired value. |
| 181 | + * Stagger read fallback configurations which are removed automatically after |
| 182 | + sufficient amount of time has passed. |
| 183 | + |
| 184 | +### JsonnetDefinedBlobAccessConfiguration |
| 185 | + |
| 186 | +``` |
| 187 | +message JsonnetDefinedBlobAccessConfiguration { |
| 188 | + // Path to the jsonnet file which describes the blob access configuration |
| 189 | + // to use. This will be monitored for changes. |
| 190 | + string path = 1; |
| 191 | +
|
| 192 | + // External variables to use when parsing the jsonnet file. These are used |
| 193 | + // in addition to the process environment variables which are always used. |
| 194 | + map<string, string> external_variables = 2; |
| 195 | +
|
| 196 | + // Methodology to use for subscribing to configuration file changes |
| 197 | + JsonnetConfigurationSubscriptionMethod subscription_method = 3; |
| 198 | +
|
| 199 | + // An optional path to a lockfile which buildbarn will check the presense |
| 200 | + // for before before parsing the jsonnet file. Buildbarn will only parse |
| 201 | + // the jsonnet file if the lockfile does not exist. |
| 202 | + // |
| 203 | + // The lockfile is there to prevent unintentional partial updates to the |
| 204 | + // BlobAccessConfiguration. Such as when having a jsonnet file which |
| 205 | + // imports a libsonnet file where each file gets updated separately. |
| 206 | + string lock_file_path = 4; |
| 207 | +} |
| 208 | +
|
| 209 | +message JsonnetConfigurationSubscriptionMethod { |
| 210 | + oneof method { |
| 211 | + // Parse the jsonnet file continously to monitor for any changes |
| 212 | + google.protobuf.Duration polling_duration = 1; |
| 213 | + // Subscribe to the jsonnet file and any imported libsonnet files via |
| 214 | + // the inotify api. |
| 215 | + google.protobuf.Empty inotify = 2; |
| 216 | + } |
| 217 | +} |
| 218 | +``` |
0 commit comments