|
| 1 | +# Buildbarn Architecture Decision Record #11: Resharding Without Downtime |
| 2 | + |
| 3 | +Author: Benjamin Ingberg<br/> |
| 4 | +Date: 2024-12-30 |
| 5 | + |
| 6 | +# Context |
| 7 | + |
| 8 | +Resharding a Buildbarn cluster without downtime is today a multi step process |
| 9 | +that can be done as the following steps: |
| 10 | + |
| 11 | +1. Deploy new storage shards. |
| 12 | +2. Deploy a new topology with a read fallback configuration. Configure the old |
| 13 | + topology as the primary and the new topology as the secondary. |
| 14 | +3. When the topology from step 2 has propagated, swap the primary and secondary. |
| 15 | +4. When your new shards have performed sufficient amount of replication, deploy |
| 16 | + a topology without fallback configuration. |
| 17 | +5. Once the topology from step 4 has propagated, tear down any unused shards |
| 18 | + |
| 19 | +This process enables live resharding of a cluster. It works because each step is |
| 20 | +backwards compatible with the previous step. That is, accessing the blobstore |
| 21 | +with the topology from step N-1 will resolve correctly even if some components |
| 22 | +are already using the topology from step N. |
| 23 | + |
| 24 | +The exact timing and method needed to perform these steps depend on how you |
| 25 | +orchestrate your buildbarn cluster and your retention aims. This process might |
| 26 | +span an hour several weeks. |
| 27 | + |
| 28 | +Resharding a cluster is a rare operation, so having multiple steps to achieve it |
| 29 | +is not inherently problematic. However, without a significant amount of |
| 30 | +automation of the cluster's meta-state there are large risks for performing it |
| 31 | +incorrectly. |
| 32 | + |
| 33 | +# Issues During Resharding |
| 34 | + |
| 35 | +## Non-Availability of a Secondary Set of Shards |
| 36 | + |
| 37 | +You might not have the ability to spin up a secondary set of storage shards to |
| 38 | +perform the switchover. This is a common situation in an on-prem environment, |
| 39 | +where running two copies of your production environment may not be feasible. |
| 40 | + |
| 41 | +This is not necessarily a blocker. You can reuse shards from the old topology |
| 42 | +in your new topology. However, this has a risk of significantly reducing your |
| 43 | +retention time since data must be stored according to the addressing schema of |
| 44 | +both the new and the old topology simultaneously. |
| 45 | + |
| 46 | +While it is possible to reduce the amount of address space that is resharded with |
| 47 | +drained backends this requires advance planning. |
| 48 | + |
| 49 | +## Topology changes requires restarts |
| 50 | + |
| 51 | +Currently, the only way to modify the topology visible to an individual |
| 52 | +Buildbarn component is to restart that component. While mounting a Kubernetes |
| 53 | +ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack |
| 54 | +logic for dynamically reloading their blob access configuration. |
| 55 | + |
| 56 | +A properly configured cluster can still perform rolling updates. However, there |
| 57 | +is a trade-off between the speed of rollout and the potential loss of ongoing |
| 58 | +work. Since clients automatically retry when encountering errors, losing ongoing |
| 59 | +work may be the preferred issue to address. Nonetheless, for clusters with very |
| 60 | +expensive long-running actions, this could result in significant work loss. |
| 61 | + |
| 62 | +# Improvements |
| 63 | + |
| 64 | +## Better Overlap Between Sharding Topologies |
| 65 | + |
| 66 | +Currently, two different sharding topologies, even if they share nodes, will |
| 67 | +have a small overlap between addressing schemas. This can be significantly |
| 68 | +improved by using a different sharding algorithm. |
| 69 | + |
| 70 | +For this purpose we replace the implementation of |
| 71 | +`ShardingBlobAccessConfiguration` with one that uses [Rendezvous |
| 72 | +hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing |
| 73 | +is a lightweight and stateless technique for distributed hash tables. It has a low |
| 74 | +overhead with minimal disruption during resharding. |
| 75 | + |
| 76 | +Sharding with Rendezvous hashing gives us the following properties: |
| 77 | + * Removing a shard is _guaranteed_ to only require resharding for the blobs |
| 78 | + that resolved to the removed shard. |
| 79 | + * Adding a shard will reshard any blob to the new shard with a probability of |
| 80 | + `weight/total_weight`. |
| 81 | + |
| 82 | +This effectively means adding or removing a shard triggers a predictable, |
| 83 | +minimal amount of resharding, eliminating the need for drained backends. |
| 84 | + |
| 85 | +``` |
| 86 | +message ShardingBlobAccessConfiguration { |
| 87 | + message Shard { |
| 88 | + // unchanged |
| 89 | + BlobAccessConfiguration backend = 1; |
| 90 | + // unchanged |
| 91 | + uint32 weight = 2; |
| 92 | + } |
| 93 | + // unchanged |
| 94 | + uint64 hash_initialization = 1; |
| 95 | +
|
| 96 | + // Was 'shards' an array of shards to use, has been replaced with 'shard_map' |
| 97 | + reserved 2; |
| 98 | +
|
| 99 | + // NEW: |
| 100 | + // Shards identified by a key within the context of this sharding |
| 101 | + // configuration. Shards are chosen via Rendezvous hashing based on the |
| 102 | + // digest, weight, key and hash_initialization of the configuration. |
| 103 | + // |
| 104 | + // When removing a shard from the map it is guaranteed that only blobs |
| 105 | + // which resolved to the removed shard will get a different shard. When |
| 106 | + // adding shards there is a weight/total_weight probability that any given |
| 107 | + // blob will be resolved to the new shards. |
| 108 | + map<string, Shard> shard_map = 3; |
| 109 | +} |
| 110 | +``` |
| 111 | + |
| 112 | +Other algorithms considered were: [Consistent |
| 113 | +hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and |
| 114 | +[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf). |
| 115 | + |
| 116 | +## Handling Topology Changes Dynamically |
| 117 | + |
| 118 | +To reduce the amount of restarts required a new blob access configuration is |
| 119 | +added. |
| 120 | + |
| 121 | +``` |
| 122 | +message RemotelyDefinedBlobAccessConfiguration { |
| 123 | + // Fetch the blob access configuration from an external service |
| 124 | + buildbarn.configuration.grpc.ClientConfiguration grpc = 1; |
| 125 | +
|
| 126 | + // Maximum grace time when receiving a new blob access configuration |
| 127 | + // before cancelling ongoing requests. |
| 128 | + // |
| 129 | + // Recommended value: 0s for client facing systems, 60s for internal |
| 130 | + // systems |
| 131 | + google.protobuf.Duration maximum_grace_time = 2; |
| 132 | +} |
| 133 | +``` |
| 134 | + |
| 135 | +Which calls to a service that implements the following: |
| 136 | + |
| 137 | +``` |
| 138 | +service RemoteBlobAccessConfiguration { |
| 139 | + rpc Synchronize(SynchronizeRequest) returns (SynchronizeResponse); |
| 140 | +} |
| 141 | +
|
| 142 | +message SynchronizeRequest { |
| 143 | + enum StorageBackend { |
| 144 | + CAS = 0; |
| 145 | + AC = 1; |
| 146 | + ICAS = 2; |
| 147 | + ISCC = 3; |
| 148 | + FSAC = 4; |
| 149 | + } |
| 150 | + // An implementation defined identifier describing who the client is which |
| 151 | + // the remote service may take into consideration when returning the |
| 152 | + // BlobAccessConfiguration. This is typically used when clients are in |
| 153 | + // different networks and should route differently. |
| 154 | + string identifier = 1; |
| 155 | +
|
| 156 | + // Which storage backend that the service should describe the topology for. |
| 157 | + StorageBackend storage_backend = 2; |
| 158 | +
|
| 159 | + // A message describing the current state from the perspective of the |
| 160 | + // client, the client may assume it's current topology is correct if the |
| 161 | + // service does not respond with a request to change it. |
| 162 | + BlobAccessConfiguration current = 3; |
| 163 | +} |
| 164 | +
|
| 165 | +message SynchronizeResponse { |
| 166 | + // A message describing how the component should perform blob access |
| 167 | + // configurations. |
| 168 | + BlobAccessConfiguration desired_state = 1; |
| 169 | + |
| 170 | + // Latest time at which an ongoing request to the previous blob access |
| 171 | + // configuration should be allowed to continue before the caller should |
| 172 | + // cancel it and return an error. The client may cancel it before the |
| 173 | + // grace_time has passed. |
| 174 | + google.protobuf.Timestamp grace_time = 2; |
| 175 | +
|
| 176 | + // Timestamp for when this state is considered expired. The remote blob |
| 177 | + // access configuration will avoid breaking compatibility until |
| 178 | + // after this timestamp has passed. |
| 179 | + google.protobuf.Timestamp response_expiry = 3; |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +A simple implementation of this service could be a sidecar container that |
| 184 | +dynamically reads a configmap. |
| 185 | + |
| 186 | +A more complex implementation might: |
| 187 | + * Read prometheus metrics and roll out updates to the correct mirror. |
| 188 | + * Increase the number of shards when the metrics indicate that the retention |
| 189 | + time has fallen below a desired value. |
| 190 | + * Stagger read fallback configurations which are removed automatically after |
| 191 | + sufficient amount of time has passed. |
| 192 | + |
| 193 | +Adding grace times and response expiries allows the service to set an upper |
| 194 | +bound before a well behaved system should have propagated the topology change. |
| 195 | +This allows it to make informed decisions on when to break compatibility. |
0 commit comments