Add a new ADR for resharding improvements

meroton-benjamin · meroton-benjamin · commit 0cadab8d7a3e · 2024-12-30T13:21:35.000+01:00
diff --git a/0011-resharding-without-downtime.md b/0011-resharding-without-downtime.md
@@ -0,0 +1,195 @@
+# Buildbarn Architecture Decision Record #11: Resharding Without Downtime
+
+Author: Benjamin Ingberg<br/>
+Date: 2024-12-30
+
+# Context
+
+Resharding a Buildbarn cluster without downtime is today a multi step process
+that can be done as the following steps:
+
+1. Deploy new storage shards.
+2. Deploy a new topology with a read fallback configuration. Configure the old
+   topology as the primary and the new topology as the secondary.
+3. When the topology from step 2 has propagated, swap the primary and secondary.
+4. When your new shards have performed sufficient amount of replication, deploy
+   a topology without fallback configuration.
+5. Once the topology from step 4 has propagated, tear down any unused shards
+
+This process enables live resharding of a cluster. It works because each step is
+backwards compatible with the previous step. That is, accessing the blobstore
+with the topology from step N-1 will resolve correctly even if some components
+are already using the topology from step N.
+
+The exact timing and method needed to perform these steps depend on how you
+orchestrate your buildbarn cluster and your retention aims. This process might
+span an hour several weeks.
+
+Resharding a cluster is a rare operation, so having multiple steps to achieve it
+is not inherently problematic. However, without a significant amount of
+automation of the cluster's meta-state there are large risks for performing it
+incorrectly.
+
+# Issues During Resharding
+
+## Non-Availability of a Secondary Set of Shards
+
+You might not have the ability to spin up a secondary set of storage shards to
+perform the switchover. This is a common situation in an on-prem environment,
+where running two copies of your production environment may not be feasible.
+
+This is not necessarily a  blocker. You can reuse shards from the old topology
+in your new topology. However, this has a risk of significantly reducing your
+retention time since data must be stored according to the addressing schema of
+both the new and the old topology simultaneously.
+
+While it is possible to reduce the amount of address space that is resharded with
+drained backends this requires advance planning.
+
+## Topology changes requires restarts
+
+Currently, the only way to modify the topology visible to an individual
+Buildbarn component is to restart that component. While mounting a Kubernetes
+ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack
+logic for dynamically reloading their blob access configuration.
+
+A properly configured cluster can still perform rolling updates. However, there
+is a trade-off between the speed of rollout and the potential loss of ongoing
+work. Since clients automatically retry when encountering errors, losing ongoing
+work may be the preferred issue to address. Nonetheless, for clusters with very
+expensive long-running actions, this could result in significant work loss.
+
+# Improvements
+
+## Better Overlap Between Sharding Topologies
+
+Currently, two different sharding topologies, even if they share nodes, will
+have a small overlap between addressing schemas. This can be significantly
+improved by using a different sharding algorithm.
+
+For this purpose we replace the implementation of
+`ShardingBlobAccessConfiguration` with one that uses [Rendezvous
+hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing
+is a lightweight and stateless technique for distributed hash tables. It has a low
+overhead with minimal disruption during resharding.
+
+Sharding with Rendezvous hashing gives us the following properties: 
+ * Removing a shard is _guaranteed_ to only require resharding for the blobs
+   that resolved to the removed shard.
+ * Adding a shard will reshard any blob to the new shard with a probability of
+   `weight/total_weight`.
+
+This effectively means adding or removing a shard triggers a predictable,
+minimal amount of resharding, eliminating the need for drained backends.
+
+```
+message ShardingBlobAccessConfiguration {
+    message Shard {
+        // unchanged
+        BlobAccessConfiguration backend = 1;
+        // unchanged
+        uint32 weight = 2;
+    }
+    // unchanged
+    uint64 hash_initialization = 1;
+
+    // Was 'shards' an array of shards to use, has been replaced with 'shard_map'
+    reserved 2;
+
+    // NEW:
+    // Shards identified by a key within the context of this sharding
+    // configuration. Shards are chosen via Rendezvous hashing based on the 
+    // digest, weight, key and hash_initialization of the configuration.
+    //
+    // When removing a shard from the map it is guaranteed that only blobs
+    // which resolved to the removed shard will get a different shard. When
+    // adding shards there is a weight/total_weight probability that any given
+    // blob will be resolved to the new shards.
+    map<string, Shard> shard_map = 3;
+}
+```
+
+Other algorithms considered were: [Consistent
+hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and
+[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf).
+
+## Handling Topology Changes Dynamically
+
+To reduce the amount of restarts required a new blob access configuration is
+added.
+
+```
+message RemotelyDefinedBlobAccessConfiguration {
+    // Fetch the blob access configuration from an external service
+    buildbarn.configuration.grpc.ClientConfiguration grpc = 1;
+
+    // Maximum grace time when receiving a new blob access configuration
+    // before cancelling ongoing requests.
+    //
+    // Recommended value: 0s for client facing systems, 60s for internal
+    // systems
+    google.protobuf.Duration maximum_grace_time = 2;
+}
+```
+
+Which calls to a service that implements the following:
+
+```
+service RemoteBlobAccessConfiguration {
+    rpc Synchronize(SynchronizeRequest) returns (SynchronizeResponse);
+}  
+
+message SynchronizeRequest {
+    enum StorageBackend {
+        CAS = 0;
+        AC = 1;
+        ICAS = 2;
+        ISCC = 3;
+        FSAC = 4;
+    }
+    // An implementation defined identifier describing who the client is which
+    // the remote service may take into consideration when returning the
+    // BlobAccessConfiguration. This is typically used when clients are in
+    // different networks and should route differently.
+    string identifier = 1;
+
+    // Which storage backend that the service should describe the topology for.
+    StorageBackend storage_backend = 2;
+
+    // A message describing the current state from the perspective of the
+    // client, the client may assume it's current topology is correct if the
+    // service does not respond with a request to change it.
+    BlobAccessConfiguration current = 3;
+}
+
+message SynchronizeResponse {
+    // A message describing how the component should perform blob access
+    // configurations.
+    BlobAccessConfiguration desired_state = 1;
+    
+    // Latest time at which an ongoing request to the previous blob access
+    // configuration should be allowed to continue before the caller should
+    // cancel it and return an error. The client may cancel it before the
+    // grace_time has passed.
+    google.protobuf.Timestamp grace_time = 2;
+
+    // Timestamp for when this state is considered expired. The remote blob
+    // access configuration will avoid breaking compatibility until
+    // after this timestamp has passed.
+    google.protobuf.Timestamp response_expiry = 3;
+}
+```
+
+A simple implementation of this service could be a sidecar container that
+dynamically reads a configmap.
+
+A more complex implementation might:
+ * Read prometheus metrics and roll out updates to the correct mirror.
+ * Increase the number of shards when the metrics indicate that the retention
+   time has fallen below a desired value.
+ * Stagger read fallback configurations which are removed automatically after
+   sufficient amount of time has passed.
+
+Adding grace times and response expiries allows the service to set an upper
+bound before a well behaved system should have propagated the topology change.
+This allows it to make informed decisions on when to break compatibility.