Skip to content

Commit 0cadab8

Browse files
Add a new ADR for resharding improvements
1 parent c1016eb commit 0cadab8

File tree

1 file changed

+195
-0
lines changed

1 file changed

+195
-0
lines changed
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Buildbarn Architecture Decision Record #11: Resharding Without Downtime
2+
3+
Author: Benjamin Ingberg<br/>
4+
Date: 2024-12-30
5+
6+
# Context
7+
8+
Resharding a Buildbarn cluster without downtime is today a multi step process
9+
that can be done as the following steps:
10+
11+
1. Deploy new storage shards.
12+
2. Deploy a new topology with a read fallback configuration. Configure the old
13+
topology as the primary and the new topology as the secondary.
14+
3. When the topology from step 2 has propagated, swap the primary and secondary.
15+
4. When your new shards have performed sufficient amount of replication, deploy
16+
a topology without fallback configuration.
17+
5. Once the topology from step 4 has propagated, tear down any unused shards
18+
19+
This process enables live resharding of a cluster. It works because each step is
20+
backwards compatible with the previous step. That is, accessing the blobstore
21+
with the topology from step N-1 will resolve correctly even if some components
22+
are already using the topology from step N.
23+
24+
The exact timing and method needed to perform these steps depend on how you
25+
orchestrate your buildbarn cluster and your retention aims. This process might
26+
span an hour several weeks.
27+
28+
Resharding a cluster is a rare operation, so having multiple steps to achieve it
29+
is not inherently problematic. However, without a significant amount of
30+
automation of the cluster's meta-state there are large risks for performing it
31+
incorrectly.
32+
33+
# Issues During Resharding
34+
35+
## Non-Availability of a Secondary Set of Shards
36+
37+
You might not have the ability to spin up a secondary set of storage shards to
38+
perform the switchover. This is a common situation in an on-prem environment,
39+
where running two copies of your production environment may not be feasible.
40+
41+
This is not necessarily a blocker. You can reuse shards from the old topology
42+
in your new topology. However, this has a risk of significantly reducing your
43+
retention time since data must be stored according to the addressing schema of
44+
both the new and the old topology simultaneously.
45+
46+
While it is possible to reduce the amount of address space that is resharded with
47+
drained backends this requires advance planning.
48+
49+
## Topology changes requires restarts
50+
51+
Currently, the only way to modify the topology visible to an individual
52+
Buildbarn component is to restart that component. While mounting a Kubernetes
53+
ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack
54+
logic for dynamically reloading their blob access configuration.
55+
56+
A properly configured cluster can still perform rolling updates. However, there
57+
is a trade-off between the speed of rollout and the potential loss of ongoing
58+
work. Since clients automatically retry when encountering errors, losing ongoing
59+
work may be the preferred issue to address. Nonetheless, for clusters with very
60+
expensive long-running actions, this could result in significant work loss.
61+
62+
# Improvements
63+
64+
## Better Overlap Between Sharding Topologies
65+
66+
Currently, two different sharding topologies, even if they share nodes, will
67+
have a small overlap between addressing schemas. This can be significantly
68+
improved by using a different sharding algorithm.
69+
70+
For this purpose we replace the implementation of
71+
`ShardingBlobAccessConfiguration` with one that uses [Rendezvous
72+
hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing
73+
is a lightweight and stateless technique for distributed hash tables. It has a low
74+
overhead with minimal disruption during resharding.
75+
76+
Sharding with Rendezvous hashing gives us the following properties:
77+
* Removing a shard is _guaranteed_ to only require resharding for the blobs
78+
that resolved to the removed shard.
79+
* Adding a shard will reshard any blob to the new shard with a probability of
80+
`weight/total_weight`.
81+
82+
This effectively means adding or removing a shard triggers a predictable,
83+
minimal amount of resharding, eliminating the need for drained backends.
84+
85+
```
86+
message ShardingBlobAccessConfiguration {
87+
message Shard {
88+
// unchanged
89+
BlobAccessConfiguration backend = 1;
90+
// unchanged
91+
uint32 weight = 2;
92+
}
93+
// unchanged
94+
uint64 hash_initialization = 1;
95+
96+
// Was 'shards' an array of shards to use, has been replaced with 'shard_map'
97+
reserved 2;
98+
99+
// NEW:
100+
// Shards identified by a key within the context of this sharding
101+
// configuration. Shards are chosen via Rendezvous hashing based on the
102+
// digest, weight, key and hash_initialization of the configuration.
103+
//
104+
// When removing a shard from the map it is guaranteed that only blobs
105+
// which resolved to the removed shard will get a different shard. When
106+
// adding shards there is a weight/total_weight probability that any given
107+
// blob will be resolved to the new shards.
108+
map<string, Shard> shard_map = 3;
109+
}
110+
```
111+
112+
Other algorithms considered were: [Consistent
113+
hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and
114+
[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf).
115+
116+
## Handling Topology Changes Dynamically
117+
118+
To reduce the amount of restarts required a new blob access configuration is
119+
added.
120+
121+
```
122+
message RemotelyDefinedBlobAccessConfiguration {
123+
// Fetch the blob access configuration from an external service
124+
buildbarn.configuration.grpc.ClientConfiguration grpc = 1;
125+
126+
// Maximum grace time when receiving a new blob access configuration
127+
// before cancelling ongoing requests.
128+
//
129+
// Recommended value: 0s for client facing systems, 60s for internal
130+
// systems
131+
google.protobuf.Duration maximum_grace_time = 2;
132+
}
133+
```
134+
135+
Which calls to a service that implements the following:
136+
137+
```
138+
service RemoteBlobAccessConfiguration {
139+
rpc Synchronize(SynchronizeRequest) returns (SynchronizeResponse);
140+
}
141+
142+
message SynchronizeRequest {
143+
enum StorageBackend {
144+
CAS = 0;
145+
AC = 1;
146+
ICAS = 2;
147+
ISCC = 3;
148+
FSAC = 4;
149+
}
150+
// An implementation defined identifier describing who the client is which
151+
// the remote service may take into consideration when returning the
152+
// BlobAccessConfiguration. This is typically used when clients are in
153+
// different networks and should route differently.
154+
string identifier = 1;
155+
156+
// Which storage backend that the service should describe the topology for.
157+
StorageBackend storage_backend = 2;
158+
159+
// A message describing the current state from the perspective of the
160+
// client, the client may assume it's current topology is correct if the
161+
// service does not respond with a request to change it.
162+
BlobAccessConfiguration current = 3;
163+
}
164+
165+
message SynchronizeResponse {
166+
// A message describing how the component should perform blob access
167+
// configurations.
168+
BlobAccessConfiguration desired_state = 1;
169+
170+
// Latest time at which an ongoing request to the previous blob access
171+
// configuration should be allowed to continue before the caller should
172+
// cancel it and return an error. The client may cancel it before the
173+
// grace_time has passed.
174+
google.protobuf.Timestamp grace_time = 2;
175+
176+
// Timestamp for when this state is considered expired. The remote blob
177+
// access configuration will avoid breaking compatibility until
178+
// after this timestamp has passed.
179+
google.protobuf.Timestamp response_expiry = 3;
180+
}
181+
```
182+
183+
A simple implementation of this service could be a sidecar container that
184+
dynamically reads a configmap.
185+
186+
A more complex implementation might:
187+
* Read prometheus metrics and roll out updates to the correct mirror.
188+
* Increase the number of shards when the metrics indicate that the retention
189+
time has fallen below a desired value.
190+
* Stagger read fallback configurations which are removed automatically after
191+
sufficient amount of time has passed.
192+
193+
Adding grace times and response expiries allows the service to set an upper
194+
bound before a well behaved system should have propagated the topology change.
195+
This allows it to make informed decisions on when to break compatibility.

0 commit comments

Comments
 (0)