Skip to content

Commit 775c8da

Browse files
Add a new ADR for resharding improvements
1 parent c1016eb commit 775c8da

File tree

1 file changed

+218
-0
lines changed

1 file changed

+218
-0
lines changed
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# Buildbarn Architecture Decision Record #11: Resharding Without Downtime
2+
3+
Author: Benjamin Ingberg<br/>
4+
Date: 2024-12-30
5+
6+
# Context
7+
8+
Resharding a Buildbarn cluster without downtime is today a multi step process
9+
that can be done as the following steps:
10+
11+
1. Deploy new storage shards.
12+
2. Deploy a new topology with a read fallback configuration. Configure the old
13+
topology as the primary and the new topology as the secondary.
14+
3. When the topology from step 2 has propagated, swap the primary and secondary.
15+
4. When your new shards have performed sufficient amount of replication, deploy
16+
a topology without fallback configuration.
17+
5. Once the topology from step 4 has propagated, tear down any unused shards
18+
19+
This process enables live resharding of a cluster. It works because each step is
20+
backwards compatible with the previous step. That is, accessing the blobstore
21+
with the topology from step N-1 will resolve correctly even if some components
22+
are already using the topology from step N.
23+
24+
The exact timing and method needed to perform these steps depend on how you
25+
orchestrate your buildbarn cluster and your retention aims. This process might
26+
span an hour to several weeks.
27+
28+
Resharding a cluster is a rare operation, so having multiple steps to achieve it
29+
is not inherently problematic. However, without a significant amount of
30+
automation of the cluster's meta-state there are large risks for performing it
31+
incorrectly.
32+
33+
# Issues During Resharding
34+
35+
## Non-Availability of a Secondary Set of Shards
36+
37+
You might not have the ability to spin up a secondary set of storage shards to
38+
perform the switchover. This is a common situation in an on-prem environment,
39+
where running two copies of your production environment may not be feasible.
40+
41+
This is not necessarily a blocker. You can reuse shards from the old topology
42+
in your new topology. However, this has a risk of significantly reducing your
43+
retention time since data must be stored according to the addressing schema of
44+
both the new and the old topology simultaneously.
45+
46+
While it is possible to reduce the amount of address space that is resharded with
47+
drained backends this requires advance planning.
48+
49+
## Topology changes requires restarts
50+
51+
Currently, the only way to modify the topology visible to an individual
52+
Buildbarn component is to restart that component. While mounting a Kubernetes
53+
ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack
54+
logic for dynamically reloading their blob access configuration.
55+
56+
A properly configured cluster can still perform rolling updates. However, there
57+
is a trade-off between the speed of rollout and the potential loss of ongoing
58+
work. Since clients automatically retry when encountering errors, losing ongoing
59+
work may be the preferred issue to address. Nonetheless, for clusters with very
60+
expensive long-running actions, this could result in significant work loss.
61+
62+
The amount of restarted components can be reduced by routing traffic via
63+
internal facing frontends which are topology aware.
64+
65+
# Improvements
66+
67+
## Better Overlap Between Sharding Topologies
68+
69+
Currently, two different sharding topologies, even if they share nodes, will
70+
have a small overlap between addressing schemas. This can be significantly
71+
improved by using a different sharding algorithm.
72+
73+
For this purpose we replace the implementation of
74+
`ShardingBlobAccessConfiguration` with one that uses [Rendezvous
75+
hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing
76+
is a lightweight and stateless technique for distributed hash tables. It has a low
77+
overhead with minimal disruption during resharding.
78+
79+
Sharding with Rendezvous hashing gives us the following properties:
80+
* Removing a shard is _guaranteed_ to only require resharding for the blobs
81+
that resolved to the removed shard.
82+
* Adding a shard will reshard any blob to the new shard with a probability of
83+
`weight/total_weight`.
84+
85+
This effectively means adding or removing a shard triggers a predictable,
86+
minimal amount of resharding, eliminating the need for drained backends.
87+
88+
```
89+
message ShardingBlobAccessConfiguration {
90+
message Shard {
91+
// unchanged
92+
BlobAccessConfiguration backend = 1;
93+
// unchanged
94+
uint32 weight = 2;
95+
}
96+
// unchanged
97+
uint64 hash_initialization = 1;
98+
99+
// Was 'shards' an array of shards to use, has been replaced with
100+
// 'shard_map'
101+
reserved 2;
102+
103+
// NEW:
104+
// Shards identified by a key within the context of this sharding
105+
// configuration. The key is a freeform string which describes the identity
106+
// of the shard in the context of the current sharding configuration.
107+
// Shards are chosen via Rendezvous hashing based on the digest, weight,
108+
// key and hash_initialization of the configuration.
109+
//
110+
// When removing a shard from the map it is guaranteed that only blobs
111+
// which resolved to the removed shard will get a different shard. When
112+
// adding shards there is a weight/total_weight probability that any given
113+
// blob will be resolved to the new shards.
114+
map<string, Shard> shard_map = 3;
115+
}
116+
```
117+
118+
Other algorithms considered were: [Consistent
119+
hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and
120+
[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf).
121+
122+
## Handling Topology Changes Dynamically
123+
124+
To reduce the amount of restarts required buildbarn needs a construct that
125+
describes how to reload the topology.
126+
127+
### RemotelyDefinedBlobAccessConfiguration
128+
129+
```
130+
message RemotelyDefinedBlobAccessConfiguration {
131+
// Fetch the blob access configuration from an external service
132+
buildbarn.configuration.grpc.ClientConfiguration endpoint = 1;
133+
134+
// Duration to reuse the previous configuration when not able to reach
135+
// RemoteBlobAccessConfiguration.GetBlobAccessConfiguration
136+
// before the component should consider itself partitioned from the cluster
137+
// and return UNAVAILABLE for any access requests.
138+
//
139+
// Recommended value: 10s
140+
google.protobuf.Duration remote_configuration_timeout = 2;
141+
}
142+
```
143+
144+
This configuration calls service that implements the following:
145+
146+
```
147+
service RemoteBlobAccessConfiguration {
148+
// A request to subscribe to the blob access configurations.
149+
// The service should immediately return the current blob access
150+
// configuration. After the initial blob access configuration the stream
151+
// will push out any changes to the blob access configuration.
152+
rpc GetBlobAccessConfiguration(GetBlobAccessConfigurationRequest) returns (stream BlobAccessConfiguration);
153+
}
154+
155+
message GetBlobAccessConfigurationRequest {
156+
enum StorageBackend {
157+
CAS = 0;
158+
AC = 1;
159+
ICAS = 2;
160+
ISCC = 3;
161+
FSAC = 4;
162+
}
163+
// An implementation defined identifier describing who the client is. The
164+
// remote service may take into consideration when returning the
165+
// BlobAccessConfiguration. This is typically used when clients are in
166+
// different networks and should route differently.
167+
google.protobuf.Value identifier = 1;
168+
169+
// Which storage backend that the service should describe the topology for.
170+
StorageBackend storage_backend = 2;
171+
}
172+
```
173+
174+
A simple implementation of this service could be a sidecar container that
175+
dynamically reads a configmap.
176+
177+
A more complex implementation might:
178+
* Read prometheus metrics and roll out updates to the correct mirror.
179+
* Increase the number of shards when the metrics indicate that the retention
180+
time has fallen below a desired value.
181+
* Stagger read fallback configurations which are removed automatically after
182+
sufficient amount of time has passed.
183+
184+
### JsonnetDefinedBlobAccessConfiguration
185+
186+
```
187+
message JsonnetDefinedBlobAccessConfiguration {
188+
// Path to the jsonnet file which describes the blob access configuration
189+
// to use. This will be monitored for changes.
190+
string path = 1;
191+
192+
// External variables to use when parsing the jsonnet file. These are used
193+
// in addition to the process environment variables which are always used.
194+
map<string, string> external_variables = 2;
195+
196+
// Methodology to use for subscribing to configuration file changes
197+
JsonnetConfigurationSubscriptionMethod subscription_method = 3;
198+
199+
// An optional path to a lockfile which buildbarn will check the presense
200+
// for before before parsing the jsonnet file. Buildbarn will only parse
201+
// the jsonnet file if the lockfile does not exist.
202+
//
203+
// The lockfile is there to prevent unintentional partial updates to the
204+
// BlobAccessConfiguration. Such as when having a jsonnet file which
205+
// imports a libsonnet file where each file gets updated separately.
206+
string lock_file_path = 4;
207+
}
208+
209+
message JsonnetConfigurationSubscriptionMethod {
210+
oneof method {
211+
// Parse the jsonnet file continously to monitor for any changes
212+
google.protobuf.Duration polling_duration = 1;
213+
// Subscribe to the jsonnet file and any imported libsonnet files via
214+
// the inotify api.
215+
google.protobuf.Empty inotify = 2;
216+
}
217+
}
218+
```

0 commit comments

Comments
 (0)