-
Notifications
You must be signed in to change notification settings - Fork 8
Add a new ADR for resharding improvements #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,258 @@ | ||
# Buildbarn Architecture Decision Record #11: Resharding Without Downtime | ||
|
||
Author: Benjamin Ingberg<br/> | ||
Date: 2024-12-30 | ||
|
||
# Context | ||
|
||
Resharding a Buildbarn cluster without downtime is today a multi step process | ||
that can be done as the following steps: | ||
|
||
1. Deploy new storage shards. | ||
2. Deploy a new topology with a read fallback configuration. Configure the old | ||
topology as the primary and the new topology as the secondary. | ||
3. When the topology from step 2 has propagated, swap the primary and secondary. | ||
4. When your new shards have performed sufficient amount of replication, deploy | ||
a topology without fallback configuration. | ||
5. Once the topology from step 4 has propagated, tear down any unused shards | ||
|
||
This process enables live resharding of a cluster. It works because each step is | ||
backwards compatible with the previous step. That is, accessing the blobstore | ||
with the topology from step N-1 will resolve correctly even if some components | ||
are already using the topology from step N. | ||
|
||
The exact timing and method needed to perform these steps depend on how you | ||
orchestrate your buildbarn cluster and your retention aims. This process might | ||
span an hour to several weeks. | ||
|
||
Resharding a cluster is a rare operation, so having multiple steps to achieve it | ||
is not inherently problematic. However, without a significant amount of | ||
automation of the cluster's meta-state there are large risks for performing it | ||
incorrectly. | ||
|
||
# Issues During Resharding | ||
|
||
## Non-Availability of a Secondary Set of Shards | ||
|
||
You might not have the ability to spin up a secondary set of storage shards to | ||
perform the switchover. This is a common situation in an on-prem environment, | ||
where running two copies of your production environment may not be feasible. | ||
|
||
This is not necessarily a blocker. You can reuse shards from the old topology | ||
in your new topology. However, this has a risk of significantly reducing your | ||
retention time since data must be stored according to the addressing schema of | ||
both the new and the old topology simultaneously. | ||
|
||
While it is possible to reduce the amount of address space that is resharded with | ||
drained backends this requires advance planning. | ||
|
||
## Topology changes requires restarts | ||
|
||
Currently, the only way to modify the topology visible to an individual | ||
Buildbarn component is to restart that component. While mounting a Kubernetes | ||
ConfigMap as a volume allows it to reload on changes, Buildbarn programs lack | ||
logic for dynamically reloading their blob access configuration. | ||
|
||
A properly configured cluster can still perform rolling updates. However, there | ||
is a trade-off between the speed of rollout and the potential loss of ongoing | ||
work. Since clients automatically retry when encountering errors, losing ongoing | ||
work may be the preferred issue to address. Nonetheless, for clusters with very | ||
expensive long-running actions, this could result in significant work loss. | ||
|
||
The amount of restarted components can be reduced by routing traffic via | ||
internal facing frontends which are topology aware. | ||
|
||
# Improvements | ||
|
||
## Better Overlap Between Sharding Topologies | ||
|
||
Currently, two different sharding topologies, even if they share nodes, will | ||
have a small overlap between addressing schemas. This can be significantly | ||
improved by using a different sharding algorithm. | ||
|
||
For this purpose we replace the implementation of | ||
`ShardingBlobAccessConfiguration` with one that uses [Rendezvous | ||
hashing](https://en.wikipedia.org/wiki/Rendezvous_hashing). Rendezvous hashing | ||
is a lightweight and stateless technique for distributed hash tables. It has a low | ||
overhead with minimal disruption during resharding. | ||
|
||
Sharding with Rendezvous hashing gives us the following properties: | ||
* Removing a shard is _guaranteed_ to only require resharding for the blobs | ||
that resolved to the removed shard. | ||
* Adding a shard will reshard any blob to the new shard with a probability of | ||
`weight/total_weight`. | ||
|
||
This effectively means adding or removing a shard triggers a predictable, | ||
minimal amount of resharding, eliminating the need for drained backends. | ||
|
||
``` | ||
message ShardingBlobAccessConfiguration { | ||
message Shard { | ||
// unchanged | ||
BlobAccessConfiguration backend = 1; | ||
// unchanged | ||
uint32 weight = 2; | ||
} | ||
// unchanged | ||
uint64 hash_initialization = 1; | ||
|
||
// Was 'shards' an array of shards to use, has been replaced with | ||
// 'shard_map' | ||
reserved 2; | ||
|
||
// NEW: | ||
// Shards identified by a key within the context of this sharding | ||
// configuration. The key is a freeform string which describes the identity | ||
// of the shard in the context of the current sharding configuration. | ||
// Shards are chosen via Rendezvous hashing based on the digest, weight, | ||
// key and hash_initialization of the configuration. | ||
// | ||
// When removing a shard from the map it is guaranteed that only blobs | ||
// which resolved to the removed shard will get a different shard. When | ||
// adding shards there is a weight/total_weight probability that any given | ||
// blob will be resolved to the new shards. | ||
map<string, Shard> shard_map = 3; | ||
} | ||
``` | ||
|
||
### Other Algorithms | ||
|
||
Other algorithms considered were: [Consistent | ||
hashing](https://en.wikipedia.org/wiki/Consistent_hashing) and | ||
[Maglev](https://storage.googleapis.com/gweb-research2023-media/pubtools/2904.pdf). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could probably use some discussion why those were algorithms were rejected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack |
||
|
||
Rendezvous hashing was chosen for it's simpleness of implementation, | ||
mathematical elegance and not needing to be parameterized in configuration. | ||
|
||
_Of all shards pick the one with the largest `-weight / log(h)` where h is a | ||
hash of the key/node pair mapped to `]0, 1[`._ | ||
|
||
There are some minor advantages and disadvantages to the other algorithms, in | ||
particular with regards to how computationally heavy they are. However, in a | ||
typical buildbarn cluster with dozens of shards, the amount of computation | ||
required to pick a shard is unlikely to be relevant. | ||
|
||
##### Consistent Hashing | ||
|
||
Consistent hashing can be reduced into a special case of Rendezvous with the | ||
advantage that it can precomputed into a sorted list. There we can binary search | ||
for the target in `log(W)` time. This is in contrast to to Rendezvous hashing | ||
which takes `N` time to search. | ||
|
||
* `N` is the number of shards | ||
* `W` is the sum of weights in its reduced form (i.e. smallest weight is 1 and | ||
all weights are integer) | ||
|
||
In practice buildbarn clusters only have a small number of shards and doing a | ||
straight linear search with Rendezvous is faster for any reasonable cluster (say | ||
less than 100 shards). Rendezvous is also independent on the weight of the | ||
individual shards which improves the relative performance of Rendezvous for | ||
shards that are not evenly weighted. | ||
|
||
#### Maglev | ||
|
||
Maglev in turn uses an iterative method that precomputes an evenly distributed | ||
lookup table of any arbitrary size. After the precomputational step the target | ||
shard is a simple lookup of the key modulo the size of the table. | ||
|
||
The disadvantage of Maglev is that it aims for minimal resharding rather than | ||
optimal resharding. When removing a shard from the set of all shard, the | ||
algorithm tolerates a small amount of unecessary resharding. This is the order | ||
of a few percentiles depending on the number of shards and size of the lookup | ||
table. | ||
|
||
## Handling Topology Changes Dynamically | ||
|
||
To reduce the amount of restarts required buildbarn needs a construct that | ||
describes how to reload the topology. | ||
|
||
### RemotelyDefinedBlobAccessConfiguration | ||
|
||
``` | ||
message RemotelyDefinedBlobAccessConfiguration { | ||
// Fetch the blob access configuration from an external service | ||
buildbarn.configuration.grpc.ClientConfiguration endpoint = 1; | ||
|
||
// Duration to reuse the previous configuration when not able to reach | ||
// RemoteBlobAccessConfiguration.GetBlobAccessConfiguration | ||
// before the component should consider itself partitioned from the cluster | ||
// and return UNAVAILABLE for any access requests. | ||
// | ||
// Recommended value: 10s | ||
google.protobuf.Duration remote_configuration_timeout = 2; | ||
Comment on lines
+176
to
+182
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you considered supporting a time-to-live (TTL) value from the fetched configuration, telling how many seconds that particular retrieved configuration is valid? Similar to TTL in DNS records. That would allow changing the refresh interval without restarting all components. Operators could for example freely choose a propagation duration between step 2 and 3 (in the 'Context' chapter) for each resharding occasion, by first changing the refresh interval (for scenarios where unnecessary client retries is undesired). If configuration is retrieved via HTTP, then a TTL expire value could be specified in a HTTP response header. |
||
} | ||
``` | ||
|
||
This configuration calls service that implements the following: | ||
|
||
``` | ||
service RemoteBlobAccessConfiguration { | ||
// A request to subscribe to the blob access configurations. | ||
// The service should immediately return the current blob access | ||
// configuration. After the initial blob access configuration the stream | ||
// will push out any changes to the blob access configuration. | ||
rpc GetBlobAccessConfiguration(GetBlobAccessConfigurationRequest) returns (stream BlobAccessConfiguration); | ||
} | ||
|
||
message GetBlobAccessConfigurationRequest { | ||
enum StorageBackend { | ||
CAS = 0; | ||
AC = 1; | ||
ICAS = 2; | ||
ISCC = 3; | ||
FSAC = 4; | ||
} | ||
// Which storage backend that the service should describe the topology for. | ||
StorageBackend storage_backend = 1; | ||
} | ||
``` | ||
|
||
A simple implementation of this service could be a sidecar container that | ||
dynamically reads a configmap. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This makes me wonder: why have a gRPC service for this? Why can't Buildbarn binaries monitor a config map on disk and reload it as needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is also a possible implementation when running in a Kubernetes context. Instead of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you considered retrieving jsonnet files via HTTP, to allow using standard web servers instead of a special gRPC service? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand the benefit of gRPC if using a gRPC stream to push configuration updates, but on the other hand that involves having states about clients on the server-side which increases complexity. I find the use of Jsonnet files via HTTP appealing, especially when combined with a TTL expiration header in the HTTP responses. |
||
|
||
A more complex implementation might: | ||
* Read prometheus metrics and roll out updates to the correct mirror. | ||
* Increase the number of shards when the metrics indicate that the retention | ||
time has fallen below a desired value. | ||
* Stagger read fallback configurations which are removed automatically after | ||
sufficient amount of time has passed. | ||
|
||
### JsonnetDefinedBlobAccessConfiguration | ||
|
||
We refer to a secondary configuration file that will be dynamically loaded at | ||
runtime. This configuration may be modified in which case Buildbarn will reload | ||
it. | ||
|
||
``` | ||
message JsonnetDefinedBlobAccessConfiguration { | ||
// Path to the jsonnet file which describes the blob access configuration | ||
// to use. This will be monitored for changes. | ||
string path = 1; | ||
|
||
// External variables to use when parsing the jsonnet file. These are used | ||
// in addition to the process environment variables which are always used. | ||
map<string, string> external_variables = 2; | ||
|
||
// Buildbarn will use the inotify api to watch for any changes to the | ||
// underlying jsonnet file any libsonnet file it includes and any optional | ||
// lock file specified. Should the inotify api not be able to notice file | ||
// modifications Buildbarn can be configured to periodically poll for | ||
// changes to the configuration files. | ||
// | ||
// This typically happens because the underlying file system does not | ||
// support the inotify api and/or because the file is hidden by bind | ||
// mounts. | ||
// | ||
// Setting this value to 0 will disable the periodic refresh interval. | ||
google.protobuf.Duration refresh_interval = 3; | ||
|
||
// An optional path to a lockfile which buildbarn will check the presense | ||
// for before before parsing the jsonnet file. Buildbarn will only parse | ||
// the jsonnet file if the lockfile does not exist. | ||
// | ||
// The lockfile is there to prevent unintentional partial updates to the | ||
// BlobAccessConfiguration. Such as when having a jsonnet file which | ||
// imports a libsonnet file where each file gets updated separately. | ||
string lock_file_path = 4; | ||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could probably use some discussion on how actually swapping out the BlobAccess under the hood is going to work. LocalBlobAccess and ZipWritingBlobAccess have logic for at least flushing data upon shutdown, but that's it. This means that every new config push would end up leaking gRPC clients and file descriptors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think I'm equipped to answer that question until we've had a look at what performing the implementation would entail. Here are some unordered random thoughts to look into: From a consumer point of view it would be neat if Buildbarn handled all of the possible issues that could arise from reloading a configuration. From a maintenance point of view this might not be ideal or even warranted. The key features we would want to reload are the topographical pieces of the local blob access (e.g. mirrored, sharding, readfallback, grpc and probably others). Those would then need implementation work to not leak or have any other issues. But several other configurations might not even make sense to be reloaded. We would still want everything to compose neatly, but there is no point in partially reloading a local blob access configuration. Reuse subsets of the graph which we determine to be identical? Have an interface for reloading the configuration which may error out (for when reloading is requested in a non-compatible manner?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My organization have a Buildbarn fork with custom BlobAccess implementations for our custom sharding algorithm, with support for reloading. The only upstream blob access implementation we reload is for grpc, and for that we rely on pkg/grpc/deduplicating_client_factory to reuse gRPC clients. I believe we have nothing in place to avoid leaking gRPC clients for no longer used endpoints, but that is acceptable for our current resharding use cases. Our fork introduced a global Mutex that is locked while the main routine initializes blob access configurations during startup, and also locked when configuration refreshing takes place. I’m not sure if that Mutex is really needed, but have not dared to remove it because it is hard to grasp all potential data races that could otherwise occur. If you will add support for reloading configurations, that would be very helpful and probably allow us to simplify our custom patches! And we could someday publish them if interest in the open-source community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth calling out that it is possible to reduce the number of processes that need to be restarted by pointing all other components to the frontend processes (e.g., workers, the scheduler, bb-browser).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add it to the text, but I don't think it's a good pattern to encourage for workers.
Firstly, frontends already require a significant amount of performance even without proxying the worker traffic. Secondly, when workers have the topology they can restart gracefully. But when they are proxying via frontends they will instead fail some actions randomly that just happened to be accessing the storage at the time of restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more use case for resharding without restart: bb-clientd instances connected directly to the shards without a bb-storage proxy frontend in between. Because restarting client-side bb-clientd instances is typically challenging for operators of the server-side services.