Skip to content

V3 idea: Drop support for offset reads & writes. Decompose large CAS objects into shallow Merkle trees #178

@EdSchouten

Description

@EdSchouten

The size of objects stored in the CAS has a pretty large spread. We can have tiny Directory objects that are less than 100 bytes in size, but we can also have output files that are gigabytes in size. Dealing with these large files is annoying:

  • When using the digest as a key for sharding the CAS across multiple backends, you notice that you get an almost perfect distribution in terms of request count. Below is an example read count, 1h average, of a setup that has eight shards:

Screenshot 2020-10-16 at 13 43 58

But if you take a look at the amount of data read in bytes, again 1h average, it's a lot less balanced. We sometimes see that the busiest shard receives 50% traffic more than the one that's least busy. And that's measured across a full hour.

Screenshot 2020-10-16 at 13 44 38

  • Many storage systems out there put some limits on the maximum size of an object. Redis can only store 512 MB objects. Cassandra has a 2 GB limit on column values. Ceph's Rados has a maximum object size of 128 MB. (Buildbarn's LocalBlobAccess also has certain limits.) It is possible to decompose objects into chunks on the storage infrastructure side, but that removes the ability to perform data integrity checking at the chunk level, as multiple chunks need to be concatenated to perform checksum validation. When chunks are also spread across shards, an individual shard loses the ability to validate its own data entirely.
  • Connection drops may occur while uploading files. When this happens, two things may happen: A) the client may attempt to re-upload the file from the beginning. B) the client somehow requests that the upload continues at the last received position. This would require the server to hold on to partial uploads, and negotiate with the client at which offset the upload needs to resume. That isn't fun to implement.
  • The current approach doesn't make it possible to do validated reads of chunks of a file. If workers implement lazy loading of file contents (e.g., using a worker FUSE file system), it may be preferable to not read large files as a whole, but only read the parts that are actually used by build actions. This can't be done in a validated way if the only thing you have is the SHA-sum of the file as a whole.
  • Relatedly, in Add support for transferring compressed blobs via ByteStream #168 we're discussing how support for compressing CAS contents can be added to REv2. As we've noticed, the need for supporting partial uploads & downloads complicates that a lot.

My suggestion is that REv3 drops support for offset reads & writes entirely. Instead, we should model large files as shallow (1-level deep) Merkle trees of chunks with some fixed size. Instead of sending a single Bytestream request to download such a large file, a client must (or may?) first need to read a manifest object from the CAS containing a list of digests of chunks whose contents need to be concatenated. When uploading a file, FindMissingBlobs() should be called against the the digest of the manifest object and each of the digests of the chunks. This allows a client to skip uploading of parts that are already in the CAS. This both speeds up resumption of partially uploaded files, and adds (a rough version of) deduplication of identical regions across multiple files. Because large objects don't occur very frequently (read: almost all objects tend to be small), this indirection doesn't really impact performance for most workloads. Compression can be added by simply compressing individual chunks.

Earlier this year I experimented with this concept for Buildbarn (ADR#3), where the decomposition size can be any power of 2, 2 KiB or larger. It used a slightly simplified version of BLAKE3 to hash objects. I never fully completed it/merged it into master, unfortunately. It also didn't take the desire to do compression into account. I think that both decomposition and compression should be considered, and can likely not be discussed separately. My hope is that for REv3, we find the time to solve this properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions