Skip to content

(tree): Incremental summarization of chunked forest #24949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

agarwal-navin
Copy link
Contributor

@agarwal-navin agarwal-navin commented Jul 1, 2025

Reviewer guidance

  • There are a couple of TODOs that will be fixed before merging the PR. I have kept it so it helps with the review.
  • There are some outstanding work which I plan to do in a follow up. I have called them out in the Next steps section below.

Next steps

  • Add schema API to allow consumers to specify which properties can be incremental summarized. Currently, they are hardcoded but that will be changed before making this PR ready for review.
  • Add logic to gate incremental summaries based on format version of FieldBatchCoded. Also, that format version will be updated as per cross-client compat policy.

Description

AB#1263 has an overall description and link to a design document.

This PR adds incremental summarization support to chunked forest. Currently, the chunked forest data is encoded and added to a single blob in the summary tree. This change breaks it down into mutiple subtrees where each subtree can be incrementally summarized, i.e., if such a subtree doesn't change between summaries, a summary handle will be added to the summary instead of the entire subtree.
The schema will support marking properties as incrementally summarizable (TODO - not in this PR). Any chunks in the fields corresponding to this property will be the boundary at which incrementally summarization will work. Basically, if this property changes or anything within it changes (for objects and arrays), its chunks will be re-summarized, or else, a summary handle will be used.

To achieve this, this PR makes the following changes:

  • Added a new encoded format called EncodedIncrementalChunkShape to EncodedChunkShape for chunks whose data will be incremental summarized.
  • Added a new shape called IncrementalChunkShape is added which represents chunks whose data can be incrementally summarized, aka, incremental chunks.
  • Each incremental chunk is assigned a ChunkReferenceId which uniquely identifies it under its parent field.
  • Added a new field encoder called incrementalFieldEncoder which encodes its chunks as IncrementalChunkShape. The ChunkReferenceIds of these chunks are stored as EncodedNestedArrayShape.
  • Added a new decoder called IncrementalChunkDecoder is added which can decode an incremental chunk.
  • Added IncrementalEncoder and IncrementalDecoder interfaces which has properties that help with encoding and decoding of incremental chunks. These are added to the FieldBatchEncodingContext.
  • Added a ForestIncrementalSummaryBuilder which tracks chunks across summaries and generates summary tree or summary handles for them depending on whether they change between summaries.
  • The forest summary tree consists of a blob called ForestTree at the root that contains the data for the forest. If there are any incremental fields in the forest, an array of chunk reference ids will be encoded for its chunks in the data.
    • Fore each incremental chunk, a separate summary tree will be added with its reference id as the key for the tree.
    • The data for the chunk will be added to a blob called contents under the tree. If the chunk has any incremental fields, an array of chunk reference ids will be encoded for its chunks in the data.
    • The same format is repeated until all incremental fields have been summarized.
    • Note that currently, an incremental field will have exactly one chunk. This may change in the future where the data may be encoded into multiple chunks.

AB#14308 AB#41861 AB#41863 AB#41864

@github-actions github-actions bot added area: dds Issues related to distributed data structures area: dds: tree area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc dependencies Pull requests that update a dependency file base: main PRs targeted against main branch labels Jul 1, 2025
agarwal-navin added a commit that referenced this pull request Jul 9, 2025
…ce (#24970)

## Description
Added `getSnapshotTree` API to `IChannelStorageService` to let DDS
access their snapshot tree. This will help DDSes examine their summary
where the summary consists of dynamic subtrees / blobs. Currently, all
the DDSes store their data in fixed blobs, so they can download these
specific blobs via their name.

However, there can be scenarios where a DDS writes their data in dynamic
trees / blobs similar to the container runtime where its channels
subtree's contents depends on number of data stores and their ids. In
such cases, without having access to the snapshot tree, it will need to
write the path of all its blob into its summary tree (into a blob with
specific name) and use that to fetch them during load.
For example, shared tree is implementing incremental summarization of
its data where the number and paths of blobs will change dynamically.
Having access to the snapshot tree is important for it to fetch the
contents of all the blobs in its summary tree. PR for shared tree's
incremental summarization -
#24949.


[AB#43434](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/43434)
@agarwal-navin agarwal-navin force-pushed the incrementalSummaryTree branch from 5b54069 to a18ca83 Compare July 14, 2025 23:31
@github-actions github-actions bot removed the dependencies Pull requests that update a dependency file label Jul 14, 2025
@agarwal-navin agarwal-navin force-pushed the incrementalSummaryTree branch from eac3c35 to e0f4f71 Compare July 28, 2025 23:50
Comment on lines 243 to 248
const chunkReferenceId = readStream(stream);
assert(
typeof chunkReferenceId === "number",
"Expected incremental field's data to be a chunk reference id",
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use:

Suggested change
const chunkReferenceId = readStream(stream);
assert(
typeof chunkReferenceId === "number",
"Expected incremental field's data to be a chunk reference id",
);
const chunkReferenceId = readStreamNumber(stream);

That has the assert built in.

extends ShapeGeneric<EncodedChunkShape>
implements FieldEncoder
{
export class NestedArrayEncoder extends NestedArrayShape implements FieldEncoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to not extend NestedArrayShape, and extend just store an instance of it.

That will make caching/reuse of the shape more effective (we can cache the NestedArrayShape instances without worriny about if they are an NestedArrayEncoder).

It will also seperate concerns here so the encoder can clearly just be about implementing the encider interface, which will be cleaner.

I think we would want to make the constructor take in the just the public readonly shape: NestedArrayShape

Copy link
Contributor Author

@agarwal-navin agarwal-navin Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a parameter but made it use a default value using the inner encoder's shape by default since that is going to be the common case.

);

let chunkReferenceIds: ChunkReferenceId[] = [];
if (cursor.getFieldLength() !== 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IncrementalEncoder.encodeIncrementalField does not say anything about not supporting empty fields so I think we should skip this check, and let it handle the empty case. I don't think we gain anything by adding the special casing of empty here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this check so that we don't end up adding an empty summary tree for fields that are empty. I could move this check to the ForestIncrementalSummaryBuilder and it can decide to return an empty array in case the field is empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: dds: tree area: dds Issues related to distributed data structures area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants