Skip to content

feat(tree): Add vTextExperimental codec with specialized node shape encoding#27515

Open
brrichards wants to merge 4 commits into
microsoft:mainfrom
brrichards:experimental-vtext-clean
Open

feat(tree): Add vTextExperimental codec with specialized node shape encoding#27515
brrichards wants to merge 4 commits into
microsoft:mainfrom
brrichards:experimental-vtext-clean

Conversation

@brrichards

Copy link
Copy Markdown
Contributor

Summary

Wires up an encoder for the specialized node shape ({f: {base, fields, value?}}) introduced in #27200. The bulk of this diff is snapshot files (~4000 lines across 54 snapshots) while the actual encoder changes are ~1100 lines across 8 source/test files.

The encoder uses a two-pass approach per field batch:

  • Pass 1 (counting): Walks every node in the batch in post-order. Children are counted before their parents. For each VTextObjectNodeEncoder node, a cohort key is built from its specializable fields (leaf values and resolved child shapes). Because we walk bottom-up, child shape decisions are already finalized when a parent builds its cohort key, allowing parents to reference their children's specialized shapes by identity.

  • Pass 2 (encoding): For each node, if its cohort key's count from pass 1 meets the minimum threshold (defaultMinOccurrencesForSpecialization = 8), a SpecializedNodeShapeEncoder is used and then encodes only the field overrides relative to the base shape. Otherwise the node falls back to the base NodeShapeBasedEncoder.

Batch state is held on a stack to keep each compressedEncode call's specialization decisions isolated. Recursive calls (e.g. for incremental fields that are encoded separately) push their own state, so sub-batches get independent counting without polluting the parent's totals.

Testing

  • Existing encode/decode round-trip tests extended for vTextExperimental
  • Snapshot tests cover schemaless and schema-compressed Vtext output
  • schemaCompressedEncodeVTextExperimentalForTests allows threshold override for small test inputs

Additional optimizations are intentionally excluded from this PR and will land as separate follow-ups.

Reviewer Guidance

The review process is outlined on this wiki page.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (5251 lines, 63 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🔭 PR Review Fleet Report

Note

This report is generated by an experimental AI review fleet and is provided as a beta feature. Findings are a starting point for discussion, not a gate. Use your own judgement.

Verdict: ❌ Request Changes

0 Disastrous, 1 Dangerous, 5 Disagreeable

Findings

Sev # Area File What Fix
🐊 Dangerous H1 Correctness packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:585 Cohort key collisions due to unsanitized | separator in string leaf values. cohortKey builds its key by joining parts with | (e.g. L:string:v1|L:string:v2), but string values may themselves contain the substring |L:string:. Concrete example: a node with field1="a|L:string:b" and field2="c" produces the exact same cohort key as a node with field1="a" and field2="b|L:string:c" — both yield L:string:a|L:string:b|L:string:c. When the combined occurrence count of these two structurally-different value tuples crosses minOccurrencesForSpecialization, createSpecialized is called once and the resulting SpecializedNodeShapeEncoder bakes the first caller's constant values into the shape. Every subsequent node matched to the same key — but carrying different actual values — is silently encoded with the wrong constants, because encodeValue for array-valued shapes writes nothing to the stream and the decoder unconditionally uses the shape's constant. This is silent data corruption: no assertion fires and the document round-trips to wrong values. Realistic trigger: any schema with two or more Multiplicity.Single string-typed fields where document content happens to contain the literal substring |L:string: (e.g. in prose, code, or technical notation). Replace the raw |-join in cohortKey with collision-resistant encoding. The simplest correct fix is to JSON-stringify each part before joining, or to percent-encode the | character inside each part's value segment: e.g. replace valueKey(cursor.value) with JSON.stringify(cursor.value) so that | within a string value becomes "|" and can never be confused with the separator. Similarly ensure the S:${id} parts, which use only decimal digits, can never match an L: prefix, which is already the case — so only the L: path needs fixing.
🐍 Disagreeable M1 Correctness packages/dds/tree/src/feature-libraries/chunked-forest/codec/compressedEncode.ts:56 compressedEncode calls context.beginBatch(fieldBatch) (which pushes a VTextBatchState onto the VText batch stack) but does not use try/finally to guarantee context.endBatch() (which pops it) is called if an exception is thrown during encoding. If anyFieldEncoder.encodeField or updateShapesAndIdentifiersEncoding throws, the batch stack retains an extra element. For VText incremental sub-chunk encoding, compressedEncode is called recursively with the same EncoderContext instance (see incrementalFieldEncoder which passes context to the recursive call). If any implementation of IncrementalEncoder.encodeIncrementalField catches the exception thrown by its encode callback and then allows the outer compressedEncode to continue, currentBatch() will return the stale inner batch instead of the outer batch. All subsequent cohort lookups and specialization decisions in the remainder of the outer encode would use the wrong VTextBatchState, producing wrong shapes and potentially corrupted output. Wrap the body of compressedEncode in a try/finally block so context.endBatch() is always called: context.beginBatch(fieldBatch); try { ... return result; } finally { context.endBatch(); }.
🐍 Disagreeable M2 Performance packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:566 cohortKey allocates a fresh string[] (const parts: string[] = []) on every invocation. It is called once per VTextObjectNodeEncoder node during the encoding pass (O(N)) and once per node per count-pass iteration (up to 10 × O(N)). For a large collaborative document with 1M rich-text nodes this produces ~10M short-lived array allocations per summary encode, generating significant GC pressure. For documents that are encoded frequently (e.g. on every container attach or summary upload under incremental-encode paths) the heap churn accumulates. Replace the parts: string[] + parts.join('|') pattern with direct string concatenation: let key = ''; for (const field of ...) { key += (key ? '|' : '') + ...; } return key;. This eliminates the per-call array allocation without changing observable behavior.
🐍 Disagreeable M3 Performance packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:520 During the count pass, countNodeAndDescendants uses a post-order traversal so children are visited before their parent. When the parent node's countNode is then called, cohortKey re-enters each subShape specializable child field — navigating the cursor into fields and nodes that were just visited by the outer traversal. This doubles the cursor navigation work (enter/exit field + enter/exit node) for every level of subShape nesting in the counting pass. The same double-traversal occurs in the encoding pass: encodeNode calls resolveShape → cohortKey (cursor into children to resolve shapes), then the resolved NodeShapeBasedEncoder.encodeNode traverses all fields again to produce output. For a schema with D levels of nested subShape fields and N total nodes the encode pass does O(2 × N × D) cursor navigations instead of O(N × D). For real-world rich-text schemas with 2–3 nesting levels and 1M leaf nodes this is a concrete 2–3× overhead on cursor traversal. Cache the cohort key alongside the node's post-order count result so that the parent's cohortKey call can read the pre-computed child shape ID without re-entering the cursor. One approach: after countNode(child) completes, stash the resolved shape (or its ID) in the CohortState indexed by the cursor's stable node identity, and have the parent's cohortKey read from that cache instead of re-navigating.
🐍 Disagreeable M4 Testing packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:723 resolveShape's specializedEncoders cache is never tested with more than one entry. All unit tests for the VText specialization path produce exactly one specialized shape (either only one tuple crosses the threshold, or all instances share the same tuple). The cache map state.specializedEncoders is therefore only exercised for size 0 (miss → create) and size 1 (hit). A bug where two distinct above-threshold tuples receive the same cohort key — causing one tuple's nodes to be encoded with the other's specialized shape — would go undetected: the single-shape tests still pass, and the snapshot tests only verify round-trip fidelity, not that each tuple got a distinct encoder. Add a test with a schema that has two boolean fields where three distinct tuples each appear above the threshold, e.g. 3 × {bold:true, italic:true}, 3 × {bold:false, italic:false}, 3 × {bold:true, italic:false} with minOccurrencesForSpecialization=3. Assert countSpecializedShapes(encoded) === 3, then round-trip and verify the decoded tree equals the original. This exercises specializedEncoders with 3 entries and confirms no cohort-key collision silently merges distinct tuples.
🐍 Disagreeable M5 Testing packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:473 The multi-iteration counting loop in buildContextVText (for (let iteration = 0; iteration < countPassMaxIterations; iteration++)) is never exercised beyond its first iteration. A second iteration is required when the schema has nested subShape specializations: in iteration 1, child cohort keys are resolved against empty resolveCounts (all children use the base shape), so parent nodes group by a "base child" key. After commitIteration() populates resolveCounts, iteration 2 re-counts with children potentially resolving to specialized shapes, changing parent cohort keys. No test creates a schema with a parent object whose subShape field points to a child VTextObjectNodeEncoder — so a bug in commitIteration that causes premature loop exit (e.g., failing to detect changed counts) would produce incorrect parent cohort keys and wrong nested specializations, undetected by any existing test. Add a test with a two-level schema: outer Paragraph object has a required single-valued format field of polymorphic type (e.g., BoldFormat | ItalicFormat, each being an object with boolean fields). Provide enough Paragraph instances so that both the child format nodes and the parent Paragraph nodes cross the specialization threshold. Assert that the outer batch produces specialized shapes for both levels (countSpecializedShapes >= 2), then round-trip and verify the decoded tree. This forces the counting loop to run at least two iterations and confirms commitIteration's monotonic merge is correct.

View workflow run

@brrichards

brrichards commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

compressedEncode calls context.beginBatch(fieldBatch) (which pushes a VTextBatchState onto the VText batch stack) but does not use try/finally to guarantee context.endBatch() (which pops it) is called if an exception is thrown during encoding. If anyFieldEncoder.encodeField or updateShapesAndIdentifiersEncoding throws, the batch stack retains an extra element. For VText incremental sub-chunk encoding, compressedEncode is called recursively with the same EncoderContext instance (see incrementalFieldEncoder which passes context to the recursive call). If any implementation of IncrementalEncoder.encodeIncrementalField catches the exception thrown by its encode callback and then allows the outer compressedEncode to continue, currentBatch() will return the stale inner batch instead of the outer batch. All subsequent cohort lookups and specialization decisions in the remainder of the outer encode would use the wrong VTextBatchState, producing wrong shapes and potentially corrupted output.

Maybe I'm thinking about this wrong, but if an error happens during encoding I imagine that is a very serious error that we don't want to just have happen and then continue with a potential continuation of using the tree/data. I also think that an error is always thrown in the case of this happening. I don't think a try/finally is doing anything here other than just catching an error that would already propogate up and then updating our batch stack that we dont want to use anymore?

@brrichards

Copy link
Copy Markdown
Contributor Author

cohortKey allocates a fresh string[] (const parts: string[] = []) on every invocation. It is called once per VTextObjectNodeEncoder node during the encoding pass (O(N)) and once per node per count-pass iteration (up to 10 × O(N)). For a large collaborative document with 1M rich-text nodes this produces ~10M short-lived array allocations per summary encode, generating significant GC pressure. For documents that are encoded frequently (e.g. on every container attach or summary upload under incremental-encode paths) the heap churn accumulates.

1 million rich text-nodes.... The up to 10 is just the cap of iterations, but for formatted text nodes it will converge in around 2-3, I believe v8 handles these short-lived array well enough that this shouldn't be bad on performance. Following this PR will be heavy performance testing/optimizations for this new codec. If this does end up being heavy then it can be optimized in the following PR's

@brrichards brrichards marked this pull request as ready for review June 9, 2026 20:56
@brrichards brrichards requested a review from a team as a code owner June 9, 2026 20:56
Copilot AI review requested due to automatic review settings June 9, 2026 20:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 63 out of 63 changed files in this pull request and generated no comments.

* Experimental codec with optimizations for text.
*/
vTextExperimental: "text",
vTextExperimental: 3,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to hold off on doing this until we have confirmed this new format is accomplishing our goals, and we don't need/want to make any adjustments to it (once it gets a number, we should never tweak the format and should support loading it forever)

* keeping it off the encoder instances is what lets recursive sub-chunk encodes get their
* own state with no snapshot/restore.
*/
private readonly onBeginBatch:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern seems odd to me. Can you just run any before and after logic you need inside schemaCompressedEncodeVTextExperimental before and after doing the actual encode? Also, when
schemaCompressedEncode is called, it has
everything thats needed to run this when constructing the context, which is only used for that encode, so I don't get why we need to defer this using a callback.

You could even have it call compressedEncode directly so it could provide a customized or even subclassed EncoderContext.

* {@link schemaCompressedEncodeVTextExperimental}, which hard-codes
* {@link defaultMinOccurrencesForSpecialization}.
*/
export function schemaCompressedEncodeVTextExperimentalForTests(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this only exits for tests, it should be in the test code, not the production code files, unless applications using FF need it for writing their own tests (which I don't think is the case here as its not package exported)

* {@link defaultMinOccurrencesForSpecialization}. See {@link SpecializableField} for
* the two kinds of field specialization supported.
*/
class VTextObjectNodeEncoder implements NodeEncoder {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't figured out the actual algorithm you are using yet, but it sounds like it might be a bit expensive. You should probably have some performance tests to measure how slow it is compared to the existing approach, as well as some opt size tests to compare how much of a win this gives in various cases, and see if any of our existing test cases regress.

) {}

public get shape(): AnyShape {
return AnyShape.instance;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the encoding is forcing an indirection, putting a reference to the actual shape you are specializing in the data array unconditionally. That seems like a lot of overhead, and something we should avoid.

}

/**
* The per-encoder slice of {@link VTextBatchState}.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding this new code hard to read. I can't understand VTextBatchState without knowing what cohorts are, but CohortState is defined in terms of VTextBatchState.

Its the same kind of issue I called out in https://github.com/microsoft/FluidFramework/pull/27467/files/d75042feb21c542606a64903c494098914c757c0#r3345015961

Perhaps we can define a cohort. Then maybe define a cohort state in terms of that, then build up from there with the actual algorithm?

Comment on lines +480 to +485
/** Cohort key occurrence counts for the current count-pass iteration. */
counts: Map<string, number>;
/** Finalized counts from previous iterations, used for threshold decisions. */
resolveCounts: Map<string, number>;
/** Cached specialized shapes, keyed by cohort key. */
specializedEncoders: Map<string, SpecializedNodeShapeEncoder>;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to understand: its hard to tell how these are used because the types are not very specifcic.

Do these fields need to be reassignable? If not, make them readonly so its clear the maps are only modified not reassigned.

Nothing explicitly states what the keys are here: are these three parallel maps with the same keys?
If so, generally thats an anti patten as having a single map with one copy of each key and an inner object with the relevant properties often makes more sense (there are reasons for either approach, but right now all I can really say is more clearity is needed: if these are parakllel maps you need to document/clarify that, and say why, while using a single map is more self documenting)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those strings are some special keying scheme, it might be good to have a type alias for that so you can document the syntax and semantics in one place which is easily found. Using a stronger branded typw could do the same bug with added type safty and might be a good here.

@CraigMacomber

Copy link
Copy Markdown
Contributor

I don't follow the encoding algorithm, but I don't see how this addresses the fact that for even a simple type with N leaf fields which are Booleans, there, there are 2^N different sets of fields you can specialize, and for each of those 2^N cases, there are a ton of ways that specialization could be made out of other specialized shapes or not.

It sounds like you are counting occurrences of different shapes, but there are ~2^N possible shapes to count of every node with N children, plus the conditional count of the shapes those shapes could be made out of.

Anyway, the description in the PR and the comments I had time to go through in the code mostly left me wondering what a "cohort" is and how "cohort keys" work, and what approximate/heuristic this is intending to implement I think an optimal approach would be computationally intractable.

I could imagine computing a frequency table for every leaf value in the schema, then use some heuristics based on that to come up with shapes. Any field with exactly one value can be trivially inlined into the shape, but going beyond that requires knowing correlations between the fields, which gets messy and likely impractical.

You could decide that specializations for a given type always form a tree, where you specialize fields in the order:

  1. Fields with exactly one value (specialized in root)
  2. Fields sorted by how much data might be saved (for each value of the key, length of encoded value * number of occurrences - 1)

Then lazily populate that shape tree when encoding?

That's just an idea: not saying the approach you have is bad, or different, just that I'd expect something in the top level descript addressing high level algorithm concerns like how this decided which fields we should specialize and in what order, and I could get that from the existing description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants