feat(tree): Add vTextExperimental codec with specialized node shape encoding by brrichards · Pull Request #27515 · microsoft/FluidFramework

brrichards · 2026-06-08T21:29:07Z

Summary

Wires up an encoder for the specialized node shape ({f: {base, fields, value?}}) introduced in #27200. The bulk of this diff is snapshot files (~4000 lines across 54 snapshots) while the actual encoder changes are ~1100 lines across 8 source/test files.

The encoder uses a two-pass approach per field batch:

Pass 1 (counting): Walks every node in the batch in post-order. Children are counted before their parents. For each VTextObjectNodeEncoder node, a cohort key is built from its specializable fields (leaf values and resolved child shapes). Because we walk bottom-up, child shape decisions are already finalized when a parent builds its cohort key, allowing parents to reference their children's specialized shapes by identity.
Pass 2 (encoding): For each node, if its cohort key's count from pass 1 meets the minimum threshold (defaultMinOccurrencesForSpecialization = 8), a SpecializedNodeShapeEncoder is used and then encodes only the field overrides relative to the base shape. Otherwise the node falls back to the base NodeShapeBasedEncoder.

Batch state is held on a stack to keep each compressedEncode call's specialization decisions isolated. Recursive calls (e.g. for incremental fields that are encoded separately) push their own state, so sub-batches get independent counting without polluting the parent's totals.

Testing

Existing encode/decode round-trip tests extended for vTextExperimental
Snapshot tests cover schemaless and schema-compressed Vtext output
schemaCompressedEncodeVTextExperimentalForTests allows threshold override for small test inputs

Additional optimizations are intentionally excluded from this PR and will land as separate follow-ups.

Reviewer Guidance

The review process is outlined on this wiki page.

github-actions · 2026-06-08T21:30:16Z

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (5251 lines, 63 files), I've queued these reviewers:

Correctness — logic errors, race conditions, lifecycle issues
Security — vulnerabilities, secret exposure, injection
API Compatibility — breaking changes, release tags, type design
Performance — algorithmic regressions, memory leaks
Testing — coverage gaps, hollow tests

How this works

Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.
Tick Start review below to dispatch the review fleet.
After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.
This comment updates as new commits land; your reviewer selections are preserved.
Start review

github-actions · 2026-06-08T21:45:21Z

🔭 PR Review Fleet Report

Note

This report is generated by an experimental AI review fleet and is provided as a beta feature. Findings are a starting point for discussion, not a gate. Use your own judgement.

Verdict: ❌ Request Changes

0 Disastrous, 1 Dangerous, 5 Disagreeable

Findings

Sev	#	Area	File	What	Fix
🐊 Dangerous	H1	Correctness	`packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:585`	Cohort key collisions due to unsanitized `\|` separator in string leaf values. `cohortKey` builds its key by joining parts with `\|` (e.g. `L:string:v1\|L:string:v2`), but string values may themselves contain the substring `\|L:string:`. Concrete example: a node with field1="a\|L:string:b" and field2="c" produces the exact same cohort key as a node with field1="a" and field2="b\|L:string:c" — both yield `L:string:a\|L:string:b\|L:string:c`. When the combined occurrence count of these two structurally-different value tuples crosses `minOccurrencesForSpecialization`, `createSpecialized` is called once and the resulting `SpecializedNodeShapeEncoder` bakes the first caller's constant values into the shape. Every subsequent node matched to the same key — but carrying different actual values — is silently encoded with the wrong constants, because `encodeValue` for array-valued shapes writes nothing to the stream and the decoder unconditionally uses the shape's constant. This is silent data corruption: no assertion fires and the document round-trips to wrong values. Realistic trigger: any schema with two or more `Multiplicity.Single` string-typed fields where document content happens to contain the literal substring `\|L:string:` (e.g. in prose, code, or technical notation).	Replace the raw `\|`-join in `cohortKey` with collision-resistant encoding. The simplest correct fix is to JSON-stringify each part before joining, or to percent-encode the `\|` character inside each part's value segment: e.g. replace `valueKey(cursor.value)` with `JSON.stringify(cursor.value)` so that `\|` within a string value becomes `"\|"` and can never be confused with the separator. Similarly ensure the `S:${id}` parts, which use only decimal digits, can never match an `L:` prefix, which is already the case — so only the `L:` path needs fixing.
🐍 Disagreeable	M1	Correctness	`packages/dds/tree/src/feature-libraries/chunked-forest/codec/compressedEncode.ts:56`	`compressedEncode` calls `context.beginBatch(fieldBatch)` (which pushes a `VTextBatchState` onto the VText batch stack) but does not use try/finally to guarantee `context.endBatch()` (which pops it) is called if an exception is thrown during encoding. If `anyFieldEncoder.encodeField` or `updateShapesAndIdentifiersEncoding` throws, the batch stack retains an extra element. For VText incremental sub-chunk encoding, `compressedEncode` is called recursively with the same `EncoderContext` instance (see `incrementalFieldEncoder` which passes `context` to the recursive call). If any implementation of `IncrementalEncoder.encodeIncrementalField` catches the exception thrown by its encode callback and then allows the outer `compressedEncode` to continue, `currentBatch()` will return the stale inner batch instead of the outer batch. All subsequent cohort lookups and specialization decisions in the remainder of the outer encode would use the wrong `VTextBatchState`, producing wrong shapes and potentially corrupted output.	Wrap the body of `compressedEncode` in a try/finally block so `context.endBatch()` is always called: `context.beginBatch(fieldBatch); try { ... return result; } finally { context.endBatch(); }`.
🐍 Disagreeable	M2	Performance	`packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:566`	`cohortKey` allocates a fresh `string[]` (`const parts: string[] = []`) on every invocation. It is called once per `VTextObjectNodeEncoder` node during the encoding pass (O(N)) and once per node per count-pass iteration (up to 10 × O(N)). For a large collaborative document with 1M rich-text nodes this produces ~10M short-lived array allocations per summary encode, generating significant GC pressure. For documents that are encoded frequently (e.g. on every container attach or summary upload under incremental-encode paths) the heap churn accumulates.	Replace the `parts: string[]` + `parts.join('\|')` pattern with direct string concatenation: `let key = ''; for (const field of ...) { key += (key ? '\|' : '') + ...; } return key;`. This eliminates the per-call array allocation without changing observable behavior.
🐍 Disagreeable	M3	Performance	`packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:520`	During the count pass, `countNodeAndDescendants` uses a post-order traversal so children are visited before their parent. When the parent node's `countNode` is then called, `cohortKey` re-enters each `subShape` specializable child field — navigating the cursor into fields and nodes that were just visited by the outer traversal. This doubles the cursor navigation work (enter/exit field + enter/exit node) for every level of `subShape` nesting in the counting pass. The same double-traversal occurs in the encoding pass: `encodeNode` calls `resolveShape → cohortKey` (cursor into children to resolve shapes), then the resolved `NodeShapeBasedEncoder.encodeNode` traverses all fields again to produce output. For a schema with D levels of nested `subShape` fields and N total nodes the encode pass does O(2 × N × D) cursor navigations instead of O(N × D). For real-world rich-text schemas with 2–3 nesting levels and 1M leaf nodes this is a concrete 2–3× overhead on cursor traversal.	Cache the cohort key alongside the node's post-order count result so that the parent's `cohortKey` call can read the pre-computed child shape ID without re-entering the cursor. One approach: after `countNode(child)` completes, stash the resolved shape (or its ID) in the `CohortState` indexed by the cursor's stable node identity, and have the parent's `cohortKey` read from that cache instead of re-navigating.
🐍 Disagreeable	M4	Testing	`packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:723`	`resolveShape`'s `specializedEncoders` cache is never tested with more than one entry. All unit tests for the VText specialization path produce exactly one specialized shape (either only one tuple crosses the threshold, or all instances share the same tuple). The cache map `state.specializedEncoders` is therefore only exercised for size 0 (miss → create) and size 1 (hit). A bug where two distinct above-threshold tuples receive the same cohort key — causing one tuple's nodes to be encoded with the other's specialized shape — would go undetected: the single-shape tests still pass, and the snapshot tests only verify round-trip fidelity, not that each tuple got a distinct encoder.	Add a test with a schema that has two boolean fields where three distinct tuples each appear above the threshold, e.g. 3 × `{bold:true, italic:true}`, 3 × `{bold:false, italic:false}`, 3 × `{bold:true, italic:false}` with `minOccurrencesForSpecialization=3`. Assert `countSpecializedShapes(encoded) === 3`, then round-trip and verify the decoded tree equals the original. This exercises `specializedEncoders` with 3 entries and confirms no cohort-key collision silently merges distinct tuples.
🐍 Disagreeable	M5	Testing	`packages/dds/tree/src/feature-libraries/chunked-forest/codec/schemaBasedEncode.ts:473`	The multi-iteration counting loop in `buildContextVText` (`for (let iteration = 0; iteration < countPassMaxIterations; iteration++)`) is never exercised beyond its first iteration. A second iteration is required when the schema has nested `subShape` specializations: in iteration 1, child cohort keys are resolved against empty `resolveCounts` (all children use the base shape), so parent nodes group by a "base child" key. After `commitIteration()` populates `resolveCounts`, iteration 2 re-counts with children potentially resolving to specialized shapes, changing parent cohort keys. No test creates a schema with a parent object whose `subShape` field points to a child `VTextObjectNodeEncoder` — so a bug in `commitIteration` that causes premature loop exit (e.g., failing to detect changed counts) would produce incorrect parent cohort keys and wrong nested specializations, undetected by any existing test.	Add a test with a two-level schema: outer `Paragraph` object has a required single-valued `format` field of polymorphic type (e.g., `BoldFormat \| ItalicFormat`, each being an object with boolean fields). Provide enough `Paragraph` instances so that both the child `format` nodes and the parent `Paragraph` nodes cross the specialization threshold. Assert that the outer batch produces specialized shapes for both levels (`countSpecializedShapes >= 2`), then round-trip and verify the decoded tree. This forces the counting loop to run at least two iterations and confirms `commitIteration`'s monotonic merge is correct.

View workflow run

brrichards · 2026-06-08T22:35:46Z

compressedEncode calls context.beginBatch(fieldBatch) (which pushes a VTextBatchState onto the VText batch stack) but does not use try/finally to guarantee context.endBatch() (which pops it) is called if an exception is thrown during encoding. If anyFieldEncoder.encodeField or updateShapesAndIdentifiersEncoding throws, the batch stack retains an extra element. For VText incremental sub-chunk encoding, compressedEncode is called recursively with the same EncoderContext instance (see incrementalFieldEncoder which passes context to the recursive call). If any implementation of IncrementalEncoder.encodeIncrementalField catches the exception thrown by its encode callback and then allows the outer compressedEncode to continue, currentBatch() will return the stale inner batch instead of the outer batch. All subsequent cohort lookups and specialization decisions in the remainder of the outer encode would use the wrong VTextBatchState, producing wrong shapes and potentially corrupted output.

Maybe I'm thinking about this wrong, but if an error happens during encoding I imagine that is a very serious error that we don't want to just have happen and then continue with a potential continuation of using the tree/data. I also think that an error is always thrown in the case of this happening. I don't think a try/finally is doing anything here other than just catching an error that would already propogate up and then updating our batch stack that we dont want to use anymore?

brrichards · 2026-06-08T22:44:33Z

cohortKey allocates a fresh string[] (const parts: string[] = []) on every invocation. It is called once per VTextObjectNodeEncoder node during the encoding pass (O(N)) and once per node per count-pass iteration (up to 10 × O(N)). For a large collaborative document with 1M rich-text nodes this produces ~10M short-lived array allocations per summary encode, generating significant GC pressure. For documents that are encoded frequently (e.g. on every container attach or summary upload under incremental-encode paths) the heap churn accumulates.

1 million rich text-nodes.... The up to 10 is just the cap of iterations, but for formatted text nodes it will converge in around 2-3, I believe v8 handles these short-lived array well enough that this shouldn't be bad on performance. Following this PR will be heavy performance testing/optimizations for this new codec. If this does end up being heavy then it can be optimized in the following PR's

Copilot

Copilot reviewed 63 out of 63 changed files in this pull request and generated no comments.

CraigMacomber · 2026-06-12T21:28:23Z

 	 * Experimental codec with optimizations for text.
 	 */
-	vTextExperimental: "text",
+	vTextExperimental: 3,


I think we want to hold off on doing this until we have confirmed this new format is accomplishing our goals, and we don't need/want to make any adjustments to it (once it gets a number, we should never tweak the format and should support loading it forever)

CraigMacomber · 2026-06-12T21:42:14Z

+		 * keeping it off the encoder instances is what lets recursive sub-chunk encodes get their
+		 * own state with no snapshot/restore.
+		 */
+		private readonly onBeginBatch:


This pattern seems odd to me. Can you just run any before and after logic you need inside schemaCompressedEncodeVTextExperimental before and after doing the actual encode? Also, when
schemaCompressedEncode is called, it has
everything thats needed to run this when constructing the context, which is only used for that encode, so I don't get why we need to defer this using a callback.

You could even have it call compressedEncode directly so it could provide a customized or even subclassed EncoderContext.

CraigMacomber · 2026-06-12T21:43:40Z

+ * {@link schemaCompressedEncodeVTextExperimental}, which hard-codes
+ * {@link defaultMinOccurrencesForSpecialization}.
+ */
+export function schemaCompressedEncodeVTextExperimentalForTests(


if this only exits for tests, it should be in the test code, not the production code files, unless applications using FF need it for writing their own tests (which I don't think is the case here as its not package exported)

CraigMacomber · 2026-06-12T21:50:24Z

+ * {@link defaultMinOccurrencesForSpecialization}. See {@link SpecializableField} for
+ * the two kinds of field specialization supported.
+ */
+class VTextObjectNodeEncoder implements NodeEncoder {


I haven't figured out the actual algorithm you are using yet, but it sounds like it might be a bit expensive. You should probably have some performance tests to measure how slow it is compared to the existing approach, as well as some opt size tests to compare how much of a win this gives in various cases, and see if any of our existing test cases regress.

CraigMacomber · 2026-06-12T21:53:09Z

+	) {}
+
+	public get shape(): AnyShape {
+		return AnyShape.instance;


This seems like the encoding is forcing an indirection, putting a reference to the actual shape you are specializing in the data array unconditionally. That seems like a lot of overhead, and something we should avoid.

CraigMacomber · 2026-06-12T22:00:53Z

+}
+
+/**
+ * The per-encoder slice of {@link VTextBatchState}.


I'm finding this new code hard to read. I can't understand VTextBatchState without knowing what cohorts are, but CohortState is defined in terms of VTextBatchState.

Its the same kind of issue I called out in https://github.com/microsoft/FluidFramework/pull/27467/files/d75042feb21c542606a64903c494098914c757c0#r3345015961

Perhaps we can define a cohort. Then maybe define a cohort state in terms of that, then build up from there with the actual algorithm?

CraigMacomber · 2026-06-12T22:04:58Z

+	/** Cohort key occurrence counts for the current count-pass iteration. */
+	counts: Map<string, number>;
+	/** Finalized counts from previous iterations, used for threshold decisions. */
+	resolveCounts: Map<string, number>;
+	/** Cached specialized shapes, keyed by cohort key. */
+	specializedEncoders: Map<string, SpecializedNodeShapeEncoder>;


This is hard to understand: its hard to tell how these are used because the types are not very specifcic.

Do these fields need to be reassignable? If not, make them readonly so its clear the maps are only modified not reassigned.

Nothing explicitly states what the keys are here: are these three parallel maps with the same keys?
If so, generally thats an anti patten as having a single map with one copy of each key and an inner object with the relevant properties often makes more sense (there are reasons for either approach, but right now all I can really say is more clearity is needed: if these are parakllel maps you need to document/clarify that, and say why, while using a single map is more self documenting)

If those strings are some special keying scheme, it might be good to have a type alias for that so you can document the syntax and semantics in one place which is easily found. Using a stronger branded typw could do the same bug with added type safty and might be a good here.

CraigMacomber · 2026-06-12T22:25:35Z

I don't follow the encoding algorithm, but I don't see how this addresses the fact that for even a simple type with N leaf fields which are Booleans, there, there are 2^N different sets of fields you can specialize, and for each of those 2^N cases, there are a ton of ways that specialization could be made out of other specialized shapes or not.

It sounds like you are counting occurrences of different shapes, but there are ~2^N possible shapes to count of every node with N children, plus the conditional count of the shapes those shapes could be made out of.

Anyway, the description in the PR and the comments I had time to go through in the code mostly left me wondering what a "cohort" is and how "cohort keys" work, and what approximate/heuristic this is intending to implement I think an optimal approach would be computationally intractable.

I could imagine computing a frequency table for every leaf value in the schema, then use some heuristics based on that to come up with shapes. Any field with exactly one value can be trivially inlined into the shape, but going beyond that requires knowing correlations between the fields, which gets messy and likely impractical.

You could decide that specializations for a given type always form a tree, where you specialize fields in the order:

Fields with exactly one value (specialized in root)
Fields sorted by how much data might be saved (for each value of the key, length of encoded value * number of occurrences - 1)

Then lazily populate that shape tree when encoding?

That's just an idea: not saying the approach you have is bad, or different, just that I'd expect something in the top level descript addressing high level algorithm concerns like how this decided which fields we should specialize and in what order, and I could get that from the existing description.

vTextExperimental codec

01276e7

brrichards added 2 commits June 8, 2026 16:40

address feedback

21ed2bd

Change experimental codec version to number format and update test suite

050b830

brrichards marked this pull request as ready for review June 9, 2026 20:56

brrichards requested a review from a team as a code owner June 9, 2026 20:56

Copilot AI review requested due to automatic review settings June 9, 2026 20:56

Copilot started reviewing on behalf of brrichards June 9, 2026 20:56 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Merge branch 'main' into experimental-vtext-clean

f5e2cf0

CraigMacomber reviewed Jun 12, 2026

View reviewed changes

Conversation

brrichards commented Jun 8, 2026

Summary

Testing

Reviewer Guidance

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔭 PR Review Fleet Report

Findings

Uh oh!

brrichards commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brrichards commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CraigMacomber commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

brrichards commented Jun 8, 2026 •

edited

Loading