feat(tree): Add vTextExperimental codec with specialized node shape encoding#27515
feat(tree): Add vTextExperimental codec with specialized node shape encoding#27515brrichards wants to merge 4 commits into
Conversation
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (5251 lines, 63 files), I've queued these reviewers:
How this works
|
🔭 PR Review Fleet ReportNote This report is generated by an experimental AI review fleet and is provided as a beta feature. Findings are a starting point for discussion, not a gate. Use your own judgement. Verdict: ❌ Request Changes 0 Disastrous, 1 Dangerous, 5 Disagreeable Findings
|
Maybe I'm thinking about this wrong, but if an error happens during encoding I imagine that is a very serious error that we don't want to just have happen and then continue with a potential continuation of using the tree/data. I also think that an error is always thrown in the case of this happening. I don't think a try/finally is doing anything here other than just catching an error that would already propogate up and then updating our batch stack that we dont want to use anymore? |
1 million rich text-nodes.... The up to 10 is just the cap of iterations, but for formatted text nodes it will converge in around 2-3, I believe v8 handles these short-lived array well enough that this shouldn't be bad on performance. Following this PR will be heavy performance testing/optimizations for this new codec. If this does end up being heavy then it can be optimized in the following PR's |
| * Experimental codec with optimizations for text. | ||
| */ | ||
| vTextExperimental: "text", | ||
| vTextExperimental: 3, |
There was a problem hiding this comment.
I think we want to hold off on doing this until we have confirmed this new format is accomplishing our goals, and we don't need/want to make any adjustments to it (once it gets a number, we should never tweak the format and should support loading it forever)
| * keeping it off the encoder instances is what lets recursive sub-chunk encodes get their | ||
| * own state with no snapshot/restore. | ||
| */ | ||
| private readonly onBeginBatch: |
There was a problem hiding this comment.
This pattern seems odd to me. Can you just run any before and after logic you need inside schemaCompressedEncodeVTextExperimental before and after doing the actual encode? Also, when
schemaCompressedEncode is called, it has
everything thats needed to run this when constructing the context, which is only used for that encode, so I don't get why we need to defer this using a callback.
You could even have it call compressedEncode directly so it could provide a customized or even subclassed EncoderContext.
| * {@link schemaCompressedEncodeVTextExperimental}, which hard-codes | ||
| * {@link defaultMinOccurrencesForSpecialization}. | ||
| */ | ||
| export function schemaCompressedEncodeVTextExperimentalForTests( |
There was a problem hiding this comment.
if this only exits for tests, it should be in the test code, not the production code files, unless applications using FF need it for writing their own tests (which I don't think is the case here as its not package exported)
| * {@link defaultMinOccurrencesForSpecialization}. See {@link SpecializableField} for | ||
| * the two kinds of field specialization supported. | ||
| */ | ||
| class VTextObjectNodeEncoder implements NodeEncoder { |
There was a problem hiding this comment.
I haven't figured out the actual algorithm you are using yet, but it sounds like it might be a bit expensive. You should probably have some performance tests to measure how slow it is compared to the existing approach, as well as some opt size tests to compare how much of a win this gives in various cases, and see if any of our existing test cases regress.
| ) {} | ||
|
|
||
| public get shape(): AnyShape { | ||
| return AnyShape.instance; |
There was a problem hiding this comment.
This seems like the encoding is forcing an indirection, putting a reference to the actual shape you are specializing in the data array unconditionally. That seems like a lot of overhead, and something we should avoid.
| } | ||
|
|
||
| /** | ||
| * The per-encoder slice of {@link VTextBatchState}. |
There was a problem hiding this comment.
I'm finding this new code hard to read. I can't understand VTextBatchState without knowing what cohorts are, but CohortState is defined in terms of VTextBatchState.
Its the same kind of issue I called out in https://github.com/microsoft/FluidFramework/pull/27467/files/d75042feb21c542606a64903c494098914c757c0#r3345015961
Perhaps we can define a cohort. Then maybe define a cohort state in terms of that, then build up from there with the actual algorithm?
| /** Cohort key occurrence counts for the current count-pass iteration. */ | ||
| counts: Map<string, number>; | ||
| /** Finalized counts from previous iterations, used for threshold decisions. */ | ||
| resolveCounts: Map<string, number>; | ||
| /** Cached specialized shapes, keyed by cohort key. */ | ||
| specializedEncoders: Map<string, SpecializedNodeShapeEncoder>; |
There was a problem hiding this comment.
This is hard to understand: its hard to tell how these are used because the types are not very specifcic.
Do these fields need to be reassignable? If not, make them readonly so its clear the maps are only modified not reassigned.
Nothing explicitly states what the keys are here: are these three parallel maps with the same keys?
If so, generally thats an anti patten as having a single map with one copy of each key and an inner object with the relevant properties often makes more sense (there are reasons for either approach, but right now all I can really say is more clearity is needed: if these are parakllel maps you need to document/clarify that, and say why, while using a single map is more self documenting)
There was a problem hiding this comment.
If those strings are some special keying scheme, it might be good to have a type alias for that so you can document the syntax and semantics in one place which is easily found. Using a stronger branded typw could do the same bug with added type safty and might be a good here.
|
I don't follow the encoding algorithm, but I don't see how this addresses the fact that for even a simple type with N leaf fields which are Booleans, there, there are 2^N different sets of fields you can specialize, and for each of those 2^N cases, there are a ton of ways that specialization could be made out of other specialized shapes or not. It sounds like you are counting occurrences of different shapes, but there are ~2^N possible shapes to count of every node with N children, plus the conditional count of the shapes those shapes could be made out of. Anyway, the description in the PR and the comments I had time to go through in the code mostly left me wondering what a "cohort" is and how "cohort keys" work, and what approximate/heuristic this is intending to implement I think an optimal approach would be computationally intractable. I could imagine computing a frequency table for every leaf value in the schema, then use some heuristics based on that to come up with shapes. Any field with exactly one value can be trivially inlined into the shape, but going beyond that requires knowing correlations between the fields, which gets messy and likely impractical. You could decide that specializations for a given type always form a tree, where you specialize fields in the order:
Then lazily populate that shape tree when encoding? That's just an idea: not saying the approach you have is bad, or different, just that I'd expect something in the top level descript addressing high level algorithm concerns like how this decided which fields we should specialize and in what order, and I could get that from the existing description. |
Summary
Wires up an encoder for the specialized node shape (
{f: {base, fields, value?}}) introduced in #27200. The bulk of this diff is snapshot files (~4000 lines across 54 snapshots) while the actual encoder changes are ~1100 lines across 8 source/test files.The encoder uses a two-pass approach per field batch:
Pass 1 (counting): Walks every node in the batch in post-order. Children are counted before their parents. For each
VTextObjectNodeEncodernode, a cohort key is built from its specializable fields (leaf values and resolved child shapes). Because we walk bottom-up, child shape decisions are already finalized when a parent builds its cohort key, allowing parents to reference their children's specialized shapes by identity.Pass 2 (encoding): For each node, if its cohort key's count from pass 1 meets the minimum threshold (
defaultMinOccurrencesForSpecialization = 8), aSpecializedNodeShapeEncoderis used and then encodes only the field overrides relative to the base shape. Otherwise the node falls back to the baseNodeShapeBasedEncoder.Batch state is held on a stack to keep each
compressedEncodecall's specialization decisions isolated. Recursive calls (e.g. for incremental fields that are encoded separately) push their own state, so sub-batches get independent counting without polluting the parent's totals.Testing
schemaCompressedEncodeVTextExperimentalForTestsallows threshold override for small test inputsAdditional optimizations are intentionally excluded from this PR and will land as separate follow-ups.
Reviewer Guidance
The review process is outlined on this wiki page.