Skip to content

CBG-5291: CAS-safe Update on dual MetadataStore wrapper#8238

Merged
bbrks merged 4 commits into
mainfrom
CBG-5291
May 11, 2026
Merged

CBG-5291: CAS-safe Update on dual MetadataStore wrapper#8238
bbrks merged 4 commits into
mainfrom
CBG-5291

Conversation

@bbrks
Copy link
Copy Markdown
Member

@bbrks bbrks commented May 6, 2026

CBG-5291

MetadataStore.Update previously delegated the whole read-modify-write to the fallback store, and WriteUpdateWithXattrs went straight to primary without consulting fallback.
Both now share a wrapper-level CAS retry loop that always lands writes in primary, even when the doc only exists in fallback.

  • Check primary for existence; if it holds the doc, directly delegate to the primary data store.
  • Otherwise read from fallback, run the callback, and insert into primary with cas=0 to force insertion.
    • On CAS failure, retry via RetryLoopCas / DefaultRetrySleeper. It's expect that a CAS failure will subsequently find an entry in the primary - since all writes are routed there. Only reads have the fallback.
  • Deletes/tombstones on fallback-held docs apply to fallback, because the primary data store never held the doc, so a primary tombstone would just shadow it, instead of actually removing the original.
  • WriteUpdateWithXattrs short-circuits to primary when previous.Cas != 0, since the wrapper reads only return non-zero CAS for primary hits and so can rely on a regular update.

Integration Tests

Copilot AI review requested due to automatic review settings May 6, 2026 14:58
@bbrks bbrks requested a review from gregns1 May 6, 2026 16:36
@bbrks bbrks assigned bbrks and torcolvin and unassigned bbrks and gregns1 May 6, 2026
Copy link
Copy Markdown
Collaborator

@torcolvin torcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some definitely odd existing behavior around Update and UpdateWithXattrs sometimes returning a cas value on error instead of 0. I think this code might have divergence in rosmar/CBS but I think it might be worthwhile to match the upstream behavior.

The other major issue is potentially whether we want to handle tombstone documents. We allow dbconfig to be tombstones in app services and then we will resurrect them. I wonder if we want to be more explicit about tombstone handling and removal. I don't have this part of the code fully loaded in my head, but if you do I'm happy, or at least make sure that we have test coverage for the expected behavior.

Comment thread base/bucket_gocb_test.go
calls++
callbackInputs = append(callbackInputs, append([]byte(nil), current...))
if calls == 1 {
// Simulate a concurrent writer beating us to the punch on primary.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test that might seem missing is what happens if you have a retry here but you also have a writer that writes to the secondary location for read and you have to update it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design of the MetadataStore dual-collection wrapper means that all^ writes only ever go to primary once it is in use.

^caveats:

  • Deletes can apply to the fallback which are technically a write... but there's handling in this PR to cover that case by simply retrying the update.
  • The timing of all SG nodes actually enabling/using MetadataStore consistently is a problem to be solved with future tickets to close up any gaps where existing SG nodes are writing metadata to the fallback.

By design most Update operations aren't actually 'CAS safe' in the traditional sense, since the caller hasn't actually got a CAS to work with. The read->callback mutation->write steps are the only CAS-safe bit.

Comment thread base/bucket_gocb_test.go
assert.Equal(t, fallbackBody, fallbackRaw, "fallback is never written by Update")
}

// TestMetadataStoreUpdateDeleteFallbackOnly verifies the CBG-5291 delete behaviour for a doc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime you write tombstones it makes me think about resurrections. I think that the behavior is probably the same, but Update can do a resurrection with preserving existing xattrs? I think I'm comfortable the tests as is, but if you aren't, you can write some.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - resurrections restore into primary and reinstate the original xattrs

Comment thread base/dual_metadata_store.go Outdated
// Delete the fallback copy — primary never held it, so a primary tombstone is
// pointless and would shadow the fallback for nothing. Remove with the CAS we
// just observed; on a concurrent fallback mutation we retry the loop.
_, removeErr := ms.fallback.Remove(k, fallbackCas)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to Purge this document?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purge meaning what in this context? Removing any system xattrs too?

Comment thread base/dual_metadata_store.go Outdated
return ms.primary.Update(k, exp, callback)
}
step := func() (uint64, error) {
fallbackValue, fallbackCas, err := ms.fallback.GetRaw(k)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to return false for a tombstone, but we don't return false for a tombstone in the general case of Update, is this the desired behavior?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you re-check this - I'm not quite sure what you mean and some of the improved naming/comments may help answer this.

Comment thread base/dual_metadata_store.go
Comment thread base/dual_metadata_store.go Outdated
if IsCasMismatch(writeErr) {
return 0, errRetryUpdateLoop
}
return 0, writeErr
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream update function returns cas if found and there is an error. I'm not sure if this can ever occur meaningfully but I remember this being a weird quirk of the API. This may be a case where rosmar has divergent behavior from CBS.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.

The intention is the retry loop will pick up new CAS values each time.

Comment thread base/dual_metadata_store.go Outdated

newValue, cbExpiry, isDelete, cbErr := callback(fallbackValue)
if cbErr != nil {
return 0, cbErr
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream update function returns cas even if there is an error. I'm nto sure if this is intentional or not, but it might be a really subtle set of bugs to not match behavior.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.

The intention is the retry loop will pick up new CAS values each time.

Comment thread base/dual_metadata_store.go Outdated
return 0, errRetryUpdateLoop
}
if cbErr != nil {
return 0, cbErr
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike Update, this function's parent does return 0 cas.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.

The intention is the retry loop will pick up new CAS values each time.

Comment thread base/dual_metadata_store.go Outdated
Comment thread base/dual_metadata_store.go
@torcolvin torcolvin assigned bbrks and unassigned torcolvin May 7, 2026
bbrks and others added 4 commits May 8, 2026 13:49
Replaces the placeholder MetadataStore.Update (which delegated the entire
read-modify-write to the fallback datastore) with a wrapper-level CAS
retry loop that always lands writes in primary. When the doc is in
primary or the doc is absent from both stores, it delegates to
primary.Update; when the doc lives only in fallback, it feeds the
fallback value to the caller's callback and inserts the result into
primary via WriteCas, retrying on CAS race. A delete returned by the
callback for a fallback-only doc is a no-op in primary - the metadata
migration sweep tombstones the fallback copy.

WriteUpdateWithXattrs gets the same dual-read pattern for symmetry
(callers today are application-data only, but a future metadata caller
would otherwise silently skip fallback data).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback on the CAS-safe Update wrapper:
- Update / WriteUpdateWithXattrs delete on a fallback-only doc now lands
  in the fallback (Remove / WriteTombstoneWithXattrs with the observed
  CAS, retrying on race) so subsequent reads stop falling through to a
  stale doc instead of relying on the migration sweep.
- Document the previous.Cas != 0 shortcut: the wrapper's read methods
  discard fallback CAS, so a non-zero CAS from the wrapper unambiguously
  identifies a primary revision.
- Note in code that xattrsToDelete is intentionally nil on the
  migration-insert branch (cas=0 inserts reject xattrsToDelete with
  ErrDeleteXattrOnDocumentInsert).

New / rewritten tests cover each behaviour and were verified TDD-style
to catch a regression of the corresponding fix:
- TestMetadataStoreUpdateDeleteFallbackOnly (rewritten for new contract)
- TestMetadataStoreUpdateRetriesOnCASMismatch
- TestMetadataStoreWriteUpdateWithXattrsCASShortcut
- TestMetadataStoreWriteUpdateWithXattrsHonorsXattrsToDelete
- TestMetadataStoreWriteUpdateWithXattrsTombstoneFallbackOnly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update and WriteUpdateWithXattrs both ran the same outer loop — probe
primary, delegate when primary holds the doc or migration is sealed,
otherwise read fallback / invoke the caller's callback / write to
primary (or, on a fallback-only delete or tombstone, write to
fallback). Lift the shared scaffolding into runFallbackUpdateLoop.

The helper drives a worker through RetryLoopCas with
DefaultRetrySleeper, matching the CAS-retry bound (10 attempts,
doubling sleep from 5ms) used elsewhere in this package; the loop
returns NewRetryTimeoutError if exceeded. Per-method specifics
— fallback-read shape, callback signature, delete vs. insert
classification, xattr-specific NotFound retry — stay in each
method's step closure, with the small CAS/NotFound/terminal switch
inlined at each branch.

WriteUpdateWithXattrs threads its caller-supplied ctx through;
Update has no ctx parameter (DataStore.Update doesn't carry one)
and passes context.TODO(), matching the existing readFromFallback
idiom in the same file.

No behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e loop

Address Tor's PR review and tighten the wrapper:
- Use errors.Is for ErrCasFailureShouldRetry (was ==).
- Add fallback-tombstone resurrection coverage for Update and WriteUpdateWithXattrs,
  plus a concurrent-fallback-writer test to document the primary-shadows-fallback contract.
- Inline runFallbackUpdateLoop into Update and WriteUpdateWithXattrs; drop the
  errRetryUpdateLoop sentinel.
- Document the deliberate divergence from upstream Update's return-cas-on-error
  behaviour, and explain why no previousLoopCas defensive check is needed here.
- Simplify godoc and rename closure-locals (primaryDelegate -> primaryUpdateFn,
  fallbackCas -> fallbackCasForDelete/Tombstone) for clarity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bbrks bbrks merged commit c049b7c into main May 11, 2026
50 checks passed
@bbrks bbrks deleted the CBG-5291 branch May 11, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants