CBG-5291: CAS-safe Update on dual MetadataStore wrapper by bbrks · Pull Request #8238 · couchbase/sync_gateway

bbrks · 2026-05-06T14:58:28Z

MetadataStore.Update previously delegated the whole read-modify-write to the fallback store, and WriteUpdateWithXattrs went straight to primary without consulting fallback.
Both now share a wrapper-level CAS retry loop that always lands writes in primary, even when the doc only exists in fallback.

Check primary for existence; if it holds the doc, directly delegate to the primary data store.
Otherwise read from fallback, run the callback, and insert into primary with cas=0 to force insertion.
- On CAS failure, retry via RetryLoopCas / DefaultRetrySleeper. It's expect that a CAS failure will subsequently find an entry in the primary - since all writes are routed there. Only reads have the fallback.
Deletes/tombstones on fallback-held docs apply to fallback, because the primary data store never held the doc, so a primary tombstone would just shadow it, instead of actually removing the original.
WriteUpdateWithXattrs short-circuits to primary when previous.Cas != 0, since the wrapper reads only return non-zero CAS for primary hits and so can rely on a regular update.

Integration Tests

https://jenkins.sgwdev.com/job/SyncGatewayIntegration/600/

torcolvin

There's some definitely odd existing behavior around Update and UpdateWithXattrs sometimes returning a cas value on error instead of 0. I think this code might have divergence in rosmar/CBS but I think it might be worthwhile to match the upstream behavior.

The other major issue is potentially whether we want to handle tombstone documents. We allow dbconfig to be tombstones in app services and then we will resurrect them. I wonder if we want to be more explicit about tombstone handling and removal. I don't have this part of the code fully loaded in my head, but if you do I'm happy, or at least make sure that we have test coverage for the expected behavior.

torcolvin · 2026-05-07T15:46:06Z

+		calls++
+		callbackInputs = append(callbackInputs, append([]byte(nil), current...))
+		if calls == 1 {
+			// Simulate a concurrent writer beating us to the punch on primary.


The test that might seem missing is what happens if you have a retry here but you also have a writer that writes to the secondary location for read and you have to update it?

The design of the MetadataStore dual-collection wrapper means that all^ writes only ever go to primary once it is in use.

^caveats:

Deletes can apply to the fallback which are technically a write... but there's handling in this PR to cover that case by simply retrying the update.

The timing of all SG nodes actually enabling/using MetadataStore consistently is a problem to be solved with future tickets to close up any gaps where existing SG nodes are writing metadata to the fallback.

By design most Update operations aren't actually 'CAS safe' in the traditional sense, since the caller hasn't actually got a CAS to work with. The read->callback mutation->write steps are the only CAS-safe bit.

torcolvin · 2026-05-07T15:47:33Z

+	assert.Equal(t, fallbackBody, fallbackRaw, "fallback is never written by Update")
+}
+
+// TestMetadataStoreUpdateDeleteFallbackOnly verifies the CBG-5291 delete behaviour for a doc


Anytime you write tombstones it makes me think about resurrections. I think that the behavior is probably the same, but Update can do a resurrection with preserving existing xattrs? I think I'm comfortable the tests as is, but if you aren't, you can write some.

Done - resurrections restore into primary and reinstate the original xattrs

torcolvin · 2026-05-07T15:49:01Z

+			// Delete the fallback copy — primary never held it, so a primary tombstone is
+			// pointless and would shadow the fallback for nothing. Remove with the CAS we
+			// just observed; on a concurrent fallback mutation we retry the loop.
+			_, removeErr := ms.fallback.Remove(k, fallbackCas)


Do we actually want to Purge this document?

Purge meaning what in this context? Removing any system xattrs too?

torcolvin · 2026-05-07T15:49:37Z

+		return ms.primary.Update(k, exp, callback)
+	}
+	step := func() (uint64, error) {
+		fallbackValue, fallbackCas, err := ms.fallback.GetRaw(k)


This is going to return false for a tombstone, but we don't return false for a tombstone in the general case of Update, is this the desired behavior?

Can you re-check this - I'm not quite sure what you mean and some of the improved naming/comments may help answer this.

torcolvin · 2026-05-07T15:52:17Z

+		if IsCasMismatch(writeErr) {
+			return 0, errRetryUpdateLoop
+		}
+		return 0, writeErr


The upstream update function returns cas if found and there is an error. I'm not sure if this can ever occur meaningfully but I remember this being a weird quirk of the API. This may be a case where rosmar has divergent behavior from CBS.

Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.

The intention is the retry loop will pick up new CAS values each time.

torcolvin · 2026-05-07T15:53:08Z

+
+		newValue, cbExpiry, isDelete, cbErr := callback(fallbackValue)
+		if cbErr != nil {
+			return 0, cbErr


The upstream update function returns cas even if there is an error. I'm nto sure if this is intentional or not, but it might be a really subtle set of bugs to not match behavior.

Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.

The intention is the retry loop will pick up new CAS values each time.

torcolvin · 2026-05-07T15:54:32Z

+			return 0, errRetryUpdateLoop
+		}
+		if cbErr != nil {
+			return 0, cbErr


Unlike Update, this function's parent does return 0 cas.

Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.

The intention is the retry loop will pick up new CAS values each time.

Replaces the placeholder MetadataStore.Update (which delegated the entire read-modify-write to the fallback datastore) with a wrapper-level CAS retry loop that always lands writes in primary. When the doc is in primary or the doc is absent from both stores, it delegates to primary.Update; when the doc lives only in fallback, it feeds the fallback value to the caller's callback and inserts the result into primary via WriteCas, retrying on CAS race. A delete returned by the callback for a fallback-only doc is a no-op in primary - the metadata migration sweep tombstones the fallback copy. WriteUpdateWithXattrs gets the same dual-read pattern for symmetry (callers today are application-data only, but a future metadata caller would otherwise silently skip fallback data). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address review feedback on the CAS-safe Update wrapper: - Update / WriteUpdateWithXattrs delete on a fallback-only doc now lands in the fallback (Remove / WriteTombstoneWithXattrs with the observed CAS, retrying on race) so subsequent reads stop falling through to a stale doc instead of relying on the migration sweep. - Document the previous.Cas != 0 shortcut: the wrapper's read methods discard fallback CAS, so a non-zero CAS from the wrapper unambiguously identifies a primary revision. - Note in code that xattrsToDelete is intentionally nil on the migration-insert branch (cas=0 inserts reject xattrsToDelete with ErrDeleteXattrOnDocumentInsert). New / rewritten tests cover each behaviour and were verified TDD-style to catch a regression of the corresponding fix: - TestMetadataStoreUpdateDeleteFallbackOnly (rewritten for new contract) - TestMetadataStoreUpdateRetriesOnCASMismatch - TestMetadataStoreWriteUpdateWithXattrsCASShortcut - TestMetadataStoreWriteUpdateWithXattrsHonorsXattrsToDelete - TestMetadataStoreWriteUpdateWithXattrsTombstoneFallbackOnly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update and WriteUpdateWithXattrs both ran the same outer loop — probe primary, delegate when primary holds the doc or migration is sealed, otherwise read fallback / invoke the caller's callback / write to primary (or, on a fallback-only delete or tombstone, write to fallback). Lift the shared scaffolding into runFallbackUpdateLoop. The helper drives a worker through RetryLoopCas with DefaultRetrySleeper, matching the CAS-retry bound (10 attempts, doubling sleep from 5ms) used elsewhere in this package; the loop returns NewRetryTimeoutError if exceeded. Per-method specifics — fallback-read shape, callback signature, delete vs. insert classification, xattr-specific NotFound retry — stay in each method's step closure, with the small CAS/NotFound/terminal switch inlined at each branch. WriteUpdateWithXattrs threads its caller-supplied ctx through; Update has no ctx parameter (DataStore.Update doesn't carry one) and passes context.TODO(), matching the existing readFromFallback idiom in the same file. No behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e loop Address Tor's PR review and tighten the wrapper: - Use errors.Is for ErrCasFailureShouldRetry (was ==). - Add fallback-tombstone resurrection coverage for Update and WriteUpdateWithXattrs, plus a concurrent-fallback-writer test to document the primary-shadows-fallback contract. - Inline runFallbackUpdateLoop into Update and WriteUpdateWithXattrs; drop the errRetryUpdateLoop sentinel. - Document the deliberate divergence from upstream Update's return-cas-on-error behaviour, and explain why no previousLoopCas defensive check is needed here. - Simplify godoc and rename closure-locals (primaryDelegate -> primaryUpdateFn, fallbackCas -> fallbackCasForDelete/Tombstone) for clarity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 6, 2026 14:58

Copilot started reviewing on behalf of bbrks May 6, 2026 14:59 View session

bbrks assigned bbrks and gregns1 May 6, 2026

bbrks requested a review from gregns1 May 6, 2026 16:36

bbrks assigned bbrks and torcolvin and unassigned bbrks and gregns1 May 6, 2026

torcolvin reviewed May 7, 2026

View reviewed changes

torcolvin assigned bbrks and unassigned torcolvin May 7, 2026

bbrks and others added 4 commits May 8, 2026 13:49

bbrks force-pushed the CBG-5291 branch from 144d7e9 to df9ab22 Compare May 8, 2026 19:01

torcolvin approved these changes May 8, 2026

View reviewed changes

bbrks merged commit c049b7c into main May 11, 2026
50 checks passed

bbrks deleted the CBG-5291 branch May 11, 2026 09:01

Conversation

bbrks commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

torcolvin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bbrks commented May 6, 2026 •

edited

Loading