Conversation
torcolvin
left a comment
There was a problem hiding this comment.
There's some definitely odd existing behavior around Update and UpdateWithXattrs sometimes returning a cas value on error instead of 0. I think this code might have divergence in rosmar/CBS but I think it might be worthwhile to match the upstream behavior.
The other major issue is potentially whether we want to handle tombstone documents. We allow dbconfig to be tombstones in app services and then we will resurrect them. I wonder if we want to be more explicit about tombstone handling and removal. I don't have this part of the code fully loaded in my head, but if you do I'm happy, or at least make sure that we have test coverage for the expected behavior.
| calls++ | ||
| callbackInputs = append(callbackInputs, append([]byte(nil), current...)) | ||
| if calls == 1 { | ||
| // Simulate a concurrent writer beating us to the punch on primary. |
There was a problem hiding this comment.
The test that might seem missing is what happens if you have a retry here but you also have a writer that writes to the secondary location for read and you have to update it?
There was a problem hiding this comment.
The design of the MetadataStore dual-collection wrapper means that all^ writes only ever go to primary once it is in use.
^caveats:
- Deletes can apply to the fallback which are technically a write... but there's handling in this PR to cover that case by simply retrying the update.
- The timing of all SG nodes actually enabling/using MetadataStore consistently is a problem to be solved with future tickets to close up any gaps where existing SG nodes are writing metadata to the fallback.
By design most Update operations aren't actually 'CAS safe' in the traditional sense, since the caller hasn't actually got a CAS to work with. The read->callback mutation->write steps are the only CAS-safe bit.
| assert.Equal(t, fallbackBody, fallbackRaw, "fallback is never written by Update") | ||
| } | ||
|
|
||
| // TestMetadataStoreUpdateDeleteFallbackOnly verifies the CBG-5291 delete behaviour for a doc |
There was a problem hiding this comment.
Anytime you write tombstones it makes me think about resurrections. I think that the behavior is probably the same, but Update can do a resurrection with preserving existing xattrs? I think I'm comfortable the tests as is, but if you aren't, you can write some.
There was a problem hiding this comment.
Done - resurrections restore into primary and reinstate the original xattrs
| // Delete the fallback copy — primary never held it, so a primary tombstone is | ||
| // pointless and would shadow the fallback for nothing. Remove with the CAS we | ||
| // just observed; on a concurrent fallback mutation we retry the loop. | ||
| _, removeErr := ms.fallback.Remove(k, fallbackCas) |
There was a problem hiding this comment.
Do we actually want to Purge this document?
There was a problem hiding this comment.
Purge meaning what in this context? Removing any system xattrs too?
| return ms.primary.Update(k, exp, callback) | ||
| } | ||
| step := func() (uint64, error) { | ||
| fallbackValue, fallbackCas, err := ms.fallback.GetRaw(k) |
There was a problem hiding this comment.
This is going to return false for a tombstone, but we don't return false for a tombstone in the general case of Update, is this the desired behavior?
There was a problem hiding this comment.
Can you re-check this - I'm not quite sure what you mean and some of the improved naming/comments may help answer this.
| if IsCasMismatch(writeErr) { | ||
| return 0, errRetryUpdateLoop | ||
| } | ||
| return 0, writeErr |
There was a problem hiding this comment.
The upstream update function returns cas if found and there is an error. I'm not sure if this can ever occur meaningfully but I remember this being a weird quirk of the API. This may be a case where rosmar has divergent behavior from CBS.
There was a problem hiding this comment.
Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.
The intention is the retry loop will pick up new CAS values each time.
|
|
||
| newValue, cbExpiry, isDelete, cbErr := callback(fallbackValue) | ||
| if cbErr != nil { | ||
| return 0, cbErr |
There was a problem hiding this comment.
The upstream update function returns cas even if there is an error. I'm nto sure if this is intentional or not, but it might be a really subtle set of bugs to not match behavior.
There was a problem hiding this comment.
Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.
The intention is the retry loop will pick up new CAS values each time.
| return 0, errRetryUpdateLoop | ||
| } | ||
| if cbErr != nil { | ||
| return 0, cbErr |
There was a problem hiding this comment.
Unlike Update, this function's parent does return 0 cas.
There was a problem hiding this comment.
Only ever want to return a non-zero CAS if these wrappers are successful by design. Added a lot of comments around this.
The intention is the retry loop will pick up new CAS values each time.
Replaces the placeholder MetadataStore.Update (which delegated the entire read-modify-write to the fallback datastore) with a wrapper-level CAS retry loop that always lands writes in primary. When the doc is in primary or the doc is absent from both stores, it delegates to primary.Update; when the doc lives only in fallback, it feeds the fallback value to the caller's callback and inserts the result into primary via WriteCas, retrying on CAS race. A delete returned by the callback for a fallback-only doc is a no-op in primary - the metadata migration sweep tombstones the fallback copy. WriteUpdateWithXattrs gets the same dual-read pattern for symmetry (callers today are application-data only, but a future metadata caller would otherwise silently skip fallback data). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback on the CAS-safe Update wrapper: - Update / WriteUpdateWithXattrs delete on a fallback-only doc now lands in the fallback (Remove / WriteTombstoneWithXattrs with the observed CAS, retrying on race) so subsequent reads stop falling through to a stale doc instead of relying on the migration sweep. - Document the previous.Cas != 0 shortcut: the wrapper's read methods discard fallback CAS, so a non-zero CAS from the wrapper unambiguously identifies a primary revision. - Note in code that xattrsToDelete is intentionally nil on the migration-insert branch (cas=0 inserts reject xattrsToDelete with ErrDeleteXattrOnDocumentInsert). New / rewritten tests cover each behaviour and were verified TDD-style to catch a regression of the corresponding fix: - TestMetadataStoreUpdateDeleteFallbackOnly (rewritten for new contract) - TestMetadataStoreUpdateRetriesOnCASMismatch - TestMetadataStoreWriteUpdateWithXattrsCASShortcut - TestMetadataStoreWriteUpdateWithXattrsHonorsXattrsToDelete - TestMetadataStoreWriteUpdateWithXattrsTombstoneFallbackOnly Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update and WriteUpdateWithXattrs both ran the same outer loop — probe primary, delegate when primary holds the doc or migration is sealed, otherwise read fallback / invoke the caller's callback / write to primary (or, on a fallback-only delete or tombstone, write to fallback). Lift the shared scaffolding into runFallbackUpdateLoop. The helper drives a worker through RetryLoopCas with DefaultRetrySleeper, matching the CAS-retry bound (10 attempts, doubling sleep from 5ms) used elsewhere in this package; the loop returns NewRetryTimeoutError if exceeded. Per-method specifics — fallback-read shape, callback signature, delete vs. insert classification, xattr-specific NotFound retry — stay in each method's step closure, with the small CAS/NotFound/terminal switch inlined at each branch. WriteUpdateWithXattrs threads its caller-supplied ctx through; Update has no ctx parameter (DataStore.Update doesn't carry one) and passes context.TODO(), matching the existing readFromFallback idiom in the same file. No behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e loop Address Tor's PR review and tighten the wrapper: - Use errors.Is for ErrCasFailureShouldRetry (was ==). - Add fallback-tombstone resurrection coverage for Update and WriteUpdateWithXattrs, plus a concurrent-fallback-writer test to document the primary-shadows-fallback contract. - Inline runFallbackUpdateLoop into Update and WriteUpdateWithXattrs; drop the errRetryUpdateLoop sentinel. - Document the deliberate divergence from upstream Update's return-cas-on-error behaviour, and explain why no previousLoopCas defensive check is needed here. - Simplify godoc and rename closure-locals (primaryDelegate -> primaryUpdateFn, fallbackCas -> fallbackCasForDelete/Tombstone) for clarity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CBG-5291
MetadataStore.Updatepreviously delegated the whole read-modify-write to the fallback store, andWriteUpdateWithXattrswent straight to primary without consulting fallback.Both now share a wrapper-level CAS retry loop that always lands writes in primary, even when the doc only exists in fallback.
RetryLoopCas/DefaultRetrySleeper. It's expect that a CAS failure will subsequently find an entry in the primary - since all writes are routed there. Only reads have the fallback.WriteUpdateWithXattrsshort-circuits to primary whenprevious.Cas != 0, since the wrapper reads only return non-zero CAS for primary hits and so can rely on a regular update.Integration Tests