fix(source-shopify): prevent metafield_customers from skipping customers on bulk checkpoint#77005
fix(source-shopify): prevent metafield_customers from skipping customers on bulk checkpoint#77005sophiecuiy wants to merge 2 commits intomasterfrom
Conversation
…ers on bulk checkpoint Bulk streams with a parent_stream_class filter/slice on the parent's cursor (e.g. customers.updated_at) while emitted records carry the child's cursor (e.g. metafield.updated_at). When a bulk job checkpoints mid-output, stream_slices advanced the next slice using the max child cursor seen, which can be arbitrarily later than the slice's upper bound. Every parent between slice_end and that child timestamp was silently never queried. Source the checkpoint cursor from the bulk record producer's tracked parent cursor, which is bounded by the slice's upper bound. Fixes airbytehq/oncall#12004. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note 📝 PR Converted to Draft More info...Thank you for creating this PR. As a policy to protect our engineers' time, Airbyte requires all PRs to be created first in draft status. Your PR has been automatically converted to draft status in respect for this policy. As soon as your PR is ready for formal review, you can proceed to convert the PR to "ready for review" status by clicking the "Ready for review" button at the bottom of the PR page. To skip draft status in future PRs, please include |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksPR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
📚 Show Repo GuidanceHelpful Resources
|
|
|
Deploy preview for airbyte-docs ready!
Deployed with vercel-action |
What
Fixes silent data loss in
source-shopifybulk metafield streams when a bulk job checkpoints mid-output. On one reporter's production sync (oncall#12004), a single checkpoint event on themetafield_customersstream caused the next slice to jump forward 2 years and 8 months, skipping every customer in that window and their metafields. This reproduces the 858 / 180 / 23 vs 431 / 46 / 3 gap reporters observed between Shopify Admin and their destination.Closes airbytehq/airbyte#76874 (the nested-connection-drop theory is a red herring; this is the actual bug).
Root cause
Bulk streams with a
parent_stream_classhave a cursor / filter-field mismatch:customers(query: "updated_at:>='X' AND updated_at:<='Y'")).metafield.updated_at), which can be arbitrarily outside the slice window — a customer whosecustomer.updated_atfalls in June 2023 can own a metafield whoseupdated_atis yesterday.stream_slicesadvances the next slice's start viaget_adjusted_job_end(..., self._checkpoint_cursor, ...), it's passing the max child cursor as the next slice's parent-cursor lower bound.Production log excerpt (oncall#12004)
2.67 years of customers skipped in one slice advance. The off-midnight
01:31:14timestamp is the smoking gun — no slicer would produce that alignment organically.How
ShopifyBulkRecordalready tracks the parent's cursor value separately via_parent_stream_cursor_value(max across all parent records composed), exposed viaget_parent_stream_state(). This value is currently used for state reporting inIncrementalShopifyGraphQlBulkStream.get_updated_statebut wasn't used for slice advancement after a checkpoint.Added
_effective_checkpoint_cursor()that sources from the record producer's parent cursor whenparent_stream_classis set (falls back to the existing_checkpoint_cursorotherwise for streams without a parent).stream_slicesnow uses this to advance the next slice's start.Bounded by the slice's upper bound (the bulk query's
updated_at:<='Y'filter constrains any customer returned), so the next slice can't skip past the current slice's end.Blast radius
Same fix applies to every bulk stream with
parent_stream_classwhere emitted records' cursor differs from the parent cursor:MetafieldCustomers(parent:Customers) — reproduced in oncall#12004MetafieldProducts(parent:Products)MetafieldProductImages(parent:Products)ProductImages(parent:Products)CustomerAddress(parent:Customers)CustomerJourneySummary(parent:Orders)InventoryLevels(parent:Locations)Streams without
parent_stream_class(MetafieldOrders,MetafieldDraftOrders,MetafieldCollections,MetafieldProductVariants, etc.) hit the fall-through branch and keep current behavior — they may still be vulnerable to the same class of bug at the query level, but this PR is scoped to theparent_stream_class-backed path where the fix is straightforward.Test coverage
Added two unit tests in
unit_tests/graphql_bulk/test_job.py:test_effective_checkpoint_cursor_uses_parent_cursor_for_parent_backed_stream— simulates the oncall#12004 pathological state (child cursor 2026-02-26 vs parent cursor 2023-06-28) and asserts the effective checkpoint cursor returns the parent's 2023-06-28, not the child's 2026-02-26.test_effective_checkpoint_cursor_falls_back_when_no_parent_state— streams without a parent, or with no parent cursor yet tracked, retain the current_checkpoint_cursorbehavior.Full unit suite: 211 passed (
poetry run pytest unit_tests/).Breaking change evaluation
Not a breaking change:
PATCH bump
3.3.2 → 3.3.3.Immediate workaround (while this PR is in flight)
For customers hitting this on large stores, setting
job_checkpoint_intervalhigh enough that checkpointing never fires on the affected stream sidesteps the bug. Combined with 3.3.2's external sort (no longer memory-bound), large slices no longer have to checkpoint to stay healthy.Can this PR be safely reverted and rolled back?