Skip to content

fix(source-shopify): prevent metafield_customers from skipping customers on bulk checkpoint#77005

Draft
sophiecuiy wants to merge 2 commits intomasterfrom
sophie/fix-metafield-customers-checkpoint-skip
Draft

fix(source-shopify): prevent metafield_customers from skipping customers on bulk checkpoint#77005
sophiecuiy wants to merge 2 commits intomasterfrom
sophie/fix-metafield-customers-checkpoint-skip

Conversation

@sophiecuiy
Copy link
Copy Markdown
Contributor

What

Fixes silent data loss in source-shopify bulk metafield streams when a bulk job checkpoints mid-output. On one reporter's production sync (oncall#12004), a single checkpoint event on the metafield_customers stream caused the next slice to jump forward 2 years and 8 months, skipping every customer in that window and their metafields. This reproduces the 858 / 180 / 23 vs 431 / 46 / 3 gap reporters observed between Shopify Admin and their destination.

Closes airbytehq/airbyte#76874 (the nested-connection-drop theory is a red herring; this is the actual bug).

Root cause

Bulk streams with a parent_stream_class have a cursor / filter-field mismatch:

  • The bulk query filters and slices on the parent's cursor (customers(query: "updated_at:>='X' AND updated_at:<='Y'")).
  • Emitted records carry the child's cursor (metafield.updated_at), which can be arbitrarily outside the slice window — a customer whose customer.updated_at falls in June 2023 can own a metafield whose updated_at is yesterday.
  • When the bulk job checkpoints at the configured line threshold (default 100k) and stream_slices advances the next slice's start via get_adjusted_job_end(..., self._checkpoint_cursor, ...), it's passing the max child cursor as the next slice's parent-cursor lower bound.
  • Every parent (and all its metafields) between the current slice's end and that child timestamp is silently never queried.

Production log excerpt (oncall#12004)

slice 310: customers.updated_at ∈ [2023-06-18, 2023-06-28]
  BULK Job 7247832088750 created
  checkpointing after >= 100000 rows collected
  Rows collected: 181724 --> records: 149793
  continue from checkpoint: 2026-02-26T01:31:14+00:00
slice 311: customers.updated_at ∈ [2026-02-26T01:31:14, 2026-03-08T01:31:14]

2.67 years of customers skipped in one slice advance. The off-midnight 01:31:14 timestamp is the smoking gun — no slicer would produce that alignment organically.

How

ShopifyBulkRecord already tracks the parent's cursor value separately via _parent_stream_cursor_value (max across all parent records composed), exposed via get_parent_stream_state(). This value is currently used for state reporting in IncrementalShopifyGraphQlBulkStream.get_updated_state but wasn't used for slice advancement after a checkpoint.

Added _effective_checkpoint_cursor() that sources from the record producer's parent cursor when parent_stream_class is set (falls back to the existing _checkpoint_cursor otherwise for streams without a parent). stream_slices now uses this to advance the next slice's start.

Bounded by the slice's upper bound (the bulk query's updated_at:<='Y' filter constrains any customer returned), so the next slice can't skip past the current slice's end.

Blast radius

Same fix applies to every bulk stream with parent_stream_class where emitted records' cursor differs from the parent cursor:

  • MetafieldCustomers (parent: Customers) — reproduced in oncall#12004
  • MetafieldProducts (parent: Products)
  • MetafieldProductImages (parent: Products)
  • ProductImages (parent: Products)
  • CustomerAddress (parent: Customers)
  • CustomerJourneySummary (parent: Orders)
  • InventoryLevels (parent: Locations)

Streams without parent_stream_class (MetafieldOrders, MetafieldDraftOrders, MetafieldCollections, MetafieldProductVariants, etc.) hit the fall-through branch and keep current behavior — they may still be vulnerable to the same class of bug at the query level, but this PR is scoped to the parent_stream_class-backed path where the fix is straightforward.

Test coverage

Added two unit tests in unit_tests/graphql_bulk/test_job.py:

  1. test_effective_checkpoint_cursor_uses_parent_cursor_for_parent_backed_stream — simulates the oncall#12004 pathological state (child cursor 2026-02-26 vs parent cursor 2023-06-28) and asserts the effective checkpoint cursor returns the parent's 2023-06-28, not the child's 2026-02-26.
  2. test_effective_checkpoint_cursor_falls_back_when_no_parent_state — streams without a parent, or with no parent cursor yet tracked, retain the current _checkpoint_cursor behavior.

Full unit suite: 211 passed (poetry run pytest unit_tests/).

Breaking change evaluation

Not a breaking change:

  • Record schema: unchanged.
  • Spec/config: unchanged.
  • Primary key / cursor field: unchanged.
  • State format: unchanged.
  • Behavior: strictly additive — next-slice advancement on checkpoint now respects slice bounds. Previously-skipped records will now be included. No existing records can be dropped by this change.

PATCH bump 3.3.2 → 3.3.3.

Immediate workaround (while this PR is in flight)

For customers hitting this on large stores, setting job_checkpoint_interval high enough that checkpointing never fires on the affected stream sidesteps the bug. Combined with 3.3.2's external sort (no longer memory-bound), large slices no longer have to checkpoint to stay healthy.

Can this PR be safely reverted and rolled back?

  • YES 💚

…ers on bulk checkpoint

Bulk streams with a parent_stream_class filter/slice on the parent's
cursor (e.g. customers.updated_at) while emitted records carry the
child's cursor (e.g. metafield.updated_at). When a bulk job checkpoints
mid-output, stream_slices advanced the next slice using the max child
cursor seen, which can be arbitrarily later than the slice's upper
bound. Every parent between slice_end and that child timestamp was
silently never queried.

Source the checkpoint cursor from the bulk record producer's tracked
parent cursor, which is bounded by the slice's upper bound.

Fixes airbytehq/oncall#12004.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@octavia-bot octavia-bot Bot marked this pull request as draft April 24, 2026 23:05
@octavia-bot
Copy link
Copy Markdown
Contributor

octavia-bot Bot commented Apr 24, 2026

Note

📝 PR Converted to Draft

More info...

Thank you for creating this PR. As a policy to protect our engineers' time, Airbyte requires all PRs to be created first in draft status. Your PR has been automatically converted to draft status in respect for this policy.

As soon as your PR is ready for formal review, you can proceed to convert the PR to "ready for review" status by clicking the "Ready for review" button at the bottom of the PR page.

To skip draft status in future PRs, please include [ready] in your PR title or add the skip-draft-status label when creating your PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • 🛠️ Quick Fixes
    • /format-fix - Fixes most formatting issues.
    • /bump-version - Bumps connector versions, scraping changelog description from the PR title.
      • Bump types: patch (default), minor, major, major_rc, rc, promote.
      • The rc type is a smart default: applies minor_rc if stable, or bumps the RC number if already RC.
      • The promote type strips the RC suffix to finalize a release.
      • Example: /bump-version type=rc or /bump-version type=minor
    • /bump-progressive-rollout-version - Alias for /bump-version type=rc. Bumps with an RC suffix and enables progressive rollout.
  • ❇️ AI Testing and Review (internal link: AI-SDLC Docs):
    • /ai-prove-fix - Runs prerelease readiness checks, including testing against customer connections.
    • /ai-canary-prerelease - Rolls out prerelease to 5-10 connections for canary testing.
    • /ai-review - AI-powered PR review for connector safety and quality gates.
  • 🚀 Connector Releases:
    • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
  • ☕️ JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
  • 🐍 Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.
  • ⚙️ Admin commands:
    • /force-merge reason="<REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.
      Example: /force-merge reason="CI is flaky, tests pass locally"
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

source-shopify Connector Test Results

226 tests   216 ✅  2m 2s ⏱️
  2 suites   10 💤
  2 files      0 ❌

Results for commit 4920bff.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Deploy preview for airbyte-docs ready!

Project:airbyte-docs
Status: ✅  Deploy successful!
Preview URL:https://airbyte-docs-6fqzxrte7-airbyte-growth.vercel.app
Latest Commit:4920bff

Deployed with vercel-action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants