Skip to content

fix: InMemoryDocumentStore.load_from_disk corrupts blob and sparse_embedding fields#11634

Closed
adhavan18 wants to merge 2 commits into
deepset-ai:mainfrom
adhavan18:fix/load-from-disk-corrupts-blob-sparse-embedding
Closed

fix: InMemoryDocumentStore.load_from_disk corrupts blob and sparse_embedding fields#11634
adhavan18 wants to merge 2 commits into
deepset-ai:mainfrom
adhavan18:fix/load-from-disk-corrupts-blob-sparse-embedding

Conversation

@adhavan18

@adhavan18 adhavan18 commented Jun 15, 2026

Copy link
Copy Markdown

Fixes #11593.

Problem

load_from_disk uses Document(**doc) to reconstruct documents. But save_to_disk serialises via Document.to_dict(flatten=False), which converts nested dataclass fields to plain dicts (blobByteStream.to_dict(), sparse_embeddingSparseEmbedding.to_dict()). The plain constructor doesn't reverse this, so those fields come back as raw dicts. Any access to document.blob.data or document.sparse_embedding.indices raises AttributeError: 'dict' object has no attribute 'data'.

Fix

Replace Document(**doc) with Document.from_dict(doc), which is the documented inverse of to_dict and correctly restores nested fields.

Test

Added round-trip test: save documents with blob and sparse_embedding, reload, verify fields are proper dataclass instances.

… instead of Document constructor

save_to_disk serialises documents with Document.to_dict(flatten=False),
which converts nested dataclass fields to plain dicts:
  blob          -> ByteStream.to_dict()  (a plain dict)
  sparse_embedding -> SparseEmbedding.to_dict() (a plain dict)

load_from_disk previously reconstructed documents with Document(**doc),
which passes those plain dicts directly to the constructor without
reversing the serialisation.  The fields were loaded as raw dicts
instead of the proper ByteStream / SparseEmbedding instances.

Downstream effects:
  - AttributeError: 'dict' object has no attribute 'data' on any
    access to document.blob.data (e.g. DocumentToImageContent).
  - doc.to_dict() / doc == other both fail with the same error.
  - A save -> load -> save round-trip is impossible.

Fix: replace Document(**doc) with Document.from_dict(doc), which
is the documented inverse of to_dict and correctly restores
ByteStream, SparseEmbedding, and any other nested dataclass fields.

Adds a regression test that exercises the full save/load round-trip
with both a blob and a sparse_embedding, asserts the correct types
are restored, and verifies that save_to_disk on the loaded store
works without error.

Fixes deepset-ai#11593
@adhavan18 adhavan18 requested a review from a team as a code owner June 15, 2026 09:16
@adhavan18 adhavan18 requested review from julian-risch and removed request for a team June 15, 2026 09:16
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

@adhavan18 is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant

CLAassistant commented Jun 15, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@julian-risch

Copy link
Copy Markdown
Member

@adhavan18 Thank you for opening this PR. Another PR addressing the same issue is already under review #11594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: InMemoryDocumentStore.load_from_disk corrupts documents with blob or sparse_embedding (loaded as raw dicts)

3 participants