Skip to content

fix: use Document.from_dict in InMemoryDocumentStore.load_from_disk#11594

Merged
davidsbatista merged 4 commits into
deepset-ai:mainfrom
Ayushhgit:fix-load-from-disk-document-from-dict
Jun 19, 2026
Merged

fix: use Document.from_dict in InMemoryDocumentStore.load_from_disk#11594
davidsbatista merged 4 commits into
deepset-ai:mainfrom
Ayushhgit:fix-load-from-disk-document-from-dict

Conversation

@Ayushhgit

@Ayushhgit Ayushhgit commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Related Issues

Proposed Changes:

InMemoryDocumentStore.load_from_disk rebuilt documents with the plain Document(**doc) constructor, which performs no conversion of nested fields. Since save_to_disk serializes with Document.to_dict(flatten=False) (converting blob to ByteStream.to_dict() and sparse_embedding to SparseEmbedding.to_dict()), any document saved with those fields came back with raw dicts in their place. The corrupted documents crashed repr(), to_dict(), equality comparison, a second save_to_disk, and any component accessing document.blob.data (e.g. image pipelines).

One-line fix: reconstruct with Document.from_dict(doc), the documented inverse of to_dict, which restores ByteStream and SparseEmbedding instances.

How did you test it?

  • New regression test test_save_to_disk_and_load_from_disk_with_blob_and_sparse_embedding: saves a document with both a blob and a sparse_embedding, reloads, asserts proper types, equality with the original, and that the reloaded store can be saved again. Fails on main, passes with this fix.
  • hatch run test:unit test/document_stores/test_in_memory.py — 148 passed, 4 skipped.

Notes for the reviewer

  • Document.from_dict also handles the nested meta dict produced by to_dict(flatten=False), so documents without blob/sparse fields round-trip exactly as before (covered by the existing test_save_to_disk_and_load_from_disk).

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

load_from_disk rebuilt documents with the plain Document constructor,
which does not convert nested fields. Documents saved with a blob
(ByteStream) or sparse_embedding (SparseEmbedding) came back with those
fields as raw dicts, crashing repr(), to_dict(), equality comparison,
save_to_disk of the reloaded store, and any component accessing
document.blob.data.

save_to_disk serializes with Document.to_dict(flatten=False);
Document.from_dict is its inverse and restores the proper types.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Ayushhgit Ayushhgit requested a review from a team as a code owner June 12, 2026 08:08
@Ayushhgit Ayushhgit requested review from davidsbatista and removed request for a team June 12, 2026 08:08
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

@Ayushhgit is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@davidsbatista

Copy link
Copy Markdown
Contributor

@Ayushhgit you currently have 3 open PRs and keep opening more. Please, focus on one PR at a time.

@Ayushhgit

Copy link
Copy Markdown
Contributor Author

Hey @davidsbatista these were my last, I'll wait until all current PR's of mine close until starting a new one. Sorry if I caused any inconvenience.

@github-actions github-actions Bot added the type:documentation Improvements on the docs label Jun 19, 2026
@vercel

vercel Bot commented Jun 19, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview Jun 19, 2026 7:41am

Request Review

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/document_stores/in_memory
  document_store.py
Project Total  

This report was generated by python-coverage-comment-action

@davidsbatista davidsbatista left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some final adjustments, thanks for the contribution 👍🏽

@davidsbatista davidsbatista merged commit 9c9fbd7 into deepset-ai:main Jun 19, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: InMemoryDocumentStore.load_from_disk corrupts documents with blob or sparse_embedding (loaded as raw dicts)

2 participants