fix: InMemoryDocumentStore.load_from_disk corrupts blob and sparse_embedding fields#11634
Closed
adhavan18 wants to merge 2 commits into
Closed
Conversation
… instead of Document constructor
save_to_disk serialises documents with Document.to_dict(flatten=False),
which converts nested dataclass fields to plain dicts:
blob -> ByteStream.to_dict() (a plain dict)
sparse_embedding -> SparseEmbedding.to_dict() (a plain dict)
load_from_disk previously reconstructed documents with Document(**doc),
which passes those plain dicts directly to the constructor without
reversing the serialisation. The fields were loaded as raw dicts
instead of the proper ByteStream / SparseEmbedding instances.
Downstream effects:
- AttributeError: 'dict' object has no attribute 'data' on any
access to document.blob.data (e.g. DocumentToImageContent).
- doc.to_dict() / doc == other both fail with the same error.
- A save -> load -> save round-trip is impossible.
Fix: replace Document(**doc) with Document.from_dict(doc), which
is the documented inverse of to_dict and correctly restores
ByteStream, SparseEmbedding, and any other nested dataclass fields.
Adds a regression test that exercises the full save/load round-trip
with both a blob and a sparse_embedding, asserts the correct types
are restored, and verifies that save_to_disk on the loaded store
works without error.
Fixes deepset-ai#11593
|
@adhavan18 is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
Member
|
@adhavan18 Thank you for opening this PR. Another PR addressing the same issue is already under review #11594 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #11593.
Problem
load_from_diskusesDocument(**doc)to reconstruct documents. Butsave_to_diskserialises viaDocument.to_dict(flatten=False), which converts nested dataclass fields to plain dicts (blob→ByteStream.to_dict(),sparse_embedding→SparseEmbedding.to_dict()). The plain constructor doesn't reverse this, so those fields come back as raw dicts. Any access todocument.blob.dataordocument.sparse_embedding.indicesraisesAttributeError: 'dict' object has no attribute 'data'.Fix
Replace
Document(**doc)withDocument.from_dict(doc), which is the documented inverse ofto_dictand correctly restores nested fields.Test
Added round-trip test: save documents with blob and sparse_embedding, reload, verify fields are proper dataclass instances.