Skip to content

Commit 4cda910

Browse files
mdrxynpentrel
andauthored
deepagents: notes on multimodal (#4189)
Co-authored-by: Naomi Pentrel <5212232+npentrel@users.noreply.github.com>
1 parent 6077a0e commit 4cda910

2 files changed

Lines changed: 22 additions & 2 deletions

File tree

src/oss/deepagents/context-engineering.mdx

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -381,6 +381,22 @@ Content offloading happens when tool call inputs or results exceed a token thres
381381

382382
![An example of offloading showing a large tool response that is replaced with a message about the location of the offloaded results and the first 10 lines of the result](/oss/images/deepagents/offloading-results.png)
383383

384+
### Multimodal inputs
385+
386+
Deep Agents supports multimodal inputs, such as images returned by `read_file` or provided in messages, but the built-in context management mechanisms are primarily text and message-history oriented. They do not resize images, lower image resolution, or generate reusable visual embeddings.
387+
388+
For multimodal workloads, keep large media out of active message history when possible:
389+
390+
- Store images, screenshots, and charts in a filesystem backend or external object store, then pass file paths or URLs through messages.
391+
- Prefer references over base64-encoded image blocks in long-running conversations.
392+
- If a tool produces an image, have the tool save the image and return a concise text description plus a path or URL.
393+
- Use subagents for image-heavy inspection work so the main agent receives a compact text result instead of every multimodal intermediate step.
394+
- Tune summarization thresholds or provide a custom token counter when your model provider charges many tokens for images.
395+
396+
Offloading large tool inputs and results only measures text content. Non-text blocks, including images, are preserved in the replacement message rather than compressed. A message that contains only an image will not be offloaded because of image size alone.
397+
398+
Summarization replaces older messages with a text summary once those messages fall outside the preserved recent context. Any images in the summarized partition are no longer sent as active image blocks after summarization. The conversation history file written to the backend is a textual record, not a media artifact store, so store important images separately if the agent needs to inspect them again later.
399+
384400
### Summarization
385401

386402
:::js
@@ -394,9 +410,9 @@ When the context size crosses the model's context window limit (for example 85%
394410
This process has two components:
395411

396412
- **In-context summary**: An LLM generates a structured summary of the conversation including session intent, artifacts created, and next steps—which replaces the full conversation history in the agent's working memory.
397-
- **Filesystem preservation**: The complete, original conversation messages are written to the filesystem as a canonical record.
413+
- **Filesystem preservation**: A text rendering of the original conversation messages is written to the filesystem as a canonical record.
398414

399-
This dual approach ensures the agent maintains awareness of its goals and progress (via the summary) while preserving the ability to recover specific details when needed (via filesystem search).
415+
This dual approach ensures the agent maintains awareness of its goals and progress (via the summary) while preserving the ability to recover text details when needed (via filesystem search).
400416

401417
![An example of summarization showing an agent's conversation history, where several steps get compacted](/oss/images/deepagents/summarization.png)
402418

src/oss/langchain/middleware/built-in.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,10 @@ Automatically summarize conversation history when approaching token limits, pres
6060
- Multi-turn dialogues with extensive history.
6161
- Applications where preserving full conversation context matters.
6262

63+
<Note>
64+
Summarization is text-oriented context compression. It does not resize, downsample, or otherwise compress image/audio/video payloads. Recent messages retained by `keep` still include their original multimodal blocks, while older multimodal messages that are summarized are represented only by the generated text summary. For image-heavy applications, store media in a filesystem or object store and pass URLs or file references through message history.
65+
</Note>
66+
6367
:::python
6468
**API reference:** @[`SummarizationMiddleware`]
6569

0 commit comments

Comments
 (0)