Skip to content

[Bug] read_document truncates long PDFs before later pages are reachable #575

@Y1fe1Zh0u

Description

@Y1fe1Zh0u

Pre-checks

Deployment Method

Source (setup.sh)

Steps to Reproduce

  1. Provide an agent with a multi-page PDF where the first page alone can produce around 8,000 extracted characters.
  2. Ask the agent to use the read_document tool to inspect the document.
  3. Observe that the returned content stops after the early character budget is reached, so later PDF pages are not reachable through the tool.

Relevant code path:

  • backend/app/services/agent_tools.py: the read_document tool schema exposes only path, so the model cannot request a larger character budget or page range.
  • backend/app/services/agent_tools.py: execution defaults to max_chars = min(int(arguments.get("max_chars", 8000)), 20000).
  • backend/app/services/agent_tools.py: the pdfplumber path iterates up to 50 pages, but breaks when accumulated extracted text reaches max_chars.
  • backend/app/services/agent_tools.py: the PyMuPDF fallback has the same accumulated-length early-stop behavior.

Expected vs Actual Behavior

Expected: read_document should let an agent reliably inspect long documents beyond the first returned character window. For PDFs, the tool should support a way to page through content or request a specific page/range/offset, and the exposed tool schema should include the relevant controls.

Actual: read_document reads from the beginning of the document until the character cap is hit. The default cap is 8,000 characters and the hard cap is 20,000 characters, but the tool schema does not expose max_chars to the model. If page 1 is large enough to hit the default cap, a 36-page PDF can appear as if only page 1 was read, even though the parser itself is capable of reading later pages.

This is not a PDF parser limitation. It is a tool contract / pagination issue caused by the return-limit logic.

Logs / Screenshots

Suggested fix direction:

  • Expose max_chars in the read_document tool schema if it remains supported internally.
  • Add pagination controls such as page, page_start, page_end, offset, or cursor so agents can continue reading later pages.
  • Return truncation metadata, for example total pages, pages read, whether output was truncated, and a continuation hint.
  • Avoid making long-PDF behavior look like successful full-document reading when only the first character window was returned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions