[Bug] read_document truncates long PDFs before later pages are reachable

## Pre-checks

- [x] I have searched [existing issues](https://github.com/dataelement/Clawith/issues) and this is not a duplicate.

## Deployment Method

Source (setup.sh)

## Steps to Reproduce

1. Provide an agent with a multi-page PDF where the first page alone can produce around 8,000 extracted characters.
2. Ask the agent to use the `read_document` tool to inspect the document.
3. Observe that the returned content stops after the early character budget is reached, so later PDF pages are not reachable through the tool.

Relevant code path:

- `backend/app/services/agent_tools.py`: the `read_document` tool schema exposes only `path`, so the model cannot request a larger character budget or page range.
- `backend/app/services/agent_tools.py`: execution defaults to `max_chars = min(int(arguments.get("max_chars", 8000)), 20000)`.
- `backend/app/services/agent_tools.py`: the pdfplumber path iterates up to 50 pages, but breaks when accumulated extracted text reaches `max_chars`.
- `backend/app/services/agent_tools.py`: the PyMuPDF fallback has the same accumulated-length early-stop behavior.

## Expected vs Actual Behavior

Expected: `read_document` should let an agent reliably inspect long documents beyond the first returned character window. For PDFs, the tool should support a way to page through content or request a specific page/range/offset, and the exposed tool schema should include the relevant controls.

Actual: `read_document` reads from the beginning of the document until the character cap is hit. The default cap is 8,000 characters and the hard cap is 20,000 characters, but the tool schema does not expose `max_chars` to the model. If page 1 is large enough to hit the default cap, a 36-page PDF can appear as if only page 1 was read, even though the parser itself is capable of reading later pages.

This is not a PDF parser limitation. It is a tool contract / pagination issue caused by the return-limit logic.

## Logs / Screenshots

Suggested fix direction:

- Expose `max_chars` in the `read_document` tool schema if it remains supported internally.
- Add pagination controls such as `page`, `page_start`, `page_end`, `offset`, or `cursor` so agents can continue reading later pages.
- Return truncation metadata, for example total pages, pages read, whether output was truncated, and a continuation hint.
- Avoid making long-PDF behavior look like successful full-document reading when only the first character window was returned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] read_document truncates long PDFs before later pages are reachable #575

Pre-checks

Deployment Method

Steps to Reproduce

Expected vs Actual Behavior

Logs / Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] read_document truncates long PDFs before later pages are reachable #575

Description

Pre-checks

Deployment Method

Steps to Reproduce

Expected vs Actual Behavior

Logs / Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions