RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path by icereed · Pull Request #964 · icereed/paperless-gpt

icereed · 2026-05-05T14:44:36Z

RFC + MVP: a generic LLM-OCR HTTP API for paperless-gpt

💬 This PR exists to invite discussion. It ships a working MVP, a Python shim skeleton for the paperless-ngx 3.0 parser plugin slot, and a curl quickstart you can try right now.
Please comment with feedback, alternative designs, or use cases I missed.

TL;DR

Add a small, stable, consumer-agnostic HTTP API on paperless-gpt:

POST /api/v1/parse          # send a document, get text (+ later: searchable PDF, thumbnail)
GET  /api/v1/capabilities   # what MIME types / providers are available
GET  /api/v1/healthz        # liveness

The headline use case is the paperless-ngx 3.0 parser plugin framework (discussion #12023). Because paperless-gpt is Go and that framework loads in-process Python entrypoints, this PR also ships a thin Python shim (paperless-gpt-parser/) that implements ParserProtocol and forwards documents to the sidecar over HTTP.

But the same endpoint is equally useful from:

🔄 n8n / Make / Zapier workflows
🤖 Local coding agents (Claude Code, Continue, aider, MCP) that struggle with PDFs in user prompts
💻 CLI tools (pgpt parse foo.pdf)
🧬 RAG / vector ingestion pipelines
📥 Any custom app that today wraps Tesseract or a vendor OCR

By designing it generic up front we get the paperless-ngx integration and a much larger TAM at no extra design cost.

What this PR ships

Go side (this repo)

File	What
`parser_api.go`	Three handlers, optional bearer-token middleware, MIME allow-list
`parser_api_test.go`	7 tests: capabilities, healthz, image parse, unsupported MIME, missing file, bearer enforcement, invalid JSON `context` — all pass
`main.go`	`app.registerParserAPI(router)` wired into the existing router

Python shim skeleton

File	What
`paperless-gpt-parser/pyproject.toml`	Entrypoint registration `paperless_ngx.parsers = paperless_gpt = paperless_gpt_parser.parser:GptParser`
`paperless-gpt-parser/src/paperless_gpt_parser/parser.py`	`GptParser` implementing `ParserProtocol`
`paperless-gpt-parser/src/paperless_gpt_parser/client.py`	Tiny httpx wrapper around the v1 API
`paperless-gpt-parser/src/paperless_gpt_parser/config.py`	`PAPERLESS_GPT_URL`, token, score, timeout

The shim will move to its own repo (icereed/paperless-gpt-parser) once the API stabilises. It lives here for now so reviewers can see the full picture in one PR.

Docs

File	What
`docs/parser_plugin_rfc.md`	Full design + open questions
`docs/examples/parser-plugin/README.md`	Copy-pasteable `curl` quickstart for every persona
`docs/examples/parser-plugin/docker-compose.yml`	paperless-ngx + sidecar proof of concept

Try it (60 seconds)

Build and run the sidecar:

docker build -t paperless-gpt:pr-964 .
docker run --rm -p 8080:8080 \
  -e LLM_PROVIDER=openai -e LLM_MODEL=gpt-4o-mini \
  -e VISION_LLM_PROVIDER=openai -e VISION_LLM_MODEL=gpt-4o-mini \
  -e OCR_PROVIDER=llm -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e PAPERLESS_BASE_URL=http://example.invalid \
  -e PAPERLESS_API_TOKEN=anything \
  paperless-gpt:pr-964

Discover capabilities:

curl -s http://localhost:8080/api/v1/capabilities | jq

{
  "name": "paperless-gpt",
  "version": "devVersion",
  "supported_mime_types": {
    "application/pdf": ".pdf", "image/png": ".png",
    "image/jpeg": ".jpg", "image/tiff": ".tiff", "image/webp": ".webp"
  },
  "providers": [{"id":"llm","display_name":"llm","can_produce_archive":false}],
  "default_provider": "llm",
  "default_score": 50,
  "notes": ["MVP: text extraction only…"]
}

Parse a PDF:

curl -s -X POST http://localhost:8080/api/v1/parse \
  -F file=@./scan.pdf \
  -F mime_type=application/pdf | jq

{ "text": "Invoice no. 12345 …", "page_count": 3, "provider": "llm" }

Use from any agent, n8n node, or shell:

curl -s -X POST $PAPERLESS_GPT_URL/api/v1/parse \
  -F file=@"$1" \
  -F mime_type=$(file --mime-type -b "$1") \
  | jq -r '.text'

Full quickstart with more snippets and the auth flow lives in docs/examples/parser-plugin/README.md.

Why a shim and not a native plugin?

paperless-gpt is Go, the new framework loads in-process Python plugins. Bundling a Go shared library in a Python wheel across architectures is a packaging nightmare. HTTP keeps the deployment story identical to today (one container) and lets every tool — not just paperless-ngx — call the same endpoint.

Migration story (no breakage)

paperless-ngx version	paperless-gpt deployment
≤ 2.x	Existing tag-based polling — unchanged
3.0+	Either keep tag-based polling, or `pip install paperless-gpt-parser` and run paperless-gpt as a sidecar
Future	Plugin path becomes recommended; tag polling deprecated but not removed for one major cycle

We never break existing users.

Out of scope for this PR (intentionally — keeps it reviewable)

🚧 Searchable archive PDF in the /parse response (planned via gardar/ocrchestra/pkg/pdfocr — the polling flow already does this for the local-file output)
🚧 Real WebP thumbnail in the response (the Python shim renders a placeholder for now so paperless-ngx never sees a missing path)
🚧 Per-request provider override and language_hint honoring
🚧 Unifying the new stateless path with the existing ProcessDocumentOCR polling flow
🚧 Moving paperless-gpt-parser/ to its own repo + PyPI publish

These are tracked in the roadmap section of the RFC and will be follow-up PRs once the shape is confirmed.

What I'd love feedback on

The RFC ends with Open questions for discussion. Specifically:

Endpoint shape — base64 in JSON vs. multipart/mixed for binary fields once we add the archive PDF?
Per-request provider override — should consumers pick the LLM/OCR provider per call?
Auth model — open by default + optional bearer token enough? Need scoped tokens?
PDF/A — does anyone actually need PDF/A conformance for the archive output, or is regular PDF + hOCR layer fine?
Shim location — sibling repo paperless-gpt-parser (cleaner for pip install) vs. monorepo (easier to keep in sync)?
Scope — keep tag/title/correspondent suggestions out of the parser plugin and continue via the existing tag-based path? (Current proposal: yes.)
Naming — /api/v1/parse good? Or /api/v1/extract, /api/v1/ocr, /api/v1/documents:parse?
Use cases I missed — would you wire this into something? Tell me how, so the API doesn't accidentally close that door.

Checklist

Try it, break it, comment

Pull the branch, run the curl commands, tell me what's wrong with the design. The point of this PR is the discussion — happy to iterate substantially on the API shape before anything is locked in.

cc @icereed — and anyone in the community with opinions on the paperless-ngx integration story or the broader "PDF-as-a-service" use case 🙏

Summary by CodeRabbit

New Features
- Added parser plugin HTTP API (/api/v1/parse, /capabilities, /healthz) with optional bearer-token auth
- Added Python integration package to forward documents to a parser sidecar for AI-powered OCR
Documentation
- Quickstart, Docker Compose example, RFC and README documenting API endpoints, usage, auth and workflow integrations
Tests
- Added automated tests validating parser API behavior and auth handling

Adds a draft RFC and stub HTTP endpoints (currently 501 Not Implemented) for a consumer-agnostic /api/v1/parse + /capabilities + /healthz surface. The headline use case is the paperless-ngx 3.0 parser plugin framework (paperless-ngx/paperless-ngx#12294, discussion #12023): a thin Python shim 'paperless-gpt-parser' would implement ParserProtocol and forward to this endpoint. Because the API is intentionally generic, the same endpoint also serves n8n / Make / Zapier workflows, local coding agents that struggle with PDFs, CLI tools, and RAG ingestion pipelines. This commit only adds: - docs/parser_plugin_rfc.md (full design + open questions) - parser_api.go (stubs returning 501) - main.go (router registration) No production behaviour changes. Opening as a PR to invite discussion before implementation lands.

coderabbitai · 2026-05-05T14:44:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9743c39a-68e5-41c3-b9e6-f1a908f9eb7f

📥 Commits

Reviewing files that changed from the base of the PR and between 4a01ac4 and 3aed1a1.

📒 Files selected for processing (1)

Dockerfile

✅ Files skipped from review due to trivial changes (1)

Dockerfile

📝 Walkthrough

Walkthrough

Adds a versioned Parser API (HTTP /api/v1) and a reference Python parser shim plus docs, tests, and Docker Compose example; implements Go handlers (capabilities, healthz, parse), client/shim code, and registers routes in main.go.

Changes

Parser Plugin API + Shim

Layer / File(s)	Summary
API Spec & Quickstart `docs/parser_plugin_rfc.md`, `docs/examples/parser-plugin/README.md`	RFC and Quickstart define `/api/v1` endpoints (`POST /parse`, `GET /capabilities`, `GET /healthz`), request/response shapes, error semantics, optional bearer-token auth, and curl/integration examples.
Compose Example `docs/examples/parser-plugin/docker-compose.yml`	Docker Compose example wiring `paperless-ngx`, `paperless-gpt` sidecar, and `redis`, showing env and mounts for the parser shim.
Go API Types & Handlers `parser_api.go`	Adds DTOs (`ParserCapabilities`, `ParserProviderInfo`, `ParseResponse`, `ParserMetadataEntry`) and Gin handlers: `/capabilities`, `/healthz`, and `/parse` (multipart validation, mime detection, provider routing, PDF rendering via go-fitz, page-wise OCR, and base64-encoded optional artifacts).
Go: PDF parse helper `parser_api.go` (parsePDFForAPI)	Implements PDF loading, bounded DPI rendering, per-page OCR calls, cancellation via context, concatenation of text, page-count and OCR limit tracking.
Bearer-token middleware & routing `parser_api.go`, `main.go`	Adds optional `PAPERLESS_GPT_API_TOKEN` enforcement and registers `/api/v1` routes via `app.registerParserAPI(router)` in `main()`.
Go Tests `parser_api_test.go`	Adds tests and helpers: `stubOCR`, `makeTinyPNG`, `multipartBody`, and tests covering capabilities, healthz, parse image, unsupported mime, missing file, bearer-token enforcement, and invalid context JSON.
Python shim config `paperless-gpt-parser/src/paperless_gpt_parser/config.py`	Adds `Config` dataclass and `_truthy` helper reading env vars (`PAPERLESS_GPT_URL`, `PAPERLESS_GPT_API_TOKEN`, `PAPERLESS_GPT_PARSER_*`) with defaults and parsing.
Python HTTP client `paperless-gpt-parser/src/paperless_gpt_parser/client.py`	Adds `GptClient`, `Capabilities`, and `ParseResult`; implements `capabilities()` (GET) and `parse()` (multipart POST), base64 decoding helpers, proper file handle cleanup, and context support.
Python Parser plugin implementation `paperless-gpt-parser/src/paperless_gpt_parser/parser.py`, `.../__init__.py`	Adds `GptParser` implementing ParserProtocol: lazy capability caching, `supported_mime_types()` / `score()`, tempdir lifecycle, `parse()` forwarding, archive writing, thumbnail handling (uses returned WebP or generates placeholder via PIL), page-count accessors, and metadata no-op. Exports `GptParser` and sets `__version__`.
Python packaging & docs `paperless-gpt-parser/pyproject.toml`, `paperless-gpt-parser/README.md`	Adds project metadata, dependencies (`httpx`, `Pillow`), test deps, and entry point `paperless_gpt = "paperless_gpt_parser.parser:GptParser"`. README documents usage and layout.
Dev image tweak `Dockerfile`	Updates pinned Alpine `musl-dev` version in build stage from `1.2.5-r9` to `1.2.5-r11`.

Sequence Diagram

sequenceDiagram
    participant PNgx as paperless-ngx
    participant Shim as GptParser (Python)
    participant HTTP as HTTP Client
    participant Server as Parser API (Go)
    participant OCR as OCR Provider

    PNgx->>Shim: parse(document_path, mime_type)
    Shim->>HTTP: GET /api/v1/capabilities
    HTTP->>Server: GET /api/v1/capabilities
    Server->>OCR: check provider status
    Server-->>HTTP: ParserCapabilities JSON
    HTTP-->>Shim: Capabilities

    Shim->>HTTP: POST /api/v1/parse (multipart: file + context)
    HTTP->>Server: POST /api/v1/parse
    alt PDF
        Server->>Server: render pages to JPEG (bounded DPI)
        Server->>OCR: ProcessImage(page_jpeg)
        OCR-->>Server: text per page
        Server->>Server: concat text, count pages
    else Image
        Server->>OCR: ProcessImage(file_bytes)
        OCR-->>Server: text
    end
    Server-->>HTTP: ParseResponse {text, page_count, archive?, thumbnail?}
    HTTP-->>Shim: ParseResponse
    Shim->>Shim: write archive/thumb to temp dir or generate placeholder
    Shim-->>PNgx: {text, archive_path, page_count, thumbnail_path}

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

icereed/paperless-gpt#208: Modifies OCR provider integration and provider initialization that this PR also depends on.

Suggested labels

safe-to-test

Poem

🐰 A nibble, a hop, the parser's begun,

Files sent and scanned beneath the spring sun,
Go routes hum, Python shim passes the cheer,
Text hops back home — tidy, crisp, and clear,
Hooray! A carrot for each parsed run.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: introducing an RFC and MVP for a generic LLM-OCR HTTP API with a paperless-ngx 3.0 parser plugin implementation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/paperless-ngx-parser-plugin

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ckstart Replaces the 501-stub commit on this branch with a working prototype that the community can actually try. Go side (paperless-gpt): - POST /api/v1/parse: accepts multipart (file + mime_type + optional filename/produce_*/provider/language_hint/context) and returns JSON with text + page_count for images and PDFs. - GET /api/v1/capabilities: enumerates supported MIME types, providers, default score; consumed by the Python shim and any other client. - GET /api/v1/healthz: liveness probe. - Optional bearer-token auth via PAPERLESS_GPT_API_TOKEN. - Tests: parser_api_test.go covers capabilities, healthz, image parse, unsupported MIME, missing file, bearer-token enforcement, invalid context JSON. All pass. Python shim (paperless-gpt-parser/): - Implements paperless-ngx 3.0 ParserProtocol (entrypoint group paperless_ngx.parsers). - Lazy capabilities cache so paperless-ngx start order doesn't matter. - Forwards to /api/v1/parse, exposes get_text/get_page_count today, archive_pdf/thumbnail/date once the sidecar implements them. Docs: - examples/parser-plugin/README.md: copy-pasteable curl snippets and use-case examples (paperless-ngx, n8n, coding agents, CLI). - examples/parser-plugin/docker-compose.yml: paperless-ngx + sidecar proof of concept. - parser_plugin_rfc.md: roadmap reflects MVP shipped, follow-ups spelled out (archive PDF, thumbnail, provider override). Out of scope for this PR (intentionally — keeps it reviewable): - Searchable archive PDF in the response (planned via gardar/ocrchestra/pkg/pdfocr; the polling flow already does this). - Real WebP thumbnail (placeholder rendered by the shim for now). - Per-request provider override and language_hint honoring. - Unifying the new stateless path with the existing ProcessDocumentOCR polling flow. - Moving paperless-gpt-parser/ to its own repo + PyPI.

Alpine 3.21 main repository has rotated musl-dev from r9 to r11. The older pinned version is no longer fetchable, breaking 'apk add' in CI: ERROR: unable to select packages: musl-dev-1.2.5-r11: breaks: world[musl-dev=1.2.5-r9] Other pinned packages (gcc, mupdf, mupdf-dev, sed) are still current. Verified against alpine:3.21 with 'apk policy'. Pre-existing failure on main (renovate has an open PR for the same fix); applying it here so the parser-plugin PR pipeline can go green.

icereed changed the title ~~RFC: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path~~ RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path#964

RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path#964
icereed wants to merge 3 commits into
mainfrom
feature/paperless-ngx-parser-plugin

icereed commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as off-topic.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

icereed commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RFC + MVP: a generic LLM-OCR HTTP API for paperless-gpt

TL;DR

What this PR ships

Go side (this repo)

Python shim skeleton

Docs

Try it (60 seconds)

Why a shim and not a native plugin?

Migration story (no breakage)

Out of scope for this PR (intentionally — keeps it reviewable)

What I'd love feedback on

Checklist

Try it, break it, comment

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as off-topic.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

icereed commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading