Skip to content

RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path#964

Open
icereed wants to merge 3 commits into
mainfrom
feature/paperless-ngx-parser-plugin
Open

RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path#964
icereed wants to merge 3 commits into
mainfrom
feature/paperless-ngx-parser-plugin

Conversation

@icereed

@icereed icereed commented May 5, 2026

Copy link
Copy Markdown
Owner

RFC + MVP: a generic LLM-OCR HTTP API for paperless-gpt

💬 This PR exists to invite discussion. It ships a working MVP, a Python shim skeleton for the paperless-ngx 3.0 parser plugin slot, and a curl quickstart you can try right now.
Please comment with feedback, alternative designs, or use cases I missed.

TL;DR

Add a small, stable, consumer-agnostic HTTP API on paperless-gpt:

POST /api/v1/parse          # send a document, get text (+ later: searchable PDF, thumbnail)
GET  /api/v1/capabilities   # what MIME types / providers are available
GET  /api/v1/healthz        # liveness

The headline use case is the paperless-ngx 3.0 parser plugin framework (discussion #12023). Because paperless-gpt is Go and that framework loads in-process Python entrypoints, this PR also ships a thin Python shim (paperless-gpt-parser/) that implements ParserProtocol and forwards documents to the sidecar over HTTP.

But the same endpoint is equally useful from:

  • 🔄 n8n / Make / Zapier workflows
  • 🤖 Local coding agents (Claude Code, Continue, aider, MCP) that struggle with PDFs in user prompts
  • 💻 CLI tools (pgpt parse foo.pdf)
  • 🧬 RAG / vector ingestion pipelines
  • 📥 Any custom app that today wraps Tesseract or a vendor OCR

By designing it generic up front we get the paperless-ngx integration and a much larger TAM at no extra design cost.

What this PR ships

Go side (this repo)

File What
parser_api.go Three handlers, optional bearer-token middleware, MIME allow-list
parser_api_test.go 7 tests: capabilities, healthz, image parse, unsupported MIME, missing file, bearer enforcement, invalid JSON contextall pass
main.go app.registerParserAPI(router) wired into the existing router

Python shim skeleton

File What
paperless-gpt-parser/pyproject.toml Entrypoint registration paperless_ngx.parsers = paperless_gpt = paperless_gpt_parser.parser:GptParser
paperless-gpt-parser/src/paperless_gpt_parser/parser.py GptParser implementing ParserProtocol
paperless-gpt-parser/src/paperless_gpt_parser/client.py Tiny httpx wrapper around the v1 API
paperless-gpt-parser/src/paperless_gpt_parser/config.py PAPERLESS_GPT_URL, token, score, timeout

The shim will move to its own repo (icereed/paperless-gpt-parser) once the API stabilises. It lives here for now so reviewers can see the full picture in one PR.

Docs

File What
docs/parser_plugin_rfc.md Full design + open questions
docs/examples/parser-plugin/README.md Copy-pasteable curl quickstart for every persona
docs/examples/parser-plugin/docker-compose.yml paperless-ngx + sidecar proof of concept

Try it (60 seconds)

Build and run the sidecar:

docker build -t paperless-gpt:pr-964 .
docker run --rm -p 8080:8080 \
  -e LLM_PROVIDER=openai -e LLM_MODEL=gpt-4o-mini \
  -e VISION_LLM_PROVIDER=openai -e VISION_LLM_MODEL=gpt-4o-mini \
  -e OCR_PROVIDER=llm -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e PAPERLESS_BASE_URL=http://example.invalid \
  -e PAPERLESS_API_TOKEN=anything \
  paperless-gpt:pr-964

Discover capabilities:

curl -s http://localhost:8080/api/v1/capabilities | jq
{
  "name": "paperless-gpt",
  "version": "devVersion",
  "supported_mime_types": {
    "application/pdf": ".pdf", "image/png": ".png",
    "image/jpeg": ".jpg", "image/tiff": ".tiff", "image/webp": ".webp"
  },
  "providers": [{"id":"llm","display_name":"llm","can_produce_archive":false}],
  "default_provider": "llm",
  "default_score": 50,
  "notes": ["MVP: text extraction only…"]
}

Parse a PDF:

curl -s -X POST http://localhost:8080/api/v1/parse \
  -F file=@./scan.pdf \
  -F mime_type=application/pdf | jq
{ "text": "Invoice no. 12345 …", "page_count": 3, "provider": "llm" }

Use from any agent, n8n node, or shell:

curl -s -X POST $PAPERLESS_GPT_URL/api/v1/parse \
  -F file=@"$1" \
  -F mime_type=$(file --mime-type -b "$1") \
  | jq -r '.text'

Full quickstart with more snippets and the auth flow lives in docs/examples/parser-plugin/README.md.

Why a shim and not a native plugin?

paperless-gpt is Go, the new framework loads in-process Python plugins. Bundling a Go shared library in a Python wheel across architectures is a packaging nightmare. HTTP keeps the deployment story identical to today (one container) and lets every tool — not just paperless-ngx — call the same endpoint.

Migration story (no breakage)

paperless-ngx version paperless-gpt deployment
≤ 2.x Existing tag-based polling — unchanged
3.0+ Either keep tag-based polling, or pip install paperless-gpt-parser and run paperless-gpt as a sidecar
Future Plugin path becomes recommended; tag polling deprecated but not removed for one major cycle

We never break existing users.

Out of scope for this PR (intentionally — keeps it reviewable)

  • 🚧 Searchable archive PDF in the /parse response (planned via gardar/ocrchestra/pkg/pdfocr — the polling flow already does this for the local-file output)
  • 🚧 Real WebP thumbnail in the response (the Python shim renders a placeholder for now so paperless-ngx never sees a missing path)
  • 🚧 Per-request provider override and language_hint honoring
  • 🚧 Unifying the new stateless path with the existing ProcessDocumentOCR polling flow
  • 🚧 Moving paperless-gpt-parser/ to its own repo + PyPI publish

These are tracked in the roadmap section of the RFC and will be follow-up PRs once the shape is confirmed.

What I'd love feedback on

The RFC ends with Open questions for discussion. Specifically:

  1. Endpoint shape — base64 in JSON vs. multipart/mixed for binary fields once we add the archive PDF?
  2. Per-request provider override — should consumers pick the LLM/OCR provider per call?
  3. Auth model — open by default + optional bearer token enough? Need scoped tokens?
  4. PDF/A — does anyone actually need PDF/A conformance for the archive output, or is regular PDF + hOCR layer fine?
  5. Shim location — sibling repo paperless-gpt-parser (cleaner for pip install) vs. monorepo (easier to keep in sync)?
  6. Scope — keep tag/title/correspondent suggestions out of the parser plugin and continue via the existing tag-based path? (Current proposal: yes.)
  7. Naming/api/v1/parse good? Or /api/v1/extract, /api/v1/ocr, /api/v1/documents:parse?
  8. Use cases I missed — would you wire this into something? Tell me how, so the API doesn't accidentally close that door.

Checklist

  • RFC document
  • Working MVP (/parse, /capabilities, /healthz)
  • Tests (parser_api_test.go, all pass)
  • Optional bearer-token auth
  • Python shim skeleton
  • curl quickstart
  • docker-compose example
  • No new Go dependencies
  • No production behaviour changed for existing flows
  • Searchable archive PDF (follow-up)
  • Real thumbnail (follow-up)

Try it, break it, comment

Pull the branch, run the curl commands, tell me what's wrong with the design. The point of this PR is the discussion — happy to iterate substantially on the API shape before anything is locked in.

cc @icereed — and anyone in the community with opinions on the paperless-ngx integration story or the broader "PDF-as-a-service" use case 🙏

Summary by CodeRabbit

  • New Features

    • Added parser plugin HTTP API (/api/v1/parse, /capabilities, /healthz) with optional bearer-token auth
    • Added Python integration package to forward documents to a parser sidecar for AI-powered OCR
  • Documentation

    • Quickstart, Docker Compose example, RFC and README documenting API endpoints, usage, auth and workflow integrations
  • Tests

    • Added automated tests validating parser API behavior and auth handling

Adds a draft RFC and stub HTTP endpoints (currently 501 Not Implemented)
for a consumer-agnostic /api/v1/parse + /capabilities + /healthz surface.

The headline use case is the paperless-ngx 3.0 parser plugin framework
(paperless-ngx/paperless-ngx#12294, discussion #12023): a thin Python
shim 'paperless-gpt-parser' would implement ParserProtocol and forward
to this endpoint. Because the API is intentionally generic, the same
endpoint also serves n8n / Make / Zapier workflows, local coding agents
that struggle with PDFs, CLI tools, and RAG ingestion pipelines.

This commit only adds:
  - docs/parser_plugin_rfc.md  (full design + open questions)
  - parser_api.go              (stubs returning 501)
  - main.go                    (router registration)

No production behaviour changes. Opening as a PR to invite discussion
before implementation lands.
@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9743c39a-68e5-41c3-b9e6-f1a908f9eb7f

📥 Commits

Reviewing files that changed from the base of the PR and between 4a01ac4 and 3aed1a1.

📒 Files selected for processing (1)
  • Dockerfile
✅ Files skipped from review due to trivial changes (1)
  • Dockerfile

📝 Walkthrough

Walkthrough

Adds a versioned Parser API (HTTP /api/v1) and a reference Python parser shim plus docs, tests, and Docker Compose example; implements Go handlers (capabilities, healthz, parse), client/shim code, and registers routes in main.go.

Changes

Parser Plugin API + Shim

Layer / File(s) Summary
API Spec & Quickstart
docs/parser_plugin_rfc.md, docs/examples/parser-plugin/README.md
RFC and Quickstart define /api/v1 endpoints (POST /parse, GET /capabilities, GET /healthz), request/response shapes, error semantics, optional bearer-token auth, and curl/integration examples.
Compose Example
docs/examples/parser-plugin/docker-compose.yml
Docker Compose example wiring paperless-ngx, paperless-gpt sidecar, and redis, showing env and mounts for the parser shim.
Go API Types & Handlers
parser_api.go
Adds DTOs (ParserCapabilities, ParserProviderInfo, ParseResponse, ParserMetadataEntry) and Gin handlers: /capabilities, /healthz, and /parse (multipart validation, mime detection, provider routing, PDF rendering via go-fitz, page-wise OCR, and base64-encoded optional artifacts).
Go: PDF parse helper
parser_api.go (parsePDFForAPI)
Implements PDF loading, bounded DPI rendering, per-page OCR calls, cancellation via context, concatenation of text, page-count and OCR limit tracking.
Bearer-token middleware & routing
parser_api.go, main.go
Adds optional PAPERLESS_GPT_API_TOKEN enforcement and registers /api/v1 routes via app.registerParserAPI(router) in main().
Go Tests
parser_api_test.go
Adds tests and helpers: stubOCR, makeTinyPNG, multipartBody, and tests covering capabilities, healthz, parse image, unsupported mime, missing file, bearer-token enforcement, and invalid context JSON.
Python shim config
paperless-gpt-parser/src/paperless_gpt_parser/config.py
Adds Config dataclass and _truthy helper reading env vars (PAPERLESS_GPT_URL, PAPERLESS_GPT_API_TOKEN, PAPERLESS_GPT_PARSER_*) with defaults and parsing.
Python HTTP client
paperless-gpt-parser/src/paperless_gpt_parser/client.py
Adds GptClient, Capabilities, and ParseResult; implements capabilities() (GET) and parse() (multipart POST), base64 decoding helpers, proper file handle cleanup, and context support.
Python Parser plugin implementation
paperless-gpt-parser/src/paperless_gpt_parser/parser.py, .../__init__.py
Adds GptParser implementing ParserProtocol: lazy capability caching, supported_mime_types() / score(), tempdir lifecycle, parse() forwarding, archive writing, thumbnail handling (uses returned WebP or generates placeholder via PIL), page-count accessors, and metadata no-op. Exports GptParser and sets __version__.
Python packaging & docs
paperless-gpt-parser/pyproject.toml, paperless-gpt-parser/README.md
Adds project metadata, dependencies (httpx, Pillow), test deps, and entry point paperless_gpt = "paperless_gpt_parser.parser:GptParser". README documents usage and layout.
Dev image tweak
Dockerfile
Updates pinned Alpine musl-dev version in build stage from 1.2.5-r9 to 1.2.5-r11.

Sequence Diagram

sequenceDiagram
    participant PNgx as paperless-ngx
    participant Shim as GptParser (Python)
    participant HTTP as HTTP Client
    participant Server as Parser API (Go)
    participant OCR as OCR Provider

    PNgx->>Shim: parse(document_path, mime_type)
    Shim->>HTTP: GET /api/v1/capabilities
    HTTP->>Server: GET /api/v1/capabilities
    Server->>OCR: check provider status
    Server-->>HTTP: ParserCapabilities JSON
    HTTP-->>Shim: Capabilities

    Shim->>HTTP: POST /api/v1/parse (multipart: file + context)
    HTTP->>Server: POST /api/v1/parse
    alt PDF
        Server->>Server: render pages to JPEG (bounded DPI)
        Server->>OCR: ProcessImage(page_jpeg)
        OCR-->>Server: text per page
        Server->>Server: concat text, count pages
    else Image
        Server->>OCR: ProcessImage(file_bytes)
        OCR-->>Server: text
    end
    Server-->>HTTP: ParseResponse {text, page_count, archive?, thumbnail?}
    HTTP-->>Shim: ParseResponse
    Shim->>Shim: write archive/thumb to temp dir or generate placeholder
    Shim-->>PNgx: {text, archive_path, page_count, thumbnail_path}
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

safe-to-test

Poem

🐰 A nibble, a hop, the parser's begun,

Files sent and scanned beneath the spring sun,
Go routes hum, Python shim passes the cheer,
Text hops back home — tidy, crisp, and clear,
Hooray! A carrot for each parsed run.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing an RFC and MVP for a generic LLM-OCR HTTP API with a paperless-ngx 3.0 parser plugin implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/paperless-ngx-parser-plugin

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…ckstart

Replaces the 501-stub commit on this branch with a working prototype
that the community can actually try.

Go side (paperless-gpt):
- POST /api/v1/parse: accepts multipart (file + mime_type + optional
  filename/produce_*/provider/language_hint/context) and returns JSON
  with text + page_count for images and PDFs.
- GET /api/v1/capabilities: enumerates supported MIME types, providers,
  default score; consumed by the Python shim and any other client.
- GET /api/v1/healthz: liveness probe.
- Optional bearer-token auth via PAPERLESS_GPT_API_TOKEN.
- Tests: parser_api_test.go covers capabilities, healthz, image parse,
  unsupported MIME, missing file, bearer-token enforcement, invalid
  context JSON. All pass.

Python shim (paperless-gpt-parser/):
- Implements paperless-ngx 3.0 ParserProtocol (entrypoint group
  paperless_ngx.parsers).
- Lazy capabilities cache so paperless-ngx start order doesn't matter.
- Forwards to /api/v1/parse, exposes get_text/get_page_count today,
  archive_pdf/thumbnail/date once the sidecar implements them.

Docs:
- examples/parser-plugin/README.md: copy-pasteable curl snippets and
  use-case examples (paperless-ngx, n8n, coding agents, CLI).
- examples/parser-plugin/docker-compose.yml: paperless-ngx + sidecar
  proof of concept.
- parser_plugin_rfc.md: roadmap reflects MVP shipped, follow-ups
  spelled out (archive PDF, thumbnail, provider override).

Out of scope for this PR (intentionally — keeps it reviewable):
- Searchable archive PDF in the response (planned via
  gardar/ocrchestra/pkg/pdfocr; the polling flow already does this).
- Real WebP thumbnail (placeholder rendered by the shim for now).
- Per-request provider override and language_hint honoring.
- Unifying the new stateless path with the existing
  ProcessDocumentOCR polling flow.
- Moving paperless-gpt-parser/ to its own repo + PyPI.
@icereed icereed changed the title RFC: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path May 5, 2026
@icereed icereed marked this pull request as ready for review May 5, 2026 14:54
Alpine 3.21 main repository has rotated musl-dev from r9 to r11. The
older pinned version is no longer fetchable, breaking 'apk add' in CI:

  ERROR: unable to select packages:
    musl-dev-1.2.5-r11:
      breaks: world[musl-dev=1.2.5-r9]

Other pinned packages (gcc, mupdf, mupdf-dev, sed) are still current.
Verified against alpine:3.21 with 'apk policy'.

Pre-existing failure on main (renovate has an open PR for the same fix);
applying it here so the parser-plugin PR pipeline can go green.
coderabbitai[bot]

This comment was marked as off-topic.

Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Repository owner deleted a comment from coderabbitai Bot May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant