RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path#964
RFC + MVP: generic LLM-OCR HTTP API + paperless-ngx 3.0 parser plugin path#964icereed wants to merge 3 commits into
Conversation
Adds a draft RFC and stub HTTP endpoints (currently 501 Not Implemented) for a consumer-agnostic /api/v1/parse + /capabilities + /healthz surface. The headline use case is the paperless-ngx 3.0 parser plugin framework (paperless-ngx/paperless-ngx#12294, discussion #12023): a thin Python shim 'paperless-gpt-parser' would implement ParserProtocol and forward to this endpoint. Because the API is intentionally generic, the same endpoint also serves n8n / Make / Zapier workflows, local coding agents that struggle with PDFs, CLI tools, and RAG ingestion pipelines. This commit only adds: - docs/parser_plugin_rfc.md (full design + open questions) - parser_api.go (stubs returning 501) - main.go (router registration) No production behaviour changes. Opening as a PR to invite discussion before implementation lands.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a versioned Parser API (HTTP /api/v1) and a reference Python parser shim plus docs, tests, and Docker Compose example; implements Go handlers (capabilities, healthz, parse), client/shim code, and registers routes in main.go. ChangesParser Plugin API + Shim
Sequence DiagramsequenceDiagram
participant PNgx as paperless-ngx
participant Shim as GptParser (Python)
participant HTTP as HTTP Client
participant Server as Parser API (Go)
participant OCR as OCR Provider
PNgx->>Shim: parse(document_path, mime_type)
Shim->>HTTP: GET /api/v1/capabilities
HTTP->>Server: GET /api/v1/capabilities
Server->>OCR: check provider status
Server-->>HTTP: ParserCapabilities JSON
HTTP-->>Shim: Capabilities
Shim->>HTTP: POST /api/v1/parse (multipart: file + context)
HTTP->>Server: POST /api/v1/parse
alt PDF
Server->>Server: render pages to JPEG (bounded DPI)
Server->>OCR: ProcessImage(page_jpeg)
OCR-->>Server: text per page
Server->>Server: concat text, count pages
else Image
Server->>OCR: ProcessImage(file_bytes)
OCR-->>Server: text
end
Server-->>HTTP: ParseResponse {text, page_count, archive?, thumbnail?}
HTTP-->>Shim: ParseResponse
Shim->>Shim: write archive/thumb to temp dir or generate placeholder
Shim-->>PNgx: {text, archive_path, page_count, thumbnail_path}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ckstart Replaces the 501-stub commit on this branch with a working prototype that the community can actually try. Go side (paperless-gpt): - POST /api/v1/parse: accepts multipart (file + mime_type + optional filename/produce_*/provider/language_hint/context) and returns JSON with text + page_count for images and PDFs. - GET /api/v1/capabilities: enumerates supported MIME types, providers, default score; consumed by the Python shim and any other client. - GET /api/v1/healthz: liveness probe. - Optional bearer-token auth via PAPERLESS_GPT_API_TOKEN. - Tests: parser_api_test.go covers capabilities, healthz, image parse, unsupported MIME, missing file, bearer-token enforcement, invalid context JSON. All pass. Python shim (paperless-gpt-parser/): - Implements paperless-ngx 3.0 ParserProtocol (entrypoint group paperless_ngx.parsers). - Lazy capabilities cache so paperless-ngx start order doesn't matter. - Forwards to /api/v1/parse, exposes get_text/get_page_count today, archive_pdf/thumbnail/date once the sidecar implements them. Docs: - examples/parser-plugin/README.md: copy-pasteable curl snippets and use-case examples (paperless-ngx, n8n, coding agents, CLI). - examples/parser-plugin/docker-compose.yml: paperless-ngx + sidecar proof of concept. - parser_plugin_rfc.md: roadmap reflects MVP shipped, follow-ups spelled out (archive PDF, thumbnail, provider override). Out of scope for this PR (intentionally — keeps it reviewable): - Searchable archive PDF in the response (planned via gardar/ocrchestra/pkg/pdfocr; the polling flow already does this). - Real WebP thumbnail (placeholder rendered by the shim for now). - Per-request provider override and language_hint honoring. - Unifying the new stateless path with the existing ProcessDocumentOCR polling flow. - Moving paperless-gpt-parser/ to its own repo + PyPI.
Alpine 3.21 main repository has rotated musl-dev from r9 to r11. The
older pinned version is no longer fetchable, breaking 'apk add' in CI:
ERROR: unable to select packages:
musl-dev-1.2.5-r11:
breaks: world[musl-dev=1.2.5-r9]
Other pinned packages (gcc, mupdf, mupdf-dev, sed) are still current.
Verified against alpine:3.21 with 'apk policy'.
Pre-existing failure on main (renovate has an open PR for the same fix);
applying it here so the parser-plugin PR pipeline can go green.
RFC + MVP: a generic LLM-OCR HTTP API for paperless-gpt
TL;DR
Add a small, stable, consumer-agnostic HTTP API on
paperless-gpt:The headline use case is the paperless-ngx 3.0 parser plugin framework (discussion #12023). Because paperless-gpt is Go and that framework loads in-process Python entrypoints, this PR also ships a thin Python shim (
paperless-gpt-parser/) that implementsParserProtocoland forwards documents to the sidecar over HTTP.But the same endpoint is equally useful from:
pgpt parse foo.pdf)By designing it generic up front we get the paperless-ngx integration and a much larger TAM at no extra design cost.
What this PR ships
Go side (this repo)
parser_api.goparser_api_test.gocontext— all passmain.goapp.registerParserAPI(router)wired into the existing routerPython shim skeleton
paperless-gpt-parser/pyproject.tomlpaperless_ngx.parsers = paperless_gpt = paperless_gpt_parser.parser:GptParserpaperless-gpt-parser/src/paperless_gpt_parser/parser.pyGptParserimplementingParserProtocolpaperless-gpt-parser/src/paperless_gpt_parser/client.pypaperless-gpt-parser/src/paperless_gpt_parser/config.pyPAPERLESS_GPT_URL, token, score, timeoutDocs
docs/parser_plugin_rfc.mddocs/examples/parser-plugin/README.mdcurlquickstart for every personadocs/examples/parser-plugin/docker-compose.ymlTry it (60 seconds)
Build and run the sidecar:
Discover capabilities:
curl -s http://localhost:8080/api/v1/capabilities | jq{ "name": "paperless-gpt", "version": "devVersion", "supported_mime_types": { "application/pdf": ".pdf", "image/png": ".png", "image/jpeg": ".jpg", "image/tiff": ".tiff", "image/webp": ".webp" }, "providers": [{"id":"llm","display_name":"llm","can_produce_archive":false}], "default_provider": "llm", "default_score": 50, "notes": ["MVP: text extraction only…"] }Parse a PDF:
curl -s -X POST http://localhost:8080/api/v1/parse \ -F file=@./scan.pdf \ -F mime_type=application/pdf | jq{ "text": "Invoice no. 12345 …", "page_count": 3, "provider": "llm" }Use from any agent, n8n node, or shell:
Full quickstart with more snippets and the auth flow lives in
docs/examples/parser-plugin/README.md.Why a shim and not a native plugin?
paperless-gpt is Go, the new framework loads in-process Python plugins. Bundling a Go shared library in a Python wheel across architectures is a packaging nightmare. HTTP keeps the deployment story identical to today (one container) and lets every tool — not just paperless-ngx — call the same endpoint.
Migration story (no breakage)
pip install paperless-gpt-parserand run paperless-gpt as a sidecarWe never break existing users.
Out of scope for this PR (intentionally — keeps it reviewable)
/parseresponse (planned viagardar/ocrchestra/pkg/pdfocr— the polling flow already does this for the local-file output)provideroverride andlanguage_hinthonoringProcessDocumentOCRpolling flowpaperless-gpt-parser/to its own repo + PyPI publishThese are tracked in the roadmap section of the RFC and will be follow-up PRs once the shape is confirmed.
What I'd love feedback on
The RFC ends with Open questions for discussion. Specifically:
paperless-gpt-parser(cleaner forpip install) vs. monorepo (easier to keep in sync)?/api/v1/parsegood? Or/api/v1/extract,/api/v1/ocr,/api/v1/documents:parse?Checklist
/parse,/capabilities,/healthz)parser_api_test.go, all pass)Try it, break it, comment
Pull the branch, run the curl commands, tell me what's wrong with the design. The point of this PR is the discussion — happy to iterate substantially on the API shape before anything is locked in.
cc @icereed — and anyone in the community with opinions on the paperless-ngx integration story or the broader "PDF-as-a-service" use case 🙏
Summary by CodeRabbit
New Features
Documentation
Tests