Machine-readable renders of printer/device specification PDFs, intended as a data source for AI tools (full-text search and an MCP server).
The pipeline is being rebuilt from scratch; DESIGN.md is the source of truth for the architecture and decisions.
Each source PDF lives under pdf/<vendor>/<doc>.pdf. The pipeline turns it into
aligned text and image artifacts under data-extraction/, keyed by the
vendor-rooted stem <vendor>/<doc> (the source path with pdf/ stripped):
| Path | Contents |
|---|---|
data-extraction/markdown/<vendor>/<doc>/document.md |
Full-document Markdown (one file) |
data-extraction/markdown/<vendor>/<doc>/page-NN.md |
Per-page Markdown slices |
data-extraction/jpeg/<vendor>/<doc>/small/page-NN.jpg |
~1024px page previews |
data-extraction/jpeg/<vendor>/<doc>/big/page-NN.jpg |
Full-resolution page renders |
data-extraction/pagemap/<vendor>/<doc>.json |
Authoritative page<->artifact map |
data-extraction/quality/<vendor>/<doc>.json |
Per-page extraction QA metrics |
data-extraction/index/<type>/ |
Pre-generated search index(es) |
page-NN is zero-padded to the width of the document's page count, so
markdown/.../page-07.md corresponds exactly to jpeg/.../small/page-07.jpg.
That 1:1 text-to-image mapping is what lets the search/MCP layer cite a page and
show its render. The mapping is recorded in pagemap/<vendor>/<doc>.json
(authoritative), so downstream code never relies on filename math.
pdf/ source PDFs (pdf/<vendor>/<doc>.pdf)
data-extraction/ static outputs only (binaries via Git LFS)
data-extraction-docker/ build code: extraction/ and indexing/ Docker images
mcp-server/ Python 3.13 MCP server (serves the corpus + search)
Code and static outputs are kept strictly separate: build code never lives in
data-extraction/, and outputs never live in data-extraction-docker/.
- PDF pushed under
pdf/ - Extraction (discrete phases, see below)
- Indexing: full-text search index built from the Markdown
- MCP server image built (
:<ver>lean and:<ver>-stuffed) and deployed
Indexing and the image build are downstream of extraction and must wait for it.
This is wired in .github/workflows/: extract.yml (manual, heavy) ->
index.yml (auto after extract, with an eval gate) -> build-mcp-server.yml
(auto after index; builds the lean + stuffed images, pushes to GHCR, optional
Render deploy). Stages are chained with workflow_run and all check out Git LFS.
See DESIGN.md.
Extraction is split into functionally singular phases rather than one opaque
pass, so each can be re-run, swapped, or extended independently, and intermediate
data is kept for troubleshooting (e.g. analysing where quality dropped). All
phases share one image (data-extraction-docker/extraction/).
| Phase | Input | Output |
|---|---|---|
render |
data-extraction/jpeg/<stem>/{small,big}/page-NN.jpg |
|
text |
data-extraction/text/<stem>/page-NN.txt |
|
markdown |
data-extraction/markdown/<stem>/page-NN.md |
|
quality |
markdown+text | data-extraction/quality/<stem>.json |
assemble |
all the above | data-extraction/pagemap/<stem>.json + document.md + reports |
Principles:
- All-or-none per (phase, document): a phase regenerates its whole artifact
kind for a document; there is no per-page resume state. Restart a phase to redo
it, or insert a new phase (e.g. a VLM
describestep) between existing ones. - Data + meta: every phase writes
data-extraction/meta/<stem>/<phase>.jsonwith tool, version, params, start/end, total + per-page timing, and status.assemblefolds a summary into the pagemap'sphasesfield, so timing and quality lineage live with the results.
data-extraction/jpeg/**/*.jpg and data-extraction/index/**/*.db are tracked
via Git LFS (see .gitattributes); Markdown and JSON stay as normal git text.
Run git lfs install once before committing binary artifacts.
- Text extraction: Docling (layout-aware Markdown) with OCR fallback and a gated VLM tier for pages that fail quality checks. Quality is verified per page and flagged pages get a side-by-side review report.
- Search index: SQLite FTS5 with a tokenizer tuned to keep symbol-heavy
command tokens findable (
ESC @,GS ( L, hex1B 40). Full-text/keyword now; theindex/<type>/layout leaves room to add and compare a vector index later.