Skip to content

OCR: skip failed pages instead of aborting entire document #932

@lurkus

Description

@lurkus

I searched existing issues and didn't find this specific request.

Problem

When a single page fails OCR, the entire document is aborted and the auto-OCR tag is never removed. This causes an infinite retry loop — the document is reprocessed on every polling cycle, always failing on the same page.

In my case, a 10-page document fails every time on page 4, which is covered in very dense small text. Pages 1-3 OCR just fine, but the LLM chokes on page 4 and the whole document is abandoned. Since the auto tag stays, it retries repeatedly.

Root cause

In ocr.go, the per-page processing loops return immediately on any page error:

// ocr.go:222-224 (pdf mode) and ocr.go:292-294 (image mode)
result, err := app.ocrProvider.ProcessImage(ctx, pdfContent, i+1)
if err != nil {
    return nil, fmt.Errorf("error performing OCR for document %d, page %d: %w", documentID, i+1, err)
}

In background.go:232-236, when ProcessDocumentOCR returns an error, the auto-OCR tag is not removed (removal only happens on the success path at line 247). So the document stays in the processing queue indefinitely.

Proposed solution

  1. ocr.go: On per-page OCR failure, log the error and continue to the next page instead of returning. Track which pages were skipped (e.g., add SkippedPages []int to ProcessedDocument).

  2. background.go: On partial success (some pages skipped), still:

    • Update document content with the pages that succeeded
    • Remove the auto-OCR tag to stop the retry loop
    • Add a configurable warning tag (e.g., OCR_PARTIAL_ERROR_TAG env var, defaulting to something like paperless-gpt-ocr-incomplete) so the user knows pages were skipped and can review

This way, documents with problematic pages still get useful OCR text, don't get stuck in an infinite retry loop, and are flagged for user review.

Related

Environment

  • paperless-gpt: latest (self-hosted Docker)
  • OCR provider: local GGML vision model via Ollama
  • Process mode: image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions