Content Sanitization for LLM API Calls by BieggerM · Pull Request #917 · icereed/paperless-gpt

BieggerM · 2026-03-04T22:48:01Z

Content Sanitization

Removes sensitive data from document content before sending to LLM APIs.

Configuration

# Remove literal strings (comma-separated)
REMOVE_FROM_CONTENT=CONFIDENTIAL,John Doe,SECRET

# Remove regex patterns (semicolon-separated)
REMOVE_FROM_CONTENT_REGEX=DE\d{20};[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,};\b\d{4}-\d{4}-\d{4}-\d{4}\b

Examples:

Names: REMOVE_FROM_CONTENT=John Doe,Jane Smith
IBANs: REMOVE_FROM_CONTENT_REGEX=DE\d{20}
Emails: REMOVE_FROM_CONTENT_REGEX=[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Credit cards: REMOVE_FROM_CONTENT_REGEX=\b\d{4}-\d{4}-\d{4}-\d{4}\b

Coverage

✅ Document suggestions (title, tags, correspondent, document type, created date)
✅ Custom field suggestions
✅ Ad-hoc analysis
✅ OCR prompts

Implementation

Patterns compiled once at startup (main.go)
Thread-safe (sync.Once)
Zero overhead if not configured
Package: sanitize/ with 11 test cases

Docker Compose

services:
  app:
    image: paperless-gpt
    environment:
      - REMOVE_FROM_CONTENT=CONFIDENTIAL,SECRET
      - REMOVE_FROM_CONTENT_REGEX=DE\d{20};[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Troubleshooting

# Verify env vars
docker exec <container> env | grep REMOVE

# Check logs
docker logs <container> 2>&1 | grep -i sanitiz

Invalid regex patterns cause startup failure with error message.

Testing

go test ./sanitize/...

Summary by CodeRabbit

New Features
- Content sanitization removes configured sensitive strings and regex patterns from documents and prompts before AI processing; sanitization is initialized at startup and applied across analysis and suggestion workflows.
Documentation
- Added docs describing configuration via environment variables and usage.
Tests
- Comprehensive tests added to validate literal and regex removal, parsing, and initialization behavior.

Introduce sanitize package to strip configured patterns from content before sending to LLMs. Supports literal string removal via REMOVE_FROM_CONTENT env var and regex pattern removal via REMOVE_FROM_CONTENT_REGEX env var. Apply sanitization to document content processing and OCR prompts to prevent sensitive information from being sent to external LLM APIs.

Content Sanitization Feature

coderabbitai · 2026-03-04T22:48:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff79db12-c00e-4924-8268-a9435eea1b12

📥 Commits

Reviewing files that changed from the base of the PR and between fccd7e9 and c688344.

📒 Files selected for processing (1)

sanitize/sanitize.go

🚧 Files skipped from review as they are similar to previous changes (1)

sanitize/sanitize.go

📝 Walkthrough

Walkthrough

This PR adds a new sanitize package (Init, Sanitize) with env-configured literal and regex removals, wires Sanitize into main initialization and multiple code paths (HTTP handlers, LLM functions, OCR provider) to clean content/prompts before use, and updates the Dockerfile to include the sanitize directory in the build context.

Changes

Cohort / File(s)	Summary
New Sanitize Package `sanitize/doc.go`, `sanitize/sanitize.go`, `sanitize/sanitize_test.go`	Adds a sanitize package with `Init()` and `Sanitize(string)`, env-driven literal and regex removals, one-time init with error reporting, parsing helpers, and comprehensive tests.
Application Integration `main.go`, `app_http_handlers.go`, `app_llm.go`, `ocr/llm_provider.go`	Imports `sanitize`; calls `sanitize.Init()` at startup; applies `sanitize.Sanitize(...)` to document content and LLM prompts before truncation/templating and before sending to LLMs.
Build Configuration `Dockerfile`	Adds `COPY sanitize ./sanitize` to the builder stage so the sanitize package is included in the image build context.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant HTTP as HTTP Handler
    participant Sanitizer as Sanitizer (package)
    participant LLM as LLM Provider
    participant OCR as OCR Provider

    Client->>HTTP: submit document / request
    HTTP->>Sanitizer: Sanitize(doc.Content)
    Sanitizer-->>HTTP: sanitizedContent
    HTTP->>LLM: send sanitizedContent in prompt
    LLM->>Sanitizer: Sanitize(promptText)
    Sanitizer-->>LLM: sanitizedPrompt
    LLM->>HTTP: response
    HTTP-->>Client: deliver result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Feature Request: Content Sanitization for LLM API Calls #916: Adds the requested content-sanitization feature and integrates it into document and LLM workflows, matching the issue objective.

Poem

🐰 I hop through text with whiskered care,

I nibble secrets hidden there,
I scrub the prompts and trim each line,
So LLMs see only what's benign,
A tidy world — from rabbit, fine. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Content Sanitization for LLM API Calls' clearly and concisely captures the main change: adding content sanitization before LLM API calls, which is the primary feature across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sanitize/sanitize_test.go`:
- Around line 87-95: The subtests in sanitize_test.go can inherit process-wide
env state; ensure each subtest explicitly isolates REMOVE_FROM_CONTENT and
REMOVE_FROM_CONTENT_REGEX by first unsetting both vars at the start of the
subtest, then setting each with tt.literals and tt.regexes only if non-empty,
and always deferring an Unsetenv for both variables (i.e., call
os.Unsetenv("REMOVE_FROM_CONTENT") and os.Unsetenv("REMOVE_FROM_CONTENT_REGEX")
at the start, then conditionally os.Setenv for tt.literals and tt.regexes, and
defer Unsetenv for each) so the test case values (tt.literals/tt.regexes) do not
leak between tests.

In `@sanitize/sanitize.go`:
- Around line 11-43: Init() currently uses a function-local initErr so if the
first call to initOnce.Do sets an error, subsequent Init() calls return nil
because the local variable isn't preserved; change initErr to a package-level
error variable (e.g., var initErr error) and assign to that variable inside the
initOnce.Do closure, then have Init() return the package-level initErr so any
initialization failure is preserved across calls to Init(); keep existing
symbols literalPatterns, regexPatterns and initOnce unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1336c72f-7475-4a9f-a780-365c82b3f59a

📥 Commits

Reviewing files that changed from the base of the PR and between d445bfa and 1dae1cd.

📒 Files selected for processing (8)

Dockerfile
app_http_handlers.go
app_llm.go
main.go
ocr/llm_provider.go
sanitize/doc.go
sanitize/sanitize.go
sanitize/sanitize_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sanitize/sanitize.go`:
- Around line 72-90: Sanitize currently relies on global literalPatterns and
regexPatterns being initialized by Init(), which can lead to a silent fail-open
if Init() was not called; modify the sanitization module to perform a defensive
initialization by adding a package-level sync.Once (e.g., var initOnce
sync.Once) and invoking initOnce.Do(Init) at the start of Sanitize (or create a
private ensureInitialized helper that calls initOnce.Do(Init)), so Init() runs
at least once before using literalPatterns/regexPatterns and Sanitize fails
closed even if callers forget to call Init().

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f8ae823-9271-4455-8b22-6dcaaf292591

📥 Commits

Reviewing files that changed from the base of the PR and between 1dae1cd and fccd7e9.

📒 Files selected for processing (2)

sanitize/sanitize.go
sanitize/sanitize_test.go

BieggerM and others added 5 commits March 4, 2026 21:51

Fix Docker build: copy sanitize package to build context

770eac1

Fix: sanitize document content in custom fields processing

97f3db8

Fix: sanitize document content in ad-hoc analysis

8d1d95a

Merge pull request #1 from BieggerM/sanitize

1dae1cd

Content Sanitization Feature

coderabbitai Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread sanitize/sanitize_test.go Outdated

Comment thread sanitize/sanitize.go

fix(sanitize): reset initErr in tests to prevent state leakage

fccd7e9

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread sanitize/sanitize.go

fix(sanitize): add defensive init check to fail closed on error

c688344

ivanzud added a commit to ivanzud/paperless-gpt that referenced this pull request Mar 8, 2026

Merge upstream PR icereed#917: sanitize LLM API payloads

6dfdb54

This was referenced Mar 8, 2026

Resolve unresolved upstream March OCR/Ollama items ivanzud/paperless-gpt#131

Merged

[UPSTREAM #916] Feature Request: Content Sanitization for LLM API Calls ivanzud/paperless-gpt#132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Content Sanitization for LLM API Calls#917

Content Sanitization for LLM API Calls#917
BieggerM wants to merge 7 commits into
icereed:mainfrom
BieggerM:main

BieggerM commented Mar 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

BieggerM commented Mar 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Content Sanitization

Configuration

Coverage

Implementation

Docker Compose

Troubleshooting

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BieggerM commented Mar 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 4, 2026 •

edited

Loading