Skip to content

Content Sanitization for LLM API Calls#917

Open
BieggerM wants to merge 7 commits into
icereed:mainfrom
BieggerM:main
Open

Content Sanitization for LLM API Calls#917
BieggerM wants to merge 7 commits into
icereed:mainfrom
BieggerM:main

Conversation

@BieggerM

@BieggerM BieggerM commented Mar 4, 2026

Copy link
Copy Markdown

Content Sanitization

Removes sensitive data from document content before sending to LLM APIs.

Configuration

# Remove literal strings (comma-separated)
REMOVE_FROM_CONTENT=CONFIDENTIAL,John Doe,SECRET

# Remove regex patterns (semicolon-separated)
REMOVE_FROM_CONTENT_REGEX=DE\d{20};[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,};\b\d{4}-\d{4}-\d{4}-\d{4}\b

Examples:

  • Names: REMOVE_FROM_CONTENT=John Doe,Jane Smith
  • IBANs: REMOVE_FROM_CONTENT_REGEX=DE\d{20}
  • Emails: REMOVE_FROM_CONTENT_REGEX=[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
  • Credit cards: REMOVE_FROM_CONTENT_REGEX=\b\d{4}-\d{4}-\d{4}-\d{4}\b

Coverage

✅ Document suggestions (title, tags, correspondent, document type, created date)
✅ Custom field suggestions
✅ Ad-hoc analysis
✅ OCR prompts

Implementation

  • Patterns compiled once at startup (main.go)
  • Thread-safe (sync.Once)
  • Zero overhead if not configured
  • Package: sanitize/ with 11 test cases

Docker Compose

services:
  app:
    image: paperless-gpt
    environment:
      - REMOVE_FROM_CONTENT=CONFIDENTIAL,SECRET
      - REMOVE_FROM_CONTENT_REGEX=DE\d{20};[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Troubleshooting

# Verify env vars
docker exec <container> env | grep REMOVE

# Check logs
docker logs <container> 2>&1 | grep -i sanitiz

Invalid regex patterns cause startup failure with error message.

Testing

go test ./sanitize/...

Summary by CodeRabbit

  • New Features

    • Content sanitization removes configured sensitive strings and regex patterns from documents and prompts before AI processing; sanitization is initialized at startup and applied across analysis and suggestion workflows.
  • Documentation

    • Added docs describing configuration via environment variables and usage.
  • Tests

    • Comprehensive tests added to validate literal and regex removal, parsing, and initialization behavior.

BieggerM and others added 5 commits March 4, 2026 21:51
Introduce sanitize package to strip configured patterns from content
before sending to LLMs. Supports literal string removal via
REMOVE_FROM_CONTENT env var and regex pattern removal via
REMOVE_FROM_CONTENT_REGEX env var.

Apply sanitization to document content processing and OCR prompts
to prevent sensitive information from being sent to external LLM APIs.
Content Sanitization Feature
@coderabbitai

coderabbitai Bot commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff79db12-c00e-4924-8268-a9435eea1b12

📥 Commits

Reviewing files that changed from the base of the PR and between fccd7e9 and c688344.

📒 Files selected for processing (1)
  • sanitize/sanitize.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • sanitize/sanitize.go

📝 Walkthrough

Walkthrough

This PR adds a new sanitize package (Init, Sanitize) with env-configured literal and regex removals, wires Sanitize into main initialization and multiple code paths (HTTP handlers, LLM functions, OCR provider) to clean content/prompts before use, and updates the Dockerfile to include the sanitize directory in the build context.

Changes

Cohort / File(s) Summary
New Sanitize Package
sanitize/doc.go, sanitize/sanitize.go, sanitize/sanitize_test.go
Adds a sanitize package with Init() and Sanitize(string), env-driven literal and regex removals, one-time init with error reporting, parsing helpers, and comprehensive tests.
Application Integration
main.go, app_http_handlers.go, app_llm.go, ocr/llm_provider.go
Imports sanitize; calls sanitize.Init() at startup; applies sanitize.Sanitize(...) to document content and LLM prompts before truncation/templating and before sending to LLMs.
Build Configuration
Dockerfile
Adds COPY sanitize ./sanitize to the builder stage so the sanitize package is included in the image build context.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant HTTP as HTTP Handler
    participant Sanitizer as Sanitizer (package)
    participant LLM as LLM Provider
    participant OCR as OCR Provider

    Client->>HTTP: submit document / request
    HTTP->>Sanitizer: Sanitize(doc.Content)
    Sanitizer-->>HTTP: sanitizedContent
    HTTP->>LLM: send sanitizedContent in prompt
    LLM->>Sanitizer: Sanitize(promptText)
    Sanitizer-->>LLM: sanitizedPrompt
    LLM->>HTTP: response
    HTTP-->>Client: deliver result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐰 I hop through text with whiskered care,

I nibble secrets hidden there,
I scrub the prompts and trim each line,
So LLMs see only what's benign,
A tidy world — from rabbit, fine. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Content Sanitization for LLM API Calls' clearly and concisely captures the main change: adding content sanitization before LLM API calls, which is the primary feature across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sanitize/sanitize_test.go`:
- Around line 87-95: The subtests in sanitize_test.go can inherit process-wide
env state; ensure each subtest explicitly isolates REMOVE_FROM_CONTENT and
REMOVE_FROM_CONTENT_REGEX by first unsetting both vars at the start of the
subtest, then setting each with tt.literals and tt.regexes only if non-empty,
and always deferring an Unsetenv for both variables (i.e., call
os.Unsetenv("REMOVE_FROM_CONTENT") and os.Unsetenv("REMOVE_FROM_CONTENT_REGEX")
at the start, then conditionally os.Setenv for tt.literals and tt.regexes, and
defer Unsetenv for each) so the test case values (tt.literals/tt.regexes) do not
leak between tests.

In `@sanitize/sanitize.go`:
- Around line 11-43: Init() currently uses a function-local initErr so if the
first call to initOnce.Do sets an error, subsequent Init() calls return nil
because the local variable isn't preserved; change initErr to a package-level
error variable (e.g., var initErr error) and assign to that variable inside the
initOnce.Do closure, then have Init() return the package-level initErr so any
initialization failure is preserved across calls to Init(); keep existing
symbols literalPatterns, regexPatterns and initOnce unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1336c72f-7475-4a9f-a780-365c82b3f59a

📥 Commits

Reviewing files that changed from the base of the PR and between d445bfa and 1dae1cd.

📒 Files selected for processing (8)
  • Dockerfile
  • app_http_handlers.go
  • app_llm.go
  • main.go
  • ocr/llm_provider.go
  • sanitize/doc.go
  • sanitize/sanitize.go
  • sanitize/sanitize_test.go

Comment thread sanitize/sanitize_test.go Outdated
Comment thread sanitize/sanitize.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sanitize/sanitize.go`:
- Around line 72-90: Sanitize currently relies on global literalPatterns and
regexPatterns being initialized by Init(), which can lead to a silent fail-open
if Init() was not called; modify the sanitization module to perform a defensive
initialization by adding a package-level sync.Once (e.g., var initOnce
sync.Once) and invoking initOnce.Do(Init) at the start of Sanitize (or create a
private ensureInitialized helper that calls initOnce.Do(Init)), so Init() runs
at least once before using literalPatterns/regexPatterns and Sanitize fails
closed even if callers forget to call Init().

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f8ae823-9271-4455-8b22-6dcaaf292591

📥 Commits

Reviewing files that changed from the base of the PR and between 1dae1cd and fccd7e9.

📒 Files selected for processing (2)
  • sanitize/sanitize.go
  • sanitize/sanitize_test.go

Comment thread sanitize/sanitize.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant