Content Sanitization for LLM API Calls#917
Conversation
Introduce sanitize package to strip configured patterns from content before sending to LLMs. Supports literal string removal via REMOVE_FROM_CONTENT env var and regex pattern removal via REMOVE_FROM_CONTENT_REGEX env var. Apply sanitization to document content processing and OCR prompts to prevent sensitive information from being sent to external LLM APIs.
Content Sanitization Feature
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds a new sanitize package (Init, Sanitize) with env-configured literal and regex removals, wires Sanitize into main initialization and multiple code paths (HTTP handlers, LLM functions, OCR provider) to clean content/prompts before use, and updates the Dockerfile to include the sanitize directory in the build context. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant HTTP as HTTP Handler
participant Sanitizer as Sanitizer (package)
participant LLM as LLM Provider
participant OCR as OCR Provider
Client->>HTTP: submit document / request
HTTP->>Sanitizer: Sanitize(doc.Content)
Sanitizer-->>HTTP: sanitizedContent
HTTP->>LLM: send sanitizedContent in prompt
LLM->>Sanitizer: Sanitize(promptText)
Sanitizer-->>LLM: sanitizedPrompt
LLM->>HTTP: response
HTTP-->>Client: deliver result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sanitize/sanitize_test.go`:
- Around line 87-95: The subtests in sanitize_test.go can inherit process-wide
env state; ensure each subtest explicitly isolates REMOVE_FROM_CONTENT and
REMOVE_FROM_CONTENT_REGEX by first unsetting both vars at the start of the
subtest, then setting each with tt.literals and tt.regexes only if non-empty,
and always deferring an Unsetenv for both variables (i.e., call
os.Unsetenv("REMOVE_FROM_CONTENT") and os.Unsetenv("REMOVE_FROM_CONTENT_REGEX")
at the start, then conditionally os.Setenv for tt.literals and tt.regexes, and
defer Unsetenv for each) so the test case values (tt.literals/tt.regexes) do not
leak between tests.
In `@sanitize/sanitize.go`:
- Around line 11-43: Init() currently uses a function-local initErr so if the
first call to initOnce.Do sets an error, subsequent Init() calls return nil
because the local variable isn't preserved; change initErr to a package-level
error variable (e.g., var initErr error) and assign to that variable inside the
initOnce.Do closure, then have Init() return the package-level initErr so any
initialization failure is preserved across calls to Init(); keep existing
symbols literalPatterns, regexPatterns and initOnce unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1336c72f-7475-4a9f-a780-365c82b3f59a
📒 Files selected for processing (8)
Dockerfileapp_http_handlers.goapp_llm.gomain.goocr/llm_provider.gosanitize/doc.gosanitize/sanitize.gosanitize/sanitize_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sanitize/sanitize.go`:
- Around line 72-90: Sanitize currently relies on global literalPatterns and
regexPatterns being initialized by Init(), which can lead to a silent fail-open
if Init() was not called; modify the sanitization module to perform a defensive
initialization by adding a package-level sync.Once (e.g., var initOnce
sync.Once) and invoking initOnce.Do(Init) at the start of Sanitize (or create a
private ensureInitialized helper that calls initOnce.Do(Init)), so Init() runs
at least once before using literalPatterns/regexPatterns and Sanitize fails
closed even if callers forget to call Init().
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7f8ae823-9271-4455-8b22-6dcaaf292591
📒 Files selected for processing (2)
sanitize/sanitize.gosanitize/sanitize_test.go
Content Sanitization
Removes sensitive data from document content before sending to LLM APIs.
Configuration
Examples:
REMOVE_FROM_CONTENT=John Doe,Jane SmithREMOVE_FROM_CONTENT_REGEX=DE\d{20}REMOVE_FROM_CONTENT_REGEX=[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}REMOVE_FROM_CONTENT_REGEX=\b\d{4}-\d{4}-\d{4}-\d{4}\bCoverage
✅ Document suggestions (title, tags, correspondent, document type, created date)
✅ Custom field suggestions
✅ Ad-hoc analysis
✅ OCR prompts
Implementation
main.go)sync.Once)sanitize/with 11 test casesDocker Compose
Troubleshooting
Invalid regex patterns cause startup failure with error message.
Testing
go test ./sanitize/...Summary by CodeRabbit
New Features
Documentation
Tests