Add the free-text dataset-redaction runner over CSV/JSONL/Parquet rows

## Summary
A CSV/JSONL/Parquet dataset-redaction runner is a v1.6 headline (section 8.2) and OpenMed's answer to pyDeid-style batch de-id (section 2.1, processing/batch.py). The separate v2.0 OM-044 epic is the k-anonymity/generalization engine for STRUCTURED identifiers; this task is the simpler, sooner free-text-cell runner: read a tabular/line file, route free-text columns through deidentify(), and write a redacted dataset with an aggregate audit summary. Without it there is no batch de-id path before v2.0.

## Scope
- [ ] Add a redact_dataset(path, text_columns=[...], policy=...) runner (openmed/processing/batch.py) that reads CSV/JSONL/Parquet, routes each free-text column value through deidentify() with the selected policy, and writes a redacted output dataset preserving non-text columns.
- [ ] Aggregate per-file audit summary (total spans, per-label counts, residual-leakage estimate) without writing raw PHI.
- [ ] Expose it via an 'openmed redact-dataset' CLI subcommand (reachable through the console entry point).
- [ ] Stream/iterate rows so large files do not load wholly into memory; document the supported formats and column-selection.
- [ ] Tests over a small fixture CSV/JSONL with PHI in designated columns: redacted output has no PHI in those columns, non-text columns are untouched, and the audit summary counts spans.

## Acceptance criteria
- [ ] redact_dataset over a fixture CSV and a fixture JSONL routes the designated free-text columns through deidentify() and writes a redacted dataset; non-text columns are unchanged.
- [ ] 'openmed redact-dataset' runs the path from the CLI and emits an aggregate audit summary containing no raw PHI.
- [ ] A test asserts PHI in the designated columns is removed/replaced and the summary reports per-label span counts.
- [ ] test suite green: .venv/bin/python -m pytest tests/ -q

## Out of scope
- k-anonymity / l-diversity / t-closeness / DP transforms on structured identifier columns (OM-044).
- Column-role classification via DataProfiler (OM-044).
- Warehouse/streaming connectors beyond local files.

## Files
- openmed/processing/batch.py
- openmed/cli/redact_dataset.py
- tests/unit/processing/test_redact_dataset.py

---
Task: OM-055  ·  Milestone: v1.6  ·  Priority: P1  ·  Size: M
Depends on: OM-002, OM-031a  ·  Blocks: —
Roadmap: section 8.2 (v1.6 headline), section 2.1 (pyDeid row)
Spec: PLANS/V2/EXECUTION/tasks/OM-055.md


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the free-text dataset-redaction runner over CSV/JSONL/Parquet rows #177

Summary

Scope

Acceptance criteria

Out of scope

Files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add the free-text dataset-redaction runner over CSV/JSONL/Parquet rows #177

Description

Summary

Scope

Acceptance criteria

Out of scope

Files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions