Skip to content

Add the free-text dataset-redaction runner over CSV/JSONL/Parquet rows #177

@maziyarpanahi

Description

@maziyarpanahi

Summary

A CSV/JSONL/Parquet dataset-redaction runner is a v1.6 headline (section 8.2) and OpenMed's answer to pyDeid-style batch de-id (section 2.1, processing/batch.py). The separate v2.0 OM-044 epic is the k-anonymity/generalization engine for STRUCTURED identifiers; this task is the simpler, sooner free-text-cell runner: read a tabular/line file, route free-text columns through deidentify(), and write a redacted dataset with an aggregate audit summary. Without it there is no batch de-id path before v2.0.

Scope

  • Add a redact_dataset(path, text_columns=[...], policy=...) runner (openmed/processing/batch.py) that reads CSV/JSONL/Parquet, routes each free-text column value through deidentify() with the selected policy, and writes a redacted output dataset preserving non-text columns.
  • Aggregate per-file audit summary (total spans, per-label counts, residual-leakage estimate) without writing raw PHI.
  • Expose it via an 'openmed redact-dataset' CLI subcommand (reachable through the console entry point).
  • Stream/iterate rows so large files do not load wholly into memory; document the supported formats and column-selection.
  • Tests over a small fixture CSV/JSONL with PHI in designated columns: redacted output has no PHI in those columns, non-text columns are untouched, and the audit summary counts spans.

Acceptance criteria

  • redact_dataset over a fixture CSV and a fixture JSONL routes the designated free-text columns through deidentify() and writes a redacted dataset; non-text columns are unchanged.
  • 'openmed redact-dataset' runs the path from the CLI and emits an aggregate audit summary containing no raw PHI.
  • A test asserts PHI in the designated columns is removed/replaced and the summary reports per-label span counts.
  • test suite green: .venv/bin/python -m pytest tests/ -q

Out of scope

  • k-anonymity / l-diversity / t-closeness / DP transforms on structured identifier columns (OM-044).
  • Column-role classification via DataProfiler (OM-044).
  • Warehouse/streaming connectors beyond local files.

Files

  • openmed/processing/batch.py
  • openmed/cli/redact_dataset.py
  • tests/unit/processing/test_redact_dataset.py

Task: OM-055 · Milestone: v1.6 · Priority: P1 · Size: M
Depends on: OM-002, OM-031a · Blocks: —
Roadmap: section 8.2 (v1.6 headline), section 2.1 (pyDeid row)
Spec: PLANS/V2/EXECUTION/tasks/OM-055.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1HighfeatureNew capabilityroadmap-v2OpenMed V2 roadmap backlog

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions