Skip to content

reidentify() cannot round-trip when the same entity type appears more than once #204

@Technolity

Description

@Technolity

Summary

deidentify(..., keep_mapping=True) returns a mapping of redacted -> original,
and reidentify() reverses it. When two entities of the same type appear (e.g.
two NAME) they both redact to the same placeholder ([NAME]) under method="mask",
so:

  • the mapping dict keeps only the last original; the first is silently
    overwritten (openmed/core/pii.py, mapping[redacted] = original)
  • reidentify() does str.replace(redacted, original), which replaces every
    [NAME] with the same value anyway

The documented round-trip (reidentify(deid, mapping) == original) does not hold
for repeated types.

Repro

text = "Patient John Smith and nurse Jane Doe"
r = deidentify(text, method="mask", keep_mapping=True)
reidentify(r.deidentified_text, r.mapping) # != text

Why it is not a one-line fix

With method="mask" the placeholders are identical by design, so a faithful
round-trip is impossible without changing one of:

  1. the placeholder format (e.g. numbered [NAME_1], [NAME_2]) which changes mask
    output the README advertises as [NAME], [DATE]
  2. the mapping type (dict[str,str] -> order/position aware) which breaks the
    documented mapping shape and the REST /pii/deidentify mapping contract

Both are API decisions, so I am raising this rather than picking one.

Options

A. Numbered placeholders only when a type repeats (single occurrence stays
[NAME], so existing output and docs are unchanged for the common case).
B. Position-based reidentify with an ordered structure alongside the dict.
C. Document that round-trip is only guaranteed when each redacted token is
unique.

Happy to send a PR once you pick a direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions