Summary
deidentify(..., keep_mapping=True) returns a mapping of redacted -> original,
and reidentify() reverses it. When two entities of the same type appear (e.g.
two NAME) they both redact to the same placeholder ([NAME]) under method="mask",
so:
- the mapping dict keeps only the last original; the first is silently
overwritten (openmed/core/pii.py, mapping[redacted] = original)
- reidentify() does str.replace(redacted, original), which replaces every
[NAME] with the same value anyway
The documented round-trip (reidentify(deid, mapping) == original) does not hold
for repeated types.
Repro
text = "Patient John Smith and nurse Jane Doe"
r = deidentify(text, method="mask", keep_mapping=True)
reidentify(r.deidentified_text, r.mapping) # != text
Why it is not a one-line fix
With method="mask" the placeholders are identical by design, so a faithful
round-trip is impossible without changing one of:
- the placeholder format (e.g. numbered [NAME_1], [NAME_2]) which changes mask
output the README advertises as [NAME], [DATE]
- the mapping type (dict[str,str] -> order/position aware) which breaks the
documented mapping shape and the REST /pii/deidentify mapping contract
Both are API decisions, so I am raising this rather than picking one.
Options
A. Numbered placeholders only when a type repeats (single occurrence stays
[NAME], so existing output and docs are unchanged for the common case).
B. Position-based reidentify with an ordered structure alongside the dict.
C. Document that round-trip is only guaranteed when each redacted token is
unique.
Happy to send a PR once you pick a direction.
Summary
deidentify(..., keep_mapping=True) returns a mapping of redacted -> original,
and reidentify() reverses it. When two entities of the same type appear (e.g.
two NAME) they both redact to the same placeholder ([NAME]) under method="mask",
so:
overwritten (openmed/core/pii.py, mapping[redacted] = original)
[NAME] with the same value anyway
The documented round-trip (reidentify(deid, mapping) == original) does not hold
for repeated types.
Repro
text = "Patient John Smith and nurse Jane Doe"
r = deidentify(text, method="mask", keep_mapping=True)
reidentify(r.deidentified_text, r.mapping) # != text
Why it is not a one-line fix
With method="mask" the placeholders are identical by design, so a faithful
round-trip is impossible without changing one of:
output the README advertises as [NAME], [DATE]
documented mapping shape and the REST /pii/deidentify mapping contract
Both are API decisions, so I am raising this rather than picking one.
Options
A. Numbered placeholders only when a type repeats (single occurrence stays
[NAME], so existing output and docs are unchanged for the common case).
B. Position-based reidentify with an ordered structure alongside the dict.
C. Document that round-trip is only guaranteed when each redacted token is
unique.
Happy to send a PR once you pick a direction.