Skip to content

Commit 86d9bd7

Browse files
wehosHongzhi Wenclaude
authored
perf(memory): cache entry token_count to skip re-tokenization on every render (#939)
* perf(memory): cache entry token_count to skip re-tokenization on every render Persona / reflection entries now carry two derived fields: - token_count: int | None - token_count_text_sha256: str | None On render, PersonaManager._ascore_trim_entries computes sha256(text) and reuses the cached count on fingerprint match. Otherwise it calls acount_tokens once and writes both fields back in-memory. The cache rides along on normal persona/reflection saves (no new event type). Fresh boot re-tokenizes on first render. amerge_into explicitly invalidates the cache when rewriting text to avoid the tiny window where a concurrent reader might see new text + stale count; the fingerprint check would catch it otherwise. Follow-up optimization to PR #936 (render budget): ~200ms saved per render for 100 entries once the cache is warm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory/tokenize): key token_count cache by tokenizer identity, not just text Round-1 review on PR #939 (codex, P2) flagged a correctness gap: the cache keyed only on sha256(text), but `utils.tokenize.count_tokens` silently falls back from tiktoken to a heuristic when the encoding data file is unavailable (packaging products without o200k_base.tiktoken). A cache warmed under tiktoken then served to a heuristic-fallback render would return a count off by ~1.5-2×, making the render budget trim arbitrary. Fix: add a third cache field `token_count_tokenizer` populated from a new `utils.tokenize.tokenizer_identity()` helper that returns `tiktoken:<encoding>` or `heuristic:<version>` based on the current encoder cache state. Cache hit requires BOTH text sha256 AND tokenizer identity to match; mismatch on either triggers recompute and re-stamp. `_invalidate_token_count_cache` now clears all three fields. `_normalize_entry` / `_normalize_reflection` default the new field to None so legacy on-disk data reads as a cache miss on first render. Regression tests cover: - sync identity mismatch → recompute - async identity mismatch → recompute (render hot path) - tiktoken → heuristic transition (real failure mode, counts differ) - tokenizer_identity() output shape is stable Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory/tokencache): invalidate on card-sync rewrite + scope cache to persona-only PR #939 round-2 review (CodeRabbit): 1. Minor — `_apply_character_card_sync` (persona.py:512) rewrites `entry['text']` in place when a character card field changes, but never called `_invalidate_token_count_cache(entry)` like `amerge_into` does. The fingerprint check would still catch the drift on the next render, so results were correct — but one extra sha256 per stale entry, and the intent was hidden. Added the invalidator call, mirroring the `amerge_into` pattern. 2. Major — reflection token-count cache was architecturally ineffective. Verified in reflection.py: `aload_reflections` and `_aload_reflections_full` both read fresh from disk every call — there is no `self._reflections` process-resident view mirroring `PersonaManager._personas`. Any cache writeback on a reflection entry lived on a transient list that got garbage-collected on render exit, so the "cache" was doing the work without ever serving a hit. Chose Option A (be honest): drop the reflection cache entirely rather than grow an in-memory view (bigger refactor, beyond this PR's perf- optimization scope) or persist derived fields on render (bypasses the event-log contract). Persona-side caching still captures the bulk of render time in typical workloads. Implementation: - `_normalize_reflection` no longer defaults the three `token_count*` fields. Legacy reflections that already carry the fields are preserved byte-for-byte (no data loss on existing on-disk state). - `_get_cached_token_count` / `_aget_cached_token_count` gained a `writeback=True` kwarg. When False, we still short-circuit on a pre-existing cache hit (free) but never mutate the entry. - `_score_trim_entries` / `_ascore_trim_entries` gained a `cache_writeback` kwarg, forwarded to the token-count helper. - Both reflection call sites in `arender_persona_markdown` + `_compose_persona_markdown` now pass `cache_writeback=False`, so the render math still works without polluting reflection.json. Tests: - Added `test_character_card_sync_invalidates_cache_on_text_rewrite` (regression guard for fix #1). - Replaced `test_cache_survives_reflection_save_reload_roundtrip` with `test_reflection_render_does_not_pollute_cache_fields` (locks in the new contract: render + save round-trip leaves cache fields absent). - Replaced `test_normalize_reflection_defaults_cache_fields_to_none` with `test_normalize_reflection_does_not_default_cache_fields` (also verifies legacy on-disk entries that happen to carry the fields aren't silently dropped). All 64 targeted tests green. Broader tests/unit/ suite: 950 passed, the 2 failures + 2 errors are pre-existing on the branch tip before this commit (verified via `git stash` + rerun). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory/tokencache): defensive int validation on cache hit to survive poisoned disk The hit path previously did `int(entry['token_count'])` unconditionally, trusting whatever came off disk. A hand-edited or corrupted persona.json could plant a non-numeric string ("??"), null, boolean, or negative value while the sha256+tokenizer fingerprints still match — in which case the bare int() either raised (bombed the render) or returned meaningless garbage (blew the budget math). New `_coerce_cached_count` helper validates the cached value: returns the non-negative int on success, None on None/non-numeric/boolean/negative so the hit branch falls through to recompute. Booleans are rejected explicitly (bool is an int subclass, so `int(True) == 1` would otherwise silently succeed). Applies to both sync and async twins. Regression tests cover non-numeric, negative, null, and boolean poisoned values on both render paths, plus direct unit coverage of the coercion helper shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(memory/tokencache): reject non-integer floats and inf/nan in cache coerce _coerce_cached_count previously used bare int(raw), which silently truncated 1.9 → 1 (a poisoned cache value matching fingerprint would make the render budget slightly off) and let float('inf') escape as an uncaught OverflowError mid-render. Reject both: non-integer floats (via .is_integer()) and any float overflow / TypeError / ValueError / OverflowError → cache miss + recompute. Reported by coderabbitai on PR #939. * docs(memory/tokencache): fix reflection contract misstatement in cache comment The module-level comment block claimed both _normalize_entry and _normalize_reflection default the three token_count* fields to None. In practice _normalize_reflection deliberately does NOT — reflections skip the defaults because their renders use writeback=False (no in-memory cache to persist into). Rewrite the comment to state each side's contract accurately. --------- Co-authored-by: Hongzhi Wen <cartabio.coder1@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 360e3aa commit 86d9bd7

4 files changed

Lines changed: 1185 additions & 7 deletions

File tree

memory/persona.py

Lines changed: 219 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
robust_json_loads,
4242
)
4343
from utils.logger_config import get_module_logger
44-
from utils.tokenize import acount_tokens, count_tokens
44+
from utils.tokenize import acount_tokens, count_tokens, tokenizer_identity
4545

4646
logger = get_module_logger(__name__, "Memory")
4747

@@ -510,6 +510,11 @@ def _sync_entity(entity: str, card_data: dict) -> bool:
510510
# 文本变化 → 更新
511511
old_text = entry.get('text', '')
512512
entry['text'] = text
513+
# token_count 缓存是从 text 派生的;这里原地改写
514+
# text 必须同步失效缓存,否则渲染路径要等到
515+
# fingerprint mismatch 才补算,还会额外浪费一次
516+
# sha256(对偶于 amerge_into 的 _sync_mutate_entry)。
517+
self._invalidate_token_count_cache(entry)
513518
modified = True
514519
logger.info(f"[Persona] {name}: card 同步更新 [{entity}] \"{old_text[:30]}\"\"{text[:30]}\"")
515520
new_card_entries.append(entry)
@@ -662,6 +667,21 @@ def _normalize_entry(entry) -> dict:
662667
- rein_last_signal_at / disp_last_signal_at: 各自独立的衰减时钟
663668
- sub_zero_days + sub_zero_last_increment_date: archive 倒计时
664669
- merged_from_ids: LLM merge_into 决策吸收的 reflection id 列表
670+
671+
Token-count cache fields (derived, cache-only — not event-sourced):
672+
- token_count: int | None — cached acount_tokens(text)
673+
- token_count_text_sha256: str | None — fingerprint of the text that
674+
was tokenized; a mismatch triggers recompute on the next render.
675+
- token_count_tokenizer: str | None — fingerprint of the counter
676+
used when `token_count` was written (e.g. `tiktoken:o200k_base`
677+
or `heuristic:v1`). A mismatch with the current tokenizer
678+
identity also triggers recompute, so a cache warmed under
679+
tiktoken doesn't get served to a heuristic-fallback render.
680+
681+
Zero-migration schema addition: existing on-disk entries without
682+
these fields naturally read as None via `.get()`, which counts as a
683+
cache miss and triggers a clean recompute on first render. No
684+
explicit migration event is needed.
665685
"""
666686
defaults = {
667687
'id': '', # 唯一标识
@@ -684,6 +704,16 @@ def _normalize_entry(entry) -> dict:
684704
'user_fact_reinforce_count': 0,
685705
# 溯源:merge_into 吸收的 reflection id 列表
686706
'merged_from_ids': [],
707+
# Derived token-count cache — populated by the render path
708+
# (`_get_cached_token_count` / `_aget_cached_token_count`)
709+
# on first render and ride-alongs with normal persona saves.
710+
# Both text-sha and tokenizer-identity must match for a hit,
711+
# so a cache warmed under tiktoken can't be served to a
712+
# heuristic-fallback render (e.g. packaging without encoding
713+
# data file).
714+
'token_count': None,
715+
'token_count_text_sha256': None,
716+
'token_count_tokenizer': None,
687717
}
688718
if isinstance(entry, str):
689719
d = dict(defaults)
@@ -1122,6 +1152,13 @@ def _sync_mutate_entry(_view):
11221152
target_entry['disp_last_signal_at'] = now_iso
11231153
target_entry['sub_zero_days'] = 0
11241154
target_entry['merged_from_ids'] = new_merged_from
1155+
# Token-count cache is derived from `text`; rewriting text
1156+
# must drop the cache so the next render recomputes. The
1157+
# fingerprint check would catch the drift anyway, but
1158+
# explicit invalidation avoids the tiny window where a
1159+
# concurrent reader might see new text + stale count and
1160+
# saves one sha256 compute on the next render.
1161+
self._invalidate_token_count_cache(target_entry)
11251162

11261163
# _sync_save: cloudsave gate + write + cache-evict-on-failure
11271164
# (CodeRabbit PR #936 round-5 Major #1). See
@@ -1742,9 +1779,166 @@ def _collect_all_entries(persona: dict) -> list[dict]:
17421779
# thread on the async path so the FastAPI event loop doesn't stall on
17431780
# batches of ~100 entries.
17441781

1782+
# ── token-count cache helpers ────────────────────────────────────
1783+
#
1784+
# Each persona entry dict may carry three derived fields populated
1785+
# on first render: `token_count` (int), `token_count_text_sha256`
1786+
# (str) and `token_count_tokenizer` (str). `_normalize_entry`
1787+
# defaults all three to None.
1788+
#
1789+
# Reflection entries deliberately do NOT default these fields —
1790+
# `_normalize_reflection` leaves them absent because reflections
1791+
# have no process-resident cache (each render re-reads from disk),
1792+
# so any writeback would be garbage-collected with the transient
1793+
# list. Reflection renders therefore call the same helpers with
1794+
# `writeback=False` and never persist cache fields. See the
1795+
# commentary on `_get_cached_token_count` for the contract.
1796+
#
1797+
# Read path (persona): compute sha256(text); if it matches the
1798+
# stored fingerprint AND the count is populated, use the cached
1799+
# value. Otherwise compute (sync `count_tokens` / async
1800+
# `acount_tokens`) and write all three fields back to the in-memory
1801+
# entry. The cache is never written to disk directly — it rides
1802+
# along whenever the persona is otherwise saved (add_fact,
1803+
# amerge_into, asave_persona, etc.). A fresh process boot
1804+
# re-tokenizes on first render which is an acceptable warm-up cost.
1805+
#
1806+
# Red line compliance: the cache is purely derived from `text`, so
1807+
# event-sourcing it would duplicate the source of truth (see
1808+
# RFC §3.6.8 + the "derived values shouldn't produce events"
1809+
# principle). The in-memory update + ride-along-on-save approach
1810+
# also avoids "view mutations outside an event" — we only ever
1811+
# invoke a disk write through existing event-sourced or save-
1812+
# permitted paths.
1813+
1814+
@staticmethod
1815+
def _text_fingerprint(text: str) -> str:
1816+
"""sha256 hex digest of `text` used as the cache key. Same
1817+
encoding as the `rewrite_text_sha256` payload in amerge_into so
1818+
the two stay consistent if we ever cross-check."""
1819+
return hashlib.sha256((text or '').encode('utf-8')).hexdigest()
1820+
1821+
@classmethod
1822+
def _get_cached_token_count(cls, entry: dict, *, writeback: bool = True) -> int:
1823+
"""Sync cache-aware token count. Writes `token_count`,
1824+
`token_count_text_sha256` and `token_count_tokenizer` back to
1825+
`entry` on miss when `writeback=True` (the default, for persona
1826+
entries that live in the `_personas` in-memory view and therefore
1827+
benefit from across-render cache reuse).
1828+
1829+
Callers should pass `writeback=False` for entries that do not have
1830+
a process-resident view (currently: reflection entries, which are
1831+
always loaded fresh from disk via `aload_reflections`). In that
1832+
mode we still short-circuit on a pre-existing cache hit — that's
1833+
free — but we never pollute the entry dict with fields that
1834+
wouldn't survive the next render anyway.
1835+
1836+
Cache hit requires BOTH fingerprints to match:
1837+
- text sha256 (catches text mutation)
1838+
- tokenizer identity (catches tiktoken↔heuristic transition;
1839+
see `utils.tokenize.tokenizer_identity` docstring for the
1840+
motivating scenario — packaging without encoding data file).
1841+
1842+
Additionally, `token_count` must coerce cleanly to a non-negative
1843+
int. A hand-edited or corrupted `persona.json` could plant a
1844+
non-numeric or negative value with fingerprints that still happen
1845+
to match (or match after someone also hand-rewrote the sha256
1846+
field) — in which case `int(...)` on the cached value would
1847+
either raise or return garbage and bomb the render. On coercion
1848+
failure we treat it as a cache miss and recompute.
1849+
"""
1850+
text = entry.get('text', '') or ''
1851+
if not text:
1852+
return 0
1853+
fp = cls._text_fingerprint(text)
1854+
tid = tokenizer_identity()
1855+
cached_count = cls._coerce_cached_count(entry.get('token_count'))
1856+
if (
1857+
cached_count is not None
1858+
and entry.get('token_count_text_sha256') == fp
1859+
and entry.get('token_count_tokenizer') == tid
1860+
):
1861+
return cached_count
1862+
n = count_tokens(text)
1863+
if writeback:
1864+
entry['token_count'] = int(n)
1865+
entry['token_count_text_sha256'] = fp
1866+
entry['token_count_tokenizer'] = tid
1867+
return int(n)
1868+
1869+
@classmethod
1870+
async def _aget_cached_token_count(cls, entry: dict, *, writeback: bool = True) -> int:
1871+
"""Async twin — uses `acount_tokens` (worker-thread tiktoken).
1872+
Write-back semantics match the sync helper (both fingerprints).
1873+
See `_get_cached_token_count` for the `writeback=False` contract
1874+
(used by reflection render path, which has no in-memory view),
1875+
and for the defensive coercion of poisoned `token_count` values
1876+
from a hand-edited or corrupted `persona.json`."""
1877+
text = entry.get('text', '') or ''
1878+
if not text:
1879+
return 0
1880+
fp = cls._text_fingerprint(text)
1881+
tid = tokenizer_identity()
1882+
cached_count = cls._coerce_cached_count(entry.get('token_count'))
1883+
if (
1884+
cached_count is not None
1885+
and entry.get('token_count_text_sha256') == fp
1886+
and entry.get('token_count_tokenizer') == tid
1887+
):
1888+
return cached_count
1889+
n = await acount_tokens(text)
1890+
if writeback:
1891+
entry['token_count'] = int(n)
1892+
entry['token_count_text_sha256'] = fp
1893+
entry['token_count_tokenizer'] = tid
1894+
return int(n)
1895+
17451896
@staticmethod
1897+
def _coerce_cached_count(raw) -> int | None:
1898+
"""Validate a `token_count` value loaded from an entry dict.
1899+
1900+
Returns the non-negative int when `raw` is coercible and sane;
1901+
returns None (→ force a cache miss) when `raw` is missing,
1902+
non-numeric, a bool, a non-integer float (1.9 would silently
1903+
truncate to 1), `inf` / `nan` (`int(inf)` raises
1904+
`OverflowError`), or negative.
1905+
1906+
`bool` is a subclass of `int` in Python, so the explicit
1907+
`isinstance(raw, bool)` reject keeps us from accepting `True`/
1908+
`False` as legitimate cached counts if persona.json was hand-
1909+
edited with boolean-looking garbage."""
1910+
if raw is None or isinstance(raw, bool):
1911+
return None
1912+
if isinstance(raw, float):
1913+
if not raw.is_integer():
1914+
return None
1915+
if raw < 0:
1916+
return None
1917+
return int(raw)
1918+
try:
1919+
value = int(raw)
1920+
except (TypeError, ValueError, OverflowError):
1921+
return None
1922+
if value < 0:
1923+
return None
1924+
return value
1925+
1926+
@staticmethod
1927+
def _invalidate_token_count_cache(entry: dict) -> None:
1928+
"""Explicitly drop the cached count. Called by code paths that
1929+
rewrite `entry['text']` (e.g. `amerge_into`) to avoid the tiny
1930+
window where a concurrent reader sees new text + stale count.
1931+
The fingerprint check would catch it anyway, but explicit
1932+
invalidation is clearer and saves one sha256 compute on the
1933+
next render."""
1934+
entry['token_count'] = None
1935+
entry['token_count_text_sha256'] = None
1936+
entry['token_count_tokenizer'] = None
1937+
1938+
@classmethod
17461939
def _score_trim_entries(
1747-
entries: list, budget: int, now: datetime,
1940+
cls, entries: list, budget: int, now: datetime,
1941+
*, cache_writeback: bool = True,
17481942
) -> list:
17491943
"""Sync score-trim: sort by (evidence_score, importance) DESC, keep
17501944
entries whose accumulated `count_tokens(text)` ≤ `budget`. Stops at
@@ -1753,6 +1947,13 @@ def _score_trim_entries(
17531947
17541948
`entries` is a list of dicts (no entity tagging — caller sorts/keys
17551949
as needed). Returns the kept subset preserving the score-DESC order.
1950+
1951+
`cache_writeback`: default True writes `token_count` fields back
1952+
onto each entry for across-render reuse (persona path — entries
1953+
live in `_personas`). Pass False for reflection entries, which are
1954+
loaded fresh from disk every render and would have no persistent
1955+
view to cache against; writing cache fields there would be
1956+
misleading and pollute reflection.json on the next save.
17561957
"""
17571958
sorted_entries = sorted(
17581959
entries,
@@ -1765,19 +1966,21 @@ def _score_trim_entries(
17651966
kept = []
17661967
total = 0
17671968
for e in sorted_entries:
1768-
t = count_tokens(e.get('text', '') or '')
1969+
t = cls._get_cached_token_count(e, writeback=cache_writeback)
17691970
if total + t > budget:
17701971
break
17711972
kept.append(e)
17721973
total += t
17731974
return kept
17741975

1775-
@staticmethod
1976+
@classmethod
17761977
async def _ascore_trim_entries(
1777-
entries: list, budget: int, now: datetime,
1978+
cls, entries: list, budget: int, now: datetime,
1979+
*, cache_writeback: bool = True,
17781980
) -> list:
17791981
"""Async twin of `_score_trim_entries`. Identical math; the only
1780-
difference is `acount_tokens` (worker-thread tiktoken)."""
1982+
difference is `acount_tokens` (worker-thread tiktoken). See the
1983+
sync twin for the `cache_writeback` contract."""
17811984
sorted_entries = sorted(
17821985
entries,
17831986
key=lambda e: (
@@ -1789,7 +1992,7 @@ async def _ascore_trim_entries(
17891992
kept = []
17901993
total = 0
17911994
for e in sorted_entries:
1792-
t = await acount_tokens(e.get('text', '') or '')
1995+
t = await cls._aget_cached_token_count(e, writeback=cache_writeback)
17931996
if total + t > budget:
17941997
break
17951998
kept.append(e)
@@ -1993,6 +2196,11 @@ def _compose_persona_markdown(
19932196
persona, suppressed_text_set,
19942197
),
19952198
REFLECTION_RENDER_TOKEN_BUDGET, now,
2199+
# Reflections have no `_personas`-style in-memory view — they're
2200+
# always loaded fresh from disk. Writing cache fields onto the
2201+
# transient dicts would be garbage-collected on render exit and
2202+
# could only pollute reflection.json on the next save.
2203+
cache_writeback=False,
19962204
)
19972205
# Preserve the score-DESC order produced by _score_trim_entries.
19982206
# The previous implementation filtered the ORIGINAL source lists by
@@ -2089,6 +2297,10 @@ async def arender_persona_markdown(
20892297
persona, suppressed_text_set,
20902298
),
20912299
REFLECTION_RENDER_TOKEN_BUDGET, now,
2300+
# See sync twin: reflections have no `_personas`-style
2301+
# in-memory view, so we compute fresh every render without
2302+
# writing cache fields back onto the transient dicts.
2303+
cache_writeback=False,
20922304
)
20932305
# Preserve score-DESC order from _ascore_trim_entries — mirror of
20942306
# the sync path fix in _compose_persona_markdown (CodeRabbit PR

memory/reflection.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,20 @@ def _normalize_reflection(entry: dict) -> dict:
191191
Also adds the `recent_mentions` / `suppress` mention抑制 fields so
192192
confirmed reflections can share persona's 5h-window rate-limit
193193
machinery (AI 自我克制,不在 5h 内反复提同一条)。
194+
195+
Note on token-count cache: persona entries carry a
196+
`token_count` / `token_count_text_sha256` / `token_count_tokenizer`
197+
cache that survives across renders because `PersonaManager._personas`
198+
is a process-resident view — render-time cache writes land on the
199+
same dict that the next render reads. Reflections do NOT have an
200+
equivalent in-memory view; `aload_reflections` /
201+
`_aload_reflections_full` always read fresh from disk, so any cache
202+
writeback on a reflection entry would be garbage-collected right
203+
after the render. We deliberately do NOT add the cache fields to
204+
this schema — populating them would be misleading (they'd look like
205+
a working cache but every render still tokenizes from scratch).
206+
Reflection tokenization stays live; in typical workloads the
207+
persona-side cache captures the bulk of render time anyway.
194208
"""
195209
defaults = {
196210
# Evidence counters

0 commit comments

Comments
 (0)