Skip to content

checkpoint_latest:* pointer keys ignore checkpoint_prefix, colliding across savers on a shared Redis Stack #187

Description

@JerryChaox

Summary

AsyncRedisSaver (and the sync RedisSaver) accept checkpoint_prefix / checkpoint_write_prefix so multiple deployments can share a Redis Stack instance without key collisions. JSON document keys ({prefix}:lg:checkpoint:..., {prefix}:lg:checkpoint_write:...) honor those prefixes correctly. But the "latest pointer" keys used by aget_tuple (when no checkpoint_id is supplied) are built as bare strings that ignore the configured prefix:

latest_pointer_key = f"checkpoint_latest:{storage_safe_thread_id}:{storage_safe_checkpoint_ns}"

This makes prefixing effectively useless for the "fetch latest checkpoint" path that the LangGraph runtime uses on every Pregel start.

Affected sites (v0.4.1, also present on main)

langgraph/checkpoint/redis/aio.py:

  • L410 — read in aget_tuple
  • L1010 — write in aput
  • L1996 — write in aput's cancelled-fallback
  • L2179 — delete in aprune
  • L2046–2048 / L2186–2189 — pipeline.delete(latest_pointer_key) in adelete_thread / aprune

langgraph/checkpoint/redis/__init__.py (sync variants): L580 / L758 / L1602 / L1788.

The SET site in aput is unmistakable:

latest_pointer_key = f\"checkpoint_latest:{storage_safe_thread_id}:{storage_safe_checkpoint_ns}\"
await self._redis.set(latest_pointer_key, checkpoint_key)

— no prefix folded in.

Impact

On a Redis Stack hosting multiple deployments (e.g. staging + prod, or multi-tenant), every saver instance reads / writes the same global checkpoint_latest:{thread}:{ns} keyspace.

  1. Cross-deployment overwrite. If two deployments use overlapping thread_id shapes (and the LangGraph runtime gives full control of thread_id to the application — UUIDs help but don't prevent the failure mode), the last writer's pointer wins.
  2. Silent decode failure. Once a pointer resolves to a doc key under the other env's prefix, the saver's pipeline JSON.GET checkpoint_key returns None (the doc exists, just under a different prefix). aget_tuple falls through to return None — the saver reports "no checkpoints exist" and the conversation silently forgets its history.
  3. No log signal. The failure looks identical to a fresh thread, so the issue is hard to attribute. We initially mistook this for a Pregel bug.

We hit (2) in production after migrating an environment onto a Redis Stack that the other env was already using.

Reproduction

Local Redis Stack (docker run --rm -d -p 6379:6379 redis/redis-stack:latest), langgraph-checkpoint-redis==0.4.1:

import asyncio
from redis.asyncio import Redis
from langgraph.checkpoint.redis.aio import AsyncRedisSaver
from langgraph.checkpoint.base import empty_checkpoint


async def main() -> None:
    admin = Redis.from_url(\"redis://localhost:6379\", decode_responses=False)
    await admin.flushdb()

    # Two deployments sharing one Redis, each with its own prefix.
    saver_a = AsyncRedisSaver(
        redis_client=Redis.from_url(\"redis://localhost:6379\", decode_responses=False),
        checkpoint_prefix=\"env-a:lg:checkpoint\",
        checkpoint_write_prefix=\"env-a:lg:checkpoint_write\",
    )
    saver_b = AsyncRedisSaver(
        redis_client=Redis.from_url(\"redis://localhost:6379\", decode_responses=False),
        checkpoint_prefix=\"env-b:lg:checkpoint\",
        checkpoint_write_prefix=\"env-b:lg:checkpoint_write\",
    )
    await saver_a.asetup()
    await saver_b.asetup()

    cfg = {\"configurable\": {\"thread_id\": \"t1\", \"checkpoint_ns\": \"\"}}

    cp_a = empty_checkpoint(); cp_a[\"channel_values\"] = {\"owner\": \"A\"}
    cp_b = empty_checkpoint(); cp_b[\"channel_values\"] = {\"owner\": \"B\"}
    await saver_a.aput(cfg, cp_a, {}, {})
    await saver_b.aput(cfg, cp_b, {}, {})  # overwrites A's bare pointer

    print(\"A keys:\", sorted(k async for k in admin.scan_iter(match=\"env-a:*\")))
    print(\"B keys:\", sorted(k async for k in admin.scan_iter(match=\"env-b:*\")))
    print(\"bare pointer:\", await admin.get(b\"checkpoint_latest:t1:__empty__\"))

    tup_a = await saver_a.aget_tuple(cfg)
    print(\"A read:\", tup_a)  # ← None — A's checkpoint is silently dropped


asyncio.run(main())

Output:

A keys: [b'env-a:lg:checkpoint:t1:__empty__:...']   # A's doc still exists under env-a prefix
B keys: [b'env-b:lg:checkpoint:t1:__empty__:...']   # B's doc under env-b prefix
bare pointer: b'env-b:lg:checkpoint:t1:__empty__:...'  # global pointer now points at B's doc
A read: None                                        # A's aget_tuple can't decode env-b's doc

Workaround

We shipped a wrapper that proxies the Redis client (and its pipeline()) passed into the saver. It intercepts the four ops the library does against checkpoint_latest:* (GET / SET / EXPIRE / DELETE) and rewrites the keys to live under our deployment's master prefix. A read-time fallback honours the legacy bare key only when the doc it points at lives under our env, so pre-deploy active threads continue working without leaking cross-env data.

This works but it's ~120 lines of proxy code + 19 unit tests + a real-Redis smoke suite + a one-shot migration script we'd rather not maintain against future versions of this library. It also broke once because our initial proxy missed __aenter__ / __aexit__ on the pipeline wrapper, which redisvl/index/storage.py:awrite requires — so this workaround is fragile in ways checkpoint_prefix users probably don't expect to need to know.

Suggested fix

Either:

  1. Apply checkpoint_prefix to the latest-pointer keys too. The minimal change is to extract a small helper

    def _make_latest_pointer_key(self, thread_id: str, ns: str) -> str:
        return f\"{self.checkpoint_prefix}:latest:{thread_id}:{ns}\"
        # or any shape that's documented and namespace-scoped

    and call it from the five sites that currently inline the f-string. Existing deployments will need a one-shot migration of bare → prefixed pointers — easy to include as a migrate_latest_pointers() utility on the saver that scans checkpoint_latest:* once and rewrites each.

  2. Or, document that checkpoint_prefix is not sufficient to isolate multiple savers on a shared Redis Stack and recommend separate Redis databases (db=N) or separate instances per deployment. This is the cheaper docs-only fix but it's a footgun for anyone discovering checkpoint_prefix and assuming it does what it appears to.

Option 1 is what we'd prefer (the kwarg's existence implies isolation). Happy to send a PR if there's interest — let us know if you'd want it as a single bump (with migration helper) or split.

Environment

  • langgraph-checkpoint-redis==0.4.1 (latest tag as of 2026-05-19), verified on main
  • langgraph-checkpoint 2.x
  • redis-py 5.x, redisvl 0.x
  • Redis Stack 7.x
  • Python 3.11 (prod container) and 3.14 (dev)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions