State lost on WebSocket reconnect: link_token_to_sid emits new_token when old sid hasn't disconnected yet

  ## Description

  `LocalTokenManager.link_token_to_sid` (introduced in #4607, Reflex 0.8.5) treats any reconnection where the same `client_token` arrives
  with a different `sid` as a "duplicate tab" and generates a new token via `emit("new_token", ...)`.

  Because state managers (`StateManagerDisk`, `StateManagerRedis`) index state by `client_token`, regenerating it **silently discards the
  user's session** — authenticated state, variables, everything. The user sees the UI reset and (if auth is in place) is effectively
  logged out.

  The duplicate-tab heuristic fires in at least two non-duplicate-tab scenarios:

  1. **WebSocket reconnect** after a transient close. The browser opens a new socket because the previous one dropped, but the server
  hasn't processed the old socket's `disconnect` yet. In production behind Caddy we measured the window between the new `connect` and the
  old `disconnect`: **~800 ms**. During that window the check `token_to_socket[token].sid != sid` triggers "duplicate".
  2. **Backend crash/restart** with persisted state (already reported by @hr-alebel in the #4607 comments, Oct 31 2025).

  This is not a new observation: @masenf raised exactly this concern in a review comment on #4607 on Jan 21 2025:

  > *"Revisiting this PR, is there ever a case where we wouldn't get `on_disconnect` to unlink the sid/token, resulting in a
  refresh/reconnect leading to loss of token and state for a client?"*

  @benedikt-bartscher suggested opening a formal issue on Nov 1 2025. This is that issue.

  ## To Reproduce

  Any Reflex 0.8.5+ app with authenticated state and a reverse proxy in front (tested with Caddy 2, but any proxy that can close idle TCP
  connections or any transient network blip reproduces):

  1. Open the app, log in (OIDC or any auth that sets state on connect).
  2. Leave the tab idle ~60-120 s (time depends on the proxy's idle/keepalive settings).
  3. Interact again — navigate or trigger an event.
  4. The browser opens a new WebSocket (expected — transient close).
  5. Reflex emits `new_token`, the client's `client_token` rotates, the auth state is gone.

  Reproducible 100% of the time on our end. @hr-alebel reports the same for the backend-restart variant, also 100% reproducible.

  ## Expected behavior

  A transient WebSocket reconnect by the same browser should preserve `client_token` and the associated state. The "duplicate tab"
  heuristic should either:

  - Only fire when the old sid is *confirmed alive* (not merely still in the map), or
  - Reuse the token for the new sid and let the stale sid mapping be cleaned up on its eventual `on_disconnect`.

  ## Actual behavior — evidence from production logs

  Sequence captured with `socketio.server` and `engineio.server` loggers at INFO and a custom wrapper around `EventNamespace.on_connect` /   `on_disconnect`. Portal behind Caddy 2, Reflex 0.8.28.post1, `StateManagerDisk`:

  ```
  12:00:12 INFO engineio.server | iA82ql0ePeBzV8pyAAAE: Sending packet PING
  12:00:12 INFO engineio.server | iA82ql0ePeBzV8pyAAAE: Received packet PONG
  12:00:37 INFO engineio.server | iA82ql0ePeBzV8pyAAAE: Sending packet PING
  12:00:56 INFO engineio.server | 4rg_LN_9Z6veURYDAAAG: Upgrade to websocket successful
  12:00:56 INFO portal_web_gic.ws | WS connect sid=PicQCyE7PIbmpLuvAAAH
  12:00:56 INFO socketio.server | emitting event "new_token" to PicQCyE7PIbmpLuvAAAH [/_event]
  12:00:56.800 WARNING portal_web_gic.ws | WS disconnect sid=GfxpxXQDQQHhiihfAAAB reason=transport close
  ```

  The key detail: `new_token` is emitted to the new sid **before** the old sid's `transport close` is processed (~800 ms gap).
  `pingTimeout` is 120 s so this isn't ping-timeout-related; the server still considers the old sid alive when the new one registers.

  ## Environment

  - Reflex: 0.8.28.post1
  - Python: 3.13.12
  - State manager: `StateManagerDisk`
  - Deployment: Docker (backend + frontend + internal nginx), behind Caddy 2 as edge reverse proxy, Keycloak OIDC for auth
  - Browser: Chrome 147 on Windows 10

  Same behavior reported by @hr-alebel on any Reflex version ≥ 0.8.5 under backend restarts with Redis-persisted state.

  ## Workaround (production-tested)

  We applied a monkey-patch that replaces `LocalTokenManager.link_token_to_sid` with a version that reuses the token instead of generating   a new one on duplicate detection, and removes the old sid→token mapping so the eventual stale `disconnect` doesn't wipe the shared
  token. The accepted trade-off is that **intentional duplicate tabs share state** — acceptable for our internal corporate portal with
  per-user auth.

  Patch (≈20 lines, applied via `LocalTokenManager.link_token_to_sid = patched_fn` before `rx.App()`):

  ```python
  async def link_token_to_sid_reuse(self, token: str, sid: str) -> None:
      existing = self.token_to_socket.get(token)
      if existing is not None and existing.sid != sid:
          # Drop the stale sid→token mapping so the old sid's eventual
          # on_disconnect doesn't wipe the token the new sid is using.
          self.sid_to_token.pop(existing.sid, None)
      self.token_to_socket[token] = SocketRecord(
          instance_id=self.instance_id, sid=sid,
      )
      self.sid_to_token[sid] = token
      return None  # never emit new_token
  ```

  Since the patch modifies the base class, it covers `RedisTokenManager` too.

  ## Related

  - PR #4607 — introduced the TokenManager and the duplicate-tab heuristic
  - [Comment by @masenf, Jan 21 2025](https://github.com/reflex-dev/reflex/pull/4607) raising this exact concern
  - [Comment by @hr-alebel, Oct 31 2025](https://github.com/reflex-dev/reflex/pull/4607) confirming the backend-restart variant
  - Issue #5099 — adjacent reconnection-quality issue, different symptom
  - PR #4953 — "Disconnect old websockets" (related, doesn't cover this race)
  - Issue #5669 / PR #6126 — exposes TokenManager APIs but doesn't fix the underlying heuristic

  ## Would appreciate

  Guidance on whether the preferred upstream fix is (a) the reuse-token approach above, (b) a liveness check on the stale sid before
  declaring duplicate, or (c) something else. Happy to open a PR if there's alignment on the direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State lost on WebSocket reconnect: link_token_to_sid emits new_token when old sid hasn't disconnected yet #6345

Description

To Reproduce

Expected behavior

Actual behavior — evidence from production logs

Environment

Workaround (production-tested)

Related

Would appreciate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

State lost on WebSocket reconnect: link_token_to_sid emits new_token when old sid hasn't disconnected yet #6345

Description

Description

To Reproduce

Expected behavior

Actual behavior — evidence from production logs

Environment

Workaround (production-tested)

Related

Would appreciate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions