Skip to content

State lost on WebSocket reconnect: link_token_to_sid emits new_token when old sid hasn't disconnected yet #6345

@M1L0J05

Description

@M1L0J05

Description

LocalTokenManager.link_token_to_sid (introduced in #4607, Reflex 0.8.5) treats any reconnection where the same client_token arrives
with a different sid as a "duplicate tab" and generates a new token via emit("new_token", ...).

Because state managers (StateManagerDisk, StateManagerRedis) index state by client_token, regenerating it silently discards the
user's session
— authenticated state, variables, everything. The user sees the UI reset and (if auth is in place) is effectively
logged out.

The duplicate-tab heuristic fires in at least two non-duplicate-tab scenarios:

  1. WebSocket reconnect after a transient close. The browser opens a new socket because the previous one dropped, but the server
    hasn't processed the old socket's disconnect yet. In production behind Caddy we measured the window between the new connect and the
    old disconnect: ~800 ms. During that window the check token_to_socket[token].sid != sid triggers "duplicate".
  2. Backend crash/restart with persisted state (already reported by @hr-alebel in the fix duplicate tab issue #4607 comments, Oct 31 2025).

This is not a new observation: @masenf raised exactly this concern in a review comment on #4607 on Jan 21 2025:

"Revisiting this PR, is there ever a case where we wouldn't get on_disconnect to unlink the sid/token, resulting in a
refresh/reconnect leading to loss of token and state for a client?"

@benedikt-bartscher suggested opening a formal issue on Nov 1 2025. This is that issue.

To Reproduce

Any Reflex 0.8.5+ app with authenticated state and a reverse proxy in front (tested with Caddy 2, but any proxy that can close idle TCP
connections or any transient network blip reproduces):

  1. Open the app, log in (OIDC or any auth that sets state on connect).
  2. Leave the tab idle ~60-120 s (time depends on the proxy's idle/keepalive settings).
  3. Interact again — navigate or trigger an event.
  4. The browser opens a new WebSocket (expected — transient close).
  5. Reflex emits new_token, the client's client_token rotates, the auth state is gone.

Reproducible 100% of the time on our end. @hr-alebel reports the same for the backend-restart variant, also 100% reproducible.

Expected behavior

A transient WebSocket reconnect by the same browser should preserve client_token and the associated state. The "duplicate tab"
heuristic should either:

  • Only fire when the old sid is confirmed alive (not merely still in the map), or
  • Reuse the token for the new sid and let the stale sid mapping be cleaned up on its eventual on_disconnect.

Actual behavior — evidence from production logs

Sequence captured with socketio.server and engineio.server loggers at INFO and a custom wrapper around EventNamespace.on_connect / on_disconnect. Portal behind Caddy 2, Reflex 0.8.28.post1, StateManagerDisk:

12:00:12 INFO engineio.server | iA82ql0ePeBzV8pyAAAE: Sending packet PING
12:00:12 INFO engineio.server | iA82ql0ePeBzV8pyAAAE: Received packet PONG
12:00:37 INFO engineio.server | iA82ql0ePeBzV8pyAAAE: Sending packet PING
12:00:56 INFO engineio.server | 4rg_LN_9Z6veURYDAAAG: Upgrade to websocket successful
12:00:56 INFO portal_web_gic.ws | WS connect sid=PicQCyE7PIbmpLuvAAAH
12:00:56 INFO socketio.server | emitting event "new_token" to PicQCyE7PIbmpLuvAAAH [/_event]
12:00:56.800 WARNING portal_web_gic.ws | WS disconnect sid=GfxpxXQDQQHhiihfAAAB reason=transport close

The key detail: new_token is emitted to the new sid before the old sid's transport close is processed (~800 ms gap).
pingTimeout is 120 s so this isn't ping-timeout-related; the server still considers the old sid alive when the new one registers.

Environment

  • Reflex: 0.8.28.post1
  • Python: 3.13.12
  • State manager: StateManagerDisk
  • Deployment: Docker (backend + frontend + internal nginx), behind Caddy 2 as edge reverse proxy, Keycloak OIDC for auth
  • Browser: Chrome 147 on Windows 10

Same behavior reported by @hr-alebel on any Reflex version ≥ 0.8.5 under backend restarts with Redis-persisted state.

Workaround (production-tested)

We applied a monkey-patch that replaces LocalTokenManager.link_token_to_sid with a version that reuses the token instead of generating a new one on duplicate detection, and removes the old sid→token mapping so the eventual stale disconnect doesn't wipe the shared
token. The accepted trade-off is that intentional duplicate tabs share state — acceptable for our internal corporate portal with
per-user auth.

Patch (≈20 lines, applied via LocalTokenManager.link_token_to_sid = patched_fn before rx.App()):

async def link_token_to_sid_reuse(self, token: str, sid: str) -> None:
    existing = self.token_to_socket.get(token)
    if existing is not None and existing.sid != sid:
        # Drop the stale sid→token mapping so the old sid's eventual
        # on_disconnect doesn't wipe the token the new sid is using.
        self.sid_to_token.pop(existing.sid, None)
    self.token_to_socket[token] = SocketRecord(
        instance_id=self.instance_id, sid=sid,
    )
    self.sid_to_token[sid] = token
    return None  # never emit new_token

Since the patch modifies the base class, it covers RedisTokenManager too.

Related

Would appreciate

Guidance on whether the preferred upstream fix is (a) the reuse-token approach above, (b) a liveness check on the stale sid before
declaring duplicate, or (c) something else. Happy to open a PR if there's alignment on the direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions