Skip to content

fix(libp2p): gate connection-monitor abort on transport silence#3504

Draft
tabcat wants to merge 2 commits into
libp2p:mainfrom
tabcat:fix/connection-monitor-silence-gate
Draft

fix(libp2p): gate connection-monitor abort on transport silence#3504
tabcat wants to merge 2 commits into
libp2p:mainfrom
tabcat:fix/connection-monitor-silence-gate

Conversation

@tabcat
Copy link
Copy Markdown
Member

@tabcat tabcat commented May 13, 2026

Description

Connection-monitor currently aborts on the first failed ping. With the AdaptiveTimeout 5s floor, transient loss or backpressure on a healthy connection routinely trips the abort even though TCP would have recovered (see #3463 for trans-Pacific evidence).

The default abortConnectionOnPingFailure: true is preserved but the abort is now gated on actual transport silence:

  • @libp2p/interface: optional lastReadAt?: number added to MessageStreamTimeline; new ConnectionStaleError export.
  • @libp2p/utils: AbstractMessageStream.onData() sets timeline.lastReadAt = Date.now() whenever bytes arrive — covers every transport in one line since they all route through this method.
  • libp2p: connection-monitor adds connectionStaleTimeout (default 60s). On ping failure it aborts only if Date.now() - (conn.timeline.lastReadAt ?? conn.timeline.open) > connectionStaleTimeout. Aborts emit ConnectionStaleError so consumers can distinguish gate-driven aborts from transport-level failures.

Net effect: ping failures from heavy traffic on other streams, GC pauses, TCP retransmit bursts, or path-level jitter no longer kill healthy connections — only sustained silence does.

Notes & open questions

Some structural cleanups (drop AdaptiveTimeout from this file, simplify pingTimeout, drop abortConnectionOnPingFailure) become natural once the abort decision lives elsewhere — they're breaking and noted for v4.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if necessary (this includes comments as well)
  • I have added tests that prove my fix is effective or that my feature works

tabcat added 2 commits May 13, 2026 21:50
A single failed ping no longer aborts the connection. The heartbeat now
records `lastReadAt` on every MaConn read and only aborts when the ping
fails AND no data has been received from the peer for
`connectionStaleTimeout` ms (default 60s). Aborts use the new
`ConnectionStaleError` so consumers can distinguish staleness from
transport-level failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant