fix(libp2p): gate connection-monitor abort on transport silence#3504
Draft
tabcat wants to merge 2 commits into
Draft
fix(libp2p): gate connection-monitor abort on transport silence#3504tabcat wants to merge 2 commits into
tabcat wants to merge 2 commits into
Conversation
A single failed ping no longer aborts the connection. The heartbeat now records `lastReadAt` on every MaConn read and only aborts when the ping fails AND no data has been received from the peer for `connectionStaleTimeout` ms (default 60s). Aborts use the new `ConnectionStaleError` so consumers can distinguish staleness from transport-level failures.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Connection-monitor currently aborts on the first failed ping. With the
AdaptiveTimeout5s floor, transient loss or backpressure on a healthy connection routinely trips the abort even though TCP would have recovered (see #3463 for trans-Pacific evidence).The default
abortConnectionOnPingFailure: trueis preserved but the abort is now gated on actual transport silence:@libp2p/interface: optionallastReadAt?: numberadded toMessageStreamTimeline; newConnectionStaleErrorexport.@libp2p/utils:AbstractMessageStream.onData()setstimeline.lastReadAt = Date.now()whenever bytes arrive — covers every transport in one line since they all route through this method.libp2p: connection-monitor addsconnectionStaleTimeout(default 60s). On ping failure it aborts only ifDate.now() - (conn.timeline.lastReadAt ?? conn.timeline.open) > connectionStaleTimeout. Aborts emitConnectionStaleErrorso consumers can distinguish gate-driven aborts from transport-level failures.Net effect: ping failures from heavy traffic on other streams, GC pauses, TCP retransmit bursts, or path-level jitter no longer kill healthy connections — only sustained silence does.
Notes & open questions
Some structural cleanups (drop
AdaptiveTimeoutfrom this file, simplifypingTimeout, dropabortConnectionOnPingFailure) become natural once the abort decision lives elsewhere — they're breaking and noted for v4.Change checklist