Skip to content

Persistent routing lock stall/deadlock under peer churn #2581

@BOURBONCASK

Description

@BOURBONCASK

Describe the bug

Under peer-to-peer routing with peer churn, a long-lived publisher can stall permanently in put().wait(). The publisher remains blocked in route_data waiting for the routing Tables RwLock after resuming the process for about 30 seconds and interrupting it again.

This looks like a persistent routing lock stall/deadlock rather than transient contention.

To reproduce

Reproducer branch:

The reproducer adds examples/examples/z_p2p_declare_final_stress.rs and registers it as z_p2p_declare_final_stress.

The test launches one router process and multiple peer worker processes.

Publisher process

The publisher process creates stable peer sessions, declares a publisher on:

repro/deadlock

Each publisher session repeatedly calls:

publisher.put(payload.clone()).wait()

This represents the long-lived publisher that eventually stalls.

Churn peer process

Each churn process repeatedly creates short-lived peer sessions. Each session:

  1. Opens a peer session.
  2. Declares a subscriber on repro/deadlock.
  3. Sleeps for --churn-hold-ms.
  4. Drops the subscriber.
  5. Drops the session.
  6. Sleeps for --churn-idle-ms.
  7. Repeats.

This creates repeated declare/finalize/session-close churn while the publisher keeps sending data.

Build:

git fetch https://github.com/BOURBONCASK/zenoh.git repro/p2p-declare-final-deadlock
git checkout FETCH_HEAD
cargo build -p zenoh-examples --example z_p2p_declare_final_stress

Run:

target/debug/examples/z_p2p_declare_final_stress \
  --router-endpoint tcp/127.0.0.1:17447 \
  --publisher-processes 1 \
  --publisher-sessions-per-process 1 \
  --churn-processes 4 \
  --churn-sessions-per-process 2 \
  --put-period-ms 10 \
  --churn-hold-ms 20 \
  --churn-idle-ms 20 \
  --stall-after-secs 5

The supervisor prints each child role and PID, for example:

[role=supervisor pid=... ts_ms=...] spawned child role=publisher index=0 pid=60572
[role=supervisor pid=... ts_ms=...] spawned child role=churn index=0 pid=60574

When the publisher detects no completed put() for the stall timeout, it parks and prints:

[role=publisher index=0 pid=60572 ts_ms=...] STALL no-completed-put session=0 stall_after=5s; process is parked for debugger attach. Run `lldb -p 60572`, then `thread backtrace all`.

Debugging Procedure

Do not use log enable lldb all; that records LLDB internal logs and makes the useful backtrace hard to read.

Attach to the publisher PID printed by the repro:

lldb -p 60572
(lldb) thread backtrace all
(lldb) process continue

Wait about 30 seconds, press Ctrl-C, then capture another backtrace:

(lldb) thread backtrace all

Repeat the same procedure for one churn peer PID, for example:

lldb -p 60574
(lldb) thread backtrace all
(lldb) process continue

Wait about 30 seconds, press Ctrl-C, then:

(lldb) thread backtrace all

Observed Publisher Stack

Publisher PID: 60572

After continuing the process and interrupting it again, the publisher is still blocked in the same stack:

thread #3
std::sys::sync::rwlock::queue::RwLock::lock_contended
std::sync::poison::rwlock::RwLock<T>::read
zenoh::net::routing::dispatcher::pubsub::route_data
<zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_push_consume
<zenoh::net::routing::namespace::Namespace as zenoh::net::primitives::Primitives>::send_push_consume
zenoh::api::session::Session::resolve_put
<zenoh::api::builders::publisher::PublicationBuilder<&zenoh::api::publisher::Publisher, PublicationBuilderPut> as zenoh_core::Wait>::wait
z_p2p_declare_final_stress::spawn_publisher::{closure}

The main thread is parked in the repro watchdog:

z_p2p_declare_final_stress::report_publisher_stall
z_p2p_declare_final_stress::monitor_publishers
z_p2p_declare_final_stress::run_publisher_process

Observed Churn Peer Stacks

Churn peer PID: 60574

After continuing the process and interrupting it again, the churn process remains blocked in declare/finalize/transport/gossip paths. Representative stacks:

thread #2
<zenoh::api::builders::session::CloseBuilder as zenoh_core::Wait>::wait
<zenoh::api::session::Session as Drop>::drop
z_p2p_declare_final_stress::spawn_churn_peer::{closure}
thread #3
std::sync::poison::mutex::Mutex<T>::lock
<zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_declare
<zenoh::net::routing::namespace::Namespace as zenoh::net::primitives::Primitives>::send_declare
zenoh::api::session::Session::declare_prefix
<zenoh::api::builders::subscriber::SubscriberBuilder as zenoh_core::Wait>::wait
z_p2p_declare_final_stress::spawn_churn_peer::{closure}
thread #5 / thread #10
std::sync::poison::mutex::Mutex<T>::lock
zenoh::net::routing::gateway::Gateway::new_transport_unicast
<zenoh::net::runtime::RuntimeTransportEventHandler as zenoh_transport::TransportEventHandler>::new_unicast
zenoh_transport::unicast::manager::TransportManager::notify_new_transport_unicast
zenoh_transport::unicast::manager::TransportManager::init_new_transport_unicast
zenoh::net::runtime::orchestrator::Runtime::connect_peer
zenoh::net::protocol::gossip::Gossip::link_states

Why This Looks Persistent

The publisher and churn peer were both resumed for about 30 seconds and interrupted again. The relevant threads were still blocked in the same lock paths. The publisher did not recover and did not complete more put() calls.

The currently visible victim stack is the publisher waiting for a Tables read lock in route_data. The churn process shows concurrent declare/finalize/session-close and transport/gossip paths blocked on mutexes. This suggests a lock-ordering deadlock or persistent routing lock stall triggered by peer churn.

Environment

OS: macOS arm64
Binary: target/debug/examples/z_p2p_declare_final_stress
Zenoh mode: peer_to_peer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions