Describe the bug
Under peer-to-peer routing with peer churn, a long-lived publisher can stall permanently in put().wait(). The publisher remains blocked in route_data waiting for the routing Tables RwLock after resuming the process for about 30 seconds and interrupting it again.
This looks like a persistent routing lock stall/deadlock rather than transient contention.
To reproduce
Reproducer branch:
The reproducer adds examples/examples/z_p2p_declare_final_stress.rs and registers it as z_p2p_declare_final_stress.
The test launches one router process and multiple peer worker processes.
Publisher process
The publisher process creates stable peer sessions, declares a publisher on:
Each publisher session repeatedly calls:
publisher.put(payload.clone()).wait()
This represents the long-lived publisher that eventually stalls.
Churn peer process
Each churn process repeatedly creates short-lived peer sessions. Each session:
- Opens a peer session.
- Declares a subscriber on
repro/deadlock.
- Sleeps for
--churn-hold-ms.
- Drops the subscriber.
- Drops the session.
- Sleeps for
--churn-idle-ms.
- Repeats.
This creates repeated declare/finalize/session-close churn while the publisher keeps sending data.
Build:
git fetch https://github.com/BOURBONCASK/zenoh.git repro/p2p-declare-final-deadlock
git checkout FETCH_HEAD
cargo build -p zenoh-examples --example z_p2p_declare_final_stress
Run:
target/debug/examples/z_p2p_declare_final_stress \
--router-endpoint tcp/127.0.0.1:17447 \
--publisher-processes 1 \
--publisher-sessions-per-process 1 \
--churn-processes 4 \
--churn-sessions-per-process 2 \
--put-period-ms 10 \
--churn-hold-ms 20 \
--churn-idle-ms 20 \
--stall-after-secs 5
The supervisor prints each child role and PID, for example:
[role=supervisor pid=... ts_ms=...] spawned child role=publisher index=0 pid=60572
[role=supervisor pid=... ts_ms=...] spawned child role=churn index=0 pid=60574
When the publisher detects no completed put() for the stall timeout, it parks and prints:
[role=publisher index=0 pid=60572 ts_ms=...] STALL no-completed-put session=0 stall_after=5s; process is parked for debugger attach. Run `lldb -p 60572`, then `thread backtrace all`.
Debugging Procedure
Do not use log enable lldb all; that records LLDB internal logs and makes the useful backtrace hard to read.
Attach to the publisher PID printed by the repro:
lldb -p 60572
(lldb) thread backtrace all
(lldb) process continue
Wait about 30 seconds, press Ctrl-C, then capture another backtrace:
(lldb) thread backtrace all
Repeat the same procedure for one churn peer PID, for example:
lldb -p 60574
(lldb) thread backtrace all
(lldb) process continue
Wait about 30 seconds, press Ctrl-C, then:
(lldb) thread backtrace all
Observed Publisher Stack
Publisher PID: 60572
After continuing the process and interrupting it again, the publisher is still blocked in the same stack:
thread #3
std::sys::sync::rwlock::queue::RwLock::lock_contended
std::sync::poison::rwlock::RwLock<T>::read
zenoh::net::routing::dispatcher::pubsub::route_data
<zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_push_consume
<zenoh::net::routing::namespace::Namespace as zenoh::net::primitives::Primitives>::send_push_consume
zenoh::api::session::Session::resolve_put
<zenoh::api::builders::publisher::PublicationBuilder<&zenoh::api::publisher::Publisher, PublicationBuilderPut> as zenoh_core::Wait>::wait
z_p2p_declare_final_stress::spawn_publisher::{closure}
The main thread is parked in the repro watchdog:
z_p2p_declare_final_stress::report_publisher_stall
z_p2p_declare_final_stress::monitor_publishers
z_p2p_declare_final_stress::run_publisher_process
Observed Churn Peer Stacks
Churn peer PID: 60574
After continuing the process and interrupting it again, the churn process remains blocked in declare/finalize/transport/gossip paths. Representative stacks:
thread #2
<zenoh::api::builders::session::CloseBuilder as zenoh_core::Wait>::wait
<zenoh::api::session::Session as Drop>::drop
z_p2p_declare_final_stress::spawn_churn_peer::{closure}
thread #3
std::sync::poison::mutex::Mutex<T>::lock
<zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_declare
<zenoh::net::routing::namespace::Namespace as zenoh::net::primitives::Primitives>::send_declare
zenoh::api::session::Session::declare_prefix
<zenoh::api::builders::subscriber::SubscriberBuilder as zenoh_core::Wait>::wait
z_p2p_declare_final_stress::spawn_churn_peer::{closure}
thread #5 / thread #10
std::sync::poison::mutex::Mutex<T>::lock
zenoh::net::routing::gateway::Gateway::new_transport_unicast
<zenoh::net::runtime::RuntimeTransportEventHandler as zenoh_transport::TransportEventHandler>::new_unicast
zenoh_transport::unicast::manager::TransportManager::notify_new_transport_unicast
zenoh_transport::unicast::manager::TransportManager::init_new_transport_unicast
zenoh::net::runtime::orchestrator::Runtime::connect_peer
zenoh::net::protocol::gossip::Gossip::link_states
Why This Looks Persistent
The publisher and churn peer were both resumed for about 30 seconds and interrupted again. The relevant threads were still blocked in the same lock paths. The publisher did not recover and did not complete more put() calls.
The currently visible victim stack is the publisher waiting for a Tables read lock in route_data. The churn process shows concurrent declare/finalize/session-close and transport/gossip paths blocked on mutexes. This suggests a lock-ordering deadlock or persistent routing lock stall triggered by peer churn.
Environment
OS: macOS arm64
Binary: target/debug/examples/z_p2p_declare_final_stress
Zenoh mode: peer_to_peer
Describe the bug
Under peer-to-peer routing with peer churn, a long-lived publisher can stall permanently in
put().wait(). The publisher remains blocked inroute_datawaiting for the routingTablesRwLockafter resuming the process for about 30 seconds and interrupting it again.This looks like a persistent routing lock stall/deadlock rather than transient contention.
To reproduce
Reproducer branch:
2045a6e27f080e820a37a72a757dc57cacf73cd1eclipse-zenoh/zenoh mainbfcab2644f7624b4f7bb7fa0db7781b858cd6888The reproducer adds
examples/examples/z_p2p_declare_final_stress.rsand registers it asz_p2p_declare_final_stress.The test launches one router process and multiple peer worker processes.
Publisher process
The publisher process creates stable peer sessions, declares a publisher on:
Each publisher session repeatedly calls:
This represents the long-lived publisher that eventually stalls.
Churn peer process
Each churn process repeatedly creates short-lived peer sessions. Each session:
repro/deadlock.--churn-hold-ms.--churn-idle-ms.This creates repeated declare/finalize/session-close churn while the publisher keeps sending data.
Build:
Run:
The supervisor prints each child role and PID, for example:
When the publisher detects no completed
put()for the stall timeout, it parks and prints:Debugging Procedure
Do not use
log enable lldb all; that records LLDB internal logs and makes the useful backtrace hard to read.Attach to the publisher PID printed by the repro:
Wait about 30 seconds, press
Ctrl-C, then capture another backtrace:Repeat the same procedure for one churn peer PID, for example:
Wait about 30 seconds, press
Ctrl-C, then:Observed Publisher Stack
Publisher PID:
60572After continuing the process and interrupting it again, the publisher is still blocked in the same stack:
The main thread is parked in the repro watchdog:
Observed Churn Peer Stacks
Churn peer PID:
60574After continuing the process and interrupting it again, the churn process remains blocked in declare/finalize/transport/gossip paths. Representative stacks:
Why This Looks Persistent
The publisher and churn peer were both resumed for about 30 seconds and interrupted again. The relevant threads were still blocked in the same lock paths. The publisher did not recover and did not complete more
put()calls.The currently visible victim stack is the publisher waiting for a
Tablesread lock inroute_data. The churn process shows concurrent declare/finalize/session-close and transport/gossip paths blocked on mutexes. This suggests a lock-ordering deadlock or persistent routing lock stall triggered by peer churn.Environment