Persistent routing lock stall/deadlock under peer churn

## Describe the bug

Under peer-to-peer routing with peer churn, a long-lived publisher can stall permanently in `put().wait()`. The publisher remains blocked in `route_data` waiting for the routing `Tables` `RwLock` after resuming the process for about 30 seconds and interrupting it again.

This looks like a persistent routing lock stall/deadlock rather than transient contention.

## To reproduce

Reproducer branch:

- Fork branch: https://github.com/BOURBONCASK/zenoh/tree/repro/p2p-declare-final-deadlock
- Reproducer commit: `2045a6e27f080e820a37a72a757dc57cacf73cd1`
- Official base branch: `eclipse-zenoh/zenoh main`
- Official base commit: `bfcab2644f7624b4f7bb7fa0db7781b858cd6888`

The reproducer adds `examples/examples/z_p2p_declare_final_stress.rs` and registers it as `z_p2p_declare_final_stress`.

The test launches one router process and multiple peer worker processes.

**Publisher process**

The publisher process creates stable peer sessions, declares a publisher on:

```text
repro/deadlock
```

Each publisher session repeatedly calls:

```rust
publisher.put(payload.clone()).wait()
```

This represents the long-lived publisher that eventually stalls.

**Churn peer process**

Each churn process repeatedly creates short-lived peer sessions. Each session:

1. Opens a peer session.
2. Declares a subscriber on `repro/deadlock`.
3. Sleeps for `--churn-hold-ms`.
4. Drops the subscriber.
5. Drops the session.
6. Sleeps for `--churn-idle-ms`.
7. Repeats.

This creates repeated declare/finalize/session-close churn while the publisher keeps sending data.

Build:

```bash
git fetch https://github.com/BOURBONCASK/zenoh.git repro/p2p-declare-final-deadlock
git checkout FETCH_HEAD
cargo build -p zenoh-examples --example z_p2p_declare_final_stress
```

Run:

```bash
target/debug/examples/z_p2p_declare_final_stress \
  --router-endpoint tcp/127.0.0.1:17447 \
  --publisher-processes 1 \
  --publisher-sessions-per-process 1 \
  --churn-processes 4 \
  --churn-sessions-per-process 2 \
  --put-period-ms 10 \
  --churn-hold-ms 20 \
  --churn-idle-ms 20 \
  --stall-after-secs 5
```

The supervisor prints each child role and PID, for example:

```text
[role=supervisor pid=... ts_ms=...] spawned child role=publisher index=0 pid=60572
[role=supervisor pid=... ts_ms=...] spawned child role=churn index=0 pid=60574
```

When the publisher detects no completed `put()` for the stall timeout, it parks and prints:

```text
[role=publisher index=0 pid=60572 ts_ms=...] STALL no-completed-put session=0 stall_after=5s; process is parked for debugger attach. Run `lldb -p 60572`, then `thread backtrace all`.
```

## Debugging Procedure

Do not use `log enable lldb all`; that records LLDB internal logs and makes the useful backtrace hard to read.

Attach to the publisher PID printed by the repro:

```text
lldb -p 60572
(lldb) thread backtrace all
(lldb) process continue
```

Wait about 30 seconds, press `Ctrl-C`, then capture another backtrace:

```text
(lldb) thread backtrace all
```

Repeat the same procedure for one churn peer PID, for example:

```text
lldb -p 60574
(lldb) thread backtrace all
(lldb) process continue
```

Wait about 30 seconds, press `Ctrl-C`, then:

```text
(lldb) thread backtrace all
```

## Observed Publisher Stack

Publisher PID: `60572`

After continuing the process and interrupting it again, the publisher is still blocked in the same stack:

```text
thread #3
std::sys::sync::rwlock::queue::RwLock::lock_contended
std::sync::poison::rwlock::RwLock<T>::read
zenoh::net::routing::dispatcher::pubsub::route_data
<zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_push_consume
<zenoh::net::routing::namespace::Namespace as zenoh::net::primitives::Primitives>::send_push_consume
zenoh::api::session::Session::resolve_put
<zenoh::api::builders::publisher::PublicationBuilder<&zenoh::api::publisher::Publisher, PublicationBuilderPut> as zenoh_core::Wait>::wait
z_p2p_declare_final_stress::spawn_publisher::{closure}
```

The main thread is parked in the repro watchdog:

```text
z_p2p_declare_final_stress::report_publisher_stall
z_p2p_declare_final_stress::monitor_publishers
z_p2p_declare_final_stress::run_publisher_process
```

## Observed Churn Peer Stacks

Churn peer PID: `60574`

After continuing the process and interrupting it again, the churn process remains blocked in declare/finalize/transport/gossip paths. Representative stacks:

```text
thread #2
<zenoh::api::builders::session::CloseBuilder as zenoh_core::Wait>::wait
<zenoh::api::session::Session as Drop>::drop
z_p2p_declare_final_stress::spawn_churn_peer::{closure}
```

```text
thread #3
std::sync::poison::mutex::Mutex<T>::lock
<zenoh::net::routing::dispatcher::face::Face as zenoh::net::primitives::Primitives>::send_declare
<zenoh::net::routing::namespace::Namespace as zenoh::net::primitives::Primitives>::send_declare
zenoh::api::session::Session::declare_prefix
<zenoh::api::builders::subscriber::SubscriberBuilder as zenoh_core::Wait>::wait
z_p2p_declare_final_stress::spawn_churn_peer::{closure}
```

```text
thread #5 / thread #10
std::sync::poison::mutex::Mutex<T>::lock
zenoh::net::routing::gateway::Gateway::new_transport_unicast
<zenoh::net::runtime::RuntimeTransportEventHandler as zenoh_transport::TransportEventHandler>::new_unicast
zenoh_transport::unicast::manager::TransportManager::notify_new_transport_unicast
zenoh_transport::unicast::manager::TransportManager::init_new_transport_unicast
zenoh::net::runtime::orchestrator::Runtime::connect_peer
zenoh::net::protocol::gossip::Gossip::link_states
```

## Why This Looks Persistent

The publisher and churn peer were both resumed for about 30 seconds and interrupted again. The relevant threads were still blocked in the same lock paths. The publisher did not recover and did not complete more `put()` calls.

The currently visible victim stack is the publisher waiting for a `Tables` read lock in `route_data`. The churn process shows concurrent declare/finalize/session-close and transport/gossip paths blocked on mutexes. This suggests a lock-ordering deadlock or persistent routing lock stall triggered by peer churn.

## Environment

```text
OS: macOS arm64
Binary: target/debug/examples/z_p2p_declare_final_stress
Zenoh mode: peer_to_peer
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent routing lock stall/deadlock under peer churn #2581

Describe the bug

To reproduce

Debugging Procedure

Observed Publisher Stack

Observed Churn Peer Stacks

Why This Looks Persistent

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Persistent routing lock stall/deadlock under peer churn #2581

Description

Describe the bug

To reproduce

Debugging Procedure

Observed Publisher Stack

Observed Churn Peer Stacks

Why This Looks Persistent

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions