[Enhancement] Consider cloud-aware neighbor cache defaults to handle NLB source IP preservation and neigh_confirm() interaction in EKS

## Summary

We observed intermittent pod connectivity failures on EKS with 
Bottlerocket nodes that trace to a specific interaction between 
Linux NUD (Neighbor Unreachability Detection), NLB source IP 
preservation, and Bottlerocket's current neighbor cache sysctl 
defaults. We are raising this as an enhancement request asking 
whether Bottlerocket — as an AWS-purpose-built OS — should apply 
more cloud-aware defaults to handle this more gracefully.

---

## What We Observed

On EKS with Bottlerocket nodes and VPC CNI prefix delegation:

**1. A wrong MAC address gets cached in the neighbor table for a pod IP**

When a pod IP is reused after a pod deletion, the neighbor entry 
on other nodes may still hold the old MAC. This is expected 
behavior and normally Linux would self-correct via NUD.

**2. NLB source IP preservation prevents self-correction**

In our cluster, backend pods make outbound HTTPS calls to a 
collector endpoint via NLB. The NLB forwards these to the Istio 
ingress NodePort with source IP preserved as the pod IP.

When these packets arrive at the ingress node, the Linux kernel 
calls `neigh_confirm()` for that pod IP — resetting the REACHABLE 
timer on whatever MAC is currently cached. The kernel has no way 
to know the packet arrived via NLB rather than directly from the pod.
```
Backend pod makes OTel trace call to NLB
  ↓
NLB preserves source IP = pod IP, forwards to NodePort
  ↓
Ingress node receives packet
  ↓
neigh_confirm() fires for pod IP
  ↓
Wrong/stale MAC locked as REACHABLE indefinitely
  ↓
New direct connections from ingress to pod use wrong MAC
  ↓
Packets silently dropped, 503 timeout
```

**3. GC never runs, so wrong entries accumulate**

We checked the neighbor GC thresholds on our Bottlerocket nodes:
```
net.ipv4.neigh.default.gc_thresh1 = 8096
net.ipv4.neigh.default.gc_thresh2 = 12288
net.ipv4.neigh.default.gc_thresh3 = 16384

Current neighbor table size: ~1100 entries
```

Because the table never exceeds `gc_thresh1`, GC never runs. 
We observed **1009 out of 1102 entries (91%) in STALE state** — 
sitting indefinitely, each a potential wrong-MAC incident on 
the next IP reuse.

We noticed that `bottlerocket-core-kit/packages/release/release-sysctl.conf` 
explicitly sets `gc_thresh2` and `gc_thresh3` with the comment 
*"Avoid neighbor table contention in large subnets"*. This 
addresses the upper bound problem (table getting too large) but 
does not address the lower bound problem (GC never running when 
table is small, allowing stale entries to accumulate indefinitely).

**4. Lab reproduced 100%**

We reproduced this with complete reliability:
```bash
# Inject wrong MAC as STALE
ip neigh del <POD-IP> dev eth0
ip neigh add <POD-IP> dev eth0 lladdr 0a:00:00:00:00:01 nud stale

# Without any direct connection -- just OTel trace arriving via NLB
# Entry spontaneously moved from STALE to REACHABLE with fake MAC

ip neigh show <POD-IP>
# <POD-IP> dev eth0 lladdr 0a:00:00:00:00:01 REACHABLE
# confirmed = 1 second ago

# Direct connection now fails -- pod unreachable
curl -v <POD-IP>:8080
# hangs indefinitely
```

---

## Why This Is a Bottlerocket Question

We are not reporting this as a Linux kernel bug. `neigh_confirm()` 
firing on inbound TCP data is correct RFC 4861 behavior. The kernel 
is doing exactly what the standard mandates.

We are raising this at Bottlerocket because:

1. Bottlerocket explicitly targets AWS EKS as its primary use case
2. Bottlerocket already makes deliberate neighbor cache decisions 
   in `release-sysctl.conf` for cloud subnet scale
3. Bottlerocket knows it runs in AWS VPC with NLB source IP 
   preservation as a common pattern
4. Bottlerocket is in a unique position to set defaults that 
   handle this cloud-specific interaction more gracefully than 
   vanilla Linux defaults

---

## Specific Questions / Requests

**1. Should `arp_notify = 1` be set by default on ENI interfaces?**
```
net.ipv4.conf.eth0.arp_notify = 1
```

This causes the kernel to send a gratuitous ARP when an IP 
address is added to an interface. In EKS with VPC CNI, this 
would notify all nodes when a pod IP is assigned to a new ENI, 
invalidating stale neighbor entries before they can be poisoned. 
This is the mechanism that other CNIs (Antrea, Calico, Flannel) 
use explicitly. Would Bottlerocket consider enabling this by 
default for EKS variants?

**2. Should `gc_thresh1` be tuned relative to `gc_thresh2`?**

The current defaults set a very high upper range 
(`gc_thresh2=15360`, `gc_thresh3=16384`) to avoid table 
contention in large clusters. But `gc_thresh1` appears to be 
set such that GC never runs on nodes with typical table sizes 
(~1000-2000 entries in our clusters). 

Would it make sense for `gc_thresh1` to be set to a value that 
allows GC to run periodically and clean up stale/failed entries 
even on moderate-sized clusters — while keeping the high upper 
bounds to protect large clusters?

**3. Should `base_reachable_time_ms` be reduced for EKS variants?**

A shorter REACHABLE window means entries decay to STALE faster, 
reducing the window in which a wrong MAC can be kept alive by 
`neigh_confirm()`. The default ~30s is designed for stable 
physical networks. In cloud environments with frequent pod churn 
and IP reuse, a shorter value may be more appropriate.

---

## Environment

- Bottlerocket EKS variant (aws-k8s-*)
- VPC CNI prefix delegation (/28 per ENI)
- NLB with source IP preservation enabled
- Istio ingress gateway with NodePort service
- Java OpenTelemetry auto-instrumentation sending traces via NLB

---

## References

- `bottlerocket-core-kit/packages/release/release-sysctl.conf` 
  — existing neighbor cache tuning
- RFC 4861 Section 7.3.1 — upper-layer reachability confirmation
- Linux kernel `net/core/neighbour.c` — `neigh_confirm()` 
  implementation
- AWS Support case raised in parallel for VPC/NLB behavior 
  confirmation
- amazon-vpc-cni-k8s issue raised in parallel for CNI-layer 
  GARP question

We are happy to provide additional diagnostics, packet captures, 
or test results if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Consider cloud-aware neighbor cache defaults to handle NLB source IP preservation and neigh_confirm() interaction in EKS #878

Summary

What We Observed

Why This Is a Bottlerocket Question

Specific Questions / Requests

Environment

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Enhancement] Consider cloud-aware neighbor cache defaults to handle NLB source IP preservation and neigh_confirm() interaction in EKS #878

Description

Summary

What We Observed

Why This Is a Bottlerocket Question

Specific Questions / Requests

Environment

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions