Summary
We observed intermittent pod connectivity failures on EKS with
Bottlerocket nodes that trace to a specific interaction between
Linux NUD (Neighbor Unreachability Detection), NLB source IP
preservation, and Bottlerocket's current neighbor cache sysctl
defaults. We are raising this as an enhancement request asking
whether Bottlerocket — as an AWS-purpose-built OS — should apply
more cloud-aware defaults to handle this more gracefully.
What We Observed
On EKS with Bottlerocket nodes and VPC CNI prefix delegation:
1. A wrong MAC address gets cached in the neighbor table for a pod IP
When a pod IP is reused after a pod deletion, the neighbor entry
on other nodes may still hold the old MAC. This is expected
behavior and normally Linux would self-correct via NUD.
2. NLB source IP preservation prevents self-correction
In our cluster, backend pods make outbound HTTPS calls to a
collector endpoint via NLB. The NLB forwards these to the Istio
ingress NodePort with source IP preserved as the pod IP.
When these packets arrive at the ingress node, the Linux kernel
calls neigh_confirm() for that pod IP — resetting the REACHABLE
timer on whatever MAC is currently cached. The kernel has no way
to know the packet arrived via NLB rather than directly from the pod.
Backend pod makes OTel trace call to NLB
↓
NLB preserves source IP = pod IP, forwards to NodePort
↓
Ingress node receives packet
↓
neigh_confirm() fires for pod IP
↓
Wrong/stale MAC locked as REACHABLE indefinitely
↓
New direct connections from ingress to pod use wrong MAC
↓
Packets silently dropped, 503 timeout
3. GC never runs, so wrong entries accumulate
We checked the neighbor GC thresholds on our Bottlerocket nodes:
net.ipv4.neigh.default.gc_thresh1 = 8096
net.ipv4.neigh.default.gc_thresh2 = 12288
net.ipv4.neigh.default.gc_thresh3 = 16384
Current neighbor table size: ~1100 entries
Because the table never exceeds gc_thresh1, GC never runs.
We observed 1009 out of 1102 entries (91%) in STALE state —
sitting indefinitely, each a potential wrong-MAC incident on
the next IP reuse.
We noticed that bottlerocket-core-kit/packages/release/release-sysctl.conf
explicitly sets gc_thresh2 and gc_thresh3 with the comment
"Avoid neighbor table contention in large subnets". This
addresses the upper bound problem (table getting too large) but
does not address the lower bound problem (GC never running when
table is small, allowing stale entries to accumulate indefinitely).
4. Lab reproduced 100%
We reproduced this with complete reliability:
# Inject wrong MAC as STALE
ip neigh del <POD-IP> dev eth0
ip neigh add <POD-IP> dev eth0 lladdr 0a:00:00:00:00:01 nud stale
# Without any direct connection -- just OTel trace arriving via NLB
# Entry spontaneously moved from STALE to REACHABLE with fake MAC
ip neigh show <POD-IP>
# <POD-IP> dev eth0 lladdr 0a:00:00:00:00:01 REACHABLE
# confirmed = 1 second ago
# Direct connection now fails -- pod unreachable
curl -v <POD-IP>:8080
# hangs indefinitely
Why This Is a Bottlerocket Question
We are not reporting this as a Linux kernel bug. neigh_confirm()
firing on inbound TCP data is correct RFC 4861 behavior. The kernel
is doing exactly what the standard mandates.
We are raising this at Bottlerocket because:
- Bottlerocket explicitly targets AWS EKS as its primary use case
- Bottlerocket already makes deliberate neighbor cache decisions
in release-sysctl.conf for cloud subnet scale
- Bottlerocket knows it runs in AWS VPC with NLB source IP
preservation as a common pattern
- Bottlerocket is in a unique position to set defaults that
handle this cloud-specific interaction more gracefully than
vanilla Linux defaults
Specific Questions / Requests
1. Should arp_notify = 1 be set by default on ENI interfaces?
net.ipv4.conf.eth0.arp_notify = 1
This causes the kernel to send a gratuitous ARP when an IP
address is added to an interface. In EKS with VPC CNI, this
would notify all nodes when a pod IP is assigned to a new ENI,
invalidating stale neighbor entries before they can be poisoned.
This is the mechanism that other CNIs (Antrea, Calico, Flannel)
use explicitly. Would Bottlerocket consider enabling this by
default for EKS variants?
2. Should gc_thresh1 be tuned relative to gc_thresh2?
The current defaults set a very high upper range
(gc_thresh2=15360, gc_thresh3=16384) to avoid table
contention in large clusters. But gc_thresh1 appears to be
set such that GC never runs on nodes with typical table sizes
(~1000-2000 entries in our clusters).
Would it make sense for gc_thresh1 to be set to a value that
allows GC to run periodically and clean up stale/failed entries
even on moderate-sized clusters — while keeping the high upper
bounds to protect large clusters?
3. Should base_reachable_time_ms be reduced for EKS variants?
A shorter REACHABLE window means entries decay to STALE faster,
reducing the window in which a wrong MAC can be kept alive by
neigh_confirm(). The default ~30s is designed for stable
physical networks. In cloud environments with frequent pod churn
and IP reuse, a shorter value may be more appropriate.
Environment
- Bottlerocket EKS variant (aws-k8s-*)
- VPC CNI prefix delegation (/28 per ENI)
- NLB with source IP preservation enabled
- Istio ingress gateway with NodePort service
- Java OpenTelemetry auto-instrumentation sending traces via NLB
References
bottlerocket-core-kit/packages/release/release-sysctl.conf
— existing neighbor cache tuning
- RFC 4861 Section 7.3.1 — upper-layer reachability confirmation
- Linux kernel
net/core/neighbour.c — neigh_confirm()
implementation
- AWS Support case raised in parallel for VPC/NLB behavior
confirmation
- amazon-vpc-cni-k8s issue raised in parallel for CNI-layer
GARP question
We are happy to provide additional diagnostics, packet captures,
or test results if useful.
Summary
We observed intermittent pod connectivity failures on EKS with
Bottlerocket nodes that trace to a specific interaction between
Linux NUD (Neighbor Unreachability Detection), NLB source IP
preservation, and Bottlerocket's current neighbor cache sysctl
defaults. We are raising this as an enhancement request asking
whether Bottlerocket — as an AWS-purpose-built OS — should apply
more cloud-aware defaults to handle this more gracefully.
What We Observed
On EKS with Bottlerocket nodes and VPC CNI prefix delegation:
1. A wrong MAC address gets cached in the neighbor table for a pod IP
When a pod IP is reused after a pod deletion, the neighbor entry
on other nodes may still hold the old MAC. This is expected
behavior and normally Linux would self-correct via NUD.
2. NLB source IP preservation prevents self-correction
In our cluster, backend pods make outbound HTTPS calls to a
collector endpoint via NLB. The NLB forwards these to the Istio
ingress NodePort with source IP preserved as the pod IP.
When these packets arrive at the ingress node, the Linux kernel
calls
neigh_confirm()for that pod IP — resetting the REACHABLEtimer on whatever MAC is currently cached. The kernel has no way
to know the packet arrived via NLB rather than directly from the pod.
3. GC never runs, so wrong entries accumulate
We checked the neighbor GC thresholds on our Bottlerocket nodes:
Because the table never exceeds
gc_thresh1, GC never runs.We observed 1009 out of 1102 entries (91%) in STALE state —
sitting indefinitely, each a potential wrong-MAC incident on
the next IP reuse.
We noticed that
bottlerocket-core-kit/packages/release/release-sysctl.confexplicitly sets
gc_thresh2andgc_thresh3with the comment"Avoid neighbor table contention in large subnets". This
addresses the upper bound problem (table getting too large) but
does not address the lower bound problem (GC never running when
table is small, allowing stale entries to accumulate indefinitely).
4. Lab reproduced 100%
We reproduced this with complete reliability:
Why This Is a Bottlerocket Question
We are not reporting this as a Linux kernel bug.
neigh_confirm()firing on inbound TCP data is correct RFC 4861 behavior. The kernel
is doing exactly what the standard mandates.
We are raising this at Bottlerocket because:
in
release-sysctl.conffor cloud subnet scalepreservation as a common pattern
handle this cloud-specific interaction more gracefully than
vanilla Linux defaults
Specific Questions / Requests
1. Should
arp_notify = 1be set by default on ENI interfaces?This causes the kernel to send a gratuitous ARP when an IP
address is added to an interface. In EKS with VPC CNI, this
would notify all nodes when a pod IP is assigned to a new ENI,
invalidating stale neighbor entries before they can be poisoned.
This is the mechanism that other CNIs (Antrea, Calico, Flannel)
use explicitly. Would Bottlerocket consider enabling this by
default for EKS variants?
2. Should
gc_thresh1be tuned relative togc_thresh2?The current defaults set a very high upper range
(
gc_thresh2=15360,gc_thresh3=16384) to avoid tablecontention in large clusters. But
gc_thresh1appears to beset such that GC never runs on nodes with typical table sizes
(~1000-2000 entries in our clusters).
Would it make sense for
gc_thresh1to be set to a value thatallows GC to run periodically and clean up stale/failed entries
even on moderate-sized clusters — while keeping the high upper
bounds to protect large clusters?
3. Should
base_reachable_time_msbe reduced for EKS variants?A shorter REACHABLE window means entries decay to STALE faster,
reducing the window in which a wrong MAC can be kept alive by
neigh_confirm(). The default ~30s is designed for stablephysical networks. In cloud environments with frequent pod churn
and IP reuse, a shorter value may be more appropriate.
Environment
References
bottlerocket-core-kit/packages/release/release-sysctl.conf— existing neighbor cache tuning
net/core/neighbour.c—neigh_confirm()implementation
confirmation
GARP question
We are happy to provide additional diagnostics, packet captures,
or test results if useful.