Skip to content

[Enhancement] Consider cloud-aware neighbor cache defaults to handle NLB source IP preservation and neigh_confirm() interaction in EKS #878

@slice-mohijeet

Description

@slice-mohijeet

Summary

We observed intermittent pod connectivity failures on EKS with
Bottlerocket nodes that trace to a specific interaction between
Linux NUD (Neighbor Unreachability Detection), NLB source IP
preservation, and Bottlerocket's current neighbor cache sysctl
defaults. We are raising this as an enhancement request asking
whether Bottlerocket — as an AWS-purpose-built OS — should apply
more cloud-aware defaults to handle this more gracefully.


What We Observed

On EKS with Bottlerocket nodes and VPC CNI prefix delegation:

1. A wrong MAC address gets cached in the neighbor table for a pod IP

When a pod IP is reused after a pod deletion, the neighbor entry
on other nodes may still hold the old MAC. This is expected
behavior and normally Linux would self-correct via NUD.

2. NLB source IP preservation prevents self-correction

In our cluster, backend pods make outbound HTTPS calls to a
collector endpoint via NLB. The NLB forwards these to the Istio
ingress NodePort with source IP preserved as the pod IP.

When these packets arrive at the ingress node, the Linux kernel
calls neigh_confirm() for that pod IP — resetting the REACHABLE
timer on whatever MAC is currently cached. The kernel has no way
to know the packet arrived via NLB rather than directly from the pod.

Backend pod makes OTel trace call to NLB
  ↓
NLB preserves source IP = pod IP, forwards to NodePort
  ↓
Ingress node receives packet
  ↓
neigh_confirm() fires for pod IP
  ↓
Wrong/stale MAC locked as REACHABLE indefinitely
  ↓
New direct connections from ingress to pod use wrong MAC
  ↓
Packets silently dropped, 503 timeout

3. GC never runs, so wrong entries accumulate

We checked the neighbor GC thresholds on our Bottlerocket nodes:

net.ipv4.neigh.default.gc_thresh1 = 8096
net.ipv4.neigh.default.gc_thresh2 = 12288
net.ipv4.neigh.default.gc_thresh3 = 16384

Current neighbor table size: ~1100 entries

Because the table never exceeds gc_thresh1, GC never runs.
We observed 1009 out of 1102 entries (91%) in STALE state
sitting indefinitely, each a potential wrong-MAC incident on
the next IP reuse.

We noticed that bottlerocket-core-kit/packages/release/release-sysctl.conf
explicitly sets gc_thresh2 and gc_thresh3 with the comment
"Avoid neighbor table contention in large subnets". This
addresses the upper bound problem (table getting too large) but
does not address the lower bound problem (GC never running when
table is small, allowing stale entries to accumulate indefinitely).

4. Lab reproduced 100%

We reproduced this with complete reliability:

# Inject wrong MAC as STALE
ip neigh del <POD-IP> dev eth0
ip neigh add <POD-IP> dev eth0 lladdr 0a:00:00:00:00:01 nud stale

# Without any direct connection -- just OTel trace arriving via NLB
# Entry spontaneously moved from STALE to REACHABLE with fake MAC

ip neigh show <POD-IP>
# <POD-IP> dev eth0 lladdr 0a:00:00:00:00:01 REACHABLE
# confirmed = 1 second ago

# Direct connection now fails -- pod unreachable
curl -v <POD-IP>:8080
# hangs indefinitely

Why This Is a Bottlerocket Question

We are not reporting this as a Linux kernel bug. neigh_confirm()
firing on inbound TCP data is correct RFC 4861 behavior. The kernel
is doing exactly what the standard mandates.

We are raising this at Bottlerocket because:

  1. Bottlerocket explicitly targets AWS EKS as its primary use case
  2. Bottlerocket already makes deliberate neighbor cache decisions
    in release-sysctl.conf for cloud subnet scale
  3. Bottlerocket knows it runs in AWS VPC with NLB source IP
    preservation as a common pattern
  4. Bottlerocket is in a unique position to set defaults that
    handle this cloud-specific interaction more gracefully than
    vanilla Linux defaults

Specific Questions / Requests

1. Should arp_notify = 1 be set by default on ENI interfaces?

net.ipv4.conf.eth0.arp_notify = 1

This causes the kernel to send a gratuitous ARP when an IP
address is added to an interface. In EKS with VPC CNI, this
would notify all nodes when a pod IP is assigned to a new ENI,
invalidating stale neighbor entries before they can be poisoned.
This is the mechanism that other CNIs (Antrea, Calico, Flannel)
use explicitly. Would Bottlerocket consider enabling this by
default for EKS variants?

2. Should gc_thresh1 be tuned relative to gc_thresh2?

The current defaults set a very high upper range
(gc_thresh2=15360, gc_thresh3=16384) to avoid table
contention in large clusters. But gc_thresh1 appears to be
set such that GC never runs on nodes with typical table sizes
(~1000-2000 entries in our clusters).

Would it make sense for gc_thresh1 to be set to a value that
allows GC to run periodically and clean up stale/failed entries
even on moderate-sized clusters — while keeping the high upper
bounds to protect large clusters?

3. Should base_reachable_time_ms be reduced for EKS variants?

A shorter REACHABLE window means entries decay to STALE faster,
reducing the window in which a wrong MAC can be kept alive by
neigh_confirm(). The default ~30s is designed for stable
physical networks. In cloud environments with frequent pod churn
and IP reuse, a shorter value may be more appropriate.


Environment

  • Bottlerocket EKS variant (aws-k8s-*)
  • VPC CNI prefix delegation (/28 per ENI)
  • NLB with source IP preservation enabled
  • Istio ingress gateway with NodePort service
  • Java OpenTelemetry auto-instrumentation sending traces via NLB

References

  • bottlerocket-core-kit/packages/release/release-sysctl.conf
    — existing neighbor cache tuning
  • RFC 4861 Section 7.3.1 — upper-layer reachability confirmation
  • Linux kernel net/core/neighbour.cneigh_confirm()
    implementation
  • AWS Support case raised in parallel for VPC/NLB behavior
    confirmation
  • amazon-vpc-cni-k8s issue raised in parallel for CNI-layer
    GARP question

We are happy to provide additional diagnostics, packet captures,
or test results if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions