Skip to content

EtcdManualResolver.Build() sends two resolver updates causing unnecessary gRPC balancer switch and wasted connections #21660

@zyriljamez

Description

@zyriljamez

Bug report criteria

What happened?

EtcdManualResolver.Build() in client/v3/internal/resolver/resolver.go sends two resolver updates to gRPC in rapid
succession with different ServiceConfig values. This causes gRPC to switch balancers mid-connection, killing an in-
flight SubConn and producing warnings:

[core] [Channel #2 SubChannel #5] grpc: addrConn.createTransport failed to connect to
{Addr: "127.0.0.1:2379", ...}. Err: connection error: desc = "transport: Error while dialing:
dial tcp 127.0.0.1:2379: operation was canceled"

All etcd operations succeed, but each occurrence wastes resources — a throwaway TCP dial, TLS handshake, and SubConn
teardown per connection. In applications that create etcd clients frequently, this adds up to unnecessary CPU and
network overhead, unbounded channelz ID growth, and persistent warning log noise that obscures real issues.

What did you expect to happen?

A single resolver update with the complete state (endpoints + round_robin ServiceConfig), producing no spurious
warnings.

How can we reproduce it (as minimally and precisely as possible)?

Any kube-apiserver connecting to etcd will produce these warnings. The frequency depends on how often new etcd clients
are created. Every new grpc.ClientConn created through the etcd client triggers the double resolver update in Build(). This includes:

  • newClient() — initial client creation
  • client.Dial() — used by maintenance.Status(), maintenance.HashKV(), and other per-endpoint operations

Any application that creates etcd clients frequently or creates multiple clients concurrently will see these warnings.
The race is timing-dependent — it's deterministic under load (many concurrent client initializations) but may not
reproduce in a small standalone binary where TCP dials complete in <1ms.

Anything else we need to know?

Root cause

Build() currently does:

go
func (r *EtcdManualResolver) Build(target resolver.Target, cc resolver.ClientConn,
    opts resolver.BuildOptions) (resolver.Resolver, error) {
    r.serviceConfig = cc.ParseServiceConfig(`{"loadBalancingPolicy": "round_robin"}`)
    if r.serviceConfig.Err != nil {
        return nil, r.serviceConfig.Err
    }
    res, err := r.Resolver.Build(target, cc, opts)  // ← sends update #1
    if err != nil {
        return nil, err
    }
    r.updateState()                                   // ← sends update #2
    return res, nil
}

Proposed fix

Move r.updateState() before r.Resolver.Build():

Etcd version (please run commands below)

Details

N/A — the bug is in the etcd client library (go.etcd.io/etcd/client/v3), not the server.

Confirmed affected:

  • etcd client: v3.6.5
  • etcd server: v3.6.7 / v3.6.10
  • grpc-go: v1.72.2

The buggy code in client/v3/internal/resolver/resolver.go has not changed
across v3.6.x releases — all versions using EtcdManualResolver are affected.

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions