Is there an existing issue for this?
Kong version ($ kong version)
KONG 3.9.0.1 KIC 3.5.6
Current Behavior
We recently migrated to Kong Ingress from kong apigateway. Primarily because the microservices count plus the team increased. Making it easier for respective teams to mange their services as they see fit.
KonG Version : 3.9.0.1
KIC Version: 3.5.6
in 3.9.0.1 of kong and KIC 3.5 setup serving 100+ upstream service and serving ~1M apis per minute(at peak). Running on AWS EKS.
We’re experiencing recurring outages and intermittent upstream failures in a Kong + KIC setup after enabling proper connectivity to the Admin API.
Initially, KIC couldn’t reach Kong’s Admin API due to service mesh (Linkerd) interception, so the database was populated only with migration-generated entities using random UUIDs. Once connectivity was restored, KIC performed a full reconciliation using deterministic UUIDs, which triggered large-scale DELETE + CREATE operations across routes, services, and upstreams—causing a full outage.
After that, we observed a repeating ~2-hour outage cycle caused by controller-runtime backoff retries, and an additional conflict where a migration job keeps recreating deleted entities during ArgoCD sync.
Currently, the system is partially stable, but we still see:
Periodic (~10–15 min) config syncs triggering Kong worker reloads across all data plane pods
Transient proxy:-1 errors due to upstream target mismatches
Persistent UUID mismatches for upstream targets leading to PUT->404->POST cycles
An orphan route being continuously deleted and recreated
What is the root cause of these repeated disruptions, and what is the correct way to fully stabilize KIC reconciliation (especially around UUID mismatches, migrations, and sync behavior)?
here are some kong logs we see:
This is recoded everytime argocd sync or at 2 hrs when KIC expires
::1 - - [14/Apr/2026:16:02:52 +0000] “PUT /upstreams/176de4b2-44d5-4df1-83da-e73a6eb10fd9/targets/54aec173-7162-4dd4-9f20-b524290aa89f HTTP/1.1” 404 23 “-” “kong-ingress-controller/3.5.6”
::1 - - [14/Apr/2026:16:02:52 +0000] “POST /upstreams/176de4b2-44d5-4df1-83da-e73a6eb10fd9/targets HTTP/1.1” 201 241 “-” “kong-ingress-controller/3.5.6”
::1 - - [14/Apr/2026:07:26:00 +0000] “DELETE /routes/7836a675-2f04-57b8-b6bc-9ff5d2d215e2 HTTP/1.1” 204 0 “-” “kong-ingress-controller/3.5.6”
Not able to see upstream intermitently and throwing 404
{\“source\”:\“kong\”,\“client_ip\”:\“10.10.49.13\”,\“started_at\”:1776103416608,\“latencies\”:{\“kong\”:0,\“receive\”:0,\“request\”:1,\“proxy\”:-1},\“upstream_uri\”:\“\”,\“x_forwarded_for\”:\“99.79.87.237\”,\“upstream_status\”:\“\”,\“host\”:\“api.xyz.com\”,\“workspace\”:\“xxxx-916f-2d5892806f08\”}"
P.S: KONG DP and CP pods never crashed or showed any form of resource exhaustion. The RDS PGSql connection pool, CPU utilization etc were always less than 50% during this.
Expected Behavior
Upstream should always be matched
Steps To Reproduce
No response
Anything else?
No response
Is there an existing issue for this?
Kong version (
$ kong version)KONG 3.9.0.1 KIC 3.5.6
Current Behavior
We recently migrated to Kong Ingress from kong apigateway. Primarily because the microservices count plus the team increased. Making it easier for respective teams to mange their services as they see fit.
KonG Version : 3.9.0.1
KIC Version: 3.5.6
in 3.9.0.1 of kong and KIC 3.5 setup serving 100+ upstream service and serving ~1M apis per minute(at peak). Running on AWS EKS.
We’re experiencing recurring outages and intermittent upstream failures in a Kong + KIC setup after enabling proper connectivity to the Admin API.
Initially, KIC couldn’t reach Kong’s Admin API due to service mesh (Linkerd) interception, so the database was populated only with migration-generated entities using random UUIDs. Once connectivity was restored, KIC performed a full reconciliation using deterministic UUIDs, which triggered large-scale DELETE + CREATE operations across routes, services, and upstreams—causing a full outage.
After that, we observed a repeating ~2-hour outage cycle caused by controller-runtime backoff retries, and an additional conflict where a migration job keeps recreating deleted entities during ArgoCD sync.
Currently, the system is partially stable, but we still see:
Periodic (~10–15 min) config syncs triggering Kong worker reloads across all data plane pods
Transient proxy:-1 errors due to upstream target mismatches
Persistent UUID mismatches for upstream targets leading to PUT->404->POST cycles
An orphan route being continuously deleted and recreated
What is the root cause of these repeated disruptions, and what is the correct way to fully stabilize KIC reconciliation (especially around UUID mismatches, migrations, and sync behavior)?
here are some kong logs we see:
This is recoded everytime argocd sync or at 2 hrs when KIC expires
::1 - - [14/Apr/2026:16:02:52 +0000] “PUT /upstreams/176de4b2-44d5-4df1-83da-e73a6eb10fd9/targets/54aec173-7162-4dd4-9f20-b524290aa89f HTTP/1.1” 404 23 “-” “kong-ingress-controller/3.5.6”
::1 - - [14/Apr/2026:16:02:52 +0000] “POST /upstreams/176de4b2-44d5-4df1-83da-e73a6eb10fd9/targets HTTP/1.1” 201 241 “-” “kong-ingress-controller/3.5.6”
::1 - - [14/Apr/2026:07:26:00 +0000] “DELETE /routes/7836a675-2f04-57b8-b6bc-9ff5d2d215e2 HTTP/1.1” 204 0 “-” “kong-ingress-controller/3.5.6”
Not able to see upstream intermitently and throwing 404
{\“source\”:\“kong\”,\“client_ip\”:\“10.10.49.13\”,\“started_at\”:1776103416608,\“latencies\”:{\“kong\”:0,\“receive\”:0,\“request\”:1,\“proxy\”:-1},\“upstream_uri\”:\“\”,\“x_forwarded_for\”:\“99.79.87.237\”,\“upstream_status\”:\“\”,\“host\”:\“api.xyz.com\”,\“workspace\”:\“xxxx-916f-2d5892806f08\”}"
P.S: KONG DP and CP pods never crashed or showed any form of resource exhaustion. The RDS PGSql connection pool, CPU utilization etc were always less than 50% during this.
Expected Behavior
Upstream should always be matched
Steps To Reproduce
No response
Anything else?
No response