Skip to content

Failover plugins malfunction for applications running in Open Liberty #1846

@hlond

Description

@hlond

Describe the bug

We are experiencing an AWS Advanced JDBC Wrapper malfunction for failover in applications running in Open Liberty.

In the case of a failover, the failover plugins (failover and failover2) induce a transition into a non-functional final state of the application, where it under load never recovers from.

We note, that without the failover plugins, the application reaches a healthy state - after the expected blocking period in the sub-minute range associated to the DNS TTL settings.

Therefore, using the wrapper with failover plugins enabled, breaks applications running in Open Liberty, in the case of a failover.

In order to reproduce this issue, we build a minimal project: https://github.com/hlond/open-liberty-rds-failover-test

Expected Behavior

We expect a behavior like in version 2.6.0 which is the last AWS Advanced JDBC Wrapper version that works with the failover plugins enabled when run in Open Liberty. Here the failover of a cluster is detected quickly and the application recovers appropriately


         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/


     execution: local
        script: tests/endpoint-wrapper-test.js
        output: -

     scenarios: (100.00%) 1 scenario, 1 max VUs, 1m30s max duration (incl. graceful stop):
              * default: 1 looping VUs for 1m0s (gracefulStop: 30s)

...

running (0m18.0s), 1/1 VUs, 16 complete and 0 interrupted iterations
default   [  30% ] 1 VUs  0m18.0s/1m0s
time="2026-04-10T07:54:56Z" level=info msg="200 OK { duration: 6.086638, blocked: 0.00541, looking_up: 0, connecting: 0, tls_handshaking: 0, sending: 0.018116, waiting: 6.000049, receiving: 0.068473 }" source=console
time="2026-04-10T07:54:56Z" level=info msg="\"2026-04-10T07:54:56.450Z\"" source=console

running (0m19.0s), 1/1 VUs, 17 complete and 0 interrupted iterations
default   [  32% ] 1 VUs  0m19.0s/1m0s

running (0m20.0s), 1/1 VUs, 18 complete and 0 interrupted iterations
default   [  33% ] 1 VUs  0m20.0s/1m0s

running (0m21.0s), 1/1 VUs, 18 complete and 0 interrupted iterations
default   [  35% ] 1 VUs  0m21.0s/1m0s
time="2026-04-10T07:54:59Z" level=info msg="500 Internal Server Error { duration: 2214.257015, blocked: 0.005508, looking_up: 0, connecting: 0, tls_handshaking: 0, sending: 0.019793, waiting: 2213.784239, receiving: 0.452983 }" source=console
time="2026-04-10T07:54:59Z" level=info msg="\"2026-04-10T07:54:59.665Z\"" source=console

running (0m22.0s), 1/1 VUs, 18 complete and 0 interrupted iterations
default   [  37% ] 1 VUs  0m22.0s/1m0s
time="2026-04-10T07:55:00Z" level=info msg="200 OK { duration: 6.198184, blocked: 0.439266, looking_up: 0, connecting: 0.376779, tls_handshaking: 0, sending: 0.05514, waiting: 5.932385, receiving: 0.210659 }" source=console
time="2026-04-10T07:55:00Z" level=info msg="\"2026-04-10T07:55:00.673Z\"" source=console

...


  █ TOTAL RESULTS

    checks_total.......: 57     0.941045/s
    checks_succeeded...: 98.24% 56 out of 57
    checks_failed......: 1.75%  1 out of 57

    ✗ status is 200
      ↳  98% — ✓ 56 / ✗ 1

...

What plugins are used? What other connection properties were set?

wrapperPlugins: auroraConnectionTracker,efm2,failover2,iam; wrapperDialect: aurora-pg; failureDetectionCount: 1; failureDetectionInterval: 500; failoverTimeoutMs: 30000; failoverWriterReconnectIntervalMs: 1000; failoverReaderConnectTimeoutMs: 1000; failoverClusterTopologyRefreshRateMs: 1000; stringType: unspecified

Current Behavior

From versions greater or equal to 2.6.1, after failover, the application responds with 500 status codes only

         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/


     execution: local
        script: tests/endpoint-wrapper-test.js
        output: -

     scenarios: (100.00%) 1 scenario, 1 max VUs, 1m30s max duration (incl. graceful stop):
              * default: 1 looping VUs for 1m0s (gracefulStop: 30s)

...

running (0m24.0s), 1/1 VUs, 22 complete and 0 interrupted iterations
default   [  40% ] 1 VUs  0m24.0s/1m0s
time="2026-04-10T07:59:57Z" level=info msg="200 OK { duration: 11.780068, blocked: 0.005679, looking_up: 0, connecting: 0, tls_handshaking: 0, sending: 0.018721, waiting: 11.685849, receiving: 0.075498 }" source=console
time="2026-04-10T07:59:57Z" level=info msg="\"2026-04-10T07:59:57.665Z\"" source=console

running (0m25.0s), 1/1 VUs, 23 complete and 0 interrupted iterations
default   [  42% ] 1 VUs  0m25.0s/1m0s
time="2026-04-10T07:59:58Z" level=info msg="500 Internal Server Error { duration: 258.530431, blocked: 0.004885, looking_up: 0, connecting: 0, tls_handshaking: 0, sending: 0.016515, waiting: 258.421381, receiving: 0.092535 }" source=console
time="2026-04-10T07:59:58Z" level=info msg="\"2026-04-10T07:59:58.925Z\"" source=console

running (0m26.0s), 1/1 VUs, 24 complete and 0 interrupted iterations
default   [  43% ] 1 VUs  0m26.0s/1m0s
time="2026-04-10T07:59:59Z" level=info msg="500 Internal Server Error { duration: 15.942459, blocked: 0.437148, looking_up: 0, connecting: 0.374119, tls_handshaking: 0, sending: 0.053102, waiting: 15.760584, receiving: 0.128773 }" source=console
time="2026-04-10T07:59:59Z" level=info msg="\"2026-04-10T07:59:59.942Z\"" source=console

...

  █ TOTAL RESULTS

    checks_total.......: 59     0.970975/s
    checks_succeeded...: 40.67% 24 out of 59
    checks_failed......: 59.32% 35 out of 59

    ✗ status is 200
      ↳  40% — ✓ 24 / ✗ 35

...

We emphasize, that the application does never recover, which can be observed for an elongated test duration

running (0h50m08.0s), 1/1 VUs, 297 complete and 0 interrupted iterations
default   [  84% ] 1 VUs  0h50m08.0s/1h0m0s

running (0h50m09.0s), 1/1 VUs, 297 complete and 0 interrupted iterations
default   [  84% ] 1 VUs  0h50m09.0s/1h0m0s

running (0h50m10.0s), 1/1 VUs, 297 complete and 0 interrupted iterations
default   [  84% ] 1 VUs  0h50m10.0s/1h0m0s

running (0h50m11.0s), 1/1 VUs, 297 complete and 0 interrupted iterations
default   [  84% ] 1 VUs  0h50m11.0s/1h0m0s
time="2026-04-10T07:31:23Z" level=info msg="500 Internal Server Error { duration: 10010.194834, blocked: 0.404858, looking_up: 0, connecting: 0.348769, tls_handshaking: 0, sending: 0.033519, waiting: 10010.06085, receiving: 0.100465 }" source=console
time="2026-04-10T07:31:23Z" level=info msg="\"2026-04-10T07:31:23.999Z\"" source=console

However, by removing the failover plugin from the connection options, we find that the application recovers after around 5 seconds, which is the TTL of the underlying DNS resolution to the IP address of the alternative RDS instance, as expected

         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/


     execution: local
        script: tests/endpoint-wrapper-test.js
        output: -

     scenarios: (100.00%) 1 scenario, 1 max VUs, 1m30s max duration (incl. graceful stop):
              * default: 1 looping VUs for 1m0s (gracefulStop: 30s)

...

running (0m24.0s), 1/1 VUs, 22 complete and 0 interrupted iterations
default   [  40% ] 1 VUs  0m24.0s/1m0s
time="2026-04-10T08:19:06Z" level=info msg="200 OK { duration: 4.970019, blocked: 0.005365, looking_up: 0, connecting: 0, tls_handshaking: 0, sending: 0.01781, waiting: 4.82412, receiving: 0.128089 }" source=console
time="2026-04-10T08:19:06Z" level=info msg="\"2026-04-10T08:19:06.095Z\"" source=console

running (0m25.0s), 1/1 VUs, 23 complete and 0 interrupted iterations
default   [  42% ] 1 VUs  0m25.0s/1m0s

running (0m26.0s), 1/1 VUs, 24 complete and 0 interrupted iterations
default   [  43% ] 1 VUs  0m26.0s/1m0s

running (0m27.0s), 1/1 VUs, 24 complete and 0 interrupted iterations
default   [  45% ] 1 VUs  0m27.0s/1m0s

running (0m28.0s), 1/1 VUs, 24 complete and 0 interrupted iterations
default   [  47% ] 1 VUs  0m28.0s/1m0s

running (0m29.0s), 1/1 VUs, 24 complete and 0 interrupted iterations
default   [  48% ] 1 VUs  0m29.0s/1m0s

running (0m30.0s), 1/1 VUs, 24 complete and 0 interrupted iterations
default   [  50% ] 1 VUs  0m30.0s/1m0s
time="2026-04-10T08:19:12Z" level=info msg="200 OK { duration: 5273.119536, blocked: 0.005668, looking_up: 0, connecting: 0, tls_handshaking: 0, sending: 0.017774, waiting: 5273.029983, receiving: 0.071779 }" source=console
time="2026-04-10T08:19:12Z" level=info msg="\"2026-04-10T08:19:12.369Z\"" source=console

running (0m31.0s), 1/1 VUs, 24 complete and 0 interrupted iterations
default   [  52% ] 1 VUs  0m31.0s/1m0s
time="2026-04-10T08:19:13Z" level=info msg="200 OK { duration: 4.968991, blocked: 0.005643, looking_up: 0, connecting: 0, tls_handshaking: 0, sending: 0.015134, waiting: 4.819513, receiving: 0.134344 }" source=console
time="2026-04-10T08:19:13Z" level=info msg="\"2026-04-10T08:19:13.375Z\"" source=console

...

  █ TOTAL RESULTS

    checks_total.......: 54      0.890355/s
    checks_succeeded...: 100.00% 54 out of 54
    checks_failed......: 0.00%   0 out of 54

    ✓ status is 200

...

Therefore, we conclude broken failover capabilities for applications that are running on Open Liberty.

Reproduction Steps

We built a minimal application, running on Open Libery, that allows to simulate failover scenarios for different versions of the wrapper.

It can be found here: https://github.com/hlond/open-liberty-rds-failover-test

Specifically, it provides a simulated failover scenario, based on k6 and the aws cli. For one minute it creates load on a test endpoint that performs a trivial SQL query against the RDS cluster. After 10 seconds, a failover is triggered via aws cli. With this, we demonstrate the change in failover capabilities, depending on the chosen version of the wrapper.

The README.md of the test project describes how to run it. After setting up the environment, proceed by building the application, paramezerized by the wrapper version.

Build the application based on the last working version of the wrapper

docker compose build --build-arg AWS_ADVANCED_JDBC_WRAPPER_VERSION=2.6.0

Build the application based on the first broken version of the wrapper

docker compose build --build-arg AWS_ADVANCED_JDBC_WRAPPER_VERSION=2.6.1

Build the application based on a more recent version of the wrapper

docker compose build --build-arg AWS_ADVANCED_JDBC_WRAPPER_VERSION=3.3.0

Each build can be run via

docker compose up

The failover scenario can be run via

./failover_test.sh

Possible Solution

We have determined, that the wrapper went from functional to non-functional between the patch releases 2.6.0 and 2.6.1.

We have further reduced the relevant scope of this discussion to the merge request #1444, by cherry picking the commit 5a7eec7, via

git checkout tags/2.6.0
git cherry-pick 5a7eec7

In our analysis, we build two images: one for tags/2.6.0, and one for tags/2.6.0 + cherry picking 5a7eec7, via

./gradlew :aws-advanced-jdbc-wrapper:build -x test

The latter one shows the symptoms that we are experiencing since 2.6.1. We emphasize, that the malfunction persist also for more recent versions, such as 3.3.0.

Additional Information/Context

Test Setup

In the test project we use:

  • Aurora Postgresql cluster version 17
  • A recent version of open liberty
  • Jakarte EE 10
  • MicroProfile 7.1
  • Persistance API 3.1
  • A recent version of the postgresql driver
  • networkaddress.cache.ttl set to 5 seconds (in harmony with AWS specifications)

We suspect that the problem lies in the interaction between the wrapper and the Connection Manager implementation in Open Liberty.

In our example application, we deliberately chose not to include exception handling, as we would expect the Connection Manager to be able to recognize the changed state following the failover and return to a functional state.

This was the case up to version 2.6.0. From version 2.6.1 onwards, the application no longer recovers.

The AWS Advanced JDBC Wrapper version used

2.6.0 (working); 2.6.1 (first not working); 3.3.0 (still not working)

JDK version used

IBM Semeru OpenJ9 17 (from Open Liberty image)

Operating System and version

Docker image open-liberty:kernel-slim-java17-openj9 (see test project Dockerfile)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpending releaseResolution implemented, pending official release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions