Skip to content

fix ci flakiness leaderelection test#4388

Merged
epugh merged 5 commits into
apache:mainfrom
epugh:copilot/fix-ci-flakiness-leaderelection-test
May 8, 2026
Merged

fix ci flakiness leaderelection test#4388
epugh merged 5 commits into
apache:mainfrom
epugh:copilot/fix-ci-flakiness-leaderelection-test

Conversation

@epugh
Copy link
Copy Markdown
Contributor

@epugh epugh commented May 2, 2026

I found this while running some tests on Crave:

./gradlew :solr:core:test --tests "org.apache.solr.cloud.LeaderElectionIntegrationTest.testSimpleSliceLeaderElection" "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=F08428E6CA23E0C6 -Ptests.timeoutSuite=600000! -Ptests.useSecurityManager=true -Ptests.file.encoding=UTF-8

I asked for a tip from copilot on what was failing, and this is what it gave me. I believe that I've basically seen this same timing fix proposed on other flaky tests..

Copilot AI and others added 2 commits May 2, 2026 14:08
…erElection

- Increase cluster.waitForActiveCollection timeout from 10s to 60s
- Replace ad-hoc polling loop after expireZkSession with waitForState
  (waits until leader moves away from the expired-session node)
- Replace Thread.sleep with waitForState for node rejoining live nodes
- Replace final polling loop + assertEquals with waitForState
  (waits until original node becomes leader again)

Agent-Logs-Url: https://github.com/epugh/solr/sessions/1eab1dea-7bf6-4911-93ff-03f3c6614cfd

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
@epugh epugh marked this pull request as draft May 2, 2026 14:15
createCollection(collection);

cluster.waitForActiveCollection(collection, 10, TimeUnit.SECONDS, 2, 6);
cluster.waitForActiveCollection(collection, 60, TimeUnit.SECONDS, 2, 6);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is to set an overall upper limit before failing!


// make sure we have waited long enough for the first leader to have come back
Thread.sleep(ZkTestServer.TICK_TIME * 2 + 100);
// Wait until leadership has moved away from the expired-session node
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the waitForState instead of the thread.sleep

Copilot AI and others added 3 commits May 2, 2026 14:28
…nd ZkShardTermsRecoveryTest

Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
…in LeaderElectionIntegrationTest

Agent-Logs-Url: https://github.com/epugh/solr/sessions/e666fe4d-9dd3-4036-90d5-2d08cbdec281

Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
@epugh epugh marked this pull request as ready for review May 5, 2026 17:16
@epugh epugh merged commit 89c5413 into apache:main May 8, 2026
5 of 6 checks passed
epugh added a commit that referenced this pull request May 8, 2026
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: epugh <22395+epugh@users.noreply.github.com>
(cherry picked from commit 89c5413)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants