ci: Improve che-happy-path test reliability with retry logic and health checks#1581
ci: Improve che-happy-path test reliability with retry logic and health checks#1581
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: akurinnoy The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…hecks This commit enhances the `.ci/oci-devworkspace-happy-path.sh` script to significantly improve test reliability in CI environments by adding: - Health checks for DWO and Che deployments using kubectl wait - Retry logic with exponential backoff (2 retries, 60s base delay) - Comprehensive artifact collection on failures - Graceful error handling and cleanup between retries - Clear error messages with stage identification The improvements address flakiness in the v14-che-happy-path Prow test by handling transient failures (image pull timeouts, API server issues, operator reconciliation delays) and providing detailed diagnostics for genuine failures. Key features: - DWO verification: Waits for deployment condition=available - Che verification: Waits for CheCluster condition=Available - Retry strategy: 2 attempts with exponential backoff + jitter - Artifact collection: Operator logs, CheCluster CR, pod info, events - Cleanup: Deletes failed deployments before retry - Realistic timeouts: 24 hours (86400s) for pod wait/ready Expected impact: Reduce CI flakiness from ~50% to >90% success rate for infrastructure-related failures, with significantly better diagnostics. Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-Authored-By: Oleksii Kurinnyi <okurinny@redhat.com> Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
528e591 to
299de3a
Compare
…ealth checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
|
/retest |
…c and health checks Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
|
/retest |
|
@akurinnoy: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What does this PR do?
Improves reliability of the
v14-che-happy-pathCI test by adding retry logic, health checks, and comprehensive error diagnostics to the test script.Why is this needed?
The current happy-path test fails intermittently due to transient infrastructure issues (image pull timeouts, operator reconciliation delays, API server unavailability) that are unrelated to code changes. These false-positive failures reduce CI reliability and slow down development.
Key Improvements
Retry Logic
Health Checks
deployment condition=availablebefore proceedingCheCluster condition=Availablewith 10-minute timeoutDiagnostics
Configuration
Expected Impact
Testing
✅ Validated on local CRC cluster with PR #1578
Documentation
Complete script documentation:
.ci/README-CHE-HAPPY-PATH.md