Skip to content

ci: Improve che-happy-path test reliability with retry logic and health checks#1581

Open
akurinnoy wants to merge 3 commits intomainfrom
improve-che-happy-path-reliability
Open

ci: Improve che-happy-path test reliability with retry logic and health checks#1581
akurinnoy wants to merge 3 commits intomainfrom
improve-che-happy-path-reliability

Conversation

@akurinnoy
Copy link
Collaborator

What does this PR do?

Improves reliability of the v14-che-happy-path CI test by adding retry logic, health checks, and comprehensive error diagnostics to the test script.

Why is this needed?

The current happy-path test fails intermittently due to transient infrastructure issues (image pull timeouts, operator reconciliation delays, API server unavailability) that are unrelated to code changes. These false-positive failures reduce CI reliability and slow down development.

Key Improvements

Retry Logic

  • 2 retry attempts with exponential backoff (60s base + 0-15s jitter)
  • Cleanup between retries to ensure fresh state

Health Checks

  • DWO: Waits for deployment condition=available before proceeding
  • Che: Waits for CheCluster condition=Available with 10-minute timeout

Diagnostics

  • Collects operator logs, CheCluster CR, pod info, and events on each failure
  • Clear error messages identifying which stage failed

Configuration

  • Realistic timeouts: 24 hours (86,400s) instead of unrealistic defaults
  • Configurable retry count and delays via environment variables

Expected Impact

  • Reliability: Reduce flakiness from ~50% to >90% success rate
  • Debugging: Comprehensive artifacts for faster issue resolution
  • Compatibility: Drop-in replacement, no Prow config changes needed

Testing

✅ Validated on local CRC cluster with PR #1578

  • DWO deployment verified with health checks
  • Retry logic executed successfully (2 attempts with 71s backoff)
  • Comprehensive artifacts collected on failures
  • Error reporting clear and actionable

Documentation

Complete script documentation: .ci/README-CHE-HAPPY-PATH.md

  • Configuration options and environment variables
  • Usage examples (Prow CI + local testing)
  • Common failure scenarios and troubleshooting
  • Artifact locations and structure

@openshift-ci
Copy link

openshift-ci bot commented Feb 3, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akurinnoy
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

export CHE_REPO_BRANCH="${CHE_REPO_BRANCH:-main}"

# Download and run the remote test script
if ! bash <(curl -s "https://raw.githubusercontent.com/eclipse/che/${CHE_REPO_BRANCH}/tests/devworkspace-happy-path/remote-launch.sh"); then

Check warning

Code scanning / Scorecard

Pinned-Dependencies Medium

score is 2: downloadThenRun not pinned by hash
Click Remediation section below to solve this issue
…hecks

This commit enhances the `.ci/oci-devworkspace-happy-path.sh` script to
significantly improve test reliability in CI environments by adding:

- Health checks for DWO and Che deployments using kubectl wait
- Retry logic with exponential backoff (2 retries, 60s base delay)
- Comprehensive artifact collection on failures
- Graceful error handling and cleanup between retries
- Clear error messages with stage identification

The improvements address flakiness in the v14-che-happy-path Prow test
by handling transient failures (image pull timeouts, API server issues,
operator reconciliation delays) and providing detailed diagnostics for
genuine failures.

Key features:
- DWO verification: Waits for deployment condition=available
- Che verification: Waits for CheCluster condition=Available
- Retry strategy: 2 attempts with exponential backoff + jitter
- Artifact collection: Operator logs, CheCluster CR, pod info, events
- Cleanup: Deletes failed deployments before retry
- Realistic timeouts: 24 hours (86400s) for pod wait/ready

Expected impact: Reduce CI flakiness from ~50% to >90% success rate for
infrastructure-related failures, with significantly better diagnostics.

Assisted-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Oleksii Kurinnyi <okurinny@redhat.com>
Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@akurinnoy akurinnoy force-pushed the improve-che-happy-path-reliability branch from 528e591 to 299de3a Compare February 3, 2026 13:28
…ealth checks

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@dkwon17
Copy link
Collaborator

dkwon17 commented Feb 4, 2026

/retest

…c and health checks

Signed-off-by: Oleksii Kurinnyi <okurinny@redhat.com>
@dkwon17
Copy link
Collaborator

dkwon17 commented Feb 6, 2026

/retest

@openshift-ci
Copy link

openshift-ci bot commented Feb 6, 2026

@akurinnoy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/v14-che-happy-path 751a1a8 link true /test v14-che-happy-path

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants