-
Notifications
You must be signed in to change notification settings - Fork 68
refactor(ci): migrate to updated slo action #748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the SLO (Service Level Objective) testing infrastructure to migrate from Prometheus Pushgateway to OpenTelemetry OTLP/HTTP metrics export. The changes modernize the metrics collection approach and update the CI workflows to run baseline comparison tests.
Changes:
- Migrated from Prometheus Pushgateway +
prometheus-clientto OpenTelemetry OTLP/HTTP metrics export - Replaced
ratelimiterdependency with a customSyncRateLimiterimplementation - Updated GitHub Actions workflows to run current vs. baseline workload comparison tests
- Removed local playground configurations (compose files) in favor of external infrastructure
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/slo/src/core/metrics.py | Complete rewrite of metrics implementation using OpenTelemetry SDK instead of prometheus-client |
| tests/slo/src/jobs/base.py | Added custom SyncRateLimiter class to replace ratelimiter dependency |
| tests/slo/src/jobs/topic_jobs.py | Updated to use SyncRateLimiter and otlp_endpoint parameter |
| tests/slo/src/jobs/table_jobs.py | Updated to use SyncRateLimiter and otlp_endpoint parameter |
| tests/slo/src/jobs/async_topic_jobs.py | Updated for OTLP metrics, improved type hints and error handling |
| tests/slo/src/runners/base.py | Added type hints for driver property |
| tests/slo/src/runners/topic_runner.py | Added driver assertions and updated to use otlp_endpoint |
| tests/slo/src/runners/table_runner.py | Added driver assertions and updated to use otlp_endpoint |
| tests/slo/src/options.py | Renamed --prom-pgw to --otlp-endpoint with updated help text |
| tests/slo/requirements.txt | Replaced prometheus-client and ratelimiter with OpenTelemetry packages |
| tests/slo/Dockerfile | New Dockerfile for packaging the SLO workload as a container |
| tests/slo/slo_runner.sh | Complete rewrite as a bash script to run workloads in Docker containers |
| tests/slo/README.md | Updated documentation for new OTLP-based approach |
| tests/slo/src/summary.py | Deleted custom Summary implementation (no longer needed) |
| tests/slo/playground/* | Removed local playground configs (moved to external repo) |
| .github/workflows/slo.yml | Completely rewritten to run current vs. baseline comparisons in parallel |
| .github/workflows/slo-report.yml | Updated action version and added SLO label removal job |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| name: Run YDB SLO Tests | ||
| runs-on: ubuntu-latest | ||
| runs-on: "large-runner" |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runs-on value 'ubuntu-latest' conflicts with the comment on line 37 in the diff which shows 'large-runner'. The actual runner type in the code doesn't match what appears to be intended based on the diff context. If large-runner is required for SLO tests, this should be 'large-runner', not 'ubuntu-latest'.
| runs-on: "large-runner" | |
| runs-on: ubuntu-latest |
| async with self.aio_driver.topic_client.writer( | ||
| self.args.path, | ||
| codec=ydb.TopicCodec.GZIP, | ||
| codec=getattr(ydb, "PublicCodec", ydb.TopicCodec).GZIP, |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of getattr with a fallback to access codecs is fragile and unclear. Consider documenting why this fallback exists or using a more explicit version check if this is for backward compatibility. The attribute name 'PublicCodec' suggests a newer API, but without documentation, maintainability suffers.
| pass | ||
|
|
||
| @abstractmethod | ||
| def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None: |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error parameter should be typed as Optional[Exception] rather than Exception = None. Using None as a default with a non-optional type annotation is inconsistent and can confuse type checkers.
|
|
||
| def stop(self, labels, start_time, attempts=1, error=None): | ||
| duration = time.time() - start_time | ||
| def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None: |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error parameter should be typed as Optional[Exception] rather than Exception = None. Using None as a default with a non-optional type annotation is inconsistent and can confuse type checkers.
| self._pending.add(1, attributes={"operation_type": labels_t[0]}) | ||
| return time.time() | ||
|
|
||
| def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None: |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error parameter should be typed as Optional[Exception] rather than Exception = None. Using None as a default with a non-optional type annotation is inconsistent and can confuse type checkers.
| def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None: | |
| def stop(self, labels, start_time: float, attempts: int = 1, error: Optional[Exception] = None) -> None: |
| - name: Run SLO Tests | ||
| - name: Run SLO Tests (current + baseline in parallel) | ||
| timeout-minutes: 15 |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout of 15 minutes is insufficient for SLO tests with a default duration of 600 seconds (10 minutes). The timeout should account for workload startup, execution (10 minutes), and shutdown grace period. Consider increasing to at least 20 minutes to prevent false failures.
| timeout-minutes: 15 | |
| timeout-minutes: 20 |
| quantile-estimator==0.1.2 | ||
|
|
||
| # OpenTelemetry (OTLP/HTTP exporter) | ||
| # NOTE: OpenTelemetry 1.39.1 requires Python >= 3.9. |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states OpenTelemetry 1.39.1 requires Python >= 3.9, but the Dockerfile uses Python 3.11. Consider documenting the minimum Python version requirement more prominently, or adding a version check to ensure compatibility.
| # NOTE: OpenTelemetry 1.39.1 requires Python >= 3.9. | |
| # NOTE: These SLO tests require Python >= 3.9 (OpenTelemetry 1.39.1) and are tested with Python 3.11. |
| def __init__(self, min_interval_s: float) -> None: | ||
| self._min_interval_s = max(0.0, float(min_interval_s)) | ||
| self._lock = threading.Lock() | ||
| self._next_allowed_ts = 0.0 # monotonic timestamp |
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says 'monotonic timestamp' but the initialization uses time.time() semantics (wall clock time). While time.monotonic() is used later in enter, the initial value of 0.0 could be confusing. Consider clarifying this is initialized to 0 to allow the first call immediately.
| self._next_allowed_ts = 0.0 # monotonic timestamp | |
| # Next allowed time in the time.monotonic() time base; initialized to 0.0 | |
| # so that the first acquisition is allowed immediately. | |
| self._next_allowed_ts = 0.0 |
| @@ -1,8 +1,8 @@ | |||
| name: SLO Report | |||
| name: slo-report | |||
Copilot
AI
Jan 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow name 'slo-report' uses lowercase with hyphen, which is inconsistent with the original 'SLO Report' that used title case. While not a functional issue, maintaining consistent naming conventions across workflows improves readability.
| name: slo-report | |
| name: SLO Report |
Pull request type
Please check the type of change your PR introduces:
What is the current behavior?
Issue Number: N/A
What is the new behavior?
Other information