Skip to content

Conversation

@polRk
Copy link
Member

@polRk polRk commented Jan 16, 2026

Pull request type

Please check the type of change your PR introduces:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

What is the current behavior?

Issue Number: N/A

What is the new behavior?

Other information

Copilot AI review requested due to automatic review settings January 16, 2026 13:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the SLO (Service Level Objective) testing infrastructure to migrate from Prometheus Pushgateway to OpenTelemetry OTLP/HTTP metrics export. The changes modernize the metrics collection approach and update the CI workflows to run baseline comparison tests.

Changes:

  • Migrated from Prometheus Pushgateway + prometheus-client to OpenTelemetry OTLP/HTTP metrics export
  • Replaced ratelimiter dependency with a custom SyncRateLimiter implementation
  • Updated GitHub Actions workflows to run current vs. baseline workload comparison tests
  • Removed local playground configurations (compose files) in favor of external infrastructure

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/slo/src/core/metrics.py Complete rewrite of metrics implementation using OpenTelemetry SDK instead of prometheus-client
tests/slo/src/jobs/base.py Added custom SyncRateLimiter class to replace ratelimiter dependency
tests/slo/src/jobs/topic_jobs.py Updated to use SyncRateLimiter and otlp_endpoint parameter
tests/slo/src/jobs/table_jobs.py Updated to use SyncRateLimiter and otlp_endpoint parameter
tests/slo/src/jobs/async_topic_jobs.py Updated for OTLP metrics, improved type hints and error handling
tests/slo/src/runners/base.py Added type hints for driver property
tests/slo/src/runners/topic_runner.py Added driver assertions and updated to use otlp_endpoint
tests/slo/src/runners/table_runner.py Added driver assertions and updated to use otlp_endpoint
tests/slo/src/options.py Renamed --prom-pgw to --otlp-endpoint with updated help text
tests/slo/requirements.txt Replaced prometheus-client and ratelimiter with OpenTelemetry packages
tests/slo/Dockerfile New Dockerfile for packaging the SLO workload as a container
tests/slo/slo_runner.sh Complete rewrite as a bash script to run workloads in Docker containers
tests/slo/README.md Updated documentation for new OTLP-based approach
tests/slo/src/summary.py Deleted custom Summary implementation (no longer needed)
tests/slo/playground/* Removed local playground configs (moved to external repo)
.github/workflows/slo.yml Completely rewritten to run current vs. baseline comparisons in parallel
.github/workflows/slo-report.yml Updated action version and added SLO label removal job

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


name: Run YDB SLO Tests
runs-on: ubuntu-latest
runs-on: "large-runner"
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runs-on value 'ubuntu-latest' conflicts with the comment on line 37 in the diff which shows 'large-runner'. The actual runner type in the code doesn't match what appears to be intended based on the diff context. If large-runner is required for SLO tests, this should be 'large-runner', not 'ubuntu-latest'.

Suggested change
runs-on: "large-runner"
runs-on: ubuntu-latest

Copilot uses AI. Check for mistakes.
async with self.aio_driver.topic_client.writer(
self.args.path,
codec=ydb.TopicCodec.GZIP,
codec=getattr(ydb, "PublicCodec", ydb.TopicCodec).GZIP,
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of getattr with a fallback to access codecs is fragile and unclear. Consider documenting why this fallback exists or using a more explicit version check if this is for backward compatibility. The attribute name 'PublicCodec' suggests a newer API, but without documentation, maintainability suffers.

Copilot uses AI. Check for mistakes.
pass

@abstractmethod
def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None:
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error parameter should be typed as Optional[Exception] rather than Exception = None. Using None as a default with a non-optional type annotation is inconsistent and can confuse type checkers.

Copilot uses AI. Check for mistakes.

def stop(self, labels, start_time, attempts=1, error=None):
duration = time.time() - start_time
def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None:
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error parameter should be typed as Optional[Exception] rather than Exception = None. Using None as a default with a non-optional type annotation is inconsistent and can confuse type checkers.

Copilot uses AI. Check for mistakes.
self._pending.add(1, attributes={"operation_type": labels_t[0]})
return time.time()

def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None:
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error parameter should be typed as Optional[Exception] rather than Exception = None. Using None as a default with a non-optional type annotation is inconsistent and can confuse type checkers.

Suggested change
def stop(self, labels, start_time: float, attempts: int = 1, error: Exception = None) -> None:
def stop(self, labels, start_time: float, attempts: int = 1, error: Optional[Exception] = None) -> None:

Copilot uses AI. Check for mistakes.
- name: Run SLO Tests
- name: Run SLO Tests (current + baseline in parallel)
timeout-minutes: 15
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout of 15 minutes is insufficient for SLO tests with a default duration of 600 seconds (10 minutes). The timeout should account for workload startup, execution (10 minutes), and shutdown grace period. Consider increasing to at least 20 minutes to prevent false failures.

Suggested change
timeout-minutes: 15
timeout-minutes: 20

Copilot uses AI. Check for mistakes.
quantile-estimator==0.1.2

# OpenTelemetry (OTLP/HTTP exporter)
# NOTE: OpenTelemetry 1.39.1 requires Python >= 3.9.
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states OpenTelemetry 1.39.1 requires Python >= 3.9, but the Dockerfile uses Python 3.11. Consider documenting the minimum Python version requirement more prominently, or adding a version check to ensure compatibility.

Suggested change
# NOTE: OpenTelemetry 1.39.1 requires Python >= 3.9.
# NOTE: These SLO tests require Python >= 3.9 (OpenTelemetry 1.39.1) and are tested with Python 3.11.

Copilot uses AI. Check for mistakes.
def __init__(self, min_interval_s: float) -> None:
self._min_interval_s = max(0.0, float(min_interval_s))
self._lock = threading.Lock()
self._next_allowed_ts = 0.0 # monotonic timestamp
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 'monotonic timestamp' but the initialization uses time.time() semantics (wall clock time). While time.monotonic() is used later in enter, the initial value of 0.0 could be confusing. Consider clarifying this is initialized to 0 to allow the first call immediately.

Suggested change
self._next_allowed_ts = 0.0 # monotonic timestamp
# Next allowed time in the time.monotonic() time base; initialized to 0.0
# so that the first acquisition is allowed immediately.
self._next_allowed_ts = 0.0

Copilot uses AI. Check for mistakes.
@@ -1,8 +1,8 @@
name: SLO Report
name: slo-report
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow name 'slo-report' uses lowercase with hyphen, which is inconsistent with the original 'SLO Report' that used title case. While not a functional issue, maintaining consistent naming conventions across workflows improves readability.

Suggested change
name: slo-report
name: SLO Report

Copilot uses AI. Check for mistakes.
@vgvoleg vgvoleg merged commit 7ea41c1 into main Jan 16, 2026
23 of 25 checks passed
@vgvoleg vgvoleg deleted the update-slo-action branch January 16, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants