Skip to content

Conversation

@SumanMaharana
Copy link
Contributor

@SumanMaharana SumanMaharana commented Dec 23, 2025

Describe your changes:

Chore Add Pipeline Obs averageRunTime
In pipelines/name/fqn API we only send the latest taskStatus

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Schema extension:
    • Added averageRunTime field to pipelineObservability.json for tracking average pipeline execution duration in milliseconds
  • New calculation logic:
    • calculateAverageRuntime in PipelineRepository.java computes 30-day average using task-level timing with pipeline-level fallback
  • SQL query optimization:
    • Refactored execution trend queries with CTEs for both MySQL and PostgreSQL to support task-based runtime calculation

This will update automatically on new commits.


@github-actions
Copy link
Contributor

TypeScript types have been updated based on the JSON schema changes in the PR

@github-actions github-actions bot requested a review from a team as a code owner December 23, 2025 15:28
@gitar-bot
Copy link

gitar-bot bot commented Dec 23, 2025

🔍 CI failure analysis for 709a7dd: All 4 CI failures (Python test flakiness, timeouts, Playwright E2E timeouts, and MySQL connection leak in test teardown) are infrastructure/test issues unrelated to this PR's pipeline observability changes.

Issue

All CI failures across multiple jobs are unrelated to this PR's changes and are caused by infrastructure/environment issues, test flakiness, and a MySQL connection leak test issue.

Failure 1: py-run-tests (3.10) - Job 58805164311

Root Cause: Flaky SQL lineage test

  • Test: test_complex_postgres_view
  • Failure: AssertionError: [SqlFluff] ❌ Column lineage assertion failed
  • Result: 3156 tests passed, only 1 failed

Failure 2: py-run-tests (3.10) - Job 58805321295 (Retry)

Root Cause: CI infrastructure timeout

  • 60-minute timeout during reindexing step
  • curl command hung and never completed

Failure 3: playwright-ci-postgresql (2, 6) - Job 58805343633

Root Cause: E2E test timeouts

  • 447 passed, 1 failed, 2 flaky, 2 skipped (1.1 hours)
  • Failures in Glossary and Bulk Import features with page/locator timeout errors

Failure 4: py-run-tests (3.11) - Job 58805331601 (NEW)

Root Cause: MySQL connection leak during test teardown

Test Results:

  • 490 passed, 21 skipped, 1 xfailed, 1397 warnings, 1 error (1:07:20)

Error Details:

ERROR at teardown of test_column_test_cases[value_between_timestamp]
AssertionError: Found 10 open connections to MySQL
assert 10 <= 9

Location: ingestion/tests/integration/mysql/conftest.py:72

What happened:

  • Test fixture checks for dangling MySQL connections after tests complete
  • Expected maximum: 9 connections
  • Actual: 10 connections
  • The test itself (value_between_timestamp) passed successfully
  • Failure occurred only during teardown when checking for connection leaks

Why unrelated to this PR:

  1. PR changes: This PR only modifies:

    • Backend: Pipeline observability schema (pipelineObservability.json)
    • Java code: Runtime calculation in PipelineRepository.java
    • SQL queries: Pipeline metrics in CollectionDAO.java
    • Frontend: Auto-generated TypeScript types
    • No changes to: MySQL test infrastructure, connection management, data quality tests, or integration test fixtures
  2. Test specifics: The failing test is test_data_quality.py::test_column_test_cases[value_between_timestamp] - a MySQL data quality integration test that validates timestamp range checks, completely unrelated to pipeline observability features

  3. Connection leak pattern: This is a known test infrastructure issue (referenced in code comment: "Needs to be handled for Test Cases Ensure Sampler only creates 1 Engine #21187")

  4. High pass rate: 490 of 491 integration tests passed; only the teardown assertion failed

Common Pattern Across All Failures

All four CI failures exhibit infrastructure/environment/test issues:

  • Python 3.10: Test flakiness + reindexing timeout
  • Playwright: Network/response timeouts in UI operations
  • Python 3.11: MySQL connection leak in test teardown (not in actual test code)
  • None related to pipeline observability code changes

Recommendation

The solution is to acknowledge these are pre-existing test infrastructure issues unrelated to the PR's code changes. The failures can be resolved by:

  1. The MySQL connection leak issue is documented in GitHub issue Ensure Sampler only creates 1 Engine #21187 and should be addressed separately
  2. Retrying the CI jobs when infrastructure is more stable
  3. The PR code itself is sound and should not be blocked by these unrelated test infrastructure failures
Code Review 👍 Approved with suggestions

Well-structured enhancement to pipeline runtime calculation with one notable performance concern around N+1 queries.

⚠️ Performance: N+1 query problem in pipeline observability loop

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:1174-1183

The calculateAverageRuntime method is called inside a loop that processes each pipeline observability record. This method fetches 30 days of pipeline status data for each individual pipeline using getResultsFromAndToTimestamps. If there are N pipelines, this results in N additional database queries (N+1 pattern).

Impact: For large numbers of pipelines, this will significantly degrade performance and increase database load.

Suggested fix: Consider one of these approaches:

  1. Batch the pipeline FQNs and perform a single query to get all runtime data
  2. Pre-compute and cache average runtime values
  3. Add the average runtime calculation to the existing SQL query that fetches the observability data
More details 💡 2 suggestions
💡 Edge Case: Inconsistent task filtering between SQL and Java logic

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:1228-1244

The calculateAverageRuntime Java method requires both startTime AND endTime to be non-null on each task before considering it for min/max calculation (line 1231: if (task.getStartTime() != null && task.getEndTime() != null)).

However, the SQL queries in CollectionDAO.java calculate MIN(startTime) and MAX(endTime) separately, allowing tasks with only startTime or only endTime to contribute to each respective aggregate.

Example scenario: If task A has startTime=100, endTime=null and task B has startTime=null, endTime=200:

  • SQL will compute: runtime = 200 - 100 = 100
  • Java will compute: runtime = null (neither task has both values)

Suggested fix: Align the logic - either update the Java code to match SQL behavior (separate loops for min/max) or update the SQL to only consider tasks where both values exist.

💡 Bug: Potential negative runtime with overlapping parallel tasks

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java:35-45

The runtime calculation MAX(endTime) - MIN(startTime) assumes tasks run sequentially. For pipelines with parallel tasks that may have overlapping execution windows, this calculation can produce results that exceed actual wall-clock time.

More critically, if there's data inconsistency where a task's startTime is greater than another task's endTime (e.g., clock skew, timezone issues, or corrupted data), the calculation MAX(endTime) - MIN(startTime) could yield unexpected negative values.

Impact: The runtime metrics may not accurately represent actual pipeline execution time for parallel task pipelines, and could produce negative values with inconsistent data.

Suggested fix: Consider adding a check in the SQL or application layer to ensure runtime values are non-negative (e.g., GREATEST(runtime, 0) or NULLIF(runtime, CASE WHEN runtime < 0 THEN runtime END)).

What Works Well

  • Good use of CTEs for cleaner runtime calculation logic in both MySQL and PostgreSQL queries
  • Proper null-safety handling throughout the Java code with appropriate null checks
  • Consistent implementation across both database backends with database-specific JSON functions

Recommendations

  • Address the N+1 query pattern in calculateAverageRuntime to avoid performance degradation with many pipelines
  • Align the task filtering logic between SQL and Java implementations for consistent behavior
  • Consider adding guards against negative runtime values for edge cases with data inconsistencies

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
✅ Code review is on Gitar will review this change.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

Auto-apply ✅ Code review Compact
gitar auto-apply:on         
gitar code-review:off         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)

@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants