Chore: Add Pipeline Obs averageRunTime #24979

SumanMaharana · 2025-12-23T15:22:08Z

Describe your changes:

Chore Add Pipeline Obs averageRunTime
In pipelines/name/fqn API we only send the latest taskStatus

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Schema extension:
- Added averageRunTime field to pipelineObservability.json for tracking average pipeline execution duration in milliseconds
New calculation logic:
- calculateAverageRuntime in PipelineRepository.java computes 30-day average using task-level timing with pipeline-level fallback
SQL query optimization:
- Refactored execution trend queries with CTEs for both MySQL and PostgreSQL to support task-based runtime calculation

_{This will update automatically on new commits.}

github-actions · 2025-12-23T15:28:13Z

TypeScript types have been updated based on the JSON schema changes in the PR

gitar-bot · 2025-12-23T15:30:19Z

🔍 CI failure analysis for 709a7dd: All 4 CI failures (Python test flakiness, timeouts, Playwright E2E timeouts, and MySQL connection leak in test teardown) are infrastructure/test issues unrelated to this PR's pipeline observability changes.

Issue

All CI failures across multiple jobs are unrelated to this PR's changes and are caused by infrastructure/environment issues, test flakiness, and a MySQL connection leak test issue.

Failure 1: py-run-tests (3.10) - Job 58805164311

Root Cause: Flaky SQL lineage test

Test: test_complex_postgres_view
Failure: AssertionError: [SqlFluff] ❌ Column lineage assertion failed
Result: 3156 tests passed, only 1 failed

Failure 2: py-run-tests (3.10) - Job 58805321295 (Retry)

Root Cause: CI infrastructure timeout

60-minute timeout during reindexing step
curl command hung and never completed

Failure 3: playwright-ci-postgresql (2, 6) - Job 58805343633

Root Cause: E2E test timeouts

447 passed, 1 failed, 2 flaky, 2 skipped (1.1 hours)
Failures in Glossary and Bulk Import features with page/locator timeout errors

Failure 4: py-run-tests (3.11) - Job 58805331601 (NEW)

Root Cause: MySQL connection leak during test teardown

Test Results:

490 passed, 21 skipped, 1 xfailed, 1397 warnings, 1 error (1:07:20)

Error Details:

ERROR at teardown of test_column_test_cases[value_between_timestamp]
AssertionError: Found 10 open connections to MySQL
assert 10 <= 9

Location: ingestion/tests/integration/mysql/conftest.py:72

What happened:

Test fixture checks for dangling MySQL connections after tests complete
Expected maximum: 9 connections
Actual: 10 connections
The test itself (value_between_timestamp) passed successfully
Failure occurred only during teardown when checking for connection leaks

Why unrelated to this PR:

PR changes: This PR only modifies:
- Backend: Pipeline observability schema (pipelineObservability.json)
- Java code: Runtime calculation in PipelineRepository.java
- SQL queries: Pipeline metrics in CollectionDAO.java
- Frontend: Auto-generated TypeScript types
- No changes to: MySQL test infrastructure, connection management, data quality tests, or integration test fixtures
Test specifics: The failing test is test_data_quality.py::test_column_test_cases[value_between_timestamp] - a MySQL data quality integration test that validates timestamp range checks, completely unrelated to pipeline observability features
Connection leak pattern: This is a known test infrastructure issue (referenced in code comment: "Needs to be handled for Test Cases Ensure Sampler only creates 1 Engine #21187")
High pass rate: 490 of 491 integration tests passed; only the teardown assertion failed

Common Pattern Across All Failures

All four CI failures exhibit infrastructure/environment/test issues:

Python 3.10: Test flakiness + reindexing timeout
Playwright: Network/response timeouts in UI operations
Python 3.11: MySQL connection leak in test teardown (not in actual test code)
None related to pipeline observability code changes

Recommendation

The solution is to acknowledge these are pre-existing test infrastructure issues unrelated to the PR's code changes. The failures can be resolved by:

The MySQL connection leak issue is documented in GitHub issue Ensure Sampler only creates 1 Engine #21187 and should be addressed separately
Retrying the CI jobs when infrastructure is more stable
The PR code itself is sound and should not be blocked by these unrelated test infrastructure failures

Code Review 👍 Approved with suggestions

Well-structured enhancement to pipeline runtime calculation with one notable performance concern around N+1 queries.

⚠️

Performance: N+1 query problem in pipeline observability loop

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:1174-1183

The calculateAverageRuntime method is called inside a loop that processes each pipeline observability record. This method fetches 30 days of pipeline status data for each individual pipeline using getResultsFromAndToTimestamps. If there are N pipelines, this results in N additional database queries (N+1 pattern).

Impact: For large numbers of pipelines, this will significantly degrade performance and increase database load.

Suggested fix: Consider one of these approaches:

Batch the pipeline FQNs and perform a single query to get all runtime data
Pre-compute and cache average runtime values
Add the average runtime calculation to the existing SQL query that fetches the observability data

More details 💡 2 suggestions

💡 Edge Case: Inconsistent task filtering between SQL and Java logic

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/PipelineRepository.java:1228-1244

The calculateAverageRuntime Java method requires both startTime AND endTime to be non-null on each task before considering it for min/max calculation (line 1231: if (task.getStartTime() != null && task.getEndTime() != null)).

However, the SQL queries in CollectionDAO.java calculate MIN(startTime) and MAX(endTime) separately, allowing tasks with only startTime or only endTime to contribute to each respective aggregate.

Example scenario: If task A has startTime=100, endTime=null and task B has startTime=null, endTime=200:

SQL will compute: runtime = 200 - 100 = 100
Java will compute: runtime = null (neither task has both values)

Suggested fix: Align the logic - either update the Java code to match SQL behavior (separate loops for min/max) or update the SQL to only consider tasks where both values exist.

💡 Bug: Potential negative runtime with overlapping parallel tasks

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java:35-45

The runtime calculation MAX(endTime) - MIN(startTime) assumes tasks run sequentially. For pipelines with parallel tasks that may have overlapping execution windows, this calculation can produce results that exceed actual wall-clock time.

More critically, if there's data inconsistency where a task's startTime is greater than another task's endTime (e.g., clock skew, timezone issues, or corrupted data), the calculation MAX(endTime) - MIN(startTime) could yield unexpected negative values.

Impact: The runtime metrics may not accurately represent actual pipeline execution time for parallel task pipelines, and could produce negative values with inconsistent data.

Suggested fix: Consider adding a check in the SQL or application layer to ensure runtime values are non-negative (e.g., GREATEST(runtime, 0) or NULLIF(runtime, CASE WHEN runtime < 0 THEN runtime END)).

What Works Well

Good use of CTEs for cleaner runtime calculation logic in both MySQL and PostgreSQL queries
Proper null-safety handling throughout the Java code with appropriate null checks
Consistent implementation across both database backends with database-specific JSON functions

Recommendations

Address the N+1 query pattern in calculateAverageRuntime to avoid performance degradation with many pipelines
Align the task filtering logic between SQL and Java implementations for consistent behavior
Consider adding guards against negative runtime values for edge cases with data inconsistencies

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
✅ Code review is on Gitar will review this change.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

`Auto-apply`	`✅ Code review`	`Compact`
`gitar auto-apply:on`	`gitar code-review:off`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)}

sonarqubecloud · 2025-12-23T17:31:39Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Chore Add Pipeline Obs averageRunTime

b36906b

SumanMaharana requested a review from a team as a code owner December 23, 2025 15:22

SumanMaharana had a problem deploying to test December 23, 2025 15:22 — with GitHub Actions Error

github-actions bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Dec 23, 2025

Merge branch 'main' into fix-pipelinesobs-api

dbbda25

SumanMaharana temporarily deployed to test December 23, 2025 15:24 — with GitHub Actions Inactive

SumanMaharana had a problem deploying to test December 23, 2025 15:24 — with GitHub Actions Failure

SumanMaharana temporarily deployed to test December 23, 2025 15:24 — with GitHub Actions Inactive

Update generated TypeScript types

709a7dd

github-actions bot requested a review from a team as a code owner December 23, 2025 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chore: Add Pipeline Obs averageRunTime #24979

Chore: Add Pipeline Obs averageRunTime #24979

SumanMaharana commented Dec 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

gitar-bot bot commented Dec 23, 2025 •

edited

Loading

Issue

Failure 1: py-run-tests (3.10) - Job 58805164311

Failure 2: py-run-tests (3.10) - Job 58805321295 (Retry)

Failure 3: playwright-ci-postgresql (2, 6) - Job 58805343633

Failure 4: py-run-tests (3.11) - Job 58805331601 (NEW)

Common Pattern Across All Failures

Recommendation

What Works Well

Recommendations

Uh oh!

sonarqubecloud bot commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Chore: Add Pipeline Obs averageRunTime #24979

Are you sure you want to change the base?

Chore: Add Pipeline Obs averageRunTime #24979

Conversation

SumanMaharana commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

gitar-bot bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Failure 1: py-run-tests (3.10) - Job 58805164311

Failure 2: py-run-tests (3.10) - Job 58805321295 (Retry)

Failure 3: playwright-ci-postgresql (2, 6) - Job 58805343633

Failure 4: py-run-tests (3.11) - Job 58805331601 (NEW)

Common Pattern Across All Failures

Recommendation

What Works Well

Recommendations

Uh oh!

sonarqubecloud bot commented Dec 23, 2025

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanMaharana commented Dec 23, 2025 •

edited

Loading

gitar-bot bot commented Dec 23, 2025 •

edited

Loading