Enhanced worker crash handling with integrated crash telemetry #5412

raujaiswal · 2025-11-28T10:04:07Z

Context

Fixed worker crash reporting logic.
Added Kusto telemetry to capture and report worker‑crash events.
Added try/catch handling in a crash‑prone scenario to prevent unhandled worker failures.
changes are behind Feature flag

Description

Problem (in Brief):

There were instances where the worker crashed, yet the pipeline continued running until it eventually hit the job‑level timeout. While the worker ideally should never crash, any unhandled exception causing a crash results in unnecessary resource consumption and prolonged pipeline execution—raising concerns about product reliability and efficiency.

How this problem gets noticed?

We identified this issue through an IcM where a regression was detected. Sharing the retrospective below for reference:
https://portal.microsofticm.com/imp/v5/retrospectives/Internal/1250792

Root Cause (Worker Crash – Brief Summary)

In the current Plan v8+ flow, the worker is responsible for reporting job status (Succeeded/Failed/Completed) back to the server. However, due to an unhandled exception, the worker process was crashing before sending the completion event. As a result, the server never received a signal to stop the job, and the pipeline continued running until the job-level timeout, leading to unnecessary resource consumption.

To address this, we implemented improvements focusing on two areas:

Preventing Worker Crashes
The worker ideally should never crash. Wherever crashes were caused by unhandled exceptions, we identified those scenarios and added the required try/catch handling to ensure the worker exits gracefully.
Fallback Handling via Listener
In cases where the worker still crashes unexpectedly, the listener now takes responsibility for notifying the server to stop the job. We also added telemetry signals to track and diagnose worker crashes more effectively.

Repro Steps:

Added exception handling inside the CancelAndKillProcessTree function, which is invoked during step‑level timeouts.

Pipeline Script

trigger: none
pool:
  name: Default
  demands:
    - Agent.Name -equals selfhosted2

stages:
- stage: ReleaseStage
  displayName: ModelD Service Deployment in Public Cloud
  jobs:
  - job: LockboxApprovalRequest
    displayName: Lockbox Approval Request
    dependsOn: []
    timeoutInMinutes: 7
    steps:
    - task: CmdLine@2
      timeoutInMinutes: 1
      continueOnError: true
      inputs:
        script: |
          echo Write your commands here
 
          echo Hello world
    - task: PowerShell@2
      displayName: 'Loop with Wait Script'
      timeoutInMinutes: 3
      inputs:
        targetType: 'inline'
        script: |
          Write-Host "Starting loop script..."
          for ($i = 1; $i -le 5000; $i++) {
              Write-Host "Iteration $i of 10"
              Start-Sleep -Seconds 10
              Write-Host "Completed iteration $i"
          }
          Write-Host "Loop script completed."
    - task: CmdLine@2
      inputs:
        script: |-
          echo Write your commands here
 
          echo Hello world

Job keep on running even worker get crashed: Pipeline

worker crashed logs in terminal:

Risk Assessment (Low / Medium / High)

Assess the risk level and justify your assessment. For example: code path sensitivity, usage scope, or backward compatibility concerns.

Unit Tests Added or Updated (Yes / No)

Indicate whether unit tests were added or modified to reflect the changes.

Additional Testing Performed

List manual or automated tests performed beyond unit tests (e.g., integration, scenario, regression).

Verified worker‑crash reporting via Listener (FF enabled locally): Pipeline

Verified crash‑prevention logic after adding try/catch in the identified scenario: Pipeline

Also tested on-prem 2022 ADO version

Change Behind Feature Flag (Yes / No)

Can this change be behine feature flag, if not why?

Tech Design / Approach

Design has been written and reviewed.
Any architectural decisions, trade-offs, and alternatives are captured.

Design Doc (deep dive in Plan v7, plan v8+): https://microsoftapc.sharepoint.com/:w:/r/teams/ADOTasksandAgents/_layouts/15/Doc.aspx?sourcedoc=%7BDFED2317-D226-4EC9-A6AB-1E3E53D91934%7D&file=worker%20crash%20handling.docx&action=default&mobileredirect=true

Documentation Changes Required (Yes/No)

Indicate whether related documentation needs to be updated.

User guides, API specs, system diagrams, or runbooks are updated.

Logging Added/Updated (Yes/No)

Appropriate log statements are added with meaningful messages.
Logging does not expose sensitive data.
Log levels are used correctly (e.g., info, warn, error).

Telemetry Added/Updated (Yes/No)

Custom telemetry (e.g., counters, timers, error tracking) is added as needed.
Events are tagged with proper metadata for filtering and analysis.
Telemetry is validated in staging or test environments.

Kusto Query:

CustomerIntelligence
| where Area == "AzurePipelinesAgent" 
| where Feature == "WorkerCrash"
| extend data = parse_json(Properties)
| extend tracepoint = tostring(data.properties.TracePoint)
| extend exitCode= tostring(data.properties.ExitCode)
| extend HostId  = tostring(data.hostId)
| extend JobId = tostring(data.properties.JobId)
| project TIMESTAMP, tracepoint, exitCode, HostId, JobId, data

Rollback Scenario and Process (Yes/No)

Rollback plan is documented.

Dependency Impact Assessed and Regression Tested (Yes/No)

All impacted internal modules, APIs, services, and third-party libraries are analyzed.
Results are reviewed and confirmed to not break existing functionality.

- Added enhanced crash handling logic with feature flag control - Implemented dual-mode operation for Plan v7 vs Plan v8+ scenarios - Added forced completion for Plan v8+ worker crashes - Enhanced logging for crash detection and completion analysis - Added notifyServerOfWorkerCrash variable for clear intent - Maintained backward compatibility with original logic - Added comprehensive trace logging for debugging

…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash

raujaiswal · 2025-12-05T11:01:15Z

/azp run

azure-pipelines · 2025-12-05T11:01:30Z

Azure Pipelines successfully started running 1 pipeline(s).

raujaiswal · 2025-12-08T04:03:21Z

/azp run

azure-pipelines · 2025-12-08T04:03:35Z

Azure Pipelines successfully started running 1 pipeline(s).

src/Agent.Listener/JobDispatcher.cs

src/Agent.Sdk/Knob/AgentKnobs.cs

sanjuyadav24 · 2025-12-08T10:12:46Z

src/Agent.Listener/JobDispatcher.cs

+                                    if (enhancedworkercrashhandlingenabled)
+                                    {
+                                        bool isPlanV8Plus = PlanUtil.GetFeatures(message.Plan).HasFlag(PlanFeatures.JobCompletedPlanEvent);
+                                        bool isWorkerCrash = !TaskResultUtil.IsValidReturnCode(returnCode);


should we rename this to worker failed to send status to server?
as in that case this can be extended to any events in future

src/Agent.Listener/JobDispatcher.cs

sanjuyadav24 · 2025-12-08T10:22:16Z

src/Agent.Listener/JobDispatcher.cs

+                Uri jobServerUrl = systemConnection.Url;
+
+                // Make sure SystemConnection Url match Config Url base for OnPremises server
+                if (!message.Variables.ContainsKey(Constants.Variables.System.ServerType) ||


this logic is similar to the logic in method LogWorkerProcessUnhandledException, could we please check if we can refactor this

Thank you for suggestion, refactored the part.

sanjuyadav24 · 2025-12-08T10:22:53Z

src/Agent.Sdk/Knob/AgentKnobs.cs

+        public static readonly Knob EnhancedWorkerCrashHandling = new Knob(
+            nameof(EnhancedWorkerCrashHandling),
+            "If true, enables enhanced worker crash handling with forced completion for Plan v8+ scenarios where worker crashes cannot send completion events",
+            new EnvironmentKnobSource("ENHANCED_WORKER_CRASH_HANDLING"),


please convert this to runtime knob

sanjuyadav24 · 2025-12-08T10:34:14Z

src/Agent.Listener/JobDispatcher.cs

                                        detailInfo = string.Join(Environment.NewLine, workerOutput);
                                        Trace.Info($"Return code {returnCode} indicate worker encounter an unhandled exception or app crash, attach worker stdout/stderr to JobRequest result.");
                                        await LogWorkerProcessUnhandledException(message, detailInfo, agentCertManager.SkipServerCertificateValidation);
+


shouldn't we have sendEvent to server method here as well, as here only we are logging if worker is terminated due to unhandled exception

Copilot · 2025-12-09T08:04:50Z

@raujaiswal I've opened a new pull request, #5423, to work on those changes. Once the pull request is ready, I'll request review from you.

…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash

raujaiswal · 2025-12-10T08:48:04Z

/azp run

azure-pipelines · 2025-12-10T08:48:17Z

Azure Pipelines successfully started running 1 pipeline(s).

raujaiswal · 2025-12-10T10:21:31Z

/azp run

azure-pipelines · 2025-12-10T10:21:46Z

Azure Pipelines successfully started running 1 pipeline(s).

sanjuyadav24 · 2025-12-10T10:29:43Z

src/Agent.Listener/JobDispatcher.cs

-                            //server URI (which was set at config time)
-                            jobServerUrl = result;
-                        }
+                        await jobServer.RaisePlanEventAsync(message.Plan.ScopeIdentifier, message.Plan.PlanType, message.Plan.PlanId, jobCompletedEvent, CancellationToken.None);


should we add a retry here? I do see we have a retry in Jobrunner for RaisePlanEventAsync

sanjuyadav24

Please add design doc, other documentation link in work item with anything related to this issue

Also please check, I think we should also add a catch condition the actual block of worker which threw unhandled exception and process crashed

…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash

raujaiswal · 2025-12-16T13:53:07Z

/azp run

azure-pipelines · 2025-12-16T13:53:19Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines-bot and others added 3 commits November 28, 2025 15:33

agentknob changes for worker crash

7a5bb52

Merge branch 'master' into users/raujaiswal/worker-crash

c05c057

raujaiswal marked this pull request as ready for review December 2, 2025 06:47

raujaiswal requested review from a team as code owners December 2, 2025 06:47

azure-pipelines-bot and others added 3 commits December 3, 2025 14:45

enhanced worker crash reporting for listner and added telemetry

f2f4138

Merge branch 'users/raujaiswal/worker-crash' of https://github.com/mi…

95aedec

…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash

Merge branch 'master' into users/raujaiswal/worker-crash

f0a3af4

raujaiswal marked this pull request as draft December 3, 2025 09:52

azure-pipelines-bot and others added 4 commits December 4, 2025 12:05

removed feature flag telemetry

e18c663

Merge branch 'users/raujaiswal/worker-crash' of https://github.com/mi…

425e29b

…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash

cleanup the code

af10a40

Merge branch 'master' into users/raujaiswal/worker-crash

1d3c339

raujaiswal marked this pull request as ready for review December 5, 2025 09:39

raujaiswal changed the title ~~Implement enhanced worker crash handling in JobDispatcher~~ Enhanced worker crash handling in JobDispatcher with added crash telemetry Dec 5, 2025

raujaiswal changed the title ~~Enhanced worker crash handling in JobDispatcher with added crash telemetry~~ Enhanced worker crash handling with added crash telemetry Dec 5, 2025

raujaiswal changed the title ~~Enhanced worker crash handling with added crash telemetry~~ Enhanced worker crash handling with integrated crash telemetry Dec 5, 2025

raujaiswal added the internal label Dec 5, 2025

raujaiswal added the misc Miscellaneous Changes label Dec 8, 2025

dassayantan24 reviewed Dec 8, 2025

View reviewed changes

src/Agent.Listener/JobDispatcher.cs Outdated Show resolved Hide resolved

dassayantan24 reviewed Dec 8, 2025

View reviewed changes

src/Agent.Listener/JobDispatcher.cs Outdated Show resolved Hide resolved

dassayantan24 reviewed Dec 8, 2025

View reviewed changes

src/Agent.Sdk/Knob/AgentKnobs.cs Show resolved Hide resolved

sanjuyadav24 reviewed Dec 8, 2025

View reviewed changes

Copilot AI mentioned this pull request Dec 9, 2025

Clarify Plan v8+ worker crash completion behavior in response to review feedback #5423

Closed

azure-pipelines-bot added 4 commits December 10, 2025 14:02

refactored the chnages

1944d29

Merge branch 'users/raujaiswal/worker-crash' of https://github.com/mi…

eedd2e4

…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash

refactored the changes

ea914c3

refactored the changes

50fe7d7

azure-pipelines-bot added 3 commits December 10, 2025 14:38

refactored the changes

8e617cf

refactored the changes

f2699ce

refactored the changes

e67f638

sanjuyadav24 reviewed Dec 10, 2025

View reviewed changes

azure-pipelines-bot and others added 2 commits December 11, 2025 11:47

added retry mechanism

0448035

Merge branch 'master' into users/raujaiswal/worker-crash

b56bfc8

sanjuyadav24 approved these changes Dec 12, 2025

View reviewed changes

sanjuyadav24 requested changes Dec 12, 2025

View reviewed changes

raujaiswal and others added 5 commits December 15, 2025 09:29

Merge branch 'master' into users/raujaiswal/worker-crash

3a093df

added try-catch

7342cee

Merge branch 'master' into users/raujaiswal/worker-crash

be0f03a

Merge branch 'users/raujaiswal/worker-crash' of https://github.com/mi…

3c5cd62

…crosoft/azure-pipelines-agent into users/raujaiswal/worker-crash

added tracepoint in telemetry

641d8a0

sanjuyadav24 approved these changes Dec 16, 2025

View reviewed changes

Enhanced worker crash handling with integrated crash telemetry #5412

Are you sure you want to change the base?

Enhanced worker crash handling with integrated crash telemetry #5412

Conversation

raujaiswal commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Description

Problem (in Brief):

How this problem gets noticed?

Root Cause (Worker Crash – Brief Summary)

Repro Steps:

Risk Assessment (Low / Medium / High)

Unit Tests Added or Updated (Yes / No)

Additional Testing Performed

Change Behind Feature Flag (Yes / No)

Tech Design / Approach

Documentation Changes Required (Yes/No)

Logging Added/Updated (Yes/No)

Telemetry Added/Updated (Yes/No)

Rollback Scenario and Process (Yes/No)

Dependency Impact Assessed and Regression Tested (Yes/No)

Uh oh!

raujaiswal commented Dec 5, 2025

Uh oh!

azure-pipelines bot commented Dec 5, 2025

Uh oh!

raujaiswal commented Dec 8, 2025

Uh oh!

azure-pipelines bot commented Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanjuyadav24 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanjuyadav24 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

raujaiswal Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

sanjuyadav24 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

sanjuyadav24 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 9, 2025

Uh oh!

raujaiswal commented Dec 10, 2025

Uh oh!

azure-pipelines bot commented Dec 10, 2025

Uh oh!

raujaiswal commented Dec 10, 2025

Uh oh!

azure-pipelines bot commented Dec 10, 2025

Uh oh!

sanjuyadav24 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

raujaiswal Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

sanjuyadav24 left a comment

Choose a reason for hiding this comment

Uh oh!

raujaiswal commented Dec 16, 2025

Uh oh!

azure-pipelines bot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

raujaiswal commented Nov 28, 2025 •

edited

Loading