🤖 bench: enable PTC + hard-restart by default, fix mux-run.sh fatal-on-exit #2239

ammar-agent · 2026-02-07T02:50:22Z

Summary

Enable Programmatic Tool Calling (PTC) and exec-subagent-hard-restart experiments by default for all users. Fix token extraction loss in Terminal-Bench runs when agent exits non-zero.

Background

The Terminal-Bench baseline (Opus 4.6 / xhigh) scores 57.3% (51/89 pass). Analysis of the failure modes shows:

11 tasks time out — the agent exhausts its 30-min budget doing sequential tool calls that PTC could batch
Sub-agent delegation is used in only 10% of tasks — the hard-restart experiment prevents sub-agents from permanently failing on context overflow
Token tracking is lost for all timed-out tasks because mux-run.sh calls fatal() before the token-extraction python block runs

Implementation

1. PTC enabled by default (experiments.ts)

programmatic-tool-calling: enabledByDefault: true
The code_execution tool is now available by default in both the desktop app and CLI, letting the model batch 2+ tool calls in a single turn

2. Exec sub-agent hard restart enabled by default (experiments.ts)

exec-subagent-hard-restart: enabledByDefault: true
Sub-agents that hit context overflow will hard-restart instead of failing permanently

3. CLI respects enabledByDefault (run.ts)

buildExperimentsObject() now auto-enables experiments with enabledByDefault: true when no explicit --experiment flags are passed
If the user passes any --experiment flags, those override defaults entirely (no mixing)

4. Fix mux-run.sh fatal-on-exit (mux-run.sh)

Replace fatal "mux agent session failed" with a warning + continue
Token extraction now runs unconditionally (even for timed-out / crashed agent runs)
Original exit code is propagated at the end of the script

Validation

make typecheck — pass
make lint — pass
make fmt-check — pass
make test — 3446 pass, 0 fail

Generated with mux • Model: anthropic:claude-opus-4-6 • Thinking: xhigh • Cost: $92.04

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0b9aad504b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-07T02:51:59Z

benchmarks/terminal_bench/mux-run.sh

 if ! printf '%s' "${instruction}" | "${cmd[@]}" | tee "${MUX_OUTPUT_FILE}"; then
-  fatal "mux agent session failed"
+  mux_exit_code=$?
+  log "WARNING: mux agent session exited with code ${mux_exit_code}"


Preserve the failing exit code before ! negates it

Because the pipeline is wrapped in if ! …; then, the ! operator negates the pipeline status, so inside the block $? is always 0 when the command failed. With set -o pipefail, this means mux_exit_code becomes 0 on failures/timeouts, and the script exits 0 even when the agent crashed, so the harness can’t detect failures. Capture the pipeline status before applying ! (or use if ...; then ... else mux_exit_code=$?) to preserve the real exit code.

Useful? React with 👍 / 👎.

…exit - Set enabledByDefault=true for programmatic-tool-calling and exec-subagent-hard-restart experiments - CLI buildExperimentsObject now auto-enables default experiments when no explicit --experiment flags are passed - mux-run.sh: replace fatal() on agent non-zero exit with a warning, allowing token extraction to run unconditionally - Propagate agent exit code at end of script

Plumbs MUX_EXPLORE_MODEL through mux_agent.py → mux-run.sh → CLI. Sets config.agentAiDefaults.explore so explore sub-agents use a fast/cheap model instead of inheriting the expensive parent model.

chatgpt-codex-connector bot reviewed Feb 7, 2026

View reviewed changes

ammar-agent force-pushed the tbench-enable-ptc-defaults branch from 0b9aad5 to 607e7a2 Compare February 7, 2026 16:22

bench: add --explore-model CLI flag for sub-agent model override

55d3827

Plumbs MUX_EXPLORE_MODEL through mux_agent.py → mux-run.sh → CLI. Sets config.agentAiDefaults.explore so explore sub-agents use a fast/cheap model instead of inheriting the expensive parent model.

ammar-agent force-pushed the tbench-enable-ptc-defaults branch from 607e7a2 to 55d3827 Compare February 7, 2026 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤖 bench: enable PTC + hard-restart by default, fix mux-run.sh fatal-on-exit #2239

🤖 bench: enable PTC + hard-restart by default, fix mux-run.sh fatal-on-exit #2239

ammar-agent commented Feb 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🤖 bench: enable PTC + hard-restart by default, fix mux-run.sh fatal-on-exit #2239

Are you sure you want to change the base?

🤖 bench: enable PTC + hard-restart by default, fix mux-run.sh fatal-on-exit #2239

Conversation

ammar-agent commented Feb 7, 2026

Summary

Background

Implementation

Validation

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant