Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Summary

Enable Programmatic Tool Calling (PTC) and exec-subagent-hard-restart experiments by default for all users. Fix token extraction loss in Terminal-Bench runs when agent exits non-zero.

Background

The Terminal-Bench baseline (Opus 4.6 / xhigh) scores 57.3% (51/89 pass). Analysis of the failure modes shows:

  • 11 tasks time out — the agent exhausts its 30-min budget doing sequential tool calls that PTC could batch
  • Sub-agent delegation is used in only 10% of tasks — the hard-restart experiment prevents sub-agents from permanently failing on context overflow
  • Token tracking is lost for all timed-out tasks because mux-run.sh calls fatal() before the token-extraction python block runs

Implementation

1. PTC enabled by default (experiments.ts)

  • programmatic-tool-calling: enabledByDefault: true
  • The code_execution tool is now available by default in both the desktop app and CLI, letting the model batch 2+ tool calls in a single turn

2. Exec sub-agent hard restart enabled by default (experiments.ts)

  • exec-subagent-hard-restart: enabledByDefault: true
  • Sub-agents that hit context overflow will hard-restart instead of failing permanently

3. CLI respects enabledByDefault (run.ts)

  • buildExperimentsObject() now auto-enables experiments with enabledByDefault: true when no explicit --experiment flags are passed
  • If the user passes any --experiment flags, those override defaults entirely (no mixing)

4. Fix mux-run.sh fatal-on-exit (mux-run.sh)

  • Replace fatal "mux agent session failed" with a warning + continue
  • Token extraction now runs unconditionally (even for timed-out / crashed agent runs)
  • Original exit code is propagated at the end of the script

Validation

  • make typecheck — pass
  • make lint — pass
  • make fmt-check — pass
  • make test — 3446 pass, 0 fail

Generated with mux • Model: anthropic:claude-opus-4-6 • Thinking: xhigh • Cost: $92.04

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0b9aad504b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 98 to +106
if ! printf '%s' "${instruction}" | "${cmd[@]}" | tee "${MUX_OUTPUT_FILE}"; then
fatal "mux agent session failed"
mux_exit_code=$?
log "WARNING: mux agent session exited with code ${mux_exit_code}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the failing exit code before ! negates it

Because the pipeline is wrapped in if ! …; then, the ! operator negates the pipeline status, so inside the block $? is always 0 when the command failed. With set -o pipefail, this means mux_exit_code becomes 0 on failures/timeouts, and the script exits 0 even when the agent crashed, so the harness can’t detect failures. Capture the pipeline status before applying ! (or use if ...; then ... else mux_exit_code=$?) to preserve the real exit code.

Useful? React with 👍 / 👎.

…exit

- Set enabledByDefault=true for programmatic-tool-calling and
  exec-subagent-hard-restart experiments
- CLI buildExperimentsObject now auto-enables default experiments
  when no explicit --experiment flags are passed
- mux-run.sh: replace fatal() on agent non-zero exit with a warning,
  allowing token extraction to run unconditionally
- Propagate agent exit code at end of script
@ammar-agent ammar-agent force-pushed the tbench-enable-ptc-defaults branch from 0b9aad5 to 607e7a2 Compare February 7, 2026 16:22
Plumbs MUX_EXPLORE_MODEL through mux_agent.py → mux-run.sh → CLI.
Sets config.agentAiDefaults.explore so explore sub-agents use a
fast/cheap model instead of inheriting the expensive parent model.
@ammar-agent ammar-agent force-pushed the tbench-enable-ptc-defaults branch from 607e7a2 to 55d3827 Compare February 7, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant