fix(crab-pf): replace broken verify with smoke+session test, add session handling by MrFlounder · Pull Request #32 · promptfoo/crabcode

MrFlounder · 2026-02-22T00:13:12Z

Summary

Fix false-positive verification: The old verify tool ran promptfoo eval against a config with redteam section but no tests array — zero tests ran, zero failures, reported success. Now replaced with a 3-step process: direct provider smoke test → session test → eval with 2 real test cases.
Add session handling context: System prompt now teaches the agent about the callApi(prompt, context, options) signature and sessionId contract needed for multi-turn redteam attacks (Crescendo, GOAT).
Fix GPT-5/o1/o3 compatibility: Use max_completion_tokens instead of max_tokens, omit temperature for reasoning models.
Fix Node.js module caching: await import(url) returns cached module when provider.js is rewritten. Added ?t=timestamp cache buster.

Changed files

File	What changed
`generator/config.ts`	Replace `redteam` section with `prompts` + `tests` + `defaultTest.assert`
`agent/loop.ts`	Rewrite `verify` tool: smoke test + session test + eval with proper parsing
`agent/tools.ts`	Update verify description, remove unused `numTests` param
`agent/system-prompt.ts`	Add session handling section, update `callApi` signature in example
`agent/providers.ts`	GPT-5/o1/o3 compat (`max_completion_tokens`, no `temperature`)

Test plan

Run crab pf against a simple HTTP target — verify eval shows 2 passed, 0 failed
Run against a session-based target — verify provider gets correct callApi signature and returns sessionId
Run with --provider openai:gpt-5 — verify no API errors
Break provider.js intentionally — verify smoke test catches it (not false-positive)

🤖 Generated with Claude Code

…ion handling The verify tool was false-positive: configs had a `redteam` section but no `tests` array, so `promptfoo eval` ran zero tests and reported success. Changes: - Replace redteam config with 2 simple test cases + defaultTest assertion - Rewrite verify: direct callApi smoke test + session test, then promptfoo eval - Add cache-busting for dynamic imports (Node caches rewritten provider.js) - Parse eval output properly (require >0 passed, 0 failed) - Add session handling to system prompt (callApi signature, sessionId contract) - Update tool schema (remove unused numTests param) - Fix GPT-5/o1/o3 compat (max_completion_tokens, omit temperature) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

plugins/promptfoo/src/agent/loop.ts

+          const packageJsonPath = path.join(outputDir, 'package.json');
+          if (fs.existsSync(packageJsonPath)) {
+            try {
+              execSync(`cd "${outputDir}" && npm install --silent 2>&1`, {


In general, to fix this kind of issue you should avoid invoking a shell with a dynamically constructed command string containing untrusted data. Instead, use the child_process APIs that accept a command and an argument array (execFile, spawn, or execSync with the cmd and args form) and pass the untrusted value as an argument or configuration option (like cwd) that is not interpreted by the shell. If you truly must use a shell, you need to robustly escape the untrusted value with a library such as shell-quote, but the safer approach here is to eliminate the shell usage.

The single best way to fix this specific case without changing the intended behavior is to stop constructing and executing cd "${outputDir}" && npm install --silent 2>&1 as a shell command string. Instead, we can run npm directly and let Node handle the working directory via the cwd option. For synchronous execution, execSync already supports an argument-array form: execSync(command, args?, options?). We can replace the existing call with execSync('npm', ['install', '--silent'], { cwd: outputDir, timeout: 60000, encoding: 'utf-8' });. This preserves the same effect—running npm install --silent in outputDir with the same timeout and encoding settings—while ensuring that outputDir is only used as a directory path, not as part of a shell command. No additional imports are necessary, as execSync is already imported from node:child_process. The change is localized to the executeTool function around line 290 in plugins/promptfoo/src/agent/loop.ts.

Concretely:

In plugins/promptfoo/src/agent/loop.ts, locate the execSync call on lines 290–293.

Replace the shell-string invocation that embeds outputDir with the safer argument-array form and pass outputDir through the cwd option.

Keep the try/catch structure and behavior (ignore install errors) unchanged.

Two issues where the LLM was flying blind after failures: 1. Tool exceptions: outer catch returned result=null, error=message but only result was serialized into the tool message — LLM saw "null" with no explanation. Now includes both error and result in content. 2. Verify smoke test: "empty output" told the LLM nothing about why. Now includes the full provider response (output, sessionId, error) so the LLM can diagnose the actual problem (e.g. missing polling loop, wrong response field, auth failure). Follows opencode's pattern where tool errors go into content for the LLM to reason about, rather than being silently dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-advanced-security bot found potential problems Feb 22, 2026

View reviewed changes

MrFlounder merged commit 3f5ee99 into main Feb 22, 2026
3 checks passed

MrFlounder deleted the feat/crab-pf-verify-session branch February 22, 2026 01:31

github-actions bot mentioned this pull request Feb 22, 2026

chore(main): release 0.11.1 #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(crab-pf): replace broken verify with smoke+session test, add session handling#32

fix(crab-pf): replace broken verify with smoke+session test, add session handling#32
MrFlounder merged 2 commits intomainfrom
feat/crab-pf-verify-session

MrFlounder commented Feb 22, 2026

Uh oh!

Check warning

Copilot Autofix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

@@ -287,7 +287,8 @@
                       const packageJsonPath = path.join(outputDir, 'package.json');
                       if (fs.existsSync(packageJsonPath)) {
                         try {
-                          execSync(`cd "${outputDir}" && npm install --silent 2>&1`, {
+                          execSync('npm', ['install', '--silent'], {
+                            cwd: outputDir,
                             timeout: 60000,
                             encoding: 'utf-8',
                           });

Comments

Conversation

MrFlounder commented Feb 22, 2026

Summary

Changed files

Test plan

Uh oh!

Check warning

Uh oh!

Uh oh!

Copilot Autofix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant