Skip to content

Comments

fix(crab-pf): replace broken verify with smoke+session test, add session handling#32

Merged
MrFlounder merged 2 commits intomainfrom
feat/crab-pf-verify-session
Feb 22, 2026
Merged

fix(crab-pf): replace broken verify with smoke+session test, add session handling#32
MrFlounder merged 2 commits intomainfrom
feat/crab-pf-verify-session

Conversation

@MrFlounder
Copy link
Contributor

Summary

  • Fix false-positive verification: The old verify tool ran promptfoo eval against a config with redteam section but no tests array — zero tests ran, zero failures, reported success. Now replaced with a 3-step process: direct provider smoke test → session test → eval with 2 real test cases.
  • Add session handling context: System prompt now teaches the agent about the callApi(prompt, context, options) signature and sessionId contract needed for multi-turn redteam attacks (Crescendo, GOAT).
  • Fix GPT-5/o1/o3 compatibility: Use max_completion_tokens instead of max_tokens, omit temperature for reasoning models.
  • Fix Node.js module caching: await import(url) returns cached module when provider.js is rewritten. Added ?t=timestamp cache buster.

Changed files

File What changed
generator/config.ts Replace redteam section with prompts + tests + defaultTest.assert
agent/loop.ts Rewrite verify tool: smoke test + session test + eval with proper parsing
agent/tools.ts Update verify description, remove unused numTests param
agent/system-prompt.ts Add session handling section, update callApi signature in example
agent/providers.ts GPT-5/o1/o3 compat (max_completion_tokens, no temperature)

Test plan

  • Run crab pf against a simple HTTP target — verify eval shows 2 passed, 0 failed
  • Run against a session-based target — verify provider gets correct callApi signature and returns sessionId
  • Run with --provider openai:gpt-5 — verify no API errors
  • Break provider.js intentionally — verify smoke test catches it (not false-positive)

🤖 Generated with Claude Code

…ion handling

The verify tool was false-positive: configs had a `redteam` section but no
`tests` array, so `promptfoo eval` ran zero tests and reported success.

Changes:
- Replace redteam config with 2 simple test cases + defaultTest assertion
- Rewrite verify: direct callApi smoke test + session test, then promptfoo eval
- Add cache-busting for dynamic imports (Node caches rewritten provider.js)
- Parse eval output properly (require >0 passed, 0 failed)
- Add session handling to system prompt (callApi signature, sessionId contract)
- Update tool schema (remove unused numTests param)
- Fix GPT-5/o1/o3 compat (max_completion_tokens, omit temperature)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
const packageJsonPath = path.join(outputDir, 'package.json');
if (fs.existsSync(packageJsonPath)) {
try {
execSync(`cd "${outputDir}" && npm install --silent 2>&1`, {

Check warning

Code scanning / CodeQL

Unsafe shell command constructed from library input Medium

This string concatenation which depends on
library input
is later used in a
shell command
.

Copilot Autofix

AI about 2 hours ago

In general, to fix this kind of issue you should avoid invoking a shell with a dynamically constructed command string containing untrusted data. Instead, use the child_process APIs that accept a command and an argument array (execFile, spawn, or execSync with the cmd and args form) and pass the untrusted value as an argument or configuration option (like cwd) that is not interpreted by the shell. If you truly must use a shell, you need to robustly escape the untrusted value with a library such as shell-quote, but the safer approach here is to eliminate the shell usage.

The single best way to fix this specific case without changing the intended behavior is to stop constructing and executing cd "${outputDir}" && npm install --silent 2>&1 as a shell command string. Instead, we can run npm directly and let Node handle the working directory via the cwd option. For synchronous execution, execSync already supports an argument-array form: execSync(command, args?, options?). We can replace the existing call with execSync('npm', ['install', '--silent'], { cwd: outputDir, timeout: 60000, encoding: 'utf-8' });. This preserves the same effect—running npm install --silent in outputDir with the same timeout and encoding settings—while ensuring that outputDir is only used as a directory path, not as part of a shell command. No additional imports are necessary, as execSync is already imported from node:child_process. The change is localized to the executeTool function around line 290 in plugins/promptfoo/src/agent/loop.ts.

Concretely:

  • In plugins/promptfoo/src/agent/loop.ts, locate the execSync call on lines 290–293.
  • Replace the shell-string invocation that embeds outputDir with the safer argument-array form and pass outputDir through the cwd option.
  • Keep the try/catch structure and behavior (ignore install errors) unchanged.
Suggested changeset 1
plugins/promptfoo/src/agent/loop.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/plugins/promptfoo/src/agent/loop.ts b/plugins/promptfoo/src/agent/loop.ts
--- a/plugins/promptfoo/src/agent/loop.ts
+++ b/plugins/promptfoo/src/agent/loop.ts
@@ -287,7 +287,8 @@
           const packageJsonPath = path.join(outputDir, 'package.json');
           if (fs.existsSync(packageJsonPath)) {
             try {
-              execSync(`cd "${outputDir}" && npm install --silent 2>&1`, {
+              execSync('npm', ['install', '--silent'], {
+                cwd: outputDir,
                 timeout: 60000,
                 encoding: 'utf-8',
               });
EOF
@@ -287,7 +287,8 @@
const packageJsonPath = path.join(outputDir, 'package.json');
if (fs.existsSync(packageJsonPath)) {
try {
execSync(`cd "${outputDir}" && npm install --silent 2>&1`, {
execSync('npm', ['install', '--silent'], {
cwd: outputDir,
timeout: 60000,
encoding: 'utf-8',
});
Copilot is powered by AI and may make mistakes. Always verify output.
Two issues where the LLM was flying blind after failures:

1. Tool exceptions: outer catch returned result=null, error=message but
   only result was serialized into the tool message — LLM saw "null"
   with no explanation. Now includes both error and result in content.

2. Verify smoke test: "empty output" told the LLM nothing about why.
   Now includes the full provider response (output, sessionId, error)
   so the LLM can diagnose the actual problem (e.g. missing polling
   loop, wrong response field, auth failure).

Follows opencode's pattern where tool errors go into content for the
LLM to reason about, rather than being silently dropped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MrFlounder MrFlounder merged commit 3f5ee99 into main Feb 22, 2026
3 checks passed
@MrFlounder MrFlounder deleted the feat/crab-pf-verify-session branch February 22, 2026 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant