- When deploying language models with tool-calling capabilities in production environments, it's essential to ensure their effectiveness and reliability. This evaluation process goes beyond traditional testing and focuses on two key aspects:
+ Tool evaluations ensure AI models use your tools correctly in production. Unlike traditional testing, evaluations measure two key aspects:
+
+ 1. **Tool selection**: Does the model choose the right tools for the task?
+ 2. **Parameter accuracy**: Does the model provide correct arguments?
- 1. **Tool Utilization**: Assessing how efficiently the language model uses the available tools.
- 2. **Intent Understanding**: Evaluating the language model's ability to comprehend user intents and select the appropriate tools to fulfill those intents.
+ Arcade's evaluation framework helps you validate tool-calling capabilities before deployment, ensuring reliability in real-world applications. You can evaluate tools from MCP servers, Arcade Gateways, or custom implementations.
- Arcade's Evaluation Framework provides a comprehensive approach to assess and validate the tool-calling capabilities of language models, ensuring they meet the high standards required for real-world applications.
-## Why Evaluate Tool Calling by Task?
-
-Language models augmented with tool-use capabilities can perform complex tasks by invoking external tools or APIs. However, without proper evaluation, these models might:
-
-- **Misinterpret user intents**, leading to incorrect tool selection.
-- **Provide incorrect arguments** to tools, causing failures or undesired outcomes.
-- **Fail to execute the necessary sequence of tool calls**, especially in tasks requiring multiple steps.
-
-Evaluating tool calling by task ensures that the language model can handle specific scenarios reliably, providing confidence in its performance in production settings.
-
-## Evaluation Scoring
-
-Scoring in the evaluation framework is based on comparing the model's actual tool calls with the expected ones for each evaluation case. The total score for a case depends on:
-
-1. **Tool Selection**: Whether the model selected the correct tools for the task.
-2. **Tool Call Arguments**: The correctness of the arguments provided to the tools, evaluated by critics.
-3. **Evaluation Rubric**: Each aspect of the evaluation is weighted according to the rubric, affecting its impact on the final score.
-
-The evaluation result includes:
-
-- **Score**: A normalized value between 0.0 and 1.0.
-- **Result**:
- - _Passed_: Score is above the fail threshold.
- - _Failed_: Score is below the fail threshold.
- - _Warned_: Score is between the warning and fail thresholds.
+## What can go wrong?
-## Critics: Types and Usage
+Without proper evaluation, AI models might:
-Critics are essential for evaluating the correctness of tool call arguments. Different types of critics serve various evaluation needs:
+- **Misinterpret user intents**, selecting the wrong tools
+- **Provide incorrect arguments**, causing failures or unexpected behavior
+- **Skip necessary tool calls**, missing steps in multi-step tasks
+- **Make incorrect assumptions** about parameter defaults or formats
-### BinaryCritic
+## How evaluation works
-`BinaryCritic`s check for exact matches between expected and actual values after casting.
+Evaluations compare the model's actual tool calls with expected tool calls for each test case.
-- **Use Case**: When exact values are required (e.g., specific numeric parameters).
-- **Example**: Ensuring the model provides the exact user ID in a function call.
+### Scoring components
-### NumericCritic
+1. **Tool selection**: Did the model choose the correct tool?
+2. **Parameter evaluation**: Are the arguments correct? (evaluated by critics)
+3. **Weighted scoring**: Each aspect has a weight that affects the final score
-`NumericCritic` evaluates numeric values within a specified range, allowing for acceptable deviations.
+### Evaluation results
-- **Use Case**: When values can be approximate but should be within a certain threshold.
-- **Example**: Accepting approximate results in mathematical computations due to floating-point precision.
+Each test case receives:
-### SimilarityCritic
+- **Score**: Calculated from weighted critic scores, normalized proportionally (weights can be any positive value)
+- **Status**:
+ - **Passed**: Score meets or exceeds fail threshold (default: 0.8)
+ - **Failed**: Score falls below fail threshold
+ - **Warned**: Score is between warn and fail thresholds (default: 0.9)
-`SimilarityCritic` measures the similarity between expected and actual string values using metrics like cosine similarity.
+Example output:
-- **Use Case**: When the exact wording isn't critical, but the content should be similar.
-- **Example**: Evaluating if the message content in a communication tool is similar to the expected message.
-
-### DatetimeCritic
-
-`DatetimeCritic` evaluates the closeness of datetime values within a specified tolerance.
-
-- **Use Case**: When datetime values should be within a certain range of the expected time.
-- **Example**: Verifying if a scheduled event time is close enough to the intended time.
-
-### Choosing the Right Critic
-
-- **Exact Matches Needed**: Use **BinaryCritic** for strict equality.
-- **Numeric Ranges**: Use **NumericCritic** when a tolerance is acceptable.
-- **Textual Similarity**: Use **SimilarityCritic** for comparing messages or descriptions.
-- **Datetime Tolerance**: Use **DatetimeCritic** when a tolerance is acceptable for datetime comparisons.
-
-Critics are defined with fields such as `critic_field`, `weight`, and parameters specific to their types (e.g., `similarity_threshold` for `SimilarityCritic`).
+```
+PASSED Get weather for city -- Score: 1.00
+WARNED Send message with typo -- Score: 0.85
+FAILED Wrong tool selected -- Score: 0.50
+```
-## Rubrics and Setting Thresholds
+## Next steps
-An **EvalRubric** defines the evaluation criteria and thresholds for determining pass/fail outcomes. Key components include:
+- [Create an evaluation suite](/home/evaluate-tools/create-an-evaluation-suite) to start testing your tools
+- [Run evaluations](/home/evaluate-tools/run-evaluations) with multiple providers
+- Explore [capture mode](/home/evaluate-tools/capture-mode) to bootstrap test expectations
+- Compare tool sources with [comparative evaluations](/home/evaluate-tools/comparative-evaluations)
-- **Fail Threshold**: The minimum score required to pass the evaluation.
-- **Warn Threshold**: The score threshold for issuing a warning.
-- **Weights**: Assigns importance to different aspects of the evaluation (e.g., tool selection, argument correctness).
+## Advanced features
-### Setting Up a Rubric
+Once you're comfortable with basic evaluations, explore these advanced capabilities:
-- **Define Fail and Warn Thresholds**: Choose values between 0.0 and 1.0 to represent acceptable performance levels.
-- **Assign Weights**: Allocate weights to tool selection and critics to reflect their importance in the overall evaluation.
-- **Configure Failure Conditions**: Set flags like `fail_on_tool_selection` to enforce strict criteria.
+### Capture mode
-### Example Rubric Configuration:
+Record tool calls without scoring to discover what models actually call. Useful for bootstrapping test expectations and debugging. [Learn more →](/home/evaluate-tools/capture-mode)
-A rubric that requires a score of at least 0.85 to pass and issues a warning if the score is between 0.85 and 0.95:
+### Comparative evaluations
-- Fail Threshold: 0.85
-- Warn Threshold: 0.95
-- Fail on Tool Selection: True
-- Tool Selection Weight: 1.0
+Test the same cases against different tool sources (tracks) with isolated registries. Compare how models perform with different tool implementations. [Learn more →](/home/evaluate-tools/comparative-evaluations)
-```python
-rubric = EvalRubric(
- fail_threshold=0.85,
- warn_threshold=0.95,
- fail_on_tool_selection=True,
- tool_selection_weight=1.0,
-)
-```
+### Output formats
-## Building an Evaluation Suite
-
-An **EvalSuite** orchestrates the running of multiple evaluation cases. Here's how to build one:
-
-1. **Initialize EvalSuite**: Provide a name, system message, tool catalog, and rubric.
-2. **Add Evaluation Cases**: Use `add_case` or `extend_case` to include various scenarios.
-3. **Specify Expected Tool Calls**: Define the tools and arguments expected for each case.
-4. **Assign Critics**: Attach critics relevant to each case to evaluate specific arguments.
-5. **Run the Suite**: Execute the suite using the Arcade CLI to collect results.
-
-### Example: Math Tools Evaluation Suite
-
-An evaluation suite for math tools might include cases such as:
-
-- **Adding Two Large Numbers**:
- - **User Message**: "Add 12345 and 987654321"
- - **Expected Tool Call**: `add(a=12345, b=987654321)`
- - **Critics**:
- - `BinaryCritic` for arguments `a` and `b`
-- **Calculating Square Roots**:
- - **User Message**: "What is the square root of 3224990521?"
- - **Expected Tool Call**: `sqrt(a=3224990521)`
- - **Critics**:
- - `BinaryCritic` for argument `a`
-
-### Example: Slack Messaging Tools Evaluation Suite
-
-An evaluation suite for Slack messaging tools might include cases such as:
-
-- **Sending a Direct Message**:
- - **User Message**: "Send a direct message to johndoe saying 'Hello, can we meet at 3 PM?'"
- - **Expected Tool Call**: `send_dm_to_user(user_name='johndoe', message='Hello, can we meet at 3 PM?')`
- - **Critics**:
- - `BinaryCritic` for `user_name`
- - `SimilarityCritic` for `message`
-- **Posting a Message to a Channel**:
- - **User Message**: "Post 'The new feature is now live!' in the #announcements channel"
- - **Expected Tool Call**: `send_message_to_channel(channel_name='announcements', message='The new feature is now live!')`
- - **Critics**:
- - `BinaryCritic` for `channel_name`
- - `SimilarityCritic` for `message`
+Save results in multiple formats (txt, md, html, json) for reporting and analysis. Mix formats with `--format md,html,json` or use `--format all`. [Learn more →](/home/evaluate-tools/run-evaluations#output-formats)
diff --git a/public/llms.txt b/public/llms.txt
index 286da23b9..689592313 100644
--- a/public/llms.txt
+++ b/public/llms.txt
@@ -1,4 +1,4 @@
-
+
# Arcade
@@ -126,6 +126,8 @@ Arcade delivers three core capabilities: Deploy agents even your security team w
## Evaluate Tools
+- [Capture mode](https://docs.arcade.dev/en/home/evaluate-tools/capture-mode.md): The "Capture mode" documentation page guides users on how to record tool calls without scoring, enabling them to bootstrap test expectations, debug model behavior, and explore new tools. It outlines typical workflows, basic usage steps, and best practices for capturing and converting
+- [Comparative evaluations](https://docs.arcade.dev/en/home/evaluate-tools/comparative-evaluations.md): This documentation page provides guidance on conducting comparative evaluations by running the same test cases against different tool implementations, allowing users to compare tool sources side-by-side. It explains the concept of tracks, outlines the steps for setting up and executing comparative evaluations, and offers
- [Evaluate tools](https://docs.arcade.dev/en/home/evaluate-tools/create-an-evaluation-suite.md): This documentation page provides a comprehensive guide on how to create and run an evaluation suite for assessing tools using the Arcade framework. Users will learn to define evaluation cases, utilize various critics to measure performance, and execute evaluations to ensure their tools are effectively integrated with
- [Run evaluations with the Arcade CLI](https://docs.arcade.dev/en/home/evaluate-tools/run-evaluations.md): This documentation page provides guidance on using the Arcade CLI to run evaluations of tool-enabled language models. It outlines the steps to execute evaluation suites, customize the evaluation process with various command options, and analyze the results efficiently. Users will learn how to utilize the
- [Why evaluate tools?](https://docs.arcade.dev/en/home/evaluate-tools/why-evaluate-tools.md): This documentation page explains the importance of evaluating tools used in language models with tool-calling capabilities, focusing on their effectiveness and reliability in production environments. It outlines the evaluation framework, which assesses tool utilization and intent understanding, and details the scoring system based on