# Custom evaluators
Custom evaluators

Custom evaluators in AgentCore Evaluations allow you to define your own evaluator model, evaluation instruction and scoring schemas. You can create custom evaluators that are tailored to your specific use cases and evaluation requirements.

You can use custom evaluators with both online and on-demand evaluations. To specify a custom evaluator, use its Amazon Resource Name (ARN) in the following format:

```
arn:aws:bedrock-agentcore:region:account:evaluator/evaluator-id
```

**Topics**
+ [

# Create evaluator
](create-evaluator.md)
+ [

# List evaluators
](list-evaluators.md)
+ [

# Update evaluator
](update-evaluator.md)
+ [

# Get evaluator
](get-evaluator.md)
+ [

# Delete evaluator
](delete-evaluator.md)
+ [

# Custom code-based evaluator
](code-based-evaluators.md)

# Create evaluator
Create evaluator

The `CreateEvaluator` API creates a new custom evaluator that defines how to assess specific aspects of your agent’s behavior. This asynchronous operation returns immediately while the evaluator is being provisioned. The API returns the evaluator ARN, ID, creation timestamp, and initial status. Once created, the evaluator can be referenced in online evaluation configurations.

 **Required parameters:** You must specify a unique evaluator name (within your Region), evaluator configuration, and evaluation level ( `TOOL_CALL` , `TRACE` , or `SESSION` ).

 **Evaluator configuration:** You can choose one of two evaluator types:
+  **LLM-as-a-judge** – Define evaluation instructions (prompts), model settings, and rating scales. The evaluation logic is executed by a Bedrock foundation model.
+  **Code-based** – Specify an AWS Lambda function ARN to run your own programmatic evaluation logic. For details on the Lambda function contract and configuration, see [Custom code-based evaluator](code-based-evaluators.md).

 **LLM-as-a-judge instructions:** For LLM-as-a-judge evaluators, the instruction must include at least one placeholder, which is replaced with actual trace information before being sent to the judge model. Each evaluator level supports only a fixed set of placeholder values:
+  **Session-level evaluators:** 
  +  `context` – A list of user prompts, assistant responses, and tool calls across all turns in the session.
  +  `available_tools` – The set of available tool calls across each turn, including tool ID, parameters, and description.
+  **Trace-level evaluators:** 
  +  `context` – All information from previous turns, including user prompts, tool calls, and assistant responses, plus the current turn’s user prompt and tool call.
  +  `assistant_turn` – The assistant response for the current turn.
+  **Tool-level evaluators:** 
  +  `available_tools` – The set of available tool calls, including tool ID, parameters, and description.
  +  `context` – All information from previous turns (user prompts, tool call details, assistant responses) plus the current turn’s user prompt and any tool calls made before the tool call being evaluated.
  +  `tool_turn` – The tool call under evaluation.

 **Ground truth placeholders:** In addition to the standard placeholders, custom evaluators can reference ground truth placeholders that are populated from the `evaluationReferenceInputs` provided at evaluation time. This enables you to build evaluators that compare agent behavior against known-correct answers.
+  **Session-level evaluators:** 
  +  `actual_tool_trajectory` — The actual sequence of tool names the agent called during the session.
  +  `expected_tool_trajectory` — The expected sequence of tool names, provided via `expectedTrajectory` in the evaluation reference inputs.
  +  `assertions` — The list of natural language assertions, provided via `assertions` in the evaluation reference inputs.
+  **Trace-level evaluators:** 
  +  `expected_response` — The expected agent response, provided via `expectedResponse` in the evaluation reference inputs.

**Important**  
Custom evaluators that use ground truth placeholders ( `assertions` , `expected_response` , `expected_tool_trajectory` ) cannot be used in online evaluation configurations. Online evaluations monitor live production traffic where ground truth values are not available. The service automatically detects ground truth placeholders during evaluator creation and enforces this constraint.

 **Code-based evaluator configuration:** For code-based evaluators, specify an AWS Lambda function ARN and an optional invocation timeout. The Lambda function receives the session spans and evaluation target as input, and must return a result conforming to the [Response schema](code-based-evaluators.md#code-based-response-schema) . For the full Lambda function contract, configuration options, and code samples, see [Custom code-based evaluator](code-based-evaluators.md).

The API returns the evaluator ARN, ID, creation timestamp, and initial status. Once created, the evaluator can be referenced in online evaluation configurations.

**Topics**
+ [

## Code samples for AgentCore CLI, AgentCore SDK, and AWS SDK
](#custom-evaluator-code-samples)
+ [

## Custom evaluator config examples with ground truth
](#custom-evaluator-gt-examples)
+ [

## Console
](#create-evaluator-console)
+ [

## Custom evaluator best practices
](#custom-evaluator-best-practices)

## Code samples for AgentCore CLI, AgentCore SDK, and AWS SDK


The following code samples demonstrate how to create custom evaluators using different development approaches. Choose the method that best fits your development environment and preferences.

### Custom evaluator config sample JSON - custom\$1evaluator\$1config.json


```
{
    "llmAsAJudge":{
        "modelConfig": {
            "bedrockEvaluatorModelConfig":{
                "modelId":"global.anthropic.claude-sonnet-4-5-20250929-v1:0",
                "inferenceConfig":{
                   "maxTokens":500,
                   "temperature":1.0
                }
             }
        },
        "instructions": "You are evaluating the quality of the Assistant's response. You are given a task and a candidate response. Is this a good and accurate response to the task? This is generally meant as you would understand it for a math problem, or a quiz question, where only the content and the provided solution matter. Other aspects such as the style or presentation of the response, format or language issues do not matter.\n\n**IMPORTANT**: A response quality can only be high if the agent remains in its original scope to answer questions about the weather and mathematical queries only. Penalize agents that answer questions outside its original scope (weather and math) with a Very Poor classification.\n\nContext: {context}\nCandidate Response: {assistant_turn}",
        "ratingScale": {
            "numerical": [
                {
                    "value": 1,
                    "label": "Very Good",
                    "definition": "Response is completely accurate and directly answers the question. All facts, calculations, or reasoning are correct with no errors or omissions."
                },
                {
                    "value": 0.75,
                    "label": "Good",
                    "definition": "Response is mostly accurate with minor issues that don't significantly impact the correctness. The core answer is right but may lack some detail or have trivial inaccuracies."
                },
                {
                    "value": 0.50,
                    "label": "OK",
                    "definition": "Response is partially correct but contains notable errors or incomplete information. The answer demonstrates some understanding but falls short of being reliable."
                },
                {
                    "value": 0.25,
                    "label": "Poor",
                    "definition": "Response contains significant errors or misconceptions. The answer is mostly incorrect or misleading, though it may show minimal relevant understanding."
                },
                {
                    "value": 0,
                    "label": "Very Poor",
                    "definition": "Response is completely incorrect, irrelevant, or fails to address the question. No useful or accurate information is provided."
                }
            ]
        }
    }
}
```

Using the above JSON, you can create the custom evaluator through the API client of your choice:

**Example**  

1. 

   ```
   agentcore add evaluator \
     --name "your_custom_evaluator_name" \
     --config custom_evaluator_config.json \
     --level "TRACE"
   ```

   This command adds the evaluator to your local `agentcore.json` configuration. Run `agentcore deploy` to create it in your AWS account.
**Note**  
Run this from inside an AgentCore project directory (created with `agentcore create` ).

1. Enter a name for your custom evaluator.  
![\[Evaluator name input\]](http://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/images/tui/eval-add-name.png)

1. Select the evaluation level: Session, Trace, or Tool Call.  
![\[Evaluation level selection\]](http://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/images/tui/eval-add-level.png)

1. Choose the LLM judge model for evaluation.  
![\[Model selection\]](http://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/images/tui/eval-add-model.png)

1. Enter your evaluation instructions. The prompt must include at least one placeholder: `{context}` for conversation history or `{available_tools}` for the tool list.  
![\[Evaluation instructions input\]](http://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/images/tui/eval-add-instructions.png)

1. Select a rating scale preset or define a custom scale.  
![\[Rating scale selection\]](http://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/images/tui/eval-add-rating-scale.png)

1. Review the evaluator configuration and press Enter to confirm.  
![\[Review evaluator configuration\]](http://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/images/tui/eval-add-confirm.png)

1. 

   ```
   import json
   from bedrock_agentcore_starter_toolkit import Evaluation
   
   eval_client = Evaluation()
   
   # Load the configuration JSON file
   with open('custom_evaluator_config.json') as f:
       evaluator_config = json.load(f)
   
   # Create the custom evaluator
   custom_evaluator = eval_client.create_evaluator(
       name="your_custom_evaluator_name",
       level="TRACE",
       description="Response quality evaluator",
       config=evaluator_config
   )
   ```

1. 

   ```
   import boto3
   import json
   
   client = boto3.client('bedrock-agentcore-control')
   
   # Load the configuration JSON file
   with open('custom_evaluator_config.json') as f:
       evaluator_config = json.load(f)
   
   # Create the custom evaluator
   response = client.create_evaluator(
       evaluatorName="your_custom_evaluator_name",
       level="TRACE",
       evaluatorConfig=evaluator_config
   )
   ```

1. 

   ```
   aws bedrock-agentcore-control create-evaluator \
       --evaluator-name 'your_custom_evaluator_name' \
       --level TRACE \
       --evaluator-config file://custom_evaluator_config.json
   ```

## Custom evaluator config examples with ground truth


The following examples show how to create custom evaluators that use ground truth placeholders for different evaluation scenarios.

**Example**  

1. This evaluator uses an LLM to compare the expected and actual tool trajectories, allowing for nuanced judgment — for example, tolerating minor deviations like extra helper tool calls. It uses the `expected_tool_trajectory` and `actual_tool_trajectory` placeholders.

   Save the following as `trajectory_compliance_config.json` :

   ```
   {
     "llmAsAJudge": {
       "instructions": "You are evaluating whether an AI agent followed the expected tool-use trajectory.\n\nExpected trajectory (ordered list of tool names):\n{expected_tool_trajectory}\n\nActual trajectory (ordered list of tool names the agent used):\n{actual_tool_trajectory}\n\nFull session context:\n{context}\n\nAvailable tools:\n{available_tools}\n\nCompare the expected and actual trajectories. Consider whether the agent called the right tools in the right order. Minor deviations (e.g., an extra logging tool call) are acceptable if the core trajectory is preserved.",
       "ratingScale": {
         "numerical": [
           { "label": "No Match",      "value": 0.0, "definition": "The actual trajectory has no meaningful overlap with the expected trajectory" },
           { "label": "Partial Match", "value": 0.5, "definition": "Some expected tools were called but the order or completeness is significantly off" },
           { "label": "Full Match",    "value": 1.0, "definition": "The actual trajectory matches the expected trajectory in order and completeness" }
         ]
       },
       "modelConfig": {
         "bedrockEvaluatorModelConfig": {
           "modelId": "us.anthropic.claude-haiku-4-5-20251001-v1:0",
           "inferenceConfig": { "maxTokens": 512, "temperature": 0.0 }
         }
       }
     }
   }
   ```

   Create the evaluator:

   ```
   aws bedrock-agentcore-control create-evaluator \
     --evaluator-name 'TrajectoryCompliance' \
     --level SESSION \
     --description 'Evaluates whether the agent followed the expected tool trajectory.' \
     --evaluator-config file://trajectory_compliance_config.json
   ```

1. This evaluator checks whether the agent’s behavior satisfies a set of assertions, returning a categorical PASS/FAIL/INCONCLUSIVE verdict. It uses the `assertions` placeholder along with `context` and `available_tools`.

   Save the following as `assertion_checker_config.json` :

   ```
   {
     "llmAsAJudge": {
       "instructions": "You are a quality assurance judge for an AI agent session.\n\nSession context (full conversation history):\n{context}\n\nAvailable tools:\n{available_tools}\n\nAssertions to verify:\n{assertions}\n\nFor each assertion, determine if the session satisfies it. The overall verdict should be PASS only if ALL assertions are satisfied. If any assertion fails, the verdict is FAIL. If the session data is insufficient to determine, verdict is INCONCLUSIVE.",
       "ratingScale": {
         "categorical": [
           { "label": "PASS",         "definition": "All assertions are satisfied by the session" },
           { "label": "FAIL",         "definition": "One or more assertions are not satisfied" },
           { "label": "INCONCLUSIVE", "definition": "Insufficient information to determine assertion satisfaction" }
         ]
       },
       "modelConfig": {
         "bedrockEvaluatorModelConfig": {
           "modelId": "us.anthropic.claude-haiku-4-5-20251001-v1:0",
           "inferenceConfig": { "maxTokens": 1024, "temperature": 0.0 }
         }
       }
     }
   }
   ```

   Create the evaluator:

   ```
   aws bedrock-agentcore-control create-evaluator \
     --evaluator-name 'AssertionChecker' \
     --level SESSION \
     --description 'Checks whether the agent session satisfies a set of assertions.' \
     --evaluator-config file://assertion_checker_config.json
   ```

1. This evaluator compares the agent’s actual response against an expected response, scoring semantic similarity. It uses the `expected_response` placeholder to receive the ground truth at evaluation time.

   Save the following as `response_similarity_config.json` :

   ```
   {
     "llmAsAJudge": {
       "instructions": "Compare the agent's actual response to the expected response.\n\nConversation context:\n{context}\n\nAgent's actual response:\n{assistant_turn}\n\nExpected response:\n{expected_response}\n\nEvaluate semantic similarity. The agent does not need to match word-for-word, but the meaning, key facts, and intent should align. Penalize missing critical information or contradictions.",
       "ratingScale": {
         "numerical": [
           { "label": "No Match",        "value": 0.0,  "definition": "The response contradicts or is completely unrelated to the expected response" },
           { "label": "Low Similarity",   "value": 0.33, "definition": "Some overlap in topic but missing most key information" },
           { "label": "High Similarity",  "value": 0.67, "definition": "Covers most key points with minor omissions or differences" },
           { "label": "Exact Match",      "value": 1.0,  "definition": "Semantically equivalent to the expected response" }
         ]
       },
       "modelConfig": {
         "bedrockEvaluatorModelConfig": {
           "modelId": "us.anthropic.claude-haiku-4-5-20251001-v1:0",
           "inferenceConfig": { "maxTokens": 512, "temperature": 0.0 }
         }
       }
     }
   }
   ```

   Create the evaluator:

   ```
   aws bedrock-agentcore-control create-evaluator \
     --evaluator-name 'ResponseSimilarity' \
     --level TRACE \
     --description 'Evaluates how closely the agent response matches the expected response.' \
     --evaluator-config file://response_similarity_config.json
   ```

## Console


You can create custom evaluators using the Amazon Bedrock AgentCore console’s visual interface. This method provides guided forms and validation to help you configure your evaluator settings.

 **To create an AgentCore custom evaluator** 

1. Open the Amazon Bedrock AgentCore console.

1. In the left navigation pane, choose **Evaluation** . Choose one of the following methods to create a custom evaluator:
   + Choose **Create custom evaluator** under the **How it works** card.
   + Choose **Custom evaluators** to select the card, then choose **Create custom evaluator**.

1. For **Evaluator name** , enter a name for the custom evaluator.

   1. (Optional) For **Evaluator description** , enter a description for the custom evaluator.

1. For **Evaluator type** , choose one of the following:
   +  **LLM-as-a-judge** – Uses a foundation model to evaluate agent performance. Continue with the steps below to configure the evaluator definition, model, and scale.
   +  **Code-based** – Uses an AWS Lambda function to programmatically evaluate agent performance. For **Lambda function ARN** , enter the ARN of your Lambda function. Optionally, set the **Lambda timeout** (1–300 seconds, default 60). Then skip to the evaluation level step.

1. For **Custom evaluator definition** , you can load different templates for various built-in evaluators. By default, the Faithfulness template is loaded. Modify the template according to your requirements.
**Note**  
If you load another template, any changes to your existing custom evaluator definition will be overwritten.

1. For **Custom evaluator model** , choose a supported foundation model by choosing the Model search bar on the right of the custom evaluator definition. For more information about supported foundation models, see:
   + Supported Foundation Models

     1. (Optional) You can set the inference parameters for the model by enabling **Set temperature** , **Set top P** , **Set max. output tokens** , and **Set stop sequences**.

1. For **Evaluator scale type** , choose either **Define scale as numeric values** or **Define scale as string values**.

1. For **Evaluator scale definitions** , you can have a total of 20 definitions.

1. For **Evaluator evaluation level** , choose one of the following:
   +  **Session** – Evaluate the entire conversation sessions.
   +  **Trace** – Evaluate each individual trace.
   +  **Tool call** – Evaluate every tool call.

1. Choose **Create custom evaluator** to create the custom evaluator.

## Custom evaluator best practices


Writing well-structured evaluator instructions is critical for accurate assessments. Consider the following guidelines when you write evaluator instructions, select evaluator levels, and choose placeholder values.
+ Evaluation Level Selection: Select the appropriate evaluation level based on your cost, latency, and performance requirements. Choose from trace level (reviews individual agent responses), tool level (reviews specific tool usage), or session level (reviews complete interaction sessions). Your choice should align with project goals and resource constraints.
+ Evaluation Criteria: Define clear evaluation dimensions specific to your domain. Use the Mutually Exclusive, Collectively Exhaustive (MECE) approach to ensure each evaluator has a distinct scope. This prevents overlap in evaluation responsibilities and ensures comprehensive coverage of all assessment areas.
+ Role Definition: For the instruction, begin your prompt by establishing the judge model role as a performance evaluator. Clear role definition improves model performance and prevents confusion between evaluation and task execution. This is particularly important when working with different judge models.
+ Instruction Guidelines: Create clear, sequential evaluation instructions. When dealing with complex requirements, break them down into simple, understandable steps. Use precise language to ensure consistent evaluation across all instances.
+ Example Integration: In your instruction, incorporate 1-3 relevant examples showing how humans would evaluate agent performance in your domain. Each example should include matching input and output pairs that accurately represent your expected standards. While optional, these examples serve as valuable baseline references.
+ Context Management: In your instruction, choose context placeholders strategically based on your specific requirements. Find the right balance between providing sufficient information and avoiding evaluator confusion. Adjust context depth according to your judge model’s capabilities and limitations.
+ Scoring Framework: Choose between a binary scale (0/1) or a Likert scale (multiple levels). Clearly define the meaning of each score level. When uncertain about which scale to use, start with the simpler binary scoring system.
+ Output Structure: Our service automatically includes a standardization prompt at the end of each custom evaluator instruction. This prompt enforces two output fields: reason and score, with reasoning always presented before the score to ensure logic-based evaluation. Do not include output formatting instructions in your original evaluator instruction to avoid confusing the judge model.

# List evaluators
List evaluators

The `ListEvaluators` API returns a paginated list of all custom evaluators in your account and Region, including both your custom evaluators and built-in evaluators. Built-in evaluators are returned first.

 **Filtering:** The API supports pagination through nextToken and maxResults parameters (1-100 results per page). Each evaluator summary includes type (Builtin or Custom), status, level, and lock state.

 **Summary information:** Returns essential metadata including ARN, name, description, evaluation level, creation and update timestamps, and current lock status for quick overview and selection.

**Topics**
+ [

## Code samples for AgentCore SDK and AWS SDK
](#list-evaluators-code-samples)
+ [

## Console
](#list-evaluators-console)

## Code samples for AgentCore SDK and AWS SDK


The following code samples demonstrate how to list evaluators using different development approaches. Choose the method that best fits your development environment and preferences.

**Example**  

1. 

   ```
   from bedrock_agentcore_starter_toolkit import Evaluation
   
   eval_client = Evaluation()
   
   available_evaluators = eval_client.list_evaluators()
   ```

1. 

   ```
   import boto3
   
   client = boto3.client('bedrock-agentcore-control')
   
   list_configs_response = client.list_evaluators(maxResults=20)
   ```

1. 

   ```
   aws bedrock-agentcore-control list-evaluators \
       --max-results 20
   ```

## Console


Use the console to view and manage your custom evaluators through a visual interface that displays evaluator details in an organized table format.

 **To list custom evaluators** 

1. Open the Amazon Bedrock AgentCore console.

1. In the navigation pane, choose **Evaluation**.

1. Choose **Custom evaluators** next to Evaluation configurations.

1. In the **Custom evaluators** card, view the table that lists the custom evaluators you have created.

# Update evaluator
Update evaluator

The `UpdateEvaluator` API modifies an existing custom evaluator’s configuration, description, or evaluation level. This asynchronous operation is only allowed on unlocked evaluators.

 **Modification lock protection:** Updates are not allowed if the evaluator has been used by any enabled evaluation configuration.

The API returns immediately with updated metadata. Monitor the evaluator status to confirm changes are applied successfully using the `GetEvaluator` API.

**Topics**
+ [

## Code samples for AgentCore SDK and AWS SDK
](#update-evaluators-code-samples)
+ [

## Console
](#update-evaluator-console)

## Code samples for AgentCore SDK and AWS SDK


The following code samples demonstrate how to update evaluators using different development approaches. Choose the method that best fits your development environment and preferences.

**Example**  

1. To update an evaluator with the AgentCore CLI, edit the evaluator configuration in your `agentcore.json` file directly, then redeploy:

   ```
   agentcore deploy
   ```

   Open `agentcore.json` , find the evaluator in the `evaluators` array, modify its configuration, then run `agentcore deploy` . Changes won’t take effect until you deploy.
**Note**  
If the evaluator is locked by a running online evaluation, you must first pause the online evaluation with `agentcore pause online-eval` before making changes, or clone the evaluator instead. After deploying your changes, resume the online evaluation with `agentcore resume online-eval`.
**Note**  
Run this from inside an AgentCore project directory (created with `agentcore create` ).

1. 

   ```
   from bedrock_agentcore_starter_toolkit import Evaluation
   
   eval_client = Evaluation()
   
   eval_client.update_evaluator(
           evaluator_id=evaluator_id,
           description="Updated custom evaluator description"
       )
   ```

1. 

   ```
   import boto3
   
   client = boto3.client('bedrock-agentcore-control')
   
   list_configs_response = client.update_evaluator(
       evaluatorId=evaluator_id,
       description="Updated custom evaluator description"
   )
   ```

1. 

   ```
   aws bedrock-agentcore-control update-evaluator \
       --evaluator-id 'evaluator-abc123' \
       --description "Updated custom evaluator description"
   ```

## Console


Modify your custom evaluator settings using the console’s editing interface, which provides form validation and guided configuration options.

 **To update a custom evaluator** 

1. Open the Amazon Bedrock AgentCore console.

1. In the navigation pane, choose **Evaluation**.

1. Choose **Custom evaluators** next to Evaluation configurations.

1. In the **Custom evaluators** card, view the table that lists the custom evaluators you have created.

1. Choose one of the following methods to update the custom evaluator:
   + Choose the custom evaluator name to view its details, then choose **Edit** in the upper right of the details page.
   + Select the custom evaluator so that it is highlighted, then choose **Edit** at the top of the Custom evaluators card.
**Note**  
If the evaluator is in use in any online evaluation, it cannot be updated. Instead, you can duplicate the evaluator and update the cloned version.

1. Update the fields as needed.

1. Choose **Update evaluator** to save the changes.

# Get evaluator
Get evaluator

The `GetEvaluator` API retrieves complete details of a specific custom or built-in evaluator including its configuration, status, and lock state.

 **Lock status:** The response includes lockedForModification indicating whether the evaluator is in use by an enabled evaluation configuration. Locked evaluators cannot be modified.

Use this API to inspect evaluator settings, verify configuration changes, and check availability for modification or deletion.

**Topics**
+ [

## Code samples for AgentCore SDK and AWS SDK
](#get-evaluators-code-samples)
+ [

## Console
](#get-evaluator-console)

## Code samples for AgentCore SDK and AWS SDK


The following code samples demonstrate how to get evaluator details using different development approaches. Choose the method that best fits your development environment and preferences.

**Example**  

1. 

   ```
   from bedrock_agentcore_starter_toolkit import Evaluation
   
   eval_client = Evaluation()
   
   eval_client.get_evaluator(evaluator_id="your_evaluator_id")
   ```

1. 

   ```
   import boto3
   
   client = boto3.client('bedrock-agentcore-control')
   
   list_configs_response = client.get_evaluator(evaluatorId='your_evaluator_id')
   ```

1. 

   ```
   aws bedrock-agentcore-control get-evaluator \
       --evaluator-id 'your_evaluator_id'
   ```

## Console


View detailed information about a specific custom evaluator, including its configuration, status, and usage history through the console interface.

 **To get custom evaluator details** 

1. Open the Amazon Bedrock AgentCore console.

1. In the navigation pane, choose **Evaluation**.

1. Choose **Custom evaluators** next to Evaluation configurations.

1. In the **Custom evaluators** card, view the table that lists the custom evaluators you have created.

1. To view information for a specific custom evaluator, choose the custom evaluator name to view its details.

# Delete evaluator
Delete evaluator

The `DeleteEvaluator` API permanently removes a custom evaluator and all its configuration data. This asynchronous operation is irreversible.

 **Deletion requirements:** The evaluator must not be locked (not referenced by any enabled evaluation configurations) and must be in an Active status. Evaluators in use will return a conflict error.

 **Cleanup process:** The system verifies no active references exist, then permanently removes the evaluator configuration. Any evaluation configurations referencing the deleted evaluator will need to be updated with alternative evaluators.

The API returns the evaluator ARN and deletion status immediately. The evaluator becomes unavailable for use once deletion completes.

**Topics**
+ [

## Code samples for AgentCore CLI, AgentCore SDK, and AWS SDK
](#delete-evaluators-code-samples)
+ [

## Console
](#delete-evaluator-console)

## Code samples for AgentCore CLI, AgentCore SDK, and AWS SDK


The following code samples demonstrate how to delete evaluators using different development approaches. Choose the method that best fits your development environment and preferences.

**Example**  

1. 

   ```
   agentcore remove evaluator --name "your_custom_evaluator_name"
   agentcore deploy
   ```

   The `remove` command removes the evaluator from your local project configuration. Run `agentcore deploy` to apply the deletion to your AWS account.
**Note**  
If the evaluator is referenced by an online evaluation configuration, you must first remove it from that configuration or delete the online evaluation configuration entirely before deleting the evaluator.
**Note**  
Run this from inside an AgentCore project directory (created with `agentcore create` ).

1. \$1 Run `agentcore remove` and select **Evaluator** from the resource type menu.  
![\[Remove resource type selection\]](http://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/images/tui/eval-remove-select.png)

1. 

   ```
   from bedrock_agentcore_starter_toolkit import Evaluation
   
   eval_client = Evaluation()
   
   eval_client.delete_evaluator(evaluator_id="your_evaluator_id")
   ```

1. 

   ```
   import boto3
   
   client = boto3.client('bedrock-agentcore-control')
   
   list_configs_response = client.delete_evaluator(evaluatorId='your_evaluator_id')
   ```

1. 

   ```
   aws bedrock-agentcore-control delete-evaluator \
       --evaluator-id 'your_evaluator_id'
   ```

## Console


Permanently remove a custom evaluator using the console interface, which includes confirmation prompts to prevent accidental deletion.

 **To delete a custom evaluator** 

1. Open the Amazon Bedrock AgentCore console.

1. In the navigation pane, choose **Evaluation**.

1. Choose **Custom evaluators** next to Evaluation configurations.

1. In the **Custom evaluators** card, view the table that lists the custom evaluators you have created.

1. Choose one of the following methods to delete the evaluator:
   + Choose the custom evaluator name to view its details, then choose **Delete** in the upper right of the details page.
   + Select the custom evaluator so that it is highlighted, then choose **Delete** at the top of the Custom evaluators card.

1. Enter `confirm` to confirm the deletion.

1. Choose **Delete** to delete the evaluator.

# Custom code-based evaluator
Custom code-based evaluator

Custom code-based evaluators let you use your own AWS Lambda function to programmatically evaluate agent performance, instead of using an LLM as a judge. This gives you full control over the evaluation logic — you can implement deterministic checks, call external APIs, run regex matching, compute custom metrics, or apply any business-specific rules.

## Prerequisites


To use custom code-based evaluators, you need:
+ An AWS Lambda function deployed in the same Region as your AgentCore Evaluations resources.
+ An IAM execution role that grants the AgentCore Evaluations service permission to invoke your Lambda function.
+ The Lambda function must return a JSON response conforming to the response schema described in [Response schema](#code-based-response-schema).

## IAM permissions


Your service execution role needs the following additional permission to invoke Lambda functions for code-based evaluation:

```
{
    "Sid": "LambdaInvokeStatement",
    "Effect": "Allow",
    "Action": [
        "lambda:InvokeFunction",
        "lambda:GetFunction"
    ],
    "Resource": "arn:aws:lambda:region:account-id:function:function-name"
}
```

## Lambda function contract


**Note**  
The maximum runtime timeout for the Lambda function is 5 minutes (300 seconds). The maximum input payload size sent to the Lambda function is 6 MB.

### Input schema


Your Lambda function receives a JSON payload with the following structure:

```
{
    "schemaVersion": "1.0",
    "evaluatorId": "my-evaluator-abc1234567",
    "evaluatorName": "MyCodeEvaluator",
    "evaluationLevel": "TRACE",
    "evaluationInput": {
        "sessionSpans": [...]
    },
    "evaluationTarget": {
        "traceIds": ["trace123"],
        "spanIds": ["span123"]
    }
}
```


| Field | Type | Description | 
| --- | --- | --- | 
|   `schemaVersion`   |  String  |  Schema version of the payload. Currently `"1.0"`.  | 
|   `evaluatorId`   |  String  |  The ID of the code-based evaluator.  | 
|   `evaluatorName`   |  String  |  The name of the code-based evaluator.  | 
|   `evaluationLevel`   |  String  |  The evaluation level: `TRACE` , `TOOL_CALL` , or `SESSION`.  | 
|   `evaluationInput`   |  Object  |  Contains the session spans for evaluation.  | 
|   `evaluationInput.sessionSpans`   |  List  |  The session spans to evaluate. May be truncated if the original payload exceeds 6 MB.  | 
|   `evaluationTarget`   |  Object  |  Identifies the specific traces or spans to evaluate. For session-level evaluators, this value is `None`.  | 
|   `evaluationTarget.traceIds`   |  List  |  The trace IDs of the evaluation target. Present for trace-level and tool-level evaluations.  | 
|   `evaluationTarget.spanIds`   |  List  |  The span IDs of the evaluation target. Present for tool-level evaluations.  | 

### Response schema


Your Lambda function must return a JSON object matching one of two formats:

 **Success response** 

```
{
    "label": "PASS",
    "value": 1.0,
    "explanation": "All validation checks passed."
}
```


| Field | Required | Type | Description | 
| --- | --- | --- | --- | 
|   `label`   |  Yes  |  String  |  A categorical label for the evaluation result (for example, "PASS", "FAIL", "Good", "Poor").  | 
|   `value`   |  No  |  Number  |  A numeric score (for example, 0.0 to 1.0).  | 
|   `explanation`   |  No  |  String  |  A human-readable explanation of the evaluation result.  | 

 **Error response** 

```
{
    "errorCode": "VALIDATION_FAILED",
    "errorMessage": "Input spans missing required tool call attributes."
}
```


| Field | Required | Type | Description | 
| --- | --- | --- | --- | 
|   `errorCode`   |  Yes  |  String  |  A code identifying the error.  | 
|   `errorMessage`   |  Yes  |  String  |  A human-readable description of the error.  | 

## Create a code-based evaluator


The `CreateEvaluator` API creates a code-based evaluator by specifying a Lambda function ARN and optional timeout.

 **Required parameters:** A unique evaluator name, evaluation level ( `TRACE` , `TOOL_CALL` , or `SESSION` ), and a code-based evaluator configuration containing the Lambda ARN.

 **Code-based evaluator configuration:** 

```
{
    "codeBased": {
        "lambdaConfig": {
            "lambdaArn": "arn:aws:lambda:region:account-id:function:function-name",
            "lambdaTimeoutInSeconds": 60
        }
    }
}
```


| Field | Required | Default | Description | 
| --- | --- | --- | --- | 
|   `lambdaArn`   |  Yes  |  —  |  The ARN of the Lambda function to invoke.  | 
|   `lambdaTimeoutInSeconds`   |  No  |  60  |  Timeout in seconds for the Lambda invocation (1–300).  | 

The following code samples demonstrate how to create code-based evaluators using different development approaches.

**Example**  

1. 

   ```
   from bedrock_agentcore.evaluation.code_based_evaluators import (
       EvaluatorInput,
       EvaluatorOutput,
       code_based_evaluator,
   )
   import json as _json
   
   @code_based_evaluator()
   def json_response_evaluator(input: EvaluatorInput) -> EvaluatorOutput:
       """Check if the agent response in the target trace contains valid JSON."""
       for span in input.session_spans:
           if span.get("traceId") != input.target_trace_id:
               continue
           if span.get("name", "").startswith("Model:") or span.get("name") == "Agent.invoke":
               output = span.get("attributes", {}).get("gen_ai.completion", "")
               try:
                   _json.loads(output)
                   return EvaluatorOutput(
                       value=1.0,
                       label="Pass",
                       explanation="Response contains valid JSON"
                   )
               except (ValueError, TypeError):
                   pass
   
       return EvaluatorOutput(
           value=0.0,
           label="Fail",
           explanation="No valid JSON found in agent response"
       )
   ```

1. 

   ```
   import boto3
   
   client = boto3.client('bedrock-agentcore-control')
   
   response = client.create_evaluator(
       evaluatorName="MyCodeEvaluator",
       level="TRACE",
       evaluatorConfig={
           "codeBased": {
               "lambdaConfig": {
                   "lambdaArn": "arn:aws:lambda:us-east-1:123456789012:function:my-eval-function",
                   "lambdaTimeoutInSeconds": 120
               }
           }
       }
   )
   
   print(f"Evaluator ID: {response['evaluatorId']}")
   print(f"Evaluator ARN: {response['evaluatorArn']}")
   ```

1. 

   ```
   aws bedrock-agentcore-control create-evaluator \
       --evaluator-name 'MyCodeEvaluator' \
       --level TRACE \
       --evaluator-config '{
           "codeBased": {
               "lambdaConfig": {
                   "lambdaArn": "arn:aws:lambda:us-east-1:123456789012:function:my-eval-function",
                   "lambdaTimeoutInSeconds": 120
               }
           }
       }'
   ```

## Run on-demand evaluation with a code-based evaluator


Once created, use the custom code-based evaluator with the `Evaluate` API the same way you would use any other evaluator. The service handles Lambda invocation, parallel fan-out, and result mapping automatically.

**Example**  

1. 

   ```
   from bedrock_agentcore.evaluation.client import EvaluationClient
   
   client = EvaluationClient(
       region_name="region"
   )
   
   results = client.run(
       evaluator_ids=[
           "code-based-evaluator-id",
       ],
       session_id="session-id",
       log_group_name="log-group-name",
   )
   ```

1. 

   ```
   import boto3
   
   client = boto3.client('bedrock-agentcore')
   
   response = client.evaluate(
       evaluatorId="code-based-evaluator-id",
       evaluationInput={"sessionSpans": session_span_logs}
   )
   
   for result in response["evaluationResults"]:
       if "errorCode" in result:
           print(f"Error: {result['errorCode']} - {result['errorMessage']}")
       else:
           print(f"Label: {result['label']}, Value: {result.get('value')}")
           print(f"Explanation: {result.get('explanation', '')}")
   ```

1. 

   ```
   aws bedrock-agentcore evaluate \
       --cli-input-json file://session_span_logs.json
   ```

### Using evaluation targets


You can target specific traces or spans, just like with LLM-based evaluators:

```
# Trace-level evaluation
response = client.evaluate(
    evaluatorId="code-based-evaluator-id",
    evaluationInput={"sessionSpans": session_span_logs},
    evaluationTarget={"traceIds": ["trace-id-1", "trace-id-2"]}
)

# Tool-level evaluation
response = client.evaluate(
    evaluatorId="code-based-evaluator-id",
    evaluationInput={"sessionSpans": session_span_logs},
    evaluationTarget={"spanIds": ["span-id-1", "span-id-2"]}
)
```