LlmAsAJudgeOptions

class aws_cdk.aws_bedrock_agentcore_alpha.LlmAsAJudgeOptions(*, instructions, model_id, rating_scale, additional_model_request_fields=None, inference_config=None)

Bases: object

(experimental) Options for configuring an LLM-as-a-Judge custom evaluator.

Uses a foundation model to assess agent performance based on custom instructions and a rating scale.

Parameters:

instructions (str) – (experimental) The evaluation instructions that guide the language model in assessing agent performance. These instructions define the evaluation criteria, context, and expected behavior. Instructions must contain placeholders appropriate for the evaluation level (e.g., {context}, {available_tools} for SESSION level). Note: Evaluators using reference-input placeholders (e.g., {expected_tool_trajectory}, {assertions}, {expected_response}) are only compatible with on-demand evaluation, not online evaluation.
model_id (str) – (experimental) The identifier of the Amazon Bedrock model to use for evaluation. Accepts standard model IDs (e.g., 'anthropic.claude-sonnet-4-6') and cross-region inference profile IDs with region prefixes (e.g., 'us.anthropic.claude-sonnet-4-6', 'eu.anthropic.claude-sonnet-4-6').
rating_scale (EvaluatorRatingScale) – (experimental) The rating scale that defines how the evaluator should score agent performance.
additional_model_request_fields (Optional[Mapping[str, Any]]) – (experimental) Additional model-specific request fields. Default: - No additional fields
inference_config (Union[EvaluatorInferenceConfig, Dict[str, Any], None]) – (experimental) Optional inference configuration parameters that control model behavior during evaluation. When not specified, the foundation model uses its own default values for maxTokens, temperature, and topP. Default: - The foundation model’s default inference parameters are used

Stability:

experimental

ExampleMetadata:

infused

Example:

# Create a custom LLM-as-a-Judge evaluator
evaluator = agentcore.Evaluator(self, "MyEvaluator",
    evaluator_name="my_custom_evaluator",
    level=agentcore.EvaluationLevel.SESSION,
    evaluator_config=agentcore.EvaluatorConfig.llm_as_aJudge(
        instructions="Evaluate whether the agent response is helpful and accurate.",
        model_id="us.anthropic.claude-sonnet-4-6",
        rating_scale=agentcore.EvaluatorRatingScale.categorical([label="Good", definition="The response is helpful and accurate.", label="Bad", definition="The response is not helpful or contains errors."
        ])
    )
)

# Use the custom evaluator in an online evaluation configuration
agentcore.OnlineEvaluationConfig(self, "MyEvaluation",
    online_evaluation_config_name="my_evaluation",
    evaluators=[
        agentcore.EvaluatorReference.builtin(agentcore.BuiltinEvaluator.HELPFULNESS),
        agentcore.EvaluatorReference.custom(evaluator)
    ],
    data_source=agentcore.DataSourceConfig.from_cloud_watch_logs(
        log_group_names=["/aws/bedrock-agentcore/my-agent"],
        service_names=["my-agent.default"]
    )
)

Attributes

additional_model_request_fields

(experimental) Additional model-specific request fields.

Default:

No additional fields

Stability:

experimental

inference_config

(experimental) Optional inference configuration parameters that control model behavior during evaluation.

When not specified, the foundation model uses its own default values for maxTokens, temperature, and topP.

Default:

The foundation model’s default inference parameters are used

See:

https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/custom-evaluators.html

Stability:

experimental

instructions

(experimental) The evaluation instructions that guide the language model in assessing agent performance.

These instructions define the evaluation criteria, context, and expected behavior. Instructions must contain placeholders appropriate for the evaluation level (e.g., {context}, {available_tools} for SESSION level).

Note: Evaluators using reference-input placeholders (e.g., {expected_tool_trajectory}, {assertions}, {expected_response}) are only compatible with on-demand evaluation, not online evaluation.

See:: https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/custom-evaluators.html
Stability:: experimental

model_id

(experimental) The identifier of the Amazon Bedrock model to use for evaluation.

Accepts standard model IDs (e.g., 'anthropic.claude-sonnet-4-6') and cross-region inference profile IDs with region prefixes (e.g., 'us.anthropic.claude-sonnet-4-6', 'eu.anthropic.claude-sonnet-4-6').

Stability:: experimental

rating_scale

(experimental) The rating scale that defines how the evaluator should score agent performance.

Stability:: experimental