EvaluatorInferenceConfig

class aws_cdk.aws_bedrock_agentcore_alpha.EvaluatorInferenceConfig(*, max_tokens=None, temperature=None, top_p=None)

Bases: object

(experimental) Inference configuration for a custom LLM-as-a-Judge evaluator.

Controls how the foundation model generates evaluation responses.

Parameters:

max_tokens (Union[int, float, None]) – (experimental) The maximum number of tokens to generate in the model response. Default: - The foundation model’s default maximum token limit is used
temperature (Union[int, float, None]) – (experimental) The temperature value that controls randomness in the model’s responses. Higher values produce more diverse outputs. Range: 0.0 to 1.0. Default: - The foundation model’s default temperature is used
top_p (Union[int, float, None]) – (experimental) The top-p sampling parameter that controls the diversity of the model’s responses. Range: 0.0 to 1.0. Default: - The foundation model’s default top-p value is used

Stability:

experimental

ExampleMetadata:

fixture=default infused

Example:

# LLM-as-a-Judge with categorical rating scale
categorical_evaluator = agentcore.Evaluator(self, "CategoricalEvaluator",
    evaluator_name="domain_accuracy_evaluator",
    level=agentcore.EvaluationLevel.SESSION,
    description="Evaluates domain-specific accuracy of agent responses",
    evaluator_config=agentcore.EvaluatorConfig.llm_as_aJudge(
        instructions="Evaluate whether the agent response is accurate within the healthcare domain.",
        model_id="us.anthropic.claude-sonnet-4-6",
        rating_scale=agentcore.EvaluatorRatingScale.categorical([label="Accurate", definition="The response contains factually correct healthcare information.", label="Inaccurate", definition="The response contains incorrect or misleading healthcare information."
        ])
    )
)

# LLM-as-a-Judge with numerical rating scale and inference config
numerical_evaluator = agentcore.Evaluator(self, "NumericalEvaluator",
    evaluator_name="response_quality_evaluator",
    level=agentcore.EvaluationLevel.TRACE,
    evaluator_config=agentcore.EvaluatorConfig.llm_as_aJudge(
        instructions="Rate the overall quality of the agent response on a scale of 1 to 5.",
        model_id="us.anthropic.claude-sonnet-4-6",
        rating_scale=agentcore.EvaluatorRatingScale.numerical([label="Poor", definition="Inadequate response.", value=1, label="Below Average", definition="Partially addresses the query.", value=2, label="Average", definition="Adequately addresses the query.", value=3, label="Good", definition="Well-structured and accurate response.", value=4, label="Excellent", definition="Outstanding response exceeding expectations.", value=5
        ]),
        inference_config=agentcore.EvaluatorInferenceConfig(
            max_tokens=1024,
            temperature=0.1
        )
    )
)

Attributes

max_tokens

(experimental) The maximum number of tokens to generate in the model response.

Default:

The foundation model’s default maximum token limit is used

Stability:

experimental

temperature

(experimental) The temperature value that controls randomness in the model’s responses.

Higher values produce more diverse outputs. Range: 0.0 to 1.0.

Default:

The foundation model’s default temperature is used

Stability:

experimental

top_p

(experimental) The top-p sampling parameter that controls the diversity of the model’s responses.

Range: 0.0 to 1.0.

Default:

The foundation model’s default top-p value is used

Stability:

experimental