Benchmarking with standardized datasets Large Language Model as a Judge (LLMAJ) evaluation Custom Scorers

Evaluation types and Job Submission

Benchmarking with standardized datasets

Use the Benchmark Evaluation type to evaluate the quality of your model across standardized benchmark datasets including popular datasets like MMLU and BBH.

Benchmark	Custom Dataset Supported	Modalities	Description	Metrics	Strategy	Subtask available
mmlu	No	Text	Multi-task Language Understanding – Tests knowledge across 57 subjects.	accuracy	zs_cot	Yes
mmlu_pro	No	Text	MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering.	accuracy	zs_cot	No
bbh	No	Text	Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills.	accuracy	fs_cot	Yes
gpqa	No	Text	General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities.	accuracy	zs_cot	No
math	No	Text	Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems.	exact_match	zs_cot	Yes
strong_reject	No	Text	Quality-Control Task – Tests the model's ability to detect and reject inappropriate, harmful, or incorrect content.	deflection	zs	Yes
ifeval	No	Text	Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification.	accuracy	zs	No

For more information on BYOD formats, see Supported Dataset Formats for Bring-Your-Own-Dataset (BYOD) Tasks.

Available Subtasks

The following lists available subtasks for model evaluation across multiple domains including MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), StrongReject, and MATH. These subtasks allow you to assess your model's performance on specific capabilities and knowledge areas.

MMLU Subtasks


MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]

BBH Subtasks


BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]

Math Subtasks


MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]

StrongReject Subtasks


STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]

Submit your benchmark job

Large Language Model as a Judge (LLMAJ) evaluation

Use LLM-as-a-judge (LLMAJ) evaluation to leverage another frontier model to grade your target model responses. You can use AWS Bedrock models as judges by calling create_evaluation_job API to launch the evaluation job.

SageMaker LLM as a Judge: This feature is powered by Amazon Bedrock Evaluations. Your use of this feature is subject to pricing of Amazon Bedrock Evaluations, see the Service Terms applicable to Amazon Bedrock, and the terms that apply to your usage of third-party models. Amazon Bedrock Evaluations may securely transmit data across AWS Regions within your geography for processing. For more information, access Amazon Bedrock Evaluations documentation.

For more information on the supported judge models see: https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html

You can use 2 different metric formats to define the evaluation:

Builtin metrics: Leverage AWS Bedrock builtin metrics to analyze the quality of your model's inference responses. For more information, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html
Custom metrics: Define your own custom metrics in Bedrock Evaluation custom metric format to analyze the quality of your model's inference responses using your own instruction. For more information, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

Submit a builtin metrics LLMAJ job

Submit a custom metrics LLMAJ job

Define your custom metric(s):


{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}

For more information, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html

Custom Scorers

Define your own custom scorer function to launch an evaluation job. The system provides two built-in scorers: Prime math and Prime code. You can also bring your own scorer function. You can copy your scorer function code directly or bring your own Lambda function definition using the associated ARN. By default, both scorer types produce evaluation results which include standard metrics such as F1 score, ROUGE, and BLEU.

For more information on built-in and custom scorers and their respective requirements/contracts, see Evaluate with Preset and Custom Scorers.

Register your dataset

Bring your own dataset for custom scorer by registering it as a SageMaker Hub Content Dataset.

Submit a built-in scorer job

Submit a custom scorer job

Define a custom reward function. For more information, see Custom Scorers (Bring Your Own Metrics).

Register the custom reward function

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Getting Started

Evaluation Metrics Formats