Evaluation types and Job Submission
Benchmarking with standardized datasets
Use the Benchmark Evaluation type to evaluate the quality of your model across standardized benchmark datasets including popular datasets like MMLU and BBH.
| Benchmark | Custom Dataset Supported | Modalities | Description | Metrics | Strategy | Subtask available |
|---|---|---|---|---|---|---|
| mmlu | No | Text | Multi-task Language Understanding – Tests knowledge across 57 subjects. | accuracy | zs_cot | Yes |
| mmlu_pro | No | Text | MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering. | accuracy | zs_cot | No |
| bbh | No | Text | Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills. | accuracy | fs_cot | Yes |
| gpqa | No | Text | General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities. | accuracy | zs_cot | No |
| math | No | Text | Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems. | exact_match | zs_cot | Yes |
| strong_reject | No | Text | Quality-Control Task – Tests the model's ability to detect and reject inappropriate, harmful, or incorrect content. | deflection | zs | Yes |
| ifeval | No | Text | Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification. | accuracy | zs | No |
For more information on BYOD formats, see Supported Dataset Formats for Bring-Your-Own-Dataset (BYOD) Tasks.
Available Subtasks
The following lists available subtasks for model evaluation across multiple domains including MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), StrongReject, and MATH. These subtasks allow you to assess your model's performance on specific capabilities and knowledge areas.
MMLU Subtasks
MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]
BBH Subtasks
BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]
Math Subtasks
MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus" ]
StrongReject Subtasks
STRONG_REJECT_SUBTASKS = [ "gcg_transfer_harmbench", "gcg_transfer_universal_attacks", "combination_3", "combination_2", "few_shot_json", "dev_mode_v2", "dev_mode_with_rant", "wikipedia_with_title", "distractors", "wikipedia", "style_injection_json", "style_injection_short", "refusal_suppression", "prefix_injection", "distractors_negated", "poems", "base64", "base64_raw", " base64_input_only", "base64_output_only", "evil_confidant", "aim", "rot_13", "disemvowel", "auto_obfuscation", "auto_payload_splitting", "pair", "pap_authority_endorsement", "pap_evidence_based_persuasion", "pap_expert_endorsement", "pap_logical_appeal", "pap_misrepresentation" ]
Submit your benchmark job
Large Language Model as a Judge (LLMAJ) evaluation
Use LLM-as-a-judge (LLMAJ) evaluation to leverage another frontier model to grade your target model responses. You can use AWS Bedrock models as judges by calling create_evaluation_job API to launch the evaluation job.
For more information on the supported judge models see: https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html
You can use 2 different metric formats to define the evaluation:
-
Builtin metrics: Leverage AWS Bedrock builtin metrics to analyze the quality of your model's inference responses. For more information, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html
-
Custom metrics: Define your own custom metrics in Bedrock Evaluation custom metric format to analyze the quality of your model's inference responses using your own instruction. For more information, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html
Submit a builtin metrics LLMAJ job
Submit a custom metrics LLMAJ job
Define your custom metric(s):
{ "customMetricDefinition": { "name": "PositiveSentiment", "instructions": ( "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. " "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n" "Consider the following:\n" "- Does the response have a positive, encouraging tone?\n" "- Is the response helpful and constructive?\n" "- Does it avoid negative language or criticism?\n\n" "Rate on this scale:\n" "- Good: Response has positive sentiment\n" "- Poor: Response lacks positive sentiment\n\n" "Here is the actual task:\n" "Prompt: {{prompt}}\n" "Response: {{prediction}}" ), "ratingScale": [ {"definition": "Good", "value": {"floatValue": 1}}, {"definition": "Poor", "value": {"floatValue": 0}} ] } }
For more information, see: https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html
Custom Scorers
Define your own custom scorer function to launch an evaluation job. The system provides two built-in scorers: Prime math and Prime code. You can also bring your own scorer function. You can copy your scorer function code directly or bring your own Lambda function definition using the associated ARN. By default, both scorer types produce evaluation results which include standard metrics such as F1 score, ROUGE, and BLEU.
For more information on built-in and custom scorers and their respective requirements/contracts, see Evaluate with Preset and Custom Scorers.
Register your dataset
Bring your own dataset for custom scorer by registering it as a SageMaker Hub Content Dataset.
Submit a built-in scorer job
Submit a custom scorer job
Define a custom reward function. For more information, see Custom Scorers (Bring Your Own Metrics).
Register the custom reward function