Evaluating your SageMaker AI-trained model
The purpose of the evaluation process is to assess trained-model performance against benchmarks or custom dataset. The evaluation process typically involves steps to create evaluation recipe pointing to the trained model, specify evaluation datasets and metrics, submit a separate job for the evaluation, and evaluate against standard benchmarks or custom data. The evaluation process will output performance metrics stored in your Amazon S3 bucket.
Note
The evaluation process described in this topic is an offline process. The model is tested against fixed benchmarks with predefined answers, rather than being assessed in real-time or through live user interactions. For real-time evaluation, you can test the model after it has been deployed to Amazon Bedrock by calling Amazon Bedrock Runtime APIs.
Topics
Prerequisites
Before you start a evaluation training job, note the following.
-
A SageMaker AI-trained Amazon Nova model which you want to evaluate its performance.
-
Base Amazon Nova recipe for evaluation. For more information, see Getting Amazon Nova recipes.
Available benchmark tasks
A sample code package is available that demonstrates how to calculate benchmark
metrics using the SageMaker model evaluation feature for Amazon Nova. To access the code
packages, see sample-Nova-lighteval-custom-task
Here is a list of available industry standard benchmarks supported. You can specify
the following benchmarks in the eval_task
parameter.
Available benchmarks for model evaluation
Benchmark | Modality | Description | Metrics | Strategy | Subtask available |
---|---|---|---|---|---|
mmlu |
Text |
Multi-task Language Understanding – Tests knowledge across 57 subjects. |
accuracy |
zs_cot | Yes |
mmlu_pro | Text |
MMLU – Professional Subset – Focuses on professional domains such as law, medicine, accounting, and engineering. |
accuracy | zs_cot | No |
bbh | Text |
Advanced Reasoning Tasks – A collection of challenging problems that test higher-level cognitive and problem-solving skills. |
accuracy | fs_cot | Yes |
gpqa | Text |
General Physics Question Answering – Assesses comprehension of physics concepts and related problem-solving abilities. |
accuracy | zs_cot | No |
math | Text |
Mathematical Problem Solving – Measures mathematical reasoning across topics including algebra, calculus, and word problems. |
exact_match | zs_cot | Yes |
strong_reject | Text |
Quality-Control Task – Tests the model’s ability to detect and reject inappropriate, harmful, or incorrect content. |
deflection | zs | Yes |
ifeval | Text |
Instruction-Following Evaluation – Gauges how accurately a model follows given instructions and completes tasks to specification. |
accuracy | zs | No |
gen_qa | Text |
Custom Dataset Evaluation – Lets you supply your own dataset for benchmarking, comparing model outputs to reference answers with metrics such as ROUGE and BLEU. |
all | gen_qa | No |
mmmu | Multi-Modal |
Massive Multidiscipline Multimodal Understanding (MMMU) – College-level benchmark comprising multiple-choice and open-ended questions from 30 disciplines. |
accuracy | zs_cot | Yes |
llm_judge | Text |
LLM-as-a-Judge Preference Comparison – Uses a Nova Judge model to determine preference between paired responses (B compared with A) for your prompts, calculating the probability of B being preferred over A. |
all | judge | No |
Evaluation specific configurations
Below is a breakdown of the key components in the recipe and guidance on how to modify them for your use cases.
Understanding and modifying your recipes
General run configuration
run: name: eval_job_name model_type: amazon.nova-micro-v1:0:128k model_name_or_path: nova-micro/prod replicas: 1 data_s3_path: ""
-
name
: A descriptive name for your evaluation job. -
model_type
: Specifies the Nova model variant to use. Do not manually modify this field. Options include:-
amazon.nova-micro-v1:0:128k
-
amazon.nova-lite-v1:0:300k
-
amazon.nova-pro-v1:0:300k
-
-
model_name_or_path
: The path to the base model or s3 path for post trained checkpoint. Options include:-
nova-micro/prod
-
nova-lite/prod
-
nova-pro/prod
-
S3 path for post trained checkpoint path (
s3:customer-escrow-111122223333-smtj-<unique_id>/<training_run_name>
)Note
Evaluate post-trained model
To evaluate a post-trained model after a Nova SFT training job, follow these steps after running a successful training job. At the end of the training logs, you will see the log message "Training is complete". You will also find a
manifest.json
file in your output bucket containing the location of your checkpoint. This file will be located within anoutput.tar.gz
file at your output S3 location. To proceed with evaluation, use this checkpoint by setting it as the value forrun.model_name_or_path
in your recipe configuration.
-
-
replica
: The number of compute instances to use for distributed training. Set to 1 as multi-node is not supported. -
data_s3_path
: The input dataset Amazon S3 path. This field is required but should always left empty.
Evaluation configuration
evaluation: task: mmlu strategy: zs_cot subtask: abstract_algebra metric: accuracy
-
task
: Specifies the evaluation benchmark or task to use. Supported task includes:-
mmlu
-
mmlu_pro
-
bbh
-
gpqa
-
math
-
strong_reject
-
gen_qa
-
ifeval
-
mmmu
-
llm_judge
-
-
strategy
: Defines the evaluation approach.-
zs_cot
: Zero-shot Chain of Thought - an approach to prompt large language models that encourages step-by-step reasoning without requiring explicit examples. -
fs_cot
: Few-shot Chain of Thought - an approach that provides a few examples of step-by-step reasoning before asking the model to solve a new problem. -
zs
: Zero-shot - an approach to solve a problem without any prior training examples. -
gen_qa
: Strategy specific for bring your own dataset. -
judge
: Strategy specific for Nova LLM as Judge.
-
-
subtask
: Optional. Specific components of the evaluation task. For a complete list of available subtasks, see Available subtasks.-
Check supported subtasks in Available benchmarks tasks.
-
Should remove this field if there are no subtasks benchmarks.
-
-
metric
: The evaluation metric to use.-
accuracy
: Percentage of correct answers. -
exact_match
: For math benchmark, returns the rate at which the input predicted strings exactly match their references. -
deflection
: For strong reject benchmark, returns relative deflection to base model and difference significance metrics. -
all
:For
gen_qa
, bring your own dataset benchmark, return following metrics:-
rouge1
: Measures overlap of unigrams (single words) between generated and reference text. -
rouge2
: Measures overlap of bigrams (two consecutive words) between generated and reference text. -
rougeL
: Measures longest common subsequence between texts, allowing for gaps in the matching. -
exact_match
: Binary score (0 or 1) indicating if the generated text matches the reference text exactly, character by character. -
quasi_exact_match
: Similar to exact match but more lenient, typically ignoring case, punctuation, and white space differences. -
f1_score
: Harmonic mean of precision and recall, measuring word overlap between predicted and reference answers. -
f1_score_quasi
: Similar to f1_score but with more lenient matching, using normalized text comparison that ignores minor differences. -
bleu
: Measures precision of n-gram matches between generated and reference text, commonly used in translation evaluation.
For
llm_judge
, bring your own dataset benchmark, return following metrics:-
a_scores
: Number of wins forresponse_A
across forward and backward evaluation passes. -
a_scores_stderr
: Standard error ofresponse_A_scores
across pairwise judgements. -
b_scores
: Measures Number of wins forresponse_B
across forward and backward evaluation passes. -
a_scores_stderr
: Standard error ofresponse_B_scores
across pairwise judgements. -
ties
: Number of judgements whereresponse_A
andresponse_B
are evaluated as equal. -
ties_stderr
: Standard error ofties
across pairwise judgements. -
inference_error
: Count of judgements that could not be properly evaluated. -
score
: Aggregate score based on wins from both forward and backward passes forresponse_B
. -
score_stderr
: Aggregate score based on wins from both forward and backward passes forresponse_B
. -
inference_error_stderr
: Standard error of the aggregate score across pairwise judgements. -
winrate
: The probability thatresponse_B
will be preferred overresponse_A
calculated using Bradley-Terry probability. -
lower_rate
: Lower bound (2.5th percentile) of the estimated win rate from bootstrap sampling. -
upper_rate
: Upper bound (97.5th percentile) of the estimated win rate from bootstrap sampling.
-
-
Inference configuration (optional)
inference: max_new_tokens: 2048 top_k: -1 top_p: 1.0 temperature: 0
-
max_new_tokens
: Maximum number of tokens to generate. Must be an integer. (Unavailable for LLM Judge) -
top_k
: Number of the highest probability tokens to consider. Must be an integer. -
top_p
: Cumulative probability threshold for token sampling. Must be a float between 1.0 to 0.0. -
temperature
: Randomness in token selection (higher = more random), keep 0 to make the result deterministic. Float type, minimal value is 0.
Evaluation recipe examples
Amazon Nova provides four different types of evaluation recipes.
All recipes are available in Amazon SageMaker HyperPod recipes GitHub repository
Evaluation recipes
These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of text-only benchmarks.
Recipe format: xxx_
general_text_benchmark_eval.yaml
.
These recipes enable you to evaluate the fundamental capabilities of Amazon Nova models across a comprehensive suite of multi-modality benchmarks.
Recipe format:
xxx_general_multi_modal_benchmark_eval.yaml
.
Multi-modal benchmark requirements
-
Model support - Only support nova-lite and nova-pro base model and its post-trained variants.
These recipes enable you to bring your own dataset for benchmarking and compare model outputs to reference answers using different types of metrics.
Recipe format: xxx_
bring_your_own_dataset_eval.yaml
.
Bring your own dataset requirements
File format:
-
Single
gen_qa.jsonl
file containing evaluation examples. The file name should be exactgen_qa.jsonl
. -
Your must upload your dataset to an S3 location where SageMaker training jobs can access.
-
The file must follow the required schema format for general Q&Q dataset.
Schema format - Each line in the .jsonl
file must be a JSON
object with the following fields.
-
Required fields.
query
: String containing the question or instruction that needs an answer.response
: String containing the expected model output. -
Optional fields.
system
: String containing the system prompt that sets the behavior, role, or personality of the AI model before it processes the query.
Example entry
{ "system":"You are an English major with top marks in class who likes to give minimal word responses: ", "query":"What is the symbol that ends the sentence as a question", "response":"?" }{ "system":"You are a pattern analysis specialist who provides succinct answers: ", "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?", "response":"32" }{ "system":"You have great attention to detail and follow instructions accurately: ", "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry", "response":"of dry" }
To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:
evaluation: task: gen_qa strategy: gen_qa metric: all
Limitations
-
Only one
.jsonl
file is allowed per evaluation. -
File must strictly follow the defined schema.
Nova LLM Judge is a model evaluation feature that enables you to compare the
quality of responses from one model against a baseline model's responses using a
custom dataset. It accepts a dataset containing prompts, baseline responses, and
challenger responses, then uses a Nova Judge model to provide a win rate metric
based on Bradley-Terryxxx_llm_judge _eval.yaml
.
Nova LLM dataset requirements
File format:
-
Single
llm_judge.jsonl
file containing evaluation examples. The file name should be exactllm_judge.jsonl
. -
Your must upload your dataset to an S3 location where SageMaker training jobs can access.
-
The file must follow the required schema format for the
llm_judge
dataset. -
The input dataset should ensure all records are under 12 k context length.
Schema format - Each line in the .jsonl
file must be a JSON
object with the following fields.
-
Required fields.
prompt
: String containing the prompt for the generated response.response_A
: String containing the baseline response.response_B
: String containing the alternative response be compared with baseline response.
Example entry
{ "prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less." } { "prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations." } { "prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts." }
To use your custom dataset, modify your evaluation recipe with the following required fields, don't change any of the content:
evaluation: task: llm_judge strategy: judge metric: all
Limitations
-
Only one
.jsonl
file is allowed per evaluation. -
The file must strictly follow the defined schema.
-
Nova Judge models are the same across micro / lite / pro specifications.
-
Custom judge models are not currently supported.
Running evaluation training jobs
Start a training job using the following sample Jupyter notebook. For more information, see Use a SageMaker AI estimator to run a training job.
Reference tables
Before running the notebook, refer to the following reference tables to select image URI and instance configurations.
Selecting image URI
Recipe | Image URI |
---|---|
Evaluation image URI |
708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest |
Selecting instance type and count
Model | Job type | Instance type | Recommended instance count | Allowed instance count |
---|---|---|---|---|
Amazon Nova Micro | Evaluation (SFT/DPO) |
g5.12xlarge |
1 | 1 |
Amazon Nova Lite | Evaluation (SFT/DPO) |
g5.12xlarge |
1 | 1 |
Amazon Nova Pro | Evaluation (SFT/DPO) |
p5.48xlarge |
1 | 1 |
Sample notebook
The following sample notebook demonstrates how to run an evaluation training job.
# install python SDK !pip install sagemaker import os import sagemaker,boto3 from sagemaker.inputs import TrainingInput from sagemaker.pytorch import PyTorch sagemaker_session = sagemaker.Session() role = sagemaker.get_execution_role() # Download recipe from https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/evaluation/nova to local # Assume the file name be `recipe.yaml` # Populate parameters # input_s3_uri = "
s3://<path>/input/
" # (Optional) Only used for multi-modal dataset or bring your own dataset s3 location output_s3_uri= "s3://<path>/output/
" # Output data s3 location, a zip containing metrics json and tensorboard metrics files will be stored to this location instance_type = "instace_type
" # ml.g5.16xlarge as example job_name = "your job name
" recipe_path = "recipe path
" # ./recipe.yaml as example image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest" # Do not change # (Optional) To bring your own dataset and LLM judge for evaluation # evalInput = TrainingInput( # s3_data=input_s3_uri, # distribution='FullyReplicated', # s3_data_type='S3Prefix' #) estimator = PyTorch( output_path=output_s3_uri, base_job_name=job_name, role=role, instance_type=instance_type, training_recipe=recipe_path, sagemaker_session=sagemaker_session, image_uri = image_uri ) estimator.fit() # If input dataset exist, pass in inputs # estimator.fit(inputs={"train": evalInput})
Assessing and analyzing evaluation results
After your evaluation job completes successfully, you can assess and analyze the results using the following steps.
To assess and analyze the results, following these steps.
-
Understand the output location structure. Results are stored in your specified Amazon S3 output location as a compressed file:
s3:
//your-bucket/output/benchmark-name/
└── job_name/ └── output/ └── output.tar.gz -
Download the
output.tar.gz
file from your bucket. Extract the contents to reveal.run_name/ ├── eval_results/ | └── results_[timestamp].json │ └── inference_output.jsonl (only present for gen_qa) | └── details/ | └── model/ | └── <execution-date-time>/ | └──details_<task_name>_#_<datetime>.parquet └── tensorboard_results/ └── eval/ └── events.out.tfevents.[timestamp]
-
results_[timestamp].json
- Output metrics JSON file -
details_<task_name>_#_<datetime>.parquet
- Inference output file (except forstrong_reject
) -
events.out.tfevents.[timestamp]
- TensorBoard output file -
inference_output.jsonl
- Cleaned inference output file (only forgen_qa
tasks)
-
-
View results in TensorBoard. To visualize your evaluation metrics:
-
Upload the extracted folder to an S3 bucket
-
Navigate to SageMaker TensorBoard
-
Select your "S3 folders"
-
Add the S3 folder path
-
Wait for synchronization to complete
-
-
Analyze inference outputs. All evaluation tasks, except
llm_judge
andstrong_reject
, will have the following fields for analysis in the inference output.-
full_prompt
- the full user prompt sent to the model used for the evaluation task. -
gold
- the field that contains the correct answer(s) as specified by the dataset. -
metrics
- the field that contains the metrics evaluated on the individual inference. Values that require aggregation would not have a value on the individual inference outputs. -
predictions
- the field that contains a list of the model’s output for the given prompt.
By looking at these fields, you can determine the cause for metric differences and understand the behavior of the customized models.
For
llm_judge
, the inference output file contains the following fields under the metrics field per pair of evaluations.-
forward_output
- Judge's raw preferences when evaluating in order (response_A, response_B). -
backward_output
- Judge's raw preferences when evaluating in reverse order (response_B, response_A). -
Pairwise metrics
- Metrics that are calculated per pair of forward and backward evaluation includinga_scores
,b_scores
,ties
,inference-score
andscore
.Note
Aggregate metrics like
winrate
are only available in the summary results files, not per individual judgement.
For
gen_qa
, theinference_output.jsonl
file contains the following fields for each JSON object:-
prompt - The final prompt submitted to the model
-
inference - The raw inference output from the model
-
Evaluation best practices and troubleshooting
Best practices
The following lists some best practices for the evaluation process.
-
Keep your output paths organized by model and benchmark type.
-
Maintain consistent naming conventions for easy tracking.
-
Save extracted results in a secure location.
-
Monitor TensorBoard sync status for successful data loading.
Troubleshooting
You can use CloudWatch log group /aws/sagemaker/TrainingJobs
for training job error logs.
CUDA Out of Memory Error
Issue:
When running model evaluation, you receive the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X MiB. GPU 0 has a total capacity of Y GiB of which Z MiB is free.
Cause:
This error occurs when you attempt to load a model that requires more GPU memory than what's available on your current instance type.
Solution:
Choose an instance type with more GPU memory. For example, if you use G5.12xlarge (96 GiB GPU memory), upgrade to G5.48xlarge (192 GiB GPU memory)
Prevention:
Before running model evaluation, do the following.
-
Estimate your model's memory requirements
-
Ensure your selected instance type has sufficient GPU memory
-
Consider the memory overhead needed for model loading and inference
Available subtasks
The following lists available subtasks for model evaluation across multiple domains including MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), mathematics, and MMMU (Massive Multi-discipline Multimodal Understanding). These subtasks allow you to assess your model's performance on specific capabilities and knowledge areas.
MMLU
MMLU_SUBTASKS = [ "abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions" ]
BBH
BBH_SUBTASKS = [ "boolean_expressions", "causal_judgement", "date_understanding", "disambiguation_qa", "dyck_languages", "formal_fallacies", "geometric_shapes", "hyperbaton", "logical_deduction_five_objects", "logical_deduction_seven_objects", "logical_deduction_three_objects", "movie_recommendation", "multistep_arithmetic_two", "navigate", "object_counting", "penguins_in_a_table", "reasoning_about_colored_objects", "ruin_names", "salient_translation_error_detection", "snarks", "sports_understanding", "temporal_sequences", "tracking_shuffled_objects_five_objects", "tracking_shuffled_objects_seven_objects", "tracking_shuffled_objects_three_objects", "web_of_lies", "word_sorting" ]
Math
MATH_SUBTASKS = [ "algebra", "counting_and_probability", "geometry", "intermediate_algebra", "number_theory", "prealgebra", "precalculus", ]
MMMU
MATH_SUBTASKS = [ "Accounting", "Agriculture", "Architecture_and_Engineering", "Art", "Art_Theory", "Basic_Medical_Science", "Biology", "Chemistry", "Clinical_Medicine", "Computer_Science", "Design", "Diagnostics_and_Laboratory_Medicine", "Economics", "Electronics", "Energy_and_Power", "Finance", "Geography", "History", "Literature", "Manage", "Marketing", "Materials", "Math", "Mechanical_Engineering", "Music", "Pharmacy", "Physics", "Psychology", "Public_Health", "Sociology", ]