Accessing and analyzing evaluation results

After your evaluation job completes successfully, you can access and analyze the results using the information in this section. Based on the output_s3_path (such as s3://output_path/) defined in the recipe, the output structure is the following:


job_name/
├── eval-result/
│    └── results_[timestamp].json
│    └── inference_output.jsonl (only present for gen_qa)
│    └── details/
│        └── model/
│            └── execution-date-time/
│                └──details_task_name_#_datetime.parquet
└── tensorboard-results/
    └── eval/
        └── events.out.tfevents.[timestamp]

Metrics results are stored in the specified S3 output location s3://output_path/job_name/eval-result/result-timestamp.json.

Tensorboard results are stored in the S3 path s3://output_path/job_name/eval-tensorboard-result/eval/event.out.tfevents.epoch+ip.

All inference outputs, except for llm_judge and strong_reject, are stored in the S3 path: s3://output_path/job_name/eval-result/details/model/taskname.parquet.

For gen_qa, the inference_output.jsonl file contains the following fields for each JSON object:

prompt - The final prompt submitted to the model
inference - The raw inference output from the model
gold - The target response from the input dataset
metadata - The metadata string from the input dataset if provided

To visualize your evaluation metrics in Tensorboard, complete the following steps:

Navigate to SageMaker AI Tensorboard.
Select S3 folders.
Add your S3 folder path, for example s3://output_path/job-name/eval-tensorboard-result/eval.
Wait for synchronization to complete.

The time series, scalars, and text visualizations are available.

We recommend the following best practices:

Keep your output paths organized by model and benchmark type.
Maintain consistent naming conventions for easy tracking.
Save extracted results in a secure location.
Monitor TensorBoard sync status for successful data loading.

You can find SageMaker HyperPod job error logs in the CloudWatch log group /aws/sagemaker/Clusters/cluster-id.

Log Probability Output Format

When top_logprobs is configured in your inference settings, the evaluation output includes token-level log probabilities in the parquet files. Each token position contains a dictionary of the top candidate tokens with their log probabilities in the following structure:


{
"Ġint": {"logprob_value": -17.8125, "decoded_value": " int"},
"Ġthe": {"logprob_value": -2.345, "decoded_value": " the"}
}

Each token entry contains:

logprob_value: The log probability value for the token
decoded_value: The human-readable decoded string representation of the token

The raw tokenizer token is used as the dictionary key to ensure uniqueness, while decoded_value provides a readable interpretation.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Starting an evaluation job

MLflow monitoring