Model Evaluation Summary MLFlow logging TensorBoard

Evaluation Metrics Formats

Evaluating the quality of your model across these metric formats:

Model Evaluation Summary
MLFlow
TensorBoard

Model Evaluation Summary

When you submit your evaluation job you specify an AWS S3 output location. SageMaker automatically uploads the evaluation summary .json file to the location. The benchmark summary S3 path is the following:


s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/eval_results/

Pass the AWS S3 location

Read it directly as a .json from the AWS S3 location or visualized automatically in the UI:


{
  "results": {
    "custom|gen_qa_gen_qa|0": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    },
    "all": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    }
  }
}

Sample performance metrics for custom gen-qa benchmark visualized in SageMaker Studio.

MLFlow logging

Provide your SageMaker MLFlow resource ARN

SageMaker Studio uses the default MLFlow app that gets provisioned on each Studio domain when you use the model customization capability for the first time. SageMaker Studio uses the default MLflow app associated ARN in evaluation job submission.

You can also submit your evaluation job and explicitly provide an MLFlow Resource ARN to stream metrics to said associated tracking server/app for real time analysis.

SageMaker Python SDK


evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",
    evaluate_base_model=False
)

execution = evaluator.evaluate()

Model level and system level metric visualization:

Sample model level error and accuracy for MMLU benchmarking task.

Sample built-in metrics for LLMAJ benchmarking task.

Sample system level metrics for MMLU benchmarking task.

TensorBoard

Submit your evaluation job with an AWS S3 output location. SageMaker automatically uploads a TensorBoard file to the location.

SageMaker uploads the TensorBoard file to AWS S3 in the following location:


s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/tensorboard_results/eval/

Pass the AWS S3 location as follows

Sample model level metrics

SageMaker TensorBoard displaying results of a benchmarking job.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Evaluation types and Job Submission

Supported Dataset Formats for Bring-Your-Own-Dataset (BYOD) Tasks