GENPERF01-BP02 Collect performance metrics from generative AI workloads

Foundation model performance on specific tasks is measured and quantified in different ways depending on the desired outcome. It is important to discern the performance of a model over time when selecting foundation models for generative AI workloads by identifying performance metrics and evaluating model performance. This is true not just for model inference, but model training and customization workloads as well.

Desired outcome: When implemented, your organization improves its ability to evaluate model performance against the identified performance metric.

Benefits of establishing this best practice: Experiment more often - Testing model performance using quantifiable evaluation metrics assists in the selection of foundation models for generative AI workloads.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Traditional performance monitoring and optimization focus on the efficiency of compute, network, memory and storage resources. Generative AI workloads add new dimensions to the performance considerations, particularly concerning response quality. Inaccurate model responses or models responding in an overly casual, dismissive, or even toxic manner may be considered under-performing. Consult your organization's AI policy for more details on what constitutes an under-performing language model with respect to your use case.

Different use cases may have several relevant metrics for use in evaluating model performance. Performance metrics for inference workloads may capture model response latency or throughput. Performance metrics for model customization or training workloads are likely focused on model training times. Ultimately, a model should respond with accurately, robustly, and somewhat predictably. Capturing model performance against these metrics and evaluating model performance against your organization's AI performance requirements helps to provide consistently high performing generative AI workloads.

Generative AI tasks should report metrics, telemetry and logs to a centralized logging and monitoring solution such as Amazon CloudWatch. By configuring Amazon CloudWatch or similar, customers can collect performance metrics from model endpoints hosted in Amazon SageMaker AI or generative AI services like Amazon Q for Amazon Bedrock. These metrics can be used to identify which models perform well against a metric, and which need additional performance improvements.

Performance metrics may also be collected by applications and services that interact with models. Collect metrics and application traces pertaining to the flow of information rather than a specific piece of the workflow. Work to determine how your entire application performs when interacting with generative AI solutions. This can help you triage performance concerns faster and improve resolution times.

Use internal golden datasets or external benchmarking datasets to evaluate model performance on specific tasks. Consult model cards to identify model strengths and weaknesses, evaluating on selected datasets where appropriate. Benchmark custom models on a suite of tests using internal and external data to develop a well-rounded understanding of your model's performance.

Note that a model may not excel at all tasks. Be judicious when selecting a performance metric for your model, and consult your organization's AI policy to identify which performance metric to prioritize for your use case.

Implementation steps

Identify the performance metrics to prioritize for your generative AI use case.
Develop a mechanism to capture the performance metrics.

Implement a trace framework like OpenLLMetry to capture additional metrics.
Capture metrics using Amazon CloudWatch or a similar centralized logging and monitoring solution.
Use a benchmarking dataset within an evaluation framework such as fmeval.

Establish reasonable performance thresholds and alert accordingly.

Use Amazon CloudWatch alarms for production alerting on latency, throughput, or other traditional performance metrics.
Incorporate regular benchmarking using internal golden datasets, and update the dataset as your customer's usage changes.
Consult model cards for new models, and perform custom benchmarking of new models where appropriate.

Identify, capture, and log remediation actions in your organization's AI policy.

For example, increased latency on self-hosted models may call for horizontal scaling to remediate the issue. Your organization's AI policy should define acceptable latency thresholds.
For example, a model response which is identified as a hallucination may call for updates to a system prompt. Such an update should require testing against internal golden datasets to verify that system prompt changes do not adversely affect related prompt workflows.

Implement a centralized experiment tracking solution such as Amazon SageMaker AI with MLflow.

Resources

Related best practices:

Related documents:

Related examples:

Related tools:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

GENPERF01-BP01 Define a ground truth data set of prompts and responses

Maintaining model performance