MLPERF06-BP05 Establish an automated re-training framework

Monitor data and model predictions to identify errors due to data and concept drift. By implementing automated model re-training at scheduled intervals or when performance metrics reach defined thresholds, you can maintain model accuracy and effectiveness over time. This approach keeps your machine learning models relevant as data patterns evolve.

Desired outcome: You can detect when your deployed ML models experience data drift or performance degradation, and automatically run retraining processes. You establish mechanisms to monitor data statistics and ML inferences in production, allowing you to maintain high-quality predictions without manual intervention. Your models are consistently updated with new data, and model versions are properly tracked to maintain traceability and reproducibility.

Common anti-patterns:

Waiting for model performance to fail catastrophically before initiating retraining.
Manually monitoring model performance without automated alerts or prompts.
Retraining on a fixed schedule regardless of model performance or data patterns.
Lacking proper version control for retrained models.
Not maintaining consistent evaluation metrics across model versions.

Benefits of establishing this best practice:

Maintains model accuracy and relevance as data patterns evolve.
Reduces manual intervention required to keep models performing optimally.
Enables quick response to data drift and concept drift.
Creates a documented, repeatable process for model updates.
Provides consistent model quality through automated evaluation.
Maximizes return on investment for machine learning solutions.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Establishing an automated retraining framework is crucial for maintaining ML model performance over time. As new data becomes available or as the underlying patterns in your data change, your models can drift and become less accurate. By implementing a systematic approach to model monitoring and retraining, you can verify that your ML solutions continue to deliver business value.

Avoid waiting for model performance to fail catastrophically before initiating retraining. Many organizations manually monitor model performance without automated alerts or prompts, retrain on a fixed schedule regardless of model performance or data patterns, lack proper version control for retrained models, and don't maintain consistent evaluation metrics across model versions.

Start by defining clear performance metrics for your models that align with your business objectives. These metrics should be continuously monitored in production to detect performance degradation. Additionally, monitor your input data for statistical changes that may indicate drift from the training distribution. When changes are detected, your automated framework should run retraining workflows.

The process should include data preparation, model training with both existing and new data, thorough evaluation, and controlled deployment. Each retrained model should be versioned appropriately to maintain traceability and allow for rollback if needed.

Implementation steps

Define model performance metrics. Establish clear metrics that measure how well your model is performing relative to business objectives. These could include accuracy, precision, recall, F1 score, or custom domain-specific metrics. Verify that these metrics can be calculated automatically and regularly in your production environment.
Configure monitoring systems. Use Amazon SageMaker AI Model Monitor to continuously monitor the quality of your ML models in production. Set up data quality monitoring to detect drift in input features, model quality monitoring to track prediction quality, bias drift monitoring to detect changes in fairness metrics, and feature attribution drift to identify changes in feature importance.
Establish retraining prompts. Define the conditions that will initiate model retraining. These can include scheduled intervals based on business requirements, performance degradation beyond defined thresholds, detection of data drift above acceptable limits, and availability of new training data. Set up Amazon CloudWatch alerts to notify or automatically run retraining workflows.
Design retraining pipelines. Create automated pipelines using Amazon SageMaker AI Pipelines that handle the entire retraining workflow, including data preparation, feature engineering, model training, evaluation, and deployment. For large-scale foundation model training or distributed workloads, leverage Amazon SageMaker AI HyperPod which provides managed, resilient high-performance clusters with automatic health checks and PyTorch auto-resume capabilities for long-running training jobs. In your pipeline, include steps for validation against holdout data before deployment.
Implement model versioning. Use Amazon SageMaker AI Model Registry to track and manage different versions of your models. As a result, you can recreate a model version if needed and provide traceability for your deployed models. Associate metadata with each version to document training data, hyperparameters, and performance metrics.
Automate data processing for new training data. Set up automated data processing workflows that prepare new data for training. Configure Amazon S3 event notifications to run Lambda functions or AWS Step Functions workflows when new data becomes available. Use Amazon SageMaker AI Feature Store to manage features consistently across training and inference.
Set up orchestration. Use AWS Step Functions Data Science SDK for SageMaker AI to orchestrate complex ML workflows. Define each step in the workflow and configure alerts to initiate the process. For detecting new training data, combine AWS CloudTrail with Amazon CloudWatch Events to automatically start Step Function workflows.
Implement deployment safeguards. Use deployment techniques like blue-green deployment or canary releases to safely transition to new model versions. Monitor the performance of new models closely during initial deployment and configure automatic rollback if performance degrades.
Create feedback loops. Establish mechanisms to collect ground truth data from production to continually evaluate and improve your models. This might involve user feedback, delayed outcomes, or manual labeling processes for a subset of predictions.
Document the retraining process. Create comprehensive documentation for your retraining framework, including prompts, pipelines, evaluation criteria, and deployment strategies. This process fosters knowledge transfer and consistent application of the process.

Resources

Related documents:

Related videos:

Automating Machine Learning Workflows: Leveraging Amazon SageMaker AI Pipelines and Autopilot for Efficient Model Development and Deployment

Related examples:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

MLPERF06-BP04 Monitor, detect, and handle model performance degradation

MLPERF06-BP06 Review for updated data and features for retraining