MLPER-15: Monitor, detect, and handle model performance degradation
Model performance could degrade over time for reasons such as data quality, model quality, model bias, and model explainability. Continuously monitor the quality of the ML model in real time. Identify the right time and frequency to retrain and update the model. Configure alerts to notify and initiate actions if any drift in model performance is observed.
Implementation plan
- 
              Monitor model performance - Amazon SageMaker AI Model Monitor continually monitors the quality of Amazon SageMaker AI machine learning models in production. Establish a baseline during training before model is in production. Collect data while in production and compare changes in model inferences. Observations of drifts in the data statistics will indicate that the model may need to be retrained. The timing of drifts will establish a schedule for retraining. Use SageMaker AI Clarify to identify model bias. Configure alerting systems with Amazon CloudWatch to send notifications for unexpected bias or changes in data quality. 
- 
              Perform automatic scaling - Amazon SageMaker AI includes automatic scaling capabilities for your hosted model to dynamically adjust underlying compute supporting an endpoint based on demand. This capability ensures that your endpoint can dynamically support demand while reducing operational overhead. 
- 
              Monitor endpoint metrics - Amazon SageMaker AI also outputs endpoint metrics for monitoring the usage and health of the endpoint. Amazon SageMaker AI Model Monitor provides the capability to monitor your ML models in production and provides alerts when data quality issues appear. Create a mechanism to aggregate and analyze model prediction endpoint metrics using services, such as Amazon OpenSearch Service (OpenSearch Service). OpenSearch Service supports Kibana for dashboards and visualization. The traceability of hosting metrics back to versioned inputs allows for analysis of changes that could be impacting current operational performance.