# Well-Architected machine learning
Well-Architected machine learning

 The six phases for the ML lifecycle referenced in this lens are illustrated in Figure 2 in a sequence. 

![\[Diagram of the phases of the machine learning lifecycle\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-phases.png)


 The following sections describe Well-Architected machine learning for each of the lifecycle phases. 

**Topics**
+ [

# Business goal identification
](business-goal-identification.md)
+ [

# ML problem framing
](ml-problem-framing.md)
+ [

# ML lifecycle architecture diagram
](architecture-diagram.md)
+ [

# Data processing
](data-processing.md)
+ [

# Model development
](model-development.md)
+ [

# Deployment
](deployment.md)
+ [

# Monitoring
](monitoring.md)

# Business goal identification
Business goal identification

 Business goal identification is the most important phase of the ML lifecycle. An organization considering ML should have a clear idea of the problem to be solved, and the business value to be gained. You must be able to measure business value against specific business objectives and success criteria. While this holds true for technical solutions, this step is particularly challenging when considering ML solutions because ML is a constantly evolving technology.  

 After you determine your criteria for success, evaluate your organization's ability to move toward that target. The target should be achievable and provide a clear path to production. Involve relevant stakeholders from the beginning to align them to this target and new business processes that result from this initiative. 

 Start the review by determining if ML is the appropriate approach for delivering your business goal. Evaluate the options that you have available for achieving the goal. Determine how accurate the resulting outcomes would be, while considering the cost and scalability of each approach. 

 For an ML-based approach to be successful, verify that enough relevant, high-quality training data is available to the algorithm. Carefully evaluate the data to make sure that the correct data sources are available and accessible. 

 **Steps in this phase:** 

 The following work steps should be followed to establish your business goals. 
+  Business considerations: 
  +  Understand business requirements. 
  +  Align affected stakeholders with this initiative. 
  +  Form a business question. 
  +  Identify critical, must-have features. 
  +  Consider new business processes that might come out of this implementation. 
  +  Consider how business value can be measured using business metrics that the ML model can assist to improve. 
+  Frame the ML problem: 
  +  Define the machine learning task based on the business question. 
  +  Review proven or published works in similar domains, if available. 
  +  Design small, focused POCs to validate those aspects of the approach where inadequate confidence exists. 
+  Determine the optimization objective: 
  +  Determine key business performance metrics for the ML use case, such as uplift in new business acquisition, fraud detection rate, and anomaly detection. Increase CSAT according to the business needs. 
+  Review data requirements: 
  +  Review the project's ML feasibility and data requirements. 
+  Cost and performance optimization: 
  +  Evaluate the cost of data acquisition, training, inference, and wrong predictions. 
  +  Evaluate whether bringing in external data sources might improve model performance. 
+  Production considerations: 
  +  Review how to handle ML-generated errors. 
  +  Establish pathways to production. 

# ML problem framing
ML problem framing

 In this phase, the business problem is framed as a machine learning problem: what is observed and what should be predicted (known as a label or target variable). Determining what to predict and how performance must be optimized is a key step in ML. For example, consider a scenario where a manufacturing company wants to maximize profits. There are several possible approaches including forecasting sales demand for existing product lines to optimize output, forecasting the required input materials and components required to reduce capital locked up in stock, and predicting sales for new products to prioritize new product development. 

 It's necessary to work through framing the ML problems in line with the business challenge. 

** Steps in this phase: **
+  Define criteria for a successful outcome of the project. 
+  Establish an observable and quantifiable performance metric for the project, such as accuracy. 
+  Define the relationship between the technical metric (for example, accuracy) and the business outcome (for example, sales). 
+  Assist to verify business stakeholders understand and agree with the defined performance metrics. 
+  Formulate the ML question in terms of inputs, desired outputs, and the performance metric to be optimized. 
+  Evaluate whether ML is the right approach. Some business problems don't need ML as simple business rules can do a much better job. For other business problems, there might not be sufficient data to apply ML as a solution. 
+  Create a strategy to achieve the data sourcing and data annotation objective. 
+  Start with a model that is simple to interpret and makes debugging more manageable. 
+  Map the technical outcome to a business outcome. 
+  Iterate on the model by gathering more data, optimizing the parameters, or increasing the complexity as needed to achieve the business outcome. 

# ML lifecycle architecture diagram
Architecture diagram

 Figure 3 shows the ML lifecycle phases with the *data processing phase* (for example, Process Data) expanded into a *data collection sub-phase* (Collect Data) and a *data preparation sub-phase* (Pre-process Data and Engineer Features). These sub-phases are discussed in more detail in this section. 

![\[Lifecycle data with data processing sub-phases included.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-phases.png)


 Figure 4 illustrates the details of the ML lifecycle phases that occur following the problem framing phase and shows how the data-processing sub-phases interact with the subsequent phases, that is, the *model development phase*, the *model deployment phase*, and the *model monitoring phase*. 

 The model development phase includes training, tuning, and evaluation. The model deployment phase includes the staging environment for model validation for security and robustness. Monitoring is key in timely detection and mitigation of drifts. Feedback loops across the ML lifecycle phases are key enablers for monitoring. Feature stores (both online and offline) provide consistent and reusable features across model development and deployment phases. The model registry enables the version control and lineage tracking for model and data components. This figure also emphasizes the lineage tracking and its components that are discussed in this section in more detail. 

 The cloud agnostic architecture diagrams in this paper provide high-level best practices with the following assumptions: 
+  All concepts presented here are cloud and technology agnostic. 
+  Solid black lines are indicative of process flow. 
+  Dashed color lines are indicative of input and output flow. 
+  Architecture diagram components are color-coded for ease of communication across this document. 

![\[ML lifecycle with detailed phases and extended components\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-phases-and-expanded-components.png)


 The components of the sub-phases of the ML lifecycle shown in Figure 4 are as follows: 
+  **Online/offline feature store:** Reduces duplication and the need to rerun feature engineering code across teams and projects. An online store with low-latency retrieval capabilities is ideal for real-time inference. On the other hand, an offline store is designed for maintaining a history of feature values and is suited for training and batch scoring. 
+  **Model registry:** A repository for storing ML model artifacts including trained model and related metadata (such as data, code, and model). It enables the tracking of the lineage of ML models as it can act as a version control system. 
+  **Performance feedback loop:** Informs the iterative data preparation phase based on the evaluation of the model during the model development phase. 
+  **Model drift feedback loop:** Informs the iterative data preparation phase based on the evaluation of the model during the production deployment phase. 
+  **Alarm manager:** Receives alerts from the model monitoring system. It then publishes notifications to the services that can deliver alerts to target applications. The model update re-training pipeline is one such target application. 
+  **Scheduler:** Initiates a model re-training at business-defined intervals. 
+  **Lineage tracker:** Enables reproducible machine learning experiences. It enables the re-creation of the ML environment at a specific point in time, reflecting the versions of resources and environments at that time. 

 The ML lineage tracker collects references to traceable data, model, and infrastructure resource changes. It consists of the following components: 
+  System architecture (infrastructure as code to address environment drift) 
+  Data (metadata, values, and features) 
+  Model (algorithm, features, parameters, and hyperparameters) 
+  Code (implementation, modeling, and pipeline) 

 The lineage tracker collects changed references through alternative iterations of ML lifecycle phases. Alternative algorithms and feature lists are evaluated as experiments for final production deployment. 

 Figure 5 includes machine learning components and their information that the lineage tracker collects across different releases. The collected information enables going back to a specific point-in-time release and re-creating it. 

![\[Lineage tracker diagram\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/lineage-tracker.png)


 Lineage tracker components include: 
+  **Infrastructure as code (IaC):** Modeling, provisioning, and managing cloud computing resources (compute, storage, network, and application services) can be automated using infrastructure as code. Cloud computing takes advantage of virtualization to enable the on-demand provisioning of resources. IaC avoids configuration drift through automation, while increasing the speed and agility of infrastructure deployments. IaC code changes are committed to version-controlled repository. 
+  **Data:** Store data schemes and metadata in version control systems. Store the data in a storage media, such as a data lake. The location or link to the data can be in a configuration file and stored in code version control media. 
+  **Implementation code:** Changes to implementation code can be stored as point-in-time using version control media. 
+  **Model feature list:** A *feature store*, discussed earlier in this section (Figure 4), maintains the details of the features as well as their previous versions for any point-in-time changes. 
+  **Model algorithm code:** Changes to model algorithm code at specific points-in-time can be stored in version control media. 
+  **Model container image:** Versions of model container images for specific point-in-time changes can be stored in container repositories managed by container registry. 

# Data processing
Data processing

 In ML workloads, the data (inputs and corresponding desired output) serves important functions including: 
+  Defining the goal of the system: the output representation and the relationship of each output to each input, by means of the input and output pairs. 
+  Training the algorithm that associates inputs to outputs. 
+  Measuring the performance of the model against changes in data distribution or data drift. 
+  Building a baseline dataset to capture data drift. 

 As shown in Figure 6, data processing consists of data collection and data preparation. Data preparation includes data preprocessing and feature engineering. It mainly uses data wrangling for interactive data analysis and data visualization for exploratory data analysis (EDA). EDA focuses on understanding data, sanity checks, and validation of data quality.  

 It is important to note that the same sequence of data processing steps that is applied to the training data needs to also be applied to the inference requests. 

![\[Figure showing data processing components\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-processing-components.png)


**Topics**
+ [

# Data collection
](data-collection.md)
+ [

# Data preparation
](data-preparation.md)

# Data collection
Data collection

 Important steps in the ML lifecycle are to identify the data needed, followed by the evaluation of the various means available for collecting that data to train your model. 

![\[Figure showing the main components of data collection.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-collection-main-components.png)

+  **Label:** *Labeled data* is a group of samples that have been tagged with one or more labels. If labels are missing, then some effort is required to label it (either manual or automated). 
+  **Ingest and aggregate:** Data collection includes ingesting and aggregating data from multiple data sources. 

![\[Figure showing how data sources lead to data ingestion means and then into data technologies.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-sources-ingestion-technologies.png)


 The sub-components of the *ingest and aggregate* component (shown in Figure 8) are as follows: 
+  **Data sources:** Data sources include time-series, events, sensors, IoT devices, and social networks, depending on the nature of the use case. You can enrich your data sources by using the geospatial capability of Amazon SageMaker AI to access a range of geospatial data sources from AWS (for example, Amazon Location Service), open-source datasets (for example, [Open Data on AWS](https://aws.amazon.com/opendata/)), or your own proprietary data including from third-party providers (such as Planet Labs). To learn more about the geospatial capability in Amazon SageMaker AI, visit [Geospatial ML with Amazon SageMaker AI](https://aws.amazon.com/sagemaker/geospatial/). 
+  **Data ingestion:** Data ingestion processes and technologies capture and store data on storage media. Data ingestion can occur in real-time using streaming technologies or historical mode using batch technologies. 
+  **Data technologies:** Data storage technologies vary from transactional (SQL) databases, to data lakes and data warehouses to form a lake house with marketplace governance across teams and partners. Extract, transform, and load (ETL) pipeline technology automates and orchestrates the data movement and transformations across cloud services and resources. A lake house enables storing and analyzing structured and unstructured data. 

# Data preparation
Data preparation

 ML models are only as good as the data that is used to train them. Verify that suitable training data is available and is optimized for learning and generalization. Data preparation includes data preprocessing and feature engineering. 

 A key aspect to understanding data is to identify patterns. These patterns are often not evident with data in tables. Exploratory data analysis (EDA) with visualization tools can assist in quickly gaining a deeper understanding of data. Prepare data using data wrangler tools for interactive data analysis and model building. Employ no-code/low-code, automation, and visual capabilities to improve the productivity and reduce the cost for interactive analysis. Use generative AI code tools. 

**Topics**
+ [

# Data preprocessing
](data-preprocessing.md)
+ [

# Feature engineering
](feature-engineering.md)

# Data preprocessing
Data preprocessing

 Data preprocessing puts data into the right shape and quality for training. There are many data preprocessing strategies including: data cleaning, balancing, replacing, imputing, partitioning, scaling, augmenting, and unbiasing. 

![\[Chart showing the data preprocessing strategies.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-processing-main-components.png)


 The data preprocessing strategies listed in Figure 9 can be expanded as the following: 
+  **Clean (replace, impute, remove outliers and duplicates):** Remove outliers and duplicates, replace inaccurate or irrelevant data, and correct missing data using imputation techniques that will minimize bias as part of data cleaning. 
+  **Partition:** To block ML models from overfitting and to evaluate a trained model accurately, randomly split data into train, validate, and test sets. Data leakage can happen when information from hold-out test dataset leaks into the training data. One way to avoid data leakage is to remove duplicates before splitting the data. 
+  **Scale (normalize, standardize):** Normalization is a scaling technique in machine learning that is applied during data preparation to change the values of numeric columns in the dataset to use a common scale. This technique assists to verify that each feature of the machine learning model has equal feature importance when they have different ranges. Normalized numeric features will have values in the range of [0,1]. Standardized numeric features will have a mean of 0 and standard deviation of 1. Standardization assists in handling outliers. 
+  **Unbias, balance (detection and mitigation):** Detecting and mitigating bias assists to avoid inaccurate model results. Biases are imbalances in the accuracy of predictions across different groups, such as age or income bracket. Biases can come from the data or algorithm used to train your model. 
+  **Augment:** Data augmentation increases the amount of data artificially by synthesizing new data from existing data. Data augmentation can assist to regularize and reduce overfitting. 

# Feature engineering
Feature engineering

 Every unique attribute of the data is considered a *feature* (also known as an *attribute*). For example, when designing a solution for predicting customer churn, the data used typically includes features such as customer location, age, income level, and recent purchases. 

![\[Chart showing the main components of feature engineering.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/feature-engineering-main-components.png)


 Feature engineering is a process to select and transform variables when creating a predictive model using machine learning or statistical modeling. Feature engineering typically includes feature creation, feature transformation, feature extraction, and feature selection as listed in Figure 10. With deep learning, feature engineering is automated as part of the algorithm learning. 
+  *Feature creation* refers to the creation of new features from existing data to assist with better predictions. Examples of feature creation include one-hot-encoding, binning, splitting, and calculated features. 
+  *Feature transformation and imputation* include steps for replacing missing features or features that are not valid. Some techniques include forming Cartesian products of features, non-linear transformations (such as binning numeric variables into categories), and creating domain-specific features. 
+  *Feature extraction* involves reducing the amount of data to be processed using dimensionality reduction techniques. These techniques include Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA). This reduces the amount of memory and computing power required, while still accurately maintaining original data characteristics. 
+  *Feature selection* is the process of selecting a subset of extracted features. This is the subset that is relevant and contributes to minimizing the error rate of a trained model. Feature importance score and correlation matrix can be factors in selecting the most relevant features for model training. 

# Model development
Model development

 Model development consists of model building, training, tuning, and evaluation. 

**Topics**
+ [

# Model training and tuning
](model-training-and-tuning.md)
+ [

# Model evaluation
](model-evaluation.md)

# Model training and tuning
Model training and tuning

 In this phase, you select a machine learning algorithm that is appropriate for your problem and then train the ML model. You provide the algorithm with the training data, set an objective metric for the ML model to optimize on, and set the hyperparameters to optimize the training process. 

![\[Chart displaying ML model training and tuning main components\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/training-tuning-components.png)


 Model training, tuning, and evaluation require prepared data and engineered features. The following are the main activities in this stage, as listed in Figure 11: 
+  **Features:** Features are selected as part of the data processing after data quality is assured 
+  **Building code:** Model development includes building the algorithm and its supporting code. The code-building process should support version control and continuous build, test, and integration through a pipeline. 
+  **Algorithm selection:** Selecting the right algorithm involves running many experiments with parameter tuning across available options. Factors to consider when evaluating each option can include success metrics, model explainability, and compute requirements (training/prediction time and memory requirements). 
+  **Model training (data parallel, model parallel):** The process of training a ML model involves providing the algorithm with training data to learn from. Distributed training enables splitting large models and training datasets across compute instances to reduce training time significantly. Model parallelism and data parallelism are techniques to achieve distributed training. 
+  **Model parallelism** is the process of splitting a model up between multiple instances or nodes. 
+  **Data parallelism** is the process of splitting the training set in mini-batches evenly distributed across nodes. Thus, each node only trains the model on a fraction of the total dataset. 
+  **Debugging or profiling:** A machine learning training job can have problems including: system bottlenecks, overfitting, saturated activation functions, and vanishing gradients. These problems can compromise model performance. A debugger provides visibility into the ML training process through monitoring, recording, and analyzing data. It captures the state of a training job at periodic intervals. 
+  **Validation metrics:** Typically, a training algorithm computes several metrics such as loss and prediction accuracy. These metrics determine if the model is learning and generalizing well for making predictions on unseen data. Metrics reported by the algorithm depend on the business problem and the ML technique used. For example, a *confusion matrix* is one of the metrics used for classification models, and Root Mean Squared Error (RMSE) is one of the metrics for regression models. 
+  **Hyperparameter tuning:** Settings that can be tuned to control the behavior of the ML algorithm are referred to as *hyperparameters*. The number and type of hyperparameters in ML algorithms are specific to each model. Examples of commonly used hyperparameters include: learning rate, number of epochs, hidden layers, hidden units and activation functions. Hyperparameter tuning, or optimization, is the process of choosing the optimal hyperparameters for an algorithm. 
+  **Training code container:** Create container images with your training code and its entire dependency stack. This will enable the training and deployment of models quickly and reliably at scale. 
+  **Model artifacts:** Model artifacts are the outputs that results from training a model. They typically consist of trained parameters, a model definition that describes how to compute inferences, and other metadata. 
+  **Visualization:** Enables exploring and understanding data during metrics validation, debugging, profiling, and hyperparameter tuning. 

![\[ML lifecycle with pre-production pipelines\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-preproduction-pipelines.png)


 Figure 12 shows the pre-production pipelines. The *data prepare* pipeline automates data preparation tasks. The feature pipeline automates the storing, fetching, and copying of the features into and from online/offline store. The CI/CD/CT pipeline automates the build, train, and release to staging and production environments. 

# Model evaluation
Model evaluation

 After the model has been trained, evaluate it for its performance and success metrics. You might want to generate multiple models using different methods and evaluate the effectiveness of each model. You also might evaluate whether your model must be [more sensitive than specific, or more specific than sensitive](https://en.wikipedia.org/wiki/Sensitivity_and_specificity). For multiclass models, determine error rates for each class separately. 

 You can evaluate your model using historical data (offline evaluation) or live data (online evaluation). In offline evaluation, the trained model is evaluated with a portion of the dataset that has been set aside as a *holdout set*. This holdout data is never used for model training or validation, but rather to evaluate errors in the final model. The holdout data annotations must have high assigned label correctness for the evaluation to make sense. Allocate additional resources to verify the correctness of the holdout data. 

 Based on the evaluation results, you might fine-tune the data, the algorithm, or both. When you fine-tune the data, you apply the concepts of data cleansing, preparation and feature engineering. 

![\[Chart showing the machine learning lifecycle with the performance evaluation pipeline added in purple.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-preformance-evaluation-pipeline-added.png)


 Figure 13 includes the model performance evaluation, the *data prepare* and CI/CD/CT pipelines that fine-tune data and algorithms, re-training, and evaluation of model results. 

# Deployment
Deployment

 After you have trained, tuned, and evaluated your model, you can deploy it into production and make predictions against this deployed model. Amazon SageMaker AI Studio can convert notebook code to production-ready jobs without the need to manage the underlying infrastructure. Be sure to use a governance process. Controlling deployments through automation combined with manual or automated quality gates facilitates that changes can be effectively validated with dependent systems prior to deployment to production.  

![\[Deployment architecture diagram\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/deployment-architecture-diagram.png)


 Figure 14 illustrates the deployment phase of the ML lifecycle in production. An application sends request payloads to a production endpoint to make inference against the model. Model artifacts are fetched from the model registry, features are retrieved from the feature store, and the inference code container is obtained from the container repository.  

![\[Deployment main components\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/deployment-main-components.png)


 Figure 15 lists key components of production deployment including: 
+  **Blue/green, canary, A/B, shadow deployment/testing:** Deployment and testing strategies that reduce downtime and risks when releasing a new or updated version. 
+  The *blue/green* deployment technique provides two identical production environments (initially *blue* is the existing infrastructure and *green* is an identical infrastructure for testing). Once testing is done on the *green* environment, live application traffic is directed to it from the *blue* environment. Then the roles of the blue/green environments are switched. 
+  With a *canary* deployment, a new release is deployed to a small group of users while other users continue to use the previous version. Once you're satisfied with the new release, you can gradually roll it out to the users. 
+  *A/B* testing strategy enables deploying changes to a model. Direct a defined portion of traffic to the new model. Direct the remaining traffic to the old model. A/B testing is similar to canary testing, but has larger user groups and a longer time scale, typically days or even weeks. 
+  With a *shadow* deployment strategy, the new version is available alongside the old version. The input data is run through both versions. The older version is used for servicing the production application and the new one is used for testing and analysis. 
+  **Inference pipeline:** Figure 16 shows the inference pipeline that automates capturing of the prepared data, performing predictions and post-processing for real-time or batch inferences. 
+  **Scheduler pipeline:** Deployed model is representative of the latest data patterns. When configured as shown in Figure 16, re-training at intervals can minimize the risk of data and concept drifts. A scheduler can initiate a re-training at business defined intervals. Data preparation, CI/CD/CT, and feature pipelines will also be active during this process. 

![\[ML lifecycle with scheduler retrain and batch or real-time inference pipelines\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-scheduler-inference-pipelines.png)


# Monitoring
Monitoring

 The model monitoring system must capture data, compare that data to the training set, define rules to detect issues, and send alerts. This process repeats on a defined schedule, when initiated by an event, or when initiated by human intervention. The issues detected in the monitoring phase include: data quality, model quality, bias drift, and feature attribution drift.  

![\[Chart displaying model monitoring main components\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/key-components-monitor-phase.png)


 Figure 17 lists key components of monitoring, including: 
+  **Model explainability:** Monitoring system uses *explainability* to evaluate the soundness of the model and if the predictions can be trusted. 
+  **Detect drift:** Monitoring system detects data and concept drifts, initiates an alert, and sends it to the alarm manager system. Data drift is significant changes to the data distribution compared to the data used for training. Concept drift is when the properties of the target variables change. Data drift can result in model performance degradation.  
+  **Model update pipeline:** If the alarm manager identifies violations, it launches the model update pipeline for a re-train. This can be seen in Figure 18. The *Data prepare*, *CI/CD/CT*, and *Feature* pipelines will also be active during this process. 

![\[ML lifecycle with model update, retrain, and batch or real-time inference pipelines\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/ml-lifecycle-model-update-inference-pipelines.png)