

# Data processing
Data processing

 In ML workloads, the data (inputs and corresponding desired output) serves important functions including: 
+  Defining the goal of the system: the output representation and the relationship of each output to each input, by means of the input and output pairs. 
+  Training the algorithm that associates inputs to outputs. 
+  Measuring the performance of the model against changes in data distribution or data drift. 
+  Building a baseline dataset to capture data drift. 

 As shown in Figure 6, data processing consists of data collection and data preparation. Data preparation includes data preprocessing and feature engineering. It mainly uses data wrangling for interactive data analysis and data visualization for exploratory data analysis (EDA). EDA focuses on understanding data, sanity checks, and validation of data quality.  

 It is important to note that the same sequence of data processing steps that is applied to the training data needs to also be applied to the inference requests. 

![\[Figure showing data processing components\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-processing-components.png)


**Topics**
+ [

# Data collection
](data-collection.md)
+ [

# Data preparation
](data-preparation.md)

# Data collection
Data collection

 Important steps in the ML lifecycle are to identify the data needed, followed by the evaluation of the various means available for collecting that data to train your model. 

![\[Figure showing the main components of data collection.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-collection-main-components.png)

+  **Label:** *Labeled data* is a group of samples that have been tagged with one or more labels. If labels are missing, then some effort is required to label it (either manual or automated). 
+  **Ingest and aggregate:** Data collection includes ingesting and aggregating data from multiple data sources. 

![\[Figure showing how data sources lead to data ingestion means and then into data technologies.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-sources-ingestion-technologies.png)


 The sub-components of the *ingest and aggregate* component (shown in Figure 8) are as follows: 
+  **Data sources:** Data sources include time-series, events, sensors, IoT devices, and social networks, depending on the nature of the use case. You can enrich your data sources by using the geospatial capability of Amazon SageMaker AI to access a range of geospatial data sources from AWS (for example, Amazon Location Service), open-source datasets (for example, [Open Data on AWS](https://aws.amazon.com/opendata/)), or your own proprietary data including from third-party providers (such as Planet Labs). To learn more about the geospatial capability in Amazon SageMaker AI, visit [Geospatial ML with Amazon SageMaker AI](https://aws.amazon.com/sagemaker/geospatial/). 
+  **Data ingestion:** Data ingestion processes and technologies capture and store data on storage media. Data ingestion can occur in real-time using streaming technologies or historical mode using batch technologies. 
+  **Data technologies:** Data storage technologies vary from transactional (SQL) databases, to data lakes and data warehouses to form a lake house with marketplace governance across teams and partners. Extract, transform, and load (ETL) pipeline technology automates and orchestrates the data movement and transformations across cloud services and resources. A lake house enables storing and analyzing structured and unstructured data. 

# Data preparation
Data preparation

 ML models are only as good as the data that is used to train them. Verify that suitable training data is available and is optimized for learning and generalization. Data preparation includes data preprocessing and feature engineering. 

 A key aspect to understanding data is to identify patterns. These patterns are often not evident with data in tables. Exploratory data analysis (EDA) with visualization tools can assist in quickly gaining a deeper understanding of data. Prepare data using data wrangler tools for interactive data analysis and model building. Employ no-code/low-code, automation, and visual capabilities to improve the productivity and reduce the cost for interactive analysis. Use generative AI code tools. 

**Topics**
+ [

# Data preprocessing
](data-preprocessing.md)
+ [

# Feature engineering
](feature-engineering.md)

# Data preprocessing
Data preprocessing

 Data preprocessing puts data into the right shape and quality for training. There are many data preprocessing strategies including: data cleaning, balancing, replacing, imputing, partitioning, scaling, augmenting, and unbiasing. 

![\[Chart showing the data preprocessing strategies.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/data-processing-main-components.png)


 The data preprocessing strategies listed in Figure 9 can be expanded as the following: 
+  **Clean (replace, impute, remove outliers and duplicates):** Remove outliers and duplicates, replace inaccurate or irrelevant data, and correct missing data using imputation techniques that will minimize bias as part of data cleaning. 
+  **Partition:** To block ML models from overfitting and to evaluate a trained model accurately, randomly split data into train, validate, and test sets. Data leakage can happen when information from hold-out test dataset leaks into the training data. One way to avoid data leakage is to remove duplicates before splitting the data. 
+  **Scale (normalize, standardize):** Normalization is a scaling technique in machine learning that is applied during data preparation to change the values of numeric columns in the dataset to use a common scale. This technique assists to verify that each feature of the machine learning model has equal feature importance when they have different ranges. Normalized numeric features will have values in the range of [0,1]. Standardized numeric features will have a mean of 0 and standard deviation of 1. Standardization assists in handling outliers. 
+  **Unbias, balance (detection and mitigation):** Detecting and mitigating bias assists to avoid inaccurate model results. Biases are imbalances in the accuracy of predictions across different groups, such as age or income bracket. Biases can come from the data or algorithm used to train your model. 
+  **Augment:** Data augmentation increases the amount of data artificially by synthesizing new data from existing data. Data augmentation can assist to regularize and reduce overfitting. 

# Feature engineering
Feature engineering

 Every unique attribute of the data is considered a *feature* (also known as an *attribute*). For example, when designing a solution for predicting customer churn, the data used typically includes features such as customer location, age, income level, and recent purchases. 

![\[Chart showing the main components of feature engineering.\]](http://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/images/feature-engineering-main-components.png)


 Feature engineering is a process to select and transform variables when creating a predictive model using machine learning or statistical modeling. Feature engineering typically includes feature creation, feature transformation, feature extraction, and feature selection as listed in Figure 10. With deep learning, feature engineering is automated as part of the algorithm learning. 
+  *Feature creation* refers to the creation of new features from existing data to assist with better predictions. Examples of feature creation include one-hot-encoding, binning, splitting, and calculated features. 
+  *Feature transformation and imputation* include steps for replacing missing features or features that are not valid. Some techniques include forming Cartesian products of features, non-linear transformations (such as binning numeric variables into categories), and creating domain-specific features. 
+  *Feature extraction* involves reducing the amount of data to be processed using dimensionality reduction techniques. These techniques include Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA). This reduces the amount of memory and computing power required, while still accurately maintaining original data characteristics. 
+  *Feature selection* is the process of selecting a subset of extracted features. This is the subset that is relevant and contributes to minimizing the error rate of a trained model. Feature importance score and correlation matrix can be factors in selecting the most relevant features for model training. 