

# Types of Algorithms


Machine learning can help you accomplish empirical tasks that require some sort of inductive inference. This task involves induction as it uses data to train algorithms to make generalizable inferences. This means that the algorithms can make statistically reliable predictions or decisions, or complete other tasks when applied to new data that was not used to train them. 

To help you select the best algorithm for your task, we classify these tasks on various levels of abstraction. At the highest level of abstraction, machine learning attempts to find patterns or relationships between features or less structured items, such as text in a data set. Pattern recognition techniques can be classified into distinct machine learning paradigms, each of which address specific problem types. There are currently three basic paradigms for machine learning used to address various problem types: 
+ [Supervised learning](#algorithms-choose-supervised-learning)
+ [Unsupervised learning](#algorithms-choose-unsupervised-learning)
+ [Reinforcement learning](#algorithms-choose-reinforcement-learning)

The types of problems that each learning paradigm can address are identified by considering the inferences (or predictions, decisions, or other tasks) you want to make from the type of data that you have or could collect. Machine learning paradigms use algorithmic methods to address their various problem types. The algorithms provide recipes for solving these problems. 

However, many algorithms, such as neural networks, can be deployed with different learning paradigms and on different types of problems. Multiple algorithms can also address a specific problem type. Some algorithms are more generally applicable and others are quite specific for certain kinds of objectives and data. So the mapping between machine learning algorithms and problem types is many-to-many. Also, there are various implementation options available for algorithms. 

The following sections provide guidance concerning implementation options, machine learning paradigms, and algorithms appropriate for different problem types.

**Topics**
+ [

## Choose an algorithm implementation
](#algorithms-choose-implementation)
+ [

## Problem types for the basic machine learning paradigms
](#basic-machine-learning-paradigms)
+ [

# Built-in algorithms and pretrained models in Amazon SageMaker
](algos.md)
+ [

# Use Reinforcement Learning with Amazon SageMaker AI
](reinforcement-learning.md)

## Choose an algorithm implementation


After choosing an algorithm, you must decide which implementation of it you want to use. Amazon SageMaker AI supports three implementation options that require increasing levels of effort. 
+ **Pre-trained models** require the least effort and are models ready to deploy or to fine-tune and deploy using SageMaker JumpStart.
+ **Built-in algorithms** require more effort and scale if the data set is large and significant resources are needed to train and deploy the model.
+ If there is no built-in solution that works, try to develop one that uses **pre-made images for machine and deep learning frameworks** for supported frameworks such as Scikit-Learn, TensorFlow, PyTorch, MXNet, or Chainer.
+ If you need to run custom packages or use any code which isn’t a part of a supported framework or available via PyPi, then you need to build **your own custom Docker image** that is configured to install the necessary packages or software. The custom image must also be pushed to an online repository like the Amazon Elastic Container Registry.

**Topics**
+ [

### Use a built-in algorithm
](#built-in-algorithms-benefits)
+ [

### Use script mode in a supported framework
](#supported-frameworks-benefits)
+ [

### Use a custom Docker image
](#custom-image-use-case)

Algorithm implementation guidance


| Implementation | Requires code | Pre-coded algorithms | Support for third party packages | Support for custom code | Level of effort | 
| --- | --- | --- | --- | --- | --- | 
| Built-in | No | Yes | No | No | Low | 
| Scikit-learn | Yes | Yes | PyPi only | Yes | Medium | 
| Spark ML | Yes | Yes | PyPi only | Yes | Medium | 
| XGBoost (open source) | Yes | Yes | PyPi only | Yes | Medium | 
| TensorFlow | Yes | No | PyPi only | Yes | Medium-high | 
| PyTorch | Yes | No | PyPi only | Yes | Medium-high | 
| MXNet | Yes | No | PyPi only | Yes | Medium-high | 
| Chainer | Yes | No | PyPi only | Yes | Medium-high | 
| Custom image | Yes | No | Yes, from any source | Yes | High | 

### Use a built-in algorithm


When choosing an algorithm for your type of problem and data, the easiest option is to use one of Amazon SageMaker AI's built-in algorithms. These built-in algorithms come with two major benefits.
+ The built-in algorithms require no coding to start running experiments. The only inputs you need to provide are the data, hyperparameters, and compute resources. This allows you to run experiments more quickly, with less overhead for tracking results and code changes.
+ The built-in algorithms come with parallelization across multiple compute instances and GPU support right out of the box for all applicable algorithms (some algorithms may not be included due to inherent limitations). If you have a lot of data with which to train your model, most built-in algorithms can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be easier to use its corollary in SageMaker AI and input the hyper-parameters you already know than to port it over, using script mode on a supported framework.

For more information on the built-in algorithms provided by SageMaker AI, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md).

For important information about docker registry paths, data formats, recommended EC2 instance types, and CloudWatch logs common to all of the built-in algorithms provided by SageMaker AI, see [Parameters for Built-in Algorithms](common-info-all-im-models.md).

### Use script mode in a supported framework


If the algorithm you want to use for your model is not supported by a built-in choice and you are comfortable coding your own solution, then you should consider using an Amazon SageMaker AI supported framework. This is referred to as "script mode" because you write your custom code (script) in a text file with a `.py` extension. As the table above indicates, SageMaker AI supports most of the popular machine learning frameworks. These frameworks come preloaded with the corresponding framework and some additional Python packages, such as Pandas and NumPy, so you can write your own code for training an algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a requirements.txt file with your training code or to include your own code directories. R is also supported natively in SageMaker notebook kernels. Some frameworks, like scikit-learn and Spark ML, have pre-coded algorithms you can use easily, while other frameworks like TensorFlow and PyTorch may require you to implement the algorithm yourself. The only limitation when using a supported framework image is that you cannot import any software packages that are not hosted on PyPi or that are not already included with the framework’s image.

For more information on the frameworks supported by SageMaker AI, see [Machine Learning Frameworks and Languages](frameworks.md).

### Use a custom Docker image


Amazon SageMaker AI's built-in algorithms and supported frameworks should cover most use cases, but there are times when you may need to use an algorithm from a package not included in any of the supported frameworks. You might also have a pre-trained model picked or persisted somewhere which you need to deploy. SageMaker AI uses Docker images to host the training and serving of all models, so you can supply your own custom Docker image if the package or software you need is not included in a supported framework. This may be your own Python package or an algorithm coded in a language like Stan or Julia. For these images you must also configure the training of the algorithm and serving of the model properly in your Dockerfile. This requires intermediate knowledge of Docker and is not recommended unless you are comfortable writing your own machine learning algorithm. Your Docker image must be uploaded to an online repository, such as the Amazon Elastic Container Registry (ECR) before you can train and serve your model properly.

 For more information on custom Docker images in SageMaker AI, see [Docker containers for training and deploying models](docker-containers.md).

## Problem types for the basic machine learning paradigms


The following three sections describe the main problem types addressed by the three basic paradigms for machine learning. For a list of the built-in algorithms that SageMaker AI provides to address these problem types, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md).

**Topics**
+ [

### Supervised learning
](#algorithms-choose-supervised-learning)
+ [

### Unsupervised learning
](#algorithms-choose-unsupervised-learning)
+ [

### Reinforcement learning
](#algorithms-choose-reinforcement-learning)

### Supervised learning


If your data set consists of features or attributes (inputs) that contain target values (outputs), then you have a supervised learning problem. If your target values are categorical (mathematically discrete), then you have a **classification problem**. It is a standard practice to distinguish binary from multiclass classification. 
+ **Binary classification** is a type of supervised learning that assigns an individual to one of two predefined and mutually exclusive classes based on the individual's attributes. It is supervised because the models are trained using examples in which the attributes are provided with correctly labeled objects. A medical diagnosis for whether an individual has a disease or not based on the results of diagnostic tests is an example of binary classification.
+ **Multiclass classification** is a type of supervised learning that assigns an individual to one of several classes based on the individual's attributes. It is supervised because the models are trained using examples in which the attributes are provided with correctly labeled objects. An example is the prediction of the topic most relevant to a text document. A document may be classified as being about religion, politics, or finance, or as about one of several other predefined topic classes.

If the target values you are trying to predict are mathematically continuous, then you have a **regression** problem. Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it. An example is the prediction of house prices using features like the number of bathrooms and bedrooms and the square footage of the house and garden. Regression analysis can create a model that takes one or more of these features as an input and predicts the price of a house.

For more information on the built-in supervised learning algorithms provided by SageMaker AI, see [Supervised learning](algos.md#algorithms-built-in-supervised-learning).

### Unsupervised learning


If your data set consists of features or attributes (inputs) that do not contain labels or target values (outputs), then you have an unsupervised learning problem. In this type of problem, the output must be predicted based on the pattern discovered in the input data. The goal in unsupervised learning problems is to discover patterns such as groupings within the data. There are a large variety of tasks or problem types to which unsupervised learning can be applied. Principal component and cluster analyses are two of the main methods commonly deployed for preprocessing data. Here is a short list of problem types that can be addressed by unsupervised learning:
+ **Dimension reduction** is typically part of a data exploration step used to determine the most relevant features to use for model construction. The idea is to transform data from a high-dimensional, sparsely populated space into a low-dimensional space that retains most significant properties of the original data. This provides relief for the curse of dimensionality that can arise with sparsely populated, high-dimensional data on which statistical analysis becomes problematic. It can also be used to help understand data, reducing high-dimensional data to a lower dimensionality that can be visualized.
+ **Cluster analysis** is a class of techniques that are used to classify objects or cases into groups called clusters. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the features or attributes that you want the algorithm to use to determine similarity, select a distance function to measure similarity, and specify the number of clusters to use in the analysis.
+ **Anomaly detection** is the identification of rare items, events, or observations in a data set which raise suspicions because they differ significantly from the rest of the data. The identification of anomalous items can be used, for example, to detect bank fraud or medical errors. Anomalies are also referred to as outliers, novelties, noise, deviations, and exceptions.
+ **Density estimation** is the construction of estimates of unobservable underlying probability density functions based on observed data. A natural use of density estimates is for data exploration. Density estimates can discover features such as skewness and multimodality in the data. The most basic form of density estimation is a rescaled histogram.

SageMaker AI provides several built-in machine learning algorithms that you can use for these unsupervised learning tasks. For more information on the built-in unsupervised algorithms provided by SageMaker AI, see [Unsupervised learning](algos.md#algorithms-built-in-unsupervised-learning).

### Reinforcement learning


Reinforcement learning is a type of learning that is based on interaction with the environment. This type of learning is used by an agent that must learn behavior through trial-and-error interactions with a dynamic environment in which the goal is to maximize the long-term rewards that the agent receives as a result of its actions. Rewards are maximized by trading off exploring actions that have uncertain rewards with exploiting actions that have known rewards. 

For more information on SageMaker AI's frameworks, toolkits, and environments for reinforcement learning, see [Use Reinforcement Learning with Amazon SageMaker AI](reinforcement-learning.md).

# Built-in algorithms and pretrained models in Amazon SageMaker
Built-in algorithms and pretrained models

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly. For someone who is new to SageMaker, choosing the right algorithm for your particular use case can be a challenging task. The following table provides a quick cheat sheet that shows how you can start with an example problem or use case and find an appropriate built-in algorithm offered by SageMaker that is valid for that problem type. Additional guidance organized by learning paradigms (supervised and unsupervised) and important data domains (text and images) is provided in the sections following the table.

Table: Mapping use cases to built-in algorithms

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/algos.html)

For important information about the following items common to all of the built-in algorithms provided by SageMaker AI, see [Parameters for Built-in Algorithms](common-info-all-im-models.md).
+ Docker registry paths
+ data formats
+ recommended Amazon EC2 instance types
+ CloudWatch logs

The following sections provide additional guidance for the Amazon SageMaker AI built-in algorithms grouped by the supervised and unsupervised learning paradigms to which they belong. For descriptions of these learning paradigms and their associated problem types, see [Types of Algorithms](algorithms-choose.md). Sections are also provided for the SageMaker AI built-in algorithms available to address two important machine learning domains: textual analysis and image processing.
+ [Pre-trained models and solution templates](#algorithms-built-in-jumpstart)
+ [Supervised learning](#algorithms-built-in-supervised-learning)
+ [Unsupervised learning](#algorithms-built-in-unsupervised-learning)
+ [Textual analysis](#algorithms-built-in-text-analysis)
+ [Image processing](#algorithms-built-in-image-processing)

## Pre-trained models and solution templates
JumpStart models and solution templates

Amazon SageMaker JumpStart provides a wide range of pre-trained models, pre-built solution templates, and examples for popular problem types. These use the SageMaker SDK as well as Studio Classic. For more information about these models, solutions, and the example notebooks provided by Amazon SageMaker JumpStart, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

## Supervised learning
Supervised learning algorithms

Amazon SageMaker AI provides several built-in general purpose algorithms that can be used for either classification or regression problems.
+ [AutoGluon-Tabular](autogluon-tabular.md)—an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers. 
+ [CatBoost](catboost.md)—an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
+ [Factorization Machines Algorithm](fact-machines.md)—an extension of a linear model that is designed to economically capture interactions between features within high-dimensional sparse datasets.
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)—a non-parametric method that uses the k nearest labeled points to assign a value. For classification, it is a label to a new data point. For regression, it is a predicted target value from the average of the k nearest points.
+ [LightGBM](lightgbm.md)—an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability. These two novel techniques are Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
+ [Linear Learner Algorithm](linear-learner.md)—learns a linear function for regression or a linear threshold function for classification.
+ [TabTransformer](tabtransformer.md)—a novel deep tabular data modeling architecture built on self-attention-based Transformers. 
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)—an implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models.

Amazon SageMaker AI also provides several built-in supervised learning algorithms used for more specialized tasks during feature engineering and forecasting from time series data.
+ [Object2Vec Algorithm](object2vec.md)—a new highly customizable multi-purpose algorithm used for feature engineering. It can learn low-dimensional dense embeddings of high-dimensional objects to produce features that improve training efficiencies for downstream models. While this is a supervised algorithm, there are many scenarios in which the relationship labels can be obtained purely from natural clusterings in data. Even though it requires labeled data for training, this can occur without any explicit human annotation.
+ [Use the SageMaker AI DeepAR forecasting algorithm](deepar.md)—a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).

## Unsupervised learning


Amazon SageMaker AI provides several built-in algorithms that can be used for a variety of unsupervised learning tasks. These tasks includes things like clustering, dimension reduction, pattern recognition, and anomaly detection.
+ [Principal Component Analysis (PCA) Algorithm](pca.md)—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For mathematicians, principal components are eigenvectors of the data's covariance matrix.
+ [K-Means Algorithm](k-means.md)—finds discrete groupings within data. This occurs where members of a group are as similar as possible to one another and as different as possible from members of other groups.
+ [IP Insights](ip-insights.md)—learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)—detects anomalous data points within a data set that diverge from otherwise well-structured or patterned data.

## Textual analysis


SageMaker AI provides algorithms that are tailored to the analysis of textual documents. This includes text used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
+ [BlazingText algorithm](blazingtext.md)—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.
+ [Sequence-to-Sequence Algorithm](seq-2-seq.md)—a supervised algorithm commonly used for neural machine translation. 
+ [Latent Dirichlet Allocation (LDA) Algorithm](lda.md)—an algorithm suitable for determining topics in a set of documents. It is an *unsupervised algorithm*, which means that it doesn't use example data with answers during training.
+ [Neural Topic Model (NTM) Algorithm](ntm.md)—another unsupervised technique for determining topics in a set of documents, using a neural network approach.
+ [Text Classification - TensorFlow](text-classification-tensorflow.md)—a supervised algorithm that supports transfer learning with available pretrained models for text classification.

## Image processing


SageMaker AI also provides image processing algorithms that are used for image classification, object detection, and computer vision.
+ [Image Classification - MXNet](image-classification.md)—uses example data with answers (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Image Classification - TensorFlow](image-classification-tensorflow.md)—uses pretrained TensorFlow Hub models to fine-tune for specific tasks (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Semantic Segmentation Algorithm](semantic-segmentation.md)—provides a fine-grained, pixel-level approach to developing computer vision applications.
+ [Object Detection - MXNet](object-detection.md)—detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.
+ [Object Detection - TensorFlow](object-detection-tensorflow.md)—detects bounding boxes and object labels in an image. It is a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow models.

**Topics**
+ [

## Pre-trained models and solution templates
](#algorithms-built-in-jumpstart)
+ [

## Supervised learning
](#algorithms-built-in-supervised-learning)
+ [

## Unsupervised learning
](#algorithms-built-in-unsupervised-learning)
+ [

## Textual analysis
](#algorithms-built-in-text-analysis)
+ [

## Image processing
](#algorithms-built-in-image-processing)
+ [

# Parameters for Built-in Algorithms
](common-info-all-im-models.md)
+ [

# Built-in SageMaker AI Algorithms for Tabular Data
](algorithms-tabular.md)
+ [

# Built-in SageMaker AI Algorithms for Text Data
](algorithms-text.md)
+ [

# Built-in SageMaker AI Algorithms for Time-Series Data
](algorithms-time-series.md)
+ [

# Unsupervised Built-in SageMaker AI Algorithms
](algorithms-unsupervised.md)
+ [

# Built-in SageMaker AI Algorithms for Computer Vision
](algorithms-vision.md)

# Parameters for Built-in Algorithms
Common Information

The following table lists parameters for each of the algorithms provided by Amazon SageMaker AI.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| AutoGluon-Tabular | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| BlazingText | train | File or Pipe | Text file (one sentence per line with space-separated tokens)  | CPU or GPU (single instance only)  | No | 
| CatBoost | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| DeepAR Forecasting | train and (optionally) test | File | JSON Lines or Parquet | CPU or GPU | Yes | 
| Factorization Machines | train and (optionally) test | File or Pipe | recordIO-protobuf | CPU (GPU for dense data) | Yes | 
| Image Classification - MXNet | train and validation, (optionally) train\$1lst, validation\$1lst, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Image Classification - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| IP Insights | train and (optionally) validation | File | CSV | CPU or GPU | Yes | 
| K-Means | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPUCommon (single GPU device on one or more instances) | No | 
| K-Nearest-Neighbors (k-NN) | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU (single GPU device on one or more instances) | Yes | 
| LDA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU (single instance only) | No | 
| LightGBM | train/training and (optionally) validation | File | CSV | CPU | Yes | 
| Linear Learner | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Neural Topic Model | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Object2Vec | train and (optionally) validation, test, or both | File | JSON Lines  | CPU or GPU (single instance only) | No | 
| Object Detection - MXNet | train and validation, (optionally) train\$1annotation, validation\$1annotation, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Object Detection - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | GPU | Yes (only across multiple GPUs on a single instance) | 
| PCA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Random Cut Forest | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU | Yes | 
| Semantic Segmentation | train and validation, train\$1annotation, validation\$1annotation, and (optionally) label\$1map and model | File or Pipe | Image files | GPU (single instance only) | No | 
| Seq2Seq Modeling | train, validation, and vocab | File | recordIO-protobuf | GPU (single instance only) | No | 
| TabTransformer | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| Text Classification - TensorFlow | training and validation | File | CSV | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| XGBoost (0.90-1, 0.90-2, 1.0-1, 1.2-1, 1.2-21) | train and (optionally) validation | File or Pipe | CSV, LibSVM, or Parquet | CPU (or GPU for 1.2-1) | Yes | 

Algorithms that are *parallelizable* can be deployed on multiple compute instances for distributed training.

The following topics provide information about data formats, recommended Amazon EC2 instance types, and CloudWatch logs common to all of the built-in algorithms provided by Amazon SageMaker AI.

**Note**  
To look up the Docker image URIs of the built-in algorithms managed by SageMaker AI, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths).

**Topics**
+ [

# Common Data Formats for Training
](cdf-training.md)
+ [

# Common data formats for inference
](cdf-inference.md)
+ [

# Instance Types for Built-in Algorithms
](cmn-info-instance-types.md)
+ [

# Logs for Built-in Algorithms
](common-info-all-sagemaker-models-logs.md)

# Common Data Formats for Training


To prepare for training, you can preprocess your data using a variety of AWS services, including AWS Glue, Amazon EMR, Amazon Redshift, Amazon Relational Database Service, and Amazon Athena. After preprocessing, publish the data to an Amazon S3 bucket. For training, the data must go through a series of conversions and transformations, including: 
+ Training data serialization (handled by you) 
+ Training data deserialization (handled by the algorithm) 
+ Training model serialization (handled by the algorithm) 
+ Trained model deserialization (optional, handled by you) 

When using Amazon SageMaker AI in the training portion of the algorithm, make sure to upload all data at once. If more data is added to that location, a new training call would need to be made to construct a brand new model.

**Topics**
+ [

## Content Types Supported by Built-In Algorithms
](#cdf-common-content-types)
+ [

## Using Pipe Mode
](#cdf-pipe-mode)
+ [

## Using CSV Format
](#cdf-csv-format)
+ [

## Using RecordIO Format
](#cdf-recordio-format)
+ [

## Trained Model Deserialization
](#td-deserialization)

## Content Types Supported by Built-In Algorithms


The following table lists some of the commonly supported [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType) values and the algorithms that use them:

ContentTypes for Built-in Algorithms


| ContentType | Algorithm | 
| --- | --- | 
| application/x-image | Object Detection Algorithm, Semantic Segmentation | 
| application/x-recordio |  Object Detection Algorithm  | 
| application/x-recordio-protobuf |  Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence  | 
| application/jsonlines |  BlazingText, DeepAR  | 
| image/jpeg |  Object Detection Algorithm, Semantic Segmentation  | 
| image/png |  Object Detection Algorithm, Semantic Segmentation  | 
| text/csv |  IP Insights, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, XGBoost  | 
| text/libsvm |  XGBoost  | 

For a summary of the parameters used by each algorithm, see the documentation for the individual algorithms or this [table](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

## Using Pipe Mode


In *Pipe mode*, your training job streams data directly from Amazon Simple Storage Service (Amazon S3). Streaming can provide faster start times for training jobs and better throughput. This is in contrast to *File mode*, in which your data from Amazon S3 is stored on the training instance volumes. File mode uses disk space to store both your final model artifacts and your full training dataset. By streaming in your data directly from Amazon S3 in Pipe mode, you reduce the size of Amazon Elastic Block Store volumes of your training instances. Pipe mode needs only enough disk space to store your final model artifacts. See the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html) for additional details on the training input mode.

## Using CSV Format


Many Amazon SageMaker AI algorithms support training with data in CSV format. To use data in CSV format for training, in the input data channel specification, specify **text/csv** as the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType). Amazon SageMaker AI requires that a CSV file does not have a header record and that the target variable is in the first column. To run unsupervised learning algorithms that don't have a target, specify the number of label columns in the content type. For example, in this case **'content\$1type=text/csv;label\$1size=0'**. For more information, see [Now use Pipe mode with CSV datasets for faster training on Amazon SageMaker AI built-in algorithms](https://aws.amazon.com/blogs/machine-learning/now-use-pipe-mode-with-csv-datasets-for-faster-training-on-amazon-sagemaker-built-in-algorithms/).

## Using RecordIO Format


In the protobuf recordIO format, SageMaker AI converts each observation in the dataset into a binary representation as a set of 4-byte floats, then loads it in the protobuf values field. If you are using Python for your data preparation, we strongly recommend that you use these existing transformations. However, if you are using another language, the protobuf definition file below provides the schema that you use to convert your data into SageMaker AI protobuf format.

**Note**  
For an example that shows how to convert the commonly used numPy array into the protobuf recordIO format, see *[An Introduction to Factorization Machines with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/factorization_machines_mnist/factorization_machines_mnist.html)* .

```
syntax = "proto2";

 package aialgs.data;

 option java_package = "com.amazonaws.aialgorithms.proto";
 option java_outer_classname = "RecordProtos";

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float32Tensor   {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated float values = 1 [packed = true];

     // If key is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 20) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float64Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated double values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as 32-bit ints (int32).
 message Int32Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated int32 values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For Exmple, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // Support for storing binary data for parsing in other ways (such as JPEG/etc).
 // This is an example of another type of value and may not immediately be supported.
 message Bytes {
     repeated bytes value = 1;

     // If the content type of the data is known, stores it.
     // This allows for the possibility of using decoders for common formats
     // in the future.
     optional string content_type = 2;
 }

 message Value {
     oneof value {
         // The numbering assumes the possible use of:
         // - float16, float128
         // - int8, int16, int32
         Float32Tensor float32_tensor = 2;
         Float64Tensor float64_tensor = 3;
         Int32Tensor int32_tensor = 7;
         Bytes bytes = 9;
     }
 }

 message Record {
     // Map from the name of the feature to the value.
     //
     // For vectors and libsvm-like datasets,
     // a single feature with the name `values`
     // should be specified.
     map<string, Value> features = 1;

     // An optional set of labels for this record.
     // Similar to the features field above, the key used for
     // generic scalar / vector labels should be 'values'.
     map<string, Value> label = 2;

     // A unique identifier for this record in the dataset.
     //
     // Whilst not necessary, this allows better
     // debugging where there are data issues.
     //
     // This is not used by the algorithm directly.
     optional string uid = 3;

     // Textual metadata describing the record.
     //
     // This may include JSON-serialized information
     // about the source of the record.
     //
     // This is not used by the algorithm directly.
     optional string metadata = 4;

     // An optional serialized JSON object that allows per-record
     // hyper-parameters/configuration/other information to be set.
     //
     // The meaning/interpretation of this field is defined by
     // the algorithm author and may not be supported.
     //
     // This is used to pass additional inference configuration
     // when batch inference is used (e.g. types of scores to return).
     optional string configuration = 5;
 }
```

After creating the protocol buffer, store it in an Amazon S3 location that Amazon SageMaker AI can access and that can be passed as part of `InputDataConfig` in `create_training_job`. 

**Note**  
For all Amazon SageMaker AI algorithms, the `ChannelName` in `InputDataConfig` must be set to `train`. Some algorithms also support a validation or test `input channels`. These are typically used to evaluate the model's performance by using a hold-out dataset. Hold-out datasets are not used in the initial training but can be used to further tune the model.

## Trained Model Deserialization


Amazon SageMaker AI models are stored as model.tar.gz in the S3 bucket specified in `OutputDataConfig` `S3OutputPath` parameter of the `create_training_job` call. The S3 bucket must be in the same AWS Region as the notebook instance. You can specify most of these model artifacts when creating a hosting model. You can also open and review them in your notebook instance. When `model.tar.gz` is untarred, it contains `model_algo-1`, which is a serialized Apache MXNet object. For example, you use the following to load the k-means model into memory and view it: 

```
import mxnet as mx
print(mx.ndarray.load('model_algo-1'))
```

# Common data formats for inference


Amazon SageMaker AI algorithms accept and produce several different MIME types for the HTTP payloads used in retrieving online and mini-batch predictions. You can use multiple AWS services to transform or preprocess records before running inference. At a minimum, you need to convert the data for the following:
+ Inference request serialization (handled by you) 
+ Inference request deserialization (handled by the algorithm) 
+ Inference response serialization (handled by the algorithm) 
+ Inference response deserialization (handled by you) 

**Topics**
+ [

## Convert data for inference request serialization
](#ir-serialization)
+ [

## Convert data for inference response deserialization
](#ir-deserialization)
+ [

## Common request formats for all algorithms
](#common-in-formats)
+ [

## Use batch transform with built-in algorithms
](#cm-batch)

## Convert data for inference request serialization


Content type options for Amazon SageMaker AI algorithm inference requests include: `text/csv`, `application/json`, and `application/x-recordio-protobuf`. Algorithms that don't support all of these types can support other types. XGBoost, for example, only supports `text/csv` from this list, but also supports `text/libsvm`.

For `text/csv`, the value for the Body argument to `invoke_endpoint` should be a string with commas separating the values for each feature. For example, a record for a model with four features might look like `1.5,16.0,14,23.0`. Any transformations performed on the training data should also be performed on the data before obtaining inference. The order of the features matters and must remain unchanged. 

`application/json` is more flexible and provides multiple possible formats for developers to use in their applications. At a high level, in JavaScript, the payload might look like the following: 

```
let request = {
  // Instances might contain multiple rows that predictions are sought for.
  "instances": [
    {
      // Request and algorithm specific inference parameters.
      "configuration": {},
      // Data in the specific format required by the algorithm.
      "data": {
         "<field name>": dataElement
       }
    }
  ]
}
```

You have the following options for specifying the `dataElement`: 

**Protocol buffers equivalent**

```
// Has the same format as the protocol buffers implementation described for training.
let dataElement = {
  "keys": [],
  "values": [],
  "shape": []
}
```

**Simple numeric vector **

```
// An array containing numeric values is treated as an instance containing a
// single dense vector.
let dataElement = [1.5, 16.0, 14.0, 23.0]

// It will be converted to the following representation by the SDK.
let converted = {
  "features": {
    "values": dataElement
  }
}
```

**For multiple records**

```
let request = {
  "instances": [
    // First instance.
    {
      "features": [ 1.5, 16.0, 14.0, 23.0 ]
    },
    // Second instance.
    {
      "features": [ -2.0, 100.2, 15.2, 9.2 ]
    }
  ]
}
```

## Convert data for inference response deserialization


Amazon SageMaker AI algorithms return JSON in several layouts. At a high level, the structure is:

```
let response = {
  "predictions": [{
    // Fields in the response object are defined on a per algorithm-basis.
  }]
}
```

The fields that are included in predictions differ across algorithms. The following are examples of output for the k-means algorithm.

**Single-record inference** 

```
let response = {
  "predictions": [{
    "closest_cluster": 5,
    "distance_to_cluster": 36.5
  }]
}
```

**Multi-record inference**

```
let response = {
  "predictions": [
    // First instance prediction.
    {
      "closest_cluster": 5,
      "distance_to_cluster": 36.5
    },
    // Second instance prediction.
    {
      "closest_cluster": 2,
      "distance_to_cluster": 90.3
    }
  ]
}
```

**Multi-record inference with protobuf input **

```
{
  "features": [],
  "label": {
    "closest_cluster": {
      "values": [ 5.0 ] // e.g. the closest centroid/cluster was 1.0
    },
    "distance_to_cluster": {
      "values": [ 36.5 ]
    }
  },
  "uid": "abc123",
  "metadata": "{ "created_at": '2017-06-03' }"
}
```

SageMaker AI algorithms also support the JSONLINES format, where the per-record response content is same as that in JSON format. The multi-record structure is a collection of per-record response objects separated by newline characters. The response content for the built-in KMeans algorithm for 2 input data points is:

```
{"distance_to_cluster": 23.40593910217285, "closest_cluster": 0.0}
{"distance_to_cluster": 27.250282287597656, "closest_cluster": 0.0}
```

While running batch transform, we recommended using the `jsonlines` response type by setting the `Accept` field in the `CreateTransformJobRequest` to `application/jsonlines`.

## Common request formats for all algorithms


Most algorithms use many of the following inference request formats.

### JSON request format


**Content type:** application/JSON

**Dense format**

```
let request =   {
    "instances":    [
        {
            "features": [1.5, 16.0, 14.0, 23.0]
        }
    ]
}


let request =   {
    "instances":    [
        {
            "data": {
                "features": {
                    "values": [ 1.5, 16.0, 14.0, 23.0]
                }
            }
        }
    ]
}
```

**Sparse format**

```
{
	"instances": [
		{"data": {"features": {
					"keys": [26, 182, 232, 243, 431],
					"shape": [2000],
					"values": [1, 1, 1, 4, 1]
				}
			}
		},
		{"data": {"features": {
					"keys": [0, 182, 232, 243, 431],
					"shape": [2000],
					"values": [13, 1, 1, 4, 1]
				}
			}
		},
	]
}
```

### JSONLINES request format


**Content type:** application/JSONLINES

**Dense format**

A single record in dense format can be represented as either:

```
{ "features": [1.5, 16.0, 14.0, 23.0] }
```

or:

```
{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }
```

**Sparse Format**

A single record in sparse format is represented as:

```
{"data": {"features": { "keys": [26, 182, 232, 243, 431], "shape": [2000], "values": [1, 1, 1, 4, 1] } } }
```

Multiple records are represented as a collection of single-record representations, separated by newline characters:

```
{"data": {"features": { "keys": [0, 1, 3], "shape": [4], "values": [1, 4, 1] } } }
{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }
{ "features": [1.5, 16.0, 14.0, 23.0] }
```

### CSV request format


**Content type:** text/CSV; label\$1size=0

**Note**  
CSV support is not available for factorization machines.

### RECORDIO request format


Content type: application/x-recordio-protobuf

## Use batch transform with built-in algorithms


While running batch transform, we recommended using the JSONLINES response type instead of JSON, if supported by the algorithm. To do this, set the `Accept` field in the `CreateTransformJobRequest` to `application/jsonlines`.

When you create a transform job, the `SplitType` must be set based on the `ContentType` of the input data. Similarly, depending on the `Accept` field in the `CreateTransformJobRequest`, `AssembleWith` must be set accordingly. Use the following table to set these fields:


| ContentType | Recommended SplitType | 
| --- | --- | 
| application/x-recordio-protobuf | RecordIO | 
| text/csv | Line | 
| application/jsonlines | Line | 
| application/json | None | 
| application/x-image | None | 
| image/\$1 | None | 


| Accept | Recommended AssembleWith | 
| --- | --- | 
| application/x-recordio-protobuf | None | 
| application/json | None | 
| application/jsonlines | Line | 

For more information on response formats for specific algorithms, see the following:
+ [DeepAR Inference Formats](deepar-in-formats.md)
+ [Factorization Machines Response Formats](fm-in-formats.md)
+ [IP Insights Inference Data Formats](ip-insights-inference-data-formats.md)
+ [K-Means Response Formats](km-in-formats.md)
+ [k-NN Request and Response Formats](kNN-inference-formats.md)
+ [Linear learner response formats](LL-in-formats.md)
+ [NTM Response Formats](ntm-in-formats.md)
+ [Data Formats for Object2Vec Inference](object2vec-inference-formats.md)
+ [Encoder Embeddings for Object2Vec](object2vec-encoder-embeddings.md)
+ [PCA Response Formats](PCA-in-formats.md)
+ [RCF Response Formats](rcf-in-formats.md)

# Instance Types for Built-in Algorithms
Suggested instance types

Most Amazon SageMaker AI algorithms have been engineered to take advantage of GPU computing for training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. Exceptions are noted in this guide.

To learn about the supported EC2 instances, see [Instance details](https://aws.amazon.com/sagemaker-ai/pricing/#Instance_details).

The size and type of data can have a great effect on which hardware configuration is most effective. When the same model is trained on a recurring basis, initial testing across a spectrum of instance types can discover configurations that are more cost-effective in the long run. Additionally, algorithms that train most efficiently on GPUs might not require GPUs for efficient inference. Experiment to determine the most cost effectiveness solution. To get an automatic instance recommendation or conduct custom load tests, use [Amazon SageMaker Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html).

For more information on SageMaker AI hardware specifications, see [Amazon SageMaker AI pricing](https://aws.amazon.com/sagemaker/ai/pricing/).

**UltraServers**

UltraServers connect multiple Amazon EC2 instances using a low-latency, high-bandwidth accelerator interconnect. They are built to handle large-scale AI/ML workloads that require significant processing power. For more information, see [Amazon EC2 UltraServers](https://aws.amazon.com/ec2/ultraservers/). To get started with UltraServers, see [Reserve training plans for your training jobs or HyperPod clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/reserve-capacity-with-training-plans.html).

To get started with UltraServers on Amazon SageMaker AI, [ create a training plan](https://docs.aws.amazon.com/sagemaker/latest/dg/reserve-capacity-with-training-plans.html). Once your UltraServer is available in the training plan, create a training job with the AWS Management Console, Amazon SageMaker AI API, or AWS CLI. Remember to specify the UltraServer instance type that you purchased in the training plan.

An UltraServer can run one or multiple jobs at a time. UltraServers groups instances together, which gives you some flexibility in terms of how to allocate your UltraServer capacity in your organization. As you configure your jobs, also remember your organization's data security guidelines, as instances in one UltraServer can access data for another job in another instance on the same UltraServer.

If you run into hardware failures in the UltraServer, SageMaker AI automatically tries to resolve the issue. As SageMaker AI investigates and resolves the issue, you might receive notifications and actions through AWS Health Events or AWS Support.

Once your training job finishes, SageMaker AI stops the instances, but they remain available in your training plan if the plan is still active. To keep an instance in an UltraServer running after a job finishes, you can use [managed warm pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html).

If your training plan has enough capacity, you can even run training jobs across multiple UltraServers. By default, each UltraServer comes with 18 instances, comprising of 17 instances and 1 spare instance. If you need more instances, you must buy more UltraServers. When creating a training job, you can configure how jobs are placed across UltraServers using the `InstancePlacementConfig` parameter.

If you don't configure job placement, SageMaker AI automatically allocates jobs to instances within your UltraServer. This default strategy is based on best effort that prioritizes filling all of the instances in a single UltraServer before using a different UltraServer. For example, if you request 14 instances and have 2 UltraServers in your training plan, SageMaker AI uses all of the instances in the first UltraServer. If you requested 20 instances and have 2 UltraServers in your training plan, SageMaker AI will will use all 17 instances in the first UltraServer and then use 3 from the second UltraServer. Instances within an UltraServer use NVLink to communicate, but individual UltraServers use Elastic Fabric Adapter (EFA), which might affect model training performance.

# Logs for Built-in Algorithms
Logs

Amazon SageMaker AI algorithms produce Amazon CloudWatch logs, which provide detailed information on the training process. To see the logs, in the AWS management console, choose **CloudWatch**, choose **Logs**, and then choose the /aws/sagemaker/TrainingJobs **log group**. Each training job has one log stream per node on which it was trained. The log stream’s name begins with the value specified in the `TrainingJobName` parameter when the job was created.

**Note**  
If a job fails and logs do not appear in CloudWatch, it's likely that an error occurred before the start of training. Reasons include specifying the wrong training image or S3 location.

The contents of logs vary by algorithms. However, you can typically find the following information:
+ Confirmation of arguments provided at the beginning of the log
+ Errors that occurred during training
+ Measurement of an algorithm's accuracy or numerical performance
+ Timings for the algorithm and any major stages within the algorithm

## Common Errors
Example Errors

If a training job fails, some details about the failure are provided by the `FailureReason` return value in the training job description, as follows:

```
sage = boto3.client('sagemaker')
sage.describe_training_job(TrainingJobName=job_name)['FailureReason']
```

Others are reported only in the CloudWatch logs. Common errors include the following:

1. Misspecifying a hyperparameter or specifying a hyperparameter that is invalid for the algorithm.

   **From the CloudWatch Log**

   ```
   [10/16/2017 23:45:17 ERROR 139623806805824 train.py:48]
   Additional properties are not allowed (u'mini_batch_siz' was
   unexpected)
   ```

1. Specifying an invalid value for a hyperparameter.

   **FailureReason**

   ```
   AlgorithmError: u'abc' is not valid under any of the given
   schemas\n\nFailed validating u'oneOf' in
   schema[u'properties'][u'feature_dim']:\n    {u'oneOf':
   [{u'pattern': u'^([1-9][0-9]*)$', u'type': u'string'},\n
   {u'minimum': 1, u'type': u'integer'}]}\
   ```

   **FailureReason**

   ```
   [10/16/2017 23:57:17 ERROR 140373086025536 train.py:48] u'abc'
   is not valid under any of the given schemas
   ```

1. Inaccurate protobuf file format.

   **From the CloudWatch log**

   ```
   [10/17/2017 18:01:04 ERROR 140234860816192 train.py:48] cannot
                      copy sequence with size 785 to array axis with dimension 784
   ```

# Built-in SageMaker AI Algorithms for Tabular Data
Tabular

Amazon SageMaker AI provides built-in algorithms that are tailored to the analysis of tabular data. Tabular data refers to any datasets that are organized in tables consisting of rows (observations) and columns (features). The built-in SageMaker AI algorithms for tabular data can be used for either classification or regression problems.
+ [AutoGluon-Tabular](autogluon-tabular.md)—an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers. 
+ [CatBoost](catboost.md)—an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
+ [Factorization Machines Algorithm](fact-machines.md)—an extension of a linear model that is designed to economically capture interactions between features within high-dimensional sparse datasets.
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)—a non-parametric method that uses the k nearest labeled points to assign a label to a new data point for classification or a predicted target value from the average of the k nearest points for regression.
+ [LightGBM](lightgbm.md)—an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
+ [Linear Learner Algorithm](linear-learner.md)—learns a linear function for regression or a linear threshold function for classification.
+ [TabTransformer](tabtransformer.md)—a novel deep tabular data modeling architecture built on self-attention-based Transformers. 
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)—an implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| AutoGluon-Tabular | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| CatBoost | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| Factorization Machines | train and (optionally) test | File or Pipe | recordIO-protobuf | CPU (GPU for dense data) | Yes | 
| K-Nearest-Neighbors (k-NN) | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU (single GPU device on one or more instances) | Yes | 
| LightGBM | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| Linear Learner | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| TabTransformer | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| XGBoost (0.90-1, 0.90-2, 1.0-1, 1.2-1, 1.2-21) | train and (optionally) validation | File or Pipe | CSV, LibSVM, or Parquet | CPU (or GPU for 1.2-1) | Yes | 

# AutoGluon-Tabular
AutoGluon-Tabular Algorithm

[AutoGluon-Tabular](https://auto.gluon.ai/stable/index.html) is a popular open-source AutoML framework that trains highly accurate machine learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. This page includes information about Amazon EC2 instance recommendations and sample notebooks for AutoGluon-Tabular.

# How to use SageMaker AI AutoGluon-Tabular
How to use AutoGluon-Tabular

You can use AutoGluon-Tabular as an Amazon SageMaker AI built-in algorithm. The following section describes how to use AutoGluon-Tabular with the SageMaker Python SDK. For information on how to use AutoGluon-Tabular from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use AutoGluon-Tabular as a built-in algorithm**

  Use the AutoGluon-Tabular built-in algorithm to build an AutoGluon-Tabular training container as shown in the following code example. You can automatically spot the AutoGluon-Tabular built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the AutoGluon-Tabular image URI, you can use the AutoGluon-Tabular container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The AutoGluon-Tabular built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own AutoGluon-Tabular training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "autogluon-classification-ensemble", "*", "training"
  training_instance_type = "ml.p3.2xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_binary/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "auto_stack"
  ] = "True"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the AutoGluon-Tabular as a built-in algorithm, see the following notebook examples. Any S3 bucket used in these examples must be in the same AWS Region as the notebook instance used to run them.
  + [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)
  + [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)

# Input and Output interface for the AutoGluon-Tabular algorithm


Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of AutoGluon-Tabular supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the AutoGluon-Tabular model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains a CSV file. You can optionally include another subdirectory in the same location called `validation/` that also has a CSV file. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI AutoGluon-Tabular uses the `autogluon.tabular.TabularPredictor` module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI AutoGluon-Tabular with the AutoGluon framework**
+ Use the following Python code:

  ```
  import tarfile
  from autogluon.tabular import TabularPredictor
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = TabularPredictor.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the AutoGluon-Tabular algorithm


SageMaker AI AutoGluon-Tabular supports single-instance CPU and single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker AI AutoGluon-Tabular currently does not support multi-GPU training.

## AutoGluon-Tabular sample notebooks
Sample Notebooks

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI AutoGluon-Tabular algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular classification model.  | 
|  [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular regression model.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How AutoGluon-Tabular works
How It Works

AutoGluon-Tabular performs advanced data processing, deep learning, and multi-layer model ensemble methods. It automatically recognizes the data type in each column for robust data preprocessing, including special handling of text fields. 

AutoGluon fits various models ranging from off-the-shelf boosted trees to customized neural networks. These models are ensembled in a novel way: models are stacked in multiple layers and trained in a layer-wise manner that guarantees raw data can be translated into high-quality predictions within a given time constraint. This process mitigates overfitting by splitting the data in various ways with careful tracking of out-of-fold examples.

The AutoGluon-Tabular algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, and distributions. You can use AutoGluon-Tabular for regression, classification (binary and multiclass), and ranking problems.

Refer to the following diagram illustrating how the multi-layer stacking strategy works.

![\[AutoGluon's multi-layer stacking strategy shown with two stacking layers.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/autogluon_tabular_illustration.png)


For more information, see *[AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data](https://arxiv.org/pdf/2003.06505.pdf)*.

# AutoGluon-Tabular hyperparameters
Hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI AutoGluon-Tabular algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI AutoGluon-Tabular algorithm is an implementation of the open-source [AutoGluon-Tabular](https://github.com/awslabs/autogluon) package.

**Note**  
The default hyperparameters are based on example datasets in the [AutoGluon-Tabular sample notebooks](autogluon-tabular.md#autogluon-tabular-sample-notebooks).

By default, the SageMaker AI AutoGluon-Tabular algorithm automatically chooses an evaluation metric based on the type of classification problem. The algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error. For binary classification problems, the evaluation metric is area under the receiver operating characteristic curve (AUC). For multiclass classification problems, the evaluation metric is accuracy. You can use the `eval_metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on AutoGluon-Tabular hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| eval\$1metric |  The evaluation metric for validation data. If `eval_metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) Valid values: string, refer to the [AutoGluon documentation](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html) for valid values. Default value: `"auto"`.  | 
| presets |  List of preset configurations for various arguments in `fit()`.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, any of the following: (`"best_quality"`, `"high_quality"`, `good_quality"`, `"medium_quality"`, `"optimize_for_deployment"`,` or "interpretable"`). Default value: `"medium_quality"`.  | 
| auto\$1stack |  Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy. Set `auto_stack` to `"True"` if you are willing to tolerate longer training times in order to maximize predictive accuracy. This automatically sets the `num_bag_folds` and `num_stack_levels` arguments based on dataset properties.  Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| num\$1bag\$1folds |  Number of folds used for bagging of models. When `num_bag_folds` is equal to `k`, training time is roughly increased by a factor of `k`. Set `num_bag_folds` to 0 to deactivate bagging. This is disabled by default, but we recommend using values between 5 and 10 to maximize predictive performance. Increasing `num_bag_folds` results in models with lower bias, but that are more prone to overfitting. One is an invalid value for this parameter, and will raise a `ValueError`. Values greater than 10 may produce diminishing returns and can even harm overall results due to overfitting. To further improve predictions, avoid increasing `num_bag_folds` and instead increase `num_bag_sets`. Valid values: string, any integer between (and including) `"0"` and `"10"`. Default value: `"0"`.  | 
| num\$1bag\$1sets |  Number of repeats of kfold bagging to perform (values must be greater than or equal to 1). The total number of models trained during bagging is equal to `num_bag_folds` \$1 `num_bag_sets`. This parameter defaults to one if `time_limit` is not specified. This parameters is disabled if `num_bag_folds` is not specified. Values greater than one result in superior predictive performance, especially on smaller problems and with stacking enabled.  Valid values: integer, range: [`1`, `20`]. Default value: `1`.  | 
| num\$1stack\$1levels |  Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of `num_stack_levels` \$1 1. Set this parameter to 0 to deactivate stack ensembling. This parameter is deactivated by default, but we recommend using values between 1 and 3 to maximize predictive performance. To prevent overfitting and a `ValueError`, `num_bag_folds` must be greater than or equal to 2. Valid values: float, range: [`0`, `3`]. Default value: `0`.  | 
| refit\$1full |  Whether or not to retrain all models on all of the data (training and validation) after the normal training procedure. For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| set\$1best\$1to\$1refit\$1full |  Whether or not to change the default model that the predictor uses for prediction. If `set_best_to_refit_full` is set to `"True"`, the default model changes to the model that exhibited the highest validation score as a result of refitting (activated by `refit_full`). Only valid if `refit_full` is set. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| save\$1space |  Whether or note to reduce the memory and disk size of predictor by deleting auxiliary model files that aren’t needed for prediction on new data. This has no impact on inference accuracy. We recommend setting `save_space` to `"True"` if the only goal is to use the trained model for prediction. Certain advanced functionality may no longer be available if `save_space` is set to `"True"`. Refer to the `[predictor.save\$1space()](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.save_space.html)` documentation for more details. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| verbosity |  The verbosity of print messages. `verbosity` levels range from `0` to `4`, with higher levels corresponding to more detailed print statements. A `verbosity` of `0` suppresses warnings.  Valid values: integer, any of the following: (`0`, `1`, `2`, `3`, or `4`). Default value: `2`.  | 

# Tuning an AutoGluon-Tabular model
Model Tuning

Although AutoGluon-Tabular can be used with model tuning, its design can deliver good performance using stacking and ensemble methods, meaning hyperparameter optimization is not necessary. Rather than focusing on model tuning, AutoGluon-Tabular succeeds by stacking models in multiple layers and training in a layer-wise manner. 

For more information about AutoGluon-Tabular hyperparameters, see [AutoGluon-Tabular hyperparameters](autogluon-tabular-hyperparameters.md).

# CatBoost
CatBoost Algorithm

[CatBoost](https://catboost.ai/) is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.

CatBoost introduces two critical algorithmic advances to GBDT:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm

1. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. This page includes information about Amazon EC2 instance recommendations and sample notebooks for CatBoost.

# How to use SageMaker AI CatBoost
How to use CatBoost

You can use CatBoost as an Amazon SageMaker AI built-in algorithm. The following section describes how to use CatBoost with the SageMaker Python SDK. For information on how to use CatBoost from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use CatBoost as a built-in algorithm**

  Use the CatBoost built-in algorithm to build a CatBoost training container as shown in the following code example. You can automatically spot the CatBoost built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the CatBoost image URI, you can use the CatBoost container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The CatBoost built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own CatBoost training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "catboost-classification-model", "*", "training"
  training_instance_type = "ml.m5.xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_multiclass/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "iterations"
  ] = "500"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up CatBoost as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)
  + [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)

# Input and Output interface for the CatBoost algorithm


Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of CatBoost supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the CatBoost model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `training` or `validation` channels, the CatBoost algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI CatBoost uses the `catboost.CatBoostClassifier` and `catboost.CatBoostRegressor` modules to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI CatBoost with `catboost`**
+ Use the following Python code:

  ```
  import tarfile
  from catboost import CatBoostClassifier
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  file_path = os.path.join(model_file_path, "model")
  model = CatBoostClassifier()
  model.load_model(file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the CatBoost algorithm


SageMaker AI CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5). Further, we recommend that you have enough total memory in selected instances to hold the training data. 

## CatBoost sample notebooks
Sample Notebooks

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI CatBoost algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI CatBoost algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI CatBoost algorithm to train and host a tabular regression model.   | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How CatBoost Works
How It Works

CatBoost implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two critical algorithmic advances:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm

1. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.

The CatBoost algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use CatBoost for regression, classification (binary and multiclass), and ranking problems.

For more information on gradient boosting, see [How the SageMaker AI XGBoost algorithm works](xgboost-HowItWorks.md). For in-depth details about the additional GOSS and EFB techniques used in the CatBoost method, see *[CatBoost: unbiased boosting with categorical features](https://arxiv.org/pdf/1706.09516.pdf)*.

# CatBoost hyperparameters
Hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI CatBoost algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI CatBoost algorithm is an implementation of the open-source [CatBoost](https://github.com/catboost/catboost) package.

**Note**  
The default hyperparameters are based on example datasets in the [CatBoost sample notebooks](catboost.md#catboost-sample-notebooks).

By default, the SageMaker AI CatBoost algorithm automatically chooses an evaluation metric and loss function based on the type of classification problem. The CatBoost algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric and loss functions are both root mean squared error. For binary classification problems, the evaluation metric is Area Under the Curve (AUC) and the loss function is log loss. For multiclass classification problems, the evaluation metric and loss functions are multiclass cross entropy. You can use the `eval_metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on LightGBM hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| iterations |  The maximum number of trees that can be built. Valid values: integer, range: Positive integer. Default value: `500`.  | 
| early\$1stopping\$1rounds |  The training will stop if one metric of one validation data point does not improve in the last `early_stopping_rounds` round. If `early_stopping_rounds` is less than or equal to zero, this hyperparameter is ignored. Valid values: integer. Default value: `5`.  | 
| eval\$1metric |  The evaluation metric for validation data. If `eval_metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/catboost-hyperparameters.html) Valid values: string, refer to the [CatBoost documentation](https://catboost.ai/en/docs/references/eval-metric__supported-metrics) for valid values. Default value: `"auto"`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.009`.  | 
| depth |  Depth of the tree. Valid values: integer, range: (`1`, `16`). Default value: `6`.  | 
| l2\$1leaf\$1reg |  Coefficient for the L2 regularization term of the cost function. Valid values: integer, range: Positive integer. Default value: `3`.  | 
| random\$1strength |  The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. Valid values: float, range: Positive floating point number. Default value: `1.0`.  | 
| max\$1leaves |  The maximum number of leaves in the resulting tree. Can only be used with the `"Lossguide"` growing policy. Valid values: integer, range: [`2`, `64`]. Default value: `31`.  | 
| rsm |  Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. Valid values: float, range: (`0.0`, `1.0`]. Default value: `1.0`.  | 
| sampling\$1frequency |  Frequency to sample weights and objects when building trees. Valid values: string, either: (`"PerTreeLevel"` or `"PerTree"`). Default value: `"PerTreeLevel"`.  | 
| min\$1data\$1in\$1leaf |  The minimum number of training samples in a leaf. CatBoost does not search for new splits in leaves with a sample count less than the specified value. Can only be used with the `"Lossguide"` and `"Depthwise"` growing policies. Valid values: integer, range: (`1` or `∞`). Default value: `1`.  | 
| bagging\$1temperature |  Defines the settings of the Bayesian bootstrap. Use the Bayesian bootstrap to assign random weights to objects. If `bagging_temperature` is set to `1.0`, then the weights are sampled from an exponential distribution. If `bagging_temperature` is set to `0.0`, then all weights are 1.0. Valid values: float, range: Non-negative float. Default value: `1.0`.  | 
| boosting\$1type |  The boosting scheme. "Auto" means that the `boosting_type` is selected based on processing unit type, the number of objects in the training dataset, and the selected learning mode. Valid values: string, any of the following: (`"Auto"`, `"Ordered"`, `"Plain"`). Default value: `"Auto"`.  | 
| scale\$1pos\$1weight |  The weight for positive class in binary classification. The value is used as a multiplier for the weights of objects from positive class. Valid values: float, range: Positive float. Default value: `1.0`.  | 
| max\$1bin |  The number of splits for numerical features. `"Auto"` means that `max_bin` is selected based on the processing unit type and other parameters. For details, see the CatBoost documentation. Valid values: string, either: (`"Auto"` or string of integer from `"1"` to `"65535"` inclusively). Default value: `"Auto"`.  | 
| grow\$1policy |  The tree growing policy. Defines how to perform greedy tree construction. Valid values: string, any of the following: (`"SymmetricTree"`, `"Depthwise"`, or `"Lossguide"`). Default value: `"SymmetricTree"`.  | 
| random\$1seed |  The random seed used for training. Valid values: integer, range: Non-negative integer. Default value: `1.0`. | 
| thread\$1count |  The number of threads to use during the training. If `thread_count` is `-1`, then the number of threads is equal to the number of processor cores. `thread_count` cannot be `0`. Valid values: integer, either: (`-1` or positive integer). Default value: `-1`.  | 
| verbose |  The verbosity of print messages, with higher levels corresponding to more detailed print statements. Valid values: integer, range: Positive integer. Default value: `1`.  | 

# Tune a CatBoost model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters:

**Note**  
The learning loss function is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [CatBoost hyperparameters](catboost-hyperparameters.md).
+ A learning loss function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for CatBoost is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the CatBoost algorithm
Metrics

The SageMaker AI CatBoost algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| RMSE | root mean square error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MAE | mean absolute error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MedianAbsoluteError | median absolute error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| R2 | r2 score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Logloss | binary cross entropy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Precision | precision | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Recall | recall | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| F1 | f1 score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| AUC | auc score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MultiClass | multiclass cross entropy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Accuracy | accuracy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| BalancedAccuracy | balanced accuracy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 

## Tunable CatBoost hyperparameters
Tunable Hyperparameters

Tune the CatBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the CatBoost evaluation metrics are: `learning_rate`, `depth`, `l2_leaf_reg`, and `random_strength`. For a list of all the CatBoost hyperparameters, see [CatBoost hyperparameters](catboost-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| depth | IntegerParameterRanges | MinValue: 4, MaxValue: 10 | 
| l2\$1leaf\$1reg | IntegerParameterRanges | MinValue: 2, MaxValue: 10 | 
| random\$1strength | ContinuousParameterRanges | MinValue: 0, MaxValue: 10 | 

# Factorization Machines Algorithm
Factorization Machines

The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

**Note**  
The Amazon SageMaker AI implementation of the Factorization Machines algorithm considers only pair-wise (2nd order) interactions between features.

**Topics**
+ [

## Input/Output Interface for the Factorization Machines Algorithm
](#fm-inputoutput)
+ [

## EC2 Instance Recommendation for the Factorization Machines Algorithm
](#fm-instances)
+ [

## Factorization Machines Sample Notebooks
](#fm-sample-notebooks)
+ [

# How Factorization Machines Work
](fact-machines-howitworks.md)
+ [

# Factorization Machines Hyperparameters
](fact-machines-hyperparameters.md)
+ [

# Tune a Factorization Machines Model
](fm-tuning.md)
+ [

# Factorization Machines Response Formats
](fm-in-formats.md)

## Input/Output Interface for the Factorization Machines Algorithm


The Factorization Machines algorithm can be run in either in binary classification mode or regression mode. In each mode, a dataset can be provided to the **test** channel along with the train channel dataset. The scoring depends on the mode used. In regression mode, the testing dataset is scored using Root Mean Square Error (RMSE). In binary classification mode, the test dataset is scored using Binary Cross Entropy (Log Loss), Accuracy (at threshold=0.5) and F1 Score (at threshold =0.5).

For **training**, the Factorization Machines algorithm currently supports only the `recordIO-protobuf` format with `Float32` tensors. Because their use case is predominantly on sparse data, `CSV` is not a good candidate. Both File and Pipe mode training are supported for recordIO-wrapped protobuf.

For **inference**, the Factorization Machines algorithm supports the `application/json` and `x-recordio-protobuf` formats. 
+ For the **binary classification** problem, the algorithm predicts a score and a label. The label is a number and can be either `0` or `1`. The score is a number that indicates how strongly the algorithm believes that the label should be `1`. The algorithm computes score first and then derives the label from the score value. If the score is greater than or equal to 0.5, the label is `1`.
+ For the **regression** problem, just a score is returned and it is the predicted value. For example, if Factorization Machines is used to predict a movie rating, score is the predicted rating value.

Please see [Factorization Machines Sample Notebooks](#fm-sample-notebooks) for more details on training and inference file formats.

## EC2 Instance Recommendation for the Factorization Machines Algorithm


The Amazon SageMaker AI Factorization Machines algorithm is highly scalable and can train across distributed instances. We recommend training and inference with CPU instances for both sparse and dense datasets. In some circumstances, training with one or more GPUs on dense data might provide some benefit. Training with GPUs is available only on dense data. Use CPU instances for sparse data. The Factorization Machines algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

## Factorization Machines Sample Notebooks
Sample Notebooks

For a sample notebook that uses the SageMaker AI Factorization Machines algorithm to analyze the images of handwritten digits from zero to nine in the MNIST dataset, see [An Introduction to Factorization Machines with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/factorization_machines_mnist/factorization_machines_mnist.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. Example notebooks that use Factorization Machines algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Factorization Machines Work
How It Works

The prediction task for a Factorization Machines model is to estimate a function ŷ from a feature set xi to a target domain. This domain is real-valued for regression and binary for classification. The Factorization Machines model is supervised and so has a training dataset (xi,yj) available. The advantages this model presents lie in the way it uses a factorized parametrization to capture the pairwise feature interactions. It can be represented mathematically as follows: 

![\[An image containing the equation for the Factorization Machines model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM1.jpg)


 The three terms in this equation correspond respectively to the three components of the model: 
+ The w0 term represents the global bias.
+ The wi linear terms model the strength of the ith variable.
+ The <vi,vj> factorization terms model the pairwise interaction between the ith and jth variable.

The global bias and linear terms are the same as in a linear model. The pairwise feature interactions are modeled in the third term as the inner product of the corresponding factors learned for each feature. Learned factors can also be considered as embedding vectors for each feature. For example, in a classification task, if a pair of features tends to co-occur more often in positive labeled samples, then the inner product of their factors would be large. In other words, their embedding vectors would be close to each other in cosine similarity. For more information about the Factorization Machines model, see [Factorization Machines](https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf).

For regression tasks, the model is trained by minimizing the squared error between the model prediction ŷn and the target value yn. This is known as the square loss:

![\[An image containing the equation for square loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM2.jpg)


For a classification task, the model is trained by minimizing the cross entropy loss, also known as the log loss: 

![\[An image containing the equation for log loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM3.jpg)


where: 

![\[An image containing the logistic function of the predicted values.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM4.jpg)


For more information about loss functions for classification, see [Loss functions for classification](https://en.wikipedia.org/wiki/Loss_functions_for_classification).

# Factorization Machines Hyperparameters
Hyperparameters

The following table contains the hyperparameters for the Factorization Machines algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order.


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim | The dimension of the input feature space. This could be very high with sparse input. **Required** Valid values: Positive integer. Suggested value range: [10000,10000000]  | 
| num\$1factors | The dimensionality of factorization. **Required** Valid values: Positive integer. Suggested value range: [2,1000], 64 typically generates good outcomes and is a good starting point.  | 
| predictor\$1type | The type of predictor. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Required** Valid values: String: `binary_classifier` or `regressor`  | 
| bias\$1init\$1method | The initialization method for the bias term: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant` Default value: `normal`  | 
| bias\$1init\$1scale | Range for initialization of the bias term. Takes effect if `bias_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| bias\$1init\$1sigma | The standard deviation for initialization of the bias term. Takes effect if `bias_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| bias\$1init\$1value | The initial value of the bias term. Takes effect if `bias_init_method` is set to `constant`.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| bias\$1lr | The learning rate for the bias term.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.1  | 
| bias\$1wd | The weight decay for the bias term.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| clip\$1gradient | Gradient clipping optimizer parameter. Clips the gradient by projecting onto the interval [-`clip_gradient`, \$1`clip_gradient`].  **Optional** Valid values: Float Default value: None  | 
| epochs | The number of training epochs to run.  **Optional** Valid values: Positive integer Default value: 1  | 
| eps | Epsilon parameter to avoid division by 0. **Optional** Valid values: Float. Suggested value: small. Default value: None  | 
| factors\$1init\$1method | The initialization method for factorization terms: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant`. Default value: `normal`  | 
| factors\$1init\$1scale  | The range for initialization of factorization terms. Takes effect if `factors_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| factors\$1init\$1sigma | The standard deviation for initialization of factorization terms. Takes effect if `factors_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| factors\$1init\$1value | The initial value of factorization terms. Takes effect if `factors_init_method` is set to `constant`.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| factors\$1lr | The learning rate for factorization terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.0001  | 
| factors\$1wd | The weight decay for factorization terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.00001  | 
| linear\$1lr | The learning rate for linear terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| linear\$1init\$1method | The initialization method for linear terms: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant`. Default value: `normal`  | 
| linear\$1init\$1scale | Range for initialization of linear terms. Takes effect if `linear_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| linear\$1init\$1sigma | The standard deviation for initialization of linear terms. Takes effect if `linear_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| linear\$1init\$1value | The initial value of linear terms. Takes effect if `linear_init_method` is set to *constant*.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| linear\$1wd | The weight decay for linear terms. **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| mini\$1batch\$1size | The size of mini-batch used for training.  **Optional** Valid values: Positive integer Default value: 1000  | 
| rescale\$1grad |  Gradient rescaling optimizer parameter. If set, multiplies the gradient with `rescale_grad` before updating. Often choose to be 1.0/`batch_size`.  **Optional** Valid values: Float Default value: None  | 

# Tune a Factorization Machines Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Factorization Machines Algorithm
Metrics

The Factorization Machines algorithm has both binary classification and regression predictor types. The predictor type determines which metric you can use for automatic model tuning. The algorithm reports a `test:rmse` regressor metric, which is computed during training. When tuning the model for regression tasks, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:rmse | Root Mean Square Error | Minimize | 

The Factorization Machines algorithm reports three binary classification metrics, which are computed during training. When tuning the model for binary classification tasks, choose one of these as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:binary\$1classification\$1accuracy | Accuracy | Maximize | 
| test:binary\$1classification\$1cross\$1entropy | Cross Entropy | Minimize | 
| test:binary\$1f\$1beta | Beta | Maximize | 

## Tunable Factorization Machines Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the Factorization Machines algorithm. The initialization parameters that contain the terms bias, linear, and factorization depend on their initialization method. There are three initialization methods: `uniform`, `normal`, and `constant`. These initialization methods are not themselves tunable. The parameters that are tunable are dependent on this choice of the initialization method. For example, if the initialization method is `uniform`, then only the `scale` parameters are tunable. Specifically, if `bias_init_method==uniform`, then `bias_init_scale`, `linear_init_scale`, and `factors_init_scale` are tunable. Similarly, if the initialization method is `normal`, then only `sigma` parameters are tunable. If the initialization method is `constant`, then only `value` parameters are tunable. These dependencies are listed in the following table. 


| Parameter Name | Parameter Type | Recommended Ranges | Dependency | 
| --- | --- | --- | --- | 
| bias\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| bias\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| bias\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| bias\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| bias\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| epoch | IntegerParameterRange | MinValue: 1, MaxValue: 1000 | None | 
| factors\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| factors\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| factors\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| factors\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| factors\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512] | None | 
| linear\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| linear\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| linear\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| linear\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| linear\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| mini\$1batch\$1size | IntegerParameterRange | MinValue: 100, MaxValue: 10000 | None | 

# Factorization Machines Response Formats
Inference Formats

Amazon SageMaker AI provides several response formats for getting inference from the Factorization Machines model, such as JSON, JSONLINES, and RECORDIO, with specific structures for binary classification and regression tasks.

## JSON Response Format


Binary classification

```
let response =   {
    "predictions":    [
        {
            "score": 0.4,
            "predicted_label": 0
        } 
    ]
}
```

Regression

```
let response =   {
    "predictions":    [
        {
            "score": 0.4
        } 
    ]
}
```

## JSONLINES Response Format


Binary classification

```
{"score": 0.4, "predicted_label": 0}
```

Regression

```
{"score": 0.4}
```

## RECORDIO Response Format


Binary classification

```
[
    Record = {
        features = {},
        label = {
            'score’: {
                keys: [],
                values: [0.4]  # float32
            },
            'predicted_label': {
                keys: [],
                values: [0.0]  # float32
            }
        }
    }
]
```

Regression

```
[
    Record = {
        features = {},
        label = {
            'score’: {
                keys: [],
                values: [0.4]  # float32
            }   
        }
    }
]
```

# K-Nearest Neighbors (k-NN) Algorithm


Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression. For classification problems, the algorithm queries the *k* points that are closest to the sample point and returns the most frequently used label of their class as the predicted label. For regression problems, the algorithm queries the *k* closest points to the sample point and returns the average of their feature values as the predicted value. 

Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building. Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction, the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model in memory and inference latency. We provide two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is to construct the index. The index enables efficient lookups of distances between points whose values or class labels have not yet been determined and the k nearest points to use for inference.

**Topics**
+ [

## Input/Output Interface for the k-NN Algorithm
](#kNN-input_output)
+ [

## k-NN Sample Notebooks
](#kNN-sample-notebooks)
+ [

# How the k-NN Algorithm Works
](kNN_how-it-works.md)
+ [

## EC2 Instance Recommendation for the k-NN Algorithm
](#kNN-instances)
+ [

# k-NN Hyperparameters
](kNN_hyperparameters.md)
+ [

# Tune a k-NN Model
](kNN-tuning.md)
+ [

# Data Formats for k-NN Training Input
](kNN-in-formats.md)
+ [

# k-NN Request and Response Formats
](kNN-inference-formats.md)

## Input/Output Interface for the k-NN Algorithm


SageMaker AI k-NN supports train and test data channels.
+ Use a *train channel* for data that you want to sample and construct into the k-NN index.
+ Use a *test channel* to emit scores in log files. Scores are listed as one line per mini-batch: accuracy for `classifier`, mean-squared error (mse) for `regressor` for score.

For training inputs, k-NN supports `text/csv` and `application/x-recordio-protobuf` data formats. For input type `text/csv`, the first `label_size` columns are interpreted as the label vector for that row. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference inputs, k-NN supports the `application/json`, `application/x-recordio-protobuf`, and `text/csv` data formats. The `text/csv` format accepts a `label_size` and encoding parameter. It assumes a `label_size` of 0 and a UTF-8 encoding.

For inference outputs, k-NN supports the `application/json` and `application/x-recordio-protobuf` data formats. These two data formats also support a verbose output mode. In verbose output mode, the API provides the search results with the distances vector sorted from smallest to largest, and corresponding elements in the labels vector.

For batch transform, k-NN supports the `application/jsonlines` data format for both input and output. An example input is as follows:

```
content-type: application/jsonlines

{"features": [1.5, 16.0, 14.0, 23.0]}
{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}
```

An example output is as follows:

```
accept: application/jsonlines

{"predicted_label": 0.0}
{"predicted_label": 2.0}
```

For more information on input and output file formats, see [Data Formats for k-NN Training Input](kNN-in-formats.md) for training, [k-NN Request and Response Formats](kNN-inference-formats.md) for inference, and the [k-NN Sample Notebooks](#kNN-sample-notebooks).

## k-NN Sample Notebooks
Sample Notebooks

For a sample notebook that uses the SageMaker AI k-nearest neighbor algorithm to predict wilderness cover types from geological and forest service data, see the [K-Nearest Neighbor Covertype ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/k_nearest_neighbors_covtype/k_nearest_neighbors_covtype.html). 

Use a Jupyter notebook instance to run the example in SageMaker AI. To learn how to create and open a Jupyter notebook instance in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI example notebooks. Find K-Nearest Neighbor notebooks in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How the k-NN Algorithm Works
How It Works

The Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm follows a multi-step training process which includes sampling the input data, performing dimension reduction, and building an index. The indexed data is then used during inference to efficiently find the k-nearest neighbors for a given data point and make predictions based on the neighboring labels or values.

## Step 1: Sample


To specify the total number of data points to be sampled from the training dataset, use the `sample_size`parameter. For example, if the initial dataset has 1,000 data points and the `sample_size` is set to 100, where the total number of instances is 2, each worker would sample 50 points. A total set of 100 data points would be collected. Sampling runs in linear time with respect to the number of data points. 

## Step 2: Perform Dimension Reduction


The current implementation of the k-NN algorithm has two methods of dimension reduction. You specify the method in the `dimension_reduction_type` hyperparameter. The `sign` method specifies a random projection, which uses a linear projection using a matrix of random signs, and the `fjlt` method specifies a fast Johnson-Lindenstrauss transform, a method based on the Fourier transform. Both methods preserve the L2 and inner product distances. The `fjlt` method should be used when the target dimension is large and has better performance with CPU inference. The methods differ in their computational complexity. The `sign` method requires O(ndk) time to reduce the dimension of a batch of n points of dimension d into a target dimension k. The `fjlt` method requires O(nd log(d)) time, but the constants involved are larger. Using dimension reduction introduces noise into the data and this noise can reduce prediction accuracy.

## Step 3: Build an Index


During inference, the algorithm queries the index for the k-nearest-neighbors of a sample point. Based on the references to the points, the algorithm makes the classification or regression prediction. It makes the prediction based on the class labels or values provided. k-NN provides three different types of indexes: a flat index, an inverted index, and an inverted index with product quantization. You specify the type with the `index_type` parameter.

## Serialize the Model


When the k-NN algorithm finishes training, it serializes three files to prepare for inference. 
+ model\$1algo-1: Contains the serialized index for computing the nearest neighbors.
+ model\$1algo-1.labels: Contains serialized labels (np.float32 binary format) for computing the predicted label based on the query result from the index.
+ model\$1algo-1.json: Contains the JSON-formatted model metadata which stores the `k` and `predictor_type` hyper-parameters from training for inference along with other relevant state.

With the current implementation of k-NN, you can modify the metadata file to change the way predictions are computed. For example, you can change `k` to 10 or change `predictor_type` to *regressor*.

```
{
  "k": 5,
  "predictor_type": "classifier",
  "dimension_reduction": {"type": "sign", "seed": 3, "target_dim": 10, "input_dim": 20},
  "normalize": False,
  "version": "1.0"
}
```

## EC2 Instance Recommendation for the k-NN Algorithm


We recommend training on a CPU instance (such as ml.m5.2xlarge) or on a GPU instance. The k-NN algorithm supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

Inference requests from CPUs generally have a lower average latency than requests from GPUs because there is a tax on CPU-to-GPU communication when you use GPU hardware. However, GPUs generally have higher throughput for larger batches.

# k-NN Hyperparameters
Hyperparameters

The following table lists the hyperparameters that you can set for the Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm.


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  The number of features in the input data. **Required** Valid values: positive integer.  | 
| k |  The number of nearest neighbors. **Required** Valid values: positive integer  | 
| predictor\$1type |  The type of inference to use on the data labels. **Required** Valid values: *classifier* for classification or *regressor* for regression.  | 
| sample\$1size |  The number of data points to be sampled from the training data set.  **Required** Valid values: positive integer  | 
| dimension\$1reduction\$1target |  The target dimension to reduce to. **Required** when you specify the `dimension_reduction_type` parameter. Valid values: positive integer greater than 0 and less than `feature_dim`.  | 
| dimension\$1reduction\$1type |  The type of dimension reduction method.  **Optional** Valid values: *sign* for random projection or *fjlt* for the fast Johnson-Lindenstrauss transform. Default value: No dimension reduction  | 
| faiss\$1index\$1ivf\$1nlists |  The number of centroids to construct in the index when `index_type` is *faiss.IVFFlat* or *faiss.IVFPQ*. **Optional** Valid values: positive integer Default value: *auto*, which resolves to `sqrt(sample_size)`.  | 
| faiss\$1index\$1pq\$1m |  The number of vector sub-components to construct in the index when `index_type` is set to *faiss.IVFPQ*.  The FaceBook AI Similarity Search (FAISS) library requires that the value of `faiss_index_pq_m` is a divisor of the data dimension. If `faiss_index_pq_m` is not a divisor of the data dimension, we increase the data dimension to smallest integer divisible by `faiss_index_pq_m`. If no dimension reduction is applied, the algorithm adds a padding of zeros. If dimension reduction is applied, the algorithm increase the value of the `dimension_reduction_target` hyper-parameter. **Optional** Valid values: One of the following positive integers: 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 40, 48, 56, 64, 96  | 
| index\$1metric |  The metric to measure the distance between points when finding nearest neighbors. When training with `index_type` set to `faiss.IVFPQ`, the `INNER_PRODUCT` distance and `COSINE` similarity are not supported. **Optional**  Valid values: *L2* for Euclidean-distance, *INNER\$1PRODUCT* for inner-product distance, *COSINE* for cosine similarity. Default value: *L2*  | 
| index\$1type |  The type of index. **Optional** Valid values: *faiss.Flat*, *faiss.IVFFlat*, *faiss.IVFPQ*. Default values: *faiss.Flat*  | 
| mini\$1batch\$1size |  The number of observations per mini-batch for the data iterator.  **Optional** Valid values: positive integer Default value: 5000  | 

# Tune a k-NN Model
Model Tuning

The Amazon SageMaker AI k-nearest neighbors algorithm is a supervised algorithm. The algorithm consumes a test data set and emits a metric about the accuracy for a classification task or about the mean squared error for a regression task. These accuracy metrics compare the model predictions for their respective task to the ground truth provided by the empirical test data. To find the best model that reports the highest accuracy or lowest error on the test dataset, run a hyperparameter tuning job for k-NN. 

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric appropriate for the prediction task of the algorithm. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric. The hyperparameters are used only to help estimate model parameters and are not used by the trained model to make predictions.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the k-NN Algorithm
Metrics

The k-nearest neighbors algorithm computes one of two metrics in the following table during training depending on the type of task specified by the `predictor_type` hyper-parameter. 
+ *classifier* specifies a classification task and computes `test:accuracy` 
+ *regressor* specifies a regression task and computes `test:mse`.

Choose the `predictor_type` value appropriate for the type of task undertaken to calculate the relevant objective metric when tuning a model.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:accuracy |  When `predictor_type` is set to *classifier*, k-NN compares the predicted label, based on the average of the k-nearest neighbors' labels, to the ground truth label provided in the test channel data. The accuracy reported ranges from 0.0 (0%) to 1.0 (100%).  |  Maximize  | 
| test:mse |  When `predictor_type` is set to *regressor*, k-NN compares the predicted label, based on the average of the k-nearest neighbors' labels, to the ground truth label provided in the test channel data. The mean squared error is computed by comparing the two labels.  |  Minimize  | 



## Tunable k-NN Hyperparameters
Tunable hyperparameters

Tune the Amazon SageMaker AI k-nearest neighbor model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| k |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 1024  | 
| sample\$1size |  IntegerParameterRanges  |  MinValue: 256, MaxValue: 20000000  | 

# Data Formats for k-NN Training Input
Training Formats

All Amazon SageMaker AI built-in algorithms adhere to the common input training formats described in [Common Data Formats - Training](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html). This topic contains a list of the available input formats for the SageMaker AI k-nearest-neighbor algorithm.

## CSV Data Format


content-type: text/csv; label\$1size=1

```
4,1.2,1.3,9.6,20.3
```

The first `label_size` columns are interpreted as the label vector for that row.

## RECORDIO Data Format


content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {
            'values': {
                values: [1.2, 1.3, 9.6, 20.3]  # float32
            }
        },
        label = {
            'values': {
                values: [4]  # float32
            }
        }
    }
] 

                
}
```

# k-NN Request and Response Formats
Inference Formats

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI k-nearest-neighbor algorithm.

## INPUT: CSV Request Format


content-type: text/csv

```
1.2,1.3,9.6,20.3
```

This accepts a `label_size` or encoding parameter. It assumes a `label_size` of 0 and a utf-8 encoding.

## INPUT: JSON Request Format


content-type: application/json

```
{
  "instances": [
    {"data": {"features": {"values": [-3, -1, -4, 2]}}},
    {"features": [3.0, 0.1, 0.04, 0.002]}]
}
```

## INPUT: JSONLINES Request Format


content-type: application/jsonlines

```
{"features": [1.5, 16.0, 14.0, 23.0]}
{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}
```

## INPUT: RECORDIO Request Format


content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {
            'values': {
                values: [-3, -1, -4, 2]  # float32
            }
        },
        label = {}
    },
    Record = {
        features = {
            'values': {
                values: [3.0, 0.1, 0.04, 0.002]  # float32
            }
        },
        label = {}
    },
]
```

## OUTPUT: JSON Response Format


accept: application/json

```
{
  "predictions": [
    {"predicted_label": 0.0},
    {"predicted_label": 2.0}
  ]
}
```

## OUTPUT: JSONLINES Response Format


accept: application/jsonlines

```
{"predicted_label": 0.0}
{"predicted_label": 2.0}
```

## OUTPUT: VERBOSE JSON Response Format


In verbose mode, the API provides the search results with the distances vector sorted from smallest to largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/json; verbose=true

```
{
  "predictions": [
    {
        "predicted_label": 0.0,
        "distances": [3.11792408, 3.89746071, 6.32548437],
        "labels": [0.0, 1.0, 0.0]
    },
    {
        "predicted_label": 2.0,
        "distances": [1.08470316, 3.04917915, 5.25393973],
        "labels": [2.0, 2.0, 0.0]
    }
  ]
}
```

## OUTPUT: RECORDIO-PROTOBUF Response Format


content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [2.0]  # float32
            }
        }
    }
]
```

## OUTPUT: VERBOSE RECORDIO-PROTOBUF Response Format


In verbose mode, the API provides the search results with the distances vector sorted from smallest to largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/x-recordio-protobuf; verbose=true

```
[
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            },
            'distances': {
                values: [3.11792408, 3.89746071, 6.32548437]  # float32
            },
            'labels': {
                values: [0.0, 1.0, 0.0]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            },
            'distances': {
                values: [1.08470316, 3.04917915, 5.25393973]  # float32
            },
            'labels': {
                values: [2.0, 2.0, 0.0]  # float32
            }
        }
    }
]
```

## SAMPLE OUTPUT for the k-NN Algorithm


For regressor tasks:

```
[06/08/2018 20:15:33 INFO 140026520049408] #test_score (algo-1) : ('mse', 0.013333333333333334)
```

For classifier tasks:

```
[06/08/2018 20:15:46 INFO 140285487171328] #test_score (algo-1) : ('accuracy', 0.98666666666666669)
```

# LightGBM
LightGBM Algorithm

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT. This page includes information about Amazon EC2 instance recommendations and sample notebooks for LightGBM.

# How to use SageMaker AI LightGBM
How to use LightGBM

You can use LightGBM as an Amazon SageMaker AI built-in algorithm. The following section describes how to use LightGBM with the SageMaker Python SDK. For information on how to use LightGBM from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use LightGBM as a built-in algorithm**

  Use the LightGBM built-in algorithm to build a LightGBM training container as shown in the following code example. You can automatically spot the LightGBM built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the LightGBM image URI, you can use the LightGBM container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The LightGBM built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own LightGBM training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
  training_instance_type = "ml.m5.xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_multiclass/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train" 
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation" 
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "num_boost_round"
  ] = "500"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1, # for distributed training, specify an instance_count greater than 1
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "train": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the LightGBM as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)
  + [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)

# Input and Output interface for the LightGBM algorithm


Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of LightGBM supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the LightGBM model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `train` and `validation` channels to provide your input data. Alternatively, you can use only the `train` channel.

**Note**  
Both `train` and `training` are valid channel names for LightGBM training.

**Use both the `train` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `train` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `train` or `validation` channels, the LightGBM algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `train` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `train` channel**:

You can alternatively provide your input data by way of a single S3 path for the `train` channel. This S3 path should point to a directory with a subdirectory named `train/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI LightGBM uses the Python Joblib module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI LightGBM with the JobLib module**
+ Use the following Python code:

  ```
  import joblib 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = joblib.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the LightGBM algorithm


SageMaker AI LightGBM currently supports single-instance and multi-instance CPU training. For multi-instance CPU training (distributed training), specify an `instance_count` greater than 1 when you define your Estimator. For more information on distributed training with LightGBM, see [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html).

LightGBM is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5). Further, we recommend that you have enough total memory in selected instances to hold the training data. 

## LightGBM sample notebooks
Sample Notebooks

The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI LightGBM algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular regression model.   | 
|  [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html)  |  This notebook demonstrates distributed training with the Amazon SageMaker AI LightGBM algorithm using the Dask framework.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How LightGBM works
How It Works

LightGBM implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). These techniques are designed to significantly improve the efficiency and scalability of GBDT.

The LightGBM algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use LightGBM for regression, classification (binary and multiclass), and ranking problems.

For more information on gradient boosting, see [How the SageMaker AI XGBoost algorithm works](xgboost-HowItWorks.md). For in-depth details about the additional GOSS and EFB techniques used in the LightGBM method, see *[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf)*.

# LightGBM hyperparameters
Hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI LightGBM algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI LightGBM algorithm is an implementation of the open-source [LightGBM](https://github.com/microsoft/LightGBM) package. 

**Note**  
The default hyperparameters are based on example datasets in the [LightGBM sample notebooks](lightgbm.md#lightgbm-sample-notebooks).

By default, the SageMaker AI LightGBM algorithm automatically chooses an evaluation metric and objective function based on the type of classification problem. The LightGBM algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error and the objective function is L2 loss. For binary classification problems, the evaluation metric and objective function are both binary cross entropy. For multiclass classification problems, the evaluation metric is multiclass cross entropy and the objective function is softmax. You can use the `metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on LightGBM hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| num\$1boost\$1round |  The maximum number of boosting iterations. **Note:** Internally, LightGBM constructs `num_class * num_boost_round` trees for multi-class classification problems. Valid values: integer, range: Positive integer. Default value: `100`.  | 
| early\$1stopping\$1rounds |  The training will stop if one metric of one validation data point does not improve in the last `early_stopping_rounds` round. If `early_stopping_rounds` is less than or equal to zero, this hyperparameter is ignored. Valid values: integer. Default value: `10`.  | 
| metric |  The evaluation metric for validation data. If `metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/lightgbm-hyperparameters.html) Valid values: string, any of the following: (`"auto"`, `"rmse"`, `"l1"`, `"l2"`, `"huber"`, `"fair"`, `"binary_logloss"`, `"binary_error"`, `"auc"`, `"average_precision"`, `"multi_logloss"`, `"multi_error"`, `"auc_mu"`, or `"cross_entropy"`). Default value: `"auto"`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.1`.  | 
| num\$1leaves |  The maximum number of leaves in one tree. Valid values: integer, range: (`1`, `131072`). Default value: `64`.  | 
| feature\$1fraction |  A subset of features to be selected on each iteration (tree). Must be less than 1.0. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.9`.  | 
| bagging\$1fraction |  A subset of features similar to `feature_fraction`, but `bagging_fraction` randomly selects part of the data without resampling. Valid values: float, range: (`0.0`, `1.0`]. Default value: `0.9`.  | 
| bagging\$1freq |  The frequency to perform bagging. At every `bagging_freq` iteration, LightGBM randomly selects a percentage of the data to use for the next `bagging_freq` iteration. This percentage is determined by the `bagging_fraction` hyperparameter. If `bagging_freq` is zero, then bagging is deactivated. Valid values: integer, range: Non-negative integer. Default value: `1`.  | 
| max\$1depth |  The maximum depth for a tree model. This is used to deal with overfitting when the amount of data is small. If `max_depth` is less than or equal to zero, this means there is no limit for maximum depth. Valid values: integer. Default value: `6`.  | 
| min\$1data\$1in\$1leaf |  The minimum amount of data in one leaf. Can be used to deal with overfitting. Valid values: integer, range: Non-negative integer. Default value: `3`.  | 
| max\$1delta\$1step |  Used to limit the max output of tree leaves. If `max_delta_step` is less than or equal to 0, then there is no constraint. The final max output of leaves is `learning_rate * max_delta_step`. Valid values: float. Default value: `0.0`.  | 
| lambda\$1l1 |  L1 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| lambda\$1l2 |  L2 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| boosting |  Boosting type Valid values: string, any of the following: (`"gbdt"`, `"rf"`, `"dart"`, or `"goss"`). Default value: `"gbdt"`.  | 
| min\$1gain\$1to\$1split |  The minimum gain to perform a split. Can be used to speed up training. Valid values: integer, float: Non-negative float. Default value: `0.0`.  | 
| scale\$1pos\$1weight |  The weight of the labels with positive class. Used only for binary classification tasks. `scale_pos_weight` cannot be used if `is_unbalance` is set to `"True"`.  Valid values: float, range: Positive float. Default value: `1.0`.  | 
| tree\$1learner |  Tree learner type. Valid values: string, any of the following: (`"serial"`, `"feature"`, `"data"`, or `"voting"`). Default value: `"serial"`.  | 
| feature\$1fraction\$1bynode |  Selects a subset of random features on each tree node. For example, if `feature_fraction_bynode` is `0.8`, then 80% of features are selected. Can be used to deal with overfitting. Valid values: integer, range: (`0.0`, `1.0`]. Default value: `1.0`.  | 
| is\$1unbalance |  Set to `"True"` if training data is unbalanced. Used only for binary classification tasks. `is_unbalance` cannot be used with `scale_pos_weight`. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| max\$1bin |  The maximum number of bins used to bucket feature values. A small number of bins may reduce training accuracy, but may increase general performance. Can be used to deal with overfitting. Valid values: integer, range: (1, ∞). Default value: `255`.  | 
| tweedie\$1variance\$1power |  Controls the variance of the Tweedie distribution. Set this closer to `2.0` to shift toward a gamma distribution. Set this closer to `1.0` to shift toward a Poisson distribution. Used only for regression tasks. Valid values: float, range: [`1.0`, `2.0`). Default value: `1.5`.  | 
| num\$1threads |  Number of parallel threads used to run LightGBM. Value 0 means default number of threads in OpenMP. Valid values: integer, range: Non-negative integer. Default value: `0`.  | 
| verbosity |  The verbosity of print messages. If the `verbosity` is less than `0`, then print messages only show fatal errors. If `verbosity` is set to `0`, then print messages include errors and warnings. If `verbosity` is `1`, then print messages show more information. A `verbosity` greater than `1` shows the most information in print messages and can be used for debugging. Valid values: integer. Default value: `1`.  | 

# Tune a LightGBM model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters: 

**Note**  
The learning objective function is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).
+ A learning objective function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your specified hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for LightGBM is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the LightGBM algorithm
Metrics

The SageMaker AI LightGBM algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| rmse | root mean square error | minimize | "rmse: ([0-9\$1\$1.]\$1)" | 
| l1 | mean absolute error | minimize | "l1: ([0-9\$1\$1.]\$1)" | 
| l2 | mean squared error | minimize | "l2: ([0-9\$1\$1.]\$1)" | 
| huber | huber loss | minimize | "huber: ([0-9\$1\$1.]\$1)" | 
| fair | fair loss | minimize | "fair: ([0-9\$1\$1.]\$1)" | 
| binary\$1logloss | binary cross entropy | maximize | "binary\$1logloss: ([0-9\$1\$1.]\$1)" | 
| binary\$1error | binary error | minimize | "binary\$1error: ([0-9\$1\$1.]\$1)" | 
| auc | AUC | maximize | "auc: ([0-9\$1\$1.]\$1)" | 
| average\$1precision | average precision score | maximize | "average\$1precision: ([0-9\$1\$1.]\$1)" | 
| multi\$1logloss | multiclass cross entropy | maximize | "multi\$1logloss: ([0-9\$1\$1.]\$1)" | 
| multi\$1error | multiclass error score | minimize | "multi\$1error: ([0-9\$1\$1.]\$1)" | 
| auc\$1mu | AUC-mu | maximize | "auc\$1mu: ([0-9\$1\$1.]\$1)" | 
| cross\$1entropy | cross entropy | minimize | "cross\$1entropy: ([0-9\$1\$1.]\$1)" | 

## Tunable LightGBM hyperparameters
Tunable Hyperparameters

Tune the LightGBM model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the LightGBM evaluation metrics are: `learning_rate`, `num_leaves`, `feature_fraction`, `bagging_fraction`, `bagging_freq`, `max_depth` and `min_data_in_leaf`. For a list of all the LightGBM hyperparameters, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| num\$1leaves | IntegerParameterRanges | MinValue: 10, MaxValue: 100 | 
| feature\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1freq | IntegerParameterRanges | MinValue: 0, MaxValue: 10 | 
| max\$1depth | IntegerParameterRanges | MinValue: 15, MaxValue: 100 | 
| min\$1data\$1in\$1leaf | IntegerParameterRanges | MinValue: 10, MaxValue: 200 | 

# Linear Learner Algorithm


*Linear models* are supervised learning algorithms used for solving either classification or regression problems. For input, you give the model labeled examples (*x*, *y*). *x* is a high-dimensional vector and *y* is a numeric label. For binary classification problems, the label must be either 0 or 1. For multiclass classification problems, the labels must be from 0 to `num_classes` - 1. For regression problems, *y* is a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold function, and maps a vector *x* to an approximation of the label *y*. 

The Amazon SageMaker AI linear learner algorithm provides a solution for both classification and regression problems. With the SageMaker AI algorithm, you can simultaneously explore different training objectives and choose the best solution from a validation set. You can also explore a large number of models and choose the best. The best model optimizes either of the following:
+ Continuous objectives, such as mean square error, cross entropy loss, absolute error.
+ Discrete objectives suited for classification, such as F1 measure, precision, recall, or accuracy. 

Compared with methods that provide a solution for only continuous objectives, the SageMaker AI linear learner algorithm provides a significant increase in speed over naive hyperparameter optimization techniques. It is also more convenient. 

The linear learner algorithm requires a data matrix, with rows representing the observations, and columns representing the dimensions of the features. It also requires an additional column that contains the labels that match the data points. At a minimum, Amazon SageMaker AI linear learner requires you to specify input and output data locations, and objective type (classification or regression) as arguments. The feature dimension is also required. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html). You can specify additional parameters in the `HyperParameters` string map of the request body. These parameters control the optimization procedure, or specifics of the objective function that you train on. For example, the number of epochs, regularization, and loss type. 

If you're using [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), the linear learner algorithm supports using [checkpoints to take a snapshot of the state of the model](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html).

**Topics**
+ [

## Input/Output interface for the linear learner algorithm
](#ll-input_output)
+ [

## EC2 instance recommendation for the linear learner algorithm
](#ll-instances)
+ [

## Linear learner sample notebooks
](#ll-sample-notebooks)
+ [

# How linear learner works
](ll_how-it-works.md)
+ [

# Linear learner hyperparameters
](ll_hyperparameters.md)
+ [

# Tune a linear learner model
](linear-learner-tuning.md)
+ [

# Linear learner response formats
](LL-in-formats.md)

## Input/Output interface for the linear learner algorithm


The Amazon SageMaker AI linear learner algorithm supports three data channels: train, validation (optional), and test (optional). If you provide validation data, the `S3DataDistributionType` should be `FullyReplicated`. The algorithm logs validation loss at every epoch, and uses a sample of the validation data to calibrate and select the best model. If you don't provide validation data, the algorithm uses a sample of the training data to calibrate and select the model. If you provide test data, the algorithm logs include the test score for the final model.

**For training**, the linear learner algorithm supports both `recordIO-wrapped protobuf` and `CSV` formats. For the `application/x-recordio-protobuf` input type, only Float32 tensors are supported. For the `text/csv` input type, the first column is assumed to be the label, which is the target variable for prediction. You can use either File mode or Pipe mode to train linear learner models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

**For inference**, the linear learner algorithm supports the `application/json`, `application/x-recordio-protobuf`, and `text/csv` formats. When you make predictions on new data, the format of the response depends on the type of model. **For regression** (`predictor_type='regressor'`), the `score` is the prediction produced by the model. **For classification** (`predictor_type='binary_classifier'` or `predictor_type='multiclass_classifier'`), the model returns a `score` and also a `predicted_label`. The `predicted_label` is the class predicted by the model and the `score` measures the strength of that prediction. 
+ **For binary classification**, `predicted_label` is `0` or `1`, and `score` is a single floating point number that indicates how strongly the algorithm believes that the label should be 1.
+ **For multiclass classification**, the `predicted_class` will be an integer from `0` to `num_classes-1`, and `score` will be a list of one floating point number per class. 

To interpret the `score` in classification problems, you have to consider the loss function used. If the `loss` hyperparameter value is `logistic` for binary classification or `softmax_loss` for multiclass classification, then the `score` can be interpreted as the probability of the corresponding class. These are the loss values used by the linear learner when the `loss` value is `auto` default value. But if the loss is set to `hinge_loss`, then the score cannot be interpreted as a probability. This is because hinge loss corresponds to a Support Vector Classifier, which does not produce probability estimates.

For more information on input and output file formats, see [Linear learner response formats](LL-in-formats.md). For more information on inference formats, and the [Linear learner sample notebooks](#ll-sample-notebooks).

## EC2 instance recommendation for the linear learner algorithm


The linear learner algorithm supports both CPU and GPU instances for training and inference. For GPU, the linear learner algorithm supports P2, P3, G4dn, and G5 GPU families.

During testing, we have not found substantial evidence that multi-GPU instances are faster than single-GPU instances. Results can vary, depending on your specific use case.

## Linear learner sample notebooks


 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI linear learner algorithm.


| **Notebook Title** | **Description** | 
| --- | --- | 
|  [An Introduction with the MNIST dataset](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/linear_learner_mnist/linear_learner_mnist.html)  |   Using the MNIST dataset, we train a binary classifier to predict a single digit.  | 
|  [How to Build a Multiclass Classifier?](https://sagemaker-examples.readthedocs.io/en/latest/scientific_details_of_algorithms/linear_learner_multiclass_classification/linear_learner_multiclass_classification.html)  |   Using UCI's Covertype dataset, we demonstrate how to train a multiclass classifier.   | 
|  [How to Build a Machine Learning (ML) Pipeline for Inference? ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html)  |   Using a Scikit-learn container, we demonstrate how to build an end-to-end ML pipeline.   | 

 For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. The topic modeling example notebooks using the linear learning algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab and choose **Create copy**. 

# How linear learner works
How It Works

There are three steps involved in the implementation of the linear learner algorithm: preprocess, train, and validate. 

## Step 1: Preprocess


Normalization, or feature scaling, is an important preprocessing step for certain loss functions that ensures the model being trained on a dataset does not become dominated by the weight of a single feature. The Amazon SageMaker AI Linear Learner algorithm has a normalization option to assist with this preprocessing step. If normalization is turned on, the algorithm first goes over a small sample of the data to learn the mean value and standard deviation for each feature and for the label. Each of the features in the full dataset is then shifted to have mean of zero and scaled to have a unit standard deviation.

**Note**  
For best results, ensure your data is shuffled before training. Training with unshuffled data may cause training to fail. 

You can configure whether the linear learner algorithm normalizes the feature data and the labels using the `normalize_data` and `normalize_label` hyperparameters, respectively. Normalization is enabled by default for both features and labels for regression. Only the features can be normalized for binary classification and this is the default behavior. 

## Step 2: Train


With the linear learner algorithm, you train with a distributed implementation of stochastic gradient descent (SGD). You can control the optimization process by choosing the optimization algorithm. For example, you can choose to use Adam, AdaGrad, stochastic gradient descent, or other optimization algorithms. You also specify their hyperparameters, such as momentum, learning rate, and the learning rate schedule. If you aren't sure which algorithm or hyperparameter value to use, choose a default that works for the majority of datasets. 

During training, you simultaneously optimize multiple models, each with slightly different objectives. For example, you vary L1 or L2 regularization and try out different optimizer settings. 

## Step 3: Validate and set the threshold


When training multiple models in parallel, the models are evaluated against a validation set to select the most optimal model once training is complete. For regression, the most optimal model is the one that achieves the best loss on the validation set. For classification, a sample of the validation set is used to calibrate the classification threshold. The most optimal model selected is the one that achieves the best binary classification selection criteria on the validation set. Examples of such criteria include the F1 measure, accuracy, and cross-entropy loss. 

**Note**  
If the algorithm is not provided a validation set, then evaluating and selecting the most optimal model is not possible. To take advantage of parallel training and model selection ensure you provide a validation set to the algorithm. 

# Linear learner hyperparameters
Hyperparameters

The following table contains the hyperparameters for the linear learner algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. When a hyperparameter is set to `auto`, Amazon SageMaker AI will automatically calculate and set the value of that hyperparameter. 


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes |  The number of classes for the response variable. The algorithm assumes that classes are labeled `0`, ..., `num_classes - 1`. **Required** when `predictor_type` is `multiclass_classifier`. Otherwise, the algorithm ignores it. Valid values: Integers from 3 to 1,000,000  | 
| predictor\$1type |  Specifies the type of target variable as a binary classification, multiclass classification, or regression. **Required** Valid values: `binary_classifier`, `multiclass_classifier`, or `regressor`  | 
| accuracy\$1top\$1k |  When computing the top-k accuracy metric for multiclass classification, the value of *k*. If the model assigns one of the top-k scores to the true label, an example is scored as correct. **Optional** Valid values: Positive integers Default value: 3   | 
| balance\$1multiclass\$1weights |  Specifies whether to use class weights, which give each class equal importance in the loss function. Used only when the `predictor_type` is `multiclass_classifier`. **Optional** Valid values: `true`, `false` Default value: `false`  | 
| beta\$11 |  The exponential decay rate for first-moment estimates. Applies only when the `optimizer` value is `adam`. **Optional** Valid values: `auto` or floating-point value between 0 and 1.0 Default value: `auto`  | 
| beta\$12 |  The exponential decay rate for second-moment estimates. Applies only when the `optimizer` value is `adam`. **Optional** Valid values: `auto` or floating-point integer between 0 and 1.0  Default value: `auto`  | 
| bias\$1lr\$1mult |  Allows a different learning rate for the bias term. The actual learning rate for the bias is `learning_rate` \$1 `bias_lr_mult`. **Optional** Valid values: `auto` or positive floating-point integer Default value: `auto`  | 
| bias\$1wd\$1mult |  Allows different regularization for the bias term. The actual L2 regularization weight for the bias is `wd` \$1 `bias_wd_mult`. By default, there is no regularization on the bias term. **Optional** Valid values: `auto` or non-negative floating-point integer Default value: `auto`  | 
| binary\$1classifier\$1model\$1selection\$1criteria |  When `predictor_type` is set to `binary_classifier`, the model evaluation criteria for the validation dataset (or for the training dataset if you don't provide a validation dataset). Criteria include: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) **Optional** Valid values: `accuracy`, `f_beta`, `precision_at_target_recall`, `recall_at_target_precision`, or `loss_function` Default value: `accuracy`  | 
| early\$1stopping\$1patience | If no improvement is made in the relevant metric, the number of epochs to wait before ending training. If you have provided a value for binary\$1classifier\$1model\$1selection\$1criteria. the metric is that value. Otherwise, the metric is the same as the value specified for the loss hyperparameter. The metric is evaluated on the validation data. If you haven't provided validation data, the metric is always the same as the value specified for the `loss` hyperparameter and is evaluated on the training data. To disable early stopping, set `early_stopping_patience` to a value greater than the value specified for `epochs`.**Optional**Valid values: Positive integerDefault value: 3 | 
| early\$1stopping\$1tolerance |  The relative tolerance to measure an improvement in loss. If the ratio of the improvement in loss divided by the previous best loss is smaller than this value, early stopping considers the improvement to be zero. **Optional** Valid values: Positive floating-point integer Default value: 0.001  | 
| epochs |  The maximum number of passes over the training data. **Optional** Valid values: Positive integer Default value: 15  | 
| f\$1beta |  The value of beta to use when calculating F score metrics for binary or multiclass classification. Also used if the value specified for `binary_classifier_model_selection_criteria` is `f_beta`. **Optional** Valid values: Positive floating-point integers Default value: 1.0   | 
| feature\$1dim |  The number of features in the input data.  **Optional** Valid values: `auto` or positive integer Default values: `auto`  | 
| huber\$1delta |  The parameter for Huber loss. During training and metric evaluation, compute L2 loss for errors smaller than delta and L1 loss for errors larger than delta. **Optional** Valid values: Positive floating-point integer Default value: 1.0   | 
| init\$1bias |  Initial weight for the bias term. **Optional** Valid values: Floating-point integer Default value: 0  | 
| init\$1method |  Sets the initial distribution function used for model weights. Functions include: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) **Optional** Valid values: `uniform` or `normal` Default value: `uniform`  | 
| init\$1scale |  Scales an initial uniform distribution for model weights. Applies only when the `init_method` hyperparameter is set to `uniform`. **Optional** Valid values: Positive floating-point integer Default value: 0.07  | 
| init\$1sigma |  The initial standard deviation for the normal distribution. Applies only when the `init_method` hyperparameter is set to `normal`. **Optional** Valid values: Positive floating-point integer Default value: 0.01  | 
| l1 |  The L1 regularization parameter. If you don't want to use L1 regularization, set the value to 0. **Optional** Valid values: `auto` or non-negative float Default value: `auto`  | 
| learning\$1rate |  The step size used by the optimizer for parameter updates. **Optional** Valid values: `auto` or positive floating-point integer Default value: `auto`, whose value depends on the optimizer chosen.  | 
| loss |  Specifies the loss function.  The available loss functions and their default values depend on the value of `predictor_type`: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) Valid values: `auto`, `logistic`, `squared_loss`, `absolute_loss`, `hinge_loss`, `eps_insensitive_squared_loss`, `eps_insensitive_absolute_loss`, `quantile_loss`, or `huber_loss`  **Optional** Default value: `auto`  | 
| loss\$1insensitivity |  The parameter for the epsilon-insensitive loss type. During training and metric evaluation, any error smaller than this value is considered to be zero. **Optional** Valid values: Positive floating-point integer Default value: 0.01   | 
| lr\$1scheduler\$1factor |  For every `lr_scheduler_step` hyperparameter, the learning rate decreases by this quantity. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive floating-point integer between 0 and 1 Default value: `auto`  | 
| lr\$1scheduler\$1minimum\$1lr |  The learning rate never decreases to a value lower than the value set for `lr_scheduler_minimum_lr`. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive floating-point integer Default values: `auto`  | 
| lr\$1scheduler\$1step |  The number of steps between decreases of the learning rate. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive integer Default value: `auto`  | 
| margin |  The margin for the `hinge_loss` function. **Optional** Valid values: Positive floating-point integer Default value: 1.0  | 
| mini\$1batch\$1size |  The number of observations per mini-batch for the data iterator. **Optional** Valid values: Positive integer Default value: 1000  | 
| momentum |  The momentum of the `sgd` optimizer. **Optional** Valid values: `auto` or a floating-point integer between 0 and 1.0 Default value: `auto`  | 
| normalize\$1data |  Normalizes the feature data before training. Data normalization shifts the data for each feature to have a mean of zero and scales it to have unit standard deviation. **Optional** Valid values: `auto`, `true`, or `false` Default value: `true`  | 
| normalize\$1label |  Normalizes the label. Label normalization shifts the label to have a mean of zero and scales it to have unit standard deviation. The `auto` default value normalizes the label for regression problems but does not for classification problems. If you set the `normalize_label` hyperparameter to `true` for classification problems, the algorithm ignores it. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| num\$1calibration\$1samples |  The number of observations from the validation dataset to use for model calibration (when finding the best threshold). **Optional** Valid values: `auto` or positive integer Default value: `auto`  | 
| num\$1models |  The number of models to train in parallel. For the default, `auto`, the algorithm decides the number of parallel models to train. One model is trained according to the given training parameter (regularization, optimizer, loss), and the rest by close parameters. **Optional** Valid values: `auto` or positive integer Default values: `auto`  | 
| num\$1point\$1for\$1scaler |  The number of data points to use for calculating normalization or unbiasing of terms. **Optional** Valid values: Positive integer Default value: 10,000  | 
| optimizer |  The optimization algorithm to use. **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) Default value: `auto`. The default setting for `auto` is `adam`.  | 
| positive\$1example\$1weight\$1mult |  The weight assigned to positive examples when training a binary classifier. The weight of negative examples is fixed at 1. If you want the algorithm to choose a weight so that errors in classifying negative *vs.* positive examples have equal impact on training loss, specify `balanced`. If you want the algorithm to choose the weight that optimizes performance, specify `auto`. **Optional** Valid values: `balanced`, `auto`, or a positive floating-point integer Default value: 1.0  | 
| quantile |  The quantile for quantile loss. For quantile q, the model attempts to produce predictions so that the value of `true_label` is greater than the prediction with probability q. **Optional** Valid values: Floating-point integer between 0 and 1 Default value: 0.5  | 
| target\$1precision |  The target precision. If `binary_classifier_model_selection_criteria` is `recall_at_target_precision`, then precision is held at this value while recall is maximized. **Optional** Valid values: Floating-point integer between 0 and 1.0 Default value: 0.8  | 
| target\$1recall |  The target recall. If `binary_classifier_model_selection_criteria` is `precision_at_target_recall`, then recall is held at this value while precision is maximized. **Optional** Valid values: Floating-point integer between 0 and 1.0 Default value: 0.8  | 
| unbias\$1data |  Unbiases the features before training so that the mean is 0. By default data is unbiased as the `use_bias` hyperparameter is set to `true`. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| unbias\$1label |  Unbiases labels before training so that the mean is 0. Applies to regression only if the `use_bias` hyperparameter is set to `true`. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| use\$1bias |  Specifies whether the model should include a bias term, which is the intercept term in the linear equation. **Optional** Valid values: `true` or `false` Default value: `true`  | 
| use\$1lr\$1scheduler |  Whether to use a scheduler for the learning rate. If you want to use a scheduler, specify `true`.  **Optional** Valid values: `true` or `false` Default value: `true`  | 
| wd |  The weight decay parameter, also known as the L2 regularization parameter. If you don't want to use L2 regularization, set the value to 0. **Optional** Valid values:`auto` or non-negative floating-point integer Default value: `auto`  | 

# Tune a linear learner model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric. 

The linear learner algorithm also has an internal mechanism for tuning hyperparameters separate from the automatic model tuning feature described here. By default, the linear learner algorithm tunes hyperparameters by training multiple models in parallel. When you use automatic model tuning, the linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel models, `num_models`, to 1. The algorithm ignores any value that you set for `num_models`.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the linear learner algorithm
Metrics

The linear learner algorithm reports the metrics in the following table, which are computed during training. Choose one of them as the objective metric. To avoid overfitting, we recommend tuning the model against a validation metric instead of a training metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:absolute\$1loss |  The absolute loss of the final model on the test dataset. This objective metric is only valid for regression.  |  Minimize  | 
| test:binary\$1classification\$1accuracy |  The accuracy of the final model on the test dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:binary\$1f\$1beta |  The F-beta score of the final model on the test dataset. By default, it is the F1 score, which is the harmonic mean of precision and recall. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:dcg |  The discounted cumulative gain of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1f\$1beta |  The F-beta score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1precision |  The precision score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1recall |  The recall score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:mse |  The mean square error of the final model on the test dataset. This objective metric is only valid for regression.  |  Minimize  | 
| test:multiclass\$1accuracy |  The accuracy of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:multiclass\$1top\$1k\$1accuracy |  The accuracy among the top k labels predicted on the test dataset. If you choose this metric as the objective, we recommend setting the value of k using the `accuracy_top_k` hyperparameter. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:objective\$1loss |  The mean value of the objective loss function on the test dataset after the model is trained. By default, the loss is logistic loss for binary classification and squared loss for regression. To set the loss to other types, use the `loss` hyperparameter.  |  Minimize  | 
| test:precision |  The precision of the final model on the test dataset. If you choose this metric as the objective, we recommend setting a target recall by setting the `binary_classifier_model_selection` hyperparameter to `precision_at_target_recall` and setting the value for the `target_recall` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:recall |  The recall of the final model on the test dataset. If you choose this metric as the objective, we recommend setting a target precision by setting the `binary_classifier_model_selection` hyperparameter to `recall_at_target_precision` and setting the value for the `target_precision` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:roc\$1auc\$1score |  The area under receiving operating characteristic curve (ROC curve) of the final model on the test dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:absolute\$1loss |  The absolute loss of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:binary\$1classification\$1accuracy |  The accuracy of the final model on the validation dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:binary\$1f\$1beta |  The F-beta score of the final model on the validation dataset. By default, the F-beta score is the F1 score, which is the harmonic mean of the `validation:precision` and `validation:recall` metrics. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:dcg |  The discounted cumulative gain of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1f\$1beta |  The F-beta score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1precision |  The precision score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1recall |  The recall score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:mse |  The mean square error of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:multiclass\$1accuracy |  The accuracy of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:multiclass\$1top\$1k\$1accuracy |  The accuracy among the top k labels predicted on the validation dataset. If you choose this metric as the objective, we recommend setting the value of k using the `accuracy_top_k` hyperparameter. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:objective\$1loss |  The mean value of the objective loss function on the validation dataset every epoch. By default, the loss is logistic loss for binary classification and squared loss for regression. To set loss to other types, use the `loss` hyperparameter.  |  Minimize  | 
| validation:precision |  The precision of the final model on the validation dataset. If you choose this metric as the objective, we recommend setting a target recall by setting the `binary_classifier_model_selection` hyperparameter to `precision_at_target_recall` and setting the value for the `target_recall` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:recall |  The recall of the final model on the validation dataset. If you choose this metric as the objective, we recommend setting a target precision by setting the `binary_classifier_model_selection` hyperparameter to `recall_at_target_precision` and setting the value for the `target_precision` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:rmse |  The root mean square error of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:roc\$1auc\$1score |  The area under receiving operating characteristic curve (ROC curve) of the final model on the validation dataset. This objective metric is only valid for binary classification.  |  Maximize  | 

## Tuning linear learner hyperparameters
Tunable Hyperparameters

You can tune a linear learner model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| wd |  `ContinuousParameterRanges`  |  `MinValue: ``1e-7`, `MaxValue`: `1`  | 
| l1 |  `ContinuousParameterRanges`  |  `MinValue`: `1e-7`, `MaxValue`: `1`  | 
| learning\$1rate |  `ContinuousParameterRanges`  |  `MinValue`: `1e-5`, `MaxValue`: `1`  | 
| mini\$1batch\$1size |  `IntegerParameterRanges`  |  `MinValue`: `100`, `MaxValue`: `5000`  | 
| use\$1bias |  `CategoricalParameterRanges`  |  `[True, False]`  | 
| positive\$1example\$1weight\$1mult |  `ContinuousParameterRanges`  |  `MinValue`: 1e-5, `MaxValue`: `1e5`  | 

# Linear learner response formats
Inference Formats

## JSON response formats


All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). The following are the available output formats for the SageMaker AI linear learner algorithm.

**Binary Classification**

```
let response =   {
    "predictions":    [
        {
            "score": 0.4,
            "predicted_label": 0
        } 
    ]
}
```

**Multiclass Classification**

```
let response =   {
    "predictions":    [
        {
            "score": [0.1, 0.2, 0.4, 0.3],
            "predicted_label": 2
        } 
    ]
}
```

**Regression**

```
let response =   {
    "predictions":    [
        {
            "score": 0.4
        } 
    ]
}
```

## JSONLINES response formats


**Binary Classification**

```
{"score": 0.4, "predicted_label": 0}
```

**Multiclass Classification**

```
{"score": [0.1, 0.2, 0.4, 0.3], "predicted_label": 2}
```

**Regression**

```
{"score": 0.4}
```

## RECORDIO response formats


**Binary Classification**

```
[
    Record = {
        features = {},
        label = {
            'score': {
                keys: [],
                values: [0.4]  # float32
            },
            'predicted_label': {
                keys: [],
                values: [0.0]  # float32
            }
        }
    }
]
```

**Multiclass Classification**

```
[
    Record = {
    "features": [],
    "label":    {
            "score":  {
                    "values":   [0.1, 0.2, 0.3, 0.4]   
            },
            "predicted_label":  {
                    "values":   [3]
            }
       },
    "uid":  "abc123",
    "metadata": "{created_at: '2017-06-03'}"
   }
]
```

**Regression**

```
[
    Record = {
        features = {},
        label = {
            'score': {
                keys: [],
                values: [0.4]  # float32
            }   
        }
    }
]
```

# TabTransformer
TabTransformer Algorithm

[TabTransformer](https://arxiv.org/abs/2012.06678) is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. This page includes information about Amazon EC2 instance recommendations and sample notebooks for TabTransformer.

# How to use SageMaker AI TabTransformer
How to use TabTransformer

You can use TabTransformer as an Amazon SageMaker AI built-in algorithm. The following section describes how to use TabTransformer with the SageMaker Python SDK. For information on how to use TabTransformer from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use TabTransformer as a built-in algorithm**

  Use the TabTransformer built-in algorithm to build a TabTransformer training container as shown in the following code example. You can automatically spot the TabTransformer built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the TabTransformer image URI, you can use the TabTransformer container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The TabTransformer built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own TabTransformer training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "pytorch-tabtransformerclassification-model", "*", "training"
  training_instance_type = "ml.p3.2xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_binary/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "n_epochs"
  ] = "50"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the TabTransformer as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb)
  + [Tabular regression with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Regression_TabTransformer.ipynb)

# Input and Output interface for the TabTransformer algorithm


TabTransformer operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of TabTransformer supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the TabTransformer model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `training` or `validation` channels, the TabTransformer algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

## Amazon EC2 instance recommendation for the TabTransformer algorithm


SageMaker AI TabTransformer supports single-instance CPU and single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker AI TabTransformer currently does not support multi-GPU training.

## TabTransformer sample notebooks
Sample Notebooks

The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI TabTransformer algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI TabTransformer algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Regression_TabTransformer.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI TabTransformer algorithm to train and host a tabular regression model.   | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How TabTransformer works
How It Works

TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability.

TabTransformer performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use TabTransformer for regression, classification (binary and multiclass), and ranking problems.

The following diagram illustrates the TabTransformer architecture.

![\[The architecture of TabTransformer.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/tabtransformer_illustration.png)


For more information, see *[TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/abs/2012.06678)*.

# TabTransformer hyperparameters
Hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI TabTransformer algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI TabTransformer algorithm is an implementation of the open-source [TabTransformer](https://github.com/jrzaurin/pytorch-widedeep) package.

**Note**  
The default hyperparameters are based on example datasets in the [TabTransformer sample notebooks](tabtransformer.md#tabtransformer-sample-notebooks).

The SageMaker AI TabTransformer algorithm automatically chooses an evaluation metric and objective function based on the type of classification problem. The TabTransformer algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is r square and the objective function is mean square error. For binary classification problems, the evaluation metric and objective function are both binary cross entropy. For multiclass classification problems, the evaluation metric and objective function are both multiclass cross entropy.

**Note**  
The TabTransformer evaluation metric and objective functions are not currently available as hyperparameters. Instead, the SageMaker AI TabTransformer built-in algorithm automatically detects the type of classification task (regression, binary, or multiclass) based on the number of unique integers in the label column and assigns an evaluation metric and objective function.


| Parameter Name | Description | 
| --- | --- | 
| n\$1epochs |  Number of epochs to train the deep neural network. Valid values: integer, range: Positive integer. Default value: `5`.  | 
| patience |  The training will stop if one metric of one validation data point does not improve in the last `patience` round. Valid values: integer, range: (`2`, `60`). Default value: `10`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: Positive floating point number. Default value: `0.001`.  | 
| batch\$1size |  The number of examples propagated through the network.  Valid values: integer, range: (`1`, `2048`). Default value: `256`.  | 
| input\$1dim |  The dimension of embeddings to encode the categorical and/or continuous columns. Valid values: string, any of the following: `"16"`, `"32"`, `"64"`, `"128"`, `"256"`, or `"512"`. Default value: `"32"`.  | 
| n\$1blocks |  The number of Transformer encoder blocks. Valid values: integer, range: (`1`, `12`). Default value: `4`.  | 
| attn\$1dropout |  Dropout rate applied to the Multi-Head Attention layers. Valid values: float, range: (`0`, `1`). Default value: `0.2`.  | 
| mlp\$1dropout |  Dropout rate applied to the FeedForward network within the encoder layers and the final MLP layers on top of Transformer encoders. Valid values: float, range: (`0`, `1`). Default value: `0.1`.  | 
| frac\$1shared\$1embed |  The fraction of embeddings shared by all the different categories for one particular column. Valid values: float, range: (`0`, `1`). Default value: `0.25`.  | 

# Tune a TabTransformer model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters: 

**Note**  
The learning objective function and evaluation metric are both automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [TabTransformer hyperparameters](tabtransformer-hyperparameters.md).
+ A learning objective function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for TabTransformer is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the TabTransformer algorithm
Metrics

The SageMaker AI TabTransformer algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| r2 | r square | maximize | "metrics=\$1'r2': (\$1\$1S\$1)\$1" | 
| f1\$1score | binary cross entropy | maximize | "metrics=\$1'f1': (\$1\$1S\$1)\$1" | 
| accuracy\$1score | multiclass cross entropy | maximize | "metrics=\$1'accuracy': (\$1\$1S\$1)\$1" | 

## Tunable TabTransformer hyperparameters
Tunable Hyperparameters

Tune the TabTransformer model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the TabTransformer evaluation metrics are: `learning_rate`, `input_dim`, `n_blocks`, `attn_dropout`, `mlp_dropout`, and `frac_shared_embed`. For a list of all the TabTransformer hyperparameters, see [TabTransformer hyperparameters](tabtransformer-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| input\$1dim | CategoricalParameterRanges | [16, 32, 64, 128, 256, 512] | 
| n\$1blocks | IntegerParameterRanges | MinValue: 1, MaxValue: 12 | 
| attn\$1dropout | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.8 | 
| mlp\$1dropout | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.8 | 
| frac\$1shared\$1embed | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.5 | 

# XGBoost algorithm with Amazon SageMaker AI
XGBoost Algorithm

The [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
+ Its robust handling of a variety of data types, relationships, distributions.
+ The variety of hyperparameters that you can fine-tune.

You can use XGBoost for regression, classification (binary and multiclass), and ranking problems. 

You can use the new release of the XGBoost algorithm as either:
+ A Amazon SageMaker AI built-in algorithm.
+ A framework to run training scripts in your local environments.

This implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an bigger set of metrics than the original versions. It provides an XGBoost `estimator` that runs a training script in a managed XGBoost environment. The current release of SageMaker AI XGBoost is based on the original XGBoost versions 1.0, 1.2, 1.3, 1.5, 1.7 and 3.0.

For more information about the Amazon SageMaker AI XGBoost algorithm, see the following blog posts:
+ [Introducing the open-source Amazon SageMaker AI XGBoost algorithm container](https://aws.amazon.com/blogs/machine-learning/introducing-the-open-source-amazon-sagemaker-xgboost-algorithm-container/)
+ [Amazon SageMaker AI XGBoost now offers fully distributed GPU training](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-xgboost-now-offers-fully-distributed-gpu-training/)

## Supported versions


For more details, see our [support policy](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-support-policy.html#pre-built-containers-support-policy-ml-framework).
+ Framework (open source) mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5
+ Algorithm mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5

**Warning**  
Due to required compute capacity, version 3.0-5 of SageMaker AI XGBoost is not compatible with GPU instances from the P3 instance family for training or inference.

**Warning**  
Due to package compatible, version 3.0-5 of SageMaker AI XGBoost does not support SageMaker debugger.

**Warning**  
Due to required compute capacity, version 1.7-1 of SageMaker AI XGBoost is not compatible with GPU instances from the P2 instance family for training or inference.

**Warning**  
Network Isolation Mode: Do not upgrade pip beyond version 25.2. Newer versions may attempt to fetch setuptools from PyPI during module installation.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

**Warning**  
The XGBoost 0.90 versions are deprecated. Supports for security updates or bug fixes for XGBoost 0.90 is discontinued. We highly recommend that you upgrade the XGBoost version to one of the newer versions.

**Note**  
XGBoost v1.1 is not supported on SageMaker AI. XGBoost 1.1 has a broken capability to run prediction when the test input has fewer features than the training data in LIBSVM inputs. This capability has been restored in XGBoost v1.2. Consider using SageMaker AI XGBoost 1.2-2 or later.

**Note**  
You can use XGBoost v1.0-1, but it's not officially supported.

## EC2 instance recommendation for the XGBoost algorithm


SageMaker AI XGBoost supports CPU and GPU training and inference. Instance recommendations depend on training and inference needs, as well as the version of the XGBoost algorithm. Choose one of the following options for more information:
+ [CPU training](#Instance-XGBoost-training-cpu)
+ [GPU training](#Instance-XGBoost-training-gpu)
+ [Distributed CPU training](#Instance-XGBoost-distributed-training-cpu)
+ [Distributed GPU training](#Instance-XGBoost-distributed-training-gpu)
+ [Inference](#Instance-XGBoost-inference)

### Training


The SageMaker AI XGBoost algorithm supports CPU and GPU training.

#### CPU training


SageMaker AI XGBoost 1.0-1 or earlier only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. It supports the use of disk space to handle data that does not fit into main memory. This is a result of the out-of-core feature available with the libsvm input mode. Even so, writing cache files onto disk slows the algorithm processing time. 

#### GPU training


SageMaker AI XGBoost version 1.2-2 or later supports GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. 

SageMaker AI XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.

SageMaker AI XGBoost version 1.7-1 or later supports P3, G4dn, and G5 GPU instance families. Note that due to compute capacity requirements, version 1.7-1 or later does not support the P2 instance family.

SageMaker AI XGBoost version 3.0-5 or later supports G4dn and G5 GPU instance families. Note that due to compute capacity requirements, version 3.0-5 or later does not support the P3 instance family.

To take advantage of GPU training:
+ Specify the instance type as one of the GPU instances (for example, G4dn) 
+ Set the `tree_method` hyperparameter to `gpu_hist` in your existing XGBoost script

### Distributed training


SageMaker AI XGBoost supports CPU and GPU instances for distributed training.

#### Distributed CPU training


To run CPU training on multiple instances, set the `instance_count` parameter for the estimator to a value greater than one. The input data must be divided between the total number of instances. 

##### Divide input data across instances


Divide the input data using the following steps:

1. Break the input data down into smaller files. The number of files should be at least equal to the number of instances used for distributed training. Using multiple smaller files as opposed to one large file also decreases the data download time for the training job.

1. When creating your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html), set the distribution parameter to `ShardedByS3Key`. With this, each instance gets approximately *1/n* of the number of files in S3 if there are *n* instances specified in the training job.

#### Distributed GPU training


You can use distributed training with either single-GPU or multi-GPU instances.

**Distributed training with single-GPU instances **

SageMaker AI XGBoost versions 1.2-2 through 1.3-1 only support single-GPU instance training. This means that even if you select a multi-GPU instance, only one GPU is used per instance.

You must divide your input data between the total number of instances if: 
+ You use XGBoost versions 1.2-2 through 1.3-1.
+ You do not need to use multi-GPU instances.

 For more information, see [Divide input data across instances](#Instance-XGBoost-distributed-training-divide-data).

**Note**  
Versions 1.2-2 through 1.3-1 of SageMaker AI XGBoost only use one GPU per instance even if you choose a multi-GPU instance.

**Distributed training with multi-GPU instances**

Starting with version 1.5-1, SageMaker AI XGBoost offers distributed GPU training with [Dask](https://www.dask.org/). With Dask you can utilize all GPUs when using one or more multi-GPU instances. Dask also works when using single-GPU instances. 

Train with Dask using the following steps:

1. Either omit the `distribution` parameter in your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html) or set it to `FullyReplicated`.

1. When defining your hyperparameters, set `use_dask_gpu_training` to `"true"`.

**Important**  
Distributed training with Dask only supports CSV and Parquet input formats. If you use other data formats such as LIBSVM or PROTOBUF, the training job fails.   
For Parquet data, ensure that the column names are saved as strings. Columns that have names of other data types will fail to load.

**Important**  
Distributed training with Dask does not support pipe mode. If pipe mode is specified, the training job fails.

There are a few considerations to be aware of when training SageMaker AI XGBoost with Dask. Be sure to split your data into smaller files. Dask reads each Parquet file as a partition. There is a Dask worker for every GPU. As a result, the number of files should be greater than the total number of GPUs (instance count \$1 number of GPUs per instance). Having a very large number of files can also degrade performance. For more information, see [Dask Best Practices](https://docs.dask.org/en/stable/best-practices.html).

#### Variations in output


The specified `tree_method` hyperparameter determines the algorithm that is used for XGBoost training. The tree methods `approx`, `hist` and `gpu_hist` are all approximate methods and use sketching for quantile calculation. For more information, see [Tree Methods](https://xgboost.readthedocs.io/en/stable/treemethod.html) in the XGBoost documentation. Sketching is an approximate algorithm. Therefore, you can expect variations in the model depending on factors such as the number of workers chosen for distributed training. The significance of the variation is data-dependent.

### Inference


SageMaker AI XGBoost supports CPU and GPU instances for inference. For information about the instance types for inference, see [Amazon SageMaker AI ML Instance Types](https://aws.amazon.com/sagemaker/pricing/).

# How to use SageMaker AI XGBoost
How to Use XGBoost

With SageMaker AI, you can use XGBoost as a built-in algorithm or framework. When XGBoost as a framework, you have more flexibility and access to more advanced scenarios because you can customize your own training scripts. The following sections describe how to use XGBoost with the SageMaker Python SDK, and the input/output interface for the XGBoost algorithm. For information on how to use XGBoost from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

**Topics**
+ [

## Use XGBoost as a framework
](#xgboost-how-to-framework)
+ [

## Use XGBoost as a built-in algorithm
](#xgboost-how-to-built-in)
+ [

## Input/Output interface for the XGBoost algorithm
](#InputOutput-XGBoost)

## Use XGBoost as a framework


Use XGBoost as a framework to run your customized training scripts that can incorporate additional data processing into your training jobs. In the following code example, SageMaker Python SDK provides the XGBoost API as a framework. This functions similarly to how SageMaker AI provides other framework APIs, such as TensorFlow, MXNet, and PyTorch.

```
import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "verbosity":"1",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-framework'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-framework')

# construct a SageMaker AI XGBoost estimator
# specify the entry_point to your xgboost training script
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.7-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For an end-to-end example of using SageMaker AI XGBoost as a framework, see [Regression with Amazon SageMaker AI XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone_dist_script_mode.html).

## Use XGBoost as a built-in algorithm


Use the XGBoost built-in algorithm to build an XGBoost training container as shown in the following code example. You can automatically spot the XGBoost built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API. If using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 1, use the `get_image_uri` API. To make sure that the `image_uris.retrieve` API finds the correct URI, see [Common parameters for built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html). Then look up `xgboost` from the full list of built-in algorithm image URIs and available regions.

After specifying the XGBoost image URI, use the XGBoost container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. This XGBoost built-in algorithm mode does not incorporate your own XGBoost training script and runs directly on the input datasets.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](xgboost.md#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

```
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-built-in-algo'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-built-in-algo')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# construct a SageMaker AI estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For more information about how to set up the XGBoost as a built-in algorithm, see the following notebook examples.
+ [Managed Spot Training for XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html)
+ [Regression with Amazon SageMaker AI XGBoost (Parquet input)](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html)

## Input/Output interface for the XGBoost algorithm


Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports the following data formats for training and inference:
+  *text/libsvm* (default) 
+  *text/csv*
+  *application/x-parquet*
+  *application/x-recordio-protobuf*

**Note**  
There are a few considerations to be aware of regarding training and inference input:  
For increased performance, we recommend using XGBoost with *File mode*, in which your data from Amazon S3 is stored on the training instance volumes.
For training with columnar input, the algorithm assumes that the target variable (label) is the first column. For inference, the algorithm assumes that the input has no label column.
For CSV data, the input should not have a header record.
For LIBSVM training, the algorithm assumes that subsequent columns after the label column contain the zero-based index value pairs for features. So each row has the format: : <label> <index0>:<value0> <index1>:<value1>.
For information on instance types and distributed training, see [EC2 instance recommendation for the XGBoost algorithm](xgboost.md#Instance-XGBoost).

For CSV training input mode, the total memory available to the algorithm must be able to hold the training dataset. The total memory available is calculated as `Instance Count * the memory available in the InstanceType`. For libsvm training input mode, it's not required, but we recommend it.

For v1.3-1 and later, SageMaker AI XGBoost saves the model in the XGBoost internal binary format, using `Booster.save_model`. Previous versions use the Python pickle module to serialize/deserialize the model.

**Note**  
Be mindful of versions when using an SageMaker AI XGBoost model in open source XGBoost. Versions 1.3-1 and later use the XGBoost internal binary format while previous versions use the Python pickle module.

**To use a model trained with SageMaker AI XGBoost v1.3-1 or later in open source XGBoost**
+ Use the following Python code:

  ```
  import xgboost as xgb
  
  xgb_model = xgb.Booster()
  xgb_model.load_model(model_file_path)
  xgb_model.predict(dtest)
  ```

**To use a model trained with previous versions of SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

# XGBoost sample notebooks
Sample Notebooks

The following list contains a variety of sample Jupyter notebooks that address different use cases of Amazon SageMaker AI XGBoost algorithm.
+ [How to Create a Custom XGBoost container](https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/sagemaker_studio_image_build/xgboost_bring_your_own/Batch_Transform_BYO_XGB.html) – This notebook shows you how to build a custom XGBoost Container with Amazon SageMaker AI Batch Transform.
+ [Regression with XGBoost using Parquet](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html) – This notebook shows you how to use the Abalone dataset in Parquet to train a XGBoost model.
+ [How to Train and Host a Multiclass Classification Model](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_mnist/xgboost_mnist.html) – This notebook shows how to use the MNIST dataset to train and host a multiclass classification model.
+ [How to train a Model for Customer Churn Prediction](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html) – This notebook shows you how to train a model to Predict Mobile Customer Departure in an effort to identify unhappy customers.
+ [An Introduction to Amazon SageMaker AI Managed Spot infrastructure for XGBoost Training](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html) – This notebook shows you how to use Spot Instances for training with a XGBoost Container.
+ [How to use Amazon SageMaker Debugger to debug XGBoost Training Jobs](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html) – This notebook shows you how to use Amazon SageMaker Debugger to monitor training jobs to detect inconsistencies using built-in debugging rules.

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. The topic modeling example notebooks using the linear learning algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How the SageMaker AI XGBoost algorithm works
How It Works

[XGBoost](https://github.com/dmlc/xgboost) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

When using [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) for regression, the weak learners are regression trees, and each regression tree maps an input data point to one of its leaves that contains a continuous score. XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions). The training proceeds iteratively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction. It's called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

 Below is a brief illustration on how gradient tree boosting works.

![\[A diagram illustrating gradient tree boosting.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/xgboost_illustration.png)


**For more detail on XGBoost, see:**
+ [XGBoost: A Scalable Tree Boosting System](https://arxiv.org/pdf/1603.02754.pdf)
+ [Gradient Tree Boosting ](https://www.sas.upenn.edu/~fdiebold/NoHesitations/BookAdvanced.pdf#page=380)
+ [Introduction to Boosted Trees](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

# XGBoost hyperparameters
Hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source DMLC XGBoost package. For details about full set of hyperparameter that can be configured for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_1.2.0/).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class |  The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: Integer.  | 
| num\$1round |  The number of rounds to run the training. **Required** Valid values: Integer.  | 
| alpha |  L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 0  | 
| base\$1score |  The initial prediction score of all instances, global bias. **Optional** Valid values: Float. Default value: 0.5  | 
| booster |  Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `"gbtree"`, `"gblinear"`, or `"dart"`. Default value: `"gbtree"`  | 
| colsample\$1bylevel |  Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bynode |  Subsample ratio of columns from each node. **Optional** Valid values: Float. Range: (0,1]. Default value: 1  | 
| colsample\$1bytree |  Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| csv\$1weights |  When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| deterministic\$1histogram |  When this flag is enabled, XGBoost builds histogram on GPU deterministically. Used only if `tree_method` is set to `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"`. Default value: `"true"`  | 
| early\$1stopping\$1rounds |  The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: Integer. Default value: -  | 
| eta |  Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric |  Evaluation metrics for validation data. A default metric is assigned according to the objective: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) For a list of valid inputs, see [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String. Default value: Default according to objective.  | 
| gamma |  Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy |  Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `"depthwise"` or `"lossguide"`. Default value: `"depthwise"`  | 
| interaction\$1constraints |  Specify groups of variables that are allowed to interact. **Optional** Valid values: Nested list of integers. Each integer represents a feature, and each nested list contains features that are allowed to interact e.g., [[1,2], [3,4,5]]. Default value: None  | 
| lambda |  L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 1  | 
| lambda\$1bias |  L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin |  Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: Integer. Default value: 256  | 
| max\$1delta\$1step |  Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth |  Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves |  Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: Integer. Default value: 0  | 
| min\$1child\$1weight |  Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| monotone\$1constraints |  Specifies monotonicity constraints on any feature. **Optional** Valid values: Tuple of Integers. Valid integers: -1 (decreasing constraint), 0 (no constraint), 1 (increasing constraint).  E.g., (0, 1): No constraint on first predictor, and an increasing constraint on the second. (-1, 1): Decreasing constraint on first predictor, and an increasing constraint on the second. Default value: (0, 0)  | 
| normalize\$1type |  Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread |  Number of parallel threads used to run *xgboost*. **Optional** Valid values: Integer. Default value: Maximum number of threads.  | 
| objective |  Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `multi:softmax`, `reg:squarederror`. For a full list of valid inputs, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String Default value: `"reg:squarederror"`  | 
| one\$1drop |  When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type |  The type of boosting process to run. **Optional** Valid values: String. Either `"default"` or `"update"`. Default value: `"default"`  | 
| rate\$1drop |  The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf |  This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type |  Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight |  Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed |  Random number seed. **Optional** Valid values: integer Default value: 0  | 
| single\$1precision\$1histogram |  When this flag is enabled, XGBoost uses single precision to build histograms instead of double precision. Used only if `tree_method` is set to `hist` or `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| sketch\$1eps |  Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop |  Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample |  Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method |  The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, `hist`, or `gpu_hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power |  Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater |  A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 
| use\$1dask\$1gpu\$1training |  Set `use_dask_gpu_training` to `"true"` if you want to run distributed GPU training with Dask. Dask GPU training is only supported for versions 1.5-1 and later. Do not set this value to `"true"` for versions preceding 1.5-1. For more information, see [Distributed GPU training](xgboost.md#Instance-XGBoost-distributed-training-gpu). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| verbosity | Verbosity of printing messages. Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug). **Optional** Default value: 1  | 

# Tune an XGBoost Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

**Note**  
Automatic model tuning for XGBoost 0.90 is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation Metrics Computed by the XGBoost Algorithm
Metrics

The XGBoost algorithm computes the following metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Classification rate, calculated as \$1(right)/\$1(all cases).  |  Maximize  | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:f1 |  Indicator of classification accuracy, calculated as the harmonic mean of precision and recall.  |  Maximize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:mse |  Mean squared error.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

## Tunable XGBoost Hyperparameters
Tunable Hyperparameters

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bynode |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 

# Deprecated Versions of XGBoost and their Upgrades
Deprecated Versions of XGBoost

This topic contains documentation for previous versions of Amazon SageMaker AI XGBoost that are still available but deprecated. It also provides instructions on how to upgrade deprecated versions of XGBoost, when possible, to more current versions.

**Topics**
+ [

# Upgrade XGBoost Version 0.90 to Version 1.5
](xgboost-version-0.90.md)
+ [

# XGBoost Version 0.72
](xgboost-72.md)

# Upgrade XGBoost Version 0.90 to Version 1.5
XGBoost Release 0.90

If you are using the SageMaker Python SDK, to upgrade existing XGBoost 0.90 jobs to version 1.5, you must have version 2.x of the SDK installed and change the XGBoost `version` and `framework_version` parameters to 1.5-1. If you are using Boto3, you need to update the Docker image, and a few hyperparameters and learning objectives.

**Topics**
+ [

## Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x
](#upgrade-xgboost-version-0.90-sagemaker-python-sdk)
+ [

## Change the image tag to 1.5-1
](#upgrade-xgboost-version-0.90-change-image-tag)
+ [

## Change Docker Image for Boto3
](#upgrade-xgboost-version-0.90-boto3)
+ [

## Update Hyperparameters and Learning Objectives
](#upgrade-xgboost-version-0.90-hyperparameters)

## Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x


If you are still using Version 1.x of the SageMaker Python SDK, you must to upgrade version 2.x of the SageMaker Python SDK. For information on the latest version of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). To install the latest version, run:

```
python -m pip install --upgrade sagemaker
```

## Change the image tag to 1.5-1


If you are using the SageMaker Python SDK and using the XGBoost build-in algorithm, change the version parameter in `image_uris.retrive`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)
```

If you are using the SageMaker Python SDK and using XGBoost as a framework to run your customized training scripts, change the `framework_version` parameter in the XGBoost API.

```
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.5-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)
```

`sagemaker.session.s3_input` in SageMaker Python SDK version 1.x has been renamed to `sagemaker.inputs.TrainingInput`. You must use `sagemaker.inputs.TrainingInput` as in the following example.

```
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)
```

 For the full list of SageMaker Python SDK version 2.x changes, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). 

## Change Docker Image for Boto3


If you are using Boto3 to train or deploy your model, change the docker image tag (1, 0.72, 0.90-1 or 0.90-2) to 1.5-1.

```
{
    "AlgorithmSpecification":: {
        "TrainingImage": "746614075791.dkr.ecr.us-west-1.amazonaws.com/sagemaker-xgboost:1.5-1"
    }
    ...
}
```

If you using the SageMaker Python SDK to retrieve registry path, change the `version` parameter in `image_uris.retrieve`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")
```

## Update Hyperparameters and Learning Objectives


The silent parameter has been deprecated and is no longer available in XGBoost 1.5 and later versions. Use `verbosity` instead. If you were using the `reg:linear` learning objective, it has been deprecated as well in favor of` reg:squarederror`. Use `reg:squarederror` instead.

```
hyperparameters = {
    "verbosity": "2",
    "objective": "reg:squarederror",
    "num_round": "50",
    ...
}

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          ...)
```

# XGBoost Version 0.72
XGBoost Release 0.72

**Important**  
The XGBoost 0.72 is deprecated by Amazon SageMaker AI. You can still use this old version of XGBoost (as a built-in algorithm) by pulling its image URI as shown in the following code sample. For XGBoost, the image URI ending with `:1` is for the old version.  

```
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri

xgb_image_uri = get_image_uri(boto3.Session().region_name, "xgboost", repo_version="1")
```

```
import boto3
from sagemaker import image_uris

xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")
```
If you want to use newer versions, you have to explicitly specify the image URI tags (see [Supported versions](xgboost.md#xgboost-supported-versions)).

This previous release of the Amazon SageMaker AI XGBoost algorithm is based on the 0.72 release. [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. XGBoost has done remarkably well in machine learning competitions because it robustly handles a variety of data types, relationships, and distributions, and because of the large number of hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid choice for problems in regression, classification (binary and multiclass), and ranking.

Customers should consider using the new release of [XGBoost algorithm with Amazon SageMaker AI](xgboost.md). They can use it as a SageMaker AI built-in algorithm or as a framework to run scripts in their local environments as they would typically, for example, do with a Tensorflow deep learning framework. The new implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics. The earlier implementation of XGBoost remains available to customers if they need to postpone migrating to the new version. But this previous implementation will remain tied to the 0.72 release of XGBoost.

## Input/Output Interface for the XGBoost Release 0.72


Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports CSV and libsvm formats for training and inference:
+ For Training ContentType, valid inputs are *text/libsvm* (default) or *text/csv*.
+ For Inference ContentType, valid inputs are *text/libsvm* or (the default) *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column.   
For libsvm training, the algorithm assumes that the label is in the first column. Subsequent columns contain the zero-based index value pairs for features. So each row has the format: <label> <index0>:<value0> <index1>:<value1> ... Inference requests for libsvm may or may not have labels in the libsvm format.

This differs from other SageMaker AI algorithms, which use the protobuf training input format to maintain greater consistency with standard XGBoost data formats.

For CSV training input mode, the total memory available to the algorithm (Instance Count \$1 the memory available in the `InstanceType`) must be able to hold the training dataset. For libsvm training input mode, it's not required, but we recommend it.

SageMaker AI XGBoost uses the Python pickle module to serialize/deserialize the model, which can be used for saving/loading the model.

**To use a model trained with SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  import xgboost
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

## EC2 Instance Recommendation for the XGBoost Release 0.72


SageMaker AI XGBoost currently only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M4) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. Although it supports the use of disk space to handle data that does not fit into main memory (the out-of-core feature available with the libsvm input mode), writing cache files onto disk slows the algorithm processing time.

## XGBoost Release 0.72 Sample Notebooks
Sample Notebooks

For a sample notebook that shows how to use the latest version of SageMaker AI XGBoost as a built-in algorithm to train and host a regression model, see [Regression with Amazon SageMaker AI XGBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.html). To use the 0.72 version of XGBoost, you need to change the version in the sample code to 0.72. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the XGBoost algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

## XGBoost Release 0.72 Hyperparameters
Hyperparameters

The following table contains the hyperparameters for the XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source XGBoost package. Currently SageMaker AI supports version 0.72. For more detail about hyperparameter configuration for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_0.72/parameter.html).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class | The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: integer  | 
| num\$1round | The number of rounds to run the training. **Required** Valid values: integer  | 
| alpha | L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 0  | 
| base\$1score | The initial prediction score of all instances, global bias. **Optional** Valid values: float Default value: 0.5  | 
| booster | Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `gbtree`, `gblinear`, or `dart`. Default value: `gbtree`  | 
| colsample\$1bylevel | Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bytree | Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1 | 
| csv\$1weights | When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| early\$1stopping\$1rounds | The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: integer Default value: -  | 
| eta | Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric | Evaluation metrics for validation data. A default metric is assigned according to the objective:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-72.html) For a list of valid inputs, see [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: Default according to objective.  | 
| gamma | Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy | Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `depthwise` or `lossguide`. Default value: `depthwise`  | 
| lambda | L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 1  | 
| lambda\$1bias | L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin | Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: integer Default value: 256  | 
| max\$1delta\$1step | Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth | Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves | Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: integer Default value: 0  | 
| min\$1child\$1weight | Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| normalize\$1type | Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread | Number of parallel threads used to run *xgboost*. **Optional** Valid values: integer Default value: Maximum number of threads.  | 
| objective | Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `reg:softmax`, `multi:squarederror`. For a full list of valid inputs, refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: `reg:squarederror`  | 
| one\$1drop | When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type | The type of boosting process to run. **Optional** Valid values: String. Either `default` or `update`. Default value: `default`  | 
| rate\$1drop | The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf | This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type | Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight | Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed | Random number seed. **Optional** Valid values: integer Default value: 0  | 
| silent | 0 means print running messages, 1 means silent mode. Valid values: 0 or 1 **Optional** Default value: 0  | 
| sketch\$1eps | Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop | Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample | Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method | The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, or `hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power | Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater | A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 

## Tune an XGBoost Release 0.72 Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

### Metrics Computed by the XGBoost Release 0.72 Algorithm
Metrics

The XGBoost algorithm based on version 0.72 computes the following nine metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

### Tunable XGBoost Release 0.72 Hyperparameters
Tunable Hyperparameters

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 

# Built-in SageMaker AI Algorithms for Text Data
Text

SageMaker AI provides algorithms that are tailored to the analysis of textual documents used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
+ [BlazingText algorithm](blazingtext.md)—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.
+ [Latent Dirichlet Allocation (LDA) Algorithm](lda.md)—an algorithm suitable for determining topics in a set of documents. It is an *unsupervised algorithm*, which means that it doesn't use example data with answers during training.
+ [Neural Topic Model (NTM) Algorithm](ntm.md)—another unsupervised technique for determining topics in a set of documents, using a neural network approach.
+ [Object2Vec Algorithm](object2vec.md)—a general-purpose neural embedding algorithm that can be used for recommendation systems, document classification, and sentence embeddings.
+ [Sequence-to-Sequence Algorithm](seq-2-seq.md)—a supervised algorithm commonly used for neural machine translation. 
+ [Text Classification - TensorFlow](text-classification-tensorflow.md)—a supervised algorithm that supports transfer learning with available pretrained models for text classification. 


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| BlazingText | train | File or Pipe | Text file (one sentence per line with space-separated tokens)  | GPU (single instance only) or CPU | No | 
| LDA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU (single instance only) | No | 
| Neural Topic Model | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | GPU or CPU | Yes | 
| Object2Vec | train and (optionally) validation, test, or both | File | JSON Lines  | GPU or CPU (single instance only) | No | 
| Seq2Seq Modeling | train, validation, and vocab | File | recordIO-protobuf | GPU (single instance only) | No | 
| Text Classification - TensorFlow | training and validation | File | CSV | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 

# BlazingText algorithm
BlazingText

The Amazon SageMaker AI BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.

The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector representation of a word is called a *word embedding*. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words. 

Many natural language processing (NLP) applications learn word embeddings by training on large collections of documents. These pretrained vector representations provide information about semantics and word distributions that typically improves the generalizability of other models that are later trained on a more limited amount of data. Most implementations of the Word2vec algorithm are not optimized for multi-core CPU architectures. This makes it difficult to scale to large datasets. 

With the BlazingText algorithm, you can scale to large datasets easily. Similar to Word2vec, it provides the Skip-gram and continuous bag-of-words (CBOW) training architectures. BlazingText's implementation of the supervised multi-class, multi-label text classification algorithm extends the fastText text classifier to use GPU acceleration with custom [CUDA ](https://docs.nvidia.com/cuda/index.html) kernels. You can train a model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU. And, you achieve performance on par with the state-of-the-art deep learning text classification algorithms.

The BlazingText algorithm is not parallelizable. For more information on parameters related to training, see [ Docker Registry Paths for SageMaker Built-in Algorithms](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

 The SageMaker AI BlazingText algorithms provides the following features:
+ Accelerated training of the fastText text classifier on multi-core CPUs or a GPU and Word2Vec on GPUs using highly optimized CUDA kernels. For more information, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).
+ [Enriched Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) by learning vector representations for character n-grams. This approach enables BlazingText to generate meaningful vectors for out-of-vocabulary (OOV) words by representing their vectors as the sum of the character n-gram (subword) vectors.
+ A `batch_skipgram` `mode` for the Word2Vec algorithm that allows faster training and distributed computation across multiple CPU nodes. The `batch_skipgram` `mode` does mini-batching using the Negative Sample Sharing strategy to convert level-1 BLAS operations into level-3 BLAS operations. This efficiently leverages the multiply-add instructions of modern architectures. For more information, see [Parallelizing Word2Vec in Shared and Distributed Memory](https://arxiv.org/pdf/1604.04661.pdf).

To summarize, the following modes are supported by BlazingText on different types instances:


| Modes |  Word2Vec (Unsupervised Learning)  |  Text Classification (Supervised Learning)  | 
| --- | --- | --- | 
|  Single CPU instance  |  `cbow` `Skip-gram` `Batch Skip-gram`  |  `supervised`  | 
|  Single GPU instance (with 1 or more GPUs)  |  `cbow` `Skip-gram`  |  `supervised` with one GPU  | 
|  Multiple CPU instances  | Batch Skip-gram  | None | 

For more information about the mathematics behind BlazingText, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).

**Topics**
+ [

## Input/Output Interface for the BlazingText Algorithm
](#bt-inputoutput)
+ [

## EC2 Instance Recommendation for the BlazingText Algorithm
](#blazingtext-instances)
+ [

## BlazingText Sample Notebooks
](#blazingtext-sample-notebooks)
+ [

# BlazingText Hyperparameters
](blazingtext_hyperparameters.md)
+ [

# Tune a BlazingText Model
](blazingtext-tuning.md)

## Input/Output Interface for the BlazingText Algorithm


The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.

### Training and Validation Data Format


#### Training and Validation Data Format for the Word2Vec Algorithm


For Word2Vec training, upload the file under the *train* channel. No other channels are supported. The file should contain a training sentence per line.

#### Training and Validation Data Format for the Text Classification Algorithm


For supervised mode, you can train with file mode or with the augmented manifest text format.

##### Train with File Mode
File Mode

For `supervised` mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string *\$1\$1label\$1\$1*. Here is an example of a training/validation file:

```
__label__4  linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2  bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .
```

**Note**  
The order of labels within the sentence doesn't matter. 

Upload the training file under the train channel, and optionally upload the validation file under the validation channel.

##### Train with Augmented Manifest Text Format
Augmented Manifest

Supervised mode for CPU instances also supports the augmented manifest format, which enables you to do training in pipe mode without needing to create RecordIO files. While using the format, an S3 manifest file needs to be generated that contains the list of sentences and their corresponding labels. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The sentences are specified using the `source` tag and the label can be specified using the `label` tag. Both `source` and `label` tags should be provided under the `AttributeNames` parameter value as specified in the request.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label":2}
```

Multi-label training is also supported by specifying a JSON array of labels.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label": [1, 3]}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label": [2, 4, 5]}
```

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Model Artifacts and Inference


#### Model Artifacts for the Word2Vec Algorithm


For Word2Vec training, the model artifacts consist of *vectors.txt*, which contains words-to-vectors mapping, and *vectors.bin*, a binary used by BlazingText for hosting, inference, or both. *vectors.txt* stores the vectors in a format that is compatible with other tools like Gensim and Spacy. For example, a Gensim user can run the following commands to load the vectors.txt file:

```
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
word_vectors.doesnt_match("breakfast cereal dinner lunch".split())
```

If the evaluation parameter is set to `True`, an additional file, *eval.json*, is created. This file contains the similarity evaluation results (using Spearman’s rank correlation coefficients) on WS-353 dataset. The number of words from the WS-353 dataset that aren't there in the training corpus are reported.

For inference requests, the model accepts a JSON file containing a list of strings and returns a list of vectors. If the word is not found in vocabulary, inference returns a vector of zeros. If subwords is set to `True` during training, the model is able to generate vectors for out-of-vocabulary (OOV) words.

##### Sample JSON Request


Mime-type:` application/json`

```
{
"instances": ["word1", "word2", "word3"]
}
```

#### Model Artifacts for the Text Classification Algorithm


Training with supervised outputs creates a *model.bin* file that can be consumed by BlazingText hosting. For inference, the BlazingText model accepts a JSON file containing a list of sentences and returns a list of corresponding predicted labels and probability scores. Each sentence is expected to be a string with space-separated tokens, words, or both.

##### Sample JSON Request


Mime-type:` application/json`

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."]
}
```

By default, the server returns only one prediction, the one with the highest probability. For retrieving the top *k* predictions, you can set *k* in the configuration, as follows:

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."],
 "configuration": {"k": 2}
}
```

For BlazingText, the` content-type` and `accept` parameters must be equal. For batch transform, they both need to be `application/jsonlines`. If they differ, the `Accept` field is ignored. The format for input follows:

```
content-type: application/jsonlines

{"source": "source_0"}
{"source": "source_1"}

if you need to pass the value of k for top-k, then you can do it in the following way:

{"source": "source_0", "k": 2}
{"source": "source_1", "k": 3}
```

The format for output follows:

```
accept: application/jsonlines


{"prob": [prob_1], "label": ["__label__1"]}
{"prob": [prob_1], "label": ["__label__1"]}

If you have passed the value of k to be more than 1, then response will be in this format:

{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
```

For both supervised (text classification) and unsupervised (Word2Vec) modes, the binaries (*\$1.bin*) produced by BlazingText can be cross-consumed by fastText and vice versa. You can use binaries produced by BlazingText by fastText. Likewise, you can host the model binaries created with fastText using BlazingText.

Here is an example of how to use a model generated with BlazingText with fastText:

```
#Download the model artifact from S3
aws s3 cp s3://<YOUR_S3_BUCKET>/<PREFIX>/model.tar.gz model.tar.gz

#Unzip the model archive
tar -xzf model.tar.gz

#Use the model archive with fastText
fasttext predict ./model.bin test.txt
```

However, the binaries are only supported when training on CPU and single GPU; training on multi-GPU will not produce binaries.

## EC2 Instance Recommendation for the BlazingText Algorithm


For `cbow` and `skipgram` modes, BlazingText supports single CPU and single GPU instances. Both of these modes support learning of `subwords` embeddings. To achieve the highest speed without compromising accuracy, we recommend that you use an ml.p3.2xlarge instance. 

For `batch_skipgram` mode, BlazingText supports single or multiple CPU instances. When training on multiple instances, set the value of the `S3DataDistributionType` field of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) object that you pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) to `FullyReplicated`. BlazingText takes care of distributing data across machines.

For the supervised text classification mode, a C5 instance is recommended if the training dataset is less than 2 GB. For larger datasets, use an instance with a single GPU. BlazingText supports P2, P3, G4dn, and G5 instances for training and inference.

## BlazingText Sample Notebooks
Sample Notebooks

For a sample notebook that trains and deploys the SageMaker AI BlazingText algorithm to generate word vectors, see [Learning Word2Vec Word Representations using BlazingText](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/blazingtext_word2vec_text8/blazingtext_word2vec_text8.html). For instructions for creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI examples. The topic modeling example notebooks that use the Blazing Text are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# BlazingText Hyperparameters
Hyperparameters

When you start a training job with a `CreateTrainingJob` request, you specify a training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The hyperparameters for the BlazingText algorithm depend on which mode you use: Word2Vec (unsupervised) and Text Classification (supervised).

## Word2Vec Hyperparameters
Word2Vec Hyperparameters

The following table lists the hyperparameters for the BlazingText Word2Vec training algorithm provided by Amazon SageMaker AI.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The Word2vec architecture used for training. **Required** Valid values: `batch_skipgram`, `skipgram`, or `cbow`  | 
| batch\$1size |  The size of each batch when `mode` is set to `batch_skipgram`. Set to a number between 10 and 20. **Optional** Valid values: Positive integer Default value: 11  | 
| buckets |  The number of hash buckets to use for subwords. **Optional** Valid values: positive integer Default value: 2000000  | 
| epochs |  The number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| evaluation |  Whether the trained model is evaluated using the [WordSimilarity-353 Test](http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.html). **Optional** Valid values: (Boolean) `True` or `False` Default value: `True`  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1char |  The minimum number of characters to use for subwords/character n-grams. **Optional** Valid values: positive integer Default value: 3  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| max\$1char |  The maximum number of characters to use for subwords/character n-grams **Optional** Valid values: positive integer Default value: 6  | 
| negative\$1samples |  The number of negative samples for the negative sample sharing strategy. **Optional** Valid values: Positive integer Default value: 5  | 
| sampling\$1threshold |  The threshold for the occurrence of words. Words that appear with higher frequency in the training data are randomly down-sampled. **Optional** Valid values: Positive fraction. The recommended range is (0, 1e-3] Default value: 0.0001  | 
| subwords |  Whether to learn subword embeddings on not. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| vector\$1dim |  The dimension of the word vectors that the algorithm learns. **Optional** Valid values: Positive integer Default value: 100  | 
| window\$1size |  The size of the context window. The context window is the number of words surrounding the target word used for training. **Optional** Valid values: Positive integer Default value: 5  | 

## Text Classification Hyperparameters
Text Classification Hyperparameters

The following table lists the hyperparameters for the Text Classification training algorithm provided by Amazon SageMaker AI.

**Note**  
Although some of the parameters are common between the Text Classification and Word2Vec modes, they might have different meanings depending on the context.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The training mode. **Required** Valid values: `supervised`  | 
| buckets |  The number of hash buckets to use for word n-grams. **Optional** Valid values: Positive integer Default value: 2000000  | 
| early\$1stopping |  Whether to stop training if validation accuracy doesn't improve after a `patience` number of epochs. Note that a validation channel is required if early stopping is used. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| epochs |  The maximum number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| min\$1epochs |  The minimum number of epochs to train before early stopping logic is invoked. **Optional** Valid values: Positive integer Default value: 5  | 
| patience |  The number of epochs to wait before applying early stopping when no progress is made on the validation set. Used only when `early_stopping` is `True`. **Optional** Valid values: Positive integer Default value: 4  | 
| vector\$1dim |  The dimension of the embedding layer. **Optional** Valid values: Positive integer Default value: 100  | 
| word\$1ngrams |  The number of word n-gram features to use. **Optional** Valid values: Positive integer Default value: 2  | 

# Tune a BlazingText Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the BlazingText Algorithm
Metrics

The BlazingText Word2Vec algorithm (`skipgram`, `cbow`, and `batch_skipgram` modes) reports on a single metric during training: `train:mean_rho`. This metric is computed on [WS-353 word similarity datasets](https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art)). When tuning the hyperparameter values for the Word2Vec algorithm, use this metric as the objective.

The BlazingText Text Classification algorithm (`supervised` mode), also reports on a single metric during training: the `validation:accuracy`. When tuning the hyperparameter values for the text classification algorithm, use these metrics as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| train:mean\$1rho |  The mean rho (Spearman's rank correlation coefficient) on [WS-353 word similarity datasets](http://alfonseca.org/pubs/ws353simrel.tar.gz)  |  Maximize  | 
| validation:accuracy |  The classification accuracy on the user-specified validation dataset  |  Maximize  | 

## Tunable BlazingText Hyperparameters
Tunable Hyperparameters

### Tunable Hyperparameters for the Word2Vec Algorithm
Tunable Word2Vec Hyperparameters

Tune an Amazon SageMaker AI BlazingText Word2Vec model with the following hyperparameters. The hyperparameters that have the greatest impact on Word2Vec objective metrics are: `mode`, ` learning_rate`, `window_size`, `vector_dim`, and `negative_samples`.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| batch\$1size |  `IntegerParameterRange`  |  [8-32]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| mode |  `CategoricalParameterRange`  |  [`'batch_skipgram'`, `'skipgram'`, `'cbow'`]  | 
| negative\$1samples |  `IntegerParameterRange`  |  [5-25]  | 
| sampling\$1threshold |  `ContinuousParameterRange`  |  MinValue: 0.0001, MaxValue: 0.001  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| window\$1size |  `IntegerParameterRange`  |  [1-10]  | 

### Tunable Hyperparameters for the Text Classification Algorithm
Tunable Text Classification Hyperparameters

Tune an Amazon SageMaker AI BlazingText text classification model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| buckets |  `IntegerParameterRange`  |  [1000000-10000000]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| word\$1ngrams |  `IntegerParameterRange`  |  [1-3]  | 

# Latent Dirichlet Allocation (LDA) Algorithm
Latent Dirichlet Allocation (LDA)

The Amazon SageMaker AI Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

The exact content of two documents with similar topic mixtures will not be the same. But overall, you would expect these documents to more frequently use a shared subset of words, than when compared with a document from a different topic mixture. This allows LDA to discover these word groups and use them to form topics. As an extremely simple example, given a set of documents where the only words that occur within them are: *eat*, *sleep*, *play*, *meow*, and *bark*, LDA might produce topics like the following:


| **Topic** | *eat* | *sleep*  | *play* | *meow* | *bark* | 
| --- | --- | --- | --- | --- | --- | 
| Topic 1  | 0.1  | 0.3  | 0.2  | 0.4  | 0.0  | 
| Topic 2  | 0.2  | 0.1 | 0.4  | 0.0  | 0.3  | 

You can infer that documents that are more likely to fall into Topic 1 are about cats (who are more likely to *meow* and *sleep*), and documents that fall into Topic 2 are about dogs (who prefer to *play* and *bark*). These topics can be found even though the words dog and cat never appear in any of the texts. 

**Topics**
+ [

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
](#lda-or-ntm)
+ [

## Input/Output Interface for the LDA Algorithm
](#lda-inputoutput)
+ [

## EC2 Instance Recommendation for the LDA Algorithm
](#lda-instances)
+ [

## LDA Sample Notebooks
](#LDA-sample-notebooks)
+ [

# How LDA Works
](lda-how-it-works.md)
+ [

# LDA Hyperparameters
](lda_hyperparameters.md)
+ [

# Tune an LDA Model
](lda-tuning.md)

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)


Topic models are commonly used to produce topics from corpuses that (1) coherently encapsulate semantic meaning and (2) describe documents well. As such, topic models aim to minimize perplexity and maximize topic coherence. 

Perplexity is an intrinsic language modeling evaluation metric that measures the inverse of the geometric mean per-word likelihood in your test data. A lower perplexity score indicates better generalization performance. Research has shown that the likelihood computed per word often does not align to human judgement, and can be entirely non-correlated, thus topic coherence has been introduced. Each inferred topic from your model consists of words, and topic coherence is computed to the top N words for that particular topic from your model. It is often defined as the average or median of the pairwise word-similarity scores of the words in that topic e.g., Pointwise Mutual Information (PMI). A promising model generates coherent topics or topics with high topic coherence scores. 

While the objective is to train a topic model that minimizes perplexity and maximizes topic coherence, there is often a tradeoff with both LDA and NTM. Recent research by Amazon, Dinget et al., 2018 has shown that NTM is promising for achieving high topic coherence but LDA trained with collapsed Gibbs sampling achieves better perplexity. There is a tradeoff between perplexity and topic coherence. From a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized across multiple GPU instances, whereas LDA only supports single-instance CPU training. 

**Topics**
+ [

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
](#lda-or-ntm)
+ [

## Input/Output Interface for the LDA Algorithm
](#lda-inputoutput)
+ [

## EC2 Instance Recommendation for the LDA Algorithm
](#lda-instances)
+ [

## LDA Sample Notebooks
](#LDA-sample-notebooks)
+ [

# How LDA Works
](lda-how-it-works.md)
+ [

# LDA Hyperparameters
](lda_hyperparameters.md)
+ [

# Tune an LDA Model
](lda-tuning.md)

## Input/Output Interface for the LDA Algorithm


LDA expects data to be provided on the train channel, and optionally supports a test channel, which is scored by the final model. LDA supports both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV`, the data must be dense and have dimension equal to *number of records \$1 vocabulary size*. LDA can be trained in File or Pipe mode when using recordIO-wrapped protobuf, but only in File mode for the `CSV` format.

For inference, `text/csv`, `application/json`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. LDA inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_mixture` vector for each observation.

Please see the [LDA Sample Notebooks](#LDA-sample-notebooks) for more detail on training and inference formats.

## EC2 Instance Recommendation for the LDA Algorithm


LDA currently only supports single-instance CPU training. CPU instances are recommended for hosting/inference.

## LDA Sample Notebooks
Sample Notebooks

For a sample notebook that shows how to train the SageMaker AI Latent Dirichlet Allocation algorithm on a dataset and then how to deploy the trained model to perform inferences about the topic mixtures in input documents, see the [An Introduction to SageMaker AI LDA](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lda_topic_modeling/LDA-Introduction.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How LDA Works
How It Works

Amazon SageMaker AI LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of different categories. These categories are themselves a probability distribution over the features. LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to discriminative models, which attempt to learn how inputs map to outputs.

You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A feature is referred to as a word. And the resulting categories are referred to as topics.

**Note**  
Lemmatization significantly increases algorithm performance and accuracy. Consider pre-processing any input text data. For more information, see [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html).

An LDA model is defined by two parameters:
+ α—A prior estimate on topic probability (in other words, the average frequency that each topic within a given document occurs). 
+ β—a collection of k topics where each topic is given a probability distribution over the vocabulary used in a document corpus, also called a "topic-word distribution."

LDA is a "bag-of-words" model, which means that the order of words does not matter. LDA is a generative model where each document is generated word-by-word by choosing a topic mixture θ ∼ Dirichlet(α). 

 For each word in the document: 
+  Choose a topic z ∼ Multinomial(θ) 
+  Choose the corresponding topic-word distribution β\$1z. 
+  Draw a word w ∼ Multinomial(β\$1z). 

When training the model, the goal is to find parameters α and β, which maximize the probability that the text corpus is generated by the model.

The most popular methods for estimating the LDA model use Gibbs sampling or Expectation Maximization (EM) techniques. The Amazon SageMaker AI LDA uses tensor spectral decomposition. This provides several advantages:
+  **Theoretical guarantees on results**. The standard EM-method is guaranteed to converge only to local optima, which are often of poor quality. 
+  **Embarrassingly parallelizable**. The work can be trivially divided over input documents in both training and inference. The EM-method and Gibbs Sampling approaches can be parallelized, but not as easily. 
+  **Fast**. Although the EM-method has low iteration cost it is prone to slow convergence rates. Gibbs Sampling is also subject to slow convergence rates and also requires a large number of samples. 

At a high-level, the tensor decomposition algorithm follows this process:

1.  The goal is to calculate the spectral decomposition of a **V** x **V** x **V** tensor, which summarizes the moments of the documents in our corpus. **V** is vocabulary size (in other words, the number of distinct words in all of the documents). The spectral components of this tensor are the LDA parameters α and β, which maximize the overall likelihood of the document corpus. However, because vocabulary size tends to be large, this **V** x **V** x **V** tensor is prohibitively large to store in memory. 

1.  Instead, it uses a **V** x **V** moment matrix, which is the two-dimensional analog of the tensor from step 1, to find a whitening matrix of dimension **V** x **k**. This matrix can be used to convert the **V** x **V** moment matrix into a **k** x **k** identity matrix. **k** is the number of topics in the model. 

1.  This same whitening matrix can then be used to find a smaller **k** x **k** x **k** tensor. When spectrally decomposed, this tensor has components that have a simple relationship with the components of the **V** x **V** x **V** tensor. 

1.  *Alternating Least Squares* is used to decompose the smaller **k** x *k* x **k** tensor. This provides a substantial improvement in memory consumption and speed. The parameters α and β can be found by “unwhitening” these outputs in the spectral decomposition. 

After the LDA model’s parameters have been found, you can find the topic mixtures for each document. You use stochastic gradient descent to maximize the likelihood function of observing a given topic mixture corresponding to these data.

Topic quality can be improved by increasing the number of topics to look for in training and then filtering out poor quality ones. This is in fact done automatically in SageMaker AI LDA: 25% more topics are computed and only the ones with largest associated Dirichlet priors are returned. To perform further topic filtering and analysis, you can increase the topic count and modify the resulting LDA model as follows:

```
> import mxnet as mx
> alpha, beta = mx.ndarray.load(‘model.tar.gz’)
> # modify alpha and beta
> mx.nd.save(‘new_model.tar.gz’, [new_alpha, new_beta])
> # upload to S3 and create new SageMaker model using the console
```

For more information about algorithms for LDA and the SageMaker AI implementation, see the following:
+ Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. *Tensor Decompositions for Learning Latent Variable Models*, Journal of Machine Learning Research, 15:2773–2832, 2014.
+  David M Blei, Andrew Y Ng, and Michael I Jordan. *Latent Dirichlet Allocation*. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
+  Thomas L Griffiths and Mark Steyvers. *Finding Scientific Topics*. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. 
+  Tamara G Kolda and Brett W Bader. *Tensor Decompositions and Applications*. SIAM Review, 51(3):455–500, 2009. 

# LDA Hyperparameters
Hyperparameters

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the LDA training algorithm provided by Amazon SageMaker AI. For more information, see [How LDA Works](lda-how-it-works.md).


| Parameter Name | Description | 
| --- | --- | 
| num\$1topics |  The number of topics for LDA to find within the data. **Required** Valid values: positive integer  | 
| feature\$1dim |  The size of the vocabulary of the input document corpus. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  The total number of documents in the input document corpus. **Required** Valid values: positive integer  | 
| alpha0 |  Initial guess for the concentration parameter: the sum of the elements of the Dirichlet prior. Small values are more likely to generate sparse topic mixtures and large values (greater than 1.0) produce more uniform mixtures.  **Optional** Valid values: Positive float Default value: 1.0  | 
| max\$1restarts |  The number of restarts to perform during the Alternating Least Squares (ALS) spectral decomposition phase of the algorithm. Can be used to find better quality local minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 10  | 
| max\$1iterations |  The maximum number of iterations to perform during the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 1000  | 
| tol |  Target error tolerance for the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive float Default value: 1e-8  | 

# Tune an LDA Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

LDA is an unsupervised topic modeling algorithm that attempts to describe a set of observations (documents) as a mixture of different categories (topics). The “per-word log-likelihood” (PWLL) metric measures the likelihood that a learned set of topics (an LDA model) accurately describes a test document dataset. Larger values of PWLL indicate that the test data is more likely to be described by the LDA model.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the LDA Algorithm
Metrics

The LDA algorithm reports on a single metric during training: `test:pwll`. When tuning a model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:pwll | Per-word log-likelihood on the test dataset. The likelihood that the test dataset is accurately described by the learned LDA model. | Maximize | 

## Tunable LDA Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the LDA algorithm. Both hyperparameters, `alpha0` and `num_topics`, can affect the LDA objective metric (`test:pwll`). If you don't already know the optimal values for these hyperparameters, which maximize per-word log-likelihood and produce an accurate LDA model, automatic model tuning can help find them.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha0 | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 10 | 
| num\$1topics | IntegerParameterRanges | MinValue: 1, MaxValue: 150 | 

# Neural Topic Model (NTM) Algorithm


Amazon SageMaker AI NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into *topics* that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a *latent representation* because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics. Documents relevant to each topic might be indexed or searched for based on their soft topic labels. The latent representations of documents might also be used to find similar documents in the topic space. You can also use the latent representations of documents that the topic model learns for input to another supervised algorithm such as a document classifier. Because the latent representations of documents are expected to capture the semantics of the underlying documents, algorithms based in part on these representations are expected to perform better than those based on lexical features alone.

Although you can use both the Amazon SageMaker AI NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.

For more information on the mathematics behind NTM, see [Neural Variational Inference for Text Processing](https://arxiv.org/pdf/1511.06038.pdf).

**Topics**
+ [

## Input/Output Interface for the NTM Algorithm
](#NTM-inputoutput)
+ [

## EC2 Instance Recommendation for the NTM Algorithm
](#NTM-instances)
+ [

## NTM Sample Notebooks
](#NTM-sample-notebooks)
+ [

# NTM Hyperparameters
](ntm_hyperparameters.md)
+ [

# Tune an NTM Model
](ntm-tuning.md)
+ [

# NTM Response Formats
](ntm-in-formats.md)

## Input/Output Interface for the NTM Algorithm


Amazon SageMaker AI Neural Topic Model supports four data channels: train, validation, test, and auxiliary. The validation, test, and auxiliary data channels are optional. If you specify any of these optional channels, set the value of the `S3DataDistributionType` parameter for them to `FullyReplicated`. If you provide validation data, the loss on this data is logged at every epoch, and the model stops training as soon as it detects that the validation loss is not improving. If you don't provide validation data, the algorithm stops early based on the training data, but this can be less efficient. If you provide test data, the algorithm reports the test loss from the final model. 

The train, validation, and test data channels for NTM support both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV` format, each row must be represented densely with zero counts for words not present in the corresponding document, and have dimension equal to: (number of records) \$1 (vocabulary size). You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`. The auxiliary channel is used to supply a text file that contains vocabulary. By supplying the vocabulary file, users are able to see the top words for each of the topics printed in the log instead of their integer IDs. Having the vocabulary file also allows NTM to compute the Word Embedding Topic Coherence (WETC) scores, a new metric displayed in the log that captures similarity among the top words in each topic effectively. The `ContentType` for the auxiliary channel is `text/plain`, with each line containing a single word, in the order corresponding to the integer IDs provided in the data. The vocabulary file must be named `vocab.txt` and currently only UTF-8 encoding is supported. 

For inference, `text/csv`, `application/json`, `application/jsonlines`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. NTM inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_weights` vector for each observation.

See the [blog post](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-neural-topic-model-now-supports-auxiliary-vocabulary-channel-new-topic-evaluation-metrics-and-training-subsampling/) for more details on using the auxiliary channel and the WETC scores. For more information on how to compute the WETC score, see [Coherence-Aware Neural Topic Modeling](https://arxiv.org/pdf/1809.02687.pdf). We used the pairwise WETC described in this paper for the Amazon SageMaker AI Neural Topic Model.

For more information on input and output file formats, see [NTM Response Formats](ntm-in-formats.md) for inference and the [NTM Sample Notebooks](#NTM-sample-notebooks).

## EC2 Instance Recommendation for the NTM Algorithm


NTM training supports both GPU and CPU instance types. We recommend GPU instances, but for certain workloads, CPU instances may result in lower training costs. CPU instances should be sufficient for inference. NTM training supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

## NTM Sample Notebooks
Sample Notebooks

For a sample notebook that uses the SageMaker AI NTM algorithm to uncover topics in documents from a synthetic data source where the topic distributions are known, see the [Introduction to Basic Functionality of NTM](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ntm_synthetic/ntm_synthetic.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# NTM Hyperparameters
Hyperparameters

The following table lists the hyperparameters that you can set for the Amazon SageMaker AI Neural Topic Model (NTM) algorithm.


| Parameter Name | Description | 
| --- | --- | 
|  `feature_dim`  |  The vocabulary size of the dataset. **Required** Valid values: Positive integer (min: 1, max: 1,000,000)  | 
| num\$1topics |  The number of required topics. **Required** Valid values: Positive integer (min: 2, max: 1000)  | 
| batch\$1norm |  Whether to use batch normalization during training. **Optional** Valid values: *true* or *false* Default value: *false*  | 
| clip\$1gradient |  The maximum magnitude for each gradient component. **Optional** Valid values: Float (min: 1e-3) Default value: Infinity  | 
| encoder\$1layers |  The number of layers in the encoder and the output size of each layer. When set to *auto*, the algorithm uses two layers of sizes 3 x `num_topics` and 2 x `num_topics` respectively.  **Optional** Valid values: Comma-separated list of positive integers or *auto* Default value: *auto*  | 
| encoder\$1layers\$1activation |  The activation function to use in the encoder layers. **Optional** Valid values:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `sigmoid`  | 
| epochs |  The maximum number of passes over the training data. **Optional** Valid values: Positive integer (min: 1) Default value: 50  | 
| learning\$1rate |  The learning rate for the optimizer. **Optional** Valid values: Float (min: 1e-6, max: 1.0) Default value: 0.001  | 
| mini\$1batch\$1size |  The number of examples in each mini batch. **Optional** Valid values: Positive integer (min: 1, max: 10000) Default value: 256  | 
| num\$1patience\$1epochs |  The number of successive epochs over which early stopping criterion is evaluated. Early stopping is triggered when the change in the loss function drops below the specified `tolerance` within the last `num_patience_epochs` number of epochs. To disable early stopping, set `num_patience_epochs` to a value larger than `epochs`. **Optional** Valid values: Positive integer (min: 1) Default value: 3  | 
| optimizer |  The optimizer to use for training. **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `adadelta`  | 
| rescale\$1gradient |  The rescale factor for gradient. **Optional** Valid values: float (min: 1e-3, max: 1.0) Default value: 1.0  | 
| sub\$1sample |  The fraction of the training data to sample for training per epoch. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 1.0  | 
| tolerance |  The maximum relative change in the loss function. Early stopping is triggered when change in the loss function drops below this value within the last `num_patience_epochs` number of epochs. **Optional** Valid values: Float (min: 1e-6, max: 0.1) Default value: 0.001  | 
| weight\$1decay |   The weight decay coefficient. Adds L2 regularization. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 0.0  | 

# Tune an NTM Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

Amazon SageMaker AI NTM is an unsupervised learning algorithm that learns latent representations of large collections of discrete data, such as a corpus of documents. Latent representations use inferred variables that are not directly measured to model the observations in a dataset. Automatic model tuning on NTM helps you find the model that minimizes loss over the training or validation data. *Training loss* measures how well the model fits the training data. *Validation loss* measures how well the model can generalize to data that it is not trained on. Low training loss indicates that a model is a good fit to the training data. Low validation loss indicates that a model has not overfit the training data and so should be able to model documents successfully on which is has not been trained. Usually, it's preferable to have both losses be small. However, minimizing training loss too much might result in overfitting and increase validation loss, which would reduce the generality of the model. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the NTM Algorithm
Metrics

The NTM algorithm reports a single metric that is computed during training: `validation:total_loss`. The total loss is the sum of the reconstruction loss and Kullback-Leibler divergence. When tuning hyperparameter values, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:total\$1loss |  Total Loss on validation set  |  Minimize  | 

## Tunable NTM Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the NTM algorithm. Usually setting low `mini_batch_size` and small `learning_rate` values results in lower validation losses, although it might take longer to train. Low validation losses don't necessarily produce more coherent topics as interpreted by humans. The effect of other hyperparameters on training and validation loss can vary from dataset to dataset. To see which values are compatible, see [NTM Hyperparameters](ntm_hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| encoder\$1layers\$1activation |  CategoricalParameterRanges  |  ['sigmoid', 'tanh', 'relu']  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 0.1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 16, MaxValue:2048  | 
| optimizer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'adadelta']  | 
| rescale\$1gradient |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 1.0  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 1.0  | 

# NTM Response Formats
Inference Formats

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI NTM algorithm.

## JSON Response Format


```
{
    "predictions":    [
        {"topic_weights": [0.02, 0.1, 0,...]},
        {"topic_weights": [0.25, 0.067, 0,...]}
    ]
}
```

## JSONLINES Response Format


```
{"topic_weights": [0.02, 0.1, 0,...]}
{"topic_weights": [0.25, 0.067, 0,...]}
```

## RECORDIO Response Format


```
[
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    }  
]
```

# Object2Vec Algorithm
Object2Vec

The Amazon SageMaker AI Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression. 

Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in the SageMaker AI [BlazingText algorithm](blazingtext.md). For a blog post that discusses how to apply Object2Vec to some practical use cases, see [Introduction to Amazon SageMaker AI Object2Vec](https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/). 

**Topics**
+ [

## I/O Interface for the Object2Vec Algorithm
](#object2vec-inputoutput)
+ [

## EC2 Instance Recommendation for the Object2Vec Algorithm
](#object2vec--instances)
+ [

## Object2Vec Sample Notebooks
](#object2vec-sample-notebooks)
+ [

# How Object2Vec Works
](object2vec-howitworks.md)
+ [

# Object2Vec Hyperparameters
](object2vec-hyperparameters.md)
+ [

# Tune an Object2Vec Model
](object2vec-tuning.md)
+ [

# Data Formats for Object2Vec Training
](object2vec-training-formats.md)
+ [

# Data Formats for Object2Vec Inference
](object2vec-inference-formats.md)
+ [

# Encoder Embeddings for Object2Vec
](object2vec-encoder-embeddings.md)

## I/O Interface for the Object2Vec Algorithm


You can use Object2Vec on many input data types, including the following examples.


| Input Data Type | Example | 
| --- | --- | 
|  Sentence-sentence pairs  | "A soccer game with multiple males playing." and "Some men are playing a sport." | 
|  Labels-sequence pairs  | The genre tags of the movie "Titanic", such as "Romance" and "Drama", and its short description: "James Cameron's Titanic is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic. She was the most luxurious liner of her era, a ship of dreams, which ultimately carried over 1,500 people to their death in the ice cold waters of the North Atlantic in the early hours of April 15, 1912." | 
|  Customer-customer pairs  |  The customer ID of Jane and customer ID of Jackie.  | 
|  Product-product pairs  |  The product ID of football and product ID of basketball.  | 
|  Item review user-item pairs  |  A user's ID and the items she has bought, such as apple, pear, and orange.  | 

To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec natively supports two types of input: 
+ A discrete token, which is represented as a list of a single `integer-id`. For example, `[10]`.
+ A sequences of discrete tokens, which is represented as a list of `integer-ids`. For example, `[0,12,10,13]`.

The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token, token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:
+  Average-pooled embeddings
+  Hierarchical convolutional neural networks (CNNs),
+  Multi-layered bidirectional long short-term memory (BiLSTMs) 

The input label for each pair can be one of the following:
+ A categorical label that expresses the relationship between the objects in the pair 
+ A score that expresses the strength of the similarity between the two objects 

For categorical labels used in classification, the algorithm supports the cross-entropy loss function. For ratings/score-based labels used in regression, the algorithm supports the mean squared error (MSE) loss function. Specify these loss functions with the `output_layer` hyperparameter when you create the model training job.

## EC2 Instance Recommendation for the Object2Vec Algorithm


The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you are training or running inference. 

When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance. For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance, you can use a larger instance. Currently, the Object2Vec algorithm can train only on a single machine. However, it does offer support for multiple GPUs. Object2Vec supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

For inference with a trained Object2Vec model that has a deep neural network, we recommend using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the [GPU optimization: Classification or Regression](object2vec-inference-formats.md#object2vec-inference-gpu-optimize-classification) or [GPU optimization: Encoder Embeddings](object2vec-encoder-embeddings.md#object2vec-inference-gpu-optimize-encoder-embeddings) inference network is loaded into GPU.

## Object2Vec Sample Notebooks
Sample Notebooks
+ [Using Object2Vec to Encode Sentences into Fixed Length Embeddings](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/object2vec_sentence_similarity/object2vec_sentence_similarity.html)

# How Object2Vec Works
How It Works

When using the Amazon SageMaker AI Object2Vec algorithm, you follow the standard workflow: process the data, train the model, and produce inferences. 

**Topics**
+ [

## Step 1: Process Data
](#object2vec-step-1-data-preprocessing)
+ [

## Step 2: Train a Model
](#object2vec-step-2-training-model)
+ [

## Step 3: Produce Inferences
](#object2vec-step-3-inference)

## Step 1: Process Data
Data Processing

During preprocessing, convert the data to the [JSON Lines](http://jsonlines.org/) text file format specified in [Data Formats for Object2Vec Training](object2vec-training-formats.md) . To get the highest accuracy during training, also randomly shuffle the data before feeding it into the model. How you generate random permutations depends on the language. For python, you could use `np.random.shuffle`; for Unix, `shuf`.

## Step 2: Train a Model
Model Training

The SageMaker AI Object2Vec algorithm has the following main components:
+ **Two input channels** – The input channels take a pair of objects of the same or different types as inputs, and pass them to independent and customizable encoders.
+ **Two encoders** – The two encoders, enc0 and enc1, convert each object into a fixed-length embedding vector. The encoded embeddings of the objects in the pair are then passed into a comparator.
+ **A comparator** – The comparator compares the embeddings in different ways and outputs scores that indicate the strength of the relationship between the paired objects. In the output score for a sentence pair. For example, 1 indicates a strong relationship between a sentence pair, and 0 represents a weak relationship. 

During training, the algorithm accepts pairs of objects and their relationship labels or scores as inputs. The objects in each pair can be of different types, as described earlier. If the inputs to both encoders are composed of the same token-level units, you can use a shared token embedding layer by setting the `tied_token_embedding_weight` hyperparameter to `True` when you create the training job. This is possible, for example, when comparing sentences that both have word token-level units. To generate negative samples at a specified rate, set the `negative_sampling_rate` hyperparameter to the desired ratio of negative to positive samples. This hyperparameter expedites learning how to discriminate between the positive samples observed in the training data and the negative samples that are not likely to be observed. 

Pairs of objects are passed through independent, customizable encoders that are compatible with the input types of corresponding objects. The encoders convert each object in a pair into a fixed-length embedding vector of equal length. The pair of vectors are passed to a comparator operator, which assembles the vectors into a single vector using the value specified in the he `comparator_list` hyperparameter. The assembled vector then passes through a multilayer perceptron (MLP) layer, which produces an output that the loss function compares with the labels that you provided. This comparison evaluates the strength of the relationship between the objects in the pair as predicted by the model. The following figure shows this workflow.

![\[Architecture of the Object2Vec Algorithm from Data Inputs to Scores\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/object2vec-training-image.png)


## Step 3: Produce Inferences
Data Processing

After the model is trained, you can use the trained encoder to preprocess input objects or to perform two types of inference:
+ To convert singleton input objects into fixed-length embeddings using the corresponding encoder
+ To predict the relationship label or score between a pair of input objects

The inference server automatically figures out which of the types is requested based on the input data. To get the embeddings as output, provide only one input. To predict the relationship label or score, provide both inputs in the pair.

# Object2Vec Hyperparameters
Hyperparameters

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Object2Vec training algorithm.


| Parameter Name | Description | 
| --- | --- | 
| enc0\$1max\$1seq\$1len |  The maximum sequence length for the enc0 encoder. **Required** Valid values: 1 ≤ integer ≤ 5000  | 
| enc0\$1vocab\$1size |  The vocabulary size of enc0 tokens. **Required** Valid values: 2 ≤ integer ≤ 3000000  | 
| bucket\$1width |  The allowed difference between data sequence length when bucketing is enabled. To enable bucketing, specify a non-zero value for this parameter. **Optional** Valid values: 0 ≤ integer ≤ 100 Default value: 0 (no bucketing)  | 
| comparator\$1list |  A list used to customize the way in which two embeddings are compared. The Object2Vec comparator operator layer takes the encodings from both encoders as inputs and outputs a single vector. This vector is a concatenation of subvectors. The string values passed to the `comparator_list` and the order in which they are passed determine how these subvectors are assembled. For example, if `comparator_list="hadamard, concat"`, then the comparator operator constructs the vector by concatenating the Hadamard product of two encodings and the concatenation of two encodings. If, on the other hand, `comparator_list="hadamard"`, then the comparator operator constructs the vector as the hadamard product of only two encodings.  **Optional** Valid values: A string that contains any combination of the names of the three binary operators: `hadamard`, `concat`, or `abs_diff`. The Object2Vec algorithm currently requires that the two vector encodings have the same dimension. These operators produce the subvectors as follows: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `"hadamard, concat, abs_diff"`  | 
| dropout |  The dropout probability for network layers. *Dropout* is a form of regularization used in neural networks that reduces overfitting by trimming codependent neurons. **Optional** Valid values: 0.0 ≤ float ≤ 1.0 Default value: 0.0  | 
| early\$1stopping\$1patience |  The number of consecutive epochs without improvement allowed before early stopping is applied. Improvement is defined by with the `early_stopping_tolerance` hyperparameter. **Optional** Valid values: 1 ≤ integer ≤ 5 Default value: 3  | 
| early\$1stopping\$1tolerance |  The reduction in the loss function that an algorithm must achieve between consecutive epochs to avoid early stopping after the number of consecutive epochs specified in the `early_stopping_patience` hyperparameter concludes. **Optional** Valid values: 0.000001 ≤ float ≤ 0.1 Default value: 0.01  | 
| enc\$1dim |  The dimension of the output of the embedding layer. **Optional** Valid values: 4 ≤ integer ≤ 10000 Default value: 4096  | 
| enc0\$1network |  The network model for the enc0 encoder. **Optional** Valid values: `hcnn`, `bilstm`, or `pooled_embedding` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `hcnn`  | 
| enc0\$1cnn\$1filter\$1width |  The filter width of the convolutional neural network (CNN) enc0 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 9 Default value: 3  | 
| enc0\$1freeze\$1pretrained\$1embedding |  Whether to freeze enc0 pretrained embedding weights. **Conditional** Valid values: `True` or `False` Default value: `True`  | 
| enc0\$1layers  |  The number of layers in the enc0 encoder. **Conditional** Valid values: `auto` or 1 ≤ integer ≤ 4 [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `auto`  | 
| enc0\$1pretrained\$1embedding\$1file |  The filename of the pretrained enc0 token embedding file in the auxiliary data channel. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc0\$1token\$1embedding\$1dim |  The output dimension of the enc0 token embedding layer. **Conditional** Valid values: 2 ≤ integer ≤ 1000 Default value: 300  | 
| enc0\$1vocab\$1file |  The vocabulary file for mapping pretrained enc0 token embedding vectors to numerical vocabulary IDs. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1network |  The network model for the enc1 encoder. If you want the enc1 encoder to use the same network model as enc0, including the hyperparameter values, set the value to `enc0`.   Even when the enc0 and enc1 encoder networks have symmetric architectures, you can't shared parameter values for these networks.  **Optional** Valid values: `enc0`, `hcnn`, `bilstm`, or `pooled_embedding` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `enc0`  | 
| enc1\$1cnn\$1filter\$1width |  The filter width of the CNN enc1 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 9 Default value: 3  | 
| enc1\$1freeze\$1pretrained\$1embedding |  Whether to freeze enc1 pretrained embedding weights. **Conditional** Valid values: `True` or `False` Default value: `True`  | 
| enc1\$1layers  |  The number of layers in the enc1 encoder. **Conditional** Valid values: `auto` or 1 ≤ integer ≤ 4 [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `auto`  | 
| enc1\$1max\$1seq\$1len |  The maximum sequence length for the enc1 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 5000  | 
| enc1\$1pretrained\$1embedding\$1file |  The name of the enc1 pretrained token embedding file in the auxiliary data channel. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1token\$1embedding\$1dim |  The output dimension of the enc1 token embedding layer. **Conditional** Valid values: 2 ≤ integer ≤ 1000 Default value: 300  | 
| enc1\$1vocab\$1file |  The vocabulary file for mapping pretrained enc1 token embeddings to vocabulary IDs. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1vocab\$1size |  The vocabulary size of enc0 tokens. **Conditional** Valid values: 2 ≤ integer ≤ 3000000  | 
| epochs |  The number of epochs to run for training.  **Optional** Valid values: 1 ≤ integer ≤ 100 Default value: 30  | 
| learning\$1rate |  The learning rate for training. **Optional** Valid values: 1.0E-6 ≤ float ≤ 1.0 Default value: 0.0004  | 
| mini\$1batch\$1size |  The batch size that the dataset is split into for an `optimizer` during training. **Optional** Valid values: 1 ≤ integer ≤ 10000 Default value: 32  | 
| mlp\$1activation |  The type of activation function for the multilayer perceptron (MLP) layer. **Optional** Valid values: `tanh`, `relu`, or `linear` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `linear`  | 
| mlp\$1dim |  The dimension of the output from MLP layers. **Optional** Valid values: 2 ≤ integer ≤ 10000 Default value: 512  | 
| mlp\$1layers |  The number of MLP layers in the network. **Optional** Valid values: 0 ≤ integer ≤ 10 Default value: 2  | 
| negative\$1sampling\$1rate |  The ratio of negative samples, generated to assist in training the algorithm, to positive samples that are provided by users. Negative samples represent data that is unlikely to occur in reality and are labelled negatively for training. They facilitate training a model to discriminate between the positive samples observed and the negative samples that are not. To specify the ratio of negative samples to positive samples used for training, set the value to a positive integer. For example, if you train the algorithm on input data in which all of the samples are positive and set `negative_sampling_rate` to 2, the Object2Vec algorithm internally generates two negative samples per positive sample. If you don't want to generate or use negative samples during training, set the value to 0.  **Optional** Valid values: 0 ≤ integer Default value: 0 (off)  | 
| num\$1classes |  The number of classes for classification training. Amazon SageMaker AI ignores this hyperparameter for regression problems. **Optional** Valid values: 2 ≤ integer ≤ 30 Default value: 2  | 
| optimizer |  The optimizer type. **Optional** Valid values: `adadelta`, `adagrad`, `adam`, `sgd`, or `rmsprop`. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `adam`  | 
| output\$1layer |  The type of output layer where you specify that the task is regression or classification. **Optional** Valid values: `softmax` or `mean_squared_error` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `softmax`  | 
| tied\$1token\$1embedding\$1weight |  Whether to use a shared embedding layer for both encoders. If the inputs to both encoders use the same token-level units, use a shared token embedding layer. For example, for a collection of documents, if one encoder encodes sentences and another encodes whole documents, you can use a shared token embedding layer. That's because both sentences and documents are composed of word tokens from the same vocabulary. **Optional** Valid values: `True` or `False` Default value: `False`  | 
| token\$1embedding\$1storage\$1type |  The mode of gradient update used during training: when the `dense` mode is used, the optimizer calculates the full gradient matrix for the token embedding layer even if most rows of the gradient are zero-valued. When `sparse` mode is used, the optimizer only stores rows of the gradient that are actually being used in the mini-batch. If you want the algorithm to perform lazy gradient updates, which calculate the gradients only in the non-zero rows and which speed up training, specify `row_sparse`. Setting the value to `row_sparse` constrains the values available for other hyperparameters, as follows:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) **Optional** Valid values: `dense` or `row_sparse` Default value: `dense`  | 
| weight\$1decay |  The weight decay parameter used for optimization. **Optional** Valid values: 0 ≤ float ≤ 10000 Default value: 0 (no decay)  | 

# Tune an Object2Vec Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. For the objective metric, you use one of the metrics that the algorithm computes. Automatic model tuning searches the chosen hyperparameters to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Object2Vec Algorithm
Metrics

The Object2Vec algorithm has both classification and regression metrics. The `output_layer` type determines which metric you can use for automatic model tuning. 

### Regressor Metrics Computed by the Object2Vec Algorithm
Regressor Metrics

The algorithm reports a mean squared error regressor metric, which is computed during testing and validation. When tuning the model for regression tasks, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:mean\$1squared\$1error | The Mean Square Error | Minimize | 
| validation:mean\$1squared\$1error | The Mean Square Error | Minimize | 

### Classification Metrics Computed by the Object2Vec Algorithm
Classification Metrics

The Object2Vec algorithm reports accuracy and cross-entropy classification metrics, which are computed during test and validation. When tuning the model for classification tasks, choose one of these as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:accuracy | Accuracy | Maximize | 
| test:cross\$1entropy | Cross-entropy | Minimize | 
| validation:accuracy | Accuracy | Maximize | 
| validation:cross\$1entropy | Cross-entropy | Minimize | 

## Tunable Object2Vec Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the Object2Vec algorithm.


| Hyperparameter Name | Hyperparameter Type | Recommended Ranges and Values | 
| --- | --- | --- | 
| dropout | ContinuousParameterRange | MinValue: 0.0, MaxValue: 1.0 | 
| early\$1stopping\$1patience | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| early\$1stopping\$1tolerance | ContinuousParameterRange | MinValue: 0.001, MaxValue: 0.1 | 
| enc\$1dim | IntegerParameterRange | MinValue: 4, MaxValue: 4096 | 
| enc0\$1cnn\$1filter\$1width | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| enc0\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| enc0\$1token\$1embedding\$1dim | IntegerParameterRange | MinValue: 5, MaxValue: 300 | 
| enc1\$1cnn\$1filter\$1width | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| enc1\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| enc1\$1token\$1embedding\$1dim | IntegerParameterRange | MinValue: 5, MaxValue: 300 | 
| epochs | IntegerParameterRange | MinValue: 4, MaxValue: 20 | 
| learning\$1rate | ContinuousParameterRange | MinValue: 1e-6, MaxValue: 1.0 | 
| mini\$1batch\$1size | IntegerParameterRange | MinValue: 1, MaxValue: 8192 | 
| mlp\$1activation | CategoricalParameterRanges |  [`tanh`, `relu`, `linear`]  | 
| mlp\$1dim | IntegerParameterRange | MinValue: 16, MaxValue: 1024 | 
| mlp\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| optimizer | CategoricalParameterRanges | [`adagrad`, `adam`, `rmsprop`, `sgd`, `adadelta`] | 
| weight\$1decay | ContinuousParameterRange | MinValue: 0.0, MaxValue: 1.0 | 

# Data Formats for Object2Vec Training
Training Formats

When training with the Object2Vec algorithm, make sure that the input data in your request is in JSON Lines format, where each line represents a single data point.

## Input: JSON Lines Request Format


Content-type: application/jsonlines

```
{"label": 0, "in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"label": 1, "in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"label": 1, "in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
```

The “in0” and “in1” are the inputs for encoder0 and encoder1, respectively. The same format is valid for both classification and regression problems. For regression, the field `"label"` can accept real valued inputs.

# Data Formats for Object2Vec Inference
Inference Formats: Scoring

The following page describes the input request and output response formats for getting scoring inference from the Amazon SageMaker AI Object2Vec model.

## GPU optimization: Classification or Regression


Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the classification/regression or the [Output: Encoder Embeddings](object2vec-encoder-embeddings.md#object2vec-out-encoder-embeddings-data) inference network is loaded into GPU. If the majority of your inference is for classification or regression, specify `INFERENCE_PREFERRED_MODE=classification`. The following is a Batch Transform example of using 4 instances of p3.2xlarge that optimizes for classification/regression inference:

```
transformer = o2v.transformer(instance_count=4,
                              instance_type="ml.p2.xlarge",
                              max_concurrent_transforms=2,
                              max_payload=1,  # 1MB
                              strategy='MultiRecord',
                              env={'INFERENCE_PREFERRED_MODE': 'classification'},  # only useful with GPU
                              output_path=output_s3_path)
```

## Input: Classification or Regression Request Format


Content-type: application/json

```
{
  "instances" : [
    {"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]},
    {"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]},
    {"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
  ]
}
```

Content-type: application/jsonlines

```
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
```

For classification problems, the length of the scores vector corresponds to `num_classes`. For regression problems, the length is 1.

## Output: Classification or Regression Response Format


Accept: application/json

```
{
    "predictions": [
        {
            "scores": [
                0.6533935070037842,
                0.07582679390907288,
                0.2707797586917877
            ]
        },
        {
            "scores": [
                0.026291321963071823,
                0.6577019095420837,
                0.31600672006607056
            ]
        }
    ]
}
```

Accept: application/jsonlines

```
{"scores":[0.195667684078216,0.395351558923721,0.408980727195739]}
{"scores":[0.251988261938095,0.258233487606048,0.489778339862823]}
{"scores":[0.280087798833847,0.368331134319305,0.351581096649169]}
```

In both the classification and regression formats, the scores apply to individual labels. 

# Encoder Embeddings for Object2Vec
Inference Formats: Embeddings

The following page lists the input request and output response formats for getting encoder embedding inference from the Amazon SageMaker AI Object2Vec model.

## GPU optimization: Encoder Embeddings


An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the [Data Formats for Object2Vec Inference](object2vec-inference-formats.md) or the encoder embedding inference network is loaded into GPU. If the majority of your inference is for encoder embeddings, specify `INFERENCE_PREFERRED_MODE=embedding`. The following is a Batch Transform example of using 4 instances of p3.2xlarge that optimizes for encoder embedding inference:

```
transformer = o2v.transformer(instance_count=4,
                              instance_type="ml.p2.xlarge",
                              max_concurrent_transforms=2,
                              max_payload=1,  # 1MB
                              strategy='MultiRecord',
                              env={'INFERENCE_PREFERRED_MODE': 'embedding'},  # only useful with GPU
                              output_path=output_s3_path)
```

## Input: Encoder Embeddings


Content-type: application/json; infer\$1max\$1seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum sequence lengths for the forward and backward encoder.

```
{
  "instances" : [
    {"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]},
    {"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]},
    {"in0": [774, 14, 21, 206]}
  ]
}
```

Content-type: application/jsonlines; infer\$1max\$1seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum sequence lengths for the forward and backward encoder.

```
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]}
{"in0": [774, 14, 21, 206]}
```

In both of these formats, you specify only one input type: `“in0”` or `“in1.”` The inference service then invokes the corresponding encoder and outputs the embeddings for each of the instances. 

## Output: Encoder Embeddings


Content-type: application/json

```
{
  "predictions": [
    {"embeddings":[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.003637571120634,0.021305780857801,0.004316598642617,0.0,0.003397724591195,0.0,0.000378780066967,0.0,0.0,0.0,0.007419463712722]},
    {"embeddings":[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133398,0.047553978860378,0.0,0.0,0.011533712036907,0.011472506448626,0.010696629062294,0.0,0.0,0.0,0.008508535102009]}
  ]
}
```

Content-type: application/jsonlines

```
{"embeddings":[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.003637571120634,0.021305780857801,0.004316598642617,0.0,0.003397724591195,0.0,0.000378780066967,0.0,0.0,0.0,0.007419463712722]}
{"embeddings":[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133398,0.047553978860378,0.0,0.0,0.011533712036907,0.011472506448626,0.010696629062294,0.0,0.0,0.0,0.008508535102009]}
```

The vector length of the embeddings output by the inference service is equal to the value of one of the following hyperparameters that you specify at training time: `enc0_token_embedding_dim`, `enc1_token_embedding_dim`, or `enc_dim`.

# Sequence-to-Sequence Algorithm
Sequence to Sequence (seq2seq)

Amazon SageMaker AI Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker AI seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures. 

**Topics**
+ [

## Input/Output Interface for the Sequence-to-Sequence Algorithm
](#s2s-inputoutput)
+ [

## EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm
](#s2s-instances)
+ [

## Sequence-to-Sequence Sample Notebooks
](#seq-2-seq-sample-notebooks)
+ [

# How Sequence-to-Sequence Works
](seq-2-seq-howitworks.md)
+ [

# Sequence-to-Sequence Hyperparameters
](seq-2-seq-hyperparameters.md)
+ [

# Tune a Sequence-to-Sequence Model
](seq-2-seq-tuning.md)

## Input/Output Interface for the Sequence-to-Sequence Algorithm


**Training**

SageMaker AI seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as integers, not as floating points, as is usually the case.

A script to convert data from tokenized text files to the protobuf format is included in [the seq2seq example notebook](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). In general, it packs the data into 32-bit integer tensors and generates the necessary vocabulary files, which are needed for metric calculation and inference.

After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three channels:
+ `train`: It should contain the training data (for example, the `train.rec` file generated by the preprocessing script).
+ `validation`: It should contain the validation data (for example, the `val.rec` file generated by the preprocessing script).
+ `vocab`: It should contain two vocabulary files (`vocab.src.json` and `vocab.trg.json`) 

If the algorithm doesn't find data in any of these three channels, training results in an error.

**Inference**

For hosted endpoints, inference supports two data formats. To perform inference using space separated text tokens, use the `application/json` format. Otherwise, use the `recordio-protobuf` format to work with the integer encoded data. Both modes support batching of input data. `application/json` format also allows you to visualize the attention matrix.
+ `application/json`: Expects the input in JSON format and returns the output in JSON format. Both content and accept types should be `application/json`. Each sequence is expected to be a string with whitespace separated tokens. This format is recommended when the number of source sequences in the batch is small. It also supports the following additional configuration options:

  `configuration`: \$1`attention_matrix`: `true`\$1: Returns the attention matrix for the particular input sequence.
+ `application/x-recordio-protobuf`: Expects the input in `recordio-protobuf` format and returns the output in `recordio-protobuf format`. Both content and accept types should be `applications/x-recordio-protobuf`. For this format, the source sequences must be converted into a list of integers for subsequent protobuf encoding. This format is recommended for bulk inference.

For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON Lines format and returns the output in JSON Lines format. Both content and accept types should be `application/jsonlines`. The format for input is as follows:

```
content-type: application/jsonlines

{"source": "source_sequence_0"}
{"source": "source_sequence_1"}
```

The format for response is as follows:

```
accept: application/jsonlines

{"target": "predicted_sequence_0"}
{"target": "predicted_sequence_1"}
```

For additional details on how to serialize and deserialize the inputs and outputs to specific formats for inference, see the [Sequence-to-Sequence Sample Notebooks](#seq-2-seq-sample-notebooks) .

## EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm


The Amazon SageMaker AI seq2seq algorithm only supports on GPU instance types and can only train on a single machine. However, you can use instances with multiple GPUs. The seq2seq algorithm supports P2, P3, G4dn, and G5 GPU instance families.

## Sequence-to-Sequence Sample Notebooks
Sample Notebooks

For a sample notebook that shows how to use the SageMaker AI Sequence to Sequence algorithm to train a English-German translation model, see [Machine Translation English-German Example Using SageMaker AI Seq2Seq](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Sequence-to-Sequence Works
How It Works

Typically, a neural network for sequence-to-sequence modeling consists of a few layers, including: 
+ An **embedding layer**. In this layer, the input matrix, which is input tokens encoded in a sparse way (for example, one-hot encoded) are mapped to a dense feature layer. This is required because a high-dimensional feature vector is more capable of encoding information regarding a particular token (word for text corpora) than a simple one-hot-encoded vector. It is also a standard practice to initialize this embedding layer with a pre-trained word vector like [FastText](https://fasttext.cc/) or [Glove](https://nlp.stanford.edu/projects/glove/) or to initialize it randomly and learn the parameters during training. 
+ An **encoder layer**. After the input tokens are mapped into a high-dimensional feature space, the sequence is passed through an encoder layer to compress all the information from the input embedding layer (of the entire sequence) into a fixed-length feature vector. Typically, an encoder is made of RNN-type networks like long short-term memory (LSTM) or gated recurrent units (GRU). ([ Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) explains LSTM in a great detail.) 
+ A **decoder layer**. The decoder layer takes this encoded feature vector and produces the output sequence of tokens. This layer is also usually built with RNN architectures (LSTM and GRU). 

The whole model is trained jointly to maximize the probability of the target sequence given the source sequence. This model was first introduced by [Sutskever et al.](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) in 2014. 

**Attention mechanism**. The disadvantage of an encoder-decoder framework is that model performance decreases as and when the length of the source sequence increases because of the limit of how much information the fixed-length encoded feature vector can contain. To tackle this problem, in 2015, Bahdanau et al. proposed the [attention mechanism](https://arxiv.org/pdf/1409.0473.pdf). In an attention mechanism, the decoder tries to find the location in the encoder sequence where the most important information could be located and uses that information and previously decoded words to predict the next token in the sequence. 

For more in details, see the whitepaper [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) by Luong, et al. that explains and simplifies calculations for various attention mechanisms. Additionally, the whitepaper [Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) by Wu, et al. describes Google's architecture for machine translation, which uses skip connections between encoder and decoder layers.

# Sequence-to-Sequence Hyperparameters
Hyperparameters

The following table lists the hyperparameters that you can set when training with the Amazon SageMaker AI Sequence-to-Sequence (seq2seq) algorithm.


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size | Mini batch size for gradient descent. **Optional** Valid values: positive integer Default value: 64 | 
| beam\$1size | Length of the beam for beam search. Used during training for computing `bleu` and used during inference. **Optional** Valid values: positive integer Default value: 5 | 
| bleu\$1sample\$1size | Number of instances to pick from validation dataset to decode and compute `bleu` score during training. Set to -1 to use full validation set (if `bleu` is chosen as `optimized_metric`). **Optional** Valid values: integer Default value: 0 | 
| bucket\$1width | Returns (source,target) buckets up to (`max_seq_len_source`, `max_seq_len_target`). The longer side of the data uses steps of `bucket_width` while the shorter side uses steps scaled down by the average target/source length ratio. If one sided reaches its maximum length before the other, width of extra buckets on that side is fixed to that side of `max_len`. **Optional** Valid values: positive integer Default value: 10 | 
| bucketing\$1enabled | Set to `false` to disable bucketing, unroll to maximum length. **Optional** Valid values: `true` or `false` Default value: `true` | 
| checkpoint\$1frequency\$1num\$1batches | Checkpoint and evaluate every x batches. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 1000 | 
| checkpoint\$1threshold | Maximum number of checkpoints model is allowed to not improve in `optimized_metric` on validation dataset before training is stopped. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 3 | 
| clip\$1gradient | Clip absolute gradient values greater than this. Set to negative to disable. **Optional** Valid values: float Default value: 1 | 
| cnn\$1activation\$1type | The `cnn` activation type to be used. **Optional** Valid values: String. One of `glu`, `relu`, `softrelu`, `sigmoid`, or `tanh`. Default value: `glu` | 
| cnn\$1hidden\$1dropout | Dropout probability for dropout between convolutional layers. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| cnn\$1kernel\$1width\$1decoder | Kernel width for the `cnn` decoder. **Optional** Valid values: positive integer Default value: 5 | 
| cnn\$1kernel\$1width\$1encoder | Kernel width for the `cnn` encoder. **Optional** Valid values: positive integer Default value: 3 | 
| cnn\$1num\$1hidden | Number of `cnn` hidden units for encoder and decoder. **Optional** Valid values: positive integer Default value: 512 | 
| decoder\$1type | Decoder type. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: *rnn* | 
| embed\$1dropout\$1source | Dropout probability for source side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| embed\$1dropout\$1target | Dropout probability for target side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| encoder\$1type | Encoder type. The `rnn` architecture is based on attention mechanism by Bahdanau et al. and *cnn* architecture is based on Gehring et al. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: `rnn` | 
| fixed\$1rate\$1lr\$1half\$1life | Half life for learning rate in terms of number of checkpoints for `fixed_rate_`\$1 schedulers. **Optional** Valid values: positive integer Default value: 10 | 
| learning\$1rate | Initial learning rate. **Optional** Valid values: float Default value: 0.0003 | 
| loss\$1type | Loss function for training. **Optional** Valid values: String. `cross-entropy` Default value: `cross-entropy` | 
| lr\$1scheduler\$1type | Learning rate scheduler type. `plateau_reduce` means reduce the learning rate whenever `optimized_metric` on `validation_accuracy` plateaus. `inv_t` is inverse time decay. `learning_rate`/(1\$1`decay_rate`\$1t) **Optional** Valid values: String. One of `plateau_reduce`, `fixed_rate_inv_t`, or `fixed_rate_inv_sqrt_t`. Default value: `plateau_reduce` | 
| max\$1num\$1batches | Maximum number of updates/batches to process. -1 for infinite. **Optional** Valid values: integer Default value: -1 | 
| max\$1num\$1epochs | Maximum number of epochs to pass through training data before fitting is stopped. Training continues until this number of epochs even if validation accuracy is not improving if this parameter is passed. Ignored if not passed. **Optional** Valid values: Positive integer and less than or equal to max\$1num\$1epochs. Default value: none | 
| max\$1seq\$1len\$1source | Maximum length for the source sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100  | 
| max\$1seq\$1len\$1target | Maximum length for the target sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100 | 
| min\$1num\$1epochs | Minimum number of epochs the training must run before it is stopped via `early_stopping` conditions. **Optional** Valid values: positive integer Default value: 0 | 
| momentum | Momentum constant used for `sgd`. Don't pass this parameter if you are using `adam` or `rmsprop`. **Optional** Valid values: float Default value: none | 
| num\$1embed\$1source | Embedding size for source tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1embed\$1target | Embedding size for target tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1layers\$1decoder | Number of layers for Decoder *rnn* or *cnn*. **Optional** Valid values: positive integer Default value: 1 | 
| num\$1layers\$1encoder | Number of layers for Encoder `rnn` or `cnn`. **Optional** Valid values: positive integer Default value: 1 | 
| optimized\$1metric | Metrics to optimize with early stopping. **Optional** Valid values: String. One of `perplexity`, `accuracy`, or `bleu`. Default value: `perplexity` | 
| optimizer\$1type | Optimizer to choose from. **Optional** Valid values: String. One of `adam`, `sgd`, or `rmsprop`. Default value: `adam` | 
| plateau\$1reduce\$1lr\$1factor | Factor to multiply learning rate with (for `plateau_reduce`). **Optional** Valid values: float Default value: 0.5 | 
| plateau\$1reduce\$1lr\$1threshold | For `plateau_reduce` scheduler, multiply learning rate with reduce factor if `optimized_metric` didn't improve for this many checkpoints. **Optional** Valid values: positive integer Default value: 3 | 
| rnn\$1attention\$1in\$1upper\$1layers | Pass the attention to upper layers of *rnn*, like Google NMT paper. Only applicable if more than one layer is used. **Optional** Valid values: boolean (`true` or `false`) Default value: `true` | 
| rnn\$1attention\$1num\$1hidden | Number of hidden units for attention layers. defaults to `rnn_num_hidden`. **Optional** Valid values: positive integer Default value: `rnn_num_hidden` | 
| rnn\$1attention\$1type | Attention model for encoders. `mlp` refers to concat and bilinear refers to general from the Luong et al. paper. **Optional** Valid values: String. One of `dot`, `fixed`, `mlp`, or `bilinear`. Default value: `mlp` | 
| rnn\$1cell\$1type | Specific type of `rnn` architecture. **Optional** Valid values: String. Either `lstm` or `gru`. Default value: `lstm` | 
| rnn\$1decoder\$1state\$1init | How to initialize `rnn` decoder states from encoders. **Optional** Valid values: String. One of `last`, `avg`, or `zero`. Default value: `last` | 
| rnn\$1first\$1residual\$1layer | First *rnn* layer to have a residual connection, only applicable if number of layers in encoder or decoder is more than 1. **Optional** Valid values: positive integer Default value: 2 | 
| rnn\$1num\$1hidden | The number of *rnn* hidden units for encoder and decoder. This must be a multiple of 2 because the algorithm uses bi-directional Long Term Short Term Memory (LSTM) by default. **Optional** Valid values: positive even integer Default value: 1024 | 
| rnn\$1residual\$1connections | Add residual connection to stacked *rnn*. Number of layers should be more than 1. **Optional** Valid values: boolean (`true` or `false`) Default value: `false` | 
| rnn\$1decoder\$1hidden\$1dropout | Dropout probability for hidden state that combines the context with the *rnn* hidden state in the decoder. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| training\$1metric | Metrics to track on training on validation data. **Optional** Valid values: String. Either `perplexity` or `accuracy`. Default value: `perplexity` | 
| weight\$1decay | Weight decay constant. **Optional** Valid values: float Default value: 0 | 
| weight\$1init\$1scale | Weight initialization scale (for `uniform` and `xavier` initialization).  **Optional** Valid values: float Default value: 2.34 | 
| weight\$1init\$1type | Type of weight initialization.  **Optional** Valid values: String. Either `uniform` or `xavier`. Default value: `xavier` | 
| xavier\$1factor\$1type | Xavier factor type. **Optional** Valid values: String. One of `in`, `out`, or `avg`. Default value: `in` | 

# Tune a Sequence-to-Sequence Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Sequence-to-Sequence Algorithm
Metrics

The sequence to sequence algorithm reports three metrics that are computed during training. Choose one of them as an objective to optimize when tuning the hyperparameter values.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Accuracy computed on the validation dataset.  |  Maximize  | 
| validation:bleu |  [Bleu﻿](https://en.wikipedia.org/wiki/BLEU) score computed on the validation dataset. Because BLEU computation is expensive, you can choose to compute BLEU on a random subsample of the validation dataset to speed up the overall training process. Use the `bleu_sample_size` parameter to specify the subsample.  |  Maximize  | 
| validation:perplexity |  [Perplexity](https://en.wikipedia.org/wiki/Perplexity), is a loss function computed on the validation dataset. Perplexity measures the cross-entropy between an empirical sample and the distribution predicted by a model and so provides a measure of how well a model predicts the sample values, Models that are good at predicting a sample have a low perplexity.  |  Minimize  | 

## Tunable Sequence-to-Sequence Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the SageMaker AI Sequence to Sequence algorithm. The hyperparameters that have the greatest impact on sequence to sequence objective metrics are: `batch_size`, `optimizer_type`, `learning_rate`, `num_layers_encoder`, and `num_layers_decoder`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1layers\$1encoder |  IntegerParameterRange  |  [1-10]  | 
| num\$1layers\$1decoder |  IntegerParameterRange  |  [1-10]  | 
| batch\$1size |  CategoricalParameterRange  |  [16,32,64,128,256,512,1024,2048]  | 
| optimizer\$1type |  CategoricalParameterRange  |  ['adam', 'sgd', 'rmsprop']  | 
| weight\$1init\$1type |  CategoricalParameterRange  |  ['xavier', 'uniform']  | 
| weight\$1init\$1scale |  ContinuousParameterRange  |  For the xavier type: MinValue: 2.0, MaxValue: 3.0 For the uniform type: MinValue: -1.0, MaxValue: 1.0  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 0.00005, MaxValue: 0.2  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.1  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.5, MaxValue: 0.9  | 
| clip\$1gradient |  ContinuousParameterRange  |  MinValue: 1.0, MaxValue: 5.0  | 
| rnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to recurrent neural networks (RNNs). [128,256,512,1024,2048]   | 
| cnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to convolutional neural networks (CNNs). [128,256,512,1024,2048]   | 
| num\$1embed\$1source |  IntegerParameterRange  |  [256-512]  | 
| num\$1embed\$1target |  IntegerParameterRange  |  [256-512]  | 
| embed\$1dropout\$1source |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| embed\$1dropout\$1target |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| rnn\$1decoder\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| cnn\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| lr\$1scheduler\$1type |  CategoricalParameterRange  |  ['plateau\$1reduce', 'fixed\$1rate\$1inv\$1t', 'fixed\$1rate\$1inv\$1sqrt\$1t']  | 
| plateau\$1reduce\$1lr\$1factor |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 0.5  | 
| plateau\$1reduce\$1lr\$1threshold |  IntegerParameterRange  |  [1-5]  | 
| fixed\$1rate\$1lr\$1half\$1life |  IntegerParameterRange  |  [10-30]  | 

# Text Classification - TensorFlow
Text Classification - TensorFlow

The Amazon SageMaker AI Text Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the [TensorFlow Hub](https://tfhub.dev/). Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of text data is not available. The text classification algorithm takes a text string as input and outputs a probability for each of the class labels. Training datasets must be in CSV format. This page includes information about Amazon EC2 instance recommendations and sample notebooks for Text Classification - TensorFlow.

**Topics**
+ [

# How to use the SageMaker AI Text Classification - TensorFlow algorithm
](text-classification-tensorflow-how-to-use.md)
+ [

# Input and output interface for the Text Classification - TensorFlow algorithm
](text-classification-tensorflow-inputoutput.md)
+ [

## Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm
](#text-classification-tensorflow-instances)
+ [

## Text Classification - TensorFlow sample notebooks
](#text-classification-tensorflow-sample-notebooks)
+ [

# How Text Classification - TensorFlow Works
](text-classification-tensorflow-HowItWorks.md)
+ [

# TensorFlow Hub Models
](text-classification-tensorflow-Models.md)
+ [

# Text Classification - TensorFlow Hyperparameters
](text-classification-tensorflow-Hyperparameter.md)
+ [

# Tune a Text Classification - TensorFlow model
](text-classification-tensorflow-tuning.md)

# How to use the SageMaker AI Text Classification - TensorFlow algorithm
How to use Text Classification - TensorFlow

You can use Text Classification - TensorFlow as an Amazon SageMaker AI built-in algorithm. The following section describes how to use Text Classification - TensorFlow with the SageMaker AI Python SDK. For information on how to use Text Classification - TensorFlow from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

The Text Classification - TensorFlow algorithm supports transfer learning using any of the compatible pretrained TensorFlow models. For a list of all available pretrained models, see [TensorFlow Hub Models](text-classification-tensorflow-Models.md). Every pretrained model has a unique `model_id`. The following example uses BERT Base Uncased (`model_id`: `tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2`) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated model training artifacts to construct a SageMaker AI Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and their default values with `hyperparameters.retrieve_default`. For more information, see [Text Classification - TensorFlow Hyperparameters](text-classification-tensorflow-Hyperparameter.md). Use these values to construct a SageMaker AI Estimator.

**Note**  
Default hyperparameter values are different for different models. For example, for larger models, the default batch size is smaller. 

This example uses the [https://www.tensorflow.org/datasets/catalog/glue#gluesst2](https://www.tensorflow.org/datasets/catalog/glue#gluesst2) dataset, which contains positive and negative movie reviews. We pre-downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call `.fit` using the Amazon S3 location of your training dataset. Any S3 bucket used in a notebook must be in the same AWS Region as the notebook instance that accesses it.

```
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2", "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)

# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")

# Retrieve the pretrained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

# Retrieve the default hyperparameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST2/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create an Estimator instance
tf_tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a training job
tf_tc_estimator.fit({"training": training_dataset_s3_path}, logs=True)
```

For more information about how to use the SageMaker Text Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) notebook.

# Input and output interface for the Text Classification - TensorFlow algorithm


Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset made up of text sentences with any number of classes. The pretrained model attaches a classification layer to the Text Embedding model and initializes the layer parameters to random values. The output dimension of the classification layer is determined based on the number of classes detected in the input data. 

Be mindful of how to format your training data for input to the Text Classification - TensorFlow model.
+ **Training data input format:** A directory containing a `data.csv` file. Each row of the first column should have integer class labels between 0 and the number of classes. Each row of the second column should have the corresponding text data.

The following is an example of an input CSV file. Note that the file should not have any header. The file should be hosted in an Amazon S3 bucket with a path similar to the following: `s3://bucket_name/input_directory/`. Note that the trailing `/` is required.

```
|   |  |
|---|---|
|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human nature|
|...|...|
```

## Incremental training


You can seed the training of a new model with artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data.

**Note**  
You can only seed a SageMaker AI Text Classification - TensorFlow model with another Text Classification - TensorFlow model trained in SageMaker AI. 

You can use any dataset for incremental training, as long as the set of classes remains the same. The incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained model, you start with an existing fine-tuned model. 

For more information on using incremental training with the SageMaker AI Text Classification - TensorFlow algorithm, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) sample notebook.

## Inference with the Text Classification - TensorFlow algorithm


You can host the fine-tuned model that results from your TensorFlow Text Classification training for inference. Any raw text formats for inference must be content type `application/x-text`.

Running inference results in probability values, class labels for all classes, and the predicted label corresponding to the class index with the highest probability encoded in JSON format. The Text Classification - TensorFlow model processes a single string per request and outputs only one line. The following is an example of a JSON format response:

```
accept: application/json;verbose

{"probabilities": [prob_0, prob_1, prob_2, ...],
"labels": [label_0, label_1, label_2, ...],
"predicted_label": predicted_label}
```

If `accept` is set to `application/json`, then the model only outputs probabilities. 

## Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm


The Text Classification - TensorFlow algorithm supports all CPU and GPU instances for training, including:
+ `ml.p2.xlarge`
+ `ml.p2.16xlarge`
+ `ml.p3.2xlarge`
+ `ml.p3.16xlarge`
+ `ml.g4dn.xlarge`
+ `ml.g4dn.16.xlarge`
+ `ml.g5.xlarge`
+ `ml.g5.48xlarge`

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference. For a comprehensive list of SageMaker training and inference instances across AWS Regions, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

## Text Classification - TensorFlow sample notebooks
Sample notebooks

For more information about how to use the SageMaker AI Text Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How Text Classification - TensorFlow Works
How It Works

The Text Classification - TensorFlow algorithm takes text as classifies it into one of the output class labels. Deep learning networks such as [BERT](https://arxiv.org/pdf/1810.04805.pdf) are highly accurate for text classification. There are also deep learning networks that are trained on large text datasets, such as TextNet, which has more than 11 million texts with about 11,000 categories. After a network is trained with TextNet data, you can then fine-tune the network on a dataset with a particular focus to perform more specific text classification tasks. The Amazon SageMaker AI Text Classification - TensorFlow algorithm supports transfer learning on many pretrained models that are available in the TensorFlow Hub.

According to the number of class labels in your training data, a text classification layer is attached to the pretrained TensorFlow model of your choice. The classification layer consists of a dropout layer, a dense layer, and a fully connected layer with 2-norm regularization, and is initialized with random weights. You can change the hyperparameter values for the dropout rate of the dropout layer and the L2 regularization factor for the dense layer.

You can fine-tune either the entire network (including the pretrained model) or only the top classification layer on new training data. With this method of transfer learning, training with smaller datasets is possible.

# TensorFlow Hub Models
TensorFlow Models

The following pretrained models are available to use for transfer learning with the Text Classification - TensorFlow algorithm. 

The following models vary significantly in size, number of model parameters, training time, and inference latency for any given dataset. The best model for your use case depends on the complexity of your fine-tuning dataset and any requirements that you have on training time, inference latency, or model accuracy.


| Model Name | `model_id` | Source | 
| --- | --- | --- | 
|  BERT Base Uncased  | `tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3) | 
|  BERT Base Cased  | `tensorflow-tc-bert-en-cased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3) | 
|  BERT Base Multilingual Cased  | `tensorflow-tc-bert-multi-cased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3) | 
|  Small BERT L-2\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1) | 
|  Small BERT L-2\$1H-256\$1A-4 | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1) | 
|  Small BERT L-2\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1) | 
|  Small BERT L-2\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1) | 
|  Small BERT L-4\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1) | 
|  Small BERT L-4\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1) | 
|  Small BERT L-4\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1) | 
|  Small BERT L-4\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1) | 
|  Small BERT L-6\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1) | 
|  Small BERT L-6\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1) | 
|  Small BERT L-6\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1) | 
|  Small BERT L-6\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1) | 
|  Small BERT L-8\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1) | 
|  Small BERT L-8\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1) | 
|  Small BERT L-8\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1) | 
|  Small BERT L-8\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1) | 
|  Small BERT L-10\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1) | 
|  Small BERT L-10\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1) | 
|  Small BERT L-10\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1) | 
|  Small BERT L-10\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1) | 
|  Small BERT L-12\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1) | 
|  Small BERT L-12\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1) | 
|  Small BERT L-12\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1) | 
|  Small BERT L-12\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1) | 
|  BERT Large Uncased  | `tensorflow-tc-bert-en-uncased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/3) | 
|  BERT Large Cased  | `tensorflow-tc-bert-en-cased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/3) | 
|  BERT Large Uncased Whole Word Masking  | `tensorflow-tc-bert-en-wwm-uncased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_wwm_uncased_L-24_H-1024_A-16/3) | 
|  BERT Large Cased Whole Word Masking  | `tensorflow-tc-bert-en-wwm-cased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_wwm_cased_L-24_H-1024_A-16/3) | 
|  ALBERT Base  | `tensorflow-tc-albert-en-base` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/albert_en_base/2) | 
|  ELECTRA Small\$1\$1  | `tensorflow-tc-electra-small-1` | [TensorFlow Hub link](https://tfhub.dev/google/electra_small/2) | 
|  ELECTRA Base  | `tensorflow-tc-electra-base-1` | [TensorFlow Hub link](https://tfhub.dev/google/electra_base/2) | 
|  BERT Base Wikipedia and BooksCorpus  | `tensorflow-tc-experts-bert-wiki-books-1` | [TensorFlow Hub link](https://tfhub.dev/google/experts/bert/wiki_books/2) | 
|  BERT Base MEDLINE/PubMed  | `tensorflow-tc-experts-bert-pubmed-1` | [TensorFlow Hub link](https://tfhub.dev/google/experts/bert/pubmed/2) | 
|  Talking Heads Base  | `tensorflow-tc-talking-heads-base` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1) | 
|  Talking Heads Large  | `tensorflow-tc-talking-heads-large` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1) | 

# Text Classification - TensorFlow Hyperparameters
Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Object Detection - TensorFlow algorithm. See [Tune a Text Classification - TensorFlow model](text-classification-tensorflow-tuning.md) for information on hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size |  The batch size for training. For training on instances with multiple GPUs, this batch size is used across the GPUs.  Valid values: positive integer. Default value: `32`.  | 
| beta\$11 |  The beta1 for the `"adam"` and `"adamw"` optimizers. Represents the exponential decay rate for the first moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| beta\$12 |  The beta2 for the `"adam"` and `"adamw"` optimizers. Represents the exponential decay rate for the second moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.999`.  | 
| dropout\$1rate | The dropout rate for the dropout layer in the top classification layer. Used only when `reinitialize_top_layer` is set to `"True"`. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2` | 
| early\$1stopping |  Set to `"True"` to use early stopping logic during training. If `"False"`, early stopping is not used. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| early\$1stopping\$1min\$1delta | The minimum change needed to qualify as an improvement. An absolute change less than the value of early\$1stopping\$1min\$1delta does not qualify as improvement. Used only when early\$1stopping is set to "True".Valid values: float, range: [`0.0`, `1.0`].Default value: `0.0`. | 
| early\$1stopping\$1patience |  The number of epochs to continue training with no improvement. Used only when `early_stopping` is set to `"True"`. Valid values: positive integer. Default value: `5`.  | 
| epochs |  The number of training epochs. Valid values: positive integer. Default value: `10`.  | 
| epsilon |  The epsilon for `"adam"`, `"rmsprop"`, `"adadelta"`, and `"adagrad"` optimizers. Usually set to a small value to avoid division by 0. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `1e-7`.  | 
| initial\$1accumulator\$1value |  The starting value for the accumulators, or the per-parameter momentum values, for the `"adagrad"` optimizer. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| learning\$1rate | The optimizer learning rate. Valid values: float, range: [`0.0`, `1.0`].Default value: `0.001`. | 
| momentum |  The momentum for the `"sgd"` and `"nesterov"` optimizers. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| optimizer |  The optimizer type. For more information, see [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) in the TensorFlow documentation. Valid values: string, any of the following: (`"adamw"`, `"adam"`, `"sgd"`, `"nesterov"`, `"rmsprop"`,` "adagrad"` , `"adadelta"`). Default value: `"adam"`.  | 
| regularizers\$1l2 |  The L2 regularization factor for the dense layer in the classification layer. Used only when `reinitialize_top_layer` is set to `"True"`. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| reinitialize\$1top\$1layer |  If set to `"Auto"`, the top classification layer parameters are re-initialized during fine-tuning. For incremental training, top classification layer parameters are not re-initialized unless set to `"True"`. Valid values: string, any of the following: (`"Auto"`, `"True"` or `"False"`). Default value: `"Auto"`.  | 
| rho |  The discounting factor for the gradient of the `"adadelta"` and `"rmsprop"` optimizers. Ignored for other optimizers.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.95`.  | 
| train\$1only\$1on\$1top\$1layer |  If `"True"`, only the top classification layer parameters are fine-tuned. If `"False"`, all model parameters are fine-tuned. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| validation\$1split\$1ratio |  The fraction of training data to randomly split to create validation data. Only used if validation data is not provided through the `validation` channel. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2`.  | 
| warmup\$1steps\$1fraction |  The fraction of the total number of gradient update steps, where the learning rate increases from 0 to the initial learning rate as a warm up. Only used with the `adamw` optimizer. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.1`.  | 

# Tune a Text Classification - TensorFlow model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the Text Classification - TensorFlow algorithm
Metrics

Refer to the following chart to find which metrics are computed by the Text Classification - TensorFlow algorithm.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| validation:accuracy | The ratio of the number of correct predictions to the total number of predictions made. | Maximize | `val_accuracy=([0-9\\.]+)` | 

## Tunable Text Classification - TensorFlow hyperparameters
Tunable Hyperparameters

Tune a text classification model with the following hyperparameters. The hyperparameters that have the greatest impact on text classification objective metrics are: `batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `regularizers_l2`, `beta_1`, `beta_2`, and `eps` based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adamw` or `adam` is the `optimizer`.

For more information about which hyperparameters are used for each `optimizer`, see [Text Classification - TensorFlow Hyperparameters](text-classification-tensorflow-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| batch\$1size | IntegerParameterRanges | MinValue: 4, MaxValue: 128 | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['adamw', 'adam', 'sgd', 'rmsprop', 'nesterov', 'adagrad', 'adadelta'] | 
| regularizers\$1l2 | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| train\$1only\$1on\$1top\$1layer | CategoricalParameterRanges | ['True', 'False'] | 

# Built-in SageMaker AI Algorithms for Time-Series Data
Time-Series

SageMaker AI provides algorithms that are tailored to the analysis of time-series data for forecasting product demand, server loads, webpage requests, and more.
+ [Use the SageMaker AI DeepAR forecasting algorithm](deepar.md)—a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| DeepAR Forecasting | train and (optionally) test | File | JSON Lines or Parquet | GPU or CPU | Yes | 

# Use the SageMaker AI DeepAR forecasting algorithm
DeepAR Forecasting

The Amazon SageMaker AI DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series. They then use that model to extrapolate the time series into the future. 

In many applications, however, you have many similar time series across a set of cross-sectional units. For example, you might have time series groupings for demand for different products, server loads, and requests for webpages. For this type of application, you can benefit from training a single model jointly over all of the time series. DeepAR takes this approach. When your dataset contains hundreds of related time series, DeepAR outperforms the standard ARIMA and ETS methods. You can also use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.

The training input for the DeepAR algorithm is one or, preferably, more `target` time series that have been generated by the same process or similar processes. Based on this input dataset, the algorithm trains a model that learns an approximation of this process/processes and uses it to predict how the target time series evolves. Each target time series can be optionally associated with a vector of static (time-independent) categorical features provided by the `cat` field and a vector of dynamic (time-dependent) time series provided by the `dynamic_feat` field. SageMaker AI trains the DeepAR model by randomly sampling training examples from each target time series in the training dataset. Each training example consists of a pair of adjacent context and prediction windows with fixed predefined lengths. To control how far in the past the network can see, use the `context_length` hyperparameter. To control how far in the future predictions can be made, use the `prediction_length` hyperparameter. For more information, see [How the DeepAR Algorithm Works](deepar_how-it-works.md).

**Topics**
+ [

## Input/Output Interface for the DeepAR Algorithm
](#deepar-inputoutput)
+ [

## Best Practices for Using the DeepAR Algorithm
](#deepar_best_practices)
+ [

## EC2 Instance Recommendations for the DeepAR Algorithm
](#deepar-instances)
+ [

## DeepAR Sample Notebooks
](#deepar-sample-notebooks)
+ [

# How the DeepAR Algorithm Works
](deepar_how-it-works.md)
+ [

# DeepAR Hyperparameters
](deepar_hyperparameters.md)
+ [

# Tune a DeepAR Model
](deepar-tuning.md)
+ [

# DeepAR Inference Formats
](deepar-in-formats.md)

## Input/Output Interface for the DeepAR Algorithm


DeepAR supports two data channels. The required `train` channel describes the training dataset. The optional `test` channel describes a dataset that the algorithm uses to evaluate model accuracy after training. You can provide training and test datasets in [JSON Lines](http://jsonlines.org/) format. Files can be in gzip or [Parquet](https://parquet.apache.org/) file format.

When specifying the paths for the training and test data, you can specify a single file or a directory that contains multiple files, which can be stored in subdirectories. If you specify a directory, DeepAR uses all files in the directory as inputs for the corresponding channel, except those that start with a period (.) and those named *\$1SUCCESS*. This ensures that you can directly use output folders produced by Spark jobs as input channels for your DeepAR training jobs.

By default, the DeepAR model determines the input format from the file extension (`.json`, `.json.gz`, or `.parquet`) in the specified input path. If the path does not end in one of these extensions, you must explicitly specify the format in the SDK for Python. Use the `content_type` parameter of the [s3\$1input](https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.s3_input) class.

The records in your input files should contain the following fields:
+ `start`—A string with the format `YYYY-MM-DD HH:MM:SS`. The start timestamp can't contain time zone information.
+ `target`—An array of floating-point values or integers that represent the time series. You can encode missing values as `null` literals, or as `"NaN"` strings in JSON, or as `nan` floating-point values in Parquet.
+ `dynamic_feat` (optional)—An array of arrays of floating-point values or integers that represents the vector of custom feature time series (dynamic features). If you set this field, all records must have the same number of inner arrays (the same number of feature time series). In addition, each inner array must be the same length as the associated `target` value plus `prediction_length`. Missing values are not supported in the features. For example, if target time series represents the demand of different products, an associated `dynamic_feat` might be a boolean time-series which indicates whether a promotion was applied (1) to the particular product or not (0): 

  ```
  {"start": ..., "target": [1, 5, 10, 2], "dynamic_feat": [[0, 1, 1, 0]]}
  ```
+ `cat` (optional)—An array of categorical features that can be used to encode the groups that the record belongs to. Categorical features must be encoded as a 0-based sequence of positive integers. For example, the categorical domain \$1R, G, B\$1 can be encoded as \$10, 1, 2\$1. All values from each categorical domain must be represented in the training dataset. That's because the DeepAR algorithm can forecast only for categories that have been observed during training. And, each categorical feature is embedded in a low-dimensional space whose dimensionality is controlled by the `embedding_dimension` hyperparameter. For more information, see [DeepAR Hyperparameters](deepar_hyperparameters.md).

If you use a JSON file, it must be in [JSON Lines](http://jsonlines.org/) format. For example:

```
{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1], "dynamic_feat": [[1.1, 1.2, 0.5, ...]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat": [[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat": [[1.3, 0.4]]}
```

In this example, each time series has two associated categorical features and one time series features.

For Parquet, you use the same three fields as columns. In addition, `"start"` can be the `datetime` type. You can compress Parquet files using gzip (`gzip`) or the Snappy compression library (`snappy`).

If the algorithm is trained without `cat` and `dynamic_feat` fields, it learns a "global" model, that is a model that is agnostic to the specific identity of the target time series at inference time and is conditioned only on its shape.

If the model is conditioned on the `cat` and `dynamic_feat` feature data provided for each time series, the prediction will probably be influenced by the character of time series with the corresponding `cat` features. For example, if the `target` time series represents the demand of clothing items, you can associate a two-dimensional `cat` vector that encodes the type of item (e.g. 0 = shoes, 1 = dress) in the first component and the color of an item (e.g. 0 = red, 1 = blue) in the second component. A sample input would look as follows:

```
{ "start": ..., "target": ..., "cat": [0, 0], ... } # red shoes
{ "start": ..., "target": ..., "cat": [1, 1], ... } # blue dress
```

At inference time, you can request predictions for targets with `cat` values that are combinations of the `cat` values observed in the training data, for example:

```
{ "start": ..., "target": ..., "cat": [0, 1], ... } # blue shoes
{ "start": ..., "target": ..., "cat": [1, 0], ... } # red dress
```

The following guidelines apply to training data:
+ The start time and length of the time series can differ. For example, in marketing, products often enter a retail catalog at different dates, so their start dates naturally differ. But all series must have the same frequency, number of categorical features, and number of dynamic features. 
+ Shuffle the training file with respect to the position of the time series in the file. In other words, the time series should occur in random order in the file.
+ Make sure to set the `start` field correctly. The algorithm uses the `start` timestamp to derive the internal features. 
+ If you use categorical features (`cat`), all time series must have the same number of categorical features. If the dataset contains the `cat` field, the algorithm uses it and extracts the cardinality of the groups from the dataset. By default, `cardinality` is `"auto"`. If the dataset contains the `cat` field, but you don't want to use it, you can disable it by setting `cardinality` to `""`. If a model was trained using a `cat` feature, you must include it for inference.
+ If your dataset contains the `dynamic_feat` field, the algorithm uses it automatically. All time series have to have the same number of feature time series. The time points in each of the feature time series correspond one-to-one to the time points in the target. In addition, the entry in the `dynamic_feat` field should have the same length as the `target`. If the dataset contains the `dynamic_feat` field, but you don't want to use it, disable it by setting(`num_dynamic_feat` to `""`). If the model was trained with the `dynamic_feat` field, you must provide this field for inference. In addition, each of the features has to have the length of the provided target plus the `prediction_length`. In other words, you must provide the feature value in the future.

If you specify optional test channel data, the DeepAR algorithm evaluates the trained model with different accuracy metrics. The algorithm calculates the root mean square error (RMSE) over the test data as follows:

![\[RMSE Formula: Sqrt(1/nT(Sum[i,t](y-hat(i,t)-y(i,t))^2))\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deepar-1.png)


*y**i*,*t* is the true value of time series *i* at the time *t*. *ŷ**i*,*t* is the mean prediction. The sum is over all *n* time series in the test set and over the last Τ time points for each time series, where Τ corresponds to the forecast horizon. You specify the length of the forecast horizon by setting the `prediction_length` hyperparameter. For more information, see [DeepAR Hyperparameters](deepar_hyperparameters.md).

In addition, the algorithm evaluates the accuracy of the forecast distribution using weighted quantile loss. For a quantile in the range [0, 1], the weighted quantile loss is defined as follows:

![\[Weighted quantile loss equation.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deepar-2.png)


 *q**i*,*t*(τ) is the τ-quantile of the distribution that the model predicts. To specify which quantiles to calculate loss for, set the `test_quantiles` hyperparameter. In addition to these, the average of the prescribed quantile losses is reported as part of the training logs. For information, see [DeepAR Hyperparameters](deepar_hyperparameters.md). 

For inference, DeepAR accepts JSON format and the following fields:
+  `"instances"`, which includes one or more time series in JSON Lines format
+  A name of `"configuration"`, which includes parameters for generating the forecast 

For more information, see [DeepAR Inference Formats](deepar-in-formats.md).

## Best Practices for Using the DeepAR Algorithm


When preparing your time series data, follow these best practices to achieve the best results:
+ Except for when splitting your dataset for training and testing, always provide the entire time series for training, testing, and when calling the model for inference. Regardless of how you set `context_length`, don't break up the time series or provide only a part of it. The model uses data points further back than the value set in `context_length` for the lagged values feature.
+ When tuning a DeepAR model, you can split the dataset to create a training dataset and a test dataset. In a typical evaluation, you would test the model on the same time series used for training, but on the future `prediction_length` time points that follow immediately after the last time point visible during training. You can create training and test datasets that satisfy this criteria by using the entire dataset (the full length of all time series that are available) as a test set and removing the last `prediction_length` points from each time series for training. During training, the model doesn't see the target values for time points on which it is evaluated during testing. During testing, the algorithm withholds the last `prediction_length` points of each time series in the test set and generates a prediction. Then it compares the forecast with the withheld values. You can create more complex evaluations by repeating time series multiple times in the test set, but cutting them at different endpoints. With this approach, accuracy metrics are averaged over multiple forecasts from different time points. For more information, see [Tune a DeepAR Model](deepar-tuning.md).
+ Avoid using very large values (>400) for the `prediction_length` because it makes the model slow and less accurate. If you want to forecast further into the future, consider aggregating your data at a lower frequency. For example, use `5min` instead of `1min`.
+ Because lags are used, a model can look further back in the time series than the value specified for `context_length`. Therefore, you don't need to set this parameter to a large value. We recommend starting with the value that you used for `prediction_length`.
+ We recommend training a DeepAR model on as many time series as are available. Although a DeepAR model trained on a single time series might work well, standard forecasting algorithms, such as ARIMA or ETS, might provide more accurate results. The DeepAR algorithm starts to outperform the standard methods when your dataset contains hundreds of related time series. Currently, DeepAR requires that the total number of observations available across all training time series is at least 300.

## EC2 Instance Recommendations for the DeepAR Algorithm


You can train DeepAR on both GPU and CPU instances and in both single and multi-machine settings. We recommend starting with a single CPU instance (for example, ml.c4.2xlarge or ml.c4.4xlarge), and switching to GPU instances and multiple machines only when necessary. Using GPUs and multiple machines improves throughput only for larger models (with many cells per layer and many layers) and for large mini-batch sizes (for example, greater than 512).

For inference, DeepAR supports only CPU instances.

Specifying large values for `context_length`, `prediction_length`, `num_cells`, `num_layers`, or `mini_batch_size` can create models that are too large for small instances. In this case, use a larger instance type or reduce the values for these parameters. This problem also frequently occurs when running hyperparameter tuning jobs. In that case, use an instance type large enough for the model tuning job and consider limiting the upper values of the critical parameters to avoid job failures. 

## DeepAR Sample Notebooks
Sample Notebooks

For a sample notebook that shows how to prepare a time series dataset for training the SageMaker AI DeepAR algorithm and how to deploy the trained model for performing inferences, see [DeepAR demo on electricity dataset](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/deepar_electricity/DeepAR-Electricity.html), which illustrates the advanced features of DeepAR on a real world dataset. For instructions on creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI examples. To open a notebook, choose its **Use** tab, and choose **Create copy**.

For more information about the Amazon SageMaker AI DeepAR algorithm, see the following blog posts:
+ [Now available in Amazon SageMaker AI: DeepAR algorithm for more accurate time series forecasting](https://aws.amazon.com/blogs/machine-learning/now-available-in-amazon-sagemaker-deepar-algorithm-for-more-accurate-time-series-forecasting/)
+ [Deep demand forecasting with Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/deep-demand-forecasting-with-amazon-sagemaker/)

# How the DeepAR Algorithm Works
How DeepAR Works

During training, DeepAR accepts a training dataset and an optional test dataset. It uses the test dataset to evaluate the trained model. In general, the datasets don't have to contain the same set of time series. You can use a model trained on a given training set to generate forecasts for the future of the time series in the training set, and for other time series. Both the training and the test datasets consist of one or, preferably, more target time series. Each target time series can optionally be associated with a vector of feature time series and a vector of categorical features. For more information, see [Input/Output Interface for the DeepAR Algorithm](deepar.md#deepar-inputoutput). 

For example, the following is an element of a training set indexed by *i* which consists of a target time series, *Zi,t*, and two associated feature time series, *Xi,1,t* and *Xi,2,t*:

![\[Figure 1: Target time series and associated feature time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.base.png)


The target time series might contain missing values, which are represented by line breaks in the time series. DeepAR supports only feature time series that are known in the future. This allows you to run "what if?" scenarios. What happens, for example, if I change the price of a product in some way? 

Each target time series can also be associated with a number of categorical features. You can use these features to encode which groupings a time series belongs to. Categorical features allow the model to learn typical behavior for groups, which it can use to increase model accuracy. DeepAR implements this by learning an embedding vector for each group that captures the common properties of all time series in the group. 

## How Feature Time Series Work in the DeepAR Algorithm


To facilitate learning time-dependent patterns, such as spikes during weekends, DeepAR automatically creates feature time series based on the frequency of the target time series. It uses these derived feature time series with the custom feature time series that you provide during training and inference. The following figure shows two of these derived time series features: *ui,1,t* represents the hour of the day and *ui,2,t* the day of the week.

![\[Figure 2: Derived time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.derived.png)


The DeepAR algorithm automatically generates these feature time series. The following table lists the derived features for the supported basic time frequencies.


| Frequency of the Time Series | Derived Features | 
| --- | --- | 
| Minute |  `minute-of-hour`, `hour-of-day`, `day-of-week`, `day-of-month`, `day-of-year`  | 
| Hour |  `hour-of-day`, `day-of-week`, `day-of-month`, `day-of-year`  | 
| Day |  `day-of-week`, `day-of-month`, `day-of-year`  | 
| Week |  `day-of-month`, `week-of-year`  | 
| Month |  month-of-year  | 

DeepAR trains a model by randomly sampling several training examples from each of the time series in the training dataset. Each training example consists of a pair of adjacent context and prediction windows with fixed predefined lengths. The `context_length` hyperparameter controls how far in the past the network can see, and the `prediction_length` hyperparameter controls how far in the future predictions can be made. During training, the algorithm ignores training set elements containing time series that are shorter than a specified prediction length. The following figure represents five samples with context lengths of 12 hours and prediction lengths of 6 hours drawn from element *i*. For brevity, we've omitted the feature time series *xi,1,t* and *ui,2,t*.

![\[Figure 3: Sampled time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.sampled.png)


To capture seasonality patterns, DeepAR also automatically feeds lagged values from the target time series. In the example with hourly frequency, for each time index, *t = T*, the model exposes the *zi,t* values, which occurred approximately one, two, and three days in the past.

![\[Figure 4: Lagged time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.lags.png)


For inference, the trained model takes as input target time series, which might or might not have been used during training, and forecasts a probability distribution for the next `prediction_length` values. Because DeepAR is trained on the entire dataset, the forecast takes into account patterns learned from similar time series.

For information on the mathematics behind DeepAR, see [DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks](https://arxiv.org/abs/1704.04110). 

# DeepAR Hyperparameters
Hyperparameters

The following table lists the hyperparameters that you can set when training with the Amazon SageMaker AI DeepAR forecasting algorithm.


| Parameter Name | Description | 
| --- | --- | 
| context\$1length |  The number of time-points that the model gets to see before making the prediction. The value for this parameter should be about the same as the `prediction_length`. The model also receives lagged inputs from the target, so `context_length` can be much smaller than typical seasonalities. For example, a daily time series can have yearly seasonality. The model automatically includes a lag of one year, so the context length can be shorter than a year. The lag values that the model picks depend on the frequency of the time series. For example, lag values for daily frequency are previous week, 2 weeks, 3 weeks, 4 weeks, and year. **Required** Valid values: Positive integer  | 
| epochs |  The maximum number of passes over the training data. The optimal value depends on your data size and learning rate. See also `early_stopping_patience`. Typical values range from 10 to 1000. **Required** Valid values: Positive integer  | 
| prediction\$1length |  The number of time-steps that the model is trained to predict, also called the forecast horizon. The trained model always generates forecasts with this length. It can't generate longer forecasts. The `prediction_length` is fixed when a model is trained and it cannot be changed later. **Required** Valid values: Positive integer  | 
| time\$1freq |  The granularity of the time series in the dataset. Use `time_freq` to select appropriate date features and lags. The model supports the following basic frequencies. It also supports multiples of these basic frequencies. For example, `5min` specifies a frequency of 5 minutes. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deepar_hyperparameters.html) **Required** Valid values: An integer followed by *M*, *W*, *D*, *H*, or *min*. For example, `5min`.  | 
| cardinality |  When using the categorical features (`cat`), `cardinality` is an array specifying the number of categories (groups) per categorical feature. Set this to `auto` to infer the cardinality from the data. The `auto` mode also works when no categorical features are used in the dataset. This is the recommended setting for the parameter. Set cardinality to `ignore` to force DeepAR to not use categorical features, even it they are present in the data. To perform additional data validation, it is possible to explicitly set this parameter to the actual value. For example, if two categorical features are provided where the first has 2 and the other has 3 possible values, set this to [2, 3]. For more information on how to use categorical feature, see the data-section on the main documentation page of DeepAR. **Optional** Valid values: `auto`, `ignore`, array of positive integers, empty string, or  Default value: `auto`  | 
| dropout\$1rate |  The dropout rate to use during training. The model uses zoneout regularization. For each iteration, a random subset of hidden neurons are not updated. Typical values are less than 0.2. **Optional** Valid values: float Default value: 0.1  | 
| early\$1stopping\$1patience |  If this parameter is set, training stops when no progress is made within the specified number of `epochs`. The model that has the lowest loss is returned as the final model. **Optional** Valid values: integer  | 
| embedding\$1dimension |  Size of embedding vector learned per categorical feature (same value is used for all categorical features). The DeepAR model can learn group-level time series patterns when a categorical grouping feature is provided. To do this, the model learns an embedding vector of size `embedding_dimension` for each group, capturing the common properties of all time series in the group. A larger `embedding_dimension` allows the model to capture more complex patterns. However, because increasing the `embedding_dimension` increases the number of parameters in the model, more training data is required to accurately learn these parameters. Typical values for this parameter are between 10-100.  **Optional** Valid values: positive integer Default value: 10  | 
| learning\$1rate |  The learning rate used in training. Typical values range from 1e-4 to 1e-1. **Optional** Valid values: float Default value: 1e-3  | 
| likelihood |  The model generates a probabilistic forecast, and can provide quantiles of the distribution and return samples. Depending on your data, select an appropriate likelihood (noise model) that is used for uncertainty estimates. The following likelihoods can be selected: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deepar_hyperparameters.html) **Optional** Valid values: One of *gaussian*, *beta*, *negative-binomial*, *student-T*, or *deterministic-L1*. Default value: `student-T`  | 
| mini\$1batch\$1size |  The size of mini-batches used during training. Typical values range from 32 to 512. **Optional** Valid values: positive integer Default value: 128  | 
| num\$1cells |  The number of cells to use in each hidden layer of the RNN. Typical values range from 30 to 100. **Optional** Valid values: positive integer Default value: 40  | 
| num\$1dynamic\$1feat |  The number of `dynamic_feat` provided in the data. Set this to `auto` to infer the number of dynamic features from the data. The `auto` mode also works when no dynamic features are used in the dataset. This is the recommended setting for the parameter. To force DeepAR to not use dynamic features, even it they are present in the data, set `num_dynamic_feat` to `ignore`.  To perform additional data validation, it is possible to explicitly set this parameter to the actual integer value. For example, if two dynamic features are provided, set this to 2.  **Optional** Valid values: `auto`, `ignore`, positive integer, or empty string Default value: `auto`  | 
| num\$1eval\$1samples |  The number of samples that are used per time-series when calculating test accuracy metrics. This parameter does not have any influence on the training or the final model. In particular, the model can be queried with a different number of samples. This parameter only affects the reported accuracy scores on the test channel after training. Smaller values result in faster evaluation, but then the evaluation scores are typically worse and more uncertain. When evaluating with higher quantiles, for example 0.95, it may be important to increase the number of evaluation samples. **Optional** Valid values: integer Default value: 100  | 
| num\$1layers |  The number of hidden layers in the RNN. Typical values range from 1 to 4. **Optional** Valid values: positive integer Default value: 2  | 
| test\$1quantiles |  Quantiles for which to calculate quantile loss on the test channel. **Optional** Valid values: array of floats Default value: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]  | 

# Tune a DeepAR Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the DeepAR Algorithm
Metrics

The DeepAR algorithm reports three metrics, which are computed during training. When tuning a model, choose one of these as the objective. For the objective, use either the forecast accuracy on a provided test channel (recommended) or the training loss. For recommendations for the training/test split for the DeepAR algorithm, see [Best Practices for Using the DeepAR Algorithm](deepar.md#deepar_best_practices). 


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:RMSE |  The root mean square error between the forecast and the actual target computed on the test set.  |  Minimize  | 
| test:mean\$1wQuantileLoss |  The average overall quantile losses computed on the test set. To control which quantiles are used, set the `test_quantiles` hyperparameter.   |  Minimize  | 
| train:final\$1loss |  The training negative log-likelihood loss averaged over the last training epoch for the model.  |  Minimize  | 

## Tunable Hyperparameters for the DeepAR Algorithm
Tunable Hyperparameters

Tune a DeepAR model with the following hyperparameters. The hyperparameters that have the greatest impact, listed in order from the most to least impactful, on DeepAR objective metrics are: `epochs`, `context_length`, `mini_batch_size`, `learning_rate`, and `num_cells`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 1000  | 
| context\$1length |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 200  | 
| mini\$1batch\$1size |  `IntegerParameterRanges`  |  MinValue: 32, MaxValue: 1028  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 1e-5, MaxValue: 1e-1  | 
| num\$1cells |  `IntegerParameterRanges`  |  MinValue: 30, MaxValue: 200  | 
| num\$1layers |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 8  | 
| dropout\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.00, MaxValue: 0.2  | 
| embedding\$1dimension |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 50  | 

# DeepAR Inference Formats
Inference Formats

The following page describes the request and response formats for inference with the Amazon SageMaker AI DeepAR model.

## DeepAR JSON Request Formats
Inference Formats

Query a trained model by using the model's endpoint. The endpoint takes the following JSON request format. 

In the request, the `instances` field corresponds to the time series that should be forecast by the model. 

If the model was trained with categories, you must provide a `cat` for each instance. If the model was trained without the `cat` field, it should be omitted.

If the model was trained with a custom feature time series (`dynamic_feat`), you have to provide the same number of `dynamic_feat`values for each instance. Each of them should have a length given by `length(target) + prediction_length`, where the last `prediction_length` values correspond to the time points in the future that will be predicted. If the model was trained without custom feature time series, the field should not be included in the request.

```
{
    "instances": [
        {
            "start": "2009-11-01 00:00:00",
            "target": [4.0, 10.0, "NaN", 100.0, 113.0],
            "cat": [0, 1],
            "dynamic_feat": [[1.0, 1.1, 2.1, 0.5, 3.1, 4.1, 1.2, 5.0, ...]]
        },
        {
            "start": "2012-01-30",
            "target": [1.0],
            "cat": [2, 1],
            "dynamic_feat": [[2.0, 3.1, 4.5, 1.5, 1.8, 3.2, 0.1, 3.0, ...]]
        },
        {
            "start": "1999-01-30",
            "target": [2.0, 1.0],
            "cat": [1, 3],
            "dynamic_feat": [[1.0, 0.1, -2.5, 0.3, 2.0, -1.2, -0.1, -3.0, ...]]
        }
    ],
    "configuration": {
         "num_samples": 50,
         "output_types": ["mean", "quantiles", "samples"],
         "quantiles": ["0.5", "0.9"]
    }
}
```

The `configuration` field is optional. `configuration.num_samples` sets the number of sample paths that the model generates to estimate the mean and quantiles. `configuration.output_types` describes the information that will be returned in the request. Valid values are `"mean"` `"quantiles"` and `"samples"`. If you specify `"quantiles"`, each of the quantile values in `configuration.quantiles` is returned as a time series. If you specify `"samples"`, the model also returns the raw samples used to calculate the other outputs.

## DeepAR JSON Response Formats
Inference Formats

The following is the format of a response, where `[...]` are arrays of numbers:

```
{
    "predictions": [
        {
            "quantiles": {
                "0.9": [...],
                "0.5": [...]
            },
            "samples": [...],
            "mean": [...]
        },
        {
            "quantiles": {
                "0.9": [...],
                "0.5": [...]
            },
            "samples": [...],
            "mean": [...]
        },
        {
            "quantiles": {
                "0.9": [...],
                "0.5": [...]
            },
            "samples": [...],
            "mean": [...]
        }
    ]
}
```

DeepAR has a response timeout of 60 seconds. When passing multiple time series in a single request, the forecasts are generated sequentially. Because the forecast for each time series typically takes about 300 to 1000 milliseconds or longer, depending on the model size, passing too many time series in a single request can cause time outs. It's better to send fewer time series per request and send more requests. Because the DeepAR algorithm uses multiple workers per instance, you can achieve much higher throughput by sending multiple requests in parallel.

By default, DeepAR uses one worker per CPU for inference, if there is sufficient memory per CPU. If the model is large and there isn't enough memory to run a model on each CPU, the number of workers is reduced. The number of workers used for inference can be overwritten using the environment variable `MODEL_SERVER_WORKERS` For example, by setting `MODEL_SERVER_WORKERS=1`) when calling the SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API.

## Batch Transform with the DeepAR Algorithm


DeepAR forecasting supports getting inferences by using batch transform from data using the JSON Lines format. In this format, each record is represented on a single line as a JSON object, and lines are separated by newline characters. The format is identical to the JSON Lines format used for model training. For information, see [Input/Output Interface for the DeepAR Algorithm](deepar.md#deepar-inputoutput). For example:

```
{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1], "dynamic_feat": [[1.1, 1.2, 0.5, ..]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat": [[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat": [[1.3, 0.4]]}
```

**Note**  
When creating the transformation job with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html), set the `BatchStrategy` value to `SingleRecord` and set the `SplitType` value in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html) configuration to `Line`, as the default values currently cause runtime failures.

Similar to the hosted endpoint inference request format, the `cat` and the `dynamic_feat` fields for each instance are required if both of the following are true:
+ The model is trained on a dataset that contained both the `cat` and the `dynamic_feat` fields.
+ The corresponding `cardinality` and `num_dynamic_feat` values used in the training job are not set to `"".`

Unlike hosted endpoint inference, the configuration field is set once for the entire batch inference job using an environment variable named `DEEPAR_INFERENCE_CONFIG`. The value of `DEEPAR_INFERENCE_CONFIG` can be passed when the model is created by calling [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) API. If `DEEPAR_INFERENCE_CONFIG` is missing in the container environment, the inference container uses the following default:

```
{
    "num_samples": 100,
    "output_types": ["mean", "quantiles"],
    "quantiles": ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9"]
}
```

The output is also in JSON Lines format, with one line per prediction, in an order identical to the instance order in the corresponding input file. Predictions are encoded as objects identical to the ones returned by responses in online inference mode. For example:

```
{ "quantiles": { "0.1": [...], "0.2": [...] }, "samples": [...], "mean": [...] }
```

Note that in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html) configuration of the SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request clients must explicitly set the `AssembleWith` value to `Line`, as the default value `None` concatenates all JSON objects on the same line.

For example, here is a SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request for a DeepAR job with a custom `DEEPAR_INFERENCE_CONFIG`:

```
{
   "BatchStrategy": "SingleRecord",
   "Environment": { 
      "DEEPAR_INFERENCE_CONFIG" : "{ \"num_samples\": 200, \"output_types\": [\"mean\"] }",
      ...
   },
   "TransformInput": {
      "SplitType": "Line",
      ...
   },
   "TransformOutput": { 
      "AssembleWith": "Line",
      ...
   },
   ...
}
```

# Unsupervised Built-in SageMaker AI Algorithms
Unsupervised

Amazon SageMaker AI provides several built-in algorithms that can be used for a variety of unsupervised learning tasks such as clustering, dimension reduction, pattern recognition, and anomaly detection.
+ [IP Insights](ip-insights.md)—learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.
+ [K-Means Algorithm](k-means.md)—finds discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
+ [Principal Component Analysis (PCA) Algorithm](pca.md)—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For mathematicians, principal components are eigenvectors of the data's covariance matrix.
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)—detects anomalous data points within a data set that diverge from otherwise well-structured or patterned data.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| IP Insights | train and (optionally) validation | File | CSV | CPU or GPU | Yes | 
| K-Means | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPUCommon (single GPU device on one or more instances) | No | 
| PCA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | GPU or CPU | Yes | 
| Random Cut Forest | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU | Yes | 

# IP Insights
IP Insights

Amazon SageMaker AI IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms.

SageMaker AI IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker AI IP Insights model returns a score that infers how anomalous the pattern of the event is. For example, when a user attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the IP Insights score into another machine learning model. For example, you can combine the IP Insight score with other features to rank the findings of another security system, such as those from [Amazon GuardDuty](https://docs.aws.amazon.com/guardduty/latest/ug/what-is-guardduty.html).

The SageMaker AI IP Insights algorithm can also learn vector representations of IP addresses, known as *embeddings*. You can use vector-encoded embeddings as features in downstream machine learning tasks that use the information observed in the IP addresses. For example, you can use them in tasks such as measuring similarities between IP addresses in clustering and visualization tasks.

**Topics**
+ [

## Input/Output Interface for the IP Insights Algorithm
](#ip-insights-inputoutput)
+ [

## EC2 Instance Recommendation for the IP Insights Algorithm
](#ip-insights-instances)
+ [

## IP Insights Sample Notebooks
](#ip-insights-sample-notebooks)
+ [

# How IP Insights Works
](ip-insights-howitworks.md)
+ [

# IP Insights Hyperparameters
](ip-insights-hyperparameters.md)
+ [

# Tune an IP Insights Model
](ip-insights-tuning.md)
+ [

# IP Insights Data Formats
](ip-insights-data-formats.md)

## Input/Output Interface for the IP Insights Algorithm


**Training and Validation**

The SageMaker AI IP Insights algorithm supports training and validation data channels. It uses the optional validation channel to compute an area-under-curve (AUC) score on a predefined negative sampling strategy. The AUC metric validates how well the model discriminates between positive and negative samples. Training and validation data content types need to be in `text/csv` format. The first column of the CSV data is an opaque string that provides a unique identifier for the entity. The second column is an IPv4 address in decimal-dot notation. IP Insights currently supports only File mode. For more information and some examples, see [IP Insights Training Data Formats](ip-insights-training-data-formats.md).

**Inference**

For inference, IP Insights supports `text/csv`, `application/json`, and `application/jsonlines` data content types. For more information about the common data formats for inference provided by SageMaker AI, see [Common data formats for inference](cdf-inference.md). IP Insights inference returns output formatted as either `application/json` or `application/jsonlines`. Each record in the output data contains the corresponding `dot_product` (or compatibility score) for each input data point. For more information and some examples, see [IP Insights Inference Data Formats](ip-insights-inference-data-formats.md).

## EC2 Instance Recommendation for the IP Insights Algorithm


The SageMaker AI IP Insights algorithm can run on both GPU and CPU instances. For training jobs, we recommend using GPU instances. However, for certain workloads with large training datasets, distributed CPU instances might reduce training costs. For inference, we recommend using CPU instances. IP Insights supports P2, P3, G4dn, and G5 GPU families.

### GPU Instances for the IP Insights Algorithm


IP Insights supports all available GPUs. If you need to speed up training, we recommend starting with a single GPU instance, such as ml.p3.2xlarge, and then moving to a multi-GPU environment, such as ml.p3.8xlarge and ml.p3.16xlarge. Multi-GPUs automatically divide the mini batches of training data across themselves. If you switch from a single GPU to multiple GPUs, the `mini_batch_size` is divided equally into the number of GPUs used. You may want to increase the value of the `mini_batch_size` to compensate for this.

### CPU Instances for the IP Insights Algorithm


The type of CPU instance that we recommend depends largely on the instance's available memory and the model size. The model size is determined by two hyperparameters: `vector_dim` and `num_entity_vectors`. The maximum supported model size is 8 GB. The following table lists typical EC2 instance types that you would deploy based on these input parameters for various model sizes. In Table 1, the value for `vector_dim` in the first column range from 32 to 2048 and the values for `num_entity_vectors` in the first row range from 10,000 to 50,000,000.


| `vector_dim` \$1 `num_entity_vectors`. | 10,000 | 50,000 | 100,000 | 500,000 | 1,000,000 | 5,000,000 | 10,000,000 | 50,000,000 | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 32 |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.2xlarge | ml.m5.4xlarge | 
|  `64`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge | ml.m5.2xlarge |  | 
|  `128`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge | ml.m5.4xlarge |  | 
|  `256`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.4xlarge |  |  | 
|  `512`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge |  |  |  | 
|  `1024`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.4xlarge |  |  |  | 
|  `2048`  |  ml.m5.large  | ml.m5.large | ml.m5.xlarge | ml.m5.xlarge |  |  |  |  | 

The values for the `mini_batch_size`, `num_ip_encoder_layers`, `random_negative_sampling_rate`, and `shuffled_negative_sampling_rate` hyperparameters also affect the amount of memory required. If these values are large, you might need to use a larger instance type than normal.

## IP Insights Sample Notebooks
Sample Notebooks

For a sample notebook that shows how to train the SageMaker AI IP Insights algorithm and perform inferences with it, see [An Introduction to the SageMaker AIIP Insights Algorithm ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ipinsights_login/ipinsights-tutorial.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI examples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How IP Insights Works
How It Works

Amazon SageMaker AI IP Insights is an unsupervised algorithm that consumes observed data in the form of (entity, IPv4 address) pairs that associates entities with IP addresses. IP Insights determines how likely it is that an entity would use a particular IP address by learning latent vector representations for both entities and IP addresses. The distance between these two representations can then serve as the proxy for how likely this association is.

The IP Insights algorithm uses a neural network to learn the latent vector representations for entities and IP addresses. Entities are first hashed to a large but fixed hash space and then encoded by a simple embedding layer. Character strings such as user names or account IDs can be fed directly into IP Insights as they appear in log files. You don't need to preprocess the data for entity identifiers. You can provide entities as an arbitrary string value during both training and inference. The hash size should be configured with a value that is high enough to ensure that the number of *collisions*, which occur when distinct entities are mapped to the same latent vector, remain insignificant. For more information about how to select appropriate hash sizes, see [Feature Hashing for Large Scale Multitask Learning](https://alex.smola.org/papers/2009/Weinbergeretal09.pdf). For representing IP addresses, on the other hand, IP Insights uses a specially designed encoder network to uniquely represent each possible IPv4 address by exploiting the prefix structure of IP addresses.

During training, IP Insights automatically generates negative samples by randomly pairing entities and IP addresses. These negative samples represent data that is less likely to occur in reality. The model is trained to discriminate between positive samples that are observed in the training data and these generated negative samples. More specifically, the model is trained to minimize the *cross entropy*, also known as the *log loss*, defined as follows: 

![\[An image containing the equation for log loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-cross-entropy.png)


yn is the label that indicates whether the sample is from the real distribution governing observed data (yn=1) or from the distribution generating negative samples (yn=0). pn is the probability that the sample is from the real distribution, as predicted by the model.

Generating negative samples is an important process that is used to achieve an accurate model of the observed data. If negative samples are extremely unlikely, for example, if all of the IP addresses in negative samples are 10.0.0.0, then the model trivially learns to distinguish negative samples and fails to accurately characterize the actual observed dataset. To keep negative samples more realistic, IP Insights generates negative samples both by randomly generating IP addresses and randomly picking IP addresses from training data. You can configure the type of negative sampling and the rates at which negative samples are generated with the `random_negative_sampling_rate` and `shuffled_negative_sampling_rate` hyperparameters.

Given an nth (entity, IP address pair), the IP Insights model outputs a *score*, Sn , that indicates how compatible the entity is with the IP address. This score corresponds to the log odds ratio for a given (entity, IP address) of the pair coming from a real distribution as compared to coming from a negative distribution. It is defined as follows:

![\[An image containing the equation for the score, a log odds ratio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-log-odds.png)


The score is essentially a measure of the similarity between the vector representations of the nth entity and IP address. It can be interpreted as how much more likely it would be to observe this event in reality than in a randomly generated dataset. During training, the algorithm uses this score to calculate an estimate of the probability of a sample coming from the real distribution, pn, to use in the cross entropy minimization, where:

![\[An image showing the equation for probability that the sample is from a real distribution.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-sample-probability.png)


# IP Insights Hyperparameters
Hyperparameters

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Amazon SageMaker AI IP Insights algorithm.


| Parameter Name | Description | 
| --- | --- | 
| num\$1entity\$1vectors | The number of entity vector representations (entity embedding vectors) to train. Each entity in the training set is randomly assigned to one of these vectors using a hash function. Because of hash collisions, it might be possible to have multiple entities assigned to the same vector. This would cause the same vector to represent multiple entities. This generally has a negligible effect on model performance, as long as the collision rate is not too severe. To keep the collision rate low, set this value as high as possible. However, the model size, and, therefore, the memory requirement, for both training and inference, scales linearly with this hyperparameter. We recommend that you set this value to twice the number of unique entity identifiers. **Required** Valid values: 1 ≤ positive integer ≤ 250,000,000  | 
| vector\$1dim | The size of embedding vectors to represent entities and IP addresses. The larger the value, the more information that can be encoded using these representations. In practice, model size scales linearly with this parameter and limits how large the dimension can be. In addition, using vector representations that are too large can cause the model to overfit, especially for small training datasets. Overfitting occurs when a model doesn't learn any pattern in the data but effectively memorizes the training data and, therefore, cannot generalize well and performs poorly during inference. The recommended value is 128. **Required** Valid values: 4 ≤ positive integer ≤ 4096  | 
| batch\$1metrics\$1publish\$1interval | The interval (every X batches) at which the Apache MXNet Speedometer function prints the training speed of the network (samples/second).  **Optional** Valid values: positive integer ≥ 1 Default value: 1,000 | 
| epochs | The number of passes over the training data. The optimal value depends on your data size and learning rate. Typical values range from 5 to 100. **Optional** Valid values: positive integer ≥ 1 Default value: 10 | 
| learning\$1rate | The learning rate for the optimizer. IP Insights use a gradient-descent-based Adam optimizer. The learning rate effectively controls the step size to update model parameters at each iteration. Too large a learning rate can cause the model to diverge because the training is likely to overshoot a minima. On the other hand, too small a learning rate slows down convergence. Typical values range from 1e-4 to 1e-1. **Optional** Valid values: 1e-6 ≤ float ≤ 10.0 Default value: 0.001 | 
| mini\$1batch\$1size | The number of examples in each mini batch. The training procedure processes data in mini batches. The optimal value depends on the number of unique account identifiers in the dataset. In general, the larger the `mini_batch_size`, the faster the training and the greater the number of possible shuffled-negative-sample combinations. However, with a large `mini_batch_size`, the training is more likely to converge to a poor local minimum and perform relatively worse for inference.  **Optional** Valid values: 1 ≤ positive integer ≤ 500000 Default value: 10,000 | 
| num\$1ip\$1encoder\$1layers | The number of fully connected layers used to encode the IP address embedding. The larger the number of layers, the greater the model's capacity to capture patterns among IP addresses. However, using a large number of layers increases the chance of overfitting. **Optional** Valid values: 0 ≤ positive integer ≤ 100 Default value: 1 | 
| random\$1negative\$1sampling\$1rate | The number of random negative samples, R, to generate per input example. The training procedure relies on negative samples to prevent the vector representations of the model collapsing to a single point. Random negative sampling generates R random IP addresses for each input account in the mini batch. The sum of the `random_negative_sampling_rate` (R) and `shuffled_negative_sampling_rate` (S) must be in the interval: 1 ≤ R \$1 S ≤ 500. **Optional** Valid values: 0 ≤ positive integer ≤ 500 Default value: 1 | 
| shuffled\$1negative\$1sampling\$1rate | The number of shuffled negative samples, S, to generate per input example. In some cases, it helps to use more realistic negative samples that are randomly picked from the training data itself. This kind of negative sampling is achieved by shuffling the data within a mini batch. Shuffled negative sampling generates S negative IP addresses by shuffling the IP address and account pairings within a mini batch. The sum of the `random_negative_sampling_rate` (R) and `shuffled_negative_sampling_rate` (S) must be in the interval: 1 ≤ R \$1 S ≤ 500. **Optional** Valid values: 0 ≤ positive integer ≤ 500 Default value: 1 | 
| weight\$1decay | The weight decay coefficient. This parameter adds an L2 regularization factor that is required to prevent the model from overfitting the training data. **Optional** Valid values: 0.0 ≤ float ≤ 10.0 Default value: 0.00001 | 

# Tune an IP Insights Model
Model Tuning

*Automatic model tuning*, also called hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the IP Insights Algorithm
Metrics

The Amazon SageMaker AI IP Insights algorithm is an unsupervised learning algorithm that learns associations between IP addresses and entities. The algorithm trains a discriminator model , which learns to separate observed data points (*positive samples*) from randomly generated data points (*negative samples*). Automatic model tuning on IP Insights helps you find the model that can most accurately distinguish between unlabeled validation data and automatically generated negative samples. The model accuracy on the validation dataset is measured by the area under the receiver operating characteristic curve. This `validation:discriminator_auc` metric can take values between 0.0 and 1.0, where 1.0 indicates perfect accuracy.

The IP Insights algorithm computes a `validation:discriminator_auc` metric during validation, the value of which is used as the objective function to optimize for hyperparameter tuning.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:discriminator\$1auc |  Area under the receiver operating characteristic curve on the validation dataset. The validation dataset is not labeled. Area Under the Curve (AUC) is a metric that describes the model's ability to discriminate validation data points from randomly generated data points.  |  Maximize  | 

## Tunable IP Insights Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the SageMaker AI IP Insights algorithm. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs |  IntegerParameterRange  |  MinValue: 1, MaxValue: 100  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 0.1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 100, MaxValue: 50000  | 
| num\$1entity\$1vectors |  IntegerParameterRanges  |  MinValue: 10000, MaxValue: 1000000  | 
| num\$1ip\$1encoder\$1layers |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 10  | 
| random\$1negative\$1sampling\$1rate |  IntegerParameterRanges  |  MinValue: 0, MaxValue: 10  | 
| shuffled\$1negative\$1sampling\$1rate |  IntegerParameterRanges  |  MinValue: 0, MaxValue: 10  | 
| vector\$1dim |  IntegerParameterRanges  |  MinValue: 8, MaxValue: 256  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 1.0  | 

# IP Insights Data Formats
Data Formats

This section provides examples of the available input and output data formats used by the IP Insights algorithm during training and inference.

**Topics**
+ [

# IP Insights Training Data Formats
](ip-insights-training-data-formats.md)
+ [

# IP Insights Inference Data Formats
](ip-insights-inference-data-formats.md)

# IP Insights Training Data Formats
Training

The following are the available data input formats for the IP Insights algorithm. Amazon SageMaker AI built-in algorithms adhere to the common input training format described in [Common Data Formats for Training](cdf-training.md). However, the SageMaker AI IP Insights algorithm currently supports only the CSV data input format.

## IP Insights Training Data Input Formats
Data Input

### INPUT: CSV


The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot notation. 

content-type: text/csv

```
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
```

# IP Insights Inference Data Formats
Inference

The following are the available input and output formats for the IP Insights algorithm. Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common data formats for inference](cdf-inference.md). However, the SageMaker AI IP Insights algorithm does not currently support RecordIO format.

## IP Insights Input Request Formats
Input Request

### INPUT: CSV Format


The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot notation. 

content-type: text/csv

```
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
```

### INPUT: JSON Format


JSON data can be provided in different formats. IP Insights follows the common SageMaker AI formats. For more information about inference formats, see [Common data formats for inference](cdf-inference.md).

content-type: application/json

```
{
  "instances": [
    {"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
    {"features": ["entity_id_2", "10.10.1.2"]}
  ]
}
```

### INPUT: JSONLINES Format


The JSON Lines content type is useful for running batch transform jobs. For more information on SageMaker AI inference formats, see [Common data formats for inference](cdf-inference.md). For more information on running batch transform jobs, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md).

content-type: application/jsonlines

```
{"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
{"features": ["entity_id_2", "10.10.1.2"]}]
```

## IP Insights Output Response Formats
Output Response

### OUTPUT: JSON Response Format


The default output of the SageMaker AI IP Insights algorithm is the `dot_product` between the input entity and IP address. The dot\$1product signifies how compatible the model considers the entity and IP address. The `dot_product` is unbounded. To make predictions about whether an event is anomalous, you need to set a threshold based on your defined distribution. For information about how to use the `dot_product` for anomaly detection, see the [An Introduction to the SageMaker AIIP Insights Algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ipinsights_login/ipinsights-tutorial.html).

accept: application/json

```
{
  "predictions": [
    {"dot_product": 0.0},
    {"dot_product": 2.0}
  ]
}
```

Advanced users can access the model's learned entity and IP embeddings by providing the additional content-type parameter `verbose=True` to the Accept heading. You can use the `entity_embedding` and `ip_embedding` for debugging, visualizing, and understanding the model. Additionally, you can use these embeddings in other machine learning techniques, such as classification or clustering.

accept: application/json;verbose=True

```
{
  "predictions": [
    {
        "dot_product": 0.0,
        "entity_embedding": [1.0, 0.0, 0.0],
        "ip_embedding": [0.0, 1.0, 0.0]
    },
    {
        "dot_product": 2.0,
        "entity_embedding": [1.0, 0.0, 1.0],
        "ip_embedding": [1.0, 0.0, 1.0]
    }
  ]
}
```

### OUTPUT: JSONLINES Response Format


accept: application/jsonlines 

```
{"dot_product": 0.0}
{"dot_product": 2.0}
```

accept: application/jsonlines; verbose=True 

```
{"dot_product": 0.0, "entity_embedding": [1.0, 0.0, 0.0], "ip_embedding": [0.0, 1.0, 0.0]}
{"dot_product": 2.0, "entity_embedding": [1.0, 0.0, 1.0], "ip_embedding": [1.0, 0.0, 1.0]}
```

# K-Means Algorithm


K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity. 

Amazon SageMaker AI uses a modified version of the web-scale k-means clustering algorithm. Compared with the original version of the algorithm, the version used by Amazon SageMaker AI is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, the version used by Amazon SageMaker AI streams mini-batches (small, random subsets) of the training data. For more information about mini-batch k-means, see [Web-scale k-means Clustering](https://dl.acm.org/doi/10.1145/1772690.1772862).

The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The *n* attributes in each row represent a point in *n*-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations. The algorithm groups observations with similar attribute values (the points corresponding to these observations are closer together). For more information about how k-means works in Amazon SageMaker AI, see [How K-Means Clustering Works](algo-kmeans-tech-notes.md).

**Topics**
+ [

## Input/Output Interface for the K-Means Algorithm
](#km-inputoutput)
+ [

## EC2 Instance Recommendation for the K-Means Algorithm
](#km-instances)
+ [

## K-Means Sample Notebooks
](#kmeans-sample-notebooks)
+ [

# How K-Means Clustering Works
](algo-kmeans-tech-notes.md)
+ [

# K-Means Hyperparameters
](k-means-api-config.md)
+ [

# Tune a K-Means Model
](k-means-tuning.md)
+ [

# K-Means Response Formats
](km-in-formats.md)

## Input/Output Interface for the K-Means Algorithm


For training, the k-means algorithm expects data to be provided in the *train* channel (recommended `S3DataDistributionType=ShardedByS3Key`), with an optional *test* channel (recommended `S3DataDistributionType=FullyReplicated`) to score the data on. Both `recordIO-wrapped-protobuf` and `CSV` formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference, `text/csv`, `application/json`, and `application/x-recordio-protobuf` are supported. k-means returns a `closest_cluster` label and the `distance_to_cluster` for each observation.

For more information on input and output file formats, see [K-Means Response Formats](km-in-formats.md) for inference and the [K-Means Sample Notebooks](#kmeans-sample-notebooks). The k-means algorithm does not support multiple instance learning, in which the training set consists of labeled “bags”, each of which is a collection of unlabeled instances.

## EC2 Instance Recommendation for the K-Means Algorithm


We recommend training k-means on CPU instances. You can train on GPU instances, but should limit GPU training to single-GPU instances (such as ml.g4dn.xlarge) because only one GPU is used per instance. The k-means algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

## K-Means Sample Notebooks
Sample Notebooks

For a sample notebook that uses the SageMaker AI K-means algorithm to segment the population of counties in the United States by attributes identified using principle component analysis, see [Analyze US census data for population segmentation using Amazon SageMaker AI](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans/sagemaker-countycensusclustering.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, click on its **Use** tab and select **Create copy**.

# How K-Means Clustering Works
How It Works

K-means is an algorithm that trains a model that groups similar objects together. The k-means algorithm accomplishes this by mapping each observation in the input dataset to a point in the *n*-dimensional space (where *n* is the number of attributes of the observation). For example, your dataset might contain observations of temperature and humidity in a particular location, which are mapped to points (*t, h*) in 2-dimensional space. 



**Note**  
Clustering algorithms are unsupervised. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used. For more information, see [Unsupervised learning](algorithms-choose.md#algorithms-choose-unsupervised-learning).

In k-means clustering, each cluster has a center. During model training, the k-means algorithm uses the distance of the point that corresponds to each observation in the dataset to the cluster centers as the basis for clustering. You choose the number of clusters (*k*) to create. 

For example, suppose that you want to create a model to recognize handwritten digits and you choose the MNIST dataset for training. The dataset provides thousands of images of handwritten digits (0 through 9). In this example, you might choose to create 10 clusters, one for each digit (0, 1, …, 9). As part of model training, the k-means algorithm groups the input images into 10 clusters.

Each image in the MNIST dataset is a 28x28-pixel image, with a total of 784 pixels. Each image corresponds to a point in a 784-dimensional space, similar to a point in a 2-dimensional space (x,y). To find a cluster to which a point belongs, the k-means algorithm finds the distance of that point from all of the cluster centers. It then chooses the cluster with the closest center as the cluster to which the image belongs. 

**Note**  
Amazon SageMaker AI uses a customized version of the algorithm where, instead of specifying that the algorithm create *k* clusters, you might choose to improve model accuracy by specifying extra cluster centers *(K = k\$1x)*. However, the algorithm ultimately reduces these to *k* clusters.

In SageMaker AI, you specify the number of clusters when creating a training job. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html). In the request body, you add the `HyperParameters` string map to specify the `k` and `extra_center_factor` strings.

The following is a summary of how k-means works for model training in SageMaker AI:

1. It determines the initial *K* cluster centers. 
**Note**  
In the following topics, *K* clusters refer to *k \$1 x*, where you specify *k* and *x* when creating a model training job. 

1. It iterates over input training data and recalculates cluster centers.

1. It reduces resulting clusters to *k* (if the data scientist specified the creation of *k\$1x* clusters in the request). 

The following sections also explain some of the parameters that a data scientist might specify to configure a model training job as part of the `HyperParameters` string map. 

**Topics**
+ [

## Step 1: Determine the Initial Cluster Centers
](#kmeans-step1)
+ [

## Step 2: Iterate over the Training Dataset and Calculate Cluster Centers
](#kmeans-step2)
+ [

## Step 3: Reduce the Clusters from *K* to *k*
](#kmeans-step3)

## Step 1: Determine the Initial Cluster Centers


When using k-means in SageMaker AI, the initial cluster centers are chosen from the observations in a small, randomly sampled batch. Choose one of the following strategies to determine how these initial cluster centers are selected:
+ The random approach—Randomly choose *K* observations in your input dataset as cluster centers. For example, you might choose a cluster center that points to the 784-dimensional space that corresponds to any 10 images in the MNIST training dataset.
+ The k-means\$1\$1 approach, which works as follows: 

  1. Start with one cluster and determine its center. You randomly select an observation from your training dataset and use the point corresponding to the observation as the cluster center. For example, in the MNIST dataset, randomly choose a handwritten digit image. Then choose the point in the 784-dimensional space that corresponds to the image as your cluster center. This is cluster center 1.

  1. Determine the center for cluster 2. From the remaining observations in the training dataset, pick an observation at random. Choose one that is different than the one you previously selected. This observation corresponds to a point that is far away from cluster center 1. Using the MNIST dataset as an example, you do the following:
     + For each of the remaining images, find the distance of the corresponding point from cluster center 1. Square the distance and assign a probability that is proportional to the square of the distance. That way, an image that is different from the one that you previously selected has a higher probability of getting selected as cluster center 2. 
     + Choose one of the images randomly, based on probabilities assigned in the previous step. The point that corresponds to the image is cluster center 2.

  1. Repeat Step 2 to find cluster center 3. This time, find the distances of the remaining images from cluster center 2.

  1. Repeat the process until you have the *K* cluster centers.

To train a model in SageMaker AI, you create a training job. In the request, you provide configuration information by specifying the following `HyperParameters` string maps:
+ To specify the number of clusters to create, add the `k` string.
+ For greater accuracy, add the optional `extra_center_factor` string. 
+ To specify the strategy that you want to use to determine the initial cluster centers, add the `init_method` string and set its value to `random` or `k-means++`.

For more information about the SageMaker AI k-means estimator, see [K-means](https://sagemaker.readthedocs.io/en/stable/algorithms/unsupervised/kmeans.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation.

You now have an initial set of cluster centers. 

## Step 2: Iterate over the Training Dataset and Calculate Cluster Centers


The cluster centers that you created in the preceding step are mostly random, with some consideration for the training dataset. In this step, you use the training dataset to move these centers toward the true cluster centers. The algorithm iterates over the training dataset, and recalculates the *K* cluster centers.

1. Read a mini-batch of observations (a small, randomly chosen subset of all records) from the training dataset and do the following. 
**Note**  
When creating a model training job, you specify the batch size in the `mini_batch_size` string in the `HyperParameters` string map. 

   1. Assign all of the observations in the mini-batch to one of the clusters with the closest cluster center.

   1. Calculate the number of observations assigned to each cluster. Then, calculate the proportion of new points assigned per cluster.

      For example, consider the following clusters:

      Cluster c1 = 100 previously assigned points. You added 25 points from the mini-batch in this step.

      Cluster c2 = 150 previously assigned points. You added 40 points from the mini-batch in this step.

      Cluster c3 = 450 previously assigned points. You added 5 points from the mini-batch in this step.

      Calculate the proportion of new points assigned to each of clusters as follows:

      ```
      p1 = proportion of points assigned to c1 = 25/(100+25)
      p2 = proportion of points assigned to c2 = 40/(150+40)
      p3 = proportion of points assigned to c3 = 5/(450+5)
      ```

   1. Compute the center of the new points added to each cluster:

      ```
      d1 = center of the new points added to cluster 1
      d2 = center of the new points added to cluster 2
      d3 = center of the new points added to cluster 3
      ```

   1. Compute the weighted average to find the updated cluster centers as follows:

      ```
      Center of cluster 1 = ((1 - p1) * center of cluster 1) + (p1 * d1)
      Center of cluster 2 = ((1 - p2) * center of cluster 2) + (p2 * d2)
      Center of cluster 3 = ((1 - p3) * center of cluster 3) + (p3 * d3)
      ```

1. Read the next mini-batch, and repeat Step 1 to recalculate the cluster centers. 

1. For more information about mini-batch *k*-means, see [Web-scale k-means Clustering](https://citeseerx.ist.psu.edu/document?repid=rep1type=pdf&doi=b452a856a3e3d4d37b1de837996aa6813bedfdcf)).

## Step 3: Reduce the Clusters from *K* to *k*


If the algorithm created *K* clusters—*(K = k\$1x)* where *x* is greater than 1—then it reduces the *K* clusters to *k* clusters. (For more information, see `extra_center_factor` in the preceding discussion.) It does this by applying Lloyd's method with `kmeans++` initialization to the *K* cluster centers. For more information about Lloyd's method, see [k-means clustering](https://pdfs.semanticscholar.org/0074/4cb7cc9ccbbcdadbd5ff2f2fee6358427271.pdf). 

# K-Means Hyperparameters
Hyperparameters

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the k-means training algorithm provided by Amazon SageMaker AI. For more information about how k-means clustering works, see [How K-Means Clustering Works](algo-kmeans-tech-notes.md).


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim | The number of features in the input data. **Required** Valid values: Positive integer  | 
| k |  The number of required clusters. **Required** Valid values: Positive integer  | 
| epochs | The number of passes done over the training data. **Optional** Valid values: Positive integer Default value: 1  | 
| eval\$1metrics | A JSON list of metric types used to report a score for the model. Allowed values are `msd` for Means Square Deviation and `ssd` for Sum of Square Distance. If test data is provided, the score is reported for each of the metrics requested. **Optional** Valid values: Either `[\"msd\"]` or `[\"ssd\"]` or `[\"msd\",\"ssd\"]` . Default value: `[\"msd\"]`  | 
| extra\$1center\$1factor | The algorithm creates K centers = `num_clusters` \$1 `extra_center_factor` as it runs and reduces the number of centers from K to `k` when finalizing the model. **Optional** Valid values: Either a positive integer or `auto`. Default value: `auto`  | 
| half\$1life\$1time\$1size | Used to determine the weight given to an observation when computing a cluster mean. This weight decays exponentially as more points are observed. When a point is first observed, it is assigned a weight of 1 when computing the cluster mean. The decay constant for the exponential decay function is chosen so that after observing `half_life_time_size` points, its weight is 1/2. If set to 0, there is no decay. **Optional** Valid values: Non-negative integer Default value: 0  | 
| init\$1method | Method by which the algorithm chooses the initial cluster centers. The standard k-means approach chooses them at random. An alternative k-means\$1\$1 method chooses the first cluster center at random. Then it spreads out the position of the remaining initial clusters by weighting the selection of centers with a probability distribution that is proportional to the square of the distance of the remaining data points from existing centers. **Optional** Valid values: Either `random` or `kmeans++`. Default value: `random`  | 
| local\$1lloyd\$1init\$1method | The initialization method for Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Either `random` or `kmeans++`. Default value: `kmeans++`  | 
| local\$1lloyd\$1max\$1iter | The maximum number of iterations for Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Positive integer Default value: 300  | 
| local\$1lloyd\$1num\$1trials | The number of times the Lloyd's expectation-maximization (EM) procedure with the least loss is run when building the final model containing `k` centers. **Optional** Valid values: Either a positive integer or `auto`. Default value: `auto`  | 
| local\$1lloyd\$1tol | The tolerance for change in loss for early stopping of Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Float. Range in [0, 1]. Default value: 0.0001  | 
| mini\$1batch\$1size | The number of observations per mini-batch for the data iterator. **Optional** Valid values: Positive integer Default value: 5000  | 

# Tune a K-Means Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker AI k-means algorithm is an unsupervised algorithm that groups data into clusters whose members are as similar as possible. Because it is unsupervised, it doesn't use a validation dataset that hyperparameters can optimize against. But it does take a test dataset and emits metrics that depend on the squared distance between the data points and the final cluster centroids at the end of each training run. To find the model that reports the tightest clusters on the test dataset, you can use a hyperparameter tuning job. The clusters optimize the similarity of their members.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the K-Means Algorithm
Metrics

The k-means algorithm computes the following metrics during training. When tuning a model, choose one of these metrics as the objective metric. 


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:msd | Mean squared distances between each record in the test set and the closest center of the model. | Minimize | 
| test:ssd | Sum of the squared distances between each record in the test set and the closest center of the model. | Minimize | 



## Tunable K-Means Hyperparameters
Tunable hyperparameters

Tune the Amazon SageMaker AI k-means model with the following hyperparameters. The hyperparameters that have the greatest impact on k-means objective metrics are: `mini_batch_size`, `extra_center_factor`, and `init_method`. Tuning the hyperparameter `epochs` generally results in minor improvements.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs | IntegerParameterRanges | MinValue: 1, MaxValue:10 | 
| extra\$1center\$1factor | IntegerParameterRanges | MinValue: 4, MaxValue:10 | 
| init\$1method | CategoricalParameterRanges | ['kmeans\$1\$1', 'random'] | 
| mini\$1batch\$1size | IntegerParameterRanges | MinValue: 3000, MaxValue:15000 | 

# K-Means Response Formats
Inference Formats

All SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI k-means algorithm.

## JSON Response Format


```
{
    "predictions": [
        {
            "closest_cluster": 1.0,
            "distance_to_cluster": 3.0,
        },
        {
            "closest_cluster": 2.0,
            "distance_to_cluster": 5.0,
        },

        ....
    ]
}
```

## JSONLINES Response Format


```
{"closest_cluster": 1.0, "distance_to_cluster": 3.0}
{"closest_cluster": 2.0, "distance_to_cluster": 5.0}
```

## RECORDIO Response Format


```
[
    Record = {
        features = {},
        label = {
            'closest_cluster': {
                keys: [],
                values: [1.0, 2.0]  # float32
            },
            'distance_to_cluster': {
                keys: [],
                values: [3.0, 5.0]  # float32
            },
        }
    }
]
```

## CSV Response Format


The first value in each line corresponds to `closest_cluster`.

The second value in each line corresponds to `distance_to_cluster`.

```
1.0,3.0
2.0,5.0
```

# Principal Component Analysis (PCA) Algorithm


PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called *components*, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

In Amazon SageMaker AI, PCA operates in two modes, depending on the scenario: 
+ **regular**: For datasets with sparse data and a moderate number of observations and features.
+ **randomized**: For datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

PCA uses tabular data. 

The rows represent observations you want to embed in a lower dimensional space. The columns represent features that you want to find a reduced approximation for. The algorithm calculates the covariance matrix (or an approximation thereof in a distributed manner), and then performs the singular value decomposition on this summary to produce the principal components. 

**Topics**
+ [

## Input/Output Interface for the PCA Algorithm
](#pca-inputoutput)
+ [

## EC2 Instance Recommendation for the PCA Algorithm
](#pca-instances)
+ [

## PCA Sample Notebooks
](#PCA-sample-notebooks)
+ [

# How PCA Works
](how-pca-works.md)
+ [

# PCA Hyperparameters
](PCA-reference.md)
+ [

# PCA Response Formats
](PCA-in-formats.md)

## Input/Output Interface for the PCA Algorithm


For training, PCA expects data provided in the train channel, and optionally supports a dataset passed to the test dataset, which is scored by the final algorithm. Both `recordIO-wrapped-protobuf` and `CSV` formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference, PCA supports `text/csv`, `application/json`, and `application/x-recordio-protobuf`. Results are returned in either `application/json` or `application/x-recordio-protobuf` format with a vector of "projections."

For more information on input and output file formats, see [PCA Response Formats](PCA-in-formats.md) for inference and the [PCA Sample Notebooks](#PCA-sample-notebooks).

## EC2 Instance Recommendation for the PCA Algorithm


PCA supports CPU and GPU instances for training and inference. Which instance type is most performant depends heavily on the specifics of the input data. For GPU instances, PCA supports P2, P3, G4dn, and G5.

## PCA Sample Notebooks
Sample Notebooks

For a sample notebook that shows how to use the SageMaker AI Principal Component Analysis algorithm to analyze the images of handwritten digits from zero to nine in the MNIST dataset, see [An Introduction to PCA with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/pca_mnist/pca_mnist.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How PCA Works
How It Works

Principal Component Analysis (PCA) is a learning algorithm that reduces the dimensionality (number of features) within a dataset while still retaining as much information as possible. 

PCA reduces dimensionality by finding a new set of features called *components*, which are composites of the original features, but are uncorrelated with one another. The first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

It is an unsupervised dimensionality reduction algorithm. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used.

Given the input of a matrix with rows ![\[x_1,…,x_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-39b.png) each of dimension `1 * d`, the data is partitioned into mini-batches of rows and distributed among the training nodes (workers). Each worker then computes a summary of its data. The summaries of the different workers are then unified into a single solution at the end of the computation. 

**Modes**

The Amazon SageMaker AI PCA algorithm uses either of two modes to calculate these summaries, depending on the situation:
+ **regular**: for datasets with sparse data and a moderate number of observations and features.
+ **randomized**: for datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

As the algorithm's last step, it performs the singular value decomposition on the unified solution, from which the principal components are then derived.

## Mode 1: Regular


The workers jointly compute both ![\[Equation in text-form: \sum x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-1b.png) and ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

**Note**  
Because ![\[Equation in text-form: x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-3b.png) are `1 * d` row vectors, ![\[Equation in text-form: x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-4b.png) is a matrix (not a scalar). Using row vectors within the code allows us to obtain efficient caching.

The covariance matrix is computed as ![\[Equation in text-form: \sum x_i^T x_i - (1/n) (\sum x_i)^T \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-32b.png) , and its top `num_components` singular vectors form the model.

**Note**  
If `subtract_mean` is `False`, we avoid computing and subtracting ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

Use this algorithm when the dimension `d` of the vectors is small enough so that ![\[Equation in text-form: d^2\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-7b.png) can fit in memory.

## Mode 2: Randomized


When the number of features in the input dataset is large, we use a method to approximate the covariance metric. For every mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, we randomly initialize a `(num_components + extra_components) * b` matrix that we multiply by each mini-batch, to create a `(num_components + extra_components) * d` matrix. The sum of these matrices is computed by the workers, and the servers perform SVD on the final `(num_components + extra_components) * d` matrix. The top right `num_components` singular vectors of it are the approximation of the top singular vectors of the input matrix.

Let ![\[Equation in text-form: \ell\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38b.png) ` = num_components + extra_components`. Given a mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, the worker draws a random matrix ![\[Equation in text-form: H_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-24b.png) of dimension ![\[Equation in text-form: \ell * b\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38.png) . Depending on whether the environment uses a GPU or CPU and the dimension size, the matrix is either a random sign matrix where each entry is `+-1` or a *FJLT* (fast Johnson Lindenstrauss transform; for information, see [FJLT Transforms](https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pdf) and the follow-up papers). The worker then computes ![\[Equation in text-form: H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-26b.png) and maintains ![\[Equation in text-form: B = \sum H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-27b.png) . The worker also maintains ![\[Equation in text-form: h^T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-28b.png) , the sum of columns of ![\[Equation in text-form: H_1,..,H_T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-29b.png) (`T` being the total number of mini-batches), and `s`, the sum of all input rows. After processing the entire shard of data, the worker sends the server `B`, `h`, `s`, and `n` (the number of input rows).

Denote the different inputs to the server as ![\[Equation in text-form: B^1, h^1, s^1, n^1,…\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-30b.png) The server computes `B`, `h`, `s`, `n` the sums of the respective inputs. It then computes ![\[Equation in text-form: C = B – (1/n) h^T s\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-31b.png) , and finds its singular value decomposition. The top-right singular vectors and singular values of `C` are used as the approximate solution to the problem.

# PCA Hyperparameters
Hyperparameters

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific HyperParameters as string-to-string maps. The following table lists the hyperparameters for the PCA training algorithm provided by Amazon SageMaker AI. For more information about how PCA works, see [How PCA Works](how-pca-works.md). 


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  Input dimension. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  Number of rows in a mini-batch. **Required** Valid values: positive integer  | 
| num\$1components |  The number of principal components to compute. **Required** Valid values: positive integer  | 
| algorithm\$1mode |  Mode for computing the principal components.  **Optional** Valid values: *regular* or *randomized* Default value: *regular*  | 
| extra\$1components |  As the value increases, the solution becomes more accurate but the runtime and memory consumption increase linearly. The default, -1, means the maximum of 10 and `num_components`. Valid for *randomized* mode only. **Optional** Valid values: Non-negative integer or -1 Default value: -1  | 
| subtract\$1mean |  Indicates whether the data should be unbiased both during training and at inference.  **Optional** Valid values: One of *true* or *false* Default value: *true*  | 

# PCA Response Formats
Inference Formats

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI PCA algorithm.

## JSON Response Format


Accept—application/json

```
{
    "projections": [
        {
            "projection": [1.0, 2.0, 3.0, 4.0, 5.0]
        },
        {
            "projection": [6.0, 7.0, 8.0, 9.0, 0.0]
        },
        ....
    ]
}
```

## JSONLINES Response Format


Accept—application/jsonlines

```
{ "projection": [1.0, 2.0, 3.0, 4.0, 5.0] }
{ "projection": [6.0, 7.0, 8.0, 9.0, 0.0] }
```

## RECORDIO Response Format


Accept—application/x-recordio-protobuf

```
[
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    },
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    }  
]
```

# Random Cut Forest (RCF) Algorithm


Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

While there are many applications of anomaly detection algorithms to one-dimensional time series data such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-dimensional input. Amazon SageMaker AI RCF scales well with respect to number of features, data set size, and number of instances.

**Topics**
+ [

## Input/Output Interface for the RCF Algorithm
](#rcf-input_output)
+ [

## Instance Recommendations for the RCF Algorithm
](#rcf-instance-recommend)
+ [

## RCF Sample Notebooks
](#rcf-sample-notebooks)
+ [

# How RCF Works
](rcf_how-it-works.md)
+ [

# RCF Hyperparameters
](rcf_hyperparameters.md)
+ [

# Tune an RCF Model
](random-cut-forest-tuning.md)
+ [

# RCF Response Formats
](rcf-in-formats.md)

## Input/Output Interface for the RCF Algorithm


Amazon SageMaker AI Random Cut Forest supports the `train` and `test` data channels. The optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either `application/x-recordio-protobuf` or `text/csv` formats. For the test data, when using text/csv format, the content must be specified as text/csv;label\$1size=1 where the first column of each row represents the anomaly label: "1" for an anomalous data point and "0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`

The train channel only supports `S3DataDistributionType=ShardedByS3Key` and the test channel only supports `S3DataDistributionType=FullyReplicated`. The following example specifies the S3 distribution type for the train channel using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html).

**Note**  
The `sagemaker.inputs.s3_input` method was renamed to `sagemaker.inputs.TrainingInput` in [SageMaker Python SDK v2](https://sagemaker.readthedocs.io/en/stable/v2.html#s3-input).

```
  import sagemaker
    
  # specify Random Cut Forest training job information and hyperparameters
  rcf = sagemaker.estimator.Estimator(...)
    
  # explicitly specify "ShardedByS3Key" distribution type
  train_data = sagemaker.inputs.TrainingInput(
       s3_data=s3_training_data_location,
       content_type='text/csv;label_size=0',
       distribution='ShardedByS3Key')
    
  # run the training job on input data stored in S3
  rcf.fit({'train': train_data})
```

To avoid common errors around execution roles, ensure that you have the execution roles required, `AmazonSageMakerFullAccess` and `AmazonEC2ContainerRegistryFullAccess`. To avoid common errors around your image not existing or its permissions being incorrect, ensure that your ECR image is not larger then the allocated disk space on the training instance. To avoid this, run your training job on an instance that has sufficient disk space. In addition, if your ECR image is from a different AWS account's Elastic Container Service (ECS) repository, and you do not set repository permissions to grant access, this will result in an error. See the [ECR repository permissions ](https://docs.aws.amazon.com/AmazonECR/latest/userguide/set-repository-policy.html) for more information on setting a repository policy statement.

See the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for more information on customizing the S3 data source attributes. Finally, in order to take advantage of multi-instance training the training data must be partitioned into at least as many files as instances.

For inference, RCF supports `application/x-recordio-protobuf`, `text/csv` and `application/json` input data content types. See the [Parameters for Built-in Algorithms](common-info-all-im-models.md) documentation for more information. RCF inference returns `application/x-recordio-protobuf` or `application/json` formatted output. Each record in these output data contains the corresponding anomaly scores for each input data point. See [Common Data Formats--Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) for more information.

For more information on input and output file formats, see [RCF Response Formats](rcf-in-formats.md) for inference and the [RCF Sample Notebooks](#rcf-sample-notebooks).

## Instance Recommendations for the RCF Algorithm


For training, we recommend the `ml.m4`, `ml.c4`, and `ml.c5` instance families. For inference we recommend using a `ml.c5.xl` instance type in particular, for maximum performance as well as minimized cost per hour of usage. Although the algorithm could technically run on GPU instance types it does not take advantage of GPU hardware.

## RCF Sample Notebooks
Sample Notebooks

For an example of how to train an RCF model and perform inferences with it, see the [An Introduction to SageMaker AI Random Cut Forests](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.html) notebook. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, click on its **Use** tab and select **Create copy**.

For a blog post on using the RCF algorithm, see [Use the built-in Amazon SageMaker AI Random Cut Forest algorithm for anomaly detection](https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/).

# How RCF Works
How It Works

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a dataset. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a dataset can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

The main idea behind the RCF algorithm is to create a forest of trees where each tree is obtained using a partition of a sample of the training data. For example, a random sample of the input data is first determined. The random sample is then partitioned according to the number of trees in the forest. Each tree is given such a partition and organizes that subset of points into a k-d tree. The anomaly score assigned to a data point by the tree is defined as the expected change in complexity of the tree as a result adding that point to the tree; which, in approximation, is inversely proportional to the resulting depth of the point in the tree. The random cut forest assigns an anomaly score by computing the average score from each constituent tree and scaling the result with respect to the sample size. The RCF algorithm is based on the one described in reference [1].

## Sample Data Randomly


The first step in the RCF algorithm is to obtain a random sample of the training data. In particular, suppose we want a sample of size ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) from ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) total data points. If the training data is small enough, the entire dataset can be used, and we could randomly draw ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) elements from this set. However, frequently the training data is too large to fit all at once, and this approach isn't feasible. Instead, we use a technique called reservoir sampling.

[Reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) is an algorithm for efficiently drawing random samples from a dataset ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) where the elements in the dataset can only be observed one at a time or in batches. In fact, reservoir sampling works even when ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) is not known *a priori*. If only one sample is requested, such as when ![\[Equation in text-form: K=1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf15.jpg), the algorithm is like this:

**Algorithm: Reservoir Sampling**
+  Input: dataset or data stream ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) 
+  Initialize the random sample ![\[Equation in text-form: X=S_1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf4.jpg) 
+  For each observed sample ![\[Equation in text-form: S_n,n=2,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf5.jpg):
  +  Pick a uniform random number ![\[Equation in text-form: \xi \in [0,1]\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf6.jpg) 
  +  If ![\[Equation in text-form: \xi \less 1/n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf7.jpg) 
    +  Set ![\[Equation in text-form: X=S_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf8.jpg) 
+  Return ![\[Equation in text-form: X\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf9.jpg) 

This algorithm selects a random sample such that ![\[Equation in text-form: P(X=S_n)=1/N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf10.jpg) for all ![\[Equation in text-form: n=1,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf11.jpg). When ![\[Equation in text-form: K>1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf12.jpg) the algorithm is more complicated. Additionally, a distinction must be made between random sampling that is with and without replacement. RCF performs an augmented reservoir sampling without replacement on the training data based on the algorithms described in [2].

## Train a RCF Model and Produce Inferences


The next step in RCF is to construct a random cut forest using the random sample of data. First, the sample is partitioned into a number of equal-sized partitions equal to the number of trees in the forest. Then, each partition is sent to an individual tree. The tree recursively organizes its partition into a binary tree by partitioning the data domain into bounding boxes.

This procedure is best illustrated with an example. Suppose a tree is given the following two-dimensional dataset. The corresponding tree is initialized to the root node:

![\[A two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF1.jpg)


Figure: A two-dimensional dataset where the majority of data lies in a cluster (blue) except for one anomalous data point (orange). The tree is initialized with a root node.

The RCF algorithm organizes these data in a tree by first computing a bounding box of the data, selecting a random dimension (giving more weight to dimensions with higher "variance"), and then randomly determining the position of a hyperplane "cut" through that dimension. The two resulting subspaces define their own sub tree. In this example, the cut happens to separate a lone point from the remainder of the sample. The first level of the resulting binary tree consists of two nodes, one which will consist of the subtree of points to the left of the initial cut and the other representing the single point on the right.

![\[A random cut partitioning the two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF2.jpg)


Figure: A random cut partitioning the two-dimensional dataset. An anomalous data point is more likely to lie isolated in a bounding box at a smaller tree depth than other points. 

Bounding boxes are then computed for the left and right halves of the data and the process is repeated until every leaf of the tree represents a single data point from the sample. Note that if the lone point is sufficiently far away then it is more likely that a random cut would result in point isolation. This observation provides the intuition that tree depth is, loosely speaking, inversely proportional to the anomaly score.

When performing inference using a trained RCF model the final anomaly score is reported as the average across scores reported by each tree. Note that it is often the case that the new data point does not already reside in the tree. To determine the score associated with the new point the data point is inserted into the given tree and the tree is efficiently (and temporarily) reassembled in a manner equivalent to the training process described above. That is, the resulting tree is as if the input data point were a member of the sample used to construct the tree in the first place. The reported score is inversely proportional to the depth of the input point within the tree.

## Choose Hyperparameters


The primary hyperparameters used to tune the RCF model are `num_trees` and `num_samples_per_tree`. Increasing `num_trees` has the effect of reducing the noise observed in anomaly scores since the final score is the average of the scores reported by each tree. While the optimal value is application-dependent we recommend using 100 trees to begin with as a balance between score noise and model complexity. Note that inference time is proportional to the number of trees. Although training time is also affected it is dominated by the reservoir sampling algorithm describe above.

The parameter `num_samples_per_tree` is related to the expected density of anomalies in the dataset. In particular, `num_samples_per_tree` should be chosen such that `1/num_samples_per_tree` approximates the ratio of anomalous data to normal data. For example, if 256 samples are used in each tree then we expect our data to contain anomalies 1/256 or approximately 0.4% of the time. Again, an optimal value for this hyperparameter is dependent on the application.

## References


1.  Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. "Robust random cut forest based anomaly detection on streams." In *International Conference on Machine Learning*, pp. 2712-2721. 2016.

1.  Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, and Al Geist. "Reservoir-based random sampling with replacement from data stream." In *Proceedings of the 2004 SIAM International Conference on Data Mining*, pp. 492-496. Society for Industrial and Applied Mathematics, 2004.

# RCF Hyperparameters
Hyperparameters

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Amazon SageMaker AI RCF algorithm. For more information, including recommendations on how to choose hyperparameters, see [How RCF Works](rcf_how-it-works.md).




| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  The number of features in the data set. (If you use the [Random Cut Forest](https://sagemaker.readthedocs.io/en/stable/algorithms/unsupervised/randomcutforest.html) estimator, this value is calculated for you and need not be specified.) **Required** Valid values: Positive integer (min: 1, max: 10000)  | 
| eval\$1metrics |  A list of metrics used to score a labeled test data set. The following metrics can be selected for output: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html) **Optional** Valid values: a list with possible values taken from `accuracy` or `precision_recall_fscore`.  Default value: Both `accuracy`, `precision_recall_fscore` are calculated.  | 
| num\$1samples\$1per\$1tree |  Number of random samples given to each tree from the training data set. **Optional** Valid values: Positive integer (min: 1, max: 2048) Default value: 256  | 
| num\$1trees |  Number of trees in the forest. **Optional** Valid values: Positive integer (min: 50, max: 1000) Default value: 100  | 

# Tune an RCF Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning or hyperparameter optimization, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker AI RCF algorithm is an unsupervised anomaly-detection algorithm that requires a labeled test dataset for hyperparameter optimization. RCF calculates anomaly scores for test data points and then labels the data points as anomalous if their scores are beyond three standard deviations from the mean score. This is known as the three-sigma limit heuristic. The F1-score is based on the difference between calculated labels and actual labels. The hyperparameter tuning job finds the model that maximizes that score. The success of hyperparameter optimization depends on the applicability of the three-sigma limit heuristic to the test dataset.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the RCF Algorithm
Metrics

The RCF algorithm computes the following metric during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:f1 | F1-score on the test dataset, based on the difference between calculated labels and actual labels. | Maximize | 

## Tunable RCF Hyperparameters
Tunable Hyperparameters

You can tune a RCF model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1samples\$1per\$1tree | IntegerParameterRanges | MinValue: 1, MaxValue:2048 | 
| num\$1trees | IntegerParameterRanges | MinValue: 50, MaxValue:1000 | 

# RCF Response Formats
Inference Formats

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). Note that SageMaker AI Random Cut Forest supports both dense and sparse JSON and RecordIO formats. This topic contains a list of the available output formats for the SageMaker AI RCF algorithm.

## JSON Response Format


ACCEPT: application/json.

```
    {                                                                                                                                                                                                                                                                                    
        "scores":    [                                                                                                                                                                                                                                                                   
            {"score": 0.02},                                                                                                                                                                                                                                                             
            {"score": 0.25}                                                                                                                                                                                                                                                              
        ]                                                                                                                                                                                                                                                                                
    }
```

### JSONLINES Response Format


ACCEPT: application/jsonlines.

```
{"score": 0.02},
{"score": 0.25}
```

## RECORDIO Response Format


ACCEPT: application/x-recordio-protobuf.

```
    [                                                                                                                                                                                                                                                                                    
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.25]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         },                                                                                                                                                                                                                                                                                   
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.23]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         }                                                                                                                                                                                                                                                                                    
    ]
```

# Built-in SageMaker AI Algorithms for Computer Vision
Vision

SageMaker AI provides image processing algorithms that are used for image classification, object detection, and computer vision.
+ [Image Classification - MXNet](image-classification.md)—uses example data with answers (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Image Classification - TensorFlow](image-classification-tensorflow.md)—uses pretrained TensorFlow Hub models to fine-tune for specific tasks (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Object Detection - MXNet](object-detection.md)—detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.
+ [Object Detection - TensorFlow](object-detection-tensorflow.md)—detects bounding boxes and object labels in an image. It is a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow models.
+ [Semantic Segmentation Algorithm](semantic-segmentation.md)—provides a fine-grained, pixel-level approach to developing computer vision applications.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| Image Classification - MXNet | train and validation, (optionally) train\$1lst, validation\$1lst, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Image Classification - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| Object Detection | train and validation, (optionally) train\$1annotation, validation\$1annotation, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Object Detection - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | GPU | Yes (only across multiple GPUs on a single instance) | 
| Semantic Segmentation | train and validation, train\$1annotation, validation\$1annotation, and (optionally) label\$1map and model | File or Pipe | Image files | GPU (single instance only) | No | 

# Image Classification - MXNet
Image Classification - MXNet

The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification. It takes an image as input and outputs one or more labels assigned to that image. It uses a convolutional neural network that can be trained from scratch or trained using transfer learning when a large number of training images are not available 

The recommended input format for the Amazon SageMaker AI image classification algorithms is Apache MXNet [RecordIO](https://mxnet.apache.org/api/faq/recordio). However, you can also use raw images in .jpg or .png format. Refer to [this discussion](https://mxnet.apache.org/api/architecture/note_data_loading) for a broad overview of efficient data preparation and loading for machine learning systems. 

**Note**  
To maintain better interoperability with existing deep learning frameworks, this differs from the protobuf data formats commonly used by other Amazon SageMaker AI algorithms.

For more information on convolutional networks, see: 
+ [Deep residual learning for image recognition](https://arxiv.org/abs/1512.03385) Kaiming He, et al., 2016 IEEE Conference on Computer Vision and Pattern Recognition
+ [ImageNet image database](http://www.image-net.org/)
+ [Image classification with Gluon-CV and MXNet](https://gluon-cv.mxnet.io/build/examples_classification/index.html)

**Topics**
+ [

## Input/Output Interface for the Image Classification Algorithm
](#IC-inputoutput)
+ [

## EC2 Instance Recommendation for the Image Classification Algorithm
](#IC-instances)
+ [

## Image Classification Sample Notebooks
](#IC-sample-notebooks)
+ [

# How Image Classification Works
](IC-HowItWorks.md)
+ [

# Image Classification Hyperparameters
](IC-Hyperparameter.md)
+ [

# Tune an Image Classification Model
](IC-tuning.md)

## Input/Output Interface for the Image Classification Algorithm


The SageMaker AI Image Classification algorithm supports both RecordIO (`application/x-recordio`) and image (`image/png`, `image/jpeg`, and `application/x-image`) content types for training in file mode, and supports the RecordIO (`application/x-recordio`) content type for training in pipe mode. However, you can also train in pipe mode using the image files (`image/png`, `image/jpeg`, and `application/x-image`), without creating RecordIO files, by using the augmented manifest format.

Distributed training is supported for file mode and pipe mode. When using the RecordIO content type in pipe mode, you must set the `S3DataDistributionType` of the `S3DataSource` to `FullyReplicated`. The algorithm supports a fully replicated model where your data is copied onto each machine.

The algorithm supports `image/png`, `image/jpeg`, and `application/x-image` for inference.

### Train with RecordIO Format


If you use the RecordIO format for training, specify both `train` and `validation` channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify one RecordIO (`.rec`) file in the `train` channel and one RecordIO file in the `validation` channel. Set the content type for both channels to `application/x-recordio`. 

### Train with Image Format


If you use the Image format for training, specify `train`, `validation`, `train_lst`, and `validation_lst` channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify the individual image data (`.jpg` or `.png` files) for the `train` and `validation` channels. Specify one `.lst` file in each of the `train_lst` and `validation_lst` channels. Set the content type for all four channels to `application/x-image`. 

**Note**  
SageMaker AI reads the training and validation data separately from different channels, so you must store the training and validation data in different folders.

A `.lst` file is a tab-separated file with three columns that contains a list of image files. The first column specifies the image index, the second column specifies the class label index for the image, and the third column specifies the relative path of the image file. The image index in the first column must be unique across all of the images. The set of class label indices are numbered successively and the numbering should start with 0. For example, 0 for the cat class, 1 for the dog class, and so on for additional classes. 

 The following is an example of a `.lst` file: 

```
5      1   your_image_directory/train_img_dog1.jpg
1000   0   your_image_directory/train_img_cat1.jpg
22     1   your_image_directory/train_img_dog2.jpg
```

For example, if your training images are stored in `s3://<your_bucket>/train/class_dog`, `s3://<your_bucket>/train/class_cat`, and so on, specify the path for your `train` channel as `s3://<your_bucket>/train`, which is the top-level directory for your data. In the `.lst` file, specify the relative path for an individual file named `train_image_dog1.jpg` in the `class_dog` class directory as `class_dog/train_image_dog1.jpg`. You can also store all your image files under one subdirectory inside the `train` directory. In that case, use that subdirectory for the relative path. For example, `s3://<your_bucket>/train/your_image_directory`. 

### Train with Augmented Manifest Image Format


The augmented manifest format enables you to do training in Pipe mode using image files without needing to create RecordIO files. You need to specify both train and validation channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. While using the format, an S3 manifest file needs to be generated that contains the list of images and their corresponding annotations. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The images are specified using the `'source-ref'` tag that points to the S3 location of the image. The annotations are provided under the `"AttributeNames"` parameter value as specified in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. It can also contain additional metadata under the `metadata` tag, but these are ignored by the algorithm. In the following example, the `"AttributeNames"` are contained in the list of image and annotation references `["source-ref", "class"]`. The corresponding label value is `"0"` for the first image and `“1”` for the second image:

```
{"source-ref":"s3://image/filename1.jpg", "class":"0"}
{"source-ref":"s3://image/filename2.jpg", "class":"1", "class-metadata": {"class-name": "cat", "type" : "groundtruth/image-classification"}}
```

The order of `"AttributeNames"` in the input files matters when training the ImageClassification algorithm. It accepts piped data in a specific order, with `image` first, followed by `label`. So the "AttributeNames" in this example are provided with `"source-ref"` first, followed by `"class"`. When using the ImageClassification algorithm with Augmented Manifest, the value of the `RecordWrapperType` parameter must be `"RecordIO"`.

Multi-label training is also supported by specifying a JSON array of values. The `num_classes` hyperparameter must be set to match the total number of classes. There are two valid label formats: multi-hot and class-id. 

In the multi-hot format, each label is a multi-hot encoded vector of all classes, where each class takes the value of 0 or 1. In the following example, there are three classes. The first image is labeled with classes 0 and 2, while the second image is labeled with class 2 only: 

```
{"image-ref": "s3://amzn-s3-demo-bucket/sample01/image1.jpg", "class": "[1, 0, 1]"}
{"image-ref": "s3://amzn-s3-demo-bucket/sample02/image2.jpg", "class": "[0, 0, 1]"}
```

In the class-id format, each label is a list of the class ids, from [0, `num_classes`), which apply to the data point. The previous example would instead look like this:

```
{"image-ref": "s3://amzn-s3-demo-bucket/sample01/image1.jpg", "class": "[0, 2]"}
{"image-ref": "s3://amzn-s3-demo-bucket/sample02/image2.jpg", "class": "[2]"}
```

The multi-hot format is the default, but can be explicitly set in the content type with the `label-format` parameter: `"application/x-recordio; label-format=multi-hot".` The class-id format, which is the format outputted by GroundTruth, must be set explicitly: `"application/x-recordio; label-format=class-id".`

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Incremental Training


You can also seed the training of a new model with the artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data. SageMaker AI image classification models can be seeded only with another built-in image classification model trained in SageMaker AI.

To use a pretrained model, in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, specify the `ChannelName` as "model" in the `InputDataConfig` parameter. Set the `ContentType` for the model channel to `application/x-sagemaker-model`. The input hyperparameters of both the new model and the pretrained model that you upload to the model channel must have the same settings for the `num_layers`, `image_shape` and `num_classes` input parameters. These parameters define the network architecture. For the pretrained model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker AI. You can use either RecordIO or image formats for input data.

### Inference with the Image Classification Algorithm


The generated models can be hosted for inference and support encoded `.jpg` and `.png` image formats as `image/png, image/jpeg`, and `application/x-image` content-type. The input image is resized automatically. The output is the probability values for all classes encoded in JSON format, or in [JSON Lines text format](http://jsonlines.org/) for batch transform. The image classification model processes a single image per request and so outputs only one line in the JSON or JSON Lines format. The following is an example of a response in JSON Lines format:

```
accept: application/jsonlines

 {"prediction": [prob_0, prob_1, prob_2, prob_3, ...]}
```

For more details on training and inference, see the image classification sample notebook instances referenced in the introduction.

## EC2 Instance Recommendation for the Image Classification Algorithm


For image classification, we support P2, P3, G4dn, and G5 instances. We recommend using GPU instances with more memory for training with large batch sizes. You can also run the algorithm on multi-GPU and multi-machine settings for distributed training. Both CPU (such as C4) and GPU (P2, P3, G4dn, or G5) instances can be used for inference.

## Image Classification Sample Notebooks
Sample Notebooks

For a sample notebook that uses the SageMaker AI image classification algorithm, see [Build and Register an MXNet Image Classification Model via SageMaker Pipelines](https://github.com/aws-samples/amazon-sagemaker-pipelines-mxnet-image-classification/blob/main/image-classification-sagemaker-pipelines.ipynb). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The example image classification notebooks are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Image Classification Works
How It Works

The image classification algorithm takes an image as input and classifies it into one of the output categories. Deep learning has revolutionized the image classification domain and has achieved great performance. Various deep learning networks such as [ResNet](https://arxiv.org/abs/1512.03385), [DenseNet](https://arxiv.org/abs/1608.06993), [Inception](https://arxiv.org/pdf/1409.4842.pdf), and so on, have been developed to be highly accurate for image classification. At the same time, there have been efforts to collect labeled image data that are essential for training these networks. [ImageNet](https://www.image-net.org/) is one such large dataset that has more than 11 million images with about 11,000 categories. Once a network is trained with ImageNet data, it can then be used to generalize with other datasets as well, by simple re-adjustment or fine-tuning. In this transfer learning approach, a network is initialized with weights (in this example, trained on ImageNet), which can be later fine-tuned for an image classification task in a different dataset. 

Image classification in Amazon SageMaker AI can be run in two modes: full training and transfer learning. In full training mode, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data.

# Image Classification Hyperparameters
Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Image Classification algorithm. See [Tune an Image Classification Model](IC-tuning.md) for information on image classification hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes | Number of output classes. This parameter defines the dimensions of the network output and is typically set to the number of classes in the dataset. Besides multi-class classification, multi-label classification is supported too. Please refer to [Input/Output Interface for the Image Classification Algorithm](image-classification.md#IC-inputoutput) for details on how to work with multi-label classification with augmented manifest files.  **Required** Valid values: positive integer  | 
| num\$1training\$1samples | Number of training examples in the input dataset. If there is a mismatch between this value and the number of samples in the training set, then the behavior of the `lr_scheduler_step` parameter is undefined and distributed training accuracy might be affected. **Required** Valid values: positive integer  | 
| augmentation\$1type |  Data augmentation type. The input images can be augmented in multiple ways as specified below. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html) **Optional**  Valid values: `crop`, `crop_color`, or `crop_color_transform`. Default value: no default value  | 
| beta\$11 | The beta1 for `adam`, that is the exponential decay rate for the first moment estimates. **Optional**  Valid values: float. Range in [0, 1]. Default value: 0.9 | 
| beta\$12 | The beta2 for `adam`, that is the exponential decay rate for the second moment estimates. **Optional**  Valid values: float. Range in [0, 1]. Default value: 0.999 | 
| checkpoint\$1frequency | Period to store model parameters (in number of epochs). Note that all checkpoint files are saved as part of the final model file "model.tar.gz" and uploaded to S3 to the specified model location. This increases the size of the model file proportionally to the number of checkpoints saved during training. **Optional** Valid values: positive integer no greater than `epochs`. Default value: no default value (Save checkpoint at the epoch that has the best validation accuracy) | 
| early\$1stopping | `True` to use early stopping logic during training. `False` not to use it. **Optional** Valid values: `True` or `False` Default value: `False` | 
| early\$1stopping\$1min\$1epochs | The minimum number of epochs that must be run before the early stopping logic can be invoked. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 10 | 
| early\$1stopping\$1patience | The number of epochs to wait before ending training if no improvement is made in the relevant metric. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 5 | 
| early\$1stopping\$1tolerance | Relative tolerance to measure an improvement in accuracy validation metric. If the ratio of the improvement in accuracy divided by the previous best accuracy is smaller than the `early_stopping_tolerance` value set, early stopping considers there is no improvement. It is used only when `early_stopping` = `True`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.0 | 
| epochs | Number of training epochs. **Optional** Valid values: positive integer Default value: 30 | 
| eps | The epsilon for `adam` and `rmsprop`. It is usually set to a small value to avoid division by 0. **Optional** Valid values: float. Range in [0, 1]. Default value: 1e-8 | 
| gamma | The gamma for `rmsprop`, the decay factor for the moving average of the squared gradient. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.9 | 
| image\$1shape | The input image dimensions, which is the same size as the input layer of the network. The format is defined as '`num_channels`, height, width'. The image dimension can take on any value as the network can handle varied dimensions of the input. However, there may be memory constraints if a larger image dimension is used. Pretrained models can use only a fixed 224 x 224 image size. Typical image dimensions for image classification are '3,224,224'. This is similar to the ImageNet dataset.  For training, if any input image is smaller than this parameter in any dimension, training fails. If an image is larger, a portion of the image is cropped, with the cropped area specified by this parameter. If hyperparameter `augmentation_type` is set, random crop is taken; otherwise, central crop is taken.  At inference, input images are resized to the `image_shape` that was used during training. Aspect ratio is not preserved, and images are not cropped. **Optional** Valid values: string Default value: ‘3,224,224’ | 
| kv\$1store |  Weight update synchronization mode during distributed training. The weight updates can be updated either synchronously or asynchronously across machines. Synchronous updates typically provide better accuracy than asynchronous updates but can be slower. See distributed training in MXNet for more details. This parameter is not applicable to single machine training. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html) **Optional** Valid values: `dist_sync` or `dist_async` Default value: no default value  | 
| learning\$1rate | Initial learning rate. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.1 | 
| lr\$1scheduler\$1factor | The ratio to reduce learning rate used in conjunction with the `lr_scheduler_step` parameter, defined as `lr_new` = `lr_old` \$1 `lr_scheduler_factor`. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.1 | 
| lr\$1scheduler\$1step | The epochs at which to reduce the learning rate. As explained in the `lr_scheduler_factor` parameter, the learning rate is reduced by `lr_scheduler_factor` at these epochs. For example, if the value is set to "10, 20", then the learning rate is reduced by `lr_scheduler_factor` after 10th epoch and again by `lr_scheduler_factor` after 20th epoch. The epochs are delimited by ",". **Optional** Valid values: string Default value: no default value | 
| mini\$1batch\$1size | The batch size for training. In a single-machine multi-GPU setting, each GPU handles `mini_batch_size`/num\$1gpu training samples. For the multi-machine training in dist\$1sync mode, the actual batch size is `mini_batch_size`\$1number of machines. See MXNet docs for more details. **Optional** Valid values: positive integer Default value: 32 | 
| momentum | The momentum for `sgd` and `nag`, ignored for other optimizers. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.9 | 
| multi\$1label |  Flag to use for multi-label classification where each sample can be assigned multiple labels. Average accuracy across all classes is logged. **Optional** Valid values: 0 or 1 Default value: 0  | 
| num\$1layers | Number of layers for the network. For data with large image size (for example, 224x224 - like ImageNet), we suggest selecting the number of layers from the set [18, 34, 50, 101, 152, 200]. For data with small image size (for example, 28x28 - like CIFAR), we suggest selecting the number of layers from the set [20, 32, 44, 56, 110]. The number of layers in each set is based on the ResNet paper. For transfer learning, the number of layers defines the architecture of base network and hence can only be selected from the set [18, 34, 50, 101, 152, 200]. **Optional** Valid values: positive integer in [18, 34, 50, 101, 152, 200] or [20, 32, 44, 56, 110] Default value: 152 | 
| optimizer | The optimizer type. For more details of the parameters for the optimizers, please refer to MXNet's API. **Optional** Valid values: One of `sgd`, `adam`, `rmsprop`, or `nag`. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html) Default value: `sgd` | 
| precision\$1dtype | The precision of the weights used for training. The algorithm can use either single precision (`float32`) or half precision (`float16`) for the weights. Using half-precision for weights results in reduced memory consumption. **Optional** Valid values: `float32` or `float16` Default value: `float32` | 
| resize | The number of pixels in the shortest side of an image after resizing it for training. If the parameter is not set, then the training data is used without resizing. The parameter should be larger than both the width and height components of `image_shape` to prevent training failure. **Required** when using image content types **Optional** when using the RecordIO content type Valid values: positive integer Default value: no default value  | 
| top\$1k | Reports the top-k accuracy during training. This parameter has to be greater than 1, since the top-1 training accuracy is the same as the regular training accuracy that has already been reported. **Optional** Valid values: positive integer larger than 1. Default value: no default value | 
| use\$1pretrained\$1model | Flag to use pre-trained model for training. If set to 1, then the pretrained model with the corresponding number of layers is loaded and used for training. Only the top FC layer are reinitialized with random weights. Otherwise, the network is trained from scratch. **Optional** Valid values: 0 or 1 Default value: 0 | 
| use\$1weighted\$1loss |  Flag to use weighted cross-entropy loss for multi-label classification (used only when `multi_label` = 1), where the weights are calculated based on the distribution of classes. **Optional** Valid values: 0 or 1 Default value: 0  | 
| weight\$1decay | The coefficient weight decay for `sgd` and `nag`, ignored for other optimizers. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.0001 | 

# Tune an Image Classification Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Image Classification Algorithm
Metrics

The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is computed during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy | The ratio of the number of correct predictions to the total number of predictions made. | Maximize | 

## Tunable Image Classification Hyperparameters
Tunable Hyperparameters

Tune an image classification model with the following hyperparameters. The hyperparameters that have the greatest impact on image classification objective metrics are: `mini_batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `weight_decay`, `beta_1`, `beta_2`, `eps`, and `gamma`, based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adam` is the `optimizer`.

For more information about which hyperparameters are used in each optimizer, see [Image Classification Hyperparameters](IC-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| gamma | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 0.999 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| mini\$1batch\$1size | IntegerParameterRanges | MinValue: 8, MaxValue: 512 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['sgd', ‘adam’, ‘rmsprop’, 'nag'] | 
| weight\$1decay | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 

# Image Classification - TensorFlow
Image Classification - TensorFlow

The Amazon SageMaker Image Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the [TensorFlow Hub](https://tfhub.dev/s?fine-tunable=yes&module-type=image-classification&subtype=module,placeholder&tf-version=tf2). Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The image classification algorithm takes an image as input and outputs a probability for each provided class label. Training datasets must consist of images in .jpg, .jpeg, or .png format. This page includes information about Amazon EC2 instance recommendations and sample notebooks for Image Classification - TensorFlow.

**Topics**
+ [

# How to use the SageMaker Image Classification - TensorFlow algorithm
](IC-TF-how-to-use.md)
+ [

# Input and output interface for the Image Classification - TensorFlow algorithm
](IC-TF-inputoutput.md)
+ [

## Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm
](#IC-TF-instances)
+ [

## Image Classification - TensorFlow sample notebooks
](#IC-TF-sample-notebooks)
+ [

# How Image Classification - TensorFlow Works
](IC-TF-HowItWorks.md)
+ [

# TensorFlow Hub Models
](IC-TF-Models.md)
+ [

# Image Classification - TensorFlow Hyperparameters
](IC-TF-Hyperparameter.md)
+ [

# Tune an Image Classification - TensorFlow model
](IC-TF-tuning.md)

# How to use the SageMaker Image Classification - TensorFlow algorithm
How to use Image Classification - TensorFlow

You can use Image Classification - TensorFlow as an Amazon SageMaker AI built-in algorithm. The following section describes how to use Image Classification - TensorFlow with the SageMaker AI Python SDK. For information on how to use Image Classification - TensorFlow from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

The Image Classification - TensorFlow algorithm supports transfer learning using any of the compatible pretrained TensorFlow Hub models. For a list of all available pretrained models, see [TensorFlow Hub Models](IC-TF-Models.md). Every pretrained model has a unique `model_id`. The following example uses MobileNet V2 1.00 224 (`model_id`: `tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4`) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated model training artifacts to construct a SageMaker AI Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and their default values with `hyperparameters.retrieve_default`. For more information, see [Image Classification - TensorFlow Hyperparameters](IC-TF-Hyperparameter.md). Use these values to construct a SageMaker AI Estimator.

**Note**  
Default hyperparameter values are different for different models. For larger models, the default batch size is smaller and the `train_only_top_layer` hyperparameter is set to `"True"`.

This example uses the [https://www.tensorflow.org/datasets/catalog/tf_flowers](https://www.tensorflow.org/datasets/catalog/tf_flowers) dataset, which contains five classes of flower images. We pre-downloaded the dataset from TensorFlow under the Apache 2.0 license and made it available with Amazon S3. To fine-tune your model, call `.fit` using the Amazon S3 location of your training dataset.

```
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4", "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)

# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")

# Retrieve the pretrained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

# The sample training data is available in the following S3 bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/tf_flowers/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ic-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create SageMaker Estimator instance
tf_ic_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Use S3 path of the training data to launch SageMaker TrainingJob
tf_ic_estimator.fit({"training": training_dataset_s3_path}, logs=True)
```

# Input and output interface for the Image Classification - TensorFlow algorithm


Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset with any number of image classes. Be mindful of how to format your training data for input to the Image Classification - TensorFlow model.
+ **Training data input format:** Your training data should be a directory with as many subdirectories as the number of classes. Each subdirectory should contain images belonging to that class in .jpg, .jpeg, or .png format.

The following is an example of an input directory structure. This example dataset has two classes: `roses` and `dandelion`. The image files in each class folder can have any name. The input directory should be hosted in an Amazon S3 bucket with a path similar to the following: `s3://bucket_name/input_directory/`. Note that the trailing `/` is required.

```
input_directory
    |--roses
        |--abc.jpg
        |--def.jpg
    |--dandelion
        |--ghi.jpg
        |--jkl.jpg
```

Trained models output label mapping files that map class folder names to the indices in the list of output class probabilities. This mapping is in alphabetical order. For example, in the preceding example, the dandelion class is index 0 and the roses class is index 1. 

After training, you have a fine-tuned model that you can further train using incremental training or deploy for inference. The Image Classification - TensorFlow algorithm automatically adds a pre-processing and post-processing signature to the fine-tuned model so that it can take in images as input and return class probabilities. The file mapping class indices to class labels is saved along with the models. 

## Incremental training


You can seed the training of a new model with artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data.

**Note**  
You can only seed a SageMaker Image Classification - TensorFlow model with another Image Classification - TensorFlow model trained in SageMaker AI. 

You can use any dataset for incremental training, as long as the set of classes remains the same. The incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained model, you start with an existing fine-tuned model. For an example of incremental training with the SageMaker AI Image Classification - TensorFlow algorithm, see the [Introduction to SageMaker TensorFlow - Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb) sample notebook.

## Inference with the Image Classification - TensorFlow algorithm


You can host the fine-tuned model that results from your TensorFlow Image Classification training for inference. Any input image for inference must be in `.jpg`, .`jpeg`, or `.png` format and be content type `application/x-image`. The Image Classification - TensorFlow algorithm resizes input images automatically. 

Running inference results in probability values, class labels for all classes, and the predicted label corresponding to the class index with the highest probability encoded in JSON format. The Image Classification - TensorFlow model processes a single image per request and outputs only one line. The following is an example of a JSON format response:

```
accept: application/json;verbose

 {"probabilities": [prob_0, prob_1, prob_2, ...],
  "labels":        [label_0, label_1, label_2, ...],
  "predicted_label": predicted_label}
```

If `accept` is set to `application/json`, then the model only outputs probabilities. For more information on training and inference with the Image Classification - TensorFlow algorithm, see the [Introduction to SageMaker TensorFlow - Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb) sample notebook.

## Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm


The Image Classification - TensorFlow algorithm supports all CPU and GPU instances for training, including:
+ `ml.p2.xlarge`
+ `ml.p2.16xlarge`
+ `ml.p3.2xlarge`
+ `ml.p3.16xlarge`
+ `ml.g4dn.xlarge`
+ `ml.g4dn.16.xlarge`
+ `ml.g5.xlarge`
+ `ml.g5.48xlarge`

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference.

## Image Classification - TensorFlow sample notebooks
Sample notebooks

For more information about how to use the SageMaker Image Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to SageMaker TensorFlow - Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb) notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How Image Classification - TensorFlow Works
How It Works

The Image Classification - TensorFlow algorithm takes an image as input and classifies it into one of the output class labels. Various deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet are highly accurate for image classification. There are also deep learning networks that are trained on large image datasets, such as ImageNet, which has over 11 million images and almost 11,000 classes. After a network is trained with ImageNet data, you can then fine-tune the network on a dataset with a particular focus to perform more specific classification tasks. The Amazon SageMaker Image Classification - TensorFlow algorithm supports transfer learning on many pretrained models that are available in the TensorFlow Hub.

According to the number of class labels in your training data, a classification layer is attached to the pretrained TensorFlow Hub model of your choice. The classification layer consists of a dropout layer, a dense layer, and a fully-connected layer with 2-norm regularizer that is initialized with random weights. The model has hyperparameters for the dropout rate of the dropout layer and the L2 regularization factor for the dense layer. You can then fine-tune either the entire network (including the pretrained model) or only the top classification layer on new training data. With this method of transfer learning, training with smaller datasets is possible.

# TensorFlow Hub Models
TensorFlow Hub Models

The following pretrained models are available to use for transfer learning with the Image Classification - TensorFlow algorithm. 

The following models vary significantly in size, number of model parameters, training time, and inference latency for any given dataset. The best model for your use case depends on the complexity of your fine-tuning dataset and any requirements that you have on training time, inference latency, or model accuracy.


| Model Name | `model_id` | Source | 
| --- | --- | --- | 
| MobileNet V2 1.00 224 | `tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/classification/4) | 
| MobileNet V2 0.75 224 | `tensorflow-ic-imagenet-mobilenet-v2-075-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_075_224/classification/4) | 
| MobileNet V2 0.50 224 | `tensorflow-ic-imagenet-mobilenet-v2-050-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_050_224/classification/4) | 
| MobileNet V2 0.35 224 | `tensorflow-ic-imagenet-mobilenet-v2-035-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/classification/4) | 
| MobileNet V2 1.40 224 | `tensorflow-ic-imagenet-mobilenet-v2-140-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_140_224/classification/4) | 
| MobileNet V2 1.30 224 | `tensorflow-ic-imagenet-mobilenet-v2-130-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/4) | 
| MobileNet V2 | `tensorflow-ic-tf2-preview-mobilenet-v2-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4) | 
| Inception V3 | `tensorflow-ic-imagenet-inception-v3-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_v3/classification/4) | 
| Inception V2 | `tensorflow-ic-imagenet-inception-v2-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_v2/classification/4) | 
| Inception V1 | `tensorflow-ic-imagenet-inception-v1-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_v1/classification/4) | 
| Inception V3 Preview | `tensorflow-ic-tf2-preview-inception-v3-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/tf2-preview/inception_v3/classification/4) | 
| Inception ResNet V2 | `tensorflow-ic-imagenet-inception-resnet-v2-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_resnet_v2/classification/4) | 
| ResNet V2 50 | `tensorflow-ic-imagenet-resnet-v2-50-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v2_50/classification/4) | 
| ResNet V2 101 | `tensorflow-ic-imagenet-resnet-v2-101-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v2_101/classification/4) | 
| ResNet V2 152 | `tensorflow-ic-imagenet-resnet-v2-152-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v2_152/classification/4) | 
| ResNet V1 50 | `tensorflow-ic-imagenet-resnet-v1-50-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v1_50/classification/4) | 
| ResNet V1 101 | `tensorflow-ic-imagenet-resnet-v1-101-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v1_101/classification/4) | 
| ResNet V1 152 | `tensorflow-ic-imagenet-resnet-v1-152-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v1_152/classification/4) | 
| ResNet 50 | `tensorflow-ic-imagenet-resnet-50-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_50/classification/1) | 
| EfficientNet B0 | `tensorflow-ic-efficientnet-b0-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b0/classification/1) | 
| EfficientNet B1 | `tensorflow-ic-efficientnet-b1-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b1/classification/1) | 
| EfficientNet B2 | `tensorflow-ic-efficientnet-b2-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b2/classification/1) | 
| EfficientNet B3 | `tensorflow-ic-efficientnet-b3-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b3/classification/1) | 
| EfficientNet B4 | `tensorflow-ic-efficientnet-b4-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b4/classification/1) | 
| EfficientNet B5 | `tensorflow-ic-efficientnet-b5-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b5/classification/1) | 
| EfficientNet B6 | `tensorflow-ic-efficientnet-b6-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b6/classification/1) | 
| EfficientNet B7 | `tensorflow-ic-efficientnet-b7-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b7/classification/1) | 
| EfficientNet B0 Lite | `tensorflow-ic-efficientnet-lite0-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite0/classification/2) | 
| EfficientNet B1 Lite | `tensorflow-ic-efficientnet-lite1-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite1/classification/2) | 
| EfficientNet B2 Lite | `tensorflow-ic-efficientnet-lite2-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite2/classification/2) | 
| EfficientNet B3 Lite | `tensorflow-ic-efficientnet-lite3-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite3/classification/2) | 
| EfficientNet B4 Lite | `tensorflow-ic-efficientnet-lite4-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite4/classification/2) | 
| MobileNet V1 1.00 224 | `tensorflow-ic-imagenet-mobilenet-v1-100-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_224/classification/4) | 
| MobileNet V1 1.00 192 | `tensorflow-ic-imagenet-mobilenet-v1-100-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_192/classification/4) | 
| MobileNet V1 1.00 160 | `tensorflow-ic-imagenet-mobilenet-v1-100-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_160/classification/4) | 
| MobileNet V1 1.00 128 | `tensorflow-ic-imagenet-mobilenet-v1-100-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_128/classification/4) | 
| MobileNet V1 0.75 224 | `tensorflow-ic-imagenet-mobilenet-v1-075-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_224/classification/4) | 
| MobileNet V1 0.75 192 | `tensorflow-ic-imagenet-mobilenet-v1-075-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_192/classification/4) | 
| MobileNet V1 0.75 160 | `tensorflow-ic-imagenet-mobilenet-v1-075-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_160/classification/4) | 
| MobileNet V1 0.75 128 | `tensorflow-ic-imagenet-mobilenet-v1-075-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_128/classification/4) | 
| MobileNet V1 0.50 224 | `tensorflow-ic-imagenet-mobilenet-v1-050-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_224/classification/4) | 
| MobileNet V1 0.50 192 | `tensorflow-ic-imagenet-mobilenet-v1-050-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_192/classification/4) | 
| MobileNet V1 1.00 160 | `tensorflow-ic-imagenet-mobilenet-v1-050-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_160/classification/4) | 
| MobileNet V1 0.50 128 | `tensorflow-ic-imagenet-mobilenet-v1-050-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_128/classification/4) | 
| MobileNet V1 0.25 224 | `tensorflow-ic-imagenet-mobilenet-v1-025-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_224/classification/4) | 
| MobileNet V1 0.25 192 | `tensorflow-ic-imagenet-mobilenet-v1-025-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_192/classification/4) | 
| MobileNet V1 0.25 160 | `tensorflow-ic-imagenet-mobilenet-v1-025-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_160/classification/4) | 
| MobileNet V1 0.25 128 | `tensorflow-ic-imagenet-mobilenet-v1-025-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_128/classification/4) | 
| BiT-S R50x1 | `tensorflow-ic-bit-s-r50x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r50x1/ilsvrc2012_classification/1) | 
| BiT-S R50x3 | `tensorflow-ic-bit-s-r50x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r50x3/ilsvrc2012_classification/1) | 
| BiT-S R101x1 | `tensorflow-ic-bit-s-r101x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r101x1/ilsvrc2012_classification/1) | 
| BiT-S R101x3 | `tensorflow-ic-bit-s-r101x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r101x3/ilsvrc2012_classification/1) | 
| BiT-M R50x1 | `tensorflow-ic-bit-m-r50x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x1/ilsvrc2012_classification/1) | 
| BiT-M R50x3 | `tensorflow-ic-bit-m-r50x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x3/ilsvrc2012_classification/1) | 
| BiT-M R101x1 | `tensorflow-ic-bit-m-r101x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x1/ilsvrc2012_classification/1) | 
| BiT-M R101x3 | `tensorflow-ic-bit-m-r101x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x3/ilsvrc2012_classification/1) | 
| BiT-M R50x1 ImageNet-21k | `tensorflow-ic-bit-m-r50x1-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x1/imagenet21k_classification/1) | 
| BiT-M R50x3 ImageNet-21k | `tensorflow-ic-bit-m-r50x3-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x3/imagenet21k_classification/1) | 
| BiT-M R101x1 ImageNet-21k | `tensorflow-ic-bit-m-r101x1-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x1/imagenet21k_classification/1) | 
| BiT-M R101x3 ImageNet-21k | `tensorflow-ic-bit-m-r101x3-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x3/imagenet21k_classification/1) | 

# Image Classification - TensorFlow Hyperparameters
Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Image Classification - TensorFlow algorithm. See [Tune an Image Classification - TensorFlow model](IC-TF-tuning.md) for information on hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| augmentation |  Set to `"True"` to apply `augmentation_random_flip`, `augmentation_random_rotation`, and `augmentation_random_zoom` to the training data.  Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| augmentation\$1random\$1flip |  Indicates which flip mode to use for data augmentation when `augmentation` is set to `"True"`. For more information, see [RandomFlip](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomFlip) in the TensorFlow documentation. Valid values: string, any of the following: (`"horizontal_and_vertical"`, `"vertical"`, or `"None"`). Default value: `"horizontal_and_vertical"`.  | 
| augmentation\$1random\$1rotation |  Indicates how much rotation to use for data augmentation when `augmentation` is set to `"True"`. Values represent a fraction of 2π. Positive values rotate counterclockwise while negative values rotate clockwise. `0` means no rotation. For more information, see [RandomRotation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomRotation) in the TensorFlow documentation. Valid values: float, range: [`-1.0`, `1.0`]. Default value: `0.2`.  | 
| augmentation\$1random\$1zoom |  Indicates how much vertical zoom to use for data augmentation when `augmentation` is set to `"True"`. Positive values zoom out while negative values zoom in. `0` means no zoom. For more information, see [RandomZoom](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomZoom) in the TensorFlow documentation. Valid values: float, range: [`-1.0`, `1.0`]. Default value: `0.1`.  | 
| batch\$1size |  The batch size for training. For training on instances with multiple GPUs, this batch size is used across the GPUs.  Valid values: positive integer. Default value: `32`.  | 
| beta\$11 |  The beta1 for the `"adam"` optimizer. Represents the exponential decay rate for the first moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| beta\$12 |  The beta2 for the `"adam"` optimizer. Represents the exponential decay rate for the second moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.999`.  | 
| binary\$1mode |  When `binary_mode` is set to `"True"`, the model returns a single probability number for the positive class and can use additional `eval_metric` options. Use only for binary classification problems. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| dropout\$1rate | The dropout rate for the dropout layer in the top classification layer. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2` | 
| early\$1stopping |  Set to `"True"` to use early stopping logic during training. If `"False"`, early stopping is not used. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| early\$1stopping\$1min\$1delta | The minimum change needed to qualify as an improvement. An absolute change less than the value of early\$1stopping\$1min\$1delta does not qualify as improvement. Used only when early\$1stopping is set to "True".Valid values: float, range: [`0.0`, `1.0`].Default value: `0.0`. | 
| early\$1stopping\$1patience |  The number of epochs to continue training with no improvement. Used only when `early_stopping` is set to `"True"`. Valid values: positive integer. Default value: `5`.  | 
| epochs |  The number of training epochs. Valid values: positive integer. Default value: `3`.  | 
| epsilon |  The epsilon for `"adam"`, `"rmsprop"`, `"adadelta"`, and `"adagrad"` optimizers. Usually set to a small value to avoid division by 0. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `1e-7`.  | 
| eval\$1metric |  If `binary_mode` is set to `"False"`, `eval_metric` can only be `"accuracy"`. If `binary_mode` is `"True"`, select any of the valid values. For more information, see [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics) in the TensorFlow documentation. Valid values: string, any of the following: (`"accuracy"`, `"precision"`, `"recall"`, `"auc"`, or `"prc"`). Default value: `"accuracy"`.  | 
| image\$1resize\$1interpolation |  Indicates interpolation method used when resizing images. For more information, see [image.resize](https://www.tensorflow.org/api_docs/python/tf/image/resize) in the TensorFlow documentation. Valid values: string, any of the following: (`"bilinear"`, `"nearest"`, `"bicubic"`, `"area"`,` "lanczos3"` , `"lanczos5"`, `"gaussian"`, or `"mitchellcubic"`). Default value: `"bilinear"`.  | 
| initial\$1accumulator\$1value |  The starting value for the accumulators, or the per-parameter momentum values, for the `"adagrad"` optimizer. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| label\$1smoothing |  Indicates how much to relax the confidence on label values. For example, if `label_smoothing` is `0.1`, then non-target labels are `0.1/num_classes `and target labels are `0.9+0.1/num_classes`.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.1`.  | 
| learning\$1rate | The optimizer learning rate. Valid values: float, range: [`0.0`, `1.0`].Default value: `0.001`. | 
| momentum |  The momentum for `"sgd"`, `"nesterov"`, and `"rmsprop"` optimizers. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| optimizer |  The optimizer type. For more information, see [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) in the TensorFlow documentation. Valid values: string, any of the following: (`"adam"`, `"sgd"`, `"nesterov"`, `"rmsprop"`,` "adagrad"` , `"adadelta"`). Default value: `"adam"`.  | 
| regularizers\$1l2 |  The L2 regularization factor for the dense layer in the classification layer.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `.0001`.  | 
| reinitialize\$1top\$1layer |  If set to `"Auto"`, the top classification layer parameters are re-initialized during fine-tuning. For incremental training, top classification layer parameters are not re-initialized unless set to `"True"`. Valid values: string, any of the following: (`"Auto"`, `"True"` or `"False"`). Default value: `"Auto"`.  | 
| rho |  The discounting factor for the gradient of the `"adadelta"` and `"rmsprop"` optimizers. Ignored for other optimizers.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.95`.  | 
| train\$1only\$1top\$1layer |  If `"True"`, only the top classification layer parameters are fine-tuned. If `"False"`, all model parameters are fine-tuned. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 

# Tune an Image Classification - TensorFlow model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the Image Classification - TensorFlow algorithm
Metrics

The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is computed during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy | The ratio of the number of correct predictions to the total number of predictions made. | Maximize | 

## Tunable Image Classification - TensorFlow hyperparameters
Tunable Hyperparameters

Tune an image classification model with the following hyperparameters. The hyperparameters that have the greatest impact on image classification objective metrics are: `batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `regularizers_l2`, `beta_1`, `beta_2`, and `eps` based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adam` is the `optimizer`.

For more information about which hyperparameters are used for each `optimizer`, see [Image Classification - TensorFlow Hyperparameters](IC-TF-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| batch\$1size | IntegerParameterRanges | MinValue: 8, MaxValue: 512 | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['sgd', ‘adam’, ‘rmsprop’, 'nesterov', 'adagrad', 'adadelta'] | 
| regularizers\$1l2 | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| train\$1only\$1top\$1layer | ContinuousParameterRanges | ['True', 'False'] | 

# Object Detection - MXNet


The Amazon SageMaker AI Object Detection - MXNet algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. The object is categorized into one of the classes in a specified collection with a confidence score that it belongs to the class. Its location and scale in the image are indicated by a rectangular bounding box. It uses the [Single Shot multibox Detector (SSD)](https://arxiv.org/pdf/1512.02325.pdf) framework and supports two base networks: [VGG](https://arxiv.org/pdf/1409.1556.pdf) and [ResNet](https://arxiv.org/pdf/1603.05027.pdf). The network can be trained from scratch, or trained with models that have been pre-trained on the [ImageNet](http://www.image-net.org/) dataset.

**Topics**
+ [

## Input/Output Interface for the Object Detection Algorithm
](#object-detection-inputoutput)
+ [

## EC2 Instance Recommendation for the Object Detection Algorithm
](#object-detection-instances)
+ [

## Object Detection Sample Notebooks
](#object-detection-sample-notebooks)
+ [

# How Object Detection Works
](algo-object-detection-tech-notes.md)
+ [

# Object Detection Hyperparameters
](object-detection-api-config.md)
+ [

# Tune an Object Detection Model
](object-detection-tuning.md)
+ [

# Object Detection Request and Response Formats
](object-detection-in-formats.md)

## Input/Output Interface for the Object Detection Algorithm


The SageMaker AI Object Detection algorithm supports both RecordIO (`application/x-recordio`) and image (`image/png`, `image/jpeg`, and `application/x-image`) content types for training in file mode and supports RecordIO (`application/x-recordio`) for training in pipe mode. However you can also train in pipe mode using the image files (`image/png`, `image/jpeg`, and `application/x-image`), without creating RecordIO files, by using the augmented manifest format. The recommended input format for the Amazon SageMaker AI object detection algorithms is [Apache MXNet RecordIO](https://mxnet.apache.org/api/architecture/note_data_loading). However, you can also use raw images in .jpg or .png format. The algorithm supports only `application/x-image` for inference.

**Note**  
To maintain better interoperability with existing deep learning frameworks, this differs from the protobuf data formats commonly used by other Amazon SageMaker AI algorithms.

See the [Object Detection Sample Notebooks](#object-detection-sample-notebooks) for more details on data formats.

### Train with the RecordIO Format


If you use the RecordIO format for training, specify both train and validation channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify one RecordIO (.rec) file in the train channel and one RecordIO file in the validation channel. Set the content type for both channels to `application/x-recordio`. An example of how to generate RecordIO file can be found in the object detection sample notebook. You can also use tools from the [MXNet's GluonCV](https://gluon-cv.mxnet.io/build/examples_datasets/recordio.html) to generate RecordIO files for popular datasets like the [PASCAL Visual Object Classes](http://host.robots.ox.ac.uk/pascal/VOC/) and [Common Objects in Context (COCO)](http://cocodataset.org/#home).

### Train with the Image Format


If you use the image format for training, specify `train`, `validation`, `train_annotation`, and `validation_annotation` channels as values for the `InputDataConfig` parameter of [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify the individual image data (.jpg or .png) files for the train and validation channels. For annotation data, you can use the JSON format. Specify the corresponding .json files in the `train_annotation` and `validation_annotation` channels. Set the content type for all four channels to `image/png` or `image/jpeg` based on the image type. You can also use the content type `application/x-image` when your dataset contains both .jpg and .png images. The following is an example of a .json file.

```
{
   "file": "your_image_directory/sample_image1.jpg",
   "image_size": [
      {
         "width": 500,
         "height": 400,
         "depth": 3
      }
   ],
   "annotations": [
      {
         "class_id": 0,
         "left": 111,
         "top": 134,
         "width": 61,
         "height": 128
      },
      {
         "class_id": 0,
         "left": 161,
         "top": 250,
         "width": 79,
         "height": 143
      },
      {
         "class_id": 1,
         "left": 101,
         "top": 185,
         "width": 42,
         "height": 130
      }
   ],
   "categories": [
      {
         "class_id": 0,
         "name": "dog"
      },
      {
         "class_id": 1,
         "name": "cat"
      }
   ]
}
```

Each image needs a .json file for annotation, and the .json file should have the same name as the corresponding image. The name of above .json file should be "sample\$1image1.json". There are four properties in the annotation .json file. The property "file" specifies the relative path of the image file. For example, if your training images and corresponding .json files are stored in s3://*your\$1bucket*/train/sample\$1image and s3://*your\$1bucket*/train\$1annotation, specify the path for your train and train\$1annotation channels as s3://*your\$1bucket*/train and s3://*your\$1bucket*/train\$1annotation, respectively. 

In the .json file, the relative path for an image named sample\$1image1.jpg should be sample\$1image/sample\$1image1.jpg. The `"image_size"` property specifies the overall image dimensions. The SageMaker AI object detection algorithm currently only supports 3-channel images. The `"annotations"` property specifies the categories and bounding boxes for objects within the image. Each object is annotated by a `"class_id"` index and by four bounding box coordinates (`"left"`, `"top"`, `"width"`, `"height"`). The `"left"` (x-coordinate) and `"top"` (y-coordinate) values represent the upper-left corner of the bounding box. The `"width"` (x-coordinate) and `"height"` (y-coordinate) values represent the dimensions of the bounding box. The origin (0, 0) is the upper-left corner of the entire image. If you have multiple objects within one image, all the annotations should be included in a single .json file. The `"categories"` property stores the mapping between the class index and class name. The class indices should be numbered successively and the numbering should start with 0. The `"categories"` property is optional for the annotation .json file

### Train with Augmented Manifest Image Format


The augmented manifest format enables you to do training in pipe mode using image files without needing to create RecordIO files. You need to specify both train and validation channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. While using the format, an S3 manifest file needs to be generated that contains the list of images and their corresponding annotations. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The images are specified using the `'source-ref'` tag that points to the S3 location of the image. The annotations are provided under the `"AttributeNames"` parameter value as specified in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. It can also contain additional metadata under the `metadata` tag, but these are ignored by the algorithm. In the following example, the `"AttributeNames` are contained in the list `["source-ref", "bounding-box"]`:

```
{"source-ref": "s3://your_bucket/image1.jpg", "bounding-box":{"image_size":[{ "width": 500, "height": 400, "depth":3}], "annotations":[{"class_id": 0, "left": 111, "top": 134, "width": 61, "height": 128}, {"class_id": 5, "left": 161, "top": 250, "width": 80, "height": 50}]}, "bounding-box-metadata":{"class-map":{"0": "dog", "5": "horse"}, "type": "groundtruth/object-detection"}}
{"source-ref": "s3://your_bucket/image2.jpg", "bounding-box":{"image_size":[{ "width": 400, "height": 300, "depth":3}], "annotations":[{"class_id": 1, "left": 100, "top": 120, "width": 43, "height": 78}]}, "bounding-box-metadata":{"class-map":{"1": "cat"}, "type": "groundtruth/object-detection"}}
```

The order of `"AttributeNames"` in the input files matters when training the Object Detection algorithm. It accepts piped data in a specific order, with `image` first, followed by `annotations`. So the "AttributeNames" in this example are provided with `"source-ref"` first, followed by `"bounding-box"`. When using Object Detection with Augmented Manifest, the value of parameter `RecordWrapperType` must be set as `"RecordIO"`.

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Incremental Training


You can also seed the training of a new model with the artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data. SageMaker AI object detection models can be seeded only with another built-in object detection model trained in SageMaker AI.

To use a pretrained model, in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, specify the `ChannelName` as "model" in the `InputDataConfig` parameter. Set the `ContentType` for the model channel to `application/x-sagemaker-model`. The input hyperparameters of both the new model and the pretrained model that you upload to the model channel must have the same settings for the `base_network` and `num_classes` input parameters. These parameters define the network architecture. For the pretrained model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker AI. You can use either RecordIO or image formats for input data.

For more information on incremental training and for instructions on how to use it, see [Use Incremental Training in Amazon SageMaker AI](incremental-training.md). 

## EC2 Instance Recommendation for the Object Detection Algorithm


The object detection algorithm supports P2, P3, G4dn, and G5 GPU instance families. We recommend using GPU instances with more memory for training with large batch sizes. You can run the object detection algorithm on multi-GPU and mult-machine settings for distributed training.

You can use both CPU (such as C5 and M5) and GPU (such as P3 and G4dn) instances for inference.

## Object Detection Sample Notebooks
Sample Notebooks

For a sample notebook that shows how to use the SageMaker AI Object Detection algorithm to train and host a model on the 

[Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/datasets/cub_200_2011/) dataset using the Single Shot multibox Detector algorithm, see [Amazon SageMaker AI Object Detection for Bird Species](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/object_detection_birds/object_detection_birds.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The object detection example notebook using the Object Detection algorithm is located in the **Introduction to Amazon Algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

For more information about the Amazon SageMaker AI Object Detection algorithm, see the following blog posts:
+ [Training the Amazon SageMaker AI object detection model and running it on AWS IoT Greengrass – Part 1 of 3: Preparing training data](https://aws.amazon.com/blogs/iot/sagemaker-object-detection-greengrass-part-1-of-3/)
+ [Training the Amazon SageMaker AI object detection model and running it on AWS IoT Greengrass – Part 2 of 3: Training a custom object detection model](https://aws.amazon.com/blogs/iot/sagemaker-object-detection-greengrass-part-2-of-3/)
+ [Training the Amazon SageMaker AI object detection model and running it on AWS IoT Greengrass – Part 3 of 3: Deploying to the edge](https://aws.amazon.com/blogs/iot/sagemaker-object-detection-greengrass-part-3-of-3/)

# How Object Detection Works
How It Works

The object detection algorithm identifies and locates all instances of objects in an image from a known collection of object categories. The algorithm takes an image as input and outputs the category that the object belongs to, along with a confidence score that it belongs to the category. The algorithm also predicts the object's location and scale with a rectangular bounding box. Amazon SageMaker AI Object Detection uses the [Single Shot multibox Detector (SSD)](https://arxiv.org/pdf/1512.02325.pdf) algorithm that takes a convolutional neural network (CNN) pretrained for classification task as the base network. SSD uses the output of intermediate layers as features for detection. 

Various CNNs such as [VGG](https://arxiv.org/pdf/1409.1556.pdf) and [ResNet](https://arxiv.org/pdf/1603.05027.pdf) have achieved great performance on the image classification task. Object detection in Amazon SageMaker AI supports both VGG-16 and ResNet-50 as a base network for SSD. The algorithm can be trained in full training mode or in transfer learning mode. In full training mode, the base network is initialized with random weights and then trained on user data. In transfer learning mode, the base network weights are loaded from pretrained models.

The object detection algorithm uses standard data augmentation operations, such as flip, rescale, and jitter, on the fly internally to help avoid overfitting.

# Object Detection Hyperparameters
Hyperparameters

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters that are used to help estimate the parameters of the model from a training dataset. The following table lists the hyperparameters provided by Amazon SageMaker AI for training the object detection algorithm. For more information about how object training works, see [How Object Detection Works](algo-object-detection-tech-notes.md).


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes |  The number of output classes. This parameter defines the dimensions of the network output and is typically set to the number of classes in the dataset. **Required** Valid values: positive integer  | 
| num\$1training\$1samples |  The number of training examples in the input dataset.  If there is a mismatch between this value and the number of samples in the training set, then the behavior of the `lr_scheduler_step` parameter will be undefined and distributed training accuracy may be affected.  **Required** Valid values: positive integer  | 
| base\$1network |  The base network architecture to use. **Optional** Valid values: 'vgg-16' or 'resnet-50' Default value: 'vgg-16'  | 
| early\$1stopping |  `True` to use early stopping logic during training. `False` not to use it. **Optional** Valid values: `True` or `False` Default value: `False`  | 
| early\$1stopping\$1min\$1epochs |  The minimum number of epochs that must be run before the early stopping logic can be invoked. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 10  | 
| early\$1stopping\$1patience |  The number of epochs to wait before ending training if no improvement, as defined by the `early_stopping_tolerance` hyperparameter, is made in the relevant metric. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 5  | 
| early\$1stopping\$1tolerance |  The tolerance value that the relative improvement in `validation:mAP`, the mean average precision (mAP), is required to exceed to avoid early stopping. If the ratio of the change in the mAP divided by the previous best mAP is smaller than the `early_stopping_tolerance` value set, early stopping considers that there is no improvement. It is used only when `early_stopping` = `True`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.0  | 
| image\$1shape |  The image size for input images. We rescale the input image to a square image with this size. We recommend using 300 and 512 for better performance. **Optional** Valid values: positive integer ≥300 Default: 300  | 
| epochs |  The number of training epochs.  **Optional** Valid values: positive integer Default: 30  | 
| freeze\$1layer\$1pattern |  The regular expression (regex) for freezing layers in the base network. For example, if we set `freeze_layer_pattern` = `"^(conv1_\|conv2_).*"`, then any layers with a name that contains `"conv1_"` or `"conv2_"` are frozen, which means that the weights for these layers are not updated during training. The layer names can be found in the network symbol files [vgg16-symbol.json](http://data.mxnet.io/models/imagenet/vgg/vgg16-symbol.json ) and [resnet-50-symbol.json](http://data.mxnet.io/models/imagenet/resnet/50-layers/resnet-50-symbol.json). Freezing a layer means that its weights can not be modified further. This can reduce training time significantly in exchange for modest losses in accuracy. This technique is commonly used in transfer learning where the lower layers in the base network do not need to be retrained. **Optional** Valid values: string Default: No layers frozen.  | 
| kv\$1store |  The weight update synchronization mode used for distributed training. The weights can be updated either synchronously or asynchronously across machines. Synchronous updates typically provide better accuracy than asynchronous updates but can be slower. See the [Distributed Training](https://mxnet.apache.org/api/faq/distributed_training) MXNet tutorial for details.  This parameter is not applicable to single machine training.  **Optional** Valid values: `'dist_sync'` or `'dist_async'` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object-detection-api-config.html) Default: -  | 
| label\$1width |  The force padding label width used to sync across training and validation data. For example, if one image in the data contains at most 10 objects, and each object's annotation is specified with 5 numbers, [class\$1id, left, top, width, height], then the `label_width` should be no smaller than (10\$15 \$1 header information length). The header information length is usually 2. We recommend using a slightly larger `label_width` for the training, such as 60 for this example. **Optional** Valid values: Positive integer large enough to accommodate the largest annotation information length in the data. Default: 350  | 
| learning\$1rate |  The initial learning rate. **Optional** Valid values: float in (0, 1] Default: 0.001  | 
| lr\$1scheduler\$1factor |  The ratio to reduce learning rate. Used in conjunction with the `lr_scheduler_step` parameter defined as `lr_new` = `lr_old` \$1 `lr_scheduler_factor`. **Optional** Valid values: float in (0, 1) Default: 0.1  | 
| lr\$1scheduler\$1step |  The epochs at which to reduce the learning rate. The learning rate is reduced by `lr_scheduler_factor` at epochs listed in a comma-delimited string: "epoch1, epoch2, ...". For example, if the value is set to "10, 20" and the `lr_scheduler_factor` is set to 1/2, then the learning rate is halved after 10th epoch and then halved again after 20th epoch. **Optional** Valid values: string Default: empty string  | 
| mini\$1batch\$1size |  The batch size for training. In a single-machine multi-gpu setting, each GPU handles `mini_batch_size`/`num_gpu` training samples. For the multi-machine training in `dist_sync` mode, the actual batch size is `mini_batch_size`\$1number of machines. A large `mini_batch_size` usually leads to faster training, but it may cause out of memory problem. The memory usage is related to `mini_batch_size`, `image_shape`, and `base_network` architecture. For example, on a single p3.2xlarge instance, the largest `mini_batch_size` without an out of memory error is 32 with the base\$1network set to "resnet-50" and an `image_shape` of 300. With the same instance, you can use 64 as the `mini_batch_size` with the base network `vgg-16` and an `image_shape` of 300. **Optional** Valid values: positive integer Default: 32  | 
| momentum |  The momentum for `sgd`. Ignored for other optimizers. **Optional** Valid values: float in (0, 1] Default: 0.9  | 
| nms\$1threshold |  The non-maximum suppression threshold. **Optional** Valid values: float in (0, 1] Default: 0.45  | 
| optimizer |  The optimizer types. For details on optimizer values, see [MXNet's API](https://mxnet.apache.org/api/python/docs/api/). **Optional** Valid values: ['sgd', 'adam', 'rmsprop', 'adadelta'] Default: 'sgd'  | 
| overlap\$1threshold |  The evaluation overlap threshold. **Optional** Valid values: float in (0, 1] Default: 0.5  | 
| use\$1pretrained\$1model |  Indicates whether to use a pre-trained model for training. If set to 1, then the pre-trained model with corresponding architecture is loaded and used for training. Otherwise, the network is trained from scratch. **Optional** Valid values: 0 or 1 Default: 1  | 
| weight\$1decay |  The weight decay coefficient for `sgd` and `rmsprop`. Ignored for other optimizers. **Optional** Valid values: float in (0, 1) Default: 0.0005  | 

# Tune an Object Detection Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Object Detection Algorithm
Metrics

The object detection algorithm reports on a single metric during training: `validation:mAP`. When tuning a model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:mAP |  Mean Average Precision (mAP) computed on the validation set.  |  Maximize  | 



## Tunable Object Detection Hyperparameters
Tunable hyperparameters

Tune the Amazon SageMaker AI object detection model with the following hyperparameters. The hyperparameters that have the greatest impact on the object detection objective metric are: `mini_batch_size`, `learning_rate`, and `optimizer`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-6, MaxValue: 0.5  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 8, MaxValue: 64  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.999  | 
| optimizer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'rmsprop', 'adadelta']  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.999  | 

# Object Detection Request and Response Formats
Inference Formats

The following page describes the inference request and response formats for the Amazon SageMaker AI Object Detection - MXNet model.

## Request Format


Query a trained model by using the model's endpoint. The endpoint takes .jpg and .png image formats with `image/jpeg` and `image/png` content-types.

## Response Formats


The response is the class index with a confidence score and bounding box coordinates for all objects within the image encoded in JSON format. The following is an example of response .json file:

```
{"prediction":[
  [4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507, 0.9345266819000244],
  [0.0, 0.73376623392105103, 0.5714187026023865, 0.40427327156066895, 0.827075183391571, 0.9712159633636475],
  [4.0, 0.32643985450267792, 0.3677481412887573, 0.034883320331573486, 0.6318609714508057, 0.5967587828636169],
  [8.0, 0.22552496790885925, 0.6152569651603699, 0.5722782611846924, 0.882301390171051, 0.8985623121261597],
  [3.0, 0.42260299175977707, 0.019305512309074402, 0.08386176824569702, 0.39093565940856934, 0.9574796557426453]
]}
```

Each row in this .json file contains an array that represents a detected object. Each of these object arrays consists of a list of six numbers. The first number is the predicted class label. The second number is the associated confidence score for the detection. The last four numbers represent the bounding box coordinates [xmin, ymin, xmax, ymax]. These output bounding box corner indices are normalized by the overall image size. Note that this encoding is different than that use by the input .json format. For example, in the first entry of the detection result, 0.3088374733924866 is the left coordinate (x-coordinate of upper-left corner) of the bounding box as a ratio of the overall image width, 0.07030484080314636 is the top coordinate (y-coordinate of upper-left corner) of the bounding box as a ratio of the overall image height, 0.7110607028007507 is the right coordinate (x-coordinate of lower-right corner) of the bounding box as a ratio of the overall image width, and 0.9345266819000244 is the bottom coordinate (y-coordinate of lower-right corner) of the bounding box as a ratio of the overall image height. 

To avoid unreliable detection results, you might want to filter out the detection results with low confidence scores. In the [object detection sample notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_birds/object_detection_birds.ipynb), we provide examples of scripts that use a threshold to remove low confidence detections and to plot bounding boxes on the original images.

For batch transform, the response is in JSON format, where the format is identical to the JSON format described above. The detection results of each image is represented as a JSON file. For example:

```
{"prediction": [[label_id, confidence_score, xmin, ymin, xmax, ymax], [label_id, confidence_score, xmin, ymin, xmax, ymax]]}
```

For more details on training and inference, see the [Object Detection Sample Notebooks](object-detection.md#object-detection-sample-notebooks).

## OUTPUT: JSON Response Format


accept: application/json;annotation=1

```
{
   "image_size": [
      {
         "width": 500,
         "height": 400,
         "depth": 3
      }
   ],
   "annotations": [
      {
         "class_id": 0,
         "score": 0.943,
         "left": 111,
         "top": 134,
         "width": 61,
         "height": 128
      },
      {
         "class_id": 0,
         "score": 0.0013,
         "left": 161,
         "top": 250,
         "width": 79,
         "height": 143
      },
      {
         "class_id": 1,
         "score": 0.0133,
         "left": 101,
         "top": 185,
         "width": 42,
         "height": 130
      }
   ]
}
```

# Object Detection - TensorFlow
Object Detection - TensorFlow

The Amazon SageMaker AI Object Detection - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the [TensorFlow Model Garden](https://github.com/tensorflow/models). Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The object detection algorithm takes an image as input and outputs a list of bounding boxes. Training datasets must consist of images in .`jpg`, `.jpeg`, or `.png` format. This page includes information about Amazon EC2 instance recommendations and sample notebooks for Object Detection - TensorFlow.

**Topics**
+ [

# How to use the SageMaker AI Object Detection - TensorFlow algorithm
](object-detection-tensorflow-how-to-use.md)
+ [

# Input and output interface for the Object Detection - TensorFlow algorithm
](object-detection-tensorflow-inputoutput.md)
+ [

## Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm
](#object-detection-tensorflow-instances)
+ [

## Object Detection - TensorFlow sample notebooks
](#object-detection-tensorflow-sample-notebooks)
+ [

# How Object Detection - TensorFlow Works
](object-detection-tensorflow-HowItWorks.md)
+ [

# TensorFlow Models
](object-detection-tensorflow-Models.md)
+ [

# Object Detection - TensorFlow Hyperparameters
](object-detection-tensorflow-Hyperparameter.md)
+ [

# Tune an Object Detection - TensorFlow model
](object-detection-tensorflow-tuning.md)

# How to use the SageMaker AI Object Detection - TensorFlow algorithm
How to use Object Detection - TensorFlow

You can use Object Detection - TensorFlow as an Amazon SageMaker AI built-in algorithm. The following section describes how to use Object Detection - TensorFlow with the SageMaker AI Python SDK. For information on how to use Object Detection - TensorFlow from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

The Object Detection - TensorFlow algorithm supports transfer learning using any of the compatible pretrained TensorFlow models. For a list of all available pretrained models, see [TensorFlow Models](object-detection-tensorflow-Models.md). Every pretrained model has a unique `model_id`. The following example uses ResNet50 (`model_id`: `tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8`) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated model training artifacts to construct a SageMaker AI Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and their default values with `hyperparameters.retrieve_default`. For more information, see [Object Detection - TensorFlow Hyperparameters](object-detection-tensorflow-Hyperparameter.md). Use these values to construct a SageMaker AI Estimator.

**Note**  
Default hyperparameter values are different for different models. For example, for larger models, the default number of epochs is smaller. 

This example uses the [https://www.cis.upenn.edu/~jshi/ped_html/#pub1](https://www.cis.upenn.edu/~jshi/ped_html/#pub1) dataset, which contains images of pedestriants in the street. We pre-downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call `.fit` using the Amazon S3 location of your training dataset.

```
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8", "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)

# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")

# Retrieve the pretrained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

# Retrieve the default hyperparameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/PennFudanPed_COCO_format/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-od-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create an Estimator instance
tf_od_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a training job
tf_od_estimator.fit({"training": training_dataset_s3_path}, logs=True)
```

For more information about how to use the SageMaker AI Object Detection - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to SageMaker TensorFlow - Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_tensorflow/Amazon_Tensorflow_Object_Detection.ipynb) notebook.

# Input and output interface for the Object Detection - TensorFlow algorithm


Each of the pretrained models listed in TensorFlow Models can be fine-tuned to any dataset with any number of image classes. Be mindful of how to format your training data for input to the Object Detection - TensorFlow model.
+ **Training data input format:** Your training data should be a directory with an `images` subdirectory and an `annotations.json` file. 

The following is an example of an input directory structure. The input directory should be hosted in an Amazon S3 bucket with a path similar to the following: `s3://bucket_name/input_directory/`. Note that the trailing `/` is required.

```
input_directory
    |--images
        |--abc.png
        |--def.png
    |--annotations.json
```

The `annotations.json` file should contain information for bounding boxes and their class labels in the form of a dictionary `"images"` and `"annotations"` keys. The value for the `"images"` key should be a list of dictionaries. There should be one dictionary for each image with the following information: `{"file_name": image_name, "height": height, "width": width, "id": image_id}`. The value for the `"annotations"` key should also be a list of dictionaries. There should be one dictionary for each bounding box with the following information: `{"image_id": image_id, "bbox": [xmin, ymin, xmax, ymax], "category_id": bbox_label}`.

After training, a label mapping file and trained model are saved to your Amazon S3 bucket.

## Incremental training


You can seed the training of a new model with artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data.

**Note**  
You can only seed a SageMaker AI Object Detection - TensorFlow model with another Object Detection - TensorFlow model trained in SageMaker AI. 

You can use any dataset for incremental training, as long as the set of classes remains the same. The incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained model, you start with an existing fine-tuned model. For more information about how to use incremental training with the SageMaker AI Object Detection - TensorFlow, see the [Introduction to SageMaker TensorFlow - Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_tensorflow/Amazon_Tensorflow_Object_Detection.ipynb) notebook.

## Inference with the Object Detection - TensorFlow algorithm


You can host the fine-tuned model that results from your TensorFlow Object Detection training for inference. Any input image for inference must be in `.jpg`, .`jpeg`, or `.png` format and be content type `application/x-image`. The Object Detection - TensorFlow algorithm resizes input images automatically. 

Running inference results in bounding boxes, predicted classes, and the scores of each prediction encoded in JSON format. The Object Detection - TensorFlow model processes a single image per request and outputs only one line. The following is an example of a JSON format response:

```
accept: application/json;verbose

{"normalized_boxes":[[xmin1, xmax1, ymin1, ymax1],....], 
    "classes":[classidx1, class_idx2,...], 
    "scores":[score_1, score_2,...], 
    "labels": [label1, label2, ...], 
    "tensorflow_model_output":<original output of the model>}
```

If `accept` is set to `application/json`, then the model only outputs normalized boxes, classes, and scores. 

## Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm


The Object Detection - TensorFlow algorithm supports all GPU instances for training, including:
+ `ml.p2.xlarge`
+ `ml.p2.16xlarge`
+ `ml.p3.2xlarge`
+ `ml.p3.16xlarge`

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as M5) and GPU (P2 or P3) instances can be used for inference. For a comprehensive list of SageMaker training and inference instances across AWS Regions, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

## Object Detection - TensorFlow sample notebooks
Sample notebooks

For more information about how to use the SageMaker AI Object Detection - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to SageMaker TensorFlow - Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_tensorflow/Amazon_Tensorflow_Object_Detection.ipynb) notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How Object Detection - TensorFlow Works
How It Works

The Object Detection - TensorFlow algorithm takes an image as input and predicts bounding boxes and object labels. Various deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet are highly accurate for object detection. There are also deep learning networks that are trained on large image datasets, such as Common Objects in Context (COCO), which has 328,000 images. After a network is trained with COCO data, you can then fine-tune the network on a dataset with a particular focus to perform more specific object detection tasks. The Amazon SageMaker AI Object Detection - TensorFlow algorithm supports transfer learning on many pretrained models that are available in the TensorFlow Model Garden.

According to the number of class labels in your training data, an object detection layer is attached to the pretrained TensorFlow model of your choice. You can then fine-tune either the entire network (including the pretrained model) or only the top classification layer on new training data. With this method of transfer learning, training with smaller datasets is possible.

# TensorFlow Models
TensorFlow Models

The following pretrained models are available to use for transfer learning with the Object Detection - TensorFlow algorithm. 

The following models vary significantly in size, number of model parameters, training time, and inference latency for any given dataset. The best model for your use case depends on the complexity of your fine-tuning dataset and any requirements that you have on training time, inference latency, or model accuracy.


| Model Name | `model_id` | Source | 
| --- | --- | --- | 
| ResNet50 V1 FPN 640 | `tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| EfficientDet D0 512 | `tensorflow-od1-ssd-efficientdet-d0-512x512-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d0_coco17_tpu-32.tar.gz) | 
| EfficientDet D1 640 | `tensorflow-od1-ssd-efficientdet-d1-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz) | 
| EfficientDet D2 768 | `tensorflow-od1-ssd-efficientdet-d2-768x768-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d2_coco17_tpu-32.tar.gz) | 
| EfficientDet D3 896 | `tensorflow-od1-ssd-efficientdet-d3-896x896-coco17-tpu-32` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d3_coco17_tpu-32.tar.gz) | 
| MobileNet V1 FPN 640 | `tensorflow-od1-ssd-mobilenet-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| MobileNet V2 FPNLite 320 | `tensorflow-od1-ssd-mobilenet-v2-fpnlite-320x320-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8.tar.gz) | 
| MobileNet V2 FPNLite 640 | `tensorflow-od1-ssd-mobilenet-v2-fpnlite-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz) | 
| ResNet50 V1 FPN 1024 | `tensorflow-od1-ssd-resnet50-v1-fpn-1024x1024-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_1024x1024_coco17_tpu-8.tar.gz) | 
| ResNet101 V1 FPN 640 | `tensorflow-od1-ssd-resnet101-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet101_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| ResNet101 V1 FPN 1024 | `tensorflow-od1-ssd-resnet101-v1-fpn-1024x1024-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8.tar.gz) | 
| ResNet152 V1 FPN 640 | `tensorflow-od1-ssd-resnet152-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet152_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| ResNet152 V1 FPN 1024 | `tensorflow-od1-ssd-resnet152-v1-fpn-1024x1024-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet152_v1_fpn_1024x1024_coco17_tpu-8.tar.gz) | 

# Object Detection - TensorFlow Hyperparameters
Hyperparameters

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Object Detection - TensorFlow algorithm. See [Tune an Object Detection - TensorFlow model](object-detection-tensorflow-tuning.md) for information on hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size |  The batch size for training.  Valid values: positive integer. Default value: `3`.  | 
| beta\$11 |  The beta1 for the `"adam"` optimizer. Represents the exponential decay rate for the first moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| beta\$12 |  The beta2 for the `"adam"` optimizer. Represents the exponential decay rate for the second moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.999`.  | 
| early\$1stopping |  Set to `"True"` to use early stopping logic during training. If `"False"`, early stopping is not used. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| early\$1stopping\$1min\$1delta | The minimum change needed to qualify as an improvement. An absolute change less than the value of early\$1stopping\$1min\$1delta does not qualify as improvement. Used only when early\$1stopping is set to "True".Valid values: float, range: [`0.0`, `1.0`].Default value: `0.0`. | 
| early\$1stopping\$1patience |  The number of epochs to continue training with no improvement. Used only when `early_stopping` is set to `"True"`. Valid values: positive integer. Default value: `5`.  | 
| epochs |  The number of training epochs. Valid values: positive integer. Default value: `5` for smaller models, `1` for larger models.  | 
| epsilon |  The epsilon for `"adam"`, `"rmsprop"`, `"adadelta"`, and `"adagrad"` optimizers. Usually set to a small value to avoid division by 0. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `1e-7`.  | 
| initial\$1accumulator\$1value |  The starting value for the accumulators, or the per-parameter momentum values, for the `"adagrad"` optimizer. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.1`.  | 
| learning\$1rate | The optimizer learning rate. Valid values: float, range: [`0.0`, `1.0`].Default value: `0.001`. | 
| momentum |  The momentum for the `"sgd"` and `"nesterov"` optimizers. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| optimizer |  The optimizer type. For more information, see [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) in the TensorFlow documentation. Valid values: string, any of the following: (`"adam"`, `"sgd"`, `"nesterov"`, `"rmsprop"`,` "adagrad"` , `"adadelta"`). Default value: `"adam"`.  | 
| reinitialize\$1top\$1layer |  If set to `"Auto"`, the top classification layer parameters are re-initialized during fine-tuning. For incremental training, top classification layer parameters are not re-initialized unless set to `"True"`. Valid values: string, any of the following: (`"Auto"`, `"True"` or `"False"`). Default value: `"Auto"`.  | 
| rho |  The discounting factor for the gradient of the `"adadelta"` and `"rmsprop"` optimizers. Ignored for other optimizers.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.95`.  | 
| train\$1only\$1on\$1top\$1layer |  If `"True"`, only the top classification layer parameters are fine-tuned. If `"False"`, all model parameters are fine-tuned. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 

# Tune an Object Detection - TensorFlow model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the Object Detection - TensorFlow algorithm
Metrics

Refer to the following chart to find which metrics are computed by the Object Detection - TensorFlow algorithm.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| validation:localization\$1loss | The localization loss for box prediction. | Minimize | `Val_localization=([0-9\\.]+)` | 

## Tunable Object Detection - TensorFlow hyperparameters
Tunable Hyperparameters

Tune an object detection model with the following hyperparameters. The hyperparameters that have the greatest impact on object detection objective metrics are: `batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `regularizers_l2`, `beta_1`, `beta_2`, and `eps` based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adam` is the `optimizer`.

For more information about which hyperparameters are used for each `optimizer`, see [Object Detection - TensorFlow Hyperparameters](object-detection-tensorflow-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| batch\$1size | IntegerParameterRanges | MinValue: 8, MaxValue: 512 | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['sgd', ‘adam’, ‘rmsprop’, 'nesterov', 'adagrad', 'adadelta'] | 
| regularizers\$1l2 | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| train\$1only\$1on\$1top\$1layer | CategoricalParameterRanges | ['True', 'False'] | 
| initial\$1accumulator\$1value | CategoricalParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 

# Semantic Segmentation Algorithm
Semantic Segmentation

The SageMaker AI semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing. 

For comparison, the SageMaker AI [Image Classification - MXNet](image-classification.md) is a supervised learning algorithm that analyzes only whole images, classifying them into one of multiple output categories. The [Object Detection - MXNet](object-detection.md) is a supervised learning algorithm that detects and classifies all instances of an object in an image. It indicates the location and scale of each object in the image with a rectangular bounding box. 

Because the semantic segmentation algorithm classifies every pixel in an image, it also provides information about the shapes of the objects contained in the image. The segmentation output is represented as a grayscale image, called a *segmentation mask*. A segmentation mask is a grayscale image with the same shape as the input image.

The SageMaker AI semantic segmentation algorithm is built using the [MXNet Gluon framework and the Gluon CV toolkit](https://github.com/dmlc/gluon-cv). It provides you with a choice of three built-in algorithms to train a deep neural network. You can use the [Fully-Convolutional Network (FCN) algorithm ](https://arxiv.org/abs/1605.06211), [Pyramid Scene Parsing (PSP) algorithm](https://arxiv.org/abs/1612.01105), or [DeepLabV3](https://arxiv.org/abs/1706.05587). 

Each of the three algorithms has two distinct components: 
+ The *backbone* (or *encoder*)—A network that produces reliable activation maps of features.
+ The *decoder*—A network that constructs the segmentation mask from the encoded activation maps.

You also have a choice of backbones for the FCN, PSP, and DeepLabV3 algorithms: [ResNet50 or ResNet101](https://arxiv.org/abs/1512.03385). These backbones include pretrained artifacts that were originally trained on the [ImageNet](http://www.image-net.org/) classification task. You can fine-tune these backbones for segmentation using your own data. Or, you can initialize and train these networks from scratch using only your own data. The decoders are never pretrained. 

To deploy the trained model for inference, use the SageMaker AI hosting service. During inference, you can request the segmentation mask either as a PNG image or as a set of probabilities for each class for each pixel. You can use these masks as part of a larger pipeline that includes additional downstream image processing or other applications.

**Topics**
+ [

## Semantic Segmentation Sample Notebooks
](#semantic-segmentation-sample-notebooks)
+ [

## Input/Output Interface for the Semantic Segmentation Algorithm
](#semantic-segmentation-inputoutput)
+ [

## EC2 Instance Recommendation for the Semantic Segmentation Algorithm
](#semantic-segmentation-instances)
+ [

# Semantic Segmentation Hyperparameters
](segmentation-hyperparameters.md)
+ [

# Tuning a Semantic Segmentation Model
](semantic-segmentation-tuning.md)

## Semantic Segmentation Sample Notebooks
Sample Notebooks

For a sample Jupyter notebook that uses the SageMaker AI semantic segmentation algorithm to train a model and deploy it to perform inferences, see the [Semantic Segmentation Example](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/semantic_segmentation_pascalvoc/semantic_segmentation_pascalvoc.html). For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). 

To see a list of all of the SageMaker AI samples, create and open a notebook instance, and choose the **SageMaker AI Examples** tab. The example semantic segmentation notebooks are located under **Introduction to Amazon algorithms**. To open a notebook, choose its **Use** tab, and choose **Create copy**.

## Input/Output Interface for the Semantic Segmentation Algorithm


SageMaker AI semantic segmentation expects the customer's training dataset to be on [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/). Once trained, it produces the resulting model artifacts on Amazon S3. The input interface format for the SageMaker AI semantic segmentation is similar to that of most standardized semantic segmentation benchmarking datasets. The dataset in Amazon S3 is expected to be presented in two channels, one for `train` and one for `validation` using four directories, two for images and two for annotations. Annotations are expected to be uncompressed PNG images. The dataset might also have a label map that describes how the annotation mappings are established. If not, the algorithm uses a default. It also supports the augmented manifest image format (`application/x-image`) for training in Pipe input mode straight from Amazon S3. For inference, an endpoint accepts images with an `image/jpeg` content type. 

### How Training Works
Training

The training data is split into four directories: `train`, `train_annotation`, `validation`, and `validation_annotation`. There is a channel for each of these directories. The dataset also expected to have one `label_map.json` file per channel for `train_annotation` and `validation_annotation` respectively. If you don't provide these JSON files, SageMaker AI provides the default set label map.

The dataset specifying these files should look similar to the following example:

```
s3://bucket_name
    |
    |- train
                 |
                 | - 0000.jpg
                 | - coffee.jpg
    |- validation
                 |
                 | - 00a0.jpg
                 | - bananna.jpg
    |- train_annotation
                 |
                 | - 0000.png
                 | - coffee.png
    |- validation_annotation
                 |
                 | - 00a0.png
                 | - bananna.png
    |- label_map
                 | - train_label_map.json
                 | - validation_label_map.json
```

Every JPG image in the train and validation directories has a corresponding PNG label image with the same name in the `train_annotation` and `validation_annotation` directories. This naming convention helps the algorithm to associate a label with its corresponding image during training. The `train`, `train_annotation`, `validation`, and `validation_annotation` channels are mandatory. The annotations are single-channel PNG images. The format works as long as the metadata (modes) in the image helps the algorithm read the annotation images into a single-channel 8-bit unsigned integer. For more information on our support for modes, see the [Python Image Library documentation](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#modes). We recommend using the 8-bit pixel, true color `P` mode. 

The image that is encoded is a simple 8-bit integer when using modes. To get from this mapping to a map of a label, the algorithm uses one mapping file per channel, called the *label map*. The label map is used to map the values in the image with actual label indices. In the default label map, which is provided by default if you don’t provide one, the pixel value in an annotation matrix (image) directly index the label. These images can be grayscale PNG files or 8-bit indexed PNG files. The label map file for the unscaled default case is the following: 

```
{
  "scale": "1"
}
```

To provide some contrast for viewing, some annotation software scales the label images by a constant amount. To support this, the SageMaker AI semantic segmentation algorithm provides a rescaling option to scale down the values to actual label values. When scaling down doesn’t convert the value to an appropriate integer, the algorithm defaults to the greatest integer less than or equal to the scale value. The following code shows how to set the scale value to rescale the label values:

```
{
  "scale": "3"
}
```

The following example shows how this `"scale"` value is used to rescale the `encoded_label` values of the input annotation image when they are mapped to the `mapped_label` values to be used in training. The label values in the input annotation image are 0, 3, 6, with scale 3, so they are mapped to 0, 1, 2 for training:

```
encoded_label = [0, 3, 6]
mapped_label = [0, 1, 2]
```

In some cases, you might need to specify a particular color mapping for each class. Use the map option in the label mapping as shown in the following example of a `label_map` file:

```
{
    "map": {
        "0": 5,
        "1": 0,
        "2": 2
    }
}
```

This label mapping for this example is:

```
encoded_label = [0, 5, 2]
mapped_label = [1, 0, 2]
```

With label mappings, you can use different annotation systems and annotation software to obtain data without a lot of preprocessing. You can provide one label map per channel. The files for a label map in the `label_map` channel must follow the naming conventions for the four directory structure. If you don't provide a label map, the algorithm assumes a scale of 1 (the default).

### Training with the Augmented Manifest Format
Augmented Manifest format

The augmented manifest format enables you to do training in Pipe mode using image files without needing to create RecordIO files. The augmented manifest file contains data objects and should be in [JSON Lines](http://jsonlines.org/) format, as described in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Each line in the manifest is an entry containing the Amazon S3 URI for the image and the URI for the annotation image.

Each JSON object in the manifest file must contain a `source-ref` key. The `source-ref` key should contain the value of the Amazon S3 URI to the image. The labels are provided under the `AttributeNames` parameter value as specified in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. It can also contain additional metadata under the metadata tag, but these are ignored by the algorithm. In the example below, the `AttributeNames` are contained in the list of image and annotation references `["source-ref", "city-streets-ref"]`. These names must have `-ref` appended to them. When using the Semantic Segmentation algorithm with Augmented Manifest, the value of the `RecordWrapperType` parameter must be `"RecordIO"` and value of the `ContentType` parameter must be `application/x-recordio`.

```
{"source-ref": "S3 bucket location", "city-streets-ref": "S3 bucket location", "city-streets-metadata": {"job-name": "label-city-streets", }}
```

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Incremental Training
Incremental Training

You can also seed the training of a new model with a model that you trained previously using SageMaker AI. This incremental training saves training time when you want to train a new model with the same or similar data. Currently, incremental training is supported only for models trained with the built-in SageMaker AI Semantic Segmentation.

To use your own pre-trained model, specify the `ChannelName` as "model" in the `InputDataConfig` for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Set the `ContentType` for the model channel to `application/x-sagemaker-model`. The `backbone`, `algorithm`, `crop_size`, and `num_classes` input parameters that define the network architecture must be consistently specified in the input hyperparameters of the new model and the pre-trained model that you upload to the model channel. For the pretrained model file, you can use the compressed (.tar.gz) artifacts from SageMaker AI outputs. You can only use Image formats for input data. For more information on incremental training and for instructions on how to use it, see [Use Incremental Training in Amazon SageMaker AI](incremental-training.md). 

### Produce Inferences
Inference

To query a trained model that is deployed to an endpoint, you need to provide an image and an `AcceptType` that denotes the type of output required. The endpoint takes JPEG images with an `image/jpeg` content type. If you request an `AcceptType` of `image/png`, the algorithm outputs a PNG file with a segmentation mask in the same format as the labels themselves. If you request an accept type of`application/x-recordio-protobuf`, the algorithm returns class probabilities encoded in recordio-protobuf format. The latter format outputs a 3D tensor where the third dimension is the same size as the number of classes. This component denotes the probability of each class label for each pixel.

## EC2 Instance Recommendation for the Semantic Segmentation Algorithm


The SageMaker AI semantic segmentation algorithm only supports GPU instances for training, and we recommend using GPU instances with more memory for training with large batch sizes. The algorithm can be trained using P2, P3, G4dn, or G5 instances in single machine configurations.

For inference, you can use either CPU instances (such as C5 and M5) and GPU instances (such as P3 and G4dn) or both. For information about the instance types that provide varying combinations of CPU, GPU, memory, and networking capacity for inference, see [Amazon SageMaker AI ML Instance Types](https://aws.amazon.com/sagemaker/pricing/instance-types/).

# Semantic Segmentation Hyperparameters
Hyperparameters

The following tables list the hyperparameters supported by the Amazon SageMaker AI semantic segmentation algorithm for network architecture, data inputs, and training. You specify Semantic Segmentation for training in the `AlgorithmName` of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request.

**Network Architecture Hyperparameters**


| Parameter Name | Description | 
| --- | --- | 
| backbone |  The backbone to use for the algorithm's encoder component. **Optional** Valid values: `resnet-50`, `resnet-101`  Default value: `resnet-50`  | 
| use\$1pretrained\$1model |  Whether a pretrained model is to be used for the backbone. **Optional** Valid values: `True`, `False` Default value: `True`  | 
| algorithm |  The algorithm to use for semantic segmentation.  **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) Default value: `fcn`  | 

**Data Hyperparameters**


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes |  The number of classes to segment. **Required** Valid values: 2 ≤ positive integer ≤ 254  | 
| num\$1training\$1samples |  The number of samples in the training data. The algorithm uses this value to set up the learning rate scheduler. **Required** Valid values: positive integer  | 
| base\$1size |  Defines how images are rescaled before cropping. Images are rescaled such that the long size length is set to `base_size` multiplied by a random number from 0.5 to 2.0, and the short size is computed to preserve the aspect ratio. **Optional** Valid values: positive integer > 16 Default value: 520  | 
| crop\$1size |  The image size for input during training. We randomly rescale the input image based on `base_size`, and then take a random square crop with side length equal to `crop_size`. The `crop_size` will be automatically rounded up to multiples of 8. **Optional** Valid values: positive integer > 16 Default value: 240  | 

**Training Hyperparameters**


| Parameter Name | Description | 
| --- | --- | 
| early\$1stopping |  Whether to use early stopping logic during training. **Optional** Valid values: `True`, `False` Default value: `False`  | 
| early\$1stopping\$1min\$1epochs |  The minimum number of epochs that must be run. **Optional** Valid values: integer Default value: 5  | 
| early\$1stopping\$1patience |  The number of epochs that meet the tolerance for lower performance before the algorithm enforces an early stop. **Optional** Valid values: integer Default value: 4  | 
| early\$1stopping\$1tolerance |  If the relative improvement of the score of the training job, the mIOU, is smaller than this value, early stopping considers the epoch as not improved. This is used only when `early_stopping` = `True`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.0  | 
| epochs |  The number of epochs with which to train. **Optional** Valid values: positive integer Default value: 10  | 
| gamma1 |  The decay factor for the moving average of the squared gradient for `rmsprop`. Used only for `rmsprop`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.9  | 
| gamma2 |  The momentum factor for `rmsprop`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.9  | 
| learning\$1rate |  The initial learning rate.  **Optional** Valid values: 0 < float ≤ 1 Default value: 0.001  | 
| lr\$1scheduler |  The shape of the learning rate schedule that controls its decrease over time. **Optional** Valid values:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) Default value: `poly`  | 
| lr\$1scheduler\$1factor |  If `lr_scheduler` is set to `step`, the ratio by which to reduce (multipy) the `learning_rate` after each of the epochs specified by the `lr_scheduler_step`. Otherwise, ignored. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.1  | 
| lr\$1scheduler\$1step |  A comma delimited list of the epochs after which the `learning_rate` is reduced (multiplied) by an `lr_scheduler_factor`. For example, if the value is set to `"10, 20"`, then the `learning-rate` is reduced by `lr_scheduler_factor` after the 10th epoch and again by this factor after 20th epoch. **Conditionally Required** if `lr_scheduler` is set to `step`. Otherwise, ignored. Valid values: string Default value: (No default, as the value is required when used.)  | 
| mini\$1batch\$1size |  The batch size for training. Using a large `mini_batch_size` usually results in faster training, but it might cause you to run out of memory. Memory usage is affected by the values of the `mini_batch_size` and `image_shape` parameters, and the backbone architecture. **Optional** Valid values: positive integer  Default value: 16  | 
| momentum |  The momentum for the `sgd` optimizer. When you use other optimizers, the semantic segmentation algorithm ignores this parameter. **Optional** Valid values: 0 < float ≤ 1 Default value: 0.9  | 
| optimizer |  The type of optimizer. For more information about an optimizer, choose the appropriate link: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) **Optional** Valid values: `adam`, `adagrad`, `nag`, `rmsprop`, `sgd`  Default value: `sgd`  | 
| syncbn |  If set to `True`, the batch normalization mean and variance are computed over all the samples processed across the GPUs. **Optional**  Valid values: `True`, `False`  Default value: `False`  | 
| validation\$1mini\$1batch\$1size |  The batch size for validation. A large `mini_batch_size` usually results in faster training, but it might cause you to run out of memory. Memory usage is affected by the values of the `mini_batch_size` and `image_shape` parameters, and the backbone architecture.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) **Optional** Valid values: positive integer Default value: 16  | 
| weight\$1decay |  The weight decay coefficient for the `sgd` optimizer. When you use other optimizers, the algorithm ignores this parameter.  **Optional** Valid values: 0 < float < 1 Default value: 0.0001  | 

# Tuning a Semantic Segmentation Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

## Metrics Computed by the Semantic Segmentation Algorithm
Metrics

The semantic segmentation algorithm reports two validation metrics. When tuning hyperparameter values, choose one of these metrics as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:mIOU |  The area of the intersection of the predicted segmentation and the ground truth divided by the area of union between them for images in the validation set. Also known as the Jaccard Index.  |  Maximize  | 
| validation:pixel\$1accuracy | The percentage of pixels that are correctly classified in images from the validation set. |  Maximize  | 

## Tunable Semantic Segmentation Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the semantic segmentation algorithm.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 1e-1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 128  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.9, MaxValue: 0.999  | 
| optimzer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'adadelta']  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 1e-5, MaxValue: 1e-3  | 

# Use Reinforcement Learning with Amazon SageMaker AI
Use Reinforcement Learning

Reinforcement learning (RL) combines fields such as computer science, neuroscience, and psychology to determine how to map situations to actions to maximize a numerical reward signal. This notion of a reward signal in RL stems from neuroscience research into how the human brain makes decisions about which actions maximize reward and minimize punishment. In most situations, humans are not given explicit instructions on which actions to take, but instead must learn both which actions yield the most immediate rewards, and how those actions influence future situations and consequences. 

The problem of RL is formalized using Markov decision processes (MDPs) that originate from dynamical systems theory. MDPs aim to capture high-level details of a real problem that a learning agent encounters over some period of time in attempting to achieve some ultimate goal. The learning agent should be able to determine the current state of its environment and identify possible actions that affect the learning agent’s current state. Furthermore, the learning agent’s goals should correlate strongly to the state of the environment. A solution to a problem formulated in this way is known as a reinforcement learning method. 

## What are the differences between reinforcement, supervised, and unsupervised learning paradigms?


Machine learning can be divided into three distinct learning paradigms: supervised, unsupervised, and reinforcement.

In supervised learning, an external supervisor provides a training set of labeled examples. Each example contains information about a situation, belongs to a category, and has a label identifying the category to which it belongs. The goal of supervised learning is to generalize in order to predict correctly in situations that are not present in the training data. 

In contrast, RL deals with interactive problems, making it infeasible to gather all possible examples of situations with correct labels that an agent might encounter. This type of learning is most promising when an agent is able to accurately learn from its own experience and adjust accordingly. 

In unsupervised learning, an agent learns by uncovering structure within unlabeled data. While a RL agent might benefit from uncovering structure based on its experiences, the sole purpose of RL is to maximize a reward signal. 

**Topics**
+ [

## What are the differences between reinforcement, supervised, and unsupervised learning paradigms?
](#rl-differences)
+ [

## Why is Reinforcement Learning Important?
](#rl-why)
+ [

## Markov Decision Process (MDP)
](#rl-terms)
+ [

## Key Features of Amazon SageMaker AI RL
](#sagemaker-rl)
+ [

## Reinforcement Learning Sample Notebooks
](#sagemaker-rl-notebooks)
+ [

# Sample RL Workflow Using Amazon SageMaker AI RL
](sagemaker-rl-workflow.md)
+ [

# RL Environments in Amazon SageMaker AI
](sagemaker-rl-environments.md)
+ [

# Distributed Training with Amazon SageMaker AI RL
](sagemaker-rl-distributed.md)
+ [

# Hyperparameter Tuning with Amazon SageMaker AI RL
](sagemaker-rl-tuning.md)

## Why is Reinforcement Learning Important?


RL is well-suited for solving large, complex problems, such as supply chain management, HVAC systems, industrial robotics, game artificial intelligence, dialog systems, and autonomous vehicles. Because RL models learn by a continuous process of receiving rewards and punishments for every action taken by the agent, it is possible to train systems to make decisions under uncertainty and in dynamic environments. 

## Markov Decision Process (MDP)


RL is based on models called Markov Decision Processes (MDPs). An MDP consists of a series of time steps. Each time step consists of the following:

Environment  
Defines the space in which the RL model operates. This can be either a real-world environment or a simulator. For example, if you train a physical autonomous vehicle on a physical road, that would be a real-world environment. If you train a computer program that models an autonomous vehicle driving on a road, that would be a simulator.

State  
Specifies all information about the environment and past steps that is relevant to the future. For example, in an RL model in which a robot can move in any direction at any time step, the position of the robot at the current time step is the state, because if we know where the robot is, it isn't necessary to know the steps it took to get there.

Action  
What the agent does. For example, the robot takes a step forward.

Reward  
A number that represents the value of the state that resulted from the last action that the agent took. For example, if the goal is for a robot to find treasure, the reward for finding treasure might be 5, and the reward for not finding treasure might be 0. The RL model attempts to find a strategy that optimizes the cumulative reward over the long term. This strategy is called a *policy*.

Observation  
Information about the state of the environment that is available to the agent at each step. This might be the entire state, or it might be just a part of the state. For example, the agent in a chess-playing model would be able to observe the entire state of the board at any step, but a robot in a maze might only be able to observe a small portion of the maze that it currently occupies.

Typically, training in RL consists of many *episodes*. An episode consists of all of the time steps in an MDP from the initial state until the environment reaches the terminal state.

## Key Features of Amazon SageMaker AI RL


To train RL models in SageMaker AI RL, use the following components: 
+ A deep learning (DL) framework. Currently, SageMaker AI supports RL in TensorFlow and Apache MXNet.
+ An RL toolkit. An RL toolkit manages the interaction between the agent and the environment and provides a wide selection of state of the art RL algorithms. SageMaker AI supports the Intel Coach and Ray RLlib toolkits. For information about Intel Coach, see [https://nervanasystems.github.io/coach/](https://nervanasystems.github.io/coach/). For information about Ray RLlib, see [https://ray.readthedocs.io/en/latest/rllib.html](https://ray.readthedocs.io/en/latest/rllib.html).
+ An RL environment. You can use custom environments, open-source environments, or commercial environments. For information, see [RL Environments in Amazon SageMaker AI](sagemaker-rl-environments.md).

The following diagram shows the RL components that are supported in SageMaker AI RL.

![\[The RL components that are supported in SageMaker AI RL.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-rl-support.png)


## Reinforcement Learning Sample Notebooks


For complete code examples, see the [reinforcement learning sample notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/reinforcement_learning) in the SageMaker AI Examples repository.

# Sample RL Workflow Using Amazon SageMaker AI RL


The following example describes the steps for developing RL models using Amazon SageMaker AI RL.

1. **Formulate the RL problem**—First, formulate the business problem into an RL problem. For example, auto scaling enables services to dynamically increase or decrease capacity depending on conditions that you define. Currently, this requires setting up alarms, scaling policies, thresholds, and other manual steps. To solve this with RL, we define the components of the Markov Decision Process:

   1. **Objective**—Scale instance capacity so that it matches the desired load profile.

   1. **Environment**—A custom environment that includes the load profile. It generates a simulated load with daily and weekly variations and occasional spikes. The simulated system has a delay between when new resources are requested and when they become available for serving requests.

   1. **State**—The current load, number of failed jobs, and number of active machines.

   1. **Action**—Remove, add, or keep the same number of instances.

   1. **Reward**—A positive reward for successful transactions and a high penalty for failing transactions beyond a specified threshold.

1. **Define the RL environment**—The RL environment can be the real world where the RL agent interacts or a simulation of the real world. You can connect open source and custom environments developed using Gym interfaces and commercial simulation environments such as MATLAB and Simulink.

1. **Define the presets**—The presets configure the RL training jobs and define the hyperparameters for the RL algorithms.

1. **Write the training code**—Write training code as a Python script and pass the script to a SageMaker AI training job. In your training code, import the environment files and the preset files, and then define the `main()` function.

1. **Train the RL Model**—Use the SageMaker AI `RLEstimator` in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to start an RL training job. If you are using local mode, the training job runs on the notebook instance. When you use SageMaker AI for training, you can select GPU or CPU instances. Store the output from the training job in a local directory if you train in local mode, or on Amazon S3 if you use SageMaker AI training.

   The `RLEstimator` requires the following information as parameters. 

   1. The source directory where the environment, presets, and training code are uploaded.

   1. The path to the training script.

   1. The RL toolkit and deep learning framework you want to use. This automatically resolves to the Amazon ECR path for the RL container.

   1. The training parameters, such as the instance count, job name, and S3 path for output.

   1. Metric definitions that you want to capture in your logs. These can also be visualized in CloudWatch and in SageMaker AI notebooks.

1. **Visualize training metrics and output**—After a training job that uses an RL model completes, you can view the metrics you defined in the training jobs in CloudWatch,. You can also plot the metrics in a notebook by using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) analytics library. Visualizing metrics helps you understand how the performance of the model as measured by the reward improves over time.
**Note**  
If you train in local mode, you can't visualize metrics in CloudWatch.

1. **Evaluate the model**—Checkpointed data from the previously trained models can be passed on for evaluation and inference in the checkpoint channel. In local mode, use the local directory. In SageMaker AI training mode, you need to upload the data to S3 first.

1. **Deploy RL models**—Finally, deploy the trained model on an endpoint hosted on SageMaker AI containers or on an edge device by using AWS IoT Greengrass.

For more information on RL with SageMaker AI, see [Using RL with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_rl.html).

# RL Environments in Amazon SageMaker AI


Amazon SageMaker AI RL uses environments to mimic real-world scenarios. Given the current state of the environment and an action taken by the agent or agents, the simulator processes the impact of the action, and returns the next state and a reward. Simulators are useful in cases where it is not safe to train an agent in the real world (for example, flying a drone) or if the RL algorithm takes a long time to converge (for example, when playing chess).

The following diagram shows an example of the interactions with a simulator for a car racing game.

![\[An example of the interactions with a simulator for a car racing game.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-rl-flow.png)


The simulation environment consists of an agent and a simulator. Here, a convolutional neural network (CNN) consumes images from the simulator and generates actions to control the game controller. With multiple simulations, this environment generates training data of the form `state_t`, `action`, `state_t+1`, and `reward_t+1`. Defining the reward is not trivial and impacts the RL model quality. We want to provide a few examples of reward functions, but would like to make it user-configurable. 

**Topics**
+ [

## Use OpenAI Gym Interface for Environments in SageMaker AI RL
](#sagemaker-rl-environments-gym)
+ [

## Use Open-Source Environments
](#sagemaker-rl-environments-open)
+ [

## Use Commercial Environments
](#sagemaker-rl-environments-commercial)

## Use OpenAI Gym Interface for Environments in SageMaker AI RL


To use OpenAI Gym environments in SageMaker AI RL, use the following API elements. For more information about OpenAI Gym, see [Gym Documentation](https://www.gymlibrary.dev/).
+ `env.action_space`—Defines the actions the agent can take, specifies whether each action is continuous or discrete, and specifies the minimum and maximum if the action is continuous.
+ `env.observation_space`—Defines the observations the agent receives from the environment, as well as minimum and maximum for continuous observations.
+ `env.reset()`—Initializes a training episode. The `reset()` function returns the initial state of the environment, and the agent uses the initial state to take its first action. The action is then sent to `step()` repeatedly until the episode reaches a terminal state. When `step()` returns `done = True`, the episode ends. The RL toolkit re-initializes the environment by calling `reset()`.
+ `step()`—Takes the agent action as input and outputs the next state of the environment, the reward, whether the episode has terminated, and an `info` dictionary to communicate debugging information. It is the responsibility of the environment to validate the inputs.
+ `env.render()`—Used for environments that have visualization. The RL toolkit calls this function to capture visualizations of the environment after each call to the `step()` function.

## Use Open-Source Environments


You can use open-source environments, such as EnergyPlus and RoboSchool, in SageMaker AI RL by building your own container. For more information about EnergyPlus, see [https://energyplus.net/](https://energyplus.net/). For more information about RoboSchool, see [https://github.com/openai/roboschool](https://github.com/openai/roboschool). The HVAC and RoboSchool examples in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning) show how to build a custom container to use with SageMaker AI RL:

## Use Commercial Environments


You can use commercial environments, such as MATLAB and Simulink, in SageMaker AI RL by building your own container. You need to manage your own licenses.

# Distributed Training with Amazon SageMaker AI RL


Amazon SageMaker AI RL supports multi-core and multi-instance distributed training. Depending on your use case, training and/or environment rollout can be distributed. For example, SageMaker AI RL works for the following distributed scenarios:
+ Single training instance and multiple rollout instances of the same instance type. For an example, see the Neural Network Compression example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning).
+ Single trainer instance and multiple rollout instances, where different instance types for training and rollouts. For an example, see the AWS DeepRacer / AWS RoboMaker example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning).
+ Single trainer instance that uses multiple cores for rollout. For an example, see the Roboschool example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning). This is useful if the simulation environment is light-weight and can run on a single thread. 
+ Multiple instances for training and rollouts. For an example, see the Roboschool example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning).

# Hyperparameter Tuning with Amazon SageMaker AI RL


You can run a hyperparameter tuning job to optimize hyperparameters for Amazon SageMaker AI RL. The Roboschool example in the sample notebooks in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning) shows how you can do this with RL Coach. The launcher script shows how you can abstract parameters from the Coach preset file and optimize them.