

# Docker containers for training and deploying models


Amazon SageMaker AI makes extensive use of *Docker containers* for build and runtime tasks. SageMaker AI provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training and inference. Using containers, you can train machine learning algorithms and deploy models quickly and reliably at any scale. The topics in this section show how to deploy these containers for your own use cases. For information about how to bring your own containers for use with Amazon SageMaker Studio Classic, see [Custom Images in Amazon SageMaker Studio Classic](studio-byoi.md).

**Topics**
+ [

## Scenarios for Running Scripts, Training Algorithms, or Deploying Models with SageMaker AI
](#container-scenarios)
+ [

# Docker container basics
](docker-basics.md)
+ [

# Pre-built SageMaker AI Docker images
](docker-containers-prebuilt.md)
+ [

# Custom Docker containers with SageMaker AI
](docker-containers-adapt-your-own.md)
+ [

# Container creation with your own algorithms and models
](docker-containers-create.md)
+ [

# Examples and More Information: Use Your Own Algorithm or Model
](docker-containers-notebooks.md)
+ [

## Troubleshooting your Docker containers and deployments
](#docker-containers-troubleshooting)

## Scenarios for Running Scripts, Training Algorithms, or Deploying Models with SageMaker AI
Scenarios and Guidance

Amazon SageMaker AI always uses Docker containers when running scripts, training algorithms, and deploying models. Your level of engagement with containers depends on your use case. 

The following decision tree illustrates three main scenarios: **Use cases for using pre-built Docker containers with SageMaker AI**; **Use cases for extending a pre-built Docker container**; **Use case for building your own container**.

![\[Decision tree for container use cases.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/your-algorithm-containers-flowchart-diagram.png)


**Topics**
+ [

### Use cases for using pre-built Docker containers with SageMaker AI
](#container-scenarios-use-prebuilt)
+ [

### Use cases for extending a pre-built Docker container
](#container-scenarios-extend-prebuilt)
+ [

### Use case for building your own container
](#container-scenarios-byoc)

### Use cases for using pre-built Docker containers with SageMaker AI


Consider the following use cases when using containers with SageMaker AI:
+ **Pre-built SageMaker AI algorithm** – Use the image that comes with the built-in algorithm. See [Use Amazon SageMaker AI Built-in Algorithms or Pre-trained Models](https://docs.aws.amazon.com//sagemaker/latest/dg/algos.html) for more information.
+ **Custom model with pre-built SageMaker AI container** – If you train or deploy a custom model, but use a framework that has a pre-built SageMaker AI container including TensorFlow and PyTorch, choose one of the following options:
  + If you don't need a custom package, and the container already includes all required packages: Use the pre-built Docker image associated with your framework. For more information, see [Pre-built SageMaker AI Docker images](docker-containers-prebuilt.md).
  + If you need a custom package installed into one of the pre-built containers: Confirm that the pre-built Docker image allows a requirements.txt file, or extend the pre-built container based on the following use cases.

### Use cases for extending a pre-built Docker container


The following are use cases for extending a pre-built Docker container:
+ **You can't import the dependencies** – Extend the pre-built Docker image associated with your framework. See [Extend a Pre-built Container](prebuilt-containers-extend.md) for more information.
+ **You can't import the dependencies in the pre-built container and the pre-built container supports requirements.txt** – Add all the required dependencies in requirements.txt. The following frameworks support using requirements.txt.
  + [TensorFlow](https://sagemaker.readthedocs.io/en/v2.18.0/frameworks/tensorflow/using_tf.html)
  + [Chainer](https://sagemaker.readthedocs.io/en/v2.18.0/frameworks/chainer/using_chainer.html?highlight=requirements.txt)
  + [Sci-kit learn](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html?highlight=requirements.txt)
  + [PyTorch](https://sagemaker.readthedocs.io/en/v2.18.0/frameworks/pytorch/using_pytorch.html?highlight=requirements.txt)
  + [Apache MXNet](https://sagemaker.readthedocs.io/en/v2.18.0/frameworks/mxnet/using_mxnet.html?highlight=requirements.txt)

### Use case for building your own container


If you build or train a custom model and require custom framework that does not have a pre-built image, build a custom container.

As an example use case of training and deploying a TensorFlow model, the following guide shows how to determine which option from the previous sections of **Use cases** fits to the case.

Assume that you have the following requirements for training and deploying a TensorFlow model.
+ A TensorFlow model is a custom model.
+ Because a TensorFlow model is going to be built in the TensorFlow framework, use the TensorFlow pre-built framework container to train and host the model.
+ If you require custom packages in either your [ entrypoint](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#train-a-model-with-tensorflow) script or [inference script, either extend the pre-built container or use a requirements.txt file to install dependencies at runtime.](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html#how-to-implement-the-pre-and-or-post-processing-handler-s)

After you determine the type of container that you need, the following list provides details about the previously listed options.
+ **Use a built-in SageMaker AI algorithm or framework**. For most use cases, you can use the built-in algorithms and frameworks without worrying about containers. You can train and deploy these algorithms from the SageMaker AI console, the AWS Command Line Interface (AWS CLI), a Python notebook, or the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). You can do that by specifying the algorithm or framework version when creating your Estimator. The available built-in algorithms are itemized and described in the [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md) topic. For more information about the available frameworks, see [ML Frameworks and Languages](frameworks.md). For an example of how to train and deploy a built-in algorithm using a Jupyter notebook running in a SageMaker notebook instance, see the [Guide to getting set up with Amazon SageMaker AI](gs.md) topic. 
+ **Use pre-built SageMaker AI container images**. Alternatively, you can use the built-in algorithms and frameworks using Docker containers. SageMaker AI provides containers for its built-in algorithms and pre-built Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a full list of the available SageMaker Images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). It also supports machine learning libraries such as scikit-learn and SparkML. If you use the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), you can deploy the containers by passing the full container URI to their respective SageMaker SDK `Estimator` class. For the full list of deep learning frameworks that are currently supported by SageMaker AI, see [Prebuilt SageMaker AI Docker images for deep learning](pre-built-containers-frameworks-deep-learning.md). For information about the scikit-learn and SparkML pre-built container images, see [Accessing Docker Images for Scikit-learn and Spark ML](pre-built-docker-containers-scikit-learn-spark.md). For more information about using frameworks with the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), see their respective topics in [Machine Learning Frameworks and Languages](frameworks.md).
+ **Extend a pre-built SageMaker AI container image**. If you would like to extend a pre-built SageMaker AI algorithm or model Docker image, you can modify the SageMaker image to satisfy your needs. For an example, see [Extending our PyTorch containers](https://github.com/aws/amazon-sagemaker-examples-community/blob/215215eb25b40eadaf126d055dbb718a245d7603/bring-your-own-container/pytorch_extending_our_containers/pytorch_extending_our_containers.ipynb). 
+ **Adapt an existing container image**: If you would like to adapt a pre-existing container image to work with SageMaker AI, you must modify the Docker container to enable either the SageMaker Training or Inference toolkit. For an example that shows how to build your own containers to train and host an algorithm, see [Bring Your Own R Algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb).

# Docker container basics


The following page outlines the most significant aspects of using Docker containers with Amazon SageMaker AI.

Docker is a program that performs operating system-level virtualization for installing, distributing, and managing software. It packages applications and their dependencies into virtual containers that provide isolation, portability, and security. With Docker, you can ship code faster, standardize application operations, seamlessly move code, and economize by improving resource utilization. For more general information about Docker, see [Docker overview](https://docs.docker.com/engine/docker-overview/).

**SageMaker AI Functions**

SageMaker AI uses Docker containers in the backend to manage training and inference processes. SageMaker AI abstracts away from this process, so it happens automatically when an estimator is used. While you don't need to use Docker containers explicitly with SageMaker AI for most use cases, you can use Docker containers to extend and customize SageMaker AI functionality. 

**Containers with Amazon SageMaker Studio Classic**

Studio Classic runs from a Docker container and uses it to manage functionality. As a result, you must create your Docker container following the steps in [Custom Images in Amazon SageMaker Studio Classic](studio-byoi.md).

# Pre-built SageMaker AI Docker images


Amazon SageMaker AI provides containers for its built-in algorithms and pre-built Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. It also supports machine learning libraries such as scikit-learn and SparkML. 

You can use these images from your SageMaker notebook instance or SageMaker Studio. You can also extend the pre-built SageMaker images to include libraries and needed functionality. The following topics give information about the available images and how to use them.

For the Docker registry path and other parameters for each of the Amazon SageMaker AI provided algorithms and Deep Learning Containers (DLC), see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths).

For information on Docker images for developing reinforcement learning (RL) solutions in SageMaker AI, see [SageMaker AI RL Containers](https://github.com/aws/sagemaker-rl-container).

**Note**  
Pre-built container images are owned by SageMaker AI, and in some cases include proprietary code. Capabilities such as training and processing jobs, batch transform, and real-time inference use service-owned credentials to pull and run images on managed SageMaker AI instances. Because customer credentials aren't used, any AWS IAM policies (including service control policies and resource control policies) that deny Amazon ECR permissions don't prevent the use of pre-built images.

**Topics**
+ [

# Prebuilt SageMaker image support policy
](pre-built-containers-support-policy.md)
+ [

# Prebuilt SageMaker AI Docker images for deep learning
](pre-built-containers-frameworks-deep-learning.md)
+ [

# Accessing Docker Images for Scikit-learn and Spark ML
](pre-built-docker-containers-scikit-learn-spark.md)
+ [

# Deep Graph Networks
](deep-graph-library.md)
+ [

# Extend a Pre-built Container
](prebuilt-containers-extend.md)

# Prebuilt SageMaker image support policy
Support Policy

All [pre-built SageMaker images](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html), including framework-specific containers, built-in algorithm containers, algorithms and model packages listed in AWS Marketplace, and [AWS Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html) are regularly scanned for common vulnerabilities listed by the [Common Vulnerabilities and Exposures (CVE) Program](https://www.cve.org/) and the [National Vulnerability Database (NVD)](https://nvd.nist.gov/). For more information about CVEs, see [CVE Frequently Asked Questions (FAQs)](https://www.cve.org/ResourcesSupport/FAQs). Supported pre-built container images receive an updated minor version release following any security patches. 

All supported container images are routinely updated to address any critical CVEs. For high severity scenarios, we recommend customers build and host a patched version of the container in their own [Amazon Elastic Container Registry (Amazon ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html). 

If you are running a container image version that is no longer supported, you may not have the most updated drivers, libraries, and relevant packages. For a more up-to-date version, we recommend that you upgrade to one of the supported frameworks available using the latest image of your choice.

SageMaker AI doesn't release out-of-patch images for containers in new AWS Regions.

**Note**  
As of August 2024, the `forecasting-deepar` container is no longer receiving security patches or updates. While you can continue to use this container, you incur additional risk. Containers are deprecated when the entire framework or algorithms is no longer supported, and the underlying MXNet framework for the container has reached end-of-maintenance.

**Topics**
+ [

## AWS Deep Learning Containers (DLC) support policy
](#pre-built-containers-support-policy-dlc)
+ [

## SageMaker AI ML Framework Container support policy
](#pre-built-containers-support-policy-ml-framework)
+ [

## SageMaker AI Built-in Algorithm Container support policy
](#pre-built-containers-support-policy-built-in)
+ [

## LLM Hosting Container support policy
](#pre-built-containers-support-policy-llm-hosting)
+ [

## Unsupported containers and deprecation
](#pre-built-containers-support-policy-deprecation)

## AWS Deep Learning Containers (DLC) support policy


AWS Deep Learning Containers are a set of Docker images for training and serving deep learning models. To view available images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) in the Deep Learning Containers GitHub repository.

DLCs hit their end of patch date 365 days after their GitHub release date. Patch updates for DLCs are not “in-place” updates. You must delete the existing image on your instance and pull the latest container image without terminating your instance. For more information, see [Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html) in the *AWS Deep Learning Containers Developer Guide*. 

Reference the [AWS Deep Learning Containers Framework Support Policy table](https://aws.amazon.com/releasenotes/dlc-support-policy/) to check which frameworks and versions are actively supported for AWS DLCs. You can reference the framework associated with a DLC in the support policy table for any images that are not explicitly listed. For example, you can reference **PyTorch** in the support policy table for DLC images such as `huggingface-pytorch-inference` and `stabilityai-pytorch-inference`.

**Note**  
If a DLC uses the HuggingFace [Transformers](https://huggingface.co/docs/transformers/en/index) SDK, then only the image with the latest Transfromers version is supported. For more information, see **HuggingFace** for the Region of your choice in the [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). 

## SageMaker AI ML Framework Container support policy


The SageMaker AI ML Framework Containers are a set of Docker images for training and serving machine learning workloads with environments optimized for common frameworks such as XGBoost and Scikit Learn. To view available SageMaker AI ML Framework Containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Navigate to the AWS Region of your choice, and browse images with the **(algorithm)** tag. SageMaker AI ML Framework Containers also adhere to the [AWS Deep Learning Containers framework support policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html). 

To retrieve the latest image version for XGBoost 1.7-1 in framework mode, use the following SageMaker Python SDK commands: 

```
from sagemaker import image_uris
image_uris.retrieve(framework='xgboost',region='us-east-1',version='3.0-5')
```


| Framework | Current version | GitHub GA | End of patch | 
| --- | --- | --- | --- | 
| XGBoost | 3.0-5 | 11/17/2025 | 11/17/2026 | 
| XGBoost | 1.7-1 | 03/06/2023 | 03/06/2025 | 
| XGBoost | 1.5-1 | 02/21/2022 | 02/21/2023 | 
| XGBoost | 1.3-1 | 05/21/2021 | 05/21/2022 | 
| XGBoost | 1.2-2 | 09/20/2020 | 09/20/2021 | 
| XGBoost | 1.2-1 | 07/19/2020 | 07/19/2021 | 
| XGBoost |  1.0-1  |  >4 years  | Not supported | 
| Scikit-Learn |  1.4-2  |  10/30/2025  |  10/30/2026  | 
| Scikit-Learn |  1.2-1  |  03/06/2023  |  03/06/2025  | 
| Scikit-Learn |  1.0-1  |  04/07/2022  |  04/07/2023  | 
| Scikit-Learn |  0.23-1  | 3/6/2023 |  06/02/2021  | 
| Scikit-Learn |  0.20-1  |  >4 years  | Not supported | 

## SageMaker AI Built-in Algorithm Container support policy


The SageMaker AI Built-in Algorithm Containers are a set of Docker images for training and serving [SageMaker AI’s built-in machine learning algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). To view available SageMaker AI Built-in Algorithm Containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Navigate to the AWS Region of your choice, and browse images with the **(algorithm)** tag. 

Patch updates for built-in container images are “in-place” updates. To stay up-to-date with the latest security patches, we recommend checking out the latest built-in algorithm image version using the `latest` image tag. 


| Image container | End of patch | 
| --- | --- | 
| `blazingtext:latest` | 05/15/2024 | 
| `factorization-machines:latest` | 05/15/2024 | 
| `forecasting-deepar:latest` | 08/26/2025 | 
| `image-classification:latest` | 05/15/2024 | 
| `instance-segmentation:latest` | 05/15/2024 | 
| `ipembeddings:latest` | 05/15/2024 | 
| `ipinsights:latest` | 05/15/2024 | 
| `kmeans:latest` | 05/15/2024 | 
| `knn:latest` | 05/15/2024 | 
| `linear-learner:inference-cpu-1/training-cpu-1` | 05/15/2024 | 
| `linear-learner:latest` | 05/15/2024 | 
| `mxnet-algorithms:training-cpu/inference-cpu` | 05/15/2024 | 
| `ntm:latest` | 05/15/2024 | 
| `object-detection:latest` | 05/15/2024 | 
| `object2vec:latest` | 05/15/2024 | 
| `pca:latest` | 05/15/2024 | 
| `randomcutforest:latest` | 05/15/2024 | 
| `semantic-segmentation:latest` | 05/15/2024 | 
| `seq2seq:latest` | 05/15/2024 | 

## LLM Hosting Container support policy


[LLM hosting containers](https://github.com/awslabs/llm-hosting-container) such as the HuggingFace Text Generation Inference (TGI) containers hit their end of patch date 30 days after their GitHub release date.

**Important**  
We make an exception when there is a major version update. For example, if the HuggingFace Text Generation Inference (TGI) toolkit updates to TGI 2.0, then we continue to support the most recent version of TGI 1.4 for a period of three months from the date of the GitHub release.


| Toolkit container | Current version | GitHub GA | End of patch | 
| --- | --- | --- | --- | 
| TGI | tgi2.3.1 | 10/14/2024 | 11/14/2024 | 
| TGI | optimum0.0.25 | 10/04/2024 | 11/04/2024 | 
| TGI | tgi2.2.0 | 07/26/2024 | 08/30/2024 | 
| TGI | tgi2.0.0 | 05/15/2024 | 08/15/2024 | 
| TGI |  tgi1.4.5  |  04/03/2024  |  07/03/2024  | 
| TGI |  tgi1.4.2  |  02/22/2024  |  03/22/2024  | 
| TGI |  tgi1.4.0  |  01/29/2024  |  02/29/2024  | 
| TGI |  tgi1.3.3  |  12/19/2023  |  01/19/2024  | 
| TGI |  tgi1.3.1  |  12/11/2023  |  01/11/2024  | 
| TGI |  tgi1.2.0  |  12/04/2023  |  01/04/2024  | 
| TGI |  optimum 0.0.24  |  08/23/2024  |  09/30/2024  | 
| TGI |  optimum 0.0.23  |  07/26/2024  |  08/30/2024  | 
| TGI |  optimum 0.0.21  |  05/10/2024  |  08/15/2024  | 
| TGI |  optimum 0.0.19  |  02/19/2024  |  03/19/2024  | 
| TGI |  optimum 0.0.18  |  02/01/2024  |  03/01/2024  | 
| TGI |  optimum 0.0.17  |  01/24/2024  |  02/24/2024  | 
| TGI |  optimum 0.0.16  |  01/18/2024  |  02/18/2024  | 
| TEI |  tei1.4.0  |  08/01/2024  |  09/01/2024  | 
| TEI |  tei1.2.3  |  04/26/2024  |  05/26/2024  | 

## Unsupported containers and deprecation


When a container reaches end of patch or is deprecated, it no longer receives security patching. Containers are deprecated when entire frameworks or algorithms are no longer supported.

The following containers no longer receive support: 
+ As of August 2024, the `forecasting-deepar` container is no longer receiving security patches or updates due to the underlying MXNet framework for the container reaching end-of-maintenance.
+ As of April 2024, [SageMaker AI Reinforcement Learning (RL) containers](https://github.com/aws/sagemaker-rl-container) are no longer supported. To build your own RL images, see [Building Your Image](https://github.com/aws/sagemaker-rl-container#building-your-image) in the SageMaker AI RL containers GitHub repository. 
+ As of September 2023, JumpStart Industry: Financial containers are no longer supported.

# Prebuilt SageMaker AI Docker images for deep learning
Prebuilt Deep Learning Images

Amazon SageMaker AI provides prebuilt Docker images that include deep learning frameworks and other dependencies needed for training and inference. For a complete list of the prebuilt Docker images managed by SageMaker AI, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html).

## Using the SageMaker AI Python SDK


With the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk), you can train and deploy models using these popular deep learning frameworks. For instructions on installing and using the SDK, see [https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk). The following table lists the available frameworks and instructions on how to use them with the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk):


| Framework | Instructions | 
| --- | --- | 
| TensorFlow |  [Using TensorFlow with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html)  | 
| MXNet |  [Using MXNet with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html)  | 
| PyTorch |  [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html)  | 
| Chainer |  [Using Chainer with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/chainer/using_chainer.html)  | 
| Hugging Face |  [Using Hugging Face with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html)  | 

## Extending Prebuilt SageMaker AI Docker Images


You can customize these prebuilt containers or extend them as needed. With this customization, you can handle any additional functional requirements for your algorithm or model that the prebuilt SageMaker AI Docker image doesn't support. For an example of this, see [Fine-tuning and deploying a BERTopic model on SageMaker AI with your own scripts and dataset, by extending existing PyTorch containers](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/pytorch_extend_container_train_deploy_bertopic/BERTtopic_extending_container.html).

You can also use prebuilt containers to deploy your custom models or models that have been trained in a framework other than SageMaker AI. For an overview of the process, see [Bring Your Own Pretrained MXNet or TensorFlow Models into Amazon SageMaker](https://aws.amazon.com/blogs/machine-learning/bring-your-own-pre-trained-mxnet-or-tensorflow-models-into-amazon-sagemaker/). This tutorial covers bringing the trained model artifacts into SageMaker AI and hosting them at an endpoint.

# Accessing Docker Images for Scikit-learn and Spark ML
Prebuilt Scikit-learn and Spark ML Images

SageMaker AI provides prebuilt Docker images that install the scikit-learn and Spark ML libraries. These libraries also include the dependencies needed to build Docker images that are compatible with SageMaker AI using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). With the SDK, you can use scikit-learn for machine learning tasks and use Spark ML to create and tune machine learning pipelines. For instructions on installing and using the SDK, see [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk#installing-the-sagemaker-python-sdk). 

You can also access the images from an Amazon ECR repository in your own environment.

Use the following commands to find out which versions of the images are available. For example, use the following to find the available `sagemaker-sparkml-serving` image in the `ca-central-1` Region:

```
aws \
    ecr describe-images \
    --region ca-central-1 \
    --registry-id 341280168497 \
    --repository-name sagemaker-sparkml-serving
```

## Accessing an image from the SageMaker AI Python SDK


The following table contains links to the GitHub repositories with the source code for the scikit-learn and Spark ML containers. The table also contains links to instructions that show how use these containers with Python SDK estimators to run your own training algorithms and hosting your own models. 


| Library | Prebuilt Docker Image Source Code | Instructions | 
| --- | --- | --- | 
| scikit-learn |  [SageMaker AI Scikit-learn Containers](https://github.com/aws/sagemaker-scikit-learn-container)  |  [Using Scikit-learn with the Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_sklearn.html)  | 
| Spark ML |  [SageMaker AI Spark ML Serving Containers](https://github.com/aws/sagemaker-sparkml-serving-container)  |  [SparkML Python SDK Documentation](https://sagemaker.readthedocs.io/en/stable/sagemaker.sparkml.html)  | 

For more information and links to github repositories, see [Resources for using Scikit-learn with Amazon SageMaker AI](sklearn.md) and [Resources for using SparkML Serving with Amazon SageMaker AI](sparkml-serving.md).

## Specifying the Prebuilt Images Manually


If you are not using the SageMaker Python SDK and one of its estimators to manage the container, you have to retrieve the relevant prebuilt container manually. The SageMaker AI prebuilt Docker images are stored in Amazon Elastic Container Registry (Amazon ECR). You can push or pull them using their fullname registry addresses. SageMaker AI uses the following Docker Image URL patterns for scikit-learn and Spark ML:
+ `<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/sagemaker-scikit-learn:<SCIKIT-LEARN_VERSION>-cpu-py<PYTHON_VERSION>`

  For example, `746614075791.dkr.ecr.us-west-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3`
+ `<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/sagemaker-sparkml-serving:<SPARK-ML_VERSION>`

  For example, `341280168497.dkr.ecr.ca-central-1.amazonaws.com/sagemaker-sparkml-serving:2.4`

For account IDs and AWS Region names, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths).

# Deep Graph Networks
Deep Graph Networks

Deep graph networks refer to a type of neural network that is trained to solve graph problems. A deep graph network uses an underlying deep learning framework like PyTorch or MXNet. The potential for graph networks in practical AI applications is highlighted in the Amazon SageMaker AI tutorials for [Deep Graph Library](https://www.dgl.ai/) (DGL). Examples for training models on graph datasets include social networks, knowledge bases, biology, and chemistry. 

 ![\[The Deep Graph Library (DGL) ecosystem.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/dgl_white_background_bold.png) 

 *Figure 1. The DGL ecosystem* 

Several examples are provided using Amazon SageMaker AI’s deep learning containers that are preconfigured with DGL. If you have special modules you want to use with DGL, you can also build your own container. The examples involve heterographs, which are graphs that have multiple types of nodes and edges, and draw on a variety of applications across disparate scientific fields, such as bioinformatics and social network analysis. DGL provides a wide array of [graph neural network implementations for different types models](https://docs.dgl.ai/tutorials/models/index.html). Some of the highlights include: 
+ Graph convolutional network (GCN)
+ Relational graph convolutional network (R-GCN)
+ Graph attention network (GAT)
+ Deep generative models of graphs (DGMG)
+ Junction tree neural network (JTNN)

# Getting started with training a deep graph network


DGL is available as a deep learning container in Amazon ECR. You can select deep learning containers when you write your estimator function in an Amazon SageMaker notebook. You can also craft your own custom container with DGL by following the [Bring Your Own Container](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html) guide. The easiest way to get started with a deep graph network uses one of the DGL containers in Amazon Elastic Container Registry.  

**Note**  
 Backend framework support is limited to PyTorch and MXNet. 

**Setup**  
If you are using Amazon SageMaker Studio, you need to clone the examples repository first. If you are using a notebook instance, you can find the examples by choosing the SageMaker AI icon at bottom of the left toolbar. 

**To clone the Amazon SageMaker SDK and notebook examples repository**

1. From the **JupyterLab** view in Amazon SageMaker AI, go to the **File Browser** at the top of the left toolbar. From the **File Browser panel**, you can see a new navigation at the top of the panel. 

1. Choose the icon on the far right to clone a Git repository. 

1. Add the repository URL: [https://github.com/awslabs/amazon-sagemaker-examples.git](https://github.com/awslabs/amazon-sagemaker-examples.git) 

1. Browse the newly added folder and its contents. The DGL examples are stored in the **sagemaker-python-sdk** folder. 

**Train**  
After you've set up, you can train the deep graph network.

**To train a deep graph network**

1. From the **JupyterLab** view in Amazon SageMaker AI, browse the [example notebooks](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk) and look for DGL folders. Several files may be included to support an example. Examine the README for any prerequisites. 

1. Run the .ipynb notebook example.  

1. Find the estimator function, and note the line where it is using an Amazon ECR container for DGL and a specific instance type. You may want to update this to use a container in your preferred Region. 

1. Run the function to launch the instance and use the DGL container for training a graph network. Charges are incurred for launching this instance. The instance self-terminates when the training is complete. 

An example of knowledge graph embedding (KGE) is provided. It uses the Freebase dataset, a knowledge base of general facts. An example use case would be to graph the relationships of persons and predict their nationality.  

An example implementation of a graph convolutional network (GCN) shows how you can train a graph network to predict toxicity. A physiology dataset, Tox21, provides toxicity measurements for how substances affect biological responses.  

Another GCN example shows you how to train a graph network on a scientific publications bibliography dataset, known as Cora. You can use it to find relationships between authors, topics, and conferences. 

The last example is a recommender system for movie reviews. It uses a graph convolutional matrix completion (GCMC) network trained on the MovieLens datasets. These datasets consist of movie titles, genres, and ratings by users. 

# Extend a Pre-built Container


If a pre-built SageMaker AI container doesn't fulfill all of your requirements, you can extend the existing image to accommodate your needs. Even if there is direct support for your environment or framework, you may want to add additional functionality or configure your container environment differently. By extending a pre-built image, you can leverage the included deep learning libraries and settings without having to create an image from scratch. You can extend the container to add libraries, modify settings, and install additional dependencies. 

The following tutorial shows how to extend a pre-built SageMaker image and publish it to Amazon ECR.

**Topics**
+ [

## Requirements to Extend a Pre-built Container
](#prebuilt-containers-extend-required)
+ [

## Extend SageMaker AI Containers to Run a Python Script
](#prebuilt-containers-extend-tutorial)

## Requirements to Extend a Pre-built Container


To extend a pre-built SageMaker image, you need to set the following environment variables within your Dockerfile. For more information on environment variables with SageMaker AI containers, see the [SageMaker Training Toolkit GitHub repo](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md).
+ `SAGEMAKER_SUBMIT_DIRECTORY`: The directory within the container in which the Python script for training is located.
+ `SAGEMAKER_PROGRAM`: The Python script that should be invoked and used as the entry point for training.

You can also install additional libraries by including the following in your Dockerfile:

```
RUN pip install <library>
```

The following tutorial shows how to use these environment variables.

## Extend SageMaker AI Containers to Run a Python Script
Get Started with Containers

In this tutorial, you learn how to extend the SageMaker AI PyTorch container with a Python file that uses the CIFAR-10 dataset. By extending the SageMaker AI PyTorch container, you utilize the existing training solution made to work with SageMaker AI. This tutorial extends a training image, but the same steps can be taken to extend an inference image. For a full list of the available images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

To run your own training model using the SageMaker AI containers, build a Docker container through a SageMaker Notebook instance. 

### Step 1: Create an SageMaker Notebook Instance
Step 1: Create a notebook instance

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/). 

1. In the left navigation pane, choose **Notebook**, choose **Notebook instances**, and then choose **Create notebook instance**. 

1. On the **Create notebook instance** page, provide the following information: 

   1. For **Notebook instance name**, enter **RunScriptNotebookInstance**.

   1. For **Notebook Instance type**, choose **ml.t2.medium**.

   1. In the **Permissions and encryption** section, do the following:

      1. For **IAM role**, choose **Create a new role**.

      1. On the **Create an IAM role** page, choose **Specific S3 buckets**, specify an Amazon S3 bucket named **sagemaker-run-script**, and then choose **Create role**.

         SageMaker AI creates an IAM role named `AmazonSageMaker-ExecutionRole-YYYYMMDDTHHmmSS`, such as `AmazonSageMaker-ExecutionRole-20190429T110788`. Note that the execution role naming convention uses the date and time when the role was created, separated by a `T`.

   1. For **Root Access**, choose **Enable**.

   1. Choose **Create notebook instance**. 

1. On the **Notebook instances** page, the **Status** is **Pending**. It can take a few minutes for Amazon SageMaker AI to launch a machine learning compute instance—in this case, it launches a notebook instance—and attach an ML storage volume to it. The notebook instance has a preconfigured Jupyter notebook server and a set of Anaconda libraries. For more information, see [  CreateNotebookInstance](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateNotebookInstance.html). 

   

1. In the **Permissions and encryption** section, copy **the IAM role ARN number**, and paste it into a notepad file to save it temporarily. You use this IAM role ARN number later to configure a local training estimator in the notebook instance. **The IAM role ARN number** looks like the following: `'arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20190429T110788'` 

1. After the status of the notebook instance changes to **InService**, choose **Open JupyterLab**.

### Step 2: Create and Upload the Dockerfile and Python Training Scripts
Step 2: Create and upload training scripts

1. After JupyterLab opens, create a new folder in the home directory of your JupyterLab. In the upper-left corner, choose the **New Folder** icon, and then enter the folder name `docker_test_folder`. 

1.  Create a `Dockerfile` text file in the `docker_test_folder` directory. 

   1. Choose the **New Launcher** icon (\$1) in the upper-left corner. 

   1. In the right pane under the **Other** section, choose **Text File**.

   1.  Paste the following `Dockerfile` sample code into your text file. 

      ```
      # SageMaker PyTorch image
      FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04
      
      ENV PATH="/opt/ml/code:${PATH}"
      
      # this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
      ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
      
      # /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to store your user code.
      COPY cifar10.py /opt/ml/code/cifar10.py
      
      # Defines cifar10.py as script entrypoint 
      ENV SAGEMAKER_PROGRAM cifar10.py
      ```

      The Dockerfile script performs the following tasks:
      + `FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04` – Downloads the SageMaker AI PyTorch base image. You can replace this with any SageMaker AI base image you want to bring to build containers.
      + `ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code` – Sets `/opt/ml/code` as the training script directory.
      + `COPY cifar10.py /opt/ml/code/cifar10.py` – Copies the script to the location inside the container that is expected by SageMaker AI. The script must be located in this folder.
      + `ENV SAGEMAKER_PROGRAM cifar10.py` – Sets your `cifar10.py` training script as the entrypoint script.

   1.  On the left directory navigation pane, the text file name might automatically be named `untitled.txt`. To rename the file, right-click the file, choose **Rename**, rename the file as `Dockerfile` without the `.txt` extension, and then press `Ctrl+s` or `Command+s` to save the file.

1. Create or upload a training script `cifar10.py` in the `docker_test_folder`. You can use the following example script for this exercise. 

   ```
   import ast
   import argparse
   import logging
   
   import os
   
   import torch
   import torch.distributed as dist
   import torch.nn as nn
   import torch.nn.parallel
   import torch.optim
   import torch.utils.data
   import torch.utils.data.distributed
   import torchvision
   import torchvision.models
   import torchvision.transforms as transforms
   import torch.nn.functional as F
   
   logger=logging.getLogger(__name__)
   logger.setLevel(logging.DEBUG)
   
   classes=('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
   
   
   # https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/cifar10_tutorial.py#L118
   class Net(nn.Module):
       def __init__(self):
           super(Net, self).__init__()
           self.conv1=nn.Conv2d(3, 6, 5)
           self.pool=nn.MaxPool2d(2, 2)
           self.conv2=nn.Conv2d(6, 16, 5)
           self.fc1=nn.Linear(16 * 5 * 5, 120)
           self.fc2=nn.Linear(120, 84)
           self.fc3=nn.Linear(84, 10)
   
       def forward(self, x):
           x=self.pool(F.relu(self.conv1(x)))
           x=self.pool(F.relu(self.conv2(x)))
           x=x.view(-1, 16 * 5 * 5)
           x=F.relu(self.fc1(x))
           x=F.relu(self.fc2(x))
           x=self.fc3(x)
           return x
   
   
   def _train(args):
       is_distributed=len(args.hosts) > 1 and args.dist_backend is not None
       logger.debug("Distributed training - {}".format(is_distributed))
   
       if is_distributed:
           # Initialize the distributed environment.
           world_size=len(args.hosts)
           os.environ['WORLD_SIZE']=str(world_size)
           host_rank=args.hosts.index(args.current_host)
           dist.init_process_group(backend=args.dist_backend, rank=host_rank, world_size=world_size)
           logger.info(
               'Initialized the distributed environment: \'{}\' backend on {} nodes. '.format(
                   args.dist_backend,
                   dist.get_world_size()) + 'Current host rank is {}. Using cuda: {}. Number of gpus: {}'.format(
                   dist.get_rank(), torch.cuda.is_available(), args.num_gpus))
   
       device='cuda' if torch.cuda.is_available() else 'cpu'
       logger.info("Device Type: {}".format(device))
   
       logger.info("Loading Cifar10 dataset")
       transform=transforms.Compose(
           [transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
   
       trainset=torchvision.datasets.CIFAR10(root=args.data_dir, train=True,
                                               download=False, transform=transform)
       train_loader=torch.utils.data.DataLoader(trainset, batch_size=args.batch_size,
                                                  shuffle=True, num_workers=args.workers)
   
       testset=torchvision.datasets.CIFAR10(root=args.data_dir, train=False,
                                              download=False, transform=transform)
       test_loader=torch.utils.data.DataLoader(testset, batch_size=args.batch_size,
                                                 shuffle=False, num_workers=args.workers)
   
       logger.info("Model loaded")
       model=Net()
   
       if torch.cuda.device_count() > 1:
           logger.info("Gpu count: {}".format(torch.cuda.device_count()))
           model=nn.DataParallel(model)
   
       model=model.to(device)
   
       criterion=nn.CrossEntropyLoss().to(device)
       optimizer=torch.optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
   
       for epoch in range(0, args.epochs):
           running_loss=0.0
           for i, data in enumerate(train_loader):
               # get the inputs
               inputs, labels=data
               inputs, labels=inputs.to(device), labels.to(device)
   
               # zero the parameter gradients
               optimizer.zero_grad()
   
               # forward + backward + optimize
               outputs=model(inputs)
               loss=criterion(outputs, labels)
               loss.backward()
               optimizer.step()
   
               # print statistics
               running_loss += loss.item()
               if i % 2000 == 1999:  # print every 2000 mini-batches
                   print('[%d, %5d] loss: %.3f' %
                         (epoch + 1, i + 1, running_loss / 2000))
                   running_loss=0.0
       print('Finished Training')
       return _save_model(model, args.model_dir)
   
   
   def _save_model(model, model_dir):
       logger.info("Saving the model.")
       path=os.path.join(model_dir, 'model.pth')
       # recommended way from http://pytorch.org/docs/master/notes/serialization.html
       torch.save(model.cpu().state_dict(), path)
   
   
   def model_fn(model_dir):
       logger.info('model_fn')
       device="cuda" if torch.cuda.is_available() else "cpu"
       model=Net()
       if torch.cuda.device_count() > 1:
           logger.info("Gpu count: {}".format(torch.cuda.device_count()))
           model=nn.DataParallel(model)
   
       with open(os.path.join(model_dir, 'model.pth'), 'rb') as f:
           model.load_state_dict(torch.load(f))
       return model.to(device)
   
   
   if __name__ == '__main__':
       parser=argparse.ArgumentParser()
   
       parser.add_argument('--workers', type=int, default=2, metavar='W',
                           help='number of data loading workers (default: 2)')
       parser.add_argument('--epochs', type=int, default=2, metavar='E',
                           help='number of total epochs to run (default: 2)')
       parser.add_argument('--batch-size', type=int, default=4, metavar='BS',
                           help='batch size (default: 4)')
       parser.add_argument('--lr', type=float, default=0.001, metavar='LR',
                           help='initial learning rate (default: 0.001)')
       parser.add_argument('--momentum', type=float, default=0.9, metavar='M', help='momentum (default: 0.9)')
       parser.add_argument('--dist-backend', type=str, default='gloo', help='distributed backend (default: gloo)')
   
       # The parameters below retrieve their default values from SageMaker environment variables, which are
       # instantiated by the SageMaker containers framework.
       # https://github.com/aws/sagemaker-containers#how-a-script-is-executed-inside-the-container
       parser.add_argument('--hosts', type=str, default=ast.literal_eval(os.environ['SM_HOSTS']))
       parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
       parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
       parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
       parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS'])
   
       _train(parser.parse_args())
   ```

### Step 3: Build the Container
Step 3: Build the container

1. In the JupyterLab home directory, open a Jupyter notebook. To open a new notebook, choose the **New Launch** icon and then choose **conda\$1pytorch\$1p39** in the **Notebook** section. 

1. Run the following command in the first notebook cell to change to the `docker_test_folder` directory:

   ```
   % cd ~/SageMaker/docker_test_folder
   ```

   This returns your current directory as follows:

   ```
   ! pwd
   ```

   `output: /home/ec2-user/SageMaker/docker_test_folder`

1. Log in to Docker to access the base container:

   ```
   ! aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
   ```

1. To build the Docker container, run the following Docker build command, including the space followed by a period at the end:

   ```
   ! docker build -t pytorch-extended-container-test .
   ```

   The Docker build command must be run from the Docker directory you created, in this case `docker_test_folder`.
**Note**  
If you get the following error message that Docker cannot find the Dockerfile, make sure the Dockerfile has the correct name and has been saved to the directory.  

   ```
   unable to prepare context: unable to evaluate symlinks in Dockerfile path: 
   lstat /home/ec2-user/SageMaker/docker/Dockerfile: no such file or directory
   ```
Remember that `docker` looks for a file specifically called `Dockerfile` without any extension within the current directory. If you named it something else, you can pass in the file name manually with the `-f` flag. For example, if you named your Dockerfile `Dockerfile-text.txt`, run the following command:  

   ```
   ! docker build -t tf-custom-container-test -f Dockerfile-text.txt .
   ```

### Step 4: Test the Container
Step 4: Test the container

1. To test the container locally in the notebook instance, open a Jupyter notebook. Choose **New Launcher** and choose **Notebook** in **`conda_pytorch_p39`** framework. The rest of the code snippets must run from the Jupyter notebook instance.

1. Download the CIFAR-10 dataset.

   ```
   import torch
   import torchvision
   import torchvision.transforms as transforms
   
   def _get_transform():
       return transforms.Compose(
           [transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
   
   
   def get_train_data_loader(data_dir='/tmp/pytorch/cifar-10-data'):
       transform=_get_transform()
       trainset=torchvision.datasets.CIFAR10(root=data_dir, train=True,
                                               download=True, transform=transform)
       return torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)
   
   
   def get_test_data_loader(data_dir='/tmp/pytorch/cifar-10-data'):
       transform=_get_transform()
       testset=torchvision.datasets.CIFAR10(root=data_dir, train=False,
                                              download=True, transform=transform)
       return torch.utils.data.DataLoader(testset, batch_size=4,
                                          shuffle=False, num_workers=2)
   
   trainloader=get_train_data_loader('/tmp/pytorch-example/cifar-10-data')
   testloader=get_test_data_loader('/tmp/pytorch-example/cifar-10-data')
   ```

1. Set `role` to the role used to create your Jupyter notebook. This is used to configure your SageMaker AI Estimator.

   ```
   from sagemaker import get_execution_role
   
   role=get_execution_role()
   ```

1. Paste the following example script into the notebook code cell to configure a SageMaker AI Estimator using your extended container.

   ```
   from sagemaker.estimator import Estimator
   
   hyperparameters={'epochs': 1}
   
   estimator=Estimator(
       image_uri='pytorch-extended-container-test',
       role=role,
       instance_count=1,
       instance_type='local',
       hyperparameters=hyperparameters
   )
   
   estimator.fit('file:///tmp/pytorch-example/cifar-10-data')
   ```

1. Run the code cell. This test outputs the training environment configuration, the values used for the environmental variables, the source of the data, and the loss and accuracy obtained during training.

### Step 5: Push the Container to Amazon Elastic Container Registry (Amazon ECR)
Step 5: Push the container to Amazon ECR

1. After you successfully run the local mode test, you can push the Docker container to [Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) and use it to run training jobs. 

   Run the following command lines in a notebook cell.

   ```
   %%sh
   
   # Specify an algorithm name
   algorithm_name=pytorch-extended-container-test
   
   account=$(aws sts get-caller-identity --query Account --output text)
   
   # Get the region defined in the current configuration (default to us-west-2 if none defined)
   region=$(aws configure get region)
   
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   
   # If the repository doesn't exist in ECR, create it.
   
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   if [ $? -ne 0 ]
   then
   aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
   fi
   
   # Log into Docker
   aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
   
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```

1. After you push the container, you can call the Amazon ECR image from anywhere in the SageMaker AI environment. Run the following code example in the next notebook cell. 

   If you want to use this training container with SageMaker Studio to use its visualization features, you can also run the following code in a Studio notebook cell to call the Amazon ECR image of your training container.

   ```
   import boto3
   
   client=boto3.client('sts')
   account=client.get_caller_identity()['Account']
   
   my_session=boto3.session.Session()
   region=my_session.region_name
   
   algorithm_name="pytorch-extended-container-test"
   ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)
   
   ecr_image
   # This should return something like
   # 12-digits-of-your-account.dkr.ecr.us-east-2.amazonaws.com/tf-2.2-test:latest
   ```

1. Use the `ecr_image` retrieved from the previous step to configure a SageMaker AI estimator object. The following code sample configures a SageMaker AI PyTorch estimator.

   ```
   import sagemaker
   
   from sagemaker import get_execution_role
   from sagemaker.estimator import Estimator
   
   estimator=Estimator(
       image_uri=ecr_image,
       role=get_execution_role(),
       base_job_name='pytorch-extended-container-test',
       instance_count=1,
       instance_type='ml.p2.xlarge'
   )
   
   # start training
   estimator.fit()
   
   # deploy the trained model
   predictor=estimator.deploy(1, instance_type)
   ```

### Step 6: Clean up Resources
Step 6: Clean up resources

**To clean up resources when done with the Get Started example**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/), choose the notebook instance **RunScriptNotebookInstance**, choose **Actions**, and choose **Stop**. It can take a few minutes for the instance to stop. 

1. After the instance **Status** changes to **Stopped**, choose **Actions**, choose **Delete**, and then choose **Delete** in the dialog box. It can take a few minutes for the instance to be deleted. The notebook instance disappears from the table when it has been deleted. 

1. Open the [Amazon S3 console](https://console.aws.amazon.com/s3/) and delete the bucket that you created for storing model artifacts and the training dataset. 

1. Open the [IAM console](https://console.aws.amazon.com/iam/) and delete the IAM role. If you created permission policies, you can delete them, too. 
**Note**  
 The Docker container shuts down automatically after it has run. You don't need to delete it.

# Custom Docker containers with SageMaker AI


You can adapt an existing Docker image to work with SageMaker AI. You may need to use an existing, external Docker image with SageMaker AI when you have a container that satisfies feature or safety requirements that are not currently supported by a pre-built SageMaker AI image. There are two toolkits that allow you to bring your own container and adapt it to work with SageMaker AI:
+ [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) – Use this toolkit for training models with SageMaker AI.
+ [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) – Use this toolkit for deploying models with SageMaker AI.

The following topics show how to adapt your existing image using the SageMaker Training and Inference toolkits:

**Topics**
+ [

## Individual Framework Libraries
](#docker-containers-adapt-your-own-frameworks)
+ [

# SageMaker Training and Inference Toolkits
](amazon-sagemaker-toolkits.md)
+ [

# Adapting your own training container
](adapt-training-container.md)
+ [

# Adapt your own inference container for Amazon SageMaker AI
](adapt-inference-container.md)

## Individual Framework Libraries


In addition to the SageMaker Training Toolkit and SageMaker AI Inference Toolkit, SageMaker AI also provides toolkits specialized for TensorFlow, MXNet, PyTorch, and Chainer. The following table provides links to the GitHub repositories that contain the source code for each framework and their respective serving toolkits. The instructions linked are for using the Python SDK to run training algorithms and host models on SageMaker AI. The functionality for these individual libraries is included in the SageMaker AI Training Toolkit and SageMaker AI Inference Toolkit.


| Framework | Toolkit Source Code | 
| --- | --- | 
| TensorFlow |  [SageMaker AI TensorFlow Training](https://github.com/aws/sagemaker-tensorflow-training-toolkit) [SageMaker AI TensorFlow Serving](https://github.com/aws/sagemaker-tensorflow-serving-container)  | 
| MXNet |  [SageMaker AI MXNet Training](https://github.com/aws/sagemaker-mxnet-training-toolkit) [SageMaker AI MXNet Inference](https://github.com/aws/sagemaker-mxnet-inference-toolkit)  | 
| PyTorch |  [SageMaker AI PyTorch Training](https://github.com/aws/sagemaker-pytorch-training-toolkit) [SageMaker AI PyTorch Inference](https://github.com/aws/sagemaker-pytorch-inference-toolkit)  | 
| Chainer |  [SageMaker AI Chainer SageMaker AI Containers](https://github.com/aws/sagemaker-chainer-container)  | 

# SageMaker Training and Inference Toolkits


The [SageMaker Training](https://github.com/aws/sagemaker-training-toolkit) and [SageMaker AI Inference](https://github.com/aws/sagemaker-inference-toolkit) toolkits implement the functionality that you need to adapt your containers to run scripts, train algorithms, and deploy models on SageMaker AI. When installed, the library defines the following for users:
+ The locations for storing code and other resources. 
+ The entry point that contains the code to run when the container is started. Your Dockerfile must copy the code that needs to be run into the location expected by a container that is compatible with SageMaker AI. 
+ Other information that a container needs to manage deployments for training and inference. 

## SageMaker AI Toolkits Containers Structure


When SageMaker AI trains a model, it creates the following file folder structure in the container's `/opt/ml` directory.

```
/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│
├── code
│
├── output
│
└── failure
```

When you run a model *training* job, the SageMaker AI container uses the `/opt/ml/input/` directory, which contains the JSON files that configure the hyperparameters for the algorithm and the network layout used for distributed training. The `/opt/ml/input/` directory also contains files that specify the channels through which SageMaker AI accesses the data, which is stored in Amazon Simple Storage Service (Amazon S3). The SageMaker AI containers library places the scripts that the container will run in the `/opt/ml/code/` directory. Your script should write the model generated by your algorithm to the `/opt/ml/model/` directory. For more information, see [Containers with custom training algorithms](your-algorithms-training-algo.md).

When you *host* a trained model on SageMaker AI to make inferences, you deploy the model to an HTTP endpoint. The model makes real-time predictions in response to inference requests. The container must contain a serving stack to process these requests.

In a hosting or batch transform container, the model files are located in the same folder to which they were written during training.

```
/opt/ml/model
│
└── <model files>
```

For more information, see [Containers with custom inference code](your-algorithms-inference-main.md).

## Single Versus Multiple Containers


You can either provide separate Docker images for the training algorithm and inference code or you can use a single Docker image for both. When creating Docker images for use with SageMaker AI, consider the following:
+ Providing two Docker images can increase storage requirements and cost because common libraries might be duplicated.
+ In general, smaller containers start faster for both training and hosting. Models train faster and the hosting service can react to increases in traffic by automatically scaling more quickly.
+ You might be able to write an inference container that is significantly smaller than the training container. This is especially common when you use GPUs for training, but your inference code is optimized for CPUs.
+ SageMaker AI requires that Docker containers run without privileged access.
+ Both Docker containers that you build and those provided by SageMaker AI can send messages to the `Stdout` and `Stderr` files. SageMaker AI sends these messages to Amazon CloudWatch logs in your AWS account.

For more information about how to create SageMaker AI containers and how scripts are executed inside them, see the [SageMaker AI Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) and [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) repositories on GitHub. They also provide lists of important environmental variables and the environmental variables provided by SageMaker AI containers.

# Adapting your own training container


To run your own training model, build a Docker container using the [Amazon SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) through an Amazon SageMaker notebook instance.

## Step 1: Create a SageMaker notebook instance
Step 1: Create a notebook instance

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Notebook**, choose **Notebook instances**, and then choose **Create notebook instance**. 

1. On the **Create notebook instance** page, provide the following information: 

   1. For **Notebook instance name**, enter **RunScriptNotebookInstance**.

   1. For **Notebook Instance type**, choose **ml.t2.medium**.

   1. In the **Permissions and encryption** section, do the following:

      1. For **IAM role**, choose **Create a new role**. This opens a new window.

      1. On the **Create an IAM role** page, choose **Specific S3 buckets**, specify an Amazon S3 bucket named **sagemaker-run-script**, and then choose **Create role**.

         SageMaker AI creates an IAM role named `AmazonSageMaker-ExecutionRole-YYYYMMDDTHHmmSS`. For example, `AmazonSageMaker-ExecutionRole-20190429T110788`. Note that the execution role naming convention uses the date and time at which the role was created, separated by a `T`.

   1. For **Root Access**, choose **Enable**.

   1. Choose **Create notebook instance**. 

1. On the **Notebook instances** page, the **Status** is **Pending**. It can take a few minutes for Amazon SageMaker AI to launch a machine learning compute instance—in this case, it launches a notebook instance—and attach an ML storage volume to it. The notebook instance has a preconfigured Jupyter notebook server and a set of Anaconda libraries. For more information, see [  CreateNotebookInstance](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateNotebookInstance.html). 

   

1. Click on the **Name** of the notebook you just created. This opens a new page.

1.  In the **Permissions and encryption** section, copy **the IAM role ARN number**, and paste it into a notepad file to save it temporarily. You use this IAM role ARN number later to configure a local training estimator in the notebook instance. **The IAM role ARN number** looks like the following: `'arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20190429T110788'` 

1. After the status of the notebook instance changes to **InService**, choose **Open JupyterLab**.

## Step 2: Create and upload the Dockerfile and Python training scripts
Step 2: Create and upload training scripts

1. After JupyterLab opens, create a new folder in the home directory of your JupyterLab. In the upper-left corner, choose the **New Folder** icon, and then enter the folder name `docker_test_folder`. 

1. Create a `Dockerfile` text file in the `docker_test_folder` directory. 

   1. Choose the **New Launcher** icon (\$1) in the upper-left corner. 

   1. In the right pane under the **Other** section, choose **Text File**.

   1. Paste the following `Dockerfile` sample code into your text file. 

      ```
      #Download an open source TensorFlow Docker image
      FROM tensorflow/tensorflow:latest-gpu-jupyter
      
      # Install sagemaker-training toolkit that contains the common functionality necessary to create a container compatible with SageMaker AI and the Python SDK.
      RUN pip3 install sagemaker-training
      
      # Copies the training code inside the container
      COPY train.py /opt/ml/code/train.py
      
      # Defines train.py as script entrypoint
      ENV SAGEMAKER_PROGRAM train.py
      ```

      The Dockerfile script performs the following tasks:
      + `FROM tensorflow/tensorflow:latest-gpu-jupyter` – Downloads the latest TensorFlow Docker base image. You can replace this with any Docker base image you want to bring to build containers, as well as with AWS pre-built container base images.
      + `RUN pip install sagemaker-training` – Installs [SageMaker AI Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) that contains the common functionality necessary to create a container compatible with SageMaker AI. 
      + `COPY train.py /opt/ml/code/train.py` – Copies the script to the location inside the container that is expected by SageMaker AI. The script must be located in this folder.
      + `ENV SAGEMAKER_PROGRAM train.py` – Takes your training script `train.py` as the entrypoint script copied in the `/opt/ml/code` folder of the container. This is the only environmental variable that you must specify when you build your own container.

   1.  On the left directory navigation pane, the text file name might automatically be named `untitled.txt`. To rename the file, right-click the file, choose **Rename**, rename the file as `Dockerfile` without the `.txt` extension, and then press `Ctrl+s` or `Command+s` to save the file.

1. Upload a training script `train.py` to the `docker_test_folder`. You can use the following example script to create a model that reads handwritten digits trained on the [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database) for this exercise.

   ```
   import tensorflow as tf
   import os
   
   mnist = tf.keras.datasets.mnist
   
   (x_train, y_train), (x_test, y_test) = mnist.load_data()
   x_train, x_test = x_train / 255.0, x_test / 255.0
   
   model = tf.keras.models.Sequential([
   tf.keras.layers.Flatten(input_shape=(28, 28)),
   tf.keras.layers.Dense(128, activation='relu'),
   tf.keras.layers.Dropout(0.2),
   tf.keras.layers.Dense(10, activation='softmax')
   ])
   
   model.compile(optimizer='adam',
   loss='sparse_categorical_crossentropy',
   metrics=['accuracy'])
   
   model.fit(x_train, y_train, epochs=1)
   model_save_dir = f"{os.environ.get('SM_MODEL_DIR')}/1"
   
   model.evaluate(x_test, y_test)
   tf.saved_model.save(model, model_save_dir)
   ```

## Step 3: Build the container
Step 3: Build the container

1. In the JupyterLab home directory, open a Jupyter notebook. To open a new notebook, choose the **New Launch** icon and then choose the latest version of **conda\$1tensorflow2** in the **Notebook** section.

1. Run the following command in the first notebook cell to change to the `docker_test_folder` directory:

   ```
   cd ~/SageMaker/docker_test_folder
   ```

   This returns your current directory as follows:

   ```
   ! pwd
   ```

   `output: /home/ec2-user/SageMaker/docker_test_folder`

1. To build the Docker container, run the following Docker build command, including the space followed by a period at the end:

   ```
   ! docker build -t tf-custom-container-test .
   ```

   The Docker build command must be run from the Docker directory you created, in this case `docker_test_folder`.
**Note**  
If you get the following error message that Docker cannot find the Dockerfile, make sure the Dockerfile has the correct name and has been saved to the directory.  

   ```
   unable to prepare context: unable to evaluate symlinks in Dockerfile path: 
   lstat /home/ec2-user/SageMaker/docker/Dockerfile: no such file or directory
   ```
Remember that `docker` looks for a file specifically called `Dockerfile` without any extension within the current directory. If you named it something else, you can pass in the file name manually with the `-f` flag. For example, if you named your Dockerfile as `Dockerfile-text.txt`, run the following command:  

   ```
   ! docker build -t tf-custom-container-test -f Dockerfile-text.txt .
   ```

## Step 4: Test the container
Step 4: Test the container

1. To test the container locally in the notebook instance, open a Jupyter notebook. Choose **New Launcher** and choose the latest version of **conda\$1tensorflow2** in the **Notebook** section. 

1. Paste the following example script into the notebook code cell to configure a SageMaker AI Estimator.

   ```
   import sagemaker
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(image_uri='tf-custom-container-test',
                         role=sagemaker.get_execution_role(),
                         instance_count=1,
                         instance_type='local')
   
   estimator.fit()
   ```

   In the preceding code example, `sagemaker.get_execution_role()` is specified to the `role` argument to automatically retrieve the role set up for the SageMaker AI session. You can also replace it with the string value of **the IAM role ARN number** you used when you configured the notebook instance. The ARN should look like the following: `'arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20190429T110788'`. 

1. Run the code cell. This test outputs the training environment configuration, the values used for the environmental variables, the source of the data, and the loss and accuracy obtained during training.

## Step 5: Push the container to Amazon Elastic Container Registry (Amazon ECR)
Step 5: Push the container to Amazon ECR

1. After you successfully run the local mode test, you can push the Docker container to [Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) and use it to run training jobs. If you want to use a private Docker registry instead of Amazon ECR, see [Push your training container to a private registry](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own-private-registry.html).

   Run the following command lines in a notebook cell.

   ```
   %%sh
   
   # Specify an algorithm name
   algorithm_name=tf-custom-container-test
   
   account=$(aws sts get-caller-identity --query Account --output text)
   
   # Get the region defined in the current configuration (default to us-west-2 if none defined)
   region=$(aws configure get region)
   region=${region:-us-west-2}
   
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   
   # If the repository doesn't exist in ECR, create it.
   
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   if [ $? -ne 0 ]
   then
   aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
   fi
   
   # Get the login command from ECR and execute it directly
   
   aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
   
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```
**Note**  
This bash shell script may raise a permission issue similar to the following error message:  

   ```
   "denied: User: [ARN] is not authorized to perform: ecr:InitiateLayerUpload on resource:
   arn:aws:ecr:us-east-1:[id]:repository/tf-custom-container-test"
   ```
If this error occurs, you need to attach the **AmazonEC2ContainerRegistryFullAccess** policy to your IAM role. Go to the [IAM console](https://console.aws.amazon.com/iam/home), choose **Roles** from the left navigation pane, look up the IAMrole you used for the Notebook instance. Under the **Permission** tab, choose the **Attach policies** button, and search the **AmazonEC2ContainerRegistryFullAccess** policy. Mark the check box of the policy, and choose **Add permissions** to finish.

1. Run the following code in a Studio notebook cell to call the Amazon ECR image of your training container.

   ```
   import boto3
   
   account_id = boto3.client('sts').get_caller_identity().get('Account')
   ecr_repository = 'tf-custom-container-test'
   tag = ':latest'
   
   region = boto3.session.Session().region_name
   
   uri_suffix = 'amazonaws.com'
   if region in ['cn-north-1', 'cn-northwest-1']:
       uri_suffix = 'amazonaws.com.cn'
   
   byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)
   
   byoc_image_uri
   # This should return something like
   # 111122223333.dkr.ecr.us-east-2.amazonaws.com/sagemaker-byoc-test:latest
   ```

1. Use the `ecr_image` retrieved from the previous step to configure a SageMaker AI estimator object. The following code sample configures a SageMaker AI estimator with the `byoc_image_uri` and initiates a training job on an Amazon EC2 instance.

------
#### [ SageMaker Python SDK v1 ]

   ```
   import sagemaker
   from sagemaker import get_execution_role
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(image_uri=byoc_image_uri,
                         role=get_execution_role(),
                         base_job_name='tf-custom-container-test-job',
                         instance_count=1,
                         instance_type='ml.g4dn.xlarge')
   
   #train your model
   estimator.fit()
   ```

------
#### [ SageMaker Python SDK v2 ]

   ```
   import sagemaker
   from sagemaker import get_execution_role
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(image_uri=byoc_image_uri,
                         role=get_execution_role(),
                         base_job_name='tf-custom-container-test-job',
                         instance_count=1,
                         instance_type='ml.g4dn.xlarge')
   
   #train your model
   estimator.fit()
   ```

------

1. If you want to deploy your model using your own container, refer to [Adapting Your Own Inference Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html). You can also use an AWSframework container that can deploy a TensorFlow model. To deploy the example model to read handwritten digits, enter the following example script into the same notebook that you used to train your model in the previous sub-step to obtain the image URIs (universal resource identifiers) needed for deployment, and deploy the model.

   ```
   import boto3
   import sagemaker
   
   #obtain image uris
   from sagemaker import image_uris
   container = image_uris.retrieve(framework='tensorflow',region='us-west-2',version='2.11.0',
                       image_scope='inference',instance_type='ml.g4dn.xlarge')
   
   #create the model entity, endpoint configuration and endpoint
   predictor = estimator.deploy(1,instance_type='ml.g4dn.xlarge',image_uri=container)
   ```

   Test your model using a sample handwritten digit from the MNIST dataset using the following code example.

   ```
   #Retrieve an example test dataset to test
   import numpy as np
   import matplotlib.pyplot as plt
   from keras.datasets import mnist
   
   # Load the MNIST dataset and split it into training and testing sets
   (x_train, y_train), (x_test, y_test) = mnist.load_data()
   # Select a random example from the training set
   example_index = np.random.randint(0, x_train.shape[0])
   example_image = x_train[example_index]
   example_label = y_train[example_index]
   
   # Print the label and show the image
   print(f"Label: {example_label}")
   plt.imshow(example_image, cmap='gray')
   plt.show()
   ```

   Convert the test handwritten digit into a form that TensorFlow can ingest and make a test prediction.

   ```
   from sagemaker.serializers import JSONSerializer
   data = {"instances": example_image.tolist()}
   predictor.serializer=JSONSerializer() #update the predictor to use the JSONSerializer
   predictor.predict(data) #make the prediction
   ```

For a full example that shows how to test a custom container locally and push it to an Amazon ECR image, see the [ Building Your Own TensorFlow Container](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/tensorflow_bring_your_own/tensorflow_bring_your_own.html) example notebook.

**Tip**  
To profile and debug training jobs to monitor system utilization issues (such as CPU bottlenecks and GPU underutilization) and identify training issues (such as overfitting, overtraining, exploding tensors, and vanishing gradients), use Amazon SageMaker Debugger. For more information, see [Use Debugger with custom training containers](debugger-bring-your-own-container.md).

## Step 6: Clean up resources
Step 6: Clean up resources

**To clean up resources when done with the get started example**

1. Open the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/), choose the notebook instance **RunScriptNotebookInstance**, choose **Actions**, and choose **Stop**. It can take a few minutes for the instance to stop. 

1. After the instance **Status** changes to **Stopped**, choose **Actions**, choose **Delete**, and then choose **Delete** in the dialog box. It can take a few minutes for the instance to be deleted. The notebook instance disappears from the table when it has been deleted. 

1. Open the [Amazon S3 console](https://console.aws.amazon.com/s3/) and delete the bucket that you created for storing model artifacts and the training dataset. 

1. Open the [IAM console](https://console.aws.amazon.com/iam/) and delete the IAM role. If you created permission policies, you can delete them, too. 
**Note**  
 The Docker container shuts down automatically after it has run. You don't need to delete it.

## Blogs and Case Studies


The following blogs discuss case studies about using custom training containers in Amazon SageMaker AI.
+ [Why bring your own container to Amazon SageMaker AI and how to do it right](https://medium.com/@pandey.vikesh/why-bring-your-own-container-to-amazon-sagemaker-and-how-to-do-it-right-bc158fe41ed1), *Medium* (January 20th, 2023)

# Adapt your training job to access images in a private Docker registry


You can use a private [Docker registry](https://docs.docker.com/registry/) instead of an Amazon Elastic Container Registry (Amazon ECR) to host your images for SageMaker AI Training. The following instructions show you how to create a Docker registry, configure your virtual private cloud (VPC) and training job, store images, and give SageMaker AI access to the training image in the private docker registry. These instructions also show you how to use a Docker registry that requires authentication for a SageMaker training job.

## Create and store your images in a private Docker registry


Create a private Docker registry to store your images. Your registry must:
+ use the [Docker Registry HTTP API](https://docs.docker.com/registry/spec/api/) protocol
+ be accessible from the same VPC specified in the [VpcConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestSyntax) parameter in the `CreateTrainingJob` API. Input `VpcConfig` when you create your training job.
+ secured with a [TLS certificate](https://aws.amazon.com/what-is/ssl-certificate/) from a known public certificate authority.

For more information about creating a Docker registry, see [Deploy a registry server](https://docs.docker.com/registry/deploying/).

## Configure your VPC and SageMaker training job


SageMaker AI uses a network connection within your VPC to access images in your Docker registry. To use the images in your Docker registry for training, the registry must be accessible from an Amazon VPC in your account. For more information, see [Use a Docker registry that requires authentication for training](docker-containers-adapt-your-own-private-registry-authentication.md).

You must also configure your training job to connect to the same VPC to which your Docker registry has access. For more information, see [Configure a Training Job for Amazon VPC Access](https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html#train-vpc-configure).

## Create a training job using an image from your private Docker registry


To use an image from your private Docker registry for training, use the following guide to configure your image, configure and create a training job. The code examples that follow use the AWS SDK for Python (Boto3) client.

1. Create a training image configuration object and input `Vpc` the `TrainingRepositoryAccessMode` field as follows.

   ```
   training_image_config = {
       'TrainingRepositoryAccessMode': 'Vpc'
   }
   ```
**Note**  
If your private Docker registry requires authentication, you must add a `TrainingRepositoryAuthConfig` object to the training image configuration object. You must also specify the Amazon Resource Name (ARN) of an AWS Lambda function that provides access credentials to SageMaker AI using the `TrainingRepositoryCredentialsProviderArn` field of the `TrainingRepositoryAuthConfig` object. For more information, see the example code structure below.  

   ```
   training_image_config = {
      'TrainingRepositoryAccessMode': 'Vpc',
      'TrainingRepositoryAuthConfig': {
           'TrainingRepositoryCredentialsProviderArn': 'arn:aws:lambda:Region:Acct:function:FunctionName'
      }
   }
   ```

   For information about how to create the Lambda function to provide authentication, see [Use a Docker registry that requires authentication for training](docker-containers-adapt-your-own-private-registry-authentication.md).

1. Use a Boto3 client to create a training job and pass the correct configuration to the [create\$1training\$1job](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API. The following instructions show you how to configure the components and create a training job.

   1. Create the `AlgorithmSpecification` object that you want to pass to `create_training_job`. Use the training image configuration object that you created in the previous step, as shown in the following code example.

      ```
      algorithm_specification = {
         'TrainingImage': 'myteam.myorg.com/docker-local/my-training-image:<IMAGE-TAG>',
         'TrainingImageConfig': training_image_config,
         'TrainingInputMode': 'File'
      }
      ```
**Note**  
To use a fixed, rather than an updated version of an image, refer to the image’s [digest](https://docs.docker.com/engine/reference/commandline/pull/#pull-an-image-by-digest-immutable-identifier) instead of by name or tag.

   1. Specify the name of the training job and role that you want to pass to `create_training_job`, as shown in the following code example. 

      ```
      training_job_name = 'private-registry-job'
      execution_role_arn = 'arn:aws:iam::123456789012:role/SageMakerExecutionRole'
      ```

   1. Specify a security group and subnet for the VPC configuration for your training job. Your private Docker registry must allow inbound traffic from the security groups that you specify, as shown in the following code example.

      ```
      vpc_config = {
          'SecurityGroupIds': ['sg-0123456789abcdef0'],
          'Subnets': ['subnet-0123456789abcdef0','subnet-0123456789abcdef1']
      }
      ```
**Note**  
If your subnet is not in the same VPC as your private Docker registry, you must set up a networking connection between the two VPCs. SeeConnect VPCs using [VPC peering](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-peering.html) for more information.

   1. Specify the resource configuration, including machine learning compute instances and storage volumes to use for training, as shown in the following code example. 

      ```
      resource_config = {
          'InstanceType': 'ml.m4.xlarge',
          'InstanceCount': 1,
          'VolumeSizeInGB': 10,
      }
      ```

   1. Specify the input and output data configuration, where the training dataset is stored, and where you want to store model artifacts, as shown in the following code example.

      ```
      input_data_config = [
          {
              "ChannelName": "training",
              "DataSource":
              {
                  "S3DataSource":
                  {
                      "S3DataDistributionType": "FullyReplicated",
                      "S3DataType": "S3Prefix",
                      "S3Uri": "s3://your-training-data-bucket/training-data-folder"
                  }
              }
          }
      ]
      
      output_data_config = {
          'S3OutputPath': 's3://your-output-data-bucket/model-folder'
      }
      ```

   1. Specify the maximum number of seconds that a model training job can run as shown in the following code example.

      ```
      stopping_condition = {
          'MaxRuntimeInSeconds': 1800
      }
      ```

   1. Finally, create the training job using the parameters you specified in the previous steps as shown in the following code example.

      ```
      import boto3
      sm = boto3.client('sagemaker')
      try:
          resp = sm.create_training_job(
              TrainingJobName=training_job_name,
              AlgorithmSpecification=algorithm_specification,
              RoleArn=execution_role_arn,
              InputDataConfig=input_data_config,
              OutputDataConfig=output_data_config,
              ResourceConfig=resource_config,
              VpcConfig=vpc_config,
              StoppingCondition=stopping_condition
          )
      except Exception as e:
          print(f'error calling CreateTrainingJob operation: {e}')
      else:
          print(resp)
      ```

# Use a SageMaker AI estimator to run a training job


You can also use an [estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) from the SageMaker Python SDK to handle the configuration and running of your SageMaker training job. The following code examples show how to configure and run an estimator using images from a private Docker registry.

1. Import the required libraries and dependencies, as shown in the following code example.

   ```
   import boto3
   import sagemaker
   from sagemaker.estimator import Estimator
   
   session = sagemaker.Session()
   
   role = sagemaker.get_execution_role()
   ```

1. Provide a Uniform Resource Identifier (URI) to your training image, security groups and subnets for the VPC configuration for your training job, as shown in the following code example.

   ```
   image_uri = "myteam.myorg.com/docker-local/my-training-image:<IMAGE-TAG>"
   
   security_groups = ["sg-0123456789abcdef0"]
   subnets = ["subnet-0123456789abcdef0", "subnet-0123456789abcdef0"]
   ```

   For more information about `security_group_ids` and `subnets`, see the appropriate parameter description in the [Estimators](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) section of the SageMaker Python SDK.
**Note**  
SageMaker AI uses a network connection within your VPC to access images in your Docker registry. To use the images in your Docker registry for training, the registry must be accessible from an Amazon VPC in your account.

1. Optionally, if your Docker registry requires authentication, you must also specify the Amazon Resource Name (ARN) of an AWS Lambda function that provides access credentials to SageMaker AI. The following code example shows how to specify the ARN. 

   ```
   training_repository_credentials_provider_arn = "arn:aws:lambda:us-west-2:1234567890:function:test"
   ```

   For more information about using images in a Docker registry requiring authentication, see **Use a Docker registry that requires authentication for training** below.

1. Use the code examples from the previous steps to configure an estimator, as shown in the following code example.

   ```
   # The training repository access mode must be 'Vpc' for private docker registry jobs 
   training_repository_access_mode = "Vpc"
   
   # Specify the instance type, instance count you want to use
   instance_type="ml.m5.xlarge"
   instance_count=1
   
   # Specify the maximum number of seconds that a model training job can run
   max_run_time = 1800
   
   # Specify the output path for the model artifacts
   output_path = "s3://your-output-bucket/your-output-path"
   
   estimator = Estimator(
       image_uri=image_uri,
       role=role,
       subnets=subnets,
       security_group_ids=security_groups,
       training_repository_access_mode=training_repository_access_mode,
       training_repository_credentials_provider_arn=training_repository_credentials_provider_arn,  # remove this line if auth is not needed
       instance_type=instance_type,
       instance_count=instance_count,
       output_path=output_path,
       max_run=max_run_time
   )
   ```

1. Start your training job by calling `estimator.fit` with your job name and input path as parameters, as shown in the following code example.

   ```
   input_path = "s3://your-input-bucket/your-input-path"
   job_name = "your-job-name"
   
   estimator.fit(
       inputs=input_path,
       job_name=job_name
   )
   ```

# Use a Docker registry that requires authentication for training


If your Docker registry requires authentication, you must create an AWS Lambda function that provides access credentials to SageMaker AI. Then, create a training job and provide the ARN of this Lambda function inside the [create\$1training\$1job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) API. Lastly, you can optionally create an interface VPC endpoint so that your VPC can communicate with your Lambda function without sending traffic over the internet. The following guide shows how to create a Lambda function, assign it the correct role and create an interface VPC endpoint.

## Create the Lambda function


Create an AWS Lambda function that passes access credentials to SageMaker AI and returns a response. The following code example creates the Lambda function handler, as follows.

```
def handler(event, context):
   response = {
      "Credentials": {"Username": "username", "Password": "password"}
   }
   return response
```

The type of authentication used to set up your private Docker registry determines the contents of the response returned by your Lambda function as follows.
+ If your private Docker registry uses basic authentication, the Lambda function will return the username and password needed in order to authenticate to the registry.
+ If your private Docker registry uses [bearer token authentication](https://docs.docker.com/registry/spec/auth/token/), the username and password are sent to your authorization server, which then returns a bearer token. This token is then used to authenticate to your private Docker registry.

**Note**  
If you have more than one Lambda functions for your registries in the same account, and the execution role is the same for your training jobs, then training jobs for registry one would have access to the Lambda functions for other registries.

## Grant the correct role permission to your Lambda function


The [IAMrole](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) that you use in the `create_training_job` API must have permission to call an AWS Lambda function. The following code example shows how to extend permissions policy of an IAM role to call `myLambdaFunction`.

```
{
    "Effect": "Allow",
    "Action": [
        "lambda:InvokeFunction"
    ],
    "Resource": [
        "arn:aws:lambda:*:*:function:*myLambdaFunction*"
    ]
}
```

For information about editing a role permissions policy, see [Modifying a role permissions policy (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-modify_permissions-policy) in the *AWS Identity and Access Management User Guide*.

**Note**  
An IAM role with an attached **AmazonSageMakerFullAccess** managed policy has permission to call any Lambda function with "SageMaker AI" in its name.

## Create an interface VPC endpoint for Lambda


If you create an interface endpoint, your Amazon VPC can communicate with your Lambda function without sending traffic over the internet. For more information, see [Configuring interface VPC endpoints for Lambda](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc-endpoints.html) in the *AWS Lambda Developer Guide*.

After your interface endpoint is created, SageMaker training will call your Lambda function by sending a request through your VPC to `lambda.region.amazonaws.com`. If you select **Enable DNS Name** when you create your interface endpoint, [Amazon Route 53](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html) routes the call to the Lambda interface endpoint. If you use a different DNS provider, you must map `lambda.region.amazonaws.co`m, to your Lambda interface endpoint.

# Adapt your own inference container for Amazon SageMaker AI


If you can't use any of the images listed in [Pre-built SageMaker AI Docker images](docker-containers-prebuilt.md) Amazon SageMaker AI for your use case, you can build your own Docker container and use it inside SageMaker AI for training and inference. To be compatible with SageMaker AI, your container must have the following characteristics:
+ Your container must have a web server listing on port `8080`.
+ Your container must accept `POST` requests to the `/invocations` and `/ping` real-time endpoints. The requests that you send to these endpoints must be returned with 60 seconds for regular responses and 8 minutes for streaming responses, and have a maximum size of 25 MB.

For more information and an example of how to build your own Docker container for training and inference with SageMaker AI, see [Building your own algorithm container](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb). 

The following guide shows you how to use a `JupyterLab` space with Amazon SageMaker Studio Classic to adapt an inference container to work with SageMaker AI hosting. The example uses an NGINX web server, Gunicorn as a Python web server gateway interface, and Flask as a web application framework. You can use different applications to adapt your container as long as it meets the previous listed requirements. For more information about using your own inference code, see [Custom Inference Code with Hosting Services](your-algorithms-inference-code.md).

**Adapt your inference container**

Use the following steps to adapt your own inference container to work with SageMaker AI hosting. The example shown in the following steps uses a pre-trained [Named Entity Recognition (NER) model](https://spacy.io/universe/project/video-spacys-ner-model-alt) that uses the [spaCy](https://spacy.io/) natural language processing (NLP) library for `Python` and the following:
+ A Dockerfile to build the container that contains the NER model.
+ Inference scripts to serve the NER model.

If you adapt this example for your use case, you must use a Dockerfile and inference scripts that are needed to deploy and serve your model.

1. Create JupyterLab space with Amazon SageMaker Studio Classic (optional).

   You can use any notebook to run scripts to adapt your inference container with SageMaker AI hosting. This example shows you how to use a JupyterLab space within Amazon SageMaker Studio Classic to launch a JupyterLab application that comes with a SageMaker AI Distribution image. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).

1. Upload a Docker file and inference scripts.

   1. Create a new folder in your home directory. If you’re using JupyterLab, in the upper-left corner, choose the **New Folder** icon, and enter a folder name to contain your Dockerfile. In this example, the folder is called `docker_test_folder`.

   1. Upload a Dockerfile text file into your new folder. The following is an example Dockerfile that creates a Docker container with a pre-trained [Named Entity Recognition (NER) model](https://spacy.io/universe/project/video-spacys-ner-model) from [spaCy](https://spacy.io/), the applications and environment variables needed to run the example:

      ```
      FROM python:3.8
      
      RUN apt-get -y update && apt-get install -y --no-install-recommends \
               wget \
               python3 \
               nginx \
               ca-certificates \
          && rm -rf /var/lib/apt/lists/*
      
      RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
          pip install flask gevent gunicorn && \
              rm -rf /root/.cache
      
      #pre-trained model package installation
      RUN pip install spacy
      RUN python -m spacy download en
      
      
      # Set environment variables
      ENV PYTHONUNBUFFERED=TRUE
      ENV PYTHONDONTWRITEBYTECODE=TRUE
      ENV PATH="/opt/program:${PATH}"
      
      COPY NER /opt/program
      WORKDIR /opt/program
      ```

      In the previous code example, the environment variable `PYTHONUNBUFFERED` keeps Python from buffering the standard output stream, which allows for faster delivery of logs to the user. The environment variable `PYTHONDONTWRITEBYTECODE` keeps Python from writing compiled bytecode `.pyc` files, which are unnecessary for this use case. The environment variable `PATH` is used to identify the location of the `train` and `serve` programs when the container is invoked.

   1. Create a new directory inside your new folder to contain scripts to serve your model. This example uses a directory called `NER`, which contains the following scripts necessary to run this example:
      + `predictor.py` – A Python script that contains the logic to load and perform inference with your model.
      + `nginx.conf` – A script to configure a web server.
      + `serve` – A script that starts an inference server.
      + `wsgi.py` – A helper script to serve a model.
**Important**  
If you copy your inference scripts into a notebook ending in `.ipynb`and rename them, your script may contain formatting characters that will prevent your endpoint from deploying. Instead, create a text file and rename them.

   1. Upload a script to make your model available for inference. The following is an example script called `predictor.py` that uses Flask to provide the `/ping` and `/invocations` endpoints:

      ```
      from flask import Flask
      import flask
      import spacy
      import os
      import json
      import logging
      
      #Load in model
      nlp = spacy.load('en_core_web_sm') 
      #If you plan to use a your own model artifacts, 
      #your model artifacts should be stored in /opt/ml/model/ 
      
      
      # The flask app for serving predictions
      app = Flask(__name__)
      @app.route('/ping', methods=['GET'])
      def ping():
          # Check if the classifier was loaded correctly
          health = nlp is not None
          status = 200 if health else 404
          return flask.Response(response= '\n', status=status, mimetype='application/json')
      
      
      @app.route('/invocations', methods=['POST'])
      def transformation():
          
          #Process input
          input_json = flask.request.get_json()
          resp = input_json['input']
          
          #NER
          doc = nlp(resp)
          entities = [(X.text, X.label_) for X in doc.ents]
      
          # Transform predictions to JSON
          result = {
              'output': entities
              }
      
          resultjson = json.dumps(result)
          return flask.Response(response=resultjson, status=200, mimetype='application/json')
      ```

      The `/ping` endpoint in the previous script example returns a status code of `200` if the model is loaded correctly, and `404` if the model is loaded incorrectly. The `/invocations` endpoint processes a request formatted in JSON, extracts the input field, and uses the NER model to identify and store entities in the variable entities. The Flask application returns the response that contains these entities. For more information about these required health requests, see [How Your Container Should Respond to Health Check (Ping) Requests](your-algorithms-inference-code.md#your-algorithms-inference-algo-ping-requests).

   1. Upload a script to start an inference server. The following script example calls `serve` using Gunicorn as an application server, and Nginx as a web server:

      ```
      #!/usr/bin/env python
      
      # This file implements the scoring service shell. You don't necessarily need to modify it for various
      # algorithms. It starts nginx and gunicorn with the correct configurations and then simply waits until
      # gunicorn exits.
      #
      # The flask server is specified to be the app object in wsgi.py
      #
      # We set the following parameters:
      #
      # Parameter                Environment Variable              Default Value
      # ---------                --------------------              -------------
      # number of workers        MODEL_SERVER_WORKERS              the number of CPU cores
      # timeout                  MODEL_SERVER_TIMEOUT              60 seconds
      
      import multiprocessing
      import os
      import signal
      import subprocess
      import sys
      
      cpu_count = multiprocessing.cpu_count()
      
      model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
      model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))
      
      def sigterm_handler(nginx_pid, gunicorn_pid):
          try:
              os.kill(nginx_pid, signal.SIGQUIT)
          except OSError:
              pass
          try:
              os.kill(gunicorn_pid, signal.SIGTERM)
          except OSError:
              pass
      
          sys.exit(0)
      
      def start_server():
          print('Starting the inference server with {} workers.'.format(model_server_workers))
      
      
          # link the log streams to stdout/err so they will be logged to the container logs
          subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
          subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])
      
          nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])
          gunicorn = subprocess.Popen(['gunicorn',
                                       '--timeout', str(model_server_timeout),
                                       '-k', 'sync',
                                       '-b', 'unix:/tmp/gunicorn.sock',
                                       '-w', str(model_server_workers),
                                       'wsgi:app'])
      
          signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))
      
          # Exit the inference server upon exit of either subprocess
          pids = set([nginx.pid, gunicorn.pid])
          while True:
              pid, _ = os.wait()
              if pid in pids:
                  break
      
          sigterm_handler(nginx.pid, gunicorn.pid)
          print('Inference server exiting')
      
      # The main routine to invoke the start function.
      
      if __name__ == '__main__':
          start_server()
      ```

      The previous script example defines a signal handler function `sigterm_handler`, which shuts down the Nginx and Gunicorn sub-processes when it receives a `SIGTERM` signal. A `start_server` function starts the signal handler, starts and monitors the Nginx and Gunicorn sub-processes, and captures log streams.

   1. Upload a script to configure your web server. The following script example called `nginx.conf`, configures a Nginx web server using Gunicorn as an application server to serve your model for inference:

      ```
      worker_processes 1;
      daemon off; # Prevent forking
      
      
      pid /tmp/nginx.pid;
      error_log /var/log/nginx/error.log;
      
      events {
        # defaults
      }
      
      http {
        include /etc/nginx/mime.types;
        default_type application/octet-stream;
        access_log /var/log/nginx/access.log combined;
        
        upstream gunicorn {
          server unix:/tmp/gunicorn.sock;
        }
      
        server {
          listen 8080 deferred;
          client_max_body_size 5m;
      
          keepalive_timeout 5;
          proxy_read_timeout 1200s;
      
          location ~ ^/(ping|invocations) {
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $http_host;
            proxy_redirect off;
            proxy_pass http://gunicorn;
          }
      
          location / {
            return 404 "{}";
          }
        }
      }
      ```

      The previous script example configures Nginx to run in the foreground, sets the location to capture the `error_log`, and defines `upstream` as the Gunicorn server’s socket sock. The server configures the server block to listen on port `8080`, sets limits on client request body size and timeout values. The server block, forwards requests containing either `/ping` or `/invocations` paths to the Gunicorn `server http://gunicorn`, and returns a `404` error for other paths.

   1. Upload any other scripts needed to serve your model. This example needs the following example script called `wsgi.py` to help Gunicorn find your application:

      ```
      import predictor as myapp
      
      # This is just a simple wrapper for gunicorn to find your app.
      # If you want to change the algorithm file, simply change "predictor" above to the
      # new file.
      
      app = myapp.app
      ```

   From the folder `docker_test_folder`, your directory structure should contain a Dockerfile and the folder NER. The NER folder should contain the files `nginx.conf`, `predictor.py`, `serve`, and `wsgi.py` as follows:

    ![\[The Dockerfile structure has inference scripts under the NER directory next to the Dockerfile.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/docker-file-struct-adapt-ex.png) 

1. Build your own container.

   From the folder `docker_test_folder`, build your Docker container. The following example command will build the Docker container that is configured in your Dockerfile:

   ```
   ! docker build -t byo-container-test .
   ```

   The previous command will build a container called `byo-container-test` in the current working directory. For more information about the Docker build parameters, see [Build arguments](https://docs.docker.com/build/guide/build-args/).
**Note**  
If you get the following error message that Docker cannot find the Dockerfile, make sure the Dockerfile has the correct name and has been saved to the directory.  

   ```
   unable to prepare context: unable to evaluate symlinks in Dockerfile path:
   lstat /home/ec2-user/SageMaker/docker_test_folder/Dockerfile: no such file or directory
   ```
Docker looks for a file specifically called Dockerfile without any extension within the current directory. If you named it something else, you can pass in the file name manually with the -f flag. For example, if you named your Dockerfile as Dockerfile-text.txt, build your Docker container using the `-f` flag followed by your file as follows:  

   ```
   ! docker build -t byo-container-test -f Dockerfile-text.txt .
   ```

1. Push your Docker Image to an Amazon Elastic Container Registry (Amazon ECR)

   In a notebook cell, push your Docker image to an ECR. The following code example shows you how to build your container locally, login and push it to an ECR:

   ```
   %%sh
   # Name of algo -> ECR
   algorithm_name=sm-pretrained-spacy
   
   #make serve executable
   chmod +x NER/serve
   account=$(aws sts get-caller-identity --query Account --output text)
   # Region, defaults to us-west-2
   region=$(aws configure get region)
   region=${region:-us-east-1}
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   # If the repository doesn't exist in ECR, create it.
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   if [ $? -ne 0 ]
   then
       aws ecr create-repository --repository-name "${algorithm_name}" > /dev/nullfi
   # Get the login command from ECR and execute it directly
   aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build  -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```

   In the previous example shows how to do the following steps necessary to push the example Docker container to an ECR:

   1. Define the algorithm name as `sm-pretrained-spacy`.

   1. Make the `serve` file inside the NER folder executable.

   1. Set the AWS Region.

   1. Create an ECR if it doesn’t already exist.

   1. Login to the ECR.

   1. Build the Docker container locally.

   1. Push the Docker image to the ECR.

1. Set up the SageMaker AI client

   If you want to use SageMaker AI hosting services for inference, you must [create a model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html), create an [endpoint config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#) and [create an endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html#). In order to get inferences from your endpoint, you can use the SageMaker AI boto3 Runtime client to invoke your endpoint. The following code shows you how to set up both the SageMaker AI client and the SageMaker Runtime client using the [SageMaker AI boto3 client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html):

   ```
   import boto3
   from sagemaker import get_execution_role
   
   sm_client = boto3.client(service_name='sagemaker')
   runtime_sm_client = boto3.client(service_name='sagemaker-runtime')
   
   account_id = boto3.client('sts').get_caller_identity()['Account']
   region = boto3.Session().region_name
   
   #used to store model artifacts which SageMaker AI will extract to /opt/ml/model in the container, 
   #in this example case we will not be making use of S3 to store the model artifacts
   #s3_bucket = '<S3Bucket>'
   
   role = get_execution_role()
   ```

   In the previous code example, the Amazon S3 bucket is not used, but inserted as a comment to show how to store model artifacts.

   If you receive a permission error after you run the previous code example, you may need to add permissions to your IAM role. For more information about IAM roles, see [Amazon SageMaker Role Manager](role-manager.md). For more information about adding permissions to your current role, see [AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md).

1. Create your model.

   If you want to use SageMaker AI hosting services for inference, you must create a model in SageMaker AI. The following code example shows you how to create the spaCy NER model inside of SageMaker AI:

   ```
   from time import gmtime, strftime
   
   model_name = 'spacy-nermodel-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
   # MODEL S3 URL containing model atrifacts as either model.tar.gz or extracted artifacts. 
   # Here we are not  
   #model_url = 's3://{}/spacy/'.format(s3_bucket) 
   
   container = '{}.dkr.ecr.{}.amazonaws.com/sm-pretrained-spacy:latest'.format(account_id, region)
   instance_type = 'ml.c5d.18xlarge'
   
   print('Model name: ' + model_name)
   #print('Model data Url: ' + model_url)
   print('Container image: ' + container)
   
   container = {
   'Image': container
   }
   
   create_model_response = sm_client.create_model(
       ModelName = model_name,
       ExecutionRoleArn = role,
       Containers = [container])
   
   print("Model Arn: " + create_model_response['ModelArn'])
   ```

   The previous code example shows how to define a `model_url` using the `s3_bucket` if you were to use the Amazon S3 bucket from the comments in Step 5, and defines the ECR URI for the container image. The previous code examples defines `ml.c5d.18xlarge` as the instance type. You can also choose a different instance type. For more information about available instance types, see [ Amazon EC2 instance types](https://aws.amazon.com/ec2/instance-types/).

   In the previous code example, The `Image` key points to the container image URI. The `create_model_response` definition uses the `create_model method` to create a model, and return the model name, role and a list containing the container information. 

   Example output from the previous script follows:

   ```
   Model name: spacy-nermodel-YYYY-MM-DD-HH-MM-SS
   Model data Url: s3://spacy-sagemaker-us-east-1-bucket/spacy/
   Container image: 123456789012.dkr.ecr.us-east-2.amazonaws.com/sm-pretrained-spacy:latest
   Model Arn: arn:aws:sagemaker:us-east-2:123456789012:model/spacy-nermodel-YYYY-MM-DD-HH-MM-SS
   ```

1. 

   1. 

**Configure and create an endpoint**

      To use SageMaker AI hosting for inference, you must also configure and create an endpoint. SageMaker AI will use this endpoint for inference. The following configuration example shows how to generate and configure an endpoint with the instance type and model name that you defined previously:

      ```
      endpoint_config_name = 'spacy-ner-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
      print('Endpoint config name: ' + endpoint_config_name)
      
      create_endpoint_config_response = sm_client.create_endpoint_config(
          EndpointConfigName = endpoint_config_name,
          ProductionVariants=[{
              'InstanceType': instance_type,
              'InitialInstanceCount': 1,
              'InitialVariantWeight': 1,
              'ModelName': model_name,
              'VariantName': 'AllTraffic'}])
              
      print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])
      ```

      In the previous configuration example, `create_endpoint_config_response` associates the `model_name` with a unique endpoint configuration name `endpoint_config_name` that is created with a timestamp.

      Example output from the previous script follows:

      ```
      Endpoint config name: spacy-ner-configYYYY-MM-DD-HH-MM-SS
      Endpoint config Arn: arn:aws:sagemaker:us-east-2:123456789012:endpoint-config/spacy-ner-config-MM-DD-HH-MM-SS
      ```

      For more information about endpoint errors, see [Why does my Amazon SageMaker AI endpoint go into the failed state when I create or update an endpoint?](https://repost.aws/knowledge-center/sagemaker-endpoint-creation-fail)

   1. 

**Create an endpoint and wait for the endpoint to be in service.**

       The following code example creates the endpoint using the configuration from the previous configuration example and deploys the model: 

      ```
      %%time
      
      import time
      
      endpoint_name = 'spacy-ner-endpoint' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
      print('Endpoint name: ' + endpoint_name)
      
      create_endpoint_response = sm_client.create_endpoint(
          EndpointName=endpoint_name,
          EndpointConfigName=endpoint_config_name)
      print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])
      
      resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
      status = resp['EndpointStatus']
      print("Endpoint Status: " + status)
      
      print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
      waiter = sm_client.get_waiter('endpoint_in_service')
      waiter.wait(EndpointName=endpoint_name)
      ```

      In the previous code example, the `create_endpoint` method creates the endpoint with the generated endpoint name created in the previous code example, and prints the Amazon Resource Name of the endpoint. The `describe_endpoint` method returns information about the endpoint and its status. A SageMaker AI waiter waits for the endpoint to be in service.

1. Test your endpoint.

   Once your endpoint is in service, send an [invocation request](https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint) to your endpoint. The following code example shows how to send a test request to your endpoint:

   ```
   import json
   content_type = "application/json"
   request_body = {"input": "This is a test with NER in America with \
       Amazon and Microsoft in Seattle, writing random stuff."}
   
   #Serialize data for endpoint
   #data = json.loads(json.dumps(request_body))
   payload = json.dumps(request_body)
   
   #Endpoint invocation
   response = runtime_sm_client.invoke_endpoint(
   EndpointName=endpoint_name,
   ContentType=content_type,
   Body=payload)
   
   #Parse results
   result = json.loads(response['Body'].read().decode())['output']
   result
   ```

   In the previous code example, the method `json.dumps` serializes the `request_body` into a string formatted in JSON and saves it in the variable payload. Then SageMaker AI Runtime client uses the [invoke endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html) method to send payload to your endpoint. The result contains the response from your endpoint after extracting the output field.

   The previous code example should return the following output:

   ```
   [['NER', 'ORG'],
    ['America', 'GPE'],
    ['Amazon', 'ORG'],
    ['Microsoft', 'ORG'],
    ['Seattle', 'GPE']]
   ```

1. Delete your endpoint

   After you have completed your invocations, delete your endpoint to conserve resources. The following code example shows you how to delete your endpoint:

   ```
   sm_client.delete_endpoint(EndpointName=endpoint_name)
   sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
   sm_client.delete_model(ModelName=model_name)
   ```

   For a complete notebook containing the code in this example, see [BYOC-Single-Model](https://github.com/aws-samples/sagemaker-hosting/tree/main/Bring-Your-Own-Container/BYOC-Single-Model).

# Container creation with your own algorithms and models


If none of the existing SageMaker AI containers meet your needs and you don't have an existing container of your own, you may need to create a new Docker container. The following sections show how to create Docker containers with your training and inference algorithms for use with SageMaker AI.

**Topics**
+ [

# Containers with custom training algorithms
](your-algorithms-training-algo.md)
+ [

# Containers with custom inference code
](your-algorithms-inference-main.md)

# Containers with custom training algorithms


This section explains how Amazon SageMaker AI interacts with a Docker container that runs your custom training algorithm. Use this information to write training code and create a Docker image for your training algorithms. 

**Topics**
+ [

# How Amazon SageMaker AI Runs Your Training Image
](your-algorithms-training-algo-dockerfile.md)
+ [

# How Amazon SageMaker AI Provides Training Information
](your-algorithms-training-algo-running-container.md)
+ [

# Run Training with EFA
](your-algorithms-training-efa.md)
+ [

# How Amazon SageMaker AI Signals Algorithm Success and Failure
](your-algorithms-training-signal-success-failure.md)
+ [

# How Amazon SageMaker AI Processes Training Output
](your-algorithms-training-algo-output.md)

# How Amazon SageMaker AI Runs Your Training Image
Run Your Training Image

You can use a custom entrypoint script to automate infrastructure to train in a production environment. If you pass your entrypoint script into your Docker container, you can also run it as a standalone script without rebuilding your images. SageMaker AI processes your training image using a Docker container entrypoint script. 

This section shows you how to use a custom entrypoint without using the training toolkit. If you want to use a custom entrypoint but are unfamiliar with how to manually configure a Docker container, we recommend that you use the [SageMaker training toolkit library](https://github.com/aws/sagemaker-training-toolkit) instead. For more information about how to use the training toolkit, see [Adapting your own training container](adapt-training-container.md). 

By default, SageMaker AI looks for a script called `train` inside your container. You can also manually provide your own custom entrypoint by using the `ContainerArguments` and `ContainerEntrypoint` parameters of the [AlgorithmSpecification](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html) API. 

You have the following two options to manually configure your Docker container to run your image.
+ Use the [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API and a Docker container with an entrypoint instruction contained inside of it.
+ Use the `CreateTrainingJob` API, and pass your training script from outside of your Docker container.

If you pass your training script from outside your Docker container, you don't need to rebuild the Docker container when you update your script. You can also use several different scripts to run in the same container.

Your entrypoint script should contain training code for your image. If you use the optional `source_dir` parameter inside an [estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html), it should reference the relative Amazon S3 path to the folder containing your entrypoint script. You can reference multiple files using the `source_dir` parameter. If you do not use `source_dir`, you can specify the entrypoint using the `entry_point` parameter. For an example of a custom entrypoint script that contains an estimator, see [Bring Your Own Model with SageMaker AI Script Mode](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-script-mode/sagemaker-script-mode.html).

SageMaker AI model training supports high-performance S3 Express One Zone directory buckets as a data input location for file mode, fast file mode, and pipe mode. You can also use S3 Express One Zone directory buckets to store your training output. To use S3 Express One Zone, provide the URI of an S3 Express One Zone directory bucket instead of an Amazon S3 general purpose bucket. You can only encrypt your SageMaker AI output data in directory buckets with server-side encryption with Amazon S3 managed keys (SSE-S3). Server-side encryption with AWS KMS keys (SSE-KMS) is not currently supported for storing SageMaker AI output data in directory buckets. For more information, see [S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html).

## Run a training job with an entrypoint script bundled inside the Docker container


SageMaker AI can run an entrypoint script bundled inside your Docker container. 
+ By default, Amazon SageMaker AI runs the following container.

  ```
  docker run image train
  ```
+ SageMaker AI overrides any default [CMD](https://docs.docker.com/engine/reference/builder/#cmd) statements in a container by specifying the `train` argument after the image name. In your Docker container, use the following `exec` form of the `ENTRYPOINT` instruction.

  ```
  ENTRYPOINT ["executable", "param1", "param2", ...]
  ```

  The following example shows how to specify a python entrypoint instruction called `k-means-algorithm.py`.

  ```
  ENTRYPOINT ["python", "k-means-algorithm.py"]
  ```

  The `exec` form of the `ENTRYPOINT` instruction starts the executable directly, not as a child of `/bin/sh`. This enables it to receive signals like `SIGTERM` and `SIGKILL` from SageMaker APIs. The following conditions apply when using the SageMaker APIs. 
  + The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API has a stopping condition that directs SageMaker AI to stop model training after a specific time. 
  + The following shows the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopTrainingJob.html) API. This API issues the equivalent of the `docker stop`, with a 2-minute timeout command to gracefully stop the specified container.

    ```
    docker stop -t 120
    ```

    The command attempts to stop the running container by sending a `SIGTERM` signal. After the 2-minute timeout, the API sends `SIGKILL` and forcibly stops the containers. If the container handles the `SIGTERM` gracefully and exits within 120 seconds from receiving it, no `SIGKILL` is sent. 

  If you want access to the intermediate model artifacts after SageMaker AI stops the training, add code to handle saving artifacts in your `SIGTERM` handler.
+ If you plan to use GPU devices for model training, make sure that your containers are `nvidia-docker` compatible. Include only the CUDA toolkit on containers; don't bundle NVIDIA drivers with the image. For more information about `nvidia-docker`, see [NVIDIA/nvidia-docker](https://github.com/NVIDIA/nvidia-docker).
+ You can't use the `tini` initializer as your entrypoint script in SageMaker AI containers because it gets confused by the `train` and `serve` arguments.
+ `/opt/ml` and all subdirectories are reserved by SageMaker training. When building your algorithm’s Docker image, make sure that you don't place any data that's required by your algorithm in this directory. Because if you do, the data may no longer be visible during training.

To bundle your shell or Python scripts inside your Docker image, or to provide the script in an Amazon S3 bucket or by using the AWS Command Line Interface (CLI), continue to the following section.

### Bundle your shell script in a Docker container


 If you want to bundle a custom shell script inside your Docker image, use the following steps. 

1. Copy your shell script from your working directory to inside your Docker container. The following code snippet copies a custom entrypoint script `custom_entrypoint.sh` from the current working directory to a Docker container located in `mydir`. The following example assumes that the base Docker image has Python installed.

   ```
   FROM <base-docker-image>:<tag>
   
   # Copy custom entrypoint from current dir to /mydir on container
   COPY ./custom_entrypoint.sh /mydir/
   ```

1. Build and push a Docker container to the Amazon Elastic Container Registry ([Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html)) by following the instructions at [Pushing a Docker image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html) in the *Amazon ECR User Guide*.

1. Launch the training job by running the following AWS CLI command.

   ```
   aws --region <your-region> sagemaker create-training-job \
   --training-job-name <your-training-job-name> \
   --role-arn <your-execution-role-arn> \
   --algorithm-specification '{ \ 
       "TrainingInputMode": "File", \
       "TrainingImage": "<your-ecr-image>", \
       "ContainerEntrypoint": ["/bin/sh"], \
       "ContainerArguments": ["/mydir/custom_entrypoint.sh"]}' \
   --output-data-config '{"S3OutputPath": "s3://custom-entrypoint-output-bucket/"}' \
   --resource-config '{"VolumeSizeInGB":10,"InstanceCount":1,"InstanceType":"ml.m5.2xlarge"}' \
   --stopping-condition '{"MaxRuntimeInSeconds": 180}'
   ```

### Bundle your Python script in a Docker container


To bundle a custom Python script inside your Docker image, use the following steps. 

1. Copy your Python script from your working directory to inside your Docker container. The following code snippet copies a custom entrypoint script `custom_entrypoint.py` from the current working directory to a Docker container located in `mydir`.

   ```
   FROM <base-docker-image>:<tag>
   # Copy custom entrypoint from current dir to /mydir on container
   COPY ./custom_entrypoint.py /mydir/
   ```

1. Launch the training job by running the following AWS CLI command.

   ```
   --algorithm-specification '{ \ 
       "TrainingInputMode": "File", \
       "TrainingImage": "<your-ecr-image>", \
       "ContainerEntrypoint": ["python"], \
       "ContainerArguments": ["/mydir/custom_entrypoint.py"]}' \
   ```

## Run a training job with an entrypoint script outside the Docker container


You can use your own Docker container for training and pass in an entrypoint script from outside the Docker container. There are some benefits to structuring your entrypoint script outside the container. If you update your entrypoint script, you don't need to rebuild the Docker container. You can also use several different scripts to run in the same container. 

Specify the location of your training script using the `ContainerEntrypoint` and `ContainerArguments` parameters of the [AlgorithmSpecification](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html) API. These entrypoints and arguments behave in the same manner as Docker entrypoints and arguments. The values in these parameters override the corresponding `ENTRYPOINT` or `CMD` provided as part of the Docker container. 

When you pass your custom entrypoint script to your Docker training container, the inputs that you provide determine the behavior of the container.
+ For example, if you provide only `ContainerEntrypoint`, the request syntax using the CreateTrainingJob API is as follows.

  ```
  {
      "AlgorithmSpecification": {
          "ContainerEntrypoint": ["string"],   
          ...     
          }       
  }
  ```

  Then, the SageMaker training backend runs your custom entrypoint as follows.

  ```
  docker run --entrypoint <ContainerEntrypoint> image
  ```
**Note**  
If `ContainerEntrypoint` is provided, the SageMaker training backend runs the image with the given entrypoint and overrides the default `ENTRYPOINT` in the image.
+ If you provide only `ContainerArguments`, SageMaker AI assumes that the Docker container contains an entrypoint script. The request syntax using the `CreateTrainingJob` API is as follows.

  ```
  {
      "AlgorithmSpecification": {
          "ContainerArguments": ["arg1", "arg2"],
          ...
      }
  }
  ```

  The SageMaker training backend runs your custom entrypoint as follows.

  ```
  docker run image <ContainerArguments>
  ```
+ If your provide both the `ContainerEntrypoint` and `ContainerArguments`, then the request syntax using the `CreateTrainingJob` API is as follows.

  ```
  {
      "AlgorithmSpecification": {
          "ContainerEntrypoint": ["string"],
          "ContainerArguments": ["arg1", "arg2"],
          ...
      }
  }
  ```

   The SageMaker training backend runs your custom entrypoint as follows.

  ```
  docker run --entrypoint <ContainerEntrypoint> image <ContainerArguments>
  ```

You can use any supported `InputDataConfig` source in the `CreateTrainingJob` API to provide an entrypoint script to run your training image. 

### Provide your entrypoint script in an Amazon S3 bucket


 To provide a custom entrypoint script using an S3 bucket, use the `S3DataSource` parameter of the [DataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataSource.html#sagemaker-Type-DataSource-S3DataSource) API to specify the location of the script. If you use the `S3DataSource` parameter, the following are required.
+ The [InputMode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#sagemaker-Type-Channel-InputMode) must be of the type `File`.
+ The [S3DataDistributionType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataSource.html#sagemaker-Type-DataSource-S3DataSource) must be `FullyReplicated`.

The following example has a script called custom\$1entrypoint.sh placed in a path to an S3 bucket `s3://<bucket-name>/<bucket prefix>/custom_entrypoint.sh`.

```
#!/bin/bash
echo "Running custom_entrypoint.sh"
echo "Hello you have provided the following arguments: " "$@"
```

Next, you must set the configuration of the input data channel to run a training job. Do this either by using the AWS CLI directly or with a JSON file.

#### Configure the input data channel using AWS CLI with a JSON file


To configure your input data channel with a JSON file, use AWS CLI as shown in the following code structure. Ensure that all of the following fields use the request syntax defined in the [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestSyntax) API.

```
// run-my-training-job.json
{
 "[AlgorithmSpecification](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-AlgorithmSpecification)": { 
        "ContainerEntrypoint": ["/bin/sh"],
        "ContainerArguments": ["/opt/ml/input/data/<your_channel_name>/custom_entrypoint.sh"],
         ...
   },
  "[InputDataConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-InputDataConfig)": [ 
    { 
        "[ChannelName](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#sagemaker-Type-Channel-ChannelName)": "<your_channel_name>",
        "[DataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#sagemaker-Type-Channel-DataSource)": { 
            "[S3DataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DataSource.html#sagemaker-Type-DataSource-S3DataSource)": { 
                "[S3DataDistributionType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html#sagemaker-Type-S3DataSource-S3DataDistributionType)": "FullyReplicated",
                "[S3DataType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html#sagemaker-Type-S3DataSource-S3DataType)": "S3Prefix",
                "[S3Uri](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html#sagemaker-Type-S3DataSource-S3Uri)": "s3://<bucket-name>/<bucket_prefix>"
            }
        },
        "[InputMode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#sagemaker-Type-Channel-InputMode)": "File",
    },
    ...]
}
```

Next, run the AWS CLI command to launch the training job from the JSON file as follows.

```
aws sagemaker create-training-job --cli-input-json file://run-my-training-job.json
```

#### Configure the input data channel using AWS CLI directly


To configure your input data channel without a JSON file, use the following AWS CLI code structure.

```
aws --region <your-region> sagemaker create-training-job \
--training-job-name <your-training-job-name> \
--role-arn <your-execution-role-arn> \
--algorithm-specification '{ \
    "TrainingInputMode": "File", \
    "TrainingImage": "<your-ecr-image>", \
    "ContainerEntrypoint": ["/bin/sh"], \
    "ContainerArguments": ["/opt/ml/input/data/<your_channel_name>/custom_entrypoint.sh"]}' \
--input-data-config '[{ \
    "ChannelName":"<your_channel_name>", \
    "DataSource":{ \
        "S3DataSource":{ \
            "S3DataType":"S3Prefix", \
            "S3Uri":"s3://<bucket-name>/<bucket_prefix>", \
            "S3DataDistributionType":"FullyReplicated"}}}]' \
--output-data-config '{"S3OutputPath": "s3://custom-entrypoint-output-bucket/"}' \
--resource-config '{"VolumeSizeInGB":10,"InstanceCount":1,"InstanceType":"ml.m5.2xlarge"}' \
--stopping-condition '{"MaxRuntimeInSeconds": 180}'
```

# How Amazon SageMaker AI Provides Training Information
Provide Training Information

This section explains how SageMaker AI makes training information, such as training data, hyperparameters, and other configuration information, available to your Docker container. 

When you send a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request to SageMaker AI to start model training, you specify the Amazon Elastic Container Registry (Amazon ECR) path of the Docker image that contains the training algorithm. You also specify the Amazon Simple Storage Service (Amazon S3) location where training data is stored and algorithm-specific parameters. SageMaker AI makes this information available to the Docker container so that your training algorithm can use it. This section explains how we make this information available to your Docker container. For information about creating a training job, see `CreateTrainingJob`. For more information on the way that SageMaker AI containers organize information, see [SageMaker Training and Inference Toolkits](amazon-sagemaker-toolkits.md).

**Topics**
+ [

## Hyperparameters
](#your-algorithms-training-algo-running-container-hyperparameters)
+ [

## Environment Variables
](#your-algorithms-training-algo-running-container-environment-variables)
+ [

## Input Data Configuration
](#your-algorithms-training-algo-running-container-inputdataconfig)
+ [

## Training Data
](#your-algorithms-training-algo-running-container-trainingdata)
+ [

## Distributed Training Configuration
](#your-algorithms-training-algo-running-container-dist-training)

## Hyperparameters


 SageMaker AI makes the hyperparameters in a `CreateTrainingJob` request available in the Docker container in the `/opt/ml/input/config/hyperparameters.json` file.

The following is an example of a hyperparameter configuration in `hyperparameters.json` to specify the `num_round` and `eta` hyperparameters in the `CreateTrainingJob` operation for [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). 

```
{
    "num_round": "128",
    "eta": "0.001"
}
```

For a complete list of hyperparameters that can be used for the SageMaker AI built-in XGBoost algorithm, see [XGBoost Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).

The hyperparameters that you can tune depend on the algorithm that you are training. For a list of hyperparameters available for a SageMaker AI built-in algorithm, find them listed in **Hyperparameters** under the algorithm link in [Use Amazon SageMaker AI Built-in Algorithms or Pre-trained Models](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).

## Environment Variables


SageMaker AI sets the following environment variables in your container:
+ TRAINING\$1JOB\$1NAME – Specified in the `TrainingJobName` parameter of the `CreateTrainingJob` request.
+ TRAINING\$1JOB\$1ARN – The Amazon Resource Name (ARN) of the training job returned as the `TrainingJobArn` in the `CreateTrainingJob` response.
+ All environment variables specified in the [Environment](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-Environment) parameter in the `CreateTrainingJob` request.

## Input Data Configuration


SageMaker AI makes the data channel information in the `InputDataConfig` parameter from your `CreateTrainingJob` request available in the `/opt/ml/input/config/inputdataconfig.json` file in your Docker container.

For example, suppose that you specify three data channels (`train`, `evaluation`, and `validation`) in your request. SageMaker AI provides the following JSON:

```
{
  "train" : {"ContentType":  "trainingContentType",
             "TrainingInputMode": "File",
             "S3DistributionType": "FullyReplicated",
             "RecordWrapperType": "None"},
  "evaluation" : {"ContentType":  "evalContentType",
                  "TrainingInputMode": "File",
                  "S3DistributionType": "FullyReplicated",
                  "RecordWrapperType": "None"},
  "validation" : {"TrainingInputMode": "File",
                  "S3DistributionType": "FullyReplicated",
                  "RecordWrapperType": "None"}
}
```

**Note**  
SageMaker AI provides only relevant information about each data channel (for example, the channel name and the content type) to the container, as shown in the previous example. `S3DistributionType` will be set as `FullyReplicated` if you specify EFS or FSxLustre as input data sources.

## Training Data


The `TrainingInputMode` parameter in the `AlgorithmSpecification` of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request specifies how the training dataset is made available to your container. The following input modes are available.
+ **`File` mode**

  If you use `File` mode as your `TrainingInputMode` value, SageMaker AI sets the following parameters in your container.
  + Your `TrainingInputMode` parameter is written to `inputdataconfig.json` as "File".
  + Your data channel directory is written to `/opt/ml/input/data/channel_name`.

  If you use `File` mode, SageMaker AI creates a directory for each channel. For example, if you have three channels named `training`, `validation`, and `testing`, SageMaker AI makes the following three directories in your Docker container: 
  + `/opt/ml/input/data/training`
  + `/opt/ml/input/data/validation`
  + `/opt/ml/input/data/testing`

  `File` mode also supports the following data sources.
  + Amazon Simple Storage Service (Amazon S3)
  + Amazon Elastic File System (Amazon EFS)
  + Amazon FSx for Lustre
**Note**  
Channels that use file system data sources such as Amazon EFS and Amazon FSx must use `File` mode. In this case, the directory path provided in the channel is mounted at `/opt/ml/input/data/channel_name`.
+ **`FastFile` mode**

  If you use `FastFile` mode as your `TrainingInputNodeParameter`, SageMaker AI sets the following parameters in your container.
  + Similar to `File` mode, in `FastFile` mode, your `TrainingInputMode` parameter is written to `inputdataconfig.json` as "File".
  + Your data channel directory is written to `/opt/ml/input/data/channel_name`.

  `FastFile` mode supports the following data sources.
  + Amazon S3

  If you use `FastFile` mode, the channel directory is mounted with read-only permission.

  Historically, `File` mode preceded `FastFile` mode. To ensure backwards compatibility, algorithms that support `File` mode can also seamlessly work with `FastFile` mode as long as the `TrainingInputMode` parameter is set to `File` in `inputdataconfig.json.`.
**Note**  
Channels that use `FastFile` mode must use a `S3DataType` of "S3Prefix".  
`FastFile` mode presents a folder view that uses the forward slash (`/`) as the delimiter for grouping Amazon S3 objects into folders. `S3Uri` prefixes must not correspond to a partial folder name. For example, if an Amazon S3 dataset contains `s3://amzn-s3-demo-bucket/train-01/data.csv`, then neither `s3://amzn-s3-demo-bucket/train` nor `s3://amzn-s3-demo-bucket/train-01` are allowed as `S3Uri` prefixes.  
A trailing forward slash is recommended to define a channel corresponding to a folder. For example, the `s3://amzn-s3-demo-bucket/train-01/` channel for the `train-01` folder. Without the trailing forward slash, the channel would be ambiguous if there existed another folder `s3://amzn-s3-demo-bucket/train-011/` or file `s3://amzn-s3-demo-bucket/train-01.txt/`.
+ **`Pipe` mode**
  + `TrainingInputMode` parameter written to `inputdataconfig.json`: "Pipe"
  + Data channel directory in the Docker container: `/opt/ml/input/data/channel_name_epoch_number`
  + Supported data sources: Amazon S3

  You need to read from a separate pipe for each channel. For example, if you have three channels named `training`, `validation`, and `testing`, you need to read from the following pipes:
  + `/opt/ml/input/data/training_0, /opt/ml/input/data/training_1, ...`
  + `/opt/ml/input/data/validation_0, /opt/ml/input/data/validation_1, ...`
  + `/opt/ml/input/data/testing_0, /opt/ml/input/data/testing_1, ...`

  Read the pipes sequentially. For example, if you have a channel called `training`, read the pipes in this sequence: 

  1. Open `/opt/ml/input/data/training_0` in read mode and read it to end-of-file (EOF) or, if you are done with the first epoch, close the pipe file early. 

  1. After closing the first pipe file, look for `/opt/ml/input/data/training_1` and read it until you have completed the second epoch, and so on.

  If the file for a given epoch doesn't exist yet, your code may need to retry until the pipe is created There is no sequencing restriction across channel types. For example, you can read multiple epochs for the `training` channel and only start reading the `validation` channel when you are ready. Or, you can read them simultaneously if your algorithm requires that.

  For an example of a Jupyter notebook that shows how to use Pipe mode when bringing your own container, see [Bring your own pipe-mode algorithm to Amazon SageMaker AI](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/pipe_bring_your_own/pipe_bring_your_own.ipynb).

  

SageMaker AI model training supports high-performance S3 Express One Zone directory buckets as a data input location for file mode, fast file mode, and pipe mode. To use S3 Express One Zone, input the location of the S3 Express One Zone directory bucket instead of an Amazon S3 general purpose bucket. Provide the ARN for the IAM role with the required access control and permissions policy. Refer to [AmazonSageMakerFullAccesspolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) for details. You can only encrypt your SageMaker AI output data in directory buckets with server-side encryption with Amazon S3 managed keys (SSE-S3). Server-side encryption with AWS KMS keys (SSE-KMS) is not currently supported for storing SageMaker AI output data in directory buckets. For more information, see [S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html).

## Distributed Training Configuration


If you're performing distributed training with multiple containers, SageMaker AI makes information about all containers available in the `/opt/ml/input/config/resourceconfig.json` file.

To enable inter-container communication, this JSON file contains information for all containers. SageMaker AI makes this file available for both `File` and `Pipe` mode algorithms. The file provides the following information:
+ `current_host`—The name of the current container on the container network. For example, `algo-1`. Host values can change at any time. Don't write code with specific values for this variable.
+ `hosts`—The list of names of all containers on the container network, sorted lexicographically. For example, `["algo-1", "algo-2", "algo-3"]` for a three-node cluster. Containers can use these names to address other containers on the container network. Host values can change at any time. Don't write code with specific values for these variables.
+ `network_interface_name`—The name of the network interface that is exposed to your container. For example, containers running the Message Passing Interface (MPI) can use this information to set the network interface name.
+ Do not use the information in `/etc/hostname` or `/etc/hosts` because it might be inaccurate.
+ Hostname information may not be immediately available to the algorithm container. We recommend adding a retry policy on hostname resolution operations as nodes become available in the cluster.

The following is an example file on node 1 in a three-node cluster:

```
{
    "current_host": "algo-1",
    "hosts": ["algo-1","algo-2","algo-3"],
    "network_interface_name":"eth1"
}
```

# Run Training with EFA


 SageMaker AI provides integration with [EFA](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) devices to accelerate High Performance Computing (HPC) and machine learning applications. This integration allows you to leverage an EFA device when running your distributed training jobs. You can add EFA integration to an existing Docker container that you bring to SageMaker AI. The following information outlines how to configure your own container to use an EFA device for your distributed training jobs. 

## Prerequisites


 Your container must satisfy the [SageMaker Training container specification](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html).  

## Install EFA and required packages


Your container must download and install the [ EFA software](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html). This allows your container to recognize the EFA device, and provides compatible versions of Libfabric and Open MPI. 

Any tools like MPI and NCCL must be installed and managed inside the container to be used as part of your EFA-enabled training job. For a list of all available EFA versions, see [Verify the EFA installer using a checksum](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-verify.html). The following example shows how to modify the Dockerfile of your EFA-enabled container to install EFA, MPI, OFI, NCCL, and NCCL-TEST.

**Note**  
When using PyTorch with EFA on your container, the NCCL version of your container should match the NCCL version of your PyTorch installation. To verify the PyTorch NCCL version, use the following command:  

```
torch.cuda.nccl.version()
```

```
ARG OPEN_MPI_PATH=/opt/amazon/openmpi/
ENV NCCL_VERSION=2.7.8
ENV EFA_VERSION=1.30.0
ENV BRANCH_OFI=1.1.1

#################################################
## EFA and MPI SETUP
RUN cd $HOME \
  && curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \
  && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
  && cd aws-efa-installer \
  && ./efa_installer.sh -y --skip-kmod -g \

ENV PATH="$OPEN_MPI_PATH/bin:$PATH"
ENV LD_LIBRARY_PATH="$OPEN_MPI_PATH/lib/:$LD_LIBRARY_PATH"

#################################################
## NCCL, OFI, NCCL-TEST SETUP
RUN cd $HOME \
  && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
  && cd nccl \
  && make -j64 src.build BUILDDIR=/usr/local

RUN apt-get update && apt-get install -y autoconf
RUN cd $HOME \
  && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
  && cd aws-ofi-nccl \
  && ./autogen.sh \
  && ./configure --with-libfabric=/opt/amazon/efa \
       --with-mpi=/opt/amazon/openmpi \
       --with-cuda=/usr/local/cuda \
       --with-nccl=/usr/local --prefix=/usr/local \
  && make && make install
  
RUN cd $HOME \
  && git clone https://github.com/NVIDIA/nccl-tests \
  && cd nccl-tests \
  && make MPI=1 MPI_HOME=/opt/amazon/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local
```

## Considerations when creating your container


The EFA device is mounted to the container as `/dev/infiniband/uverbs0` under the list of devices accessible to the container. On P4d instances, the container has access to 4 EFA devices. The EFA devices can be found in the list of devices accessible to the container as: 
+  `/dev/infiniband/uverbs0` 
+  `/dev/infiniband/uverbs1` 
+  `/dev/infiniband/uverbs2` 
+  `/dev/infiniband/uverbs3` 

 To get information about hostname, peer hostnames, and network interface (for MPI) from the `resourceconfig.json` file provided to each container instances, see [Distributed Training Configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-dist-training). Your container handles regular TCP traffic among peers through the default Elastic Network Interfaces (ENI), while handling OFI (kernel bypass) traffic through the EFA device. 

## Verify that your EFA device is recognized


  To verify that the EFA device is recognized, run the following command from within your container. 

```
/opt/amazon/efa/bin/fi_info -p efa
```

Your output should look similar to the following.

```
provider: efa
    fabric: EFA-fe80::e5:56ff:fe34:56a8
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::e5:56ff:fe34:56a8
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::e5:56ff:fe34:56a8
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD
```

## Running a training job with EFA


 Once you’ve created an EFA-enabled container, you can run a training job with EFA using a SageMaker AI Estimator the same way as you would with any other Docker image. For more information on registering your container and using it for training, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step5).

# How Amazon SageMaker AI Signals Algorithm Success and Failure
Signal Success or Failure

A training algorithm indicates whether it succeeded or failed using the exit code of its process. 

A successful training execution should exit with an exit code of 0 and an unsuccessful training execution should exit with a non-zero exit code. These will be converted to `Completed` and `Failed` in the `TrainingJobStatus` returned by `DescribeTrainingJob`. This exit code convention is standard and is easily implemented in all languages. For example, in Python, you can use `sys.exit(1)` to signal a failure exit, and simply running to the end of the main routine will cause Python to exit with code 0.

In the case of failure, the algorithm can write a description of the failure to the failure file. See next section for details.

# How Amazon SageMaker AI Processes Training Output
Training Output

As your algorithm runs in a container, it generates output including the status of the training job and model and output artifacts. Your algorithm should write this information to the following files, which are located in the container's `/output` directory. Amazon SageMaker AI processes the information contained in this directory as follows:
+ `/opt/ml/model` – Your algorithm should write all final model artifacts to this directory. SageMaker AI copies this data as a single object in compressed tar format to the S3 location that you specified in the `CreateTrainingJob` request. If multiple containers in a single training job write to this directory they should ensure no `file/directory` names clash. SageMaker AI aggregates the result in a TAR file and uploads to S3 at the end of the training job. 
+ `/opt/ml/output/data` – Your algorithm should write artifacts you want to store other than the final model to this directory. SageMaker AI copies this data as a single object in compressed tar format to the S3 location that you specified in the `CreateTrainingJob` request. If multiple containers in a single training job write to this directory they should ensure no `file/directory` names clash. SageMaker AI aggregates the result in a TAR file and uploads to S3 at the end of the training job.
+ `/opt/ml/output/failure` – If training fails, after all algorithm output (for example, logging) completes, your algorithm should write the failure description to this file. In a `DescribeTrainingJob` response, SageMaker AI returns the first 1024 characters from this file as `FailureReason`. 

You can specify either an S3 general purpose or S3 directory bucket to store your training output. Directory buckets use only the Amazon S3 Express One Zone storage class, which is designed for workloads or performance-critical applications that require consistent single-digit millisecond latency. Choose the bucket type that best fits your application and performance requirements. For more information on S3 directory buckets, see [Directory buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html) in the *Amazon Simple Storage Service User Guide*. 

**Note**  
You can only encrypt your SageMaker AI output data in S3 directory buckets with server-side encryption with Amazon S3 managed keys (SSE-S3). Server-side encryption with AWS KMS keys (SSE-KMS) isn't currently supported for storing SageMaker AI output data in directory buckets.

# Containers with custom inference code


You can use Amazon SageMaker AI to interact with Docker containers and run your own inference code in one of two ways:
+ To use your own inference code with a persistent endpoint to get one prediction at a time, use SageMaker AI hosting services.
+ To use your own inference code to get predictions for an entire dataset, use SageMaker AI batch transform.

**Topics**
+ [

# Custom Inference Code with Hosting Services
](your-algorithms-inference-code.md)
+ [

# Custom Inference Code with Batch Transform
](your-algorithms-batch-code.md)

# Custom Inference Code with Hosting Services


This section explains how Amazon SageMaker AI interacts with a Docker container that runs your own inference code for hosting services. Use this information to write inference code and create a Docker image. 

**Topics**
+ [

## How SageMaker AI Runs Your Inference Image
](#your-algorithms-inference-code-run-image)
+ [

## How SageMaker AI Loads Your Model Artifacts
](#your-algorithms-inference-code-load-artifacts)
+ [

## How Your Container Should Respond to Inference Requests
](#your-algorithms-inference-code-container-response)
+ [

## How Your Container Should Respond to Health Check (Ping) Requests
](#your-algorithms-inference-algo-ping-requests)
+ [

## Container Contract to Support Bidirectional Streaming Capabilities
](#your-algorithms-inference-algo-bidi)
+ [

# Use a Private Docker Registry for Real-Time Inference Containers
](your-algorithms-containers-inference-private.md)

## How SageMaker AI Runs Your Inference Image


To configure a container to run as an executable, use an `ENTRYPOINT` instruction in a Dockerfile. Note the following: 
+ For model inference, SageMaker AI runs the container as:

  ```
  docker run image serve
  ```

  SageMaker AI overrides default `CMD` statements in a container by specifying the `serve` argument after the image name. The `serve` argument overrides arguments that you provide with the `CMD` command in the Dockerfile.

   
+ SageMaker AI expects all containers to run with root users. Create your container so that it uses only root users. When SageMaker AI runs your container, users that do not have root-level access can cause permissions issues.

   
+ We recommend that you use the `exec` form of the `ENTRYPOINT` instruction:

  ```
  ENTRYPOINT ["executable", "param1", "param2"]
  ```

  For example:

  ```
  ENTRYPOINT ["python", "k_means_inference.py"]
  ```

  The `exec` form of the `ENTRYPOINT` instruction starts the executable directly, not as a child of `/bin/sh`. This enables it to receive signals like `SIGTERM` and `SIGKILL` from the SageMaker API operations, which is a requirement. 

   

  For example, when you use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create an endpoint, SageMaker AI provisions the number of ML compute instances required by the endpoint configuration, which you specify in the request. SageMaker AI runs the Docker container on those instances. 

   

  If you reduce the number of instances backing the endpoint (by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html) API), SageMaker AI runs a command to stop the Docker container on the instances that are being terminated. The command sends the `SIGTERM` signal, then it sends the `SIGKILL` signal thirty seconds later.

   

  If you update the endpoint (by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API), SageMaker AI launches another set of ML compute instances and runs the Docker containers that contain your inference code on them. Then it runs a command to stop the previous Docker containers. To stop a Docker container, command sends the `SIGTERM` signal, then it sends the `SIGKILL` signal 30 seconds later. 

   
+ SageMaker AI uses the container definition that you provided in your [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) request to set environment variables and the DNS hostname for the container as follows:

   
  + It sets environment variables using the `ContainerDefinition.Environment` string-to-string map.
  + It sets the DNS hostname using the `ContainerDefinition.ContainerHostname`.

     
+ If you plan to use GPU devices for model inferences (by specifying GPU-based ML compute instances in your `CreateEndpointConfig` request), make sure that your containers are `nvidia-docker` compatible. Don't bundle NVIDIA drivers with the image. For more information about `nvidia-docker`, see [NVIDIA/nvidia-docker](https://github.com/NVIDIA/nvidia-docker). 

   
+ You can't use the `tini` initializer as your entry point in SageMaker AI containers because it gets confused by the `train` and `serve` arguments.

  

## How SageMaker AI Loads Your Model Artifacts


In your [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API request, you can use either the `ModelDataUrl` or `S3DataSource` parameter to identify the S3 location where model artifacts are stored. SageMaker AI copies your model artifacts from the S3 location to the `/opt/ml/model` directory for use by your inference code. Your container has read-only access to `/opt/ml/model`. Do not write to this directory.

The `ModelDataUrl` must point to a tar.gz file. Otherwise, SageMaker AI won't download the file. 

If you trained your model in SageMaker AI, the model artifacts are saved as a single compressed tar file in Amazon S3. If you trained your model outside SageMaker AI, you need to create this single compressed tar file and save it in a S3 location. SageMaker AI decompresses this tar file into /opt/ml/model directory before your container starts.

For deploying large models, we recommend that you follow [Deploying uncompressed models](large-model-inference-uncompressed.md).

## How Your Container Should Respond to Inference Requests


To obtain inferences, the client application sends a POST request to the SageMaker AI endpoint. SageMaker AI passes the request to the container, and returns the inference result from the container to the client.

For more information about the inference requests that your container will receive, see the following actions in the *Amazon SageMaker AI API Reference*:
+ [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html)
+ [ InvokeEndpointAsync](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html)
+ [ InvokeEndpointWithResponseStream](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html)
+ [ InvokeEndpointWithResponseStream](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithBidirectionalStream.html)

**Requirements for inference containers**

To respond to inference requests, your container must meet the following requirements:
+ SageMaker AI strips all `POST` headers except those supported by `InvokeEndpoint`. SageMaker AI might add additional headers. Inference containers must be able to safely ignore these additional headers.
+ To receive inference requests, the container must have a web server listening on port 8080 and must accept `POST` requests to the `/invocations` and `/ping` endpoints. 
+ A customer's model containers must accept socket connection requests within 250 ms.
+ A customer's model containers must respond to requests within 60 seconds. The model itself can have a maximum processing time of 60 seconds before responding to the `/invocations`. If your model is going to take 50-60 seconds of processing time, the SDK socket timeout should be set to be 70 seconds.
+ A customer’s model container that supports bidirectional streaming must:
  + support WebSockets connections on port 8080 to /invocations-bidirectional-stream by default.
  + have a web server listening on port 8080 and must accept POST requests to the /ping endpoints.
  + In addition to container health checks over HTTP, container must respond with Pong Frame per ([RFC6455](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.3)), for WebSocket Ping Frame sent.

**Example invocation functions**  
The following examples demonstrate how the code in your container can process inference requests. These examples handle requests that client applications send by using the InvokeEndpoint action.  
FastAPI is a web framework for building APIs with Python.  

```
from fastapi import FastAPI, status, Request, Response
. . .
app = FastAPI()
. . .
@app.post('/invocations')
async def invocations(request: Request):
    # model() is a hypothetical function that gets the inference output:
    model_resp = await model(Request)

    response = Response(
        content=model_resp,
        status_code=status.HTTP_200_OK,
        media_type="text/plain",
    )
    return response
. . .
```
In this example, the `invocations` function handles the inference request that SageMaker AI sends to the `/invocations` endpoint.
Flask is a framework for developing web applications with Python.  

```
import flask
. . .
app = flask.Flask(__name__)
. . .
@app.route('/invocations', methods=["POST"])
def invoke(request):
    # model() is a hypothetical function that gets the inference output:
    resp_body = model(request)
    return flask.Response(resp_body, mimetype='text/plain')
```
In this example, the `invoke` function handles the inference request that SageMaker AI sends to the `/invocations` endpoint.

**Example invocation functions for streaming requests**  
The following examples demonstrate how the code in your inference container can process streaming inference requests. These examples handle requests that client applications send by using the InvokeEndpointWithResponseStream action.  
When a container handles a streaming inference request, it returns the model's inference as a series of parts incrementally as the model generates them. Client applications start receiving responses immediately when they're available. They don't need to wait for the model to generate the entire response. You can implement streaming to support fast interactive experiences, such as chatbots, virtual assistants, and music generators.  
FastAPI is a web framework for building APIs with Python.  

```
from starlette.responses import StreamingResponse
from fastapi import FastAPI, status, Request
. . .
app = FastAPI()
. . .
@app.post('/invocations')
async def invocations(request: Request):
    # Streams inference response using HTTP chunked encoding
    async def generate():
        # model() is a hypothetical function that gets the inference output:
        yield await model(Request)
        yield "\n"

    response = StreamingResponse(
        content=generate(),
        status_code=status.HTTP_200_OK,
        media_type="text/plain",
    )
    return response
. . .
```
In this example, the `invocations` function handles the inference request that SageMaker AI sends to the `/invocations` endpoint. To stream the response, the example uses the `StreamingResponse` class from the Starlette framework.
Flask is a framework for developing web applications with Python.  

```
import flask
. . .
app = flask.Flask(__name__)
. . .
@app.route('/invocations', methods=["POST"])
def invocations(request):
    # Streams inference response using HTTP chunked encoding

    def generate():
        # model() is a hypothetical function that gets the inference output:
        yield model(request)
        yield "\n"
    return flask.Response(
        flask.stream_with_context(generate()), mimetype='text/plain')
. . .
```
In this example, the `invocations` function handles the inference request that SageMaker AI sends to the `/invocations` endpoint. To stream the response, the example uses the `flask.stream_with_context` function from the Flask framework.

**Example invocation functions for bidirectional streaming**  
The following examples demonstrate how the code in your container can process streaming inference request and responses. These examples handle streaming requests that client applications send by using the InvokeEndpointWithBidirectionalStream action.  
A container with bidirectional streaming capability handles streaming inference requests where parts are incrementally generated at the client and streamed to the container. It returns the model's inference back to the client as a series of parts as the model generates them. Client applications start receiving responses immediately when they're available. They don't need to wait for request to the fully generated at the client or for the model to generate the entire response. You can implement bidirectional streaming to support fast interactive experiences, such as chatbots, interactive voice AI assistants and real-time translations for a more real-time experience.  
FastAPI is a web framework for building APIs with Python.  

```
import sys
import asyncio
import json
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import JSONResponse
import uvicorn

app = FastAPI()
...
@app.websocket("/invocations-bidirectional-stream")
async def websocket_invoke(websocket: WebSocket):
    """
    WebSocket endpoint with RFC 6455 ping/pong and fragmentation support
    
    Handles:
    - Text messages (JSON) - including fragmented frames
    - Binary messages - including fragmented frames
    - Ping frames (automatically responds with pong)
    - Pong frames (logs receipt)
    - Fragmented frames per RFC 6455 Section 5.4
    """
    await manager.connect(websocket)
    
    # Fragment reassembly buffers per RFC 6455 Section 5.4
    text_fragments = []
    binary_fragments = []
    
    while True:
        # Use receive() to handle all WebSocket frame types
        message = await websocket.receive()
        print(f"Received message: {message}")
        if message["type"] == "websocket.receive":
            if "text" in message:
                # Handle text frames (including fragments)
                text_data = message["text"]
                more_body = message.get("more_body", False)
                
                if more_body:
                    # This is a fragment, accumulate it
                    text_fragments.append(text_data)
                    print(f"Received text fragment: {len(text_data)} chars (more coming)")
                else:
                    # This is the final frame or a complete message
                    if text_fragments:
                        # Reassemble fragmented message
                        text_fragments.append(text_data)
                        complete_text = "".join(text_fragments)
                        text_fragments.clear()
                        print(f"Reassembled fragmented text message: {len(complete_text)} chars total")
                        await handle_text_message(websocket, complete_text)
                    else:
                        # Complete message in single frame
                        await handle_text_message(websocket, text_data)
                
            elif "bytes" in message:
                # Handle binary frames (including fragments)
                binary_data = message["bytes"]
                more_body = message.get("more_body", False)
                
                if more_body:
                    # This is a fragment, accumulate it
                    binary_fragments.append(binary_data)
                    print(f"Received binary fragment: {len(binary_data)} bytes (more coming)")
                else:
                    # This is the final frame or a complete message
                    if binary_fragments:
                        # Reassemble fragmented message
                        binary_fragments.append(binary_data)
                        complete_binary = b"".join(binary_fragments)
                        binary_fragments.clear()
                        print(f"Reassembled fragmented binary message: {len(complete_binary)} bytes total")
                        await handle_binary_message(websocket, complete_binary)
                    else:
                        # Complete message in single frame
                        await handle_binary_message(websocket, binary_data)
                
        elif message["type"] == "websocket.ping":
            # Handle ping frames - RFC 6455 Section 5.5.2
            ping_data = message.get("bytes", b"")
            print(f"Received PING frame with payload: {ping_data}")
            # FastAPI automatically sends pong response
            
        elif message["type"] == "websocket.pong":
            # Handle pong frames
            pong_data = message.get("bytes", b"")
            print(f"Received PONG frame with payload: {pong_data}")
            
        elif message["type"] == "websocket.close":
            # Handle close frames - RFC 6455 Section 5.5.1
            close_code = message.get("code", 1000)
            close_reason = message.get("reason", "")
            print(f"Received CLOSE frame - Code: {close_code}, Reason: '{close_reason}'")
            
            # Send close frame response if not already closing
            try:
                await websocket.close(code=close_code, reason=close_reason)
                print(f"Sent CLOSE frame response - Code: {close_code}")
            except Exception as e:
                print(f"Error sending close frame: {e}")
            break
            
        elif message["type"] == "websocket.disconnect":
            print("Client initiated disconnect")
            break

        else:
            print(f"Received unknown message type: {message['type']}")
            break

                        
async def handle_binary_message(websocket: WebSocket, binary_data: bytes):
    """Handle incoming binary messages (complete or reassembled from fragments)"""
    print(f"Processing complete binary message: {len(binary_data)} bytes")
    
    try:
        # Echo back the binary data
        await websocket.send_bytes(binary_data)
    except Exception as e:
        print(f"Error handling binary message: {e}")

async def handle_text_message(websocket: WebSocket, data: str):
    """Handle incoming text messages"""
    try:
        # Send response back to the same client
        await manager.send_personal_message(data, websocket)
    except Exception as e:
        print(f"Error handling text message: {e}")

def main():
    if len(sys.argv) > 1 and sys.argv[1] == "serve":
        print("Starting server on port 8080...")
        uvicorn.run(app, host="0.0.0.0", port=8080)
    else:
        print("Usage: python app.py serve")
        sys.exit(1)

if __name__ == "__main__":
    main()
```
In this example, the `websocket_invoke` function handles the inference request that SageMaker AI sends to the `/invocations-bidirectional-stream` endpoint. It shows handling stream requests and stream responses back to the client.

## How Your Container Should Respond to Health Check (Ping) Requests


SageMaker AI launches new inference containers in the following situations:
+ Responding to `CreateEndpoint`, `UpdateEndpoint`, and `UpdateEndpointWeightsAndCapacities` API calls
+ Security patching
+ Replacing unhealthy instances

Soon after container startup, SageMaker AI starts sending periodic GET requests to the `/ping` endpoint.

The simplest requirement on the container is to respond with an HTTP 200 status code and an empty body. This indicates to SageMaker AI that the container is ready to accept inference requests at the `/invocations` endpoint.

If the container does not begin to pass health checks by consistently responding with 200s during the 8 minutes after startup, the new instance launch fails. This causes `CreateEndpoint` to fail, leaving the endpoint in a failed state. The update requested by `UpdateEndpoint` isn't completed, security patches aren't applied, and unhealthy instances aren't replaced.

While the minimum bar is for the container to return a static 200, a container developer can use this functionality to perform deeper checks. The request timeout on `/ping` attempts is 2 seconds.

Additionally, a container that is capable of handling bidirectional streaming requests must respond with a Pong Frame (per WebSocket protocol [RFC6455](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.3)) to a Ping Frame. If no Pong Frame is received for 5 consecutive Pings, the connection to container will be closed by SageMaker AI platform. SageMaker AI platform will also respond to Ping Frames from model container with Pong Frames.

## Container Contract to Support Bidirectional Streaming Capabilities


If you want to host your model container as SageMaker AI endpoint that supports bidirectional streaming capabilities, the model container must support the contract below:

**1. Bidirectional Docker Label **

The model container should have a Docker label indicating to the SageMaker AI platform that bidirectional streaming capability is supported on this container.

```
com.amazonaws.sagemaker.capabilities.bidirectional-streaming=true
```

**2. Support WebSocket Connection for invocations**

A customer’s model container that supports bi-directional streaming must support WebSockets connections on port 8080 to `/invocations-bidirectional-stream` by default. 

This path can be overridden by passing X-Amzn-SageMaker-Model-Invocation-Path header when invoking InvokeEndpointWithBidirectionalStream API. Additionally, users can specify a query string to be appended to this path by passing X-Amzn-SageMaker-Model-Query-String header when invoking InvokeEndpointWithBidirectionalStream API.

**3. Request Stream Handling**

The InvokeEndpointWithBidirectionalStream API input payloads are streamed in as a series of PayloadParts, which is just a wrapper of a binary chunk (“Bytes”: ***<Blob>***):

```
{
   "PayloadPart": { 
      "Bytes": <Blob>,
      "DataType": <String: UTF8 | BINARY>,
      "CompletionState": <String: PARTIAL | COMPLETE>
      "P": <String>
   }
}
```

**3.1. Data Frames**

SageMaker AI passes the input PayloadParts to Model container as WebSocket Data Frames ([RFC6455-Section-5.6](https://datatracker.ietf.org/doc/html/rfc6455#section-5.6))

1. SageMaker AI does not inspect into the binary chunk.

1. On receiving an input PayloadPart
   + SageMaker AI creates exactly one WebSocket Data Frame from `PayloadPart.Bytes`, then pass it to model container.
   + If `PayloadPart.DataType = UTF8`, SageMaker AI creates a Text Data Frame
   + If `PayloadPart.DataType` does not present or `PayloadPart.DataType = BINARY`, SageMaker AI creates a Binary Data Frame

1. For a sequence of PayloadParts with `PayloadPart.CompletionState = PARTIAL`, and terminated by a PayloadPart with `PayloadPart.CompletionState = COMPLETE`, SageMaker AI translates them into WebSocket fragmented message [RFC6455-Section-5.4: Fragmentation](https://datatracker.ietf.org/doc/html/rfc6455#section-5.4):
   + The initial PayloadPart with `PayloadPart.CompletionState = PARTIAL` will be translated into a WebSocket Data Frame, with FIN bit clear.
   + The subsequent PayloadParts with `PayloadPart.CompletionState = PARTIAL` will be translated into WebSocket Continuation Frames with FIN bit clear.
   + The final PayloadPart with `PayloadPart.CompletionState = COMPLETE` will be translated into WebSocket Continuation Frame with FIN bit set.

1. SageMaker AI does not encode or decode the binary chunk from the input PayloadPart, the bytes are passed to model container as-is.

1. SageMaker AI does not combine multiple input PayloadParts into one BinaryDataFrame.

1. SageMaker AI does not chunk one input PayloadPart into multiple BinaryDataFrames.

**Example: Fragmented Message Flow**

```
Client sends:
PayloadPart 1: {Bytes: "Hello ", DataType: "UTF8", CompletionState: "PARTIAL"}
PayloadPart 2: {Bytes: "World", DataType: "UTF8", CompletionState: "COMPLETE"}

Container receives:
Frame 1: Text Data Frame with "Hello " (FIN=0)
Frame 2: Continuation Frame with "World" (FIN=1)
```

**3.2. Control Frames**

Besides Data Frames, SageMaker AI also sends Control Frames to model container ([RFC6455-Section-5.5](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5)):

1. Close Frame: SageMaker AI may send Close Frame ([RFC6455-Section-5.5.1](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.1)) to model container should the connection be closed for any reason.

1. Ping Frame: SageMaker AI send Ping Frame ([RFC6455-Section-5.5.2](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.2)) once every 60 seconds, model container must respond with Pong Frame. If no Pong Frame ([RFC6455-Section-5.5.3](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.3)) is received for 5 consecutive Pings, the connection will be closed by SageMaker AI.

1. Pong Frame: SageMaker AI will respond to Ping Frames from model container with Pong Frames.

**4. Response Stream Handling**

The output are streamed out as a series of PayloadParts, ModelStreamErrors or InternalStreamFailures.

```
{   
   "PayloadPart": { 
      "Bytes": <Blob>,
      "DataType": <String: UTF8 | BINARY>,
      "CompletionState": <String: PARTIAL | COMPLETE>,
   },
   "ModelStreamError": {
      "ErrorCode": <String>,
      "Message": <String>
   },
   "InternalStreamFailure": {
      "Message": <String>
   }
}
```

**4.1. Data Frames**

SageMaker AI convert Data Frames received from model container into output PayloadParts:

1. On receiving a WebSocket Text Data Frame from the model container, SageMaker AI gets the raw bytes from the Text Data Frame, and wraps it into a response PayloadPart, meanwhile set `PayloadPart.DataType = UTF8`.

1. On receiving a WebSocket Binary Data Frame from the model container, SageMaker AI directly wraps the bytes from the data frame into a response PayloadPart, meanwhile set `PayloadPart.DataType = BINARY`.

1. For fragmented message as defined in [RFC6455-Section-5.4: Fragmentation](https://datatracker.ietf.org/doc/html/rfc6455#section-5.4):
   + The initial Data Frame with FIN bit clear will be translated into a PayloadPart with `PayloadPart.CompletionState = PARTIAL`.
   + The subsequent Continuation Frames with FIN bit clear will be translated into PayloadParts with `PayloadPart.CompletionState = PARTIAL`.
   + The final Continuation Frame with FIN bit set will be translated into PayloadPart with `PayloadPart.CompletionState = COMPLETE`.

1. SageMaker AI does not encode or decode the bytes received from model containers, the bytes are passed to model container as-is.

1. SageMaker AI does not combine multiple Data Frames received from model container into one response PayloadPart.

1. SageMaker AI does not chunk a Data Frame received from model container into multiple response PayloadParts.

**Example: Streaming Response Flow**

```
Container sends:
Frame 1: Text Data Frame with "Generating" (FIN=0)
Frame 2: Continuation Frame with " response..." (FIN=1)

Client receives:
PayloadPart 1: {Bytes: "Generating", DataType: "UTF8", CompletionState: "PARTIAL"}
PayloadPart 2: {Bytes: " response...", DataType: "UTF8", CompletionState: "COMPLETE"}
```

**4.2. Control Frames**

SageMaker AI responds to the following Control Frames from the model container:

1. On receiving a Close Frame ([RFC6455-Section-5.5.1](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.1)) from model container, SageMaker AI will wrap the status code ([RFC6455-Section-7.4](https://datatracker.ietf.org/doc/html/rfc6455#section-7.4)) and failure messages into ModelStreamError, and stream it back to the end user.

1. On receiving a Ping Frame ([RFC6455-Section-5.5.2](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.2)) from model container, SageMaker AI will respond with Pong Frame.

1. Pong Frame([RFC6455-Section-5.5.3](https://datatracker.ietf.org/doc/html/rfc6455#section-5.5.3)): If no Pong Frame is received for 5 consecutive Pings, the connection will be closed by SageMaker AI.

# Use a Private Docker Registry for Real-Time Inference Containers
Private Docker Registry for Inference

Amazon SageMaker AI hosting enables you to use images stored in Amazon ECR to build your containers for real-time inference by default. Optionally, you can build containers for real-time inference from images in a private Docker registry. The private registry must be accessible from an Amazon VPC in your account. Models that you create based on the images stored in your private Docker registry must be configured to connect to the same VPC where the private Docker registry is accessible. For information about connecting your model to a VPC, see [Give SageMaker AI Hosted Endpoints Access to Resources in Your Amazon VPC](host-vpc.md).

Your Docker registry must be secured with a TLS certificate from a known public certificate authority (CA).

**Note**  
Your private Docker registry must allow inbound traffic from the security groups you specify in the VPC configuration for your model, so that SageMaker AI hosting is able to pull model images from your registry.  
SageMaker AI can pull model images from DockerHub if there's a path to the open internet inside your VPC.

**Topics**
+ [

## Store Images in a Private Docker Registry other than Amazon Elastic Container Registry
](#your-algorithms-containers-inference-private-registry)
+ [

## Use an Image from a Private Docker Registry for Real-time Inference
](#your-algorithms-containers-inference-private-use)
+ [

## Allow SageMaker AI to authenticate to a private Docker registry
](#inference-private-docker-authenticate)
+ [

## Create the Lambda function
](#inference-private-docker-lambda)
+ [

## Give your execution role permission to Lambda
](#inference-private-docker-perms)
+ [

## Create an interface VPC endpoint for Lambda
](#inference-private-docker-vpc-interface)

## Store Images in a Private Docker Registry other than Amazon Elastic Container Registry


To use a private Docker registry to store your images for SageMaker AI real-time inference, create a private registry that is accessible from your Amazon VPC. For information about creating a Docker registry, see [Deploy a registry server](https://docs.docker.com/registry/deploying/) in the Docker documentation. The Docker registry must comply with the following:
+ The registry must be a [Docker Registry HTTP API V2](https://docs.docker.com/registry/spec/api/) registry.
+ The Docker registry must be accessible from the same VPC that you specify in the `VpcConfig` parameter that you specify when you create your model.

## Use an Image from a Private Docker Registry for Real-time Inference


When you create a model and deploy it to SageMaker AI hosting, you can specify that it use an image from your private Docker registry to build the inference container. Specify this in the `ImageConfig` object in the `PrimaryContainer` parameter that you pass to a call to the [create\$1model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) function.

**To use an image stored in your private Docker registry for your inference container**

1. Create the image configuration object and specify a value of `Vpc` for the `RepositoryAccessMode` field.

   ```
   image_config = {
                       'RepositoryAccessMode': 'Vpc'
                  }
   ```

1. If your private Docker registry requires authentication, add a `RepositoryAuthConfig` object to the image configuration object. For the `RepositoryCredentialsProviderArn` field of the `RepositoryAuthConfig` object, specify the Amazon Resource Name (ARN) of an AWS Lambda function that provides credentials that allows SageMaker AI to authenticate to your private Docker Registry. For information about how to create the Lambda function to provide authentication, see [Allow SageMaker AI to authenticate to a private Docker registry](#inference-private-docker-authenticate).

   ```
   image_config = {
                       'RepositoryAccessMode': 'Vpc',
                       'RepositoryAuthConfig': {
                          'RepositoryCredentialsProviderArn': 'arn:aws:lambda:Region:Acct:function:FunctionName'
                        }
                  }
   ```

1. Create the primary container object that you want to pass to `create_model`, using the image configuration object that you created in the previous step. 

   Provide your image in [digest](https://docs.docker.com/engine/reference/commandline/pull/#pull-an-image-by-digest-immutable-identifier) form. If you provide your image using the `:latest` tag, there is a risk that SageMaker AI pulls a newer version of the image than intended. Using the digest form ensures that SageMaker AI pulls the intended image version.

   ```
   primary_container = {
       'ContainerHostname': 'ModelContainer',
       'Image': 'myteam.myorg.com/docker-local/my-inference-image:<IMAGE-TAG>',
       'ImageConfig': image_config
   }
   ```

1. Specify the model name and the execution role that you want to pass to `create_model`.

   ```
   model_name = 'vpc-model'
   execution_role_arn = 'arn:aws:iam::123456789012:role/SageMakerExecutionRole'
   ```

1. Specify one or more security groups and subnets for the VPC configuration for your model. Your private Docker registry must allow inbound traffic from the security groups that you specify. The subnets that you specify must be in the same VPC as your private Docker registry.

   ```
   vpc_config = {
       'SecurityGroupIds': ['sg-0123456789abcdef0'],
       'Subnets': ['subnet-0123456789abcdef0','subnet-0123456789abcdef1']
   }
   ```

1. Get a Boto3 SageMaker AI client.

   ```
   import boto3
   sm = boto3.client('sagemaker')
   ```

1. Create the model by calling `create_model`, using the values you specified in the previous steps for the `PrimaryContainer` and `VpcConfig` parameters.

   ```
   try:
       resp = sm.create_model(
           ModelName=model_name,
           PrimaryContainer=primary_container,
           ExecutionRoleArn=execution_role_arn,
           VpcConfig=vpc_config,
       )
   except Exception as e:
       print(f'error calling CreateModel operation: {e}')
   else:
       print(resp)
   ```

1. Finally, call [create\$1endpoint\$1config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) and [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) to create the hosting endpoint, using the model that you created in the previous step.

   ```
   endpoint_config_name = 'my-endpoint-config'
   sm.create_endpoint_config(
       EndpointConfigName=endpoint_config_name,
       ProductionVariants=[
           {
               'VariantName': 'MyVariant',
               'ModelName': model_name,
               'InitialInstanceCount': 1,
               'InstanceType': 'ml.t2.medium'
           },
       ],
   )
   
   endpoint_name = 'my-endpoint'
   sm.create_endpoint(
       EndpointName=endpoint_name,
       EndpointConfigName=endpoint_config_name,
   )
   
   sm.describe_endpoint(EndpointName=endpoint_name)
   ```

## Allow SageMaker AI to authenticate to a private Docker registry


To pull an inference image from a private Docker registry that requires authentication, create an AWS Lambda function that provides credentials, and provide the Amazon Resource Name (ARN) of the Lambda function when you call [create\$1model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). When SageMaker AI runs `create_model`, it calls the Lambda function that you specified to get credentials to authenticate to your Docker registry.

## Create the Lambda function


Create an AWS Lambda function that returns a response with the following form:

```
def handler(event, context):
   response = {
      "Credentials": {"Username": "username", "Password": "password"}
   }
   return response
```

Depending on how you set up authentication for your private Docker registry, the credentials that your Lambda function returns can mean either of the following:
+ If you set up your private Docker registry to use basic authentication, provide the sign-in credentials to authenticate to the registry.
+ If you set up your private Docker registry to use bearer token authentication, the sign-in credentials are sent to your authorization server, which returns a Bearer token that can then be used to authenticate to the private Docker registry.

## Give your execution role permission to Lambda


The execution role that you use to call `create_model` must have permissions to call AWS Lambda functions. Add the following to the permissions policy of your execution role.

```
{
    "Effect": "Allow",
    "Action": [
        "lambda:InvokeFunction"
    ],
    "Resource": [
        "arn:aws:lambda:*:*:function:*myLambdaFunction*"
    ]
}
```

Where *myLambdaFunction* is the name of your Lambda function. For information about editing a role permissions policy, see [Modifying a role permissions policy (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-modify_permissions-policy) in the *AWS Identity and Access Management User Guide*.

**Note**  
An execution role with the `AmazonSageMakerFullAccess` managed policy attached to it has permission to call any Lambda function with **SageMaker** in its name.

## Create an interface VPC endpoint for Lambda


Create an interface endpoint so that your Amazon VPC can communicate with your AWS Lambda function without sending traffic over the internet. For information about how to do this, see [Configuring interface VPC endpoints for Lambda](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc-endpoints.html) in the *AWS Lambda Developer Guide*.

SageMaker AI hosting sends a request through your VPC to `lambda.region.amazonaws.com`, to call your Lambda function. If you choose Private DNS Name when you create your interface endpoint, Amazon Route 53 routes the call to the Lambda interface endpoint. If you use a different DNS provider, make sure to map `lambda.region.amazonaws.com` to your Lambda interface endpoint.

# Custom Inference Code with Batch Transform


This section explains how Amazon SageMaker AI interacts with a Docker container that runs your own inference code for batch transform. Use this information to write inference code and create a Docker image. 

**Topics**
+ [

## How SageMaker AI Runs Your Inference Image
](#your-algorithms-batch-code-run-image)
+ [

## How SageMaker AI Loads Your Model Artifacts
](#your-algorithms-batch-code-load-artifacts)
+ [

## How Containers Serve Requests
](#your-algorithms-batch-code-how-containe-serves-requests)
+ [

## How Your Container Should Respond to Inference Requests
](#your-algorithms-batch-code-how-containers-should-respond-to-inferences)
+ [

## How Your Container Should Respond to Health Check (Ping) Requests
](#your-algorithms-batch-algo-ping-requests)

## How SageMaker AI Runs Your Inference Image


To configure a container to run as an executable, use an `ENTRYPOINT` instruction in a Dockerfile. Note the following: 
+ For batch transforms, SageMaker AI invokes the model on your behalf. SageMaker AI runs the container as:

  ```
  docker run image serve
  ```

  The input to batch transforms must be of a format that can be split into smaller files to process in parallel. These formats include CSV, [JSON](https://www.json.org/json-en.html), [JSON Lines](https://jsonlines.org/), [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) and [RecordIO](https://mesos.apache.org/documentation/latest/recordio/).

  SageMaker AI overrides default `CMD` statements in a container by specifying the `serve` argument after the image name. The `serve` argument overrides arguments that you provide with the `CMD` command in the Dockerfile.

   
+ We recommend that you use the `exec` form of the `ENTRYPOINT` instruction:

  ```
  ENTRYPOINT ["executable", "param1", "param2"]
  ```

  For example:

  ```
  ENTRYPOINT ["python", "k_means_inference.py"]
  ```

   
+ SageMaker AI sets environment variables specified in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) on your container. Additionally, the following environment variables are populated:
  + `SAGEMAKER_BATCH` is set to `true` when the container runs batch transforms.
  + `SAGEMAKER_MAX_PAYLOAD_IN_MB` is set to the largest size payload that is sent to the container via HTTP.
  + `SAGEMAKER_BATCH_STRATEGY` is set to `SINGLE_RECORD` when the container is sent a single record per call to invocations and `MULTI_RECORD` when the container gets as many records as will fit in the payload.
  + `SAGEMAKER_MAX_CONCURRENT_TRANSFORMS` is set to the maximum number of `/invocations` requests that can be opened simultaneously.
**Note**  
The last three environment variables come from the API call made by the user. If the user doesn’t set values for them, they aren't passed. In that case, either the default values or the values requested by the algorithm (in response to the `/execution-parameters`) are used.
+ If you plan to use GPU devices for model inferences (by specifying GPU-based ML compute instances in your `CreateTransformJob` request), make sure that your containers are nvidia-docker compatible. Don't bundle NVIDIA drivers with the image. For more information about nvidia-docker, see [NVIDIA/nvidia-docker](https://github.com/NVIDIA/nvidia-docker). 

   
+ You can't use the `init` initializer as your entry point in SageMaker AI containers because it gets confused by the train and serve arguments.

  

## How SageMaker AI Loads Your Model Artifacts


In a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) request, container definitions include the `ModelDataUrl` parameter, which identifies the location in Amazon S3 where model artifacts are stored. When you use SageMaker AI to run inferences, it uses this information to determine from where to copy the model artifacts. It copies the artifacts to the `/opt/ml/model` directory in the Docker container for use by your inference code.

The `ModelDataUrl` parameter must point to a tar.gz file. Otherwise, SageMaker AI can't download the file. If you train a model in SageMaker AI, it saves the artifacts as a single compressed tar file in Amazon S3. If you train a model in another framework, you need to store the model artifacts in Amazon S3 as a compressed tar file. SageMaker AI decompresses this tar file and saves it in the `/opt/ml/model` directory in the container before the batch transform job starts. 

## How Containers Serve Requests


Containers must implement a web server that responds to invocations and ping requests on port 8080. For batch transforms, you have the option to set algorithms to implement execution-parameters requests to provide a dynamic runtime configuration to SageMaker AI. SageMaker AI uses the following endpoints: 
+ `ping`—Used to periodically check the health of the container. SageMaker AI waits for an HTTP `200` status code and an empty body for a successful ping request before sending an invocations request. You might use a ping request to load a model into memory to generate inference when invocations requests are sent.
+ (Optional) `execution-parameters`—Allows the algorithm to provide the optimal tuning parameters for a job during runtime. Based on the memory and CPUs available for a container, the algorithm chooses the appropriate `MaxConcurrentTransforms`, `BatchStrategy`, and `MaxPayloadInMB` values for the job.

Before calling the invocations request, SageMaker AI attempts to invoke the execution-parameters request. When you create a batch transform job, you can provide values for the `MaxConcurrentTransforms`, `BatchStrategy`, and `MaxPayloadInMB` parameters. SageMaker AI determines the values for these parameters using this order of precedence:

1. The parameter values that you provide when you create the `CreateTransformJob` request.

1. The values that the model container returns when SageMaker AI invokes the execution-parameters endpoint>

1. The default parameter values, listed in the following table.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-batch-code.html)

The response for a `GET` execution-parameters request is a JSON object with keys for `MaxConcurrentTransforms`, `BatchStrategy`, and `MaxPayloadInMB` parameters. This is an example of a valid response:

```
{
“MaxConcurrentTransforms”: 8,
“BatchStrategy": "MULTI_RECORD",
"MaxPayloadInMB": 6
}
```

## How Your Container Should Respond to Inference Requests


To obtain inferences, Amazon SageMaker AI sends a POST request to the inference container. The POST request body contains data from Amazon S3. Amazon SageMaker AI passes the request to the container, and returns the inference result from the container, saving the data from the response to Amazon S3.

To receive inference requests, the container must have a web server listening on port 8080 and must accept POST requests to the `/invocations` endpoint. The inference request timeout and max retries can be configured through `[ModelClientConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelClientConfig.html)`.

## How Your Container Should Respond to Health Check (Ping) Requests


The simplest requirement on the container is to respond with an HTTP 200 status code and an empty body. This indicates to SageMaker AI that the container is ready to accept inference requests at the `/invocations` endpoint.

While the minimum bar is for the container to return a static 200, a container developer can use this functionality to perform deeper checks. The request timeout on `/ping` attempts is 2 seconds.

# Examples and More Information: Use Your Own Algorithm or Model
Examples and more info

The following Jupyter notebooks and added information show how to use your own algorithms or pretrained models from an Amazon SageMaker notebook instance. For links to the GitHub repositories with the prebuilt Dockerfiles for the TensorFlow, MXNet, Chainer, and PyTorch frameworks and instructions on using the AWS SDK for Python (Boto3) estimators to run your own training algorithms on SageMaker AI Learner and your own models on SageMaker AI hosting, see [Prebuilt SageMaker AI Docker images for deep learning](pre-built-containers-frameworks-deep-learning.md)

## Setup


1. Create a SageMaker notebook instance. For instructions on how to create and access Jupyter notebook instances, see [Amazon SageMaker notebook instances](nbi.md).

1. Open the notebook instance you created.

1. Choose the **SageMaker AI Examples** tab for a list of all SageMaker AI example notebooks.

1. Open the sample notebooks from the **Advanced Functionality** section in your notebook instance or from GitHub using the provided links. To open a notebook, choose its **Use** tab, then choose **Create copy**.

## Host models trained in Scikit-learn


To learn how to host models trained in Scikit-learn for making predictions in SageMaker AI by injecting them into first-party k-means and XGBoost containers, see the following sample notebooks.
+ [kmeans\$1bring\$1your\$1own\$1model](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/kmeans_bring_your_own_model)
+ [xgboost\$1bring\$1your\$1own\$1model](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/xgboost_bring_your_own_model)

## Package TensorFlow and Scikit-learn models for use in SageMaker AI


To learn how to package algorithms that you have developed in TensorFlow and scikit-learn frameworks for training and deployment in the SageMaker AI environment, see the following notebooks. They show you how to build, register, and deploy your own Docker containers using Dockerfiles.
+ [tensorflow\$1bring\$1your\$1own](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/tensorflow_iris_byom/tensorflow_BYOM_iris.ipynb)
+ [scikit\$1bring\$1your\$1own](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own)

## Train and deploy a neural network on SageMaker AI


To learn how to train a neural network locally using MXNet or TensorFlow, and then create an endpoint from the trained model and deploy it on SageMaker AI, see the following notebooks. The MXNet model is trained to recognize handwritten numbers from the MNIST dataset. The TensorFlow model is trained to classify irises.
+ [mxnet\$1mnist\$1byom](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html)
+ [tensorflow\$1BYOM\$1iris](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/tensorflow_iris_byom/tensorflow_BYOM_iris.ipynb)

## Training using pipe mode


To learn how to use a Dockerfile to build a container that calls the `train.py script` and uses pipe mode to custom train an algorithm, see the following notebook. In pipe mode, the input data is transferred to the algorithm while it is training. This can decrease training time compared to using file mode. 
+ [pipe\$1bring\$1your\$1own](https://github.com/aws/amazon-sagemaker-examples/blob/0efd885ef2a5c04929d10c5272681f4ca17dac17/advanced_functionality/pipe_bring_your_own/pipe_bring_your_own.ipynb)

## Bring your own R model


To learn how to use add a custom R image to build and train a model in a AWS SMS notebook, see the following blog post. This blog post uses a sample R Dockerfile from a library of [SageMaker AI Studio Classic Custom Image Samples](https://github.com/aws-samples/sagemaker-studio-custom-image-samples).
+ [Bringing your own R environment to Amazon SageMaker Studio Classic](https://aws.amazon.com/blogs/machine-learning/bringing-your-own-r-environment-to-amazon-sagemaker-studio/)

## Extend a pre-built PyTorch container Image


To learn how to extend a prebuilt SageMaker AI PyTorch container image when you have additional functional requirements for your algorithm or model that the prebuilt Docker image doesn't support, see the following notebook.
+ [BERTtopic\$1extending\$1container](https://github.com/aws/amazon-sagemaker-examples/blob/0efd885ef2a5c04929d10c5272681f4ca17dac17/advanced_functionality/pytorch_extend_container_train_deploy_bertopic/BERTtopic_extending_container.ipynb)

For more information about extending a container, see [Extend a Pre-built Container](https://docs.aws.amazon.com/sagemaker/latest/dg/prebuilt-containers-extend.html).

## Train and debug training jobs on a custom container


To learn how to train and debug training jobs using SageMaker Debugger, see the following notebook. A training script provided through this example uses the TensorFlow Keras ResNet 50 model and the CIFAR10 dataset. A Docker custom container is built with the training script and pushed to Amazon ECR. While the training job is running, Debugger collects tensor outputs and identifies debugging problems. With `smdebug` client library tools, you can set a `smdebug` trial object that calls the training job and debugging information, check the training and Debugger rule status, and retrieve tensors saved in an Amazon S3 bucket to analyze training issues.
+ [build\$1your\$1own\$1container\$1with\$1debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/build_your_own_container_with_debugger/debugger_byoc.html)

## Troubleshooting your Docker containers and deployments
Troubleshooting

The following are common errors that you might run into when using Docker containers with SageMaker AI. Each error is followed by a solution to the error. 
+ ** Error: SageMaker AI has lost the Docker daemon.**

   To fix this error, restart Docker using the following command.

  ```
  sudo service docker restart
  ```
+ ** Error: The `/tmp` directory of your Docker container has run out of space.**

  Docker containers use the `/` and `/tmp` partitions to store code. These partitions can fill up easily when using large code modules in local mode. The SageMaker AI Python SDK supports specifying a custom temp directory for your local mode root directory to avoid this issue.

  To specify the custom temp directory in the Amazon Elastic Block Store volume storage, create a file at the following path `~/.sagemaker/config.yaml` and add the following configuration. The directory that you specify as `container_root` must already exist. The SageMaker AI Python SDK will not try to create it.

  ```
  local:
    container_root: /home/ec2-user/SageMaker/temp
  ```

  With this configuration, local mode uses the `/temp` directory and not the default `/tmp` directory.
+ **Low space errors on SageMaker notebook instances**

  A Docker container that runs on SageMaker notebook instances uses the root Amazon EBS volume of the notebook instance by default. To resolve low space errors, provide the path of the Amazon EBS volume attached to the notebook instance as part of the volume parameter of Docker commands.

  ```
  docker run -v EBS-volume-path:container-path
  ```