# Kubernetes Orchestration


You can orchestrate your SageMaker training and inference jobs with SageMaker AI Operators for Kubernetes and SageMaker AI Components for Kubeflow Pipelines. SageMaker AI Operators for Kubernetes make it easier for developers and data scientists using Kubernetes to train, tune, and deploy machine learning (ML) models in SageMaker AI. SageMaker AI Components for Kubeflow Pipelines allow you to move your data processing and training jobs from the Kubernetes cluster to SageMaker AI’s machine learning-optimized managed service.

**Topics**
+ [

# SageMaker AI Operators for Kubernetes
](kubernetes-sagemaker-operators.md)
+ [

# SageMaker AI Components for Kubeflow Pipelines
](kubernetes-sagemaker-components-for-kubeflow-pipelines.md)

# SageMaker AI Operators for Kubernetes


SageMaker AI Operators for Kubernetes make it easier for developers and data scientists using Kubernetes to train, tune, and deploy machine learning (ML) models in SageMaker AI. You can install these SageMaker AI Operators on your Kubernetes cluster in Amazon Elastic Kubernetes Service (Amazon EKS) to create SageMaker AI jobs natively using the Kubernetes API and command-line Kubernetes tools such as `kubectl`. This guide shows how to set up and use the operators to run model training, hyperparameter tuning, or inference (real-time and batch) on SageMaker AI from a Kubernetes cluster. The procedures and guidelines in this chapter assume that you are familiar with Kubernetes and its basic commands.

**Important**  
We are stopping the development and technical support of the original version of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master).  
If you are currently using version `v1.2.2` or below of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master), we recommend migrating your resources to the [ACK service controller for Amazon SageMaker](https://github.com/aws-controllers-k8s/sagemaker-controller). The ACK service controller is a new generation of SageMaker Operators for Kubernetes based on [AWS Controllers for Kubernetes (ACK)](https://aws-controllers-k8s.github.io/community/).  
For information on the migration steps, see [Migrate resources to the latest Operators](kubernetes-sagemaker-operators-migrate.md).  
For answers to frequently asked questions on the end of support of the original version of SageMaker Operators for Kubernetes, see [Announcing the End of Support of the Original Version of SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-eos-announcement.md)

**Note**  
There is no additional charge to use these operators. You do incur charges for any SageMaker AI resources that you use through these operators.

## What is an operator?


A Kubernetes operator is an application controller managing applications on behalf of a Kubernetes user. Controllers of the control plane encompass various control loops listening to a central state manager (ETCD) to regulate the state of the application they control. Examples of such applications include the [Cloud-controller-manager](https://kubernetes.io/docs/concepts/architecture/cloud-controller/) and `[kube-controller-manager](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/)`. Operators typically provide a higher-level abstraction than raw Kubernetes API, making it easier for users to deploy and manage applications. To add new capabilities to Kubernetes, developers can extend the Kubernetes API by creating a **custom resource** that contains their application-specific or domain-specific logic and components. Operators in Kubernetes allow users to natively invoke these custom resources and automate associated workflows.

### How does AWS Controllers for Kubernetes (ACK) work?


The SageMaker AI Operators for Kubernetes allow you to manage jobs in SageMaker AI from your Kubernetes cluster. The latest version of SageMaker AI Operators for Kubernetes is based on AWS Controllers for Kubernetes (ACK). ACK includes a common controller runtime, a code generator, and a set of AWS service-specific controllers, one of which is the SageMaker AI controller.

The following diagram illustrates how ACK works.

![\[ACK based SageMaker AI Operator for Kubernetes explained.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/k8s-orchestration/sagemaker-operators-for-kubernetes-ack-controller.png)


In this diagram, a Kubernetes user wants to run model training on SageMaker AI from within the Kubernetes cluster using the Kubernetes API. The user issues a call to `kubectl apply`, passing in a file that describes a Kubernetes custom resource describing the SageMaker training job. `kubectl apply` passes this file, called a manifest, to the Kubernetes API server running in the Kubernetes controller node (Step *1* in the workflow diagram). The Kubernetes API server receives the manifest with the SageMaker training job specification and determines whether the user has permissions to create a custom resource of kind `sageMaker.services.k8s.aws/TrainingJob`, and whether the custom resource is properly formatted (Step *2*). If the user is authorized and the custom resource is valid, the Kubernetes API server writes (Step *3*) the custom resource to its etcd data store and then responds back (Step *4*) to the user that the custom resource has been created. The SageMaker AI controller, which is running on a Kubernetes worker node within the context of a normal Kubernetes Pod, is notified (Step *5*) that a new custom resource of kind `sageMaker.services.k8s.aws/TrainingJob` has been created. The SageMaker AI controller then communicates (Step *6*) with the SageMaker API, calling the SageMaker AI `CreateTrainingJob` API to create the training job in AWS. After communicating with the SageMaker API, the SageMaker AI controller calls the Kubernetes API server to update (Step *7*) the custom resource’s status with information it received from SageMaker AI. The SageMaker AI controller therefore provides the same information to the developers that they would have received using the AWS SDK.

### Permissions overview


The operators access SageMaker AI resources on your behalf. The IAM role that the operator assumes to interact with AWS resources differs from the credentials you use to access the Kubernetes cluster. The role also differs from the role that AWS assumes when running your machine learning jobs. 

The following image explains the various authentication layers.

![\[SageMaker AI Operator for Kubernetes various authentication layers.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/k8s-orchestration/sagemaker-operators-for-kubernetes-authentication.png)


# Latest SageMaker AI Operators for Kubernetes


This section is based on the latest version of SageMaker AI Operators for Kubernetes using AWS Controllers for Kubernetes (ACK).

**Important**  
If you are currently using version `v1.2.2` or below of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master), we recommend migrating your resources to the [ACK service controller for Amazon SageMaker](https://github.com/aws-controllers-k8s/sagemaker-controller). The ACK service controller is a new generation of SageMaker Operators for Kubernetes based on [AWS Controllers for Kubernetes (ACK)](https://aws-controllers-k8s.github.io/community/).  
For information on the migration steps, see [Migrate resources to the latest Operators](kubernetes-sagemaker-operators-migrate.md).  
For answers to frequently asked questions on the end of support of the original version of SageMaker Operators for Kubernetes, see [Announcing the End of Support of the Original Version of SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-eos-announcement.md)

The latest version of [SageMaker AI Operators for Kubernetes](https://github.com/aws-controllers-k8s/sagemaker-controller) is based on [AWS Controllers for Kubernetes (ACK)](https://aws-controllers-k8s.github.io/community/ ), a framework for building Kubernetes custom controllers where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like databases or message queues using the Kubernetes API.

Use the following steps to install and use ACK to train, tune, and deploy machine learning models with Amazon SageMaker AI.

**Topics**
+ [

## Install SageMaker AI Operators for Kubernetes
](#kubernetes-sagemaker-operators-ack-install)
+ [

## Use SageMaker AI Operators for Kubernetes
](#kubernetes-sagemaker-operators-ack-use)
+ [

## Reference
](#kubernetes-sagemaker-operators-ack-reference)

## Install SageMaker AI Operators for Kubernetes


To set up the latest available version of SageMaker AI Operators for Kubernetes, see the *Setup* section in [ Machine Learning with the ACK SageMaker AI Controller](https://aws-controllers-k8s.github.io/community/docs/tutorials/sagemaker-example/#setup).

## Use SageMaker AI Operators for Kubernetes


For a tutorial on how to train a machine learning model with the ACK service controller for Amazon SageMaker AI using Amazon EKS, see [Machine Learning with the ACK SageMaker AI Controller](https://aws-controllers-k8s.github.io/community/docs/tutorials/sagemaker-example/).

For an autoscaling example, see [ Scale SageMaker AI Workloads with Application Auto Scaling](https://aws-controllers-k8s.github.io/community/docs/tutorials/autoscaling-example/)

## Reference


See also the [ACK service controller for Amazon SageMaker AI GitHub repository](https://github.com/aws-controllers-k8s/sagemaker-controller) or read [AWS Controllers for Kubernetes Documentation](https://aws-controllers-k8s.github.io/community/docs/community/overview/). 

# Old SageMaker AI Operators for Kubernetes


This section is based on the original version of [SageMaker AI Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s).

**Important**  
We are stopping the development and technical support of the original version of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master).  
If you are currently using version `v1.2.2` or below of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master), we recommend migrating your resources to the [ACK service controller for Amazon SageMaker](https://github.com/aws-controllers-k8s/sagemaker-controller). The ACK service controller is a new generation of SageMaker Operators for Kubernetes based on [AWS Controllers for Kubernetes (ACK)](https://aws-controllers-k8s.github.io/community/).  
For information on the migration steps, see [Migrate resources to the latest Operators](kubernetes-sagemaker-operators-migrate.md).  
For answers to frequently asked questions on the end of support of the original version of SageMaker Operators for Kubernetes, see [Announcing the End of Support of the Original Version of SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-eos-announcement.md)

**Topics**
+ [

## Install SageMaker AI Operators for Kubernetes
](#kubernetes-sagemaker-operators-eos-install)
+ [

# Use Amazon SageMaker AI Jobs
](kubernetes-sagemaker-jobs.md)
+ [

# Migrate resources to the latest Operators
](kubernetes-sagemaker-operators-migrate.md)
+ [

# Announcing the End of Support of the Original Version of SageMaker AI Operators for Kubernetes
](kubernetes-sagemaker-operators-eos-announcement.md)

## Install SageMaker AI Operators for Kubernetes


Use the following steps to install and use SageMaker AI Operators for Kubernetes to train, tune, and deploy machine learning models with Amazon SageMaker AI.

**Topics**
+ [

### IAM role-based setup and operator deployment
](#iam-role-based-setup-and-operator-deployment)
+ [

### Clean up resources
](#cleanup-operator-resources)
+ [

### Delete operators
](#delete-operators)
+ [

### Troubleshooting
](#troubleshooting)
+ [

### Images and SMlogs in each Region
](#images-and-smlogs-in-each-region)

### IAM role-based setup and operator deployment


The following sections describe the steps to set up and deploy the original version of the operator.

**Warning**  
**Reminder:** The following steps do not install the latest version of SageMaker AI Operators for Kubernetes. To install the new ACK-based SageMaker AI Operators for Kubernetes, see [Latest SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-ack.md).

#### Prerequisites


This guide assumes that you have completed the following prerequisites: 
+ Install the following tools on the client machine used to access your Kubernetes cluster: 
  + [https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) Version 1.13 or later. Use a `kubectl` version that is within one minor version of your Amazon EKS cluster control plane. For example, a 1.13 `kubectl` client works with Kubernetes 1.13 and 1.14 clusters. OpenID Connect (OIDC) is not supported in versions earlier than 1.13. 
  + [https://github.com/weaveworks/eksctl](https://github.com/weaveworks/eksctl) Version 0.7.0 or later 
  + [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html) Version 1.16.232 or later 
  + (optional) [Helm](https://helm.sh/docs/intro/install/) Version 3.0 or later 
  + [aws-iam-authenticator](https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html) 
+ Have IAM permissions to create roles and attach policies to roles.
+ Created a Kubernetes cluster on which to run the operators. It should either be Kubernetes version 1.13 or 1.14. For automated cluster creation using `eksctl`, see [Getting Started with eksctl](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html). It takes 20–30 minutes to provision a cluster. 

#### Cluster-scoped deployment


Before you can deploy your operator using an IAM role, associate an OpenID Connect (OIDC) Identity Provider (IdP) with your role to authenticate with the IAM service.

##### Create an OIDC provider for your cluster


The following instructions show how to create and associate an OIDC provider with your Amazon EKS cluster.

1. Set the local `CLUSTER_NAME` and `AWS_REGION` environment variables as follows:

   ```
   # Set the Region and cluster
   export CLUSTER_NAME="<your cluster name>"
   export AWS_REGION="<your region>"
   ```

1. Use the following command to associate the OIDC provider with your cluster. For more information, see [Enabling IAM Roles for Service Accounts on your Cluster.](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) 

   ```
   eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} \
         --region ${AWS_REGION} --approve
   ```

   Your output should look like the following: 

   ```
   [_]  eksctl version 0.10.1
     [_]  using region us-east-1
     [_]  IAM OpenID Connect provider is associated with cluster "my-cluster" in "us-east-1"
   ```

Now that the cluster has an OIDC identity provider, you can create a role and give a Kubernetes ServiceAccount permission to assume the role.

##### Get the OIDC ID


To set up the ServiceAccount, obtain the OIDC issuer URL using the following command:

```
aws eks describe-cluster --name ${CLUSTER_NAME} --region ${AWS_REGION} \
      --query cluster.identity.oidc.issuer --output text
```

The command returns a URL like the following: 

```
https://oidc.eks.${AWS_REGION}.amazonaws.com/id/D48675832CA65BD10A532F597OIDCID
```

In this URL, the value `D48675832CA65BD10A532F597OIDCID` is the OIDC ID. The OIDC ID for your cluster is different. You need this OIDC ID value to create a role. 

 If your output is `None`, it means that your client version is old. To work around this, run the following command: 

```
aws eks describe-cluster --region ${AWS_REGION} --query cluster --name ${CLUSTER_NAME} --output text | grep OIDC
```

The OIDC URL is returned as follows: 

```
OIDC https://oidc.eks.us-east-1.amazonaws.com/id/D48675832CA65BD10A532F597OIDCID
```

##### Create an IAM role


1. Create a file named `trust.json` and insert the following trust relationship code block into it. Be sure to replace all `<OIDC ID>`, `<AWS account number>`, and `<EKS Cluster region>` placeholders with values corresponding to your cluster. 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
         {
           "Effect": "Allow",
           "Principal": {
             "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
             "StringEquals": {
               "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:aud": "sts.amazonaws.com",
               "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:sub": "system:serviceaccount:sagemaker-k8s-operator-system:sagemaker-k8s-operator-default"
             }
           }
         }
       ]
     }
   ```

------

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
         {
           "Effect": "Allow",
           "Principal": {
             "Federated": "arn:aws-cn:iam::111122223333:oidc-provider/oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
             "StringEquals": {
               "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:aud": "sts.amazonaws.com",
               "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:sub": "system:serviceaccount:sagemaker-k8s-operator-system:sagemaker-k8s-operator-default"
             }
           }
         }
       ]
     }
   ```

------

1. Run the following command to create a role with the trust relationship defined in `trust.json`. This role allows the Amazon EKS cluster to get and refresh credentials from IAM. 

   ```
   aws iam create-role --region ${AWS_REGION} --role-name <role name> --assume-role-policy-document file://trust.json --output=text
   ```

   Your output should look like the following: 

   ```
   ROLE    arn:aws:iam::123456789012:role/my-role 2019-11-22T21:46:10Z    /       ABCDEFSFODNN7EXAMPLE   my-role
   ASSUMEROLEPOLICYDOCUMENT        2012-10-17		 	 	 
   STATEMENT       sts:AssumeRoleWithWebIdentity   Allow
   STRINGEQUALS    sts.amazonaws.com       system:serviceaccount:sagemaker-k8s-operator-system:sagemaker-k8s-operator-default
   PRINCIPAL       arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/
   ```

    Take note of `ROLE ARN`; you pass this value to your operator. 

##### Attach the AmazonSageMakerFullAccess policy to the role


To give the role access to SageMaker AI, attach the [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy. If you want to limit permissions to the operator, you can create your own custom policy and attach it. 

 To attach `AmazonSageMakerFullAccess`, run the following command: 

```
aws iam attach-role-policy --role-name <role name>  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
```

The Kubernetes ServiceAccount `sagemaker-k8s-operator-default` should have `AmazonSageMakerFullAccess` permissions. Confirm this when you install the operator. 

##### Deploy the operator


When deploying your operator, you can use either a YAML file or Helm charts. 

##### Deploy the operator using YAML


This is the simplest way to deploy your operators. The process is as follows: 

1. Download the installer script using the following command: 

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/release/rolebased/installer.yaml
   ```

1. Edit the `installer.yaml` file to replace `eks.amazonaws.com/role-arn`. Replace the ARN here with the Amazon Resource Name (ARN) for the OIDC-based role you’ve created. 

1. Use the following command to deploy the cluster: 

   ```
   kubectl apply -f installer.yaml
   ```

##### Deploy the operator using Helm Charts


Use the provided Helm Chart to install the operator. 

1. Clone the Helm installer directory using the following command: 

   ```
   git clone https://github.com/aws/amazon-sagemaker-operator-for-k8s.git
   ```

1. Navigate to the `amazon-sagemaker-operator-for-k8s/hack/charts/installer` folder. Edit the `rolebased/values.yaml` file, which includes high-level parameters for the chart. Replace the role ARN here with the Amazon Resource Name (ARN) for the OIDC-based role you've created. 

1. Install the Helm Chart using the following command: 

   ```
   kubectl create namespace sagemaker-k8s-operator-system
     helm install --namespace sagemaker-k8s-operator-system sagemaker-operator rolebased/
   ```

   If you decide to install the operator into a namespace other than the one specified, you need to adjust the namespace defined in the IAM role `trust.json` file to match. 

1. After a moment, the chart is installed with a randomly generated name. Verify that the installation succeeded by running the following command: 

   ```
   helm ls
   ```

   Your output should look like the following: 

   ```
   NAME                    NAMESPACE                       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
     sagemaker-operator      sagemaker-k8s-operator-system   1               2019-11-20 23:14:59.6777082 +0000 UTC   deployed        sagemaker-k8s-operator-0.1.0
   ```

##### Verify the operator deployment


1. You should be able to see the SageMaker AI Custom Resource Definitions (CRDs) for each operator deployed to your cluster by running the following command: 

   ```
   kubectl get crd | grep sagemaker
   ```

   Your output should look like the following: 

   ```
   batchtransformjobs.sagemaker.aws.amazon.com         2019-11-20T17:12:34Z
   endpointconfigs.sagemaker.aws.amazon.com            2019-11-20T17:12:34Z
   hostingdeployments.sagemaker.aws.amazon.com         2019-11-20T17:12:34Z
   hyperparametertuningjobs.sagemaker.aws.amazon.com   2019-11-20T17:12:34Z
   models.sagemaker.aws.amazon.com                     2019-11-20T17:12:34Z
   trainingjobs.sagemaker.aws.amazon.com               2019-11-20T17:12:34Z
   ```

1. Ensure that the operator pod is running successfully. Use the following command to list all pods: 

   ```
   kubectl -n sagemaker-k8s-operator-system get pods
   ```

   You should see a pod named `sagemaker-k8s-operator-controller-manager-*****` in the namespace `sagemaker-k8s-operator-system` as follows: 

   ```
   NAME                                                         READY   STATUS    RESTARTS   AGE
   sagemaker-k8s-operator-controller-manager-12345678-r8abc     2/2     Running   0          23s
   ```

#### Namespace-scoped deployment


You have the option to install your operator within the scope of an individual Kubernetes namespace. In this mode, the controller only monitors and reconciles resources with SageMaker AI if the resources are created within that namespace. This allows for finer-grained control over which controller is managing which resources. This is useful for deploying to multiple AWS accounts or controlling which users have access to particular jobs. 

This guide outlines how to install an operator into a particular, predefined namespace. To deploy a controller into a second namespace, follow the guide from beginning to end and change out the namespace in each step. 

##### Create an OIDC provider for your Amazon EKS cluster


The following instructions show how to create and associate an OIDC provider with your Amazon EKS cluster. 

1. Set the local `CLUSTER_NAME` and `AWS_REGION` environment variables as follows: 

   ```
   # Set the Region and cluster
   export CLUSTER_NAME="<your cluster name>"
   export AWS_REGION="<your region>"
   ```

1. Use the following command to associate the OIDC provider with your cluster. For more information, see [Enabling IAM Roles for Service Accounts on your Cluster.](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) 

   ```
   eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} \
         --region ${AWS_REGION} --approve
   ```

   Your output should look like the following: 

   ```
   [_]  eksctl version 0.10.1
     [_]  using region us-east-1
     [_]  IAM OpenID Connect provider is associated with cluster "my-cluster" in "us-east-1"
   ```

Now that the cluster has an OIDC identity provider, create a role and give a Kubernetes ServiceAccount permission to assume the role. 

##### Get your OIDC ID


To set up the ServiceAccount, first obtain the OpenID Connect issuer URL using the following command: 

```
aws eks describe-cluster --name ${CLUSTER_NAME} --region ${AWS_REGION} \
      --query cluster.identity.oidc.issuer --output text
```

The command returns a URL like the following: 

```
https://oidc.eks.${AWS_REGION}.amazonaws.com/id/D48675832CA65BD10A532F597OIDCID
```

In this URL, the value D48675832CA65BD10A532F597OIDCID is the OIDC ID. The OIDC ID for your cluster is different. You need this OIDC ID value to create a role. 

 If your output is `None`, it means that your client version is old. To work around this, run the following command: 

```
aws eks describe-cluster --region ${AWS_REGION} --query cluster --name ${CLUSTER_NAME} --output text | grep OIDC
```

The OIDC URL is returned as follows: 

```
OIDC https://oidc.eks.us-east-1.amazonaws.com/id/D48675832CA65BD10A532F597OIDCID
```

##### Create your IAM role


1. Create a file named `trust.json` and insert the following trust relationship code block into it. Be sure to replace all `<OIDC ID>`, `<AWS account number>`, `<EKS Cluster region>`, and `<Namespace>` placeholders with values corresponding to your cluster. For the purposes of this guide, `my-namespace` is used for the `<Namespace>` value. 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
         {
           "Effect": "Allow",
           "Principal": {
           "Federated": "arn:aws:iam::111122223333:oidc-provider/oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
             "StringEquals": {
                 "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:aud": "sts.amazonaws.com",
                 "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:sub": "system:serviceaccount:<Namespace>:sagemaker-k8s-operator-default"
             }
           }
         }
       ]
     }
   ```

------

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
         {
           "Effect": "Allow",
           "Principal": {
             "Federated": "arn:aws-cn:iam::111122223333:oidc-provider/oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
             "StringEquals": {
                 "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:aud": "sts.amazonaws.com",
                 "oidc.eks.<EKS Cluster region>.amazonaws.com/id/<OIDC ID>:sub": "system:serviceaccount:<Namespace>:sagemaker-k8s-operator-default"
             }
           }
         }
       ]
     }
   ```

------

1. Run the following command to create a role with the trust relationship defined in `trust.json`. This role allows the Amazon EKS cluster to get and refresh credentials from IAM. 

   ```
   aws iam create-role --region ${AWS_REGION} --role-name <role name> --assume-role-policy-document file://trust.json --output=text
   ```

   Your output should look like the following: 

   ```
   ROLE    arn:aws:iam::123456789012:role/my-role 2019-11-22T21:46:10Z    /       ABCDEFSFODNN7EXAMPLE   my-role
     ASSUMEROLEPOLICYDOCUMENT        2012-10-17		 	 	 
     STATEMENT       sts:AssumeRoleWithWebIdentity   Allow
     STRINGEQUALS    sts.amazonaws.com       system:serviceaccount:my-namespace:sagemaker-k8s-operator-default
     PRINCIPAL       arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/
   ```

Take note of `ROLE ARN`. You pass this value to your operator. 

##### Attach the AmazonSageMakerFullAccess policy to your role


To give the role access to SageMaker AI, attach the [https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess) policy. If you want to limit permissions to the operator, you can create your own custom policy and attach it. 

 To attach `AmazonSageMakerFullAccess`, run the following command: 

```
aws iam attach-role-policy --role-name <role name>  --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
```

The Kubernetes ServiceAccount `sagemaker-k8s-operator-default` should have `AmazonSageMakerFullAccess` permissions. Confirm this when you install the operator. 

##### Deploy the operator to your namespace


When deploying your operator, you can use either a YAML file or Helm charts. 

##### Deploy the operator to your namespace using YAML


There are two parts to deploying an operator within the scope of a namespace. The first is the set of CRDs that are installed at a cluster level. These resource definitions only need to be installed once per Kubernetes cluster. The second part is the operator permissions and deployment itself. 

 If you have not already installed the CRDs into the cluster, apply the CRD installer YAML using the following command: 

```
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/release/rolebased/namespaced/crd.yaml
```

To install the operator onto the cluster: 

1. Download the operator installer YAML using the following command: 

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/release/rolebased/namespaced/operator.yaml
   ```

1. Update the installer YAML to place the resources into your specified namespace using the following command: 

   ```
   sed -i -e 's/PLACEHOLDER-NAMESPACE/<YOUR NAMESPACE>/g' operator.yaml
   ```

1. Edit the `operator.yaml` file to place resources into your `eks.amazonaws.com/role-arn`. Replace the ARN here with the Amazon Resource Name (ARN) for the OIDC-based role you've created. 

1. Use the following command to deploy the cluster: 

   ```
   kubectl apply -f operator.yaml
   ```

##### Deploy the operator to your namespace using Helm Charts


There are two parts needed to deploy an operator within the scope of a namespace. The first is the set of CRDs that are installed at a cluster level. These resource definitions only need to be installed once per Kubernetes cluster. The second part is the operator permissions and deployment itself. When using Helm Charts you have to first create the namespace using `kubectl`. 

1. Clone the Helm installer directory using the following command: 

   ```
   git clone https://github.com/aws/amazon-sagemaker-operator-for-k8s.git
   ```

1. Navigate to the `amazon-sagemaker-operator-for-k8s/hack/charts/installer/namespaced` folder. Edit the `rolebased/values.yaml` file, which includes high-level parameters for the chart. Replace the role ARN here with the Amazon Resource Name (ARN) for the OIDC-based role you've created. 

1. Install the Helm Chart using the following command: 

   ```
   helm install crds crd_chart/
   ```

1. Create the required namespace and install the operator using the following command: 

   ```
   kubectl create namespace <namespace>
   helm install --n <namespace> op operator_chart/
   ```

1. After a moment, the chart is installed with the name `sagemaker-operator`. Verify that the installation succeeded by running the following command: 

   ```
   helm ls
   ```

   Your output should look like the following: 

   ```
   NAME                    NAMESPACE                       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
   sagemaker-operator      my-namespace                    1               2019-11-20 23:14:59.6777082 +0000 UTC   deployed        sagemaker-k8s-operator-0.1.0
   ```

##### Verify the operator deployment to your namespace


1. You should be able to see the SageMaker AI Custom Resource Definitions (CRDs) for each operator deployed to your cluster by running the following command: 

   ```
   kubectl get crd | grep sagemaker
   ```

   Your output should look like the following: 

   ```
   batchtransformjobs.sagemaker.aws.amazon.com         2019-11-20T17:12:34Z
   endpointconfigs.sagemaker.aws.amazon.com            2019-11-20T17:12:34Z
   hostingdeployments.sagemaker.aws.amazon.com         2019-11-20T17:12:34Z
   hyperparametertuningjobs.sagemaker.aws.amazon.com   2019-11-20T17:12:34Z
   models.sagemaker.aws.amazon.com                     2019-11-20T17:12:34Z
   trainingjobs.sagemaker.aws.amazon.com               2019-11-20T17:12:34Z
   ```

1. Ensure that the operator pod is running successfully. Use the following command to list all pods: 

   ```
   kubectl -n my-namespace get pods
   ```

   You should see a pod named `sagemaker-k8s-operator-controller-manager-*****` in the namespace `my-namespace` as follows: 

   ```
   NAME                                                         READY   STATUS    RESTARTS   AGE
   sagemaker-k8s-operator-controller-manager-12345678-r8abc     2/2     Running   0          23s
   ```

#### Install the SageMaker AI logs `kubectl` plugin


 As part of the SageMaker AI Operators for Kubernetes, you can use the `smlogs` [plugin](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) for `kubectl`. This allows SageMaker AI CloudWatch logs to be streamed with `kubectl`. `kubectl` must be installed onto your [PATH](http://www.linfo.org/path_env_var.html). The following commands place the binary in the `sagemaker-k8s-bin` directory in your home directory, and add that directory to your `PATH`. 

```
export os="linux"
  
wget https://amazon-sagemaker-operator-for-k8s-us-east-1.s3.amazonaws.com/kubectl-smlogs-plugin/v1/${os}.amd64.tar.gz
tar xvzf ${os}.amd64.tar.gz
  
# Move binaries to a directory in your homedir.
mkdir ~/sagemaker-k8s-bin
cp ./kubectl-smlogs.${os}.amd64/kubectl-smlogs ~/sagemaker-k8s-bin/.
  
# This line adds the binaries to your PATH in your .bashrc.
  
echo 'export PATH=$PATH:~/sagemaker-k8s-bin' >> ~/.bashrc
  
# Source your .bashrc to update environment variables:
source ~/.bashrc
```

Use the following command to verify that the `kubectl` plugin is installed correctly: 

```
kubectl smlogs
```

If the `kubectl` plugin is installed correctly, your output should look like the following: 

```
View SageMaker AI logs via Kubernetes
  
Usage:
  smlogs [command]
  
Aliases:
  smlogs, SMLogs, Smlogs
  
Available Commands:
  BatchTransformJob       View BatchTransformJob logs via Kubernetes
  TrainingJob             View TrainingJob logs via Kubernetes
  help                    Help about any command
  
Flags:
   -h, --help   help for smlogs
  
Use "smlogs [command] --help" for more information about a command.
```

### Clean up resources


To uninstall the operator from your cluster, you must first make sure to delete all SageMaker AI resources from the cluster. Failure to do so causes the operator delete operation to hang. Run the following commands to stop all jobs: 

```
# Delete all SageMaker AI jobs from Kubernetes
kubectl delete --all --all-namespaces hyperparametertuningjob.sagemaker.aws.amazon.com
kubectl delete --all --all-namespaces trainingjobs.sagemaker.aws.amazon.com
kubectl delete --all --all-namespaces batchtransformjob.sagemaker.aws.amazon.com
kubectl delete --all --all-namespaces hostingdeployment.sagemaker.aws.amazon.com
```

You should see output similar to the following: 

```
$ kubectl delete --all --all-namespaces trainingjobs.sagemaker.aws.amazon.com
trainingjobs.sagemaker.aws.amazon.com "xgboost-mnist-from-for-s3" deleted
  
$ kubectl delete --all --all-namespaces hyperparametertuningjob.sagemaker.aws.amazon.com
hyperparametertuningjob.sagemaker.aws.amazon.com "xgboost-mnist-hpo" deleted
  
$ kubectl delete --all --all-namespaces batchtransformjob.sagemaker.aws.amazon.com
batchtransformjob.sagemaker.aws.amazon.com "xgboost-mnist" deleted
  
$ kubectl delete --all --all-namespaces hostingdeployment.sagemaker.aws.amazon.com
hostingdeployment.sagemaker.aws.amazon.com "host-xgboost" deleted
```

After you delete all SageMaker AI jobs, see [Delete operators](#delete-operators) to delete the operator from your cluster.

### Delete operators


#### Delete cluster-based operators


##### Operators installed using YAML


To uninstall the operator from your cluster, make sure that all SageMaker AI resources have been deleted from the cluster. Failure to do so causes the operator delete operation to hang.

**Note**  
Before deleting your cluster, be sure to delete all SageMaker AI resources from the cluster. See [Clean up resources](#cleanup-operator-resources) for more information.

After you delete all SageMaker AI jobs, use `kubectl` to delete the operator from the cluster:

```
# Delete the operator and its resources
kubectl delete -f /installer.yaml
```

You should see output similar to the following: 

```
$ kubectl delete -f raw-yaml/installer.yaml
namespace "sagemaker-k8s-operator-system" deleted
customresourcedefinition.apiextensions.k8s.io "batchtransformjobs.sagemaker.aws.amazon.com" deleted
customresourcedefinition.apiextensions.k8s.io "endpointconfigs.sagemaker.aws.amazon.com" deleted
customresourcedefinition.apiextensions.k8s.io "hostingdeployments.sagemaker.aws.amazon.com" deleted
customresourcedefinition.apiextensions.k8s.io "hyperparametertuningjobs.sagemaker.aws.amazon.com" deleted
customresourcedefinition.apiextensions.k8s.io "models.sagemaker.aws.amazon.com" deleted
customresourcedefinition.apiextensions.k8s.io "trainingjobs.sagemaker.aws.amazon.com" deleted
role.rbac.authorization.k8s.io "sagemaker-k8s-operator-leader-election-role" deleted
clusterrole.rbac.authorization.k8s.io "sagemaker-k8s-operator-manager-role" deleted
clusterrole.rbac.authorization.k8s.io "sagemaker-k8s-operator-proxy-role" deleted
rolebinding.rbac.authorization.k8s.io "sagemaker-k8s-operator-leader-election-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "sagemaker-k8s-operator-manager-rolebinding" deleted
clusterrolebinding.rbac.authorization.k8s.io "sagemaker-k8s-operator-proxy-rolebinding" deleted
service "sagemaker-k8s-operator-controller-manager-metrics-service" deleted
deployment.apps "sagemaker-k8s-operator-controller-manager" deleted
secrets "sagemaker-k8s-operator-abcde" deleted
```

##### Operators installed using Helm Charts


To delete the operator CRDs, first delete all the running jobs. Then delete the Helm Chart that was used to deploy the operators using the following commands: 

```
# get the helm charts
helm ls
  
# delete the charts
helm delete <chart_name>
```

#### Delete namespace-based operators


##### Operators installed with YAML


To uninstall the operator from your cluster, first make sure that all SageMaker AI resources have been deleted from the cluster. Failure to do so causes the operator delete operation to hang.

**Note**  
Before deleting your cluster, be sure to delete all SageMaker AI resources from the cluster. See [Clean up resources](#cleanup-operator-resources) for more information.

After you delete all SageMaker AI jobs, use `kubectl` to first delete the operator from the namespace and then the CRDs from the cluster. Run the following commands to delete the operator from the cluster: 

```
# Delete the operator using the same yaml file that was used to install the operator
kubectl delete -f operator.yaml
  
# Now delete the CRDs using the CRD installer yaml
kubectl delete -f https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/release/rolebased/namespaced/crd.yaml
  
# Now you can delete the namespace if you want
kubectl delete namespace <namespace>
```

##### Operators installed with Helm Charts


To delete the operator CRDs, first delete all the running jobs. Then delete the Helm Chart that was used to deploy the operators using the following commands: 

```
# Delete the operator
helm delete <chart_name>
  
# delete the crds
helm delete crds
  
# optionally delete the namespace
kubectl delete namespace <namespace>
```

### Troubleshooting


#### Debugging a failed job


Use these steps to debug a failed job.
+ Check the job status by running the following: 

  ```
  kubectl get <CRD Type> <job name>
  ```
+ If the job was created in SageMaker AI, you can use the following command to see the `STATUS` and the `SageMaker Job Name`: 

  ```
  kubectl get <crd type> <job name>
  ```
+ You can use `smlogs` to find the cause of the issue using the following command: 

  ```
  kubectl smlogs <crd type> <job name>
  ```
+  You can also use the `describe` command to get more details about the job using the following command. The output has an `additional` field that has more information about the status of the job. 

  ```
  kubectl describe <crd type> <job name>
  ```
+ If the job was not created in SageMaker AI, then use the logs of the operator's pod to find the cause of the issue as follows: 

  ```
  $ kubectl get pods -A | grep sagemaker
  # Output:
  sagemaker-k8s-operator-system   sagemaker-k8s-operator-controller-manager-5cd7df4d74-wh22z   2/2     Running   0          3h33m
    
  $ kubectl logs -p <pod name> -c manager -n sagemaker-k8s-operator-system
  ```

#### Deleting an operator CRD


If deleting a job is not working, check if the operator is running. If the operator is not running, then you have to delete the finalizer using the following steps: 

1. In a new terminal, open the job in an editor using `kubectl edit` as follows: 

   ```
   kubectl edit <crd type> <job name>
   ```

1. Edit the job to delete the finalizer by removing the following two lines from the file. Save the file and the job is be deleted. 

   ```
   finalizers:
     - sagemaker-operator-finalizer
   ```

### Images and SMlogs in each Region


The following table lists the available operator images and SMLogs in each Region. 


|  Region  |  Controller Image  |  Linux SMLogs  | 
| --- | --- | --- | 
|  us-east-1  |  957583890962.dkr.ecr.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1  |  [https://s3.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s-us-east-1/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz](https://s3.us-east-1.amazonaws.com/amazon-sagemaker-operator-for-k8s-us-east-1/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz)  | 
|  us-east-2  |  922499468684.dkr.ecr.us-east-2.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1  |  [https://s3.us-east-2.amazonaws.com/amazon-sagemaker-operator-for-k8s-us-east-2/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz](https://s3.us-east-2.amazonaws.com/amazon-sagemaker-operator-for-k8s-us-east-2/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz)  | 
|  us-west-2  |  640106867763.dkr.ecr.us-west-2.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1  |  [https://s3.us-west-2.amazonaws.com/amazon-sagemaker-operator-for-k8s-us-west-2/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-operator-for-k8s-us-west-2/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz)  | 
|  eu-west-1  |  613661167059.dkr.ecr.eu-west-1.amazonaws.com/amazon-sagemaker-operator-for-k8s:v1  |  [https://s3.eu-west-1.amazonaws.com/amazon-sagemaker-operator-for-k8s-eu-west-1/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz](https://s3.eu-west-1.amazonaws.com/amazon-sagemaker-operator-for-k8s-eu-west-1/kubectl-smlogs-plugin/v1/linux.amd64.tar.gz)  | 

# Use Amazon SageMaker AI Jobs
Use SageMaker AI Jobs

This section is based on the original version of [SageMaker AI Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s).

**Important**  
We are stopping the development and technical support of the original version of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master).  
If you are currently using version `v1.2.2` or below of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master), we recommend migrating your resources to the [ACK service controller for Amazon SageMaker](https://github.com/aws-controllers-k8s/sagemaker-controller). The ACK service controller is a new generation of SageMaker Operators for Kubernetes based on [AWS Controllers for Kubernetes (ACK)](https://aws-controllers-k8s.github.io/community/).  
For information on the migration steps, see [Migrate resources to the latest Operators](kubernetes-sagemaker-operators-migrate.md).  
For answers to frequently asked questions on the end of support of the original version of SageMaker Operators for Kubernetes, see [Announcing the End of Support of the Original Version of SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-eos-announcement.md)

To run an Amazon SageMaker AI job using the Operators for Kubernetes, you can either apply a YAML file or use the supplied Helm Charts. 

All sample operator jobs in the following tutorials use sample data taken from a public MNIST dataset. In order to run these samples, download the dataset into your Amazon S3 bucket. You can find the dataset in [Download the MNIST Dataset.](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-preprocess-data-pull-data.html) 

**Topics**
+ [

## The TrainingJob operator
](#trainingjob-operator)
+ [

## The HyperParameterTuningJob operator
](#hyperparametertuningjobs-operator)
+ [

## The BatchTransformJob operator
](#batchtransformjobs-operator)
+ [

## The HostingDeployment operator
](#hosting-deployment-operator)
+ [

## The ProcessingJob operator
](#kubernetes-processing-job-operator)
+ [

## HostingAutoscalingPolicy (HAP) Operator
](#kubernetes-hap-operator)

## The TrainingJob operator


Training job operators reconcile your specified training job spec to SageMaker AI by launching it for you in SageMaker AI. You can learn more about SageMaker training jobs in the SageMaker AI [CreateTrainingJob API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html). 

**Topics**
+ [

### Create a TrainingJob using a YAML file
](#create-a-trainingjob-using-a-simple-yaml-file)
+ [

### Create a TrainingJob Using a Helm Chart
](#create-a-trainingjob-using-a-helm-chart)
+ [

### List TrainingJobs
](#list-training-jobs)
+ [

### Describe a TrainingJob
](#describe-a-training-job)
+ [

### View logs from TrainingJobs
](#view-logs-from-training-jobs)
+ [

### Delete TrainingJobs
](#delete-training-jobs)

### Create a TrainingJob using a YAML file


1. Download the sample YAML file for training using the following command: 

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/xgboost-mnist-trainingjob.yaml
   ```

1. Edit the `xgboost-mnist-trainingjob.yaml` file to replace the `roleArn` parameter with your `<sagemaker-execution-role>`, and `outputPath` with your Amazon S3 bucket to which the SageMaker AI execution role has write access. The `roleArn` must have permissions so that SageMaker AI can access Amazon S3, Amazon CloudWatch, and other services on your behalf. For more information on creating an SageMaker AI ExecutionRole, see [SageMaker AI Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-createtrainingjob-perms). Apply the YAML file using the following command: 

   ```
   kubectl apply -f xgboost-mnist-trainingjob.yaml
   ```

### Create a TrainingJob Using a Helm Chart


You can use Helm Charts to run TrainingJobs. 

1. Clone the GitHub repository to get the source using the following command: 

   ```
   git clone https://github.com/aws/amazon-sagemaker-operator-for-k8s.git
   ```

1. Navigate to the `amazon-sagemaker-operator-for-k8s/hack/charts/training-jobs/` folder and edit the `values.yaml` file to replace values like `rolearn` and `outputpath` with values that correspond to your account. The RoleARN must have permissions so that SageMaker AI can access Amazon S3, Amazon CloudWatch, and other services on your behalf. For more information on creating an SageMaker AI ExecutionRole, see [SageMaker AI Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-createtrainingjob-perms). 

#### Create the TrainingJob


With the roles and Amazon S3 buckets replaced with appropriate values in `values.yaml`, you can create a training job using the following command: 

```
helm install . --generate-name
```

Your output should look like the following: 

```
NAME: chart-12345678
LAST DEPLOYED: Wed Nov 20 23:35:49 2019
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thanks for installing the sagemaker-k8s-trainingjob.
```

#### Verify your training Helm Chart


To verify that the Helm Chart was created successfully, run: 

```
helm ls
```

Your output should look like the following: 

```
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
chart-12345678        default         1               2019-11-20 23:35:49.9136092 +0000 UTC   deployed        sagemaker-k8s-trainingjob-0.1.0
rolebased-12345678    default         1               2019-11-20 23:14:59.6777082 +0000 UTC   deployed        sagemaker-k8s-operator-0.1.0
```

`helm install` creates a `TrainingJob` Kubernetes resource. The operator launches the actual training job in SageMaker AI and updates the `TrainingJob` Kubernetes resource to reflect the status of the job in SageMaker AI. You incur charges for SageMaker AI resources used during the duration of your job. You do not incur any charges once your job completes or stops. 

**Note**: SageMaker AI does not allow you to update a running training job. You cannot edit any parameter and re-apply the config file. Either change the metadata name or delete the existing job and create a new one. Similar to existing training job operators like TFJob in Kubeflow, `update` is not supported. 

### List TrainingJobs


Use the following command to list all jobs created using the Kubernetes operator: 

```
kubectl get TrainingJob
```

The output listing all jobs should look like the following: 

```
kubectl get trainingjobs
NAME                        STATUS       SECONDARY-STATUS   CREATION-TIME          SAGEMAKER-JOB-NAME
xgboost-mnist-from-for-s3   InProgress   Starting           2019-11-20T23:42:35Z   xgboost-mnist-from-for-s3-examplef11eab94e0ed4671d5a8f
```

A training job continues to be listed after the job has completed or failed. You can remove a `TrainingJob` job from the list by following the [Delete TrainingJobs](#delete-training-jobs) steps. Jobs that have completed or stopped do not incur any charges for SageMaker AI resources. 

#### TrainingJob status values


The `STATUS` field can be one of the following values: 
+ `Completed` 
+ `InProgress` 
+ `Failed` 
+ `Stopped` 
+ `Stopping` 

These statuses come directly from the SageMaker AI official [API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DescribeTrainingJob.html#SageMaker-DescribeTrainingJob-response-TrainingJobStatus). 

In addition to the official SageMaker AI status, it is possible for `STATUS` to be `SynchronizingK8sJobWithSageMaker`. This means that the operator has not yet processed the job. 

#### Secondary status values


The secondary statuses come directly from the SageMaker AI official [API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DescribeTrainingJob.html#SageMaker-DescribeTrainingJob-response-SecondaryStatus). They contain more granular information about the status of the job. 

### Describe a TrainingJob


You can get more details about the training job by using the `describe` `kubectl` command. This is typically used for debugging a problem or checking the parameters of a training job. To get information about your training job, use the following command: 

```
kubectl describe trainingjob xgboost-mnist-from-for-s3
```

The output for your training job should look like the following: 

```
Name:         xgboost-mnist-from-for-s3
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  sagemaker.aws.amazon.com/v1
Kind:         TrainingJob
Metadata:
  Creation Timestamp:  2019-11-20T23:42:35Z
  Finalizers:
    sagemaker-operator-finalizer
  Generation:        2
  Resource Version:  23119
  Self Link:         /apis/sagemaker.aws.amazon.com/v1/namespaces/default/trainingjobs/xgboost-mnist-from-for-s3
  UID:               6d7uiui-0bef-11ea-b94e-0ed467example
Spec:
  Algorithm Specification:
    Training Image:       8256416981234.dkr.ecr.us-east-2.amazonaws.com/xgboost:1
    Training Input Mode:  File
  Hyper Parameters:
    Name:   eta
    Value:  0.2
    Name:   gamma
    Value:  4
    Name:   max_depth
    Value:  5
    Name:   min_child_weight
    Value:  6
    Name:   num_class
    Value:  10
    Name:   num_round
    Value:  10
    Name:   objective
    Value:  multi:softmax
    Name:   silent
    Value:  0
  Input Data Config:
    Channel Name:      train
    Compression Type:  None
    Content Type:      text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     https://s3-us-east-2.amazonaws.com/amzn-s3-demo-bucket/sagemaker/xgboost-mnist/train/
    Channel Name:                    validation
    Compression Type:                None
    Content Type:                    text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Distribution Type:  FullyReplicated
        S 3 Data Type:               S3Prefix
        S 3 Uri:                     https://s3-us-east-2.amazonaws.com/amzn-s3-demo-bucket/sagemaker/xgboost-mnist/validation/
  Output Data Config:
    S 3 Output Path:  s3://amzn-s3-demo-bucket/sagemaker/xgboost-mnist/xgboost/
  Region:             us-east-2
  Resource Config:
    Instance Count:     1
    Instance Type:      ml.m4.xlarge
    Volume Size In GB:  5
  Role Arn:             arn:aws:iam::12345678910:role/service-role/AmazonSageMaker-ExecutionRole
  Stopping Condition:
    Max Runtime In Seconds:  86400
  Training Job Name:         xgboost-mnist-from-for-s3-6d7fa0af0bef11eab94e0example
Status:
  Cloud Watch Log URL:           https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logStream:group=/aws/sagemaker/TrainingJobs;prefix=<example>;streamFilter=typeLogStreamPrefix
  Last Check Time:               2019-11-20T23:44:29Z
  Sage Maker Training Job Name:  xgboost-mnist-from-for-s3-6d7fa0af0bef11eab94eexample
  Secondary Status:              Downloading
  Training Job Status:           InProgress
Events:                          <none>
```

### View logs from TrainingJobs


Use the following command to see the logs from the `kmeans-mnist` training job: 

```
kubectl smlogs trainingjob xgboost-mnist-from-for-s3
```

Your output should look similar to the following. The logs from instances are ordered chronologically. 

```
"xgboost-mnist-from-for-s3" has SageMaker TrainingJobName "xgboost-mnist-from-for-s3-123456789" in region "us-east-2", status "InProgress" and secondary status "Starting"
xgboost-mnist-from-for-s3-6d7fa0af0bef11eab94e0ed46example/algo-1-1574293123 2019-11-20 23:45:24.7 +0000 UTC Arguments: train
xgboost-mnist-from-for-s3-6d7fa0af0bef11eab94e0ed46example/algo-1-1574293123 2019-11-20 23:45:24.7 +0000 UTC [2019-11-20:23:45:22:INFO] Running standalone xgboost training.
xgboost-mnist-from-for-s3-6d7fa0af0bef11eab94e0ed46example/algo-1-1574293123 2019-11-20 23:45:24.7 +0000 UTC [2019-11-20:23:45:22:INFO] File size need to be processed in the node: 1122.95mb. Available memory size in the node: 8586.0mb
xgboost-mnist-from-for-s3-6d7fa0af0bef11eab94e0ed46example/algo-1-1574293123 2019-11-20 23:45:24.7 +0000 UTC [2019-11-20:23:45:22:INFO] Determined delimiter of CSV input is ','
xgboost-mnist-from-for-s3-6d7fa0af0bef11eab94e0ed46example/algo-1-1574293123 2019-11-20 23:45:24.7 +0000 UTC [23:45:22] S3DistributionType set as FullyReplicated
```

### Delete TrainingJobs


Use the following command to stop a training job on Amazon SageMaker AI: 

```
kubectl delete trainingjob xgboost-mnist-from-for-s3
```

This command removes the SageMaker training job from Kubernetes. This command returns the following output: 

```
trainingjob.sagemaker.aws.amazon.com "xgboost-mnist-from-for-s3" deleted
```

If the job is still in progress on SageMaker AI, the job stops. You do not incur any charges for SageMaker AI resources after your job stops or completes. 

**Note**: SageMaker AI does not delete training jobs. Stopped jobs continue to show on the SageMaker AI console. The `delete` command takes about 2 minutes to clean up the resources from SageMaker AI. 

## The HyperParameterTuningJob operator


Hyperparameter tuning job operators reconcile your specified hyperparameter tuning job spec to SageMaker AI by launching it in SageMaker AI. You can learn more about SageMaker AI hyperparameter tuning jobs in the SageMaker AI [CreateHyperParameterTuningJob API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateHyperParameterTuningJob.html). 

**Topics**
+ [

### Create a HyperparameterTuningJob using a YAML file
](#create-a-hyperparametertuningjob-using-a-simple-yaml-file)
+ [

### Create a HyperparameterTuningJob using a Helm Chart
](#create-a-hyperparametertuningjob-using-a-helm-chart)
+ [

### List HyperparameterTuningJobs
](#list-hyperparameter-tuning-jobs)
+ [

### Describe a HyperparameterTuningJob
](#describe-a-hyperparameter-tuning-job)
+ [

### View logs from HyperparameterTuningJobs
](#view-logs-from-hyperparametertuning-jobs)
+ [

### Delete a HyperparameterTuningJob
](#delete-hyperparametertuning-jobs)

### Create a HyperparameterTuningJob using a YAML file


1. Download the sample YAML file for the hyperparameter tuning job using the following command: 

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/xgboost-mnist-hpo.yaml
   ```

1. Edit the `xgboost-mnist-hpo.yaml` file to replace the `roleArn` parameter with your `sagemaker-execution-role`. For the hyperparameter tuning job to succeed, you must also change the `s3InputPath` and `s3OutputPath` to values that correspond to your account. Apply the updates YAML file using the following command: 

   ```
   kubectl apply -f xgboost-mnist-hpo.yaml
   ```

### Create a HyperparameterTuningJob using a Helm Chart


You can use Helm Charts to run hyperparameter tuning jobs. 

1. Clone the GitHub repository to get the source using the following command: 

   ```
   git clone https://github.com/aws/amazon-sagemaker-operator-for-k8s.git
   ```

1. Navigate to the `amazon-sagemaker-operator-for-k8s/hack/charts/hyperparameter-tuning-jobs/` folder. 

1. Edit the `values.yaml` file to replace the `roleArn` parameter with your `sagemaker-execution-role`. For the hyperparameter tuning job to succeed, you must also change the `s3InputPath` and `s3OutputPath` to values that correspond to your account. 

#### Create the HyperparameterTuningJob


With the roles and Amazon S3 paths replaced with appropriate values in `values.yaml`, you can create a hyperparameter tuning job using the following command: 

```
helm install . --generate-name
```

Your output should look similar to the following: 

```
NAME: chart-1574292948
LAST DEPLOYED: Wed Nov 20 23:35:49 2019
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thanks for installing the sagemaker-k8s-hyperparametertuningjob.
```

#### Verify chart installation


To verify that the Helm Chart was created successfully, run the following command: 

```
helm ls
```

Your output should look like the following: 

```
NAME                    NAMESPACE       REVISION        UPDATED
chart-1474292948        default         1               2019-11-20 23:35:49.9136092 +0000 UTC   deployed        sagemaker-k8s-hyperparametertuningjob-0.1.0                               STATUS          CHART                           APP VERSION
chart-1574292948        default         1               2019-11-20 23:35:49.9136092 +0000 UTC   deployed        sagemaker-k8s-trainingjob-0.1.0
rolebased-1574291698    default         1               2019-11-20 23:14:59.6777082 +0000 UTC   deployed        sagemaker-k8s-operator-0.1.0
```

`helm install` creates a `HyperParameterTuningJob` Kubernetes resource. The operator launches the actual hyperparameter optimization job in SageMaker AI and updates the `HyperParameterTuningJob` Kubernetes resource to reflect the status of the job in SageMaker AI. You incur charges for SageMaker AI resources used during the duration of your job. You do not incur any charges once your job completes or stops. 

**Note**: SageMaker AI does not allow you to update a running hyperparameter tuning job. You cannot edit any parameter and re-apply the config file. You must either change the metadata name or delete the existing job and create a new one. Similar to existing training job operators like `TFJob` in Kubeflow, `update` is not supported. 

### List HyperparameterTuningJobs


Use the following command to list all jobs created using the Kubernetes operator: 

```
kubectl get hyperparametertuningjob
```

Your output should look like the following: 

```
NAME         STATUS      CREATION-TIME          COMPLETED   INPROGRESS   ERRORS   STOPPED   BEST-TRAINING-JOB                               SAGEMAKER-JOB-NAME
xgboost-mnist-hpo   Completed   2019-10-17T01:15:52Z   10          0            0        0         xgboostha92f5e3cf07b11e9bf6c06d6-009-4c7a123   xgboostha92f5e3cf07b11e9bf6c123
```

A hyperparameter tuning job continues to be listed after the job has completed or failed. You can remove a `hyperparametertuningjob` from the list by following the steps in [Delete a HyperparameterTuningJob](#delete-hyperparametertuning-jobs). Jobs that have completed or stopped do not incur any charges for SageMaker AI resources.

#### Hyperparameter tuning job status values


The `STATUS` field can be one of the following values: 
+ `Completed` 
+ `InProgress` 
+ `Failed` 
+ `Stopped` 
+ `Stopping` 

These statuses come directly from the SageMaker AI official [API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DescribeHyperParameterTuningJob.html#SageMaker-DescribeHyperParameterTuningJob-response-HyperParameterTuningJobStatus). 

In addition to the official SageMaker AI status, it is possible for `STATUS` to be `SynchronizingK8sJobWithSageMaker`. This means that the operator has not yet processed the job. 

#### Status counters


The output has several counters, like `COMPLETED` and `INPROGRESS`. These represent how many training jobs have completed and are in progress, respectively. For more information about how these are determined, see [TrainingJobStatusCounters](https://docs.aws.amazon.com/sagemaker/latest/dg/API_TrainingJobStatusCounters.html) in the SageMaker API documentation. 

#### Best TrainingJob


This column contains the name of the `TrainingJob` that best optimized the selected metric. 

To see a summary of the tuned hyperparameters, run: 

```
kubectl describe hyperparametertuningjob xgboost-mnist-hpo
```

To see detailed information about the `TrainingJob`, run: 

```
kubectl describe trainingjobs <job name>
```

#### Spawned TrainingJobs


You can also track all 10 training jobs in Kubernetes launched by `HyperparameterTuningJob` by running the following command: 

```
kubectl get trainingjobs
```

### Describe a HyperparameterTuningJob


You can obtain debugging details using the `describe` `kubectl` command.

```
kubectl describe hyperparametertuningjob xgboost-mnist-hpo
```

In addition to information about the tuning job, the SageMaker AI Operator for Kubernetes also exposes the [best training job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-monitor.html#automatic-model-tuning-best-training-job) found by the hyperparameter tuning job in the `describe` output as follows: 

```
Name:         xgboost-mnist-hpo
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"HyperparameterTuningJob","metadata":{"annotations":{},"name":"xgboost-mnist-hpo","namespace":...
API Version:  sagemaker.aws.amazon.com/v1
Kind:         HyperparameterTuningJob
Metadata:
  Creation Timestamp:  2019-10-17T01:15:52Z
  Finalizers:
    sagemaker-operator-finalizer
  Generation:        2
  Resource Version:  8167
  Self Link:         /apis/sagemaker.aws.amazon.com/v1/namespaces/default/hyperparametertuningjobs/xgboost-mnist-hpo
  UID:               a92f5e3c-f07b-11e9-bf6c-06d6f303uidu
Spec:
  Hyper Parameter Tuning Job Config:
    Hyper Parameter Tuning Job Objective:
      Metric Name:  validation:error
      Type:         Minimize
    Parameter Ranges:
      Integer Parameter Ranges:
        Max Value:     20
        Min Value:     10
        Name:          num_round
        Scaling Type:  Linear
    Resource Limits:
      Max Number Of Training Jobs:     10
      Max Parallel Training Jobs:      10
    Strategy:                          Bayesian
    Training Job Early Stopping Type:  Off
  Hyper Parameter Tuning Job Name:     xgboostha92f5e3cf07b11e9bf6c06d6
  Region:                              us-east-2
  Training Job Definition:
    Algorithm Specification:
      Training Image:       12345678910.dkr.ecr.us-east-2.amazonaws.com/xgboost:1
      Training Input Mode:  File
    Input Data Config:
      Channel Name:  train
      Content Type:  text/csv
      Data Source:
        s3DataSource:
          s3DataDistributionType:  FullyReplicated
          s3DataType:              S3Prefix
          s3Uri:                   https://s3-us-east-2.amazonaws.com/amzn-s3-demo-bucket/sagemaker/xgboost-mnist/train/
      Channel Name:                validation
      Content Type:                text/csv
      Data Source:
        s3DataSource:
          s3DataDistributionType:  FullyReplicated
          s3DataType:              S3Prefix
          s3Uri:                   https://s3-us-east-2.amazonaws.com/amzn-s3-demo-bucket/sagemaker/xgboost-mnist/validation/
    Output Data Config:
      s3OutputPath:  https://s3-us-east-2.amazonaws.com/amzn-s3-demo-bucket/sagemaker/xgboost-mnist/xgboost
    Resource Config:
      Instance Count:     1
      Instance Type:      ml.m4.xlarge
      Volume Size In GB:  5
    Role Arn:             arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole
    Static Hyper Parameters:
      Name:   base_score
      Value:  0.5
      Name:   booster
      Value:  gbtree
      Name:   csv_weights
      Value:  0
      Name:   dsplit
      Value:  row
      Name:   grow_policy
      Value:  depthwise
      Name:   lambda_bias
      Value:  0.0
      Name:   max_bin
      Value:  256
      Name:   max_leaves
      Value:  0
      Name:   normalize_type
      Value:  tree
      Name:   objective
      Value:  reg:linear
      Name:   one_drop
      Value:  0
      Name:   prob_buffer_row
      Value:  1.0
      Name:   process_type
      Value:  default
      Name:   rate_drop
      Value:  0.0
      Name:   refresh_leaf
      Value:  1
      Name:   sample_type
      Value:  uniform
      Name:   scale_pos_weight
      Value:  1.0
      Name:   silent
      Value:  0
      Name:   sketch_eps
      Value:  0.03
      Name:   skip_drop
      Value:  0.0
      Name:   tree_method
      Value:  auto
      Name:   tweedie_variance_power
      Value:  1.5
    Stopping Condition:
      Max Runtime In Seconds:  86400
Status:
  Best Training Job:
    Creation Time:  2019-10-17T01:16:14Z
    Final Hyper Parameter Tuning Job Objective Metric:
      Metric Name:        validation:error
      Value:
    Objective Status:     Succeeded
    Training End Time:    2019-10-17T01:20:24Z
    Training Job Arn:     arn:aws:sagemaker:us-east-2:123456789012:training-job/xgboostha92f5e3cf07b11e9bf6c06d6-009-4sample
    Training Job Name:    xgboostha92f5e3cf07b11e9bf6c06d6-009-4c7a3059
    Training Job Status:  Completed
    Training Start Time:  2019-10-17T01:18:35Z
    Tuned Hyper Parameters:
      Name:                                    num_round
      Value:                                   18
  Hyper Parameter Tuning Job Status:           Completed
  Last Check Time:                             2019-10-17T01:21:01Z
  Sage Maker Hyper Parameter Tuning Job Name:  xgboostha92f5e3cf07b11e9bf6c06d6
  Training Job Status Counters:
    Completed:            10
    In Progress:          0
    Non Retryable Error:  0
    Retryable Error:      0
    Stopped:              0
    Total Error:          0
Events:                   <none>
```

### View logs from HyperparameterTuningJobs


Hyperparameter tuning jobs do not have logs, but all training jobs launched by them do have logs. These logs can be accessed as if they were a normal training job. For more information, see [View logs from TrainingJobs](#view-logs-from-training-jobs).

### Delete a HyperparameterTuningJob


Use the following command to stop a hyperparameter job in SageMaker AI. 

```
kubectl delete hyperparametertuningjob xgboost-mnist-hpo
```

This command removes the hyperparameter tuning job and associated training jobs from your Kubernetes cluster and stops them in SageMaker AI. Jobs that have stopped or completed do not incur any charges for SageMaker AI resources. SageMaker AI does not delete hyperparameter tuning jobs. Stopped jobs continue to show on the SageMaker AI console. 

Your output should look like the following: 

```
hyperparametertuningjob.sagemaker.aws.amazon.com "xgboost-mnist-hpo" deleted
```

**Note**: The delete command takes about 2 minutes to clean up the resources from SageMaker AI. 

## The BatchTransformJob operator


Batch transform job operators reconcile your specified batch transform job spec to SageMaker AI by launching it in SageMaker AI. You can learn more about SageMaker AI batch transform job in the SageMaker AI [CreateTransformJob API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html). 

**Topics**
+ [

### Create a BatchTransformJob using a YAML File
](#create-a-batchtransformjob-using-a-simple-yaml-file)
+ [

### Create a BatchTransformJob using a Helm Chart
](#create-a-batchtransformjob-using-a-helm-chart)
+ [

### List BatchTransformJobs
](#list-batch-transform-jobs)
+ [

### Describe a BatchTransformJob
](#describe-a-batch-transform-job)
+ [

### View logs from BatchTransformJobs
](#view-logs-from-batch-transform-jobs)
+ [

### Delete a BatchTransformJob
](#delete-a-batch-transform-job)

### Create a BatchTransformJob using a YAML File


1. Download the sample YAML file for the batch transform job using the following command: 

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/xgboost-mnist-batchtransform.yaml
   ```

1. Edit the file `xgboost-mnist-batchtransform.yaml` to change necessary parameters to replace the `inputdataconfig` with your input data and `s3OutputPath` with your Amazon S3 buckets that the SageMaker AI execution role has write access to. 

1. Apply the YAML file using the following command: 

   ```
   kubectl apply -f xgboost-mnist-batchtransform.yaml
   ```

### Create a BatchTransformJob using a Helm Chart


You can use Helm Charts to run batch transform jobs. 

#### Get the Helm installer directory


Clone the GitHub repository to get the source using the following command: 

```
git clone https://github.com/aws/amazon-sagemaker-operator-for-k8s.git
```

#### Configure the Helm Chart


Navigate to the `amazon-sagemaker-operator-for-k8s/hack/charts/batch-transform-jobs/` folder. 

Edit the `values.yaml` file to replace the `inputdataconfig` with your input data and outputPath with your S3 buckets to which the SageMaker AI execution role has write access. 

#### Create a BatchTransformJob


1. Use the following command to create a batch transform job: 

   ```
   helm install . --generate-name
   ```

   Your output should look like the following: 

   ```
   NAME: chart-1574292948
   LAST DEPLOYED: Wed Nov 20 23:35:49 2019
   NAMESPACE: default
   STATUS: deployed
   REVISION: 1
   TEST SUITE: None
   NOTES:
   Thanks for installing the sagemaker-k8s-batch-transform-job.
   ```

1. To verify that the Helm Chart was created successfully, run the following command: 

   ```
   helm ls
   NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
   chart-1474292948        default         1               2019-11-20 23:35:49.9136092 +0000 UTC   deployed        sagemaker-k8s-batchtransformjob-0.1.0
   chart-1474292948        default         1               2019-11-20 23:35:49.9136092 +0000 UTC   deployed        sagemaker-k8s-hyperparametertuningjob-0.1.0
   chart-1574292948        default         1               2019-11-20 23:35:49.9136092 +0000 UTC   deployed        sagemaker-k8s-trainingjob-0.1.0
   rolebased-1574291698    default         1               2019-11-20 23:14:59.6777082 +0000 UTC   deployed        sagemaker-k8s-operator-0.1.0
   ```

   This command creates a `BatchTransformJob` Kubernetes resource. The operator launches the actual transform job in SageMaker AI and updates the `BatchTransformJob` Kubernetes resource to reflect the status of the job in SageMaker AI. You incur charges for SageMaker AI resources used during the duration of your job. You do not incur any charges once your job completes or stops. 

**Note**: SageMaker AI does not allow you to update a running batch transform job. You cannot edit any parameter and re-apply the config file. You must either change the metadata name or delete the existing job and create a new one. Similar to existing training job operators like `TFJob` in Kubeflow, `update` is not supported. 

### List BatchTransformJobs


Use the following command to list all jobs created using the Kubernetes operator: 

```
kubectl get batchtransformjob
```

Your output should look like the following: 

```
NAME                                STATUS      CREATION-TIME          SAGEMAKER-JOB-NAME
xgboost-mnist-batch-transform       Completed   2019-11-18T03:44:00Z   xgboost-mnist-a88fb19809b511eaac440aa8axgboost
```

A batch transform job continues to be listed after the job has completed or failed. You can remove a `hyperparametertuningjob` from the list by following the [Delete a BatchTransformJob](#delete-a-batch-transform-job) steps. Jobs that have completed or stopped do not incur any charges for SageMaker AI resources. 

#### Batch transform status values


The `STATUS` field can be one of the following values: 
+ `Completed` 
+ `InProgress` 
+ `Failed` 
+ `Stopped` 
+ `Stopping` 

These statuses come directly from the SageMaker AI official [API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DescribeHyperParameterTuningJob.html#SageMaker-DescribeHyperParameterTuningJob-response-HyperParameterTuningJobStatus). 

In addition to the official SageMaker AI status, it is possible for `STATUS` to be `SynchronizingK8sJobWithSageMaker`. This means that the operator has not yet processed the job.

### Describe a BatchTransformJob


You can obtain debugging details using the `describe` `kubectl` command.

```
kubectl describe batchtransformjob xgboost-mnist-batch-transform
```

Your output should look like the following: 

```
Name:         xgboost-mnist-batch-transform
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"BatchTransformJob","metadata":{"annotations":{},"name":"xgboost-mnist","namespace"...
API Version:  sagemaker.aws.amazon.com/v1
Kind:         BatchTransformJob
Metadata:
  Creation Timestamp:  2019-11-18T03:44:00Z
  Finalizers:
    sagemaker-operator-finalizer
  Generation:        2
  Resource Version:  21990924
  Self Link:         /apis/sagemaker.aws.amazon.com/v1/namespaces/default/batchtransformjobs/xgboost-mnist
  UID:               a88fb198-09b5-11ea-ac44-0aa8a9UIDNUM
Spec:
  Model Name:  TrainingJob-20190814SMJOb-IKEB
  Region:      us-east-1
  Transform Input:
    Content Type:  text/csv
    Data Source:
      S 3 Data Source:
        S 3 Data Type:  S3Prefix
        S 3 Uri:        s3://amzn-s3-demo-bucket/mnist_kmeans_example/input
  Transform Job Name:   xgboost-mnist-a88fb19809b511eaac440aa8a9SMJOB
  Transform Output:
    S 3 Output Path:  s3://amzn-s3-demo-bucket/mnist_kmeans_example/output
  Transform Resources:
    Instance Count:  1
    Instance Type:   ml.m4.xlarge
Status:
  Last Check Time:                2019-11-19T22:50:40Z
  Sage Maker Transform Job Name:  xgboost-mnist-a88fb19809b511eaac440aaSMJOB
  Transform Job Status:           Completed
Events:                           <none>
```

### View logs from BatchTransformJobs


Use the following command to see the logs from the `xgboost-mnist` batch transform job: 

```
kubectl smlogs batchtransformjob xgboost-mnist-batch-transform
```

### Delete a BatchTransformJob


Use the following command to stop a batch transform job in SageMaker AI. 

```
kubectl delete batchTransformJob xgboost-mnist-batch-transform
```

Your output should look like the following: 

```
batchtransformjob.sagemaker.aws.amazon.com "xgboost-mnist" deleted
```

This command removes the batch transform job from your Kubernetes cluster, as well as stops them in SageMaker AI. Jobs that have stopped or completed do not incur any charges for SageMaker AI resources. Delete takes about 2 minutes to clean up the resources from SageMaker AI. 

**Note**: SageMaker AI does not delete batch transform jobs. Stopped jobs continue to show on the SageMaker AI console. 

## The HostingDeployment operator


HostingDeployment operators support creating and deleting an endpoint, as well as updating an existing endpoint, for real-time inference. The hosting deployment operator reconciles your specified hosting deployment job spec to SageMaker AI by creating models, endpoint-configs and endpoints in SageMaker AI. You can learn more about SageMaker AI inference in the SageMaker AI [CreateEndpoint API documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateEndpoint.html). 

**Topics**
+ [

### Configure a HostingDeployment resource
](#configure-a-hostingdeployment-resource)
+ [

### Create a HostingDeployment
](#create-a-hostingdeployment)
+ [

### List HostingDeployments
](#list-hostingdeployments)
+ [

### Describe a HostingDeployment
](#describe-a-hostingdeployment)
+ [

### Invoking the endpoint
](#invoking-the-endpoint)
+ [

### Update HostingDeployment
](#update-hostingdeployment)
+ [

### Delete the HostingDeployment
](#delete-the-hostingdeployment)

### Configure a HostingDeployment resource


Download the sample YAML file for the hosting deployment job using the following command: 

```
wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/xgboost-mnist-hostingdeployment.yaml
```

The `xgboost-mnist-hostingdeployment.yaml` file has the following components that can be edited as required: 
+ *ProductionVariants*. A production variant is a set of instances serving a single model. SageMaker AI load-balances between all production variants according to set weights. 
+ *Models*. A model is the containers and execution role ARN necessary to serve a model. It requires at least a single container. 
+ *Containers*. A container specifies the dataset and serving image. If you are using your own custom algorithm instead of an algorithm provided by SageMaker AI, the inference code must meet SageMaker AI requirements. For more information, see [Using Your Own Algorithms with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html). 

### Create a HostingDeployment


To create a HostingDeployment, use `kubectl` to apply the file `hosting.yaml` with the following command: 

```
kubectl apply -f hosting.yaml
```

SageMaker AI creates an endpoint with the specified configuration. You incur charges for SageMaker AI resources used during the lifetime of your endpoint. You do not incur any charges once your endpoint is deleted. 

The creation process takes approximately 10 minutes. 

### List HostingDeployments


To verify that the HostingDeployment was created, use the following command: 

```
kubectl get hostingdeployments
```

Your output should look like the following: 

```
NAME           STATUS     SAGEMAKER-ENDPOINT-NAME
host-xgboost   Creating   host-xgboost-def0e83e0d5f11eaaa450aSMLOGS
```

#### HostingDeployment status values


The status field can be one of several values: 
+ `SynchronizingK8sJobWithSageMaker`: The operator is preparing to create the endpoint. 
+ `ReconcilingEndpoint`: The operator is creating, updating, or deleting endpoint resources. If the HostingDeployment remains in this state, use `kubectl describe` to see the reason in the `Additional` field. 
+ `OutOfService`: The endpoint is not available to take incoming requests. 
+ `Creating`: [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateEndpoint.html) is running. 
+ `Updating`: [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/API_UpdateEndpoint.html) or [UpdateEndpointWeightsAndCapacities](https://docs.aws.amazon.com/sagemaker/latest/dg/API_UpdateEndpointWeightsAndCapacities.html) is running. 
+ `SystemUpdating`: The endpoint is undergoing maintenance and cannot be updated or deleted or re-scaled until it has completed. This maintenance operation does not change any customer-specified values such as VPC config, AWS KMS encryption, model, instance type, or instance count. 
+ `RollingBack`: The endpoint fails to scale up or down or change its variant weight and is in the process of rolling back to its previous configuration. Once the rollback completes, the endpoint returns to an `InService` status. This transitional status only applies to an endpoint that has autoscaling turned on and is undergoing variant weight or capacity changes as part of an [UpdateEndpointWeightsAndCapacities](https://docs.aws.amazon.com/sagemaker/latest/dg/API_UpdateEndpointWeightsAndCapacities.html) call or when the [UpdateEndpointWeightsAndCapacities](https://docs.aws.amazon.com/sagemaker/latest/dg/API_UpdateEndpointWeightsAndCapacities.html) operation is called explicitly. 
+ `InService`: The endpoint is available to process incoming requests. 
+ `Deleting`: [DeleteEndpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DeleteEndpoint.html) is running. 
+ `Failed`: The endpoint could not be created, updated, or re-scaled. Use [DescribeEndpoint:FailureReason](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DescribeEndpoint.html#SageMaker-DescribeEndpoint-response-FailureReason) for information about the failure. [DeleteEndpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/API_DeleteEndpoint.html) is the only operation that can be performed on a failed endpoint. 

### Describe a HostingDeployment


You can obtain debugging details using the `describe` `kubectl` command.

```
kubectl describe hostingdeployment
```

Your output should look like the following: 

```
Name:         host-xgboost
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"HostingDeployment","metadata":{"annotations":{},"name":"host-xgboost","namespace":"def..."
API Version:  sagemaker.aws.amazon.com/v1
Kind:         HostingDeployment
Metadata:
  Creation Timestamp:  2019-11-22T19:40:00Z
  Finalizers:
    sagemaker-operator-finalizer
  Generation:        1
  Resource Version:  4258134
  Self Link:         /apis/sagemaker.aws.amazon.com/v1/namespaces/default/hostingdeployments/host-xgboost
  UID:               def0e83e-0d5f-11ea-aa45-0a3507uiduid
Spec:
  Containers:
    Container Hostname:  xgboost
    Image:               123456789012.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest
    Model Data URL:      s3://amzn-s3-demo-bucket/inference/xgboost-mnist/model.tar.gz
  Models:
    Containers:
      xgboost
    Execution Role Arn:  arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole
    Name:                xgboost-model
    Primary Container:   xgboost
  Production Variants:
    Initial Instance Count:  1
    Instance Type:           ml.c5.large
    Model Name:              xgboost-model
    Variant Name:            all-traffic
  Region:                    us-east-2
Status:
  Creation Time:         2019-11-22T19:40:04Z
  Endpoint Arn:          arn:aws:sagemaker:us-east-2:123456789012:endpoint/host-xgboost-def0e83e0d5f11eaaaexample
  Endpoint Config Name:  host-xgboost-1-def0e83e0d5f11e-e08f6c510d5f11eaaa450aexample
  Endpoint Name:         host-xgboost-def0e83e0d5f11eaaa450a350733ba06
  Endpoint Status:       Creating
  Endpoint URL:          https://runtime.sagemaker.us-east-2.amazonaws.com/endpoints/host-xgboost-def0e83e0d5f11eaaaexample/invocations
  Last Check Time:       2019-11-22T19:43:57Z
  Last Modified Time:    2019-11-22T19:40:04Z
  Model Names:
    Name:   xgboost-model
    Value:  xgboost-model-1-def0e83e0d5f11-df5cc9fd0d5f11eaaa450aexample
Events:     <none>
```

The status field provides more information using the following fields: 
+ `Additional`: Additional information about the status of the hosting deployment. This field is optional and only gets populated in case of error. 
+ `Creation Time`: When the endpoint was created in SageMaker AI. 
+ `Endpoint ARN`: The SageMaker AI endpoint ARN. 
+ `Endpoint Config Name`: The SageMaker AI name of the endpoint configuration. 
+ `Endpoint Name`: The SageMaker AI name of the endpoint. 
+ `Endpoint Status`: The status of the endpoint. 
+ `Endpoint URL`: The HTTPS URL that can be used to access the endpoint. For more information, see [Deploy a Model on SageMaker AI Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html). 
+ `FailureReason`: If a create, update, or delete command fails, the cause is shown here. 
+ `Last Check Time`: The last time the operator checked the status of the endpoint. 
+ `Last Modified Time`: The last time the endpoint was modified. 
+ `Model Names`: A key-value pair of HostingDeployment model names to SageMaker AI model names. 

### Invoking the endpoint


Once the endpoint status is `InService`, you can invoke the endpoint in two ways: using the AWS CLI, which does authentication and URL request signing, or using an HTTP client like cURL. If you use your own client, you need to do AWS v4 URL signing and authentication on your own. 

To invoke the endpoint using the AWS CLI, run the following command. Make sure to replace the Region and endpoint name with your endpoint's Region and SageMaker AI endpoint name. This information can be obtained from the output of `kubectl describe`. 

```
# Invoke the endpoint with mock input data.
aws sagemaker-runtime invoke-endpoint \
  --region us-east-2 \
  --endpoint-name <endpoint name> \
  --body $(seq 784 | xargs echo | sed 's/ /,/g') \
  >(cat) \
  --content-type text/csv > /dev/null
```

For example, if your Region is `us-east-2` and your endpoint config name is `host-xgboost-f56b6b280d7511ea824b129926example`, then the following command would invoke the endpoint: 

```
aws sagemaker-runtime invoke-endpoint \
  --region us-east-2 \
  --endpoint-name host-xgboost-f56b6b280d7511ea824b1299example \
  --body $(seq 784 | xargs echo | sed 's/ /,/g') \
  >(cat) \
  --content-type text/csv > /dev/null
4.95847082138
```

Here, `4.95847082138` is the prediction from the model for the mock data. 

### Update HostingDeployment


1. Once a HostingDeployment has a status of `InService`, it can be updated. It might take about 10 minutes for HostingDeployment to be in service. To verify that the status is `InService`, use the following command: 

   ```
   kubectl get hostingdeployments
   ```

1. The HostingDeployment can be updated before the status is `InService`. The operator waits until the SageMaker AI endpoint is `InService` before applying the update. 

   To apply an update, modify the `hosting.yaml` file. For example, change the `initialInstanceCount` field from 1 to 2 as follows: 

   ```
   apiVersion: sagemaker.aws.amazon.com/v1
   kind: HostingDeployment
   metadata:
     name: host-xgboost
   spec:
       region: us-east-2
       productionVariants:
           - variantName: all-traffic
             modelName: xgboost-model
             initialInstanceCount: 2
             instanceType: ml.c5.large
       models:
           - name: xgboost-model
             executionRoleArn: arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole
             primaryContainer: xgboost
             containers:
               - xgboost
       containers:
           - containerHostname: xgboost
             modelDataUrl: s3://amzn-s3-demo-bucket/inference/xgboost-mnist/model.tar.gz
             image: 123456789012.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest
   ```

1. Save the file, then use `kubectl` to apply your update as follows. You should see the status change from `InService` to `ReconcilingEndpoint`, then `Updating`. 

   ```
   $ kubectl apply -f hosting.yaml
   hostingdeployment.sagemaker.aws.amazon.com/host-xgboost configured
   
   $ kubectl get hostingdeployments
   NAME           STATUS                SAGEMAKER-ENDPOINT-NAME
   host-xgboost   ReconcilingEndpoint   host-xgboost-def0e83e0d5f11eaaa450a350abcdef
   
   $ kubectl get hostingdeployments
   NAME           STATUS     SAGEMAKER-ENDPOINT-NAME
   host-xgboost   Updating   host-xgboost-def0e83e0d5f11eaaa450a3507abcdef
   ```

SageMaker AI deploys a new set of instances with your models, switches traffic to use the new instances, and drains the old instances. As soon as this process begins, the status becomes `Updating`. After the update is complete, your endpoint becomes `InService`. This process takes approximately 10 minutes. 

### Delete the HostingDeployment


1. Use `kubectl` to delete a HostingDeployment with the following command: 

   ```
   kubectl delete hostingdeployments host-xgboost
   ```

   Your output should look like the following: 

   ```
   hostingdeployment.sagemaker.aws.amazon.com "host-xgboost" deleted
   ```

1. To verify that the hosting deployment has been deleted, use the following command: 

   ```
   kubectl get hostingdeployments
   No resources found.
   ```

Endpoints that have been deleted do not incur any charges for SageMaker AI resources. 

## The ProcessingJob operator


ProcessingJob operators are used to launch Amazon SageMaker processing jobs. For more information on SageMaker Processing jobs, see [CreateProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html). 

**Topics**
+ [

### Create a ProcessingJob using a YAML file
](#kubernetes-processing-job-yaml)
+ [

### List ProcessingJobs
](#kubernetes-processing-job-list)
+ [

### Describe a ProcessingJob
](#kubernetes-processing-job-description)
+ [

### Delete a ProcessingJob
](#kubernetes-processing-job-delete)

### Create a ProcessingJob using a YAML file


Follow these steps to create an Amazon SageMaker processing job by using a YAML file:

1. Download the `kmeans_preprocessing.py` pre-processing script.

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/kmeans_preprocessing.py
   ```

1. In one of your Amazon Simple Storage Service (Amazon S3) buckets, create a `mnist_kmeans_example/processing_code` folder and upload the script to the folder.

1. Download the `kmeans-mnist-processingjob.yaml` file.

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/kmeans-mnist-processingjob.yaml
   ```

1. Edit the YAML file to specify your `sagemaker-execution-role` and replace all instances of `amzn-s3-demo-bucket` with your S3 bucket.

   ```
   ...
   metadata:
     name: kmeans-mnist-processing
   ...
     roleArn: arn:aws:iam::<acct-id>:role/service-role/<sagemaker-execution-role>
     ...
     processingOutputConfig:
       outputs:
         ...
             s3Output:
               s3Uri: s3://<amzn-s3-demo-bucket>/mnist_kmeans_example/output/
     ...
     processingInputs:
       ...
           s3Input:
             s3Uri: s3://<amzn-s3-demo-bucket>/mnist_kmeans_example/processing_code/kmeans_preprocessing.py
   ```

   The `sagemaker-execution-role` must have permissions so that SageMaker AI can access your S3 bucket, Amazon CloudWatch, and other services on your behalf. For more information on creating an execution role, see [SageMaker AI Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html#sagemaker-roles-createtrainingjob-perms).

1. Apply the YAML file using one of the following commands.

   For cluster-scoped installation:

   ```
   kubectl apply -f kmeans-mnist-processingjob.yaml
   ```

   For namespace-scoped installation:

   ```
   kubectl apply -f kmeans-mnist-processingjob.yaml -n <NAMESPACE>
   ```

### List ProcessingJobs


Use one of the following commands to list all the jobs created using the ProcessingJob operator. `SAGEMAKER-JOB-NAME ` comes from the `metadata` section of the YAML file.

For cluster-scoped installation:

```
kubectl get ProcessingJob kmeans-mnist-processing
```

For namespace-scoped installation:

```
kubectl get ProcessingJob -n <NAMESPACE> kmeans-mnist-processing
```

Your output should look similar to the following:

```
NAME                    STATUS     CREATION-TIME        SAGEMAKER-JOB-NAME
kmeans-mnist-processing InProgress 2020-09-22T21:13:25Z kmeans-mnist-processing-7410ed52fd1811eab19a165ae9f9e385
```

The output lists all jobs regardless of their status. To remove a job from the list, see [Delete a Processing Job](https://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-processing-job-operator.html#kubernetes-processing-job-delete).

**ProcessingJob Status**
+ `SynchronizingK8sJobWithSageMaker` – The job is first submitted to the cluster. The operator has received the request and is preparing to create the processing job.
+ `Reconciling` – The operator is initializing or recovering from transient errors, along with others. If the processing job remains in this state, use the `kubectl` `describe` command to see the reason in the `Additional` field.
+ `InProgress | Completed | Failed | Stopping | Stopped` – Status of the SageMaker Processing job. For more information, see [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html#sagemaker-DescribeProcessingJob-response-ProcessingJobStatus).
+ `Error` – The operator cannot recover by reconciling.

Jobs that have completed, stopped, or failed do not incur further charges for SageMaker AI resources.

### Describe a ProcessingJob


Use one of the following commands to get more details about a processing job. These commands are typically used for debugging a problem or checking the parameters of a processing job.

For cluster-scoped installation:

```
kubectl describe processingjob kmeans-mnist-processing
```

For namespace-scoped installation:

```
kubectl describe processingjob kmeans-mnist-processing -n <NAMESPACE>
```

The output for your processing job should look similar to the following.

```
$ kubectl describe ProcessingJob kmeans-mnist-processing
Name:         kmeans-mnist-processing
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"sagemaker.aws.amazon.com/v1","kind":"ProcessingJob","metadata":{"annotations":{},"name":"kmeans-mnist-processing",...
API Version:  sagemaker.aws.amazon.com/v1
Kind:         ProcessingJob
Metadata:
  Creation Timestamp:  2020-09-22T21:13:25Z
  Finalizers:
    sagemaker-operator-finalizer
  Generation:        2
  Resource Version:  21746658
  Self Link:         /apis/sagemaker.aws.amazon.com/v1/namespaces/default/processingjobs/kmeans-mnist-processing
  UID:               7410ed52-fd18-11ea-b19a-165ae9f9e385
Spec:
  App Specification:
    Container Entrypoint:
      python
      /opt/ml/processing/code/kmeans_preprocessing.py
    Image Uri:  763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.5.0-cpu-py36-ubuntu16.04
  Environment:
    Name:   MYVAR
    Value:  my_value
    Name:   MYVAR2
    Value:  my_value2
  Network Config:
  Processing Inputs:
    Input Name:  mnist_tar
    s3Input:
      Local Path:   /opt/ml/processing/input
      s3DataType:   S3Prefix
      s3InputMode:  File
      s3Uri:        s3://<s3bucket>-us-west-2/algorithms/kmeans/mnist/mnist.pkl.gz
    Input Name:     source_code
    s3Input:
      Local Path:   /opt/ml/processing/code
      s3DataType:   S3Prefix
      s3InputMode:  File
      s3Uri:        s3://<s3bucket>/mnist_kmeans_example/processing_code/kmeans_preprocessing.py
  Processing Output Config:
    Outputs:
      Output Name:  train_data
      s3Output:
        Local Path:    /opt/ml/processing/output_train/
        s3UploadMode:  EndOfJob
        s3Uri:         s3://<s3bucket>/mnist_kmeans_example/output/
      Output Name:     test_data
      s3Output:
        Local Path:    /opt/ml/processing/output_test/
        s3UploadMode:  EndOfJob
        s3Uri:         s3://<s3bucket>/mnist_kmeans_example/output/
      Output Name:     valid_data
      s3Output:
        Local Path:    /opt/ml/processing/output_valid/
        s3UploadMode:  EndOfJob
        s3Uri:         s3://<s3bucket>/mnist_kmeans_example/output/
  Processing Resources:
    Cluster Config:
      Instance Count:     1
      Instance Type:      ml.m5.xlarge
      Volume Size In GB:  20
  Region:                 us-west-2
  Role Arn:               arn:aws:iam::<acct-id>:role/m-sagemaker-role
  Stopping Condition:
    Max Runtime In Seconds:  1800
  Tags:
    Key:    tagKey
    Value:  tagValue
Status:
  Cloud Watch Log URL:             https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=kmeans-mnist-processing-7410ed52fd1811eab19a165ae9f9e385;streamFilter=typeLogStreamPrefix
  Last Check Time:                 2020-09-22T21:14:29Z
  Processing Job Status:           InProgress
  Sage Maker Processing Job Name:  kmeans-mnist-processing-7410ed52fd1811eab19a165ae9f9e385
Events:                            <none>
```

### Delete a ProcessingJob


When you delete a processing job, the SageMaker Processing job is removed from Kubernetes but the job isn't deleted from SageMaker AI. If the job status in SageMaker AI is `InProgress` the job is stopped. Processing jobs that are stopped do not incur any charges for SageMaker AI resources. Use one of the following commands to delete a processing job. 

For cluster-scoped installation:

```
kubectl delete processingjob kmeans-mnist-processing
```

For namespace-scoped installation:

```
kubectl delete processingjob kmeans-mnist-processing -n <NAMESPACE>
```

The output for your processing job should look similar to the following.

```
processingjob.sagemaker.aws.amazon.com "kmeans-mnist-processing" deleted
```


**Note**  
SageMaker AI does not delete the processing job. Stopped jobs continue to show in the SageMaker AI console. The `delete` command takes a few minutes to clean up the resources from SageMaker AI.

## HostingAutoscalingPolicy (HAP) Operator


The HostingAutoscalingPolicy (HAP) operator takes a list of resource IDs as input and applies the same policy to each of them. Each resource ID is a combination of an endpoint name and a variant name. The HAP operator performs two steps: it registers the resource IDs and then applies the scaling policy to each resource ID. `Delete` undoes both actions. You can apply the HAP to an existing SageMaker AI endpoint or you can create a new SageMaker AI endpoint using the [HostingDeployment operator](https://docs.aws.amazon.com/sagemaker/latest/dg/hosting-deployment-operator.html#create-a-hostingdeployment). You can read more about SageMaker AI autoscaling in the [ Application Autoscaling Policy documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html).

**Note**  
In your `kubectl` commands, you can use the short form, `hap`, in place of `hostingautoscalingpolicy`.

**Topics**
+ [

### Create a HostingAutoscalingPolicy using a YAML file
](#kubernetes-hap-job-yaml)
+ [

### List HostingAutoscalingPolicies
](#kubernetes-hap-list)
+ [

### Describe a HostingAutoscalingPolicy
](#kubernetes-hap-describe)
+ [

### Update a HostingAutoscalingPolicy
](#kubernetes-hap-update)
+ [

### Delete a HostingAutoscalingPolicy
](#kubernetes-hap-delete)
+ [

### Update or delete an endpoint with a HostingAutoscalingPolicy
](#kubernetes-hap-update-delete-endpoint)

### Create a HostingAutoscalingPolicy using a YAML file


Use a YAML file to create a HostingAutoscalingPolicy (HAP) that applies a predefined or custom metric to one or multiple SageMaker AI endpoints.

Amazon SageMaker AI requires specific values in order to apply autoscaling to your variant. If these values are not specified in the YAML spec, the HAP operator applies the following default values.

```
# Do not change
Namespace                    = "sagemaker"
# Do not change
ScalableDimension            = "sagemaker:variant:DesiredInstanceCount"
# Only one supported
PolicyType                   = "TargetTrackingScaling"
# This is the default policy name but can be changed to apply a custom policy
DefaultAutoscalingPolicyName = "SageMakerEndpointInvocationScalingPolicy"
```

Use the following samples to create a HAP that applies a predefined or custom metric to one or multiple endpoints.

#### Sample 1: Apply a predefined metric to a single endpoint variant
Apply a predefined metric

1. Download the sample YAML file for a predefined metric using the following command:

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/hap-predefined-metric.yaml
   ```

1. Edit the YAML file to specify your `endpointName`, `variantName`, and `Region`.

1. Use one of the following commands to apply a predefined metric to a single resource ID (endpoint name and variant name combination).

   For cluster-scoped installation:

   ```
   kubectl apply -f hap-predefined-metric.yaml
   ```

   For namespace-scoped installation:

   ```
   kubectl apply -f hap-predefined-metric.yaml -n <NAMESPACE>
   ```

#### Sample 2: Apply a custom metric to a single endpoint variant
Apply a custom metric

1. Download the sample YAML file for a custom metric using the following command:

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/hap-custom-metric.yaml
   ```

1. Edit the YAML file to specify your `endpointName`, `variantName`, and `Region`.

1. Use one of the following commands to apply a custom metric to a single resource ID (endpoint name and variant name combination) in place of the recommended `SageMakerVariantInvocationsPerInstance`.
**Note**  
Amazon SageMaker AI does not check the validity of your YAML spec.

   For cluster-scoped installation:

   ```
   kubectl apply -f hap-custom-metric.yaml
   ```

   For namespace-scoped installation:

   ```
   kubectl apply -f hap-custom-metric.yaml -n <NAMESPACE>
   ```

#### Sample 3: Apply a scaling policy to multiple endpoints and variants
Apply a scaling policy

You can use the HAP operator to apply the same scaling policy to multiple resource IDs. A separate `scaling_policy` request is created for each resource ID (endpoint name and variant name combination).

1. Download the sample YAML file for a predefined metric using the following command:

   ```
   wget https://raw.githubusercontent.com/aws/amazon-sagemaker-operator-for-k8s/master/samples/hap-predefined-metric.yaml
   ```

1. Edit the YAML file to specify your `Region` and multiple `endpointName` and `variantName` values.

1. Use one of the following commands to apply a predefined metric to multiple resource IDs (endpoint name and variant name combinations).

   For cluster-scoped installation:

   ```
   kubectl apply -f hap-predefined-metric.yaml
   ```

   For namespace-scoped installation:

   ```
   kubectl apply -f hap-predefined-metric.yaml -n <NAMESPACE>
   ```

#### Considerations for HostingAutoscalingPolicies for multiple endpoints and variants
Considerations for applying a scaling policy

The following considerations apply when you use multiple resource IDs:
+ If you apply a single policy across multiple resource IDs, one PolicyARN is created per resource ID. Five endpoints have five PolicyARNs. When you run the `describe` command on the policy, the responses show up as one job and include a single job status.
+ If you apply a custom metric to multiple resource IDs, the same dimension or value is used for all the resource ID (variant) values. For example, if you apply a customer metric for instances 1-5, and the endpoint variant dimension is mapped to variant 1, when variant 1 exceeds the metrics, all endpoints are scaled up or down.
+ The HAP operator supports updating the list of resource IDs. If you modify, add, or delete resource IDs to the spec, the autoscaling policy is removed from the previous list of variants and applied to the newly specified resource ID combinations. Use the [https://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-hap-operator.html#kubernetes-hap-describe](https://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-hap-operator.html#kubernetes-hap-describe) command to list the resource IDs to which the policy is currently applied.

### List HostingAutoscalingPolicies


Use one of the following commands to list all HostingAutoscalingPolicies (HAPs) created using the HAP operator.

For cluster-scoped installation:

```
kubectl get hap
```

For namespace-scoped installation:

```
kubectl get hap -n <NAMESPACE>
```

Your output should look similar to the following:

```
NAME             STATUS   CREATION-TIME
hap-predefined   Created  2021-07-13T21:32:21Z
```

Use the following command to check the status of your HostingAutoscalingPolicy (HAP).

```
kubectl get hap <job-name>
```

One of the following values is returned:
+ `Reconciling` – Certain types of errors show the status as `Reconciling` instead of `Error`. Some examples are server-side errors and endpoints in the `Creating` or `Updating` state. Check the `Additional` field in status or operator logs for more details.
+ `Created`
+ `Error`

**To view the autoscaling endpoint to which you applied the policy**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left side panel, expand **Inference**.

1. Choose **Endpoints**.

1. Select the name of the endpoint of interest.

1. Scroll to the **Endpoint runtime settings** section.

### Describe a HostingAutoscalingPolicy


Use the following command to get more details about a HostingAutoscalingPolicy (HAP). These commands are typically used for debugging a problem or checking the resource IDs (endpoint name and variant name combinations) of a HAP.

```
kubectl describe hap <job-name>
```

### Update a HostingAutoscalingPolicy


The HostingAutoscalingPolicy (HAP) operator supports updates. You can edit your YAML spec to change the values and then reapply the policy. The HAP operator deletes the existing policy and applies the new policy.

### Delete a HostingAutoscalingPolicy


Use one of the following commands to delete a HostingAutoscalingPolicy (HAP) policy.

For cluster-scoped installation:

```
kubectl delete hap hap-predefined
```

For namespace-scoped installation:

```
kubectl delete hap hap-predefined -n <NAMESPACE>
```

This command deletes the scaling policy and deregisters the scaling target from Kubernetes. This command returns the following output:

```
hostingautoscalingpolicies.sagemaker.aws.amazon.com "hap-predefined" deleted
```

### Update or delete an endpoint with a HostingAutoscalingPolicy
Update or delete an endpoint

To update an endpoint that has a HostingAutoscalingPolicy (HAP), use the `kubectl` `delete` command to remove the HAP, update the endpoint, and then reapply the HAP.

To delete an endpoint that has a HAP, use the `kubectl` `delete` command to remove the HAP before you delete the endpoint.

# Migrate resources to the latest Operators
Migrate to Latest Operator

We are stopping the development and technical support of the original version of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master).

If you are currently using version `v1.2.2` or below of [ SageMaker Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master), we recommend migrating your resources to the [ACK service controller for Amazon SageMaker](https://github.com/aws-controllers-k8s/sagemaker-controller). The ACK service controller is a new generation of SageMaker Operators for Kubernetes based on [AWS Controllers for Kubernetes (ACK)](https://aws-controllers-k8s.github.io/community/).

For answers to frequently asked questions on the end of support of the original version of SageMaker Operators for Kubernetes, see [Announcing the End of Support of the Original Version of SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-eos-announcement.md)

Use the following steps to migrate your resources and use ACK to train, tune, and deploy machine learning models with Amazon SageMaker AI.

**Note**  
The latest SageMaker AI Operators for Kubernetes are not backwards compatible.

**Topics**
+ [

## Prerequisites
](#migrate-resources-to-new-operators-prerequisites)
+ [

## Adopt resources
](#migrate-resources-to-new-operators-steps)
+ [

## Clean up old resources
](#migrate-resources-to-new-operators-cleanup)
+ [

## Use the new SageMaker AI Operators for Kubernetes
](#migrate-resources-to-new-operators-tutorials)

## Prerequisites


To successfully migrate resources to the latest SageMaker AI Operators for Kubernetes, you must do the following:

1. Install the latest SageMaker AI Operators for Kubernetes. See [Setup](https://aws-controllers-k8s.github.io/community/docs/tutorials/sagemaker-example/#setup) in *Machine Learning with the ACK SageMaker AI Controller* for step-by-step instructions.

1. If you are using [HostingAutoscalingPolicy resources](#migrate-resources-to-new-operators-hap), install the new Application Auto Scaling Operators. See [Setup](https://aws-controllers-k8s.github.io/community/docs/tutorials/autoscaling-example/#setup) in *Scale SageMaker AI Workloads with Application Auto Scaling* for step-by-step instructions. This step is optional if you are not using HostingAutoScalingPolicy resources.

If permissions are configured correctly, then the ACK SageMaker AI service controller can determine the specification and status of the AWS resource and reconcile the resource as if the ACK controller originally created it.

## Adopt resources


The new SageMaker AI Operators for Kubernetes provide the ability to adopt resources that were not originally created by the ACK service controller. For more information, see [Adopt Existing AWS Resources](https://aws-controllers-k8s.github.io/community/docs/user-docs/adopted-resource/) in the ACK documentation.

The following steps show how the new SageMaker AI Operators for Kubernetes can adopt an existing SageMaker AI endpoint. Save the following sample to a file named `adopt-endpoint-sample.yaml`. 

```
apiVersion: services.k8s.aws/v1alpha1
kind: AdoptedResource
metadata:
  name: adopt-endpoint-sample
spec:  
  aws:
    # resource to adopt, not created by ACK
    nameOrID: xgboost-endpoint
  kubernetes:
    group: sagemaker.services.k8s.aws
    kind: Endpoint
    metadata:
      # target K8s CR name
      name: xgboost-endpoint
```

Submit the custom resource (CR) using `kubectl apply`:

```
kubectl apply -f adopt-endpoint-sample.yaml
```

Use `kubectl describe` to check the status conditions of your adopted resource.

```
kubectl describe adoptedresource adopt-endpoint-sample
```

Verify that the `ACK.Adopted` condition is `True`. The output should look similar to the following example:

```
---
kind: AdoptedResource
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: '{"apiVersion":"services.k8s.aws/v1alpha1","kind":"AdoptedResource","metadata":{"annotations":{},"name":"xgboost-endpoint","namespace":"default"},"spec":{"aws":{"nameOrID":"xgboost-endpoint"},"kubernetes":{"group":"sagemaker.services.k8s.aws","kind":"Endpoint","metadata":{"name":"xgboost-endpoint"}}}}'
  creationTimestamp: '2021-04-27T02:49:14Z'
  finalizers:
  - finalizers.services.k8s.aws/AdoptedResource
  generation: 1
  name: adopt-endpoint-sample
  namespace: default
  resourceVersion: '12669876'
  selfLink: "/apis/services.k8s.aws/v1alpha1/namespaces/default/adoptedresources/adopt-endpoint-sample"
  uid: 35f8fa92-29dd-4040-9d0d-0b07bbd7ca0b
spec:
  aws:
    nameOrID: xgboost-endpoint
  kubernetes:
    group: sagemaker.services.k8s.aws
    kind: Endpoint
    metadata:
      name: xgboost-endpoint
status:
  conditions:
  - status: 'True'
    type: ACK.Adopted
```

Check that your resource exists in your cluster:

```
kubectl describe endpoints.sagemaker xgboost-endpoint
```

### HostingAutoscalingPolicy resources


The `HostingAutoscalingPolicy` (HAP) resource consists of multiple Application Auto Scaling resources: `ScalableTarget` and `ScalingPolicy`. When adopting a HAP resource with ACK, first install the [Application Auto Scaling controller](https://github.com/aws-controllers-k8s/applicationautoscaling-controller). To adopt HAP resources, you need to adopt both `ScalableTarget` and `ScalingPolicy` resources. You can find the resource indentifier for these resources in the status of the `HostingAutoscalingPolicy` resource (`status.ResourceIDList`).

### HostingDeployment resources


The `HostingDeployment` resource consists of multiple SageMaker AI resources: `Endpoint`, `EndpointConfig`, and each `Model`. If you adopt a SageMaker AI endpoint in ACK, you need to adopt the `Endpoint`, `EndpointConfig`, and each `Model` separately. The `Endpoint`, `EndpointConfig`, and `Model` names can be found in status of the `HostingDeployment` resource (`status.endpointName`, `status.endpointConfigName`, and `status.modelNames`).

For a list of all supported SageMaker AI resources, refer to the [ACK API Reference](https://aws-controllers-k8s.github.io/community/reference/).

## Clean up old resources


After the new SageMaker AI Operators for Kubernetes adopt your resources, you can uninstall old operators and clean up old resources.

### Step 1: Uninstall the old operator


To uninstall the old operator, see [Delete operators](kubernetes-sagemaker-operators-end-of-support.md#delete-operators).

**Warning**  
Uninstall the old operator before deleting any old resources.

### Step 2: Remove finalizers and delete old resources


**Warning**  
Before deleting old resources, be sure that you have uninstalled the old operator.

After uninstalling the old operator, you must explicitly remove the finalizers to delete old operator resources. The following sample script shows how to delete all training jobs managed by the old operator in a given namespace. You can use a similar pattern to delete additional resources once they are adopted by the new operator.

**Note**  
You must use full resource names to get resources. For example, use `kubectl get trainingjobs.sagemaker.aws.amazon.com` instead of `kubectl get trainingjob`.

```
namespace=sagemaker_namespace
training_jobs=$(kubectl get trainingjobs.sagemaker.aws.amazon.com -n $namespace -ojson | jq -r '.items | .[] | .metadata.name')
 
for job in $training_jobs
do
    echo "Deleting $job resource in $namespace namespace"
    kubectl patch trainingjobs.sagemaker.aws.amazon.com $job -n $namespace -p '{"metadata":{"finalizers":null}}' --type=merge
    kubectl delete trainingjobs.sagemaker.aws.amazon.com $job -n $namespace
done
```

## Use the new SageMaker AI Operators for Kubernetes


For in-depth guides on using the new SageMaker AI Operators for Kubernetes, see [Use SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-ack.md#kubernetes-sagemaker-operators-ack-use)

# Announcing the End of Support of the Original Version of SageMaker AI Operators for Kubernetes
End of support FAQ

This page announces the end of support for the original version of [SageMaker AI Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s) and provides answers to frequently asked questions as well as migration information about the [ACK service controller for Amazon SageMaker AI](https://github.com/aws-controllers-k8s/sagemaker-controller), a new generation of fully supported SageMaker AI Operators for Kubernetes. For general information about the new SageMaker AI Operators for Kubernetes, see [Latest SageMaker AI Operators for Kubernetes](kubernetes-sagemaker-operators-ack.md). 

## End of Support Frequently Asked Questions
End of support FAQ

**Topics**
+ [

### Why are we ending support for the original version of SageMaker AI Operators for Kubernetes?
](#kubernetes-sagemaker-operators-eos-faq-why)
+ [

### Where can I find more information about the new SageMaker AI Operators for Kubernetes and ACK?
](#kubernetes-sagemaker-operators-eos-faq-more)
+ [

### What does end of support (EOS) mean?
](#kubernetes-sagemaker-operators-eos-faq-definition)
+ [

### How can I migrate my workload to the new SageMaker AI Operators for Kubernetes for training and inference?
](#kubernetes-sagemaker-operators-eos-faq-how)
+ [

### Which version of ACK should I migrate to?
](#kubernetes-sagemaker-operators-eos-faq-version)
+ [

### Are the initial SageMaker AI Operators for Kubernetes and the new Operators (ACK service controller for Amazon SageMaker AI) functionally equivalent?
](#kubernetes-sagemaker-operators-eos-faq-parity)

### Why are we ending support for the original version of SageMaker AI Operators for Kubernetes?


Users can now take advantage of the [ACK service controller for Amazon SageMaker AI](https://github.com/aws-controllers-k8s/sagemaker-controller). The ACK service controller is a new generation of SageMaker AI Operators for Kubernetes based on [AWS Controllers for Kubernetes](https://aws-controllers-k8s.github.io/community/) (ACK), a community-driven project optimized for production, standardizing the way to expose AWS services via a Kubernetes operator. We are therefore announcing the end of support (EOS) for the original version (not ACK-based) of [ SageMaker AI Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s). The support ends on **Feb 15, 2023** along with [Amazon Elastic Kubernetes Service Kubernetes 1.21](https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar). 

For more information on ACK, see [ACK history and tenets](https://aws-controllers-k8s.github.io/community/docs/community/background/).

### Where can I find more information about the new SageMaker AI Operators for Kubernetes and ACK?

+ For more information about the new SageMaker AI Operators for Kubernetes, see the [ACK service controller for Amazon SageMaker AI](https://github.com/aws-controllers-k8s/sagemaker-controller) GitHub repository or read [AWS Controllers for Kubernetes Documentation](https://aws-controllers-k8s.github.io/community/docs/community/overview/).
+ For a tutorial on how to train a machine learning model with the ACK service controller for Amazon SageMaker AI using Amazon EKS, see this [SageMaker AI example](https://aws-controllers-k8s.github.io/community/docs/tutorials/sagemaker-example/).

  For an autoscaling example, see [ Scale SageMaker AI Workloads with Application Auto Scaling](https://aws-controllers-k8s.github.io/community/docs/tutorials/autoscaling-example/).
+ For information on AWS Controller for Kubernetes (ACK), see the [AWS Controllers for Kubernetes](https://aws-controllers-k8s.github.io/community/) (ACK) documentation.
+ For a list of supported SageMaker AI resources, see [ACK API Reference](https://aws-controllers-k8s.github.io/community/reference/).

### What does end of support (EOS) mean?


While users can continue to use their current operators, we are no longer developing new features for the operators, nor will we release any patches or security updates for any issues found. `v1.2.2` is the last release of [SageMaker AI Operators for Kubernetes](https://github.com/aws/amazon-sagemaker-operator-for-k8s/tree/master). Users should migrate their workloads to use the [ACK service controller for Amazon SageMaker AI](https://github.com/aws-controllers-k8s/sagemaker-controller).

### How can I migrate my workload to the new SageMaker AI Operators for Kubernetes for training and inference?


For information about migrating resources from the old to the new SageMaker AI Operators for Kubernetes, follow [Migrate resources to the latest Operators](kubernetes-sagemaker-operators-migrate.md).

### Which version of ACK should I migrate to?


Users should migrate to the most recent released version of the [ACK service controller for Amazon SageMaker AI](https://github.com/aws-controllers-k8s/sagemaker-controller/tags).

### Are the initial SageMaker AI Operators for Kubernetes and the new Operators (ACK service controller for Amazon SageMaker AI) functionally equivalent?


Yes, they are at feature parity.

A few highlights of the main notable differences between the two versions include:
+ The Custom Resources Definitions (CRD) used by the ACK-based SageMaker AI Operators for Kubernetes follow the AWS API definition making it incompatible with the custom resources specifications from the SageMaker AI Operators for Kubernetes in its original version. Refer to the [CRDs](https://github.com/aws-controllers-k8s/sagemaker-controller/tree/main/helm/crds) in the new controller or use the migration guide to adopt the resources and use the new controller. 
+ The `Hosting Autoscaling` policy is no longer part of the new SageMaker AI Operators for Kubernetes and has been migrated to the [Application autoscaling](https://github.com/aws-controllers-k8s/applicationautoscaling-controller) ACK controller. To learn how to use the application autoscaling controller to configure autoscaling on SageMaker AI Endpoints, follow this [autoscaling example](https://aws-controllers-k8s.github.io/community/docs/tutorials/autoscaling-example/). 
+ The `HostingDeployment` resource was used to create Models, Endpoint Configurations, and Endpoints in one CRD. The new SageMaker AI Operators for Kubernetes has a separate CRD for each of these resources. 

# SageMaker AI Components for Kubeflow Pipelines


With SageMaker AI components for Kubeflow Pipelines, you can create and monitor native SageMaker AI training, tuning, endpoint deployment, and batch transform jobs from your Kubeflow Pipelines. By running Kubeflow Pipeline jobs on SageMaker AI, you move data processing and training jobs from the Kubernetes cluster to SageMaker AI's machine learning-optimized managed service. This document assumes prior knowledge of Kubernetes and Kubeflow. 

**Topics**
+ [

## What are Kubeflow Pipelines?
](#what-is-kubeflow-pipelines)
+ [

## What are Kubeflow Pipeline components?
](#kubeflow-pipeline-components)
+ [

## Why use SageMaker AI Components for Kubeflow Pipelines?
](#why-use-sagemaker-components)
+ [

## SageMaker AI Components for Kubeflow Pipelines versions
](#sagemaker-components-versions)
+ [

## List of SageMaker AI Components for Kubeflow Pipelines
](#sagemaker-components-list)
+ [

## IAM permissions
](#iam-permissions)
+ [

## Converting pipelines to use SageMaker AI
](#converting-pipelines-to-use-amazon-sagemaker)
+ [

# Install Kubeflow Pipelines
](kubernetes-sagemaker-components-install.md)
+ [

# Use SageMaker AI components
](kubernetes-sagemaker-components-tutorials.md)

## What are Kubeflow Pipelines?


Kubeflow Pipelines (KFP) is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. The Kubeflow Pipelines platform consists of the following:
+ A user interface (UI) for managing and tracking experiments, jobs, and runs. 
+ An engine (Argo) for scheduling multi-step ML workflows.
+ An SDK for defining and manipulating pipelines and components.
+ Notebooks for interacting with the system using the SDK.

A pipeline is a description of an ML workflow expressed as a [directed acyclic graph](https://www.kubeflow.org/docs/pipelines/concepts/graph/). Every step in the workflow is expressed as a Kubeflow Pipeline [component](https://www.kubeflow.org/docs/pipelines/overview/concepts/component/), which is a AWS SDK for Python (Boto3) module.

For more information on Kubeflow Pipelines, see the [Kubeflow Pipelines documentation](https://www.kubeflow.org/docs/pipelines/). 

## What are Kubeflow Pipeline components?


A Kubeflow Pipeline component is a set of code used to execute one step of a Kubeflow pipeline. Components are represented by a Python module built into a Docker image. When the pipeline runs, the component's container is instantiated on one of the worker nodes on the Kubernetes cluster running Kubeflow, and your logic is executed. Pipeline components can read outputs from the previous components and create outputs that the next component in the pipeline can consume. These components make it fast and easy to write pipelines for experimentation and production environments without having to interact with the underlying Kubernetes infrastructure.

You can use SageMaker AI Components in your Kubeflow pipeline. Rather than encapsulating your logic in a custom container, you simply load the components and describe your pipeline using the Kubeflow Pipelines SDK. When the pipeline runs, your instructions are translated into a SageMaker AI job or deployment. The workload then runs on the fully managed infrastructure of SageMaker AI. 

## Why use SageMaker AI Components for Kubeflow Pipelines?
Why use SageMaker AI Components?

SageMaker AI Components for Kubeflow Pipelines offer an alternative to launching your compute-intensive jobs from SageMaker AI. The components integrate SageMaker AI with the portability and orchestration of Kubeflow Pipelines. Using the SageMaker AI Components for Kubeflow Pipelines, you can create and monitor your SageMaker AI resources as part of a Kubeflow Pipelines workflow. Each of the jobs in your pipelines runs on SageMaker AI instead of the local Kubernetes cluster allowing you to take advantage of key SageMaker AI features such as data labeling, large-scale hyperparameter tuning and distributed training jobs, or one-click secure and scalable model deployment. The job parameters, status, logs, and outputs from SageMaker AI are still accessible from the Kubeflow Pipelines UI. 

The SageMaker AI components integrate key SageMaker AI features into your ML workflows from preparing data, to building, training, and deploying ML models. You can create a Kubeflow Pipeline built entirely using these components, or integrate individual components into your workflow as needed. The components are available in one or two versions. Each version of a component leverages a different backend. For more information on those versions, see [SageMaker AI Components for Kubeflow Pipelines versions](#sagemaker-components-versions).

There is no additional charge for using SageMaker AI Components for Kubeflow Pipelines. You incur charges for any SageMaker AI resources you use through these components.

## SageMaker AI Components for Kubeflow Pipelines versions
SageMaker AI Components versions

SageMaker AI Components for Kubeflow Pipelines come in two versions. Each version leverages a different backend to create and manage resources on SageMaker AI.
+ The SageMaker AI Components for Kubeflow Pipelines version 1 (v1.x or below) use **[Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html)** (AWS SDK for Python (Boto3)) as backend. 
+ The version 2 (v2.0.0-alpha2 and above) of SageMaker AI Components for Kubeflow Pipelines use [ SageMaker AI Operator for Kubernetes (ACK)](https://github.com/aws-controllers-k8s/sagemaker-controller). 

  AWS introduced [ACK](https://aws-controllers-k8s.github.io/community/) to facilitate a Kubernetes-native way of managing AWS Cloud resources. ACK includes a set of AWS service-specific controllers, one of which is the SageMaker AI controller. The SageMaker AI controller makes it easier for machine learning developers and data scientists using Kubernetes as their control plane to train, tune, and deploy machine learning (ML) models in SageMaker AI. For more information, see [SageMaker AI Operators for Kubernetes](https://aws-controllers-k8s.github.io/community/docs/tutorials/sagemaker-example/) 

Both versions of the SageMaker AI Components for Kubeflow Pipelines are supported. However, the version 2 provides some additional advantages. In particular, it offers: 

1. A consistent experience to manage your SageMaker AI resources from any application; whether you are using Kubeflow pipelines, or Kubernetes CLI (`kubectl`) or other Kubeflow applications such as Notebooks. 

1. The flexibility to manage and monitor your SageMaker AI resources outside of the Kubeflow pipeline workflow. 

1. Zero setup time to use the SageMaker AI components if you deployed the full [Kubeflow on AWS](https://awslabs.github.io/kubeflow-manifests/docs/about/) release since the SageMaker AI Operator is part of its deployment. 

## List of SageMaker AI Components for Kubeflow Pipelines
List of SageMaker AI Components

The following is a list of all SageMaker AI Components for Kubeflow Pipelines and their available versions. Alternatively, you can find all [SageMaker AI Components for Kubeflow Pipelines in GitHub](https://github.com/kubeflow/pipelines/tree/master/components/aws/sagemaker#versioning).

**Note**  
We encourage users to utilize Version 2 of a SageMaker AI component wherever it is available.

### Ground Truth components

+ **Ground Truth**

  The Ground Truth component enables you to submit SageMaker AI Ground Truth labeling jobs directly from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)
+ **Workteam**

  The Workteam component enables you to create SageMaker AI private workteam jobs directly from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)

### Data processing components

+ **Processing**

  The Processing component enables you to submit processing jobs to SageMaker AI directly from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)

### Training components

+ **Training**

  The Training component allows you to submit SageMaker Training jobs directly from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)
+ **Hyperparameter Optimization**

  The Hyperparameter Optimization component enables you to submit hyperparameter tuning jobs to SageMaker AI directly from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)

### Inference components

+ **Hosting Deploy**

  The Hosting components allow you to deploy a model using SageMaker AI hosting services from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)
+ **Batch Transform**

  The Batch Transform component allows you to run inference jobs for an entire dataset in SageMaker AI from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)
+ **Model Monitor**

  The Model Monitor components allow you to monitor the quality of SageMaker AI machine learning models in production from a Kubeflow Pipelines workflow.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/kubernetes-sagemaker-components-for-kubeflow-pipelines.html)

## IAM permissions


Deploying Kubeflow Pipelines with SageMaker AI components requires the following three layers of authentication: 
+ An IAM role granting your gateway node (which can be your local machine or a remote instance) access to the Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

  The user accessing the gateway node assumes this role to:
  + Create an Amazon EKS cluster and install KFP
  + Create IAM roles
  + Create Amazon S3 buckets for your sample input data

  The role requires the following permissions:
  + CloudWatchLogsFullAccess 
  + [https://console.aws.amazon.com/iam/home?region=us-east-1#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAWSCloudFormationFullAccess](https://console.aws.amazon.com/iam/home?region=us-east-1#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAWSCloudFormationFullAccess) 
  + IAMFullAccess
  + AmazonS3FullAccess
  + AmazonEC2FullAccess
  + AmazonEKSAdminPolicy (Create this policy using the schema from [Amazon EKS Identity-Based Policy Examples](https://docs.aws.amazon.com/eks/latest/userguide/security_iam_id-based-policy-examples.html)) 
+ A Kubernetes IAM execution role assumed by Kubernetes pipeline pods (**kfp-example-pod-role**) or the SageMaker AI Operator for Kubernetes controller pod to access SageMaker AI. This role is used to create and monitor SageMaker AI jobs from Kubernetes.

  The role requires the following permission:
  + AmazonSageMakerFullAccess 

  You can limit permissions to the KFP and controller pods by creating and attaching your own custom policy.
+ A SageMaker AI IAM execution role assumed by SageMaker AI jobs to access AWS resources such as Amazon S3 or Amazon ECR (**kfp-example-sagemaker-execution-role**).

  SageMaker AI jobs use this role to:
  + Access SageMaker AI resources
  + Input Data from Amazon S3
  + Store your output model to Amazon S3

  The role requires the following permissions:
  + AmazonSageMakerFullAccess 
  + AmazonS3FullAccess 

## Converting pipelines to use SageMaker AI


You can convert an existing pipeline to use SageMaker AI by porting your generic Python [processing containers](https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-containers.html) and [training containers](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html). If you are using SageMaker AI for inference, you also need to attach IAM permissions to your cluster and convert an artifact to a model.

# Install Kubeflow Pipelines


[Kubeflow Pipelines (KFP)](https://www.kubeflow.org/docs/components/pipelines/v2/introduction/) is the pipeline orchestration component of Kubeflow.

You can deploy Kubeflow Pipelines (KFP) on an existing Amazon Elastic Kubernetes Service (Amazon EKS) or create a new Amazon EKS cluster. Use a gateway node to interact with your cluster. The gateway node can be your local machine or an Amazon EC2 instance.

The following section guides you through the steps to set up and configure these resources.

**Topics**
+ [

## Choose an installation option
](#choose-install-option)
+ [

## Configure your pipeline permissions to access SageMaker AI
](#configure-permissions-for-pipeline)
+ [

## Access the KFP UI (Kubeflow Dashboard)
](#access-the-kfp-ui)

## Choose an installation option


Kubeflow Pipelines is available as a core component of the full distribution of Kubeflow on AWS or as a standalone installation.

Select the option that applies to your use case:

1. [Full Kubeflow on AWS Deployment](#full-kubeflow-deployment)

   To use other Kubeflow components in addition to Kubeflow Pipelines, choose the full [AWS distribution of Kubeflow](https://awslabs.github.io/kubeflow-manifests) deployment. 

1. [Standalone Kubeflow Pipelines Deployment](#kubeflow-pipelines-standalone)

   To use the Kubeflow Pipelines without the other components of Kubeflow, install Kubeflow pipelines standalone. 

### Full Kubeflow on AWS Deployment


To install the full release of Kubeflow on AWS, choose the vanilla deployment option from [Kubeflow on AWS deployment guide](https://awslabs.github.io/kubeflow-manifests/docs/deployment/) or any other deployment option supporting integrations with various AWS services (Amazon S3, Amazon RDS, Amazon Cognito).

### Standalone Kubeflow Pipelines Deployment


This section assumes that your user has permissions to create roles and define policies for the role.

#### Set up a gateway node


You can use your local machine or an Amazon EC2 instance as your gateway node. A gateway node is used to create an Amazon EKS cluster and access the Kubeflow Pipelines UI. 

Complete the following steps to set up your node. 

1. 

**Create a gateway node.**

   You can use an existing Amazon EC2 instance or create a new instance with the latest Ubuntu 18.04 DLAMI version using the steps in [Launching and Configuring a DLAMI](https://docs.aws.amazon.com/dlami/latest/devguide/launch-config.html).

1. 

**Create an IAM role to grant your gateway node access to AWS resources.**

   Create an IAM role with permissions to the following resources: CloudWatch, CloudFormation, IAM, Amazon EC2, Amazon S3, Amazon EKS.

   Attach the following policies to the IAM role:
   + CloudWatchLogsFullAccess 
   + [https://console.aws.amazon.com/iam/home?region=us-east-1#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAWSCloudFormationFullAccess](https://console.aws.amazon.com/iam/home?region=us-east-1#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAWSCloudFormationFullAccess)
   + IAMFullAccess 
   + AmazonS3FullAccess 
   + AmazonEC2FullAccess 
   + AmazonEKSAdminPolicy (Create this policy using the schema from [Amazon EKS Identity-Based Policy Examples](https://docs.aws.amazon.com/eks/latest/userguide/security_iam_id-based-policy-examples.html)) 

   For information on adding IAM permissions to an IAM role, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

1. 

**Install the following tools and clients**

   Install and configure the following tools and resources on your gateway node to access the Amazon EKS cluster and KFP User Interface (UI). 
   + [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html): The command line tool for working with AWS services. For AWS CLI configuration information, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). 
   + [aws-iam-authenticator](https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html) version 0.1.31 and above: A tool to use AWS IAM credentials to authenticate to a Kubernetes cluster.
   + [https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html](https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html) version above 0.15: The command line tool for working with Amazon EKS clusters.
   + [https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl): The command line tool for working with Kubernetes clusters. The version needs to match your Kubernetes version within one minor version.
   + [https://aws.amazon.com/sdk-for-python/](https://aws.amazon.com/sdk-for-python/).

     ```
     pip install boto3
     ```

#### Set up an Amazon EKS cluster


1. If you do not have an existing Amazon EKS cluster, run the following steps from the command line of your gateway node, skip this step otherwise.

   1. Run the following command to create an Amazon EKS cluster with version 1.17 or above. Replace `<clustername>` with any name for your cluster. 

      ```
      eksctl create cluster --name <clustername> --region us-east-1 --auto-kubeconfig --timeout=50m --managed --nodes=1
      ```

   1. When the cluster creation is complete, ensure that you have access to your cluster by listing the cluster's nodes. 

      ```
      kubectl get nodes
      ```

1. Ensure that the current `kubectl` context points to your cluster with the following command. The current context is marked with an asterisk (\$1) in the output.

   ```
   kubectl config get-contexts
   
   CURRENT NAME     CLUSTER
   *   <username>@<clustername>.us-east-1.eksctl.io   <clustername>.us-east-1.eksctl.io
   ```

1. If the desired cluster is not configured as your current default, update the default with the following command. 

   ```
   aws eks update-kubeconfig --name <clustername> --region us-east-1
   ```

#### Install Kubeflow Pipelines


Run the following steps from the terminal of your gateway node to install Kubeflow Pipelines on your cluster.

1. Install all [cert-manager components](https://cert-manager.io/docs/installation/kubectl/).

   ```
   kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.9.1/cert-manager.yaml
   ```

1. Install the Kubeflow Pipelines.

   ```
   export PIPELINE_VERSION=2.0.0-alpha.5
   kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/cert-manager/cluster-scoped-resources?ref=$KFP_VERSION"
   kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
   kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/cert-manager/dev?ref=$KFP_VERSION"
   ```

1. Ensure that the Kubeflow Pipelines service and other related resources are running.

   ```
   kubectl -n kubeflow get all | grep pipeline
   ```

   Your output should look like the following.

   ```
   pod/ml-pipeline-6b88c67994-kdtjv                      1/1     Running            0          2d
   pod/ml-pipeline-persistenceagent-64d74dfdbf-66stk     1/1     Running            0          2d
   pod/ml-pipeline-scheduledworkflow-65bdf46db7-5x9qj    1/1     Running            0          2d
   pod/ml-pipeline-ui-66cc4cffb6-cmsdb                   1/1     Running            0          2d
   pod/ml-pipeline-viewer-crd-6db65ccc4-wqlzj            1/1     Running            0          2d
   pod/ml-pipeline-visualizationserver-9c47576f4-bqmx4   1/1     Running            0          2d
   service/ml-pipeline                       ClusterIP   10.100.170.170   <none>        8888/TCP,8887/TCP   2d
   service/ml-pipeline-ui                    ClusterIP   10.100.38.71     <none>        80/TCP              2d
   service/ml-pipeline-visualizationserver   ClusterIP   10.100.61.47     <none>        8888/TCP            2d
   deployment.apps/ml-pipeline                       1/1     1            1           2d
   deployment.apps/ml-pipeline-persistenceagent      1/1     1            1           2d
   deployment.apps/ml-pipeline-scheduledworkflow     1/1     1            1           2d
   deployment.apps/ml-pipeline-ui                    1/1     1            1           2d
   deployment.apps/ml-pipeline-viewer-crd            1/1     1            1           2d
   deployment.apps/ml-pipeline-visualizationserver   1/1     1            1           2d
   replicaset.apps/ml-pipeline-6b88c67994                      1         1         1       2d
   replicaset.apps/ml-pipeline-persistenceagent-64d74dfdbf     1         1         1       2d
   replicaset.apps/ml-pipeline-scheduledworkflow-65bdf46db7    1         1         1       2d
   replicaset.apps/ml-pipeline-ui-66cc4cffb6                   1         1         1       2d
   replicaset.apps/ml-pipeline-viewer-crd-6db65ccc4            1         1         1       2d
   replicaset.apps/ml-pipeline-visualizationserver-9c47576f4   1         1         1       2d
   ```

## Configure your pipeline permissions to access SageMaker AI


In this section, you create an IAM execution role granting Kubeflow Pipeline pods access to SageMaker AI services. 

### Configuration for SageMaker AI components version 2


To run SageMaker AI Components version 2 for Kubeflow Pipelines, you need to install [SageMaker AI Operator for Kubernetes](https://github.com/aws-controllers-k8s/sagemaker-controller) and configure Role-Based Access Control (RBAC) allowing the Kubeflow Pipelines pods to create SageMaker AI custom resources in your Kubernetes cluster.

**Important**  
Follow this section if you are using Kubeflow pipelines standalone deployment. If you are using AWS distribution of Kubeflow version 1.6.0-aws-b1.0.0 or above, SageMaker AI components version 2 are already set up.

1. Install SageMaker AI Operator for Kubernetes to use SageMaker AI components version 2.

   Follow the *Setup* section of [Machine Learning with ACK SageMaker AI Controller tutorial](https://aws-controllers-k8s.github.io/community/docs/tutorials/sagemaker-example/#setup).

1. Configure RBAC permissions for the execution role (service account) used by Kubeflow Pipelines pods. In Kubeflow Pipelines standalone deployment, pipeline runs are executed in the namespace `kubeflow` using the `pipeline-runner` service account.

   1. Create a [RoleBinding](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#rolebinding-example) that gives the service account permission to manage SageMaker AI custom resources.

      ```
      cat > manage_sagemaker_cr.yaml <<EOF
      apiVersion: rbac.authorization.k8s.io/v1
      kind: RoleBinding
      metadata:
      name: manage-sagemaker-cr  
      namespace: kubeflow
      subjects:
      - kind: ServiceAccount
      name: pipeline-runner
      namespace: kubeflow
      roleRef:
      kind: ClusterRole
      name: ack-sagemaker-controller 
      apiGroup: rbac.authorization.k8s.io
      EOF
      ```

      ```
      kubectl apply -f manage_sagemaker_cr.yaml
      ```

   1. Ensure that the rolebinding was created by running:

      ```
      kubectl get rolebinding manage-sagemaker-cr -n kubeflow -o yaml
      ```

### Configuration for SageMaker AI components version 1


To run SageMaker AI Components version 1 for Kubeflow Pipelines, the Kubeflow Pipeline pods need access to SageMaker AI.

**Important**  
Follow this section whether you are using the full Kubeflow on AWS deployment or Kubeflow Pilepines standalone.

To create an IAM execution role granting Kubeflow pipeline pods access to SageMaker AI, follow those steps:

1. Export your cluster name (e.g., *my-cluster-name*) and cluster region (e.g., *us-east-1*).

   ```
   export CLUSTER_NAME=my-cluster-name
   export CLUSTER_REGION=us-east-1
   ```

1. Export the namespace and service account name according to your installation.
   + For the full Kubeflow on AWS installation, export your profile `namespace` (e.g., *kubeflow-user-example-com*) and *default-editor* as the service account.

     ```
     export NAMESPACE=kubeflow-user-example-com
     export KUBEFLOW_PIPELINE_POD_SERVICE_ACCOUNT=default-editor
     ```
   + For the standalone Pipelines deployment, export *kubeflow* as the `namespace` and *pipeline-runner* as the service account.

     ```
     export NAMESPACE=kubeflow
     export KUBEFLOW_PIPELINE_POD_SERVICE_ACCOUNT=pipeline-runner
     ```

1. Create an [ IAM OIDC provider for the Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) with the following command.

   ```
   eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} \
               --region ${CLUSTER_REGION} --approve
   ```

1. Create an IAM execution role for the KFP pods to access AWS services (SageMaker AI, CloudWatch).

   ```
   eksctl create iamserviceaccount \
   --name ${KUBEFLOW_PIPELINE_POD_SERVICE_ACCOUNT} \
   --namespace ${NAMESPACE} --cluster ${CLUSTER_NAME} \
   --region ${CLUSTER_REGION} \
   --attach-policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess \
   --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchLogsFullAccess \
   --override-existing-serviceaccounts \
   --approve
   ```

Once your pipeline permissions are configured to access SageMaker AI Components version 1, follow the SageMaker AI components for Kubeflow pipelines guide on the[ Kubeflow on AWS documentation](https://awslabs.github.io/kubeflow-manifests/docs/amazon-sagemaker-integration/sagemaker-components-for-kubeflow-pipelines/).

## Access the KFP UI (Kubeflow Dashboard)


The Kubeflow Pipelines UI is used for managing and tracking experiments, jobs, and runs on your cluster. For instructions on how to access the Kubeflow Pipelines UI from your gateway node, follow the steps that apply to your deployment option in this section.

### Full Kubeflow on AWS Deployment


Follow the instructions on the [Kubeflow on AWS website](https://awslabs.github.io/kubeflow-manifests/docs/deployment/connect-kubeflow-dashboard/) to connect to the Kubeflow dashboard and navigate to the pipelines tab.

### Standalone Kubeflow Pipelines Deployment


Use port forwarding to access the Kubeflow Pipelines UI from your gateway node by following those steps.

#### Set up port forwarding to the KFP UI service


Run the following command from the command line of your gateway node.

1. Verify that the KFP UI service is running using the following command.

   ```
   kubectl -n kubeflow get service ml-pipeline-ui
   
   NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
   ml-pipeline-ui   ClusterIP   10.100.38.71   <none>        80/TCP    2d22h
   ```

1. Run the following command to set up port forwarding to the KFP UI service. This forwards the KFP UI to port 8080 on your gateway node and allows you to access the KFP UI from your browser. 

   ```
   kubectl port-forward -n kubeflow service/ml-pipeline-ui 8080:80
   ```

   The port forward from your remote machine drops if there is no activity. Run this command again if your dashboard is unable to get logs or updates. If the commands return an error, ensure that there is no process already running on the port you are trying to use. 

#### Access the KFP UI service


Your method of accessing the KFP UI depends on your gateway node type.
+ Local machine as the gateway node:

  1. Access the dashboard in your browser as follows: 

     ```
     http://localhost:8080
     ```

  1. Choose **Pipelines** to access the pipelines UI. 
+ Amazon EC2 instance as the gateway node:

  1. You need to set up an SSH tunnel on your Amazon EC2 instance to access the Kubeflow dashboard from your local machine's browser. 

     From a new terminal session in your local machine, run the following. Replace `<public-DNS-of-gateway-node>` with the IP address of your instance found on the Amazon EC2 console. You can also use the public DNS. Replace `<path_to_key>` with the path to the pem key used to access the gateway node. 

     ```
     public_DNS_address=<public-DNS-of-gateway-node>
     key=<path_to_key>
     
     on Ubuntu:
     ssh -i ${key} -L 9000:localhost:8080 ubuntu@${public_DNS_address}
     
     or on Amazon Linux:
     ssh -i ${key} -L 9000:localhost:8080 ec2-user@${public_DNS_address}
     ```

  1. Access the dashboard in your browser. 

     ```
     http://localhost:9000
     ```

  1. Choose **Pipelines** to access the KFP UI. 

#### (Optional) Grant SageMaker AI notebook instances access to Amazon EKS, and run KFP pipelines from your notebook.


A SageMaker notebook instance is a fully managed Amazon EC2 compute instance that runs the Jupyter Notebook App. You can use a notebook instance to create and manage Jupyter notebooks then define, compile, deploy, and run your KFP pipelines using AWS SDK for Python (Boto3) or the KFP CLI. 

1. Follow the steps in [Create a SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html) to create your notebook instance, then attach the `S3FullAccess` policy to its IAM execution role.

1. From the command line of your gateway node, run the following command to retrieve the IAM role ARN of the notebook instance you created. Replace `<instance-name>` with the name of your instance.

   ```
   aws sagemaker describe-notebook-instance --notebook-instance-name <instance-name> --region <region> --output text --query 'RoleArn'
   ```

   This command outputs the IAM role ARN in the `arn:aws:iam::<account-id>:role/<role-name>` format. Take note of this ARN.

1. Run this command to attach the following policies (AmazonSageMakerFullAccess, AmazonEKSWorkerNodePolicy, AmazonS3FullAccess) to this IAM role. Replace `<role-name>` with the `<role-name>` in your ARN. 

   ```
   aws iam attach-role-policy --role-name <role-name> --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
   aws iam attach-role-policy --role-name <role-name> --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
   aws iam attach-role-policy --role-name <role-name> --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
   ```

1. Amazon EKS clusters use IAM roles to control access to the cluster. The rules are implemented in a config map named `aws-auth`. `eksctl` provides commands to read and edit the `aws-auth` config map. Only the users that have access to the cluster can edit this config map.

   `system:masters` is one of the default user groups with super user permissions to the cluster. Add your user to this group or create a group with more restrictive permissions.

1. Bind the role to your cluster by running the following command. Replace `<IAM-Role-arn>` with the ARN of the IAM role. `<your_username>` can be any unique username.

   ```
   eksctl create iamidentitymapping \
   --cluster <cluster-name> \
   --arn <IAM-Role-arn> \
   --group system:masters \
   --username <your-username> \
   --region <region>
   ```

1. Open a Jupyter notebook on your SageMaker AI instance and run the following command to ensure that it has access to the cluster.

   ```
   aws eks --region <region> update-kubeconfig --name <cluster-name>
   kubectl -n kubeflow get all | grep pipeline
   ```

# Use SageMaker AI components


In this tutorial, you run a pipeline using SageMaker AI Components for Kubeflow Pipelines to train a classification model using Kmeans with the MNIST dataset on SageMaker AI. The workflow uses Kubeflow Pipelines as the orchestrator and SageMaker AI to execute each step of the workflow. The example was taken from an existing [ SageMaker AI example](https://github.com/aws/amazon-sagemaker-examples/blob/8279abfcc78bad091608a4a7135e50a0bd0ec8bb/sagemaker-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb) and modified to work with SageMaker AI Components for Kubeflow Pipelines.

You can define your pipeline in Python using AWS SDK for Python (Boto3) then use the KFP dashboard, KFP CLI, or Boto3 to compile, deploy, and run your workflows. The full code for the MNIST classification pipeline example is available in the [Kubeflow Github repository](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/aws-samples/mnist-kmeans-sagemaker#mnist-classification-with-kmeans). To use it, clone the Python files to your gateway node.

You can find additional [ SageMaker AI Kubeflow Pipelines examples](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/aws-samples) on GitHub. For information on the components used, see the [KubeFlow Pipelines GitHub repository](https://github.com/kubeflow/pipelines/tree/master/components/aws/sagemaker).

To run the classification pipeline example, create a SageMaker AI IAM execution role granting your training job the permission to access AWS resources, then continue with the steps that correspond to your deployment option.

## Create a SageMaker AI execution role


The `kfp-example-sagemaker-execution-role` IAM role is a runtime role assumed by SageMaker AI jobs to access AWS resources. In the following command, you create an IAM execution role named `kfp-example-sagemaker-execution-role`, attach two managed policies (AmazonSageMakerFullAccess, AmazonS3FullAccess), and create a trust relationship with SageMaker AI to grant SageMaker AI jobs access to those AWS resources.

You provide this role as an input parameter when running the pipeline.

Run the following command to create the role. Note the ARN that is returned in your output.

```
SAGEMAKER_EXECUTION_ROLE_NAME=kfp-example-sagemaker-execution-role

TRUST="{ \"Version\": \"2012-10-17		 	 	 \", \"Statement\": [ { \"Effect\": \"Allow\", \"Principal\": { \"Service\": \"sagemaker.amazonaws.com\" }, \"Action\": \"sts:AssumeRole\" } ] }"
aws iam create-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --assume-role-policy-document "$TRUST"
aws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

aws iam get-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --output text --query 'Role.Arn'
```

## Full Kubeflow on AWS Deployment


Follow the instructions of the [SageMaker Training Pipeline tutorial for MNIST Classification with K-Means](https://awslabs.github.io/kubeflow-manifests/docs/amazon-sagemaker-integration/sagemaker-components-for-kubeflow-pipelines/).

## Standalone Kubeflow Pipelines Deployment


### Prepare datasets


To run the pipelines, you need to upload the data extraction pre-processing script to an Amazon S3 bucket. This bucket and all resources for this example must be located in the `us-east-1` region. For information on creating a bucket, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html).

From the `mnist-kmeans-sagemaker` folder of the Kubeflow repository you cloned on your gateway node, run the following command to upload the `kmeans_preprocessing.py` file to your Amazon S3 bucket. Change `<bucket-name>` to the name of your Amazon S3 bucket.

```
aws s3 cp mnist-kmeans-sagemaker/kmeans_preprocessing.py s3://<bucket-name>/mnist_kmeans_example/processing_code/kmeans_preprocessing.py
```

### Compile and deploy your pipeline


After defining the pipeline, you must compile it to an intermediate representation before you submit it to the Kubeflow Pipelines service on your cluster. The intermediate representation is a workflow specification in the form of a YAML file compressed into a tar.gz file. You need the KFP SDK to compile your pipeline.

#### Install KFP SDK


Run the following from the command line of your gateway node:

1. Install the KFP SDK following the instructions in the [Kubeflow pipelines documentation](https://www.kubeflow.org/docs/pipelines/sdk/install-sdk/).

1. Verify that the KFP SDK is installed with the following command:

   ```
   pip show kfp
   ```

1. Verify that `dsl-compile` has been installed correctly as follows:

   ```
   which dsl-compile
   ```

#### Compile your pipeline


You have three options to interact with Kubeflow Pipelines: KFP UI, KFP CLI, or the KFP SDK. The following sections illustrate the workflow using the KFP UI and CLI.

Complete the following steps from your gateway node.

1. Modify your Python file with your Amazon S3 bucket name and IAM role ARN.

1. Use the `dsl-compile` command from the command line to compile your pipeline as follows. Replace `<path-to-python-file>` with the path to your pipeline and `<path-to-output>` with the location where you want your tar.gz file to be.

   ```
   dsl-compile --py <path-to-python-file> --output <path-to-output>
   ```

#### Upload and run the pipeline using the KFP CLI


Complete the following steps from the command line of your gateway node. KFP organizes runs of your pipeline as experiments. You have the option to specify an experiment name. If you do not specify one, the run will be listed under **Default** experiment.

1. Upload your pipeline as follows:

   ```
   kfp pipeline upload --pipeline-name <pipeline-name> <path-to-output-tar.gz>
   ```

   Your output should look like the following. Take note of the pipeline `ID`.

   ```
   Pipeline 29c3ff21-49f5-4dfe-94f6-618c0e2420fe has been submitted
   
   Pipeline Details
   ------------------
   ID           29c3ff21-49f5-4dfe-94f6-618c0e2420fe
   Name         sm-pipeline
   Description
   Uploaded at  2020-04-30T20:22:39+00:00
   ...
   ...
   ```

1. Create a run using the following command. The KFP CLI run command currently does not support specifying input parameters while creating the run. You need to update your parameters in the AWS SDK for Python (Boto3) pipeline file before compiling. Replace `<experiment-name>` and `<job-name>` with any names. Replace `<pipeline-id>` with the ID of your submitted pipeline. Replace `<your-role-arn>` with the ARN of `kfp-example-pod-role`. Replace `<your-bucket-name>` with the name of the Amazon S3 bucket you created. 

   ```
   kfp run submit --experiment-name <experiment-name> --run-name <job-name> --pipeline-id <pipeline-id> role_arn="<your-role-arn>" bucket_name="<your-bucket-name>"
   ```

   You can also directly submit a run using the compiled pipeline package created as the output of the `dsl-compile` command.

   ```
   kfp run submit --experiment-name <experiment-name> --run-name <job-name> --package-file <path-to-output> role_arn="<your-role-arn>" bucket_name="<your-bucket-name>"
   ```

   Your output should look like the following:

   ```
   Creating experiment aws.
   Run 95084a2c-f18d-4b77-a9da-eba00bf01e63 is submitted
   +--------------------------------------+--------+----------+---------------------------+
   | run id                               | name   | status   | created at                |
   +======================================+========+==========+===========================+
   | 95084a2c-f18d-4b77-a9da-eba00bf01e63 | sm-job |          | 2020-04-30T20:36:41+00:00 |
   +--------------------------------------+--------+----------+---------------------------+
   ```

1. Navigate to the UI to check the progress of the job.

#### Upload and run the pipeline using the KFP UI


1. On the left panel, choose the **Pipelines** tab. 

1. In the upper-right corner, choose **\$1UploadPipeline**. 

1. Enter the pipeline name and description. 

1. Choose **Upload a file** and enter the path to the tar.gz file you created using the CLI or with AWS SDK for Python (Boto3).

1. On the left panel, choose the **Pipelines** tab.

1. Find the pipeline you created.

1. Choose **\$1CreateRun**.

1. Enter your input parameters.

1. Choose **Run**.

### Run predictions


Once your classification pipeline is deployed, you can run classification predictions against the endpoint that was created by the Deploy component. Use the KFP UI to check the output artifacts for `sagemaker-deploy-model-endpoint_name`. Download the .tgz file to extract the endpoint name or check the SageMaker AI console in the region you used.

#### Configure permissions to run predictions


If you want to run predictions from your gateway node, skip this section.

**To use any other machine to run predictions, assign the `sagemaker:InvokeEndpoint` permission to the IAM role used by the client machine.**

1. On your gateway node, run the following to create an IAM policy file:

   ```
   cat <<EoF > ./sagemaker-invoke.json
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:InvokeEndpoint"
               ],
               "Resource": "*"
           }
       ]
   }
   EoF
   ```

1. Attach the policy to the IAM role of the client node.

   Run the following command. Replace `<your-instance-IAM-role>` with the name of the IAM role. Replace `<path-to-sagemaker-invoke-json>` with the path to the policy file you created.

   ```
   aws iam put-role-policy --role-name <your-instance-IAM-role> --policy-name sagemaker-invoke-for-worker --policy-document file://<path-to-sagemaker-invoke-json>
   ```

#### Run predictions


1. Create a AWS SDK for Python (Boto3) file from your client machine named `mnist-predictions.py` with the following content. Replace the `ENDPOINT_NAME` variable. The script loads the MNIST dataset, creates a CSV from those digits, then sends the CSV to the endpoint for prediction and prints the results.

   ```
   import boto3
   import gzip
   import io
   import json
   import numpy
   import pickle
   
   ENDPOINT_NAME='<endpoint-name>'
   region = boto3.Session().region_name
   
   # S3 bucket where the original mnist data is downloaded and stored
   downloaded_data_bucket = f"jumpstart-cache-prod-{region}"
   downloaded_data_prefix = "1p-notebooks-datasets/mnist"
   
   # Download the dataset
   s3 = boto3.client("s3")
   s3.download_file(downloaded_data_bucket, f"{downloaded_data_prefix}/mnist.pkl.gz", "mnist.pkl.gz")
   
   # Load the dataset
   with gzip.open('mnist.pkl.gz', 'rb') as f:
       train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
   
   # Simple function to create a csv from our numpy array
   def np2csv(arr):
       csv = io.BytesIO()
       numpy.savetxt(csv, arr, delimiter=',', fmt='%g')
       return csv.getvalue().decode().rstrip()
   
   runtime = boto3.Session(region).client('sagemaker-runtime')
   
   payload = np2csv(train_set[0][30:31])
   
   response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                      ContentType='text/csv',
                                      Body=payload)
   result = json.loads(response['Body'].read().decode())
   print(result)
   ```

1. Run the AWS SDK for Python (Boto3) file as follows:

   ```
   python mnist-predictions.py
   ```

### View results and logs


When the pipeline is running, you can choose any component to check execution details, such as inputs and outputs. This lists the names of created resources.

If the KFP request is successfully processed and an SageMaker AI job is created, the component logs in the KFP UI provide a link to the job created in SageMaker AI. The CloudWatch logs are also provided if the job is successfully created. 

If you run too many pipeline jobs on the same cluster, you may see an error message that indicates that you do not have enough pods available. To fix this, log in to your gateway node and delete the pods created by the pipelines you are not using:

```
kubectl get pods -n kubeflow
kubectl delete pods -n kubeflow <name-of-pipeline-pod>
```

### Cleanup


When you're finished with your pipeline, you need to clean up your resources.

1. From the KFP dashboard, terminate your pipeline runs if they do not exit properly by choosing **Terminate**.

1. If the **Terminate** option doesn't work, log in to your gateway node and manually terminate all the pods created by your pipeline run as follows: 

   ```
   kubectl get pods -n kubeflow
   kubectl delete pods -n kubeflow <name-of-pipeline-pod>
   ```

1. Using your AWS account, log in to the SageMaker AI service. Manually stop all training, batch transform, and HPO jobs. Delete models, data buckets, and endpoints to avoid incurring any additional costs. Terminating the pipeline runs does not stop the jobs in SageMaker AI.