# SageMaker HyperPod task governance
Task governance

SageMaker HyperPod task governance is a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. This provides administrators with the capability to set:
+ Priority levels for various tasks
+ Compute allocation for each team
+ How each team lends and borrows idle compute
+ If a team preempts their own tasks

HyperPod task governance also provides Amazon EKS cluster Observability, offering real-time visibility into cluster capacity. This includes compute availability and usage, team allocation and utilization, and task run and wait time information, setting you up for informed decision-making and proactive resource management. 

The following sections cover how to set up, understand key concepts, and use HyperPod task governance for your Amazon EKS clusters.

**Topics**
+ [

# Setup for SageMaker HyperPod task governance
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md)
+ [

# Dashboard
](sagemaker-hyperpod-eks-operate-console-ui-governance-metrics.md)
+ [

# Tasks
](sagemaker-hyperpod-eks-operate-console-ui-governance-tasks.md)
+ [

# Policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md)
+ [

# Example HyperPod task governance AWS CLI commands
](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md)
+ [

# Troubleshoot
](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md)
+ [

# Attribution document for Amazon SageMaker HyperPod task governance
](sagemaker-hyperpod-eks-operate-console-ui-governance-attributions.md)

# Setup for SageMaker HyperPod task governance
Setup

The following section provides information on how to get set up with the Amazon CloudWatch Observability EKS and SageMaker HyperPod task governance add-ons.

Ensure that you have the minimum permission policy for HyperPod cluster administrators with Amazon EKS, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account, performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 

**Topics**
+ [

# Dashboard setup
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md)
+ [

# Task governance setup
](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md)

# Dashboard setup


Use the following information to get set up with Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on. This sets you up with a detailed visual dashboard that provides a view into metrics for your EKS cluster hardware, team allocation, and tasks.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites
](#hp-eks-dashboard-prerequisites)
+ [

## HyperPod Amazon CloudWatch Observability EKS add-on setup
](#hp-eks-dashboard-setup)

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites


The following section includes the prerequisites needed before installing the Amazon EKS Observability add-on.
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin).
+ Attach the `CloudWatchAgentServerPolicy` IAM policy to your worker nodes. To do so, enter the following command. Replace `my-worker-node-role` with the IAM role used by your Kubernetes worker nodes.

  ```
  aws iam attach-role-policy \
  --role-name my-worker-node-role \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
  ```

## HyperPod Amazon CloudWatch Observability EKS add-on setup


Use the following options to set up the Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on.

------
#### [ Setup using the SageMaker AI console ]

The following permissions are required for setup and visualizing the HyperPod task governance dashboard. This section expands upon the permissions listed in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). 

To manage task governance, use the sample policy:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListClusters",
                "sagemaker:DescribeCluster",
                "sagemaker:ListComputeQuotas",
                "sagemaker:CreateComputeQuota",
                "sagemaker:UpdateComputeQuota",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:DeleteComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "sagemaker:CreateClusterSchedulerConfig",
                "sagemaker:UpdateClusterSchedulerConfig",
                "sagemaker:DeleteClusterSchedulerConfig",
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:DescribeAddon",
                "eks:DescribeCluster",
                "eks:DescribeAccessEntry",
                "eks:ListAssociatedAccessPolicies",
                "eks:AssociateAccessPolicy",
                "eks:DisassociateAccessPolicy"
            ],
            "Resource": "*"
        }
    ]
}
```

------

To grant permissions to manage Amazon CloudWatch Observability Amazon EKS and view the HyperPod cluster dashboard through the SageMaker AI console, use the sample policy below:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:ListComputeQuotas",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "eks:DescribeCluster",
                "cloudwatch:GetMetricData",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon CloudWatch Observability EKS. To ensure task governance related metrics are included in the **Dashboard**, enable the Kueue metrics checkbox. Enabling the Kueue metrics enables CloudWatch **Metrics** costs, after free-tier limit is reached. For more information, see **Metrics** in [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/).

------
#### [ Setup using the EKS AWS CLI ]

Use the following EKS AWS CLI command to install the add-on:

```
aws eks create-addon --cluster-name cluster-name 
--addon-name amazon-cloudwatch-observability 
--configuration-values "configuration json"
```

Below is an example of the JSON of the configuration values:

```
{
    "agent": {
        "config": {
            "logs": {
                "metrics_collected": {
                    "kubernetes": {
                        "kueue_container_insights": true,
                        "enhanced_container_insights": true
                    },
                    "application_signals": { }
                }
            },
            "traces": {
                "traces_collected": {
                    "application_signals": { }
                }
            }
        },
    },
}
```

------
#### [ Setup using the EKS Console UI ]

1. Navigate to the [EKS console](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose **Add-ons**.

1. Find the **Amazon CloudWatch Observability** add-on and install. Install version >= 2.4.0 for the add-on. 

1. Include the following JSON, Configuration values:

   ```
   {
       "agent": {
           "config": {
               "logs": {
                   "metrics_collected": {
                       "kubernetes": {
                           "kueue_container_insights": true,
                           "enhanced_container_insights": true
                       },
                       "application_signals": { }
                   },
               },
               "traces": {
                   "traces_collected": {
                       "application_signals": { }
                   }
               }
           },
       },
   }
   ```

------

Once the EKS Observability add-on has been successfully installed, you can view your EKS cluster metrics under the HyperPod console **Dashboard** tab.

# Task governance setup


This section includes information on how to set up the Amazon SageMaker HyperPod task governance EKS add-on. This includes granting permissions that allows you to set task prioritization, compute allocation for teams, how idle compute is shared, and task preemption for teams.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [

## Kueue Settings
](#hp-eks-task-governance-kueue-settings)
+ [

## HyperPod Task governance prerequisites
](#hp-eks-task-governance-prerequisites)
+ [

## HyperPod task governance setup
](#hp-eks-task-governance-setup)

## Kueue Settings


HyperPod task governance EKS add-on installs [Kueue](https://github.com/kubernetes-sigs/kueue/tree/main/apis/kueue) for your HyperPod EKS clusters. Kueue is a kubernetes-native system that manages quotas and how jobs consume them. 


| EKS HyperPod task governance add-on version | Version of Kueue that is installed as part of the add-on | 
| --- | --- | 
|  v1.1.3  |  v0.12.0  | 

**Note**  
Kueue v.012.0 and higher don't include kueue-rbac-proxy as part of the installation. Previous versions might have kueue-rbac-proxy installed. For example, if you're using Kueue v0.8.1, you might have kueue-rbac-proxy v0.18.1.

HyperPod task governance leverages Kueue for Kubernetes-native job queueing, scheduling, and quota management, and is installed with the HyperPod task governance EKS add-on. When installed, HyperPod creates and modifies SageMaker AI-managed Kubernetes resources such as `KueueManagerConfig`, `ClusterQueues`, `LocalQueues`, `WorkloadPriorityClasses`, `ResourceFlavors`, and `ValidatingAdmissionPolicies`. While Kubernetes administrators have the flexibility to modify the state of these resources, it is possible that any changes made to a SageMaker AI-managed resource may be updated and overwritten by the service.

The following information outlines the configuration settings utilized by the HyperPod task governance add-on for setting up Kueue.

```
  apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8443
      enableClusterQueueResources: true
    webhook:
      port: 9443
    manageJobsWithoutQueueName: false
    leaderElection:
      leaderElect: true
      resourceName: c1f6bfd2.kueue.x-k8s.io
    controller:
      groupKindConcurrency:
        Job.batch: 5
        Pod: 5
        Workload.kueue.x-k8s.io: 5
        LocalQueue.kueue.x-k8s.io: 1
        ClusterQueue.kueue.x-k8s.io: 1
        ResourceFlavor.kueue.x-k8s.io: 1
    clientConnection:
      qps: 50
      burst: 100
    integrations:
      frameworks:
      - "batch/job"
      - "kubeflow.org/mpijob"
      - "ray.io/rayjob"
      - "ray.io/raycluster"
      - "jobset.x-k8s.io/jobset"
      - "kubeflow.org/mxjob"
      - "kubeflow.org/paddlejob"
      - "kubeflow.org/pytorchjob"
      - "kubeflow.org/tfjob"
      - "kubeflow.org/xgboostjob"
      - "pod"
      - "deployment"
      - "statefulset"
      - "leaderworkerset.x-k8s.io/leaderworkerset"
      podOptions:
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: NotIn
              values: [ kube-system, kueue-system ]
    fairSharing:
      enable: true
      preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
    resources:
      excludeResourcePrefixes: []
```

For more information about each configuration entry, see [Configuration](https://kueue.sigs.k8s.io/docs/reference/kueue-config.v1beta1/#Configuration) in the Kueue documentation.

## HyperPod Task governance prerequisites

+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs, manage SageMaker HyperPod clusters within your AWS account, and performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 
+ You will need to have your Kubernetes version >= 1.30. For instructions, see [Update existing clusters to the new Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).
+ If you already have Kueue installed in their clusters, uninstall Kueue before installing the EKS add-on.
+ A HyperPod node must already exist in the EKS cluster before installing the HyperPod task governance add-on. 

## HyperPod task governance setup


The following provides information on how to get set up with HyperPod task governance.

------
#### [ Setup using the SageMaker AI console ]

The following provides information on how to get set up with HyperPod task governance using the SageMaker HyperPod console.

You already have all of the following permissions attached if you have already granted permissions to manage Amazon CloudWatch Observability EKS and view the HyperPod cluster dashboard through the SageMaker AI console in the [HyperPod Amazon CloudWatch Observability EKS add-on setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md#hp-eks-dashboard-setup). If you have not set this up, use the sample policy below to grant permissions to manage the HyperPod task governance add-on and view the HyperPod cluster dashboard through the SageMaker AI console.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "eks:DescribeCluster",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon SageMaker HyperPod task governance Add-on. 

------
#### [ Setup using the Amazon EKS AWS CLI ]

Use the example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html) EKS AWS CLI command to set up the HyperPod task governance Amazon EKS API and console UI using the AWS CLI:

```
aws eks create-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

------

You can view the **Policies** tab in the HyperPod SageMaker AI console if the install was successful. You can also use the following example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html) EKS AWS CLI command to check the status. 

```
aws eks describe-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

# Dashboard
Dashboard

Amazon SageMaker HyperPod task governance provides a comprehensive dashboard view of your Amazon EKS cluster utilization metrics, including hardware, team, and task metrics. The following provides information on your HyperPod EKS cluster dashboard.

The dashboard provides a comprehensive view of cluster utilization metrics, including hardware, team, and task metrics. You will need to install the EKS add-on to view the dashboard. For more information, see [Dashboard setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md).

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Dashboard** tab. The dashboard contains the following metrics. You can download the data for a section by choosing the corresponding **Export**.

**Utilization**

Provides health of the EKS cluster point-in-time and trend-based metrics for critical compute resources. By default, **All Instance Groups** are shown. Use the dropdown menu to filter your instance groups. The metrics included in this section are:
+ Number of total, running, and pending recovery instances. The number of pending recovery instances refer to the number of instances that need attention for recovery.
+ GPUs, GPU memory, vCPUs, and vCPUs memory.
+ GPU utilization, GPU memory utilization, vCPU utilization, and vCPU memory utilization.
+ An interactive graph of your GPU and vCPU utilization. 

**Teams**

Provides information into team-specific resource management. This includes:
+ Instance and GPU allocation.
+ GPU utilization rates.
+ Borrowed GPU statistics.
+ Task status (running or pending).
+ A bar chart view of GPU utilization versus compute allocation across teams.
+ Team detailed GPU and vCPU-related information. By default, the information displayed includes **All teams**. You can filter by team and instances by choosing the dropdown menus. In the interactive plot you can filter by time.

**Tasks**

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

Provides information on task-related metrics. This includes number of running, pending, and preempted tasks, and run and wait time statistics. By default, the information displayed includes **All teams**. You can filter by team by choosing the dropdown menu. In the interactive plot you can filter by time.

# Tasks


The following provides information on Amazon SageMaker HyperPod EKS cluster tasks. Tasks are operations or jobs that are sent to the cluster. These can be machine learning operations, like training, running experiments, or inference. The viewable task details list include status, run time, and how much compute is being used per task. 

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Tasks** tab.

For the **Tasks** tab to be viewable from anyone besides the administrator, the administrator needs to [add an access entry to the EKS cluster for the IAM role](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html). 

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

For EKS clusters, kubeflow (PyTorch, MPI, TensorFlow) tasks are shown. By default, PyTorch tasks are shown. You can filter for PyTorch, MPI, TensorFlow tasks by choosing the dropdown menu or using the search field. The information that is shown for each task includes the task name, status, namespace, priority class, and creation time. 

# Using topology-aware scheduling in Amazon SageMaker HyperPod task governance
Scheduling

Topology-aware scheduling in Amazon SageMaker HyperPod task governance optimizes the training efficiency of distributed machine learning workloads by placing pods based on the physical network topology of your Amazon EC2 instances. By considering the hierarchical structure of AWS infrastructure, including Availability Zones, network blocks, and physical racks, topology-aware scheduling ensures that pods requiring frequent communication are scheduled in close proximity to minimize network latency. This intelligent placement is particularly beneficial for large-scale machine learning training jobs that involve intensive pod-to-pod communication, resulting in reduced training times and more efficient resource utilization across your cluster.

**Note**  
To use topology-aware scheduling, make sure that your version of HyperPod task governance is v1.2.2-eksbuild.1 or higher.

Topology-aware scheduling supports the following instance types:
+ ml.p3dn.24xlarge
+ ml.p4d.24xlarge
+ ml.p4de.24xlarge
+ ml.p5.48xlarge
+ ml.p5e.48xlarge
+ ml.p5en.48xlarge
+ ml.p6e-gb200.36xlarge
+ ml.trn1.2xlarge
+ ml.trn1.32xlarge
+ ml.trn1n.32xlarge
+ ml.trn2.48xlarge
+ ml.trn2u.48xlarge

Topology-aware scheduling integrates with your existing HyperPod workflows while providing flexible topology preferences through both kubectl YAML files and the HyperPod CLI. HyperPod task governance automatically configures cluster nodes with topology labels and works with HyperPod task governance policies and resource borrowing mechanisms, ensuring that topology-aware scheduling doesn't disrupt your current operational processes. With built-in support for both preferred and required topology specifications, you can fine-tune workload placement to match your specific performance requirements while maintaining the flexibility to fall back to standard scheduling when topology constraints cannot be satisfied.

By leveraging topology-aware labels in HyperPod, you can enhance their machine learning workloads through intelligent pod placement that considers the physical network infrastructure. HyperPod task governance automatically optimizes pod scheduling based on the hierarchical data center topology, which directly translates to reduced network latency and improved training performance for distributed ML tasks. This topology awareness is particularly valuable for large-scale machine learning workloads, as it minimizes communication overhead by strategically placing related pods closer together in the network hierarchy. The result is optimized communication network latency between pods, more efficient resource utilization, and better overall performance for compute-intensive AI/ML applications, all achieved without you needing to manually manage complex network topology configurations.

The following are labels for the available topology network layers that HyperPod task governance can schedule pods in:
+ topology.k8s.aws/network-node-layer-1
+ topology.k8s.aws/network-node-layer-2
+ topology.k8s.aws/network-node-layer-3
+ topology.k8s.aws/ultraserver-id

To use topology-aware scheduling, include the following labels in your YAML file:
+ kueue.x-k8s.io/podset-required-topology - indicates that this job must have the required pods and that all pods in the nodes must be scheduled within the same topology layer.
+ kueue.x-k8s.io/podset-preferred-topology - indicates that this job must have the pods, but that scheduling pods within the same topology layer is preferred but not required. HyperPod task governance will try to schedule the pods within one layer before trying the next topology layer.

If resources don’t share the same topology label, the job will be suspended. The job will be in the waitlist. Once Kueue sees that there are enough resources, it will admit and run the job.

The following example demonstrates how to use the labels in your YAML files:

```
apiVersion: batch/v1
kind: Job
metadata:
  name: test-tas-job
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: PRIORITY_CLASS-priority
spec:
  parallelism: 10
  completions: 10
  suspend: true
  template:
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
      annotations:
        kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
        or
        kueue.x-k8s.io/podset-preferred-topology: "topology.k8s.aws/network-node-layer-3"
    spec:
      nodeSelector:
        topology.k8s.aws/network-node-layer-3: TOPOLOGY_LABEL_VALUE
      containers:
        - name: dummy-job
          image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
          args: ["3600s"]
          resources:
            requests:
              cpu: "100"
      restartPolicy: Never
```

The following table explains the new parameters you can use in the kubectl YAML file.


| Parameter | Description | 
| --- | --- | 
| kueue.x-k8s.io/queue-name | The name of the queue to use to run the job. The format of this queue-name must be hyperpod-ns-team-name-localqueue. | 
| kueue.x-k8s.io/priority-class | Lets you specify a priority for pod scheduling. This specification is optional. | 
| annotations | Contains the topology annotation that you attach to the job. Available topologies are kueue.x-k8s.io/podset-required-topology and kueue.x-k8s.io/podset-preferred-topology. You can use either an annotation or nodeSelector, but not both at the same time. | 
| nodeSelector | Specifies the network layer that represents the layer of Amazon EC2 instance placement. Use either this field or an annotation, but not both at the same time. In your YAML file, you can also use the nodeSelector parameter to choose the exact layer for your pods. To get the value of your label, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. | 

You can also use the HyperPod CLI to run your job and use topology aware scheduling. For more information about the HyperPod CLI, see [SageMaker HyperPod CLI commands](sagemaker-hyperpod-eks-hyperpod-cli-reference.md).

```
hyp create hyp-pytorch-job \                                            
  --version 1.1 \
  --job-name sample-pytorch-job \
  --image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
  --pull-policy "Always" \
  --tasks-per-node 1 \
  --max-retry 1 \
  --priority high-priority \
  --namespace hyperpod-ns-team-name \
  --queue-name hyperpod-ns-team-name-localqueue \
  --preferred-topology-label topology.k8s.aws/network-node-layer-1
```

The following is an example configuration file that you might use to run a PytorchJob with topology labels. The file is largely similar if you want to run MPI and Tensorflow jobs. If you want to run those jobs instead, remember to change the configuration file accordingly, such as using the correct image instead of PyTorchJob. If you’re running a PyTorchJob, you can assign different topologies to the master and worker nodes. PyTorchJob always has one master node, so we recommend that you use topology to support worker pods instead.

```
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  annotations: {}
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
  name: tas-test-pytorch-job
  namespace: hyperpod-ns-team-name
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
    Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
        metadata:
          # annotations:
            # kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                cpu: 1
              requests:
                memory: 200Mi
                cpu: 1
          #nodeSelector:
          #  topology.k8s.aws/network-node-layer-3: xxxxxxxxxxx
```

To see the topologies for your cluster, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. By default, the topologies are hidden in the AWS Management Console and Amazon SageMaker Studio. Follow these steps to see them in the interface that you’re using.

**SageMaker Studio**

1. In SageMaker Studio, navigate to your cluster.

1. In the Tasks view, choose the options menu in the Name column, then choose **Manage columns**.

1. Select **Requested topology** and **Topology constraint** to add the columns to see the topology information in the list of Kubernetes pods.

**AWS Management Console**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under **HyperPod clusters**, choose **Cluster management**.

1. Choose the **Tasks** tab, then choose the gear icon.

1. Under instance attributes, toggle **Requested topology** and **Topology constraint**.

1. Choose **Confirm** to see the topology information in the table.

# Using gang scheduling in Amazon SageMaker HyperPod task governance
Gang scheduling

In distributed ML training, a job often requires multiple pods running concurrently across nodes with pod-to-pod communication. HyperPod task governance uses Kueue's `waitForPodsReady` feature to implement gang scheduling. When enabled, the workload is monitored by Kueue until all of its pods are ready, meaning scheduled, running, and passing the optional readiness probe. If not all pods of the workload are ready within the configured timeout, the workload is evicted and requeued.

Gang scheduling provides the following benefits:
+ **Prevents resource waste** — Kueue evicts and requeues the workload if all pods do not become ready, ensuring resources are not held indefinitely by partially running workloads.
+ **Avoids deadlocks** — Prevents jobs from holding partial resources and blocking each other indefinitely.
+ **Automatic recovery** — If pods aren't ready within the timeout, the workload is evicted and requeued with configurable exponential backoff, rather than hanging indefinitely.

## Activate gang scheduling


To activate gang scheduling, you must have a HyperPod Amazon EKS cluster with the task governance Amazon EKS add-on installed. The add-on status must be `Active` or `Degraded`.

**Note**  
Gang scheduling can also be configured directly using `kubectl` by editing the Kueue configuration on the cluster.

**Activate gang scheduling (SageMaker AI console)**

1. Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/) and navigate to your HyperPod cluster.

1. Choose the **Policy management** tab.

1. In the **Task governance** section, open **Actions**, then choose **Configure gang scheduling**.

1. Toggle gang scheduling on and configure the settings.

1. Choose **Save**. The Kueue controller restarts to apply the change.

## Gang scheduling configuration settings


The following table describes the configuration settings for gang scheduling.


| Setting | Description | Default | 
| --- | --- | --- | 
| timeout | How long Kueue waits for all pods to become ready before evicting and requeuing the workload. | 5m | 
| recoveryTimeout | How long Kueue waits for a pod to recover after a node failure before requeuing the workload. Set to 0s to disable. Defaults to the value of timeout if not set. | 5m | 
| blockAdmission | When enabled, workloads are admitted sequentially. No new workload is admitted until all pods of the current one are ready. Prevents deadlocks on resource-constrained clusters. | Off | 
| requeuingStrategy timestamp | Whether requeue order uses Creation (original submission time, preserves queue position) or Eviction (time of last eviction, effectively deprioritizing repeatedly failing jobs). | Eviction | 
| requeuingStrategy backoffLimitCount | Maximum requeue attempts before Kueue permanently deactivates the workload. Leave empty for unlimited retries. | Unlimited | 
| requeuingStrategy backoffBaseSeconds | The base time in seconds for exponential backoff when requeuing a workload after each consecutive timeout. The exponent is 2. | 60s | 
| requeuingStrategy backoffMaxSeconds | Cap on the exponential backoff delay. Once reached, Kueue continues requeuing at this fixed interval. | 3600s | 

**Note**  
Modifying gang scheduling settings restarts the Kueue controller, which may temporarily delay job admission. This applies whether you are enabling, disabling, or updating any value. Running jobs are not interrupted.

**Note**  
Gang scheduling is cluster-wide. It applies to all Kueue-managed workloads on the cluster, not just specific teams or queues.

# Policies


Amazon SageMaker HyperPod task governance simplifies how your Amazon EKS cluster resources are allocated and how tasks are prioritized. The following provides information on HyperPod EKS cluster policies. For information on how to set up task governance, see [Task governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md).

The policies are divided up into **Compute prioritization** and **Compute allocation**. The policy concepts below will be organized in the context of these policies.

**Compute prioritization**, or cluster policy, determines how idle compute is borrowed and how tasks are prioritized by teams.
+ **Idle compute allocation** defines how idle compute is allocated across teams. That is, how unused compute can be borrowed from teams. When choosing an **Idle compute allocation**, you can choose between:
  + **First-come first-serve**: When applied, teams are not prioritized against each other and each incoming task is equally likely to obtain over-quota resources. Tasks are prioritized based on order of submission. This means a user may be able to use 100% of the idle compute if they request it first.
  + **Fair-share**: When applied, teams borrow idle compute based on their assigned **Fair-share weight**. These weights are defined in **Compute allocation**. For more information on how this can be used, see [Sharing idle compute resources examples](#hp-eks-task-governance-policies-examples).
+ **Task prioritization** defines how tasks are queued as compute becomes available. When choosing a **Task prioritization**, you can choose between:
  + **First-come first-serve**: When applied, tasks are queued in the order they are requested.
  + **Task ranking**: When applied, tasks are queued in the order defined by their prioritization. If this option is chosen, you must add priority classes along with the weights at which they should be prioritized. Tasks of the same priority class will be executed on a first-come first-serve basis. When enabled in Compute allocation, tasks are preempted from lower priority tasks by higher priority tasks within the team.

    When data scientists submit jobs to the cluster, they use the priority class name in the YAML file. The priority class is in the format `priority-class-name-priority`. For an example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).
  + **Priority classes**: These classes establish a relative priority for tasks when borrowing capacity. When a task is running using borrowed quota, it may be preempted by another task of higher priority than it, if no more capacity is available for the incoming task. If **Preemption** is enabled in the **Compute allocation**, a higher priority task may also preempt tasks within its own team.
+ **Unallocated resource sharing** enables teams to borrow compute resources that are not allocated to any team through compute quota. When enabled, unallocated cluster capacity becomes available for teams to borrow automatically. For more information, see [How unallocated resource sharing works](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works).

**Compute allocation**, or compute quota, defines a team’s compute allocation and what weight (or priority level) a team is given for fair-share idle compute allocation. 
+ **Team name**: The team name. A corresponding **Namespace** will be created, of type `hyperpod-ns-team-name`. 
+ **Members**: Members of the team namespace. You will need to set up a Kubernetes role-based access control (RBAC) for data scientist users that you want to be part of this team, to run tasks on HyperPod clusters orchestrated with Amazon EKS. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
+ **Fair-share weight**: This is the level of prioritization assigned to the team when **Fair-share** is applied for **Idle compute allocation**. The highest priority has a weight of 100 and the lowest priority has a weight of 0. Higher weight enables a team to access unutilized resources within shared capacity sooner. A zero weight signifies the lowest priority, implying this team will always be at a disadvantage compared to other teams. 

  The fair-share weight provides a comparative edge to this team when vying for available resources against others. Admission prioritizes scheduling tasks from teams with the highest weights and the lowest borrowing. For example, if Team A has a weight of 10 and Team B has a weight of 5, Team A would have priority in accessing unutilized resources as in would have jobs that are scheduled earlier than Team B.
+ **Task preemption**: Compute is taken over from a task based on priority. By default, the team loaning idle compute will preempt tasks from other teams. 
+ **Lending and borrowing**: How idle compute is being lent by the team and if the team can borrow from other teams.
  + **Percentage-based borrow limit**: The limit of idle compute that a team is allowed to borrow, expressed as a percentage of their guaranteed quota. A team can borrow up to 10,000% of allocated compute. The value you provide here is interpreted as a percentage. For example, a value of 500 will be interpreted as 500%. This percentage applies uniformly across all resource types (CPU, GPU, Memory) and instance types in the team's quota.
  + **Absolute borrow limit**: The limit of idle compute that a team is allowed to borrow, defined as absolute resource values per instance type. This provides granular control over borrowing behavior for specific instance types. You need to specify absolute limits using the same schema as **Compute quota**, including instance count, accelerators, vCPU, memory, or accelerator partitions. You can specify absolute limits for one or more instance types in your team's quota.

For information on how these concepts are used, such as priority classes and name spaces, see [Example HyperPod task governance AWS CLI commands](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md).

## Sharing idle compute resources examples


The total reserved quota should not surpass the cluster's available capacity for that resource, to ensure proper quota management. For example, if a cluster comprises 20 `ml.c5.2xlarge` instances, the cumulative quota assigned to teams should remain under 20. 

If the **Compute allocation** policies for teams allow for **Lend and Borrow** or **Lend**, the idle capacity is shared between these teams. For example, Team A and Team B have **Lend and Borrow** enabled. Team A has a quota of 6 but is using only 2 for its jobs, and Team B has a quota of 5 and is using 4 for its jobs. A job that is submitted to Team B requiring 4 resources. 3 will be borrowed from Team A. 

If any team's **Compute allocation** policy is set to **Don't Lend**, the team would not be able to borrow any additional capacity beyond its own allocations.

## How unallocated resource sharing works


Unallocated resource sharing automatically manages the pool of resources that are not allocated to any compute quota in your cluster. This means HyperPod continuously monitors your cluster state and automatically updates to the correct configuration over time.

**Initial Setup**
+ When you set `IdleResourceSharing` to `Enabled` in your ClusterSchedulerConfig (by default it is `Disabled`), HyperPod task governance begins monitoring your cluster and calculates available idle resources by subtracting team quotas from total node capacity.
+ Unallocated resource sharing ClusterQueues are created to represent the borrowable resource pool.
+ When you first enable unallocated resource sharing, infrastructure setup takes several mins. You can monitor the progress through policy `Status` and `DetailedStatus` in ClusterSchedulerConfig.

**Ongoing Reconciliation**
+ HyperPod task governance continuously monitors for changes such as node additions or removals and cluster queue quota updates.
+  When changes occur, unallocated resource sharing recalculates quota and updates ClusterQueues. Reconciliation typically completes within seconds. 

**Monitoring**

 You can verify that unallocated resource sharing is fully configured by checking for unallocated resource sharing ClusterQueues: 

```
kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
```

When you see ClusterQueues with names like `hyperpod-ns-idle-resource-sharing-cq-1`, unallocated resource sharing is active. Note that multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

## Node eligibility for unallocated resource sharing


Unllocated Resource Sharing only includes nodes that meet the following requirements:

1. **Node Ready Status**
   + Nodes must be in `Ready` status to contribute to the unallocated resource pool.
   + Nodes in `NotReady` or other non-ready states are excluded from capacity calculations.
   + When a node becomes `Ready`, it is automatically included in the next reconciliation cycle.

1. **Node Schedulable Status**
   + Nodes with `spec.unschedulable: true` are excluded from unallocated resource sharing.
   + When a node becomes schedulable again, it is automatically included in the next reconciliation cycle.

1. **MIG Configuration (GPU nodes only)**
   + For GPU nodes with MIG (Multi-Instance GPU) partitioning, the `nvidia.com/mig.config.state` label must show `success` for the node to contribute MIG profiles to unallocated resource sharing.
   + These nodes will be retried automatically once MIG configuration completes successfully.

1. **Supported Instance Types**
   + The instance must be a supported SageMaker HyperPod instance type.
   + See the list of supported instance types in the SageMaker HyperPod cluster.

**Topics**
+ [

## Sharing idle compute resources examples
](#hp-eks-task-governance-policies-examples)
+ [

## How unallocated resource sharing works
](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works)
+ [

## Node eligibility for unallocated resource sharing
](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-node-eligibility)
+ [

# Create policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-create.md)
+ [

# Edit policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md)
+ [

# Delete policies
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md)
+ [

# Allocating compute quota in Amazon SageMaker HyperPod task governance
](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation.md)

# Create policies


You can create your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to create the following configurations.
+ Create your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Create **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Create HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To create your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To create a **Compute allocation**:

1. 

   1. Choose the corresponding **Create**. This takes you to the compute allocation creation page.

   1. After you have made your changes, choose **Submit**.

# Edit policies


You can edit your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to edit the following configurations.
+ Edit your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Edit **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Edit HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To edit your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To edit your **Compute allocation**:

1. 

   1. Choose the configuration you wish to edit under **Compute allocation**. This takes you to the configuration details page.

   1. If you wish to edit these configurations, choose **Edit**.

   1. After you have made your changes, choose **Submit**.

# Delete policies


You can delete your **Cluster policy** and **Compute allocation** configurations using the SageMaker AI console or AWS CLI. The following page provides instructions on how to delete your SageMaker HyperPod task governance policies and configurations.

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Note**  
If you are having issues with listing or deleting task governance policies, you may need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section. For additional information, see [Deleting clusters](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md#hp-eks-troubleshoot-delete-policies).

## Delete HyperPod task governance policies (console)


The following uses the SageMaker AI console to delete your HyperPod task governance policies.

**Note**  
You cannot delete your **Cluster policy** (`ClusterSchedulerConfig`) using the SageMaker AI console. To learn how to do so using the AWS CLI, see [Delete HyperPod task governance policies (AWS CLI)](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-cli).

**To delete task governance policies (console)**

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To delete your **Compute allocation** (`ComputeQuota`):

   1. In the **Compute allocation** section, select the configuration you want to delete.

   1. In the **Actions** dropdown menu, choose **Delete**.

   1. Follow the instructions in the UI to complete the task.

## Delete HyperPod task governance policies (AWS CLI)


The following uses the AWS CLI to delete your HyperPod task governance policies.

**Note**  
If you are having issues using the following commands, you may need to update your AWS CLI. For more information, see [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

**To delete task governance policies (AWS CLI)**

First set your variables for the AWS CLI commands that follow.

```
REGION=aws-region
```

1. Get the *cluster-arn* associated with the policies you wish to delete. You can use the following AWS CLI command to list the clusters in your AWS Region.

   ```
   aws sagemaker list-clusters \
       --region ${REGION}
   ```

1. To delete your compute allocations (`ComputeQuota`):

   1. List all of the compute quotas associated with the HyperPod cluster.

      ```
      aws sagemaker list-compute-quotas \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `compute-quota-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-compute-quota \
          --compute-quota-id compute-quota-id \
          --region ${REGION}
      ```

1. To delete your cluster policies (`ClusterSchedulerConfig`):

   1. List all of the cluster policies associated with the HyperPod cluster.

      ```
      aws sagemaker list-cluster-scheduler-configs \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `cluster-scheduler-config-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-cluster-scheduler-config 
          --cluster-scheduler-config-id scheduler-config-id \
          --region ${REGION}
      ```

# Allocating compute quota in Amazon SageMaker HyperPod task governance
Compute allocation

Cluster administrators can decide how the organization uses purchased compute. Doing so reduces waste and idle resources. You can allocate compute quota such that teams can borrow unused resources from each other. Compute quota allocation in HyperPod task governance lets administrators allocate resources at the instance level and at a more granular resource level. This capability provides flexible and efficient resource management for teams by allowing granular control over individual compute resources instead of requiring entire instance allocations. Allocating at a granular level eliminates inefficiencies of traditional instance-level allocation. Through this approach, you can optimize resource utilization and reduce idle compute.

Compute quota allocation supports three types of resource allocation: accelerators, vCPU, and memory. Accelerators are components in accelerated computer instances that that perform functions, such as floating point number calculations, graphics processing, or data pattern matching. Accelerators include GPUs, Trainium accelerators, and neuron cores. For multi-team GPU sharing, different teams can receive specific GPU allocations from the same instance type, maximizing utilization of accelerator hardware. For memory-intensive workloads that require additional RAM for data preprocessing or model caching scenarios, you can allocate memory quota beyond the default GPU-to-memory ratio. For CPU-heavy preprocessing tasks that need substantial CPU resources alongside GPU training, you can allocate independent CPU resource allocation.

Once you provide a value, HyperPod task governance calculates the ratio using the formula **allocated resource divided by the total amount of resources available in the instance**. HyperPod task governance then uses this ratio to apply default allocations to other resources, but you can override these defaults and customize them based on your use case. The following are sample scenarios of how HyperPod task governance allocates resources based on your values:
+ **Only accelerator specified** - HyperPod task governance applies the default ratio to vCPU and memory based on the accelerator values.
+ **Only vCPU specified** - HyperPod task governance calculates the ratio and applies it to memory. Accelerators are set to 0.
+ **Only memory specified** - HyperPod task governance calculates the ratio and applies it to vCPU because compute is required to run memory-specified workloads. Accelerators are set to 0.

To programmatically control quota allocation, you can use the [ ComputeQuotaResourceConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ComputeQuotaResourceConfig.html) object and specify your allocations in integers.

```
{
    "ComputeQuotaConfig": {
        "ComputeQuotaResources": [{
            "InstanceType": "ml.g5.24xlarge",
            "Accelerators": "16",
            "vCpu": "200.0",
            "MemoryInGiB": "2.0"
        }]
    }
}
```

To see all of the allocated allocations, including the defaults, use the [ DescribeComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeComputeQuota.html) operation. To update your allocations, use the [ UpdateComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateComputeQuota.html) operation.

You can also use the HyperPod CLI to allocate compute quotas. For more information about the HyperPod CLI, see [Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-run-jobs.md). The following example demonstrates how to set compute quotas using the HyperPod CLI.

```
hyp create hyp-pytorch-job --version 1.1 --job-name sample-job \
--image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
--pull-policy "Always" \
--tasks-per-node 1 \
--max-retry 1 \
--priority high-priority \
--namespace hyperpod-ns-team-name \
--queue-name hyperpod-ns-team-name-localqueue \
--instance-type sample-instance-type \
--accelerators 1 \
--vcpu 3 \
--memory 1 \
--accelerators-limit 1 \
--vcpu-limit 4 \
--memory-limit 2
```

To allocate quotas using the AWS console, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under HyperPod clusters, choose **Cluster management**.

1. Under **Compute allocations**, choose **Create**.

1. If you don’t already have instances, choose **Add allocation** to add an instance.

1. Under **Allocations**, choose to allocate by instances or individual resources. If you allocate by individual resources, SageMaker AI automatically assigns allocations to other resources by the ratio that you chose. To override this ratio-based allocation, use the corresponding toggle to override that compute.

1. Repeat steps 4 and 5 to configure additional instances.

After allocating compute quota, you can then submit jobs through the HyperPod CLI or `kubectl`. HyperPod efficiently schedules workloads based on available quota. 

# Allocating GPU partition quota
GPU partition quota

You can extend compute quota allocation to support GPU partitioning, enabling fine-grained resource sharing at the GPU partition level. When GPU partitioning is enabled on supported GPUs in the cluster, each physical GPU can be partitioned into multiple isolated GPUs with defined compute, memory, and streaming multiprocessor allocations. For more information about GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md). You can allocate specific GPU partitions to teams, allowing multiple teams to share a single GPU while maintaining hardware-level isolation and predictable performance.

For example, an ml.p5.48xlarge instance with 8 H100 GPUs can be partitioned into GPU partitions, and you can allocate individual partitions to different teams based on their task requirements. When you specify GPU partition allocations, HyperPod task governance calculates proportional vCPU and memory quotas based on the GPU partition, similar to GPU-level allocation. This approach maximizes GPU utilization by eliminating idle capacity and enabling cost-effective resource sharing across multiple concurrent tasks on the same physical GPU.

## Creating Compute Quotas


```
aws sagemaker create-compute-quota \
  --name "fractional-gpu-quota" \
  --compute-quota-config '{
    "ComputeQuotaResources": [
      {
        "InstanceType": "ml.p4d.24xlarge",
        "AcceleratorPartition": {
            "Count": 4,
            "Type": "mig-1g.5gb"
        }
      }
    ],
    "ResourceSharingConfig": { 
      "Strategy": "LendAndBorrow", 
      "BorrowLimit": 100 
    }
  }'
```

## Verifying Quota Resources


```
# Check ClusterQueue
kubectl get clusterqueues
kubectl describe clusterqueue QUEUE_NAME

# Check ResourceFlavors
kubectl get resourceflavor
kubectl describe resourceflavor FLAVOR_NAME
```

# Example HyperPod task governance AWS CLI commands
Example commands

You can use HyperPod with EKS through Kubectl or through HyperPod custom CLI. You can use these commands through Studio or AWS CLI. The following provides SageMaker HyperPod task governance examples, on how to view cluster details using the HyperPod AWS CLI commands. For more information, including how to install, see the [HyperPod CLI Github repository](https://github.com/aws/sagemaker-hyperpod-cli).

**Topics**
+ [

## Get cluster accelerator device quota information
](#hp-eks-cli-get-clusters)
+ [

## Submit a job to SageMaker AI-managed queue and namespace
](#hp-eks-cli-start-job)
+ [

## List jobs
](#hp-eks-cli-list-jobs)
+ [

## Get job detailed information
](#hp-eks-cli-get-job)
+ [

## Suspend and unsuspend jobs
](#hp-eks-cli-patch-job)
+ [

## Debugging jobs
](#hp-eks-cli-other)

## Get cluster accelerator device quota information


The following example command gets the information on the cluster accelerator device quota.

```
hyperpod get-clusters -n hyperpod-ns-test-team
```

The namespace in this example, `hyperpod-ns-test-team`, is created in Kubernetes based on the team name provided, `test-team`, when the compute allocation is created. For more information, see [Edit policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md).

Example response:

```
[
    {
        "Cluster": "hyperpod-eks-test-cluster-id",
        "InstanceType": "ml.g5.xlarge",
        "TotalNodes": 2,
        "AcceleratorDevicesAvailable": 1,
        "NodeHealthStatus=Schedulable": 2,
        "DeepHealthCheckStatus=Passed": "N/A",
        "Namespaces": {
            "hyperpod-ns-test-team": {
                "TotalAcceleratorDevices": 1,
                "AvailableAcceleratorDevices": 1
            }
        }
    }
]
```

## Submit a job to SageMaker AI-managed queue and namespace


The following example command submits a job to your HyperPod cluster. If you have access to only one team, the HyperPod AWS CLI will automatically assign the queue for you in this case. Otherwise if multiple queues are discovered, we will display all viable options for you to select.

```
hyperpod start-job --job-name hyperpod-cli-test --job-kind kubeflow/PyTorchJob --image docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd --entry-script /opt/pytorch-mnist/mnist.py --pull-policy IfNotPresent --instance-type ml.g5.xlarge --node-count 1 --tasks-per-node 1 --results-dir ./result --priority training-priority
```

The priority classes are defined in the **Cluster policy**, which defines how tasks are prioritized and idle compute is allocated. When a data scientist submits a job, they use one of the priority class names with the format `priority-class-name-priority`. In this example, `training-priority` refers to the priority class named “training”. For more information on policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

If a priority class is not specified, the job is treated as a low priority job, with a task ranking value of 0. 

If a priority class is specified, but does not correspond to one of the priority classes defined in the **Cluster policy**, the submission fails and an error message provides the defined set of priority classes.

You can also submit the job using a YAML configuration file using the following command: 

```
hyperpod start-job --config-file ./yaml-configuration-file-name.yaml
```

The following is an example YAML configuration file that is equivalent to submitting a job as discussed above.

```
defaults:
  - override hydra/job_logging: stdout
hydra:
  run:
    dir: .
  output_subdir: null
training_cfg:
  entry_script: /opt/pytorch-mnist/mnist.py
  script_args: []
  run:
    name: hyperpod-cli-test
    nodes: 1
    ntasks_per_node: 1
cluster:
  cluster_type: k8s
  instance_type: ml.g5.xlarge
  custom_labels:
    kueue.x-k8s.io/priority-class: training-priority
  cluster_config:
    label_selector:
      required:
        sagemaker.amazonaws.com/node-health-status:
          - Schedulable
      preferred:
        sagemaker.amazonaws.com/deep-health-check-status:
          - Passed
      weights:
        - 100
    pullPolicy: IfNotPresent
base_results_dir: ./result
container: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
env_vars:
  NCCL_DEBUG: INFO
```

Alternatively, you can submit a job using `kubectl` to ensure the task appears in the **Dashboard** tab. The following is an example kubectl command.

```
kubectl apply -f ./yaml-configuration-file-name.yaml
```

When submitting the job, include your queue name and priority class labels. For example, with the queue name `hyperpod-ns-team-name-localqueue` and priority class `priority-class-name-priority`, you must include the following labels:
+ `kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue` 
+ `kueue.x-k8s.io/priority-class: priority-class-name-priority`

The following YAML configuration snippet demonstrates how to add labels to your original configuration file to ensure your task appears in the **Dashboard** tab:

```
metadata:
    name: job-name
    namespace: hyperpod-ns-team-name
    labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        kueue.x-k8s.io/priority-class: priority-class-name-priority
```

## List jobs


The following command lists the jobs and their details.

```
hyperpod list-jobs
```

Example response:

```
{
    "jobs": [
        {
            "Name": "hyperpod-cli-test",
            "Namespace": "hyperpod-ns-test-team",
            "CreationTime": "2024-11-18T21:21:15Z",
            "Priority": "training",
            "State": "Succeeded"
        }
    ]
}
```

## Get job detailed information


The following command provides a job’s details. If no namespace is specified, HyperPod AWS CLI will fetch a namespace managed by SageMaker AI that you have access to.

```
hyperpod get-job --job-name hyperpod-cli-test
```

Example response:

```
{
    "Name": "hyperpod-cli-test",
    "Namespace": "hyperpod-ns-test-team",
    "Label": {
        "app": "hyperpod-cli-test",
        "app.kubernetes.io/managed-by": "Helm",
        "kueue.x-k8s.io/priority-class": "training"
    },
    "CreationTimestamp": "2024-11-18T21:21:15Z",
    "Status": {
        "completionTime": "2024-11-18T21:25:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2024-11-18T21:21:15Z",
                "lastUpdateTime": "2024-11-18T21:21:15Z",
                "message": "PyTorchJob hyperpod-cli-test is created.",
                "reason": "PyTorchJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2024-11-18T21:21:17Z",
                "lastUpdateTime": "2024-11-18T21:21:17Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test is running.",
                "reason": "PyTorchJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2024-11-18T21:25:24Z",
                "lastUpdateTime": "2024-11-18T21:25:24Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test successfully completed.",
                "reason": "PyTorchJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
            "replicaStatuses": {
                "Worker": {
                    "selector": "training.kubeflow.org/job-name=hyperpod-cli-test,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker",
                    "succeeded": 1
                }
            },
        "startTime": "2024-11-18T21:21:15Z"
    },
    "ConsoleURL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/hyperpod-eks-test-cluster-id“
}
```

## Suspend and unsuspend jobs


If you want to remove some submitted job from the scheduler, HyperPod AWS CLI provides `suspend` command to temporarily remove the job from orchestration. The suspended job will no longer be scheduled unless the job is manually unsuspended by the `unsuspend` command

To temporarily suspend a job:

```
hyperpod patch-job suspend --job-name hyperpod-cli-test
```

To add a job back to the queue:

```
hyperpod patch-job unsuspend --job-name hyperpod-cli-test
```

## Debugging jobs


The HyperPod AWS CLI also provides other commands for you to debug job submission issues. For example `list-pods` and `get-logs` in the HyperPod AWS CLI Github repository.

# Troubleshoot


The following page contains known solutions for troubleshooting your HyperPod EKS clusters.

**Topics**
+ [

## Dashboard tab
](#hp-eks-troubleshoot-dashboard)
+ [

## Tasks tab
](#hp-eks-troubleshoot-tasks)
+ [

## Policies
](#hp-eks-troubleshoot-policies)
+ [

## Deleting clusters
](#hp-eks-troubleshoot-delete-policies)
+ [

## Unallocated resource sharing
](#hp-eks-troubleshoot-unallocated-resource-sharing)

## Dashboard tab


**The EKS add-on fails to install**

For the EKS add-on installation to succeed, you will need to have a Kubernets version >= 1.30. To update, see [Update Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).

For the EKS add-on installation to succeed, all of the nodes need to be in **Ready** status and all of the pods need to be in **Running** status. 

To check the status of your nodes, use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) AWS CLI command or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your nodes. Resolve the issue for each node or reach out to your administrator. If the node status is **Unknown**, delete the node. Once all nodes statuses are **Ready**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

To check the status of your pods, use the [Kubernetes CLI](https://kubernetes.io/docs/reference/kubectl/) command `kubectl get pods -n cloudwatch-agent` or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your pods with the namespace `cloudwatch-agent`. Resolve the issue for the pods or reach out to your administrator to resolve the issues. Once all pod statuses are **Running**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

For more troubleshooting, see [Troubleshooting the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html#Container-Insights-setup-EKS-addon-troubleshoot).

## Tasks tab


If you see the error message about how the **Custom Resource Definition (CRD) is not configured on the cluster**, grant `EKSAdminViewPolicy` and `ClusterAccessRole` policies to your domain execution role. 
+ For information on how to get your execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).
+ To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

## Policies


The following lists solutions to errors relating to policies using the HyperPod APIs or console.
+ If the policy is in `CreateFailed` or `CreateRollbackFailed` status, you need to delete the failed policy and create a new one.
+ If the policy is in `UpdateFailed` status, retry the update with the same policy ARN.
+ If the policy is in `UpdateRollbackFailed` status, you need to delete the failed policy and then create a new one.
+ If the policy is in `DeleteFailed` or `DeleteRollbackFailed` status, retry the delete with the same policy ARN.
  + If you ran into an error while trying to delete the **Compute prioritization**, or cluster policy, using the HyperPod console, try to delete the `cluster-scheduler-config` using the API. To check the status of the resource, go to the details page of a compute allocation.

To see more details into the failure, use the describe API.

## Deleting clusters


The following lists known solutions to errors relating to deleting clusters.
+ When cluster deletion fails due to attached SageMaker HyperPod task governance policies, you will need to [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md).
+ When cluster deletion fails due to the missing the following permissions, you will need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section.
  + `sagemaker:ListComputeQuotas`
  + `sagemaker:ListClusterSchedulerConfig`
  + `sagemaker:DeleteComputeQuota`
  + `sagemaker:DeleteClusterSchedulerConfig`

## Unallocated resource sharing


If your unallocated resource pool capacity is less than expected:

1. **Check node ready status**

   ```
   kubectl get nodes
   ```

   Verify all nodes show `Ready` status in the STATUS column.

1. **Check node schedulable status**

   ```
   kubectl get nodes -o custom-columns=NAME:.metadata.name,UNSCHEDULABLE:.spec.unschedulable
   ```

   Verify nodes show `<none>` or `false` (not `true`).

1. **List unallocated resource sharing ClusterQueues:**

   ```
   kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
   ```

   This shows all unallocated resource sharing ClusterQueues. If the ClusterQueues are not showing up, check the `FailureReason` under ClusterSchedulerConfig policy to see if there are any failure messages to continue the debugging.

1. **Verify unallocated resource sharing quota:**

   ```
   kubectl describe clusterqueue hyperpod-ns-idle-resource-sharing-<index>
   ```

   Check the `spec.resourceGroups[].flavors[].resources` section to see the quota allocated for each resource flavor.

   Multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

1. **Check MIG configuration status (GPU nodes):**

   ```
   kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}{end}'
   ```

   Verify MIG-enabled nodes show `success` state.

# Attribution document for Amazon SageMaker HyperPod task governance
Attribution

In the following you can learn about attributions and third-party licenses for material used in Amazon SageMaker HyperPod task governance.

**Topics**
+ [

## [base-files](https://packages.debian.org/bookworm/base-files)
](#hp-eks-task-governance-attributions-base-files)
+ [

## [netbase](https://packages.debian.org/source/stable/netbase)
](#hp-eks-task-governance-attributions-netbase)
+ [

## [golang-lru](https://github.com/hashicorp/golang-lru)
](#hp-eks-task-governance-attributions-golang-lru)

## [base-files](https://packages.debian.org/bookworm/base-files)


```
This is the Debian prepackaged version of the Debian Base System
Miscellaneous files. These files were written by Ian Murdock
<imurdock@debian.org> and Bruce Perens <bruce@pixar.com>.

This package was first put together by Bruce Perens <Bruce@Pixar.com>,
from his own sources.

The GNU Public Licenses in /usr/share/common-licenses were taken from
ftp.gnu.org and are copyrighted by the Free Software Foundation, Inc.

The Artistic License in /usr/share/common-licenses is the one coming
from Perl and its SPDX name is "Artistic License 1.0 (Perl)".


Copyright © 1995-2011 Software in the Public Interest.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

On Debian systems, the complete text of the GNU General
Public License can be found in `/usr/share/common-licenses/GPL'.
```

## [netbase](https://packages.debian.org/source/stable/netbase)


```
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Comment:
 This package was created by Peter Tobias tobias@et-inf.fho-emden.de on
 Wed, 24 Aug 1994 21:33:28 +0200 and maintained by Anthony Towns
 <ajt@debian.org> until 2001.
 It is currently maintained by Marco d'Itri <md@linux.it>.

Files: *
Copyright:
 Copyright © 1994-1998 Peter Tobias
 Copyright © 1998-2001 Anthony Towns
 Copyright © 2002-2022 Marco d'Itri
License: GPL-2
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License, version 2, as
 published by the Free Software Foundation.
 .
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 .
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation,
 Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 .
 On Debian systems, the complete text of the GNU General Public License
 version 2 can be found in '/usr/share/common-licenses/GPL-2'.
```

## [golang-lru](https://github.com/hashicorp/golang-lru)


```
Copyright © 2014 HashiCorp, Inc.

Mozilla Public License, version 2.0

1. Definitions

1.1. "Contributor"

     means each individual or legal entity that creates, contributes to the
     creation of, or owns Covered Software.

1.2. "Contributor Version"

     means the combination of the Contributions of others (if any) used by a
     Contributor and that particular Contributor's Contribution.

1.3. "Contribution"

     means Covered Software of a particular Contributor.

1.4. "Covered Software"

     means Source Code Form to which the initial Contributor has attached the
     notice in Exhibit A, the Executable Form of such Source Code Form, and
     Modifications of such Source Code Form, in each case including portions
     thereof.

1.5. "Incompatible With Secondary Licenses"
     means

     a. that the initial Contributor has attached the notice described in
        Exhibit B to the Covered Software; or

     b. that the Covered Software was made available under the terms of
        version 1.1 or earlier of the License, but not also under the terms of
        a Secondary License.

1.6. "Executable Form"

     means any form of the work other than Source Code Form.

1.7. "Larger Work"

     means a work that combines Covered Software with other material, in a
     separate file or files, that is not Covered Software.

1.8. "License"

     means this document.

1.9. "Licensable"

     means having the right to grant, to the maximum extent possible, whether
     at the time of the initial grant or subsequently, any and all of the
     rights conveyed by this License.

1.10. "Modifications"

     means any of the following:

     a. any file in Source Code Form that results from an addition to,
        deletion from, or modification of the contents of Covered Software; or

     b. any new file in Source Code Form that contains any Covered Software.

1.11. "Patent Claims" of a Contributor

      means any patent claim(s), including without limitation, method,
      process, and apparatus claims, in any patent Licensable by such
      Contributor that would be infringed, but for the grant of the License,
      by the making, using, selling, offering for sale, having made, import,
      or transfer of either its Contributions or its Contributor Version.

1.12. "Secondary License"

      means either the GNU General Public License, Version 2.0, the GNU Lesser
      General Public License, Version 2.1, the GNU Affero General Public
      License, Version 3.0, or any later versions of those licenses.

1.13. "Source Code Form"

      means the form of the work preferred for making modifications.

1.14. "You" (or "Your")

      means an individual or a legal entity exercising rights under this
      License. For legal entities, "You" includes any entity that controls, is
      controlled by, or is under common control with You. For purposes of this
      definition, "control" means (a) the power, direct or indirect, to cause
      the direction or management of such entity, whether by contract or
      otherwise, or (b) ownership of more than fifty percent (50%) of the
      outstanding shares or beneficial ownership of such entity.


2. License Grants and Conditions

2.1. Grants

     Each Contributor hereby grants You a world-wide, royalty-free,
     non-exclusive license:

     a. under intellectual property rights (other than patent or trademark)
        Licensable by such Contributor to use, reproduce, make available,
        modify, display, perform, distribute, and otherwise exploit its
        Contributions, either on an unmodified basis, with Modifications, or
        as part of a Larger Work; and

     b. under Patent Claims of such Contributor to make, use, sell, offer for
        sale, have made, import, and otherwise transfer either its
        Contributions or its Contributor Version.

2.2. Effective Date

     The licenses granted in Section 2.1 with respect to any Contribution
     become effective for each Contribution on the date the Contributor first
     distributes such Contribution.

2.3. Limitations on Grant Scope

     The licenses granted in this Section 2 are the only rights granted under
     this License. No additional rights or licenses will be implied from the
     distribution or licensing of Covered Software under this License.
     Notwithstanding Section 2.1(b) above, no patent license is granted by a
     Contributor:

     a. for any code that a Contributor has removed from Covered Software; or

     b. for infringements caused by: (i) Your and any other third party's
        modifications of Covered Software, or (ii) the combination of its
        Contributions with other software (except as part of its Contributor
        Version); or

     c. under Patent Claims infringed by Covered Software in the absence of
        its Contributions.

     This License does not grant any rights in the trademarks, service marks,
     or logos of any Contributor (except as may be necessary to comply with
     the notice requirements in Section 3.4).

2.4. Subsequent Licenses

     No Contributor makes additional grants as a result of Your choice to
     distribute the Covered Software under a subsequent version of this
     License (see Section 10.2) or under the terms of a Secondary License (if
     permitted under the terms of Section 3.3).

2.5. Representation

     Each Contributor represents that the Contributor believes its
     Contributions are its original creation(s) or it has sufficient rights to
     grant the rights to its Contributions conveyed by this License.

2.6. Fair Use

     This License is not intended to limit any rights You have under
     applicable copyright doctrines of fair use, fair dealing, or other
     equivalents.

2.7. Conditions

     Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted in
     Section 2.1.


3. Responsibilities

3.1. Distribution of Source Form

     All distribution of Covered Software in Source Code Form, including any
     Modifications that You create or to which You contribute, must be under
     the terms of this License. You must inform recipients that the Source
     Code Form of the Covered Software is governed by the terms of this
     License, and how they can obtain a copy of this License. You may not
     attempt to alter or restrict the recipients' rights in the Source Code
     Form.

3.2. Distribution of Executable Form

     If You distribute Covered Software in Executable Form then:

     a. such Covered Software must also be made available in Source Code Form,
        as described in Section 3.1, and You must inform recipients of the
        Executable Form how they can obtain a copy of such Source Code Form by
        reasonable means in a timely manner, at a charge no more than the cost
        of distribution to the recipient; and

     b. You may distribute such Executable Form under the terms of this
        License, or sublicense it under different terms, provided that the
        license for the Executable Form does not attempt to limit or alter the
        recipients' rights in the Source Code Form under this License.

3.3. Distribution of a Larger Work

     You may create and distribute a Larger Work under terms of Your choice,
     provided that You also comply with the requirements of this License for
     the Covered Software. If the Larger Work is a combination of Covered
     Software with a work governed by one or more Secondary Licenses, and the
     Covered Software is not Incompatible With Secondary Licenses, this
     License permits You to additionally distribute such Covered Software
     under the terms of such Secondary License(s), so that the recipient of
     the Larger Work may, at their option, further distribute the Covered
     Software under the terms of either this License or such Secondary
     License(s).

3.4. Notices

     You may not remove or alter the substance of any license notices
     (including copyright notices, patent notices, disclaimers of warranty, or
     limitations of liability) contained within the Source Code Form of the
     Covered Software, except that You may alter any license notices to the
     extent required to remedy known factual inaccuracies.

3.5. Application of Additional Terms

     You may choose to offer, and to charge a fee for, warranty, support,
     indemnity or liability obligations to one or more recipients of Covered
     Software. However, You may do so only on Your own behalf, and not on
     behalf of any Contributor. You must make it absolutely clear that any
     such warranty, support, indemnity, or liability obligation is offered by
     You alone, and You hereby agree to indemnify every Contributor for any
     liability incurred by such Contributor as a result of warranty, support,
     indemnity or liability terms You offer. You may include additional
     disclaimers of warranty and limitations of liability specific to any
     jurisdiction.

4. Inability to Comply Due to Statute or Regulation

   If it is impossible for You to comply with any of the terms of this License
   with respect to some or all of the Covered Software due to statute,
   judicial order, or regulation then You must: (a) comply with the terms of
   this License to the maximum extent possible; and (b) describe the
   limitations and the code they affect. Such description must be placed in a
   text file included with all distributions of the Covered Software under
   this License. Except to the extent prohibited by statute or regulation,
   such description must be sufficiently detailed for a recipient of ordinary
   skill to be able to understand it.

5. Termination

5.1. The rights granted under this License will terminate automatically if You
     fail to comply with any of its terms. However, if You become compliant,
     then the rights granted under this License from a particular Contributor
     are reinstated (a) provisionally, unless and until such Contributor
     explicitly and finally terminates Your grants, and (b) on an ongoing
     basis, if such Contributor fails to notify You of the non-compliance by
     some reasonable means prior to 60 days after You have come back into
     compliance. Moreover, Your grants from a particular Contributor are
     reinstated on an ongoing basis if such Contributor notifies You of the
     non-compliance by some reasonable means, this is the first time You have
     received notice of non-compliance with this License from such
     Contributor, and You become compliant prior to 30 days after Your receipt
     of the notice.

5.2. If You initiate litigation against any entity by asserting a patent
     infringement claim (excluding declaratory judgment actions,
     counter-claims, and cross-claims) alleging that a Contributor Version
     directly or indirectly infringes any patent, then the rights granted to
     You by any and all Contributors for the Covered Software under Section
     2.1 of this License shall terminate.

5.3. In the event of termination under Sections 5.1 or 5.2 above, all end user
     license agreements (excluding distributors and resellers) which have been
     validly granted by You or Your distributors under this License prior to
     termination shall survive termination.

6. Disclaimer of Warranty

   Covered Software is provided under this License on an "as is" basis,
   without warranty of any kind, either expressed, implied, or statutory,
   including, without limitation, warranties that the Covered Software is free
   of defects, merchantable, fit for a particular purpose or non-infringing.
   The entire risk as to the quality and performance of the Covered Software
   is with You. Should any Covered Software prove defective in any respect,
   You (not any Contributor) assume the cost of any necessary servicing,
   repair, or correction. This disclaimer of warranty constitutes an essential
   part of this License. No use of  any Covered Software is authorized under
   this License except under this disclaimer.

7. Limitation of Liability

   Under no circumstances and under no legal theory, whether tort (including
   negligence), contract, or otherwise, shall any Contributor, or anyone who
   distributes Covered Software as permitted above, be liable to You for any
   direct, indirect, special, incidental, or consequential damages of any
   character including, without limitation, damages for lost profits, loss of
   goodwill, work stoppage, computer failure or malfunction, or any and all
   other commercial damages or losses, even if such party shall have been
   informed of the possibility of such damages. This limitation of liability
   shall not apply to liability for death or personal injury resulting from
   such party's negligence to the extent applicable law prohibits such
   limitation. Some jurisdictions do not allow the exclusion or limitation of
   incidental or consequential damages, so this exclusion and limitation may
   not apply to You.

8. Litigation

   Any litigation relating to this License may be brought only in the courts
   of a jurisdiction where the defendant maintains its principal place of
   business and such litigation shall be governed by laws of that
   jurisdiction, without reference to its conflict-of-law provisions. Nothing
   in this Section shall prevent a party's ability to bring cross-claims or
   counter-claims.

9. Miscellaneous

   This License represents the complete agreement concerning the subject
   matter hereof. If any provision of this License is held to be
   unenforceable, such provision shall be reformed only to the extent
   necessary to make it enforceable. Any law or regulation which provides that
   the language of a contract shall be construed against the drafter shall not
   be used to construe this License against a Contributor.


10. Versions of the License

10.1. New Versions

      Mozilla Foundation is the license steward. Except as provided in Section
      10.3, no one other than the license steward has the right to modify or
      publish new versions of this License. Each version will be given a
      distinguishing version number.

10.2. Effect of New Versions

      You may distribute the Covered Software under the terms of the version
      of the License under which You originally received the Covered Software,
      or under the terms of any subsequent version published by the license
      steward.

10.3. Modified Versions

      If you create software not governed by this License, and you want to
      create a new license for such software, you may create and use a
      modified version of this License if you rename the license and remove
      any references to the name of the license steward (except to note that
      such modified license differs from this License).

10.4. Distributing Source Code Form that is Incompatible With Secondary
      Licenses If You choose to distribute Source Code Form that is
      Incompatible With Secondary Licenses under the terms of this version of
      the License, the notice described in Exhibit B of this License must be
      attached.

Exhibit A - Source Code Form License Notice

      This Source Code Form is subject to the
      terms of the Mozilla Public License, v.
      2.0. If a copy of the MPL was not
      distributed with this file, You can
      obtain one at
      http://mozilla.org/MPL/2.0/.

If it is not possible or desirable to put the notice in a particular file,
then You may include the notice in a location (such as a LICENSE file in a
relevant directory) where a recipient would be likely to look for such a
notice.

You may add additional accurate notices of copyright ownership.

Exhibit B - "Incompatible With Secondary Licenses" Notice

      This Source Code Form is "Incompatible
      With Secondary Licenses", as defined by
      the Mozilla Public License, v. 2.0.
```