

# Parameters for Built-in Algorithms
Common Information

The following table lists parameters for each of the algorithms provided by Amazon SageMaker AI.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| AutoGluon-Tabular | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| BlazingText | train | File or Pipe | Text file (one sentence per line with space-separated tokens)  | CPU or GPU (single instance only)  | No | 
| CatBoost | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| DeepAR Forecasting | train and (optionally) test | File | JSON Lines or Parquet | CPU or GPU | Yes | 
| Factorization Machines | train and (optionally) test | File or Pipe | recordIO-protobuf | CPU (GPU for dense data) | Yes | 
| Image Classification - MXNet | train and validation, (optionally) train\$1lst, validation\$1lst, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Image Classification - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| IP Insights | train and (optionally) validation | File | CSV | CPU or GPU | Yes | 
| K-Means | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPUCommon (single GPU device on one or more instances) | No | 
| K-Nearest-Neighbors (k-NN) | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU (single GPU device on one or more instances) | Yes | 
| LDA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU (single instance only) | No | 
| LightGBM | train/training and (optionally) validation | File | CSV | CPU | Yes | 
| Linear Learner | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Neural Topic Model | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Object2Vec | train and (optionally) validation, test, or both | File | JSON Lines  | CPU or GPU (single instance only) | No | 
| Object Detection - MXNet | train and validation, (optionally) train\$1annotation, validation\$1annotation, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Object Detection - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | GPU | Yes (only across multiple GPUs on a single instance) | 
| PCA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Random Cut Forest | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU | Yes | 
| Semantic Segmentation | train and validation, train\$1annotation, validation\$1annotation, and (optionally) label\$1map and model | File or Pipe | Image files | GPU (single instance only) | No | 
| Seq2Seq Modeling | train, validation, and vocab | File | recordIO-protobuf | GPU (single instance only) | No | 
| TabTransformer | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| Text Classification - TensorFlow | training and validation | File | CSV | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| XGBoost (0.90-1, 0.90-2, 1.0-1, 1.2-1, 1.2-21) | train and (optionally) validation | File or Pipe | CSV, LibSVM, or Parquet | CPU (or GPU for 1.2-1) | Yes | 

Algorithms that are *parallelizable* can be deployed on multiple compute instances for distributed training.

The following topics provide information about data formats, recommended Amazon EC2 instance types, and CloudWatch logs common to all of the built-in algorithms provided by Amazon SageMaker AI.

**Note**  
To look up the Docker image URIs of the built-in algorithms managed by SageMaker AI, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths).

**Topics**
+ [

# Common Data Formats for Training
](cdf-training.md)
+ [

# Common data formats for inference
](cdf-inference.md)
+ [

# Instance Types for Built-in Algorithms
](cmn-info-instance-types.md)
+ [

# Logs for Built-in Algorithms
](common-info-all-sagemaker-models-logs.md)

# Common Data Formats for Training


To prepare for training, you can preprocess your data using a variety of AWS services, including AWS Glue, Amazon EMR, Amazon Redshift, Amazon Relational Database Service, and Amazon Athena. After preprocessing, publish the data to an Amazon S3 bucket. For training, the data must go through a series of conversions and transformations, including: 
+ Training data serialization (handled by you) 
+ Training data deserialization (handled by the algorithm) 
+ Training model serialization (handled by the algorithm) 
+ Trained model deserialization (optional, handled by you) 

When using Amazon SageMaker AI in the training portion of the algorithm, make sure to upload all data at once. If more data is added to that location, a new training call would need to be made to construct a brand new model.

**Topics**
+ [

## Content Types Supported by Built-In Algorithms
](#cdf-common-content-types)
+ [

## Using Pipe Mode
](#cdf-pipe-mode)
+ [

## Using CSV Format
](#cdf-csv-format)
+ [

## Using RecordIO Format
](#cdf-recordio-format)
+ [

## Trained Model Deserialization
](#td-deserialization)

## Content Types Supported by Built-In Algorithms


The following table lists some of the commonly supported [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType) values and the algorithms that use them:

ContentTypes for Built-in Algorithms


| ContentType | Algorithm | 
| --- | --- | 
| application/x-image | Object Detection Algorithm, Semantic Segmentation | 
| application/x-recordio |  Object Detection Algorithm  | 
| application/x-recordio-protobuf |  Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence  | 
| application/jsonlines |  BlazingText, DeepAR  | 
| image/jpeg |  Object Detection Algorithm, Semantic Segmentation  | 
| image/png |  Object Detection Algorithm, Semantic Segmentation  | 
| text/csv |  IP Insights, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, XGBoost  | 
| text/libsvm |  XGBoost  | 

For a summary of the parameters used by each algorithm, see the documentation for the individual algorithms or this [table](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

## Using Pipe Mode


In *Pipe mode*, your training job streams data directly from Amazon Simple Storage Service (Amazon S3). Streaming can provide faster start times for training jobs and better throughput. This is in contrast to *File mode*, in which your data from Amazon S3 is stored on the training instance volumes. File mode uses disk space to store both your final model artifacts and your full training dataset. By streaming in your data directly from Amazon S3 in Pipe mode, you reduce the size of Amazon Elastic Block Store volumes of your training instances. Pipe mode needs only enough disk space to store your final model artifacts. See the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html) for additional details on the training input mode.

## Using CSV Format


Many Amazon SageMaker AI algorithms support training with data in CSV format. To use data in CSV format for training, in the input data channel specification, specify **text/csv** as the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType). Amazon SageMaker AI requires that a CSV file does not have a header record and that the target variable is in the first column. To run unsupervised learning algorithms that don't have a target, specify the number of label columns in the content type. For example, in this case **'content\$1type=text/csv;label\$1size=0'**. For more information, see [Now use Pipe mode with CSV datasets for faster training on Amazon SageMaker AI built-in algorithms](https://aws.amazon.com/blogs/machine-learning/now-use-pipe-mode-with-csv-datasets-for-faster-training-on-amazon-sagemaker-built-in-algorithms/).

## Using RecordIO Format


In the protobuf recordIO format, SageMaker AI converts each observation in the dataset into a binary representation as a set of 4-byte floats, then loads it in the protobuf values field. If you are using Python for your data preparation, we strongly recommend that you use these existing transformations. However, if you are using another language, the protobuf definition file below provides the schema that you use to convert your data into SageMaker AI protobuf format.

**Note**  
For an example that shows how to convert the commonly used numPy array into the protobuf recordIO format, see *[An Introduction to Factorization Machines with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/factorization_machines_mnist/factorization_machines_mnist.html)* .

```
syntax = "proto2";

 package aialgs.data;

 option java_package = "com.amazonaws.aialgorithms.proto";
 option java_outer_classname = "RecordProtos";

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float32Tensor   {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated float values = 1 [packed = true];

     // If key is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 20) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float64Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated double values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as 32-bit ints (int32).
 message Int32Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated int32 values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For Exmple, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // Support for storing binary data for parsing in other ways (such as JPEG/etc).
 // This is an example of another type of value and may not immediately be supported.
 message Bytes {
     repeated bytes value = 1;

     // If the content type of the data is known, stores it.
     // This allows for the possibility of using decoders for common formats
     // in the future.
     optional string content_type = 2;
 }

 message Value {
     oneof value {
         // The numbering assumes the possible use of:
         // - float16, float128
         // - int8, int16, int32
         Float32Tensor float32_tensor = 2;
         Float64Tensor float64_tensor = 3;
         Int32Tensor int32_tensor = 7;
         Bytes bytes = 9;
     }
 }

 message Record {
     // Map from the name of the feature to the value.
     //
     // For vectors and libsvm-like datasets,
     // a single feature with the name `values`
     // should be specified.
     map<string, Value> features = 1;

     // An optional set of labels for this record.
     // Similar to the features field above, the key used for
     // generic scalar / vector labels should be 'values'.
     map<string, Value> label = 2;

     // A unique identifier for this record in the dataset.
     //
     // Whilst not necessary, this allows better
     // debugging where there are data issues.
     //
     // This is not used by the algorithm directly.
     optional string uid = 3;

     // Textual metadata describing the record.
     //
     // This may include JSON-serialized information
     // about the source of the record.
     //
     // This is not used by the algorithm directly.
     optional string metadata = 4;

     // An optional serialized JSON object that allows per-record
     // hyper-parameters/configuration/other information to be set.
     //
     // The meaning/interpretation of this field is defined by
     // the algorithm author and may not be supported.
     //
     // This is used to pass additional inference configuration
     // when batch inference is used (e.g. types of scores to return).
     optional string configuration = 5;
 }
```

After creating the protocol buffer, store it in an Amazon S3 location that Amazon SageMaker AI can access and that can be passed as part of `InputDataConfig` in `create_training_job`. 

**Note**  
For all Amazon SageMaker AI algorithms, the `ChannelName` in `InputDataConfig` must be set to `train`. Some algorithms also support a validation or test `input channels`. These are typically used to evaluate the model's performance by using a hold-out dataset. Hold-out datasets are not used in the initial training but can be used to further tune the model.

## Trained Model Deserialization


Amazon SageMaker AI models are stored as model.tar.gz in the S3 bucket specified in `OutputDataConfig` `S3OutputPath` parameter of the `create_training_job` call. The S3 bucket must be in the same AWS Region as the notebook instance. You can specify most of these model artifacts when creating a hosting model. You can also open and review them in your notebook instance. When `model.tar.gz` is untarred, it contains `model_algo-1`, which is a serialized Apache MXNet object. For example, you use the following to load the k-means model into memory and view it: 

```
import mxnet as mx
print(mx.ndarray.load('model_algo-1'))
```

# Common data formats for inference


Amazon SageMaker AI algorithms accept and produce several different MIME types for the HTTP payloads used in retrieving online and mini-batch predictions. You can use multiple AWS services to transform or preprocess records before running inference. At a minimum, you need to convert the data for the following:
+ Inference request serialization (handled by you) 
+ Inference request deserialization (handled by the algorithm) 
+ Inference response serialization (handled by the algorithm) 
+ Inference response deserialization (handled by you) 

**Topics**
+ [

## Convert data for inference request serialization
](#ir-serialization)
+ [

## Convert data for inference response deserialization
](#ir-deserialization)
+ [

## Common request formats for all algorithms
](#common-in-formats)
+ [

## Use batch transform with built-in algorithms
](#cm-batch)

## Convert data for inference request serialization


Content type options for Amazon SageMaker AI algorithm inference requests include: `text/csv`, `application/json`, and `application/x-recordio-protobuf`. Algorithms that don't support all of these types can support other types. XGBoost, for example, only supports `text/csv` from this list, but also supports `text/libsvm`.

For `text/csv`, the value for the Body argument to `invoke_endpoint` should be a string with commas separating the values for each feature. For example, a record for a model with four features might look like `1.5,16.0,14,23.0`. Any transformations performed on the training data should also be performed on the data before obtaining inference. The order of the features matters and must remain unchanged. 

`application/json` is more flexible and provides multiple possible formats for developers to use in their applications. At a high level, in JavaScript, the payload might look like the following: 

```
let request = {
  // Instances might contain multiple rows that predictions are sought for.
  "instances": [
    {
      // Request and algorithm specific inference parameters.
      "configuration": {},
      // Data in the specific format required by the algorithm.
      "data": {
         "<field name>": dataElement
       }
    }
  ]
}
```

You have the following options for specifying the `dataElement`: 

**Protocol buffers equivalent**

```
// Has the same format as the protocol buffers implementation described for training.
let dataElement = {
  "keys": [],
  "values": [],
  "shape": []
}
```

**Simple numeric vector **

```
// An array containing numeric values is treated as an instance containing a
// single dense vector.
let dataElement = [1.5, 16.0, 14.0, 23.0]

// It will be converted to the following representation by the SDK.
let converted = {
  "features": {
    "values": dataElement
  }
}
```

**For multiple records**

```
let request = {
  "instances": [
    // First instance.
    {
      "features": [ 1.5, 16.0, 14.0, 23.0 ]
    },
    // Second instance.
    {
      "features": [ -2.0, 100.2, 15.2, 9.2 ]
    }
  ]
}
```

## Convert data for inference response deserialization


Amazon SageMaker AI algorithms return JSON in several layouts. At a high level, the structure is:

```
let response = {
  "predictions": [{
    // Fields in the response object are defined on a per algorithm-basis.
  }]
}
```

The fields that are included in predictions differ across algorithms. The following are examples of output for the k-means algorithm.

**Single-record inference** 

```
let response = {
  "predictions": [{
    "closest_cluster": 5,
    "distance_to_cluster": 36.5
  }]
}
```

**Multi-record inference**

```
let response = {
  "predictions": [
    // First instance prediction.
    {
      "closest_cluster": 5,
      "distance_to_cluster": 36.5
    },
    // Second instance prediction.
    {
      "closest_cluster": 2,
      "distance_to_cluster": 90.3
    }
  ]
}
```

**Multi-record inference with protobuf input **

```
{
  "features": [],
  "label": {
    "closest_cluster": {
      "values": [ 5.0 ] // e.g. the closest centroid/cluster was 1.0
    },
    "distance_to_cluster": {
      "values": [ 36.5 ]
    }
  },
  "uid": "abc123",
  "metadata": "{ "created_at": '2017-06-03' }"
}
```

SageMaker AI algorithms also support the JSONLINES format, where the per-record response content is same as that in JSON format. The multi-record structure is a collection of per-record response objects separated by newline characters. The response content for the built-in KMeans algorithm for 2 input data points is:

```
{"distance_to_cluster": 23.40593910217285, "closest_cluster": 0.0}
{"distance_to_cluster": 27.250282287597656, "closest_cluster": 0.0}
```

While running batch transform, we recommended using the `jsonlines` response type by setting the `Accept` field in the `CreateTransformJobRequest` to `application/jsonlines`.

## Common request formats for all algorithms


Most algorithms use many of the following inference request formats.

### JSON request format


**Content type:** application/JSON

**Dense format**

```
let request =   {
    "instances":    [
        {
            "features": [1.5, 16.0, 14.0, 23.0]
        }
    ]
}


let request =   {
    "instances":    [
        {
            "data": {
                "features": {
                    "values": [ 1.5, 16.0, 14.0, 23.0]
                }
            }
        }
    ]
}
```

**Sparse format**

```
{
	"instances": [
		{"data": {"features": {
					"keys": [26, 182, 232, 243, 431],
					"shape": [2000],
					"values": [1, 1, 1, 4, 1]
				}
			}
		},
		{"data": {"features": {
					"keys": [0, 182, 232, 243, 431],
					"shape": [2000],
					"values": [13, 1, 1, 4, 1]
				}
			}
		},
	]
}
```

### JSONLINES request format


**Content type:** application/JSONLINES

**Dense format**

A single record in dense format can be represented as either:

```
{ "features": [1.5, 16.0, 14.0, 23.0] }
```

or:

```
{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }
```

**Sparse Format**

A single record in sparse format is represented as:

```
{"data": {"features": { "keys": [26, 182, 232, 243, 431], "shape": [2000], "values": [1, 1, 1, 4, 1] } } }
```

Multiple records are represented as a collection of single-record representations, separated by newline characters:

```
{"data": {"features": { "keys": [0, 1, 3], "shape": [4], "values": [1, 4, 1] } } }
{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }
{ "features": [1.5, 16.0, 14.0, 23.0] }
```

### CSV request format


**Content type:** text/CSV; label\$1size=0

**Note**  
CSV support is not available for factorization machines.

### RECORDIO request format


Content type: application/x-recordio-protobuf

## Use batch transform with built-in algorithms


While running batch transform, we recommended using the JSONLINES response type instead of JSON, if supported by the algorithm. To do this, set the `Accept` field in the `CreateTransformJobRequest` to `application/jsonlines`.

When you create a transform job, the `SplitType` must be set based on the `ContentType` of the input data. Similarly, depending on the `Accept` field in the `CreateTransformJobRequest`, `AssembleWith` must be set accordingly. Use the following table to set these fields:


| ContentType | Recommended SplitType | 
| --- | --- | 
| application/x-recordio-protobuf | RecordIO | 
| text/csv | Line | 
| application/jsonlines | Line | 
| application/json | None | 
| application/x-image | None | 
| image/\$1 | None | 


| Accept | Recommended AssembleWith | 
| --- | --- | 
| application/x-recordio-protobuf | None | 
| application/json | None | 
| application/jsonlines | Line | 

For more information on response formats for specific algorithms, see the following:
+ [DeepAR Inference Formats](deepar-in-formats.md)
+ [Factorization Machines Response Formats](fm-in-formats.md)
+ [IP Insights Inference Data Formats](ip-insights-inference-data-formats.md)
+ [K-Means Response Formats](km-in-formats.md)
+ [k-NN Request and Response Formats](kNN-inference-formats.md)
+ [Linear learner response formats](LL-in-formats.md)
+ [NTM Response Formats](ntm-in-formats.md)
+ [Data Formats for Object2Vec Inference](object2vec-inference-formats.md)
+ [Encoder Embeddings for Object2Vec](object2vec-encoder-embeddings.md)
+ [PCA Response Formats](PCA-in-formats.md)
+ [RCF Response Formats](rcf-in-formats.md)

# Instance Types for Built-in Algorithms
Suggested instance types

Most Amazon SageMaker AI algorithms have been engineered to take advantage of GPU computing for training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. Exceptions are noted in this guide.

To learn about the supported EC2 instances, see [Instance details](https://aws.amazon.com/sagemaker-ai/pricing/#Instance_details).

The size and type of data can have a great effect on which hardware configuration is most effective. When the same model is trained on a recurring basis, initial testing across a spectrum of instance types can discover configurations that are more cost-effective in the long run. Additionally, algorithms that train most efficiently on GPUs might not require GPUs for efficient inference. Experiment to determine the most cost effectiveness solution. To get an automatic instance recommendation or conduct custom load tests, use [Amazon SageMaker Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html).

For more information on SageMaker AI hardware specifications, see [Amazon SageMaker AI pricing](https://aws.amazon.com/sagemaker/ai/pricing/).

**UltraServers**

UltraServers connect multiple Amazon EC2 instances using a low-latency, high-bandwidth accelerator interconnect. They are built to handle large-scale AI/ML workloads that require significant processing power. For more information, see [Amazon EC2 UltraServers](https://aws.amazon.com/ec2/ultraservers/). To get started with UltraServers, see [Reserve training plans for your training jobs or HyperPod clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/reserve-capacity-with-training-plans.html).

To get started with UltraServers on Amazon SageMaker AI, [ create a training plan](https://docs.aws.amazon.com/sagemaker/latest/dg/reserve-capacity-with-training-plans.html). Once your UltraServer is available in the training plan, create a training job with the AWS Management Console, Amazon SageMaker AI API, or AWS CLI. Remember to specify the UltraServer instance type that you purchased in the training plan.

An UltraServer can run one or multiple jobs at a time. UltraServers groups instances together, which gives you some flexibility in terms of how to allocate your UltraServer capacity in your organization. As you configure your jobs, also remember your organization's data security guidelines, as instances in one UltraServer can access data for another job in another instance on the same UltraServer.

If you run into hardware failures in the UltraServer, SageMaker AI automatically tries to resolve the issue. As SageMaker AI investigates and resolves the issue, you might receive notifications and actions through AWS Health Events or AWS Support.

Once your training job finishes, SageMaker AI stops the instances, but they remain available in your training plan if the plan is still active. To keep an instance in an UltraServer running after a job finishes, you can use [managed warm pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html).

If your training plan has enough capacity, you can even run training jobs across multiple UltraServers. By default, each UltraServer comes with 18 instances, comprising of 17 instances and 1 spare instance. If you need more instances, you must buy more UltraServers. When creating a training job, you can configure how jobs are placed across UltraServers using the `InstancePlacementConfig` parameter.

If you don't configure job placement, SageMaker AI automatically allocates jobs to instances within your UltraServer. This default strategy is based on best effort that prioritizes filling all of the instances in a single UltraServer before using a different UltraServer. For example, if you request 14 instances and have 2 UltraServers in your training plan, SageMaker AI uses all of the instances in the first UltraServer. If you requested 20 instances and have 2 UltraServers in your training plan, SageMaker AI will will use all 17 instances in the first UltraServer and then use 3 from the second UltraServer. Instances within an UltraServer use NVLink to communicate, but individual UltraServers use Elastic Fabric Adapter (EFA), which might affect model training performance.

# Logs for Built-in Algorithms
Logs

Amazon SageMaker AI algorithms produce Amazon CloudWatch logs, which provide detailed information on the training process. To see the logs, in the AWS management console, choose **CloudWatch**, choose **Logs**, and then choose the /aws/sagemaker/TrainingJobs **log group**. Each training job has one log stream per node on which it was trained. The log stream’s name begins with the value specified in the `TrainingJobName` parameter when the job was created.

**Note**  
If a job fails and logs do not appear in CloudWatch, it's likely that an error occurred before the start of training. Reasons include specifying the wrong training image or S3 location.

The contents of logs vary by algorithms. However, you can typically find the following information:
+ Confirmation of arguments provided at the beginning of the log
+ Errors that occurred during training
+ Measurement of an algorithm's accuracy or numerical performance
+ Timings for the algorithm and any major stages within the algorithm

## Common Errors
Example Errors

If a training job fails, some details about the failure are provided by the `FailureReason` return value in the training job description, as follows:

```
sage = boto3.client('sagemaker')
sage.describe_training_job(TrainingJobName=job_name)['FailureReason']
```

Others are reported only in the CloudWatch logs. Common errors include the following:

1. Misspecifying a hyperparameter or specifying a hyperparameter that is invalid for the algorithm.

   **From the CloudWatch Log**

   ```
   [10/16/2017 23:45:17 ERROR 139623806805824 train.py:48]
   Additional properties are not allowed (u'mini_batch_siz' was
   unexpected)
   ```

1. Specifying an invalid value for a hyperparameter.

   **FailureReason**

   ```
   AlgorithmError: u'abc' is not valid under any of the given
   schemas\n\nFailed validating u'oneOf' in
   schema[u'properties'][u'feature_dim']:\n    {u'oneOf':
   [{u'pattern': u'^([1-9][0-9]*)$', u'type': u'string'},\n
   {u'minimum': 1, u'type': u'integer'}]}\
   ```

   **FailureReason**

   ```
   [10/16/2017 23:57:17 ERROR 140373086025536 train.py:48] u'abc'
   is not valid under any of the given schemas
   ```

1. Inaccurate protobuf file format.

   **From the CloudWatch log**

   ```
   [10/17/2017 18:01:04 ERROR 140234860816192 train.py:48] cannot
                      copy sequence with size 785 to array axis with dimension 784
   ```