# XGBoost algorithm with Amazon SageMaker AI
<a name="xgboost"></a>

The [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
+ Its robust handling of a variety of data types, relationships, distributions.
+ The variety of hyperparameters that you can fine-tune.

You can use XGBoost for regression, classification (binary and multiclass), and ranking problems. 

You can use the new release of the XGBoost algorithm as either:
+ A Amazon SageMaker AI built-in algorithm.
+ A framework to run training scripts in your local environments.

This implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an bigger set of metrics than the original versions. It provides an XGBoost `estimator` that runs a training script in a managed XGBoost environment. The current release of SageMaker AI XGBoost is based on the original XGBoost versions 1.0, 1.2, 1.3, 1.5, 1.7 and 3.0.

For more information about the Amazon SageMaker AI XGBoost algorithm, see the following blog posts:
+ [Introducing the open-source Amazon SageMaker AI XGBoost algorithm container](https://aws.amazon.com/blogs/machine-learning/introducing-the-open-source-amazon-sagemaker-xgboost-algorithm-container/)
+ [Amazon SageMaker AI XGBoost now offers fully distributed GPU training](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-xgboost-now-offers-fully-distributed-gpu-training/)

## Supported versions
<a name="xgboost-supported-versions"></a>

For more details, see our [support policy](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-support-policy.html#pre-built-containers-support-policy-ml-framework).
+ Framework (open source) mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5
+ Algorithm mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5

**Warning**  
Due to required compute capacity, version 3.0-5 of SageMaker AI XGBoost is not compatible with GPU instances from the P3 instance family for training or inference.

**Warning**  
Due to package compatible, version 3.0-5 of SageMaker AI XGBoost does not support SageMaker debugger.

**Warning**  
Due to required compute capacity, version 1.7-1 of SageMaker AI XGBoost is not compatible with GPU instances from the P2 instance family for training or inference.

**Warning**  
Network Isolation Mode: Do not upgrade pip beyond version 25.2. Newer versions may attempt to fetch setuptools from PyPI during module installation.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

**Warning**  
The XGBoost 0.90 versions are deprecated. Supports for security updates or bug fixes for XGBoost 0.90 is discontinued. We highly recommend that you upgrade the XGBoost version to one of the newer versions.

**Note**  
XGBoost v1.1 is not supported on SageMaker AI. XGBoost 1.1 has a broken capability to run prediction when the test input has fewer features than the training data in LIBSVM inputs. This capability has been restored in XGBoost v1.2. Consider using SageMaker AI XGBoost 1.2-2 or later.

**Note**  
You can use XGBoost v1.0-1, but it's not officially supported.

## EC2 instance recommendation for the XGBoost algorithm
<a name="Instance-XGBoost"></a>

SageMaker AI XGBoost supports CPU and GPU training and inference. Instance recommendations depend on training and inference needs, as well as the version of the XGBoost algorithm. Choose one of the following options for more information:
+ [CPU training](#Instance-XGBoost-training-cpu)
+ [GPU training](#Instance-XGBoost-training-gpu)
+ [Distributed CPU training](#Instance-XGBoost-distributed-training-cpu)
+ [Distributed GPU training](#Instance-XGBoost-distributed-training-gpu)
+ [Inference](#Instance-XGBoost-inference)

### Training
<a name="Instance-XGBoost-training"></a>

The SageMaker AI XGBoost algorithm supports CPU and GPU training.

#### CPU training
<a name="Instance-XGBoost-training-cpu"></a>

SageMaker AI XGBoost 1.0-1 or earlier only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. It supports the use of disk space to handle data that does not fit into main memory. This is a result of the out-of-core feature available with the libsvm input mode. Even so, writing cache files onto disk slows the algorithm processing time. 

#### GPU training
<a name="Instance-XGBoost-training-gpu"></a>

SageMaker AI XGBoost version 1.2-2 or later supports GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. 

SageMaker AI XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.

SageMaker AI XGBoost version 1.7-1 or later supports P3, G4dn, and G5 GPU instance families. Note that due to compute capacity requirements, version 1.7-1 or later does not support the P2 instance family.

SageMaker AI XGBoost version 3.0-5 or later supports G4dn and G5 GPU instance families. Note that due to compute capacity requirements, version 3.0-5 or later does not support the P3 instance family.

To take advantage of GPU training:
+ Specify the instance type as one of the GPU instances (for example, G4dn) 
+ Set the `tree_method` hyperparameter to `gpu_hist` in your existing XGBoost script

### Distributed training
<a name="Instance-XGBoost-distributed-training"></a>

SageMaker AI XGBoost supports CPU and GPU instances for distributed training.

#### Distributed CPU training
<a name="Instance-XGBoost-distributed-training-cpu"></a>

To run CPU training on multiple instances, set the `instance_count` parameter for the estimator to a value greater than one. The input data must be divided between the total number of instances. 

##### Divide input data across instances
<a name="Instance-XGBoost-distributed-training-divide-data"></a>

Divide the input data using the following steps:

1. Break the input data down into smaller files. The number of files should be at least equal to the number of instances used for distributed training. Using multiple smaller files as opposed to one large file also decreases the data download time for the training job.

1. When creating your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html), set the distribution parameter to `ShardedByS3Key`. With this, each instance gets approximately *1/n* of the number of files in S3 if there are *n* instances specified in the training job.

#### Distributed GPU training
<a name="Instance-XGBoost-distributed-training-gpu"></a>

You can use distributed training with either single-GPU or multi-GPU instances.

**Distributed training with single-GPU instances **

SageMaker AI XGBoost versions 1.2-2 through 1.3-1 only support single-GPU instance training. This means that even if you select a multi-GPU instance, only one GPU is used per instance.

You must divide your input data between the total number of instances if: 
+ You use XGBoost versions 1.2-2 through 1.3-1.
+ You do not need to use multi-GPU instances.

 For more information, see [Divide input data across instances](#Instance-XGBoost-distributed-training-divide-data).

**Note**  
Versions 1.2-2 through 1.3-1 of SageMaker AI XGBoost only use one GPU per instance even if you choose a multi-GPU instance.

**Distributed training with multi-GPU instances**

Starting with version 1.5-1, SageMaker AI XGBoost offers distributed GPU training with [Dask](https://www.dask.org/). With Dask you can utilize all GPUs when using one or more multi-GPU instances. Dask also works when using single-GPU instances. 

Train with Dask using the following steps:

1. Either omit the `distribution` parameter in your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html) or set it to `FullyReplicated`.

1. When defining your hyperparameters, set `use_dask_gpu_training` to `"true"`.

**Important**  
Distributed training with Dask only supports CSV and Parquet input formats. If you use other data formats such as LIBSVM or PROTOBUF, the training job fails.   
For Parquet data, ensure that the column names are saved as strings. Columns that have names of other data types will fail to load.

**Important**  
Distributed training with Dask does not support pipe mode. If pipe mode is specified, the training job fails.

There are a few considerations to be aware of when training SageMaker AI XGBoost with Dask. Be sure to split your data into smaller files. Dask reads each Parquet file as a partition. There is a Dask worker for every GPU. As a result, the number of files should be greater than the total number of GPUs (instance count \$1 number of GPUs per instance). Having a very large number of files can also degrade performance. For more information, see [Dask Best Practices](https://docs.dask.org/en/stable/best-practices.html).

#### Variations in output
<a name="Instance-XGBoost-distributed-training-output"></a>

The specified `tree_method` hyperparameter determines the algorithm that is used for XGBoost training. The tree methods `approx`, `hist` and `gpu_hist` are all approximate methods and use sketching for quantile calculation. For more information, see [Tree Methods](https://xgboost.readthedocs.io/en/stable/treemethod.html) in the XGBoost documentation. Sketching is an approximate algorithm. Therefore, you can expect variations in the model depending on factors such as the number of workers chosen for distributed training. The significance of the variation is data-dependent.

### Inference
<a name="Instance-XGBoost-inference"></a>

SageMaker AI XGBoost supports CPU and GPU instances for inference. For information about the instance types for inference, see [Amazon SageMaker AI ML Instance Types](https://aws.amazon.com/sagemaker/pricing/).

# How to use SageMaker AI XGBoost
<a name="xgboost-how-to-use"></a>

With SageMaker AI, you can use XGBoost as a built-in algorithm or framework. When XGBoost as a framework, you have more flexibility and access to more advanced scenarios because you can customize your own training scripts. The following sections describe how to use XGBoost with the SageMaker Python SDK, and the input/output interface for the XGBoost algorithm. For information on how to use XGBoost from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

**Topics**
+ [Use XGBoost as a framework](#xgboost-how-to-framework)
+ [Use XGBoost as a built-in algorithm](#xgboost-how-to-built-in)
+ [Input/Output interface for the XGBoost algorithm](#InputOutput-XGBoost)

## Use XGBoost as a framework
<a name="xgboost-how-to-framework"></a>

Use XGBoost as a framework to run your customized training scripts that can incorporate additional data processing into your training jobs. In the following code example, SageMaker Python SDK provides the XGBoost API as a framework. This functions similarly to how SageMaker AI provides other framework APIs, such as TensorFlow, MXNet, and PyTorch.

```
import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "verbosity":"1",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-framework'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-framework')

# construct a SageMaker AI XGBoost estimator
# specify the entry_point to your xgboost training script
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.7-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For an end-to-end example of using SageMaker AI XGBoost as a framework, see [Regression with Amazon SageMaker AI XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone_dist_script_mode.html).

## Use XGBoost as a built-in algorithm
<a name="xgboost-how-to-built-in"></a>

Use the XGBoost built-in algorithm to build an XGBoost training container as shown in the following code example. You can automatically spot the XGBoost built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API. If using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 1, use the `get_image_uri` API. To make sure that the `image_uris.retrieve` API finds the correct URI, see [Common parameters for built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html). Then look up `xgboost` from the full list of built-in algorithm image URIs and available regions.

After specifying the XGBoost image URI, use the XGBoost container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. This XGBoost built-in algorithm mode does not incorporate your own XGBoost training script and runs directly on the input datasets.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](xgboost.md#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

```
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-built-in-algo'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-built-in-algo')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# construct a SageMaker AI estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For more information about how to set up the XGBoost as a built-in algorithm, see the following notebook examples.
+ [Managed Spot Training for XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html)
+ [Regression with Amazon SageMaker AI XGBoost (Parquet input)](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html)

## Input/Output interface for the XGBoost algorithm
<a name="InputOutput-XGBoost"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports the following data formats for training and inference:
+  *text/libsvm* (default) 
+  *text/csv*
+  *application/x-parquet*
+  *application/x-recordio-protobuf*

**Note**  
There are a few considerations to be aware of regarding training and inference input:  
For increased performance, we recommend using XGBoost with *File mode*, in which your data from Amazon S3 is stored on the training instance volumes.
For training with columnar input, the algorithm assumes that the target variable (label) is the first column. For inference, the algorithm assumes that the input has no label column.
For CSV data, the input should not have a header record.
For LIBSVM training, the algorithm assumes that subsequent columns after the label column contain the zero-based index value pairs for features. So each row has the format: : <label> <index0>:<value0> <index1>:<value1>.
For information on instance types and distributed training, see [EC2 instance recommendation for the XGBoost algorithm](xgboost.md#Instance-XGBoost).

For CSV training input mode, the total memory available to the algorithm must be able to hold the training dataset. The total memory available is calculated as `Instance Count * the memory available in the InstanceType`. For libsvm training input mode, it's not required, but we recommend it.

For v1.3-1 and later, SageMaker AI XGBoost saves the model in the XGBoost internal binary format, using `Booster.save_model`. Previous versions use the Python pickle module to serialize/deserialize the model.

**Note**  
Be mindful of versions when using an SageMaker AI XGBoost model in open source XGBoost. Versions 1.3-1 and later use the XGBoost internal binary format while previous versions use the Python pickle module.

**To use a model trained with SageMaker AI XGBoost v1.3-1 or later in open source XGBoost**
+ Use the following Python code:

  ```
  import xgboost as xgb
  
  xgb_model = xgb.Booster()
  xgb_model.load_model(model_file_path)
  xgb_model.predict(dtest)
  ```

**To use a model trained with previous versions of SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

# XGBoost sample notebooks
<a name="xgboost-sample-notebooks"></a>

The following list contains a variety of sample Jupyter notebooks that address different use cases of Amazon SageMaker AI XGBoost algorithm.
+ [How to Create a Custom XGBoost container](https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/sagemaker_studio_image_build/xgboost_bring_your_own/Batch_Transform_BYO_XGB.html) – This notebook shows you how to build a custom XGBoost Container with Amazon SageMaker AI Batch Transform.
+ [Regression with XGBoost using Parquet](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html) – This notebook shows you how to use the Abalone dataset in Parquet to train a XGBoost model.
+ [How to Train and Host a Multiclass Classification Model](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_mnist/xgboost_mnist.html) – This notebook shows how to use the MNIST dataset to train and host a multiclass classification model.
+ [How to train a Model for Customer Churn Prediction](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html) – This notebook shows you how to train a model to Predict Mobile Customer Departure in an effort to identify unhappy customers.
+ [An Introduction to Amazon SageMaker AI Managed Spot infrastructure for XGBoost Training](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html) – This notebook shows you how to use Spot Instances for training with a XGBoost Container.
+ [How to use Amazon SageMaker Debugger to debug XGBoost Training Jobs](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html) – This notebook shows you how to use Amazon SageMaker Debugger to monitor training jobs to detect inconsistencies using built-in debugging rules.

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. The topic modeling example notebooks using the linear learning algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How the SageMaker AI XGBoost algorithm works
<a name="xgboost-HowItWorks"></a>

[XGBoost](https://github.com/dmlc/xgboost) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

When using [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) for regression, the weak learners are regression trees, and each regression tree maps an input data point to one of its leaves that contains a continuous score. XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions). The training proceeds iteratively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction. It's called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

 Below is a brief illustration on how gradient tree boosting works.

![\[A diagram illustrating gradient tree boosting.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/xgboost_illustration.png)


**For more detail on XGBoost, see:**
+ [XGBoost: A Scalable Tree Boosting System](https://arxiv.org/pdf/1603.02754.pdf)
+ [Gradient Tree Boosting ](https://www.sas.upenn.edu/~fdiebold/NoHesitations/BookAdvanced.pdf#page=380)
+ [Introduction to Boosted Trees](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

# XGBoost hyperparameters
<a name="xgboost_hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source DMLC XGBoost package. For details about full set of hyperparameter that can be configured for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_1.2.0/).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class |  The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: Integer.  | 
| num\$1round |  The number of rounds to run the training. **Required** Valid values: Integer.  | 
| alpha |  L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 0  | 
| base\$1score |  The initial prediction score of all instances, global bias. **Optional** Valid values: Float. Default value: 0.5  | 
| booster |  Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `"gbtree"`, `"gblinear"`, or `"dart"`. Default value: `"gbtree"`  | 
| colsample\$1bylevel |  Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bynode |  Subsample ratio of columns from each node. **Optional** Valid values: Float. Range: (0,1]. Default value: 1  | 
| colsample\$1bytree |  Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| csv\$1weights |  When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| deterministic\$1histogram |  When this flag is enabled, XGBoost builds histogram on GPU deterministically. Used only if `tree_method` is set to `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"`. Default value: `"true"`  | 
| early\$1stopping\$1rounds |  The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: Integer. Default value: -  | 
| eta |  Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric |  Evaluation metrics for validation data. A default metric is assigned according to the objective: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) For a list of valid inputs, see [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String. Default value: Default according to objective.  | 
| gamma |  Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy |  Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `"depthwise"` or `"lossguide"`. Default value: `"depthwise"`  | 
| interaction\$1constraints |  Specify groups of variables that are allowed to interact. **Optional** Valid values: Nested list of integers. Each integer represents a feature, and each nested list contains features that are allowed to interact e.g., [[1,2], [3,4,5]]. Default value: None  | 
| lambda |  L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 1  | 
| lambda\$1bias |  L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin |  Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: Integer. Default value: 256  | 
| max\$1delta\$1step |  Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth |  Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves |  Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: Integer. Default value: 0  | 
| min\$1child\$1weight |  Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| monotone\$1constraints |  Specifies monotonicity constraints on any feature. **Optional** Valid values: Tuple of Integers. Valid integers: -1 (decreasing constraint), 0 (no constraint), 1 (increasing constraint).  E.g., (0, 1): No constraint on first predictor, and an increasing constraint on the second. (-1, 1): Decreasing constraint on first predictor, and an increasing constraint on the second. Default value: (0, 0)  | 
| normalize\$1type |  Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread |  Number of parallel threads used to run *xgboost*. **Optional** Valid values: Integer. Default value: Maximum number of threads.  | 
| objective |  Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `multi:softmax`, `reg:squarederror`. For a full list of valid inputs, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String Default value: `"reg:squarederror"`  | 
| one\$1drop |  When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type |  The type of boosting process to run. **Optional** Valid values: String. Either `"default"` or `"update"`. Default value: `"default"`  | 
| rate\$1drop |  The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf |  This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type |  Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight |  Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed |  Random number seed. **Optional** Valid values: integer Default value: 0  | 
| single\$1precision\$1histogram |  When this flag is enabled, XGBoost uses single precision to build histograms instead of double precision. Used only if `tree_method` is set to `hist` or `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| sketch\$1eps |  Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop |  Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample |  Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method |  The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, `hist`, or `gpu_hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power |  Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater |  A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 
| use\$1dask\$1gpu\$1training |  Set `use_dask_gpu_training` to `"true"` if you want to run distributed GPU training with Dask. Dask GPU training is only supported for versions 1.5-1 and later. Do not set this value to `"true"` for versions preceding 1.5-1. For more information, see [Distributed GPU training](xgboost.md#Instance-XGBoost-distributed-training-gpu). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| verbosity | Verbosity of printing messages. Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug). **Optional** Default value: 1  | 

# Tune an XGBoost Model
<a name="xgboost-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

**Note**  
Automatic model tuning for XGBoost 0.90 is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation Metrics Computed by the XGBoost Algorithm
<a name="xgboost-metrics"></a>

The XGBoost algorithm computes the following metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Classification rate, calculated as \$1(right)/\$1(all cases).  |  Maximize  | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:f1 |  Indicator of classification accuracy, calculated as the harmonic mean of precision and recall.  |  Maximize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:mse |  Mean squared error.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

## Tunable XGBoost Hyperparameters
<a name="xgboost-tunable-hyperparameters"></a>

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bynode |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 

# Deprecated Versions of XGBoost and their Upgrades
<a name="xgboost-previous-versions"></a>

This topic contains documentation for previous versions of Amazon SageMaker AI XGBoost that are still available but deprecated. It also provides instructions on how to upgrade deprecated versions of XGBoost, when possible, to more current versions.

**Topics**
+ [Upgrade XGBoost Version 0.90 to Version 1.5](xgboost-version-0.90.md)
+ [XGBoost Version 0.72](xgboost-72.md)

# Upgrade XGBoost Version 0.90 to Version 1.5
<a name="xgboost-version-0.90"></a>

If you are using the SageMaker Python SDK, to upgrade existing XGBoost 0.90 jobs to version 1.5, you must have version 2.x of the SDK installed and change the XGBoost `version` and `framework_version` parameters to 1.5-1. If you are using Boto3, you need to update the Docker image, and a few hyperparameters and learning objectives.

**Topics**
+ [Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x](#upgrade-xgboost-version-0.90-sagemaker-python-sdk)
+ [Change the image tag to 1.5-1](#upgrade-xgboost-version-0.90-change-image-tag)
+ [Change Docker Image for Boto3](#upgrade-xgboost-version-0.90-boto3)
+ [Update Hyperparameters and Learning Objectives](#upgrade-xgboost-version-0.90-hyperparameters)

## Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x
<a name="upgrade-xgboost-version-0.90-sagemaker-python-sdk"></a>

If you are still using Version 1.x of the SageMaker Python SDK, you must to upgrade version 2.x of the SageMaker Python SDK. For information on the latest version of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). To install the latest version, run:

```
python -m pip install --upgrade sagemaker
```

## Change the image tag to 1.5-1
<a name="upgrade-xgboost-version-0.90-change-image-tag"></a>

If you are using the SageMaker Python SDK and using the XGBoost build-in algorithm, change the version parameter in `image_uris.retrive`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)
```

If you are using the SageMaker Python SDK and using XGBoost as a framework to run your customized training scripts, change the `framework_version` parameter in the XGBoost API.

```
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.5-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)
```

`sagemaker.session.s3_input` in SageMaker Python SDK version 1.x has been renamed to `sagemaker.inputs.TrainingInput`. You must use `sagemaker.inputs.TrainingInput` as in the following example.

```
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)
```

 For the full list of SageMaker Python SDK version 2.x changes, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). 

## Change Docker Image for Boto3
<a name="upgrade-xgboost-version-0.90-boto3"></a>

If you are using Boto3 to train or deploy your model, change the docker image tag (1, 0.72, 0.90-1 or 0.90-2) to 1.5-1.

```
{
    "AlgorithmSpecification":: {
        "TrainingImage": "746614075791.dkr.ecr.us-west-1.amazonaws.com/sagemaker-xgboost:1.5-1"
    }
    ...
}
```

If you using the SageMaker Python SDK to retrieve registry path, change the `version` parameter in `image_uris.retrieve`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")
```

## Update Hyperparameters and Learning Objectives
<a name="upgrade-xgboost-version-0.90-hyperparameters"></a>

The silent parameter has been deprecated and is no longer available in XGBoost 1.5 and later versions. Use `verbosity` instead. If you were using the `reg:linear` learning objective, it has been deprecated as well in favor of` reg:squarederror`. Use `reg:squarederror` instead.

```
hyperparameters = {
    "verbosity": "2",
    "objective": "reg:squarederror",
    "num_round": "50",
    ...
}

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          ...)
```

# XGBoost Version 0.72
<a name="xgboost-72"></a>

**Important**  
The XGBoost 0.72 is deprecated by Amazon SageMaker AI. You can still use this old version of XGBoost (as a built-in algorithm) by pulling its image URI as shown in the following code sample. For XGBoost, the image URI ending with `:1` is for the old version.  

```
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri

xgb_image_uri = get_image_uri(boto3.Session().region_name, "xgboost", repo_version="1")
```

```
import boto3
from sagemaker import image_uris

xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")
```
If you want to use newer versions, you have to explicitly specify the image URI tags (see [Supported versions](xgboost.md#xgboost-supported-versions)).

This previous release of the Amazon SageMaker AI XGBoost algorithm is based on the 0.72 release. [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. XGBoost has done remarkably well in machine learning competitions because it robustly handles a variety of data types, relationships, and distributions, and because of the large number of hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid choice for problems in regression, classification (binary and multiclass), and ranking.

Customers should consider using the new release of [XGBoost algorithm with Amazon SageMaker AI](xgboost.md). They can use it as a SageMaker AI built-in algorithm or as a framework to run scripts in their local environments as they would typically, for example, do with a Tensorflow deep learning framework. The new implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics. The earlier implementation of XGBoost remains available to customers if they need to postpone migrating to the new version. But this previous implementation will remain tied to the 0.72 release of XGBoost.

## Input/Output Interface for the XGBoost Release 0.72
<a name="xgboost-72-InputOutput"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports CSV and libsvm formats for training and inference:
+ For Training ContentType, valid inputs are *text/libsvm* (default) or *text/csv*.
+ For Inference ContentType, valid inputs are *text/libsvm* or (the default) *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column.   
For libsvm training, the algorithm assumes that the label is in the first column. Subsequent columns contain the zero-based index value pairs for features. So each row has the format: <label> <index0>:<value0> <index1>:<value1> ... Inference requests for libsvm may or may not have labels in the libsvm format.

This differs from other SageMaker AI algorithms, which use the protobuf training input format to maintain greater consistency with standard XGBoost data formats.

For CSV training input mode, the total memory available to the algorithm (Instance Count \$1 the memory available in the `InstanceType`) must be able to hold the training dataset. For libsvm training input mode, it's not required, but we recommend it.

SageMaker AI XGBoost uses the Python pickle module to serialize/deserialize the model, which can be used for saving/loading the model.

**To use a model trained with SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  import xgboost
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

## EC2 Instance Recommendation for the XGBoost Release 0.72
<a name="xgboost-72-Instance"></a>

SageMaker AI XGBoost currently only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M4) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. Although it supports the use of disk space to handle data that does not fit into main memory (the out-of-core feature available with the libsvm input mode), writing cache files onto disk slows the algorithm processing time.

## XGBoost Release 0.72 Sample Notebooks
<a name="xgboost-72-sample-notebooks"></a>

For a sample notebook that shows how to use the latest version of SageMaker AI XGBoost as a built-in algorithm to train and host a regression model, see [Regression with Amazon SageMaker AI XGBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.html). To use the 0.72 version of XGBoost, you need to change the version in the sample code to 0.72. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the XGBoost algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

## XGBoost Release 0.72 Hyperparameters
<a name="xgboost-72-hyperparameters"></a>

The following table contains the hyperparameters for the XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source XGBoost package. Currently SageMaker AI supports version 0.72. For more detail about hyperparameter configuration for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_0.72/parameter.html).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class | The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: integer  | 
| num\$1round | The number of rounds to run the training. **Required** Valid values: integer  | 
| alpha | L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 0  | 
| base\$1score | The initial prediction score of all instances, global bias. **Optional** Valid values: float Default value: 0.5  | 
| booster | Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `gbtree`, `gblinear`, or `dart`. Default value: `gbtree`  | 
| colsample\$1bylevel | Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bytree | Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1 | 
| csv\$1weights | When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| early\$1stopping\$1rounds | The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: integer Default value: -  | 
| eta | Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric | Evaluation metrics for validation data. A default metric is assigned according to the objective:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-72.html) For a list of valid inputs, see [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: Default according to objective.  | 
| gamma | Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy | Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `depthwise` or `lossguide`. Default value: `depthwise`  | 
| lambda | L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 1  | 
| lambda\$1bias | L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin | Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: integer Default value: 256  | 
| max\$1delta\$1step | Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth | Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves | Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: integer Default value: 0  | 
| min\$1child\$1weight | Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| normalize\$1type | Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread | Number of parallel threads used to run *xgboost*. **Optional** Valid values: integer Default value: Maximum number of threads.  | 
| objective | Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `reg:softmax`, `multi:squarederror`. For a full list of valid inputs, refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: `reg:squarederror`  | 
| one\$1drop | When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type | The type of boosting process to run. **Optional** Valid values: String. Either `default` or `update`. Default value: `default`  | 
| rate\$1drop | The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf | This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type | Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight | Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed | Random number seed. **Optional** Valid values: integer Default value: 0  | 
| silent | 0 means print running messages, 1 means silent mode. Valid values: 0 or 1 **Optional** Default value: 0  | 
| sketch\$1eps | Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop | Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample | Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method | The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, or `hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power | Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater | A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 

## Tune an XGBoost Release 0.72 Model
<a name="xgboost-72-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

### Metrics Computed by the XGBoost Release 0.72 Algorithm
<a name="xgboost-72-metrics"></a>

The XGBoost algorithm based on version 0.72 computes the following nine metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

### Tunable XGBoost Release 0.72 Hyperparameters
<a name="xgboost-72-tunable-hyperparameters"></a>

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  |