# AutoGluon-Tabular
AutoGluon-Tabular Algorithm

[AutoGluon-Tabular](https://auto.gluon.ai/stable/index.html) is a popular open-source AutoML framework that trains highly accurate machine learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. This page includes information about Amazon EC2 instance recommendations and sample notebooks for AutoGluon-Tabular.

# How to use SageMaker AI AutoGluon-Tabular
How to use AutoGluon-Tabular

You can use AutoGluon-Tabular as an Amazon SageMaker AI built-in algorithm. The following section describes how to use AutoGluon-Tabular with the SageMaker Python SDK. For information on how to use AutoGluon-Tabular from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use AutoGluon-Tabular as a built-in algorithm**

  Use the AutoGluon-Tabular built-in algorithm to build an AutoGluon-Tabular training container as shown in the following code example. You can automatically spot the AutoGluon-Tabular built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the AutoGluon-Tabular image URI, you can use the AutoGluon-Tabular container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The AutoGluon-Tabular built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own AutoGluon-Tabular training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "autogluon-classification-ensemble", "*", "training"
  training_instance_type = "ml.p3.2xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_binary/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "auto_stack"
  ] = "True"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the AutoGluon-Tabular as a built-in algorithm, see the following notebook examples. Any S3 bucket used in these examples must be in the same AWS Region as the notebook instance used to run them.
  + [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)
  + [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)

# Input and Output interface for the AutoGluon-Tabular algorithm


Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of AutoGluon-Tabular supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the AutoGluon-Tabular model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains a CSV file. You can optionally include another subdirectory in the same location called `validation/` that also has a CSV file. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI AutoGluon-Tabular uses the `autogluon.tabular.TabularPredictor` module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI AutoGluon-Tabular with the AutoGluon framework**
+ Use the following Python code:

  ```
  import tarfile
  from autogluon.tabular import TabularPredictor
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = TabularPredictor.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the AutoGluon-Tabular algorithm


SageMaker AI AutoGluon-Tabular supports single-instance CPU and single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker AI AutoGluon-Tabular currently does not support multi-GPU training.

## AutoGluon-Tabular sample notebooks
Sample Notebooks

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI AutoGluon-Tabular algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular classification model.  | 
|  [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular regression model.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How AutoGluon-Tabular works
How It Works

AutoGluon-Tabular performs advanced data processing, deep learning, and multi-layer model ensemble methods. It automatically recognizes the data type in each column for robust data preprocessing, including special handling of text fields. 

AutoGluon fits various models ranging from off-the-shelf boosted trees to customized neural networks. These models are ensembled in a novel way: models are stacked in multiple layers and trained in a layer-wise manner that guarantees raw data can be translated into high-quality predictions within a given time constraint. This process mitigates overfitting by splitting the data in various ways with careful tracking of out-of-fold examples.

The AutoGluon-Tabular algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, and distributions. You can use AutoGluon-Tabular for regression, classification (binary and multiclass), and ranking problems.

Refer to the following diagram illustrating how the multi-layer stacking strategy works.

![\[AutoGluon's multi-layer stacking strategy shown with two stacking layers.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/autogluon_tabular_illustration.png)


For more information, see *[AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data](https://arxiv.org/pdf/2003.06505.pdf)*.

# AutoGluon-Tabular hyperparameters
Hyperparameters

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI AutoGluon-Tabular algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI AutoGluon-Tabular algorithm is an implementation of the open-source [AutoGluon-Tabular](https://github.com/awslabs/autogluon) package.

**Note**  
The default hyperparameters are based on example datasets in the [AutoGluon-Tabular sample notebooks](autogluon-tabular.md#autogluon-tabular-sample-notebooks).

By default, the SageMaker AI AutoGluon-Tabular algorithm automatically chooses an evaluation metric based on the type of classification problem. The algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error. For binary classification problems, the evaluation metric is area under the receiver operating characteristic curve (AUC). For multiclass classification problems, the evaluation metric is accuracy. You can use the `eval_metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on AutoGluon-Tabular hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| eval\$1metric |  The evaluation metric for validation data. If `eval_metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) Valid values: string, refer to the [AutoGluon documentation](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html) for valid values. Default value: `"auto"`.  | 
| presets |  List of preset configurations for various arguments in `fit()`.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, any of the following: (`"best_quality"`, `"high_quality"`, `good_quality"`, `"medium_quality"`, `"optimize_for_deployment"`,` or "interpretable"`). Default value: `"medium_quality"`.  | 
| auto\$1stack |  Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy. Set `auto_stack` to `"True"` if you are willing to tolerate longer training times in order to maximize predictive accuracy. This automatically sets the `num_bag_folds` and `num_stack_levels` arguments based on dataset properties.  Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| num\$1bag\$1folds |  Number of folds used for bagging of models. When `num_bag_folds` is equal to `k`, training time is roughly increased by a factor of `k`. Set `num_bag_folds` to 0 to deactivate bagging. This is disabled by default, but we recommend using values between 5 and 10 to maximize predictive performance. Increasing `num_bag_folds` results in models with lower bias, but that are more prone to overfitting. One is an invalid value for this parameter, and will raise a `ValueError`. Values greater than 10 may produce diminishing returns and can even harm overall results due to overfitting. To further improve predictions, avoid increasing `num_bag_folds` and instead increase `num_bag_sets`. Valid values: string, any integer between (and including) `"0"` and `"10"`. Default value: `"0"`.  | 
| num\$1bag\$1sets |  Number of repeats of kfold bagging to perform (values must be greater than or equal to 1). The total number of models trained during bagging is equal to `num_bag_folds` \$1 `num_bag_sets`. This parameter defaults to one if `time_limit` is not specified. This parameters is disabled if `num_bag_folds` is not specified. Values greater than one result in superior predictive performance, especially on smaller problems and with stacking enabled.  Valid values: integer, range: [`1`, `20`]. Default value: `1`.  | 
| num\$1stack\$1levels |  Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of `num_stack_levels` \$1 1. Set this parameter to 0 to deactivate stack ensembling. This parameter is deactivated by default, but we recommend using values between 1 and 3 to maximize predictive performance. To prevent overfitting and a `ValueError`, `num_bag_folds` must be greater than or equal to 2. Valid values: float, range: [`0`, `3`]. Default value: `0`.  | 
| refit\$1full |  Whether or not to retrain all models on all of the data (training and validation) after the normal training procedure. For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| set\$1best\$1to\$1refit\$1full |  Whether or not to change the default model that the predictor uses for prediction. If `set_best_to_refit_full` is set to `"True"`, the default model changes to the model that exhibited the highest validation score as a result of refitting (activated by `refit_full`). Only valid if `refit_full` is set. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| save\$1space |  Whether or note to reduce the memory and disk size of predictor by deleting auxiliary model files that aren’t needed for prediction on new data. This has no impact on inference accuracy. We recommend setting `save_space` to `"True"` if the only goal is to use the trained model for prediction. Certain advanced functionality may no longer be available if `save_space` is set to `"True"`. Refer to the `[predictor.save\$1space()](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.save_space.html)` documentation for more details. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| verbosity |  The verbosity of print messages. `verbosity` levels range from `0` to `4`, with higher levels corresponding to more detailed print statements. A `verbosity` of `0` suppresses warnings.  Valid values: integer, any of the following: (`0`, `1`, `2`, `3`, or `4`). Default value: `2`.  | 

# Tuning an AutoGluon-Tabular model
Model Tuning

Although AutoGluon-Tabular can be used with model tuning, its design can deliver good performance using stacking and ensemble methods, meaning hyperparameter optimization is not necessary. Rather than focusing on model tuning, AutoGluon-Tabular succeeds by stacking models in multiple layers and training in a layer-wise manner. 

For more information about AutoGluon-Tabular hyperparameters, see [AutoGluon-Tabular hyperparameters](autogluon-tabular-hyperparameters.md).