

# LightGBM
<a name="lightgbm"></a>

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT. This page includes information about Amazon EC2 instance recommendations and sample notebooks for LightGBM.

# How to use SageMaker AI LightGBM
<a name="lightgbm-modes"></a>

You can use LightGBM as an Amazon SageMaker AI built-in algorithm. The following section describes how to use LightGBM with the SageMaker Python SDK. For information on how to use LightGBM from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use LightGBM as a built-in algorithm**

  Use the LightGBM built-in algorithm to build a LightGBM training container as shown in the following code example. You can automatically spot the LightGBM built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the LightGBM image URI, you can use the LightGBM container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The LightGBM built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own LightGBM training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
  training_instance_type = "ml.m5.xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_multiclass/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train" 
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation" 
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "num_boost_round"
  ] = "500"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1, # for distributed training, specify an instance_count greater than 1
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "train": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the LightGBM as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)
  + [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)

# Input and Output interface for the LightGBM algorithm
<a name="InputOutput-LightGBM"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of LightGBM supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the LightGBM model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `train` and `validation` channels to provide your input data. Alternatively, you can use only the `train` channel.

**Note**  
Both `train` and `training` are valid channel names for LightGBM training.

**Use both the `train` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `train` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `train` or `validation` channels, the LightGBM algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `train` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `train` channel**:

You can alternatively provide your input data by way of a single S3 path for the `train` channel. This S3 path should point to a directory with a subdirectory named `train/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI LightGBM uses the Python Joblib module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI LightGBM with the JobLib module**
+ Use the following Python code:

  ```
  import joblib 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = joblib.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the LightGBM algorithm
<a name="Instance-LightGBM"></a>

SageMaker AI LightGBM currently supports single-instance and multi-instance CPU training. For multi-instance CPU training (distributed training), specify an `instance_count` greater than 1 when you define your Estimator. For more information on distributed training with LightGBM, see [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html).

LightGBM is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5). Further, we recommend that you have enough total memory in selected instances to hold the training data. 

## LightGBM sample notebooks
<a name="lightgbm-sample-notebooks"></a>

The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI LightGBM algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular regression model.   | 
|  [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html)  |  This notebook demonstrates distributed training with the Amazon SageMaker AI LightGBM algorithm using the Dask framework.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How LightGBM works
<a name="lightgbm-HowItWorks"></a>

LightGBM implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). These techniques are designed to significantly improve the efficiency and scalability of GBDT.

The LightGBM algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use LightGBM for regression, classification (binary and multiclass), and ranking problems.

For more information on gradient boosting, see [How the SageMaker AI XGBoost algorithm works](xgboost-HowItWorks.md). For in-depth details about the additional GOSS and EFB techniques used in the LightGBM method, see *[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf)*.

# LightGBM hyperparameters
<a name="lightgbm-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI LightGBM algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI LightGBM algorithm is an implementation of the open-source [LightGBM](https://github.com/microsoft/LightGBM) package. 

**Note**  
The default hyperparameters are based on example datasets in the [LightGBM sample notebooks](lightgbm.md#lightgbm-sample-notebooks).

By default, the SageMaker AI LightGBM algorithm automatically chooses an evaluation metric and objective function based on the type of classification problem. The LightGBM algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error and the objective function is L2 loss. For binary classification problems, the evaluation metric and objective function are both binary cross entropy. For multiclass classification problems, the evaluation metric is multiclass cross entropy and the objective function is softmax. You can use the `metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on LightGBM hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| num\$1boost\$1round |  The maximum number of boosting iterations. **Note:** Internally, LightGBM constructs `num_class * num_boost_round` trees for multi-class classification problems. Valid values: integer, range: Positive integer. Default value: `100`.  | 
| early\$1stopping\$1rounds |  The training will stop if one metric of one validation data point does not improve in the last `early_stopping_rounds` round. If `early_stopping_rounds` is less than or equal to zero, this hyperparameter is ignored. Valid values: integer. Default value: `10`.  | 
| metric |  The evaluation metric for validation data. If `metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/lightgbm-hyperparameters.html) Valid values: string, any of the following: (`"auto"`, `"rmse"`, `"l1"`, `"l2"`, `"huber"`, `"fair"`, `"binary_logloss"`, `"binary_error"`, `"auc"`, `"average_precision"`, `"multi_logloss"`, `"multi_error"`, `"auc_mu"`, or `"cross_entropy"`). Default value: `"auto"`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.1`.  | 
| num\$1leaves |  The maximum number of leaves in one tree. Valid values: integer, range: (`1`, `131072`). Default value: `64`.  | 
| feature\$1fraction |  A subset of features to be selected on each iteration (tree). Must be less than 1.0. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.9`.  | 
| bagging\$1fraction |  A subset of features similar to `feature_fraction`, but `bagging_fraction` randomly selects part of the data without resampling. Valid values: float, range: (`0.0`, `1.0`]. Default value: `0.9`.  | 
| bagging\$1freq |  The frequency to perform bagging. At every `bagging_freq` iteration, LightGBM randomly selects a percentage of the data to use for the next `bagging_freq` iteration. This percentage is determined by the `bagging_fraction` hyperparameter. If `bagging_freq` is zero, then bagging is deactivated. Valid values: integer, range: Non-negative integer. Default value: `1`.  | 
| max\$1depth |  The maximum depth for a tree model. This is used to deal with overfitting when the amount of data is small. If `max_depth` is less than or equal to zero, this means there is no limit for maximum depth. Valid values: integer. Default value: `6`.  | 
| min\$1data\$1in\$1leaf |  The minimum amount of data in one leaf. Can be used to deal with overfitting. Valid values: integer, range: Non-negative integer. Default value: `3`.  | 
| max\$1delta\$1step |  Used to limit the max output of tree leaves. If `max_delta_step` is less than or equal to 0, then there is no constraint. The final max output of leaves is `learning_rate * max_delta_step`. Valid values: float. Default value: `0.0`.  | 
| lambda\$1l1 |  L1 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| lambda\$1l2 |  L2 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| boosting |  Boosting type Valid values: string, any of the following: (`"gbdt"`, `"rf"`, `"dart"`, or `"goss"`). Default value: `"gbdt"`.  | 
| min\$1gain\$1to\$1split |  The minimum gain to perform a split. Can be used to speed up training. Valid values: integer, float: Non-negative float. Default value: `0.0`.  | 
| scale\$1pos\$1weight |  The weight of the labels with positive class. Used only for binary classification tasks. `scale_pos_weight` cannot be used if `is_unbalance` is set to `"True"`.  Valid values: float, range: Positive float. Default value: `1.0`.  | 
| tree\$1learner |  Tree learner type. Valid values: string, any of the following: (`"serial"`, `"feature"`, `"data"`, or `"voting"`). Default value: `"serial"`.  | 
| feature\$1fraction\$1bynode |  Selects a subset of random features on each tree node. For example, if `feature_fraction_bynode` is `0.8`, then 80% of features are selected. Can be used to deal with overfitting. Valid values: integer, range: (`0.0`, `1.0`]. Default value: `1.0`.  | 
| is\$1unbalance |  Set to `"True"` if training data is unbalanced. Used only for binary classification tasks. `is_unbalance` cannot be used with `scale_pos_weight`. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| max\$1bin |  The maximum number of bins used to bucket feature values. A small number of bins may reduce training accuracy, but may increase general performance. Can be used to deal with overfitting. Valid values: integer, range: (1, ∞). Default value: `255`.  | 
| tweedie\$1variance\$1power |  Controls the variance of the Tweedie distribution. Set this closer to `2.0` to shift toward a gamma distribution. Set this closer to `1.0` to shift toward a Poisson distribution. Used only for regression tasks. Valid values: float, range: [`1.0`, `2.0`). Default value: `1.5`.  | 
| num\$1threads |  Number of parallel threads used to run LightGBM. Value 0 means default number of threads in OpenMP. Valid values: integer, range: Non-negative integer. Default value: `0`.  | 
| verbosity |  The verbosity of print messages. If the `verbosity` is less than `0`, then print messages only show fatal errors. If `verbosity` is set to `0`, then print messages include errors and warnings. If `verbosity` is `1`, then print messages show more information. A `verbosity` greater than `1` shows the most information in print messages and can be used for debugging. Valid values: integer. Default value: `1`.  | 

# Tune a LightGBM model
<a name="lightgbm-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters: 

**Note**  
The learning objective function is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).
+ A learning objective function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your specified hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for LightGBM is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the LightGBM algorithm
<a name="lightgbm-metrics"></a>

The SageMaker AI LightGBM algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| rmse | root mean square error | minimize | "rmse: ([0-9\$1\$1.]\$1)" | 
| l1 | mean absolute error | minimize | "l1: ([0-9\$1\$1.]\$1)" | 
| l2 | mean squared error | minimize | "l2: ([0-9\$1\$1.]\$1)" | 
| huber | huber loss | minimize | "huber: ([0-9\$1\$1.]\$1)" | 
| fair | fair loss | minimize | "fair: ([0-9\$1\$1.]\$1)" | 
| binary\$1logloss | binary cross entropy | maximize | "binary\$1logloss: ([0-9\$1\$1.]\$1)" | 
| binary\$1error | binary error | minimize | "binary\$1error: ([0-9\$1\$1.]\$1)" | 
| auc | AUC | maximize | "auc: ([0-9\$1\$1.]\$1)" | 
| average\$1precision | average precision score | maximize | "average\$1precision: ([0-9\$1\$1.]\$1)" | 
| multi\$1logloss | multiclass cross entropy | maximize | "multi\$1logloss: ([0-9\$1\$1.]\$1)" | 
| multi\$1error | multiclass error score | minimize | "multi\$1error: ([0-9\$1\$1.]\$1)" | 
| auc\$1mu | AUC-mu | maximize | "auc\$1mu: ([0-9\$1\$1.]\$1)" | 
| cross\$1entropy | cross entropy | minimize | "cross\$1entropy: ([0-9\$1\$1.]\$1)" | 

## Tunable LightGBM hyperparameters
<a name="lightgbm-tunable-hyperparameters"></a>

Tune the LightGBM model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the LightGBM evaluation metrics are: `learning_rate`, `num_leaves`, `feature_fraction`, `bagging_fraction`, `bagging_freq`, `max_depth` and `min_data_in_leaf`. For a list of all the LightGBM hyperparameters, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| num\$1leaves | IntegerParameterRanges | MinValue: 10, MaxValue: 100 | 
| feature\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1freq | IntegerParameterRanges | MinValue: 0, MaxValue: 10 | 
| max\$1depth | IntegerParameterRanges | MinValue: 15, MaxValue: 100 | 
| min\$1data\$1in\$1leaf | IntegerParameterRanges | MinValue: 10, MaxValue: 200 | 