

# Action on Amazon SageMaker Debugger rules
<a name="debugger-action-on-rules"></a>

Based on the Debugger rule evaluation status, you can set up automated actions such as stopping a training job and sending notifications using Amazon Simple Notification Service (Amazon SNS). You can also create your own actions using Amazon CloudWatch Events and AWS Lambda. To learn how to set up automated actions based on the Debugger rule evaluation status, see the following topics.

**Topics**
+ [Use Debugger built-in actions for rules](debugger-built-in-actions.md)
+ [Actions on rules using Amazon CloudWatch and AWS Lambda](debugger-cloudwatch-lambda.md)

# Use Debugger built-in actions for rules
<a name="debugger-built-in-actions"></a>

Use Debugger built-in actions to respond to issues found by [Debugger rule](debugger-built-in-rules.md#debugger-built-in-rules-Rule). The Debugger `rule_configs` class provides tools to configure a list of actions, including automatically stopping training jobs and sending notifications using Amazon Simple Notification Service (Amazon SNS) when the Debugger rules find training issues. The following topics takes you through the steps to accomplish these tasks.

**Topics**
+ [Set up Amazon SNS, create an `SMDebugRules` topic, and subscribe to the topic](#debugger-built-in-actions-sns)
+ [Set up your IAM role to attach required policies](#debugger-built-in-actions-iam)
+ [Configure Debugger rules with the built-in actions](#debugger-built-in-actions-on-rule)
+ [Considerations for using the Debugger built-in actions](#debugger-built-in-actions-considerations)

## Set up Amazon SNS, create an `SMDebugRules` topic, and subscribe to the topic
<a name="debugger-built-in-actions-sns"></a>

This section walks you through how to set up an Amazon SNS **SMDebugRules** topic, subscribe to it, and confirm the subscription to receive notifications from the Debugger rules.

**Note**  
For more information about billing for Amazon SNS, see [Amazon SNS pricing](https://aws.amazon.com/sns/pricing/) and [Amazon SNS FAQs](https://aws.amazon.com/sns/faqs/).

**To create a SMDebugRules topic**

1. Sign in to the AWS Management Console and open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the left navigation pane, choose **Topics**. 

1. On the **Topics** page, choose **Create topic**.

1. On the **Create topic** page, in the **Details** section, do the following:

   1. For **Type**, choose **Standard** for topic type.

   1. In **Name**, enter **SMDebugRules**.

1. Skip all other optional settings and choose **Create topic**. If you want to learn more about the optional settings, see [Creating an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html).

**To subscribe to the SMDebugRules topic**

1. Open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the left navigation pane, choose **Subscriptions**. 

1. On the **Subscriptions** page, choose **Create subscription**.

1. On the **Create subscription** page, in the **Details** section, do the following: 

   1. For **Topic ARN**, choose the **SMDebugRules** topic ARN. The ARN should be in format of `arn:aws:sns:<region-id>:111122223333:SMDebugRules`.

   1. For **Protocol**, choose **Email** or **SMS**. 

   1. For **Endpoint**, enter the endpoint value, such as an email address or a phone number that you want to receive notifications.
**Note**  
Make sure you type the correct email address and phone number. Phone numbers must include `+`, a country code, and phone number, with no special characters or spaces. For example, the phone number \$11 (222) 333-4444 is formatted as **\$112223334444**.

1. Skip all other optional settings and choose **Create subscription**. If you want to learn more about the optional settings, see [Subscribing to an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html).

After you subscribe to the **SMDebugRules** topic, you receive the following confirmation message in email or by phone:

![\[A subscription confirmation email message for the Amazon SNS SMDebugRules topic.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-subscription-confirmation.png)


For more information about Amazon SNS, see [Mobile text messaging (SMS)](https://docs.aws.amazon.com/sns/latest/dg/sns-mobile-phone-number-as-subscriber.html) and [Email notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-email-notifications.html) in the *Amazon SNS Developer Guide*.

## Set up your IAM role to attach required policies
<a name="debugger-built-in-actions-iam"></a>

In this step, you add the required policies to your IAM role.

**To add the required policies to your IAM role**

1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the left navigation pane, choose **Policies**, and choose **Create policy**.

1. On the **Create policy** page, do the following to create a new sns-access policy:

   1. Choose the **JSON** tab.

   1. Paste the JSON strings formatted in bold in the following code into the `"Statement"`, replacing the 12-digit AWS account ID with your AWS account ID.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": [
                      "sns:Publish",
                      "sns:CreateTopic",
                      "sns:Subscribe"
                  ],
                  "Resource": "arn:aws:sns:*:111122223333:SMDebugRules"
              }
          ]
      }
      ```

------

   1. At the bottom of the page, choose **Review policy**.

   1. On the **Review policy** page, for **Name**, enter **sns-access**.

   1. At the bottom of the page, choose **Create policy**.

1. Go back to the IAM console, and choose **Roles** in the left navigation pane.

1. Look up the IAM role that you use for SageMaker AI model training and choose that IAM role.

1. On the **Permissions** tab of the **Summary** page, choose **Attach policies**.

1. Search for the **sns-access** policy, select the check box next to the policy, and then choose **Attach policy**.

For more examples of setting up IAM policies for Amazon SNS, see [Example cases for Amazon SNS access control](https://docs.aws.amazon.com/sns/latest/dg/sns-access-policy-use-cases.html).

## Configure Debugger rules with the built-in actions
<a name="debugger-built-in-actions-on-rule"></a>

After successfully finishing the required settings in the preceding steps, you can configure the Debugger built-in actions for debugging rules as shown in the following example script. You can choose which built-in actions to use while building the `actions` list object. The `rule_configs` is a helper module that provides high-level tools to configure Debugger built-in rules and actions. The following built-in actions are available for Debugger:
+ `rule_configs.StopTraining()` – Stops a training job when the Debugger rule finds an issue.
+ `rule_configs.Email("abc@abc.com")` – Sends a notification via email when the Debugger rule finds an issue. Use the email address that you used when you set up your SNS topic subscription.
+ `rule_configs.SMS("+1234567890")` – Sends a notification via text message when the Debugger rule finds an issue. Use the phone number that you used when you set up your SNS topic subscription.
**Note**  
Make sure you type the correct email address and phone number. Phone numbers must include `+`, a country code, and a phone number, with no special characters or spaces. For example, the phone number \$11 (222) 333-4444 is formatted as **\$112223334444**.

You can use all of the built-in actions or a subset of actions by wrapping up using the `rule_configs.ActionList()` method, which takes the built-in actions and configures a list of actions.

**To add all of the three built-in actions to a single rule**

If you want to assign all of the three built-in actions to a single rule, configure a Debugger built-in action list while constructing an estimator. Use the following template to construct the estimator, and Debugger will stop training jobs and send notifications through email and text for any rules that you use to monitor your training job progress.

```
from sagemaker.debugger import Rule, rule_configs

# Configure an action list object for Debugger rules
actions = rule_configs.ActionList(
    rule_configs.StopTraining(), 
    rule_configs.Email("abc@abc.com"), 
    rule_configs.SMS("+1234567890")
)

# Configure rules for debugging with the actions parameter
rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule(),         # Required
        rule_parameters={"paramter_key": value },        # Optional
        actions=actions
    )
]

estimator = Estimator(
    ...
    rules = rules
)

estimator.fit(wait=False)
```

**To create multiple built-in action objects to assign different actions to a single rule**

If you want to assign the built-in actions to be triggered at different threshold values of a single rule, you can create multiple built-in action objects as shown in the following script. To avoid a conflict error by running the same rule, you must submit different rule job names (specify different strings for the rules' `name` attribute) as shown in the following example script template. This example shows how to set up [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) to take two different actions: send an email to `abc@abc.com` when a training job stalls for 60 seconds, and stop the training job if stalling for 120 seconds.

```
from sagemaker.debugger import Rule, rule_configs
import time

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

# Configure an action object for StopTraining
action_stop_training = rule_configs.ActionList(
    rule_configs.StopTraining()
)

# Configure an action object for Email
action_email = rule_configs.ActionList(
    rule_configs.Email("abc@abc.com")
)

# Configure a rule with the Email built-in action to trigger if a training job stalls for 60 seconds
stalled_training_job_rule_email = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "60", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_email
)
stalled_training_job_rule_text.name="StalledTrainingJobRuleEmail"

# Configure a rule with the StopTraining built-in action to trigger if a training job stalls for 120 seconds
stalled_training_job_rule = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "120", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_stop_training
)
stalled_training_job_rule.name="StalledTrainingJobRuleStopTraining"

estimator = Estimator(
    ...
    rules = [stalled_training_job_rule_email, stalled_training_job_rule]
)

estimator.fit(wait=False)
```

While the training job is running, the Debugger built-in action sends notification emails and text messages whenever the rule finds issues with your training job. The following screenshot shows an example of email notification for a training job that has a stalled training job issue. 

![\[An example email notification sent by Debugger when it detects a StalledTraining issue.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-email.png)


The following screenshot shows an example text notification that Debugger sends when the rule finds a StalledTraining issue.

![\[An example text notification sent by Debugger when it detects a StalledTraining issue.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-text.png)


## Considerations for using the Debugger built-in actions
<a name="debugger-built-in-actions-considerations"></a>
+ To use the Debugger built-in actions, an internet connection is required. This feature is not supported in the network isolation mode provided by Amazon SageMaker AI or Amazon VPC.
+ The built-in actions cannot be used for [Profiler rules](debugger-built-in-profiler-rules.md#debugger-built-in-profiler-rules-ProfilerRule).
+ The built-in actions cannot be used on training jobs with spot training interruptions.
+ In email or text notifications, `None` appears at the end of messages. This does not have any meaning, so you can disregard the text `None`.

# Actions on rules using Amazon CloudWatch and AWS Lambda
<a name="debugger-cloudwatch-lambda"></a>

Amazon CloudWatch collects Amazon SageMaker AI model training job logs and Amazon SageMaker Debugger rule processing job logs. Configure Debugger with Amazon CloudWatch Events and AWS Lambda to take action based on Debugger rule evaluation status. 

## Example notebooks
<a name="debugger-test-stop-training"></a>

You can run the following example notebooks, which are prepared for experimenting with stopping a training job using actions on Debugger's built-in rules using Amazon CloudWatch and AWS Lambda.
+ [Amazon SageMaker Debugger - Reacting to CloudWatch Events from Rules](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/tf-mnist-stop-training-job.html)

  This example notebook runs a training job that has a vanishing gradient issue. The Debugger [VanishingGradient](debugger-built-in-rules.md#vanishing-gradient) built-in rule is used while constructing the SageMaker AI TensorFlow estimator. When the Debugger rule detects the issue, the training job is terminated.
+ [Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/detect_stalled_training_job_and_actions.html)

  This example notebook runs a training script with a code line that forces it to sleep for 10 minutes. The Debugger [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) built-in rule invokes issues and stops the training job.

**Topics**
+ [Example notebooks](#debugger-test-stop-training)
+ [Access CloudWatch logs for Debugger rules and training jobs](debugger-cloudwatch-metric.md)
+ [Set up Debugger for automated training job termination using CloudWatch and Lambda](debugger-stop-training.md)
+ [Disable the CloudWatch Events rule to stop using the automated training job termination](debugger-disable-cw.md)

# Access CloudWatch logs for Debugger rules and training jobs
<a name="debugger-cloudwatch-metric"></a>

You can use the training and Debugger rule job status in the CloudWatch logs to take further actions when there are training issues. The following procedure shows how to access the related CloudWatch logs. For more information about monitoring training jobs using CloudWatch, see [Monitor Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-overview.html).

**To access training job logs and Debugger rule job logs**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane under the **Log** node, choose **Log Groups**.

1. In the log groups list, do the following:
   + Choose **/aws/sagemaker/TrainingJobs** for training job logs.
   + Choose **/aws/sagemaker/ProcessingJobs** for Debugger rule job logs.

# Set up Debugger for automated training job termination using CloudWatch and Lambda
<a name="debugger-stop-training"></a>

The Debugger rules monitor training job status, and a CloudWatch Events rule watches the Debugger rule training job evaluation status. The following sections outline the process needed to automate training job termination using using CloudWatch and Lambda.

**Topics**
+ [Step 1: Create a Lambda function](#debugger-lambda-function-create)
+ [Step 2: Configure the Lambda function](#debugger-lambda-function-configure)
+ [Step 3: Create a CloudWatch events rule and link to the Lambda function for Debugger](#debugger-cloudwatch-events)

## Step 1: Create a Lambda function
<a name="debugger-lambda-function-create"></a>

**To create a Lambda function**

1. Open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/).

1. In the left navigation pane, choose **Functions** and then choose **Create function**.

1. On the **Create function** page, choose **Author from scratch** option.

1. In the **Basic information** section, enter a **Function name** (for example, **debugger-rule-stop-training-job**).

1. For **Runtime**, choose **Python 3.7**.

1. For **Permissions**, expand the drop down option, and choose **Change default execution role**.

1. For **Execution role**, choose **Use an existing role** and choose the IAM role that you use for training jobs on SageMaker AI.
**Note**  
Make sure you use the execution role with `AmazonSageMakerFullAccess` and `AWSLambdaBasicExecutionRole` attached. Otherwise, the Lambda function won't properly react to the Debugger rule status changes of the training job. If you are unsure which execution role is being used, run the following code in a Jupyter notebook cell to retrieve the execution role output:  

   ```
   import sagemaker
   sagemaker.get_execution_role()
   ```

1. At the bottom of the page, choose **Create function**.

The following figure shows an example of the **Create function** page with the input fields and selections completed.

![\[Create Function page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-lambda-create.png)


## Step 2: Configure the Lambda function
<a name="debugger-lambda-function-configure"></a>

**To configure the Lambda function**

1. In the **Function code** section of the configuration page, paste the following Python script in the Lambda code editor pane. The `lambda_handler` function monitors the Debugger rule evaluation status collected by CloudWatch and triggers the `StopTrainingJob` API operation. The AWS SDK for Python (Boto3) `client` for SageMaker AI provides a high-level method, `stop_training_job`, which triggers the `StopTrainingJob` API operation.

   ```
   import json
   import boto3
   import logging
   
   logger = logging.getLogger()
   logger.setLevel(logging.INFO)
   
   def lambda_handler(event, context):
       training_job_name = event.get("detail").get("TrainingJobName")
       logging.info(f'Evaluating Debugger rules for training job: {training_job_name}')
       eval_statuses = event.get("detail").get("DebugRuleEvaluationStatuses", None)
   
       if eval_statuses is None or len(eval_statuses) == 0:
           logging.info("Couldn't find any debug rule statuses, skipping...")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       # should only attempt stopping jobs with InProgress status
       training_job_status = event.get("detail").get("TrainingJobStatus", None)
       if training_job_status != 'InProgress':
           logging.debug(f"Current Training job status({training_job_status}) is not 'InProgress'. Exiting")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       client = boto3.client('sagemaker')
   
       for status in eval_statuses:
           logging.info(status.get("RuleEvaluationStatus") + ', RuleEvaluationStatus=' + str(status))
           if status.get("RuleEvaluationStatus") == "IssuesFound":
               secondary_status = event.get("detail").get("SecondaryStatus", None)
               logging.info(
                   f'About to stop training job, since evaluation of rule configuration {status.get("RuleConfigurationName")} resulted in "IssuesFound". ' +
                   f'\ntraining job "{training_job_name}" status is "{training_job_status}", secondary status is "{secondary_status}"' +
                   f'\nAttempting to stop training job "{training_job_name}"'
               )
               try:
                   client.stop_training_job(
                       TrainingJobName=training_job_name
                   )
               except Exception as e:
                   logging.error(
                       "Encountered error while trying to "
                       "stop training job {}: {}".format(
                           training_job_name, str(e)
                       )
                   )
                   raise e
       return None
   ```

   For more information about the Lambda code editor interface, see [Creating functions using the AWS Lambda console editor](https://docs.aws.amazon.com/lambda/latest/dg/code-editor.html).

1. Skip all other settings and choose **Save** at the top of the configuration page.

## Step 3: Create a CloudWatch events rule and link to the Lambda function for Debugger
<a name="debugger-cloudwatch-events"></a>

**To create a CloudWatch Events rule and link to the Lambda function for Debugger**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane, choose **Rules** under the **Events** node.

1. Choose **Create rule**.

1. In the **Event Source** section of the **Step 1: Create rule** page, choose **SageMaker AI** for **Service Name**, and choose **SageMaker AI Training Job State Change** for **Event Type**. The Event Pattern Preview should look like the following example JSON strings: 

   ```
   {
       "source": [
           "aws.sagemaker"
       ],
       "detail-type": [
           "SageMaker Training Job State Change"
       ]
   }
   ```

1. In the **Targets** section, choose **Add target\$1**, and choose the **debugger-rule-stop-training-job** Lambda function that you created. This step links the CloudWatch Events rule with the Lambda function.

1. Choose **Configure details** and go to the **Step 2: Configure rule details** page.

1. Specify the CloudWatch rule definition name. For example, **debugger-cw-event-rule**.

1. Choose **Create rule** to finish.

1. Go back to the Lambda function configuration page and refresh the page. Confirm that it's configured correctly in the **Designer** panel. The CloudWatch Events rule should be registered as a trigger for the Lambda function. The configuration design should look like the following example:  
<a name="lambda-designer-example"></a>![\[Designer panel for the CloudWatch configuration.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-lambda-designer.png)

# Disable the CloudWatch Events rule to stop using the automated training job termination
<a name="debugger-disable-cw"></a>

If you want to disable the automated training job termination, you need to disable the CloudWatch Events rule. In the Lambda **Designer** panel, choose the **EventBridge (CloudWatch Events)** block linked to the Lambda function. This shows an **EventBridge** panel below the **Designer** panel (for example, see the previous screen shot). Select the check box next to **EventBridge (CloudWatch Events): debugger-cw-event-rule**, and then choose **Disable**. If you want to use the automated termination functionality later, you can enable the CloudWatch Events rule again.