

# Profile and optimize computational performance
<a name="train-profile-computational-performance"></a>

When training state-of-the-art deep learning models that rapidly grow in size, scaling the training job of such models to a large GPU cluster and identifying computational performance issues from billions and trillions of operations and communications in every iteration of the gradient descent process become a challenge.

SageMaker AI provides profiling tools to visualize and diagnose such complex computation issues arising from running training jobs on AWS cloud computing resources. There are two profiling options that SageMaker AI offers: Amazon SageMaker Profiler and a resource utilzation monitor in Amazon SageMaker Studio Classic. See the following introductions of the two functionalities to gain quick insights and learn which one to use depending on your needs.

**Amazon SageMaker Profiler**

Amazon SageMaker Profiler is a profiling capability of SageMaker AI with which you can deep dive into compute resources provisioned while training deep learning models, and gain visibility into operation-level details. SageMaker Profiler provides Python modules for adding annotations throughout PyTorch or TensorFlow training scripts and activating SageMaker Profiler. You can access the modules through the SageMaker Python SDK and AWS Deep Learning Containers. 

With SageMaker Profiler, you can track all activities on CPUs and GPUs, such as CPU and GPU utilizations, kernel runs on GPUs, kernel launches on CPUs, sync operations, memory operations across CPUs and GPUs, latencies between kernel launches and corresponding runs, and data transfer between CPUs and GPUs. 

SageMaker Profiler also offers a user interface (UI) that visualizes the *profile*, a statistical summary of profiled events, and the timeline of a training job for tracking and understanding the time relationship of the events between GPUs and CPUs.

To learn more about SageMaker Profiler, see [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md).

**Monitoring AWS compute resources in Amazon SageMaker Studio Classic**

SageMaker AI also provides a user interface in Studio Classic for monitoring resource utilization at high level, but with more granularity compared to the default utilization metrics collected from SageMaker AI to CloudWatch.

For any training job you run in SageMaker AI using the SageMaker Python SDK, SageMaker AI starts profiling basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, network, and I/O wait time. It collects these resource utilization metrics every 500 milliseconds. 

Compared to Amazon CloudWatch metrics, which collect metrics at intervals of 1 second, the monitoring functionality of SageMaker AI provides finer granularity into the resource utilization metrics down to 100-millisecond (0.1 second) intervals, so you can dive deep into the metrics at the level of an operation or a step.

To access the dashboard for monitoring the resource utilization metrics of a training job, see the [SageMaker AI Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html).



**Topics**
+ [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md)
+ [Monitor AWS compute resource utilization in Amazon SageMaker Studio Classic](debugger-profile-training-jobs.md)
+ [Release notes for profiling capabilities of Amazon SageMaker AI](profiler-release-notes.md)

# Amazon SageMaker Profiler
<a name="train-use-sagemaker-profiler"></a>


|  | 
| --- |
|  Amazon SageMaker Profiler is currently in preview release and available at no cost in supported AWS Regions. The generally available version of Amazon SageMaker Profiler (if any) may include features and pricing that are different than those offered in preview.  | 

Amazon SageMaker Profiler is a capability of Amazon SageMaker AI that provides a detailed view into the AWS compute resources provisioned during training deep learning models on SageMaker AI. It focuses on profiling the CPU and GPU usage, kernel runs on GPUs, kernel launches on CPUs, sync operations, memory operations across CPUs and GPUs, latencies between kernel launches and corresponding runs, and data transfer between CPUs and GPUs. SageMaker Profiler also offers a user interface (UI) that visualizes the *profile*, a statistical summary of profiled events, and the timeline of a training job for tracking and understanding the time relationship of the events between GPUs and CPUs.

**Note**  
SageMaker Profiler supports PyTorch and TensorFlow and is available in [AWS Deep Learning Containers for SageMaker AI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only). To learn more, see [Supported framework images, AWS Regions, and instance types](profiler-support.md).

**For data scientists**

Training deep learning models on a large compute cluster often has computational optimization problems, such as bottlenecks, kernel launch latencies, memory limit, and low resource utilization.

To identify such computational performance issues, you need to profile deeper into the compute resources to understand which kernels introduce latencies and which operations cause bottlenecks. Data scientists can take the benefit from using the SageMaker Profiler UI for visualizing the detailed profile of training jobs. The UI provides a dashboard furnished with summary charts and a timeline interface to track every event on the compute resources. Data scientists can also add custom annotations to track certain parts of the training job using the SageMaker Profiler Python modules.

**For administrators**

Through the Profiler landing page in the SageMaker AI console or [SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html), you can manage the Profiler application users if you are an administrator of an AWS account or SageMaker AI domain. Each domain user can access their own Profiler application given the granted permissions. As a SageMaker AI domain administrator and domain user, you can create and delete the Profiler application given the permission level you have.

**Topics**
+ [Supported framework images, AWS Regions, and instance types](profiler-support.md)
+ [Prerequisites for SageMaker Profiler](profiler-prereq.md)
+ [Prepare and run a training job with SageMaker Profiler](profiler-prepare.md)
+ [Open the SageMaker Profiler UI application](profiler-access-smprofiler-ui.md)
+ [Explore the profile output data visualized in the SageMaker Profiler UI](profiler-explore-viz.md)
+ [Troubleshooting for SageMaker Profiler](profiler-faq.md)

# Supported framework images, AWS Regions, and instance types
<a name="profiler-support"></a>

This feature supports the following machine learning frameworks and AWS Regions.

**Note**  
To use this feature, make sure that you have installed the SageMaker Python SDK [version 2.180.0](https://pypi.org/project/sagemaker/2.180.0/) or later.

## SageMaker AI framework images pre-installed with SageMaker Profiler
<a name="profiler-support-frameworks"></a>

SageMaker Profiler is pre-installed in the following [AWS Deep Learning Containers for SageMaker AI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).

### PyTorch images
<a name="profiler-support-frameworks-pytorch"></a>


| PyTorch versions | AWS DLC image URI | 
| --- | --- | 
| 2.2.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker  | 
| 2.1.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker  | 
| 2.0.1 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu121-ubuntu20.04-sagemaker  | 
| 1.13.1 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker  | 

### TensorFlow images
<a name="profiler-support-frameworks-tensorflow"></a>


| TensorFlow versions | AWS DLC image URI | 
| --- | --- | 
| 2.13.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/tensorflow-training:2.13.0-gpu-py310-cu118-ubuntu20.04-sagemaker  | 
| 2.12.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/tensorflow-training:2.12.0-gpu-py310-cu118-ubuntu20.04-sagemaker  | 
| 2.11.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-ubuntu20.04-sagemaker  | 

**Important**  
Distribution and maintenance of the framework containers in the preceding tables are under the [Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html) managed by the AWS Deep Learning Containers service. We highly recommend you to upgrade to the [currently supported framework versions](https://aws.amazon.com/releasenotes/dlc-support-policy/), if you are using prior framework versions that are no longer supported.

**Note**  
If you want to use SageMaker Profiler for other framework images or your own Docker images, you can install SageMaker Profiler using the SageMaker Profiler Python package binary files provided in the following section.

## SageMaker Profiler Python package binary files
<a name="profiler-python-package"></a>

If you want to configure your own Docker container, use SageMaker Profiler in other pre-built containers for PyTorch and TensorFlow, or install the SageMaker Profiler Python package locally, use one the following binary files. Depending on the Python and CUDA versions in your environment, choose one of the following.

### PyTorch
<a name="profiler-python-package-for-pytorch"></a>
+ Python3.8, CUDA 11.3: [https://smppy.s3.amazonaws.com/pytorch/cu113/smprof-0.3.334-cp38-cp38-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu113/smprof-0.3.334-cp38-cp38-linux_x86_64.whl)
+ Python3.9, CUDA 11.7: [https://smppy.s3.amazonaws.com/pytorch/cu117/smprof-0.3.334-cp39-cp39-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu117/smprof-0.3.334-cp39-cp39-linux_x86_64.whl)
+ Python3.10, CUDA 11.8: [https://smppy.s3.amazonaws.com/pytorch/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl)
+ Python3.10, CUDA 12.1: [https://smppy.s3.amazonaws.com/pytorch/cu121/smprof-0.3.334-cp310-cp310-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu121/smprof-0.3.334-cp310-cp310-linux_x86_64.whl)

### TensorFlow
<a name="profiler-python-package-for-tensorflow"></a>
+ Python3.9, CUDA 11.2: [https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.334-cp39-cp39-linux_x86_64.whl](https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.334-cp39-cp39-linux_x86_64.whl)
+ Python3.10, CUDA 11.8: [https://smppy.s3.amazonaws.com/tensorflow/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl](https://smppy.s3.amazonaws.com/tensorflow/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl)

For more information about how to install SageMaker Profiler using the binary files, see [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

## Supported AWS Regions
<a name="profiler-support-regions"></a>

SageMaker Profiler is available in the following AWS Regions.
+ US East (N. Virginia) (`us-east-1`)
+ US East (Ohio) (`us-east-2`)
+ US West (Oregon) (`us-west-2`)
+ Europe (Frankfurt) (`eu-central-1`)
+ Europe (Ireland) (`eu-west-1`)

## Supported instance types
<a name="profiler-support-instance-types"></a>

SageMaker Profiler supports profiling of training jobs on the following instance types.

**CPU and GPU profiling**
+ `ml.g4dn.12xlarge`
+ `ml.g5.24xlarge`
+ `ml.g5.48xlarge`
+ `ml.p3dn.24xlarge`
+ `ml.p4de.24xlarge`
+ `ml.p4d.24xlarge`
+ `ml.p5.48xlarge`

**GPU profiling only**
+ `ml.g5.2xlarge`
+ `ml.g5.4xlarge`
+ `ml.g5.8xlarge`
+ `ml.g5.16.xlarge`

# Prerequisites for SageMaker Profiler
<a name="profiler-prereq"></a>

The following list shows the prerequisites to start using SageMaker Profiler.
+ A SageMaker AI domain set up with Amazon VPC in your AWS account. 

  For instructions on setting up a domain, see [Onboard to Amazon SageMaker AI domain using quick setup](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html). You also need to add domain user profiles for individual users to access the Profiler UI application. For more information, see [Add user profiles](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-user-profile-add.html).
+ The following list is the minimum set of permissions for using the Profiler UI application.
  + `sagemaker:CreateApp`
  + `sagemaker:DeleteApp`
  + `sagemaker:DescribeTrainingJob`
  + `sagemaker:Search`
  + `s3:GetObject`
  + `s3:ListBucket`

# Prepare and run a training job with SageMaker Profiler
<a name="profiler-prepare"></a>

Setting up to running a training job with the SageMaker Profiler consists of two steps: adapting the training script and configuring the SageMaker training job launcher.

**Topics**
+ [Step 1: Adapt your training script using the SageMaker Profiler Python modules](#profiler-prepare-training-script)
+ [Step 2: Create a SageMaker AI framework estimator and activate SageMaker Profiler](#profiler-profilerconfig)
+ [(Optional) Install the SageMaker Profiler Python package](#profiler-install-python-package)

## Step 1: Adapt your training script using the SageMaker Profiler Python modules
<a name="profiler-prepare-training-script"></a>

To start capturing kernel runs on GPUs while the training job is running, modify your training script using the SageMaker Profiler Python modules. Import the library and add the `start_profiling()` and `stop_profiling()` methods to define the beginning and the end of profiling. You can also use optional custom annotations to add markers in the training script to visualize hardware activities during particular operations in each step.

Note that the annotators extract operations from GPUs. For profiling operations in CPUs, you don’t need to add any additional annotations. CPU profiling is also activated when you specify the profiling configuration, which you’ll practice in [Step 2: Create a SageMaker AI framework estimator and activate SageMaker Profiler](#profiler-profilerconfig).

**Note**  
Profiling an entire training job is not the most efficient use of resources. We recommend profiling at most 300 steps of a training job.

**Important**  
The release on [December 14, 2023](profiler-release-notes.md#profiler-release-notes-20231214) involves a breaking change. The SageMaker Profiler Python package name is changed from `smppy` to `smprof`. This is effective in the [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for TensorFlow v2.12 and later.  
If you use one of the previous versions of the [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) such TensorFlow v2.11.0, the SageMaker Profiler Python package is still available as `smppy`. If you are uncertain about which version or the package name you should use, replace the import statement of the SageMaker Profiler package with the following code snippet.  

```
try:
    import smprof 
except ImportError:
    # backward-compatability for TF 2.11 and PT 1.13.1 images
    import smppy as smprof
```

**Approach 1.** Use the context manager `smprof.annotate` to annotate full functions

You can wrap full functions with the `smprof.annotate()` context manager. This wrapper is recommended if you want to profile by functions instead of code lines. The following example script shows how to implement the context manager to wrap the training loop and full functions in each iteration.

```
import smprof

SMProf = smprof.SMProfiler.instance()
config = smprof.Config()
config.profiler = {
    "EnableCuda": "1",
}
SMProf.configure(config)
SMProf.start_profiling()

for epoch in range(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, data in enumerate(trainloader, 0):
        with smprof.annotate("step_"+str(i)):
            inputs, labels = data
            inputs = inputs.to("cuda", non_blocking=True)
            labels = labels.to("cuda", non_blocking=True)
    
            optimizer.zero_grad()
    
            with smprof.annotate("Forward"):
                outputs = net(inputs)
            with smprof.annotate("Loss"):
                loss = criterion(outputs, labels)
            with smprof.annotate("Backward"):
                loss.backward()
            with smprof.annotate("Optimizer"):
                optimizer.step()

SMProf.stop_profiling()
```

**Approach 2.** Use `smprof.annotation_begin()` and `smprof.annotation_end()` to annotate specific code line in functions

You can also define annotations to profile specific code lines. You can set the exact starting point and end point of profiling at the level of individual code lines, not by the functions. For example, in the following script, the `step_annotator` is defined at the beginning of each iteration and ends at the end of the iteration. Meanwhile, other detailed annotators for each operations are defined and wrap around the target operations throughout each iteration.

```
import smprof

SMProf = smprof.SMProfiler.instance()
config = smprof.Config()
config.profiler = {
    "EnableCuda": "1",
}
SMProf.configure(config)
SMProf.start_profiling()

for epoch in range(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, data in enumerate(trainloader, 0):
        step_annotator = smprof.annotation_begin("step_" + str(i))

        inputs, labels = data
        inputs = inputs.to("cuda", non_blocking=True)
        labels = labels.to("cuda", non_blocking=True)
        optimizer.zero_grad()

        forward_annotator = smprof.annotation_begin("Forward")
        outputs = net(inputs)
        smprof.annotation_end(forward_annotator)

        loss_annotator = smprof.annotation_begin("Loss")
        loss = criterion(outputs, labels)
        smprof.annotation_end(loss_annotator)

        backward_annotator = smprof.annotation_begin("Backward")
        loss.backward()
        smprof.annotation_end(backward_annotator)

        optimizer_annotator = smprof.annotation_begin("Optimizer")
        optimizer.step()
        smprof.annotation_end(optimizer_annotator)

        smprof.annotation_end(step_annotator)

SMProf.stop_profiling()
```

After annotating and setting up the profiler initiation modules, save the script to submit using a SageMaker training job launcher in the following Step 2. The sample launcher assumes that the training script is named `train_with_profiler_demo.py`.

## Step 2: Create a SageMaker AI framework estimator and activate SageMaker Profiler
<a name="profiler-profilerconfig"></a>

The following procedure shows how to prepare a SageMaker AI framework estimator for training using the SageMaker Python SDK.

1. Set up a `profiler_config` object using the `ProfilerConfig` and `Profiler` modules as follows.

   ```
   from sagemaker import ProfilerConfig, Profiler
   profiler_config = ProfilerConfig(
       profile_params = Profiler(cpu_profiling_duration=3600)
   )
   ```

   The following is the description of the `Profiler` module and its argument.
   +  `Profiler`: The module for activating SageMaker Profiler with the training job.
     +  `cpu_profiling_duration` (int): Specify the time duration in seconds for profiling on CPUs. Default is 3600 seconds. 

1. Create a SageMaker AI framework estimator with the `profiler_config` object created in the previous step. The following code shows an example of creating a PyTorch estimator. If you want to create a TensorFlow estimator, import `sagemaker.tensorflow.TensorFlow` instead, and specify one of the [TensorFlow versions](profiler-support.md#profiler-support-frameworks-tensorflow) supported by SageMaker Profiler. For more information about supported frameworks and instance types, see [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks).

   ```
   import sagemaker
   from sagemaker.pytorch import PyTorch
   
   estimator = PyTorch(
       framework_version="2.0.0",
       role=sagemaker.get_execution_role(),
       entry_point="train_with_profiler_demo.py", # your training job entry point
       source_dir=source_dir, # source directory for your training script
       output_path=output_path,
       base_job_name="sagemaker-profiler-demo",
       hyperparameters=hyperparameters, # if any
       instance_count=1, # Recommended to test with < 8
       instance_type=ml.p4d.24xlarge,
       profiler_config=profiler_config
   )
   ```

1. Start the training job by running the `fit` method. With `wait=False`, you can silence the training job logs and let it run in the background.

   ```
   estimator.fit(wait=False)
   ```

While running the training job or after the job has completed, you can go to the next topic at [Open the SageMaker Profiler UI application](profiler-access-smprofiler-ui.md) and start exploring and visualizing the saved profiles.

If you want to directly access the profile data saved in the Amazon S3 bucket, use the following script to retrieve the S3 URI.

```
import os
# This is an ad-hoc function to get the S3 URI
# to where the profile output data is saved
def get_detailed_profiler_output_uri(estimator):
    config_name = None
    for processing in estimator.profiler_rule_configs:
        params = processing.get("RuleParameters", dict())
        rule = config_name = params.get("rule_to_invoke", "")
        if rule == "DetailedProfilerProcessing":
            config_name = processing.get("RuleConfigurationName")
            break
    return os.path.join(
        estimator.output_path, 
        estimator.latest_training_job.name, 
        "rule-output",
        config_name,
    )

print(
    f"Profiler output S3 bucket: ", 
    get_detailed_profiler_output_uri(estimator)
)
```

## (Optional) Install the SageMaker Profiler Python package
<a name="profiler-install-python-package"></a>

To use SageMaker Profiler on PyTorch or TensorFlow framework images not listed in [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks), or on your own custom Docker container for training, you can install SageMaker Profiler by using one of the [SageMaker Profiler Python package binary files](profiler-support.md#profiler-python-package).

**Option 1: Install the SageMaker Profiler package while launching a training job**

If you want to use SageMaker Profiler for training jobs using PyTorch or TensorFlow images not listed in [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks), create a `requirements.txt` file and locate it under the path you specify to the `source_dir` parameter of the SageMaker AI framework estimator in [Step 2](#profiler-profilerconfig). For more information about setting up a `requirements.txt` file in general, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries) in the *SageMaker Python SDK documentation*. In the `requirements.txt` file, add one of the S3 bucket paths for the [SageMaker Profiler Python package binary files](profiler-support.md#profiler-python-package).

```
# requirements.txt
https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.332-cp39-cp39-linux_x86_64.whl
```

**Option 2: Install the SageMaker Profiler package in your custom Docker containers**

If you use a custom Docker container for training, add one of the [SageMaker Profiler Python package binary files](profiler-support.md#profiler-python-package) to your Dockerfile.

```
# Install the smprof package version compatible with your CUDA version
RUN pip install https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.332-cp39-cp39-linux_x86_64.whl
```

For guidance on running a custom Docker container for training on SageMaker AI in general, see [Adapting your own training container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html).

# Open the SageMaker Profiler UI application
<a name="profiler-access-smprofiler-ui"></a>

You can access the SageMaker Profiler UI application through the following options.

**Topics**
+ [Option 1: Launch the SageMaker Profiler UI from the domain details page](#profiler-access-smprofiler-ui-console-smdomain)
+ [Option 2: Launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console](#profiler-access-smprofiler-ui-console-profiler-landing-page)
+ [Option 3: Use the application launcher function in the SageMaker AI Python SDK](#profiler-access-smprofiler-ui-app-launcher-function)

## Option 1: Launch the SageMaker Profiler UI from the domain details page
<a name="profiler-access-smprofiler-ui-console-smdomain"></a>

If you have access to the SageMaker AI console, you can take this option.

**Navigate to the domain details page**

 The following procedure shows how to navigate to the domain details page. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **domains**. 

1. From the list of domains, select the domain in which you want to launch the SageMaker Profiler application.

**Launch the SageMaker Profiler UI application**

The following procedure shows how to launch the SageMaker Profiler application that is scoped to a user profile. 

1. On the domain details page, choose the **User profiles** tab. 

1. Identify the user profile for which you want to launch the SageMaker Profiler UI application. 

1. Choose **Launch** for the selected user profile, and choose **Profiler**. 

## Option 2: Launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console
<a name="profiler-access-smprofiler-ui-console-profiler-landing-page"></a>

The following procedure describes how to launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console. If you have access to the SageMaker AI console, you can take this option.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Profiler**.

1. Under **Get started**, select the domain in which you want to launch the Studio Classic application. If your user profile only belongs to one domain, you do not see the option for selecting a domain.

1. Select the user profile for which you want to launch the SageMaker Profiler UI application. If there is no user profile in the domain, choose **Create user profile**. For more information about creating a new user profile, see [Add user profiles](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-user-profile-add.html).

1. Choose **Open Profiler**.

## Option 3: Use the application launcher function in the SageMaker AI Python SDK
<a name="profiler-access-smprofiler-ui-app-launcher-function"></a>

If you are a SageMaker AI domain user and have access only to SageMaker Studio, you can access the SageMaker Profiler UI application through SageMaker Studio Classic by running the [https://sagemaker.readthedocs.io/en/stable/api/utility/interactive_apps.html#module-sagemaker.interactive_apps.detail_profiler_app](https://sagemaker.readthedocs.io/en/stable/api/utility/interactive_apps.html#module-sagemaker.interactive_apps.detail_profiler_app) function.

Note that SageMaker Studio Classic is the previous Studio UI experience before re:Invent 2023, and is migrated as an application into a newly designed Studio UI at re:Invent 2023. The SageMaker Profiler UI application is available at SageMaker AI domain level, and thus requires your domain ID and user profile name. Currently, the `DetailedProfilerApp` function only works within the SageMaker Studio Classic application; the function properly takes in the domain and user profile information from SageMaker Studio Classic.

For domain, domain users, and Studio created before re:Invent 2023, Studio Classic would be the default experience unless you have updated it following the instructions at [Migrating from Amazon SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-migrate.html). If this is your case, there's no further action needed, and you can directly launch the SageMaker Profiler UI application by running the `DetailProfilerApp` funciton.

If you created a new domain and Studio after re:Invent 2023, launch the Studio Classic application within the Studio UI and then run the `DetailProfilerApp` function to launch the SageMaker Profiler UI application.

Note that the `DetailedProfilerApp` function doesn’t work in other SageMaker AI machine learning IDEs, such as the SageMaker Studio JupyterLab application, the SageMaker Studio Code Editor application, and SageMaker Notebook instances. If you run the `DetailedProfilerApp` function in those IDEs, it returns a URL to the Profiler landing page in the SageMaker AI console, instead of a direct link to open the Profiler UI application.

# Explore the profile output data visualized in the SageMaker Profiler UI
<a name="profiler-explore-viz"></a>

This section walks through the SageMaker Profiler UI and provides tips for how to use and gain insights from it.

## Load profile
<a name="profiler-explore-viz-load"></a>

When you open the SageMaker Profiler UI, the **Load profile** page opens up. To load and generate the **Dashboard** and **Timeline**, go through the following procedure.<a name="profiler-explore-viz-load-procedure"></a>

**To load the profile of a training job**

1. From the **List of training jobs** section, use the check box to choose the training job for which you want to load the profile.

1. Choose **Load**. The job name should appear in the **Loaded profile** section at the top.

1. Choose the radio button on the left of the **Job name** to generate the **Dashboard** and **Timeline**. Note that when you choose the radio button, the UI automatically opens the **Dashboard**. Note also that if you generate the visualizations while the job status and loading status still appear to be in progress, the SageMaker Profiler UI generates **Dashboard** plots and a **Timeline** up to the most recent profile data collected from the ongoing training job or the partially loaded profile data.

**Tip**  
You can load and visualize one profile at a time. To load another profile, you must first unload the previously loaded profile. To unload a profile, use the trash bin icon on the right end of the profile in the **Loaded profile** section.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-load-data.png)


## Dashboard
<a name="profiler-explore-viz-overview"></a>

After you finish loading and selecting the training job, the UI opens the **Dashboard** page furnished with the following panels by default.
+ **GPU active time** – This pie chart shows the percentage of GPU active time versus GPU idle time. You can check if your GPUs are more active than idle throughout the entire training job. GPU active time is based on the profile data points with a utilization rate greater than 0%, whereas GPU idle time is the profiled data points with 0% utilization.
+ **GPU utilization over time** – This timeline graph shows the average GPU utilization rate over time per node, aggregating all of the nodes in a single chart. You can check if the GPUs have an unbalanced workload, under-utilization issues, bottlenecks, or idle issues during certain time intervals. To track the utilization rate at the individual GPU level and related kernel runs, use the [Timeline interface](#profiler-explore-viz-timeline). Note that the GPU activity collection starts from where you added the profiler starter function `SMProf.start_profiling()` in your training script, and stops at `SMProf.stop_profiling()`.
+ **CPU active time** – This pie chart shows the percentage of CPU active time versus CPU idle time. You can check if your CPUs are more active than idle throughout the entire training job. CPU active time is based on the profiled data points with a utilization rate greater than 0%, whereas CPU idle time is the profiled data points with 0% utilization.
+ **CPU utilization over time** – This timeline graph shows the average CPU utilization rate over time per node, aggregating all of the nodes in a single chart. You can check if the CPUs are bottlenecked or underutilized during certain time intervals. To track the utilization rate of the CPUs aligned with the individual GPU utilization and kernel runs, use the [Timeline interface](#profiler-explore-viz-timeline). Note that the utilization metrics start from the start from the job initialization.
+ **Time spent by all GPU kernels** – This pie chart shows all GPU kernels operated throughout the training job. It shows the top 15 GPU kernels by default as individual sectors and all other kernels in one sector. Hover over the sectors to see more detailed information. The value shows the total time of the GPU kernels operated in seconds, and the percentage is based on the entire time of the profile. 
+ **Time spent by top 15 GPU kernels** – This pie chart shows all GPU kernels operated throughout the training job. It shows the top 15 GPU kernels as individual sectors. Hover over the sectors to see more detailed information. The value shows the total time of the GPU kernels operated in seconds, and the percentage is based on the entire time of the profile. 
+ **Launch counts of all GPU kernels** – This pie chart shows the number of counts for every GPU kernel launched throughout the training job. It shows the top 15 GPU kernels as individual sectors and all other kernels in one sector. Hover over the sectors to see more detailed information. The value shows the total count of the launched GPU kernels, and the percentage is based on the entire count of all kernels. 
+ **Launch counts of top 15 GPU kernels** – This pie chart shows the number of counts of every GPU kernel launched throughout the training job. It shows the top 15 GPU kernels. Hover over the sectors to see more detailed information. The value shows the total count of the launched GPU kernels, and the percentage is based on the entire count of all kernels. 
+ **Step time distribution** – This histogram shows the distribution of step durations on GPUs. This plot is generated only after you add the step annotator in your training script.
+ **Kernel precision distribution** – This pie chart shows the percentage of time spent on running kernels in different data types such as FP32, FP16, INT32, and INT8. 
+ **GPU activity distribution** – This pie chart shows the percentage of time spent on GPU activities, such as running kernels, memory (`memcpy` and `memset`), and synchronization (`sync`).
+ **GPU memory operations distribution** – This pie chart shows the percentage of time spent on GPU memory operations. This visualizes the `memcopy` activities and helps identify if your training job is spending excessive time on certain memory operations.
+ **Create a new histogram** – Create a new diagram of a custom metric you annotated manually during [Step 1: Adapt your training script using the SageMaker Profiler Python modules](profiler-prepare.md#profiler-prepare-training-script). When adding a custom annotation to a new histogram, select or type the name of the annotation you added in the training script. For example, in the demo training script in Step 1, `step`, `Forward`, `Backward`, `Optimize`, and `Loss` are the custom annotations. While creating a new histogram, these annotation names should appear in the drop-down menu for metric selection. If you choose `Backward`, the UI adds the histogram of the time spent on backward passes throughout the profiled time to the **Dashboard**. This type of histogram is useful for checking if there are outliers taking abnormally longer time and causing bottleneck problems.

The following screenshots show the GPU and CPU active time ratio and the average GPU and CPU utilization rate with respect to time per compute node.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-dashboard-1.png)


The following screenshot shows an example of pie charts for comparing how many times the GPU kernels are launched and measuring the time spent on running them. In the **Time spent by all GPU kernels** and **Launch counts of all GPU kernels** panels, you can also specify an integer to the input field for *k* to adjust the number of legend to show in the plots. For example, if you specify 10, the plots show the top 10 most run and launched kernels respectively.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-dashboard-2.png)


The following screenshot shows an example of step time duration histogram, and pie charts for the kernel precision distribution, GPU activity distribution, and GPU memory operation distribution.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-dashboard-3.png)


## Timeline interface
<a name="profiler-explore-viz-timeline"></a>

To gain a detailed view into the compute resources at the level of operations and kernels scheduled on the CPUs and run on the GPUs, use the **Timeline** interface.

You can zoom in and out and pan left or right in the timeline interface using your mouse, the `[w, a, s, d]` keys, or the four arrow keys on the keyboard.

**Tip**  
For more tips on the keyboard shortcuts to interact with the **Timeline** interface, choose **Keyboard shortcuts** in the left pane.

The timeline tracks are organized in a tree structure, giving you information from the host level to the device level. For example, if you run `N` instances with eight GPUs in each, the timeline structure of each instance would be as follows.
+ **algo-inode** – This is what SageMaker AI tags to assign jobs to provisioned instances. The digit inode is randomly assigned. For example, if you use 4 instances, this section expands from **algo-1** to **algo-4**.
  + **CPU** – In this section, you can check the average CPU utilization rate and performance counters.
  + **GPUs** – In this section, you can check the average GPU utilization rate, individual GPU utilization rate, and kernels.
    + **SUM Utilization** – The average GPU utilization rates per instance.
    + **HOST-0 PID-123** – A unique name assigned to each process track. The acronym PID is the process ID, and the number appended to it is the process ID number that's recorded during data capture from the process. This section shows the following information from the process.
      + **GPU-inum\$1gpu utilization** – The utilization rate of the inum\$1gpu-th GPU over time.
      + **GPU-inum\$1gpu device** – The kernel runs on the inum\$1gpu-th GPU device.
        + **stream icuda\$1stream** – CUDA streams showing kernel runs on the GPU device. To learn more about CUDA streams, see the slides in PDF at [CUDA C/C\$1\$1 Streams and Concurrency](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) provided by NVIDIA.
      + **GPU-inum\$1gpu host** – The kernel launches on the inum\$1gpu-th GPU host.

The following several screenshots show the **Timeline** of the profile of a training job run on `ml.p4d.24xlarge` instances, which are equipped with 8 NVIDIA A100 Tensor Core GPUs in each.

The following is a zoomed-out view of the profile, printing a dozen of steps including an intermittent data loader between `step_232` and `step_233` for fetching the next data batch.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-timeline-1.png)


For each CPU, you can track the CPU utilization and performance counters, such as `"clk_unhalted_ref.tsc"` and `"itlb_misses.miss_causes_a_walk"`, which are indicative of instructions run on the CPU.

For each GPU, you can see a host timeline and a device timeline. Kernel launches are on the host timeline and kernel runs are on the device timeline. You can also see annotations (such as forward, backward, and optimize) if you have added in training script in the GPU host timeline.

In the timeline view, you can also track kernel launch-and-run pairs. This helps you understand how a kernel launch scheduled on a host (CPU) is run on the corresponding GPU device.

**Tip**  
Press the `f` key to zoom into the selected kernel.

The following screenshot is a zoomed-in view into `step_233` and `step_234` from the previous screenshot. The timeline interval selected in the following screenshot is the `AllReduce` operation, an essential communication and synchronization step in distributed training, run on the GPU-0 device. In the screenshot, note that the kernel launch in the GPU-0 host connects to the kernel run in the GPU-0 device stream 1, indicated with the arrow in cyan color.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-timeline-2.png)


Also two information tabs appear in the bottom pane of the UI when you select a timeline interval, as shown in the previous screenshot. The **Current Selection** tab shows the details of the selected kernel and the connected kernel launch from the host. The connection direction is always from host (CPU) to device (GPU) since each GPU kernel is always called from a CPU. The **Connections** tab shows the chosen kernel launch and run pair. You can select either of them to move it to the center of the **Timeline** view.

The following screenshot zooms in further into the `AllReduce` operation launch and run pair. 

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-timeline-3.png)


## Information
<a name="profiler-expore-viz-information"></a>

In **Information**, you can access information about the loaded training job, such as the instance type, Amazon Resource Names (ARNs) of compute resources provisioned for the job, node names, and hyperparameters.

## Settings
<a name="profiler-expore-viz-settings"></a>

The SageMaker AI Profiler UI application instance is configured to shut down after 2 hours of idle time by default. In **Settings**, use the following settings to adjust the auto shutdown timer.
+ **Enable app auto shutdown** – Choose and set to **Enabled** to let the application automatically shut down after the specified number of hours of idle time. To turn off the auto-shutdown functionality, choose **Disabled**.
+ **Auto shutdown threshold in hours** – If you choose **Enabled** for **Enable app auto shutdown**, you can set the threshold time in hours for the application to shut down automatically. This is set to 2 by default.

# Troubleshooting for SageMaker Profiler
<a name="profiler-faq"></a>

Use the following question-and-answer pairs to troubleshoot problems while using SageMaker Profiler.

**Q. I’m getting an error message, `ModuleNotFoundError: No module named 'smppy'`**

Since December 2023, the name of the SageMaker Profiler Python package has changed from `smppy` to `smprof` to resolve a duplicate package name issue; `smppy` is already used by an open source package.

Therefore, if you have been using `smppy` since before December 2023 and experiencing this `ModuleNotFoundError` issue, it might be due to the outdated package name in your training script while having the latested `smprof` package installed or using one of the latest [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks). In this case, make sure that you replace all mentions of `smppy` with `smprof` throughout your training script.

While updating the SageMaker Profiler Python package name in your training scripts, to avoid confusion around which version of the package name you should use, consider using a conditional import statement as shown in the following code snippet.

```
try:
    import smprof 
except ImportError:
    # backward-compatability for TF 2.11 and PT 1.13.1 images
    import smppy as smprof
```

Also note that if you have been using `smppy` while upgrading to the latest PyTorch or TensorFlow versions, make sure that you install the latest `smprof` package by following instructions at [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

**Q. I’m getting an error message, `ModuleNotFoundError: No module named 'smprof'`**

First, make sure that you use one of the officially supported SageMaker AI Framework Containers. If you don’t use one of those, you can install the `smprof` package by following instructions at [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

**Q. I’m not able to import `ProfilerConfig`**

If you are unable to import `ProfilerConfig` in your job launcher script using the SageMaker Python SDK, your local environment or the Jupyter kernel might have a significantly outdated version of the SageMaker Python SDK. Make sure that you upgrade the SDK to the latest version.

```
$ pip install --upgrade sagemaker
```

**Q. I’m getting an error message, `aborted: core dumped when importing smprof into my training script`**

In an earlier version of `smprof`, this issue occurs with PyTorch 2.0\$1 and PyTorch Lightning. To resolve this issue, also install the latest `smprof` package by following instructions at [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

**Q. I cannot find the SageMaker Profiler UI from SageMaker Studio. How can I find it?**

If you have access to the SageMaker AI console, choose one of the following options.
+ [Option 1: Launch the SageMaker Profiler UI from the domain details page](profiler-access-smprofiler-ui.md#profiler-access-smprofiler-ui-console-smdomain)
+ [Option 2: Launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console](profiler-access-smprofiler-ui.md#profiler-access-smprofiler-ui-console-profiler-landing-page)

If you are a domain user and don't have access to the SageMaker AI console, you can access the application through SageMaker Studio Classic. If this is your case, choose the following option.
+ [Option 3: Use the application launcher function in the SageMaker AI Python SDK](profiler-access-smprofiler-ui.md#profiler-access-smprofiler-ui-app-launcher-function)

# Monitor AWS compute resource utilization in Amazon SageMaker Studio Classic
<a name="debugger-profile-training-jobs"></a>

To track compute resource utilization of your training job, use the monitoring tools offered by Amazon SageMaker Debugger. 

For any training job you run in SageMaker AI using the SageMaker Python SDK, Debugger collects basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, network, and I/O wait time every 500 milliseconds. To see the dashbard of the resource utilization metrics of your training job, simply use the [SageMaker Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html).

Deep learning operations and steps might operate in intervals of milliseconds. Compared to Amazon CloudWatch metrics, which collect metrics at intervals of 1 second, Debugger provides finer granularity into the resource utilization metrics down to 100-millisecond (0.1 second) intervals so you can dive deep into the metrics at the level of an operation or a step. 

If you want to change the metric collection time interval, you can add a paramter for profiling configuration to your training job launcher. For example, if you're using the SageMaker AI Python SDK, you need to pass the `profiler_config` parameter when you create an estimator object. To learn how to adjust the resource utilization metric collection interval, see [Code template for configuring a SageMaker AI estimator object with the SageMaker Debugger Python modules in the SageMaker AI Python SDK](debugger-configuration-for-profiling.md#debugger-configuration-structure-profiler) and then [Configure settings for basic profiling of system resource utilization](debugger-configure-system-monitoring.md).

Additionally, you can add issue detecting tools called *built-in profiling rules* provided by SageMaker Debugger. The built-in profiling rules run analysis against the resource utilization metrics and detect computational performance issues. For more information, see [Use built-in profiler rules managed by Amazon SageMaker Debugger](use-debugger-built-in-profiler-rules.md). You can receive rule analysis results through the [SageMaker Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html) or the [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html). You can also create custom profiling rules using the SageMaker Python SDK. 

To learn more about monitoring functionalities provided by SageMaker Debugger, see the following topics.

**Topics**
+ [Estimator configuration with parameters for basic profiling using the Amazon SageMaker Debugger Python modules](debugger-configuration-for-profiling.md)
+ [Use built-in profiler rules managed by Amazon SageMaker Debugger](use-debugger-built-in-profiler-rules.md)
+ [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md)
+ [Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments](debugger-on-studio.md)
+ [SageMaker Debugger interactive report](debugger-profiling-report.md)
+ [Analyze data using the Debugger Python client library](debugger-analyze-data.md)

# Estimator configuration with parameters for basic profiling using the Amazon SageMaker Debugger Python modules
<a name="debugger-configuration-for-profiling"></a>

By default, SageMaker Debugger basic profiling is on by default and monitors resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, Network, and I/O wait time, of all SageMaker training jobs submitted using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). SageMaker Debugger collects these resource utilization metrics every 500 milliseconds. You don't need to make any additional changes in your code, training script, or the job launcher for tracking basic resource utilization. If you want to change the metric collection interval for basic profiling, you can specify Debugger-specific parameters while creating a SageMaker training job launcher using the SageMaker Python SDK, AWS SDK for Python (Boto3), or AWS Command Line Interface (CLI). In this guide, we focus on how to change profiling options using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). This page gives reference templates for configuring this estimator object.

If you want to access the resource utilization metrics dashboard of your training job in SageMaker Studio, you can jump onto the [Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments](debugger-on-studio.md).

If you want to activate the rules that detect system resource utilization problems automatically, you can add the `rules` parameter in the estimator object for activating the rules.

**Important**  
To use the latest SageMaker Debugger features, you need to upgrade the SageMaker Python SDK and the `SMDebug` client library. In your iPython kernel, Jupyter Notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Code template for configuring a SageMaker AI estimator object with the SageMaker Debugger Python modules in the SageMaker AI Python SDK
<a name="debugger-configuration-structure-profiler"></a>

To adjust the basic profiling configuration (`profiler_config`) or add the profiler rules (`rules`), choose one of the tabs to get the template for setting up a SageMaker AI estimator. In the subsequent pages, you can find more information about how to configure the two parameters.

**Note**  
The following code examples are not directly executable. Proceed to the next sections to learn how to configure each parameter.

------
#### [ PyTorch ]

```
# An example of constructing a SageMaker AI PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

session=boto3.session.Session()
region=session.region_name

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # SageMaker Debugger parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ TensorFlow ]

```
# An example of constructing a SageMaker AI TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

session=boto3.session.Session()
region=session.region_name

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.8.0",
    py_version="py37",
    
    # SageMaker Debugger parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ MXNet ]

```
# An example of constructing a SageMaker AI MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # SageMaker Debugger parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

**Note**  
For MXNet, when configuring the `profiler_config` parameter, you can only configure for system monitoring. Profiling framework metrics is not supported for MXNet.

------
#### [ XGBoost ]

```
# An example of constructing a SageMaker AI XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

**Note**  
For XGBoost, when configuring the `profiler_config` parameter, you can only configure for system monitoring. Profiling framework metrics is not supported for XGBoost.

------
#### [ Generic estimator ]

```
# An example of constructing a SageMaker AI generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, Rule, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

------

The following provides brief descriptions of the parameters.
+ `profiler_config` – Configure Debugger to collect system metrics and framework metrics from your training job and save into your secured S3 bucket URI or local machine. You can set how frequently or loosely collect the system metrics. To learn how to configure the `profiler_config` parameter, see [Configure settings for basic profiling of system resource utilization](debugger-configure-system-monitoring.md) and [Estimator configuration for framework profiling](debugger-configure-framework-profiling.md).
+ `rules` – Configure this parameter to activate SageMaker Debugger built-in rules that you want to run in parallel. Make sure that your training job has access to this S3 bucket. The rules runs on processing containers and automatically analyze your training job to find computational and operational performance issues. The [ProfilerReport](debugger-built-in-profiler-rules.md#profiler-report) rule is the most integrated rule that runs all built-in profiling rules and saves the profiling results as a report into your secured S3 bucket. To learn how to configure the `rules` parameter, see [Use built-in profiler rules managed by Amazon SageMaker Debugger](use-debugger-built-in-profiler-rules.md).

**Note**  
Debugger securely saves output data in subfolders of your default S3 bucket. For example, the format of the default S3 bucket URI is `s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/`. There are three subfolders created by Debugger: `debug-output`, `profiler-output`, and `rule-output`. You can also retrieve the default S3 bucket URIs using the [SageMaker AI estimator classmethods](debugger-estimator-classmethods.md).

See the following topics to find out how to configure the Debugger-specific parameters in detail.

**Topics**
+ [Code template for configuring a SageMaker AI estimator object with the SageMaker Debugger Python modules in the SageMaker AI Python SDK](#debugger-configuration-structure-profiler)
+ [Configure settings for basic profiling of system resource utilization](debugger-configure-system-monitoring.md)
+ [Estimator configuration for framework profiling](debugger-configure-framework-profiling.md)
+ [Updating Debugger system monitoring and framework profiling configuration while a training job is running](debugger-update-monitoring-profiling.md)
+ [Turn off Debugger](debugger-turn-off-profiling.md)

# Configure settings for basic profiling of system resource utilization
<a name="debugger-configure-system-monitoring"></a>

To adjust the time interval for collecting the utilization metrics, use the `ProfilerConfig` API operation to create a parameter object while constructing a SageMaker AI framework or generic estimator depending on your preference.

**Note**  
By default, for all SageMaker training jobs, Debugger collects resource utilization metrics from Amazon EC2 instances every 500 milliseconds for system monitoring, without any Debugger-specific parameters specified in SageMaker AI estimators.   
Debugger saves the system metrics in the default S3 bucket. The format of the default S3 bucket URI is `s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/`.

The following code example shows how to set up the `profiler_config` parameter with a system monitoring time interval of 1000 milliseconds.

```
from sagemaker.debugger import ProfilerConfig

profiler_config=ProfilerConfig(
    system_monitor_interval_millis=1000
)
```
+  `system_monitor_interval_millis` (int) – Specify the monitoring intervals in milliseconds to record system metrics. Available values are 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds. The default value is 500 milliseconds.

To see the progress of system monitoring, see [Open the Amazon SageMaker Debugger Insights dashboard](debugger-on-studio-insights.md).

# Estimator configuration for framework profiling
<a name="debugger-configure-framework-profiling"></a>

**Warning**  
In favor of [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md), SageMaker AI Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.   
SageMaker Python SDK <= v2.130.0
PyTorch >= v1.6.0, < v2.0
TensorFlow >= v2.3.1, < v2.11
See also [March 16, 2023](debugger-release-notes.md#debugger-release-notes-20230315).

To enable Debugger framework profiling, configure the `framework_profile_params` parameter when you construct an estimator. Debugger framework profiling collects framework metrics, such as data from initialization stage, data loader processes, Python operators of deep learning frameworks and training scripts, detailed profiling within and between steps, with cProfile or Pyinstrument options. Using the `FrameworkProfile` class, you can configure custom framework profiling options. 

**Note**  
Before getting started with Debugger framework profiling, verify that the framework used to build your model is supported by Debugger for framework profiling. For more information, see [Supported frameworks and algorithms](debugger-supported-frameworks.md).   
Debugger saves the framework metrics in a default S3 bucket. The format of the default S3 bucket URI is `s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/`.

**Topics**
+ [Default framework profiling](debugger-configure-framework-profiling-basic.md)
+ [Default system monitoring and customized framework profiling for target steps or a target time range](debugger-configure-framework-profiling-range.md)
+ [Default system monitoring and customized framework profiling with different profiling options](debugger-configure-framework-profiling-options.md)

# Default framework profiling
<a name="debugger-configure-framework-profiling-basic"></a>

Debugger framework default profiling includes the following options: detailed profiling, data loader profiling, and Python profiling. The following example code is the simplest `profiler_config` parameter setting to start the default system monitoring and the default framework profiling. The `FrameworkProfile` class in the following example code initiates the default framework profiling when a training job starts. 

```
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
    
profiler_config=ProfilerConfig(
    framework_profile_params=FrameworkProfile()
)
```

With this `profiler_config` parameter configuration, Debugger calls the default settings of monitoring and profiling. Debugger monitors system metrics every 500 milliseconds; profiles the fifth step with the detailed profiling option; the seventh step with the data loader profiling option; and the ninth, tenth, and eleventh steps with the Python profiling option. 

To find available profiling configuration options, the default parameter settings, and examples of how to configure them, see [Default system monitoring and customized framework profiling with different profiling options](debugger-configure-framework-profiling-options.md) and [SageMaker Debugger APIs – FrameworkProfile](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.FrameworkProfile) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

If you want to change the system monitoring interval and enable the default framework profiling, you can specify the `system_monitor_interval_millis` parameter explicitly with the `framework_profile_params` parameter. For example, to monitor every 1000 milliseconds and enable the default framework profiling, use the following example code.

```
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
    
profiler_config=ProfilerConfig(
    system_monitor_interval_millis=1000,
    framework_profile_params=FrameworkProfile()
)
```

For more information about the `FrameworkProfile` class, see [SageMaker Debugger APIs – FrameworkProfile](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.FrameworkProfile) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Default system monitoring and customized framework profiling for target steps or a target time range
<a name="debugger-configure-framework-profiling-range"></a>

If you want to specify target steps or target time intervals to profile your training job, you need to specify parameters for the `FrameworkProfile` class. The following code examples show how to specify the target ranges for profiling along with system monitoring.
+ **For a target step range**

  With the following example configuration, Debugger monitors the entire training job every 500 milliseconds (the default monitoring) and profiles a target step range from step 5 to step 15 (for 10 steps).

  ```
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
      
  profiler_config=ProfilerConfig(
      framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
  )
  ```

  With the following example configuration, Debugger monitors the entire training job every 1000 milliseconds and profiles a target step range from step 5 to step 15 (for 10 steps).

  ```
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
      
  profiler_config=ProfilerConfig(
      system_monitor_interval_millis=1000,
      framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
  )
  ```
+ **For a target time range**

  With the following example configuration, Debugger monitors the entire training job every 500 milliseconds (the default monitoring) and profiles a target time range from the current Unix time for 600 seconds.

  ```
  import time
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
  
  profiler_config=ProfilerConfig(
      framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()), duration=600)
  )
  ```

  With the following example configuration, Debugger monitors the entire training job every 1000 milliseconds and profiles a target time range from the current Unix time for 600 seconds.

  ```
  import time
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
  
  profiler_config=ProfilerConfig(
      system_monitor_interval_millis=1000,
      framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()), duration=600)
  )
  ```

  The framework profiling is performed for all of the profiling options at the target step or time range. 

  To find more information about available profiling options, see [SageMaker Debugger APIs – FrameworkProfile](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.FrameworkProfile) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

  The next section shows you how to script the available profiling options.

# Default system monitoring and customized framework profiling with different profiling options
<a name="debugger-configure-framework-profiling-options"></a>

This section gives information about the supported profiling configuration classes, as well as an example configuration. You can use the following profiling configuration classes to manage the framework profiling options:
+ [DetailedProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DetailedProfilingConfig) – Specify a target step or time range to profile framework operations using the native framework profilers (TensorFlow profiler and PyTorch profiler). For example, if using TensorFlow, the Debugger hooks enable the TensorFlow profiler to collect TensorFlow-specific framework metrics. Detailed profiling enables you to profile all framework operators at a pre-step (before the first step), within steps, and between steps of a training job.
**Note**  
Detailed profiling might significantly increase GPU memory consumption. We do not recommend enabling detailed profiling for more than a couple of steps.
+ [DataloaderProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DataloaderProfilingConfig) – Specify a target step or time range to profile deep learning framework data loader processes. Debugger collects every data loader event of the frameworks.
**Note**  
Data loader profiling might lower the training performance while collecting information from data loaders. We don't recommend enabling data loader profiling for more than a couple of steps.  
Debugger is preconfigured to annotate data loader processes only for the AWS deep learning containers. Debugger cannot profile data loader processes from any other custom or external training containers.
+ [PythonProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.PythonProfilingConfig) – Specify a target step or time range to profile Python functions. You can also choose between two Python profilers: cProfile and Pyinstrument.
  + *cProfile* – The standard Python profiler. cProfile collects information for every Python operator called during training. With cProfile, Debugger saves cumulative time and annotation for each function call, providing complete detail about Python functions. In deep learning, for example, the most frequently called functions might be the convolutional filters and backward pass operators, and cProfile profiles every single of them. For the cProfile option, you can further select a timer option: total time, CPU time, and off-CPU time. While you can profile every function call executing on processors (both CPU and GPU) in CPU time, you can also identify I/O or network bottlenecks with the off-CPU time option. The default is total time, and Debugger profiles both CPU and off-CPU time. With cProfile, you are able to drill down to every single functions when analyzing the profile data.
  + *Pyinstrument* – Pyinstrument is a low-overhead Python profiler that works based on sampling. With the Pyinstrument option, Debugger samples profiling events every millisecond. Because Pyinstrument measures elapsed wall-clock time instead of CPU time, the Pyinstrument option can be a better choice over the cProfile option for reducing profiling noise (filtering out irrelevant function calls that are cumulatively fast) and capturing operators that are actually compute intensive (cumulatively slow) for training your model. With Pyinstrument, you are able to see a tree of function calls and better understand the structure and root cause of the slowness.
**Note**  
Enabling Python profiling might slow down the overall training time. cProfile profiles the most frequently called Python operators at every call, so the processing time on profiling increases with respect to the number of calls. For Pyinstrument, the cumulative profiling time increases with respect to time because of its sampling mechanism.

The following example configuration shows the full structure when you use the different profiling options with specified values.

```
import time
from sagemaker.debugger import (ProfilerConfig, 
                                FrameworkProfile, 
                                DetailedProfilingConfig, 
                                DataloaderProfilingConfig, 
                                PythonProfilingConfig,
                                PythonProfiler, cProfileTimer)

profiler_config=ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        detailed_profiling_config=DetailedProfilingConfig(
            start_step=5, 
            num_steps=1
        ),
        dataloader_profiling_config=DataloaderProfilingConfig(
            start_step=7, 
            num_steps=1
        ),
        python_profiling_config=PythonProfilingConfig(
            start_step=9, 
            num_steps=1, 
            python_profiler=PythonProfiler.CPROFILE, 
            cprofile_timer=cProfileTimer.TOTAL_TIME
        )
    )
)
```

For more information about available profiling options, see [DetailedProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DetailedProfilingConfig), [DataloaderProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DataloaderProfilingConfig), and [PythonProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.PythonProfilingConfig) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Updating Debugger system monitoring and framework profiling configuration while a training job is running
<a name="debugger-update-monitoring-profiling"></a>

If you want to activate or update the Debugger monitoring configuration for a training job that is currently running, use the following SageMaker AI estimator extension methods:
+ To activate Debugger system monitoring for a running training job and receive a Debugger profiling report, use the following:

  ```
  estimator.enable_default_profiling()
  ```

  When you use the `enable_default_profiling` method, Debugger initiates the default system monitoring and the `ProfileReport` built-in rule, which generates a comprehensive profiling report at the end of the training job. This method can be called only if the current training job is running without both Debugger monitoring and profiling.

  For more information, see [estimator.enable\$1default\$1profiling](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.enable_default_profiling) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ To update system monitoring configuration, use the following:

  ```
  estimator.update_profiler(
      system_monitor_interval_millis=500
  )
  ```

  For more information, see [estimator.update\$1profiler](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.update_profiler) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Turn off Debugger
<a name="debugger-turn-off-profiling"></a>

If you want to completely turn off Debugger, do one of the following:
+ Before starting a training job, do the following:

  To turn off profiling, include the `disable_profiler` parameter to your estimator and set it to `True`.
**Warning**  
If you disable it, you won't be able to view the comprehensive Studio Debugger insights dashboard and the autogenerated profiling report.

  To turn off debugging, set the `debugger_hook_config` parameter to `False`.
**Warning**  
If you disable it, you won't be able to collect output tensors and cannot debug your model parameters.

  ```
  estimator=Estimator(
      ...
      disable_profiler=True
      debugger_hook_config=False
  )
  ```

  For more information about the Debugger-specific parameters, see [SageMaker AI Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ While a training job is running, do the following:

  To disable both monitoring and profiling while your training job is running, use the following estimator classmethod:

  ```
  estimator.disable_profiling()
  ```

  To disable framework profiling only and keep system monitoring, use the `update_profiler` method:

  ```
  estimator.update_profiler(disable_framework_metrics=true)
  ```

  For more information about the estimator extension methods, see the [estimator.disable\$1profiling](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.disable_profiling) and [estimator.update\$1profiler](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.update_profiler) classmethods in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation.

# Use built-in profiler rules managed by Amazon SageMaker Debugger
<a name="use-debugger-built-in-profiler-rules"></a>

The Amazon SageMaker Debugger built-in profiler rules analyze system metrics and framework operations collected during the training of a model. Debugger offers the `ProfilerRule` API operation that helps configure the rules to monitor training compute resources and operations and to detect anomalies. For example, the profiling rules can help you detect whether there are computational problems such as CPU bottlenecks, excessive I/O wait time, imbalanced workload across GPU workers, and compute resource underutilization. To see a full list of available built-in profiling rules, see [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md). The following topics show how to use the Debugger built-in rules with default parameter settings and custom parameter values.

**Note**  
The built-in rules are provided through Amazon SageMaker processing containers and fully managed by SageMaker Debugger at no additional cost. For more information about billing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Topics**
+ [Use SageMaker Debugger built-in profiler rules with their default parameter settings](#debugger-built-in-profiler-rules-configuration)
+ [Use Debugger built-in profiler rules with custom parameter values](#debugger-built-in-profiler-rules-configuration-param-change)

## Use SageMaker Debugger built-in profiler rules with their default parameter settings
<a name="debugger-built-in-profiler-rules-configuration"></a>

To add SageMaker Debugger built-in rules in your estimator, you need to configure a `rules` list object. The following example code shows the basic structure of listing the SageMaker Debugger built-in rules.

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_1()),
    ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_2()),
    ...
    ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_n()),
    ... # You can also append more debugging rules in the Rule.sagemaker(rule_configs.*()) format.
]

estimator=Estimator(
    ...
    rules=rules
)
```

For a complete list of available built-in rules, see [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md).

To use the profiling rules and inspect the computational performance and progress of your training job, add the [https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html#profiler-report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html#profiler-report) rule of SageMaker Debugger. This rule activates all built-in rules under the [Debugger ProfilerRule](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html#debugger-built-in-profiler-rules-ProfilerRule) `ProfilerRule` family. Furthermore, this rule generates an aggregated profiling report. For more information, see [Profiling Report Generated Using SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html). You can use the following code to add the profiling report rule to your training estimator.

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]
```

When you start the training job with the `ProfilerReport` rule, Debugger collects resource utilization data every 500 milliseconds. Debugger analyzes the resource utilization to identify if your model is having bottleneck problems. If the rules detect training anomalies, the rule evaluation status changes to `IssueFound`. You can set up automated actions, such as notifying training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see [Action on Amazon SageMaker Debugger rules](debugger-action-on-rules.md).

## Use Debugger built-in profiler rules with custom parameter values
<a name="debugger-built-in-profiler-rules-configuration-param-change"></a>

If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure the `base_config` and `rule_parameters` parameters for the `ProfilerRule.sagemaker` and `Rule.sagemaker` class methods. In case of the `Rule.sagemaker` class methods, you can also customize tensor collections through the `collections_to_save` parameter. For instruction on how to use the `CollectionConfig` class, see [Configure tensor collections using the `CollectionConfig` API](debugger-configure-tensor-collections.md). 

Use the following configuration template for built-in rules to customize parameter values. By changing the rule parameters as you want, you can adjust the sensitivity of the rules to be initiated. 
+ The `base_config` argument is where you call the built-in rule methods.
+ The `rule_parameters` argument is to adjust the default key values of the built-in rules listed in [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md).

For more information about the Debugger rule class, methods, and parameters, see [SageMaker AI Debugger Rule class](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig

rules=[
    ProfilerRule.sagemaker(
        base_config=rule_configs.BuiltInProfilerRuleName(),
        rule_parameters={
                "key": "value"
        }
    )
]
```

The parameter descriptions and value customization examples are provided for each rule at [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md).

For a low-level JSON configuration of the Debugger built-in rules using the `CreateTrainingJob` API, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

# List of Debugger built-in profiler rules
<a name="debugger-built-in-profiler-rules"></a>

Use the Debugger built-in profiler rules provided by Amazon SageMaker Debugger and analyze metrics collected while training your models. The Debugger built-in rules monitor various common conditions that are critical for the success of running a performant training job. You can call the built-in profiler rules using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the low-level SageMaker API operations. There's no additional cost for using the built-in rules. For more information about billing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Note**  
The maximum numbers of built-in profiler rules that you can attach to a training job is 20. SageMaker Debugger fully manages the built-in rules and analyzes your training job synchronously.

**Important**  
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Profiler rules
<a name="debugger-built-in-profiler-rules-ProfilerRule"></a>

The following rules are the Debugger built-in rules that are callable using the `ProfilerRule.sagemaker` classmethod.

Debugger built-in rule for generating the profiling report


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Profiling Report for any SageMaker training job |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html)  | 

Debugger built-in rules for profiling hardware system resource utilization (system metrics)


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Generic system monitoring rules for any SageMaker training job |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html)  | 

Debugger built-in rules for profiling framework metrics


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Profiling rules for deep learning frameworks (TensorFlow and PyTorch) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html)  | 

**Warning**  
In favor of [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md), SageMaker AI Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.   
SageMaker Python SDK <= v2.130.0
PyTorch >= v1.6.0, < v2.0
TensorFlow >= v2.3.1, < v2.11
See also [March 16, 2023](debugger-release-notes.md#debugger-release-notes-20230315).

**To use the built-in rules with default parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_1()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_2()),
    ...
    ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_n())
]
```

**To use the built-in rules with customizing the parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    ProfilerRule.sagemaker(
        base_config=rule_configs.BuiltInRuleName(),
        rule_parameters={
                "key": "value"
        }
    )
]
```

To find available keys for the `rule_parameters` parameter, see the parameter description tables.

Sample rule configuration codes are provided for each built-in rule below the parameter description tables.
+ For a full instruction and examples of using the Debugger built-in rules, see [Debugger built-in rules example code](debugger-built-in-rules-example.md#debugger-deploy-built-in-rules).
+ For a full instruction on using the built-in rules with the low-level SageMaker API operations, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

## ProfilerReport
<a name="profiler-report"></a>

The ProfilerReport rule invokes all of the built-in rules for monitoring and profiling. It creates a profiling report and updates when the individual rules are triggered. You can download a comprehensive profiling report while a training job is running or after the training job is complete. You can adjust the rule parameter values to customize sensitivity of the built-in monitoring and profiling rules. The following example code shows the basic format to adjust the built-in rule parameters through the ProfilerReport rule.

```
rules=[
    ProfilerRule.sagemaker(
        rule_configs.ProfilerReport(
            <BuiltInRuleName>_<parameter_name> = value
        )
    )  
]
```

If you trigger this ProfilerReport rule without any customized parameter as shown in the following example code, then the ProfilerReport rule triggers all of the built-in rules for monitoring and profiling with their default parameter values.

```
rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]
```

The following example code shows how to specify and adjust the CPUBottleneck rule's `cpu_threshold` parameter and the IOBottleneck rule's `threshold` parameter.

```
rules=[
    ProfilerRule.sagemaker(
        rule_configs.ProfilerReport(
            CPUBottleneck_cpu_threshold = 90,
            IOBottleneck_threshold = 90
        )
    )  
]
```

To explore what's in the profiler report, see [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html). Also, because this rule activates all of the profiling rules, you can also check the rule analysis status using the [SageMaker Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html).

Parameter Descriptions for the OverallSystemUsage Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| <BuiltInRuleName>\$1<parameter\$1name> |  Customizable parameter to adjust thresholds of other built-in monitoring and profiling rules.  **Optional** Default value: `None`  | 

## BatchSize
<a name="batch-size-rule"></a>

The BatchSize rule helps detect if GPU is underutilized due to a small batch size. To detect this issue, this rule monitors the average CPU utilization, GPU utilization, and GPU memory utilization. If utilization on CPU, GPU, and GPU memory is low on average, it may indicate that the training job can either run on a smaller instance type or can run with a bigger batch size. This analysis does not work for frameworks that heavily overallocate memory. However, increasing the batch size can lead to processing or data loading bottlenecks because more data preprocessing time is required in each iteration.

Parameter Descriptions for the BatchSize Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| cpu\$1threshold\$1p95 |  Defines the threshold for 95th quantile of CPU utilization in percentage. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| gpu\$1threshold\$1p95 |  Defines the threshold for 95th quantile of GPU utilization in percentage. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| gpu\$1memory\$1threshold\$1p95 | Defines the threshold for 95th quantile of GPU memory utilization in percentage. **Optional** Valid values: Integer Default values: `70` (in percentage)  | 
| patience | Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `100`  | 
| window |  Window size for computing quantiles. **Optional** Valid values: Integer Default values: `500`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## CPUBottleneck
<a name="cpu-bottleneck"></a>

The CPUBottleneck rule helps detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if number of CPU bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the CPUBottleneck Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold |  Defines the threshold for proportion of bottlenecked time to the total training time. If the proportion exceeds the percentage specified to the threshold parameter, the rule switches the rule status to True. **Optional** Valid values: Integer Default value: `50` (in percentage)  | 
| gpu\$1threshold |  A threshold that defines low GPU utilization. **Optional** Valid values: Integer Default value: `10` (in percentage)  | 
| cpu\$1threshold | A threshold that defines high CPU utilization. **Optional** Valid values: Integer Default values: `90` (in percentage)  | 
| patience | Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `100`  | 
| scan\$1interval\$1us | Time interval with which timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## GPUMemoryIncrease
<a name="gpu-memory-increase"></a>

The GPUMemoryIncrease rule helps detect a large increase in memory usage on GPUs.

Parameter Descriptions for the GPUMemoryIncrease Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| increase |  Defines the threshold for absolute memory increase. **Optional** Valid values: Integer Default value: `10` (in percentage)  | 
| patience |  Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `100`  | 
| window |  Window size for computing quantiles. **Optional** Valid values: Integer Default values: `500`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## IOBottleneck
<a name="io-bottleneck"></a>

This rule helps to detect if GPU is underutilized due to data IO bottlenecks. Rule returns True if number of IO bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the IOBottleneck Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold | Defines the threshold when Rule to return True.**Optional**Valid values: IntegerDefault value: `50` (in percentage) | 
| gpu\$1threshold |  A threshold that defines when GPU is considered underutilized. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| io\$1threshold | A threshold that defines high IO wait time.**Optional**Valid values: IntegerDefault values: `50` (in percentage) | 
| patience | Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.**Optional**Valid values: IntegerDefault values: `1000` | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## LoadBalancing
<a name="load-balancing"></a>

The LoadBalancing rule helps detect issues in workload balancing among multiple GPUs.

Parameter Descriptions for the LoadBalancing Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold |  Defines the workload percentage. **Optional** Valid values: Integer Default value: `0.5` (unitless proportion)  | 
| patience |  Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `10`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## LowGPUUtilization
<a name="low-gpu-utilization"></a>

The LowGPUUtilization rule helps detect if GPU utilization is low or suffers from fluctuations. This is checked for each GPU on each worker. Rule returns True if 95th quantile is below threshold\$1p95 which indicates underutilization. Rule returns true if 95th quantile is above threshold\$1p95 and 5th quantile is below threshold\$1p5 which indicates fluctuations.

Parameter Descriptions for the LowGPUUtilization Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold\$1p95 |  A threshold for 95th quantile below which GPU is considered to be underutilized. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| threshold\$1p5 | A threshold for 5th quantile. Default is 10 percent.**Optional**Valid values: IntegerDefault values: `10` (in percentage) | 
| patience |  Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `1000`  | 
| window |  Window size for computing quantiles. **Optional** Valid values: Integer Default values: `500`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## OverallSystemUsage
<a name="overall-system-usage"></a>

The OverallSystemUsage rule measures overall system usage per worker node. The rule currently only aggregates values per node and computes their percentiles.

Parameter Descriptions for the OverallSystemUsage Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| scan\$1interval\$1us |  Time interval to scan timeline files. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## MaxInitializationTime
<a name="max-initialization-time"></a>

The MaxInitializationTime rule helps detect if the training initialization is taking too much time. The rule waits until the first step is available.

Parameter Descriptions for the MaxInitializationTime Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold |  Defines the threshold in minutes to wait for the first step to become available. **Optional** Valid values: Integer Default value: `20` (in minutes)  | 
| scan\$1interval\$1us |  Time interval with which timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## OverallFrameworkMetrics
<a name="overall-framework-metrics"></a>

The OverallFrameworkMetrics rule summarizes the time spent on framework metrics, such as forward and backward pass, and data loading.

Parameter Descriptions for the OverallFrameworkMetrics Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| scan\$1interval\$1us |  Time interval to scan timeline files. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## StepOutlier
<a name="step-outlier"></a>

The StepOutlier rule helps detect outliers in step durations. This rule returns `True` if there are outliers with step durations larger than `stddev` sigmas of the entire step durations in a time range.

Parameter Descriptions for the StepOutlier Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| stddev |  Defines a factor by which to multiply the standard deviation. For example, the rule is invoked by default when a step duration is larger or smaller than 5 times the standard deviation.  **Optional** Valid values: Integer Default value: `5` (in minutes)  | 
| mode | Mode under which steps have been saved and on which Rule should run on. Per default rule will run on steps from EVAL and TRAIN phase**Optional**Valid values: IntegerDefault value: `5` (in minutes) | 
| n\$1outliers | How many outliers to ignore before rule returns True**Optional**Valid values: IntegerDefault value: `10` | 
| scan\$1interval\$1us |  Time interval with which timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

# Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments
<a name="debugger-on-studio"></a>

Use the Amazon SageMaker Debugger Insights dashboard in Amazon SageMaker Studio Classic Experiments to analyze your model performance and system bottlenecks while running training jobs on Amazon Elastic Compute Cloud (Amazon EC2) instances. Gain insights into your training jobs and improve your model training performance and accuracy with the Debugger dashboards. By default, Debugger monitors system metrics (CPU, GPU, GPU memory, network, and data I/O) every 500 milliseconds and basic output tensors (loss and accuracy) every 500 iterations for training jobs. You can also further customize Debugger configuration parameter values and adjust the saving intervals through the Studio Classic UI or using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). 

**Important**  
If you're using an existing Studio Classic app, delete the app and restart to use the latest Studio Classic features. For instructions on how to restart and update your Studio Classic environment, see [Update Amazon SageMaker AI Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-update.html). 

**Topics**
+ [Open the Amazon SageMaker Debugger Insights dashboard](debugger-on-studio-insights.md)
+ [Amazon SageMaker Debugger Insights dashboard controller](debugger-on-studio-insights-controllers.md)
+ [Explore the Amazon SageMaker Debugger Insights dashboard](debugger-on-studio-insights-walkthrough.md)
+ [Shut down the Amazon SageMaker Debugger Insights instance](debugger-on-studio-insights-close.md)

# Open the Amazon SageMaker Debugger Insights dashboard
<a name="debugger-on-studio-insights"></a>

In the SageMaker Debugger Insights dashboard in Studio Classic, you can see the compute resource utilization, resource utilization, and system bottleneck information of your training job that runs on Amazon EC2 instances in real time and after trainings

**Note**  
The SageMaker Debugger Insights dashboard runs a Studio Classic application on an `ml.m5.4xlarge` instance to process and render the visualizations. Each SageMaker Debugger Insights tab runs one Studio Classic kernel session. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the corresponding kernel session is also closed. The Studio Classic application remains active and accrues charges for the `ml.m5.4xlarge` instance usage. For information about pricing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Important**  
When you are done using the SageMaker Debugger Insights dashboard, you must shut down the `ml.m5.4xlarge` instance to avoid accruing charges. For instructions on how to shut down the instance, see [Shut down the Amazon SageMaker Debugger Insights instance](debugger-on-studio-insights-close.md).

**To open the SageMaker Debugger Insights dashboard**

1. On the Studio Classic **Home** page, choose **Experiments** in the left navigation pane.

1. Search your training job in the **Experiments** page. If your training job is set up with an Experiments run, the job should appear in the **Experiments** tab; if you didn't set up an Experiments run, the job should appear in the **Unassigned runs** tab.

1. Choose (click) the link of the training job name to see the job details.

1. Under the **OVERVIEW** menu, choose **Debuggger**. This should show the following two sections.
   + In the **Debugger rules** section, you can browse the status of the Debugger built-in rules associated with the training job.
   + In the **Debugger insights** section, you can find links to open SageMaker Debugger Insights on the dashboard.

1. In the **SageMaker Debugger Insights** section, choose the link of the training job name to open the SageMaker Debugger Insights dashboard. This opens a **Debug [your-training-job-name]** window. In this window, Debugger provides an overview of the computational performance of your training job on Amazon EC2 instances and helps you identify issues in compute resource utilization.

You can also download an aggregated profiling report by adding the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule of SageMaker Debugger. For more information, see [Configure Built-in Profiler Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-profiler-rules.html) and [Profiling Report Generated Using SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html).

# Amazon SageMaker Debugger Insights dashboard controller
<a name="debugger-on-studio-insights-controllers"></a>

There are different components of the Debugger controller for monitoring and profiling. In this guide, you learn about the Debugger controller components.

**Note**  
The SageMaker Debugger Insights dashboard runs a Studio Classic app on an `ml.m5.4xlarge` instance to process and render the visualizations. Each SageMaker Debugger Insights tab runs one Studio Classic kernel session. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the corresponding kernel session is also closed. The Studio Classic app remains active and accrues charges for the `ml.m5.4xlarge` instance usage. For information about pricing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Important**  
When you are done using the SageMaker Debugger Insights dashboard, shut down the `ml.m5.4xlarge` instance to avoid accruing charges. For instructions on how to shut down the instance, see [Shut down the Amazon SageMaker Debugger Insights instance](debugger-on-studio-insights-close.md).

## SageMaker Debugger Insights controller UI
<a name="debugger-on-studio-insights-controller"></a>

Using the Debugger controller located at the upper-left corner of the Insights dashboard, you can refresh the dashboard, configure or update Debugger settings for monitoring system metrics, stop a training job, and download a Debugger profiling report.

![\[SageMaker Debugger Insights Dashboard Controllers\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-refresh.png)

+ If you want to manually refresh the dashboard, choose the refresh button (the round arrow at the upper-left corner) as shown in the preceding screenshot. 
+ The **Monitoring** toggle button is on by default for any SageMaker training job initiated using the SageMaker Python SDK. If not activated, you can use the toggle button to start monitoring. During monitoring, Debugger only collects resource utilization metrics to detect computational problems such as CPU bottlenecks and GPU underutilization. For a complete list of resource utilization problems that Debugger monitors, see [Debugger built-in rules for profiling hardware system resource utilization (system metrics)](debugger-built-in-profiler-rules.md#built-in-rules-monitoring).
+ The **Configure monitoring** button opens a pop-up window that you can use to set or update the data collection frequency and the S3 path to save the data.   
![\[The pop-up window for configuring Debugger monitoring settings\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-enable-profiling-2.png)

  You can specify values for the following fields.
  + **S3 bucket URI**: Specify the base S3 bucket URI.
  + **Collect monitoring data every**: Select a time interval to collect system metrics. You can choose one of the monitoring intervals from the dropdown list. Available intervals are 100 milliseconds, 200 milliseconds, 500 milliseconds (default), 1 second, 5 seconds, and 1 minute. 
**Note**  
If you choose one of the lower time intervals, you increase the granularity of resource utilization metrics, so you can capture spikes and anomalies with a higher time resolution. However, higher the resolution, larger the size of system metrics to process. This might introduce additional overhead and impact the overall training and processing time.
+ Using the **Stop training** button, you can stop the training job when you find anomalies in resource utilization.
+ Using the **Download report** button, you can download an aggregated profiling report by using the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule of SageMaker Debugger. The button is activated when you add the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule to the estimator. For more information, see [Configure Built-in Profiler Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-profiler-rules.html) and [Profiling Report Generated Using SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html).

# Explore the Amazon SageMaker Debugger Insights dashboard
<a name="debugger-on-studio-insights-walkthrough"></a>

When you initiate a SageMaker training job, SageMaker Debugger starts monitoring the resource utilization of the Amazon EC2 instances by default. You can track the system utilization rates, statistics overview, and built-in rule analysis through the Insights dashboard. This guide walks you through the content of the SageMaker Debugger Insights dashboard under the following tabs: **System Metrics** and **Rules**. 

**Note**  
The SageMaker Debugger Insights dashboard runs a Studio Classic application on an `ml.m5.4xlarge` instance to process and render the visualizations. Each SageMaker Debugger Insights tab runs one Studio Classic kernel session. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the corresponding kernel session is also closed. The Studio Classic application remains active and accrues charges for the `ml.m5.4xlarge` instance usage. For information about pricing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Important**  
When you are done using the SageMaker Debugger Insights dashboard, shut down the `ml.m5.4xlarge` instance to avoid accruing charges. For instructions on how to shut down the instance, see [Shut down the Amazon SageMaker Debugger Insights instance](debugger-on-studio-insights-close.md).

**Important**  
In the reports, plots and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [System metrics](#debugger-insights-system-metrics-tab)
+ [Rules](#debugger-on-studio-insights-rules)

## System metrics
<a name="debugger-insights-system-metrics-tab"></a>

In the **System Metrics** tab, you can use the summary table and timeseries plots to understand resource utilization.

### Resource utilization summary
<a name="debugger-on-studio-insights-sys-resource-summary"></a>

This summary table shows the statistics of compute resource utilization metrics of all nodes (denoted as algo-*n*). The resource utilization metrics include the total CPU utilization, the total GPU utilization, the total CPU memory utilization, the total GPU memory utilization, the total I/O wait time, and the total network in bytes. The table shows the minimum and the maximum values, and p99, p90, and p50 percentiles.

![\[A summary table of resource utilization\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-resource-util-summary.png)


### Resource utilization time series plots
<a name="debugger-on-studio-insights-sys-controller"></a>

Use the time series graphs to see more details of resource utilization and identify at what time interval each instance shows any undesired utilization rate, such as low GPU utilization and CPU bottlenecks that can cause a waste of the expensive instance.

**The time series graph controller UI**

The following screenshot shows the UI controller for adjusting the time series graphs.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-insights-graph-controller.png)

+ **algo-1**: Use this dropdown menu to choose the node that you want to look into.
+ **Zoom In**: Use this button to zoom in the time series graphs and view shorter time intervals.
+ **Zoom Out**: Use this button to zoom out the time series graphs and view wider time intervals.
+ **Pan Left**: Move the time series graphs to an earlier time interval.
+ **Pan Right**: Move the time series graphs to a later time interval.
+ **Fix Timeframe**: Use this check box to fix or bring back the time series graphs to show the whole view from the first data point to the last data point.

**CPU utilization and I/O wait time**

The first two graphs show CPU utilization and I/O wait time over time. By default, the graphs show the average of CPU utilization rate and I/O wait time spent on the CPU cores. You can select one or more CPU cores by selecting the labels to graph them on single chart and compare utilization across cores. You can drag and zoom in and out to have a closer look at specific time intervals.

![\[debugger-studio-insight-mockup\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-insights-node-cpu.png)


**GPU utilization and GPU memory utilization**

The following graphs show GPU utilization and GPU memory utilization over time. By default, the graphs show the mean utilization rate over time. You can select the GPU core labels to see the utilization rate of each core. Taking the mean of utilization rate over the total number of GPU cores shows the mean utilization of the entire hardware system resource. By looking at the mean utilization rate, you can check the overall system resource usage of an Amazon EC2 instance. The following figure shows an example training job on an `ml.p3.16xlarge` instance with 8 GPU cores. You can monitor if the training job is well distributed, fully utilizing all GPUs.

![\[debugger-studio-insight-mockup\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-node-gpu.gif)


**Overall system utilization over time**

The following heatmap shows an example of the entire system utilization of an `ml.p3.16xlarge` instance over time, projected onto the two-dimensional plot. Every CPU and GPU core is listed in the vertical axis, and the utilization is recorded over time with a color scheme, where the bright colors represent low utilization and the darker colors represent high utilization. See the labeled color bar on the right side of the plot to find out which color level corresponds to which utilization rate.

![\[debugger-studio-insight-mockup\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-node-heatmap.png)


## Rules
<a name="debugger-on-studio-insights-rules"></a>

Use the **Rules** tab to find a summary of the profiling rule analysis on your training job. If the profiling rule is activated with the training job, the text appears highlighted with the solid white text. Inactive rules are dimmed in gray text. To activate these rules, follow instructions at [Use built-in profiler rules managed by Amazon SageMaker Debugger](use-debugger-built-in-profiler-rules.md).

![\[The Rules tab in the SageMaker Debugger Insights dashboard\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-insights-rules.png)


# Shut down the Amazon SageMaker Debugger Insights instance
<a name="debugger-on-studio-insights-close"></a>

When you are not using the SageMaker Debugger Insights dashboard, you should shut down the app instance to avoid incurring additional fees.

**To shut down the SageMaker Debugger Insights app instance in Studio Classic**

![\[An animated screenshot that shows how to shut down a SageMaker Debugger Insights dashboard instance.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-shut-down.png)


1. In Studio Classic, select the **Running Instances and Kernels** icon (![\[Square icon with a white outline of a cloud on a dark blue background.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Running_squid.png)). 

1. Under the **RUNNING APPS** list, look for the **sagemaker-debugger-1.0** app. Select the shutdown icon (![\[Power button icon with a circular shape and vertical line symbol.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Shutdown_light.png)) next to the app. The SageMaker Debugger Insights dashboards run on an `ml.m5.4xlarge` instance. This instance also disappears from the **RUNNING INSTANCES** when you shut down the **sagemaker-debugger-1.0** app. 

# SageMaker Debugger interactive report
<a name="debugger-profiling-report"></a>

Receive profiling reports autogenerated by Debugger. The Debugger report provide insights into your training jobs and suggest recommendations to improve your model performance. The following screenshot shows a collage of the Debugger profiling report. To learn more, see [SageMaker Debugger interactive report](#debugger-profiling-report).

**Note**  
You can download a Debugger reports while your training job is running or after the job has finished. During training, Debugger concurrently updates the report reflecting the current rules' evaluation status. You can download a complete Debugger report only after the training job has completed.

**Important**  
In the reports, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

![\[An example of a Debugger training job summary report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profile-report.jpg)


For any SageMaker training jobs, the SageMaker Debugger [ProfilerReport](debugger-built-in-profiler-rules.md#profiler-report) rule invokes all of the [monitoring and profiling rules](debugger-built-in-profiler-rules.md#built-in-rules-monitoring) and aggregates the rule analysis into a comprehensive report. Following this guide, download the report using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the S3 console, and learn what you can interpret from the profiling results.

**Important**  
In the report, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

# Download the SageMaker Debugger profiling report
<a name="debugger-profiling-report-download"></a>

Download the SageMaker Debugger profiling report while your training job is running or after the job has finished using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and AWS Command Line Interface (CLI).

**Note**  
To get the profiling report generated by SageMaker Debugger, you must use the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule offered by SageMaker Debugger. To activate the rule with your training job, see [Configure Built-in Profiler Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-profiler-rules.html).

**Tip**  
You can also download the report with a single click in the SageMaker Studio Debugger insights dashboard. This doesn't require any additional scripting to download the report. To find out how to download the report from Studio, see [Open the Amazon SageMaker Debugger Insights dashboard](debugger-on-studio-insights.md).

------
#### [ Download using SageMaker Python SDK and AWS CLI ]

1. Check the current job's default S3 output base URI.

   ```
   estimator.output_path
   ```

1. Check the current job name.

   ```
   estimator.latest_training_job.job_name
   ```

1. The Debugger profiling report is stored under `<default-s3-output-base-uri>/<training-job-name>/rule-output`. Configure the rule output path as follows:

   ```
   rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
   ```

1. To check if the report is generated, list directories and files recursively under the `rule_output_path` using `aws s3 ls` with the `--recursive` option.

   ```
   ! aws s3 ls {rule_output_path} --recursive
   ```

   This should return a complete list of files under an autogenerated folder that's named as `ProfilerReport-1234567890`. The folder name is a combination of strings: `ProfilerReport` and a unique 10-digit tag based on the Unix timestamp when the ProfilerReport rule is initiated.   
![\[An example of rule output\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-rule-output-ls-example.png)

   The `profiler-report.html` is an autogenerated profiling report by Debugger. The remaining files are the built-in rule analysis components stored in JSON and a Jupyter notebook that are used to aggregate them into the report.

1. Download the files recursively using `aws s3 cp`. The following command saves all of the rule output files to the `ProfilerReport-1234567890` folder under the current working directory.

   ```
   ! aws s3 cp {rule_output_path} ./ --recursive
   ```
**Tip**  
If using a Jupyter notebook server, run `!pwd` to double check the current working directory.

1. Under the `/ProfilerReport-1234567890/profiler-output` directory, open `profiler-report.html`. If using JupyterLab, choose **Trust HTML** to see the autogenerated Debugger profiling report.  
![\[An example of rule output\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-rule-output-open-html.png)

1. Open the `profiler-report.ipynb` file to explore how the report is generated. You can also customize and extend the profiling report using the Jupyter notebook file.

------
#### [ Download using Amazon S3 Console ]

1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base S3 bucket name should be in the following format: `sagemaker-<region>-111122223333`. Look up the base S3 bucket through the *Find bucket by name* field.  
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-0.png)

1. In the base S3 bucket, look up the training job name by specifying your job name prefix into the *Find objects by prefix* input field. Choose the training job name.  
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-1.png)

1. In the training job's S3 bucket, there must be three subfolders for training data collected by Debugger: **debug-output/**, **profiler-output/**, and **rule-output/**. Choose **rule-output/**.   
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-2.png)

1. In the **rule-output/** folder, choose **ProfilerReport-1234567890**, and choose **profiler-output/** folder. The **profiler-output/** folder contains **profiler-report.html** (the autogenerated profiling report in html), **profiler-report.ipynb** (a Jupyter notebook with scripts that are used for generating the report), and a **profiler-report/** folder (contains rule analysis JSON files that are used as components of the report).

1. Select the **profiler-report.html** file, choose **Actions**, and **Download**.  
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-3.png)

1. Open the downloaded **profiler-report.html** file in a web browser.

------

**Note**  
If you started your training job without configuring the Debugger-specific parameters, Debugger generates the report based only on the system monitoring rules because the Debugger parameters are not configured to save framework metrics. To enable framework metrics profiling and receive an extended Debugger profiling report, configure the `profiler_config` parameter when constructing or updating SageMaker AI estimators.  
To learn how to configure the `profiler_config` parameter before starting a training job, see [Estimator configuration for framework profiling](debugger-configure-framework-profiling.md).  
To update the current training job and enable framework metrics profiling, see [Update Debugger Framework Profiling Configuration](debugger-update-monitoring-profiling.md).

# Debugger profiling report walkthrough
<a name="debugger-profiling-report-walkthrough"></a>

This section walks you through the Debugger profiling report section by section. The profiling report is generated based on the built-in rules for monitoring and profiling. The report shows result plots only for the rules that found issues.

**Important**  
In the report, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [Training job summary](#debugger-profiling-report-walkthrough-summary)
+ [System usage statistics](#debugger-profiling-report-walkthrough-system-usage)
+ [Framework metrics summary](#debugger-profiling-report-walkthrough-framework-metrics)
+ [Rules summary](#debugger-profiling-report-walkthrough-rules-summary)
+ [Analyzing the training loop – step durations](#debugger-profiling-report-walkthrough-step-durations)
+ [GPU utilization analysis](#debugger-profiling-report-walkthrough-gpu-utilization)
+ [Batch size](#debugger-profiling-report-walkthrough-batch-size)
+ [CPU bottlenecks](#debugger-profiling-report-walkthrough-cpu-bottlenecks)
+ [I/O bottlenecks](#debugger-profiling-report-walkthrough-io-bottlenecks)
+ [Load balancing in multi-GPU training](#debugger-profiling-report-walkthrough-workload-balancing)
+ [GPU memory analysis](#debugger-profiling-report-walkthrough-gpu-memory)

## Training job summary
<a name="debugger-profiling-report-walkthrough-summary"></a>

At the beginning of the report, Debugger provides a summary of your training job. In this section, you can overview the time durations and timestamps at different training phases.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-summary.gif)


The summary table contains the following information:
+ **start\$1time** – The exact time when the training job started.
+ **end\$1time** – The exact time when the training job finished.
+ **job\$1duration\$1in\$1seconds** – The total training time from the **start\$1time** to the **end\$1time**.
+ **training\$1loop\$1start** – The exact time when the first step of the first epoch has started.
+ **training\$1loop\$1end** – The exact time when the last step of the last epoch has finished.
+ **training\$1loop\$1duration\$1in\$1seconds** – The total time between the training loop start time and the training loop end time.
+ **initialization\$1in\$1seconds** – Time spent on initializing the training job. The initialization phase covers the period from the **start\$1time** to the **training\$1loop\$1start** time. The initialization time is spent on compiling the training script, starting the training script, creating and initializing the model, initiating EC2 instances, and downloading training data.
+ **finalization\$1in\$1seconds** – Time spent on finalizing the training job, such as finishing the model training, updating the model artifacts, and closing the EC2 instances. The finalization phase covers the period from the **training\$1loop\$1end** time to the **end\$1time**.
+ **initialization (%)** – The percentage of time spent on **initialization** over the total **job\$1duration\$1in\$1seconds**. 
+ **training loop (%)** – The percentage of time spent on **training loop** over the total **job\$1duration\$1in\$1seconds**.
+ **finalization (%)** – The percentage of time spent on **finalization** over the total **job\$1duration\$1in\$1seconds**.

## System usage statistics
<a name="debugger-profiling-report-walkthrough-system-usage"></a>

In this section, you can see an overview of system utilization statistics.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-system-usage.png)


The Debugger profiling report includes the following information:
+ **node** – Lists the name of nodes. If using distributed training on multi nodes (multiple EC2 instances), the node names are in format of `algo-n`.
+ **metric** – The system metrics collected by Debugger: CPU, GPU, CPU memory, GPU memory, I/O, and Network metrics.
+ **unit** – The unit of the system metrics.
+ **max** – The maximum value of each system metric.
+ **p99** – The 99th percentile of each system utilization.
+ **p95** – The 95th percentile of each system utilization.
+ **p50** – The 50th percentile (median) of each system utilization.
+ **min** – The minimum value of each system metric.

## Framework metrics summary
<a name="debugger-profiling-report-walkthrough-framework-metrics"></a>

In this section, the following pie charts show the breakdown of framework operations on CPUs and GPUs.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-metrics-summary.gif)


Each of the pie charts analyzes the collected framework metrics in various aspects as follows:
+ **Ratio between TRAIN/EVAL phase and others** – Shows the ratio between time durations spent on different training phases.
+ **Ratio between forward and backward pass** – Shows the ratio between time durations spent on forward and backward pass in the training loop.
+ **Ratio between CPU/GPU operators** – Shows the ratio between time spent on operators running on CPU or GPU, such as convolutional operators.
+ **General metrics recorded in framework** – Shows the ratio between time spent on major framework metrics, such as data loading, forward and backward pass.

### Overview: CPU Operators
<a name="debugger-profiling-report-walkthrough-cpu-operators"></a>

This section provides information of the CPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called CPU operators.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-cpu-operators.gif)


### Overview: GPU operators
<a name="debugger-profiling-report-walkthrough-gpu-operators"></a>

This section provides information of the GPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called GPU operators.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-gpu-operators.gif)


## Rules summary
<a name="debugger-profiling-report-walkthrough-rules-summary"></a>

In this section, Debugger aggregates all of the rule evaluation results, analysis, rule descriptions, and suggestions.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-rules-summary.png)


## Analyzing the training loop – step durations
<a name="debugger-profiling-report-walkthrough-step-durations"></a>

In this section, you can find a detailed statistics of step durations on each GPU core of each node. Debugger evaluates mean, maximum, p99, p95, p50, and minimum values of step durations, and evaluate step outliers. The following histogram shows the step durations captured on different worker nodes and GPUs. You can enable or disable the histogram of each worker by choosing the legends on the right side. You can check if there is a particular GPU that's causing step duration outliers.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-step-duration.gif)


## GPU utilization analysis
<a name="debugger-profiling-report-walkthrough-gpu-utilization"></a>

This section shows the detailed statistics about GPU core utilization based on LowGPUUtilization rule. It also summarizes the GPU utilization statistics, mean, p95, and p5 to determine if the training job is underutilizing GPUs.

## Batch size
<a name="debugger-profiling-report-walkthrough-batch-size"></a>

This section shows the detailed statistics of total CPU utilization, individual GPU utilizations, and GPU memory footprints. The BatchSize rule determines if you need to change the batch size to better utilize the GPUs. You can check whether the batch size is too small resulting in underutilization or too large causing overutilization and out of memory issues. In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and 95th percentile for the upper bound.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-batch-size.png)


## CPU bottlenecks
<a name="debugger-profiling-report-walkthrough-cpu-bottlenecks"></a>

In this section, you can drill down into the CPU bottlenecks that the CPUBottleneck rule detected from your training job. The rule checks if the CPU utilization is above `cpu_threshold` (90% by default) and also if the GPU utilization is below `gpu_threshold` (10% by default).

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-cpu-bottlenecks.png)


The pie charts show the following information:
+ **Low GPU usage caused by CPU bottlenecks** – Shows the ratio of data points between the ones with GPU utilization above and below the threshold and the ones that matches the CPU bottleneck criteria.
+ **Ratio between TRAIN/EVAL phase and others** – Shows the ratio between time durations spent on different training phases.
+ **Ratio between forward and backward pass** – Shows the ratio between time durations spent on forward and backward pass in the training loop.
+ **Ratio between CPU/GPU operators** – Shows the ratio between time durations spent on GPUs and CPUs by Python operators, such as data loader processes and forward and backward pass operators.
+ **General metrics recorded in framework** – Shows major framework metrics and the ratio between time durations spent on the metrics.

## I/O bottlenecks
<a name="debugger-profiling-report-walkthrough-io-bottlenecks"></a>

In this section, you can find a summary of I/O bottlenecks. The rule evaluates the I/O wait time and GPU utilization rates and monitors if the time spent on the I/O requests exceeds a threshold percent of the total training time. It might indicate I/O bottlenecks where GPUs are waiting for data to arrive from storage.

## Load balancing in multi-GPU training
<a name="debugger-profiling-report-walkthrough-workload-balancing"></a>

In this section, you can identify workload balancing issue across GPUs. 

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-workload-balancing.gif)


## GPU memory analysis
<a name="debugger-profiling-report-walkthrough-gpu-memory"></a>

In this section, you can analyze the GPU memory utilization collected by the GPUMemoryIncrease rule. In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and 95th percentile for the upper bound.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-gpu-memory-utilization.png)


# Opt out of the collection of Amazon SageMaker Debugger usage statistics
<a name="debugger-telemetry"></a>

For all SageMaker training jobs, Amazon SageMaker Debugger runs the [ProfilerReport](debugger-built-in-profiler-rules.md#profiler-report) rule and autogenerates a [SageMaker Debugger interactive report](debugger-profiling-report.md). The `ProfilerReport` rule provides a Jupyter notebook file (`profiler-report.ipynb`) that generates a corresponding HTML file (`profiler-report.html`). 

Debugger collects profiling report usage statistics by including code in the Jupyter notebook that collects the unique `ProfilerReport` rule's processing job ARN if the user opens the final `profiler-report.html` file.

Debugger only collects information about whether a user opens the final HTML report. It **DOES NOT** collect any information from training jobs, training data, training scripts, processing jobs, logs, or the content of the profiling report itself.

You can opt out of the collection of usage statistics using one of the following options.

## (Recommended) Option 1: Opt out before running a training job
<a name="debugger-telemetry-profiler-report-opt-out-1"></a>

To opt out, you need to add the following Debugger `ProfilerReport` rule configuration to your training job request.

------
#### [ SageMaker Python SDK ]

```
estimator=sagemaker.estimator.Estimator(
    ...

    rules=ProfilerRule.sagemaker(
        base_config=rule_configs.ProfilerReport()
        rule_parameters={"opt_out_telemetry": "True"}
    )
)
```

------
#### [ AWS CLI ]

```
"ProfilerRuleConfigurations": [ 
    { 
        "RuleConfigurationName": "ProfilerReport-1234567890",
        "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "ProfilerReport", 
            "opt_out_telemetry": "True"
        }
    }
]
```

------
#### [ AWS SDK for Python (Boto3) ]

```
ProfilerRuleConfigurations=[ 
    {
        'RuleConfigurationName': 'ProfilerReport-1234567890',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'ProfilerReport',
            'opt_out_telemetry': 'True'
        }
    }
]
```

------

## Option 2: Opt out after a training job has completed
<a name="debugger-telemetry-profiler-report-opt-out-2"></a>

To opt out after training has completed, you need to modify the `profiler-report.ipynb` file. 

**Note**  
HTML reports autogenerated without **Option 1** already added to your training job request still report the usage statistics even after you opt out using **Option 2**.

1. Follow the instructions on downloading the Debugger profiling report files in the [Download the SageMaker Debugger profiling report](debugger-profiling-report-download.md) page.

1. In the `/ProfilerReport-1234567890/profiler-output` directory, open `profiler-report.ipynb`. 

1. Add **opt\$1out=True** to the `setup_profiler_report()` function in the fifth code cell as shown in the following example code:

   ```
   setup_profiler_report(processing_job_arn, opt_out=True)
   ```

1. Run the code cell to finish opting out.

# Analyze data using the Debugger Python client library
<a name="debugger-analyze-data"></a>

While your training job is running or after it has completed, you can access the training data collected by Debugger using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and the [SMDebug client library](https://github.com/awslabs/sagemaker-debugger/). The Debugger Python client library provides analysis and visualization tools that enable you to drill down into your training job data.

**To install the library and use its analysis tools (in a JupyterLab notebook or an iPython kernel)**

```
! pip install -U smdebug
```

The following topics walk you through how to use the Debugger Python tools to visualize and analyze the training data collected by Debugger.

**Analyze system and framework metrics**
+ [Access the profile data](debugger-analyze-data-profiling.md)
+ [Plot the system metrics and framework metrics data](debugger-access-data-profiling-default-plot.md)
+ [Access the profiling data using the pandas data parsing tool](debugger-access-data-profiling-pandas-frame.md)
+ [Access the Python profiling stats data](debugger-access-data-python-profiling.md)
+ [Merge timelines of multiple profile trace files](debugger-merge-timeline.md)
+ [Profiling data loaders](debugger-data-loading-time.md)

# Access the profile data
<a name="debugger-analyze-data-profiling"></a>

The SMDebug `TrainingJob` class reads data from the S3 bucket where the system and framework metrics are saved. 

**To set up a `TrainingJob` object and retrieve profiling event files of a training job**

```
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
tj = TrainingJob(training_job_name, region)
```

**Tip**  
You need to specify the `training_job_name` and `region` parameters to log to a training job. There are two ways to specify the training job information:   
Use the SageMaker Python SDK while the estimator is still attached to the training job.  

  ```
  import sagemaker
  training_job_name=estimator.latest_training_job.job_name
  region=sagemaker.Session().boto_region_name
  ```
Pass strings directly.  

  ```
  training_job_name="your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
  region="us-west-2"
  ```

**Note**  
By default, SageMaker Debugger collects system metrics to monitor hardware resource utilization and system bottlenecks. Running the following functions, you might receive error messages regarding unavailability of framework metrics. To retrieve framework profiling data and gain insights into framework operations, you must enable framework profiling.  
If you use SageMaker Python SDK to manipulate your training job request, pass the `framework_profile_params` to the `profiler_config` argument of your estimator. To learn more, see [Configure SageMaker Debugger Framework Profiling](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-framework-profiling.html).
If you use Studio Classic, turn on profiling using the **Profiling** toggle button in the Debugger insights dashboard. To learn more, see [SageMaker Debugger Insights Dashboard Controller](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio-insights-controllers.html).

**To retrieve a description of the training job description and the S3 bucket URI where the metric data are saved**

```
tj.describe_training_job()
tj.get_config_and_profiler_s3_output_path()
```

**To check if the system and framework metrics are available from the S3 URI**

```
tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()
```

**To create system and framework reader objects after the metric data become available**

```
system_metrics_reader = tj.get_systems_metrics_reader()
framework_metrics_reader = tj.get_framework_metrics_reader()
```

**To refresh and retrieve the latest training event files**

The reader objects have an extended method, `refresh_event_file_list()`, to retrieve the latest training event files.

```
system_metrics_reader.refresh_event_file_list()
framework_metrics_reader.refresh_event_file_list()
```

# Plot the system metrics and framework metrics data
<a name="debugger-access-data-profiling-default-plot"></a>

You can use the system and algorithm metrics objects for the following visualization classes to plot timeline graphs and histograms.

**Note**  
To visualize the data with narrowed-down metrics in the following visualization object plot methods, specify `select_dimensions` and `select_events` parameters. For example, if you specify `select_dimensions=["GPU"]`, the plot methods filter the metrics that include the "GPU" keyword. If you specify `select_events=["total"]`, the plot methods filter the metrics that include the "total" event tags at the end of the metric names. If you enable these parameters and give the keyword strings, the visualization classes return the charts with filtered metrics.
+ The `MetricsHistogram` class

  ```
  from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram
  
  metrics_histogram = MetricsHistogram(system_metrics_reader)
  metrics_histogram.plot(
      starttime=0, 
      endtime=system_metrics_reader.get_timestamp_of_latest_available_file(), 
      select_dimensions=["CPU", "GPU", "I/O"], # optional
      select_events=["total"]                  # optional
  )
  ```
+ The `StepTimelineChart` class

  ```
  from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import StepTimelineChart
  
  view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)
  ```
+ The `StepHistogram` class

  ```
  from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram
  
  step_histogram = StepHistogram(framework_metrics_reader)
  step_histogram.plot(
      starttime=step_histogram.last_timestamp - 5 * 1000 * 1000, 
      endtime=step_histogram.last_timestamp, 
      show_workers=True
  )
  ```
+ The `TimelineCharts` class

  ```
  from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
  
  view_timeline_charts = TimelineCharts(
      system_metrics_reader, 
      framework_metrics_reader,
      select_dimensions=["CPU", "GPU", "I/O"], # optional
      select_events=["total"]                  # optional 
  )
  
  view_timeline_charts.plot_detailed_profiler_data([700,710])
  ```
+ The `Heatmap` class

  ```
  from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap
  
  view_heatmap = Heatmap(
      system_metrics_reader,
      framework_metrics_reader,
      select_dimensions=["CPU", "GPU", "I/O"], # optional
      select_events=["total"],                 # optional
      plot_height=450
  )
  ```

# Access the profiling data using the pandas data parsing tool
<a name="debugger-access-data-profiling-pandas-frame"></a>

The following `PandasFrame` class provides tools to convert the collected profiling data to Pandas data frame. 

```
from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame
```

The `PandasFrame` class takes the `tj` object's S3 bucket output path, and its methods `get_all_system_metrics()` `get_all_framework_metrics()` return system metrics and framework metrics in the Pandas data format.

```
pf = PandasFrame(tj.profiler_s3_output_path)
system_metrics_df = pf.get_all_system_metrics()
framework_metrics_df = pf.get_all_framework_metrics(
    selected_framework_metrics=[
        'Step:ModeKeys.TRAIN', 
        'Step:ModeKeys.GLOBAL'
    ]
)
```

# Access the Python profiling stats data
<a name="debugger-access-data-python-profiling"></a>

The Python profiling provides framework metrics related to Python functions and operators in your training scripts and the SageMaker AI deep learning frameworks. 

<a name="debugger-access-data-python-profiling-modes"></a>**Training Modes and Phases for Python Profiling**

To profile specific intervals during training to partition statistics for each of these intervals, Debugger provides tools to set modes and phases. 

For training modes, use the following `PythonProfileModes` class:

```
from smdebug.profiler.python_profile_utils import PythonProfileModes
```

This class provides the following options:
+ `PythonProfileModes.TRAIN` – Use if you want to profile the target steps in the training phase. This mode option available only for TensorFlow.
+ `PythonProfileModes.EVAL` – Use if you want to profile the target steps in the evaluation phase. This mode option available only for TensorFlow.
+ `PythonProfileModes.PREDICT` – Use if you want to profile the target steps in the prediction phase. This mode option available only for TensorFlow.
+ `PythonProfileModes.GLOBAL` – Use if you want to profile the target steps in the global phase, which includes the previous three phases. This mode option available only for PyTorch.
+ `PythonProfileModes.PRE_STEP_ZERO` – Use if you want to profile the target steps in the initialization stage before the first training step of the first epoch starts. This phase includes the initial job submission, uploading the training scripts to EC2 instances, preparing the EC2 instances, and downloading input data. This mode option available for both TensorFlow and PyTorch.
+ `PythonProfileModes.POST_HOOK_CLOSE` – Use if you want to profile the target steps in the finalization stage after the training job has done and the Debugger hook is closed. This phase includes profiling data while the training jobs are finalized and completed. This mode option available for both TensorFlow and PyTorch.

<a name="debugger-access-data-python-profiling-phases"></a>For training phases, use the following `StepPhase` class:

```
from smdebug.profiler.analysis.utils.python_profile_analysis_utils import StepPhase
```

This class provides the following options:
+ `StepPhase.START` – Use to specify the start point of the initialization phase.
+ `StepPhase.STEP_START` – Use to specify the start step of the training phase.
+ `StepPhase.FORWARD_PASS_END` – Use to specify the steps where the forward pass ends. This option is available only for PyTorch.
+ `StepPhase.STEP_END` – Use to specify the end steps in the training phase. This option is available only for TensorFlow.
+ `StepPhase.END` – Use to specify the ending point of the finalization (post-hook-close) phase. If the callback hook is not closed, the finalization phase profiling does not occur.

**Python Profiling Analysis Tools**

Debugger supports the Python profiling with two profiling tools:
+ cProfile – The standard python profiler. cProfile collects framework metrics on CPU time for every function called when profiling was enabled.
+ Pyinstrument – This is a low overhead Python profiler sampling profiling events every milliseconds.

To learn more about the Python profiling options and what's collected, see [Default system monitoring and customized framework profiling with different profiling options](debugger-configure-framework-profiling-options.md).

The following methods of the `PythonProfileAnalysis`, `cProfileAnalysis`, `PyinstrumentAnalysis` classes are provided to fetch and analyze the Python profiling data. Each function loads the latest data from the default S3 URI.

```
from smdebug.profiler.analysis.python_profile_analysis import PythonProfileAnalysis, cProfileAnalysis, PyinstrumentAnalysis
```

To set Python profiling objects for analysis, use the cProfileAnalysis or PyinstrumentAnalysis classes as shown in the following example code. It shows how to set a `cProfileAnalysis` object, and if you want to use `PyinstrumentAnalysis`, replace the class name.

```
python_analysis = cProfileAnalysis(
    local_profile_dir=tf_python_stats_dir, 
    s3_path=tj.profiler_s3_output_path
)
```

The following methods are available for the `cProfileAnalysis` and `PyinstrumentAnalysis` classes to fetch the Python profiling stats data:
+ `python_analysis.fetch_python_profile_stats_by_time(start_time_since_epoch_in_secs, end_time_since_epoch_in_secs)` – Takes in a start time and end time, and returns the function stats of step stats whose start or end times overlap with the provided interval.
+ `python_analysis.fetch_python_profile_stats_by_step(start_step, end_step, mode, start_phase, end_phase)` – Takes in a start step and end step and returns the function stats of all step stats whose profiled `step` satisfies `start_step <= step < end_step`. 
  + `start_step` and `end_step` (str) – Specify the start step and end step to fetch the Python profiling stats data.
  + `mode` (str) – Specify the mode of training job using the `PythonProfileModes` enumerator class. The default is `PythonProfileModes.TRAIN`. Available options are provided in the [Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-modes) section.
  + `start_phase` (str) – Specify the start phase in the target step(s) using the `StepPhase` enumerator class. This parameter enables profiling between different phases of training. The default is `StepPhase.STEP_START`. Available options are provided in the [ Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-phases) section.
  + `end_phase` (str) – Specify the end phase in the target step(s) using the `StepPhase` enumerator class. This parameter sets up the end phase of training. Available options are as same as the ones for the `start_phase` parameter. The default is `StepPhase.STEP_END`. Available options are provided in the [ Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-phases) section.
+ `python_analysis.fetch_profile_stats_between_modes(start_mode, end_mode)` – Fetches stats from the Python profiling between the start and end modes.
+ `python_analysis.fetch_pre_step_zero_profile_stats()` – Fetches the stats from the Python profiling until step 0.
+ `python_analysis.fetch_post_hook_close_profile_stats()` – Fetches stats from the Python profiling after the hook is closed.
+ `python_analysis.list_profile_stats()` – Returns a DataFrame of the Python profiling stats. Each row holds the metadata for each instance of profiling and the corresponding stats file (one per step).
+ `python_analysis.list_available_node_ids()` – Returns a list the available node IDs for the Python profiling stats.

The `cProfileAnalysis` class specific methods:
+  `fetch_profile_stats_by_training_phase()` – Fetches and aggregates the Python profiling stats for every possible combination of start and end modes. For example, if a training and validation phases are done while detailed profiling is enabled, the combinations are `(PRE_STEP_ZERO, TRAIN)`, `(TRAIN, TRAIN)`, `(TRAIN, EVAL)`, `(EVAL, EVAL)`, and `(EVAL, POST_HOOK_CLOSE)`. All stats files within each of these combinations are aggregated.
+  `fetch_profile_stats_by_job_phase()` – Fetches and aggregates the Python profiling stats by job phase. The job phases are `initialization` (profiling until step 0), `training_loop` (training and validation), and `finalization` (profiling after the hook is closed).

# Merge timelines of multiple profile trace files
<a name="debugger-merge-timeline"></a>

The SMDebug client library provide profiling analysis and visualization tools for merging timelines of system metrics, framework metrics, and Python profiling data collected by Debugger. 

**Tip**  
Before proceeding, you need to set a TrainingJob object that will be utilized throughout the examples in this page. For more information about setting up a TrainingJob object, see [Access the profile data](debugger-analyze-data-profiling.md).

The `MergedTimeline` class provides tools to integrate and correlate different profiling information in a single timeline. After Debugger captures profiling data and annotations from different phases of a training job, JSON files of trace events are saved in a default `tracefolder` directory.
+ For annotations in the Python layers, the trace files are saved in `*pythontimeline.json`. 
+ For annotations in the TensorFlow C\$1\$1 layers, the trace files are saved in `*model_timeline.json`. 
+ Tensorflow profiler saves events in a `*trace.json.gz` file. 

**Tip**  
If you want to list all of the JSON trace files, use the following AWS CLI command:  

```
! aws s3 ls {tj.profiler_s3_output_path} --recursive | grep '\.json$'
```

As shown in the following animated screenshot, putting and aligning the trace events captured from the different profiling sources in a single plot can provide an overview of the entire events occurring in different phases of the training job.

![\[An example of merged timeline\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-merged-timeline.gif)


**Tip**  
To interact with the merged timeline on the traicing app using a keyboard, use the `W` key for zooming in, the `A` key for shifting to the left, the `S` key for zooming out, and the `D` key for shifiting to the right.

The multiple event trace JSON files can be merged into one trace event JSON file using the following `MergedTimeline` API operation and class method from the `smdebug.profiler.analysis.utils.merge_timelines` module.

```
from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline

combined_timeline = MergedTimeline(path, file_suffix_filter, output_directory)
combined_timeline.merge_timeline(start, end, unit)
```

The `MergedTimeline` API operation passes the following parameters:
+ `path` (str) – Specify a root folder (`/profiler-output`) that contains system and framework profiling trace files. You can locate the `profiler-output` using the SageMaker AI estimator classmethod or the TrainingJob object. For example, `estimator.latest_job_profiler_artifacts_path()` or `tj.profiler_s3_output_path`.
+ `file_suffix_filter` (list) – Specify a list of file suffix filters to merge timelines. Available suffiex filters are `["model_timeline.json", "pythontimeline.json", "trace.json.gz"].` If this parameter is not manually specified, all of the trace files are merged by default.
+ `output_directory` (str) – Specify a path to save the merged timeline JSON file. The default is to the directory specified for the `path` parameter.

The `merge_timeline()` classmethod passes the following parameters to execute the merging process:
+ `start` (int) – Specify start time (in microseconds and in Unix time format) or start step to merge timelines.
+ `end` (int) – Specify end time (in microseconds and in Unix time format) or end step to merge timelines.
+ `unit` (str) – Choose between `"time"` and `"step"`. The default is `"time"`.

Using the following example codes, execute the `merge_timeline()` method and download the merged JSON file. 
+ Merge timeline with the `"time"` unit option. The following example code merges all available trace files between the Unix start time (the absolute zero Unix time) and the current Unix time, which means that you can merge the timelines for the entire training duration.

  ```
  import time
  from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline
  from smdebug.profiler.profiler_constants import CONVERT_TO_MICROSECS
  
  combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./")
  combined_timeline.merge_timeline(0, int(time.time() * CONVERT_TO_MICROSECS))
  ```
+ Merge timeline with the `"step"` unit option. The following example code merges all available timelines between step 3 and step 9.

  ```
  from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline
  
  combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./")
  combined_timeline.merge_timeline(3, 9, unit="step")
  ```

Open the Chrome tracing app at `chrome://tracing` on a Chrome browser, and open the JSON file. You can explore the output to plot the merged timeline. 

# Profiling data loaders
<a name="debugger-data-loading-time"></a>

In PyTorch, data loader iterators, such as `SingleProcessingDataLoaderIter` and `MultiProcessingDataLoaderIter`, are initiated at the beginning of every iteration over a dataset. During the initialization phase, PyTorch turns on worker processes depending on the configured number of workers, establishes data queue to fetch data and `pin_memory` threads.

To use the PyTorch data loader profiling analysis tool, import the following `PT_dataloader_analysis` class:

```
from smdebug.profiler.analysis.utils.pytorch_dataloader_analysis import PT_dataloader_analysis
```

Pass the profiling data retrieved as a Pandas frame data object in the [Access the profiling data using the pandas data parsing tool](debugger-access-data-profiling-pandas-frame.md) section:

```
pt_analysis = PT_dataloader_analysis(pf)
```

The following functions are available for the `pt_analysis` object:

The SMDebug `S3SystemMetricsReader` class reads the system metrics from the S3 bucket specified to the `s3_trial_path` parameter.
+ `pt_analysis.analyze_dataloaderIter_initialization()`

  The analysis outputs the median and maximum duration for these initializations. If there are outliers, (i.e duration is greater than 2 \$1 median), the function prints the start and end times for those durations. These can be used to inspect system metrics during those time intervals.

  The following list shows what analysis is available from this class method:
  + Which type of data loader iterators were initialized.
  + The number of workers per iterator.
  + Inspect whether the iterator was initialized with or without pin\$1memory.
  + Number of times the iterators were initialized during training.
+ `pt_analysis.analyze_dataloaderWorkers()`

  The following list shows what analysis is available from this class method:
  + The number of worker processes that were spun off during the entire training. 
  + Median and maximum duration for the worker processes. 
  + Start and end time for the worker processes that are outliers. 
+ `pt_analysis.analyze_dataloader_getnext()`

  The following list shows what analysis is available from this class method:
  + Number of GetNext calls made during the training. 
  + Median and maximum duration in microseconds for GetNext calls. 
  + Start time, End time, duration and worker id for the outlier GetNext call duration. 
+ `pt_analysis.analyze_batchtime(start_timestamp, end_timestamp, select_events=[".*"], select_dimensions=[".*"])`

  Debugger collects the start and end times of all the GetNext calls. You can find the amount of time spent by the training script on one batch of data. Within the specified time window, you can identify the calls that are not directly contributing to the training. These calls can be from the following operations: computing the accuracy, adding the losses for debugging or logging purposes, and printing the debugging information. Operations like these can be compute intensive or time consuming. We can identify such operations by correlating the Python profiler, system metrics, and framework metrics.

  The following list shows what analysis is available from this class method:
  + Profile time spent on each data batch, `BatchTime_in_seconds`, by finding the difference between start times of current and subsequent GetNext calls. 
  + Find the outliers in `BatchTime_in_seconds` and start and end time for those outliers.
  + Obtain the system and framework metrics during those `BatchTime_in_seconds` timestamps. This indicates where the time was spent.
+ `pt_analysis.plot_the_window()`

  Plots a timeline charts between a start timestamp and the end timestamp.

# Release notes for profiling capabilities of Amazon SageMaker AI
<a name="profiler-release-notes"></a>

See the following release notes to track the latest updates for profiling capabilities of Amazon SageMaker AI.

## March 21, 2024
<a name="profiler-release-notes-20240321"></a>

**Currency updates**

[SageMaker Profiler](train-use-sagemaker-profiler.md) has added support for PyTorch v2.2.0, v2.1.0, and v2.0.1.

**AWS Deep Learning Containers pre-installed with SageMaker Profiler**

[SageMaker Profiler](train-use-sagemaker-profiler.md) is packaged in the following [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).
+ SageMaker AI Framework Container for PyTorch v2.2.0
+ SageMaker AI Framework Container for PyTorch v2.1.0
+ SageMaker AI Framework Container for PyTorch v2.0.1

## December 14, 2023
<a name="profiler-release-notes-20231214"></a>

**Currency updates**

[SageMaker Profiler](train-use-sagemaker-profiler.md) has added support for TensorFlow v2.13.0.

**Breaking changes**

This release involves a breaking change. The SageMaker Profiler Python package name is changed from `smppy` to `smprof`. If you have been using the previous version of the package while you have started using the latest [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for TensorFlow listed in the following section, make sure that you update the package name from `smppy` to `smprof` in the import statement in your training script.

**AWS Deep Learning Containers pre-installed with SageMaker Profiler**

[SageMaker Profiler](train-use-sagemaker-profiler.md) is packaged in the following [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).
+ SageMaker AI Framework Container for TensorFlow v2.13.0
+ SageMaker AI Framework Container for TensorFlow v2.12.0

If you use the previous versions of the [framework containers](profiler-support.md#profiler-support-frameworks) such TensorFlow v2.11.0, the SageMaker Profiler Python package is still available as `smppy`. If you are uncertain which version or the package name you should use, replace the import statement of the SageMaker Profiler package with the following code snippet.

```
try:
    import smprof 
except ImportError:
    # backward-compatability for TF 2.11 and PT 1.13.1 images
    import smppy as smprof
```

## August 24, 2023
<a name="profiler-release-notes-20230824"></a>

**New features**

Released Amazon SageMaker Profiler, a profiling and visualization capability of SageMaker AI to deep dive into compute resources provisioned while training deep learning models and gain visibility into operation-level details. SageMaker Profiler provides Python modules (`smppy`) for adding annotations throughout PyTorch or TensorFlow training scripts and activating SageMaker Profiler. You can access the modules through the SageMaker AI Python SDK and AWS Deep Learning Containers. For any jobs run with the SageMaker Profiler Python modules, you can load the profile data in the SageMaker Profiler UI application that provides a summary dashboard and a detailed timeline. To learn more, see [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md).

This release of the SageMaker Profiler Python package is integrated into the following [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for PyTorch and TensorFlow.
+ PyTorch v2.0.0
+ PyTorch v1.13.1
+ TensorFlow v2.12.0
+ TensorFlow v2.11.0