第 1 步：使用 P SageMaker rofiler Python 模块调整训练脚本第 2 步：创建 A SageMaker I 框架估算器并激活 Profiler SageMaker （可选）安装 P SageMaker rofiler Python 软件包

使用 P SageMaker rofiler 准备和运行训练作业

使用 SageMaker Profiler 设置运行训练作业包括两个步骤：调整训练脚本和配置 SageMaker 训练作业启动器。

主题

第 1 步：使用 P SageMaker rofiler Python 模块调整训练脚本
第 2 步：创建 A SageMaker I 框架估算器并激活 Profiler SageMaker
（可选）安装 P SageMaker rofiler Python 软件包

第 1 步：使用 P SageMaker rofiler Python 模块调整训练脚本

要在训练作业运行 GPUs 时开始捕获内核运行情况，请使用 P SageMaker rofiler Python 模块修改训练脚本。导入库并添加 start_profiling() 和 stop_profiling() 方法来定义分析的开始和结束。您还可以使用可选的自定义注释在训练脚本中添加标记，以便可视化每个步骤中的特定操作期间的硬件活动。

请注意，注释器从中提取操作。 GPUs对于中的性能分析操作 CPUs，您无需添加任何其他注释。在指定分析配置（您将在第 2 步：创建 A SageMaker I 框架估算器并激活 Profiler SageMaker中使用此配置）时也将激活 CPU 分析。

注意

对整个训练作业进行分析并不能最有效地利用资源。我们建议对训练作业的最多 300 个步骤进行分析。

重要

2023 年 12 月 14 日的发布涉及重大变更。P SageMaker rofiler Python 包名称已从更改为smppy。smprof这在 TensorFlow v2.12 及更高版本的 SageMaker AI 框架容器中有效。

如果你使用的是早期版本的 SageMaker AI 框架容器，比如 TensorFlow v2.11.0，Profiler SageMaker Python 包仍然可用作。smppy如果您不确定应使用哪个版本或软件包名称，请将 P SageMaker rofiler 软件包的 import 语句替换为以下代码片段。


try:
    import smprof 
except ImportError:
    # backward-compatability for TF 2.11 and PT 1.13.1 images
    import smppy as smprof

方法 1. 使用上下文管理器 smprof.annotate 为所有函数添加注释

您可以通过 smprof.annotate() 上下文管理器封装所有函数。如果您想按函数而不是代码行进行分析，建议使用此包装器。以下示例脚本说明如何实施上下文管理器，以便在每次迭代中包装训练循环和完整函数。


import smprof

SMProf = smprof.SMProfiler.instance()
config = smprof.Config()
config.profiler = {
    "EnableCuda": "1",
}
SMProf.configure(config)
SMProf.start_profiling()

for epoch in range(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, data in enumerate(trainloader, 0):
        with smprof.annotate("step_"+str(i)):
            inputs, labels = data
            inputs = inputs.to("cuda", non_blocking=True)
            labels = labels.to("cuda", non_blocking=True)
    
            optimizer.zero_grad()
    
            with smprof.annotate("Forward"):
                outputs = net(inputs)
            with smprof.annotate("Loss"):
                loss = criterion(outputs, labels)
            with smprof.annotate("Backward"):
                loss.backward()
            with smprof.annotate("Optimizer"):
                optimizer.step()

SMProf.stop_profiling()

方法 2. 使用 smprof.annotation_begin() 和 smprof.annotation_end() 为函数中的特定代码行添加注释

您还可以定义注释来分析特定的代码行。可以在单个代码行级别而不是按函数来设置分析的确切起始点和结束点。例如，在以下脚本中，step_annotator 会在每次迭代开始时被定义，并在迭代结束时终止。同时，还为每个操作定义了其他详细注释器，并在每次迭代中包装目标操作。


import smprof

SMProf = smprof.SMProfiler.instance()
config = smprof.Config()
config.profiler = {
    "EnableCuda": "1",
}
SMProf.configure(config)
SMProf.start_profiling()

for epoch in range(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, data in enumerate(trainloader, 0):
        step_annotator = smprof.annotation_begin("step_" + str(i))

        inputs, labels = data
        inputs = inputs.to("cuda", non_blocking=True)
        labels = labels.to("cuda", non_blocking=True)
        optimizer.zero_grad()

        forward_annotator = smprof.annotation_begin("Forward")
        outputs = net(inputs)
        smprof.annotation_end(forward_annotator)

        loss_annotator = smprof.annotation_begin("Loss")
        loss = criterion(outputs, labels)
        smprof.annotation_end(loss_annotator)

        backward_annotator = smprof.annotation_begin("Backward")
        loss.backward()
        smprof.annotation_end(backward_annotator)

        optimizer_annotator = smprof.annotation_begin("Optimizer")
        optimizer.step()
        smprof.annotation_end(optimizer_annotator)

        smprof.annotation_end(step_annotator)

SMProf.stop_profiling()

注释并设置探查器初始模块后，在接下来的步骤 2 中使用 SageMaker 训练作业启动器保存脚本以提交。示例启动器假定训练脚本已命名为 train_with_profiler_demo.py。

第 2 步：创建 A SageMaker I 框架估算器并激活 Profiler SageMaker

以下过程说明如何使用 Pyth SageMaker on SDK 准备用于训练的 SageMaker AI 框架估算器。

使用 ProfilerConfig 和 Profiler 模块设置 profiler_config 对象，如下所示。
```
from sagemaker import ProfilerConfig, Profiler
profiler_config = ProfilerConfig(
    profile_params = Profiler(cpu_profiling_duration=3600)
)
```
以下是 Profiler 模块及其参数的描述。
- Profiler：用于通过训练作业激活 SageMaker Profiler 的模块。
  - cpu_profiling_duration(int)：指定进行性能分析的持续时间（以秒为单位） CPUs。默认为 3600 秒。

使用上 SageMaker 一步中创建的profiler_config对象创建 AI 框架估算器。以下代码显示了创建 PyTorch 估算器的示例。如果要创建 TensorFlow 估算器，请sagemaker.tensorflow.TensorFlow改为导入，然后指定 Profiler 支持的TensorFlow SageMaker 版本之一。有关支持的框架和实例类型的更多信息，请参阅SageMaker Profiler 中预装的 AI 框架镜像 SageMaker。


import sagemaker
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    framework_version="2.0.0",
    role=sagemaker.get_execution_role(),
    entry_point="train_with_profiler_demo.py", # your training job entry point
    source_dir=source_dir, # source directory for your training script
    output_path=output_path,
    base_job_name="sagemaker-profiler-demo",
    hyperparameters=hyperparameters, # if any
    instance_count=1, # Recommended to test with < 8
    instance_type=ml.p4d.24xlarge,
    profiler_config=profiler_config
)

通过运行 fit 方法开始训练作业。利用 wait=False，您可以将训练作业日志设为静音，并让它在后台运行。
```
estimator.fit(wait=False)
```

在运行训练作业时或作业完成后，您可以转到打开 SageMaker Profiler 用户界面应用程序上的下一个主题，并开始探究和可视化已保存的配置文件。

如果您想直接访问保存在 Amazon S3 存储桶中的配置文件数据，请使用以下脚本来检索 S3 URI。


import os
# This is an ad-hoc function to get the S3 URI
# to where the profile output data is saved
def get_detailed_profiler_output_uri(estimator):
    config_name = None
    for processing in estimator.profiler_rule_configs:
        params = processing.get("RuleParameters", dict())
        rule = config_name = params.get("rule_to_invoke", "")
        if rule == "DetailedProfilerProcessing":
            config_name = processing.get("RuleConfigurationName")
            break
    return os.path.join(
        estimator.output_path, 
        estimator.latest_training_job.name, 
        "rule-output",
        config_name,
    )

print(
    f"Profiler output S3 bucket: ", 
    get_detailed_profiler_output_uri(estimator)
)

（可选）安装 P SageMaker rofiler Python 软件包

要在中未列出的 TensorFlow 框架映像上 PyTorch 使用 SageMaker ProfilerSageMaker Profiler 中预装的 AI 框架镜像 SageMaker，或者在您自己的自定义 Docker 容器上使用 Profiler 进行训练，您可以使用其中一个来安装 SageMaker Profiler。SageMaker Profiler Python 包二进制文件

选项 1：在启动训练作业时安装 SageMaker Profiler 软件包

如果要使用 SageMaker Profiler 训练作业 PyTorch 或未在中列出的 TensorFlow 图像SageMaker Profiler 中预装的 AI 框架镜像 SageMaker，请创建一个requirements.txt文件并将其定位在步骤 2 中为 SageMaker AI 框架估算器source_dir参数指定的路径下。有关一般设置requirements.txt文件的更多信息，请参阅 SageMaker Python SDK 文档中的使用第三方库。在 requirements.txt 文件中，为 SageMaker Profiler Python 包二进制文件文件添加一个 S3 存储桶路径。


# requirements.txt
https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.332-cp39-cp39-linux_x86_64.whl

选项 2：在您的自定义 Dock SageMaker er 容器中安装 Profiler 软件包

如果您使用自定义 Docker 容器进行训练，请将 SageMaker Profiler Python 包二进制文件之一添加到您的 Dockerfile 中。


# Install the smprof package version compatible with your CUDA version
RUN pip install https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.332-cp39-cp39-linux_x86_64.whl

有关运行自定义 Docker 容器进行一般 SageMaker 人工智能训练的指南，请参阅调整自己的训练容器。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

SageMaker Profiler 的先决条件

打开 SageMaker Profiler 用户界面应用程序