配置 Debugger 规则以调试模型参数为分析系统和框架指标配置 Debugger 内置规则使用 UpdateTrainingJob API 操作更新 Debugger 分析配置将 Debugger 自定义规则配置添加到 CreateTrainingJob API 操作

适用于 Python 的 SDK (Boto3)

Amazon SageMaker Debugger 内置规则可以使用 AWS Boto3 SageMaker AI 客户端的 create_training_job() 函数进行配置，以用于训练作业。您需要在 RuleEvaluatorImage 参数中指定正确的映像 URI，以下示例演示如何为 create_training_job() 函数设置请求正文。

下面的代码显示的示例演示了如何为 create_training_job() 请求正文配置 Debugger 并在 us-west-2 中启动训练作业，其中假设已使用 TensorFlow 准备了训练脚本 entry_point/train.py。要查找端到端的示例笔记本，请参阅使用 Amazon SageMaker Debugger (Boto3) 分析 TensorFlow 多 GPU 多节点训练作业。

注意

确保使用正确的 Docker 容器映像。要查找可用的 AWS Deep Learning Containers 映像，请参阅可用的深度学习容器映像。要查找使用 Debugger 规则时可用的 Docker 映像的完整列表，请参阅用于 Debugger 规则的 Docker 映像。


import sagemaker, boto3
import datetime, tarfile

# Start setting up a SageMaker session and a Boto3 SageMaker client
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()

# Upload a training script to a default Amazon S3 bucket of the current SageMaker session
source = 'source.tar.gz'
project = 'debugger-boto3-test'

tar = tarfile.open(source, 'w:gz')
tar.add ('entry_point/train.py') # Specify the directory and name of your training script
tar.close()

s3 = boto3.client('s3')
s3.upload_file(source, bucket, project+'/'+source)

# Set up a Boto3 session client for SageMaker
sm = boto3.Session(region_name=region).client("sagemaker")

# Start a training job
sm.create_training_job(
    TrainingJobName='debugger-boto3-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'),
    HyperParameters={
        'sagemaker_submit_directory': 's3://'+bucket+'/'+project+'/'+source,
        'sagemaker_program': '/entry_point/train.py' # training scrip file location and name under the sagemaker_submit_directory
    },
    AlgorithmSpecification={
        # Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
        'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04',
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': False
    },
    RoleArn='arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20201014T161125',
    OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'},
    ResourceConfig={
        'InstanceType': 'ml.p3.8xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 30
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 86400
    },
    DebugHookConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/debug-output',
        'CollectionConfigurations': [
            {
                'CollectionName': 'losses',
                'CollectionParameters' : {
                    'train.save_interval': '500',
                    'eval.save_interval': '50'
                }
            }
        ]
    },
    DebugRuleConfigurations=[
        {
            'RuleConfigurationName': 'LossNotDecreasing',
            'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
            'RuleParameters': {'rule_to_invoke': 'LossNotDecreasing'}
        }
    ],
    ProfilerConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/profiler-output',
        'ProfilingIntervalInMilliseconds': 500,
        'ProfilingParameters': {
            'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }',
            'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 3, }',
            'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}',
            'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
        }
    },
    ProfilerRuleConfigurations=[
        {
            'RuleConfigurationName': 'ProfilerReport',
            'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
            'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}
        }
    ]
)

配置 Debugger 规则以调试模型参数

以下代码示例演示了如何使用此 SageMaker API 配置内置 VanishingGradient 规则。

启用 Debugger 收集输出张量

按如下方式指定 Debugger 钩子配置：


DebugHookConfig={
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
    'CollectionConfigurations': [
        {
            'CollectionName': 'gradients',
            'CollectionParameters' : {
                'train.save_interval': '500',
                'eval.save_interval': '50'
            }
        }
    ]
}

这将使训练作业按每 500 个步骤的 save_interval 保存一次 gradients 张量集合。要查找可用 CollectionName 值，请参阅 SMDebug 客户端库文档中的 Debugger 内置集合。要查找可用的 CollectionParameters 参数键和值，请参阅《SageMaker Python SDK 文档》中的 sagemaker.debugger.CollectionConfig 类。

启用 Debugger 规则来调试输出张量

以下DebugRuleConfigurations API 示例说明了如何对已保存的 gradients 集合运行内置 VanishingGradient 规则。


DebugRuleConfigurations=[
    {
        'RuleConfigurationName': 'VanishingGradient',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'VanishingGradient',
            'threshold': '20.0'
        }
    }
]

通过类似于此示例中的配置，Debugger 使用 VanishingGradient 规则，在 gradients 张量的集合上为您的训练作业启动规则评估作业。要查找使用 Debugger 规则时可用的 Docker 映像的完整列表，请参阅用于 Debugger 规则的 Docker 映像。要查找 RuleParameters 的键值对，请参阅 Debugger 内置规则列表。

为分析系统和框架指标配置 Debugger 内置规则

以下示例代码显示了如何指定 ProfilerConfig API 操作以启用收集系统和框架指标。

启用 Debugger 分析以收集系统和框架指标

Target Step


ProfilerConfig={ 
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': 500, 
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3 
        }',
        'PythonProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3, 
            "ProfilerName": "cprofile",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "total_time"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}

Target Time Duration


ProfilerConfig={ 
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': 500,
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10
        }',
        'PythonProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10, 
            "ProfilerName": "cprofile",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "total_time"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}

启用 Debugger 规则来分析指标

以下示例代码显示了如何配置 ProfilerReport 规则。


ProfilerRuleConfigurations=[ 
    {
        'RuleConfigurationName': 'ProfilerReport',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'ProfilerReport',
            'CPUBottleneck_cpu_threshold': '90',
            'IOBottleneck_threshold': '90'
        }
    }
]

要查找使用 Debugger 规则时可用的 Docker 映像的完整列表，请参阅用于 Debugger 规则的 Docker 映像。要查找 RuleParameters 的键值对，请参阅 Debugger 内置规则列表。

使用 `UpdateTrainingJob` API 操作更新 Debugger 分析配置

在训练作业运行期间，可以使用 AWS Boto3 SageMaker AI 客户端的 update_training_job() 函数更新 Debugger 分析配置。配置新 ProfilerConfig 和 ProfilerRuleConfiguration 对象，然后为 TrainingJobName 参数指定训练作业名称。


ProfilerConfig={ 
    'DisableProfiler': boolean,
    'ProfilingIntervalInMilliseconds': number,
    'ProfilingParameters': { 
        'string' : 'string' 
    }
},
ProfilerRuleConfigurations=[ 
    { 
        'RuleConfigurationName': 'string',
        'RuleEvaluatorImage': 'string',
        'RuleParameters': { 
            'string' : 'string' 
        }
    }
],
TrainingJobName='your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS'

将 Debugger 自定义规则配置添加到 CreateTrainingJob API 操作

通过 AWS Boto3 SageMaker AI 客户端的 create_training_job() 函数，可以使用 DebugHookConfig 和 DebugRuleConfiguration 对象配置自定义规则用于训练作业。以下代码示例演示如何使用此 SageMaker API，配置使用 smdebug 库编写的自定义 ImproperActivation 规则。此示例假定您已在 custom_rules.py 文件中编写自定义规则，并将其上传到 Amazon S3 存储桶。该示例提供了预构建的 Docker 映像，您可以使用这些映像运行自定义规则。自定义规则评估器的 Amazon SageMaker Debugger 映像 URI 中列出了这些映像。您可以在 RuleEvaluatorImage 参数中为预构建的 Docker 映像指定 URL 注册表地址。


DebugHookConfig={
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
    'CollectionConfigurations': [
        {
            'CollectionName': 'relu_activations',
            'CollectionParameters': {
                'include_regex': 'relu',
                'save_interval': '500',
                'end_step': '5000'
            }
        }
    ]
},
DebugRulesConfigurations=[
    {
        'RuleConfigurationName': 'improper_activation_job',
        'RuleEvaluatorImage': '552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest',
        'InstanceType': 'ml.c4.xlarge',
        'VolumeSizeInGB': 400,
        'RuleParameters': {
           'source_s3_uri': 's3://bucket/custom_rules.py',
           'rule_to_invoke': 'ImproperActivation',
           'collection_names': 'relu_activations'
        }
    }
]

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

JSON (AWS CLI)