本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 使用 Amazon SageMaker Debugger 為訓練任務除錯
<a name="debugger-debug-training-jobs"></a>

若要使用 SageMaker Debugger 準備訓練指令碼並執行訓練工作，為模型訓練流程除錯，請遵循典型的兩步驟程序：使用 `sagemaker-debugger` Python SDK 修改訓練指令碼，並使用 SageMaker Python SDK 建構 SageMaker AI 估算器。請瀏覽下列主題，以瞭解如何使用 SageMaker Debugger 的除錯功能。

**Topics**
+ [調整您的訓練指令碼以註冊勾點](debugger-modify-script.md)
+ [使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md)
+ [SageMaker Debugger 針對 XGBoost 的互動式報告](debugger-report-xgboost.md)
+ [Amazon SageMaker Debugger 規則的動作](debugger-action-on-rules.md)
+ [在 TensorBoard 中視覺化 Amazon SageMaker Debugger 輸出張量](debugger-enable-tensorboard-summaries.md)

# 調整您的訓練指令碼以註冊勾點
<a name="debugger-modify-script"></a>

Amazon SageMaker Debugger 隨附一個名為 [`sagemaker-debugger`Python SDK ](https://sagemaker-debugger.readthedocs.io/en/website)的用戶端程式庫。`sagemaker-debugger` Python SDK 提供工具，可在訓練前調整您的訓練指令碼，並在訓練後調整分析工具。在此頁面中，您將了解如何使用用戶端程式庫調整訓練指令碼。

`sagemaker-debugger` Python SDK 提供包裝函式，可協助註冊勾點以擷取模型張量，而不必變更訓練指令碼。若要開始收集模型輸出張量並對其進行偵錯以尋找訓練問題，請在訓練指令碼中進行下列修改。

**提示**  
依照此頁面操作時，請使用 [`sagemaker-debugger` 開放原始碼軟體開發套件文件](https://sagemaker-debugger.readthedocs.io/en/website/index.html)進行 API 參考。

**Topics**
+ [調整 PyTorch 訓練指令碼](debugger-modify-script-pytorch.md)
+ [調整您的 TensorFlow 訓練指令碼](debugger-modify-script-tensorflow.md)

# 調整 PyTorch 訓練指令碼
<a name="debugger-modify-script-pytorch"></a>

若要開始收集模型輸出張量並對訓練問題進行偵錯，請對 PyTorch 訓練指令碼進行下列修改。

**注意**  
SageMaker Debugger 無法從 [https://pytorch.org/docs/stable/nn.functional.html](https://pytorch.org/docs/stable/nn.functional.html) API 作業收集模型輸出張量。當您撰寫 PyTorch 訓練指令碼時，建議改用 [https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) 模組。

## 適用於 PyTorch 1.12.0
<a name="debugger-modify-script-pytorch-1-12-0"></a>

若使用 PyTorch 訓練指令碼，您可以執行訓練任務，並在訓練指令碼中使用一些額外的程式碼行擷取模型輸出張量。您需要使用 `sagemaker-debugger` 用戶端程式庫中的[勾點 API](https://sagemaker-debugger.readthedocs.io/en/website/hook-api.html)。逐步執行下列指示，以程式碼範例分解各步驟。

1. 建立勾點。

   **(建議) 適用於 SageMaker AI 內的訓練工作**

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   ```

   當您在 [使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md) 中啟動訓練任務時，在估算器中使用任何 DebuggerHookConfig、TensorBoardConfig 或 Rules，SageMaker AI 會將 JSON 組態檔案新增至您的訓練執行個體，並由 `get_hook` 函式選取。請注意，如果您沒有在估算器中包含任何組態 API，就不會有要尋找勾點的組態檔案，且函式會傳回 `None`。

   **(選用) 適用於 SageMaker AI 以外的訓練工作**

   如果您以本機模式執行訓練任務，請直接在 SageMaker 筆記本執行個體、Amazon EC2 執行個體或您自己的本機裝置上執行訓練任務，並使用 `smd.Hook` 類別建立勾點。不過，這種方法只能儲存張量集合，且僅可用於 TensorBoard 視覺化。SageMaker Debugger 的內建規則不適用於本機模式，因為規則會要求 SageMaker AI ML 訓練執行個體和 S3 即時儲存來自遠端執行個體的輸出。在這種情況下，`smd.get_hook` API 會傳回 `None`。

   如果您想要建立手動勾點以在本機模式下儲存張量，請使用下列程式碼片段與邏輯來檢查 `smd.get_hook` API 是否傳回 `None`，並使用 `smd.Hook` 類別建立手動勾點。請注意，您可以在本機機器中指定任何輸出目錄。

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   
   if hook is None:
       hook=smd.Hook(
           out_dir='/path/to/your/local/output/',
           export_tensorboard=True
       )
   ```

1. 用勾點的類別方法包裝您的模型。

   `hook.register_module()` 方法採用您的模型並逐一查看每一層，尋找與您透過 [使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md) 組態提供的規則表達式所符合的任何張量。透過此勾點的可收式張量方法為加權、誤差、啟用、漸層、輸入和輸出。

   ```
   hook.register_module(model)
   ```
**提示**  
如果從大型深度學習模型收集整個輸出張量，則這些集合的大小總計可能會呈指數級增長，並可能導致瓶頸。如果想要儲存特定張量，也可以使用 `hook.save_tensor()` 方法。此方法可協助您為特定張量選取變數，並儲存至所需命名的自訂集合。如需詳細資訊，請參閱此指示的[步驟 7](#debugger-modify-script-pytorch-save-custom-tensor)。

1. 使用勾點的類別方法扭曲損失函式。

   `hook.register_loss` 方法是去包裝損失函式。它會擷取每一個您會在 [使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md) 配置過程中設定的 `save_interval` 損失值，並將它們儲存到 `"losses"` 集合。

   ```
   hook.register_loss(loss_function)
   ```

1. 在訓練區塊中新增 `hook.set_mode(ModeKeys.TRAIN)`。這表示張量集合是在訓練階段擷取的。

   ```
   def train():
       ...
       hook.set_mode(ModeKeys.TRAIN)
   ```

1. 在驗證區塊中新增 `hook.set_mode(ModeKeys.EVAL)`。這表示張量集合是在驗證階段擷取的。

   ```
   def validation():
       ...
       hook.set_mode(ModeKeys.EVAL)
   ```

1. 使用 [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar) 儲存自訂純量。您可以儲存不在模型中的純量值。例如，如要記錄在評估期間運算的精確度值，請在計算準確度的行下方新增下列程式碼行。

   ```
   hook.save_scalar("accuracy", accuracy)
   ```

   請注意，您需要提供一個字串作為第一個引數來命名自訂純量集合。這是將用於在 TensorBoard 中視覺化純量值的名稱，並且可以是您想要的任何字串。

1. <a name="debugger-modify-script-pytorch-save-custom-tensor"></a>使用 [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor) 儲存自訂張量。類似於 [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar)，您可以儲存其他張量，並定義自己的張量集合。例如，您可以透過新增以下程式碼行 (其中 `"images"` 是自訂張量的範例名稱)，來擷取傳遞到模型中的輸入映像資料並另存為自訂張量 (其中 `image_inputs` 是輸入映像資料的範例變數)。

   ```
   hook.save_tensor("images", image_inputs)
   ```

   請注意，您必須為第一個引數提供字串，才能命名自訂張量。`hook.save_tensor()` 具有第三個引數 `collections_to_write` 來指定張量集合以儲存自訂張量。預設值為 `collections_to_write="default"`。如果您沒有明確指定第三個引數，則自訂張量將儲存到 `"default"` 張量集合中。

完成訓練指令碼的調整後，請繼續前往[使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md)。

# 調整您的 TensorFlow 訓練指令碼
<a name="debugger-modify-script-tensorflow"></a>

若要開始收集模型輸出張量和針對訓練問題偵錯，請對 TensorFlow 訓練指令碼進行下列修改。

**為 SageMaker AI 中的訓練任務建立勾點**

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)
```

這會在您啟動 SageMaker 訓練任務時建立勾點。當您在 [使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md) 於您的估算器中使用任何一個 `DebuggerHookConfig`、`TensorBoardConfig` 或 `Rules` 啟動訓練任務時，SageMaker AI 會將 JSON 組態檔案新增至您的訓練執行個體，該檔案將由 `smd.get_hook` 方法選取。請注意，如果您沒有在估算器中包含任何組態 API，就不會有要尋找勾點的組態檔案，且函式會傳回 `None`。

**(選用) 為 SageMaker AI 以外的訓練工作建立勾點**

如果您以本機模式執行訓練任務，請直接在 SageMaker 筆記本執行個體、Amazon EC2 執行個體或您自己的本機裝置上執行訓練任務，並使用 `smd.Hook` 類別建立勾點。不過，這種方法只能儲存張量集合，且僅可用於 TensorBoard 視覺化。SageMaker Debugger 的內建規則不適用於本機模式。在這種情況下，`smd.get_hook` 方法也會傳回 `None`。

如果您想要建立手動勾點，請使用下列程式碼片段與邏輯來檢查勾點是否傳回 `None`，並使用 `smd.Hook` 類別建立手動勾點。

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True) 

if hook is None:
    hook=smd.KerasHook(
        out_dir='/path/to/your/local/output/',
        export_tensorboard=True
    )
```

新增勾點建立程式碼之後，請繼續 TensorFlow Keras 的下列主題。

**注意**  
SageMaker Debugger 目前僅支援 TensorFlow Keras。

## 在您的 TensorFlow Keras 訓練指令碼中註冊勾點
<a name="debugger-modify-script-tensorflow-keras"></a>

以下程序將逐步引導您使用勾點及其方法，從模型和最佳化工具收集輸出純量與張量。

1. 利用勾點的類別方法包裝您的 Keras 模型和最佳化工具。

   `hook.register_model()` 方法採用您的模型並逐一查看每一層，尋找與您透過 [使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md) 組態提供的規則表達式所符合的任何張量。透過此勾點方法的可收式張量為加權、誤差和啟用。

   ```
   model=tf.keras.Model(...)
   hook.register_model(model)
   ```

1. 透過 `hook.wrap_optimizer()` 方法包裝最佳化工具。

   ```
   optimizer=tf.keras.optimizers.Adam(...)
   optimizer=hook.wrap_optimizer(optimizer)
   ```

1. 在 TensorFlow 中以嚴格模式編譯模型。

   若要從模型收集張量 (例如每層的輸入和輸出張量)，您必須以嚴格模式執行訓練。否則，SageMaker AI Debugger 將無法收集張量。不過，系統可以收集其他張量，例如模型加權、誤差和損失，而無須在嚴格模式中明確執行。

   ```
   model.compile(
       loss="categorical_crossentropy", 
       optimizer=optimizer, 
       metrics=["accuracy"],
       # Required for collecting tensors of each layer
       run_eagerly=True
   )
   ```

1. 將勾點註冊到 [https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) 方法。

   若要從您註冊的勾點收集張量，請將 `callbacks=[hook]` 新增至 Keras `model.fit()` 類別方法。這會把 `sagemaker-debugger` 勾點作為 Keras 回呼來傳遞。

   ```
   model.fit(
       X_train, Y_train,
       batch_size=batch_size,
       epochs=epoch,
       validation_data=(X_valid, Y_valid),
       shuffle=True, 
       callbacks=[hook]
   )
   ```

1. TensorFlow 2.x 僅提供符號漸層變數，而這些變數不提供本身值的存取。如要收集漸層，請按照 [https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api) 方法包裝 `tf.GradientTape`，這需要您編寫自己的訓練步驟，如下所示。

   ```
   def training_step(model, dataset):
       with hook.wrap_tape(tf.GradientTape()) as tape:
           pred=model(data)
           loss_value=loss_fn(labels, pred)
       grads=tape.gradient(loss_value, model.trainable_variables)
       optimizer.apply_gradients(zip(grads, model.trainable_variables))
   ```

   透過包裝磁帶，`sagemaker-debugger` 勾點可以識別輸出張量，例如建層、參數和損失。包裝磁帶可確保圍繞磁帶物件函式的 `hook.wrap_tape()` 方法 (例如 `push_tape()`、`pop_tape()`、`gradient()`) 將設定 SageMaker Debugger 的寫入器，並儲存提供做為輸入 `gradient()` (可訓練的變數及損失) 和 `gradient()` (漸層) 輸出的張量。
**注意**  
如要透過自訂訓練循環進行收集，請務必使用嚴格模式。否則，SageMaker Debugger 無法收集任何張量。

如需 `sagemaker-debugger` 勾點 API 提供用來建構勾點和儲存張量的完整動作清單，請參閱 *`sagemaker-debugger`Python SDK 文件*中的[勾點方法](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html)。

完成訓練指令碼的調整後，請繼續前往[使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務](debugger-configuration-for-debugging.md)。

# 使用 SageMaker Python SDK 搭配 Debugger 啟動訓練任務
<a name="debugger-configuration-for-debugging"></a>

若要使用 SageMaker Debugger 設定 SageMaker AI 估算器，請使用 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 並指定特定 Debugger 的參數。若要充分利用除錯功能，您需要設定三個參數：`debugger_hook_config`、`tensorboard_output_config` 和 `rules`。

**重要**  
建構並執行估計器擬合方法以啟動訓練任務之前，請確定您已依照[調整您的訓練指令碼以註冊勾點](debugger-modify-script.md)中的指示調整訓練指令碼。

## 使用特定 Debugger 參數建構 SageMaker AI 估算器
<a name="debugger-configuration-structure"></a>

本節中的程式碼範例說明如何使用特定 Debugger 參數建構 SageMaker AI 估算器。

**注意**  
下列程式碼範例是建構 SageMaker AI 架構估算器的範本，不可直接執行。您必須繼續後續幾節，設定特定 Debugger 參數。

------
#### [ PyTorch ]

```
# An example of constructing a SageMaker AI PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ TensorFlow ]

```
# An example of constructing a SageMaker AI TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ MXNet ]

```
# An example of constructing a SageMaker AI MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ XGBoost ]

```
# An example of constructing a SageMaker AI XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ Generic estimator ]

```
# An example of constructing a SageMaker AI generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------

設定下列參數，啟動 SageMaker Debugger：
+ `debugger_hook_config` ([https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig) 的物件) — 必須在 [調整您的訓練指令碼以註冊勾點](debugger-modify-script.md) 期間啟動調整後訓練指令碼中的勾點，將 SageMaker 訓練啟動器 (估算器) 設定為從訓練任務收集輸出張量，並將張量儲存到安全的 S3 儲存貯體或本機機器。若要了解如何設定 `debugger_hook_config`，請參閱[設定 SageMaker Debugger 以儲存張量](debugger-configure-hook.md)。
+ `rules` ([https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule) 物件清單) — 設定此參數以啟動您要即時執行的 SageMaker Debugger 內建規則。內建規則這種邏輯可自動偵錯模型的訓練進度，並透過分析安全 S3 儲存貯體中儲存的輸出張量找出訓練問題。若要了解如何設定 `rules`，請參閱[如何設定偵錯工具內建規則](use-debugger-built-in-rules.md)。若要尋找偵錯輸出張量之內建規則的完整清單，請參閱[偵錯工具規則](debugger-built-in-rules.md#debugger-built-in-rules-Rule)。如果您想要建立自己的邏輯偵測任何訓練問題，請參閱[使用偵錯工具用戶端程式庫建立自訂規則](debugger-custom-rules.md)。
**注意**  
唯有利用 SageMaker 訓練執行個體才能使用內建規則。您無法在本機模式使用這些規則。
+ `tensorboard_output_config` ([https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig) 的物件) — 設定 SageMaker Debugger 以 Tensorboard 相容格式收集輸出張量，並儲存至 `TensorBoardOutputConfig` 物件中指定的 S3 輸出路徑。如需詳細資訊，請參閱 [在 TensorBoard 中視覺化 Amazon SageMaker Debugger 輸出張量](debugger-enable-tensorboard-summaries.md)。
**注意**  
`tensorboard_output_config` 必須使用 `debugger_hook_config` 參數進行設定，過程中您必須新增 `sagemaker-debugger` 勾點，調整訓練指令碼。

**注意**  
SageMaker Debugger 將輸出張量安全儲存於 S3 儲存貯體的子資料夾。例如，帳戶中預設 S3 儲存貯體 URI 的格式為 `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/`。SageMaker Debugger 建立的子資料夾有兩個：`debug-output` 和 `rule-output`。如果新增 `tensorboard_output_config` 參數，您也會找到 `tensorboard-output` 資料夾。

請參閱下列主題，尋找關於如何設定特定 Debugger 參數的更多詳細範例。

**Topics**
+ [使用特定 Debugger 參數建構 SageMaker AI 估算器](#debugger-configuration-structure)
+ [設定 SageMaker Debugger 以儲存張量](debugger-configure-hook.md)
+ [如何設定偵錯工具內建規則](use-debugger-built-in-rules.md)
+ [關閉偵錯工具](debugger-turn-off.md)
+ [Debugger 的實用 SageMaker AI 估算器類別方法](debugger-estimator-classmethods.md)

# 設定 SageMaker Debugger 以儲存張量
<a name="debugger-configure-hook"></a>

*張量*是從每個訓練迭代的向後和向前傳遞更新參數的資料集合。SageMaker Debugger 收集輸出張量，分析訓練任務的狀態。SageMaker Debugger 的 [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) 和 [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig) API 作業，提供將張量分組為*集合*並將其儲存到目標 S3 儲存貯體的方法。下列主題說明如何使用 `CollectionConfig` 和 `DebuggerHookConfig` API 操作，後面接著如何使用 Debugger 勾點來儲存、存取和視覺化輸出張量的範例。

建構 SageMaker AI 估計器時請指定 `debugger_hook_config` 參數，啟動 SageMaker Debugger。下列主題包含如何使用 `CollectionConfig` 和 `DebuggerHookConfig` API 作業設定 `debugger_hook_config`，從訓練任務提取張量並儲存它們的範例。

**注意**  
正確設定並啟動後，除非另有指定，否則 SageMaker Debugger 會將輸出張量儲存在預設 S3 儲存貯體。預設 S3 儲存貯體 URI 的格式為 `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/`。

**Topics**
+ [使用 `CollectionConfig` API 設定張量集合](debugger-configure-tensor-collections.md)
+ [設定 `DebuggerHookConfig` API 以儲存張量](debugger-configure-tensor-hook.md)
+ [設定 Debugger 勾點的範例筆記本和程式碼範例](debugger-save-tensors.md)

# 使用 `CollectionConfig` API 設定張量集合
<a name="debugger-configure-tensor-collections"></a>

使用 `CollectionConfig` API 作業設定張量集合。如果使用 Debugger 支援的深度學習架構和機器學習演算法，Debugger 提供預先構建的張量集合，涵蓋各種參數的正規表示式 (regex)。如下列範例程式碼所示，新增您要偵錯的內建張量集合。

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(name="weights"),
    CollectionConfig(name="gradients")
]
```

前面的集合設定了 Debugger 勾點，基於預設的 `"save_interval"` 值，每 500 個步驟儲存張量。

如需可用 Debugger 內建集合的完整清單，請參閱 [Debugger 內建集合](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection)。

如果您想自訂內建集合，例如變更儲存間隔和張量規則表達式，請使用下列 `CollectionConfig` 範本調整參數。

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(
        name="tensor_collection",
        parameters={
            "key_1": "value_1",
            "key_2": "value_2",
            ...
            "key_n": "value_n"
        }
    )
]
```

如需可用參數索引鍵的更多相關資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 中的 [CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig)。例如，下列程式碼範例會示範如何在不同訓練階段調整 “遺失” 張量集合的儲存間隔：在訓練階段每 100 個步驟儲存一次遺失，並在驗證階段每 10 個步驟儲存一次驗證遺失。

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(
        name="losses",
        parameters={
            "train.save_interval": "100",
            "eval.save_interval": "10"
        }
    )
]
```

**提示**  
此張量集合組態物件可同時用於 [DebuggerHookConfig](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-hook.html#debugger-configure-tensor-hook) 和 [Rule](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html#debugger-built-in-rules-configuration-param-change) API 作業。

# 設定 `DebuggerHookConfig` API 以儲存張量
<a name="debugger-configure-tensor-hook"></a>

使用 [DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html                 #sagemaker.debugger.DebuggerHookConfig) API，使用您在上一步建立的 `collection_configs` 物件建立 `debugger_hook_config` 物件。

```
from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
    collection_configs=collection_configs
)
```

Debugger 會將模型訓練輸出張量儲存至預設 S3 儲存貯體。預設 S3 儲存貯體 URI 的格式為 `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.`

如果您想指定確切的 S3 儲存貯體 URI，請使用下列程式碼範例：

```
from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
    s3_output_path="specify-uri"
    collection_configs=collection_configs
)
```

如需更多資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 中的 [DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig)。

# 設定 Debugger 勾點的範例筆記本和程式碼範例
<a name="debugger-save-tensors"></a>

下列各節提供如何使用 Debugger 勾點儲存、存取和視覺化輸出張量的筆記本和程式碼範例。

**Topics**
+ [張量視覺化範例筆記本](#debugger-tensor-visualization-notebooks)
+ [使用 Debugger 內建集合儲存張量](#debugger-save-built-in-collections)
+ [透過修改 Debugger 內建集合來儲存張量](#debugger-save-modified-built-in-collections)
+ [使用 Debugger 自訂集合儲存張量](#debugger-save-custom-collections)

## 張量視覺化範例筆記本
<a name="debugger-tensor-visualization-notebooks"></a>

以下兩個筆記本範例顯示用於視覺化張量的 Amazon SageMaker Debugger 進階用法。Debugger 提供訓練深度學習模型的透明檢視。
+ [使用 MXNet 在 SageMaker Studio 筆記本進行互動張量分析](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_analysis)

  此筆記本範例示範如何使用 Amazon SageMaker Debugger 視覺化已儲存的張量。透過將張量視覺化，訓練深度學習演算法的同時，您可以查看張量值的變化方式。此筆記本包含訓練任務和設定不良的神經網路，並使用 Amazon SageMaker 偵錯器來彙總和分析張量，包括梯度、啟用輸出和權重。例如，下圖顯示了受消失坡度問題影響之卷積圖層的坡度分佈。  
![\[繪製梯度分佈的圖形。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-vanishing-gradient.gif)

  此筆記本也說明良好的初始超參數設定可產生相同的張量分佈圖，從而改善訓練程序。
+ [從 MXNet 模型訓練將張量視覺化和偵錯](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_plot)

   此筆記本範例示範如何使用 Amazon SageMaker Debugger，從 MXNet Gluon 模型訓練任務儲存和視覺化張量。其中說明 Debugger 設為將所有張量儲存到 Amazon S3 儲存貯體，並擷取 ReLu 啟用輸出來進行視覺化。下圖顯示 ReLu 啟用輸出的三維視覺化。色彩方案設為藍色表示接近 0 的值，設為黃色表示接近 1 的值。  
![\[ReLU 啟用輸出的視覺化\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/tensorplot.gif)

  在此筆記本中，從 `tensor_plot.py` 匯入的 `TensorPlot` 類別，旨在繪製以二維影像作為輸入的卷積神經網路 (CNN)。筆記本隨附的 `tensor_plot.py` 指令碼使用 Debugger 擷取張量，並將卷積神經網路視覺化。您可以在 SageMaker Studio 上執行此筆記本，重現張量視覺化，並實作您自己的卷積神經網路模型。
+ [使用 MXNet 在 SageMaker 筆記本中進行即時張量分析](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_realtime_analysis)

  本範例逐步引導您安裝必要的元件，在 Amazon SageMaker 訓練任務中發出張量，以及在訓練執行期間，使用 Debugger API 作業存取這些張量。gluon CNN 模型在 Fashion MNIST 資料集訓練。任務執行時，您會看到 Debugger 如何從每 100 個批次的第一個卷積層擷取啟用輸出，並將它們視覺化。此外，這將向您顯示如何在任務完成後視覺化權重。

## 使用 Debugger 內建集合儲存張量
<a name="debugger-save-built-in-collections"></a>

使用 `CollectionConfig` API 即可使用內建張量集合，使用 `DebuggerHookConfig` API 即可儲存它們。下列範例示範如何使用 Debugger 勾點組態的預設設定，建構 SageMaker AI TensorFlow 估算器。您也可以將其用於 MXNet，PyTorch 和 XGBoost 估算器。

**注意**  
在下列範例程式碼中，`DebuggerHookConfig` 的 `s3_output_path` 參數是選用的。如果您未指定該參數，Debugger 會將張量儲存在 `s3://<output_path>/debug-output/`，其中 `<output_path>` 是 SageMaker 訓練任務的預設輸出路徑。例如：  

```
"s3://sagemaker-us-east-1-111122223333/sagemaker-debugger-training-YYYY-MM-DD-HH-MM-SS-123/debug-output"
```

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call built-in collections
collection_configs=[
        CollectionConfig(name="weights"),
        CollectionConfig(name="gradients"),
        CollectionConfig(name="losses"),
        CollectionConfig(name="biases")
    ]

# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-built-in-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

若要查看 Debugger 內建集合的清單，請參閱 [Debugger 內建集合](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection)。

## 透過修改 Debugger 內建集合來儲存張量
<a name="debugger-save-modified-built-in-collections"></a>

您可以使用 `CollectionConfig` API 作業修改 Debugger 內建集合。下列範例會示範如何調整內建 `losses` 集合，以及如何建構 SageMaker AI TensorFlow 估算器。您也可以將其用於 MXNet，PyTorch 和 XGBoost 估算器。

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call and modify built-in collections
collection_configs=[
    CollectionConfig(
                name="losses", 
                parameters={"save_interval": "50"})]

# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-modified-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

如需 `CollectionConfig` 參數完整清單，請參閱 [Debugger CollectionConfig API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk)。

## 使用 Debugger 自訂集合儲存張量
<a name="debugger-save-custom-collections"></a>

您也可以儲存較少的張量而不是整組張量 (例如，如果您想減少 Amazon S3 儲存貯體中儲存的資料量)。下列範例顯示如何自訂 Debugger 勾點組態，指定您想儲存的目標張量。您可以將其用於 TensorFlow、MXNet、PyTorch 和 XGBoost 估算器。

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to create a custom collection
collection_configs=[
        CollectionConfig(
            name="custom_activations_collection",
            parameters={
                "include_regex": "relu|tanh", # Required
                "reductions": "mean,variance,max,abs_mean,abs_variance,abs_max"
            })
    ]
    
# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-custom-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

如需 `CollectionConfig` 參數的完整清單，請參閱 [ Debugger CollectionConfig](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk)。

# 如何設定偵錯工具內建規則
<a name="use-debugger-built-in-rules"></a>

在下列主題中，您將了解如何使用 SageMaker Debugger 內建規則。Amazon SageMaker Debugger 的內建規則會分析在模型訓練期間發出的張量。SageMaker AI Debugger 提供 `Rule` API 作業，可監控訓練任務進度和錯誤，以便成功訓練模型。例如，規則可以偵測漸層是否變得太大或太小、模型是過度擬合還是過度訓練，以及訓練任務是否不會降低損耗功能並改善。要查看可用內建規則的完整清單，請參閱[偵錯工具內建規則清單](debugger-built-in-rules.md)。

**Topics**
+ [搭配預設參數設定使用偵錯工具內建規則](debugger-built-in-rules-configuration.md)
+ [搭配自訂參數值使用偵錯工具內建規則](debugger-built-in-rules-configuration-param-change.md)
+ [範例筆記本和程式碼範例，以設定偵錯工具規則](debugger-built-in-rules-example.md)

# 搭配預設參數設定使用偵錯工具內建規則
<a name="debugger-built-in-rules-configuration"></a>

若要在估算器中指定偵錯工具內建規則，您需要設定清單物件。下列範例程式碼顯示列出偵錯工具內建規則的基本結構。

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.built_in_rule_name_1()),
    Rule.sagemaker(rule_configs.built_in_rule_name_2()),
    ...
    Rule.sagemaker(rule_configs.built_in_rule_name_n()),
    ... # You can also append more profiler rules in the ProfilerRule.sagemaker(rule_configs.*()) format.
]
```

有關預設參數值和內建規則說明的詳細資訊，請參閱[偵錯工具內建規則清單](debugger-built-in-rules.md)。

若要尋找 SageMaker Debugger API 參考，請參閱 [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.sagemaker.debugger.rule_configs](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.sagemaker.debugger.rule_configs) 和 [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule)。

例如，若要檢查模型的整體訓練效能和進度，請使用下列內建規則組態建構 SageMaker AI 估算器。

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.stalled_training_rule())
]
```

當您開始訓練任務時，偵錯工具會每 500 毫秒收集一次系統資源使用率資料，並依預設每 500 個步驟收集一次遺失和準確度值。偵錯工具會分析資源使用率，來識別您的模型是否有瓶頸問題。`loss_not_decreasing`、`overfit`、`overtraining` 和 `stalled_training_rule` 會監控您的模型是否在沒有這些訓練問題的情況下，最佳化損耗功能。如果規則偵測到訓練有異常狀況，則規則評估狀態會變更為 `IssueFound`。您可以設定自動化動作，例如使用 Amazon CloudWatch Events 和 AWS Lambda來通知訓練問題和停止訓練任務。如需詳細資訊，請參閱[Amazon SageMaker Debugger 規則的動作](debugger-action-on-rules.md)。


# 搭配自訂參數值使用偵錯工具內建規則
<a name="debugger-built-in-rules-configuration-param-change"></a>

如果您想要調整內建規則參數值並自訂張量集合 Regex，請設定 `ProfilerRule.sagemaker` 和 `Rule.sagemaker` 類別方法的 `base_config` 和 `rule_parameters` 參數。對於 `Rule.sagemaker` 類別方法，您還可以透過 `collections_to_save` 參數自訂張量集合。[使用 `CollectionConfig` API 設定張量集合](debugger-configure-tensor-collections.md)提供如何使用 `CollectionConfig` 類別的指示。

使用下列內建規則的組態範本來自訂參數值。您可以視需要變更規則參數，調整要觸發的規則敏感度。
+ `base_config` 引數是您呼叫內建規則方法的位置。
+ `rule_parameters` 引數是調整 [偵錯工具內建規則清單](debugger-built-in-rules.md) 中所列出的內建規則預設金鑰值。
+ `collections_to_save` 引數透過 `CollectionConfig` API 進行張量設定，這需要 `name` 和 `parameters` 引數。
  + 要查找 `name` 的可用張量集合，請參閱[Debugger 內建張量集合](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections)。
  + 有關可調整的 `parameters` 完整清單，請參閱[Debugger 集合組態 API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk)。

有關 Debugger 規則類別、方法和參數的詳細資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 中的 [SageMaker AI Debugger 規則類別](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html)

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig

rules=[
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule_name(),
        rule_parameters={
                "key": "value"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="tensor_collection_name", 
                parameters={
                    "key": "value"
                } 
            )
        ]
    )
]
```

針對 [偵錯工具內建規則清單](debugger-built-in-rules.md) 中的每個規則提供參數描述和參數值自訂範例。

# 範例筆記本和程式碼範例，以設定偵錯工具規則
<a name="debugger-built-in-rules-example"></a>

在下列各節中，提供如何使用偵錯工具規則監控 SageMaker 訓練任務的筆記本和程式碼範例。

**Topics**
+ [偵錯工具內建規則範例筆記本](#debugger-built-in-rules-notebook-example)
+ [偵錯工具內建規則範例程式碼](#debugger-deploy-built-in-rules)
+ [使用偵錯工具內建規則與參數修改](#debugger-deploy-modified-built-in-rules)

## 偵錯工具內建規則範例筆記本
<a name="debugger-built-in-rules-notebook-example"></a>

下列範例筆記本示範如何在使用 Amazon SageMaker AI 執行訓練任務時，使用偵錯工具內建規則：
+ [搭配 TensorFlow 使用 SageMaker Debugger 內建規則](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_builtin_rule)
+ [搭配受管 Spot 訓練和 MXNet 使用 SageMaker Debugger 內建規則](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_spot_training)
+ [使用具有參數修改功能的 SageMaker Debugger 內建規則，以利用 XGBoost 即時訓練任務分析](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/xgboost_realtime_analysis)

在 SageMaker Studio 中執行範例筆記本時，您可以找到在 **Studio 實驗清單**索引標籤上建立的訓練任務試驗。例如，如下列螢幕擷取畫面所示，您可以尋找並開啟目前訓練任務的**描述試驗元件**視窗。在偵錯工具索引標籤上，您可以檢查偵錯程式規則 (`vanishing_gradient()` 和 `loss_not_decreasing()`) 是否平行監控訓練任務工作階段。有關如何在 Studio 使用者介面中查找訓練工作試用組件的完整說明，請參閱[SageMaker Studio - 查看實驗、試驗和試用組件](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks.html#studio-tasks-experiments)。

![\[使用 SageMaker Studio 中啟動的偵錯工具內建規則來執行訓練任務的映像\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-built-in-rule-studio.png)


在 SageMaker AI 環境中有兩種使用偵錯工具內建規則的方法：在準備時部署內建規則，或根據需要調整其參數。下列主題示範如何搭配範例程式碼使用內建規則。

## 偵錯工具內建規則範例程式碼
<a name="debugger-deploy-built-in-rules"></a>

下列程式碼範例示範如何使用 `Rule.sagemaker` 方法設定偵錯工具內建規則。若要指定要執行的內建規則，請使用 `rules_configs` API 作業呼叫內建規則。要查找 Debugger 內建規則和預設參數值的完整清單，請參閱[偵錯工具內建規則清單](debugger-built-in-rules.md)。

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call built-in rules that you want to use.
built_in_rules=[ 
            Rule.sagemaker(rule_configs.vanishing_gradient())
            Rule.sagemaker(rule_configs.loss_not_decreasing())
]

# construct a SageMaker AI estimator with the Debugger built-in rules
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-built-in-rules-demo',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",

    # debugger-specific arguments below
    rules=built_in_rules
)
sagemaker_estimator.fit()
```

**注意**  
偵錯工具內建規則會與您的訓練任務平行執行。訓練任務的內建規則容器數量上限為 20。

有關 Debugger 規則類別、方法和參數的詳細資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 中的 [SageMaker Debugger 規則類別](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html) 

要查找有關如何調整 Debugger 規則參數的範例，請參閱以下 [使用偵錯工具內建規則與參數修改](#debugger-deploy-modified-built-in-rules) 部分。

## 使用偵錯工具內建規則與參數修改
<a name="debugger-deploy-modified-built-in-rules"></a>

下列程式碼範例示範調整參數的內建規則結構。在此範例中，`stalled_training_rule` 會每 50 個步驟會從訓練任務收集 `losses` 張量集合，並每 10 個步驟收集評估階段。如果訓練程序開始停止，而且在 120 秒內未收集張量輸出，則 `stalled_training_rule` 會停止訓練任務。

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call the built-in rules and modify the CollectionConfig parameters

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

built_in_rules_modified=[
    Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                'threshold': '120',
                'training_job_name_prefix': base_job_name_prefix,
                'stop_training_on_fire' : 'True'
        }
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                      "train.save_interval": "50"
                      "eval.save_interval": "10"
                } 
            )
        ]
    )
]

# construct a SageMaker AI estimator with the modified Debugger built-in rule
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name=base_job_name_prefix,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",

    # debugger-specific arguments below
    rules=built_in_rules_modified
)
sagemaker_estimator.fit()
```

關於使用 `CreateTrainingJob` API 的 Debugger 內建規則的進階組態，請參閱[使用 SageMaker API 設定 Debugger](debugger-createtrainingjob-api.md)。

# 關閉偵錯工具
<a name="debugger-turn-off"></a>

如果您想完全關閉 Debugger，請執行下列其中一項操作：
+ 開始訓練任務之前，請執行下列操作：

  若要同時停止監控和效能分析，請將 `disable_profiler` 參數包含在估算器中，並將其設定為 `True`。
**警告**  
如果停用它，您將無法檢視全方位的 Studio Debugger 深入分析儀表板和自動產生的分析報告。

  若要停止偵錯，請將 `debugger_hook_config` 參數設定為 `False`。
**警告**  
如果停用它，就無法收集輸出張量，也無法偵錯模型參數。

  ```
  estimator=Estimator(
      ...
      disable_profiler=True
      debugger_hook_config=False
  )
  ```

  有關偵錯工具特定參數的更多相關資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 中的 [SageMaker AI 估算器](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator)。
+ 當訓練工作正在執行時，請執行下列操作：

  若要在訓練工作執行時停用監控和分析，請使用下列估算器分類方法：

  ```
  estimator.disable_profiling()
  ```

  只禁用架構分析並保持系統監控，請使用 `update_profiler` 方法：

  ```
  estimator.update_profiler(disable_framework_metrics=true)
  ```

  有關估算器擴展方法的更多相關資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 文件內的 [estimator.disable\$1profiling](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.disable_profiling) 和 [estimator.update\$1profiler](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.update_profiler) 類別方法。

# Debugger 的實用 SageMaker AI 估算器類別方法
<a name="debugger-estimator-classmethods"></a>

下列估算器類別方法適用於存取 SageMaker 訓練任務資訊，以及擷取 Debugger 所收集之訓練資料的輸出路徑。使用 `estimator.fit()` 方法初始化訓練任務後，可執行下列方法。
+ 若要檢查 SageMaker 訓練任務的基礎 S3 儲存貯體 URI：

  ```
  estimator.output_path
  ```
+ 若要檢查 SageMaker 訓練任務的基礎作業名稱：

  ```
  estimator.latest_training_job.job_name
  ```
+ 若要查看 SageMaker 訓練任務的完整 `CreateTrainingJob` API 作業組態：

  ```
  estimator.latest_training_job.describe()
  ```
+ 若要在 SageMaker 訓練任務執行時檢查 Debugger 規則的完整清單：

  ```
  estimator.latest_training_job.rule_job_summary()
  ```
+ 若要檢查儲存模型參數資料 (輸出張量) 的 S3 儲存貯體 URI：

  ```
  estimator.latest_job_debugger_artifacts_path()
  ```
+ 若要檢查儲存模型效能資料 (系統和架構指標) 的 S3 儲存貯體 URI：

  ```
  estimator.latest_job_profiler_artifacts_path()
  ```
+ 若要檢查偵錯輸出張量的 Debugger 規則組態：

  ```
  estimator.debugger_rule_configs
  ```
+ 若要在 SageMaker 訓練任務執行時檢查 Debugger 的偵錯規則清單：

  ```
  estimator.debugger_rules
  ```
+ 若要檢查 Debugger 的監控和分析系統規則組態與架構指標：

  ```
  estimator.profiler_rule_configs
  ```
+ 若要在 SageMaker 訓練任務執行時檢查 Debugger 的監控和分析規則清單：

  ```
  estimator.profiler_rules
  ```

如需 SageMaker AI 估算器類別及其方法的更多相關資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 的 [估算器 API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator)。

# SageMaker Debugger 針對 XGBoost 的互動式報告
<a name="debugger-report-xgboost"></a>

獲得由 Debugger 自動產生的訓練報告。Debugger 報告可提供訓練任務的深入解析，並提供改善模型效能的建議。對於 SageMaker AI XGBoost 訓練任務，請使用 Debugger [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) 規則來獲得訓練進度和結果的完整訓練報告。依照本指南，指定建構 XGBoost 估算器時的 [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) 規則、使用 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 或 Amazon S3 主控台下載報告，並深入了解訓練結果。

**注意**  
您可以在訓練工作執行時或工作完成時，下載 Debugger 報告。在訓練期間，Debugger 會同時更新報告，反映目前規則的評估狀態。只有在訓練工作完成後，您才能下載完整的 Debugger 報告。

**重要**  
報告中的圖表和建議僅用於提供資訊，並非絕對。由您負責對資訊進行您自己獨立的評估。

**Topics**
+ [使用 Debugger XGBoost 報告規則建構 SageMaker AI XGBoost 估算器](debugger-training-xgboost-report-estimator.md)
+ [下載 Debugger XGBoost 訓練報告](debugger-training-xgboost-report-download.md)
+ [Debugger XGBoost 訓練報告演練](debugger-training-xgboost-report-walkthrough.md)

# 使用 Debugger XGBoost 報告規則建構 SageMaker AI XGBoost 估算器
<a name="debugger-training-xgboost-report-estimator"></a>

[CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) 規則會從訓練任務收集下列輸出張量：
+ `hyperparameters` – 在第一個步驟進行儲存。
+ `metrics` – 每 5 個步驟儲存損失和準確性。
+ `feature_importance` – 每 5 個步驟進行儲存。
+ `predictions` – 每 5 個步驟進行儲存。
+ `labels` – 每 5 個步驟進行儲存。

輸出張量會儲存在預設的 S3 儲存貯體。例如 `s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/debug-output/`。

當您為 XGBoost 訓練任務建構 SageMaker AI 估算器時，請指定如下列範例程式碼所示的規則。

------
#### [ Using the SageMaker AI generic estimator ]

```
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.create_xgboost_report())
]

region = boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-xgboost-report-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Add the Debugger XGBoost report rule
    rules=rules
)

estimator.fit(wait=False)
```

------

# 下載 Debugger XGBoost 訓練報告
<a name="debugger-training-xgboost-report-download"></a>

在訓練任務執行期間或使用 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 和 AWS Command Line Interface (CLI) 完成任務後，下載 Debugger XGBoost 訓練報告。

------
#### [ Download using the SageMaker Python SDK and AWS CLI ]

1. 檢查目前任務的預設 S3 輸出基礎 URI。

   ```
   estimator.output_path
   ```

1. 檢查目前的任務名稱。

   ```
   estimator.latest_training_job.job_name
   ```

1. Debugger XGBoost 報告會儲存在 `<default-s3-output-base-uri>/<training-job-name>/rule-output` 底下。設定規則輸出路徑，如下所示：

   ```
   rule_output_path = estimator.output_path + "/" + estimator.latest_training_job.job_name + "/rule-output"
   ```

1. 如要檢查報告是否已產生，請在使用 `aws s3 ls` 和搭配 `--recursive` 選項在 `rule_output_path` 下遞迴列出目錄和檔案。

   ```
   ! aws s3 ls {rule_output_path} --recursive
   ```

   這應該會在名為 `CreateXgboostReport` 和 `ProfilerReport-1234567890` 的自動產生之資料夾下傳回完整的檔案清單。XGBoost 訓練報告會儲存在 `CreateXgboostReport` 中，而分析報告則儲存在 `ProfilerReport-1234567890` 資料夾中。要瞭解有關 xGBoost 訓練工作預設產生的性能分析報告的詳細資訊，請參閱[SageMaker Debugger 互動報告](debugger-profiling-report.md)。  
![\[規則輸出的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-ls.png)

   `xgboost_report.html` 是由 Debugger 自動產生的 XGBoost 訓練報告。`xgboost_report.ipynb` 是 Jupyter 筆記本，用來將訓練結果彙總到報告中。您可以使用筆記本下載所有檔案、瀏覽 HTML 報告檔案並修改報告。

1. 使用 `aws s3 cp` 遞迴下載檔案。下列命令會將所有規則輸出檔案儲存到目前工作目錄下的 `ProfilerReport-1234567890` 資料夾中。

   ```
   ! aws s3 cp {rule_output_path} ./ --recursive
   ```
**提示**  
如果您使用 Jupyter 筆記本伺服器，請執行 `!pwd` 以驗證目前的工作目錄。

1. 在 `/CreateXgboostReport` 目錄底下，開啟 `xgboost_report.html`。如果您使用 JupyterLab，請選擇**信任 HTML** 以查看自動產生的 Debugger 訓練報告。  
![\[規則輸出的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-open-trust.png)

1. 開啟 `xgboost_report.ipynb` 檔案以探索產生報告的方式。您可以使用 Jupyter 筆記本檔案來自訂及擴充訓練報告。

------
#### [ Download using the Amazon S3 console ]

1. 登入 AWS 管理主控台 ，並在 [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/)：// 開啟 Amazon S3 主控台。

1. 搜尋基礎 S3 儲存貯體。例如，如果您尚未指定任何基本作業名稱，則基礎 S3 儲存貯體名稱應採用下列格式：`sagemaker-<region>-111122223333`。透過**依名稱尋找儲存貯體**欄位查詢基礎 S3 儲存貯體。  
![\[Amazon S3 主控台中的依名稱尋找儲存貯體欄位。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-0.png)

1. 在基礎 S3 儲存貯體中，在**依字首尋找物件**中輸入工作名稱字首，然後選擇訓練工作名稱，以查詢訓練工作名稱。  
![\[Amazon S3 主控台中的依字首尋找物件欄位。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-1.png)

1. 在訓練任務的 S3 儲存貯體中，選擇 **rule-output/** 子資料夾。Debugger 所收集的訓練資料必須有三個子資料夾：**debug-output/**、**profiler-output/** 和 **rule-output/**。  
![\[規則輸出 S3 儲存貯體 URI 的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-2.png)

1. 在 **rule-output/** 資料夾中，選擇 **CreateXgboostReport/** 資料夾。此資料夾包含 **xbgoost\$1report.html** (以 html 格式自動產生的報告) 和 **xbgoost\$1report.ipynb** (含有用來產生報告之指令碼的 Jupyter 筆記本)。

1. 選擇 **xbgoost\$1report.html** 檔案，選擇 **下載動作**，然後選擇**下載**。  
![\[規則輸出 S3 儲存貯體 URI 的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-s3-download.png)

1. 在網頁瀏覽器中開啟下載的 **xbgoost\$1report.html** 檔案。

------

# Debugger XGBoost 訓練報告演練
<a name="debugger-training-xgboost-report-walkthrough"></a>

本節將逐步引導您完成 Debugger XGBoost 訓練報告。報告會根據輸出張量 regex 自動彙總，以識別二進制分類、多類別分類和迴歸中的訓練任務類型。

**重要**  
在報告中，系統會提供資訊圖表和相關建議，其中的內容並非絕對。您有責任對當中的資訊進行自己的獨立評估。

**Topics**
+ [資料集準確標籤的分佈](#debugger-training-xgboost-report-walkthrough-dist-label)
+ [損失對比步驟圖表](#debugger-training-xgboost-report-walkthrough-loss-vs-step)
+ [功能重要性](#debugger-training-xgboost-report-walkthrough-feature-importance)
+ [混淆矩陣](#debugger-training-xgboost-report-walkthrough-confusion-matrix)
+ [混淆矩陣的評估](#debugger-training-xgboost-report-walkthrough-eval-conf-matrix)
+ [每個對角線元素超過迭代的準確率](#debugger-training-xgboost-report-walkthrough-accuracy-rate)
+ [接收器操作特性曲線](#debugger-training-xgboost-report-walkthrough-rec-op-char)
+ [上次儲存的步驟之殘差分佈](#debugger-training-xgboost-report-walkthrough-dist-residual)
+ [每個標籤容器超過迭代的絕對驗證錯誤](#debugger-training-xgboost-report-walkthrough-val-error-per-label-bin)

## 資料集準確標籤的分佈
<a name="debugger-training-xgboost-report-walkthrough-dist-label"></a>

此長條圖會顯示原始資料集中標籤類別 (用於分類) 或值 (用於迴歸) 的分佈。資料集中的偏態可能會導致不準確。此視覺化內容適用於下列模型類型：二進制分類、多重分類和迴歸。

![\[資料集圖表準確標籤分佈的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-dist-label.png)


## 損失對比步驟圖表
<a name="debugger-training-xgboost-report-walkthrough-loss-vs-step"></a>

這是一個折線圖，顯示在整個訓練步驟中訓練資料和驗證資料損失的演進方式。損失是您在目標函式中定義的內容，例如平均值平方錯誤。您可以從此繪圖中測量模型是過度擬合或低度擬合。本節還提供了您可以用來釐清如何解決過度擬合和低度擬合問題的洞察。此視覺化內容適用於下列模型類型：二進制分類、多重分類和迴歸。

![\[損失對照步驟圖表的一個範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-loss-vs-step.png)


## 功能重要性
<a name="debugger-training-xgboost-report-walkthrough-feature-importance"></a>

提供三種不同類型的功能重要性視覺效果：權重、增加和覆蓋範圍。我們針對報告中三者中的每一項目提供詳細定義。功能重要性視覺化可協助您了解訓練資料集中的哪些功能對預測有何貢獻。功能重要性視覺化適用於下列模型類型：二進制分類、多重分類和迴歸。

![\[功能重要性圖表的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-feature-importance.png)


## 混淆矩陣
<a name="debugger-training-xgboost-report-walkthrough-confusion-matrix"></a>

此視覺化內容僅適用於二進位和多類別分類模型。單憑準確度可能不足以評估模型效能。對於某些使用案例，例如醫療保健和詐騙偵測，了解假陽性率和假陰性率也很重要。混淆矩陣為您提供用於評估模型效能的其他維度。

![\[混淆矩陣的一個範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-confusion-matrix.png)


## 混淆矩陣的評估
<a name="debugger-training-xgboost-report-walkthrough-eval-conf-matrix"></a>

本節為您提供有關模型精確度、重新呼叫和 F1 分數的微型、巨集和加權指標之更多洞察。

![\[混淆矩陣的評估。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-eval-conf-matrix.png)


## 每個對角線元素超過迭代的準確率
<a name="debugger-training-xgboost-report-walkthrough-accuracy-rate"></a>

此視覺化內容僅適用於二進制分類和多類別分類模型。這是一個折線圖，用於在每個類別訓練步驟中繪製混淆矩陣中的對角值。此圖顯示了每個類別在整個訓練步驟中的準確性如何進展。您可以從此圖中識別表現不佳的類別。

![\[每個對角線元素在迭代圖表上的準確率之範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-accuracy-rate.gif)


## 接收器操作特性曲線
<a name="debugger-training-xgboost-report-walkthrough-rec-op-char"></a>

此視覺化內容僅適用於二進制分類模型。接收器操作特性曲線通常用於評估二進制分類模型效能。曲線的 Y 軸為相符率 (TPF)，X 軸為假陽性率 (FPR)。該圖也會顯示曲線下面積 (AUC) 的值。AUC 值越高，您的分類器就越具預測性。您也可以使用 ROC 曲線來了解 TPR 和 FPR 之間的取捨，並識別適合您使用案例的最佳分類閾值。可以調整分類閾值以微調模型的行為，以減少一種或另一種類型的錯誤 (FP/FN)。

![\[接收器操作特性曲線圖表的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-rec-op-char.png)


## 上次儲存的步驟之殘差分佈
<a name="debugger-training-xgboost-report-walkthrough-dist-residual"></a>

此視覺化內容是一個欄位圖，顯示在最後一個步驟 Debugger 擷取的殘差分佈。在此視覺化內容中，您可以檢查殘差分佈是否接近以零為中心的常態分佈。如果殘差有偏態，則您的功能可能不足以預測標籤。

![\[上次儲存步驟圖表之殘差分佈的範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-dist-residual.png)


## 每個標籤容器超過迭代的絕對驗證錯誤
<a name="debugger-training-xgboost-report-walkthrough-val-error-per-label-bin"></a>

此視覺化內容僅適用於迴歸模型。實際的目標值會分割為 10 個間隔。此視覺化內容顯示直線圖中訓練步驟中每個間隔的驗證錯誤如何進展。絕對驗證錯誤是驗證期間預測和實際差異的絕對值。您可以從此視覺化內容中識別效能不佳的間隔。

![\[每個標籤容器在迭代圖表上的絕對驗證錯誤之範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-val-error-per-label-bin.png)


# Amazon SageMaker Debugger 規則的動作
<a name="debugger-action-on-rules"></a>

基於 Debugger 規則評估狀態，您可以設定自動化動作，例如停止訓練工作及使用 Amazon Simple Notification Service (Amazon SNS) 來傳送通知。您也可以使用 Amazon CloudWatch Events 和 建立自己的動作 AWS Lambda。若要了解如何基於偵錯工具規則評估狀態設定自動化動作，請參閱下列主題。

**Topics**
+ [使用偵錯工具內建動作來執行規則](debugger-built-in-actions.md)
+ [使用 Amazon CloudWatch 和 對規則執行的動作 AWS Lambda](debugger-cloudwatch-lambda.md)

# 使用偵錯工具內建動作來執行規則
<a name="debugger-built-in-actions"></a>

使用偵錯工具內建動作來回應 [偵錯工具規則](debugger-built-in-rules.md#debugger-built-in-rules-Rule) 找到的問題。偵錯工具 `rule_configs` 類別提供設定動作清單的工具，包含在偵錯工具規則發現訓練問題時，自動停止訓練任務及使用 Amazon Simple Notification Service (Amazon SNS) 傳送通知。下列主題會逐步引導您完成這些任務。

**Topics**
+ [設定 Amazon SNS，建立 `SMDebugRules` 主題，並訂閱該主題](#debugger-built-in-actions-sns)
+ [設定 IAM 角色以附加必要政策](#debugger-built-in-actions-iam)
+ [使用內建動作設定偵錯工具規則](#debugger-built-in-actions-on-rule)
+ [使用偵錯工具內建動作的考量事項](#debugger-built-in-actions-considerations)

## 設定 Amazon SNS，建立 `SMDebugRules` 主題，並訂閱該主題
<a name="debugger-built-in-actions-sns"></a>

本節將逐步引導您如何設定 Amazon SNS **SMDebugRules** 主題、訂閱並確認訂閱以獲得來自偵錯工具規則的通知。

**注意**  
關於 Amazon SNS 的計費，如需更多相關資訊，請參閱 [Amazon SNS 定價](https://aws.amazon.com/sns/pricing/)和 [Amazon SNS 常見問答集](https://aws.amazon.com/sns/faqs/)。

**建立一個 SMDebugEvents 主題**

1. 登入 AWS 管理主控台 ，並在 [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home)：// 開啟 Amazon SNS 主控台。

1. 在左側導覽窗格中，選擇**主題**。

1. 在**主題**頁面上，選擇**建立主題**。

1. 在**建立主題**頁面上，於**詳細資訊**區段中，執行以下作業：

   1. 在**類型**中，選擇**標準**做為主題類型。

   1. 在**名稱**中，輸入 **SMDebugRules**。

1. 略過所有其他選項設定，然後選擇**建立主題**。如果您想進一步了解可選設定，請參閱[建立一個 Amazon SNS 主題](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html)。

**訂閱 SMDebugRules 主題**

1. 在 [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home) 開啟 Amazon SNS 主控台。

1. 在左導覽窗格中，選擇**訂閱**。

1. 在**訂閱**頁面，選擇**建立訂閱**。

1. 在**建立訂閱**頁面上，於**詳細資訊**區段中，執行以下作業：

   1. 在**主題 ARN**，請選擇 **SMDebugRules** 主題 ARN。ARN 應為 `arn:aws:sns:<region-id>:111122223333:SMDebugRules` 格式。

   1. 針對**通訊協定**，選擇**電子郵件** 或**簡訊**。

   1. 在**端點**中，輸入您要接收通知的端點值，例如電子郵件地址或電話號碼。
**注意**  
請務必輸入正確的電子郵件地址和電話號碼。電話號碼必須包含 `+`、國家/地區代碼和電話號碼，不含特殊字元或空格。例如，電話號碼 \$11 (222) 333-4444 被格式化為 **\$112223334444**。

1. 略過所有其他選項設定，然後選擇**建立訂閱**。如果您想進一步了解可選設定，請參閱[訂閱 Amazon SNS 主題](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html)。

訂閱 **SMDebugRules** 主題後，您會在電子郵件或電話中收到下列確認訊息：

![\[Amazon SNS SMDebugRules 主題的訂閱確認電子郵件訊息。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-built-in-action-subscription-confirmation.png)


關於 Amazon SNS，如需更多相關資訊，請參閱 *Amazon SNS 開發人員指南*內的[行動電話簡訊 (SMS)](https://docs.aws.amazon.com/sns/latest/dg/sns-mobile-phone-number-as-subscriber.html)及[電子郵件通知](https://docs.aws.amazon.com/sns/latest/dg/sns-email-notifications.html)章節。

## 設定 IAM 角色以附加必要政策
<a name="debugger-built-in-actions-iam"></a>

您在此步驟中，新增必要政策至 IAM 角色。

**將必要政策新增至您的 IAM 角色**

1. 登入 AWS 管理主控台 ，並在 https：//[https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/) 開啟 IAM 主控台。

1. 在導覽窗格中，選擇**政策**，並選擇**建立政策**。

1. 在**建立政策**頁面上，執行下列動作以建立新的 sns-access 存取政策：

   1. 選擇 **JSON** 標籤。

   1. 將以下程式碼中以粗體格式的 JSON 字串貼入 `"Statement"`，以您的帳戶 ID 取代 12 位數 AWS AWS 的帳戶 ID。

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": [
                      "sns:Publish",
                      "sns:CreateTopic",
                      "sns:Subscribe"
                  ],
                  "Resource": "arn:aws:sns:*:111122223333:SMDebugRules"
              }
          ]
      }
      ```

------

   1. 在頁面底部選擇**檢閱政策**。

   1. 在**檢閱政策**頁面的**名稱**中，輸入 **sns-access**。

   1. 請在頁面底部選擇**建立政策**。

1. 返回 IAM 主控台，然後在左側導覽窗格中選擇**角色**。

1. 查詢您用於 SageMaker AI 模型訓練的 IAM 角色，然後選擇該 IAM 角色。

1. 在**許可**索引標籤的**總結**頁面上，選擇**連接政策**。

1. 搜尋 **sns-access** 存取政策，選取該政策旁的核取方塊，然後選擇**連接政策**。

如需為 Amazon SNS 設定 IAM 政策的更多範例，請參閱 [Amazon SNS 存取控制的範例](https://docs.aws.amazon.com/sns/latest/dg/sns-access-policy-use-cases.html)。

## 使用內建動作設定偵錯工具規則
<a name="debugger-built-in-actions-on-rule"></a>

在前面的步驟中成功完成必要設定之後，您可以為偵錯規則設定偵錯工具內建動作，如下列範例指令碼所示。您可以選擇建置 `actions` 清單物件時要使用的內建動作。`rule_configs` 是一個輔助模組，提供進階工具來配置偵錯工具的內建規則和動作。偵錯工具可使用下列內建動作：
+ `rule_configs.StopTraining()` — 當偵錯工具規則發現問題時，停止訓練工作。
+ `rule_configs.Email("abc@abc.com")` — 當偵錯工具規則發現問題時，透過電子郵件傳送通知。使用您在設定 SNS 主題訂閱時使用的電子郵件地址。
+ `rule_configs.SMS("+1234567890")` — 當偵錯工具規則發現問題時，透過簡訊傳送通知。使用您在設定 SNS 主題訂閱時使用的電話號碼。
**注意**  
請務必輸入正確的電子郵件地址和電話號碼。電話號碼必須包含 `+`、國家/地區代碼和電話號碼，不含特殊字元或空格。例如，電話號碼 \$11 (222) 333-4444 被格式化為 **\$112223334444**。

您可以總結使用 `rule_configs.ActionList()` 方法以使用所有內建動作或動作子集，該方法會採取內建動作並設定動作清單。

**將三個內建動作全部新增至單一項規則**

如果您想要將三個內建動作全部指派給單一項規則，請在建構估算器時設定偵錯工具內建動作清單。使用下列範本建構估算器，偵錯工具會以您用來監控訓練工作進度的一切規則，停止訓練工作並透過電子郵件和簡訊傳送通知。

```
from sagemaker.debugger import Rule, rule_configs

# Configure an action list object for Debugger rules
actions = rule_configs.ActionList(
    rule_configs.StopTraining(), 
    rule_configs.Email("abc@abc.com"), 
    rule_configs.SMS("+1234567890")
)

# Configure rules for debugging with the actions parameter
rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule(),         # Required
        rule_parameters={"paramter_key": value },        # Optional
        actions=actions
    )
]

estimator = Estimator(
    ...
    rules = rules
)

estimator.fit(wait=False)
```

**建立多個內建動作物件，以將不同動作指派給單一項規則**

如果您要指派在單一規則的不同閾值時觸發的內建動作，您可以建立多個內建動作物件，如下列指令碼所示。若要藉由執行相同的規則來避免發生衝突錯誤，您必須提交不同的規則作業名稱 (在規則的 `name` 屬性指定不同的字串)，如下列範例中的指令碼範本所示。此範例顯示如何設定 [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) 採取兩種不同的動作：在訓練工作停頓 60 秒時傳送電子郵件至 `abc@abc.com`；若停頓 120 秒，則停止訓練工作。

```
from sagemaker.debugger import Rule, rule_configs
import time

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

# Configure an action object for StopTraining
action_stop_training = rule_configs.ActionList(
    rule_configs.StopTraining()
)

# Configure an action object for Email
action_email = rule_configs.ActionList(
    rule_configs.Email("abc@abc.com")
)

# Configure a rule with the Email built-in action to trigger if a training job stalls for 60 seconds
stalled_training_job_rule_email = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "60", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_email
)
stalled_training_job_rule_text.name="StalledTrainingJobRuleEmail"

# Configure a rule with the StopTraining built-in action to trigger if a training job stalls for 120 seconds
stalled_training_job_rule = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "120", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_stop_training
)
stalled_training_job_rule.name="StalledTrainingJobRuleStopTraining"

estimator = Estimator(
    ...
    rules = [stalled_training_job_rule_email, stalled_training_job_rule]
)

estimator.fit(wait=False)
```

訓練工作正在執行時，當規則發現訓練工作的問題時，偵錯工具內建動作就會隨時傳送通知電子郵件和簡訊。下列螢幕擷取畫面顯示，當訓練工作出現停頓訓練工作問題時，電子郵件通知的範例。

![\[當偵錯工具偵測到 StalledTraining 問題時，傳送的電子郵件通知範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-built-in-action-email.png)


下列螢幕擷取畫面顯示當規則發現 StalledTraining 問題時，偵錯工具會傳送的簡訊通知範例。

![\[當偵錯工具偵測到 StalledTraining 問題時，所傳送的簡訊通知範例。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-built-in-action-text.png)


## 使用偵錯工具內建動作的考量事項
<a name="debugger-built-in-actions-considerations"></a>
+ 若要使用偵錯工具內建動作，網際網路連接為必要項目。Amazon SageMaker AI 或 Amazon VPC 提供的網路隔離模式不支援此功能。
+ 內建動作無法用於 [剖析工具規則](debugger-built-in-profiler-rules.md#debugger-built-in-profiler-rules-ProfilerRule)。
+ 內建動作無法用於具有 Spot 訓練中斷的訓練工作。
+ 在電子郵件或簡訊通知中，`None` 會出現在訊息結尾。這沒有任何意義，所以您可以忽略文字 `None`。

# 使用 Amazon CloudWatch 和 對規則執行的動作 AWS Lambda
<a name="debugger-cloudwatch-lambda"></a>

Amazon CloudWatch 收集 Amazon SageMaker AI 模型訓練任務日誌和 Amazon SageMaker Debugger 規則處理任務日誌。使用 Amazon CloudWatch Events 設定偵錯工具 AWS Lambda ，並根據偵錯工具規則評估狀態採取動作。

## 範例筆記本
<a name="debugger-test-stop-training"></a>

您可以執行下列範例筆記本，這些筆記本是為了使用 Amazon CloudWatch 和 AWS Lambda對 Debugger 的內建規則執行動作來停止訓練任務而預備的。
+ [Amazon SageMaker Debugger - 對 CloudWatch 活動做出反應](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/tf-mnist-stop-training-job.html)

  這個範例筆記本執行的訓練工作有梯度消失的問題。建構 SageMaker AI TensorFlow 估算器時，會使用 Debugger [VanishingGradient](debugger-built-in-rules.md#vanishing-gradient) 內建規則。Debugger 規則偵測到問題時，就會終止訓練工作。
+ [使用 SageMaker Debugger 規則偵測停止的訓練並調用動作](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/detect_stalled_training_job_and_actions.html)

  這個範例筆記本會執行具有程式碼行的訓練指令碼，強制它進入睡眠 10 分鐘。Debugger [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) 內建規則會調用問題並停止訓練工作。

**Topics**
+ [範例筆記本](#debugger-test-stop-training)
+ [存取 CloudWatch 日誌以取得偵錯工具規則和訓練任務](debugger-cloudwatch-metric.md)
+ [使用 CloudWatch 和 Lambda 設定偵錯工具，讓自動化訓練工作終止](debugger-stop-training.md)
+ [停用 CloudWatch 事件規則以停止使用讓自動化訓練任務終止](debugger-disable-cw.md)

# 存取 CloudWatch 日誌以取得偵錯工具規則和訓練任務
<a name="debugger-cloudwatch-metric"></a>

您可以透過 CloudWatch 記錄內的訓練和 Debugger 規則任務狀態，在發生訓練問題時採取進一步的動作。下列程序說明如何存取相關的 CloudWatch 日誌。對於使用 CloudWatch 監控訓練任務，如需更多相關資訊，請參閱[監控 Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-overview.html)。

**若要存取訓練任務日誌和偵錯工具規則任務日誌**

1. 透過 [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/) 開啟 CloudWatch 主控台。

1. 在左側導覽窗格中的**日誌**節點下，選擇**日誌群組**。

1. 在日誌群組清單中，執行下列動作：
   + 在訓練工作日誌選擇 **/aws/sagemaker/TrainingJobs**。
   + 在 Debugger 規則工作日誌選擇 **/aws/sagemaker/ProcessingJobs**。

# 使用 CloudWatch 和 Lambda 設定偵錯工具，讓自動化訓練工作終止
<a name="debugger-stop-training"></a>

偵錯工具規則會監控訓練工作狀態，而 CloudWatch 事件規則會監看偵錯工具規則訓練工作評估狀態。下列各節概述使用 CloudWatch 和 Lambda 自動化訓練任務終止所需的程序。

**Topics**
+ [步驟 1：建立 Lambda 函式](#debugger-lambda-function-create)
+ [步驟 2：設定 Lambda 函式](#debugger-lambda-function-configure)
+ [步驟 3：建立 CloudWatch 事件規則，並連結至偵錯工具的 Lambda 函式](#debugger-cloudwatch-events)

## 步驟 1：建立 Lambda 函式
<a name="debugger-lambda-function-create"></a>

**建立 Lambda 函式**

1. 在 https：//[https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/) 開啟 AWS Lambda 主控台。

1. 在導覽面板上，選擇**函式**，然後選擇**建立函式**。

1. 在**建立函式**頁面上，選擇**從頭開始撰寫**。

1. 在**基本資訊**區段中，輸入**函式名稱** (例如 **debugger-rule-stop-training-job**)。

1. 針對**執行期**，選擇 **Python 3.7**。

1. 針對**許可**，請展開下拉式清單選項，然後選擇**變更預設執行角色**。

1. 針對**執行角色**，請選擇**使用現有的角色**，然後選擇您在 SageMaker AI 上用於訓練工作的 IAM 角色。
**注意**  
確保您使用的執行角色連接 `AmazonSageMakerFullAccess` 和 `AWSLambdaBasicExecutionRole`。否則，Lambda 函式將無法正確回應訓練工作的 Debugger 規則狀態變更。如果您不確定正在使用哪個執行角色，請在 Jupyter 筆記本儲存格中執行下列程式碼，以擷取執行角色輸出：  

   ```
   import sagemaker
   sagemaker.get_execution_role()
   ```

1. 請在頁面底部，選擇**建立函式**。

下圖顯示**建立函式**頁面的範例，其輸入欄位和選取已完成。

![\[建立函式頁面。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-lambda-create.png)


## 步驟 2：設定 Lambda 函式
<a name="debugger-lambda-function-configure"></a>

**配置 Lambda 函式**

1. 在設定頁面的**函式程式碼**區段中，將下列 Python 指令碼貼到 Lambda 程式碼編輯器窗格中。`lambda_handler` 功能會監控 CloudWatch 收集的偵錯工具規則評估狀態，並觸發 `StopTrainingJob` API 作業。`client` for 適用於 Python (Boto3) 的 AWS SDK SageMaker AI 提供高階方法 `stop_training_job`，可觸發 `StopTrainingJob` API 操作。

   ```
   import json
   import boto3
   import logging
   
   logger = logging.getLogger()
   logger.setLevel(logging.INFO)
   
   def lambda_handler(event, context):
       training_job_name = event.get("detail").get("TrainingJobName")
       logging.info(f'Evaluating Debugger rules for training job: {training_job_name}')
       eval_statuses = event.get("detail").get("DebugRuleEvaluationStatuses", None)
   
       if eval_statuses is None or len(eval_statuses) == 0:
           logging.info("Couldn't find any debug rule statuses, skipping...")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       # should only attempt stopping jobs with InProgress status
       training_job_status = event.get("detail").get("TrainingJobStatus", None)
       if training_job_status != 'InProgress':
           logging.debug(f"Current Training job status({training_job_status}) is not 'InProgress'. Exiting")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       client = boto3.client('sagemaker')
   
       for status in eval_statuses:
           logging.info(status.get("RuleEvaluationStatus") + ', RuleEvaluationStatus=' + str(status))
           if status.get("RuleEvaluationStatus") == "IssuesFound":
               secondary_status = event.get("detail").get("SecondaryStatus", None)
               logging.info(
                   f'About to stop training job, since evaluation of rule configuration {status.get("RuleConfigurationName")} resulted in "IssuesFound". ' +
                   f'\ntraining job "{training_job_name}" status is "{training_job_status}", secondary status is "{secondary_status}"' +
                   f'\nAttempting to stop training job "{training_job_name}"'
               )
               try:
                   client.stop_training_job(
                       TrainingJobName=training_job_name
                   )
               except Exception as e:
                   logging.error(
                       "Encountered error while trying to "
                       "stop training job {}: {}".format(
                           training_job_name, str(e)
                       )
                   )
                   raise e
       return None
   ```

   如需 Lambda 程式碼編輯器界面的詳細資訊，請參閱[使用 AWS Lambda 主控台編輯器建立函數](https://docs.aws.amazon.com/lambda/latest/dg/code-editor.html)。

1. 略過所有其他設定，然後選擇組態頁面頂端的**儲存**。

## 步驟 3：建立 CloudWatch 事件規則，並連結至偵錯工具的 Lambda 函式
<a name="debugger-cloudwatch-events"></a>

**建立 CloudWatch 事件規則並連結至偵錯工具的 Lambda 函式**

1. 透過 [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/) 開啟 CloudWatch 主控台。

1. 在左側導覽窗格內的**事件**下，選擇**規則**。

1. 選擇**建立規則**。

1. 在**步驟 1：建立規則**頁面中的**事件來源**區段中，在**服務名稱**選擇 **SageMaker AI**，並在**事件類型**選擇 **SageMaker AI 訓練工作狀態變更**。事件模式預覽看起來應該如下列範例的 JSON 字串所示：

   ```
   {
       "source": [
           "aws.sagemaker"
       ],
       "detail-type": [
           "SageMaker Training Job State Change"
       ]
   }
   ```

1. 在**目標**區段, 選擇**新增目標\$1**，然後選擇您建立的 Lambda 函式 **debugger-rule-stop-training-job**。此步驟會將 CloudWatch 事件規則與 Lambda 函式相連結。

1. 選擇**設定詳細資訊**，然後前往**步驟 2：設定規則詳細資訊**頁面。

1. 指定 CloudWatch 規則定義名稱。例如 **debugger-cw-event-rule**。

1. 選擇**建立規則**以完成。

1. 返回 Lambda 函式組態頁面，並重新整理頁面。在**設計工具**面板中確認已正確設定。CloudWatch 事件規則應該註冊為 Lambda 函式的觸發器。組態設計看起來應該類似下列範例：  
<a name="lambda-designer-example"></a>![\[CloudWatch 設定的設計工具面板。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-lambda-designer.png)

# 停用 CloudWatch 事件規則以停止使用讓自動化訓練任務終止
<a name="debugger-disable-cw"></a>

如果要停用讓自動化訓練工作終止，則需要停用 CloudWatch 事件規則。在 Lambda **設計工具**面板中，選擇連結至 Lambda 函式的 **EventBridge (CloudWatch Events)** 區塊。這會在**設計工具**面板下方顯示 **EventBridge** 面板 (如需範例，請參閱先前的螢幕擷取畫面)。選取 **EventBridge (CloudWatch Events): debugger-cw-event-rule** 旁邊的核取方塊，然後選擇**停用**。如果稍後想使用自動化終止功能，您可以再次啟用 CloudWatch 事件規則。

# 在 TensorBoard 中視覺化 Amazon SageMaker Debugger 輸出張量
<a name="debugger-enable-tensorboard-summaries"></a>

**重要**  
此頁面已棄用，以便使用 Amazon SageMaker AI 和 TensoBoard，其中提供與 SageMaker 訓練和 SageMaker AI 網域的存取控制功能整合的全方位 TensorBoard 體驗。如需詳細資訊，請參閱 [Amazon SageMaker AI 中的 TensorBoard](tensorboard-on-sagemaker.md)。

使用 SageMaker Debugger，建立與 TensorBoard 相容的輸出張量檔案。載入要在 TensorBoard 中視覺化的檔案，並分析您的 SageMaker 訓練任務。偵錯工具會自動產生與 TensorBoard 相容的輸出張量檔案。對於您為儲存輸出張量自訂的任何勾點組態，偵錯工具可以彈性建立純量摘要、分佈和長條圖，供您匯入至 TensorBoard。

![\[偵錯工具輸出張量儲存機制的架構圖。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-tensorboard-concept.png)


您可以透過傳遞 `DebuggerHookConfig` 和 `TensorBoardOutputConfig` 物件給 `estimator` 來啟用。

下列程序說明如何將純量、權重和偏差儲存為可透過 TensorBoard 視覺化的完整張量、長條圖和分佈。Debugger 會將它們儲存到訓練容器的本機路徑 (預設路徑為 `/opt/ml/output/tensors`)，並同步至透過偵錯程式輸出組態物件傳遞的 Amazon S3 位置。

**若要使用偵錯工具儲存 TensorBoard 相容的輸出張量檔案**

1. 使用偵錯工具 `TensorBoardOutputConfig` 類別，設定 `tensorboard_output_config` 組態物件，以儲存 TensorBoard 輸出。對於 `s3_output_path` 參數，請指定目前 SageMaker AI 工作階段的預設 S3 儲存貯體或偏好的 S3 儲存貯體。此範例不會新增 `container_local_output_path` 參數，而是將其設定為預設本機路徑 `/opt/ml/output/tensors`。

   ```
   import sagemaker
   from sagemaker.debugger import TensorBoardOutputConfig
   
   bucket = sagemaker.Session().default_bucket()
   tensorboard_output_config = TensorBoardOutputConfig(
       s3_output_path='s3://{}'.format(bucket)
   )
   ```

   有關其他資訊，請參閱 Debugger `[TensorBoardOutputConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig)` API 中的 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable)。

1. 設定偵錯工具勾點，並自訂勾點參數值。例如，下列程式碼會設定偵錯工具勾點，以在訓練階段每 100 個步驟和驗證階段每 10 個步驟儲存所有純量輸出、每 500 個步驟 `weights` 參數 (儲存張量集合的預設 `save_interval` 值為 500)，以及每 10 個全域步驟 `bias` 參數，直到全域步驟達到 500 個。

   ```
   from sagemaker.debugger import CollectionConfig, DebuggerHookConfig
   
   hook_config = DebuggerHookConfig(
       hook_parameters={
           "train.save_interval": "100",
           "eval.save_interval": "10"
       },
       collection_configs=[
           CollectionConfig("weights"),
           CollectionConfig(
               name="biases",
               parameters={
                   "save_interval": "10",
                   "end_step": "500",
                   "save_histogram": "True"
               }
           ),
       ]
   )
   ```

   有關 Debugger 組態 API 的詳細資訊，請參閱 [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) 中的 Debugger `[CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig)` 和 `[DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig)` API。

1. 使用偵錯工具參數傳遞組態物件，以建構 SageMaker AI 估算器。以下範例範本示範如何建立一般 SageMaker AI 估算器。您也可以將 `estimator` 和 `Estimator` 取代為其他 SageMaker AI 架構的估算器父系類別和估算器類別。此功能的可用 SageMaker AI 架構估算器為 `[TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#create-an-estimator)`、`[PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#create-an-estimator)` 和 `[MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#create-an-estimator)`。

   ```
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(
       ...
       # Debugger parameters
       debugger_hook_config=hook_config,
       tensorboard_output_config=tensorboard_output_config
   )
   estimator.fit()
   ```

   此 `estimator.fit()` 方法會啟動訓練任務，偵錯工具會即時將輸出張量檔案寫入偵錯工具 S3 輸出路徑和 TensorBoard S3 輸出路徑。若要擷取輸出路徑，請使用下列估算方法：
   + 對於偵錯工具 S3 輸出路徑，請使用 `estimator.latest_job_debugger_artifacts_path()`。
   + 對於 TensorBoard S3 輸出路徑，請使用 `estimator.latest_job_tensorboard_artifacts_path()`。

1. 訓練完成後，請檢查儲存的輸出張量名稱：

   ```
   from smdebug.trials import create_trial
   trial = create_trial(estimator.latest_job_debugger_artifacts_path())
   trial.tensor_names()
   ```

1. 檢視 Amazon S3 中的 TensorBoard 輸出資料：

   ```
   tensorboard_output_path=estimator.latest_job_tensorboard_artifacts_path()
   print(tensorboard_output_path)
   !aws s3 ls {tensorboard_output_path}/
   ```

1. 將 TensorBoard 輸出資料下載至您的筆記本執行個體。例如，下列 AWS CLI 命令會將 TensorBoard 檔案下載到筆記本執行個體目前工作目錄`/logs/fit`下的 。

   ```
   !aws s3 cp --recursive {tensorboard_output_path} ./logs/fit
   ```

1. 將檔案目錄壓縮為 TAR 檔案，以下載至您的本機機器。

   ```
   !tar -cf logs.tar logs
   ```

1. 將 Tensorboard TAR 檔案下載並解壓縮至裝置上的目錄、啟動 Jupyter Jupyter 筆記本伺服器、開啟新筆記本，然後執行 TensorBoard 應用程式。

   ```
   !tar -xf logs.tar
   %load_ext tensorboard
   %tensorboard --logdir logs/fit
   ```

下列動畫螢幕擷取畫面示範步驟 5 到 8。將示範如何下載偵錯工具 TensorBoard TAR 檔案，並將檔案載入本機裝置上的 Jupyter 筆記本中。

![\[有關如何在本機下載和載入 Debugger TensorBoard 檔案的動畫。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/debugger/debugger-tensorboard.gif)