本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# SageMaker 模型平行化程式庫 v2 參考
<a name="distributed-model-parallel-v2-reference"></a>

以下是 SageMaker 模型平行化程式庫 v2 (SMP v2) 的參考。

**Topics**
+ [SMP v2 核心功能組態參數](#distributed-model-parallel-v2-reference-init-config)
+ [SMP v2 `torch.sagemaker` 套件的參考](#model-parallel-v2-torch-sagemaker-reference)
+ [從 SMP v1 升級到 SMP v2](#model-parallel-v2-upgrade-from-v1)

## SMP v2 核心功能組態參數
<a name="distributed-model-parallel-v2-reference-init-config"></a>

以下是啟用和設定 [SageMaker 模型平行化程式庫第 2 版的核心功能](model-parallel-core-features-v2.md) 的完整參數清單。這些資料必須以 JSON 格式撰寫，並傳遞至 SageMaker Python SDK 中的 PyTorch 估算器，或儲存為 SageMaker HyperPod 的 JSON 檔案。

```
{
    "hybrid_shard_degree": Integer,
    "sm_activation_offloading": Boolean,
    "activation_loading_horizon": Integer,
    "fsdp_cache_flush_warnings": Boolean,
    "allow_empty_shards": Boolean,
    "tensor_parallel_degree": Integer,
    "context_parallel_degree": Integer,
    "expert_parallel_degree": Integer,
    "random_seed": Integer
}
```
+ `hybrid_shard_degree` (整數) – 指定碎片平行化程度。值必須是介於 `0` 和 `world_size` 之間的整數。預設值為 `0`。
  + 如果設定為 `0`，當 `tensor_parallel_degree` 為 1 時，它會回到指令碼中的原生 PyTorch 實作和 API。否則，它會根據 `tensor_parallel_degree` 和 `world_size` 運算最大的可能 `hybrid_shard_degree`。回到原生 PyTorch FSDP 使用案例時，如果 `FULL_SHARD` 是您使用的策略，它會製作碎片分散到整個 GPU 叢集。如果 `HYBRID_SHARD` 或 `_HYBRID_SHARD_ZERO2` 是策略，則相當於 `hybrid_shard_degree` 為 8。啟用張量平行化時，它會根據修訂的 `hybrid_shard_degree` 製作碎片。
  + 如果設定為 `1`，當 `tensor_parallel_degree` 為 1 時，它會回到指令碼中 `NO_SHARD` 的原生 PyTorch 實作和 API。否則，它在任何指定的張量平行群組內相當於 `NO_SHARD`。
  + 如果設定為介於 2 和 `world_size` 之間的整數，碎片會在指定的 GPU 數量間發生。如果您未在 FSDP 指令碼中設定 `sharding_strategy`，則會將其覆寫為 `HYBRID_SHARD`。如果您設定 `_HYBRID_SHARD_ZERO2`，則會使用您指定的 `sharding_strategy`。
+ `sm_activation_offloading` (布林值) – 指定是否啟用 SMP 啟用卸載實作。如果為 `False`，卸載會使用原生 PyTorch 實作。如果為 `True`，則會使用 SMP 啟用卸載實作。您也需要在指令碼中使用 PyTorch 啟用卸載包裝函式 (`torch.distributed.algorithms._checkpoint.checkpoint_wrapper.offload_wrapper`)。如需詳細資訊，請參閱 [啟用卸載](model-parallel-core-features-v2-pytorch-activation-offloading.md)。預設值為 `True`。
+ `activation_loading_horizon` (整數) – 整數，指定 FSDP 的啟用卸載水平線類型。這是其輸入可同時位於 GPU 記憶體中的檢查點或卸載層數量上限。如需詳細資訊，請參閱 [啟用卸載](model-parallel-core-features-v2-pytorch-activation-offloading.md)。輸入值必須為正整數。預設值為 `2`。
+ `fsdp_cache_flush_warnings` (布林值) – 偵測和警告 PyTorch 記憶體管理員中是否發生快取排清，因為它們可能會降低運算效能。預設值為 `True`。
+ `allow_empty_shards` (布林值) – 如果張量不可分割，是否在碎片張量時允許空白碎片。在特定情況下，這是檢查點期間當機的實驗性修正。停用此功能會回到原始 PyTorch 行為。預設值為 `False`。
+ `tensor_parallel_degree` (整數) – 指定張量平行處理程度。該值必須介於 `1` 到 `world_size` 之間。預設值為 `1`。請注意，傳遞大於 1 的值不會自動啟用內容平行化；您也需要使用 [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API 在訓練指令碼中包裝模型。如需詳細資訊，請參閱 [張量平行化](model-parallel-core-features-v2-tensor-parallelism.md)。
+ `context_parallel_degree` (整數) – 指定內容平行化程度。值必須介於 `1` 和 `world_size` 之間，且必須是 `<= hybrid_shard_degree`。預設值為 `1`。請注意，傳遞大於 1 的值不會自動啟用內容平行化；您也需要使用 [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API 在訓練指令碼中包裝模型。如需詳細資訊，請參閱 [內容平行化](model-parallel-core-features-v2-context-parallelism.md)。
+ `expert_parallel_degree` (整數) – 指定專家平行化程度。該值必須介於 1 到 `world_size` 之間。預設值為 `1`。請注意，傳遞大於 1 的值不會自動啟用內容平行化；您也需要使用 [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API 在訓練指令碼中包裝模型。如需詳細資訊，請參閱 [專家平行化](model-parallel-core-features-v2-expert-parallelism.md)。
+ `random_seed` (整數) – 透過 SMP 張量平行化或專家平行化，在分散式模組中隨機操作的種子號碼。此種子會新增至張量平行或專家平行排名，以設定每個排名的實際種子。每個張量平行和專家平行排名都是唯一的。SMP v2 會確保跨張量平行和專家平行排名產生的隨機數字，分別符合非張量平行化和非專家平行化案例。

## SMP v2 `torch.sagemaker` 套件的參考
<a name="model-parallel-v2-torch-sagemaker-reference"></a>

本節是 SMP v2 所提供 `torch.sagemaker` 套件的參考。

**Topics**
+ [`torch.sagemaker.delayed_param.DelayedParamIniter`](#model-parallel-v2-torch-sagemaker-reference-delayed-param-init)
+ [`torch.sagemaker.distributed.checkpoint.state_dict_saver.async_save`](#model-parallel-v2-torch-sagemaker-reference-checkpoint-async-save)
+ [`torch.sagemaker.distributed.checkpoint.state_dict_saver.maybe_finalize_async_calls`](#model-parallel-v2-torch-sagemaker-reference-checkpoint-state-dict-saver)
+ [`torch.sagemaker.distributed.checkpoint.state_dict_saver.save`](#model-parallel-v2-torch-sagemaker-reference-checkpoint-save)
+ [`torch.sagemaker.distributed.checkpoint.state_dict_loader.load`](#model-parallel-v2-torch-sagemaker-reference-checkpoint-load)
+ [`torch.sagemaker.moe.moe_config.MoEConfig`](#model-parallel-v2-torch-sagemaker-reference-moe)
+ [`torch.sagemaker.nn.attn.FlashSelfAttention`](#model-parallel-v2-torch-sagemaker-reference-flashselfattention)
+ [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn)
+ [`torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`](#model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn)
+ [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform)
+ [`torch.sagemaker` util 函數和屬性](#model-parallel-v2-torch-sagemaker-reference-utils)

### `torch.sagemaker.delayed_param.DelayedParamIniter`
<a name="model-parallel-v2-torch-sagemaker-reference-delayed-param-init"></a>

將 [延遲參數初始化](model-parallel-core-features-v2-delayed-param-init.md) 套用至 PyTorch 模型的 API。

```
class torch.sagemaker.delayed_param.DelayedParamIniter(
    model: nn.Module,
    init_method_using_config : Callable = None,
    verbose: bool = False,
)
```

**參數**
+ `model` (`nn.Module`) – PyTorch 模型，用於包裝和套用 SMP v2 的延遲參數初始化功能。
+ `init_method_using_config` (可呼叫) – 如果您使用 SMP v2 的張量平行實作或支援的 [Hugging Face Transformer 模型與 SMP 張量平行化相容](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models)，請將此參數保持在預設值，即 `None`。根據預設，`DelayedParamIniter` API 會了解如何正確初始化指定的模型。對於任何其他模型，您需要建立自訂參數初始化函數，並將其新增至指令碼。下列程式碼片段是 SMP v2 為 [Hugging Face Transformer 模型與 SMP 張量平行化相容](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models) 實作的預設 `init_method_using_config` 函數。使用下列程式碼片段做為建立您自己的初始化組態函數、將其新增至指令碼，並將其傳遞至 SMP `DelayedParamIniter` API `init_method_using_config` 參數的參考。

  ```
  from torch.sagemaker.utils.module_utils import empty_module_params, move_buffers_to_device
  
  # Define a custom init config function.
  def custom_init_method_using_config(module):
      d = torch.cuda.current_device()
      empty_module_params(module, device=d)
      if isinstance(module, (nn.Linear, Conv1D)):
          module.weight.data.normal_(mean=0.0, std=config.initializer_range)
          if module.bias is not None:
              module.bias.data.zero_()
      elif isinstance(module, nn.Embedding):
          module.weight.data.normal_(mean=0.0, std=config.initializer_range)
          if module.padding_idx is not None:
              module.weight.data[module.padding_idx].zero_()
      elif isinstance(module, nn.LayerNorm):
          module.weight.data.fill_(1.0)
          module.bias.data.zero_()
      elif isinstance(module, LlamaRMSNorm):
          module.weight.data.fill_(1.0)
      move_buffers_to_device(module, device=d)
  
  delayed_initer = DelayedParamIniter(model, init_method_using_config=custom_init_method_using_config)
  ```

  如需上述程式碼片段中 `torch.sagemaker.module_util` 函數的詳細資訊，請參閱 [`torch.sagemaker` util 函數和屬性](#model-parallel-v2-torch-sagemaker-reference-utils)。
+ `verbose` (布林值) – 是否要在初始化和驗證期間啟用更詳細的記錄。預設值為 `False`。

**方法**
+ `get_param_init_fn()` – 傳回參數初始化函數，您可以將其傳遞給 PyTorch FSDP 包裝函式類別的 `param_init_fn` 引數。
+ `get_post_param_init_fn()` – 傳回參數初始化函數，您可以將其傳遞給 PyTorch FSDP 包裝函式類別的 `post_param_init_fn` 引數。當您在模型中綁定權重時，這是必要的。模型必須實作方法 `tie_weights`。如需詳細資訊，請參閱[延遲參數初始化](model-parallel-core-features-v2-delayed-param-init.md)中的**綁定權重的備註**。
+ `count_num_params` (`module: nn.Module, *args: Tuple[nn.Parameter]`) – 追蹤參數初始化函數正在初始化的參數數量。這有助於實作下列 `validate_params_and_buffers_inited` 方法。您通常不需要明確呼叫此函數，因為 `validate_params_and_buffers_inited` 方法會在後端隱含呼叫此方法。
+ `validate_params_and_buffers_inited` (`enabled: bool=True`) – 這是內容管理員，可協助驗證初始化的參數數量是否符合模型中的參數總數。它也會驗證所有參數和緩衝區現在都在 GPU 裝置上，而不是中繼裝置上。如果不符合這些條件，就會引發 `AssertionErrors`。此內容管理員僅為選用，您不需要使用此內容管理員來初始化參數。

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.async_save`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-async-save"></a>

用於非同步儲存的項目 API。使用此方法將 `state_dict` 非同步儲存到指定的 `checkpoint_id`。

```
def async_save(
    state_dict: STATE_DICT_TYPE,
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_writer: Optional[StorageWriter] = None,
    planner: Optional[SavePlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    coordinator_rank: int = 0,
    queue : AsyncCallsQueue = None,
    sharded_strategy: Union[SaveShardedStrategy, Tuple[str, int], None] = None,
    wait_error_handling: bool = True,
    force_check_all_plans: bool = True,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**參數**
+ `state_dict` (dict) - 必要。要儲存的狀態字典。
+ `checkpoint_id` (str) - 必要。儲存檢查點的儲存路徑。
+ `storage_writer` (StorageWriter) - 選用。PyTorch 中執行寫入操作的 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) 執行個體。如果未指定，則會使用 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) 的預設組態。
+ `planner` (SavePlanner) - 選用。PyTorch 中的 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) 執行個體。如果未指定，則會使用 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) 的預設組態。
+ `process_group` (ProcessGroup) - 選用。要處理的程序群組。如果為 `None`，則會使用預設 (全域) 程序群組。
+ `coordinator_rank` (int) - 選用。執行 `AllReduce` 等集體通訊運算子時的協調器排名。
+ `queue` (AsyncRequestQueue) - 選用。要使用的非同步排程器。根據預設，它會採用全域參數 `DEFAULT_ASYNC_REQUEST_QUEUE`。
+ `sharded_strategy` (PyTorchDistSaveShardedStrategy) - 選用。用於儲存檢查點的碎片策略。如果尚未指定，預設會使用 `torch.sagemaker.distributed.checkpoint.state_dict_saver.PyTorchDistSaveShardedStrategy`。
+ `wait_error_handling` (bool) – 選用。指定是否等待所有排名完成錯誤處理的旗標。預設值為 `True`。
+ `force_check_all_plans` (bool) – 選用。一種旗標，可決定是否強制同步各排名的計劃，即使在快取命中的情況下也是如此。預設值為 `True`。
+ `s3_region` (str) - 選用。S3 儲存貯體所在的區域。如果未指定，則會從 `checkpoint_id` 推斷區域。
+ `s3client_config` (S3ClientConfig) - 選用。資料類別會公開 S3 用戶端的可設定參數。如果未提供，則會使用 [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) 的預設組態。依預設，`part_size` 參數設為 64MB。

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.maybe_finalize_async_calls`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-state-dict-saver"></a>

此函數允許訓練程序監控要完成的多個非同步請求。

```
def maybe_finalize_async_calls(
    blocking=True, 
    process_group=None
) -> List[int]:
```

**參數**
+ `blocking` (bool) – 選用。如果為 `True`，它會等到所有作用中的請求都完成。否則，它只會完成已完成的非同步請求。預設值為 `True`。
+ `process_group` (ProcessGroup) - 選用。要操作的程序群組。如果設定為 `None`，則會使用預設 (全域) 程序群組。

**傳回值**
+ 包含非同步呼叫索引的清單已成功完成。

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.save`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-save"></a>

使用此方法將 `state_dict` 同步儲存到指定的 `checkpoint_id`。

```
def save(
    state_dict: STATE_DICT_TYPE,
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_writer: Optional[StorageWriter] = None,
    planner: Optional[SavePlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    coordinator_rank: int = 0,
    wait_error_handling: bool = True,
    force_check_all_plans: bool = True,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**參數**
+ `state_dict` (dict) - 必要。要儲存的狀態字典。
+ `checkpoint_id` (str) - 必要。儲存檢查點的儲存路徑。
+ `storage_writer` (StorageWriter) - 選用。PyTorch 中執行寫入操作的 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) 執行個體。如果未指定，則會使用 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) 的預設組態。
+ `planner` (SavePlanner) - 選用。PyTorch 中的 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) 執行個體。如果未指定，則會使用 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) 的預設組態。
+ `process_group` (ProcessGroup) - 選用。要處理的程序群組。如果為 `None`，則會使用預設 (全域) 程序群組。
+ `coordinator_rank` (int) - 選用。執行 `AllReduce` 等集體通訊運算子時的協調器排名。
+ `wait_error_handling` (bool) – 選用。指定是否等待所有排名完成錯誤處理的旗標。預設值為 `True`。
+ `force_check_all_plans` (bool) – 選用。一種旗標，可決定是否強制同步各排名的計劃，即使在快取命中的情況下也是如此。預設值為 `True`。
+ `s3_region` (str) - 選用。S3 儲存貯體所在的區域。如果未指定，則會從 `checkpoint_id` 推斷區域。
+ `s3client_config` (S3ClientConfig) - 選用。資料類別會公開 S3 用戶端的可設定參數。如果未提供，則會使用 [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) 的預設組態。依預設，`part_size` 參數設為 64MB。

### `torch.sagemaker.distributed.checkpoint.state_dict_loader.load`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-load"></a>

載入分散式模型的狀態字典 (`state_dict`)。

```
def load(
    state_dict: Dict[str, Any],
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_reader: Optional[StorageReader] = None,
    planner: Optional[LoadPlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    check_keys_matched: bool = True,
    coordinator_rank: int = 0,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**參數**
+ `state_dict` (dict) - 必要。要載入的 `state_dict`。
+ `checkpoint_id` (str) - 必要。檢查點的 ID。`checkpoint_id` 的意義取決於儲存體。它可以是資料夾或檔案的路徑。如果儲存體是金鑰值存放區，它也可以是金鑰。
+ `storage_reader` (StorageReader) - 選用。PyTorch 中執行讀取操作的 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader) 執行個體。如果未指定，分散式檢查點會根據 `checkpoint_id` 自動推斷讀取器。如果 `checkpoint_id` 也是 `None`，則會引發例外狀況錯誤。
+ `planner` (StorageReader) - 選用。PyTorch 中的 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner) 執行個體。如果未指定，則會使用 [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner) 的預設組態。
+ `check_keys_matched` (bool) – 選用。如果啟用， 會使用 `AllGather` 檢查所有排名的 `state_dict` 索引鍵是否相符。
+ `s3_region` (str) - 選用。S3 儲存貯體所在的區域。如果未指定，則會從 `checkpoint_id` 推斷區域。
+ `s3client_config` (S3ClientConfig) - 選用。資料類別會公開 S3 用戶端的可設定參數。如果未提供，則會使用 [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) 的預設組態。依預設，`part_size` 參數設為 64MB。

### `torch.sagemaker.moe.moe_config.MoEConfig`
<a name="model-parallel-v2-torch-sagemaker-reference-moe"></a>

設定 SMP 實作 Mixture-of-Experts (MoE) 的組態類別。您可以透過此類別指定 MoE 組態值，並將其傳遞給 [https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-v2-reference.html#model-parallel-v2-torch-sagemaker-reference-transform](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-v2-reference.html#model-parallel-v2-torch-sagemaker-reference-transform) API 呼叫。若要進一步了解此類別用於訓練 MoE 模型的詳細資訊，請參閱 [專家平行化](model-parallel-core-features-v2-expert-parallelism.md)。

```
class torch.sagemaker.moe.moe_config.MoEConfig(
    smp_moe=True,
    random_seed=12345,
    moe_load_balancing="sinkhorn",
    global_token_shuffle=False,
    moe_all_to_all_dispatcher=True,
    moe_aux_loss_coeff=0.001,
    moe_z_loss_coeff=0.001
)
```

**參數**
+ `smp_moe` (布林值) - 是否使用 MoE 的 SMP 實作。預設值為 `True`。
+ `random_seed` (整數) - 專家平行分散式模組中隨機操作的種子號碼。此種子會新增至專家平行排名，以設定每個排名的實際種子。每個專家平行排名的種子都是唯一的。預設值為 `12345`。
+ `moe_load_balancing` (字串) - 指定 MoE 路由器的負載平衡類型。有效選項為：`aux_loss`、`sinkhorn`、`balanced` 和 `none`。預設值為 `sinkhorn`。
+ `global_token_shuffle` (布林值) - 是否要在相同 EP 群組中跨 EP 排名隨機排序詞元。預設值為 `False`。
+ `moe_all_to_all_dispatcher` (布林值) - 是否在 MoE all-to-all發送器進行通訊。預設值為 `True`。
+ `moe_aux_loss_coeff` (浮點數) - 輔助負載平衡損失的係數。預設值為 `0.001`。
+ `moe_z_loss_coeff` (浮點數) - z-loss 的係數。預設值為 `0.001`。

### `torch.sagemaker.nn.attn.FlashSelfAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-flashselfattention"></a>

搭配 SMP v2 使用 [FlashAttention](model-parallel-core-features-v2-flashattention.md) 的 API。

```
class torch.sagemaker.nn.attn.FlashSelfAttention(
   attention_dropout_prob: float = 0.0,
   scale: Optional[float] = None,
   triton_flash_attention: bool = False,
   use_alibi: bool = False,
)
```

**參數**
+ `attention_dropout_prob` (浮點數) – 套用至注意力的退出機率。預設值為 `0.0`。
+ `scale` (浮點數) – 如果傳遞，此縮放係數會套用至 softmax。如果設定為 `None` (這也是預設值)，則縮放係數為 `1 / sqrt(attention_head_size)`。預設值為 `None`。
+ `triton_flash_attention` (bool) – 如果通過，則會使用快速注意的 Triton 實作。這是支援注意力與線性偏誤 (ALiBi) 的必要條件 (請參閱下列 `use_alibi` 參數)。此版本的核心不支援退出。預設值為 `False`。
+ `use_alibi` (bool) – 如果傳遞，則會使用提供的遮罩啟用注意力與線性偏誤 (ALiBi)。使用 ALiBi 時，需要準備注意遮罩，如下所示。預設值為 `False`。

  ```
  def generate_alibi_attn_mask(attention_mask, batch_size, seq_length, 
      num_attention_heads, alibi_bias_max=8):
      device, dtype = attention_mask.device, attention_mask.dtype
      alibi_attention_mask = torch.zeros(
          1, num_attention_heads, 1, seq_length, dtype=dtype, device=device
      )
  
      alibi_bias = torch.arange(1 - seq_length, 1, dtype=dtype, device=device).view(
          1, 1, 1, seq_length
      )
      m = torch.arange(1, num_attention_heads + 1, dtype=dtype, device=device)
      m.mul_(alibi_bias_max / num_attention_heads)
      alibi_bias = alibi_bias * (1.0 / (2 ** m.view(1, num_attention_heads, 1, 1)))
  
      alibi_attention_mask.add_(alibi_bias)
      alibi_attention_mask = alibi_attention_mask[..., :seq_length, :seq_length]
      if attention_mask is not None and attention_mask.bool().any():
          alibi_attention_mask.masked_fill(
              attention_mask.bool().view(batch_size, 1, 1, seq_length), float("-inf")
          )
  
      return alibi_attention_mask
  ```

**方法**
+ `forward(self, qkv, attn_mask=None, causal=False, cast_dtype=None, layout="b h s d")` – 一般 PyTorch 模組函數。呼叫 `module(x)` 時，SMP 會自動執行此函數。
  + `qkv` – 如下形式的 `torch.Tensor`：`(batch_size x seqlen x (3 x num_heads) x head_size)`或 `(batch_size, (3 x num_heads) x seqlen x head_size)`，`torch.Tensors` 的元組，每個都可能是形狀 `(batch_size x seqlen x num_heads x head_size)` 或 `(batch_size x num_heads x seqlen x head_size)`。必須根據形狀傳遞適當的配置 arg。
  + `attn_mask` - 下列形式 `(batch_size x 1 x 1 x seqlen)` 的 `torch.Tensor`。若要啟用此注意遮罩參數，它需要 `triton_flash_attention=True` 和 `use_alibi=True`。若要了解如何使用此方法產生注意力遮罩，請參閱[FlashAttention](model-parallel-core-features-v2-flashattention.md)中的程式碼範例。預設值為 `None`。
  + `causal` – 設定為 `False` 時，這是引數的預設值，不會套用遮罩。設為 `True` 時，`forward` 方法會使用標準較低的三角形遮罩。預設值為 `False`。
  + `cast_dtype` – 設定為特定 `dtype` 時，它會將 `qkv` 張量轉換為 `attn` 之前的 `dtype`。這適用於 Hugging Face Transformer GPT-NeoX 模型等實作，這些模型在旋轉嵌入後具有 `fp32` 的 `q` 和 `k`。如果設定為 `None`，則不會套用轉換。預設值為 `None`。
  + `layout` (字串) – 可用的值為 `b h s d` 或 `b s h d`。這應該設定為傳遞的 `qkv` 張量配置，以便針對 `attn` 套用適當的轉換。預設值為 `b h s d`。

**傳回值**

形狀為 `torch.Tensor`的單一 `(batch_size x num_heads x seq_len x head_size)`。

### `torch.sagemaker.nn.attn.FlashGroupedQueryAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn"></a>

搭配 SMP v2 使用 `FlashGroupedQueryAttention` 的 API。若要進一步了解此 API 的用法，請參閱[使用 FlashAttention 核心進行分組查詢注意力](model-parallel-core-features-v2-flashattention.md#model-parallel-core-features-v2-flashattention-grouped-query)。

```
class torch.sagemaker.nn.attn.FlashGroupedQueryAttention(
    attention_dropout_prob: float = 0.0,
    scale: Optional[float] = None,
)
```

**參數**
+ `attention_dropout_prob` (浮點數) – 套用至注意力的退出機率。預設值為 `0.0`。
+ `scale` (浮點數) – 如果傳遞，此縮放係數會套用至 softmax。如果設定為 `None`，則會使用 `1 / sqrt(attention_head_size)` 作為縮放係數。預設值為 `None`。

**方法**
+ `forward(self, q, kv, causal=False, cast_dtype=None, layout="b s h d")` – 一般 PyTorch 模組函數。呼叫 `module(x)` 時，SMP 會自動執行此函數。
  + `q` - 下列形式 `(batch_size x seqlen x num_heads x head_size)` 或 `(batch_size x num_heads x seqlen x head_size)` 的 `torch.Tensor`。必須根據形狀傳遞適當的配置 arg。
  + `kv` – 如下形式的 `torch.Tensor`：`(batch_size x seqlen x (2 x num_heads) x head_size)`或 `(batch_size, (2 x num_heads) x seqlen x head_size)`，兩個 `torch.Tensor` 的元組，每個都可能是形狀 `(batch_size x seqlen x num_heads x head_size)` 或 `(batch_size x num_heads x seqlen x head_size)`。也必須根據形狀傳遞適當的 `layout` 引數。
  + `causal` – 設定為 `False` 時，這是引數的預設值，不會套用遮罩。設為 `True` 時，`forward` 方法會使用標準較低的三角形遮罩。預設值為 `False`。
  + `cast_dtype` – 設定為特定 dtype 時，它會將 `qkv` 張量轉換為 `attn` 之前的 dtype。這對於 Hugging Face Transformer GPT-NeoX 等實作很有用，其在旋轉嵌入後具有 `fp32` 的 `q,k`。如果設定為 `None`，則不會套用轉換。預設值為 `None`。
  + layout (字串) – 可用的值為 `"b h s d"` 或 `"b s h d"`。這應該設定為傳遞的 `qkv` 張量配置，以便針對 `attn` 套用適當的轉換。預設值為 `"b h s d"`。

**傳回值**

傳回代表注意力運算輸出的單一 `torch.Tensor (batch_size x num_heads x seq_len x head_size)`。

### `torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn"></a>

支援 Llama 模型 FlashAttention 的 API。此 API 使用低階的 [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn) API。若要了解如何使用此方法，請參閱[使用 FlashAttention 核心進行分組查詢注意力](model-parallel-core-features-v2-flashattention.md#model-parallel-core-features-v2-flashattention-grouped-query)。

```
class torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention(
    config: LlamaConfig
)
```

**參數**
+ `config` – Llama 模型的 FlashAttention 組態。

**方法**
+ `forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)`
  + `hidden_states` (`torch.Tensor`) – 張量的隱藏狀態，形式為 `(batch_size x seq_len x num_heads x head_size)`。
  + `attention_mask` (`torch.LongTensor`) – 遮罩以避免對 `(batch_size x seqlen)` 形式的填補字符索引進行注意。預設值為 `None`。
  + `position_ids` (`torch.LongTensor`) – 當不是 `None` 時，其格式為 `(batch_size x seqlen)`，表示位置嵌入中每個輸入序列詞元的位置索引。預設值為 `None`。
  + `past_key_value` (快取) – 預先計算的隱藏狀態 (自我注意力區塊和跨注意力區塊中的索引鍵和值)。預設值為 `None`。
  + `output_attentions` (bool) – 指出是否要傳回所有注意力層的注意力張量。預設值為 `False`。
  + `use_cache` (bool) – 指出是否要傳回 `past_key_values` 索引鍵值狀態。預設值為 `False`。

**傳回值**

傳回代表注意力運算輸出的單一 `torch.Tensor (batch_size x num_heads x seq_len x head_size)`。

### `torch.sagemaker.transform`
<a name="model-parallel-v2-torch-sagemaker-reference-transform"></a>

SMP v2 提供此 `torch.sagemaker.transform()` API，可將 Hugging Face 轉換器模型轉換為 SMP 模型實作，並啟用 SMP 張量平行化。

```
torch.sagemaker.transform(
    model: nn.Module, 
    device: Optional[torch.device] = None, 
    dtype: Optional[torch.dtype] = None, 
    config: Optional[Dict] = None, 
    load_state_dict_from_rank0: bool = False,
    cp_comm_type: str = "p2p"
)
```

SMP v2 透過將 Hugging Face 轉換器模型的組態轉換為 SMP 轉換器組態來維護 [Hugging Face Transformer 模型與 SMP 張量平行化相容](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models) 的轉換政策。

**參數**
+ `model` (`torch.nn.Module`) – 從 [Hugging Face Transformer 模型與 SMP 張量平行化相容](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models) 轉換和套用 SMP 程式庫張量平行化的模型。
+ `device` (`torch.device`) – 如果傳遞，則會在此裝置上建立新的模型。如果原始模組在中繼裝置上有任何參數 (請參閱[延遲參數初始化](model-parallel-core-features-v2-delayed-param-init.md))，則轉換後的模組也會在中繼裝置上建立，忽略在此傳遞的引數。預設值為 `None`。
+ `dtype` (`torch.dtype`) – 如果傳遞， 會將此設定為建立模型的 dtype 內容管理員，並使用此 dtype 建立模型。這通常是不必要的，因為我們想要在使用 `MixedPrecision` 時使用 `fp32` 建立模型，而 `fp32` 是 PyTorch 中的預設 dtype。預設值為 `None`。
+ `config` (dict) – 這是用於設定 SMP 轉換器的字典。預設值為 `None`。
+ `load_state_dict_from_rank0` (布林值) – 根據預設，此模組會建立具有新權重的模型新執行個體。當此引數設為 `True` 時，SMP 會嘗試將原始 PyTorch 模型的狀態字典從第 0 個排名載入到第 0 個排名所屬張量平行群組的轉換模型。將此設為 `True` 時，排名 0 無法在中繼裝置上有任何參數。在此轉換呼叫之後，只有第一個張量平行群組會填入第 0 個排名的權重。您需要在 FSDP 包裝函式中將 `sync_module_states` 設定為 `True`，才能從第一個張量平行群組到所有其他程序取得這些權重。啟用此功能後，SMP 程式庫會從原始模型載入狀態字典。SMP 程式庫會在轉換前取得模型的 `state_dict`、將其轉換為符合轉換模型的結構、針對每個張量平行排名將其碎片化、將此狀態從第 0 個排名傳達給第 0 個排名所屬的張量平行群組中的其他排名，然後載入。預設值為 `False`。
+ `cp_comm_type` (str) – 決定內容平行化實作，且只有在 `context_parallel_degree` 大於 1 時才適用。此參數的可用值為 `p2p` 和 `all_gather`。`p2p` 實作會在注意力運算期間利用對等式發呼叫進行鍵值 (KV) 張量累積，以非同步方式執行，並允許通訊與運算重疊。另一方面，`all_gather` 實作採用 KV 張量累積的 `AllGather` 通訊集體操作。預設值為 `"p2p"`。

**傳回**

傳回您可以包裝 PyTorch FSDP 的轉換模型。當 `load_state_dict_from_rank0` 設為 `True` 時，涉及排名 0 的張量平行群組具有從排名 0 上原始狀態字典載入的權重。在原始模型上使用 [延遲參數初始化](model-parallel-core-features-v2-delayed-param-init.md) 時，只有這些排名對轉換模型的參數和緩衝區具有 CPUs 上的實際張量。其餘排名會繼續在中繼裝置上具有參數和緩衝區，以節省記憶體。

### `torch.sagemaker` util 函數和屬性
<a name="model-parallel-v2-torch-sagemaker-reference-utils"></a>

**torch.sagemaker util 函數**
+ `torch.sagemaker.init(config: Optional[Union[str, Dict[str, Any]]] = None) -> None` – 使用 SMP 初始化 PyTorch 訓練任務。
+ `torch.sagemaker.is_initialized() -> bool` – 檢查訓練任務是否使用 SMP 初始化。在任務使用 SMP 初始化時返回原生 PyTorch 時，某些屬性不相關且變成 `None`，如下列**屬性**清單所示。
+ `torch.sagemaker.utils.module_utils.empty_module_params(module: nn.Module, device: Optional[torch.device] = None, recurse: bool = False) -> nn.Module` – 如果有指定 `device` 則在其上建立空參數，並且可指定是否要遞迴所有巢狀模組。
+ `torch.sagemaker.utils.module_utils.move_buffers_to_device(module: nn.Module, device: torch.device, recurse: bool = False) -> nn.Module` – 將模組緩衝區移至指定的 `device`，如果指定，它可以遞迴所有巢狀模組。

**Properties**

使用 `torch.sagemaker.init` 初始化 SMP 後，`torch.sagemaker.state` 會保留多個有用的屬性。
+ `torch.sagemaker.state.hybrid_shard_degree` (int) – 碎片資料平行化程度，傳遞給 `torch.sagemaker.init()` 的 SMP 組態中使用者輸入的複本。如需詳細資訊，請參閱 [使用 SageMaker 模型平行化程式庫 v2](model-parallel-use-api-v2.md)。
+ `torch.sagemaker.state.rank` (int) – 裝置的全域排名，範圍為 `[0, world_size)`。
+ `torch.sagemaker.state.rep_rank_process_group` (`torch.distributed.ProcessGroup`) – 程序群組包含具有相同複寫排名的所有裝置。請注意 `torch.sagemaker.state.tp_process_group` 的細微但基本差異。回到原生 PyTorch 時，它會傳回 `None`。
+ `torch.sagemaker.state.tensor_parallel_degree` (int) – 張量平行化程度，傳遞給 `torch.sagemaker.init()` 的 SMP 組態中使用者輸入的複本。如需詳細資訊，請參閱 [使用 SageMaker 模型平行化程式庫 v2](model-parallel-use-api-v2.md)。
+ `torch.sagemaker.state.tp_size` (int) – `torch.sagemaker.state.tensor_parallel_degree` 的別名。
+ `torch.sagemaker.state.tp_rank` (int) – 裝置在 `[0, tp_size)` 範圍內的張量平行化排名，由張量平行化程度和排名機制決定。
+ `torch.sagemaker.state.tp_process_group` (`torch.distributed.ProcessGroup`) – 張量平行化群組，包括除了唯一的張量平行化排名之外，在其他維度中皆具有相同排名的所有裝置 (例如，碎片資料平行化和複寫)。回到原生 PyTorch 時，它會傳回 `None`。
+ `torch.sagemaker.state.world_size` (int) – 訓練中使用的裝置總數。

## 從 SMP v1 升級到 SMP v2
<a name="model-parallel-v2-upgrade-from-v1"></a>

若要從 SMP v1 移至 SMP v2，您必須進行指令碼變更，以移除 SMP v1 API 並套用 SMP v2 API。我們建議您從 PyTorch FSDP 指令碼開始，並遵循[使用 SageMaker 模型平行化程式庫 v2](model-parallel-use-api-v2.md)中的指示，而不是從 SMP v1 指令碼開始。

若要將 SMP v1 *模型*帶入 SMP v2，在 SMP v1 中，您必須收集完整的模型狀態字典，並在模型狀態字典上套用轉譯函數，將其轉換為 Hugging Face 轉換器模型檢查點格式。然後，在 SMP v2 中，如[使用 SMP 進行檢查點](model-parallel-core-features-v2-checkpoints.md)中所述，您可以載入 Hugging Face 轉換器模型檢查點，然後繼續使用 PyTorch 檢查點 APIs搭配 SMP v2。若要搭配 PyTorch FSDP 模型使用 SMP，請務必移至 SMP v2，並變更訓練指令碼以使用 PyTorch FSDP 和其他最新功能。

```
import smdistributed.modelparallel.torch as smp

# Create model
model = ...
model = smp.DistributedModel(model)

# Run training
...

# Save v1 full checkpoint
if smp.rdp_rank() == 0:
    model_dict = model.state_dict(gather_to_rank0=True) # save the full model
    # Get the corresponding translation function in smp v1 and translate
    if model_type == "gpt_neox":
        from smdistributed.modelparallel.torch.nn.huggingface.gptneox import translate_state_dict_to_hf_gptneox
        translated_state_dict = translate_state_dict_to_hf_gptneox(state_dict, max_seq_len=None)
    
    # Save the checkpoint
    checkpoint_path = "checkpoint.pt"
    if smp.rank() == 0:
        smp.save(
            {"model_state_dict": translated_state_dict},
            checkpoint_path,
            partial=False,
        )
```

若要在 SMP v1 中尋找可用的轉譯函數，請參閱 [支援 Hugging Face 轉換器模型](model-parallel-extended-features-pytorch-hugging-face.md)。

如需在 SMP v2 中儲存和載入模型檢查點的說明，請參閱 [使用 SMP 進行檢查點](model-parallel-core-features-v2-checkpoints.md)。