本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# Autopilot 模型部署和預測
<a name="autopilot-llms-finetuning-deploy-models"></a>

微調大型語言模型 (LLM) 之後，您可以透過設定端點以取得互動式預測來部署模型，以進行即時文字生成。

**注意**  
我們建議您在上執行即時推論任務，以`ml.g5.12xlarge`獲得更好的效能。或者，`ml.g5.8xlarge` 執行個體適用於 Falcon-7B-Instruct 和 MPT-7B-Instruc 指示的文字生成任務。  
您可以在 Amazon EC2 提供的執行個體類型中，在[加速運算](https://aws.amazon.com/ec2/instance-types/)類別中找到這些執行個體的詳細資訊。

## 即時文字生成
<a name="autopilot-llms-finetuning-realtime"></a>

您可以使用 SageMaker API 手動將您的經微調的模型部署到 SageMaker AI 託管[即時推論端點](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)，然後透過如下方式調用端點來開始進行預測。

**注意**  
或者，您可以在 Autopilot 中建立微調實驗時，選擇自動部署選項。如需設定自動化部署模型的相關資訊，請參閱[如何啟用自動部署](autopilot-create-experiment-finetune-llms.md#autopilot-llms-finetuning-auto-model-deployment)。  
您還可以使用 SageMaker Python SDK 和`JumpStartModel`類別，對由 Autopilot 微調的模型執行推論。這可以透過在 Amazon S3 中為模型的成品指定自訂位置來完成。如需將模型定義為 JumpStart 模型，以及部署模型以供推論使用的詳細資訊，請參閱[使用 JumpStartModel 類別進行低程式碼部署](https://sagemaker.readthedocs.io/en/stable/overview.html#deploy-a-pre-trained-model-directly-to-a-sagemaker-endpoint)。

1. **取得候選推論容器定義**

   您可以從 [DescribeAutoMLJobV2](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html#API_DescribeAutoMLJobV2_ResponseSyntax) API 呼叫的回應中，擷取的 `BestCandidate` 物件內部中找到 `InferenceContainerDefinitions`。推論的容器定義指的是容器化環境，專為部署和執行經過訓練的模型進行預測而設計。

   下列 AWS CLI 命令範例使用 [DescribeAutoMLJobV2](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html) API 為您的任務名稱取得建議的容器定義。

   ```
   aws sagemaker describe-auto-ml-job-v2 --auto-ml-job-name {{job-name}} --region {{region}}
   ```

1. **建立 SageMaker AI 模型**

   使用上一個步驟中的容器定義，透過使用 [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API 來建立 SageMaker AI 模型。請參閱下列 AWS CLI 命令做為範例。使用 `CandidateName` 做為您的型號名稱。

   ```
   aws sagemaker create-model --model-name '{{<your-candidate-name>}}' \
                       --primary-container '{{<container-definition}}' \
                       --execution-role-arn '{{<execution-role-arn>}}' --region '{{<region>}}
   ```

1. **建立端點組態**

   下列 AWS CLI 命令範例使用 [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API 來建立端點組態。
**注意**  
若要防止端點建立因為模型下載過時而逾時，建議您設定 `ModelDataDownloadTimeoutInSeconds = 3600` 和 `ContainerStartupHealthCheckTimeoutInSeconds = 3600`。

   ```
   aws sagemaker create-endpoint-config --endpoint-config-name '{{<your-endpoint-config-name>}}' \
                       --production-variants '{{<list-of-production-variants>}}' ModelDataDownloadTimeoutInSeconds=3600 ContainerStartupHealthCheckTimeoutInSeconds=3600 \
                       --region '{{<region>}}'
   ```

1. **建立端點** 

   下列 AWS CLI 範例使用 [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API 來建立端點。

   ```
   aws sagemaker create-endpoint --endpoint-name '{{<your-endpoint-name>}}' \
                       --endpoint-config-name '{{<endpoint-config-name-you-just-created>}}' \
                       --region '{{<region>}}'
   ```

   使用 [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API來檢查端點部署的進度。請參閱下列 AWS CLI 命令做為範例。

   ```
   aws sagemaker describe-endpoint —endpoint-name '{{<endpoint-name>}}' —region {{<region>}}
   ```

   `EndpointStatus`變更為後`InService`，端點即可用於即時推論。

1. **調用 API 端點** 

   下列命令會調用端點以進行即時推論。您的提示需要以位元組為單位進行編碼。
**注意**  
輸入提示的格式取決於語言模型。關於文字生成提示格式的詳細資訊，請參閱[請求文字生成模型即時推論的格式](#autopilot-llms-finetuning-realtime-prompt-examples)。

   ```
   aws sagemaker invoke-endpoint --endpoint-name '{{<endpoint-name>}}' \ 
                     --region '{{<region>}}' --body '{{<your-promt-in-bytes>}}' [--content-type] 'application/json' {{<outfile>}}
   ```

## 請求文字生成模型即時推論的格式
<a name="autopilot-llms-finetuning-realtime-prompt-examples"></a>

不同的大型語言模型 (LLM) 可能具有特定的軟體相依性、執行時期環境和硬體需求，這些需求會影響 Autopilot 建議的容器來託管模型以供推論使用。此外，每個模型都指定了所需的輸入資料格式和預期的預測和輸出格式。

以下是某些型號和建議容器的範例輸入。
+ 對於建議使用的 Falcon 模型容器`huggingface-pytorch-tgi-inference:2.0.1-tgi1.0.3-gpu-py39-cu118-ubuntu20.04`：

  ```
  payload = {
      "inputs": "Large language model fine-tuning is defined as",
      "parameters": {
          "do_sample": false,
          "top_p": 0.9,
          "temperature": 0.1,
          "max_new_tokens": 128,
          "stop": ["<|endoftext|>", "</s>"]
      }
  }
  ```
+ 對於所有其他具有推薦容器 `djl-inference:0.22.1-fastertransformer5.3.0-cu118` 的模型：

  ```
  payload= {
      "text_inputs": "Large language model fine-tuning is defined as"
  }
  ```