本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# Autopilot 模型部署和预测
<a name="autopilot-llms-finetuning-deploy-models"></a>

微调大型语言模型（LLM）后，您可以通过设置端点来获取交互式预测，从而部署该模型用于实时文本生成。

**注意**  
为了获得更好的性能，我们建议在 `ml.g5.12xlarge` 上运行实时推理作业。或者，`ml.g5.8xlarge`实例适用于 Falcon-7B-Instruct MPT-7B-Instruct 文本生成任务。  
在 Amazon EC2 提供的实例类型选择中，您可以在[加速计算](https://aws.amazon.com/ec2/instance-types/)类别中找到这些实例的具体信息。

## Real-time 文本生成
<a name="autopilot-llms-finetuning-realtime"></a>

您可以使用 SageMaker API 将经过微调的模型手动部署到 SageMaker AI Host [ing 实时推理端点](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)，然后通过调用终端节点开始进行预测，如下所示。

**注意**  
或者，当您在 Autopilot 中创建微调实验时，也可选择自动部署选项。有关设置模型自动部署的信息，请参阅[如何启用自动部署](autopilot-create-experiment-finetune-llms.md#autopilot-llms-finetuning-auto-model-deployment)。  
您还可以使用 SageMaker Python SDK 和`JumpStartModel`类对由自动驾驶仪微调的模型进行推断。为此，您可以在 Amazon S3 中为模型的构件指定自定义位置。有关将模型定义为模型以及部署 JumpStart 模型进行推理的信息，请参阅使用[ JumpStartModel 类进行Low-code 部署](https://sagemaker.readthedocs.io/en/stable/overview.html#deploy-a-pre-trained-model-directly-to-a-sagemaker-endpoint)。

1. **获取候选推理容器定义**

   您可以在从 [DescribeAutoMLJobV2](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html#API_DescribeAutoMLJobV2_ResponseSyntax)API 调用的响应中检索到的`BestCandidate`对象中找到的。`InferenceContainerDefinitions`用于推理的容器定义是指设计用于部署和运行经训练模型以进行预测的容器化环境。

   以下 AWS CLI 命令示例使用 [DescribeAutoMLJobV2](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html)API 为您的任务名称获取建议的容器定义。

   ```
   aws sagemaker describe-auto-ml-job-v2 --auto-ml-job-name {{job-name}} --region {{region}}
   ```

1. **创建 A SageMaker I 模型**

   使用上一步中的容器定义通过 [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html)AP SageMaker I 创建 AI 模型。以以下 AWS CLI 命令为例。使用 `CandidateName` 作为您的模型名称。

   ```
   aws sagemaker create-model --model-name '{{<your-candidate-name>}}' \
                       --primary-container '{{<container-definition}}' \
                       --execution-role-arn '{{<execution-role-arn>}}' --region '{{<region>}}
   ```

1. **创建端点配置**

   以下 AWS CLI 命令示例使用 [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html)API 创建终端节点配置。
**注意**  
为防止由于模型下载时间过长而导致端点创建超时，建议设置 `ModelDataDownloadTimeoutInSeconds = 3600` 和 `ContainerStartupHealthCheckTimeoutInSeconds = 3600`。

   ```
   aws sagemaker create-endpoint-config --endpoint-config-name '{{<your-endpoint-config-name>}}' \
                       --production-variants '{{<list-of-production-variants>}}' ModelDataDownloadTimeoutInSeconds=3600 ContainerStartupHealthCheckTimeoutInSeconds=3600 \
                       --region '{{<region>}}'
   ```

1. **创建端点** 

   以下 AWS CLI 示例使用 [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html)API 创建终端节点。

   ```
   aws sagemaker create-endpoint --endpoint-name '{{<your-endpoint-name>}}' \
                       --endpoint-config-name '{{<endpoint-config-name-you-just-created>}}' \
                       --region '{{<region>}}'
   ```

   使用 [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html)API 检查您的终端节点部署进度。以以下 AWS CLI 命令为例。

   ```
   aws sagemaker describe-endpoint —endpoint-name '{{<endpoint-name>}}' —region {{<region>}}
   ```

   将 `EndpointStatus` 更改为 `InService` 后，端点即可用于实时推理。

1. **调用端点** 

   以下命令调用端点以进行实时推理。您的提示需要编码为字节。
**注意**  
输入提示的格式取决于语言模型。有关文本生成提示格式的更多信息，请参阅[文本生成模型实时推理的请求格式](#autopilot-llms-finetuning-realtime-prompt-examples)。

   ```
   aws sagemaker invoke-endpoint --endpoint-name '{{<endpoint-name>}}' \ 
                     --region '{{<region>}}' --body '{{<your-promt-in-bytes>}}' [--content-type] 'application/json' {{<outfile>}}
   ```

## 文本生成模型实时推理的请求格式
<a name="autopilot-llms-finetuning-realtime-prompt-examples"></a>

不同的大型语言模型（LLM）可能会有特定的软件依赖性、运行时环境和硬件要求，从而影响 Autopilot 推荐的用于托管推理模型的容器。此外，每个模型都规定了所需的输入数据格式以及预测和输出的预期格式。

以下是一些模型的示例输入和推荐的容器。
+ 对于推荐了容器 `huggingface-pytorch-tgi-inference:2.0.1-tgi1.0.3-gpu-py39-cu118-ubuntu20.04` 的 Falcon 模型：

  ```
  payload = {
      "inputs": "Large language model fine-tuning is defined as",
      "parameters": {
          "do_sample": false,
          "top_p": 0.9,
          "temperature": 0.1,
          "max_new_tokens": 128,
          "stop": ["<|endoftext|>", "</s>"]
      }
  }
  ```
+ 对于所有其他模型，建议使用容器 `djl-inference:0.22.1-fastertransformer5.3.0-cu118`：

  ```
  payload= {
      "text_inputs": "Large language model fine-tuning is defined as"
  }
  ```