기계 번역으로 제공되는 번역입니다. 제공된 번역과 원본 영어의 내용이 상충하는 경우에는 영어 버전이 우선합니다.

# 모델 병렬 처리 및 대형 모델 추론
<a name="large-model-inference"></a>

 Amazon SageMaker AI에는 모델 병렬화 및 대규모 모델 추론(LMI)을 위한 전문 딥 러닝 컨테이너(DLC), 라이브러리 및 도구가 포함되어 있습니다. 다음 섹션에서는 SageMaker AI에서 LMI를 시작하는 데 필요한 리소스를 확인할 수 있습니다.

**Topics**
+ [대형 모델 추론(LMI) 컨테이너 설명서](large-model-inference-container-docs.md)
+ [대규모 모델 추론용 SageMaker AI 엔드포인트 파라미터](large-model-inference-hosting.md)
+ [압축되지 않은 모델 배포하기](large-model-inference-uncompressed.md)
+ [TorchServe를 사용한 대형 추론 모델 배포](large-model-inference-tutorials-torchserve.md)

# 대형 모델 추론(LMI) 컨테이너 설명서
<a name="large-model-inference-container-docs"></a>

[대규모 모델 추론(LMI) 컨테이너 설명서](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html)는 Deep Java Library 설명서 사이트에서 제공됩니다.

이 설명서는 Amazon SageMaker AI에 대규모 언어 모델(LLM)을 배포하고 최적화해야 하는 개발자, 데이터 과학자 및 기계 학습 엔지니어를 위해 작성되었습니다. 에서 제공하는 LLM 추론을 위한 특수 Docker 컨테이너인 LMI 컨테이너를 사용하는 데 도움이 됩니다 AWS. 개요, 배포 가이드, 지원되는 추론 라이브러리에 대한 사용 설명서 및 고급 자습서를 제공합니다.

LMI 컨테이너 설명서를 사용하여 다음을 수행할 수 있습니다.
+ LMI 컨테이너의 구성 요소 및 아키텍처 이해
+ 사용 사례에 적합한 인스턴스 유형과 백엔드를 선택하는 방법을 알아봅니다.
+ LMI 컨테이너를 사용하여 SageMaker AI에서 LLM 구성 및 배포
+ 퀀타이즈, 텐서 병렬 처리 및 지속적 배치와 같은 기능을 사용하여 성능 최적화
+ 최적의 처리량과 지연 시간을 위해 SageMaker AI 엔드포인트를 벤치마킹하고 튜닝합니다.

# 대규모 모델 추론용 SageMaker AI 엔드포인트 파라미터
<a name="large-model-inference-hosting"></a>

 SageMaker AI를 사용하여 지연 시간이 짧은 대규모 모델 추론(LMI)을 용이하게 하기 위해 다음과 같이 파라미터를 사용자 지정할 수 있습니다.
+  **인스턴스의 최대 Amazon EBS 볼륨 크기(`VolumeSizeInGB`)** - 모델 크기가 30GB보다 크고 로컬 디스크가 없는 인스턴스를 사용하는 경우 이 매개변수를 모델 크기보다 약간 크게 늘려야 합니다.
+  **상태 점검 제한 시간 할당량(`ContainerStartupHealthCheckTimeoutInSeconds`)** - 컨테이너가 올바르게 설정되고 CloudWatch 로그에 상태 점검 시간이 초과된 것으로 표시되는 경우 컨테이너가 상태 확인에 응답할 충분한 시간을 확보할 수 있도록 이 할당량을 늘려야 합니다.
+  **모델 다운로드 제한 시간 할당량(`ModelDataDownloadTimeoutInSeconds`)** - 모델 크기가 40GB보다 큰 경우 Amazon S3에서 인스턴스로 모델을 다운로드할 수 있는 충분한 시간을 확보하려면 이 할당량을 늘려야 합니다.

아래의 코드 스니펫은 앞서 언급한 매개변수를 프로그래밍 방식으로 구성하는 방법을 보여줍니다. 예제의 *기울임꼴 자리 표시자 텍스트*를 본인의 정보로 대체하세요.

```
import boto3

aws_region = "aws-region"
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# The name of the endpoint. The name must be unique within an AWS Region in your AWS account.
endpoint_name = "endpoint-name"

# Create an endpoint config name.
endpoint_config_name = "endpoint-config-name"

# The name of the model that you want to host.
model_name = "the-name-of-your-model"

instance_type = "instance-type"

sagemaker_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": model_name,
            "InstanceType": instance_type, # Specify the compute instance type.
            "InitialInstanceCount": 1, # Number of instances to launch initially.
            "VolumeSizeInGB": 256, # Specify the size of the Amazon EBS volume.
            "ModelDataDownloadTimeoutInSeconds": 1800, # Specify the model download timeout in seconds.
            "ContainerStartupHealthCheckTimeoutInSeconds": 1800, # Specify the health checkup timeout in seconds
        },
    ],
)

sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name)
```

 `ProductionVariants`에 대한 자세한 내용은 [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html) 섹션을 참조하세요.

대규모 모델에서 지연 시간이 짧은 추론을 달성하는 방법을 보여주는 예시는 aws-samples GitHub 리포지토리의 [Amazon SageMaker AI의 생성형 AI 추론 예시](https://github.com/aws-samples/sagemaker-genai-hosting-examples/tree/main)를 참조하세요.

# 압축되지 않은 모델 배포하기
<a name="large-model-inference-uncompressed"></a>

 ML 모델을 배포할 때 한 가지 옵션은 모델 아티팩트를 보관하고 `tar.gz` 형식으로 압축하는 것입니다. 이 방법은 소형 모델에서 잘 작동하지만, 수천억 개의 파라미터가 포함된 대형 모델 아티팩트를 압축한 다음 엔드포인트에서 압축을 푸는 데는 상당한 시간이 걸릴 수 있습니다. 대규모 모델 추론의 경우 압축되지 않은 ML 모델을 배포하는 것이 좋습니다. 이 가이드에서는 압축되지 않은 ML 모델을 배포하는 방법을 보여줍니다.

 압축되지 않은 ML 모델을 배포하려면 모든 모델 아티팩트를 Amazon S3에 업로드하고 공통 Amazon S3 접두사로 구성하세요. Amazon S3 접두사는 Amazon S3 객체 키 이름의 시작 부분에 있는 문자열로, 나머지 이름과 구분 기호로 구분됩니다. Amazon S3 접두사에 대한 자세한 정보는 [접두사를 사용한 객체 구성](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html)을 참고하세요.

 SageMaker AI를 사용하여 배포하려면, 슬래시(/)를 구분 기호로 사용해야 합니다. ML 모델과 관련된 아티팩트만 접두사와 함께 구성되도록 해야 합니다. 압축되지 않은 아티팩트가 한 개 있는 ML 모델의 경우, 접두사는 키 이름과 동일합니다. AWS CLI를 사용하여 접두사와 연결된 객체를 확인할 수 있습니다.

```
aws s3 ls --recursive s3://bucket/prefix
```

 모델 아티팩트를 Amazon S3에 업로드하고 공통 접두사로 구성한 후, [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) 요청을 호출할 때 [ModelDataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelDataSource.html) 필드의 일부로 해당 위치를 지정할 수 있습니다. SageMaker AI는 추론을 위해 압축되지 않은 모델 아티팩트를 `/opt/ml/model`로 자동으로 다운로드합니다. 아티팩트를 다운로드할 때 SageMaker AI가 사용하는 규칙에 대한 자세한 내용은 [S3ModelDataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3ModelDataSource.html)를 참조하세요.

 다음 코드 스니펫은 압축되지 않은 모델을 배포할 때 `CreateModel` API를 호출하는 방법을 보여줍니다. *기울임꼴로 표시된 사용자 글자*를 사용자의 정보로 바꿉니다.

```
model_name = "model-name"
sagemaker_role = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
container = "123456789012.dkr.ecr.us-west-2.amazonaws.com/inference-image:latest"

create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        "Image": container,
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": "s3://amzn-s3-demo-bucket/prefix/to/model/data/", 
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
            },
        },
    },
)
```

 앞서 언급한 예제에서는 모델 아티팩트가 공통 접두사로 구성되어 있다고 가정합니다. 대신에 모델 아티팩트가 압축되지 않은 단일 Amazon S3 객체인 경우, Amazon S3 객체를 가리키도록 `"S3Uri"`을 변경하고 `"S3DataType"`을 `"S3Object"`로 변경합니다.

**참고**  
 현재 `ModelDataSource`를 AWS Marketplace, SageMaker AI 배치 변환, SageMaker 서버리스 추론 엔드포인트 및 SageMaker 다중 모델 엔드포인트와 함께 사용할 수 없습니다.

# TorchServe를 사용한 대형 추론 모델 배포
<a name="large-model-inference-tutorials-torchserve"></a>

이 자습서에서는 GPU에서 TorchServe를 사용하여 Amazon SageMaker AI에서 대규모 모델을 배포하고 추론을 제공하는 방법을 보여줍니다. 이 예제에서는 [OPT-30b](https://huggingface.co/facebook/opt-30b) 모델을 `ml.g5` 인스턴스에 배포합니다. 기타 모델 및 인스턴스 유형과 작동하도록 이를 수정할 수 있습니다. 이 예제의 `italicized placeholder text`을 사용자 고유의 정보로 바꿉니다.

TorchServe는 대규모 분산형 모델 추론을 위한 강력한 개방형 플랫폼입니다. PyTorch, native PiPPy, DeepSpeed, HuggingFace Accelerate와 같은 인기 라이브러리를 지원하여 분산형 대규모 모델 및 비분산형 모델 추론 시나리오에서 일관성을 유지하는 균일한 핸들러 API를 제공합니다. 자세한 내용은 [TorchServe의 대형 모델 추론 설명서](https://pytorch.org/serve/large_model_inference.html#)를 참고하세요.

## TorchServe를 사용한 딥 러닝 컨테이너
<a name="large-model-inference-tutorials-torchserve-dlcs"></a>

SageMaker AI에서 TorchServe를 사용하여 대규모 모델을 배포하려면 SageMaker AI 딥 러닝 컨테이너(DLC) 중 하나를 사용할 수 있습니다. 기본적으로 TorchServe는 모든 AWS PyTorch DLCs. 모델을 로드하는 동안 TorchServe는 PiPPy, Deepspeed, Accelerate와 같은 대형 모델에 맞게 조정된 특수 라이브러리를 설치할 수 있습니다.

다음 표에는 [TorchServe가 포함된 모든 SageMaker AI DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only)가 나열되어 있습니다.


| DLC 카테고리 | 프레임워크 | 하드웨어 | 예제 URL | 
| --- | --- | --- | --- | 
| [SageMaker AI 프레임워크 컨테이너](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) |  PyTorch 2.0.0\$1  | CPU, GPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | 
| [SageMaker AI 프레임워크 Graviton 컨테이너](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-graviton-containers-sm-support-only) |  PyTorch 2.0.0\$1  | CPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.1-cpu-py310-ubuntu20.04-sagemaker | 
| [StabilityAI Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#stabilityai-inference-containers) |  PyTorch 2.0.0\$1  | GPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/stabilityai-pytorch-inference:2.0.1-sgm0.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker | 
| [Neuron Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) | PyTorch 1.13.1 | Neuronx | 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuron:1.13.1-neuron-py310-sdk2.12.0-ubuntu20.04 | 

## 시작하기
<a name="large-model-inference-tutorials-torchserve-getting-started"></a>

모델을 배포하기 전에 필수 조건을 완료합니다. 모델 매개변수를 구성하고 핸들러 코드를 사용자 지정할 수도 있습니다.

### 사전 조건
<a name="large-model-inference-tutorials-torchserve-getting-started-prereqs"></a>

시작하려면 다음과 같은 필수 조건이 있어야 합니다.

1.  AWS 계정에 액세스할 수 있는지 확인합니다. 가 AWS IAM 사용자 또는 IAM 역할을 통해 계정에 액세스할 AWS CLI 수 있도록 [환경을 설정합니다](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). IAM 역할을 사용하는 것을 추천합니다. 사용자의 개인 계정에서 테스트를 수행하기 위해 다음과 같은 관리형 권한 정책을 IAM 역할에 연결할 수 있습니다.
   + [AmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess)
   + [AmazonEC2FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess)
   + [AWSServiceRoleForAmazonEKSNodegroup](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AWSServiceRoleForAmazonEKSNodegroup)
   + [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess)
   + [AmazonS3FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonS3FullAccess)

   IAM 정책을 역할에 연결하는 방법에 대한 자세한 내용은 *AWS IAM 사용 설명서*의 [IAM 자격 증명 권한 추가 및 제거](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html)를 참고하세요.

1. 아래의 예제와 같이 종속성을 로컬에서 구성합니다.

   1.  AWS CLI다음 버전 2를 설치합니다.

      ```
      # Install the latest AWS CLI v2 if it is not installed
      !curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" !unzip awscliv2.zip
      #Follow the instructions to install v2 on the terminal
      !cat aws/README.md
      ```

   1. SageMaker AI 및 Boto3 클라이언트를 설치합니다.

      ```
      # If already installed, update your client
      #%pip install sagemaker pip --upgrade --quiet
      !pip install -U sagemaker
      !pip install -U boto
      !pip install -U botocore
      !pip install -U boto3
      ```

### 모델 설정 및 파라미터 구성
<a name="large-model-inference-tutorials-torchserve-getting-started-config"></a>

TorchServe는 [https://pytorch.org/docs/stable/elastic/run.html](https://pytorch.org/docs/stable/elastic/run.html)을 사용하여 모델 병렬 처리를 위한 분산형 환경을 설정합니다. TorchServe는 대형 모델의 경우 여러 작업자를 지원할 수 있습니다. 기본적으로 TorchServe는 라운드 로빈 알고리즘을 사용하여 호스트의 작업자에게 GPU를 할당합니다. 대규모 모델 추론의 경우, 각 워커에 할당된 GPU 수는 `model_config.yaml` 파일에 지정된 GPU 수에 따라 자동으로 계산됩니다. 주어진 시간에 표시되는 GPU 장치 ID를 지정하는 환경 변수 `CUDA_VISIBLE_DEVICES`은 이 숫자를 기반으로 설정됩니다.

예를 들어 한 노드에 8개의 GPU가 있고, 하나의 워커에는 노드(`nproc_per_node=4`)에 4개의 GPU가 필요하다고 가정해 보겠습니다. 이 경우 TorchServe는 첫 번째 워커(`CUDA_VISIBLE_DEVICES="0,1,2,3"`)에 4개의 GPU를 할당하고 두 번째 워커(`CUDA_VISIBLE_DEVICES="4,5,6,7”`)에 4개의 GPU를 할당합니다.

이 기본 동작 외에도 TorchServe는 사용자가 워커에 대해 GPU를 지정할 수 있는 유연성을 제공합니다. 예를 들어 [모델 구성 YAML 파일](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/model-archiver/README.md?plain=1#L164)에서 변수 `deviceIds: [2,3,4,5]`를 설정하고 `nproc_per_node=2`를 설정하면, TorchServe는 첫 번째 워커에 `CUDA_VISIBLE_DEVICES=”2,3”`를 할당하고 두 번째 워커에 `CUDA_VISIBLE_DEVICES="4,5”`를 할당합니다.

아래의 `model_config.yaml` 예제에서는 [OPT-30b](https://huggingface.co/facebook/opt-30b) 모델의 프런트엔드 매개변수와 백엔드 매개변수를 모두 구성합니다. 구성된 프런트엔드 매개변수는 `parallelType`, `deviceType`, `deviceIds `, `torchrun`입니다. 구성할 수 있는 프론트엔드 매개변수에 대한 자세한 내용은 [PyTorch GitHub 설명서](https://github.com/pytorch/serve/blob/2bf505bae3046b0f7d0900727ec36e611bb5dca3/docs/configuration.md?plain=1#L267)를 참고하세요. 백엔드 구성은 자유로운 스타일로 사용자 지정이 가능한 YAML 맵을 기반으로 합니다. 백엔드 매개변수의 경우, DeepSpeed 구성과 사용자 지정 핸들러 코드에서 사용하는 추가 매개변수를 정의합니다.

```
# TorchServe front-end parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 1200
parallelType: "tp"
deviceType: "gpu"
# example of user specified GPU deviceIds
deviceIds: [0,1,2,3] # sets CUDA_VISIBLE_DEVICES

torchrun:
    nproc-per-node: 4

# TorchServe back-end parameters
deepspeed:
    config: ds-config.json
    checkpoint: checkpoints.json

handler: # parameters for custom handler code
    model_name: "facebook/opt-30b"
    model_path: "model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546"
    max_length: 50
    max_new_tokens: 10
    manual_seed: 40
```

### 핸들러 사용자 지정
<a name="large-model-inference-tutorials-torchserve-getting-started-handlers"></a>

TorchServe는 인기 라이브러리로 빌드된 대규모 모델 추론을 위한 [기본 핸들러](https://github.com/pytorch/serve/tree/master/ts/torch_handler/distributed) 및 [핸들러 유틸리티](https://github.com/pytorch/serve/tree/master/ts/handler_utils)를 제공합니다. 다음 예제는 사용자 지정 핸들러 클래스 [TransformersSeqClassifierHandler](https://github.com/pytorch/serve/blob/ab69b69a59d6ca6074df7e6d4014f07eb48dedba/examples/large_models/deepspeed/custom_handler.py#L16C7-L16C39)가 [BaseDeepSpeedHandler](https://github.com/pytorch/serve/blob/ab69b69a59d6ca6074df7e6d4014f07eb48dedba/ts/torch_handler/distributed/base_deepspeed_handler.py#L8)를 확장하고 [핸들러 유틸리티](https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py)를 사용하는 방법을 설명합니다. 전체 코드 예제는 [PyTorch GitHub 설명서의 `custom_handler.py` 코드](https://github.com/pytorch/serve/blob/master/examples/large_models/deepspeed/custom_handler.py)를 참고하세요.

```
class TransformersSeqClassifierHandler(BaseDeepSpeedHandler, ABC):
    """
    Transformers handler class for sequence, token classification and question answering.
    """

    def __init__(self):
        super(TransformersSeqClassifierHandler, self).__init__()
        self.max_length = None
        self.max_new_tokens = None
        self.tokenizer = None
        self.initialized = False

    def initialize(self, ctx: Context):
        """In this initialize function, the HF large model is loaded and
        partitioned using DeepSpeed.
        Args:
            ctx (context): It is a JSON Object containing information
            pertaining to the model artifacts parameters.
        """
        super().initialize(ctx)
        model_dir = ctx.system_properties.get("model_dir")
        self.max_length = int(ctx.model_yaml_config["handler"]["max_length"])
        self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])
        model_name = ctx.model_yaml_config["handler"]["model_name"]
        model_path = ctx.model_yaml_config["handler"]["model_path"]
        seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
        torch.manual_seed(seed)

        logger.info("Model %s loading tokenizer", ctx.model_name)

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        config = AutoConfig.from_pretrained(model_name)
        with torch.device("meta"):
            self.model = AutoModelForCausalLM.from_config(
                config, torch_dtype=torch.float16
            )
        self.model = self.model.eval()

        ds_engine = get_ds_engine(self.model, ctx)
        self.model = ds_engine.module
        logger.info("Model %s loaded successfully", ctx.model_name)
        self.initialized = True

    def preprocess(self, requests):
        """
        Basic text preprocessing, based on the user's choice of application mode.
        Args:
            requests (list): A list of dictionaries with a "data" or "body" field, each
                            containing the input text to be processed.
        Returns:
            tuple: A tuple with two tensors: the batch of input ids and the batch of
                attention masks.
        """

    def inference(self, input_batch):
        """
        Predicts the class (or classes) of the received text using the serialized transformers
        checkpoint.
        Args:
            input_batch (tuple): A tuple with two tensors: the batch of input ids and the batch
                                of attention masks, as returned by the preprocess function.
        Returns:
            list: A list of strings with the predicted values for each input text in the batch.
        """
        
    def postprocess(self, inference_output):
        """Post Process Function converts the predicted response into Torchserve readable format.
        Args:
            inference_output (list): It contains the predicted response of the input text.
        Returns:
            (list): Returns a list of the Predictions and Explanations.
        """
```

## 모델 아티팩트 준비하기
<a name="large-model-inference-tutorials-torchserve-artifacts"></a>

모델을 SageMaker AI에 배포하려면 먼저 모델 아티팩트를 패키징해야 합니다. 대형 모델의 경우, 모델 아티팩트 압축을 건너뛰는 PyTorch [torch-model-archiver](https://github.com/pytorch/serve/blob/master/model-archiver/README.md) 도구를 `--archive-format no-archive` 인수와 함께 사용하는 것이 좋습니다. 다음 예제는 모든 모델 아티팩트를 `opt/`라는 새 폴더에 저장합니다.

```
torch-model-archiver --model-name opt --version 1.0 --handler custom_handler.py --extra-files ds-config.json -r requirements.txt --config-file opt/model-config.yaml --archive-format no-archive
```

`opt/` 폴더가 생성되면 PyTorch [Download\$1model](https://github.com/pytorch/serve/blob/master/examples/large_models/utils/Download_model.py) 도구를 사용하여 폴더에 OPT-30b 모델을 다운로드합니다.

```
cd opt
python path_to/Download_model.py --model_path model --model_name facebook/opt-30b --revision main
```

마지막으로 모델 아티팩트를 Amazon S3 버킷에 업로드합니다.

```
aws s3 cp opt {your_s3_bucket}/opt --recursive
```

이제 SageMaker AI 엔드포인트에 배포할 준비가 된 모델 아티팩트가 Amazon S3에 저장되어 있어야 합니다.

## SageMaker Python SDK를 사용하여 모델 배포하기
<a name="large-model-inference-tutorials-torchserve-deploy"></a>

모델 아티팩트가 준비되면 SageMaker AI 호스팅 엔드포인트에 모델을 배포할 수 있습니다. 이 섹션에서는 단일 대형 모델을 엔드포인트에 배포하고 스트리밍 응답을 예측하는 방법을 설명합니다. 엔드포인트의 스트리밍 응답에 대한 자세한 내용은 [실시간 엔드포인트 호출](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-test-endpoints.html)을 참고하세요.

모델을 배포하려면 다음 단계를 완료합니다.

1. 다음 예시와 같이 SageMaker AI 세션을 만듭니다.

   ```
   import boto3
   import sagemaker
   from sagemaker import Model, image_uris, serializers, deserializers
   
   boto3_session=boto3.session.Session(region_name="us-west-2")
   smr = boto3.client('sagemaker-runtime-demo')
   sm = boto3.client('sagemaker')
   role = sagemaker.get_execution_role()  # execution role for the endpoint
   sess= sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  # SageMaker AI session for interacting with different AWS APIs
   region = sess._region_name  # region name of the current SageMaker Studio Classic environment
   account = sess.account_id()  # account_id of the current SageMaker Studio Classic environment
   
   # Configuration:
   bucket_name = sess.default_bucket()
   prefix = "torchserve"
   output_path = f"s3://{bucket_name}/{prefix}"
   print(f'account={account}, region={region}, role={role}, output_path={output_path}')
   ```

1. 다음 예시와 같이 SageMaker AI에 비압축 모델을 만듭니다.

   ```
   from datetime import datetime
   
   instance_type = "ml.g5.24xlarge"
   endpoint_name = sagemaker.utils.name_from_base("ts-opt-30b")
   s3_uri = {your_s3_bucket}/opt
   
   model = Model(
       name="torchserve-opt-30b" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
       # Enable SageMaker uncompressed model artifacts
       model_data={
           "S3DataSource": {
                   "S3Uri": s3_uri,
                   "S3DataType": "S3Prefix",
                   "CompressionType": "None",
           }
       },
       image_uri=container,
       role=role,
       sagemaker_session=sess,
       env={"TS_INSTALL_PY_DEP_PER_MODEL": "true"},
   )
   print(model)
   ```

1. 다음 예제와 같이 모델을 Amazon EC2 인스턴스에 배포합니다.

   ```
   model.deploy(
       initial_instance_count=1,
       instance_type=instance_type,
       endpoint_name=endpoint_name,
       volume_size=512, # increase the size to store large model
       model_data_download_timeout=3600, # increase the timeout to download large model
       container_startup_health_check_timeout=600, # increase the timeout to load large model
   )
   ```

1. 다음 예제와 같이 스트리밍 응답을 처리하도록 클래스를 초기화합니다.

   ```
   import io
   
   class Parser:
       """
       A helper class for parsing the byte stream input. 
       
       The output of the model will be in the following format:
       ```
       b'{"outputs": [" a"]}\n'
       b'{"outputs": [" challenging"]}\n'
       b'{"outputs": [" problem"]}\n'
       ...
       ```
       
       While usually each PayloadPart event from the event stream will contain a byte array 
       with a full json, this is not guaranteed and some of the json objects may be split across
       PayloadPart events. For example:
       ```
       {'PayloadPart': {'Bytes': b'{"outputs": '}}
       {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
       ```
       
       This class accounts for this by concatenating bytes written via the 'write' function
       and then exposing a method which will return lines (ending with a '\n' character) within
       the buffer via the 'scan_lines' function. It maintains the position of the last read 
       position to ensure that previous bytes are not exposed again. 
       """
       
       def __init__(self):
           self.buff = io.BytesIO()
           self.read_pos = 0
           
       def write(self, content):
           self.buff.seek(0, io.SEEK_END)
           self.buff.write(content)
           data = self.buff.getvalue()
           
       def scan_lines(self):
           self.buff.seek(self.read_pos)
           for line in self.buff.readlines():
               if line[-1] != b'\n':
                   self.read_pos += len(line)
                   yield line[:-1]
                   
       def reset(self):
           self.read_pos = 0
   ```

1. 다음 예제와 같이 스트리밍 응답 예측을 테스트합니다.

   ```
   import json
   
   body = "Today the weather is really nice and I am planning on".encode('utf-8')
   resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=body, ContentType="application/json")
   event_stream = resp['Body']
   parser = Parser()
   for event in event_stream:
       parser.write(event['PayloadPart']['Bytes'])
       for line in parser.scan_lines():
           print(line.decode("utf-8"), end=' ')
   ```

이제 모델을 SageMaker AI 엔드포인트에 배포했으며 응답을 위해 모델을 간접적으로 호출할 수 있어야 합니다. SageMaker AI 실시간 엔드포인트에 대한 자세한 내용은 [단일 모델 엔드포인트](realtime-single-model.md) 섹션을 참조하세요.