Prasyarat Langkah 1: Konfigurasikan AWS kredensil Langkah 2: Buat peran SageMaker eksekusi Langkah 3: Konfigurasikan parameter model Langkah 4: Buat SageMaker sumber daya dan gunakan titik akhir Langkah 5: Panggil titik akhir Langkah 6: Bersihkan sumber daya (Opsional)

Memulai

Panduan ini menunjukkan cara menerapkan model Amazon Nova yang disesuaikan pada titik akhir SageMaker waktu nyata, mengonfigurasi parameter inferensi, dan memanggil model Anda untuk pengujian.

Prasyarat

Berikut ini adalah prasyarat untuk menerapkan model Amazon Nova pada inferensi: SageMaker

Buat Akun AWS - Jika Anda belum memilikinya, lihat Membuat AWS akun.
Izin IAM yang diperlukan - Pastikan pengguna atau peran IAM Anda memiliki kebijakan terkelola berikut:
- AmazonSageMakerFullAccess
- AmazonS3FullAccess
SDKs/CLI Versi yang diperlukan - Versi SDK berikut telah diuji dan divalidasi dengan model Amazon Nova pada SageMaker inferensi:
- SageMaker Python SDK v3.0.0+ () untuk pendekatan API berbasis sumber daya sagemaker>=3.0.0
- Boto3 versi 1.35.0+ () boto3>=1.35.0 untuk panggilan API langsung. Contoh dalam panduan ini menggunakan pendekatan ini.
Peningkatan kuota layanan - Minta peningkatan kuota SageMaker layanan Amazon untuk jenis instans ML yang akan digunakan untuk titik akhir SageMaker Inferensi (misalnya,). ml.p5.48xlarge for endpoint usage Untuk daftar tipe instans yang didukung, lihat Model dan instance yang didukung. Untuk meminta kenaikan, lihat Meminta kenaikan kuota. Untuk informasi tentang kuota SageMaker instance, lihat SageMaker titik akhir dan kuota.

Tip

Untuk penerapan end-to-end yang cepat, Anda dapat menjalankan notebook Custom Nova Model SageMaker Inference untuk menerapkan model Amazon Nova yang disesuaikan SageMaker pada inferensi dalam satu notebook.

Langkah 1: Konfigurasikan AWS kredensil

Konfigurasikan AWS kredensyal Anda menggunakan salah satu metode berikut:

Opsi 1: AWS CLI (Direkomendasikan)


aws configure

Masukkan kunci AWS akses, kunci rahasia, dan wilayah default Anda saat diminta.

Opsi 2: file AWS kredensial

Buat atau edit~/.aws/credentials:


[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

Opsi 3: Variabel lingkungan


export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

catatan

Untuk informasi selengkapnya tentang AWS kredensil, lihat Pengaturan konfigurasi dan file kredenal.

Inisialisasi klien AWS

Buat skrip atau buku catatan Python dengan kode berikut untuk menginisialisasi AWS SDK dan memverifikasi kredensil Anda:


import boto3

# AWS Configuration - Update these for your environment
REGION = "us-east-1"  # Supported regions: us-east-1, us-west-2
AWS_ACCOUNT_ID = "YOUR_ACCOUNT_ID"  # Replace with your AWS account ID

# Initialize AWS clients using default credential chain
sagemaker = boto3.client('sagemaker', region_name=REGION)
sts = boto3.client('sts')

# Verify credentials
try:
    identity = sts.get_caller_identity()
    print(f"Successfully authenticated to AWS Account: {identity['Account']}")
    
    if identity['Account'] != AWS_ACCOUNT_ID:
        print(f"Warning: Connected to account {identity['Account']}, expected {AWS_ACCOUNT_ID}")

except Exception as e:
    print(f"Failed to authenticate: {e}")
    print("Please verify your credentials are configured correctly.")

Jika otentikasi berhasil, Anda akan melihat output yang mengonfirmasi ID AWS akun Anda.

Langkah 2: Buat peran SageMaker eksekusi

Peran SageMaker eksekusi adalah peran IAM yang memberikan SageMaker izin untuk mengakses AWS sumber daya atas nama Anda, seperti bucket Amazon S3 untuk artefak model dan untuk pencatatan. CloudWatch

Menciptakan peran eksekusi

catatan

Membuat peran IAM membutuhkan iam:CreateRole dan iam:AttachRolePolicy izin. Pastikan pengguna atau peran IAM Anda memiliki izin ini sebelum melanjutkan.

Kode berikut membuat peran IAM dengan izin yang diperlukan untuk menerapkan model khusus Amazon Nova:


import json

# Create SageMaker Execution Role
role_name = f"SageMakerInference-ExecutionRole-{AWS_ACCOUNT_ID}"

trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "sagemaker.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

iam = boto3.client('iam', region_name=REGION)

# Create the role
role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy),
    Description='SageMaker execution role with S3 and SageMaker access'
)

# Attach required policies
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

SAGEMAKER_EXECUTION_ROLE_ARN = role_response['Role']['Arn']
print(f"Created SageMaker execution role: {SAGEMAKER_EXECUTION_ROLE_ARN}")

Menggunakan peran eksekusi yang ada (Opsional)

Jika Anda sudah memiliki peran SageMaker eksekusi, Anda dapat menggunakannya sebagai gantinya:


# Replace with your existing role ARN
SAGEMAKER_EXECUTION_ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_EXISTING_ROLE_NAME"

Untuk menemukan SageMaker peran yang ada di akun Anda:


iam = boto3.client('iam', region_name=REGION)
response = iam.list_roles()
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
for role in sagemaker_roles:
    print(f"{role['RoleName']}: {role['Arn']}")

penting

Peran eksekusi harus memiliki hubungan kepercayaan sagemaker.amazonaws.com dan izin untuk mengakses Amazon S3 SageMaker dan sumber daya.

Untuk informasi selengkapnya tentang peran SageMaker eksekusi, lihat SageMaker Peran.

Langkah 3: Konfigurasikan parameter model

Konfigurasikan parameter penerapan untuk model Amazon Nova Anda. Pengaturan ini mengontrol perilaku model, alokasi sumber daya, dan karakteristik inferensi. Untuk daftar jenis instance yang didukung dan nilai CONTEXT_LENGTH dan MAX_CONCURRENCY yang didukung untuk masing-masing, lihat. Model dan instance yang didukung Untuk daftar lengkap fitur kontainer tambahan seperti default pengambilan sampel, decoding spekulatif, dan kuantisasi, lihat. Fitur Kontainer Inferensi

Parameter yang diperlukan

IMAGE: URI gambar kontainer Docker untuk wadah inferensi Amazon Nova. Ini akan disediakan oleh AWS.
CONTEXT_LENGTH: Panjang konteks model.
MAX_CONCURRENCY: Jumlah maksimum urutan per iterasi; menetapkan batas pada berapa banyak permintaan pengguna individu (prompt) dapat diproses secara bersamaan dalam satu batch pada GPU. Rentang: bilangan bulat lebih besar dari 0.

Konfigurasikan penerapan Anda


# AWS Configuration
REGION = "us-east-1"  # Must match region from Step 1

# ECR Account mapping by region
ECR_ACCOUNT_MAP = {
    "us-east-1": "708977205387",
    "us-west-2": "176779409107"
}

# Container Image
IMAGE = f"{ECR_ACCOUNT_MAP[REGION]}.dkr.ecr.{REGION}.amazonaws.com/nova-inference-repo:SM-Inference-latest"
print(f"IMAGE = {IMAGE}")

# Required parameters
CONTEXT_LENGTH = "8000"        # Maximum total context length
MAX_CONCURRENCY = "8"          # Maximum concurrent sequences

# Build environment variables for the container
environment = {
    'CONTEXT_LENGTH': CONTEXT_LENGTH,
    'MAX_CONCURRENCY': MAX_CONCURRENCY,
    # Optional: add container feature environment variables here.
    # See "Inference Container Features" for the full list.
    # Examples:
    # 'DEFAULT_TEMPERATURE': '0.7',
    # 'DEFAULT_MAX_NEW_TOKENS': '512',
    # 'QUANTIZATION_DTYPE': 'fp8',
}

print("Environment configuration:")
for key, value in environment.items():
    print(f"  {key}: {value}")

Konfigurasikan parameter khusus penerapan

Sekarang konfigurasikan parameter spesifik untuk penerapan model Amazon Nova Anda, termasuk lokasi artefak model dan pemilihan jenis instans.

Tetapkan pengenal penerapan


# Deployment identifier - use a descriptive name for your use case
JOB_NAME = "my-nova-deployment"

Tentukan lokasi artefak model

Sediakan URI Amazon S3 tempat artefak model Amazon Nova terlatih Anda disimpan. Ini harus menjadi lokasi keluaran dari pelatihan model atau pekerjaan fine-tuning Anda.


# S3 location of your trained Nova model artifacts
# Replace with your model's S3 URI - must end with /
MODEL_S3_LOCATION = "s3://your-bucket-name/path/to/model/artifacts/"

Pilih varian model dan tipe instance


# Configure model variant and instance type
TESTCASE = {
    "model": "micro",              # Options: micro, lite, lite2
    "instance": "ml.g5.12xlarge"   # Refer to "Supported models and instances" section
}

# Generate resource names
INSTANCE_TYPE = TESTCASE["instance"]
MODEL_NAME = JOB_NAME + "-" + TESTCASE["model"] + "-" + INSTANCE_TYPE.replace(".", "-")
ENDPOINT_CONFIG_NAME = MODEL_NAME + "-Config"
ENDPOINT_NAME = MODEL_NAME + "-Endpoint"

print(f"Model Name: {MODEL_NAME}")
print(f"Endpoint Config: {ENDPOINT_CONFIG_NAME}")
print(f"Endpoint Name: {ENDPOINT_NAME}")

Konvensi penamaan

Kode secara otomatis menghasilkan nama yang konsisten untuk AWS sumber daya:

Nama model: {JOB_NAME}-{model}-{instance-type}
Konfigurasi Titik Akhir: {MODEL_NAME}-Config
Nama titik akhir: {MODEL_NAME}-Endpoint

Langkah 4: Buat SageMaker sumber daya dan gunakan titik akhir

SageMaker menawarkan dua pendekatan untuk menerapkan model ke titik akhir waktu nyata. Pilih pendekatan yang sesuai dengan kasus penggunaan Anda:

Komponen inferensi (Disarankan): Menyebarkan model sebagai komponen inferensi pada titik akhir. Pendekatan ini memungkinkan Anda untuk meng-host beberapa model pada satu titik akhir, menskalakan model secara independen, dan mengoptimalkan pemanfaatan sumber daya.
Titik akhir model tunggal: Menerapkan model tunggal langsung ke titik akhir menggunakan objek model dan konfigurasi titik akhir. Pendekatan ini lebih sederhana untuk disiapkan dan cocok untuk pengembangan, pengujian, atau beban kerja yang hanya membutuhkan satu model per titik akhir.

Opsi A: Membuat dengan komponen inferensi

Dengan komponen inferensi, pertama-tama Anda membuat titik akhir, lalu menerapkan model Anda sebagai komponen inferensi pada titik akhir tersebut. Ini memisahkan model dari infrastruktur titik akhir, memberi Anda lebih banyak fleksibilitas.

Buat konfigurasi titik akhir

Buat konfigurasi endpoint yang mendefinisikan infrastruktur tanpa menentukan model. Jenis dan hitungan instans dikelola pada tingkat titik akhir:


# Create Endpoint Configuration for inference components
INFERENCE_COMPONENT_NAME = MODEL_NAME + "-IC"

try:
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[
            {
                'VariantName': 'primary',
                'InstanceType': INSTANCE_TYPE,
                'InitialInstanceCount': 1,
                'RoutingConfig': {
                    'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
                }
            }
        ],
        Tags=[
            {
                'Key': 'sagemaker:nova-inference-component',
                'Value': 'true'
            }
        ]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")

except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")

Buat dan terapkan titik akhir


import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")

# Wait for endpoint to be InService
print("Waiting for endpoint to be InService...")
print("This typically takes 5-10 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print(f"\nEndpoint '{ENDPOINT_NAME}' is ready.")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            break
        else:
            print(f"Status: {status}")
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)

Buat komponen inferensi

Setelah titik akhir InService, terapkan model Amazon Nova Anda sebagai komponen inferensi:


try:
    ic_response = sagemaker.create_inference_component(
        InferenceComponentName=INFERENCE_COMPONENT_NAME,
        EndpointName=ENDPOINT_NAME,
        VariantName='primary',
        Specification={
            'Container': {
                'Image': IMAGE,
                'ArtifactUrl': MODEL_S3_LOCATION,
                'Environment': environment
            },
            'ComputeResourceRequirements': {
                'NumberOfCpuCoresRequired': 15,
                'NumberOfAcceleratorDevicesRequired': 4,
                'MinMemoryRequiredInMb': 25000
            }
        },
        RuntimeConfig={
            'CopyCount': 1
        }
    )
    print("Inference component creation initiated!")
    print(f"Inference Component ARN: {ic_response['InferenceComponentArn']}")

except sagemaker.exceptions.ClientError as e:
    print(f"Error creating inference component: {e}")

Parameter kunci:

InferenceComponentName: Pengidentifikasi unik untuk komponen inferensi Anda
EndpointName: Titik akhir untuk menerapkan komponen pada
Image: URI gambar wadah Docker untuk inferensi Amazon Nova
ArtifactUrl: Lokasi Amazon S3 dari artefak model Anda
Environment: Variabel lingkungan dikonfigurasi pada Langkah 3
NumberOfCpuCoresRequired: Jumlah core CPU yang dibutuhkan per salinan model
NumberOfAcceleratorDevicesRequired: Jumlah perangkat akselerator (GPU) yang diperlukan per salinan model
MinMemoryRequiredInMb: Memori minimum dalam MB diperlukan per salinan model
CopyCount: Jumlah salinan model yang akan digunakan

Memantau penyebaran komponen inferensi


# Wait for inference component to be InService
print("Waiting for inference component deployment...")
print("This typically takes 10-20 minutes as the model is loaded...\n")

while True:
    try:
        ic_desc = sagemaker.describe_inference_component(
            InferenceComponentName=INFERENCE_COMPONENT_NAME
        )
        ic_status = ic_desc['InferenceComponentStatus']
        
        if ic_status == 'Creating':
            print(f"⏳ Status: {ic_status} - Loading model artifacts...")
        elif ic_status == 'InService':
            print(f"✅ Status: {ic_status}")
            print(f"\nInference component '{INFERENCE_COMPONENT_NAME}' is ready!")
            break
        elif ic_status == 'Failed':
            print(f"❌ Status: {ic_status}")
            print(f"Failure Reason: {ic_desc.get('FailureReason', 'Unknown')}")
            break
        else:
            print(f"Status: {ic_status}")
    except Exception as e:
        print(f"Error checking inference component status: {e}")
        break
    
    time.sleep(30)

catatan

Saat memanggil titik akhir di Langkah 5, Anda harus menyertakan InferenceComponentName parameter dalam panggilan panggilan Anda. Lihat Langkah 5 untuk detailnya.

Opsi B: Membuat dengan titik akhir model tunggal

Dengan titik akhir model tunggal, Anda membuat objek SageMaker model, konfigurasi titik akhir, dan kemudian menerapkan titik akhir. Pendekatan ini mengemas model langsung ke konfigurasi titik akhir.

Buat SageMaker model

Kode berikut membuat SageMaker model yang mereferensikan artefak model Amazon Nova Anda:


try:
    model_response = sagemaker.create_model(
        ModelName=MODEL_NAME,
        PrimaryContainer={
            'Image': IMAGE,
            'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': MODEL_S3_LOCATION,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None'
                }
            },
            'Environment': environment
        },
        ExecutionRoleArn=SAGEMAKER_EXECUTION_ROLE_ARN,
        EnableNetworkIsolation=True
    )
    print("Model created successfully!")
    print(f"Model ARN: {model_response['ModelArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating model: {e}")

Parameter kunci:

ModelName: Pengidentifikasi unik untuk model Anda
Image: URI gambar wadah Docker untuk inferensi Amazon Nova
ModelDataSource: Lokasi Amazon S3 dari artefak model Anda
Environment: Variabel lingkungan dikonfigurasi pada Langkah 3
ExecutionRoleArn: Peran IAM dari Langkah 2
EnableNetworkIsolation: Setel ke True untuk keamanan yang ditingkatkan (mencegah kontainer melakukan panggilan jaringan keluar)

Buat konfigurasi titik akhir

Selanjutnya, buat konfigurasi titik akhir yang mendefinisikan infrastruktur penerapan Anda:


# Create Endpoint Configuration
try:
    production_variant = {
        'VariantName': 'primary',
        'ModelName': MODEL_NAME,
        'InitialInstanceCount': 1,
        'InstanceType': INSTANCE_TYPE,
    }
    
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[production_variant]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")

Parameter kunci:

VariantName: Pengidentifikasi untuk varian model ini (gunakan 'primer' untuk penerapan model tunggal)
ModelName: Referensi model yang dibuat di atas
InitialInstanceCount: Jumlah instance yang akan diterapkan (mulai dengan 1, skala nanti jika diperlukan)
InstanceType: Jenis contoh ML dipilih pada Langkah 3

Terapkan titik akhir


import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")

Pantau pembuatan titik akhir

Kode berikut akan melakukan polling status endpoint hingga penerapan selesai:


# Monitor endpoint creation progress
print("Waiting for endpoint creation to complete...")
print("This typically takes 15-30 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure and loading model...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print("\nEndpoint creation completed successfully!")
            print(f"Endpoint Name: {ENDPOINT_NAME}")
            print(f"Endpoint ARN: {response['EndpointArn']}")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            print("\nFull response:")
            print(response)
            break
        else:
            print(f"Status: {status}")
        
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)  # Check every 30 seconds

Verifikasi pembuatan sumber daya

Anda dapat memverifikasi bahwa sumber daya Anda berhasil dibuat:


# Describe the model
model_info = sagemaker.describe_model(ModelName=MODEL_NAME)
print(f"Model Status: {model_info['ModelName']} created")

# Describe the endpoint configuration
config_info = sagemaker.describe_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
print(f"Endpoint Config Status: {config_info['EndpointConfigName']} created")

Verifikasi titik akhir sudah siap

Terlepas dari pendekatan mana yang Anda pilih, Anda dapat memverifikasi konfigurasi titik akhir:


# Get detailed endpoint information
endpoint_info = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)

print("\n=== Endpoint Details ===")
print(f"Endpoint Name: {endpoint_info['EndpointName']}")
print(f"Endpoint ARN: {endpoint_info['EndpointArn']}")
print(f"Status: {endpoint_info['EndpointStatus']}")
print(f"Creation Time: {endpoint_info['CreationTime']}")
print(f"Last Modified: {endpoint_info['LastModifiedTime']}")

# Get endpoint config for instance type details
endpoint_config_name = endpoint_info['EndpointConfigName']
endpoint_config = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_config_name)

# Display production variant details
for variant in endpoint_info['ProductionVariants']:
    print(f"\nProduction Variant: {variant['VariantName']}")
    print(f"  Current Instance Count: {variant['CurrentInstanceCount']}")
    print(f"  Desired Instance Count: {variant['DesiredInstanceCount']}")
    # Get instance type from endpoint config
    for config_variant in endpoint_config['ProductionVariants']:
        if config_variant['VariantName'] == variant['VariantName']:
            print(f"  Instance Type: {config_variant['InstanceType']}")
            break

Memecahkan masalah kegagalan pembuatan titik akhir

Alasan kegagalan umum:

Kapasitas tidak mencukupi: Jenis instans yang diminta tidak tersedia di wilayah Anda
- Solusi: Coba jenis instans yang berbeda atau minta peningkatan kuota
Izin IAM: Peran eksekusi tidak memiliki izin yang diperlukan
- Solusi: Verifikasi peran memiliki akses ke artefak model Amazon S3 dan izin yang diperlukan SageMaker
Artefak model tidak ditemukan: URI Amazon S3 salah atau tidak dapat diakses
- Solusi: Verifikasi URI Amazon S3 dan periksa izin bucket, pastikan Anda berada di wilayah yang benar
Batas sumber daya: Batas akun terlampaui untuk titik akhir atau instance
- Solusi: Minta peningkatan kuota layanan melalui Service AWS Quotas atau Support

catatan

Jika Anda perlu menghapus titik akhir yang gagal dan memulai dari awal:


sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)

Langkah 5: Panggil titik akhir

Setelah titik akhir Anda InService, Anda dapat mengirim permintaan inferensi untuk menghasilkan prediksi dari model Amazon Nova Anda. SageMaker mendukung titik akhir sinkron (real-time dengan mode streaming/non -streaming) dan titik akhir asinkron (Amazon untuk pemrosesan batch). S3-based

Siapkan klien runtime

Buat klien SageMaker Runtime dengan pengaturan batas waktu yang sesuai:


import json
import boto3
import botocore
from botocore.exceptions import ClientError

# Configure client with appropriate timeouts
config = botocore.config.Config(
    read_timeout=120,      # Maximum time to wait for response
    connect_timeout=10,    # Maximum time to establish connection
    retries={'max_attempts': 3}  # Number of retry attempts
)

# Create SageMaker Runtime client
runtime_client = boto3.client('sagemaker-runtime', config=config, region_name=REGION)

Buat fungsi inferensi universal

Fungsi berikut menangani permintaan streaming dan non-streaming. Ini menggunakan INFERENCE_COMPONENT_NAME variabel yang didefinisikan dalam Langkah 4. Jika Anda menerapkan menggunakan komponen inferensi (Opsi A), ini disetel ke. MODEL_NAME + "-IC" Jika Anda menerapkan menggunakan titik akhir model tunggal (Opsi B), ini tidak ditentukan, jadi setel ke None sebelum menjalankan langkah ini:


# Only needed if you followed Option B (single model endpoints) in Step 4:
# INFERENCE_COMPONENT_NAME = None

def invoke_nova_endpoint(request_body):
    """
    Invoke Nova endpoint with automatic streaming detection.
    Supports both inference component and single model endpoint deployments.
    
    Args:
        request_body (dict): Request payload containing prompt and parameters
    
    Returns:
        dict: Response from the model (for non-streaming requests)
        None: For streaming requests (prints output directly)
    """
    body = json.dumps(request_body)
    is_streaming = request_body.get("stream", False)
    
    # Build invoke parameters
    invoke_params = {
        'EndpointName': ENDPOINT_NAME,
        'ContentType': 'application/json',
        'Body': body
    }
    
    # Add InferenceComponentName if using inference components
    if INFERENCE_COMPONENT_NAME:
        invoke_params['InferenceComponentName'] = INFERENCE_COMPONENT_NAME
    
    try:
        print(f"Invoking endpoint ({'streaming' if is_streaming else 'non-streaming'})...")
        
        if is_streaming:
            response = runtime_client.invoke_endpoint_with_response_stream(**invoke_params)
            
            event_stream = response['Body']
            for event in event_stream:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']
                    if 'Bytes' in chunk:
                        data = chunk['Bytes'].decode()
                        print("Chunk:", data)
        else:
            # Non-streaming inference
            invoke_params['Accept'] = 'application/json'
            response = runtime_client.invoke_endpoint(**invoke_params)
            
            response_body = response['Body'].read().decode('utf-8')
            result = json.loads(response_body)
            print("✅ Response received successfully")
            return result
    
    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        print(f"❌ AWS Error: {error_code} - {error_message}")
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")

Contoh 1: penyelesaian Non-streaming obrolan

Gunakan format obrolan untuk interaksi percakapan:


# Non-streaming chat request
chat_request = {
    "messages": [
        {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "max_completion_tokens": 100,  # Alternative to max_tokens
    "stream": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "logprobs": True,
    "top_logprobs": 3,
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(chat_request)

Sampel respon:


{
    "id": "chatcmpl-123456",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
            },
            "logprobs": {
                "content": [
                    {
                        "token": "Hello",
                        "logprob": -0.123,
                        "top_logprobs": [
                            {"token": "Hello", "logprob": -0.123},
                            {"token": "Hi", "logprob": -2.456},
                            {"token": "Hey", "logprob": -3.789}
                        ]
                    }
                    # Additional tokens...
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 28,
        "total_tokens": 40
    }
}

Contoh 2: Penyelesaian teks sederhana

Gunakan format penyelesaian untuk pembuatan teks sederhana:


# Simple completion request
completion_request = {
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "stream": False,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,  # -1 means no limit
    "logprobs": 3,  # Number of log probabilities to return
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(completion_request)

Sampel respon:


{
    "id": "cmpl-789012",
    "object": "text_completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "text": " Paris.",
            "index": 0,
            "logprobs": {
                "tokens": [" Paris", "."],
                "token_logprobs": [-0.001, -0.002],
                "top_logprobs": [
                    {" Paris": -0.001, " London": -5.234, " Rome": -6.789},
                    {".": -0.002, ",": -4.567, "!": -7.890}
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 2,
        "total_tokens": 8
    }
}

Contoh 3: Penyelesaian obrolan streaming


# Streaming chat request
streaming_request = {
    "messages": [
        {"role": "user", "content": "Tell me a short story about a robot"}
    ],
    "max_tokens": 200,
    "stream": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "logprobs": True,
    "top_logprobs": 2,
    "stream_options": {"include_usage": True}
}

invoke_nova_endpoint(streaming_request)

Output streaming sampel:


Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" Once"},"logprobs":{"content":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101],"top_logprobs":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101]},{"token":"\u2581In","logprob":-0.7864127159118652,"bytes":[226,150,129,73,110]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" upon"},"logprobs":{"content":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110],"top_logprobs":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110]},{"token":"\u2581a","logprob":-6.789,"bytes":[226,150,129,97]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" a"},"logprobs":{"content":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97],"top_logprobs":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97]},{"token":"\u2581time","logprob":-9.123,"bytes":[226,150,129,116,105,109,101]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" time"},"logprobs":{"content":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101],"top_logprobs":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101]},{"token":",","logprob":-6.012,"bytes":[44]}]}]},"finish_reason":null,"token_ids":null}]}

# Additional chunks...

Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":87,"total_tokens":102}}
Chunk: data: [DONE]

Contoh 4: Penyelesaian obrolan multimodal

Gunakan format multimodal untuk input gambar dan teks:


# Multimodal chat request (if supported by your model)
multimodal_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ],
    "max_tokens": 150,
    "temperature": 0.3,
    "top_p": 0.8,
    "stream": False
}

response = invoke_nova_endpoint(multimodal_request)

Sampel respon:


{
    "id": "chatcmpl-345678",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image shows..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1250,
        "completion_tokens": 45,
        "total_tokens": 1295
    }
}

Langkah 6: Bersihkan sumber daya (Opsional)

Untuk menghindari biaya yang tidak perlu, hapus AWS sumber daya yang Anda buat selama tutorial ini. SageMaker endpoint dikenakan biaya saat sedang berjalan, bahkan jika Anda tidak secara aktif membuat permintaan inferensi.

penting

Menghapus sumber daya bersifat permanen dan tidak dapat dibatalkan. Pastikan Anda tidak lagi membutuhkan sumber daya ini sebelum melanjutkan.

Inisialisasi klien pembersihan


import boto3
import time

# Initialize SageMaker client
sagemaker = boto3.client('sagemaker', region_name=REGION)

Hapus komponen inferensi (jika menggunakan Opsi A)

Jika Anda menerapkan menggunakan komponen inferensi, hapus komponen inferensi terlebih dahulu sebelum menghapus titik akhir:


# Delete inference component (Option A only)
try:
    print("Deleting inference component...")
    sagemaker.delete_inference_component(InferenceComponentName=INFERENCE_COMPONENT_NAME)
    print(f"✅ Inference component '{INFERENCE_COMPONENT_NAME}' deletion initiated")
except Exception as e:
    print(f"❌ Error deleting inference component: {e}")

# Wait for inference component to be deleted before proceeding
print("Waiting for inference component deletion...")
while True:
    try:
        sagemaker.describe_inference_component(InferenceComponentName=INFERENCE_COMPONENT_NAME)
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Inference component successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break

Hapus titik akhir


try:
    print("Deleting endpoint...")
    sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"✅ Endpoint '{ENDPOINT_NAME}' deletion initiated")
    print("Charges will stop once deletion completes (typically 2-5 minutes)")
except Exception as e:
    print(f"❌ Error deleting endpoint: {e}")

catatan

Penghapusan titik akhir adalah asinkron. Anda dapat memantau status penghapusan:


import time

print("Monitoring endpoint deletion...")
while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        print(f"Status: {status}")
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Endpoint successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break

Hapus konfigurasi titik akhir

Setelah titik akhir dihapus, hapus konfigurasi titik akhir:


try:
    print("Deleting endpoint configuration...")
    sagemaker.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
    print(f"✅ Endpoint configuration '{ENDPOINT_CONFIG_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting endpoint configuration: {e}")

Hapus model (Opsi B saja)

Jika Anda menggunakan titik akhir model tunggal, hapus objek SageMaker model:


try:
    print("Deleting model...")
    sagemaker.delete_model(ModelName=MODEL_NAME)
    print(f"✅ Model '{MODEL_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting model: {e}")

Awas Javascript dinonaktifkan atau tidak tersedia di browser Anda.

Untuk menggunakan Dokumentasi AWS, Javascript harus diaktifkan. Lihat halaman Bantuan browser Anda untuk petunjuk.

Konvensi Dokumen

SageMaker Inferensi

Fitur kontainer