Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl - Amazon SageMaker AI

Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl

The following steps show you how to deploy models stored on Amazon S3 or Amazon FSx to a Amazon SageMaker HyperPod cluster using kubectl.

The following instructions contain code cells and commands designed to run in a Jupyter notebook environment, such as Amazon SageMaker Studio or SageMaker Notebook Instances. Each code block represents a notebook cell that should be executed sequentially. The interactive elements, including model discovery tables and status monitoring commands, are optimized for the notebook interface and may not function properly in other environments. Ensure you have access to a notebook environment with the necessary AWS permissions before proceeding.

Prerequisites

Verify that you’ve set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see Setting up your HyperPod clusters for model deployment.

Setup and configuration

Replace all placeholder values with your actual resource identifiers.

  1. Initialize your cluster name. This identifies the HyperPod cluster where your model will be deployed.

    # Specify your hyperpod cluster name here hyperpod_cluster_name="<Hyperpod_cluster_name>" # NOTE: For sample deployment, we use g5.8xlarge for deepseek-r1 1.5b model which has sufficient memory and GPU instance_type="ml.g5.8xlarge"
  2. Initialize your cluster namespace. Your cluster admin should've already created a hyperpod-inference service account in your namespace.

    cluster_namespace="<namespace>"
  3. Define the helper method to create YAML files for deployment

    The following helper function generates the Kubernetes YAML configuration files needed to deploy your model. This function creates different YAML structures depending on whether your model is stored on Amazon S3 or Amazon FSx, handling the storage-specific configurations automatically. You'll use this function in the next sections to generate the deployment files for your chosen storage backend.

    def generate_inferenceendpointconfig_yaml(deployment_name, model_id, namespace, instance_type, output_file_path, region, tls_certificate_s3_location, model_location, sagemaker_endpoint_name, fsxFileSystemId="", isFsx=False, s3_bucket=None): """ Generate a InferenceEndpointConfig YAML file for S3 storage with the provided parameters. Args: deployment_name (str): The deployment name model_id (str): The model ID namespace (str): The namespace instance_type (str): The instance type output_file_path (str): Path where the YAML file will be saved region (str): Region where bucket exists tls_certificate_s3_location (str): S3 location for TLS certificate model_location (str): Location of the model sagemaker_endpoint_name (str): Name of the SageMaker endpoint fsxFileSystemId (str): FSx filesystem ID (optional) isFsx (bool): Whether to use FSx storage (optional) s3_bucket (str): S3 bucket where model exists (optional, only needed when isFsx is False) """ # Create the YAML structure model_config = { "apiVersion": "inference.sagemaker.aws.amazon.com/v1alpha1", "kind": "InferenceEndpointConfig", "metadata": { "name": deployment_name, "namespace": namespace }, "spec": { "modelName": model_id, "endpointName": sagemaker_endpoint_name, "invocationEndpoint": "invocations", "instanceType": instance_type, "modelSourceConfig": {}, "worker": { "resources": { "limits": { "nvidia.com/gpu": 1, }, "requests": { "nvidia.com/gpu": 1, "cpu": "30000m", "memory": "100Gi" } }, "image": "763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0", "modelInvocationPort": { "containerPort": 8080, "name": "http" }, "modelVolumeMount": { "name": "model-weights", "mountPath": "/opt/ml/model" }, "environmentVariables": [ { "name": "HF_MODEL_ID", "value": "/opt/ml/model" }, { "name": "SAGEMAKER_PROGRAM", "value": "inference.py", }, { "name": "SAGEMAKER_SUBMIT_DIRECTORY", "value": "/opt/ml/model/code", }, { "name": "MODEL_CACHE_ROOT", "value": "/opt/ml/model" }, { "name": "SAGEMAKER_ENV", "value": "1", } ] }, "tlsConfig": { "tlsCertificateOutputS3Uri": tls_certificate_s3_location, } }, } if (not isFsx): if s3_bucket is None: raise ValueError("s3_bucket is required when isFsx is False") model_config["spec"]["modelSourceConfig"] = { "modelSourceType": "s3", "s3Storage": { "bucketName": s3_bucket, "region": region, }, "modelLocation": model_location } else: model_config["spec"]["modelSourceConfig"] = { "modelSourceType": "fsx", "fsxStorage": { "fileSystemId": fsxFileSystemId, }, "modelLocation": model_location } # Write to YAML file with open(output_file_path, 'w') as file: yaml.dump(model_config, file, default_flow_style=False) print(f"YAML file created successfully at: {output_file_path}")

Deploy your model from Amazon S3 or Amazon FSx

Stage the model to Amazon S3
  1. Create the Amazon S3 bucket to store your model artifacts. The S3 bucket needs to be in the same Region as your HyperPod cluster.

    s3_client = boto3.client('s3', region_name=region_name, config=boto3_config) base_name = "hyperpod-inference-s3-beta" def get_account_id(): sts = boto3.client('sts') return sts.get_caller_identity()["Account"] account_id = get_account_id() s3_bucket = f"{base_name}-{account_id}" try: s3_client.create_bucket( Bucket=s3_bucket, CreateBucketConfiguration={"LocationConstraint": region_name} ) print(f"Bucket '{s3_bucket}' is created successfully.") except botocore.exceptions.ClientError as e: error_code = e.response["Error"]["Code"] if error_code in ("BucketAlreadyExists", "BucketAlreadyOwnedByYou"): print(f"Bucket '{s3_bucket}' already exists. Skipping creation.") else: raise # Re-raise unexpected exceptions
  2. Get Deployment YAML to deploy the model from the S3 bucket data.

    # Get current time in format suitable for endpoint name current_time = datetime.now().strftime("%Y%m%d%H%M%S") model_id = "deepseek15b" ## Can be a name of your choice deployment_name = f"{model_id}-{current_time}" model_location = "deepseek15b" ## This is the folder on your s3 file where the model is located sagemaker_endpoint_name=f"{model_id}-{current_time}" output_file_path=f"inferenceendpointconfig-s3-model-{model_id}.yaml" generate_inferenceendpointconfig_yaml( deployment_name=deployment_name, model_id=model_id, model_location=model_location, namespace=cluster_namespace, instance_type=instance_type, output_file_path=output_file_path, sagemaker_endpoint_name=sagemaker_endpoint_name, s3_bucket=s3_bucket, region=region_name, tls_certificate_s3_location=tls_certificate_s3_location ) os.environ["INFERENCE_ENDPOINT_CONFIG_YAML_FILE_PATH"]=output_file_path os.environ["MODEL_ID"]=model_id
Stage the model to Amazon FSx
  1. (Optional) Create an FSx volume. This step is optional because you might already have an existing FSx filesystem with the same VPC, Security Group, and Subnet ID as the HyperPod cluster you want to use.

    # Initialize the subnet ID and Security Group for FSx. These should be the same as that of the HyperPod cluster. SUBNET_ID = "<HyperPod-subnet-id>" SECURITY_GROUP_ID = "<HyperPod-security-group-id>" # Configuration CONFIG = { 'SUBNET_ID': SUBNET_ID, 'SECURITY_GROUP_ID': SECURITY_GROUP_ID, 'STORAGE_CAPACITY': 1200, 'DEPLOYMENT_TYPE': 'PERSISTENT_2', 'THROUGHPUT': 250, 'COMPRESSION_TYPE': 'LZ4', 'LUSTRE_VERSION': '2.15' } JUMPSTART_MODEL_LOCATION_ON_S3 = "s3://jumpstart-cache-prod-us-east-2/deepseek-llm/deepseek-llm-r1-distill-qwen-1-5b/artifacts/inference-prepack/v2.0.0/" # Create FSx client fsx = boto3.client('fsx') # Create FSx for Lustre file system response = fsx.create_file_system( FileSystemType='LUSTRE', FileSystemTypeVersion=CONFIG['LUSTRE_VERSION'], StorageCapacity=CONFIG['STORAGE_CAPACITY'], SubnetIds=[CONFIG['SUBNET_ID']], SecurityGroupIds=[CONFIG['SECURITY_GROUP_ID']], LustreConfiguration={ 'DeploymentType': CONFIG['DEPLOYMENT_TYPE'], 'PerUnitStorageThroughput': CONFIG['THROUGHPUT'], 'DataCompressionType': CONFIG['COMPRESSION_TYPE'], } ) # Get the file system ID file_system_id = response['FileSystem']['FileSystemId'] print(f"Creating FSx filesystem with ID: {file_system_id}") print(f"In subnet: {CONFIG['SUBNET_ID']}") print(f"With security group: {CONFIG['SECURITY_GROUP_ID']}") # Wait for the file system to become available while True: response = fsx.describe_file_systems(FileSystemIds=[file_system_id]) status = response['FileSystems'][0]['Lifecycle'] if status == 'AVAILABLE': break print(f"Waiting for file system to become available... Current status: {status}") time.sleep(30) dns_name = response['FileSystems'][0]['DNSName'] mount_name = response['FileSystems'][0]['LustreConfiguration']['MountName'] # Print the file system details print("\nFile System Details:") print(f"File System ID: {file_system_id}") print(f"DNS Name: {dns_name}") print(f"Mount Name: {mount_name}")
  2. (Optional) Mount FSx and copy data from S3 to FSx. This step is optional because your model data might already exist in the FSx filesystem. This step is only needed if you want to copy data from S3 to FSx.

    Note

    Replace values of file_system_id, dns_name and mount_name with your FSX IN CASE not using the fsx from previous step and using your own FSX.

    ## NOTE: Replace values of file_system_id, dns_name, and mount_name with your FSx in case you are not using the FSx filesystem from the previous step and using your own FSx filesystem. # file_system_id = response['FileSystems'][0]['FileSystemId'] # dns_name = response['FileSystems'][0]['DNSName'] # mount_name = response['FileSystems'][0]['LustreConfiguration']['MountName'] # print(f"File System ID: {file_system_id}") # print(f"DNS Name: {dns_name}") # print(f"Mount Name: {mount_name}") # FSx file system details mount_point = f'/mnt/fsx_{file_system_id}' # This will create something like /mnt/fsx_20240317_123456 print(f"Creating mount point at: {mount_point}") # Create mount directory if it doesn't exist !sudo mkdir -p {mount_point} # Mount the FSx Lustre file system mount_command = f"sudo mount -t lustre {dns_name}@tcp:/{mount_name} {mount_point}" !{mount_command} # Verify the mount !df -h | grep fsx print(f"File system mounted at {mount_point}") !sudo chmod 777 {mount_point} !aws s3 cp $JUMPSTART_MODEL_LOCATION_ON_S3 $mount_point/deepseek-1-5b --recursive !ls $mount_point !sudo umount {mount_point} !sudo rm -rf {mount_point}
  3. Get Deployment YAML to deploy model from FSx data.

    # Get current time in format suitable for endpoint name current_time = datetime.now().strftime("%Y%m%d%H%M%S") model_id = "deepseek15b" ## Can be a name of your choice deployment_name = f"{model_id}-{current_time}" model_location = "deepseek-1-5b" ## This is the folder on your s3 file where the model is located sagemaker_endpoint_name=f"{model_id}-{current_time}" output_file_path=f"inferenceendpointconfig-fsx-model-{model_id}.yaml" generate_inferenceendpointconfig_yaml( deployment_name=deployment_name, model_id=model_id, model_location=model_location, namespace=cluster_namespace, instance_type=instance_type, output_file_path=output_file_path, region=region_name, tls_certificate_s3_location=tls_certificate_s3_location, sagemaker_endpoint_name=sagemaker_endpoint_name, fsxFileSystemId=file_system_id, isFsx=True ) os.environ["INFERENCE_ENDPOINT_CONFIG_YAML_FILE_PATH"]=output_file_path os.environ["MODEL_ID"]=model_id
Deploy the model to your cluster
  1. Get the Amazon EKS cluster name from the HyperPod cluster ARN for kubectl authentication.

    cluster_arn = !aws sagemaker describe-cluster --cluster-name $hyperpod_cluster_name --query "Orchestrator.Eks.ClusterArn" --region $region_name cluster_name = cluster_arn[0].strip('"').split('/')[-1] print(cluster_name)
  2. Configure kubectl to authenticate with the Hyperpod EKS cluster using AWS credentials

    !aws eks update-kubeconfig --name $cluster_name --region $region_name
  3. Deploy your InferenceEndpointConfig model.

    !kubectl apply -f $INFERENCE_ENDPOINT_CONFIG_YAML_FILE_PATH

Verify the status of your deployment

  1. Check if the model successfully deployed.

    !kubectl describe InferenceEndpointConfig $deployment_name -n $cluster_namespace

    The command returns output similar to the following:

    Name:                             deepseek15b-20250624043033
    Reason:                           NativeDeploymentObjectFound
    Status:
      Conditions:
        Last Transition Time:  2025-07-10T18:39:51Z
        Message:               Deployment, ALB Creation or SageMaker endpoint registration creation for model is in progress
        Reason:                InProgress
        Status:                True
        Type:                  DeploymentInProgress
        Last Transition Time:  2025-07-10T18:47:26Z
        Message:               Deployment and SageMaker endpoint registration for model have been created successfully
        Reason:                Success
        Status:                True
        Type:                  DeploymentComplete
  2. Check that the endpoint is successfully created.

    !kubectl describe SageMakerEndpointRegistration $sagemaker_endpoint_name -n $cluster_namespace

    The command returns output similar to the following:

    Name:         deepseek15b-20250624043033
    Namespace:    ns-team-a
    Kind:         SageMakerEndpointRegistration
    
    Status:
      Conditions:
        Last Transition Time:  2025-06-24T04:33:42Z
        Message:               Endpoint created.
        Status:                True
        Type:                  EndpointCreated
        State:                 CreationCompleted
  3. Test the deployed endpoint to verify it's working correctly. This step confirms that your model is successfully deployed and can process inference requests.

    import boto3 prompt = "{\"inputs\": \"How tall is Mt Everest?\"}}" runtime_client = boto3.client('sagemaker-runtime', region_name=region_name, config=boto3_config) response = runtime_client.invoke_endpoint( EndpointName=sagemaker_endpoint_name, ContentType="application/json", Body=prompt ) print(response["Body"].read().decode())
    [{"generated_text":"As of the last update in July 2024, Mount Everest stands at a height of **8,850 meters** (29,029 feet) above sea level. The exact elevation can vary slightly due to changes caused by tectonic activity and the melting of ice sheets."}]

Manage your deployment

When you're finished testing your deployment, use the following commands to clean up your resources.

Note

Verify that you no longer need the deployed model or stored data before proceeding.

Clean up your resources
  1. Delete the inference deployment and associated Kubernetes resources. This stops the running model containers and removes the SageMaker endpoint.

    !kubectl delete inferenceendpointconfig.inference.sagemaker.aws.amazon.com/$deployment_name
  2. (Optional) Delete the FSx volume.

    try: # Delete the file system response = fsx.delete_file_system( FileSystemId=file_system_id ) print(f"Deleting FSx filesystem: {file_system_id}") # Optional: Wait for deletion to complete while True: try: response = fsx.describe_file_systems(FileSystemIds=[file_system_id]) status = response['FileSystems'][0]['Lifecycle'] print(f"Current status: {status}") time.sleep(30) except fsx.exceptions.FileSystemNotFound: print("File system deleted successfully") break except Exception as e: print(f"Error deleting file system: {str(e)}")
  3. Verify the cleanup was done successfully.

    # Check that Kubernetes resources are removed kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $cluster_namespace # Verify SageMaker endpoint is deleted (should return error or empty) aws sagemaker describe-endpoint --endpoint-name $sagemaker_endpoint_name --region $region_name
Troubleshooting
  1. Check the Kubernetes deployment status.

    !kubectl describe deployment $deployment_name -n $cluster_namespace
  2. Check the InferenceEndpointConfig status to see the high-level deployment state and any configuration issues.

    kubectl describe InferenceEndpointConfig $deployment_name -n $cluster_namespace
  3. Check status of all Kubernetes objects. Get a comprehensive view of all related Kubernetes resources in your namespace. This gives you a quick overview of what's running and what might be missing.

    !kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $cluster_namespace