Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl
The following steps show you how to deploy models stored on Amazon S3 or Amazon FSx to a Amazon SageMaker HyperPod cluster using kubectl.
The following instructions contain code cells and commands designed to run in a Jupyter notebook environment, such as Amazon SageMaker Studio or SageMaker Notebook Instances. Each code block represents a notebook cell that should be executed sequentially. The interactive elements, including model discovery tables and status monitoring commands, are optimized for the notebook interface and may not function properly in other environments. Ensure you have access to a notebook environment with the necessary AWS permissions before proceeding.
Prerequisites
Verify that you’ve set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see Setting up your HyperPod clusters for model deployment.
Setup and configuration
Replace all placeholder values with your actual resource identifiers.
-
Initialize your cluster name. This identifies the HyperPod cluster where your model will be deployed.
# Specify your hyperpod cluster name here hyperpod_cluster_name="<Hyperpod_cluster_name>" # NOTE: For sample deployment, we use g5.8xlarge for deepseek-r1 1.5b model which has sufficient memory and GPU instance_type="ml.g5.8xlarge"
-
Initialize your cluster namespace. Your cluster admin should've already created a hyperpod-inference service account in your namespace.
cluster_namespace="<namespace>"
-
Define the helper method to create YAML files for deployment
The following helper function generates the Kubernetes YAML configuration files needed to deploy your model. This function creates different YAML structures depending on whether your model is stored on Amazon S3 or Amazon FSx, handling the storage-specific configurations automatically. You'll use this function in the next sections to generate the deployment files for your chosen storage backend.
def generate_inferenceendpointconfig_yaml(deployment_name, model_id, namespace, instance_type, output_file_path, region, tls_certificate_s3_location, model_location, sagemaker_endpoint_name, fsxFileSystemId="", isFsx=False, s3_bucket=None): """ Generate a InferenceEndpointConfig YAML file for S3 storage with the provided parameters. Args: deployment_name (str): The deployment name model_id (str): The model ID namespace (str): The namespace instance_type (str): The instance type output_file_path (str): Path where the YAML file will be saved region (str): Region where bucket exists tls_certificate_s3_location (str): S3 location for TLS certificate model_location (str): Location of the model sagemaker_endpoint_name (str): Name of the SageMaker endpoint fsxFileSystemId (str): FSx filesystem ID (optional) isFsx (bool): Whether to use FSx storage (optional) s3_bucket (str): S3 bucket where model exists (optional, only needed when isFsx is False) """ # Create the YAML structure model_config = { "apiVersion": "inference.sagemaker.aws.amazon.com/v1alpha1", "kind": "InferenceEndpointConfig", "metadata": { "name": deployment_name, "namespace": namespace }, "spec": { "modelName": model_id, "endpointName": sagemaker_endpoint_name, "invocationEndpoint": "invocations", "instanceType": instance_type, "modelSourceConfig": {}, "worker": { "resources": { "limits": { "nvidia.com/gpu": 1, }, "requests": { "nvidia.com/gpu": 1, "cpu": "30000m", "memory": "100Gi" } }, "image": "763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0", "modelInvocationPort": { "containerPort": 8080, "name": "http" }, "modelVolumeMount": { "name": "model-weights", "mountPath": "/opt/ml/model" }, "environmentVariables": [ { "name": "HF_MODEL_ID", "value": "/opt/ml/model" }, { "name": "SAGEMAKER_PROGRAM", "value": "inference.py", }, { "name": "SAGEMAKER_SUBMIT_DIRECTORY", "value": "/opt/ml/model/code", }, { "name": "MODEL_CACHE_ROOT", "value": "/opt/ml/model" }, { "name": "SAGEMAKER_ENV", "value": "1", } ] }, "tlsConfig": { "tlsCertificateOutputS3Uri": tls_certificate_s3_location, } }, } if (not isFsx): if s3_bucket is None: raise ValueError("s3_bucket is required when isFsx is False") model_config["spec"]["modelSourceConfig"] = { "modelSourceType": "s3", "s3Storage": { "bucketName": s3_bucket, "region": region, }, "modelLocation": model_location } else: model_config["spec"]["modelSourceConfig"] = { "modelSourceType": "fsx", "fsxStorage": { "fileSystemId": fsxFileSystemId, }, "modelLocation": model_location } # Write to YAML file with open(output_file_path, 'w') as file: yaml.dump(model_config, file, default_flow_style=False) print(f"YAML file created successfully at: {output_file_path}")
Deploy your model from Amazon S3 or Amazon FSx
Deploy the model to your cluster
-
Get the Amazon EKS cluster name from the HyperPod cluster ARN for kubectl authentication.
cluster_arn = !aws sagemaker describe-cluster --cluster-name $hyperpod_cluster_name --query "Orchestrator.Eks.ClusterArn" --region $region_name cluster_name = cluster_arn[0].strip('"').split('/')[-1] print(cluster_name)
-
Configure kubectl to authenticate with the Hyperpod EKS cluster using AWS credentials
!aws eks update-kubeconfig --name $cluster_name --region $region_name
-
Deploy your
InferenceEndpointConfig
model.!kubectl apply -f $INFERENCE_ENDPOINT_CONFIG_YAML_FILE_PATH
Verify the status of your deployment
-
Check if the model successfully deployed.
!kubectl describe InferenceEndpointConfig $deployment_name -n $cluster_namespace
The command returns output similar to the following:
Name: deepseek15b-20250624043033 Reason: NativeDeploymentObjectFound Status: Conditions: Last Transition Time: 2025-07-10T18:39:51Z Message: Deployment, ALB Creation or SageMaker endpoint registration creation for model is in progress Reason: InProgress Status: True Type: DeploymentInProgress Last Transition Time: 2025-07-10T18:47:26Z Message: Deployment and SageMaker endpoint registration for model have been created successfully Reason: Success Status: True Type: DeploymentComplete
-
Check that the endpoint is successfully created.
!kubectl describe SageMakerEndpointRegistration $sagemaker_endpoint_name -n $cluster_namespace
The command returns output similar to the following:
Name: deepseek15b-20250624043033 Namespace: ns-team-a Kind: SageMakerEndpointRegistration Status: Conditions: Last Transition Time: 2025-06-24T04:33:42Z Message: Endpoint created. Status: True Type: EndpointCreated State: CreationCompleted
-
Test the deployed endpoint to verify it's working correctly. This step confirms that your model is successfully deployed and can process inference requests.
import boto3 prompt = "{\"inputs\": \"How tall is Mt Everest?\"}}" runtime_client = boto3.client('sagemaker-runtime', region_name=region_name, config=boto3_config) response = runtime_client.invoke_endpoint( EndpointName=sagemaker_endpoint_name, ContentType="application/json", Body=prompt ) print(response["Body"].read().decode())
[{"generated_text":"As of the last update in July 2024, Mount Everest stands at a height of **8,850 meters** (29,029 feet) above sea level. The exact elevation can vary slightly due to changes caused by tectonic activity and the melting of ice sheets."}]
Manage your deployment
When you're finished testing your deployment, use the following commands to clean up your resources.
Note
Verify that you no longer need the deployed model or stored data before proceeding.
Clean up your resources
-
Delete the inference deployment and associated Kubernetes resources. This stops the running model containers and removes the SageMaker endpoint.
!kubectl delete inferenceendpointconfig.inference.sagemaker.aws.amazon.com/$deployment_name
-
(Optional) Delete the FSx volume.
try: # Delete the file system response = fsx.delete_file_system( FileSystemId=file_system_id ) print(f"Deleting FSx filesystem: {file_system_id}") # Optional: Wait for deletion to complete while True: try: response = fsx.describe_file_systems(FileSystemIds=[file_system_id]) status = response['FileSystems'][0]['Lifecycle'] print(f"Current status: {status}") time.sleep(30) except fsx.exceptions.FileSystemNotFound: print("File system deleted successfully") break except Exception as e: print(f"Error deleting file system: {str(e)}")
-
Verify the cleanup was done successfully.
# Check that Kubernetes resources are removed kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $cluster_namespace # Verify SageMaker endpoint is deleted (should return error or empty) aws sagemaker describe-endpoint --endpoint-name $sagemaker_endpoint_name --region $region_name
Troubleshooting
-
Check the Kubernetes deployment status.
!kubectl describe deployment $deployment_name -n $cluster_namespace
-
Check the InferenceEndpointConfig status to see the high-level deployment state and any configuration issues.
kubectl describe InferenceEndpointConfig $deployment_name -n $cluster_namespace
-
Check status of all Kubernetes objects. Get a comprehensive view of all related Kubernetes resources in your namespace. This gives you a quick overview of what's running and what might be missing.
!kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $cluster_namespace