View a markdown version of this page

通过 AWS CLI 安装推理运算符失败 - 亚马逊 SageMaker AI

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

通过 AWS CLI 安装推理运算符失败

概述:通过 AWS CLI 安装推理运算符时,插件安装可能会因为缺少依赖项而失败。本节介绍常见的 CLI 安装失败场景及其解决方案。

由于缺少 CSI 驱动程序,推理插件安装失败

问题:推理运算符插件创建失败,因为 EKS 集群上未安装所需的 CSI 驱动程序依赖项。

症状和诊断:

错误消息:

插件创建日志或推理操作员日志中出现以下错误:

S3 CSI driver not installed (missing CSIDriver s3.csi.aws.com). Please install the required CSI driver and see the troubleshooting guide for more information. FSx CSI driver not installed (missing CSIDriver fsx.csi.aws.com). Please install the required CSI driver and see the troubleshooting guide for more information.

诊断步骤:

  1. 检查是否安装了 CSI 驱动程序:

    # Check for S3 CSI driver kubectl get csidriver s3.csi.aws.com kubectl get pods -n kube-system | grep mountpoint # Check for FSx CSI driver kubectl get csidriver fsx.csi.aws.com kubectl get pods -n kube-system | grep fsx
  2. 检查 EKS 插件状态:

    # List all add-ons aws eks list-addons --cluster-name $EKS_CLUSTER_NAME --region $REGION # Check specific CSI driver add-ons aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION 2>/dev/null || echo "S3 CSI driver not installed" aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION 2>/dev/null || echo "FSx CSI driver not installed"
  3. 检查推理运算符插件状态:

    aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health,Issues:issues}" \ --output json

解决方法:

第 1 步:安装缺失的 S3 CSI 驱动程序

  1. 为 S3 CSI 驱动程序创建 IAM 角色(如果尚未创建):

    # Set up service account role ARN (from installation steps) export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found") echo "S3 CSI Role ARN: $S3_CSI_ROLE_ARN"
  2. 安装 S3 CSI 驱动程序附加组件:

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-mountpoint-s3-csi-driver \ --addon-version v1.14.1-eksbuild.1 \ --service-account-role-arn $S3_CSI_ROLE_ARN \ --region $REGION
  3. 验证 S3 CSI 驱动程序的安装:

    # Wait for add-on to be active aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION # Verify CSI driver is available kubectl get csidriver s3.csi.aws.com kubectl get pods -n kube-system | grep mountpoint

第 2 步:安装缺失的 FSx CSI 驱动程序

  1. 为 FSx CSI 驱动程序创建 IAM 角色(如果尚未创建):

    # Set up service account role ARN (from installation steps) export FSX_CSI_ROLE_ARN=$(aws iam get-role --role-name $FSX_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found") echo "FSx CSI Role ARN: $FSX_CSI_ROLE_ARN"
  2. 安装 FSx CSI 驱动程序附加组件:

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name aws-fsx-csi-driver \ --addon-version v1.6.0-eksbuild.1 \ --service-account-role-arn $FSX_CSI_ROLE_ARN \ --region $REGION # Wait for add-on to be active aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION # Verify FSx CSI driver is running kubectl get pods -n kube-system | grep fsx

步骤 3:验证所有依赖关系

安装缺少的依赖项后,在重试推理运算符安装之前,请先验证它们是否正常运行:

# Check all required add-ons are active aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION # Verify all pods are running kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)" kubectl get pods -n cert-manager

模型部署期间缺少推理自定义资源定义

问题:尝试创建模型部署时缺少自定义资源定义 (CRDs)。如果您之前安装并删除了推理插件,但没有清理包含终结器的模型部署,则会出现此问题。

症状和诊断:

根本原因:

如果您删除推理插件而不先移除所有模型部署,则带有终结器的自定义资源仍保留在集群中。这些终结器必须先完成,然后才能删除。 CRDs插件删除过程不会等待 CRD 删除完成,这会使它们保持终止状态并阻止新安装。 CRDs

诊断此问题

  1. 检查是否 CRDs 存在。

    kubectl get crd | grep inference.sagemaker.aws.amazon.com
  2. 检查是否有卡住的自定义资源。

    # Check for JumpStartModel resources kubectl get jumpstartmodels -A # Check for InferenceEndpointConfig resources kubectl get inferenceendpointconfigs -A
  3. 检查资源卡住的终结器。

    # Example for a specific JumpStartModel kubectl get jumpstartmodels <model-name> -n <namespace> -o jsonpath='{.metadata.finalizers}' # Example for a specific InferenceEndpointConfig kubectl get inferenceendpointconfigs <config-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'

解决方法:

从所有模型部署中手动移除在移除推理插件时未删除的终结器。为每个卡住的自定义资源完成以下步骤。

从资源中移除终结器 JumpStartModel

  1. 列出所有命名空间中的所有 JumpStartModel 资源。

    kubectl get jumpstartmodels -A
  2. 对于每个 JumpStartModel 资源,通过修补资源来移除终结器,将 metadata.finalizers 设置为空数组。

    kubectl patch jumpstartmodels <model-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge

    以下示例说明如何修补名为 kv-l1-only 的资源。

    kubectl patch jumpstartmodels kv-l1-only -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
  3. 验证模型实例是否已删除。

    kubectl get jumpstartmodels -A

    清理完所有资源后,您应该会看到以下输出。

    Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=jumpstartmodels": the server could not find the requested resource (get jumpstartmodels.inference.sagemaker.aws.amazon.com)
  4. 验证 JumpStartModel CRD 是否已被移除。

    kubectl get crd | grep jumpstartmodels.inference.sagemaker.aws.amazon.com

    如果成功删除 CRD,则此命令不返回任何输出。

从资源中移除终结器 InferenceEndpointConfig

  1. 列出所有命名空间中的所有 InferenceEndpointConfig 资源。

    kubectl get inferenceendpointconfigs -A
  2. 对于每种 InferenceEndpointConfig 资源,移除终结器。

    kubectl patch inferenceendpointconfigs <config-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge

    以下示例说明如何修补名为的资源 my-inference-config。

    kubectl patch inferenceendpointconfigs my-inference-config -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
  3. 确认配置实例已删除。

    kubectl get inferenceendpointconfigs -A

    清理完所有资源后,您应该会看到以下输出。

    Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=inferenceendpointconfigs": the server could not find the requested resource (get inferenceendpointconfigs.inference.sagemaker.aws.amazon.com)
  4. 验证 InferenceEndpointConfig CRD 是否已被移除。

    kubectl get crd | grep inferenceendpointconfigs.inference.sagemaker.aws.amazon.com

    如果成功删除 CRD,则此命令不返回任何输出。

重新安装推理插件

清理所有卡住的资源并确认 CRDs 它们已被移除后,重新安装推理插件。有关更多信息,请参阅 使用 EKS 附加组件安装推理运算符

验证:

  1. 确认推理插件已成功安装。

    aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health}" \ --output table

    状态应为 “活跃”,“健康” 应为 “健康”。

  2. 验证 CRDs 是否已正确安装。

    kubectl get crd | grep inference.sagemaker.aws.amazon.com

    您应该会在输出中看到与推理相关的 CRDs 内容。

  3. 测试创建新模型部署,以确认问题已得到解决。

    # Create a test deployment using your preferred method kubectl apply -f <your-model-deployment.yaml>

预防措施

为防止出现此问题,请在卸载推理插件之前完成以下步骤。

  1. 删除所有模型部署。

    # Delete all JumpStartModel resources kubectl delete jumpstartmodels --all -A # Delete all InferenceEndpointConfig resources kubectl delete inferenceendpointconfigs --all -A # Wait for all resources to be fully deleted kubectl get jumpstartmodels -A kubectl get inferenceendpointconfigs -A
  2. 确认所有自定义资源都已删除。

  3. 确认所有资源都已清理完毕后,删除推理插件。

由于缺少证书管理器,推理插件安装失败

问题:由于未安装证书管理器 EKS 附加组件,推理运算符插件创建失败,从而导致缺少自定义资源定义 ()。CRDs

症状和诊断:

错误消息:

插件创建日志或推理操作员日志中出现以下错误:

Missing required CRD: certificaterequests.cert-manager.io. The cert-manager add-on is not installed. Please install cert-manager and see the troubleshooting guide for more information.

诊断步骤:

  1. 检查证书管理器是否已安装:

    # Check for cert-manager CRDs kubectl get crd | grep cert-manager kubectl get pods -n cert-manager # Check EKS add-on status aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION 2>/dev/null || echo "Cert-manager not installed"
  2. 检查推理运算符插件状态:

    aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health,Issues:issues}" \ --output json

解决方法:

步骤 1:安装证书管理器插件

  1. 安装证书管理器 EKS 附加组件:

    aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name cert-manager \ --addon-version v1.18.2-eksbuild.2 \ --region $REGION
  2. 验证证书管理器的安装:

    # Wait for add-on to be active aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION # Verify cert-manager pods are running kubectl get pods -n cert-manager # Verify CRDs are installed kubectl get crd | grep cert-manager | wc -l # Expected: Should show multiple cert-manager CRDs

步骤 2:重试推理运算符安装

  1. 安装证书管理器后,请重试推理运算符的安装:

    # Delete the failed add-on if it exists aws eks delete-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation" # Wait for deletion to complete sleep 30 # Reinstall the inference operator add-on aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --addon-version v1.0.0-eksbuild.1 \ --configuration-values file://addon-config.json \ --region $REGION
  2. 监控安装:

    # Check installation status aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health}" \ --output table # Verify inference operator pods are running kubectl get pods -n hyperpod-inference-system

由于缺少 ALB 控制器,推理插件安装失败

问题:推理运算符插件创建失败,因为未安装 Loa AWS d Balancer Controller 或者没有为推理插件正确配置。

症状和诊断:

错误消息:

插件创建日志或推理操作员日志中出现以下错误:

ALB Controller not installed (missing aws-load-balancer-controller pods). Please install the Application Load Balancer Controller and see the troubleshooting guide for more information.

诊断步骤:

  1. 检查是否安装了 ALB 控制器:

    # Check for ALB Controller pods kubectl get pods -n kube-system | grep aws-load-balancer-controller kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller # Check ALB Controller service account kubectl get serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null || echo "ALB Controller service account not found" kubectl get serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null || echo "ALB Controller service account not found in inference namespace"
  2. 检查推理运算符插件配置:

    aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \ --output json

解决方法:

根据您的设置选择以下选项之一:

选项 1:让推理插件安装 ALB 控制器(推荐)

  • 确保在插件配置中创建并正确配置 ALB 角色:

    # Verify ALB role exists export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found") echo "ALB Role ARN: $ALB_ROLE_ARN" # Update your addon-config.json to enable ALB cat > addon-config.json << EOF { "executionRoleArn": "$EXECUTION_ROLE_ARN", "tlsCertificateS3Bucket": "$BUCKET_NAME", "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN", "alb": { "enabled": true, "serviceAccount": { "create": true, "roleArn": "$ALB_ROLE_ARN" } }, "keda": { "auth": { "aws": { "irsa": { "roleArn": "$KEDA_ROLE_ARN" } } } } } EOF

选项 2:使用现有的 ALB 控制器安装

  • 如果您已经安装了 ALB Controller,请将插件配置为使用现有安装:

    # Update your addon-config.json to disable ALB installation cat > addon-config.json << EOF { "executionRoleArn": "$EXECUTION_ROLE_ARN", "tlsCertificateS3Bucket": "$BUCKET_NAME", "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN", "alb": { "enabled": false }, "keda": { "auth": { "aws": { "irsa": { "roleArn": "$KEDA_ROLE_ARN" } } } } } EOF

步骤 3:重试推理运算符安装

  1. 使用更新后的配置重新安装推理运算符插件:

    # Delete the failed add-on if it exists aws eks delete-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation" # Wait for deletion to complete sleep 30 # Reinstall with updated configuration aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --addon-version v1.0.0-eksbuild.1 \ --configuration-values file://addon-config.json \ --region $REGION
  2. 验证 ALB 控制器是否正常工作:

    # Check ALB Controller pods kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller kubectl get pods -n kube-system | grep aws-load-balancer-controller # Check service account annotations kubectl describe serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null kubectl describe serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null

由于缺少 KEDA 操作员,推理插件安装失败

问题:推理运算符插件创建失败,原因是推理插件未安装 KEDA(Kubernetes 事件驱动的自动扩缩器)运算符或配置不正确。

症状和诊断:

错误消息:

插件创建日志或推理操作员日志中出现以下错误:

KEDA operator not installed (missing keda-operator pods). KEDA can be installed separately in any namespace or via the Inference addon.

诊断步骤:

  1. 检查是否安装了 KEDA 操作员:

    # Check for KEDA operator pods in common namespaces kubectl get pods -n keda-system | grep keda-operator 2>/dev/null || echo "KEDA not found in keda-system namespace" kubectl get pods -n kube-system | grep keda-operator 2>/dev/null || echo "KEDA not found in kube-system namespace" kubectl get pods -n hyperpod-inference-system | grep keda-operator 2>/dev/null || echo "KEDA not found in inference namespace" # Check for KEDA CRDs kubectl get crd | grep keda 2>/dev/null || echo "KEDA CRDs not found" # Check KEDA service account kubectl get serviceaccount keda-operator -A 2>/dev/null || echo "KEDA service account not found"
  2. 检查推理运算符插件配置:

    aws eks describe-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION \ --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \ --output json

解决方法:

根据您的设置选择以下选项之一:

选项 1:让推理插件安装 KEDA(推荐)

  • 确保在插件配置中创建并正确配置 KEDA 角色:

    # Verify KEDA role exists export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found") echo "KEDA Role ARN: $KEDA_ROLE_ARN" # Update your addon-config.json to enable KEDA cat > addon-config.json << EOF { "executionRoleArn": "$EXECUTION_ROLE_ARN", "tlsCertificateS3Bucket": "$BUCKET_NAME", "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN", "alb": { "serviceAccount": { "create": true, "roleArn": "$ALB_ROLE_ARN" } }, "keda": { "enabled": true, "auth": { "aws": { "irsa": { "roleArn": "$KEDA_ROLE_ARN" } } } } } EOF

选项 2:使用现有的 KEDA 安装

  • 如果您已经安装了 KEDA,请将插件配置为使用现有安装:

    # Update your addon-config.json to disable KEDA installation cat > addon-config.json << EOF { "executionRoleArn": "$EXECUTION_ROLE_ARN", "tlsCertificateS3Bucket": "$BUCKET_NAME", "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN", "alb": { "serviceAccount": { "create": true, "roleArn": "$ALB_ROLE_ARN" } }, "keda": { "enabled": false } } EOF

步骤 3:重试推理运算符安装

  1. 使用更新后的配置重新安装推理运算符插件:

    # Delete the failed add-on if it exists aws eks delete-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation" # Wait for deletion to complete sleep 30 # Reinstall with updated configuration aws eks create-addon \ --cluster-name $EKS_CLUSTER_NAME \ --addon-name amazon-sagemaker-hyperpod-inference \ --addon-version v1.0.0-eksbuild.1 \ --configuration-values file://addon-config.json \ --region $REGION
  2. 验证 KEDA 是否正常工作:

    # Check KEDA pods kubectl get pods -n hyperpod-inference-system | grep keda kubectl get pods -n kube-system | grep keda kubectl get pods -n keda-system | grep keda 2>/dev/null # Check KEDA CRDs kubectl get crd | grep scaledobjects kubectl get crd | grep scaledjobs # Check KEDA service account annotations kubectl describe serviceaccount keda-operator -n hyperpod-inference-system 2>/dev/null kubectl describe serviceaccount keda-operator -n kube-system 2>/dev/null kubectl describe serviceaccount keda-operator -n keda-system 2>/dev/null