本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
使用 p6e-gb200 執行個體支援 NVIDIA-Imex
本教學課程說明如何在 P6e-GB200 AWS ParallelCluster 上開始使用 ,以利用最高 GPU 效能進行 AI 訓練和推論。p6e-gb200.36xlarge 執行個體只能透過 P6e-GB200 UltraServers 使用u-p6e-gb200x72是 Ultraserver Size,而 p6e-gb200.36xlarge是 InstanceType,其會形成超伺服器。購買 Ultraserver 時u-p6e-gb200x72,可透過具有 18 個p6e-gb200.36xlarge執行個體的適用於 ML 的 EC2 容量區塊
AWS ParallelCluster 3.14.0 版:
-
提供此執行個體類型所需的完整 NVIDIA 軟體堆疊 (驅動程式、CUDA、EFA、NVIDIA-IMEX)
-
為 P6e-GB200 Ultraserver 建立 nvidia-imex 組態
-
啟用和啟動 P6e-GB200 Ultraserver
nvidia-imex的服務 -
設定 Slurm Block 拓撲外掛程式,讓每個 P6e-GB200 Ultraserver (EC2 容量區塊) 都是大小正確的 Slurm Block (請參閱 3.14.0 版版本備註和文件歷史記錄的項目)。
不過,透過 NVLink 的 GPU-to-GPU 通訊需要額外的組態,特別是包含 ParallelCluster 不會自動產生之 IMEX 網域中運算節點 IP 地址nodes_config.cfgnodes_config.cfg
注意
從 Amazon Linux 2023、Ubuntu 22.04 和 Ubuntu 24.04 的 v3.14.0 開始,支援 P6e-GB200。 AWS ParallelCluster 如需詳細的軟體版本和更新的支援分佈清單,請參閱AWS ParallelCluster 變更日誌
建立 Prolog 指令碼以管理 NVIDIA-Imex
限制:
-
此 prolog 指令碼將在提交專屬任務時執行。這是為了確保 IMEX 重新啟動不會中斷屬於 IMEX 網域之 p6e-Gb200 節點上的任何執行中任務。
以下是您應該在 Slurm 中設定為 Prolog 的91_nvidia_imex_prolog.sh指令碼。它用於自動更新運算節點上的 nvidia-imex 組態。指令碼名稱的字首為 91,以遵循 SchedMD 的命名慣例
注意
如果相同節點上同時啟動多個任務,則不會執行此指令碼,因此我們建議在提交時使用 --exclusive旗標。
#!/usr/bin/env bash # This prolog script configures the NVIDIA IMEX on compute nodes involved in the job execution. # # In particular: # - Checks whether the job is executed exclusively. # If not, it exits immediately because it requires jobs to be executed exclusively. # - Checks if it is running on a p6e-gb200 instance type. # If not, it exits immediately because IMEX must be configured only on that instance type. # - Checks if the IMEX service is enabled. # If not, it exits immediately because IMEX must be enabled to get configured. # - Creates the IMEX default channel. # For more information about IMEX channels, see https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html # - Writes the private IP addresses of compute nodes into /etc/nvidia-imex/nodes_config.cfg. # - Restarts the IMEX system service. # # REQUIREMENTS: # - This prolog assumes to be run only with exclusive jobs. LOG_FILE_PATH="/var/log/parallelcluster/nvidia-imex-prolog.log" SCONTROL_CMD="/opt/slurm/bin/scontrol" IMEX_START_TIMEOUT=60 IMEX_STOP_TIMEOUT=15 ALLOWED_INSTANCE_TYPES="^(p6e-gb200)" IMEX_SERVICE="nvidia-imex" IMEX_NODES_CONFIG="/etc/nvidia-imex/nodes_config.cfg" function info() { echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [INFO] [PID:$$] [JOB:${SLURM_JOB_ID}] $1" } function warn() { echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [WARN] [PID:$$] [JOB:${SLURM_JOB_ID}] $1" } function error() { echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [ERROR] [PID:$$] [JOB:${SLURM_JOB_ID}] $1" } function error_exit() { error "$1" && exit 1 } function prolog_end() { info "PROLOG End JobId=${SLURM_JOB_ID}: $0" info "----------------" exit 0 } function get_instance_type() { local token=$(curl -X PUT -s "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") curl -s -H "X-aws-ec2-metadata-token: ${token}" http://169.254.169.254/latest/meta-data/instance-type } function return_if_unsupported_instance_type() { local instance_type=$(get_instance_type) if [[ ! ${instance_type} =~ ${ALLOWED_INSTANCE_TYPES} ]]; then info "Skipping IMEX configuration because instance type ${instance_type} does not support it" prolog_end fi } function return_if_imex_disabled() { if ! systemctl is-enabled "${IMEX_SERVICE}" &>/dev/null; then warn "Skipping IMEX configuration because system service ${IMEX_SERVICE} is not enabled" prolog_end fi } function return_if_job_is_not_exclusive() { if [[ "${SLURM_JOB_OVERSUBSCRIBE}" =~ ^(NO|TOPO)$ ]]; then info "Job is exclusive, proceeding with IMEX configuration" else info "Skipping IMEX configuration because the job is not exclusive" prolog_end fi } function get_ips_from_node_names() { local _nodes=$1 ${SCONTROL_CMD} -ao show node "${_nodes}" | sed 's/^.* NodeAddr=\([^ ]*\).*/\1/' } function get_compute_resource_name() { local _queue_name_prefix=$1 local _slurmd_node_name=$2 echo "${_slurmd_node_name}" | sed -E "s/${_queue_name_prefix}(.+)-[0-9]+$/\1/" } function reload_imex() { info "Stopping IMEX" timeout ${IMEX_STOP_TIMEOUT} systemctl stop ${IMEX_SERVICE} pkill -9 ${IMEX_SERVICE} info "Restarting IMEX" timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE} } function create_default_imex_channel() { info "Creating IMEX default channel" MAJOR_NUMBER=$(cat /proc/devices | grep nvidia-caps-imex-channels | cut -d' ' -f1) if [ ! -d "/dev/nvidia-caps-imex-channels" ]; then sudo mkdir /dev/nvidia-caps-imex-channels fi # Then check and create device node if [ ! -e "/dev/nvidia-caps-imex-channels/channel0" ]; then sudo mknod /dev/nvidia-caps-imex-channels/channel0 c $MAJOR_NUMBER 0 info "IMEX default channel created" else info "IMEX default channel already exists" fi } { info "PROLOG Start JobId=${SLURM_JOB_ID}: $0" return_if_job_is_not_exclusive return_if_unsupported_instance_type return_if_imex_disabled create_default_imex_channel IPS_FROM_CR=$(get_ips_from_node_names "${SLURM_NODELIST}") info "Node Names: ${SLURM_NODELIST}" info "Node IPs: ${IPS_FROM_CR}" info "IMEX Nodes Config: ${IMEX_NODES_CONFIG}" info "Updating IMEX nodes config ${IMEX_NODES_CONFIG}" echo "${IPS_FROM_CR}" > "${IMEX_NODES_CONFIG}" reload_imex prolog_end } 2>&1 | tee -a "${LOG_FILE_PATH}" | logger -t "91_nvidia_imex_prolog"
建立 HeadNode OnNodeStart 自訂動作指令碼
建立install_custom_action.sh自訂動作,其將在運算節點存取的共用目錄中下載上述 prolog 指令碼/opt/slurm/etc/scripts/prolog.d/,並設定要執行的適當許可。
#!/bin/bash set -e echo "Executing $0" PROLOG_NVIDIA_IMEX=/opt/slurm/etc/scripts/prolog.d/91_nvidia_imex_prolog.sh aws s3 cp "s3://<Bucket>/91_nvidia_imex_prolog.sh" "${PROLOG_NVIDIA_IMEX}" chmod 0755 "${PROLOG_NVIDIA_IMEX}"
建立叢集
建立包含 P6e-GB200 執行個體的叢集。您可以在下面找到包含 Ultraserver 類型 的 SlurmQueues 的範例組態u-p6e-gb200x72。
P6e-GB200 目前僅適用於 Local Zones。有些 Local Zones 不支援 NAT Gateway,因此請遵循 Local Zones 的連線選項,因為 ParallelCluster 設定受限環境的安全群組需要連線到 AWS 服務。請遵循 使用容量區塊 (CB) 啟動執行個體(AWS ParallelClusterLaunch),因為 Ultraserver 僅供容量區塊使用。
HeadNode: CustomActions: OnNodeStart: Script: s3://<s3-bucket-name>/install_custom_action.sh S3Access: - BucketName: <s3-bucket-name> InstanceType: <HeadNode-instance-type> Networking: SubnetId: <subnet-abcd78901234567890> Ssh: KeyName: <Key-name> Image: Os: ubuntu2404 Scheduling: Scheduler: slurm SlurmSettings: CustomSlurmSettings: - PrologFlags: "Alloc,NoHold" - MessageTimeout: 240 SlurmQueues: - CapacityReservationTarget: CapacityReservationId: <cr-123456789012345678> CapacityType: CAPACITY_BLOCK ComputeResources: ### u-p6e-gb200x72 - DisableSimultaneousMultithreading: true Efa: Enabled: true InstanceType: p6e-gb200.36xlarge MaxCount: 18 MinCount: 18 Name: cr1 Name: q1 Networking: SubnetIds: - <subnet-1234567890123456>
驗證 IMEX 設定
當您提交 Slurm 任務時,將會執行 91_nvidia_imex_prolog.sh prolog。以下是檢查 NVIDIA-imex 網域狀態的範例任務。
#!/bin/bash #SBATCH --job-name=nvidia-imex-status-job #SBATCH --ntasks-per-node=1 #SBATCH --output=slurm-%j.out #SBATCH --error=slurm-%j.err QUEUE_NAME="q1" COMPUTE_RES_NAME="cr1" IMEX_CONFIG_FILE="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RES_NAME}.cfg" srun bash -c "/usr/bin/nvidia-imex-ctl -N -c ${IMEX_CONFIG_FILE} > result_\${SLURM_JOB_ID}_\$(hostname).out 2> result_\${SLURM_JOB_ID}_\$(hostname).err"
檢查任務的輸出:
Connectivity Table Legend: I - Invalid - Node wasn't reachable, no connection status available N - Never Connected R - Recovering - Connection was lost, but clean up has not yet been triggered. D - Disconnected - Connection was lost, and clean up has been triggreed. A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication. !V! - Version mismatch, communication disabled. !M! - Node map mismatch, communication disabled. C - Connected - Ready for operation 5/12/2025 06:08:10.580 Nodes: Node #0 - 172.31.48.81 - READY - Version: 570.172 Node #1 - 172.31.48.98 - READY - Version: 570.172 Node #2 - 172.31.48.221 - READY - Version: 570.172 Node #3 - 172.31.49.228 - READY - Version: 570.172 Node #4 - 172.31.50.39 - READY - Version: 570.172 Node #5 - 172.31.50.44 - READY - Version: 570.172 Node #6 - 172.31.51.66 - READY - Version: 570.172 Node #7 - 172.31.51.157 - READY - Version: 570.172 Node #8 - 172.31.52.239 - READY - Version: 570.172 Node #9 - 172.31.53.80 - READY - Version: 570.172 Node #10 - 172.31.54.95 - READY - Version: 570.172 Node #11 - 172.31.54.183 - READY - Version: 570.172 Node #12 - 172.31.54.203 - READY - Version: 570.172 Node #13 - 172.31.54.241 - READY - Version: 570.172 Node #14 - 172.31.55.59 - READY - Version: 570.172 Node #15 - 172.31.55.187 - READY - Version: 570.172 Node #16 - 172.31.55.197 - READY - Version: 570.172 Node #17 - 172.31.56.47 - READY - Version: 570.172 Nodes From\To 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 C C C C C C C C C C C C C C C C C C 1 C C C C C C C C C C C C C C C C C C 2 C C C C C C C C C C C C C C C C C C 3 C C C C C C C C C C C C C C C C C C 4 C C C C C C C C C C C C C C C C C C 5 C C C C C C C C C C C C C C C C C C 6 C C C C C C C C C C C C C C C C C C 7 C C C C C C C C C C C C C C C C C C 8 C C C C C C C C C C C C C C C C C C 9 C C C C C C C C C C C C C C C C C C 10 C C C C C C C C C C C C C C C C C C 11 C C C C C C C C C C C C C C C C C C 12 C C C C C C C C C C C C C C C C C C 13 C C C C C C C C C C C C C C C C C C 14 C C C C C C C C C C C C C C C C C C 15 C C C C C C C C C C C C C C C C C C 16 C C C C C C C C C C C C C C C C C C 17 C C C C C C C C C C C C C C C C C C Domain State: UP