/AWS1/IF_SGM=>BATCHREPLACECLUSTERNODES()¶
About BatchReplaceClusterNodes¶
Replaces specific nodes within a SageMaker HyperPod cluster with new hardware. BatchReplaceClusterNodes terminates the specified instances and provisions new replacement instances with the same configuration but fresh hardware. The Amazon Machine Image (AMI) and instance configuration remain the same.
This operation is useful for recovering from hardware failures or persistent issues that cannot be resolved through a reboot.
-
Data Loss Warning: Replacing nodes destroys all instance volumes, including both root and secondary volumes. All data stored on these volumes will be permanently lost and cannot be recovered.
-
To safeguard your work, back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This will help prevent any potential data loss from the instance root volume. For more information about backup, see Use the backup script provided by SageMaker HyperPod.
-
If you want to invoke this API on an existing cluster, you'll first need to patch the cluster by running the UpdateClusterSoftware API. For more information about patching a cluster, see Update the SageMaker HyperPod platform software of a cluster.
-
You can replace up to 25 nodes in a single request.
Method Signature¶
METHODS /AWS1/IF_SGM~BATCHREPLACECLUSTERNODES
IMPORTING
!IV_CLUSTERNAME TYPE /AWS1/SGMCLUSTERNAMEORARN OPTIONAL
!IT_NODEIDS TYPE /AWS1/CL_SGMCLUSTERNODEIDS_W=>TT_CLUSTERNODEIDS OPTIONAL
!IT_NODELOGICALIDS TYPE /AWS1/CL_SGMCLSTNODELOGICALI00=>TT_CLUSTERNODELOGICALIDLIST OPTIONAL
RETURNING
VALUE(OO_OUTPUT) TYPE REF TO /aws1/cl_sgmbtcrplclstnodesrsp
RAISING
/AWS1/CX_SGMRESOURCENOTFOUND
/AWS1/CX_SGMCLIENTEXC
/AWS1/CX_SGMSERVEREXC
/AWS1/CX_RT_TECHNICAL_GENERIC
/AWS1/CX_RT_SERVICE_GENERIC.
IMPORTING¶
Required arguments:¶
iv_clustername TYPE /AWS1/SGMCLUSTERNAMEORARN /AWS1/SGMCLUSTERNAMEORARN¶
The name or Amazon Resource Name (ARN) of the SageMaker HyperPod cluster containing the nodes to replace.
Optional arguments:¶
it_nodeids TYPE /AWS1/CL_SGMCLUSTERNODEIDS_W=>TT_CLUSTERNODEIDS TT_CLUSTERNODEIDS¶
A list of EC2 instance IDs to replace with new hardware. You can specify between 1 and 25 instance IDs.
Replace operations destroy all instance volumes (root and secondary). Ensure you have backed up any important data before proceeding.
Either
NodeIdsorNodeLogicalIdsmust be provided (or both), but at least one is required.Each instance ID must follow the pattern
i-followed by 17 hexadecimal characters (for example,i-0123456789abcdef0).For SageMaker HyperPod clusters using the Slurm workload manager, you cannot replace instances that are configured as Slurm controller nodes.
it_nodelogicalids TYPE /AWS1/CL_SGMCLSTNODELOGICALI00=>TT_CLUSTERNODELOGICALIDLIST TT_CLUSTERNODELOGICALIDLIST¶
A list of logical node IDs to replace with new hardware. You can specify between 1 and 25 logical node IDs.
The
NodeLogicalIdis a unique identifier that persists throughout the node's lifecycle and can be used to track nodes that are still being provisioned and don't yet have an EC2 instance ID assigned.
Replace operations destroy all instance volumes (root and secondary). Ensure you have backed up any important data before proceeding.
This parameter is only supported for clusters using
Continuousas theNodeProvisioningMode. For clusters using the default provisioning mode, useNodeIdsinstead.Either
NodeIdsorNodeLogicalIdsmust be provided (or both), but at least one is required.
RETURNING¶
oo_output TYPE REF TO /aws1/cl_sgmbtcrplclstnodesrsp /AWS1/CL_SGMBTCRPLCLSTNODESRSP¶
Domain /AWS1/RT_ACCOUNT_ID Primitive Type NUMC
Examples¶
Syntax Example¶
This is an example of the syntax for calling the method. It includes every possible argument and initializes every possible value. The data provided is not necessarily semantically accurate (for example the value "string" may be provided for something that is intended to be an instance ID, or in some cases two arguments may be mutually exclusive). The syntax shows the ABAP syntax for creating the various data structures.
DATA(lo_result) = lo_client->batchreplaceclusternodes(
it_nodeids = VALUE /aws1/cl_sgmclusternodeids_w=>tt_clusternodeids(
( new /aws1/cl_sgmclusternodeids_w( |string| ) )
)
it_nodelogicalids = VALUE /aws1/cl_sgmclstnodelogicali00=>tt_clusternodelogicalidlist(
( new /aws1/cl_sgmclstnodelogicali00( |string| ) )
)
iv_clustername = |string|
).
This is an example of reading all possible response values
lo_result = lo_result.
IF lo_result IS NOT INITIAL.
LOOP AT lo_result->get_successful( ) into lo_row.
lo_row_1 = lo_row.
IF lo_row_1 IS NOT INITIAL.
lv_clusternodeid = lo_row_1->get_value( ).
ENDIF.
ENDLOOP.
LOOP AT lo_result->get_failed( ) into lo_row_2.
lo_row_3 = lo_row_2.
IF lo_row_3 IS NOT INITIAL.
lv_clusternodeid = lo_row_3->get_nodeid( ).
lv_batchreplaceclusternode = lo_row_3->get_errorcode( ).
lv_string = lo_row_3->get_message( ).
ENDIF.
ENDLOOP.
LOOP AT lo_result->get_failednodelogicalids( ) into lo_row_4.
lo_row_5 = lo_row_4.
IF lo_row_5 IS NOT INITIAL.
lv_clusternodelogicalid = lo_row_5->get_nodelogicalid( ).
lv_batchreplaceclusternode = lo_row_5->get_errorcode( ).
lv_string = lo_row_5->get_message( ).
ENDIF.
ENDLOOP.
LOOP AT lo_result->get_successfulnodelogicalids( ) into lo_row_6.
lo_row_7 = lo_row_6.
IF lo_row_7 IS NOT INITIAL.
lv_clusternodelogicalid = lo_row_7->get_value( ).
ENDIF.
ENDLOOP.
ENDIF.