Invoke a Multi-Model Endpoint
To invoke a multi-model endpoint, use the invoke_endpointTargetModel parameter that specifies
      which of the models at the endpoint to target. The SageMaker AI Runtime InvokeEndpoint
      request supports X-Amzn-SageMaker-Target-Model as a new header that takes the
      relative path of the model specified for invocation. The SageMaker AI system constructs the absolute
      path of the model by combining the prefix that is provided as part of the
        CreateModel API call with the relative path of the model.
The following procedures are the same for both CPU and GPU-backed multi-model endpoints.
The multi-model endpoint dynamically loads target models as needed. You can observe this
      when running the MME Sample Notebook
Note
For GPU backed instances, the HTTP response code with 507 from the GPU container indicates a lack of memory or other resources. This causes unused models to be unloaded from the container in order to load more frequently used models.
Retry Requests on ModelNotReadyException Errors
The first time you call invoke_endpoint for a model, the model is
        downloaded from Amazon Simple Storage Service and loaded into the inference container. This makes the first call
        take longer to return. Subsequent calls to the same model finish faster, because the model
        is already loaded.
SageMaker AI returns a response for a call to invoke_endpoint within 60 seconds.
        Some models are too large to download within 60 seconds. If the model does not finish
        loading before the 60 second timeout limit, the request to invoke_endpoint
        returns with the error code ModelNotReadyException, and the model continues to
        download and load into the inference container for up to 360 seconds. If you get a
          ModelNotReadyException error code for an invoke_endpoint
        request, retry the request. By default, the AWS SDKs for Python (Boto 3) (using Legacy retry modeinvoke_endpoint requests that
        result in ModelNotReadyException errors. You can configure the retry strategy
        to continue retrying the request for up to 360 seconds. If you expect your model to take
        longer than 60 seconds to download and load into the container, set the SDK socket timeout
        to 70 seconds. For more information about configuring the retry strategy for the
        AWS SDK for Python (Boto3), see Configuring a retry modeinvoke_endpoint for up to 180 seconds.
import boto3 from botocore.config import Config # This example retry strategy sets the retry attempts to 2. # With this setting, the request can attempt to download and/or load the model # for upto 180 seconds: 1 orginal request (60 seconds) + 2 retries (120 seconds) config = Config( read_timeout=70, retries={ 'max_attempts': 2 # This value can be adjusted to 5 to go up to the 360s max timeout } ) runtime_sagemaker_client = boto3.client('sagemaker-runtime', config=config)