

# Monitoring AWS resources in Amazon SageMaker AI
Monitoring

Monitoring is an important part of maintaining the reliability, availability, and performance of SageMaker AI and your other AWS solutions. AWS provides the following monitoring tools to watch SageMaker AI, report when something is wrong, and take automatic actions when appropriate:
+ *Amazon CloudWatch* monitors your AWS resources and the applications that you run on AWS in real time. You can collect and track metrics, create customized dashboards, and set alarms that notify you or take actions when a specified metric reaches a threshold that you specify. For example, you can have CloudWatch track CPU usage or other metrics of your Amazon EC2 instances and automatically launch new instances when needed. For more information, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/).
+ *Amazon CloudWatch Logs* enables you to monitor, store, and access your log files from EC2 instances, AWS CloudTrail, and other sources. CloudWatch Logs can monitor information in the log files and notify you when certain thresholds are met. You can also archive your log data in highly durable storage. For more information, see the [Amazon CloudWatch Logs User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/).
+ *AWS CloudTrail* captures API calls and related events made by or on behalf of your AWS account and delivers the log files to an Amazon S3 bucket that you specify. You can identify which users and accounts called AWS, the source IP address from which the calls were made, and when the calls occurred. For more information, see the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/).
+ *CloudWatch Events* delivers a near real-time stream of system events that describe changes in AWS resources. Create CloudWatch Events rules react to a status change in a SageMaker AI training, hyperparameter tuning, or batch transform job

**Topics**
+ [

# Amazon SageMaker AI metrics in Amazon CloudWatch
](monitoring-cloudwatch.md)
+ [

# CloudWatch Logs for Amazon SageMaker AI
](logging-cloudwatch.md)
+ [

# Logging Amazon SageMaker AI API calls using AWS CloudTrail
](logging-using-cloudtrail.md)
+ [

# Monitoring user resource access from SageMaker AI Studio Classic with sourceIdentity
](monitor-user-access.md)
+ [

# Events that Amazon SageMaker AI sends to Amazon EventBridge
](automating-sagemaker-with-eventbridge.md)

# Amazon SageMaker AI metrics in Amazon CloudWatch
Metrics in CloudWatch

You can monitor Amazon SageMaker AI using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. These statistics are kept for 15 months. With them, you can access historical information and gain a better perspective on how your web application or service is performing. However, the Amazon CloudWatch console limits the search to metrics that were updated in the last 2 weeks. This limitation ensures that the most current jobs are shown in your namespace. 

To graph metrics without using a search, specify its exact name in the source view. You can also set alarms that watch for certain thresholds, and send notifications or take actions when those thresholds are met. For more information, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/).



**Topics**
+ [

## SageMaker AI endpoint metrics
](#cloudwatch-metrics-endpoints)
+ [

## SageMaker AI endpoint invocation metrics
](#cloudwatch-metrics-endpoint-invocation)
+ [

## SageMaker AI inference component metrics
](#cloudwatch-metrics-inference-component)
+ [

## SageMaker AI multi-model endpoint metrics
](#cloudwatch-metrics-multimodel-endpoints)
+ [

## SageMaker AI job metrics
](#cloudwatch-metrics-jobs)
+ [

## SageMaker Inference Recommender jobs metrics
](#cloudwatch-metrics-inference-recommender)
+ [

## SageMaker Ground Truth metrics
](#cloudwatch-metrics-ground-truth)
+ [

## Amazon SageMaker Feature Store metrics
](#cloudwatch-metrics-feature-store)
+ [

## SageMaker pipelines metrics
](#cloudwatch-metrics-pipelines)

## SageMaker AI endpoint metrics
Endpoint metrics

The `/aws/sagemaker/Endpoints` namespace includes the following metrics for endpoint instances.

Metrics are available at a 1-minute frequency.

**Note**  
Amazon CloudWatch supports [high-resolution custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*. 


**Endpoint metrics**  

| Metric | Description | 
| --- | --- | 
| CPUReservation |  The sum of CPUs reserved by containers on an instance. This metric is provided only for endpoints that host active inference components. The value ranges between 0%–100%. In the settings for an inference component, you set the CPU reservation with the `NumberOfCpuCoresRequired` parameter. For example, if there 4 CPUs, and 2 are reserved, the `CPUReservation` metric is 50%.  | 
| CPUUtilization |  The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| CPUUtilizationNormalized |  The normalized sum of the utilization of each individual CPU core. This metric is provided only for endpoints that host active inference components. The value ranges between 0%–100%. For example, if there are four CPUs, and the `CPUUtilization` metric is 200%, then the `CPUUtilizationNormalized` metric is 50%.  | 
| DiskUtilization | The percentage of disk space used by the containers on an instance. This value range is 0%–100%.For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance.Units: Percent | 
| GPUMemoryUtilization |  The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUMemoryUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUMemoryUtilizationNormalized |  The normalized percentage of GPU memory used by the containers on an instance. This metric is provided only for endpoints that host active inference components. The value ranges between 0%–100%. For example, if there are four GPUs, and the `GPUMemoryUtilization` metric is 200%, then the `GPUMemoryUtilizationNormalized` metric is 50%.  | 
| GPUReservation |  The sum of GPUs reserved by containers on an instance. This metric is provided only for endpoints that host active inference components. The value ranges between 0%–100%. In the settings for an inference component, you set the GPU reservation by `NumberOfAcceleratorDevicesRequired`. For example, if there are 4 GPUs and 2 are reserved, the `GPUReservation` metric is 50%.   | 
| GPUUtilization |  The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUUtilizationNormalized |  The normalized percentage of GPU units that are used by the containers on an instance. This metric is provided only for endpoints that host active inference components. The value ranges between 0%–100%. For example, if there are four GPUs, and the `GPUUtilization` metric is 200%, then the `GPUUtilizationNormalized` metric is 50%.   | 
| MemoryReservation |  The sum of memory reserved by containers on an instance. This metric is provided only for endpoints that host active inference components. The value ranges between 0%–100%. In the settings for an inference component, you set the memory reservation with the `MinMemoryRequiredInMb` parameter. For example, if a 32 GiB instance reserved 1024 MB, the `MemoryReservation` metric would be 3.125%.  | 
| MemoryUtilization |  The percentage of memory that is used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 


**Dimensions for endpoint metrics**  

| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

## SageMaker AI endpoint invocation metrics
Endpoint invocation metrics

The `AWS/SageMaker` namespace includes the following request metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

The following illustration shows how a SageMaker AI endpoint interacts with the Amazon SageMaker Runtime API. The overall time between sending a request to an endpoint and receiving a response depends on the following three components.
+ Network latency – the time that it takes between making a request to and receiving a response back from the SageMaker Runtime Runtime API.
+ Overhead latency – the time that it takes to transport a request to the model container from and transport the response back to the SageMaker Runtime Runtime API.
+ Model latency – the time that it takes the model container to process the request and return a response.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/cloudwatch-latency-types.png)


For more information about total latency, see [Best practices for load testing Amazon SageMaker AI real-time inference endpoints](https://aws.amazon.com/blogs/machine-learning/best-practices-for-load-testing-amazon-sagemaker-real-time-inference-endpoints/). For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.


**Endpoint invocation metrics**  

| Metric | Description | 
| --- | --- | 
| ConcurrentRequestsPerCopy |  The number of concurrent requests being received by the inference component, normalized by each copy of an inference component. Valid statistics: Min, Max  | 
| ConcurrentRequestsPerModel |  The number of concurrent requests being received by the model. Valid statistics: Min, Max  | 
| Invocation4XXErrors |  The number of `InvokeEndpoint` requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent. Units: None Valid statistics: Average, Sum  | 
| Invocation5XXErrors |  The number of `InvokeEndpoint` requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent. Units: None Valid statistics: Average, Sum  | 
| InvocationModelErrors |  The number of model invocation requests that did not result in 2XX HTTP response. This includes 4XX/5XX status codes, low-level socket errors, malformed HTTP responses, and request timeouts. For each error response, 1 is sent; otherwise, 0 is sent. Units: None Valid statistics: Average, Sum  | 
| Invocations |  The number of `InvokeEndpoint` requests sent to a model endpoint.  To get the total number of requests sent to a model endpoint, use the Sum statistic. Units: None Valid statistics: Sum  | 
| InvocationsPerCopy |  The number of invocations normalized by each copy of an inference component. Valid statistics: Sum  | 
| InvocationsPerInstance |  The number of invocations sent to a model, normalized by `InstanceCount` in each ProductionVariant. 1/`numberOfInstances` is sent as the value on each request. `numberOfInstances` is the number of active instances for the ProductionVariant behind the endpoint at the time of the request. Units: None Valid statistics: Sum  | 
| ModelLatency |  The interval of time taken by a model to respond to a SageMaker Runtime API request. This interval includes the local communication times taken to send the request and to fetch the response from the model container. It also includes the time taken to complete the inference in the container. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count, Percentiles  | 
| ModelSetupTime |  The time it takes to launch new compute resources for a serverless endpoint. The time can vary depending on the model size, how long it takes to download the model, and the start-up time of the container. Units: Microseconds Valid statistics: Average, Min, Max, Sample Count, Percentiles  | 
| OverheadLatency |  The interval of time added to the time taken to respond to a client request by SageMaker AI overheads. This interval is measured from the time SageMaker AI receives the request until it returns a response to the client, minus the `ModelLatency`. Overhead latency can vary depending on multiple factors, including request and response payload sizes, request frequency, and authentication/authorization of the request. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count  | 
|  MidStreamErrors  |  The number of errors that occur during response streaming after the initial response has been sent to the customer.  Units: None Valid statistics: Average, Sum  | 
|  FirstChunkLatency  |  The time elapsed from when the request arrives at SageMaker AI endpoint until the first chunk of the response is sent to the customer. This metric applies to bidirectional streaming inference requests. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count, Percentiles  | 
|  FirstChunkModelLatency  |  The time taken by the model container to process the request and return the first chunk of the response. This is measured from when the request is sent to the model container until the first byte is received from the model. This metric applies to bidirectional streaming inference requests. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count, Percentiles  | 
|  FirstChunkOverheadLatency  |  The overhead latency for the first chunk, excluding model processing time. This is calculated as `FirstChunkLatency` minus `FirstChunkModelLatency`, representing the time spent in routing, preprocessing, and postprocessing operations within SageMaker AI platform. Overhead latency can vary depending on multiple factors, including request frequency, load and authentication/authorization of the request. This metric applies to bidirectional streaming inference requests. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count, Percentile  | 


**Dimensions for endpoint invocation metrics**  

| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 
| InferenceComponentName |  Filters inference component invocation metrics.  | 

## SageMaker AI inference component metrics
Inference component metrics

The `/aws/sagemaker/InferenceComponents` namespace includes the following metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) for endpoints that host inference components.

Metrics are available at a 1-minute frequency.


**Inference component metrics**  

| Metric | Description | 
| --- | --- | 
| CPUUtilizationNormalized |  The value of the `CPUUtilizationNormalized` metric reported by each copy of the inference component. The value ranges between 0%–100%. If you set the `NumberOfCpuCoresRequired` parameter in the settings for the inference component copy, the metric presents the utilization over the reservation. Otherwise, the metric presents the utilization over the limit.  | 
| GPUMemoryUtilizationNormalized |  The value of the `GPUMemoryUtilizationNormalized` metric reported by each copy of the inference component.  | 
| GPUUtilizationNormalized |  The value of the `GPUUtilizationNormalized` metric reported by each copy of the inference component. If you set the `NumberOfAcceleratorDevicesRequired` parameter in the settings for the inference component copy, the metric presents the utilization over the reservation. Otherwise, the metric presents the utilization over the limit.  | 
| MemoryUtilizationNormalized |  The value of `MemoryUtilizationNormalized` reported by each copy of the inference component. If you set the `MinMemoryRequiredInMb` parameter in the settings for the inference component copy, the metrics present the utilization over the reservation. Otherwise, the metrics present the utilization over the limit.  | 


**Dimensions for inference component metrics**  

| Dimension | Description | 
| --- | --- | 
| InferenceComponentName |  Filters inference component metrics.  | 

## SageMaker AI multi-model endpoint metrics
Multi-model endpoint metrics

The `AWS/SageMaker` namespace includes the following model loading metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.


**Multi-model endpoint model loading metrics**  

| Metric | Description | 
| --- | --- | 
| ModelLoadingWaitTime  |  The interval of time that an invocation request has waited for the target model to be downloaded, loaded, or both in order to run inference.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelUnloadingTime  |  The interval of time that it took to unload the model through the container's `UnloadModel` API call.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelDownloadingTime |  The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelLoadingTime  |  The interval of time that it took to load the model through the container's `LoadModel` API call. Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelCacheHit  |  The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded. The Average statistic shows the ratio of requests for which the model was already loaded. Units: None Valid statistics: Average, Sum, Sample Count  | 


**Dimensions for multi-model endpoint model loading metrics**  

| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

The `/aws/sagemaker/Endpoints` namespaces include the following instance metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.


**Multi-model endpoint model instance metrics**  

| Metric | Description | 
| --- | --- | 
| LoadedModelCount  |  The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance. The Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. Units: None Valid statistics: Average, Sum, Min, Max, Sample Count  | 


**Dimensions for multi-model endpoint model loading metrics**  

| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

## SageMaker AI job metrics
Job metrics

The `/aws/sagemaker/ProcessingJobs`, `/aws/sagemaker/TrainingJobs`, and `/aws/sagemaker/TransformJobs` namespaces include the following metrics for processing jobs, training jobs, and batch transform jobs.

Metrics are available at a 1-minute frequency.

**Note**  
Amazon CloudWatch supports [high-resolution custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*. 

**Tip**  
To profile your training job with a finer resolution down to 100-millisecond (0.1 second) granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any time, consider using [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html). SageMaker Debugger provides built-in rules to automatically detect common training issues. It detects hardware resource utilization issues (such as CPU, GPU, and I/O bottlenecks). It also detects non-converging model issues (such as overfit, vanishing gradients, and exploding tensors). SageMaker Debugger also provides visualizations through Studio Classic and its profiling report. To explore the Debugger visualizations, see [SageMaker Debugger Insights Dashboard Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio-insights.html), [Debugger Profiling Report Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-report.html), and [Analyze Data Using the SMDebug Client Library](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-analyze-data.html).


**Processing job, training job, and batch transform job metrics**  

| Metric | Description | 
| --- | --- | 
| CPUUtilization | The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0%–400%. For processing jobs, the value is the CPU utilization of the processing container on the instance.For training jobs, the value is the CPU utilization of the algorithm container on the instance.For batch transform jobs, the value is the CPU utilization of the transform container on the instance. For multi-instance jobs, each instance reports CPU utilization metrics. However, the default view in CloudWatch shows the average CPU utilization across all instances. Units: Percent | 
| DiskUtilization | The percentage of disk space used by the containers on an instance. This value range is 0%–100%. This metric is not supported for batch transform jobs.For processing jobs, the value is the disk space utilization of the processing container on the instance.For training jobs, the value is the disk space utilization of the algorithm container on the instance.Units: Percent For multi-instance jobs, each instance reports disk utilization metrics. However, the default view in CloudWatch shows the average disk utilization across all instances.  | 
| GPUMemoryUtilization | The percentage of GPU memory used by the containers on an instance. The value range is 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUMemoryUtilization` range is 0%–400%.For processing jobs, the value is the GPU memory utilization of the processing container on the instance.For training jobs, the value is the GPU memory utilization of the algorithm container on the instance.For batch transform jobs, the value is the GPU memory utilization of the transform container on the instance. For multi-instance jobs, each instance reports GPU memory utilization metrics. However, the default view in CloudWatch shows the average GPU memory utilization across all instances. Units: Percent | 
| GPUUtilization | The percentage of GPU units that are used by the containers on an instance. The value can range between 0–100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUUtilization` range is 0%–400%.For processing jobs, the value is the GPU utilization of the processing container on the instance.For training jobs, the value is the GPU utilization of the algorithm container on the instance.For batch transform jobs, the value is the GPU utilization of the transform container on the instance. For multi-instance jobs, each instance reports GPU utilization metrics. However, the default view in CloudWatch shows the average GPU utilization across all instances. Units: Percent | 
| MemoryUtilization | The percentage of memory that is used by the containers on an instance. This value range is 0%–100%.For processing jobs, the value is the memory utilization of the processing container on the instance.For training jobs, the value is the memory utilization of the algorithm container on the instance.For batch transform jobs, the value is the memory utilization of the transform container on the instance.Units: Percent For multi-instance jobs, each instance reports memory utilization metrics. However, the default view in CloudWatch shows the average memory utilization across all instances.  | 


**Dimensions for job metrics**  

| Dimension | Description | 
| --- | --- | 
| Host |  For processing jobs, the value for this dimension has the format `[processing-job-name]/algo-[instance-number-in-cluster]`. Use this dimension to filter instance metrics for the specified processing job and instance. This dimension format is present only in the `/aws/sagemaker/ProcessingJobs` namespace. For training jobs, the value for this dimension has the format `[training-job-name]/algo-[instance-number-in-cluster]`. Use this dimension to filter instance metrics for the specified training job and instance. This dimension format is present only in the `/aws/sagemaker/TrainingJobs` namespace. For batch transform jobs, the value for this dimension has the format `[transform-job-name]/[instance-id]`. Use this dimension to filter instance metrics for the specified batch transform job and instance. This dimension format is present only in the `/aws/sagemaker/TransformJobs` namespace.  | 

## SageMaker Inference Recommender jobs metrics
Inference Recommender metrics

The `/aws/sagemaker/InferenceRecommendationsJobs` namespace includes the following metrics for inference recommendation jobs.


**Inference Recommender metrics**  

| Metric | Description | 
| --- | --- | 
| ClientInvocations |  The number of `InvokeEndpoint` requests sent to a model endpoint, as observed by Inference Recommender. Units: None Valid statistics: Sum  | 
| ClientInvocationErrors |  The number of `InvokeEndpoint` requests that failed, as observed by Inference Recommender. Units: None Valid statistics: Sum  | 
| ClientLatency |  The interval of time taken between sending an `InvokeEndpoint` call and receiving a response as observed by Inference Recommender. Note that the time is in milliseconds, whereas the `ModelLatency` endpoint invocation metric is in microseconds. Units: Milliseconds Valid statistics: Average, Sum, Min, Max, Sample Count, Percentiles  | 
| NumberOfUsers |  The number of concurrent users sending `InvokeEndpoint` requests to the model endpoint. Units: None Valid statistics: Max, Min, Average  | 


**Dimensions for Inference Recommender job metrics**  

| Dimension | Description | 
| --- | --- | 
| JobName |  Filters Inference Recommender job metrics for the specified Inference Recommender job.  | 
| EndpointName |  Filters Inference Recommender job metrics for the specified endpoint.  | 

## SageMaker Ground Truth metrics
Ground Truth metrics


**Ground Truth metrics**  

| Metric | Description | 
| --- | --- | 
| ActiveWorkers |  A single active worker on a private work team submitted, released, or declined a task. To get the total number of active workers, use the Sum statistic. Ground Truth tries to deliver each individual `ActiveWorkers` event once. If this delivery is unsuccessful, this metric may not report the total number of active workers. Units: None Valid statistics: Sum, Sample Count  | 
| DatasetObjectsAutoAnnotated |  The number of dataset objects auto-annotated in a labeling job. This metric is only emitted when automated labeling is enabled. To view the labeling job progress, use the Max metric. Units: None Valid statistics: Max  | 
| DatasetObjectsHumanAnnotated |  The number of dataset objects annotated by a human in a labeling job. To view the labeling job progress, use the Max metric. Units: None Valid statistics: Max  | 
| DatasetObjectsLabelingFailed |  The number of dataset objects that failed labeling in a labeling job. To view the labeling job progress, use the Max metric. Units: None Valid statistics: Max  | 
| JobsFailed |  A single labeling job failed. To get the total number of labeling jobs that failed, use the Sum statistic. Units: None Valid statistics: Sum, Sample Count  | 
| JobsSucceeded |  A single labeling job succeeded. To get the total number of labeling jobs that succeeded, use the Sum statistic. Units: None Valid statistics: Sum, Sample Count  | 
| JobsStopped |  A single labeling jobs was stopped. To get the total number of labeling jobs that were stopped, use the Sum statistic. Units: None Valid statistics: Sum, Sample Count  | 
| TasksAccepted |  A single task was accepted by a worker. To get the total number of tasks accepted by workers, use the Sum statistic. Ground Truth attempts to deliver each individual `TaskAccepted` event once. If this delivery is unsuccessful, this metric may not report the total number of tasks accepted. Units: None  Valid statistics: Sum, Sample Count  | 
| TasksDeclined |  A single task was declined by a worker. To get the total number of tasks declined by workers, use the Sum statistic. Ground Truth attempts to deliver each individual `TasksDeclined` event once. If this delivery is unsuccessful, this metric may not report the total number of tasks declined. Units: None Valid Statistics: Sum, Sample Count  | 
| TasksReturned |  A single task was returned. To get the total number of tasks returned, use the Sum statistic. Ground Truth attempts to deliver each individual `TasksReturned` event once. If this delivery is unsuccessful, this metric may not report the total number of tasks returned. Units: None  Valid statistics: Sum, Sample Count  | 
| TasksSubmitted |  A single task was submitted/completed by a private worker. To get the total number of tasks submitted by workers, use the Sum statistic. Ground Truth attempts to deliver each individual `TasksSubmitted` event once. If this delivery is unsuccessful, this metric may not report the total number of tasks submitted. Units: None Valid statistics: Sum, Sample Count  | 
| TimeSpent |  Time spent on a task completed by a private worker. This metric does not include time when a worker paused or took a break. Ground Truth attempts to deliver each `TimeSpent` event once. If this delivery is unsuccessful, this metric may not report the total amount of time spent. Units: Seconds Valid statistics: Sum, Sample Count  | 
| TotalDatasetObjectsLabeled |  The number of dataset objects labeled successfully in a labeling job. To view the labeling job progress, use the Max metric. Units: None Valid statistics: Max  | 


**Dimensions for dataset object metrics**  

| Dimension | Description | 
| --- | --- | 
| LabelingJobName |  Filters dataset object count metrics for a labeling job.  | 

## Amazon SageMaker Feature Store metrics
Feature Store metrics


**Feature Store consumption metrics**  

| Metric | Description | 
| --- | --- | 
| ConsumedReadRequestsUnits |  The number of consumed read units over the specified time period. You can retrieve the consumed read units for a feature store runtime operation and its corresponding feature group. Units: None Valid statistics: All  | 
| ConsumedWriteRequestsUnits |  The number of consumed write units over the specified time period. You can retrieve the consumed write units for a feature store runtime operation and its corresponding feature group. Units: None Valid statistics: All  | 
| ConsumedReadCapacityUnits |  The number of provisioned read capacity units consumed over the specified time period. You can retrieve the consumed read capacity units for a feature store runtime operation and its corresponding feature group. Units: None Valid statistics: All  | 
| ConsumedWriteCapacityUnits |  The number of provisioned write capacity units consumed over the specified time period. You can retrieve the consumed write capacity units for a feature store runtime operation and its corresponding feature group. Units: None Valid statistics: All  | 


**Dimensions for Feature Store consumption metrics**  

| Dimension | Description | 
| --- | --- | 
| FeatureGroupName, OperationName |  Filters feature store runtime consumption metrics of the feature group and the operation that you've specified.  | 


**Feature Store operational metrics**  

| Metric | Description | 
| --- | --- | 
| Invocations |  The number of requests made to the feature store runtime operations over the specified time period. Units: None Valid statistics: Sum  | 
| Operation4XXErrors |  The number of requests made to the Feature Store runtime operations where the operation returned a 4xx HTTP response code. For each 4xx response, 1 is sent; else, 0 is sent. Units: None Valid statistics: Average, Sum  | 
| Operation5XXErrors |  The number of requests made to the feature store runtime operations where the operation returned a 5xx HTTP response code. For each 5xx response, 1 is sent; else, 0 is sent. Units: None Valid statistics: Average, Sum  | 
| ThrottledRequests |  The number of requests made to the feature store runtime operations where the request got throttled. For each throttled request, 1 is sent; else, 0 is sent. Units: None Valid statistics: Average, Sum  | 
| Latency |  The time interval to process requests made to the Feature Store runtime operations. This interval is measured from the time SageMaker AI receives the request until it returns a response to the client. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count, Percentiles  | 


**Dimensions for Feature Store operational metrics**  

| Dimension | Description | 
| --- | --- | 
|  `FeatureGroupName`, `OperationName`  | Filters feature store runtime operational metrics of the feature group and the operation that you've specified. You can use these dimensions for non batch operations, such as GetRecord, PutRecord, and DeleteRecord. | 
| OperationName |  Filters feature store runtime operational metrics for the operation that you've specified. You can use this dimension for batch operations such as BatchGetRecord.  | 

## SageMaker pipelines metrics
Pipelines metrics

The `AWS/Sagemaker/ModelBuildingPipeline` namespace includes the following metrics for pipeline executions.

Two categories of pipeline execution metrics are available:
+  **Execution Metrics across All Pipelines** – Account level pipeline execution metrics (for all pipelines in the current account)
+  **Execution Metrics by Pipeline** – Pipeline execution metrics per pipeline

Metrics are available at a 1-minute frequency.


**Pipeline execution metrics**  

| Metric | Description | 
| --- | --- | 
| ExecutionStarted |  The number of pipeline executions that started. Units: Count Valid statistics: Average, Sum  | 
| ExecutionFailed |  The number of pipeline executions that failed. Units: Count Valid statistics: Average, Sum  | 
| ExecutionSucceeded |  The number of pipeline executions that succeeded. Units: Count Valid statistics: Average, Sum  | 
| ExecutionStopped |  The number of pipeline executions that stopped. Units: Count Valid statistics: Average, Sum  | 
| ExecutionDuration |  The duration in milliseconds that the pipeline execution ran. Units: Milliseconds Valid statistics: Average, Sum, Min, Max, Sample Count  | 


**Dimensions for pipeline execution metrics**  

| Dimension | Description | 
| --- | --- | 
| PipelineName |  Filters pipeline execution metrics for a specified pipeline.  | 

The `AWS/Sagemaker/ModelBuildingPipeline` namespace includes the following metrics for pipeline steps.

Metrics are available at a 1-minute frequency.


**Pipeline step metrics**  

| Metric | Description | 
| --- | --- | 
| StepStarted |  The number of steps that started. Units: Count Valid statistics: Average, Sum  | 
| StepFailed |  The number of steps that failed. Units: Count Valid statistics: Average, Sum  | 
| StepSucceeded |  The number of steps that succeeded. Units: Count Valid statistics: Average, Sum  | 
| StepStopped |  The number of steps that stopped. Units: Count Valid statistics: Average, Sum  | 
| StepDuration |  The duration in milliseconds that the step ran. Units: Milliseconds Valid statistics: Average, Sum, Min, Max, Sample Count  | 


**Dimensions for pipeline step metrics**  

| Dimension | Description | 
| --- | --- | 
| PipelineName, StepName |  Filters step metrics for a specified pipeline and step.  | 

# CloudWatch Logs for Amazon SageMaker AI
CloudWatch logs

To help you debug your compilation jobs, processing jobs, training jobs, endpoints, transform jobs, notebook instances, and notebook instance lifecycle configurations, anything an algorithm container, a model container, or a notebook instance lifecycle configuration sends to `stdout` or `stderr` is also sent to Amazon CloudWatch Logs. In addition to debugging, you can use these for progress analysis.

By default, log data is stored in CloudWatch Logs indefinitely. However, you can configure how long to store log data in a log group. For information, see [Change Log Data Retention in CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#SettingLogRetention) in the *Amazon CloudWatch Logs User Guide*.

**Logs**

The following table lists all of the logs provided by Amazon SageMaker AI.

**Logs**

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/logging-cloudwatch.html)

**Note**  
1. The `/aws/sagemaker/NotebookInstances/[LifecycleConfigHook]` log stream is created when you create a notebook instance with a lifecycle configuration. For more information, see [Customization of a SageMaker notebook instance using an LCC script](notebook-lifecycle-config.md).  
2. For Inference Pipelines, if you don't provide container names, the platform uses \$1\$1container-1, container-2\$1\$1, and so on, corresponding to the order provided in the SageMaker AI model.

For more information about logging events with CloudWatch logging, see [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) in the *Amazon CloudWatch User Guide*.

# Logging Amazon SageMaker AI API calls using AWS CloudTrail
CloudTrail logs

Amazon SageMaker AI is integrated with [AWS CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html), a service that provides a record of actions taken by a user, role, or an AWS service. CloudTrail captures all API calls for Amazon SageMaker AI as events. The calls captured include calls from the Amazon SageMaker AI console and code calls to the Amazon SageMaker AI API operations. Using the information collected by CloudTrail, you can determine the request that was made to Amazon SageMaker AI, the IP address from which the request was made, when it was made, and additional details.

Every event or log entry contains information about who generated the request. The identity information helps you determine the following:
+ Whether the request was made with root user or user credentials.
+ Whether the request was made on behalf of an IAM Identity Center user.
+ Whether the request was made with temporary security credentials for a role or federated user.
+ Whether the request was made by another AWS service.

CloudTrail is active in your AWS account when you create the account and you automatically have access to the CloudTrail **Event history**. The CloudTrail **Event history** provides a viewable, searchable, downloadable, and immutable record of the past 90 days of recorded management events in an AWS Region. For more information, see [Working with CloudTrail Event history](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/view-cloudtrail-events.html) in the *AWS CloudTrail User Guide*. There are no CloudTrail charges for viewing the **Event history**.

For an ongoing record of events in your AWS account past 90 days, create a trail or a [CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) event data store.

**CloudTrail trails**  
A *trail* enables CloudTrail to deliver log files to an Amazon S3 bucket. All trails created using the AWS Management Console are multi-Region. You can create a single-Region or a multi-Region trail by using the AWS CLI. Creating a multi-Region trail is recommended because you capture activity in all AWS Regions in your account. If you create a single-Region trail, you can view only the events logged in the trail's AWS Region. For more information about trails, see [Creating a trail for your AWS account](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-and-update-a-trail.html) and [Creating a trail for an organization](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/creating-trail-organization.html) in the *AWS CloudTrail User Guide*.  
You can deliver one copy of your ongoing management events to your Amazon S3 bucket at no charge from CloudTrail by creating a trail, however, there are Amazon S3 storage charges. For more information about CloudTrail pricing, see [AWS CloudTrail Pricing](https://aws.amazon.com/cloudtrail/pricing/). For information about Amazon S3 pricing, see [Amazon S3 Pricing](https://aws.amazon.com/s3/pricing/).

**CloudTrail Lake event data stores**  
*CloudTrail Lake* lets you run SQL-based queries on your events. CloudTrail Lake converts existing events in row-based JSON format to [ Apache ORC](https://orc.apache.org/) format. ORC is a columnar storage format that is optimized for fast retrieval of data. Events are aggregated into *event data stores*, which are immutable collections of events based on criteria that you select by applying [advanced event selectors](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake-concepts.html#adv-event-selectors). The selectors that you apply to an event data store control which events persist and are available for you to query. For more information about CloudTrail Lake, see [Working with AWS CloudTrail Lake](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake.html) in the *AWS CloudTrail User Guide*.  
CloudTrail Lake event data stores and queries incur costs. When you create an event data store, you choose the [pricing option](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-lake-manage-costs.html#cloudtrail-lake-manage-costs-pricing-option) you want to use for the event data store. The pricing option determines the cost for ingesting and storing events, and the default and maximum retention period for the event data store. For more information about CloudTrail pricing, see [AWS CloudTrail Pricing](https://aws.amazon.com/cloudtrail/pricing/).

For security purposes, you can monitor CloudTrail logs to identify abnormal user activity. For more information about monitoring logs, see [Logging and Monitoring](sagemaker-incident-response.md).

## Amazon SageMaker AI data events in CloudTrail


[Data events](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html#logging-data-events) provide information about the resource operations performed on or in a resource (for example, reading or writing to an Amazon S3 object). These are also known as data plane operations. Data events are often high-volume activities. By default, CloudTrail doesn’t log data events. The CloudTrail **Event history** doesn't record data events.

Additional charges apply for data events. For more information about CloudTrail pricing, see [AWS CloudTrail Pricing](https://aws.amazon.com/cloudtrail/pricing/).

You can log data events for various Amazon SageMaker AI resource types by using the CloudTrail console, AWS CLI, or CloudTrail API operations. For more information about how to log data events, see [Logging data events with the AWS Management Console](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html#logging-data-events-console) and [Logging data events with the AWS Command Line Interface](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-data-events-with-cloudtrail.html#creating-data-event-selectors-with-the-AWS-CLI) in the *AWS CloudTrail User Guide*.

The following table lists the Amazon SageMaker AI resource types for which you can log data events. The **Resource type (console)** column shows the value to choose from the **Resource type** list on the CloudTrail console. The **resources.type value** column shows the `resources.type` value, which you would specify when configuring advanced event selectors using the AWS CLI or CloudTrail APIs. The **Data APIs logged to CloudTrail** column shows the API calls logged to CloudTrail for the resource type. 


| Resource type (console) | resources.type value | Data APIs logged to CloudTrail | 
| --- | --- | --- | 
| SageMaker endpoint |  AWS::SageMaker::Endpoint  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/logging-using-cloudtrail.html)  | 

**Note**  
The `InvokeEndpoint` and `InvokeEndpointAsync` API calls don't log the request parameters.

You can configure advanced event selectors to filter on the `eventName`, `readOnly`, and `resources.ARN` fields to log only those events that are important to you. For more information about these fields, see [https://docs.aws.amazon.com/awscloudtrail/latest/APIReference/API_AdvancedFieldSelector.html](https://docs.aws.amazon.com/awscloudtrail/latest/APIReference/API_AdvancedFieldSelector.html) in the *AWS CloudTrail API Reference*.

The following example shows you how to log data events for an Amazon SageMaker endpoint. In this example, you use the [put-event-selectors](https://docs.aws.amazon.com/cli/latest/reference/cloudtrail/put-event-selectors.html) AWS CLI command to add advanced event selectors that capture data events from your endpoint. You should have an existing CloudTrail trail. Before running the command, you can also save the advanced event selectors JSON object in a file like the following:

```
[
  {
    "FieldSelectors": [
      {
        "Field": "eventCategory",
        "Equals": ["Data"]
      },
      {
        "Field": "resources.ARN",
        "Equals": ["arn:aws:sagemaker:us-east-1:111122223333:endpoint/your-inference-endpoint-arn"]
      },
      {
        "Field": "resources.type",
        "Equals": ["AWS::SageMaker::Endpoint"]
      }
    ]
  }
]
```

Then, you can run the following command to start logging data events from the endpoint.

```
aws cloudtrail put-event-selectors
      --trail-name your-trail-name
      --advanced-event-selectors=file://advanced-event-selectors.json # specify your previously created JSON file
```

## Amazon SageMaker AI management events in CloudTrail


[Management events](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/logging-management-events-with-cloudtrail.html#logging-management-events) provide information about management operations that are performed on resources in your AWS account. These are also known as control plane operations. By default, CloudTrail logs management events.

Amazon SageMaker AI logs all Amazon SageMaker AI control plane operations as management events. For a list of the Amazon SageMaker AI control plane operations that Amazon SageMaker AI logs to CloudTrail, see the [Amazon SageMaker AI API Reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference).

## Operations Performed by Automatic Model Tuning


SageMaker AI supports logging non-API service events to your CloudTrail log files for automatic model tuning jobs. These events are related to your tuning jobs but, are not the direct result of a customer request to the public AWS API. For example, when you create a hyperparameter tuning job by calling [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html), SageMaker AI creates training jobs to evaluate various combinations of hyperparameters to find the best result. Similarly, when you call [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopHyperParameterTuningJob.html) to stop a hyperparameter tuning job, SageMaker AI might stop any of the associated running training jobs. Non-API events for your tuning jobs are logged to CloudTrail to help you improve governance, compliance, and operational and risk auditing of your AWS account.

Log entries that result from non-API service events have an `eventType` of `AwsServiceEvent` instead of `AwsApiCall`.

## Amazon SageMaker AI event examples


An event represents a single request from any source and includes information about the requested API operation, the date and time of the operation, request parameters, and so on. CloudTrail log files aren't an ordered stack trace of the public API calls, so events don't appear in any specific order.

The following example shows a CloudTrail event that demonstrates the `CreateEndpoint` operation.

```
{
    "eventVersion":"1.05",
    "userIdentity": {
        "type":"IAMUser",
        "principalId":"AIXDAYQEXAMPLEUMLYNGL",
        "arn":"arn:aws:iam::123456789012:user/intern",
        "accountId":"123456789012",
        "accessKeyId":"ASXIAGXEXAMPLEQULKNXV",
        "userName":"intern"
    },
    "eventTime":"2018-01-02T13:39:06Z",
    "eventSource":"sagemaker.amazonaws.com",
    "eventName":"CreateEndpoint",
    "awsRegion":"us-west-2",
    "sourceIPAddress":"127.0.0.1",
    "userAgent":"USER_AGENT",
    "requestParameters": {
        "endpointName":"ExampleEndpoint",
        "endpointConfigName":"ExampleEndpointConfig"
    },
    "responseElements": {
        "endpointArn":"arn:aws:sagemaker:us-west-2:123456789012:endpoint/exampleendpoint"
    },
    "requestID":"6b1b42b9-EXAMPLE",
    "eventID":"a6f85b21-EXAMPLE",
    "eventType":"AwsApiCall",
    "recipientAccountId":"444455556666"
}
```

The following example shows a CloudTrail event that demonstrates the `CreateModel` operation.

```
{
    "eventVersion":"1.05",
    "userIdentity": {
        "type":"IAMUser",
        "principalId":"AIXDAYQEXAMPLEUMLYNGL",
        "arn":"arn:aws:iam::123456789012:user/intern",
        "accountId":"123456789012",
        "accessKeyId":"ASXIAGXEXAMPLEQULKNXV",
        "userName":"intern"
    },
    "eventTime":"2018-01-02T15:23:46Z",
    "eventSource":"sagemaker.amazonaws.com",
    "eventName":"CreateModel",
    "awsRegion":"us-west-2",
    "sourceIPAddress":"127.0.0.1",
    "userAgent":"USER_AGENT",
    "requestParameters": {
        "modelName":"ExampleModel",
        "primaryContainer": {
            "image":"174872318107.dkr.ecr.us-west-2.amazonaws.com/kmeans:latest"
        },
        "executionRoleArn":"arn:aws:iam::123456789012:role/EXAMPLEARN"
    },
    "responseElements": {
        "modelArn":"arn:aws:sagemaker:us-west-2:123456789012:model/barkinghappy2018-01-02t15-23-32-275z-ivrdog"
    },
    "requestID":"417b8dab-EXAMPLE",
    "eventID":"0f2b3e81-EXAMPLE",
    "eventType":"AwsApiCall",
    "recipientAccountId":"444455556666"
}
```

For information about CloudTrail record contents, see [CloudTrail record contents](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-record-contents.html) in the *AWS CloudTrail User Guide*.

# Monitoring user resource access from SageMaker AI Studio Classic with sourceIdentity
Monitoring individual user access

With Amazon SageMaker Studio Classic, you can monitor user resource access. To view resource access activity, you can configure AWS CloudTrail to monitor and record user activities by following the steps in [Log Amazon SageMaker API Calls with AWS CloudTrail](https://docs.aws.amazon.com/sagemaker/latest/dg/logging-using-cloudtrail.html). 

However, the AWS CloudTrail logs for resource access only list the Studio Classic execution IAM role as the identifier. This level of logging is enough to audit user activity when each user profile has a distinct execution role. However, when a single execution IAM role is shared between several user profiles, you can't get information about the specific user that accessed the AWS resources.  

You can get information about which specific user performed an action in an AWS CloudTrail log when using a shared execution role, using the `sourceIdentity` configuration to propagate the Studio Classic user profile name. For more information about source identity, see [Monitor and control actions taken with assumed roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_control-access_monitor.html). To turn `sourceIdentity` on or off for your CloudTrail logs, see [Turn on sourceIdentity in CloudTrail logs for SageMaker AI Studio Classic](monitor-user-access-how-to.md).

## Considerations when using sourceIdentity


When you make AWS API calls from Studio Classic notebooks, SageMaker Canvas, or Amazon SageMaker Data Wrangler, the `sourceIdentity` is only recorded in CloudTrail if those calls are made using the Studio Classic [execution role](sagemaker-roles.md) session or any [chained role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_terms-and-concepts.html#iam-term-role-chaining) from that session.

When these API calls invoke other services to perform additional operations, `sourceIdentity` logging depends on the specific implementation of the invoked services. 
+ Amazon SageMaker Training and Processing: When you create a job using the training feature or the processing feature, the job creation API calls ingest the `sourceIdentity` that exists in the session. As a result, any AWS API calls made from these jobs record the `sourceIdentity` in the CloudTrail logs.
+ Amazon SageMaker Pipelines: When you create jobs using automated CI/CD pipelines, `sourceIdentity` propagates downstream and can be viewed in the CloudTrail logs.
+ Amazon EMR: When connecting to Amazon EMR from Studio Classic using [runtime roles](studio-notebooks-emr-cluster-rbac.md), administrators must explicitly [set the PropagateSourceIdentity field](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html). This ensures that Amazon EMR applies the `sourceIdentity` from the calling credentials to a job or query session. The `sourceIdentity` is then recorded in CloudTrail logs.

**Note**  
The following exceptions apply when using `sourceIdentity`.  
SageMaker Studio Classic shared spaces do not support `sourceIdentity` passthrough. AWS API calls made from SageMaker AI shared spaces do not record `sourceIdentity` in CloudTrail logs.
If AWS API calls are made from sessions that are created by users or other services and the sessions are not based on the Studio Classic execution role session, then the `sourceIdentity` is not recorded in CloudTrail logs.

# Turn on sourceIdentity in CloudTrail logs for SageMaker AI Studio Classic
Turn on sourceIdentity for Studio Classic

With Amazon SageMaker Studio Classic, you can monitor user resource access. However, the AWS CloudTrail logs for resource access only list the Studio Classic execution IAM role as the identifier. When a single execution IAM role is shared between several user profiles, you must use the `sourceIdentity` configuration to get information about the specific user that accessed the AWS resources.

The following topics explain how to turn on or off the `sourceIdentity` configuration.

**Topics**
+ [

## Prerequisites
](#monitor-user-access-prereq)
+ [

## Turn on sourceIdentity
](#monitor-user-access-enable)
+ [

## Turn off sourceIdentity
](#monitor-user-access-disable)

## Prerequisites

+ Install and configure the AWS Command Line Interface following the steps in [Installing or updating the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
+ Ensure that Studio Classic users in your domain don’t have a policy that allows them to update or modify the domain.  
+ To turn on or turn off `sourceIdentity` propagation, all apps in the domain must be in the `Stopped` or `Deleted` state. For more information about how to stop and shut down apps, see [Shut down and Update Studio Classic Apps](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-update-apps.html).
+ If source identity propagation is turned on, all execution roles must have the following trust policy permissions: 
  + Any role that the domain's execution role assumes must have the `sts:SetSourceIdentity` permission in the trust policy. If this permission is missing, your actions fail with `AccessDeniedException` or `ValidationError` when you call the job creation API. The following example trust policy includes the `sts:SetSourceIdentity` permission.

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "sagemaker.amazonaws.com"
                },
                "Action": [
                    "sts:AssumeRole",
                    "sts:SetSourceIdentity"
                ]
            }
        ]
    }
    ```

------
  + When you assume a role with another role, called role chaining, do the following:
    + Permissions for `sts:SetSourceIdentity` are required in both the permissions policy of the principal that is assuming the role, and in the role trust policy of the target role. Otherwise, the assume role operation will fail.
    +  This role chaining can happen in Studio Classic or any other downstream service, such as Amazon EMR. For more information about role chaining, see [Roles terms and concepts](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_terms-and-concepts.html). 

## Turn on sourceIdentity


The ability to propagate the user profile name as the `sourceIdentity` in Studio Classic is turned off by default.

To enable the ability to propagate the user profile name as the `sourceIdentity`, use the AWS CLI during domain creation and domain update. This feature is enabled at the domain level and not at the user profile level.

 After you enable this configuration, administrators can view the user profile in the AWS CloudTrail log for the service accessed. The user profile is given as the `sourceIdentity` value in the `userIdentity` section. For more information about using AWS CloudTrail logs with SageMaker AI, see [Log Amazon SageMaker AI API Calls with AWS CloudTrail](https://docs.aws.amazon.com/sagemaker/latest/dg/logging-using-cloudtrail.html). 

You can use the following code to enable the propagation of the user profile name as the `sourceIdentity` during domain creation using the `create-domain` API. 

```
create-domain
--domain-name <value>
--auth-mode <value>
--default-user-settings <value>
--subnet-ids <value>
--vpc-id <value>
[--tags <value>]
[--app-network-access-type <value>]
[--home-efs-file-system-kms-key-id <value>]
[--kms-key-id <value>]
[--app-security-group-management <value>]
[--domain-settings "ExecutionRoleIdentityConfig=USER_PROFILE_NAME"]
[--cli-input-json <value>]
[--generate-cli-skeleton <value>]
```

You can enable the propagation of the user profile name as the `sourceIdentity` during domain update using the `update-domain` API.

To update this configuration, all apps in the domain must be in the `Stopped` or `Deleted` state. For more information about how to stop and shut down apps, see [Shut down and Update Studio Classic Apps](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-update-apps.html).

Use the following code to enable the propagation of the user profile name as the `sourceIdentity`.

```
update-domain
--domain-id <value>
[--default-user-settings <value>]
[--domain-settings-for-update "ExecutionRoleIdentityConfig=USER_PROFILE_NAME"]
[--cli-input-json <value>]
[--generate-cli-skeleton <value>]
```

## Turn off sourceIdentity


You can also turn off the propagation of the user profile name as the `sourceIdentity` using the AWS CLI. This occurs during domain update by passing the `ExecutionRoleIdentityConfig=DISABLED` value for the `--domain-settings-for-update` parameter as part of the `update-domain` API call.

In the AWS CLI, use the following code to disable the propagation of the user profile name as the `sourceIdentity`.

```
update-domain
 --domain-id <value>
[--default-user-settings <value>]
[--domain-settings-for-update "ExecutionRoleIdentityConfig=DISABLED"]
[--cli-input-json <value>]
[--generate-cli-skeleton <value>]
```

# Events that Amazon SageMaker AI sends to Amazon EventBridge
SageMaker AI events with EventBridge

Amazon EventBridge monitors status change events in Amazon SageMaker AI. EventBridge enables you to automate SageMaker AI and respond automatically to events such as a training job status change or endpoint status change. Events from SageMaker AI are delivered to EventBridge in near real time. You can write simple rules to indicate which events are of interest to you, and what automated actions to take when an event matches a rule. To create a rule, see [Creating rules that react to events in EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule.html). If you use AWS CLI, see [put-rule](https://docs.aws.amazon.com/cli/latest/reference/events/put-rule.html) from the *AWS CLI Command Reference*.

The following sections describe the events that SageMaker AI sends to EventBridge, along with examples. You can use the examples to help you write automation rules.

**Note**  
SageMaker AI may send multiple events to EventBridge for each state change. This behavior is expected and does not necessarily indicate an error.

Some examples of the actions that can be automatically triggered include the following:
+ Invoking an AWS Lambda function
+ Invoking Amazon EC2 Run Command
+ Relaying the event to Amazon Kinesis Data Streams
+ Activating an AWS Step Functions state machine
+ Notifying an Amazon SNS topic or an AWS SMS queue

**Topics**
+ [

## SageMaker endpoint deployment state change
](#eventbridge-deployment-state)
+ [

## SageMaker endpoint state change
](#eventbridge-endpoint)
+ [

## SageMaker feature group state change
](#eventbridge-feature-group)
+ [

## SageMaker hyperparameter tuning job state change
](#eventbridge-hpo)
+ [

## SageMaker HyperPod cluster event
](#eventbridge-hyperpod-cluster-event)
+ [

## SageMaker HyperPod cluster node health
](#eventbridge-hyperpod-node-health)
+ [

## SageMaker HyperPod cluster state change
](#eventbridge-hyperpod-cluster-state)
+ [

## SageMaker image state change
](#eventbridge-image-state)
+ [

## SageMaker image version state change
](#eventbridge-image-version-state)
+ [

## SageMaker model card state change
](#eventbridge-model-card-state)
+ [

## SageMaker model package state change
](#eventbridge-model-package)
+ [

## SageMaker model state change
](#eventbridge-model)
+ [

## SageMaker pipeline execution state change
](#eventbridge-pipeline)
+ [

## SageMaker pipeline step state change
](#eventbridge-pipeline-step)
+ [

## SageMaker processing job state change
](#processing-job-state)
+ [

## SageMaker training job state change
](#eventbridge-training)
+ [

## SageMaker transform job state change
](#eventbridge-transform)

## SageMaker endpoint deployment state change
Endpoint deployment state change

**Important**  
The following examples may not work for all endpoints. For a list of features that may exclude your endpoint, see the [Exclusions](deployment-guardrails-exclusions.md) page.

Indicates a state change for an endpoint deployment. The following example shows an endpoint updating with a blue/green canary deployment.

```
{
    "version": "0",
    "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
    "detail-type": "SageMaker Endpoint Deployment State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2021-10-25T01:52:12Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:endpoint/sample-endpoint"
    ],
    "detail": {
        "EndpointName": "sample-endpoint",
        "EndpointArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint/sample-endpoint",
        "EndpointConfigName": "sample-endpoint-config-1",
        "ProductionVariants": [
            {
                "VariantName": "AllTraffic",
                "CurrentWeight": 1,
                "DesiredWeight": 1,
                "CurrentInstanceCount": 3,
                "DesiredInstanceCount": 3
            }
        ],
        "EndpointStatus": "UPDATING",
        "CreationTime": 1635195148181,
        "LastModifiedTime": 1635195148181,
        "Tags": {},
        "PendingDeploymentSummary": {
            "EndpointConfigName": "sample-endpoint-config-2",
            "StartTime": Timestamp,
            "ProductionVariants": [
                {
                    "VariantName": "AllTraffic",
                    "CurrentWeight": 1,
                    "DesiredWeight": 1,
                    "CurrentInstanceCount": 1,
                    "DesiredInstanceCount": 3,
                    "VariantStatus": [
                        {
                            "Status": "Baking",
                            "StatusMessage": "Baking for 600 seconds (TerminationWaitInSeconds) with traffic enabled on canary capacity of 1 instance(s).",
                            "StartTime": 1635195269181,
                        }
                    ]
                }
            ]
        }
    }
}
```

The following example indicates a state change for an endpoint deployment, which is being updated with new capacity on an existing endpoint configuration.

```
{
    "version": "0",
    "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
    "detail-type": "SageMaker Endpoint Deployment State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2021-10-25T01:52:12Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:651393343886:endpoint/sample-endpoint"
    ],
    "detail": {
        "EndpointName": "sample-endpoint",
        "EndpointArn": "arn:aws:sagemaker:us-west-2:651393343886:endpoint/sample-endpoint",
        "EndpointConfigName": "sample-endpoint-config-1",
        "ProductionVariants": [
            {
                "VariantName": "AllTraffic",
                "CurrentWeight": 1,
                "DesiredWeight": 1,
                "CurrentInstanceCount": 3,
                "DesiredInstanceCount": 6,
                "VariantStatus": [
                    {
                        "Status": "Updating",
                        "StatusMessage": "Scaling out desired instance count to 6.",
                        "StartTime": 1635195269181,
                    }
                ]
            }
        ],
        "EndpointStatus": "UPDATING",
        "CreationTime": 1635195148181,
        "LastModifiedTime": 1635195148181,
        "Tags": {},
    }
```

The following secondary deployment statuses are also available for endpoints found in the `VariantStatus` object.
+ `Creating`: creating instances for the production variant.

  Example message: `"Launching X instance(s)."`
+ `Deleting`: terminating instances for the production variant.

  Example message: `"Terminating X instance(s)."`
+ `Updating`: updating capacity for the production variant.

  Example messages: `"Launching X instance(s)."`, `"Scaling out desired instance count to X."`
+ `ActivatingTraffic`: turning on traffic for the production variant.

  Example message: `"Activating traffic on canary capacity of X instance(s)."`
+ `Baking`: waiting period to monitor the CloudWatch alarms in the auto-rollback configuration.

  Example message: `"Baking for X seconds (TerminationWaitInSeconds) with traffic enabled on full capacity of Y instance(s)."`

## SageMaker endpoint state change
Endpoint state change

Indicates a change in the status of a SageMaker AI hosted real-time inference endpoint.

The following shows an event with an endpoint in the `IN_SERVICE` state.

```
{
  "version": "0",
  "id": "d2921b5a-b0ad-cace-a8e3-0f159d018e06",
  "detail-type": "SageMaker Endpoint State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "1583831889050",
  "region": "us-west-2",
  "resources": [
      "arn:aws:sagemaker:us-west-2:111122223333:endpoint/myendpoint"
  ],
  "detail": {
      "EndpointName": "MyEndpoint",
      "EndpointArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint/myendpoint",
      "EndpointConfigName": "MyEndpointConfig",
      "ProductionVariants": [
          {
              "DesiredWeight": 1.0,
              "DesiredInstanceCount": 1.0
          }
      ],
      "EndpointStatus": "IN_SERVICE",
      "CreationTime": 1592411992203.0,
      "LastModifiedTime": 1592411994287.0,
      "Tags": {

      }
  }
}
```

## SageMaker feature group state change
Feature group state change

Indicates a change either in the `FeatureGroupStatus` or the `OfflineStoreStatus` of a SageMaker feature group.

```
{
  "version": "0",
  "id": "93201303-abdb-36a4-1b9b-4c1c3e3671c0",
  "detail-type": "SageMaker Feature Group State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2021-01-26T01:22:01Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:sagemaker:us-east-1:111122223333:feature-group/sample-feature-group"
  ],
  "detail": {
    "FeatureGroupArn": "arn:aws:sagemaker:us-east-1:111122223333:feature-group/sample-feature-group",
    "FeatureGroupName": "sample-feature-group",
    "RecordIdentifierFeatureName": "RecordIdentifier",
    "EventTimeFeatureName": "EventTime",
    "FeatureDefinitions": [
      {
        "FeatureName": "RecordIdentifier",
        "FeatureType": "Integral"
      },
      {
        "FeatureName": "EventTime",
        "FeatureType": "Fractional"
      }
    ],
    "CreationTime": 1611624059000,
    "OnlineStoreConfig": {
      "EnableOnlineStore": true
    },
    "OfflineStoreConfig": {
      "S3StorageConfig": {
        "S3Uri": "s3://offline/s3/uri"
      },
      "DisableGlueTableCreation": false,
      "DataCatalogConfig": {
        "TableName": "sample-feature-group-1611624059",
        "Catalog": "AwsDataCatalog",
        "Database": "sagemaker_featurestore"
      }
    },
    "RoleArn": "arn:aws:iam::111122223333:role/SageMakerRole",
    "FeatureGroupStatus": "Active",
    "Tags": {}
  }
}
```

## SageMaker hyperparameter tuning job state change
Hyperparameter tuning job state change

Indicates a change in the status of a SageMaker hyperparameter tuning job.

```
{
  "version": "0",
  "id": "844e2571-85d4-695f-b930-0153b71dcb42",
  "detail-type": "SageMaker HyperParameter Tuning Job State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2018-10-06T12:26:13Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:sagemaker:us-east-1:111122223333:tuningJob/x"
  ],
  "detail": {
    "HyperParameterTuningJobName": "016bffd3-6d71-4d3a-9710-0a332b2759fc",
    "HyperParameterTuningJobArn": "arn:aws:sagemaker:us-east-1:111122223333:tuningJob/x",
    "TrainingJobDefinition": {
      "StaticHyperParameters": {},
      "AlgorithmSpecification": {
        "TrainingImage": "trainingImageName",
        "TrainingInputMode": "inputModeFile",
        "MetricDefinitions": [
          {
            "Name": "metricName",
            "Regex": "regex"
          }
        ]
      },
      "RoleArn": "roleArn",
      "InputDataConfig": [
        {
          "ChannelName": "channelName",
          "DataSource": {
            "S3DataSource": {
              "S3DataType": "s3DataType",
              "S3Uri": "s3Uri",
              "S3DataDistributionType": "s3DistributionType"
            }
          },
          "ContentType": "contentType",
          "CompressionType": "gz",
          "RecordWrapperType": "RecordWrapper"
        }
      ],
      "VpcConfig": {
        "SecurityGroupIds": [
          "securityGroupIds"
        ],
        "Subnets": [
          "subnets"
        ]
      },
      "OutputDataConfig": {
        "KmsKeyId": "kmsKeyId",
        "S3OutputPath": "s3OutputPath"
      },
      "ResourceConfig": {
        "InstanceType": "instanceType",
        "InstanceCount": 10,
        "VolumeSizeInGB": 500,
        "VolumeKmsKeyId": "volumeKeyId"
      },
      "StoppingCondition": {
        "MaxRuntimeInSeconds": 3600
      }
    },
    "HyperParameterTuningJobStatus": "status",
    "CreationTime": "1583831889050",
    "LastModifiedTime": "1583831889050",
    "TrainingJobStatusCounters": {
      "Completed": 1,
      "InProgress": 0,
      "RetryableError": 0,
      "NonRetryableError": 0,
      "Stopped": 0
    },
    "ObjectiveStatusCounters": {
      "Succeeded": 1,
      "Pending": 0,
      "Failed": 0
    },
    "Tags": {}
  }
}
```

## SageMaker HyperPod cluster event
HyperPod cluster event

Indicates a new event in the state of a SageMaker HyperPod cluster. For more information, see the [DescribeClusterEvent](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterEvent.html) operation.

```
{
    "version": "0",
    "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
    "detail-type": "SageMaker HyperPod Cluster Event",
    "source": "aws.sagemaker",
    "account": "[REDACTED:BANK_ACCOUNT_NUMBER]",
    "time": "2025-04-28T16:59:01Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster"
    ],
    "detail": {
        "EventDetails": {
            "EventId": "a307fae0-6937-40f9-af2f-16eb873d340a",
            "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster",
            "ClusterName": "sample-cluster",
            "InstanceGroupName": "sample-instance-group",
            "InstanceId": "i-0391f86fa0fe0d465",
            "ResourceType": "Instance",
            "EventTime": 1745858447412,
            "EventDetails": {
                "EventMetadata": {
                    "Instance": {
                        "LcsExecutionState": "Started"
                    }
                }
            },
            "Description": "Instance lifecycle script execution for EC2InstanceId i-0391f86fa0fe0d465 has Started"
        }
    }
}
```

## SageMaker HyperPod cluster node health
HyperPod cluster node health

Indicates when HyperPod detects unhealthy nodes or when unhealthy nodes transition to a healthy state.

```
{
    "version": "0",
    "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
    "detail-type": "SageMaker HyperPod Cluster Node Health Event",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2021-10-25T01:52:12Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster"
    ],
    "detail": {
        "ClusterName": "sample-cluster",
        "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster",
        "InstanceId": "i-12345678abcdefghi",
        "Tags": {},
        "HealthSummary": {
            "HealthStatus": "Unhealthy",
            "HealthStatusReason": "HyperPod Health Monitoring Agent (HMA) has detected fault type NvidiaErrorTerminate on this node and is unhealthy.",
            "RepairAction": "None",
            "Recommendation": "Please Replace the Faulty Node."
        }
    }
}
```

## SageMaker HyperPod cluster state change
HyperPod cluster state change

Indicates a change in the state of a SageMaker HyperPod cluster. For more information, see the [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html#API_DescribeCluster_ResponseSyntax) API reference.

```
{
   "version": "0",
   "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
   "detail-type": "SageMaker HyperPod Cluster State Change",
   "source": "aws.sagemaker",
   "account": "111122223333",
   "time": "2025-04-28T16:59:01Z",
   "region": "us-west-2",
   "resources": [
      "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster"
   ],
   "detail": {
      "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster",
      "ClusterName": "sample-cluster",
      "ClusterStatus": "InService",
      "CreationTime": 1745858447412,
      "FailureMessage": "",
      "InstanceGroups": [
         {
            "CurrentCount": 1,
            "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-AmazonSagemakerClusterExecutionR-123OTacPcKk1",
            "InstanceGroupName": "example instance group name",
            "InstanceStorageConfigs": [
               {}
            ],
            "InstanceType": "ml.t3.medium",
            "LifeCycleConfig": {
               "OnCreate": "on_create.sh",
               "SourceS3Uri": "s3://sagemaker-hyperpod//LifeCycleScripts/base-config/provisioning_parameters.json"
            },
            "OnStartDeepHealthChecks": [
               "example health checks"
            ],
            "OverrideVpcConfig": {
               "SecurityGroupIds": [
                  "SecurityGroupId1"
               ],
               "Subnets": [
                  "Subnet1"
               ]
            },
            "Status": "Failed",
            "TargetCount": 2,
            "ThreadsPerCore": 2,
            "TrainingPlanArn": "arn:aws:sagemaker:us-west-2:111122223333:training-plan/large-models-fine-tuning",
            "TrainingPlanStatus": "NotApplicable"
         }
      ],
      "NodeRecovery": "Automatic",
      "Orchestrator": {
         "Eks": {
            "ClusterArn": "arn:aws:eks:us-west-2:111122223333:cluster/my-hyperPod-eks-cluster"
         }
      },
      "VpcConfig": {
         "SecurityGroupIds": [
            "SecurityGroupId2"
         ],
         "Subnets": [
            "Subnet2"
         ]
      }
   }
}
```

## SageMaker image state change
Image state change

Indicates a change in the status of a SageMaker image.

```
{
  "version": "0",
  "id": "cee033a3-17d8-49f8-865f-b9ebf485d9ee",
  "detail-type": "SageMaker Image State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2021-04-29T01:29:59Z",
  "region": "us-east-1",
  "resources": ["arn:aws:sagemaker:us-west-2:111122223333:image/cee033a3-17d8-49f8-865f-b9ebf485d9ee"],
  "detail": {
    "ImageName": "cee033a3-17d8-49f8-865f-b9ebf485d9ee",
    "ImageArn": "arn:aws:sagemaker:us-west-2:111122223333:image/cee033a3-17d8-49f8-865f-b9ebf485d9ee",
    "ImageStatus": "Creating",
    "Version": 1.0,
    "Tags": {}
  }
}
```

## SageMaker image version state change
Image version state change

Indicates a change in the status of a SageMaker image version.

```
{
  "version": "0",
  "id": "07fc4615-ebd7-15fc-1746-243411f09f04",
  "detail-type": "SageMaker Image Version State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2021-04-29T01:29:59Z",
  "region": "us-east-1",
  "resources": ["arn:aws:sagemaker:us-west-2:111122223333:image-version/07800032-2d29-48b7-8f82-5129225b2a85"],
  "detail": {
    "ImageArn": "arn:aws:sagemaker:us-west-2:111122223333:image/a70ff896-c832-4fe8-add6-eba25a0f43e6",
    "ImageVersionArn": "arn:aws:sagemaker:us-west-2:111122223333:image-version/07800032-2d29-48b7-8f82-5129225b2a85",
    "ImageVersionStatus": "Creating",
    "Version": 1.0,
    "Tags": {}
  }
}
```

## SageMaker model card state change
Model card state change

Indicates a change in the status of an Amazon SageMaker model card. For more information about model cards, see [Amazon SageMaker Model Cards](model-cards.md).

```
{
    "version": "0",
    "id": "aa7a9c4f-2caa-4d04-a6de-e67227ba4302",
    "detail-type": "SageMaker Model Card State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2022-11-30T00:00:00Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:sagemaker:us-east-1:111122223333:model-card/example-card"
    ],
    "detail": {
        "ModelCardVersion": 2,
        "LastModifiedTime": "2022-12-03T00:09:44.893854735Z",
        "LastModifiedBy": {
            "DomainId": "us-east-1",
            "UserProfileArn": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/user",
            "UserProfileName": "user"
        },
        "CreationTime": "2022-12-03T00:09:33.084Z",
        "CreatedBy": {
            "DomainId": "us-east-1",
            "UserProfileArn": "arn:aws:sagemaker:us-east-1:111122223333:user-profile/user",
            "UserProfileName": "user"
        },
        "ModelCardName": "example-card",
        "ModelId": "example-model",
        "ModelCardStatus": "Draft",
        "AccountId": "111122223333",
        "SecurityConfig": {}
    }
}
```

## SageMaker model package state change
Model package state change

Indicates a change in the status of a SageMaker model package.

```
{
  "version": "0",
  "id": "844e2571-85d4-695f-b930-0153b71dcb42",
  "detail-type": "SageMaker Model Package State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2021-02-24T17:00:14Z",
  "region": "us-east-2",
  "resources": [
    "arn:aws:sagemaker:us-east-2:111122223333:model-package/versionedmp-p-idy6c3e1fiqj/2"
  ],
  "source": [
    "aws.sagemaker"
  ],
  "detail": {
    "ModelPackageGroupName": "versionedmp-p-idy6c3e1fiqj",
    "ModelPackageVersion": 2,
    "ModelPackageArn": "arn:aws:sagemaker:us-east-2:111122223333:model-package/versionedmp-p-idy6c3e1fiqj/2",
    "CreationTime": "2021-02-24T17:00:14Z",
    "InferenceSpecification": {
      "Containers": [
        {
          "Image": "257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3",
          "ImageDigest": "sha256:4dc8a7e4a010a19bb9e0a6b063f355393f6e623603361bd8b105f554d4f0c004",
          "ModelDataUrl": "s3://sagemaker-project-p-idy6c3e1fiqj/versionedmp-p-idy6c3e1fiqj/AbaloneTrain/pipelines-4r83jejmhorv-TrainAbaloneModel-xw869y8C4a/output/model.tar.gz"
        }
      ],
      "SupportedContentTypes": [
        "text/csv"
      ],
      "SupportedResponseMIMETypes": [
        "text/csv"
      ]
    },
    "ModelPackageStatus": "Completed",
    "ModelPackageStatusDetails": {
      "ValidationStatuses": [],
      "ImageScanStatuses": []
    },
    "CertifyForMarketplace": false,
    "ModelApprovalStatus": "Rejected",
    "MetadataProperties": {
      "GeneratedBy": "arn:aws:sagemaker:us-east-2:111122223333:pipeline/versionedmp-p-idy6c3e1fiqj/execution/4r83jejmhorv"
    },
    "ModelMetrics": {
      "ModelQuality": {
        "Statistics": {
          "ContentType": "application/json",
          "S3Uri": "s3://sagemaker-project-p-idy6c3e1fiqj/versionedmp-p-idy6c3e1fiqj/script-2021-02-24-10-55-15-413/output/evaluation/evaluation.json"
        }
      }
    },
    "ModelLifeCycle": {
      "Stage": "Development",
      "StageStatus": "Approved",
      "StageDescription": "StageDescription"
    },
    "UpdatedModelPackageFields": [
      "ModelLifeCycle" 
      # Other possible values are 
      # "ModelApprovalStatus","ApprovalDescription","sourceUri","CustomerMetadataProperties", "InferenceSpecification"    
    ]
    "LastModifiedTime": "2021-02-24T17:00:14Z"
  }
}
```

## SageMaker model state change
Model state change

 Indicates a change in the state of a SageMaker AI model. The state changes when a SageMaker AI model is either created or deleted. 

```
{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker Model State Change"],
  "Resources" : ["arn:aws:sagemaker:us-east-1:111122223333:model/model-name"]
}
```

If a model is specified under `Resources`, an event will be generated and sent to EventBridge when the state of this model changes. If you do not specify a value for `Resources`, an event will generate when the status of any of the SageMaker AI models associated with your account changes.

## SageMaker pipeline execution state change
Pipeline execution state change

Indicates a change in the status of a SageMaker pipeline execution.

`currentPipelineExecutionStatus` and `previousPipelineExecutionStatus`can be one of the following values:
+ Executing
+ Succeeded
+ Failed
+ Stopping
+ Stopped

```
{
  "version": "0",
  "id": "315c1398-40ff-a850-213b-158f73kd93ir",
  "detail-type": "SageMaker Model Building Pipeline Execution Status Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2021-03-15T16:10:11Z",
  "region": "us-east-1",
  "resources": ["arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123", "arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123/execution/p4jn9xou8a8s"],
  "detail": {
    "pipelineExecutionDisplayName": "SomeDisplayName",
    "currentPipelineExecutionStatus": "Succeeded",
    "previousPipelineExecutionStatus": "Executing",
    "executionStartTime": "2021-03-15T16:03:13Z",
    "executionEndTime": "2021-03-15T16:10:10Z",
    "pipelineExecutionDescription": "SomeDescription",
    "pipelineArn": "arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123",
    "pipelineExecutionArn": "arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123/execution/p4jn9xou8a8s"
  }
}
```

## SageMaker pipeline step state change
Pipeline step state change

Indicates a change in the status of a SageMaker pipeline step.

If there is a cache hit, the event contains the `cacheHitResult` field. `currentStepStatus` and `previousStepStatus`can be one of the following values:
+ Starting
+ Executing
+ Succeeded
+ Failed
+ Stopping
+ Stopped

If the value of `currentStepStatus` is `Failed`, the event contains the `failureReason` field, which provides a description of why the step failed.

```
{
  "version": "0",
  "id": "ea37ccbb-5e2b-05e9-4073-1daazc940304",
  "detail-type": "SageMaker Model Building Pipeline Execution Step Status Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2021-03-15T16:10:10Z",
  "region": "us-east-1",
  "resources": ["arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123", "arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123/execution/p4jn9xou8a8s"],
  "detail": {
    "metadata": {
      "processingJob": {
        "arn": "arn:aws:sagemaker:us-east-1:111122223333:processing-job/pipelines-p4jn9xou8a8s-myprocessingstep1-tmgxry49ug"
      }
    },
    "stepStartTime": "2021-03-15T16:03:14Z",
    "stepEndTime": "2021-03-15T16:10:09Z",
    "stepName": "myprocessingstep1",
    "stepType": "Processing",
    "previousStepStatus": "Executing",
    "currentStepStatus": "Succeeded",
    "pipelineArn": "arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123",
    "pipelineExecutionArn": "arn:aws:sagemaker:us-east-1:111122223333:pipeline/myPipeline-123/execution/p4jn9xou8a8s"
  }
}
```

## SageMaker processing job state change
Processing job state change

Indicates a change in the status of a SageMaker processing job.

The following example event is for a failed processing job, where the `ProcessingJobStatus` value is `Failed`.

```
{
  "version": "0",
  "id": "0a15f67d-aa23-0123-0123-01a23w89r01t",
  "detail-type": "SageMaker Processing Job State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2019-05-31T21:49:54Z",
  "region": "us-east-1",
  "resources": ["arn:aws:sagemaker:us-west-2:037210630506:processing-job/integ-test-analytics-algo-54ee3282-5899-4aa3-afc2-7ce1d02"],
  "detail": {
    "ProcessingInputs": [{
      "InputName": "InputName",
      "S3Input": {
        "S3Uri": "s3://input/s3/uri",
        "LocalPath": "/opt/ml/processing/input/local/path",
        "S3DataType": "MANIFEST_FILE",
        "S3InputMode": "PIPE",
        "S3DataDistributionType": "FULLYREPLICATED"
      }
    }],
    "ProcessingOutputConfig": {
      "Outputs": [{
        "OutputName": "OutputName",
        "S3Output": {
          "S3Uri": "s3://output/s3/uri",
          "LocalPath": "/opt/ml/processing/output/local/path",
          "S3UploadMode": "CONTINUOUS"
        }
      }],
      "KmsKeyId": "KmsKeyId"
    },
    "ProcessingJobName": "integ-test-analytics-algo-54ee3282-5899-4aa3-afc2-7ce1d02",
    "ProcessingResources": {
      "ClusterConfig": {
        "InstanceCount": 3,
        "InstanceType": "ml.c5.xlarge",
        "VolumeSizeInGB": 5,
        "VolumeKmsKeyId": "VolumeKmsKeyId"
      }
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 2000
    },
    "AppSpecification": {
      "ImageUri": "012345678901.dkr.ecr.us-west-2.amazonaws.com/processing-uri:latest"
    },
    "NetworkConfig": {
      "EnableInterContainerTrafficEncryption": true,
      "EnableNetworkIsolation": false,
      "VpcConfig": {
        "SecurityGroupIds": ["SecurityGroupId1", "SecurityGroupId2", "SecurityGroupId3"],
        "Subnets": ["Subnet1", "Subnet2"]
      }
    },
    "RoleArn": "arn:aws:iam::037210630506:role/SageMakerPowerUser",
    "ExperimentConfig": {},
    "ProcessingJobArn": "arn:aws:sagemaker:us-west-2:037210630506:processing-job/integ-test-analytics-algo-54ee3282-5899-4aa3-afc2-7ce1d02",
    "ProcessingJobStatus":"Failed",
    "FailureReason":"InternalServerError: We encountered an internal error.  Please try again.",
    "ProcessingEndTime":1704320746000,
    "ProcessingStartTime":1704320734000,
    "LastModifiedTime":1704320746000,
    "CreationTime":1704320199000
  }
}
```

## SageMaker training job state change
Training job state change

Indicates a change in the status of a SageMaker training job.

If the value of `TrainingJobStatus` is `Failed`, the event contains the `FailureReason` field, which provides a description of why the training job failed.

```
{
    "version": "0",
    "id": "844e2571-85d4-695f-b930-0153b71dcb42",
    "detail-type": "SageMaker Training Job State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2018-10-06T12:26:13Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:sagemaker:us-east-1:111122223333:training-job/kmeans-1"
    ],
    "detail": {
        "TrainingJobName": "89c96cc8-dded-4739-afcc-6f1dc936701d",
        "TrainingJobArn": "arn:aws:sagemaker:us-east-1:111122223333:training-job/kmeans-1",
        "TrainingJobStatus": "Completed",
        "SecondaryStatus": "Completed",
        "HyperParameters": {
            "Hyper": "Parameters"
        },
        "AlgorithmSpecification": {
            "TrainingImage": "TrainingImage",
            "TrainingInputMode": "TrainingInputMode"
        },
        "RoleArn": "arn:aws:iam::111122223333:role/SMRole",
        "InputDataConfig": [
            {
                "ChannelName": "Train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3DataType",
                        "S3Uri": "S3Uri",
                        "S3DataDistributionType": "S3DataDistributionType"
                    }
                },
                "ContentType": "ContentType",
                "CompressionType": "CompressionType",
                "RecordWrapperType": "RecordWrapperType"
            }
        ],
        "OutputDataConfig": {
            "KmsKeyId": "KmsKeyId",
            "S3OutputPath": "S3OutputPath"
        },
        "ResourceConfig": {
            "InstanceType": "InstanceType",
            "InstanceCount": 3,
            "VolumeSizeInGB": 20,
            "VolumeKmsKeyId": "VolumeKmsKeyId"
        },
        "VpcConfig": {

        },
        "StoppingCondition": {
            "MaxRuntimeInSeconds": 60
        },
        "CreationTime": "1583831889050",
        "TrainingStartTime": "1583831889050",
        "TrainingEndTime": "1583831889050",
        "LastModifiedTime": "1583831889050",
        "SecondaryStatusTransitions": [

        ],
        "Tags": {

        }
    }
}
```

## SageMaker transform job state change
Transform job state change

Indicates a change in the status of a SageMaker batch transform job.

If the value of `TransformJobStatus` is `Failed`, the event contains the `FailureReason` field, which provides a description of why the training job failed.

```
{
  "version": "0",
  "id": "844e2571-85d4-695f-b930-0153b71dcb42",
  "detail-type": "SageMaker Transform Job State Change",
  "source": "aws.sagemaker",
  "account": "111122223333",
  "time": "2018-10-06T12:26:13Z",
  "region": "us-east-1",
  "resources": ["arn:aws:sagemaker:us-east-1:111122223333:transform-job/myjob"],
  "detail": {
    "TransformJobName": "4b52bd8f-e034-4345-818d-884bdd7c9724",
    "TransformJobArn": "arn:aws:sagemaker:us-east-1:111122223333:transform-job/myjob",
    "TransformJobStatus": "another status... GO",
    "FailureReason": "failed why 1",
    "ModelName": "i am a beautiful model",
    "MaxConcurrentTransforms": 5,
    "MaxPayloadInMB": 10,
    "BatchStrategy": "Strategizing...",
    "Environment": {
      "environment1": "environment2"
    },
    "TransformInput": {
      "DataSource": {
        "S3DataSource": {
          "S3DataType": "s3DataType",
          "S3Uri": "s3Uri"
        }
      },
      "ContentType": "content type",
      "CompressionType": "compression type",
      "SplitType": "split type"
    },
    "TransformOutput": {
      "S3OutputPath": "s3Uri",
      "Accept": "accept",
      "AssembleWith": "assemblyType",
      "KmsKeyId": "kmsKeyId"
    },
    "TransformResources": {
      "InstanceType": "instanceType",
      "InstanceCount": 3
    },
    "CreationTime": "2018-10-06T12:26:13Z",
    "TransformStartTime": "2018-10-06T12:26:13Z",
    "TransformEndTime": "2018-10-06T12:26:13Z",
    "Tags": {}
  }
}
```

For more information about the status values and their meanings for SageMaker AI jobs, endpoints, and pipelines, see the following links:
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeAlgorithm.html#sagemaker-DescribeAlgorithm-response-AlgorithmStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeAlgorithm.html#sagemaker-DescribeAlgorithm-response-AlgorithmStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html#sagemaker-DescribeEndpoint-response-EndpointStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html#sagemaker-DescribeEndpoint-response-EndpointStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeFeatureGroup.html#sagemaker-DescribeFeatureGroup-response-FeatureGroupStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeFeatureGroup.html#sagemaker-DescribeFeatureGroup-response-FeatureGroupStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html#sagemaker-DescribeHyperParameterTuningJob-response-HyperParameterTuningJobStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html#sagemaker-DescribeHyperParameterTuningJob-response-HyperParameterTuningJobStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeLabelingJob.html#sagemaker-DescribeLabelingJob-response-LabelingJobStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeLabelingJob.html#sagemaker-DescribeLabelingJob-response-LabelingJobStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeModelPackage.html#sagemaker-DescribeModelPackage-response-ModelPackageStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeModelPackage.html#sagemaker-DescribeModelPackage-response-ModelPackageStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeNotebookInstance.html#sagemaker-DescribeNotebookInstance-response-NotebookInstanceStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeNotebookInstance.html#sagemaker-DescribeNotebookInstance-response-NotebookInstanceStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribePipelineExecution.html#sagemaker-DescribePipelineExecution-response-PipelineExecutionStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribePipelineExecution.html#sagemaker-DescribePipelineExecution-response-PipelineExecutionStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_PipelineExecutionStep.html#sagemaker-Type-PipelineExecutionStep-StepStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_PipelineExecutionStep.html#sagemaker-Type-PipelineExecutionStep-StepStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html#sagemaker-DescribeProcessingJob-response-ProcessingJobStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html#sagemaker-DescribeProcessingJob-response-ProcessingJobStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html#sagemaker-DescribeTrainingJob-response-TrainingJobStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html#sagemaker-DescribeTrainingJob-response-TrainingJobStatus)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTransformJob.html#sagemaker-DescribeTransformJob-response-TransformJobStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTransformJob.html#sagemaker-DescribeTransformJob-response-TransformJobStatus)

For more information, see the [Amazon EventBridge User Guide](https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html).