View a markdown version of this page

Metrics reference - Amazon Managed Streaming for Apache Kafka

Metrics reference

The following metrics describes performance or connection metrics for the MSK Replicator.

AuthError metrics do not cover topic-level auth errors. To monitor your MSK Replicator's topic-level auth errors, monitor Replicator's ReplicationLatency metrics and the source cluster's topic-level metrics, MessagesInPerSec. If a topic's ReplicationLatency dropped to 0 but the topic still has data being produced to it, it indicates that the Replicator has an Auth issue with the topic. Check that the Replicator's service execution IAM role has sufficient permission to access the topic.

Metric type Metric Description Dimensions Unit Raw Metric Granularity Raw Metric Aggregation Stat
Performance ReplicationLatency Time it takes records to replicate from the source to target cluster; duration between record produce time at source and replicated to target. If ReplicationLatency increases, check if clusters have enough partitions to support replication. High replication latency can occur when the partition count is too low for high throughput. ReplicatorName Milliseconds Partition Maximum
ReplicatorName, Topic Milliseconds Partition Maximum
Performance MessageLag Monitors the sync between the MSK Replicator and the source cluster. MessageLag indicates the lag between the messages produced to the source cluster and messages consumed by the replicator. It is not the lag between the source and target cluster. Even if the source cluster is unavailable/interrupted, the replicator will finish writing the message it has consumed to the target cluster. After an outage, MessageLag shows an increase indicating the number of messages the replicator is behind the source cluster and this can be monitored until the number of messages is 0, showing that the replicator has caught up with the source cluster. ReplicatorName Count Partition Sum
ReplicatorName, Topic Count Partition Sum
Performance ReplicatorBytesInPerSec Average number of bytes processed by the replicator per second. Data processed by MSK Replicator consists of all the data that MSK Replicator receives which includes the data replicated to target cluster and the data filtered by MSK Replicator (only if your Replicator is configured with Identical topic name configuration) to prevent the data being copied back to the same topic it originated from. If your Replicator is configured with "Prefixed" topic name configuration, both ReplicatorBytesInPerSec and ReplicatorThroughput metrics will have the same value as no data will be filtered by MSK Replicator. ReplicatorName BytesPerSecond ReplicatorName Sum
Performance ReplicatorThroughput Average number of bytes replicated per second. If ReplicatorThroughput drops for a topic, check KafkaClusterPingSuccessCount and AuthError metrics to ensure the Replicator can communicate with clusters, then check cluster metrics to ensure the cluster is not down. ReplicatorName BytesPerSecond Partition Sum
ReplicatorName, Topic BytesPerSecond Partition Sum
Performance ReplicationFailures Number of replication failures. Should be 0 for healthy replication. Non-zero may indicate message size limitations, timestamp violations, or record batch size problems. ReplicatorName Count Sum
Debug AuthError The number of connections with failed authentication per second. If this metric is above 0, you can check if the service execution role policy for the replicator is valid and make sure there aren't deny permissions set for the cluster permissions. Based on clusterAlias dimension, you can identify if the source or target cluster is experiencing auth errors. ReplicatorName, ClusterAlias Count Worker Sum
Debug ThrottleTime The average time in ms a request was throttled by brokers on the cluster. Set throttling to avoid having the MSK Replicator overwhelm the cluster. If this metric is 0, replicationLatency is not high, and replicatorThroughput is as expected, then throttling is working as expected. If this metric is above 0, you can adjust throttling accordingly. ReplicatorName, ClusterAlias Milliseconds Worker Maximum
Debug ReplicatorFailure Number of failures that the replicator is experiencing. ReplicatorName Count Sum
Debug KafkaClusterPingSuccessCount

Indicates the health of the replicator connection to the kafka cluster. If this value is 1, the connection is healthy. If the value is 0 or no datapoint, the connection is unhealthy. If the value is 0, you can check network or IAM permission settings for the Kafka cluster. Based on ClusterAlias dimension, you can identify whether this metric is for source or target cluster.

ReplicatorName, ClusterAlias Count Sum
Consumer Group ConsumerGroupCount Number of consumer groups being synchronized. Verify it matches expected consumer groups. ReplicatorName Count Sum
Consumer Group ConsumerGroupOffsetSyncFailure Number of consumer group offset sync failures. Should be 0. If greater than 0, check consumer groups are active and verify permissions. ReplicatorName Count Sum
Consumer Group OffsetLag (MSK Cluster) Partition-level consumer lag on the MSK target cluster. Compare with OffsetLag (Non-MSK Cluster) to verify lag is equal. Partition Count Sum
Consumer Group OffsetLag (Non-MSK Cluster) Partition-level consumer lag on the self-managed (non-MSK) source cluster. Compare with OffsetLag (MSK Cluster). Partition Count Sum