Set up managed Prometheus collectors for Amazon MSK
To use an Amazon Managed Service for Prometheus collector, you create a scraper that discovers and pulls metrics in your Amazon Managed Streaming for Apache Kafka cluster. You can also create a scraper that integrates with Amazon Elastic Kubernetes Service. For more information, see Integrate Amazon EKS.
Create a scraper
An Amazon Managed Service for Prometheus collector consists of a scraper that discovers and collects metrics from an Amazon MSK cluster. Amazon Managed Service for Prometheus manages the scraper for you, giving you the scalability, security, and reliability that you need, without having to manage any instances, agents, or scrapers yourself.
You can create a scraper using either the AWS API or the AWS CLI as described in the following procedures.
There are a few prerequisites for creating your own scraper:
-
You must have an Amazon MSK cluster created.
-
Configure your Amazon MSK cluster's security group to allow inbound traffic on ports 11001 (JMX Exporter) and 11002 (Node Exporter) within your Amazon VPC, as the scraper requires access to these DNS records to collect Prometheus metrics.
-
The Amazon VPC in which the Amazon MSK cluster resides must have DNS enabled.
Note
The cluster will be associated with the scraper by its Amazon resource name (ARN). If you delete a cluster, and then create a new one with the same name, the ARN will be reused for the new cluster. Because of this, the scraper will attempt to collect metrics for the new cluster. You delete scrapers separately from deleting the cluster.
-
The following is a full list of the scraper operations that you can use with the AWS API:
Create a scraper with the CreateScraper API operation.
-
List your existing scrapers with the ListScrapers API operation.
-
Update the alias, configuration, or destination of a scraper with the UpdateScraper API operation.
-
Delete a scraper with the DeleteScraper API operation.
-
Get more details about a scraper with the DescribeScraper API operation.
Cross-account setup
To create a scraper in a cross-account setup when your Amazon MSK cluster from which you want to collect metrics is in a different account from the Amazon Managed Service for Prometheus collector, use the procedure below.
For example, when you have two accounts, the first source account
account_id_source where the Amazon MSK is located, and a second target
account account_id_target where the Amazon Managed Service for Prometheus workspace resides.
To create a scraper in a cross-account setup
-
In the source account, create a role
arn:aws:iam::and add the following trust policy.111122223333:role/Source{ "Effect": "Allow", "Principal": { "Service": [ "scraper.aps.amazonaws.com" ] }, "Action": "sts:AssumeRole", "Condition": { "ArnEquals": { "aws:SourceArn": "arn:aws:aps:aws-region:111122223333:scraper/scraper-id" }, "StringEquals": { "AWS:SourceAccount": "111122223333" } } } -
On every combination of source (Amazon MSK cluster) and target (Amazon Managed Service for Prometheus workspace), you need to create a role
arn:aws:iam::and add the following trust policy with permissions for AmazonPrometheusRemoteWriteAccess.444455556666:role/Target{ "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::111122223333:role/Source" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "sts:ExternalId": "arn:aws:aps:aws-region:111122223333:scraper/scraper-id" } } } -
Create a scraper with the
--role-configurationoption.aws amp create-scraper \ --source vpcConfiguration="{subnetIds=[subnet-subnet-id], "securityGroupIds": ["sg-security-group-id"]}" \ --scrape-configuration configurationBlob=<base64-encoded-blob>\ --destination ampConfiguration="{workspaceArn='arn:aws:aps:aws-region:444455556666:workspace/ws-workspace-id'}"\ --role-configuration '{"sourceRoleArn":"arn:aws:iam::111122223333:role/Source", "targetRoleArn":"arn:aws:iam::444455556666:role/Target"}' -
Validate the scraper creation.
aws amp list-scrapers { "scrapers": [ { "scraperId": "s-example123456789abcdef0", "arn": "arn:aws:aps:aws-region:111122223333:scraper/s-example123456789abcdef0": "arn:aws:iam::111122223333:role/Source", "status": "ACTIVE", "creationTime": "2025-10-27T18:45:00.000Z", "lastModificationTime": "2025-10-27T18:50:00.000Z", "tags": {}, "statusReason": "Scraper is running successfully", "source": { "vpcConfiguration": { "subnetIds": ["subnet-subnet-id"], "securityGroupIds": ["sg-security-group-id"] } }, "destination": { "ampConfiguration": { "workspaceArn": "arn:aws:aps:aws-region:444455556666:workspace/ws-workspace-id'" } }, "scrapeConfiguration": { "configurationBlob": "<base64-encoded-blob>" } } ] }
Changing between RoleConfiguration and service-linked role
When you want to switch back to a service-linked role instead of the
RoleConfiguration to write to an Amazon Managed Service for Prometheus workspace, you must
update the UpdateScraper and provide a workspace in the same account as
the scraper without the RoleConfiguration. The
RoleConfiguration will be removed from the scraper and the
service-linked role will be used.
When you are changing workspaces in the same account as the scraper and you want
to continue using the RoleConfiguration, you must again provide the
RoleConfiguration on UpdateScraper.
Find and delete scrapers
You can use the AWS API or the AWS CLI to list the scrapers in your account or to delete them.
Note
Make sure that you are using the latest version of the AWS CLI or SDK. The latest version provides you with the latest features and functionality, as well as security updates. Alternatively, use AWS CloudShell, which provides an always up-to-date command line experience, automatically.
To list all the scrapers in your account, use the ListScrapers API operation.
Alternatively, with the AWS CLI, call:
aws amp list-scrapers
ListScrapers returns all of the scrapers in your account, for
example:
{ "scrapers": [ { "scraperId": "s-1234abcd-56ef-7890-abcd-1234ef567890", "arn": "arn:aws:aps:aws-region:123456789012:scraper/s-1234abcd-56ef-7890-abcd-1234ef567890", "roleArn": "arn:aws:iam::123456789012:role/aws-service-role/AWSServiceRoleForAmazonPrometheusScraper_1234abcd-2931", "status": { "statusCode": "DELETING" }, "createdAt": "2023-10-12T15:22:19.014000-07:00", "lastModifiedAt": "2023-10-12T15:55:43.487000-07:00", "tags": {}, "source": { "vpcConfiguration": { "securityGroupIds": [ "sg-1234abcd5678ef90" ], "subnetIds": [ "subnet-abcd1234ef567890", "subnet-1234abcd5678ab90" ] } }, "destination": { "ampConfiguration": { "workspaceArn": "arn:aws:aps:aws-region:123456789012:workspace/ws-1234abcd-5678-ef90-ab12-cdef3456a78" } } } ] }
To delete a scraper, find the scraperId for the scraper that you want
to delete, using the ListScrapers operation, and then use the DeleteScraper operation to delete it.
Alternatively, with the AWS CLI, call:
aws amp delete-scraper --scraper-idscraperId
Metrics collected from Amazon MSK
When you integrate with Amazon MSK, the Amazon Managed Service for Prometheus collector automatically scrapes the following metrics:
| Metric | Description / Purpose |
|---|---|
|
jmx_config_reload_failure_total |
Total number of times the JMX exporter failed to reload its configuration file. |
|
jmx_scrape_duration_seconds |
Time taken to scrape JMX metrics in seconds for the current collection cycle. |
|
jmx_scrape_error |
Indicates whether an error occurred during JMX metric scraping (1 = error, 0 = success). |
|
java_lang_Memory_HeapMemoryUsage_used |
Amount of heap memory (in bytes) currently used by the JVM. |
|
java_lang_Memory_HeapMemoryUsage_max |
Maximum amount of heap memory (in bytes) that can be used for memory management. |
|
java_lang_Memory_NonHeapMemoryUsage_used |
Amount of non-heap memory (in bytes) currently used by the JVM. |
|
kafka_cluster_Partition_Value |
Current state or value related to Kafka cluster partitions, broken down by partition ID and topic. |
|
kafka_consumer_consumer_coordinator_metrics_assigned_partitions |
Number of partitions currently assigned to this consumer. |
|
kafka_consumer_consumer_coordinator_metrics_commit_latency_avg |
Average time taken to commit offsets in milliseconds. |
|
kafka_consumer_consumer_coordinator_metrics_commit_rate |
Number of offset commits per second. |
|
kafka_consumer_consumer_coordinator_metrics_failed_rebalance_total |
Total number of failed consumer group rebalances. |
|
kafka_consumer_consumer_coordinator_metrics_last_heartbeat_seconds_ago |
Number of seconds since the last heartbeat was sent to the coordinator. |
|
kafka_consumer_consumer_coordinator_metrics_rebalance_latency_avg |
Average time taken for consumer group rebalances in milliseconds. |
|
kafka_consumer_consumer_coordinator_metrics_rebalance_total |
Total number of consumer group rebalances. |
|
kafka_consumer_consumer_fetch_manager_metrics_bytes_consumed_rate |
Average number of bytes consumed per second by the consumer. |
|
kafka_consumer_consumer_fetch_manager_metrics_fetch_latency_avg |
Average time taken for a fetch request in milliseconds. |
|
kafka_consumer_consumer_fetch_manager_metrics_fetch_rate |
Number of fetch requests per second. |
|
kafka_consumer_consumer_fetch_manager_metrics_records_consumed_rate |
Average number of records consumed per second. |
|
kafka_consumer_consumer_fetch_manager_metrics_records_lag_max |
Maximum lag in terms of number of records for any partition in this consumer. |
|
kafka_consumer_consumer_metrics_connection_count |
Current number of active connections. |
|
kafka_consumer_consumer_metrics_incoming_byte_rate |
Average number of bytes received per second from all servers. |
|
kafka_consumer_consumer_metrics_last_poll_seconds_ago |
Number of seconds since the last consumer poll() call. |
|
kafka_consumer_consumer_metrics_request_rate |
Number of requests sent per second. |
|
kafka_consumer_consumer_metrics_response_rate |
Number of responses received per second. |
|
kafka_consumer_group_ConsumerLagMetrics_Value |
Current consumer lag value for a consumer group, indicating how far behind the consumer is. |
|
kafka_controller_KafkaController_Value |
Current state or value of the Kafka controller (1 = active controller, 0 = not active). |
|
kafka_controller_ControllerEventManager_Count |
Total number of controller events processed. |
|
kafka_controller_ControllerEventManager_Mean |
Mean (average) time taken to process controller events. |
|
kafka_controller_ControllerStats_MeanRate |
Mean rate of controller statistics operations per second. |
|
kafka_coordinator_group_GroupMetadataManager_Value |
Current state or value of the group metadata manager for consumer groups. |
|
kafka_log_LogFlushStats_Count |
Total number of log flush operations. |
|
kafka_log_LogFlushStats_Mean |
Mean (average) time taken for log flush operations. |
|
kafka_log_LogFlushStats_MeanRate |
Mean rate of log flush operations per second. |
|
kafka_network_RequestMetrics_Count |
Total count of network requests processed. |
|
kafka_network_RequestMetrics_Mean |
Mean (average) time taken to process network requests. |
|
kafka_network_RequestMetrics_MeanRate |
Mean rate of network requests per second. |
|
kafka_network_Acceptor_MeanRate |
Mean rate of accepted connections per second. |
|
kafka_server_Fetch_queue_size |
Current size of the fetch request queue. |
|
kafka_server_Produce_queue_size |
Current size of the produce request queue. |
|
kafka_server_Request_queue_size |
Current size of the general request queue. |
|
kafka_server_BrokerTopicMetrics_Count |
Total count of broker topic operations (messages in/out, bytes in/out). |
|
kafka_server_BrokerTopicMetrics_MeanRate |
Mean rate of broker topic operations per second. |
|
kafka_server_BrokerTopicMetrics_OneMinuteRate |
One-minute moving average rate of broker topic operations. |
|
kafka_server_DelayedOperationPurgatory_Value |
Current number of delayed operations in the purgatory (waiting to be completed). |
|
kafka_server_DelayedFetchMetrics_MeanRate |
Mean rate of delayed fetch operations per second. |
|
kafka_server_FetcherLagMetrics_Value |
Current lag value for replica fetcher threads (how far behind the leader). |
|
kafka_server_FetcherStats_MeanRate |
Mean rate of fetcher operations per second. |
|
kafka_server_ReplicaManager_Value |
Current state or value of the replica manager. |
|
kafka_server_ReplicaManager_MeanRate |
Mean rate of replica manager operations per second. |
|
kafka_server_LeaderReplication_byte_rate |
Rate of bytes replicated per second for partitions where this broker is the leader. |
|
kafka_server_group_coordinator_metrics_group_completed_rebalance_count |
Total number of completed consumer group rebalances. |
|
kafka_server_group_coordinator_metrics_offset_commit_count |
Total number of offset commit operations. |
|
kafka_server_group_coordinator_metrics_offset_commit_rate |
Rate of offset commit operations per second. |
|
kafka_server_socket_server_metrics_connection_count |
Current number of active connections. |
|
kafka_server_socket_server_metrics_connection_creation_rate |
Rate of new connection creation per second. |
|
kafka_server_socket_server_metrics_connection_close_rate |
Rate of connection closures per second. |
|
kafka_server_socket_server_metrics_failed_authentication_total |
Total number of failed authentication attempts. |
|
kafka_server_socket_server_metrics_incoming_byte_rate |
Rate of incoming bytes per second. |
|
kafka_server_socket_server_metrics_outgoing_byte_rate |
Rate of outgoing bytes per second. |
|
kafka_server_socket_server_metrics_request_rate |
Rate of requests per second. |
|
kafka_server_socket_server_metrics_response_rate |
Rate of responses per second. |
|
kafka_server_socket_server_metrics_network_io_rate |
Rate of network I/O operations per second. |
|
kafka_server_socket_server_metrics_io_ratio |
Fraction of time spent in I/O operations. |
|
kafka_server_controller_channel_metrics_connection_count |
Current number of active connections for controller channels. |
|
kafka_server_controller_channel_metrics_incoming_byte_rate |
Rate of incoming bytes per second for controller channels. |
|
kafka_server_controller_channel_metrics_outgoing_byte_rate |
Rate of outgoing bytes per second for controller channels. |
|
kafka_server_controller_channel_metrics_request_rate |
Rate of requests per second for controller channels. |
|
kafka_server_replica_fetcher_metrics_connection_count |
Current number of active connections for replica fetcher. |
|
kafka_server_replica_fetcher_metrics_incoming_byte_rate |
Rate of incoming bytes per second for replica fetcher. |
|
kafka_server_replica_fetcher_metrics_request_rate |
Rate of requests per second for replica fetcher. |
|
kafka_server_replica_fetcher_metrics_failed_authentication_total |
Total number of failed authentication attempts for replica fetcher. |
|
kafka_server_ZooKeeperClientMetrics_Count |
Total count of ZooKeeper client operations. |
|
kafka_server_ZooKeeperClientMetrics_Mean |
Mean latency of ZooKeeper client operations. |
|
kafka_server_KafkaServer_Value |
Current state or value of the Kafka server (typically indicates server is running). |
|
node_cpu_seconds_total |
Total seconds the CPUs spent in each mode (user, system, idle, etc.), broken down by CPU and mode. |
|
node_disk_read_bytes_total |
Total number of bytes read successfully from disks, broken down by device. |
|
node_disk_reads_completed_total |
Total number of reads completed successfully for disks, broken down by device. |
|
node_disk_writes_completed_total |
Total number of writes completed successfully for disks, broken down by device. |
|
node_disk_written_bytes_total |
Total number of bytes written successfully to disks, broken down by device. |
|
node_filesystem_avail_bytes |
Available filesystem space in bytes for non-root users, broken down by device and mount point. |
|
node_filesystem_size_bytes |
Total size of the filesystem in bytes, broken down by device and mount point. |
|
node_filesystem_free_bytes |
Free filesystem space in bytes, broken down by device and mount point. |
|
node_filesystem_files |
Total number of file nodes (inodes) on the filesystem, broken down by device and mount point. |
|
node_filesystem_files_free |
Number of free file nodes (inodes) on the filesystem, broken down by device and mount point. |
|
node_filesystem_readonly |
Indicates whether the filesystem is mounted read-only (1 = read-only, 0 = read-write). |
|
node_filesystem_device_error |
Indicates whether an error occurred while getting filesystem statistics (1 = error, 0 = success). |
Limitations
The current Amazon MSK integration with Amazon Managed Service for Prometheus has the following limitations:
-
Only supported for Amazon MSK Provisioned clusters (not available for Amazon MSK Serverless)
-
Not supported for Amazon MSK clusters with public access enabled in combination with KRaft metadata mode
-
Not supported for Amazon MSK Express brokers
-
Currently supports a 1:1 mapping between Amazon MSK clusters and Amazon Managed Service for Prometheus collectors/workspaces