Health checks for Managed Endpoints - AWS RTB Fabric

Health checks for Managed Endpoints

RTB Fabric health checks let you configure application-level health checking for Auto Scaling group (ASG)-backed responder endpoints. When enabled, RTB Fabric continuously probes each Amazon EC2 instance in your ASGs using HTTP or HTTPS health checks and routes traffic only to instances that are confirmed healthy. This reduces error rates during different instance lifecycle stages such as deployments, scaling events, instance failures, and decommissioning.

Without health checks, RTB Fabric routes traffic to all running instances in your ASGs regardless of application readiness. Instances that are booting, deploying, decommissioning, or experiencing application failures still receive traffic, which can cause elevated error rates for your bidding traffic.

Note

Health checks are available only for Auto Scaling group Managed Endpoints on responder gateways. EKS endpoints and domain name endpoints do not support this feature.

How health checks work

When you enable health checks, RTB Fabric performs the following operations:

  1. Instance discovery – RTB Fabric periodically queries your Auto Scaling groups to discover the current set of Amazon EC2 instance IP addresses, grouped by Availability Zone.

  2. Health probing – RTB Fabric sends HTTP or HTTPS requests to each instance using your configured port, path, protocol, and timeout. An instance is considered healthy if the response status code matches your configured statusCodeMatcher.

  3. Health status determination – RTB Fabric maintains a per-instance health status with threshold-based transitions. An instance must pass a consecutive number of probes (defined by healthyThresholdCount) to be marked healthy, and fail a consecutive number of probes (defined by unhealthyThresholdCount) to be marked unhealthy. This prevents single transient failures from removing instances from rotation.

  4. Traffic routing – RTB Fabric routes bid request traffic only to instances that are confirmed healthy. Healthy instances are tracked per Availability Zone, enabling AZ-aware routing decisions.

Health checks are distributed across multiple hosts and use a consensus mechanism to determine target health. Therefore, your instances may receive more than the configured number of health check probes.

Key behaviors

  • Threshold-based transitions – Once healthy, an instance must fail for unhealthyThresholdCount consecutive probes before being marked unhealthy. A single failed probe does not remove an instance from rotation. Similarly, an unhealthy instance must pass healthyThresholdCount consecutive probes before being marked healthy again.

  • Fail-open – If health checking is temporarily unavailable, or if all your instances are deemed unhealthy, RTB Fabric falls back to routing traffic to all discovered instances rather than routing to no instances. This ensures your bidding traffic continues to flow.

  • Configuration updates – Changes to health check parameters (such as interval, thresholds, port, or path) take effect automatically after you update the responder gateway.

Enabling health checks

To enable health checks, add a healthCheckConfig object to the autoScalingGroups configuration when creating or updating a responder gateway. The presence of healthCheckConfig is the enablement signal – when omitted, no active health checking is performed.

Important

You must explicitly specify port and path so that RTB Fabric sends health check probes to the correct endpoint on your instances.

Enabling health checks (console)

To enable health checks when creating a responder gateway
  1. Sign in to the AWS Management Console and open the RTB Fabric console at https://console.aws.amazon.com/rtbfabric.

  2. In the navigation pane, choose Responder gateways.

  3. Choose Create responder gateway.

  4. In the Responder endpoint configuration section, choose Auto Scaling group as the endpoint type.

  5. Configure the Auto Scaling group settings (group names, IAM role, port, and protocol).

  6. In the Health check configuration section, configure the health check settings:

    1. For Health check port, enter the port on each instance where your health check endpoint is available.

    2. For Health check path, enter the HTTP path to probe (for example, /health). Must start with /.

    3. (Optional) For Protocol, select HTTP or HTTPS. Default: HTTP.

    4. (Optional) Configure advanced settings such as timeout, interval, status code matcher, and threshold counts.

  7. Choose Create Gateway.

Create a responder gateway with health checks

$ aws rtbfabric create-responder-gateway \ --description "Responder gateway with health checks" \ --vpc-id vpc-01f345ad6524a6d7 \ --subnet-ids subnet-abc12345 subnet-def67890 \ --security-group-ids sg-12345678 \ --port 8080 \ --protocol HTTP \ --managed-endpoint-configuration '{"autoScalingGroups":{"autoScalingGroupNames":["my-bidder-asg"],"roleArn":"arn:aws:iam::123456789012:role/MyASGRole","healthCheckConfig":{"port":8081,"path":"/health"}}}' \ --client-token "unique-client-token-456" \ --endpoint-url https://rtbfabric.us-east-1.amazonaws.com \ --region us-east-1

When only port and path are specified, all other fields use their default values.

Update health check configuration

Use UpdateResponderGateway to add health checks to an existing gateway or modify the health check settings.

$ aws rtbfabric update-responder-gateway \ --gateway-id "rtb-gw-abc123def456" \ --managed-endpoint-configuration '{"autoScalingGroups":{"autoScalingGroupNames":["my-bidder-asg"],"roleArn":"arn:aws:iam::123456789012:role/MyASGRole","healthCheckConfig":{"port":8081,"path":"/health","protocol":"HTTPS","timeoutMs":1000,"intervalSeconds":10,"statusCodeMatcher":"200-299","healthyThresholdCount":3,"unhealthyThresholdCount":3}}}' \ --endpoint-url https://rtbfabric.us-east-1.amazonaws.com \ --region us-east-1

Health check configuration reference

The following table describes the settings for health checks. The setting names used in the table are the API field names within the healthCheckConfig object.

Setting Description Default Range Required

port

The port on each instance to probe. This should be the port where your application's health check endpoint is available, which is typically different from the RTB traffic port.

80–65535

Yes

path

The HTTP path to probe on each instance (for example, /health). Must start with /.

1–128 characters

Yes

protocol

The protocol for the health check probe.

HTTP

HTTP, HTTPS

No

timeoutMs

The time, in milliseconds, during which no response from a target means a failed health check.

500

100–5000

No

intervalSeconds

The approximate amount of time, in seconds, between health checks of an individual target.

5

5–60

No

statusCodeMatcher

The HTTP status codes to use when checking for a successful response from a target. You can specify a single code ("200"), a comma-separated list ("200,204"), or a range ("200-299").

"200"

1–50 characters

No

healthyThresholdCount

The number of consecutive successful health checks required before an unhealthy target is considered healthy.

5

2–10

No

unhealthyThresholdCount

The number of consecutive failed health checks required before a healthy target is considered unhealthy.

2

2–10

No

Security group requirements

For health checks to work, RTB Fabric must be able to reach the health check port on your instances. RTB Fabric sends health check probes through a network interface in your VPC. You must configure your instance security groups to allow inbound traffic on the health check port from your VPC CIDR.

Important

If your health check port is different from the RTB traffic port, you must add an additional inbound rule to your instance security groups for the health check port. Without this rule, all health check probes will fail and no instances will be marked healthy.

Add the following inbound rule to the security groups attached to your Amazon EC2 instances:

Type Protocol Port range Source

Custom TCP

TCP (HTTP or HTTPS based on your health check protocol configuration)

Your health check port (for example, 8081)

Your VPC CIDR (for example, 10.0.0.0/16). If your VPC has secondary CIDR blocks, include those as well.

This is similar to how Elastic Load Balancing health checks require inbound rules on target security groups for the health check port. For more information about load balancer security group requirements, see Update the security groups for your Application Load Balancer in the Elastic Load Balancing User Guide.

Tip

Verify that your security groups allow traffic on the health check port before enabling health checks. If health check probes cannot reach your instances, all instances will be marked unhealthy. Because RTB Fabric uses fail-open behavior, traffic will fall back to all instances, but the health check feature will not provide any benefit until the security group is corrected.

Viewing health check configuration

You can view the current health check configuration for a responder gateway using the console or the AWS CLI.

To view health check configuration (console)
  1. Sign in to the AWS Management Console and open the RTB Fabric console at https://console.aws.amazon.com/rtbfabric.

  2. In the navigation pane, choose Responder gateways.

  3. Select the gateway and choose View details.

  4. The Health check configuration section displays the current settings, or indicates that health checks are not configured.

Use the following command to view the current health check configuration for a responder gateway.

$ aws rtbfabric get-responder-gateway \ --gateway-id "rtb-gw-abc123def456" \ --endpoint-url https://rtbfabric.us-east-1.amazonaws.com \ --region us-east-1

If health checks are configured, the response includes a healthCheckConfig object within the managedEndpointConfiguration:

{ "gatewayId": "rtb-gw-abc123def456", "managedEndpointConfiguration": { "autoScalingGroups": { "autoScalingGroupNames": ["my-bidder-asg"], "roleArn": "arn:aws:iam::123456789012:role/MyASGRole", "healthCheckConfig": { "port": 8081, "path": "/health", "protocol": "HTTP", "timeoutMs": 500, "intervalSeconds": 5, "statusCodeMatcher": "200", "healthyThresholdCount": 5, "unhealthyThresholdCount": 2 } } } }

If health checks are not configured, the healthCheckConfig field is absent from the response.

Best practices

  • Update security groups before enabling health checks – Ensure that your instance security groups allow inbound traffic on the health check port from your VPC CIDR before you enable health checks. See Security group requirements.

  • Choose appropriate thresholds – Set unhealthyThresholdCount low to quickly remove failing instances from rotation. Set healthyThresholdCount higher to ensure instances are fully stable before receiving traffic again.

  • Match the timeout to your health endpoint – Set timeoutMs to a value that accommodates your health endpoint's typical response time with some margin.

  • Monitor healthy target IP counts – After enabling health checks, monitor the healthy-target-ip-count CloudWatch metric. A sudden drop compared to target-ip-count may indicate that health check probes are failing for your instances. For more information, see RTB Fabric metrics.

Troubleshooting health checks

The following are common issues with health checks and steps to resolve them.

All instances are unhealthy

Symptom: After enabling health checks, the healthy-target-ip-count metric is zero while target-ip-count shows the expected number of instances.

Possible causes:

  • Security group misconfigured – Your instance security groups do not allow inbound traffic on the health check port from your VPC CIDR. Verify the inbound rules on your instance security groups. See Security group requirements.

  • Wrong port or path – The configured port or path does not match where your health endpoint is actually listening. Verify your health check configuration using GetResponderGateway, and test the endpoint locally on an instance by running curl http://localhost:port/path.

  • Health endpoint returning unexpected status code – Your health endpoint returns a status code that does not match the configured statusCodeMatcher. Verify the response code by testing the endpoint locally and adjust the statusCodeMatcher if needed.

  • Timeout too short – The configured timeoutMs is shorter than your health endpoint's response time. Increase the timeout value or optimize your health endpoint's response time.

Some instances are unhealthy

Symptom: The healthy-target-ip-count metric is lower than target-ip-count, indicating that some instances are failing health checks.

Possible causes:

  • Instances are deploying – During rolling deployments, instances that are restarting or not yet ready may fail health checks. This is expected behavior. The healthy-target-ip-count should recover after deployment completes.

  • Application issues on specific instances – Some instances may have application-level issues. Check your application logs on the affected instances.

  • Availability Zone issues – If all unhealthy instances are in the same Availability Zone, there may be a network or infrastructure issue affecting that zone.

No healthy-target-ip-count metric

Symptom: The healthy-target-ip-count metric does not appear in CloudWatch after enabling health checks.

Possible causes:

  • Health check configuration not saved – Verify that the healthCheckConfig is present in the gateway configuration by calling GetResponderGateway. See Viewing health check configuration.

  • Gateway still activating or updating – After enabling or updating health checks, the health checking infrastructure requires a few minutes to deploy. Wait for the gateway status to return to Active and check the metric again.