View a markdown version of this page

Experiment planning document - AWS Prescriptive Guidance

Experiment planning document

Steady state

Process name Pet adoption site

Physical architecture

(Link to architecture diagram.)

Logical architecture

(Link to logical diagram.)

Define steady state

Average page load time, measured by using Largest Contentful Paint (LCP), for the pet adoption site is 2.5 seconds or less with a 99 percentile latency (P99) of 4.0 seconds or less with a baseline of 5000 concurrent users.

Steady state metrics

LCP metric captured across user base, and golden metrics (latency, throughput, error rates, saturation).

Steady state observability

LCP will be captured by the user's browser, sent to Amazon CloudWatch, and inspected with CloudWatch RUM. Over a 60 second period, the average and P99 LCP time will be aggregated for all requests in that period. Top-level golden metrics are captured by using CloudWatch.

Process to achieve steady state

Grafana K6 will be used to create a load that simulates normal production traffic levels of approximately 5K concurrent users.

Observability requirements

Teams should be able to view the following:

  • Steady state: What will be observed to verify that the application is under normal conditions?

  • Failure condition: How will the failure condition appear in the dashboard? For example:

    • Alarms that should be triggered

    • Logs that should be generated

  • Failure impact: What should be observed to view components that are expected to be impacted (scope of impact)?

  • Recovery: How will the recovery be viewed and measured to capture MTTR?

  • Debug: Troubleshooting details on experiment failures.

The following table provides suggestions and examples for an observability requirements chart. You should define what should be observed based on your specific experiment.

What needs to be observed Link to observability tool What is being observed

Source of input

Grafana K6 dashboard

  • Running container count

  • Requests per second

Overall application health

  • Pet adoption CloudWatch dashboard

  • Pet adoption user experience dashboard (RUM)

  • Amazon EKS healthy node count

  • Amazon EKS node CPU utilization

Workflow health

Pet adoption CloudWatch dashboard

LCP time, golden metrics

Traces

Pet adoption X-Ray dashboard

  • Request latency

  • Request count

  • Failure count

Logs

Pet adoption CloudWatch Logs

Any errors encountered by the pods will be issued to CloudWatch Logs.

Experiment definition

Experiment name Amazon EKS PetSite pod CPU stress

Experiment source code

(Link to experiment source repository.)

Experiment description

This experiment explores how an increase in CPU usage of the PetSite application pod would impact overall customer experience. By injecting CPU stress into each running PetSite pod, we will be able to understand if there is impact to customers and the extent of that impact.

Experiment requirements or parameters

Application load: Production average

Pod label selector: app=petsite

Experiment duration

10 minutes

Environment

Alpha test environment

Experiment target resources

PetSite application pods

Experiment baseline that is introduced through the load generating tool

  • 54% of requests have an LCP of <2.5 seconds.

  • 46% of requests have an LCP of <4 seconds.

  • No errors are observed.

Backoff condition

None

Hypothesis

What if Impact Recovery

What would happen to steady state if the PetSite application pods experienced or caused more than 60% CPU utilization for 10 minutes under normal production-level traffic?

 

LCP times will remain under 2.5 seconds for P50 of users with P99 of 4.0 seconds or less. The consumer should be able to load the PetSite landing page.

Detection:

  • CPU stress will be detected by alarms that are configured in CloudWatch.

  • LCP metrics will also generate alarms for the degradation of user experience.

Self-healing:

  • The distributed nature of the microservice architecture means that many instances of pods are running across multiple Availability Zones. 

  • The EKS cluster control plane will shift traffic away from the affected pods, and will launch new pods on worker nodes.

Recovery:  

When CPU utilization returns to normal, the LCP should recover automatically.

Experiment process

Tailor the following example step-by-step process to your specific experiment:

  1. Validate access to, and functionality of, all Amazon CloudWatch, CloudWatch RUM, and AWS X-Ray dashboards.

  2. Validate the health of the application environment:

    1. Confirm that the EKS cluster is healthy by using the CloudWatch dashboard.

    2. Visit the test pet adoption site application deployment at the example URL.

  3. Initiate a load to achieve steady state:

    1. Confirm that the load generator is running and sending 5000 requests per second.

    2. Allow 5 minutes for the application to reach its steady state.

    3. Confirm the steady state of the application by using the CloudWatch RUM dashboard.

  4. Initiate a fault (experiment):

    1. Open the AWS FIS console.

    2. Run the pet-adoption-pod-stress experiment.

    3. Confirm that the experiment is running in the console.

  5. Observe the impact of the fault on your application:

    1. Capture screenshots from the CloudWatch RUM and CloudWatch dashboards, and note any anomalous data points.

    2. After the experiment has completed in AWS FIS, capture additional screenshots to record if the application returns to steady state in the absence of stress, and note any anomalies in the data points.

    3. If the steady state doesn't resume, take steps to recover the application and record the steps taken.

  6. Validate that the environment has returned to normal:

    • Review all business, user experience, application, and infrastructure metrics to verify that the system has returned to a known state. Capture dashboard screenshots if helpful.

Experiment timeline

Make sure that you capture the timeline of the end-to-end experiment, starting with load generation, injection of the fault, observation of impact, and recovery of the application, and ending when you stop the load generation. This is illustrated in the following example.

Example timeline for a chaos experiment.

Experiment results

Experiment run ID Experiment results

PET-ADOPT-EXP-23

(Link to experiment results.)

Identified defects

  • The Kubernetes cluster didn't detect the CPU impairment of the PetSite pods, so it didn't schedule additional deployments.

  • There was no increase in 4XX or 5XX error rates as a result of this experiment.

  • We need to adjust the health check of the pod to account for impact to LCP when there are resource constraints.