View a markdown version of this page

Experiment result document - AWS Prescriptive Guidance

Experiment result document

Configuration

Document the specific configurations for the experiment. For example:

  • Load generation set to simulate 5K users issuing a total of 85 requests per second.

Prerequisites

  • Verified that the pet adoption site was running in the alpha test environment.

  • Verified that the experiment template was configured to apply CPU stress to the PetSite application pods that are running in the EKS cluster.  Application pods were identified by the Kubernetes label app=petsite.

  • Load was confirmed to be running and generating 85 requests per second.

Steady state

Document the steps taken to achieve the steady state and how you verified it. For example:

For the test deployment of pet adoption site, a load of 85 RPS is being generated to simulate steady state. The CloudWatch RUM and CloudWatch dashboards were reviewed to verify that all business and application metrics were within normal ranges previous to the execution of the experiment.

Observability data:

Expected Observed
  • LCP is less than 4 seconds for P99 of requests.

  • Response latency is less than 500 ms.

  • There are no 4XX or 5XX errors.

Steady state report 1 for chaos experiment.

Steady state report 2 for chaos experiment.

Fault injection

AWS FIS was used to inject faults by using the experiment template (provide link). The experiment was set to run for 10 minutes, and a rollback was configured if the worker nodes experienced CPU stress over 60 percent.

Fault observation

The CloudWatch RUM and CloudWatch dashboards were reviewed to track the steady state of the application (defined by using LCP metrics).  Screenshots were captured in the following table.

Observability data:

Expected Observed
  • LCP should remain under 4 seconds for P99.

  • Response time should remain under 500 ms.

  • No 4XX or 5XX errors should be encountered.

Fault observation report 1 for chaos experiment.

Fault observation report 2 for chaos experiment.

Recovery

After the stress has been removed (the AWS FIS experiment has completed and removed the CPU stress from the pods), the application should resume its normal steady state.  No manual intervention should be required.

Observability data:

Expected Observed (screenshot)

LCP P99 should be under 4 seconds with the average under 2.5 seconds.

Sample recovery results from chaos experiment.