

# Improvement process
Improvement process

 The architectural improvement process includes understanding what you have and what you can do to improve, selecting targets for improvement, testing improvements, adopting successful improvements, quantifying your success and sharing what you have learned so that it can be replicated elsewhere, and then repeating the cycle. 

 The goals of your improvements can be: 
+  To eliminate waste, low utilization, and idle or unused resources 
+  To maximize the value from resources you consume 

**Note**  
Use all the resources you provision, and complete the same work with the minimum resources possible. 

 In early stages of optimization, first eliminate areas with waste or low utilization, and then move toward more targeted optimizations that fit your specific workload. 

 Monitor changes to the consumption of resources over time. Identify where accumulated changes result in inefficient or significant increases in resource consumption. Determine the need for improvements to address the changes in consumption and implement the improvements you prioritize. 

 The following steps are designed to be an iterative process that evaluates, prioritizes, tests, and deploys sustainability-focused improvements for cloud workloads. 

1.  **Identify targets for improvement:** Review your workloads against best practices for sustainability that are identified in this document, and identify targets for improvement. 

1.  **Evaluate specific improvements:** Evaluate specific changes for potential improvement, projected cost, and business risk. 

1.  **Prioritize and plan improvements:** Prioritize changes that offer the largest improvements at the least cost and risk, and establish a plan for testing and implementation. 

1.  **Test and validate improvements:** Implement changes in testing environments to validate their potential for improvement. 

1.  **Deploy changes to production:** Implement changes across production environments. 

1.  **Measure results and replicate successes:** Look for opportunities to replicate successes across workloads, and revert changes with unacceptable outcomes. 

## Example scenario
Example scenario

 The following example scenario is referenced later in this document to illustrate each step of the improvement process. 

 Your company has a workload that performs complex image manipulations on Amazon EC2 instances and stores the modified and original files for user access. The processing activities are CPU intensive, and the output files are extremely large. 

# Identify targets for improvement
Identify targets for improvement

 Understand the best practices that can help you achieve your sustainability goals. You can find detailed descriptions of these [best practices](best-practices-for-sustainability-in-the-cloud.md) and recommendations for improvement later in this document. 

 Review your workloads and the resources used. Identify *hot spots* such as large deployments and frequently used resources. Evaluate these hot spots for opportunities to improve the effective utilization of your resources and to reduce the total resources required to achieve your business outcomes. 

 Review your workload against best practices, and identify candidates for improvement. 

 Applying this step to the [Example scenario](improvement-process.md#example-scenario), you identify the following best practices as likely targets for improvement: 
+  Use the minimum amount of hardware to meet your needs 
+  Use technologies that best support your data access and storage patterns 

## Resources
Resources
+  [Optimizing your AWS Infrastructure for Sustainability, Part I: Compute](https://aws.amazon.com/blogs/architecture/optimizing-your-aws-infrastructure-for-sustainability-part-i-compute/) 
+  [Optimizing your AWS Infrastructure for Sustainability, Part II: Storage](https://aws.amazon.com/blogs/architecture/optimizing-your-aws-infrastructure-for-sustainability-part-ii-storage/) 
+  [Optimizing your AWS Infrastructure for Sustainability, Part III: Networking](https://aws.amazon.com/blogs/architecture/optimizing-your-aws-infrastructure-for-sustainability-part-iii-networking/) 

# Evaluate specific improvements
Evaluate specific improvements

 Understand the resources provisioned by your workload to complete a unit of work. Evaluate potential improvements, and estimate their potential impact, the cost to implement, and the associated risks. 

 To measure improvements over time, first understand what you have provisioned in AWS and how those resources are being consumed. 

 Start with a full overview of your AWS usage, and use AWS Cost and Usage Reports to help identify hot spots. Use this [AWS sample code](https://github.com/aws-samples/aws-usage-queries) to help you review and analyze your report with the help of Amazon Athena. 

## Proxy metrics
Proxy metrics

 When you evaluate specific changes, you must also evaluate which metrics best quantify the effect of that change on the associated resource. These metrics are called *proxy metrics*. Select proxy metrics that best reflect the type of improvement you are evaluating and the resources targeted by improvement. These metrics might evolve over time. 

 The resources provisioned to support your workload include compute, storage, and network resources. Evaluate the resources provisioned using your proxy metrics to see how those resources are consumed. 

 Use your proxy metrics to measure the resources provisioned to achieve business outcomes. 


|  **Resource**  |  **Example proxy metrics**  |  **Improvement goals**  | 
| --- | --- | --- | 
|  Compute  |  vCPU minutes  |  Maximize utilization of provisioned resources  | 
|  Storage  |  GB provisioned  |  Reduce total provisioned  | 
|  Network  |  GB transferred or packets transferred  |  Reduce total transferred and transferred distance  | 

## Business metrics
Business metrics

 Select business metrics to quantify the achievement of business outcomes. Your business metrics should reflect the value provided by your workload, for example, the number of simultaneous active users, API calls served, or the number of transactions completed. These metrics may evolve over time. Be cautious when evaluating financial-based business metrics, since inconsistency in the value of transactions invalidates comparisons. 

## Key performance indicators
Key performance indicators

 Using the following formula, divide the provisioned resources by the business outcomes achieved to determine the provisioned resources per unit of work. 

![\[Diagram showing this formula: Resources provisioned per unit of work = proxy metric for provisioned resource / business metric for outcome\]](http://docs.aws.amazon.com/wellarchitected/latest/sustainability-pillar/images/key-performance-indicators-formula.png)


 Use your resources per unit of work as your KPIs. Establish baselines based on provisioned resources as the basis for comparisons. 


|  **Resource**  |  **Example KPIs**  |  **Improvement goals**  | 
| --- | --- | --- | 
|  Compute  |  vCPU minutes per transaction  |  Maximize utilization of provisioned resources  | 
|  Storage  |  GB per transaction  |  Reduce total provisioned  | 
|  Network  |  GB transferred per transaction or packets transferred per transaction  |  Reduce total transferred and transferred distance  | 

## Estimate improvement
Estimate improvement

 Estimate improvement as both the quantitative reduction in resources provisioned (as indicated by your proxy metrics) and the percentage change from your baseline resources provisioned per unit of work. 


|  **Resource**  |  **Example KPIs**  |  **Improvement goals**  | 
| --- | --- | --- | 
|  Compute  |  % reduction of vCPUs minutes per transaction  |  Maximize utilization  | 
|  Storage  |  % reduction GB per transaction  |  Reduce total provisioned  | 
|  Network  |  % reduction of GB transferred per transaction or packets transferred per transaction  |  Reduce total transferred and transferred distance  | 

## Evaluate improvements
Evaluate improvements

 Evaluate potential improvements against the anticipated net benefit. Evaluate the time, cost, and level of effort to implement and maintain, and business risks such as unanticipated impacts. 

 Targeted improvements often represent trade-offs between the types of resources consumed. For example, to reduce compute consumption, you can store a result, or to limit data transferred, you can process data before sending the result to a client. These [trade-offs](sustainability-as-a-non-functional-requirement.md) are discussed in additional detail later. 

 Include non-functional requirements when evaluating the risks for your workload, including security, reliability, performance efficiency, cost optimization, and the impact of improvements on your ability to operate your workload. 

 Applying this step to the [Example scenario](improvement-process.md#example-scenario), you evaluate the target improvements with the following results: 


|  **Best practice**  |  **Targeted improvement**  |  **Potential**  |  **Cost**  |  **Risk**  | 
| --- | --- | --- | --- | --- | 
|  Use the minimum amount of hardware to meet your needs  |  Implement predictive scaling to reduce low utilization periods  |  Medium  |  Low  |  Low  | 
|  Use technologies that best support your data access and storage patterns  |  Implement more effective compression mechanisms to reduce total storage and the time to achieve it  |  High  |  Low  |  Low  | 

 Implementing predictive scaling reduces the vCPU hours consumed by under-utilized or unused instances providing moderate benefits over existing scaling mechanisms with an estimated 11% reduction in resources consumed. The costs involved are low and include the configuration of the cloud resources and the operation of predictive scaling for Amazon EC2 Auto Scaling. The risk is constrained performance when scale-out is performed reactively in response to demand exceeding predictions. 

 Implementing more effective compression can have a significant impact with large reductions in file size across all of your original and manipulated images, with an estimated 25% reduction in storage requirements in production. Implementing the new algorithm is a low-effort substitution with little risk involved. 

# Prioritize and plan improvements
Prioritize and plan improvements

 Prioritize your identified improvements based on the greatest anticipated impact with the lowest costs and acceptable risk. 

 Decide which improvements to focus on initially, and include them in your resource planning and development roadmap. 

 Applying this step to the [Example scenario](improvement-process.md#example-scenario), you prioritize the target improvements as follows: 


|  **Priority**  |  **Improvement**  |  **Potential**  |  **Cost**  |  **Risk**  | 
| --- | --- | --- | --- | --- | 
|  1  |  Implement more effective compression mechanisms  |  High  |  Low  |  Low  | 
|  2  |  Implement predictive scaling  |  Medium  |  Low  |  Low  | 

 The high potential, low cost, and risk of updating file compression make it a high-value target for your company and a priority over implementing predictive scaling. You determine that implementing predictive scaling with its medium potential impact, low cost, and low risk should be the priority improvement after file compression is complete. 

 You assign a team member to implement improved file compression and add predictive scaling to your backlog. 

# Test and validate improvements
Test and validate improvements

 Perform small tests with minimized investment to reduce the risk of a large-scale effort. 

 Implement a representative copy of your workload in your testing environment to limit the cost and risk to perform testing and validation. Perform a predefined set of test transactions, measure the provisioned resources, and determine the resources used per unit of work to establish a testing baseline. 

 Implement your target improvement in the testing environment and repeat the test using the same methodology under the same conditions. Then measure the provisioned resources and resources used per unit of work with your improvement in place. 

 Calculate the percentage change from your baseline of the resources provisioned per unit of work, and determine the expected quantitative reduction in resources provisioned in your production environment. Compare these values against the anticipated values. Determine if the result is an acceptable level of improvement. Evaluate if any trade-offs in additional resources consumed make the net benefit from the improvement unacceptable. 

 Determine if the improvement is a success and if resources should be invested in implementing the change in production. If the change is evaluated as unsuccessful at this time, redirect your resources to test and validate your next target and continue your improvement cycle. 


|  **% Reduction in provisioned resources per unit of work**  |  **Quantitative reduction in provisioned resources**  |  **Action**  | 
| --- | --- | --- | 
|  Met expectations  |  Met expectations  |  Proceed with improvement  | 
|  Did not meet expectations  |  Met expectations  |  Proceed with improvement  | 
|  Met expectations  |  Did not meet expectations  |  Pursue alternative improvement  | 
|  Did not meet expectations  |  Did not meet expectations  |  Pursue alternative improvement  | 

 Applying this step to the [Example scenario](improvement-process.md#example-scenario), you perform tests to validate success. 

After you perform the tests on the improved compression algorithm, the percentage reduction in resources provisioned per unit of work (the storage required for both the original image and the modified image) met expectations with an average 30% reduction in provisioned storage and negligible increased compute load.

You determine that the additional compute resources required to apply the improved compression algorithm to existing files in production is insignificant compared to the reduction in storage achieved. You confirmed success with the quantitative reduction in resources required (TBs of storage), and the improvement is approved for production deployment.

# Deploy changes to production
Deploy changes to production

 Implement tested, validated, and approved improvements to production. Implement using limited deployments, confirm the functionality of your workload, test the actual reduction in provisioned resources and resources consumed per unit of work within the limited deployment, and check for unintended consequences of the change. Proceed with full deployments after successful testing. 

 Revert changes if tests fail or you encounter unacceptable unintended consequences of your change. 

 Applying this step to the [Example scenario](improvement-process.md#example-scenario), you take the following actions. 

 You implement the changes in production using a limited deployment through a blue-green deployment methodology. Functionality tests against the newly deployed instances are successful. You see a 26% average reduction in provisioned storage for original and manipulated image files. You don’t see any evidence of an increase in compute load compressing new files. 

 You notice an unanticipated decrease in the elapsed time to compress image files, and you attribute this to the highly optimized code for the new compression algorithm. 

 You proceed with full deployment of the new version. 

# Measure results and replicate successes
Measure results and replicate successes

Measure results and replicate successes in the following ways: 
+ Measure the initial improvement to provisioned resources per unit of work and the quantitative decrease in resources provisioned. 
+  Compare initial estimates and testing results to your production measurements. Identify factors that might have contributed to differences, and update your estimation and testing methodologies where appropriate. 
+  Determine success, and degree of success, and share results with stakeholders.
+  If you had to revert changes due to failed tests or unintended negative consequences from the change, identify the contributing factors. Iterate where viable, or evaluate new approaches to achieve the goals of the change.
+  Take what you have learned, establish standards, and apply successful improvements to other systems that can similarly benefit. Capture and share your methodology, related artifacts, and net benefits, across teams and organizations so that others can adopt your standard and replicate your success. 
+ Monitor provisioned resources per unit of work and track changes and total impact over time. Changes to your workload, or how your customers consume your workload, can have an impact on the effectiveness of your improvement. Re-evaluate improvement opportunities if you notice significant short-term decreases in the effectiveness of your improvement or an accumulated reduction in effectiveness over time.
+ Quantify the net benefit from your improvement over time (including the benefits received by other teams who applied your improvement if available) to show the return on investment from your improvement activities. 

 Applying this step to the [Example scenario](improvement-process.md#example-scenario), you measure the following results. 

 Your workload shows an initial improvement of 23% reduction in storage requirements after deploying and applying the new compression algorithm to existing image files. 

 The measured value is largely in agreement with initial estimates (25%), and the significant difference compared to testing (30%) is determined to be the result of the image files used in testing not being representative of image files present in production. You modify the testing image set to more appropriately reflect the images in production. 

 The improvement is considered a complete success. The total reduction in provisioned storage is 2% less than the estimated 25%, but 23% is still a huge improvement in sustainability impact, and is accompanied by an equivalent cost savings. 

 The only unintended consequences of the change are the beneficial reduction in elapsed time to perform the compression and an equivalent reduction vCPU consumed. These improvements are attributed to the highly optimized code. 

 You establish an internal open-source project where you share your code, associated artifacts, guidance on how to implement the change, and the results of your implementation. The internal open-source project makes it easy for your teams to adopt the code for all their persistent file storage use cases. Your teams adopt the improvement as a standard. Secondary benefits of the internal open-source project are that everyone who adopts the solution benefits from improvements to the solution, and anyone can contribute improvements to the project. 

 You publish your success and share the open-source project across your organization. Every team that adopts the solution replicates the benefit with minimum investment and adds to the net benefit received from your investment. You publish this data as a continuing success story. 

 You continue to monitor the impact of the improvement over time and will make changes to the internal open-source project as required. 