# Foundations
<a name="a-foundations"></a>

**Topics**
+ [REL 1. How do you manage Service Quotas and constraints?](rel-01.md)
+ [REL 2. How do you plan your network topology?](rel-02.md)

# REL 1. How do you manage Service Quotas and constraints?
<a name="rel-01"></a>

For cloud-based workload architectures, there are Service Quotas (which are also referred to as service limits). These quotas exist to prevent accidentally provisioning more resources than you need and to limit request rates on API operations so as to protect services from abuse. There are also resource constraints, for example, the rate that you can push bits down a fiber-optic cable, or the amount of storage on a physical disk. 

**Topics**
+ [REL01-BP01 Aware of service quotas and constraints](rel_manage_service_limits_aware_quotas_and_constraints.md)
+ [REL01-BP02 Manage service quotas across accounts and regions](rel_manage_service_limits_limits_considered.md)
+ [REL01-BP03 Accommodate fixed service quotas and constraints through architecture](rel_manage_service_limits_aware_fixed_limits.md)
+ [REL01-BP04 Monitor and manage quotas](rel_manage_service_limits_monitor_manage_limits.md)
+ [REL01-BP05 Automate quota management](rel_manage_service_limits_automated_monitor_limits.md)
+ [REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover](rel_manage_service_limits_suff_buffer_limits.md)

# REL01-BP01 Aware of service quotas and constraints
<a name="rel_manage_service_limits_aware_quotas_and_constraints"></a>

 Be aware of your default quotas and manage your quota increase requests for your workload architecture. Know which cloud resource constraints, such as disk or network, are potentially impactful. 

 **Desired outcome:** Customers can prevent service degradation or disruption in their AWS accounts by implementing proper guidelines for monitoring key metrics, infrastructure reviews, and automation remediation steps to verify that services quotas and constraints are not reached that could cause service degradation or disruption. 

 **Common anti-patterns:** 
+ Deploying a workload without understanding the hard or soft quotas and their limits for the services used. 
+ Deploying a replacement workload without analyzing and reconfiguring the necessary quotas or contacting Support in advance. 
+ Assuming that cloud services have no limits and the services can be used without consideration to rates, limits, counts, quantities.
+  Assuming that quotas will automatically be increased. 
+  Not knowing the process and timeline of quota requests. 
+  Assuming that the default cloud service quota is the identical for every service compared across regions. 
+  Assuming that service constraints can be breached and the systems will auto-scale or add increase the limit beyond the resource’s constraints 
+  Not testing the application at peak traffic in order to stress the utilization of its resources. 
+  Provisioning the resource without analysis of the required resource size. 
+  Overprovisioning capacity by choosing resource types that go well beyond actual need or expected peaks. 
+  Not assessing capacity requirements for new levels of traffic in advance of a new customer event or deploying a new technology. 

 **Benefits of establishing this best practice:** Monitoring and automated management of service quotas and resource constraints can proactively reduce failures. Changes in traffic patterns for a customer’s service can cause a disruption or degradation if best practices are not followed. By monitoring and managing these values across all regions and all accounts, applications can have improved resiliency under adverse or unplanned events. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Service Quotas is an AWS service that helps you manage your quotas for over 250 AWS services from one location. Along with looking up the quota values, you can also request and track quota increases from the Service Quotas console or using the AWS SDK. AWS Trusted Advisor offers a service quotas check that displays your usage and quotas for some aspects of some services. The default service quotas per service are also in the AWS documentation per respective service (for example, see [Amazon VPC Quotas](https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html)). 

 Some service limits, like rate limits on throttled APIs are set within the Amazon API Gateway itself by configuring a usage plan. Some limits that are set as configuration on their respective services include Provisioned IOPS, Amazon RDS storage allocated, and Amazon EBS volume allocations. Amazon Elastic Compute Cloud has its own service limits dashboard that can help you manage your instance, Amazon Elastic Block Store, and Elastic IP address limits. If you have a use case where service quotas impact your application’s performance and they are not adjustable to your needs, then contact Support to see if there are mitigations. 

 Service quotas can be Region specific or can also be global in nature. Using an AWS service that reaches its quota will not act as expected in normal usage and may cause service disruption or degradation. For example, a service quota limits the number of DL Amazon EC2 instances used in a Region. That limit may be reached during a traffic scaling event using Auto Scaling groups (ASG). 

 Service quotas for each account should be assessed for usage on a regular basis to determine what the appropriate service limits might be for that account. These service quotas exist as operational guardrails, to prevent accidentally provisioning more resources than you need. They also serve to limit request rates on API operations to protect services from abuse. 

 Service constraints are different from service quotas. Service constraints represent a particular resource’s limits as defined by that resource type. These might be storage capacity (for example, gp2 has a size limit of 1 GB - 16 TB) or disk throughput (10,0000 iops). It is essential that a resource type’s constraint be engineered and constantly assessed for usage that might reach its limit. If a constraint is reached unexpectedly, the account’s applications or services may be degraded or disrupted. 

 If there is a use case where service quotas impact an application’s performance and they cannot be adjusted to required needs, contact Support to see if there are mitigations. For more detail on adjusting fixed quotas, see [REL01-BP03 Accommodate fixed service quotas and constraints through architecture](rel_manage_service_limits_aware_fixed_limits.md). 

 There are a number of AWS services and tools to help monitor and manage Service Quotas. The service and tools should be leveraged to provide automated or manual checks of quota levels. 
+  AWS Trusted Advisor offers a service quota check that displays your usage and quotas for some aspects of some services. It can aid in identifying services that are near quota. 
+  AWS Management Console provides methods to display services quota values, manage, request new quotas, monitor status of quota requests, and display history of quotas. 
+  AWS CLI and CDKs offer programmatic methods to automatically manage and monitor service quota levels and usage. 

 **Implementation steps** 

 For Service Quotas: 
+ [ Review AWS Service Quotas. ](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html)
+  To be aware of your existing service quotas, determine the services (like IAM Access Analyzer) that are used. There are approximately 250 AWS services controlled by service quotas. Then, determine the specific service quota name that might be used within each account and Region. There are approximately 3000 service quota names per Region. 
+  Augment this quota analysis with AWS Config to find all [AWS resources](https://docs.aws.amazon.com/config/latest/developerguide/resource-config-reference.html) used in your AWS accounts. 
+  Use [AWS CloudFormation data](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-view-stack-data-resources.html) to determine your AWS resources used. Look at the resources that were created either in the AWS Management Console or with the [https://docs.aws.amazon.com/cli/latest/reference/cloudformation/list-stack-resources.html](https://docs.aws.amazon.com/cli/latest/reference/cloudformation/list-stack-resources.html) AWS CLI command. You can also see resources configured to be deployed in the template itself. 
+  Determine all the services your workload requires by looking at the deployment code. 
+  Determine the service quotas that apply. Use the programmatically accessible information from Trusted Advisor and Service Quotas. 
+  Establish an automated monitoring method (see [REL01-BP02 Manage service quotas across accounts and regions](rel_manage_service_limits_limits_considered.md) and [REL01-BP04 Monitor and manage quotas](rel_manage_service_limits_monitor_manage_limits.md)) to alert and inform if services quotas are near or have reached their limit. 
+  Establish an automated and programmatic method to check if a service quota has been changed in one region but not in other regions in the same account (see [REL01-BP02 Manage service quotas across accounts and regions](rel_manage_service_limits_limits_considered.md) and [REL01-BP04 Monitor and manage quotas](rel_manage_service_limits_monitor_manage_limits.md)). 
+  Automate scanning application logs and metrics to determine if there are any quota or service constraint errors. If these errors are present, send alerts to the monitoring system. 
+  Establish engineering procedures to calculate the required change in quota (see [REL01-BP05 Automate quota management](rel_manage_service_limits_automated_monitor_limits.md)) once it has been identified that larger quotas are required for specific services. 
+  Create a provisioning and approval workflow to request changes in service quota. This should include an exception workflow in case of request deny or partial approval. 
+  Create an engineering method to review service quotas prior to provisioning and using new AWS services before rolling out to production or loaded environments. (for example, load testing account). 

 For service constraints: 
+  Establish monitoring and metrics methods to alert for resources reading close to their resource constraints. Leverage CloudWatch as appropriate for metrics or log monitoring. 
+  Establish alert thresholds for each resource that has a constraint that is meaningful to the application or system. 
+  Create workflow and infrastructure management procedures to change the resource type if the constraint is near utilization. This workflow should include load testing as a best practice to verify that new type is the correct resource type with the new constraints. 
+  Migrate identified resource to the recommended new resource type, using existing procedures and processes. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL01-BP02 Manage service quotas across accounts and regions](rel_manage_service_limits_limits_considered.md) 
+  [REL01-BP03 Accommodate fixed service quotas and constraints through architecture](rel_manage_service_limits_aware_fixed_limits.md) 
+  [REL01-BP04 Monitor and manage quotas](rel_manage_service_limits_monitor_manage_limits.md) 
+  [REL01-BP05 Automate quota management](rel_manage_service_limits_automated_monitor_limits.md) 
+  [REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover](rel_manage_service_limits_suff_buffer_limits.md) 
+  [REL03-BP01 Choose how to segment your workload](rel_service_architecture_monolith_soa_microservice.md) 
+  [REL10-BP01 Deploy the workload to multiple locations](rel_fault_isolation_multiaz_region_system.md) 
+  [REL11-BP01 Monitor all components of the workload to detect failures](rel_withstand_component_failures_monitoring_health.md) 
+  [REL11-BP03 Automate healing on all layers](rel_withstand_component_failures_auto_healing_system.md) 
+  [REL12-BP05 Test resiliency using chaos engineering](rel_testing_resiliency_failure_injection_resiliency.md) 

 **Related documents:** 
+ [AWS Well-Architected Framework’s Reliability Pillar: Availability ](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html)
+  [AWS Service Quotas (formerly referred to as service limits)](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) 
+  [AWS Trusted Advisor Best Practice Checks (see the Service Limits section)](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/) 
+  [AWS limit monitor on AWS answers](https://aws.amazon.com/answers/account-management/limit-monitor/) 
+  [Amazon EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) 
+  [What is Service Quotas?](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ How to Request Quota Increase ](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html)
+ [ Service endpoints and quotas ](https://docs.aws.amazon.com/general/latest/gr/aws-service-information.html)
+  [Service Quotas User Guide](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ Quota Monitor for AWS](https://aws.amazon.com/solutions/implementations/quota-monitor/)
+ [AWS Fault Isolation Boundaries ](https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/abstract-and-introduction.html)
+ [ Availability with redundancy ](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-with-redundancy.html)
+ [AWS for Data ](https://aws.amazon.com/data/)
+ [ What is Continuous Integration? ](https://aws.amazon.com/devops/continuous-integration/)
+ [ What is Continuous Delivery? ](https://aws.amazon.com/devops/continuous-delivery/)
+ [ APN Partner: partners that can help with configuration management ](https://partners.amazonaws.com/search/partners?keyword=Configuration+Management&ref=wellarchitected)
+ [ Managing the account lifecycle in account-per-tenant SaaS environments on AWS](https://aws.amazon.com/blogs/mt/managing-the-account-lifecycle-in-account-per-tenant-saas-environments-on-aws/)
+ [ Managing and monitoring API throttling in your workloads ](https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/)
+ [ View AWS Trusted Advisor recommendations at scale with AWS Organizations](https://aws.amazon.com/blogs/mt/organizational-view-for-trusted-advisor/)
+ [ Automating Service Limit Increases and Enterprise Support with AWS Control Tower](https://aws.amazon.com/blogs/mt/automating-service-limit-increases-enterprise-support-aws-control-tower/)

 **Related videos:** 
+  [AWS Live re:Inforce 2019 - Service Quotas](https://youtu.be/O9R5dWgtrVo) 
+ [ View and Manage Quotas for AWS Services Using Service Quotas ](https://www.youtube.com/watch?v=ZTwfIIf35Wc)
+ [AWS IAM Quotas Demo ](https://www.youtube.com/watch?v=srJ4jr6M9YQ)

 **Related tools:** 
+ [ Amazon CodeGuru Reviewer ](https://aws.amazon.com/codeguru/)
+ [AWS CodeDeploy](https://aws.amazon.com/codedeploy/)
+ [AWS CloudTrail](https://aws.amazon.com/cloudtrail/)
+ [ Amazon CloudWatch ](https://aws.amazon.com/cloudwatch/)
+ [ Amazon EventBridge ](https://aws.amazon.com/eventbridge/)
+ [ Amazon DevOps Guru ](https://aws.amazon.com/devops-guru/)
+ [AWS Config](https://aws.amazon.com/config/)
+ [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/)
+ [AWS CDK ](https://aws.amazon.com/cdk/)
+ [AWS Systems Manager](https://aws.amazon.com/systems-manager/)
+ [AWS Marketplace](https://aws.amazon.com/marketplace/search/results?searchTerms=CMDB)

# REL01-BP02 Manage service quotas across accounts and regions
<a name="rel_manage_service_limits_limits_considered"></a>

 If you are using multiple accounts or Regions, request the appropriate quotas in all environments in which your production workloads run. 

 **Desired outcome:** Services and applications should not be affected by service quota exhaustion for configurations that span accounts or Regions or that have resilience designs using zone, Region, or account failover. 

 **Common anti-patterns:** 
+ Allowing resource usage in one isolation Region to grow with no mechanism to maintain capacity in the other ones. 
+  Manually setting all quotas independently in isolation Regions. 
+  Not considering the effect of resiliency architectures (like active or passive) in future quota needs during a degradation in the non-primary Region. 
+  Not evaluating quotas regularly and making necessary changes in every Region and account the workload runs. 
+  Not leveraging [quota request templates](https://docs.aws.amazon.com/servicequotas/latest/userguide/organization-templates.html) to request increases across multiple Regions and accounts. 
+  Not updating service quotas due to incorrectly thinking that increasing quotas has cost implications like compute reservation requests. 

 **Benefits of establishing this best practice:** Verifying that you can handle your current load in secondary regions or accounts if regional services become unavailable. This can help reduce the number of errors or levels of degradations that occur during region loss. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Service quotas are tracked per account. Unless otherwise noted, each quota is AWS Region-specific. In addition to the production environments, also manage quotas in all applicable non-production environments so that testing and development are not hindered. Maintaining a high degree of resiliency requires that service quotas are assessed continually (whether automated or manual). 

 With more workloads spanning Regions due to the implementation of designs using *Active/Active*, *Active/Passive – Hot*, *Active/Passive-Cold*, and *Active/Passive-Pilot Light* approaches, it is essential to understand all Region and account quota levels. Past traffic patterns are not always a good indicator if the service quota is set correctly. 

 Equally important, the service quota name limit is not always the same for every Region. In one Region, the value could be five, and in another region the value could be ten. Management of these quotas must span all the same services, accounts, and Regions to provide consistent resilience under load. 

 Reconcile all the service quota differences across different Regions (Active Region or Passive Region) and create processes to continually reconcile these differences. The testing plans of passive Region failovers are rarely scaled to peak active capacity, meaning that game day or table top exercises can fail to find differences in service quotas between Regions and also then maintain the correct limits. 

 *Service quota drift*, the condition where service quota limits for a specific named quota is changed in one Region and not all Regions, is very important to track and assess. Changing the quota in Regions with traffic or potentially could carry traffic should be considered. 
+  Select relevant accounts and Regions based on your service requirements, latency, regulatory, and disaster recovery (DR) requirements. 
+  Identify service quotas across all relevant accounts, Regions, and Availability Zones. The limits are scoped to account and Region. These values should be compared for differences. 

 **Implementation steps** 
+  Review Service Quotas values that might have breached beyond the a risk level of usage. AWS Trusted Advisor provides alerts for 80% and 90% threshold breaches. 
+  Review values for service quotas in any Passive Regions (in an Active/Passive design). Verify that load will successfully run in secondary Regions in the event of a failure in the primary Region. 
+  Automate assessing if any service quota drift has occurred between Regions in the same account and act accordingly to change the limits. 
+  If the customer Organizational Units (OU) are structured in the supported manner, service quota templates should be updated to reflect changes in any quotas that should be applied to multiple Regions and accounts. 
  +  Create a template and associate Regions to the quota change. 
  +  Review all existing service quota templates for any changes required (Region, limits, and accounts). 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL01-BP01 Aware of service quotas and constraints](rel_manage_service_limits_aware_quotas_and_constraints.md) 
+  [REL01-BP03 Accommodate fixed service quotas and constraints through architecture](rel_manage_service_limits_aware_fixed_limits.md) 
+  [REL01-BP04 Monitor and manage quotas](rel_manage_service_limits_monitor_manage_limits.md) 
+  [REL01-BP05 Automate quota management](rel_manage_service_limits_automated_monitor_limits.md) 
+  [REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover](rel_manage_service_limits_suff_buffer_limits.md) 
+  [REL03-BP01 Choose how to segment your workload](rel_service_architecture_monolith_soa_microservice.md) 
+  [REL10-BP01 Deploy the workload to multiple locations](rel_fault_isolation_multiaz_region_system.md) 
+  [REL11-BP01 Monitor all components of the workload to detect failures](rel_withstand_component_failures_monitoring_health.md) 
+  [REL11-BP03 Automate healing on all layers](rel_withstand_component_failures_auto_healing_system.md) 
+  [REL12-BP05 Test resiliency using chaos engineering](rel_testing_resiliency_failure_injection_resiliency.md) 

 **Related documents:** 
+ [AWS Well-Architected Framework’s Reliability Pillar: Availability ](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html)
+  [AWS Service Quotas (formerly referred to as service limits)](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) 
+  [AWS Trusted Advisor Best Practice Checks (see the Service Limits section)](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/) 
+  [AWS limit monitor on AWS answers](https://aws.amazon.com/answers/account-management/limit-monitor/) 
+  [Amazon EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) 
+  [What is Service Quotas?](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ How to Request Quota Increase ](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html)
+ [ Service endpoints and quotas ](https://docs.aws.amazon.com/general/latest/gr/aws-service-information.html)
+  [Service Quotas User Guide](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ Quota Monitor for AWS](https://aws.amazon.com/solutions/implementations/quota-monitor/)
+ [AWS Fault Isolation Boundaries ](https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/abstract-and-introduction.html)
+ [ Availability with redundancy ](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-with-redundancy.html)
+ [AWS for Data ](https://aws.amazon.com/data/)
+ [ What is Continuous Integration? ](https://aws.amazon.com/devops/continuous-integration/)
+ [ What is Continuous Delivery? ](https://aws.amazon.com/devops/continuous-delivery/)
+ [ APN Partner: partners that can help with configuration management ](https://partners.amazonaws.com/search/partners?keyword=Configuration+Management&ref=wellarchitected)
+ [ Managing the account lifecycle in account-per-tenant SaaS environments on AWS](https://aws.amazon.com/blogs/mt/managing-the-account-lifecycle-in-account-per-tenant-saas-environments-on-aws/)
+ [ Managing and monitoring API throttling in your workloads ](https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/)
+ [ View AWS Trusted Advisor recommendations at scale with AWS Organizations](https://aws.amazon.com/blogs/mt/organizational-view-for-trusted-advisor/)
+ [ Automating Service Limit Increases and Enterprise Support with AWS Control Tower](https://aws.amazon.com/blogs/mt/automating-service-limit-increases-enterprise-support-aws-control-tower/)

 **Related videos:** 
+  [AWS Live re:Inforce 2019 - Service Quotas](https://youtu.be/O9R5dWgtrVo) 
+ [ View and Manage Quotas for AWS Services Using Service Quotas ](https://www.youtube.com/watch?v=ZTwfIIf35Wc)
+ [AWS IAM Quotas Demo ](https://www.youtube.com/watch?v=srJ4jr6M9YQ)

 **Related services:** 
+ [ Amazon CodeGuru Reviewer ](https://aws.amazon.com/codeguru/)
+ [AWS CodeDeploy](https://aws.amazon.com/codedeploy/)
+ [AWS CloudTrail](https://aws.amazon.com/cloudtrail/)
+ [ Amazon CloudWatch ](https://aws.amazon.com/cloudwatch/)
+ [ Amazon EventBridge ](https://aws.amazon.com/eventbridge/)
+ [ Amazon DevOps Guru ](https://aws.amazon.com/devops-guru/)
+ [AWS Config](https://aws.amazon.com/config/)
+ [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/)
+ [AWS CDK ](https://aws.amazon.com/cdk/)
+ [AWS Systems Manager](https://aws.amazon.com/systems-manager/)
+ [AWS Marketplace](https://aws.amazon.com/marketplace/search/results?searchTerms=CMDB)

# REL01-BP03 Accommodate fixed service quotas and constraints through architecture
<a name="rel_manage_service_limits_aware_fixed_limits"></a>

Be aware of unchangeable service quotas, service constraints, and physical resource limits. Design architectures for applications and services to prevent these limits from impacting reliability.

Examples include network bandwidth, serverless function invocation payload size, throttle burst rate for of an API gateway, and concurrent user connections to a database.

 **Desired outcome:** The application or service performs as expected under normal and high traffic conditions. They have been designed to work within the limitations for that resource’s fixed constraints or service quotas. 

 **Common anti-patterns:** 
+ Choosing a design that uses a resource of a service, unaware that there are design constraints that will cause this design to fail as you scale.
+ Performing benchmarking that is unrealistic and will reach service fixed quotas during the testing. For example, running tests at a burst limit but for an extended amount of time.
+  Choosing a design that cannot scale or be modified if fixed service quotas are to be exceeded. For example, an SQS payload size of 256KB. 
+  Observability has not been designed and implemented to monitor and alert on thresholds for service quotas that might be at risk during high traffic events 

 **Benefits of establishing this best practice:** Verifying that the application will run under all projected services load levels without disruption or degradation. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Unlike soft service quotas or resources that be replaced with higher capacity units, AWS services’ fixed quotas cannot be changed. This means that all these type of AWS services must be evaluated for potential hard capacity limits when used in an application design. 

 Hard limits are show in the Service Quotas console. If the columns shows `ADJUSTABLE = No`, the service has a hard limit. Hard limits are also shown in some resources configuration pages. For example, Lambda has specific hard limits that cannot be adjusted. 

 As an example, when designing a python application to run in a Lambda function, the application should be evaluated to determine if there is any chance of Lambda running longer than 15 minutes. If the code may run more than this service quota limit, alternate technologies or designs must be considered. If this limit is reached after production deployment, the application will suffer degradation and disruption until it can be remediated. Unlike soft quotas, there is no method to change to these limits even under emergency Severity 1 events. 

 Once the application has been deployed to a testing environment, strategies should be used to find if any hard limits can be reached. Stress testing, load testing, and chaos testing should be part of the introduction test plan. 

 **Implementation steps** 
+  Review the complete list of AWS services that could be used in the application design phase. 
+  Review the soft quota limits and hard quota limits for all these services. Not all limits are shown in the Service Quotas console. Some services [describe these limits in alternate locations](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html). 
+  As you design your application, review your workload’s business and technology drivers, such as business outcomes, use case, dependent systems, availability targets, and disaster recovery objects. Let your business and technology drivers guide the process to identify the distributed system that is right for your workload. 
+  Analyze service load across Regions and accounts. Many hard limits are regionally based for services. However, some limits are account based. 
+  Analyze resilience architectures for resource usage during a zonal failure and Regional failure. In the progression of multi-Region designs using active/active, active/passive – hot, active/passive - cold, and active/passive - pilot light approaches, these failure cases will cause higher usage. This creates a potential use case for hitting hard limits. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL01-BP01 Aware of service quotas and constraints](rel_manage_service_limits_aware_quotas_and_constraints.md) 
+  [REL01-BP02 Manage service quotas across accounts and regions](rel_manage_service_limits_limits_considered.md) 
+  [REL01-BP04 Monitor and manage quotas](rel_manage_service_limits_monitor_manage_limits.md) 
+  [REL01-BP05 Automate quota management](rel_manage_service_limits_automated_monitor_limits.md) 
+  [REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover](rel_manage_service_limits_suff_buffer_limits.md) 
+  [REL03-BP01 Choose how to segment your workload](rel_service_architecture_monolith_soa_microservice.md) 
+  [REL10-BP01 Deploy the workload to multiple locations](rel_fault_isolation_multiaz_region_system.md) 
+  [REL11-BP01 Monitor all components of the workload to detect failures](rel_withstand_component_failures_monitoring_health.md) 
+  [REL11-BP03 Automate healing on all layers](rel_withstand_component_failures_auto_healing_system.md) 
+  [REL12-BP05 Test resiliency using chaos engineering](rel_testing_resiliency_failure_injection_resiliency.md) 

 **Related documents:** 
+ [AWS Well-Architected Framework’s Reliability Pillar: Availability ](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html)
+  [AWS Service Quotas (formerly referred to as service limits)](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) 
+  [AWS Trusted Advisor Best Practice Checks (see the Service Limits section)](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/) 
+  [AWS limit monitor on AWS answers](https://aws.amazon.com/answers/account-management/limit-monitor/) 
+  [Amazon EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) 
+  [What is Service Quotas?](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ How to Request Quota Increase ](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html)
+ [ Service endpoints and quotas ](https://docs.aws.amazon.com/general/latest/gr/aws-service-information.html)
+  [Service Quotas User Guide](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ Quota Monitor for AWS](https://aws.amazon.com/solutions/implementations/quota-monitor/)
+ [AWS Fault Isolation Boundaries ](https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/abstract-and-introduction.html)
+ [ Availability with redundancy ](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-with-redundancy.html)
+ [AWS for Data ](https://aws.amazon.com/data/)
+ [ What is Continuous Integration? ](https://aws.amazon.com/devops/continuous-integration/)
+ [ What is Continuous Delivery? ](https://aws.amazon.com/devops/continuous-delivery/)
+ [ APN Partner: partners that can help with configuration management ](https://partners.amazonaws.com/search/partners?keyword=Configuration+Management&ref=wellarchitected)
+ [ Managing the account lifecycle in account-per-tenant SaaS environments on AWS](https://aws.amazon.com/blogs/mt/managing-the-account-lifecycle-in-account-per-tenant-saas-environments-on-aws/)
+ [ Managing and monitoring API throttling in your workloads ](https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/)
+ [ View AWS Trusted Advisor recommendations at scale with AWS Organizations](https://aws.amazon.com/blogs/mt/organizational-view-for-trusted-advisor/)
+ [ Automating Service Limit Increases and Enterprise Support with AWS Control Tower](https://aws.amazon.com/blogs/mt/automating-service-limit-increases-enterprise-support-aws-control-tower/)
+ [ Actions, resources, and condition keys for Service Quotas ](https://docs.aws.amazon.com/service-authorization/latest/reference/list_servicequotas.html)

 **Related videos:** 
+  [AWS Live re:Inforce 2019 - Service Quotas](https://youtu.be/O9R5dWgtrVo) 
+ [ View and Manage Quotas for AWS Services Using Service Quotas ](https://www.youtube.com/watch?v=ZTwfIIf35Wc)
+ [AWS IAM Quotas Demo ](https://www.youtube.com/watch?v=srJ4jr6M9YQ)
+ [AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ](https://www.youtube.com/watch?v=O8xLxNje30M)

 **Related tools:** 
+ [AWS CodeDeploy](https://aws.amazon.com/codedeploy/)
+ [AWS CloudTrail](https://aws.amazon.com/cloudtrail/)
+ [ Amazon CloudWatch ](https://aws.amazon.com/cloudwatch/)
+ [ Amazon EventBridge ](https://aws.amazon.com/eventbridge/)
+ [ Amazon DevOps Guru ](https://aws.amazon.com/devops-guru/)
+ [AWS Config](https://aws.amazon.com/config/)
+ [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/)
+ [AWS CDK ](https://aws.amazon.com/cdk/)
+ [AWS Systems Manager](https://aws.amazon.com/systems-manager/)
+ [AWS Marketplace](https://aws.amazon.com/marketplace/search/results?searchTerms=CMDB)

# REL01-BP04 Monitor and manage quotas
<a name="rel_manage_service_limits_monitor_manage_limits"></a>

 Evaluate your potential usage and increase your quotas appropriately, allowing for planned growth in usage. 

 **Desired outcome:** Active and automated systems that manage and monitor have been deployed. These operations solutions ensure that quota usage thresholds are nearing being reached. These would be proactively remediated by requested quota changes. 

 **Common anti-patterns:** 
+ Not configuring monitoring to check for service quota thresholds
+ Not configuring monitoring for hard limits, even though those values cannot be changed.
+  Assuming that amount of time required to request and secure a soft quota change is immediate or a short period. 
+  Configuring alarms for when service quotas are being approached, but having no process on how to respond to an alert. 
+  Only configuring alarms for services supported by AWS Service Quotas and not monitoring other AWS services. 
+  Not considering quota management for multiple Region resiliency designs, like active/active, active/passive – hot, active/passive - cold, and active/passive - pilot light approaches. 
+  Not assessing quota differences between Regions. 
+  Not assessing the needs in every Region for a specific quota increase request. 
+  Not leveraging [templates for multi-Region quota management](https://docs.aws.amazon.com/servicequotas/latest/userguide/organization-templates.html). 

 **Benefits of establishing this best practice:** Automatic tracking of the AWS Service Quotas and monitoring your usage against those quotas will allow you to see when you are approaching a quota limit. You can also use this monitoring data to help limit any degradations due to quota exhaustion. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 For supported services, you can monitor your quotas by configuring various different services that can assess and then send alerts or alarms. This can aid in monitoring usage and can alert you to approaching quotas. These alarms can be invoked from AWS Config, Lambda functions, Amazon CloudWatch, or from AWS Trusted Advisor. You can also use metric filters on CloudWatch Logs to search and extract patterns in logs to determine if usage is approaching quota thresholds. 

 **Implementation steps** 

 For monitoring: 
+  Capture current resource consumption (for example, buckets or instances). Use service API operations, such as the Amazon EC2 `DescribeInstances` API, to collect current resource consumption. 
+  Capture your current quotas that are essential and applicable to the services using: 
  +  AWS Service Quotas 
  +  AWS Trusted Advisor 
  +  AWS documentation 
  +  AWS service-specific pages 
  +  AWS Command Line Interface (AWS CLI) 
  +  AWS Cloud Development Kit (AWS CDK) 
+  Use AWS Service Quotas, an AWS service that helps you manage your quotas for over 250 AWS services from one location. 
+  Use Trusted Advisor service limits to monitor your current service limits at various thresholds. 
+  Use the service quota history (console or AWS CLI) to check on regional increases. 
+  Compare service quota changes in each Region and each account to create equivalency, if required. 

 For management: 
+  Automated: Set up an AWS Config custom rule to scan service quotas across Regions and compare for differences. 
+  Automated: Set up a scheduled Lambda function to scan service quotas across Regions and compare for differences. 
+  Manual: Scan services quota through AWS CLI, API, or AWS Console to scan service quotas across Regions and compare for differences. Report the differences. 
+  If differences in quotas are identified between Regions, request a quota change, if required. 
+  Review the result of all requests. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL01-BP01 Aware of service quotas and constraints](rel_manage_service_limits_aware_quotas_and_constraints.md) 
+  [REL01-BP02 Manage service quotas across accounts and regions](rel_manage_service_limits_limits_considered.md) 
+  [REL01-BP03 Accommodate fixed service quotas and constraints through architecture](rel_manage_service_limits_aware_fixed_limits.md) 
+  [REL01-BP05 Automate quota management](rel_manage_service_limits_automated_monitor_limits.md) 
+  [REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover](rel_manage_service_limits_suff_buffer_limits.md) 
+  [REL03-BP01 Choose how to segment your workload](rel_service_architecture_monolith_soa_microservice.md) 
+  [REL10-BP01 Deploy the workload to multiple locations](rel_fault_isolation_multiaz_region_system.md) 
+  [REL11-BP01 Monitor all components of the workload to detect failures](rel_withstand_component_failures_monitoring_health.md) 
+  [REL11-BP03 Automate healing on all layers](rel_withstand_component_failures_auto_healing_system.md) 
+  [REL12-BP05 Test resiliency using chaos engineering](rel_testing_resiliency_failure_injection_resiliency.md) 

 **Related documents:** 
+ [AWS Well-Architected Framework’s Reliability Pillar: Availability ](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html)
+  [AWS Service Quotas (formerly referred to as service limits)](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) 
+  [AWS Trusted Advisor Best Practice Checks (see the Service Limits section)](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/) 
+  [AWS limit monitor on AWS answers](https://aws.amazon.com/answers/account-management/limit-monitor/) 
+  [Amazon EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) 
+  [What is Service Quotas?](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ How to Request Quota Increase ](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html)
+ [ Service endpoints and quotas ](https://docs.aws.amazon.com/general/latest/gr/aws-service-information.html)
+  [Service Quotas User Guide](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ Quota Monitor for AWS](https://aws.amazon.com/solutions/implementations/quota-monitor/)
+ [AWS Fault Isolation Boundaries ](https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/abstract-and-introduction.html)
+ [ Availability with redundancy ](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-with-redundancy.html)
+ [AWS for Data ](https://aws.amazon.com/data/)
+ [ What is Continuous Integration? ](https://aws.amazon.com/devops/continuous-integration/)
+ [ What is Continuous Delivery? ](https://aws.amazon.com/devops/continuous-delivery/)
+ [ APN Partner: partners that can help with configuration management ](https://partners.amazonaws.com/search/partners?keyword=Configuration+Management&ref=wellarchitected)
+ [ Managing the account lifecycle in account-per-tenant SaaS environments on AWS](https://aws.amazon.com/blogs/mt/managing-the-account-lifecycle-in-account-per-tenant-saas-environments-on-aws/)
+ [ Managing and monitoring API throttling in your workloads ](https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/)
+ [ View AWS Trusted Advisor recommendations at scale with AWS Organizations](https://aws.amazon.com/blogs/mt/organizational-view-for-trusted-advisor/)
+ [ Automating Service Limit Increases and Enterprise Support with AWS Control Tower](https://aws.amazon.com/blogs/mt/automating-service-limit-increases-enterprise-support-aws-control-tower/)
+ [ Actions, resources, and condition keys for Service Quotas ](https://docs.aws.amazon.com/service-authorization/latest/reference/list_servicequotas.html)

 **Related videos:** 
+  [AWS Live re:Inforce 2019 - Service Quotas](https://youtu.be/O9R5dWgtrVo) 
+ [ View and Manage Quotas for AWS Services Using Service Quotas ](https://www.youtube.com/watch?v=ZTwfIIf35Wc)
+ [AWS IAM Quotas Demo ](https://www.youtube.com/watch?v=srJ4jr6M9YQ)
+ [AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ](https://www.youtube.com/watch?v=O8xLxNje30M)

 **Related tools:** 
+ [AWS CodeDeploy](https://aws.amazon.com/codedeploy/)
+ [AWS CloudTrail](https://aws.amazon.com/cloudtrail/)
+ [ Amazon CloudWatch ](https://aws.amazon.com/cloudwatch/)
+ [ Amazon EventBridge ](https://aws.amazon.com/eventbridge/)
+ [ Amazon DevOps Guru ](https://aws.amazon.com/devops-guru/)
+ [AWS Config](https://aws.amazon.com/config/)
+ [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/)
+ [AWS CDK ](https://aws.amazon.com/cdk/)
+ [AWS Systems Manager](https://aws.amazon.com/systems-manager/)
+ [AWS Marketplace](https://aws.amazon.com/marketplace/search/results?searchTerms=CMDB)

# REL01-BP05 Automate quota management
<a name="rel_manage_service_limits_automated_monitor_limits"></a>

 Implement tools to alert you when thresholds are being approached. You can automate quota increase requests by using AWS Service Quotas APIs. 

 If you integrate your Configuration Management Database (CMDB) or ticketing system with Service Quotas, you can automate the tracking of quota increase requests and current quotas. In addition to the AWS SDK, Service Quotas offers automation using the AWS Command Line Interface (AWS CLI). 

 **Common anti-patterns:** 
+  Tracking the quotas and usage in spreadsheets. 
+  Running reports on usage daily, weekly, or monthly, and then comparing usage to the quotas. 

 **Benefits of establishing this best practice:** Automated tracking of the AWS service quotas and monitoring of your usage against that quota allows you to see when you are approaching a quota. You can set up automation to assist you in requesting a quota increase when needed. You might want to consider lowering some quotas when your usage trends in the opposite direction to realize the benefits of lowered risk (in case of compromised credentials) and cost savings. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Set up automated monitoring Implement tools using SDKs to alert you when thresholds are being approached. 
  +  Use Service Quotas and augment the service with an automated quota monitoring solution, such as AWS Limit Monitor or an offering from AWS Marketplace. 
    +  [What is Service Quotas?](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
    +  [Quota Monitor on AWS - AWS Solution](https://aws.amazon.com/answers/account-management/limit-monitor/) 
  +  Set up automated responses based on quota thresholds, using Amazon SNS and AWS Service Quotas APIs. 
  +  Test automation. 
    +  Configure limit thresholds. 
    +  Integrate with change events from AWS Config, deployment pipelines, Amazon EventBridge, or third parties. 
    +  Artificially set low quota thresholds to test responses. 
    +  Set up automated operations to take appropriate action on notifications and contact AWS Support when necessary. 
    +  Manually start change events. 
    +  Run a game day to test the quota increase change process. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help with configuration management](https://aws.amazon.com/partners/find/results/?keyword=Configuration+Management) 
+  [AWS Marketplace: CMDB products that help track limits](https://aws.amazon.com/marketplace/search/results?searchTerms=CMDB) 
+  [AWS Service Quotas (formerly referred to as service limits)](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) 
+  [AWS Trusted Advisor Best Practice Checks (see the Service Limits section)](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/) 
+  [Quota Monitor on AWS - AWS Solution](https://aws.amazon.com/answers/account-management/limit-monitor/) 
+  [Amazon EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) 
+  [What is Service Quotas?](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 

 **Related videos:** 
+  [AWS Live re:Inforce 2019 - Service Quotas](https://youtu.be/O9R5dWgtrVo) 

# REL01-BP06 Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover
<a name="rel_manage_service_limits_suff_buffer_limits"></a>

This article explains how to maintain space between the resource quota and your usage, and how it can benefit your organization. After you finish using a resource, the usage quota may continue to account for that resource. This can result in a failing or inaccessible resource. Prevent resource failure by verifying that your quotas cover the overlap of inaccessible resources and their replacements. Consider cases like network failure, Availability Zone failure, or Region failures when calculating this gap.

 **Desired outcome:** Small or large failures in resources or resource accessibility can be covered within the current service thresholds. Zone failures, network failures, or even Regional failures have been considered in the resource planning. 

 **Common anti-patterns:** 
+  Setting service quotas based on current needs without accounting for failover scenarios. 
+  Not considering the principals of static stability when calculating the peak quota for a service. 
+  Not considering the potential of inaccessible resources in calculating total quota needed for each Region. 
+  Not considering AWS service fault isolation boundaries for some services and their potential abnormal usage patterns. 

 **Benefits of establishing this best practice:** When service disruption events impact application availability, use the cloud to implement strategies to recover from these events. An example strategy is creating additional resources to replace inaccessible resources to accommodate failover conditions without exhausting your service limit. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 When evaluating a quota limit, consider failover cases that might occur due to some degradation. Consider the following failover cases. 
+  A disrupted or inaccessible VPC. 
+  An inaccessible subnet. 
+  A degraded Availability Zone that impacts resource accessibility. 
+  Networking routes or ingress and egress points are blocked or changed. 
+  A degraded Region that impacts resource accessibility. 
+  A subset of resources affected by a failure in a Region or an Availability Zone. 

 The decision to failover is unique for each situation, as the business impact can vary. Address resource capacity planning in the failover location and the resources’ quotas before deciding to failover an application or service. 

 Consider higher than normal peaks of activity when reviewing quotas for each service. These peaks might be related to resources that are inaccessible due to networking or permissions, but are still active. Unterminated active resources count against the service quota limit. 

 **Implementation steps** 
+  Maintain space between your service quota and your maximum usage to accommodate for a failover or loss of accessibility. 
+  Determine your service quotas. Account for typical deployment patterns, availability requirements, and consumption growth. 
+  Request quota increases if necessary. Anticipate a wait time for the quota increase request. 
+  Determine your reliability requirements (also known as your number of nines). 
+  Understand potential fault scenarios such as loss of a component, an Availability Zone, or a Region. 
+  Establish your deployment methodology (examples include canary, blue/green, red/black, and rolling). 
+  Include an appropriate buffer to the current quota limit. An example buffer could be 15%. 
+  Include calculations for static stability (Zonal and Regional) where appropriate. 
+  Plan consumption growth and monitor your consumption trends. 
+  Consider the static stability impact for your most critical workloads. Assess resources conforming to a statically stable system in all Regions and Availability Zones. 
+  Consider using On-Demand Capacity Reservations to schedule capacity ahead of any failover. This is a useful strategy to implement for critical business schedules to reduce potential risks of obtaining the correct quantity and type of resources during failover. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL01-BP01 Aware of service quotas and constraints](rel_manage_service_limits_aware_quotas_and_constraints.md) 
+  [REL01-BP02 Manage service quotas across accounts and regions](rel_manage_service_limits_limits_considered.md) 
+  [REL01-BP03 Accommodate fixed service quotas and constraints through architecture](rel_manage_service_limits_aware_fixed_limits.md) 
+  [REL01-BP04 Monitor and manage quotas](rel_manage_service_limits_monitor_manage_limits.md) 
+  [REL01-BP05 Automate quota management](rel_manage_service_limits_automated_monitor_limits.md) 
+  [REL03-BP01 Choose how to segment your workload](rel_service_architecture_monolith_soa_microservice.md) 
+  [REL10-BP01 Deploy the workload to multiple locations](rel_fault_isolation_multiaz_region_system.md) 
+  [REL11-BP01 Monitor all components of the workload to detect failures](rel_withstand_component_failures_monitoring_health.md) 
+  [REL11-BP03 Automate healing on all layers](rel_withstand_component_failures_auto_healing_system.md) 
+  [REL12-BP05 Test resiliency using chaos engineering](rel_testing_resiliency_failure_injection_resiliency.md) 

 **Related documents:** 
+ [AWS Well-Architected Framework’s Reliability Pillar: Availability ](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html)
+  [AWS Service Quotas (formerly referred to as service limits)](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) 
+  [AWS Trusted Advisor Best Practice Checks (see the Service Limits section)](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/best-practice-checklist/) 
+  [AWS limit monitor on AWS answers](https://aws.amazon.com/answers/account-management/limit-monitor/) 
+  [Amazon EC2 Service Limits](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html) 
+  [What is Service Quotas?](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ How to Request Quota Increase ](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html)
+ [ Service endpoints and quotas ](https://docs.aws.amazon.com/general/latest/gr/aws-service-information.html)
+  [Service Quotas User Guide](https://docs.aws.amazon.com/servicequotas/latest/userguide/intro.html) 
+ [ Quota Monitor for AWS](https://aws.amazon.com/solutions/implementations/quota-monitor/)
+ [AWS Fault Isolation Boundaries ](https://docs.aws.amazon.com/whitepapers/latest/aws-fault-isolation-boundaries/abstract-and-introduction.html)
+ [ Availability with redundancy ](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-with-redundancy.html)
+ [AWS for Data ](https://aws.amazon.com/data/)
+ [ What is Continuous Integration? ](https://aws.amazon.com/devops/continuous-integration/)
+ [ What is Continuous Delivery? ](https://aws.amazon.com/devops/continuous-delivery/)
+ [ APN Partner: partners that can help with configuration management ](https://partners.amazonaws.com/search/partners?keyword=Configuration+Management&ref=wellarchitected)
+ [ Managing the account lifecycle in account-per-tenant SaaS environments on AWS](https://aws.amazon.com/blogs/mt/managing-the-account-lifecycle-in-account-per-tenant-saas-environments-on-aws/)
+ [ Managing and monitoring API throttling in your workloads ](https://aws.amazon.com/blogs/mt/managing-monitoring-api-throttling-in-workloads/)
+ [ View AWS Trusted Advisor recommendations at scale with AWS Organizations](https://aws.amazon.com/blogs/mt/organizational-view-for-trusted-advisor/)
+ [ Automating Service Limit Increases and Enterprise Support with AWS Control Tower](https://aws.amazon.com/blogs/mt/automating-service-limit-increases-enterprise-support-aws-control-tower/)
+ [ Actions, resources, and condition keys for Service Quotas ](https://docs.aws.amazon.com/service-authorization/latest/reference/list_servicequotas.html)

 **Related videos:** 
+  [AWS Live re:Inforce 2019 - Service Quotas](https://youtu.be/O9R5dWgtrVo) 
+ [ View and Manage Quotas for AWS Services Using Service Quotas ](https://www.youtube.com/watch?v=ZTwfIIf35Wc)
+ [AWS IAM Quotas Demo ](https://www.youtube.com/watch?v=srJ4jr6M9YQ)
+ [AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ](https://www.youtube.com/watch?v=O8xLxNje30M)

 **Related tools:** 
+ [AWS CodeDeploy](https://aws.amazon.com/codedeploy/)
+ [AWS CloudTrail](https://aws.amazon.com/cloudtrail/)
+ [ Amazon CloudWatch ](https://aws.amazon.com/cloudwatch/)
+ [ Amazon EventBridge ](https://aws.amazon.com/eventbridge/)
+ [ Amazon DevOps Guru ](https://aws.amazon.com/devops-guru/)
+ [AWS Config](https://aws.amazon.com/config/)
+ [AWS Trusted Advisor](https://aws.amazon.com/premiumsupport/technology/trusted-advisor/)
+ [AWS CDK ](https://aws.amazon.com/cdk/)
+ [AWS Systems Manager](https://aws.amazon.com/systems-manager/)
+ [AWS Marketplace](https://aws.amazon.com/marketplace/search/results?searchTerms=CMDB)

# REL 2. How do you plan your network topology?
<a name="rel-02"></a>

Workloads often exist in multiple environments. These include multiple cloud environments (both publicly accessible and private) and possibly your existing data center infrastructure. Plans must include network considerations such as intra- and intersystem connectivity, public IP address management, private IP address management, and domain name resolution.

**Topics**
+ [REL02-BP01 Use highly available network connectivity for your workload public endpoints](rel_planning_network_topology_ha_conn_users.md)
+ [REL02-BP02 Provision redundant connectivity between private networks in the cloud and on-premises environments](rel_planning_network_topology_ha_conn_private_networks.md)
+ [REL02-BP03 Ensure IP subnet allocation accounts for expansion and availability](rel_planning_network_topology_ip_subnet_allocation.md)
+ [REL02-BP04 Prefer hub-and-spoke topologies over many-to-many mesh](rel_planning_network_topology_prefer_hub_and_spoke.md)
+ [REL02-BP05 Enforce non-overlapping private IP address ranges in all private address spaces where they are connected](rel_planning_network_topology_non_overlap_ip.md)

# REL02-BP01 Use highly available network connectivity for your workload public endpoints
<a name="rel_planning_network_topology_ha_conn_users"></a>

 Building highly available network connectivity to public endpoints of your workloads can help you reduce downtime due to loss of connectivity and improve the availability and SLA of your workload. To achieve this, use highly available DNS, content delivery networks (CDNs), API gateways, load balancing, or reverse proxies. 

 **Desired outcome:** It is critical to plan, build, and operationalize highly available network connectivity for your public endpoints. If your workload becomes unreachable due to a loss in connectivity, even if your workload is running and available, your customers will see your system as down. By combining the highly available and resilient network connectivity for your workload’s public endpoints, along with a resilient architecture for your workload itself, you can provide the best possible availability and service level for your customers. 

 AWS Global Accelerator, Amazon CloudFront, Amazon API Gateway, AWS Lambda Function URLs, AWS AppSync APIs, and Elastic Load Balancing (ELB) all provide highly available public endpoints. Amazon Route 53 provides a highly available DNS service for domain name resolution to verify that your public endpoint addresses can be resolved. 

 You can also evaluate AWS Marketplace software appliances for load balancing and proxying. 

 **Common anti-patterns:** 
+ Designing a highly available workload without planning out DNS and network connectivity for high availability.
+  Using public internet addresses on individual instances or containers and managing the connectivity to them with DNS.
+  Using IP addresses instead of domain names for locating services.
+  Not testing out scenarios where connectivity to your public endpoints is lost. 
+  Not analyzing network throughput needs and distribution patterns. 
+  Not testing and planning for scenarios where internet network connectivity to your public endpoints of your workload might be interrupted. 
+  Providing content (like web pages, static assets, or media files) to a large geographic area and not using a content delivery network. 
+  Not planning for distributed denial of service (DDoS) attacks. DDoS attacks risk shutting out legitimate traffic and lowering availability for your users. 

 **Benefits of establishing this best practice:** Designing for highly available and resilient network connectivity ensures that your workload is accessible and available to your users. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 At the core of building highly available network connectivity to your public endpoints is the routing of the traffic. To verify your traffic is able to reach the endpoints, the DNS must be able to resolve the domain names to their corresponding IP addresses. Use a highly available and scalable [Domain Name System (DNS)](https://aws.amazon.com/route53/what-is-dns/) such as Amazon Route 53 to manage your domain’s DNS records. You can also use health checks provided by Amazon Route 53. The health checks verify that your application is reachable, available, and functional, and they can be set up in a way that they mimic your user’s behavior, such as requesting a web page or a specific URL. In case of failure, Amazon Route 53 responds to DNS resolution requests and directs the traffic to only healthy endpoints. You can also consider using Geo DNS and Latency Based Routing capabilities offered by Amazon Route 53. 

 To verify that your workload itself is highly available, use Elastic Load Balancing (ELB). Amazon Route 53 can be used to target traffic to ELB, which distributes the traffic to the target compute instances. You can also use Amazon API Gateway along with AWS Lambda for a serverless solution. Customers can also run workloads in multiple AWS Regions. With [multi-site active/active pattern](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/), the workload can serve traffic from multiple Regions. With a multi-site active/passive pattern, the workload serves traffic from the active region while data is replicated to the secondary region and becomes active in the event of a failure in the primary region. Route 53 health checks can then be used to control DNS failover from any endpoint in a primary Region to an endpoint in a secondary Region, verifying that your workload is reachable and available to your users. 

 Amazon CloudFront provides a simple API for distributing content with low latency and high data transfer rates by serving requests using a network of edge locations around the world. Content delivery networks (CDNs) serve customers by serving content located or cached at a location near to the user. This also improves availability of your application as the load for content is shifted away from your servers over to CloudFront’s [edge locations](https://aws.amazon.com/products/networking/edge-networking/). The edge locations and regional edge caches hold cached copies of your content close to your viewers resulting in quick retrieval and increasing reachability and availability of your workload. 

 For workloads with users spread out geographically, AWS Global Accelerator helps you improve the availability and performance of the applications. AWS Global Accelerator provides Anycast static IP addresses that serve as a fixed entry point to your application hosted in one or more AWS Regions. This allows traffic to ingress onto the AWS global network as close to your users as possible, improving reachability and availability of your workload. AWS Global Accelerator also monitors the health of your application endpoints by using TCP, HTTP, and HTTPS health checks. Any changes in the health or configuration of your endpoints permit redirection of user traffic to healthy endpoints that deliver the best performance and availability to your users. In addition, AWS Global Accelerator has a fault-isolating design that uses two static IPv4 addresses that are serviced by independent network zones increasing the availability of your applications. 

 To help protect customers from DDoS attacks, AWS provides AWS Shield Standard. Shield Standard comes automatically turned on and protects from common infrastructure (layer 3 and 4) attacks like SYN/UDP floods and reflection attacks to support high availability of your applications on AWS. For additional protections against more sophisticated and larger attacks (like UDP floods), state exhaustion attacks (like TCP SYN floods), and to help protect your applications running on Amazon Elastic Compute Cloud (Amazon EC2), Elastic Load Balancing (ELB), Amazon CloudFront, AWS Global Accelerator, and Route 53, you can consider using AWS Shield Advanced. For protection against Application layer attacks like HTTP POST or GET floods, use AWS WAF. AWS WAF can use IP addresses, HTTP headers, HTTP body, URI strings, SQL injection, and cross-site scripting conditions to determine if a request should be blocked or allowed. 

 **Implementation steps** 

1.  Set up highly available DNS: Amazon Route 53 is a highly available and scalable [domain name system (DNS)](https://aws.amazon.com/route53/what-is-dns/) web service. Route 53 connects user requests to internet applications running on AWS or on-premises. For more information, see [configuring Amazon Route 53 as your DNS service](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-configuring.html). 

1.  Setup health checks: When using Route 53, verify that only healthy targets are resolvable. Start by [creating Route 53 health checks and configuring DNS failover](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html). The following aspects are important to consider when setting up health checks: 

   1. [ How Amazon Route 53 determines whether a health check is healthy ](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover-determining-health-of-endpoints.html)

   1. [ Creating, updating, and deleting health checks ](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/health-checks-creating-deleting.html)

   1. [ Monitoring health check status and getting notifications ](https://docs.aws.amazon.com/)

   1. [ Best practices for Amazon Route 53 DNS ](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/health-checks-monitor-view-status.html)

1. [ Connect your DNS service to your endpoints. ](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/best-practices-dns.html)

   1.  When using Elastic Load Balancing as a target for your traffic, create an [alias record](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html) using Amazon Route 53 that points to your load balancer’s regional endpoint. During the creation of the alias record, set the Evaluate target health option to Yes. 

   1.  For serverless workloads or private APIs when API Gateway is used, use [Route 53 to direct traffic to API Gateway](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-to-api-gateway.html). 

1.  Decide on a content delivery network. 

   1.  For delivering content using edge locations closer to the user, start by understanding [how CloudFront delivers content](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/HowCloudFrontWorks.html). 

   1.  Get started with a [simple CloudFront distribution](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/GettingStarted.SimpleDistribution.html). CloudFront then knows where you want the content to be delivered from, and the details about how to track and manage content delivery. The following aspects are important to understand and consider when setting up CloudFront distribution: 

      1. [ How caching works with CloudFront edge locations ](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cache-hit-ratio-explained.html)

      1. [ Increasing the proportion of requests that are served directly from the CloudFront caches (cache hit ratio) ](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cache-hit-ratio.html)

      1. [ Using Amazon CloudFront Origin Shield ](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/origin-shield.html)

      1. [ Optimizing high availability with CloudFront origin failover ](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html)

1.  Set up application layer protection: AWS WAF helps you protect against common web exploits and bots that can affect availability, compromise security, or consume excessive resources. To get a deeper understanding, review [how AWS WAF works](https://docs.aws.amazon.com/waf/latest/developerguide/how-aws-waf-works.html) and when you are ready to implement protections from application layer HTTP POST AND GET floods, review [Getting started with AWS WAF](https://docs.aws.amazon.com/waf/latest/developerguide/getting-started.html). You can also use AWS WAF with CloudFront see the documentation on [how AWS WAF works with Amazon CloudFront features](https://docs.aws.amazon.com/waf/latest/developerguide/cloudfront-features.html). 

1.  Set up additional DDoS protection: By default, all AWS customers receive protection from common, most frequently occurring network and transport layer DDoS attacks that target your web site or application with AWS Shield Standard at no additional charge. For additional protection of internet-facing applications running on Amazon EC2, Elastic Load Balancing, Amazon CloudFront, AWS Global Accelerator, and Amazon Route 53 you can consider [AWS Shield Advanced](https://docs.aws.amazon.com/waf/latest/developerguide/ddos-advanced-summary.html) and review [examples of DDoS resilient architectures](https://docs.aws.amazon.com/waf/latest/developerguide/ddos-resiliency.html). To protect your workload and your public endpoints from DDoS attacks review [Getting started with AWS Shield Advanced](https://docs.aws.amazon.com/waf/latest/developerguide/getting-started-ddos.html). 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL10-BP01 Deploy the workload to multiple locations](rel_fault_isolation_multiaz_region_system.md) 
+  [REL10-BP02 Select the appropriate locations for your multi-location deployment](rel_fault_isolation_select_location.md) 
+  [REL11-BP04 Rely on the data plane and not the control plane during recovery](rel_withstand_component_failures_avoid_control_plane.md) 
+  [REL11-BP06 Send notifications when events impact availability](rel_withstand_component_failures_notifications_sent_system.md) 

 **Related documents:** 
+  [APN Partner: partners that can help plan your networking](https://aws.amazon.com/partners/find/results/?keyword=network) 
+  [AWS Marketplace for Network Infrastructure](https://aws.amazon.com/marketplace/b/2649366011) 
+  [What Is AWS Global Accelerator?](https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html) 
+  [What is Amazon CloudFront?](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html) 
+  [What is Amazon Route 53?](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html) 
+  [What is Elastic Load Balancing?](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/what-is-load-balancing.html) 
+ [ Network Connectivity capability - Establishing Your Cloud Foundations ](https://docs.aws.amazon.com/whitepapers/latest/establishing-your-cloud-foundation-on-aws/network-connectivity-capability.html)
+ [ What is Amazon API Gateway? ](https://docs.aws.amazon.com/apigateway/latest/developerguide/welcome.html)
+ [ What are AWS WAF, AWS Shield, and AWS Firewall Manager? ](https://docs.aws.amazon.com/waf/latest/developerguide/what-is-aws-waf.html)
+ [ What is Amazon Route 53 Application Recovery Controller? ](https://docs.aws.amazon.com/r53recovery/latest/dg/what-is-route53-recovery.html)
+ [ Configure custom health checks for DNS failover ](https://docs.aws.amazon.com/apigateway/latest/developerguide/dns-failover.html)

 **Related videos:** 
+ [AWS re:Invent 2022 - Improve performance and availability with AWS Global Accelerator](https://www.youtube.com/watch?v=s5sjsdDC0Lg)
+ [AWS re:Invent 2020: Global traffic management with Amazon Route 53 ](https://www.youtube.com/watch?v=E33dA6n9O7I)
+ [AWS re:Invent 2022 - Operating highly available Multi-AZ applications ](https://www.youtube.com/watch?v=mwUV5skJJ0s)
+ [AWS re:Invent 2022 - Dive deep on AWS networking infrastructure ](https://www.youtube.com/watch?v=HJNR_dX8g8c)
+ [AWS re:Invent 2022 - Building resilient networks ](https://www.youtube.com/watch?v=u-qamiNgH7Q)

 **Related examples:** 
+ [ Disaster Recovery with Amazon Route 53 Application Recovery Controller (ARC) ](https://catalog.us-east-1.prod.workshops.aws/workshops/4d9ab448-5083-4db7-bee8-85b58cd53158/en-US/)
+ [ Reliability Workshops ](https://wellarchitectedlabs.com/reliability/)
+ [AWS Global Accelerator Workshop ](https://catalog.us-east-1.prod.workshops.aws/workshops/effb1517-b193-4c59-8da5-ce2abdb0b656/en-US)

# REL02-BP02 Provision redundant connectivity between private networks in the cloud and on-premises environments
<a name="rel_planning_network_topology_ha_conn_private_networks"></a>

 Use multiple AWS Direct Connect connections or VPN tunnels between separately deployed private networks. Use multiple Direct Connect locations for high availability. If using multiple AWS Regions, ensure redundancy in at least two of them. You might want to evaluate AWS Marketplace appliances that terminate VPNs. If you use AWS Marketplace appliances, deploy redundant instances for high availability in different Availability Zones. 

 AWS Direct Connect is a cloud service that makes it easy to establish a dedicated network connection from your on-premises environment to AWS. Using Direct Connect Gateway, your on-premises data center can be connected to multiple AWS VPCs spread across multiple AWS Regions. 

 This redundancy addresses possible failures that impact connectivity resiliency: 
+  How are you going to be resilient to failures in your topology? 
+  What happens if you misconfigure something and remove connectivity? 
+  Will you be able to handle an unexpected increase in traffic or use of your services? 
+  Will you be able to absorb an attempted Distributed Denial of Service (DDoS) attack? 

 When connecting your VPC to your on-premises data center via VPN, you should consider the resiliency and bandwidth requirements that you need when you select the vendor and instance size on which you need to run the appliance. If you use a VPN appliance that is not resilient in its implementation, then you should have a redundant connection through a second appliance. For all these scenarios, you need to define an acceptable time to recovery and test to ensure that you can meet those requirements. 

 If you choose to connect your VPC to your data center using a Direct Connect connection and you need this connection to be highly available, have redundant Direct Connect connections from each data center. The redundant connection should use a second Direct Connect connection from different location than the first. If you have multiple data centers, ensure that the connections terminate at different locations. Use the [Direct Connect Resiliency Toolkit](https://docs.aws.amazon.com/directconnect/latest/UserGuide/resiliency_toolkit.html) to help you set this up. 

 If you choose to fail over to VPN over the internet using Site-to-Site VPN, it’s important to understand that it supports up to 1.25-Gbps throughput per VPN tunnel, but does not support Equal Cost Multi Path (ECMP) for outbound traffic in the case of multiple AWS Managed VPN tunnels terminating on the same VGW. We do not recommend that you use AWS Managed VPN as a backup for Direct Connect connections unless you can tolerate speeds less than 1 Gbps during failover. 

 You can also use VPC endpoints to privately connect your VPC to supported AWS services and VPC endpoint services powered by AWS PrivateLink without traversing the public internet. Endpoints are virtual devices. They are horizontally scaled, redundant, and highly available VPC components. They allow communication between instances in your VPC and services without imposing availability risks or bandwidth constraints on your network traffic. 

 **Common anti-patterns:** 
+  Having only one connectivity provider between your on-site network and AWS. 
+  Consuming the connectivity capabilities of your AWS Direct Connect connection, but only having one connection. 
+  Having only one path for your VPN connectivity. 

 **Benefits of establishing this best practice:** By implementing redundant connectivity between your cloud environment and you corporate or on-premises environment, you can ensure that the dependent services between the two environments can communicate reliably. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Ensure that you have highly available connectivity between AWS and on-premises environment. Use multiple AWS Direct Connect connections or VPN tunnels between separately deployed private networks. Use multiple Direct Connect locations for high availability. If using multiple AWS Regions, ensure redundancy in at least two of them. You might want to evaluate AWS Marketplace appliances that terminate VPNs. If you use AWS Marketplace appliances, deploy redundant instances for high availability in different Availability Zones. 
  +  Ensure that you have a redundant connection to your on-premises environment You may need redundant connections to multiple AWS Regions to achieve your availability needs. 
    +  [AWS Direct Connect Resiliency Recommendations](https://aws.amazon.com/directconnect/resiliency-recommendation/) 
    +  [Using Redundant Site-to-Site VPN Connections to Provide Failover](https://docs.aws.amazon.com/vpn/latest/s2svpn/VPNConnections.html) 
      +  Use service API operations to identify correct use of Direct Connect circuits. 
        +  [DescribeConnections](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeConnections.html) 
        +  [DescribeConnectionsOnInterconnect](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeConnectionsOnInterconnect.html) 
        +  [DescribeDirectConnectGatewayAssociations](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeDirectConnectGatewayAssociations.html) 
        +  [DescribeDirectConnectGatewayAttachments](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeDirectConnectGatewayAttachments.html) 
        +  [DescribeDirectConnectGateways](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeDirectConnectGateways.html) 
        +  [DescribeHostedConnections](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeHostedConnections.html) 
        +  [DescribeInterconnects](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeInterconnects.html) 
      +  If only one Direct Connect connection exists or you have none, set up redundant VPN tunnels to your virtual private gateways. 
        +  [What is AWS Site-to-Site VPN?](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_VPN.html) 
  +  Capture your current connectivity (for example, Direct Connect, virtual private gateways, AWS Marketplace appliances). 
    +  Use service API operations to query configuration of Direct Connect connections. 
      +  [DescribeConnections](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeConnections.html) 
      +  [DescribeConnectionsOnInterconnect](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeConnectionsOnInterconnect.html) 
      +  [DescribeDirectConnectGatewayAssociations](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeDirectConnectGatewayAssociations.html) 
      +  [DescribeDirectConnectGatewayAttachments](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeDirectConnectGatewayAttachments.html) 
      +  [DescribeDirectConnectGateways](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeDirectConnectGateways.html) 
      +  [DescribeHostedConnections](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeHostedConnections.html) 
      +  [DescribeInterconnects](https://docs.aws.amazon.com/directconnect/latest/APIReference/API_DescribeInterconnects.html) 
    +  Use service API operations to collect virtual private gateways where route tables use them. 
      +  [DescribeVpnGateways](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeVpnGateways.html) 
      +  [DescribeRouteTables](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeRouteTables.html) 
    +  Use service API operations to collect AWS Marketplace applications where route tables use them. 
      +  [DescribeRouteTables](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeRouteTables.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help plan your networking](https://aws.amazon.com/partners/find/results/?keyword=network) 
+  [AWS Direct Connect Resiliency Recommendations](https://aws.amazon.com/directconnect/resiliency-recommendation/) 
+  [AWS Marketplace for Network Infrastructure](https://aws.amazon.com/marketplace/b/2649366011) 
+  [Amazon Virtual Private Cloud Connectivity Options Whitepaper](https://docs.aws.amazon.com/whitepapers/latest/aws-vpc-connectivity-options/introduction.html) 
+  [Multiple data center HA network connectivity](https://aws.amazon.com/answers/networking/aws-multiple-data-center-ha-network-connectivity/) 
+  [Using Redundant Site-to-Site VPN Connections to Provide Failover](https://docs.aws.amazon.com/vpn/latest/s2svpn/VPNConnections.html) 
+  [Using the Direct Connect Resiliency Toolkit to get started](https://docs.aws.amazon.com/directconnect/latest/UserGuide/resilency_toolkit.html) 
+  [VPC Endpoints and VPC Endpoint Services (AWS PrivateLink)](https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html) 
+  [What Is Amazon VPC?](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) 
+  [What Is a Transit Gateway?](https://docs.aws.amazon.com/vpc/latest/tgw/what-is-transit-gateway.html) 
+  [What is AWS Site-to-Site VPN?](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_VPN.html) 
+  [Working with Direct Connect Gateways](https://docs.aws.amazon.com/directconnect/latest/UserGuide/direct-connect-gateways.html) 

 **Related videos:** 
+  [AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC (NET303)](https://youtu.be/fnxXNZdf6ew) 
+  [AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs (NET406-R1)](https://youtu.be/9Nikqn_02Oc) 

# REL02-BP03 Ensure IP subnet allocation accounts for expansion and availability
<a name="rel_planning_network_topology_ip_subnet_allocation"></a>

 Amazon VPC IP address ranges must be large enough to accommodate workload requirements, including factoring in future expansion and allocation of IP addresses to subnets across Availability Zones. This includes load balancers, EC2 instances, and container-based applications. 

 When you plan your network topology, the first step is to define the IP address space itself. Private IP address ranges (following RFC 1918 guidelines) should be allocated for each VPC. Accommodate the following requirements as part of this process: 
+  Allow IP address space for more than one VPC per Region. 
+  Within a VPC, allow space for multiple subnets so that you can cover multiple Availability Zones. 
+  Always leave unused CIDR block space within a VPC for future expansion. 
+  Ensure that there is IP address space to meet the needs of any transient fleets of EC2 instances that you might use, such as Spot Fleets for machine learning, Amazon EMR clusters, or Amazon Redshift clusters. 
+  Note that the first four IP addresses and the last IP address in each subnet CIDR block are reserved and not available for your use. 
+  You should plan on deploying large VPC CIDR blocks. Note that the initial VPC CIDR block allocated to your VPC cannot be changed or deleted, but you can add additional non-overlapping CIDR blocks to the VPC. Subnet IPv4 CIDRs cannot be changed, however IPv6 CIDRs can. Keep in mind that deploying the largest VPC possible (/16) results in over 65,000 IP addresses. In the base 10.x.x.x IP address space alone, you could provision 255 such VPCs. You should therefore err on the side of being too large rather than too small to make it easier to manage your VPCs. 

 **Common anti-patterns:** 
+  Creating small VPCs. 
+  Creating small subnets and then having to add subnets to configurations as you grow. 
+  Incorrectly estimating how many IP addresses a elastic load balancer can use. 
+  Deploying many high traffic load balancers into the same subnets. 

 **Benefits of establishing this best practice:** This ensures that you can accommodate the growth of your workloads and continue to provide availability as you scale up. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Plan your network to accommodate for growth, regulatory compliance, and integration with others. Growth can be underestimated, regulatory compliance can change, and acquisitions or private network connections can be difficult to implement without proper planning. 
  +  Select relevant AWS accounts and Regions based on your service requirements, latency, regulatory, and disaster recovery (DR) requirements. 
  +  Identify your needs for regional VPC deployments. 
  +  Identify the size of the VPCs. 
    +  Determine if you are going to deploy multi-VPC connectivity. 
      +  [What Is a Transit Gateway?](https://docs.aws.amazon.com/vpc/latest/tgw/what-is-transit-gateway.html) 
      +  [Single Region Multi-VPC Connectivity](https://aws.amazon.com/answers/networking/aws-single-region-multi-vpc-connectivity/) 
    +  Determine if you need segregated networking for regulatory requirements. 
    +  Make VPCs as large as possible. The initial VPC CIDR block allocated to your VPC cannot be changed or deleted, but you can add additional non-overlapping CIDR blocks to the VPC. This however may fragment your address ranges. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help plan your networking](https://aws.amazon.com/partners/find/results/?keyword=network) 
+  [AWS Marketplace for Network Infrastructure](https://aws.amazon.com/marketplace/b/2649366011) 
+  [Amazon Virtual Private Cloud Connectivity Options Whitepaper](https://docs.aws.amazon.com/whitepapers/latest/aws-vpc-connectivity-options/introduction.html) 
+  [Multiple data center HA network connectivity](https://aws.amazon.com/answers/networking/aws-multiple-data-center-ha-network-connectivity/) 
+  [Single Region Multi-VPC Connectivity](https://aws.amazon.com/answers/networking/aws-single-region-multi-vpc-connectivity/) 
+  [What Is Amazon VPC?](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) 

 **Related videos:** 
+  [AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC (NET303)](https://youtu.be/fnxXNZdf6ew) 
+  [AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs (NET406-R1)](https://youtu.be/9Nikqn_02Oc) 

# REL02-BP04 Prefer hub-and-spoke topologies over many-to-many mesh
<a name="rel_planning_network_topology_prefer_hub_and_spoke"></a>

 If more than two network address spaces (for example, VPCs and on-premises networks) are connected via VPC peering, AWS Direct Connect, or VPN, then use a hub-and-spoke model, like that provided by AWS Transit Gateway. 

 If you have only two such networks, you can simply connect them to each other, but as the number of networks grows, the complexity of such meshed connections becomes untenable. AWS Transit Gateway provides an easy to maintain hub-and-spoke model, allowing the routing of traffic across your multiple networks. 

![\[Diagram showing not using AWS Transit Gateway\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/without-transit-gateway.png)


![\[Diagram showing using AWS Transit Gateway\]](http://docs.aws.amazon.com/wellarchitected/2023-10-03/framework/images/with-transit-gateway.png)


 **Common anti-patterns:** 
+  Using VPC peering to connect more than two VPCs. 
+  Establishing multiple BGP sessions for each VPC to establish connectivity that spans Virtual Private Clouds (VPCs) spread across multiple AWS Regions. 

 **Benefits of establishing this best practice:** As the number of networks grows, the complexity of such meshed connections becomes untenable. AWS Transit Gateway provides an easy to maintain hub-and-spoke model, allowing routing of traffic among your multiple networks. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Prefer hub-and-spoke topologies over many-to-many mesh. If more than two network address spaces (VPCs, on-premises networks) are connected via VPC peering, AWS Direct Connect, or VPN, then use a hub-and-spoke model like that provided by AWS Transit Gateway. 
  +  For only two such networks, you can simply connect them to each other, but as the number of networks grows, the complexity of such meshed connections becomes untenable. AWS Transit Gateway provides an easy to maintain hub-and-spoke model, allowing routing of traffic across your multiple networks. 
    +  [What Is a Transit Gateway?](https://docs.aws.amazon.com/vpc/latest/tgw/what-is-transit-gateway.html) 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help plan your networking](https://aws.amazon.com/partners/find/results/?keyword=network) 
+  [AWS Marketplace for Network Infrastructure](https://aws.amazon.com/marketplace/b/2649366011) 
+  [Multiple data center HA network connectivity](https://aws.amazon.com/answers/networking/aws-multiple-data-center-ha-network-connectivity/) 
+  [VPC Endpoints and VPC Endpoint Services (AWS PrivateLink)](https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html) 
+  [What Is Amazon VPC?](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) 
+  [What Is a Transit Gateway?](https://docs.aws.amazon.com/vpc/latest/tgw/what-is-transit-gateway.html) 

 **Related videos:** 
+  [AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC (NET303)](https://youtu.be/fnxXNZdf6ew) 
+  [AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs (NET406-R1)](https://youtu.be/9Nikqn_02Oc) 

# REL02-BP05 Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
<a name="rel_planning_network_topology_non_overlap_ip"></a>

 The IP address ranges of each of your VPCs must not overlap when peered or connected via VPN. You must similarly avoid IP address conflicts between a VPC and on-premises environments or with other cloud providers that you use. You must also have a way to allocate private IP address ranges when needed. 

 An IP address management (IPAM) system can help with this. Several IPAMs are available from the AWS Marketplace. 

 **Common anti-patterns:** 
+  Using the same IP range in your VPC as you have on premises or in your corporate network. 
+  Not tracking IP ranges of VPCs used to deploy your workloads. 

 **Benefits of establishing this best practice:** Active planning of your network will ensure that you do not have multiple occurrences of the same IP address in interconnected networks. This prevents routing problems from occurring in parts of the workload that are using the different applications. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Monitor and manage your CIDR use. Evaluate your potential usage on AWS, add CIDR ranges to existing VPCs, and create VPCs to allow planned growth in usage. 
  +  Capture current CIDR consumption (for example, VPCs, subnets) 
    +  Use service API operations to collect current CIDR consumption. 
  +  Capture your current subnet usage. 
    +  Use service API operations to collect subnets per VPC in each Region. 
      +  [DescribeSubnets](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeSubnets.html) 
    +  Record the current usage. 
    +  Determine if you created any overlapping IP ranges. 
    +  Calculate the spare capacity. 
    +  Identify overlapping IP ranges. You can either migrate to a new range of addresses or use Network and Port Translation (NAT) appliances from AWS Marketplace if you need to connect the overlapping ranges. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [APN Partner: partners that can help plan your networking](https://aws.amazon.com/partners/find/results/?keyword=network) 
+  [AWS Marketplace for Network Infrastructure](https://aws.amazon.com/marketplace/b/2649366011) 
+  [Amazon Virtual Private Cloud Connectivity Options Whitepaper](https://docs.aws.amazon.com/whitepapers/latest/aws-vpc-connectivity-options/introduction.html) 
+  [Multiple data center HA network connectivity](https://aws.amazon.com/answers/networking/aws-multiple-data-center-ha-network-connectivity/) 
+  [What Is Amazon VPC?](https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html) 
+  [What is IPAM?](https://docs.aws.amazon.com/vpc/latest/ipam/what-it-is-ipam.html) 

 **Related videos:** 
+  [AWS re:Invent 2018: Advanced VPC Design and New Capabilities for Amazon VPC (NET303)](https://youtu.be/fnxXNZdf6ew) 
+  [AWS re:Invent 2019: AWS Transit Gateway reference architectures for many VPCs (NET406-R1)](https://youtu.be/9Nikqn_02Oc)