REL05-BP03 Control and limit retry calls
Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.
Typical components in a distributed software system include servers, load balancers, databases, and DNS servers. In operation, and subject to failures, any of these can start generating errors. The default technique for dealing with errors is to implement retries on the client side. This technique increases the reliability and availability of the application. However, at scale—and if clients attempt to retry the failed operation as soon as an error occurs—the network can quickly become saturated with new and retried requests, each competing for network bandwidth. This can result in a retry storm, which will reduce availability of the service. This pattern might continue until a full system failure occurs.
To avoid such scenarios, backoff algorithms such as the common exponential backoff should be used. Exponential backoff algorithms gradually decrease the rate at which retries are performed, thus avoiding network congestion.
Many SDKs and software libraries, including those from AWS, implement a version of these algorithms. However, never assume a backoff algorithm exists—always test and verify this to be the case.
Simple backoff alone is not enough because in distributed systems
all clients may backoff simultaneously, creating clusters of retry
calls. Marc Brooker in his blog
post Exponential
Backoff and Jitter
Finally, it’s important to configure a maximum number of retries or elapsed time, after which retrying will simply fail. AWS SDKs implement this by default, and it can be configured. For services lower in the stack, a maximum retry limit of zero or one can limit risk yet still be effective as retries are delegated to services higher in the stack.
Level of risk exposed if this best practice is not established: High
Implementation guidance
Control and limit retry calls. Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize those retry intervals, and limit the maximum number of retries.
-
Error Retries and Exponential Backoff in AWS
-
Amazon SDKs implement retries and exponential backoff by default. Implement similar logic in your dependency layer when calling your own dependent services. Decide what the timeouts are and when to stop retrying based on your use case.
-
-
Resources
Related documents:
-
Amazon API Gateway: Throttle API Requests for Better Throughput
-
The Amazon Builders' Library: Avoiding fallback in distributed systems
-
The Amazon Builders' Library: Avoiding insurmountable queue backlogs
-
The Amazon Builders' Library: Caching challenges and strategies
-
The Amazon Builders' Library: Timeouts, retries, and backoff with jitter
Related videos: