Retries for Lambda durable functions
Durable functions provide automatic retry capabilities that make your applications resilient to transient failures. The SDK handles retries at two levels: step retries for business logic failures and backend retries for infrastructure failures.
Step retries
When an uncaught exception occurs within a step, the SDK automatically retries the step based on the configured retry strategy. Step retries are checkpointed operations that allow the SDK to suspend execution and resume later without losing progress.
Step retry behavior
The following table describes how the SDK handles exceptions within steps:
| Scenario | What happens | Metering impact |
|---|---|---|
| Exception in step with remaining retry attempts | The SDK creates a checkpoint for the retry and suspends the function. On the next invocation, the step retries with the configured backoff delay. | 1 operation + error payload size |
| Exception in step with no remaining retry attempts | The step fails and throws an exception. If your handler code doesn't catch this exception, the entire execution fails. | 1 operation + error payload size |
When a step needs to retry, the SDK checkpoints the retry state and exits the Lambda invocation if no other work is running. This allows the SDK to implement backoff delays without consuming compute resources. The function resumes automatically after the backoff period.
Configuring step retry strategies
Configure retry strategies to control how steps handle failures. You can specify maximum attempts, backoff intervals, and conditions for retrying.
Exponential backoff with max attempts:
Fixed interval backoff:
Conditional retry (retry only specific errors):
Disable retries:
When the retry strategy returns shouldRetry: false, the step fails immediately without retries. Use this for operations that should not be retried, such as idempotency checks or operations with side effects that cannot be safely repeated.
Exceptions outside steps
When an uncaught exception occurs in your handler code but outside any step, the SDK marks the execution as failed. This ensures errors in your application logic are properly captured and reported.
| Scenario | What happens | Metering impact |
|---|---|---|
| Exception in handler code outside any step | The SDK marks the execution as FAILED and returns the error. The exception is not automatically retried. | Error payload size |
To enable automatic retry for error-prone code, wrap it in a step with a retry strategy. Steps provide automatic retry with configurable backoff, while code outside steps fails immediately.
Backend retries
Backend retries occur when Lambda encounters infrastructure failures, runtime errors, or when the SDK cannot communicate with the durable execution service. Lambda automatically retries these failures to ensure your durable functions can recover from transient infrastructure issues.
Backend retry scenarios
Lambda automatically retries your function when it encounters the following scenarios:
-
Internal service errors - When Lambda or the durable execution service returns a 5xx error, indicating a temporary service issue.
-
Throttling - When your function is throttled due to concurrency limits or service quotas.
-
Timeouts - When the SDK cannot reach the durable execution service within the timeout period.
-
Sandbox initialization failures - When Lambda cannot initialize the execution environment.
-
Runtime errors - When the Lambda runtime encounters errors outside your function code, such as out-of-memory errors or process crashes.
-
Invalid checkpoint token errors - When the checkpoint token is no longer valid, typically due to service-side state changes.
The following table describes how the SDK handles these scenarios:
| Scenario | What happens | Metering impact |
|---|---|---|
| Runtime error outside durable handler (OOM, timeout, crash) | Lambda automatically retries the invocation. The SDK replays from the last checkpoint, skipping completed steps. | Error payload size + 1 operation per retry |
Service error (5xx) or timeout when calling CheckpointDurableExecution / GetDurableExecutionState APIs |
Lambda automatically retries the invocation. The SDK replays from the last checkpoint. | Error payload size + 1 operation per retry |
Throttling (429) or invalid checkpoint token when calling CheckpointDurableExecution / GetDurableExecutionState APIs |
Lambda automatically retries the invocation with exponential backoff. The SDK replays from the last checkpoint. | Error payload size + 1 operation per retry |
Client error (4xx, except 429 and invalid token) when CheckpointDurableExecution / GetDurableExecutionState APIs |
The SDK marks the execution as FAILED. No automatic retry occurs because the error indicates a permanent issue. | Error payload size |
Backend retries use exponential backoff and continue until the function succeeds or the execution timeout is reached. During replay, the SDK skips completed checkpoints and continues execution from the last successful operation, ensuring your function doesn't re-execute completed work.
Retry best practices
Follow these best practices when configuring retry strategies:
-
Configure explicit retry strategies - Don't rely on default retry behavior in production. Configure explicit retry strategies with appropriate max attempts and backoff intervals for your use case.
-
Use conditional retries - Implement
shouldRetrylogic to retry only transient errors (rate limits, timeouts) and fail fast on permanent errors (validation failures, not found). -
Set appropriate max attempts - Balance between resilience and execution time. Too many retries can delay failure detection, while too few can cause unnecessary failures.
-
Use exponential backoff - Exponential backoff reduces load on downstream services and increases the likelihood of recovery from transient failures.
-
Wrap error-prone code in steps - Code outside steps cannot be automatically retried. Wrap external API calls, database queries, and other error-prone operations in steps with retry strategies.
-
Monitor retry metrics - Track step retry operations and execution failures in Amazon CloudWatch to identify patterns and optimize retry strategies.