Retries for Lambda durable functions - AWS Lambda

Retries for Lambda durable functions

Durable functions provide automatic retry capabilities that make your applications resilient to transient failures. The SDK handles retries at two levels: step retries for business logic failures and backend retries for infrastructure failures.

Step retries

When an uncaught exception occurs within a step, the SDK automatically retries the step based on the configured retry strategy. Step retries are checkpointed operations that allow the SDK to suspend execution and resume later without losing progress.

Step retry behavior

The following table describes how the SDK handles exceptions within steps:

Scenario What happens Metering impact
Exception in step with remaining retry attempts The SDK creates a checkpoint for the retry and suspends the function. On the next invocation, the step retries with the configured backoff delay. 1 operation + error payload size
Exception in step with no remaining retry attempts The step fails and throws an exception. If your handler code doesn't catch this exception, the entire execution fails. 1 operation + error payload size

When a step needs to retry, the SDK checkpoints the retry state and exits the Lambda invocation if no other work is running. This allows the SDK to implement backoff delays without consuming compute resources. The function resumes automatically after the backoff period.

Configuring step retry strategies

Configure retry strategies to control how steps handle failures. You can specify maximum attempts, backoff intervals, and conditions for retrying.

Exponential backoff with max attempts:

TypeScript
const result = await context.step('call-api', async () => { const response = await fetch('https://api.example.com/data'); if (!response.ok) throw new Error(`API error: ${response.status}`); return await response.json(); }, { retryStrategy: (error, attemptCount) => { if (attemptCount >= 5) { return { shouldRetry: false }; } // Exponential backoff: 2s, 4s, 8s, 16s, 32s (capped at 300s) const delay = Math.min(2 * Math.pow(2, attemptCount - 1), 300); return { shouldRetry: true, delay: { seconds: delay } }; } });
Python
def retry_strategy(error, attempt_count): if attempt_count >= 5: return {'should_retry': False} # Exponential backoff: 2s, 4s, 8s, 16s, 32s (capped at 300s) delay = min(2 * (2 ** (attempt_count - 1)), 300) return {'should_retry': True, 'delay': delay} result = context.step( lambda _: call_external_api(), name='call-api', config=StepConfig(retry_strategy=retry_strategy) )

Fixed interval backoff:

TypeScript
const orders = await context.step('query-orders', async () => { return await queryDatabase(event.userId); }, { retryStrategy: (error, attemptCount) => { if (attemptCount >= 3) { return { shouldRetry: false }; } return { shouldRetry: true, delay: { seconds: 5 } }; } });
Python
def retry_strategy(error, attempt_count): if attempt_count >= 3: return {'should_retry': False} return {'should_retry': True, 'delay': 5} orders = context.step( lambda _: query_database(event['userId']), name='query-orders', config=StepConfig(retry_strategy=retry_strategy) )

Conditional retry (retry only specific errors):

TypeScript
const result = await context.step('call-rate-limited-api', async () => { const response = await fetch('https://api.example.com/data'); if (response.status === 429) throw new Error('RATE_LIMIT'); if (response.status === 504) throw new Error('TIMEOUT'); if (!response.ok) throw new Error(`API_ERROR_${response.status}`); return await response.json(); }, { retryStrategy: (error, attemptCount) => { // Only retry rate limits and timeouts const isRetryable = error.message === 'RATE_LIMIT' || error.message === 'TIMEOUT'; if (!isRetryable || attemptCount >= 3) { return { shouldRetry: false }; } // Exponential backoff: 1s, 2s, 4s (capped at 30s) const delay = Math.min(Math.pow(2, attemptCount - 1), 30); return { shouldRetry: true, delay: { seconds: delay } }; } });
Python
def retry_strategy(error, attempt_count): # Only retry rate limits and timeouts is_retryable = str(error) in ['RATE_LIMIT', 'TIMEOUT'] if not is_retryable or attempt_count >= 3: return {'should_retry': False} # Exponential backoff: 1s, 2s, 4s (capped at 30s) delay = min(2 ** (attempt_count - 1), 30) return {'should_retry': True, 'delay': delay} result = context.step( lambda _: call_rate_limited_api(), name='call-rate-limited-api', config=StepConfig(retry_strategy=retry_strategy) )

Disable retries:

TypeScript
const isDuplicate = await context.step('check-duplicate', async () => { return await checkIfOrderExists(event.orderId); }, { retryStrategy: () => ({ shouldRetry: false }) });
Python
is_duplicate = context.step( lambda _: check_if_order_exists(event['orderId']), name='check-duplicate', config=StepConfig( retry_strategy=lambda error, attempt: {'should_retry': False} ) )

When the retry strategy returns shouldRetry: false, the step fails immediately without retries. Use this for operations that should not be retried, such as idempotency checks or operations with side effects that cannot be safely repeated.

Exceptions outside steps

When an uncaught exception occurs in your handler code but outside any step, the SDK marks the execution as failed. This ensures errors in your application logic are properly captured and reported.

Scenario What happens Metering impact
Exception in handler code outside any step The SDK marks the execution as FAILED and returns the error. The exception is not automatically retried. Error payload size

To enable automatic retry for error-prone code, wrap it in a step with a retry strategy. Steps provide automatic retry with configurable backoff, while code outside steps fails immediately.

Backend retries

Backend retries occur when Lambda encounters infrastructure failures, runtime errors, or when the SDK cannot communicate with the durable execution service. Lambda automatically retries these failures to ensure your durable functions can recover from transient infrastructure issues.

Backend retry scenarios

Lambda automatically retries your function when it encounters the following scenarios:

  • Internal service errors - When Lambda or the durable execution service returns a 5xx error, indicating a temporary service issue.

  • Throttling - When your function is throttled due to concurrency limits or service quotas.

  • Timeouts - When the SDK cannot reach the durable execution service within the timeout period.

  • Sandbox initialization failures - When Lambda cannot initialize the execution environment.

  • Runtime errors - When the Lambda runtime encounters errors outside your function code, such as out-of-memory errors or process crashes.

  • Invalid checkpoint token errors - When the checkpoint token is no longer valid, typically due to service-side state changes.

The following table describes how the SDK handles these scenarios:

Scenario What happens Metering impact
Runtime error outside durable handler (OOM, timeout, crash) Lambda automatically retries the invocation. The SDK replays from the last checkpoint, skipping completed steps. Error payload size + 1 operation per retry
Service error (5xx) or timeout when calling CheckpointDurableExecution / GetDurableExecutionState APIs Lambda automatically retries the invocation. The SDK replays from the last checkpoint. Error payload size + 1 operation per retry
Throttling (429) or invalid checkpoint token when calling CheckpointDurableExecution / GetDurableExecutionState APIs Lambda automatically retries the invocation with exponential backoff. The SDK replays from the last checkpoint. Error payload size + 1 operation per retry
Client error (4xx, except 429 and invalid token) when CheckpointDurableExecution / GetDurableExecutionState APIs The SDK marks the execution as FAILED. No automatic retry occurs because the error indicates a permanent issue. Error payload size

Backend retries use exponential backoff and continue until the function succeeds or the execution timeout is reached. During replay, the SDK skips completed checkpoints and continues execution from the last successful operation, ensuring your function doesn't re-execute completed work.

Retry best practices

Follow these best practices when configuring retry strategies:

  • Configure explicit retry strategies - Don't rely on default retry behavior in production. Configure explicit retry strategies with appropriate max attempts and backoff intervals for your use case.

  • Use conditional retries - Implement shouldRetry logic to retry only transient errors (rate limits, timeouts) and fail fast on permanent errors (validation failures, not found).

  • Set appropriate max attempts - Balance between resilience and execution time. Too many retries can delay failure detection, while too few can cause unnecessary failures.

  • Use exponential backoff - Exponential backoff reduces load on downstream services and increases the likelihood of recovery from transient failures.

  • Wrap error-prone code in steps - Code outside steps cannot be automatically retried. Wrap external API calls, database queries, and other error-prone operations in steps with retry strategies.

  • Monitor retry metrics - Track step retry operations and execution failures in Amazon CloudWatch to identify patterns and optimize retry strategies.