Prompt caching for faster model inference
Note
Amazon Bedrock prompt caching is generally available with Claude 3.7 Sonnet, Claude 3.5 Haiku, Amazon Nova Micro, Amazon Nova Lite, Amazon Nova Pro, and Amazon Nova Premier. Customers who were given access to Claude 3.5 Sonnet v2 during the prompt caching preview will retain their access, however no additional customers will be granted access to prompt caching on the Claude 3.5 Sonnet v2 model.
Prompt caching is an optional feature that you can use with supported models on Amazon Bedrock to reduce inference response latency and input token costs. By adding portions of your context to a cache, the model can leverage the cache to skip recomputation of inputs, allowing Bedrock to share in the compute savings and lower your response latencies.
Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused for multiple queries. For example, if you have a chatbot where users can upload documents and ask questions about them, it can be time consuming for the model to process the document every time the user provides input. With prompt caching, you can cache the document so that future queries containing the document don't need to reprocess it.
When using prompt caching, you're charged at a reduced rate for tokens read from cache.
Depending on the model, tokens written to cache may be charged at a rate that is higher than
that of uncached input tokens. Any tokens not read from or written to cache, are charged at
the standard input token rate for that model. For more information, see
the Amazon Bedrock pricing page
How it works
If you opt to use prompt caching, Amazon Bedrock creates a cache composed of cache checkpoints. These are markers that define the contiguous subsection of your prompt that you wish to cache (often referred to as a prompt prefix). These prompt prefixes should be static between requests, alterations to the prompt prefix in subsequent requests will result in cache misses.
Cache checkpoints have a minimum and maximum number of tokens, dependent on the specific model you're using. You can only create a cache checkpoint if your total prompt prefix meets the minimum number of tokens. For example, the Anthropic Claude 3.7 Sonnet model requires at least 1,024 tokens per cache checkpoint. That means that your first cache checkpoint can be defined after 1,024 tokens and your second cache checkpoint can be defined after 2,048 tokens. If you try to add a cache checkpoint before meeting the minimum number of tokens, your inference will still succeed, but your prefix will not be cached. The cache has a five minute Time To Live (TTL), which resets with each successful cache hit. During this period, the context in the cache is preserved. If no cache hits occur within the TTL window, your cache expires.
You can use prompt caching anytime you get model inference in Amazon Bedrock for supported models. Prompt caching is supported by the following Amazon Bedrock features:
- Converse and ConverseStream APIs
-
You can carry on a conversation with a model where you specify cache checkpoints in your prompts.
- InvokeModel and InvokeModelWithResponseStream APIs
-
You can submit single prompt requests in which you enable prompt caching and specify your cache checkpoints.
- Prompt Caching with Cross-region Inference
-
Prompt caching can be used in conjunction with cross region inference. Cross-region inference automatically selects the optimal AWS Region within your geography to serve your inference request, thereby maximizing available resources and model availability. At times of high demand, these optimizations may lead to increased cache writes.
- Amazon Bedrock Prompt management
-
When you create or modify a prompt, you can choose to enable prompt caching. Depending on the model, you can cache system prompts, system instructions, and messages (user and assistant). You can also choose to disable prompt caching.
The APIs provide you with the most flexibility and granular control over the prompt cache. You can set an individual cache checkpoint within your prompts. You can add to the cache by creating more cache checkpoints, up to the maximum number of cache checkpoints allowed for the specific model. For more information, see Supported models, Regions, and limits.
Supported models, Regions, and limits
The following table lists the supported models along with their token minimums, maximum number of cache checkpoints, and fields that allow cache checkpoints.
Model name | Model ID | Release Type | Minimum number of tokens per cache checkpoint | Maximum number of cache checkpoints per request | Fields that accept prompt cache checkpoints |
---|---|---|---|---|---|
Claude 3 Opus 4.1 |
anthropic.claude-opus-4-1-20250805-v1:0 |
Generally Available |
1,024 |
4 |
`system`, `messages`, and `tools` |
Claude Opus 4 |
anthropic.claude-opus-4-20250514-v1:0 |
Generally Available |
1,024 |
4 |
`system`, `messages`, and `tools` |
Claude Sonnet 4.5 |
anthropic.claude-sonnet-4-5-20250929-v1:0 |
Generally Available |
1,024 |
4 |
`system`, `messages`, and `tools` |
Claude Haiku 4.5 |
anthropic.claude-haiku-4-5-20251001-v1:0 |
Generally Available |
4,096 |
4 |
`system`, `messages`, and `tools` |
Claude Sonnet 4 |
anthropic.claude-sonnet-4-20250514-v1:0 |
Generally Available |
1,024 |
4 |
`system`, `messages`, and `tools` |
Claude 3.7 Sonnet |
anthropic.claude-3-7-sonnet-20250219-v1:0 |
Generally Available |
1,024 |
4 |
`system`, `messages`, and `tools` |
Claude 3.5 Haiku |
anthropic.claude-3-5-haiku-20241022-v1:0 |
Generally Available |
2,048 |
4 |
`system`, `messages`, and `tools` |
Claude 3.5 Sonnet v2 |
anthropic.claude-3-5-sonnet-20241022-v2:0 |
Preview |
1,024 |
4 |
`system`, `messages`, and `tools` |
Amazon Nova Micro |
amazon.nova-micro-v1:0 |
Generally available |
1K1 |
4 |
`system` and `messages` |
Amazon Nova Lite |
amazon.nova-lite-v1:0 |
Generally available |
1K1 |
4 |
`system` and `messages`2 |
Amazon Nova Pro |
amazon.nova-pro-v1:0 |
Generally available |
1K1 |
4 |
`system` and `messages`2 |
Amazon Nova Premier |
amazon.nova-premier-v1:0 |
Generally available |
1K1 |
4 |
`system` and `messages`2 |
1: The Amazon Nova models support a maximum number of 20K tokens for prompt caching.
2: Prompt caching is primarily for text prompts.
Amazon Nova offers automatic prompt caching for all text prompts, including User
and System
messages. This mechanism can provide latency benefits when prompts begin with repetitive parts, even without explicit configuration. However, to unlock cost savings and ensure more consistent performance benefits, we recommend opting in to Explicit Prompt Caching.
Simplified Cache Management for Claude Models
For Claude models, Amazon Bedrock offers a simplified approach to cache management that reduces the complexity of manually placing cache checkpoints. Instead of requiring you to specify exact cache checkpoint locations, you can use automatic cache management with a single breakpoint at the end of your static content.
When you enable simplified cache management, the system automatically checks for cache hits at previous content block boundaries, looking back up to approximately 20 content blocks from your specified breakpoint. This allows the model to find the longest matching prefix from your cache without requiring you to predict the optimal checkpoint locations. To use this, place a single cache checkpoint at the end of your static content, before any dynamic or variable content. The system will automatically find the best cache match.
For more granular control, you can still use multiple cache checkpoints (up to 4 for Claude models) to specify exact cache boundaries. You should use multipled cache checkpoints if you are caching sections that change at different frequencies or want more control over exactly what gets cached.
Important
The automatic prefix checking only looks back approximately 20 content blocks from your cache checkpoint. If your static content extends beyond this range, consider using multiple cache checkpoints or restructuring your prompt to place the most frequently reused content within this range.
Getting started
The following sections show you a brief overview of how to use the prompt caching feature for each method of interacting with models through Amazon Bedrock.
The Converse API provides advanced and flexible options for implementing prompt caching in multi-turn conversations. For more information about the prompt requirements for each model, see the preceding section Supported models, Regions, and limits.
Example request
The following examples show a cache checkpoint set in the
messages
, system
, or tools
fields of a request to the Converse API. You can place checkpoints in any of these
locations for a given request. For example, if sending a request to the
Claude 3.5 Sonnet v2 model, you could place two cache checkpoints in
messages
, one cache checkpoint in system
,
and one in tools
. For more detailed information and examples of
structuring and sending Converse API requests, see
Carry out a conversation with the
Converse API operations.
The model response from the Converse API includes two new fields that are specific to prompt
caching. The CacheReadInputTokens
and
CacheWriteInputTokens
values tell you how many tokens were
read from the cache and how many tokens were written to the cache because of
your previous request. These are values that you're charged for by Amazon Bedrock, at a
rate that's lower than the cost of full model inference.
Prompt caching is enabled by default when you call the InvokeModel API. You can set cache checkpoints at any point in your request body, similar to the previous example for the Converse API.
For more information about sending an InvokeModel request, see Submit a single prompt with InvokeModel.
In a chat playground in the Amazon Bedrock console, you can turn on the prompt caching option, and Amazon Bedrock automatically creates cache checkpoints for you.
Follow the instructions in Generate responses in the console using playgrounds to get started with prompting in an Amazon Bedrock playground. For supported models, prompt caching is automatically turned on in the playground. However, if it’s not, then do the following to turn on prompt caching:
-
In the left side panel, open the Configurations menu.
-
Turn on the Prompt caching toggle.
-
Run your prompts.
After your combined input and model responses reach the minimum required number of tokens for a checkpoint (which varies by model), Amazon Bedrock automatically creates the first cache checkpoint for you. As you continue chatting, each subsequent reach of the minimum number of tokens creates a new checkpoint, up to the maximum number of checkpoints allowed for the model. You can view your cache checkpoints at any time by choosing View cache checkpoints next to the Prompt caching toggle, as shown in the following screenshot.

You can view how many tokens are being read from and written to the cache due
to each interaction with the model by viewing the Caching metrics
pop-up (
) in the playground responses.

If you turn off the prompt caching toggle while in the middle of a conversation, you can continue chatting with the model.