GENCOST03-BP03 Implement prompt caching to reduce token costs

Implement prompt caching for supported foundation models to reduce inference response latency and input token costs. This best practice helps organizations optimize costs by caching frequently used portions of prompts to avoid recomputation, while maintaining performance and reliability.

Desired outcome: Reduce inference costs by caching commonly used prompt components and using cached tokens at a reduced rate.

Benefits of establishing this best practice:

Control resource consumption parameters - Reduce token costs by reusing cached prompt components.
Optimize model and inference selection - Decrease latency by avoiding recomputation of cached prompt sections.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

Prompt caching is an optional feature available on supported models in Amazon Bedrock that can reduce inference response latency and input token costs. By caching portions of your context, the model can use the cache to skip recomputation, allowing Bedrock to achieve cost savings through lower token rates.

Prompt caching can help when you have workloads with long and repeated contexts that are frequently reused across multiple queries. For example, if you have a chatbot where users can upload documents and ask questions about them, caching the document content avoids reprocessing it for each user query.

When using prompt caching, cached tokens are charged at a reduced rate. Depending on the model, tokens written to cache may be charged at a higher rate than uncached input tokens. Tokens not read from or written to cache are charged at the standard input token rate.

Cache checkpoints have model-specific minimum and maximum token requirements. You can only create a checkpoint if your prompt prefix meets the minimum token count. For example, Claude 3.7 Sonnet requires at least 1,024 tokens per checkpoint. The cache has a five minute TTL that resets with each successful hit.

Implementation steps

Identify opportunities for caching:
- Review workload for repeated prompt components
- Verify prompts meet minimum token requirements
- Assess potential cost savings from reduced token rates
Enable prompt caching for supported models:
- Turn on caching in Amazon Bedrock console
- For APIs, set appropriate caching flags
- Configure cache checkpoints at optimal locations
Monitor caching metrics:
- Track cache hit and miss rates
- Monitor token costs for cached compared to uncached content
- Analyze latency improvements
Optimize cache usage:
- Tune checkpoint placement
- Adjust prompt structure to maximize cache hits
- Balance cache write costs with read savings

Resources

Related best practices:

COST10-BP01

Related documents:

Related examples:

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

GENCOST03-BP02 Control model response length

GENCOST03-BP04 Annotate user input to enable cost-aware content filtering