# Using the Amazon ElastiCache Well-Architected Lens
Amazon ElastiCache Well-Architected Lens

This section describes the Amazon ElastiCache Well-Architected Lens, a collection of design principles and guidance for designing well-architected ElastiCache workloads.
+ The ElastiCache Lens is additive to the [AWS Well-Architected Framework](https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html).
+ Each Pillar has a set of questions to help start the discussion around an ElastiCache Architecture.
  + Each question has a number of leading practices along with their scores for reporting.
    + *Required* - Necessary before going to prod (absent being a high risk)
    + *Best* - Best possible state a customer could be
    + *Good* - What we recommend customers to have (absent being a medium risk)
+ Well-Architected terminology
  + [Component](https://docs.aws.amazon.com/wellarchitected/latest/framework/definitions.html) – Code, configuration and AWS Resources that together deliver against a requirement. Components interact with other components, and often equate to a service in microservice architectures.
  + [Workload](https://docs.aws.amazon.com/wellarchitected/latest/framework/definitions.html) - A set of components that together deliver business value. Examples of workloads are marketing websites, e-commerce websites, the back-ends for a mobile app, analytic platforms, etc.

**Note**  
This guide has not been updated to include information on ElastiCache serverless caching and the new Valkey engine.

**Topics**
+ [

# Amazon ElastiCache Well-Architected Lens Operational Excellence Pillar
](OperationalExcellencePillar.md)
+ [

# Amazon ElastiCache Well-Architected Lens Security Pillar
](SecurityPillar.md)
+ [

# Amazon ElastiCache Well-Architected Lens Reliability Pillar
](ReliabilityPillar.md)
+ [

# Amazon ElastiCache Well-Architected Lens Performance Efficiency Pillar
](PerformanceEfficiencyPillar.md)
+ [

# Amazon ElastiCache Well-Architected Lens Cost Optimization Pillar
](CostOptimizationPillar.md)

# Amazon ElastiCache Well-Architected Lens Operational Excellence Pillar
Operational Excellence Pillar

The operational excellence pillar focuses on running and monitoring systems to deliver business value, and continually improving processes and procedures. Key topics include automating changes, responding to events, and defining standards to manage daily operations.

**Topics**
+ [

## OE 1: How do you understand and respond to alerts and events triggered by your ElastiCache cluster?
](#OperationalExcellencePillarOE1)
+ [

## OE 2: When and how do you scale your existing ElastiCache clusters?
](#OperationalExcellencePillarOE2)
+ [

## OE 3: How do you manage your ElastiCache cluster resources and maintain your cluster up-to-date?
](#OperationalExcellencePillarOE3)
+ [

## OE 4: How do you manage clients’ connections to your ElastiCache clusters?
](#OperationalExcellencePillarOE4)
+ [

## OE 5: How do you deploy ElastiCache Components for a Workload?
](#OperationalExcellencePillarOE5)
+ [

## OE 6: How do you plan for and mitigate failures?
](#OperationalExcellencePillarOE6)
+ [

## OE 7: How do you troubleshoot Valkey or Redis OSS engine events?
](#OperationalExcellencePillarOE7)

## OE 1: How do you understand and respond to alerts and events triggered by your ElastiCache cluster?


**Question-level introduction: **When you operate ElastiCache clusters you can optionally receive notifications and alerts when specific events occur. ElastiCache, by default, logs [events](ECEvents.md) that relate to your resources, such as a failover, node replacement, scaling operation, scheduled maintenance, and more. Each event includes the date and time, the source name and source type, and a description.

**Question-level benefit: **Being able to understand and manage the underlying reasons behind the events that trigger alerts generated by your cluster enables you to operate more effectively and respond to events appropriately.
+ **[Required]** Review the events generated by ElastiCache on the ElastiCache console (after selecting your region) or using the [Amazon Command Line Interface](http://aws.amazon.com/cli) (AWS CLI) [describe-events](https://docs.aws.amazon.com/cli/latest/reference/elasticache/describe-events.html) command and the [ElastiCache API](https://docs.aws.amazon.com/AmazonElastiCache/latest/APIReference/API_DescribeEvents.html). Configure ElastiCache to send notifications for important cluster events using Amazon Simple Notification Service (Amazon SNS). Using Amazon SNS with your clusters allows you to programmatically take actions upon ElastiCache events.
  + There are two large categories of events: current and scheduled events. The list of current events includes: resource creation and deletion, scaling operations, failover, node reboot, snapshot created, cluster’s parameter modification, CA certificate renewal, failure events (cluster provisioning failure - VPC or ENI-, scaling failures - ENI-, and snapshot failures). The list of scheduled events includes: node scheduled for replacement during the maintenance window and node replacement rescheduled.
  + Although you may not need to react immediately to some of these events, it is critical to first look at all failure events:
    + ElastiCache:AddCacheNodeFailed
    + ElastiCache:CacheClusterProvisioningFailed
    + ElastiCache:CacheClusterScalingFailed
    + ElastiCache:CacheNodesRebooted
    + ElastiCache:SnapshotFailed (Valkey or Redis OSS only)
  + **[Resources]:**
    + [Managing ElastiCache Amazon SNS notifications](ECEvents.SNS.md)
    + [Event Notifications and Amazon SNS](ElastiCacheSNS.md)
+ **[Best]** To automate responses to events, leverage AWS products and services capabilities such as SNS and Lambda Functions. Follow best practices by making small, frequent, reversible changes, as code to evolve your operations over time. You should use Amazon CloudWatch metrics to monitor your clusters. 

  **[Resources]:** [Monitor ElastiCache (cluster mode disabled) read replica endpoints using AWS Lambda, Amazon Route 53, and Amazon SNS](https://aws.amazon.com/blogs/database/monitor-amazon-elasticache-for-redis-cluster-mode-disabled-read-replica-endpoints-using-aws-lambda-amazon-route-53-and-amazon-sns/) for a use case that uses Lambda and SNS. 

## OE 2: When and how do you scale your existing ElastiCache clusters?


**Question-level introduction: **Right-sizing your ElastiCache cluster is a balancing act that needs to be evaluated every time there are changes to the underlying workload types. Your objective is to operate with the right sized environment for your workload.

**Question-level benefit: **Over-utilization of your resources may result in elevated latency and overall decreased performance. Under-utilization, on the other hand, may result in over-provisioned resources at non-optimal cost optimization. By right-sizing your environments you can strike a balance between performance efficiency and cost optimization. To remediate over or under utilization of your resources, ElastiCache can scale in two dimensions. You can scale vertically by increasing or decreasing node capacity. You can also scale horizontally by adding and removing nodes.
+ **[Required]** CPU and network over-utilization on primary nodes should be addressed by offloading and redirecting the read operations to replica nodes. Use replica nodes for read operations to reduce primary node utilization. This can be configured in your Valkey or Redis OSS client library by connecting to the ElastiCache reader endpoint for cluster mode disabled, or by using the READONLY command for cluster mode enabled.

  **[Resources]:**
  + [Finding connection endpoints in ElastiCache](Endpoints.md)
  + [Cluster Right-Sizing](https://aws.amazon.com/blogs/database/five-workload-characteristics-to-consider-when-right-sizing-amazon-elasticache-redis-clusters/)
  + [READONLY Command](https://valkey.io/commands/readonly)
+ **[Required]** Monitor the utilization of critical cluster resources such as CPU, memory, and network. The utilization of these specific cluster resources needs to be tracked to inform your decision to scale, and the type of scaling operation. For ElastiCache cluster mode disabled, primary and replica nodes can scale vertically. Replica nodes can also scale horizontally from 0 to 5 nodes. For cluster mode enabled, the same applies within each shard of your cluster. In addition, you can increase or reduce the number of shards.

  **[Resources]:**
  + [Monitoring best practices with ElastiCache using Amazon CloudWatch](https://aws.amazon.com/blogs/database/monitoring-best-practices-with-amazon-elasticache-for-redis-using-amazon-cloudwatch/)
  + [Scaling ElastiCache Clusters for Valkey and Redis OSS](Scaling.md)
  + [Scaling ElastiCache Clusters for Memcached](Scaling.md)
+ **[Best]** Monitoring trends over time can help you detect workload changes that would remain unnoticed if monitored at a particular point in time. To detect longer term trends, use CloudWatch metrics to scan for longer time ranges. The learnings from observing extended periods of CloudWatch metrics should inform your forecast around cluster resources utilization. CloudWatch data points and metrics are available for up to 455 days.

  **[Resources]:**
  + [Monitoring ElastiCache with CloudWatch Metrics](CacheMetrics.md)
  + [Monitoring Memcached with CloudWatch Metrics](CacheMetrics.md)
  + [Monitoring best practices with ElastiCache using Amazon CloudWatch](https://aws.amazon.com/blogs/database/monitoring-best-practices-with-amazon-elasticache-for-redis-using-amazon-cloudwatch/)
+ **[Best]** If your ElastiCache resources are created with CloudFormation it is best practice to perform changes using CloudFormation templates to preserve operational consistency and avoid unmanaged configuration changes and stack drifts.

  **[Resources]:**
  + [ElastiCache resource type reference for CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_ElastiCache.html)
+ **[Best]** Automate your scaling operations using cluster operational data and define thresholds in CloudWatch to setup alarms. Use CloudWatch Events and Simple Notification Service (SNS) to trigger Lambda functions and execute an ElastiCache API to scale your clusters automatically. An example would be to add a shard to your cluster when the `EngineCPUUtilization` metric reaches 80% for an extended period of time. Another option would be to use `DatabaseMemoryUsedPercentages` for a memory-based threshold.

  **[Resources]:**
  + [Using Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)
  + [What are Amazon CloudWatch events?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/WhatIsCloudWatchEvents.html)
  + [Using AWS Lambda with Amazon Simple Notification Service](https://docs.aws.amazon.com/lambda/latest/dg/with-sns.html)
  + [ElastiCache API Reference](https://docs.aws.amazon.com/AmazonElastiCache/latest/APIReference/Welcome.html)

## OE 3: How do you manage your ElastiCache cluster resources and maintain your cluster up-to-date?


**Question-level introduction: **When operating at scale, it is essential that you are able to pinpoint and identify all your ElastiCache resources. When rolling out new application features you need to create cluster version symmetry across all your ElastiCache environment types: dev, testing, and production. Resource attributes allow you to separate environments for different operational objectives, such as when rolling out new features and enabling new security mechanisms. 

**Question-level benefit: **Separating your development, testing, and production environments is best operational practice. It is also best practice that your clusters and nodes across environments have the latest software patches applied using well understood and documented processes. Taking advantage of native ElastiCache features enables your engineering team to focus on meeting business objectives and not on ElastiCache maintenance.
+ **[Best]** Run on the latest engine version available and apply the Self-Service Updates as quickly as they become available. ElastiCache automatically updates its underlying infrastructure during your specified maintenance window of the cluster. However, the nodes running in your clusters are updated via Self-Service Updates. These updates can be of two types: security patches or minor software updates. Ensure you understand the difference between types of patches and when they are applied.

  **[Resources]:**
  + [Self-Service Updates in Amazon ElastiCache](Self-Service-Updates.md)
  + [Amazon ElastiCache Managed Maintenance and Service Updates Help Page](https://aws.amazon.com/elasticache/elasticache-maintenance/)
+ **[Best]** Organize your ElastiCache resources using tags. Use tags on replication groups and not on individual nodes. You can configure tags to be displayed when you query resources and you can use tags to perform searches and apply filters. You should use Resource Groups to easily create and maintain collections of resources that share common sets of tags.

  **[Resources]:**
  + [Tagging Best Practices](https://d1.awsstatic.com/whitepapers/aws-tagging-best-practices.pdf)
  + [ElastiCache resource type reference for CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_ElastiCache.html)
  + [Parameter Groups](ParameterGroups.Engine.md#ParameterGroups.Redis)

## OE 4: How do you manage clients’ connections to your ElastiCache clusters?


**Question-level introduction: **When operating at scale you need to understand how your clients connect with the ElastiCache cluster to manage your application operational aspects (such as response times). 

**Question-level benefit: **Choosing the most appropriate connection mechanism ensures that your application does not disconnect due to connectivity errors, such as time-outs.
+ **[Required]** Separate read from write operations and connect to the replica nodes to execute read operations. However, be aware when you separate the writes from the reads you will lose the ability to read a key immediately after writing it due to the asynchronous nature of the Valkey and Redis OSS replication. The WAIT command can be leveraged to improve real world data safety and force replicas to acknowledge writes before responding to clients, at an overall performance cost. Using replica nodes for read operations can be configured in your ElastiCache client library using the ElastiCache reader endpoint for cluster mode disabled. For cluster mode enabled, use the READONLY command. For many of the ElastiCache client libraries, READONLY is implemented by default or via a configuration setting.

  **[Resources]:**
  + [Finding connection endpoints in ElastiCache](Endpoints.md)
  + [READONLY](https://valkey.io/commands/readonly)
+ **[Required] **Use connection pooling. Establishing a TCP connection has a cost in CPU time on both client and server sides and pooling allows you to reuse the TCP connection. 

  To reduce connection overhead, you should use connection pooling. With a pool of connections your application can re-use and release connections ‘at will’, without the cost of establishing the connection. You can implement connection pooling via your ElastiCache client library (if supported), with a Framework available for your application environment, or build it from the ground up.
+ **[Best]** Ensure that the socket timeout of the client is set to at least one second (vs. the typical “none” default in several clients).
  + Setting the timeout value too low can lead to possible timeouts when the server load is high. Setting it too high can result in your application taking a long time to detect connection issues.
  + Control the volume of new connections by implementing connection pooling in your client application. This reduces latency and CPU utilization needed to open and close connections, and perform a TLS handshake if TLS is enabled on the cluster.

  **[Resources]:** [Configure ElastiCache for higher availability](https://aws.amazon.com/blogs/database/configuring-amazon-elasticache-for-redis-for-higher-availability/)
+ **[Good]** Using pipelining (when your use cases allow it) can significantly boost the performance.
  + With pipelining you reduce the Round-Trip Time (RTT) between your application clients and the cluster and new requests can be processed even if the client has not yet read the previous responses.
  + With pipelining you can send multiple commands to the server without waiting for replies/ack. The downside of pipelining is that when you eventually fetch all the responses in bulk there may have been an error that you will not catch until the end.
  + Implement methods to retry requests when an error is returned that omits the bad request.

  **[Resources]:** [Pipelining](https://valkey.io/topics/pipelining/)

## OE 5: How do you deploy ElastiCache Components for a Workload?


**Question-level introduction: **ElastiCache environments can be deployed manually through the AWS Console, or programmatically through APIs, CLI, toolkits, etc. Operational Excellence best practices suggest automating deployments through code whenever possible. Additionally, ElastiCache clusters can either be isolated by workload or combined for cost optimization purposes.

**Question-level benefit: **Choosing the most appropriate deployment mechanism for your ElastiCache environments can improve Operation Excellence over time. It is recommended to perform operations as code whenever possible to minimize human error and increase repeatability, flexibility, and response time to events.

By understanding the workload isolation requirements, you can choose to have dedicated ElastiCache environments per workload or combine multiple workloads into single clusters, or combinations thereof. Understanding the tradeoffs can help strike a balance between Operational Excellence and Cost Optimization
+ **[Required]** Understand the deployment options available to ElastiCache, and automate these procedures whenever possible. Possible avenues of automation include CloudFormation, AWS CLI/SDK, and APIs.

  **[Resources]: **
  + [Amazon ElastiCache resource type reference](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_ElastiCache.html)
  + [elasticache](https://docs.aws.amazon.com/cli/latest/reference/elasticache/index.html)
  + [Amazon ElastiCache API Reference](https://docs.aws.amazon.com/AmazonElastiCache/latest/APIReference/Welcome.html)
+ **[Required]** For all workloads determine the level of cluster isolation needed. 
  + **[Best]:** High Isolation – a 1:1 workload to cluster mapping. Allows for finest grained control over access, sizing, scaling, and management of ElastiCache resources on a per workload basis.
  + **[Better]:** Medium Isolation – M:1 isolated by purpose but perhaps shared across multiple workloads (for example a cluster dedicated to caching workloads, and another dedicated for messaging).
  + **[Good]:** Low Isolation – M:1 all purpose, fully shared. Recommended for workloads where shared access is acceptable.

## OE 6: How do you plan for and mitigate failures?


**Question-level introduction: **Operational Excellence includes anticipating failures by performing regular "pre-mortem" exercises to identify potential sources of failure so they can be removed or mitigated. ElastiCache offers a Failover API that allows for simulated node failure events, for testing purposes.

**Question-level benefit: **By testing failure scenarios ahead of time you can learn how they impact your workload. This allows for safe testing of response procedures and their effectiveness, as well as gets your team familiar with their execution.

**[Required]** Regularly perform failover testing in dev/test accounts. [TestFailover](https://docs.aws.amazon.com/AmazonElastiCache/latest/APIReference/API_TestFailover.html)

## OE 7: How do you troubleshoot Valkey or Redis OSS engine events?


**Question-level introduction: **Operational Excellence requires the ability to investigate both service-level and engine-level information to analyze the health and status of your clusters. ElastiCache can emit Valkey or Redis OSS engine logs to both Amazon CloudWatch and Amazon Kinesis Data Firehose.

**Question-level benefit: **Enabling Valkey or Redis OSS engine logs on ElastiCache clusters provides insight into events that impact the health and performance of clusters. Valkey or Redis OSS engine logs provide data directly from the engine that is not available through the ElastiCache events mechanism. Through careful observation of both ElastiCache events (see preceding OE-1) and engine logs, it is possible to determine an order of events when troubleshooting from both the ElastiCache service perspective and engine perspective.
+ **[Required]** Ensure that Redis OSS engine logging functionality is enabled, which is available as of ElastiCache version 6.2 for Redis OSS and newer. This can be performed during cluster creation or by modifying the cluster after creation. 
  + Determine whether Amazon CloudWatch Logs or Amazon Kinesis Data Firehose is the appropriate target for Redis OSS engine logs.
  + Select an appropriate target log within either CloudWatch or Kinesis Data Firehose to persist the logs. If you have multiple clusters, consider a different target log for each cluster as this will help isolate data when troubleshooting.

  **[Resources]:**
  + Log delivery: [Log delivery](Log_Delivery.md)
  + Logging destinations: [Amazon CloudWatch Logs](Logging-destinations.md#Destination_Specs_CloudWatch_Logs)
  + Amazon CloudWatch Logs introduction: [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html)
  + Amazon Kinesis Data Firehose introduction: [What Is Amazon Kinesis Data Firehose?](https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html)
+ **[Best]** If using Amazon CloudWatch Logs, consider leveraging Amazon CloudWatch Logs Insights to query Valkey or Redis OSS engine log for important information.

  As an example, create a query against the CloudWatch Log group that contains the Valkey or Redis OSS engine logs that will return events with a LogLevel of ‘WARNING’, such as:

  ```
  fields @timestamp, LogLevel, Message
  | sort @timestamp desc
  | filter LogLevel = "WARNING"
  ```

  **[Resources]:**[Analyzing log data with CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html)

# Amazon ElastiCache Well-Architected Lens Security Pillar
Security Pillar

The security pillar focuses on protecting information and systems. Key topics include confidentiality and integrity of data, identifying and managing who can do what with privilege-based management, protecting systems, and establishing controls to detect security events.

**Topics**
+ [

## SEC 1: What steps are you taking in controlling authorized access to ElastiCache data?
](#SecurityPillarSEC1)
+ [

## SEC 2: Do your applications require additional authorization to ElastiCache over and above networking-based controls?
](#SecurityPillarSEC2)
+ [

## SEC 3: Is there a risk that commands can be executed inadvertently, causing data loss or failure?
](#SecurityPillarSEC3)
+ [

## SEC 4: How do you ensure data encryption at rest with ElastiCache
](#SecurityPillarSEC4)
+ [

## SEC 5: How do you encrypt in-transit data with ElastiCache?
](#SecurityPillarSEC5)
+ [

## SEC 6: How do you restrict access to control plane resources?
](#SecurityPillarSEC6)
+ [

## SEC 7: How do you detect and respond to security events?
](#SecurityPillarSEC7)

## SEC 1: What steps are you taking in controlling authorized access to ElastiCache data?


**Question-level introduction: **All ElastiCache clusters are designed to be accessed from Amazon Elastic Compute Cloud instances in a VPC, serverless functions (AWS Lambda), or containers (Amazon Elastic Container Service). The most encountered scenario is to access an ElastiCache cluster from an Amazon Elastic Compute Cloud instance within the same Amazon Virtual Private Cloud (Amazon Virtual Private Cloud). Before you can connect to a cluster from an Amazon EC2 instance, you must authorize the Amazon EC2 instance to access the cluster. To access an ElastiCache cluster running in a VPC, it is necessary to grant network ingress to the cluster.

**Question-level benefit: **Network ingress into the cluster is controlled via VPC security groups. A security group acts as a virtual firewall for your Amazon EC2 instances to control incoming and outgoing traffic. Inbound rules control the incoming traffic to your instance, and outbound rules control the outgoing traffic from your instance. In the case of ElastiCache, when launching a cluster, it requires associating a security group. This ensures that inbound and outbound traffic rules are in place for all nodes that make up the cluster. Additionally, ElastiCache is configured to deploy on private subnets exclusively such that they are only accessible from via the VPC’s private networking.
+ **[Required] **The security group associated with your cluster controls network ingress and access to the cluster. By default, a security group will not have any inbound rules defined and, therefore, no ingress path to ElastiCache. To enable this, configure an inbound rule on the security group specifying source IP address/range, TCP type traffic and the port for your ElastiCache cluster (default port 6379 for ElastiCache for Valkey and Redis OSS for example). While it is possible to allow a very broad set of ingress sources, like all resources within a VPC (0.0.0.0/0), it is advised to be as granular as possible in defining the inbound rules such as authorizing only inbound access to Valkey or Redis OSS clients running on Amazon Amazon EC2 instances associated with a specific security group.

  **[Resources]: **
  + [Subnets and subnet groups](SubnetGroups.md)
  + [Accessing your cluster or replication group](accessing-elasticache.md)
  + [Control traffic to resources using security groups](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html#DefaultSecurityGroupdefault%20security%20group)
  + [Amazon Elastic Compute Cloud security groups for Linux instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-security-groups.html#creating-your-own-security-groups)
+ **[Required] **AWS Identity and Access Management policies can be assigned to AWS Lambda functions allowing them to access ElastiCache data. To enable this feature, create an IAM execution role with the `AWSLambdaVPCAccessExecutionRole` permission, then assign the role to the AWS Lambda function.

  **[Resources]: **Configuring a Lambda function to access Amazon ElastiCache in an Amazon VPC: [Tutorial: Configuring a Lambda function to access Amazon ElastiCache in an Amazon VPC](https://docs.aws.amazon.com/lambda/latest/dg/services-elasticache-tutorial.html)

## SEC 2: Do your applications require additional authorization to ElastiCache over and above networking-based controls?


**Question-level introduction: **In scenarios where it is necessary to restrict or control access to clusters at an individual client level, it is recommended to authenticate via the AUTH command. ElastiCache authentication tokens, with optional user and user group management, enable ElastiCache to require a password before allowing clients to run commands and access keys, thereby improving data plane security.

**Question-level benefit: **To help keep your data secure, ElastiCache provides mechanisms to safeguard against unauthorized access of your data. This includes enforcing Role-Based Access Control (RBAC) AUTH, or AUTH token (password) be used by clients to connect to ElastiCache before performing authorized commands.
+ **[Best] **For ElastiCache version 6.x and higher for Redis OSS, and ElastiCache version 7.2 and higher for Valkey, define authentication and authorization controls by defining user groups, users, and access strings. Assign users to user groups, then assign user groups to clusters. To utilize RBAC, it must be selected upon cluster creation, and in-transit encryption must be enabled. Ensure you are using a Valkey or Redis OSS client that supports TLS to be able to leverage RBAC.

  **[Resources]: **
  + [Applying RBAC to a Replication Group for ElastiCache](Clusters.RBAC.md#rbac-using)
  + [Specifying Permissions Using an Access String](Clusters.RBAC.md#Access-string)
  + [ACL](https://valkey.io/topics/acl/)
  + [Supported ElastiCache versions](VersionManagement.md#supported-engine-versions)
+ **[Best] **For ElastiCache versions prior to 6.x for Redis OSS, in addition to setting strong token/password and maintaining a strict password policy for AUTH, it is best practice to rotate the password/token. ElastiCache can manage up to two (2) authentication tokens at any given time. You can also modify the cluster to explicitly require the use of authentication tokens.

  **[Resources]: **[Modifying the AUTH token on an existing ElastiCache cluster](auth.md#auth-modifyng-token)

## SEC 3: Is there a risk that commands can be executed inadvertently, causing data loss or failure?


**Question-level introduction: **There are a number of Valkey or Redis OSS commands that can have adverse impacts on operations if executed by mistake or by malicious actors. These commands can have un-intended consequences from a performance and data safety perspective. For example a developer may routinely call the FLUSHALL command in a dev environment, and due to a mistake may inadvertently attempt to call this command on a production system, resulting in accidental data loss.

**Question-level benefit: **Beginning with ElastiCache version 5.0.3 for Redis OSS, you have the ability to rename certain commands that might be disruptive to your workload. Renaming the commands can help prevent them from being inadvertently executed on the cluster. 
+ **[Required] **

  **[Resources]: **
  + [ElastiCache version 5.0.3 for Redis OSS (deprecated, use version 5.0.6)](engine-versions.md#redis-version-5-0.3)
  + [ElastiCache version 5.0.3 for Redis OSS parameter changes](ParameterGroups.Engine.md#ParameterGroups.Redis.5-0-3)
  + [Redis OSS security](https://redis.io/docs/management/security/)

## SEC 4: How do you ensure data encryption at rest with ElastiCache


**Question-level introduction: **While ElastiCache is an in-memory data store, it is possible to encrypt any data that may be persisted (on storage) as part of standard operations of the cluster. This includes both scheduled and manual backups written to Amazon S3, as well as data saved to disk storage as a result of sync and swap operations. Instance types in the M6g and R6g families also feature always-on, in-memory encryption.

**Question-level benefit: **ElastiCache provides optional encryption at-rest to increase data security.
+ **[Required] **At-rest encryption can be enabled on an ElastiCache cluster (replication group) only when it is created. An existing cluster cannot be modified to begin encrypting data at-rest. By default, ElastiCache will provide and manage the keys used in at-rest encryption. 

  **[Resources]: **
  + [At-Rest Encryption Constraints](at-rest-encryption.md#at-rest-encryption-constraints)
  + [Enabling At-Rest Encryption](at-rest-encryption.md#at-rest-encryption-enable)
+ **[Best] **Leverage Amazon EC2 instance types that encrypt data while it is in memory (such as M6g or R6g). Where possible, consider managing your own keys for at-rest encryption. For more stringent data security environments, AWS Key Management Service (KMS) can be used to self-manage Customer Master Keys (CMK). Through ElastiCache integration with AWS Key Management Service, you are able to create, own, and manage the keys used for encryption of data at rest for your ElastiCache cluster.

  **[Resources]: **
  + [Using customer managed keys from AWS Key Management Service](at-rest-encryption.md#using-customer-managed-keys-for-elasticache-security)
  + [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html)
  + [AWS KMS concepts](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#master_keys)

## SEC 5: How do you encrypt in-transit data with ElastiCache?


**Question-level introduction: ** It is a common requirement to mitigate against data being compromised while in transit. This represents data within components of a distributed system, as well as between application clients and cluster nodes. ElastiCache supports this requirement by allowing for encrypting data in-transit between clients and cluster, and between cluster nodes themselves. Instance types in the M6g and R6g families also feature always-on, in-memory encryption. 

**Question-level benefit: **Amazon ElastiCache in-transit encryption is an optional feature that allows you to increase the security of your data at its most vulnerable points, when it is in-transit from one location to another.
+ **[Required] **In-transit encryption can only be enabled on a cluster (replication group) upon creation. Please note that, due to the additional processing required for encrypting/decrypting data, implementing in-transit encryption will have some performance impact. To understand the impact, it is recommended to benchmark your workload before and after enabling encryption-in-transit.

  **[Resources]: **
  + [In-transit encryption overview](in-transit-encryption.md#in-transit-encryption-overview)

## SEC 6: How do you restrict access to control plane resources?


**Question-level introduction: **IAM policies and ARN enable fine grained access controls for ElastiCache for Valkey and Redis OSS, allowing for tighter control to manage the creation, modification and deletion of clusters.

**Question-level benefit: **Management of Amazon ElastiCache resources, such as replication groups, nodes, etc. can be constrained to AWS accounts that have specific permissions based on IAM policies, improving security and reliability of resources.
+ **[Required] **Manage access to Amazon ElastiCache resources by assigning specific AWS Identity and Access Managementpolicies to AWS users, allowing finer control over which accounts can perform what actions on clusters.

  **[Resources]: **
  + [Overview of managing access permissions to your ElastiCache resources](IAM.Overview.md)
  + [Using identity-based policies (IAM policies) for Amazon ElastiCache](IAM.IdentityBasedPolicies.md)

## SEC 7: How do you detect and respond to security events?


**Question-level introduction: **ElastiCache, when deployed with RBAC enabled, exports CloudWatch metrics to notify users of security events. These metrics help identify failed attempts to authenticate, access keys, or run commands that connecting RBAC users are not authorized for.

Additionally, AWS products and services resources help secure your overall workload by automating deployments and logging all actions and modifications for later review/audit.

**Question-level benefit: **By monitoring events, you enable your organization to respond according to your requirements, policies, and procedures. Automating the monitoring and responses to these security events hardens your overall security posture.
+ **[Required] **Familiarize yourself with the CloudWatch Metrics published that pertain to RBAC authentication and authorization failures. 
  + AuthenticationFailures = Failed attempts to authenticate to Valkey or Redis OSS
  + KeyAuthorizationFailures = Failed attempts by users to access keys without permission
  + CommandAuthorizationFailures = Failed attempts by users to run commands without permission

  **[Resources]: **
  + [Metrics for Valkey or Redis OSS](CacheMetrics.Redis.md)
+ **[Best] **It is recommended to setup alerts and notifications on these metrics and respond as necessary.

  **[Resources]: **
  + [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)
+ **[Best] **Use the Valkey or Redis OSS ACL LOG command to gather further details

  **[Resources]: **
  + [ACL LOG](https://valkey.io/commands/acl-log/)
+ **[Best] **Familiarize yourself with the AWS products and services capabilities as it pertains to monitoring, logging, and analyzing ElastiCache deployments and events

  **[Resources]: **
  + [Logging Amazon ElastiCache API calls with AWS CloudTrail](logging-using-cloudtrail.md)
  + [elasticache-redis-cluster-automatic-backup-check](https://docs.aws.amazon.com/config/latest/developerguide/elasticache-redis-cluster-automatic-backup-check.html)
  + [Monitoring use with CloudWatch Metrics](CacheMetrics.md)

# Amazon ElastiCache Well-Architected Lens Reliability Pillar
Reliability Pillar

The reliability pillar focuses on workloads performing their intended functions and how to recover quickly from failure to meet demands. Key topics include distributed system design, recovery planning, and adapting to changing requirements.

**Topics**
+ [

## REL 1: How are you supporting high availability (HA) architecture deployments?
](#ReliabilityPillarREL1)
+ [

## REL 2: How are you meeting your Recovery Point Objectives (RPOs) with ElastiCache?
](#ReliabilityPillarREL2)
+ [

## REL 3: How do you support disaster recovery (DR) requirements?
](#ReliabilityPillarREL3)
+ [

## REL 4: How do you effectively plan for failovers?
](#ReliabilityPillarREL4)
+ [

## REL 5: Are your ElastiCache components designed to scale?
](#ReliabilityPillarREL5)

## REL 1: How are you supporting high availability (HA) architecture deployments?


**Question-level introduction: ** Understanding the high availability architecture of Amazon ElastiCache will enable you to operate in a resilient state during availability events. 

**Question-level benefit: **Architecting your ElastiCache clusters to be resilient to failures ensures higher availability for your ElastiCache deployments. 
+ **[Required] **Determine the level of reliability you require for your ElastiCache cluster. Different workloads have different resiliency standards, from entirely ephemeral to mission critical workloads. Define needs for each type of environment you operate such as dev, test, and production.

  Caching engine: ElastiCache for Memcached vs ElastiCache for Valkey and Redis OSS

  1. ElastiCache for Memcached does not provide any replication mechanism and is used primarily for ephemeral workloads.

  1. ElastiCache for Valkey and Redis OSS offers HA features discussed below
+ **[Best] **For workloads that require HA, use ElastiCache in cluster mode with a minimum of two replicas per shard, even for small throughput requirement workloads that require only one shard. 

  1. For cluster mode enabled, multi-AZ is enabled automatically.

     Multi-AZ minimizes downtime by performing automatic failovers from primary node to replicas, in case of any planned or unplanned maintenance as well as mitigating AZ failure.

  1. For sharded workloads, a minimum of three shards provides faster recovery during failover events as the Valkey or Redis OSS Cluster Protocol requires a majority of primary nodes be available to achieve quorum.

  1. Set up two or more replicas across Availability.

     Having two replicas provides improved read scalability and also read availability in scenarios where one replica is undergoing maintenance.

  1. Use Graviton2-based node types (default nodes in most regions).

     ElastiCache has added optimized performance on these nodes. As a result, you get better replication and synchronization performance, resulting in overall improved availability.

  1. Monitor and right-size to deal with anticipated traffic peaks: under heavy load, the engine may become unresponsive, which affects availability. `BytesUsedForCache` and `DatabaseMemoryUsagePercentage` are good indicators of your memory usage, whereas `ReplicationLag` is an indicator of your replication health based on your write rate. You can use these metrics to trigger cluster scaling.

  1. Ensure client-side resiliency by testing with the [Failover API prior to a production failover event](https://docs.amazonaws.cn/en_us/AmazonElastiCache/latest/APIReference/API_TestFailover.html).

  **[Resources]: **
  + [Configure ElastiCache for Redis OSS for higher availability](https://aws.amazon.com/blogs/database/configuring-amazon-elasticache-for-redis-for-higher-availability/)
  + [High availability using replication groups](Replication.md)

## REL 2: How are you meeting your Recovery Point Objectives (RPOs) with ElastiCache?


**Question-level introduction: **Understand workload RPO to inform decisions on ElastiCache backup and recovery strategies.

**Question-level benefit: **Having an in-place RPO strategy can improve business continuity in the event of a disaster recovery scenarios. Designing your backup and restore policies can help you meet your Recovery Point Objectives (RPO) for your ElastiCache data. ElastiCache offers snapshot capabilities which are stored in Amazon S3, along with a configurable retention policy. These snapshots are taken during a defined backup window, and handled by the service automatically. If your workload requires additional backup granularity, you have the option to create up to 20 manual backups per day. Manually created backups do not have a service retention policy and can be kept indefinitely.
+ **[Required] **Understand and document the RPO of your ElastiCache deployments.
  + Be aware that Memcached does not offer any backup processes.
  + Review the capabilities of ElastiCache Backup and Restore features.
+ **[Best] **Have a well-communicated process in place for backing up your cluster.
  + Initiate manual backups on an as-needed basis.
  + Review retention policies for automatic backups.
  + Note that manual backups will be retained indefinitely.
  + Schedule your automatic backups during periods of low usage.
  + Perform backup operations against read-replicas to ensure you minimize the impact on cluster performance.
+ **[Good] **Leverage the scheduled backup feature of ElastiCache to regularly back up your data during a defined window. 
  + Periodically test restores from your backups.
+ **[Resources]: **
  + [Redis OSS](https://aws.amazon.com/elasticache/faqs/#Redis)
  + [Backup and restore for ElastiCache](backups.md)
  + [Making manual backups](backups-manual.md)
  + [Scheduling automatic backups](backups-automatic.md)
  + [Backup and Restore ElastiCache Clusters](https://aws.amazon.com/blogs/aws/backup-and-restore-elasticache-redis-nodes/)

## REL 3: How do you support disaster recovery (DR) requirements?


**Question-level introduction: **Disaster recovery is an important aspect of any workload planning. ElastiCache offers several options to implement disaster recovery based on workload resilience requirements. With Amazon ElastiCache Global Datastore, you can write to your cluster in one Region and have the data available to be read from two other cross-Region replica clusters, thereby enabling low-latency reads and disaster recovery across regions.

**Question-level benefit: **Understanding and planning for a variety of disaster scenarios can ensure business continuity. DR strategies must be balanced against cost, performance impact, and data loss potential.
+ **[Required] **Develop and document DR strategies for all your ElastiCache components based upon workload requirements. ElastiCache is unique in that some use cases are entirely ephemeral and don’t require any DR strategy, whereas others are on the opposite end of the spectrum and require an extremely robust DR strategy. All options must be weighed against Cost Optimization – greater resiliency requires larger amounts of instrastructure.

  Understand the DR options available on a regional and multi-region level.
  + Multi-AZ Deployments are recommended to guard against AZ failure. Be sure to deploy with Cluster-Mode enabled in Multi-AZ architectures, with a minimum of 3 AZs available.
  + Global Datastore is recommended to guard against regional failures.
+ **[Best] **Enable Global Datastore for workloads that require region level resiliency.
  + Have a plan to failover to secondary region in case of primary degradation.
  + Test multi-region failover process prior to a failover over in production.
  + Monitor `ReplicationLag` metric to understand potential impact of data loss during failover events.
+ **[Resources]: **
  + [Mitigating Failures](disaster-recovery-resiliency.md#FaultTolerance)
  + [Replication across AWS Regions using global datastores](Redis-Global-Datastore.md)
  + [Restoring from a backup with optional cluster resizing](backups-restoring.md)
  + [Minimizing downtime in ElastiCache for Valkey and Redis OSS with Multi-AZ](AutoFailover.md)

## REL 4: How do you effectively plan for failovers?


**Question-level introduction: **Enabling multi-AZ with automatic failovers is an ElastiCache best practice. In certain cases, ElastiCache for Valkey and Redis OSS replaces primary nodes as part of service operations. Examples include planned maintenance events and the unlikely case of a node failure or availability zone issue. Successful failovers rely on both ElastiCache and your client library configuration.

**Question-level benefit: **Following best practices for ElastiCache failovers in conjunction with your specific ElastiCache client library helps you minimize potential downtime during failover events. 
+ **[Required] **For cluster mode disabled, use timeouts so your clients detect if it needs to disconnect from the old primary node and reconnect to the new primary node, using the updated primary endpoint IP address. For cluster mode enabled, the client library is responsible with detecting changes in the underlying cluster topology. This is accomplished most often by configuration settings in the ElastiCache client library, which also allow you to configure the frequency and the method of refresh. Each client library offers its own settings and more details are available in their corresponding documentation.

  **[Resources]: **
  + [Minimizing downtime in ElastiCache for Valkey and Redis OSS with Multi-AZ](AutoFailover.md)
  + Review the best practices of your ElastiCache client library. 
+ **[Required] **Successful failovers depend on a healthy replication environment between the primary and the replica nodes. Review and understand the asynchronous nature of Valkey and Redis OSS replication, as well as the available CloudWatch metrics to report on the replication lag between primary and replica nodes. For use cases that require greater data safety, leverage the WAIT command to force replicas to acknowledge writes before responding to connected clients. 

  **[Resources]: **
  + [Metrics for Valkey or Redis OSS](CacheMetrics.Redis.md)
  +  [Monitoring best practices with ElastiCache using Amazon CloudWatch](https://aws.amazon.com/blogs/database/monitoring-best-practices-with-amazon-elasticache-for-redis-using-amazon-cloudwatch/)
+ **[Best] **Regularly validate the responsiveness of your application during failover using the ElastiCache Test Failover API. 

  **[Resources]: **
  + [Testing Automatic Failover to a Read Replica on ElastiCache](https://aws.amazon.com/blogs/database/testing-automatic-failover-to-a-read-replica-on-amazon-elasticache-for-redis/)
  + [Testing automatic failover](AutoFailover.md#auto-failover-test)

## REL 5: Are your ElastiCache components designed to scale?


**Question-level introduction: **By understanding the scaling capabilities and available deployment topologies, your ElastiCache components can adjust over time to meet changing workload requirements. ElastiCache offers 4-way scaling: in/out (horizontal) as well as up/down (vertical).

**Question-level benefit: **Following best practices for ElastiCache deployments provides the greatest amount of scaling flexibility, as well as meeting the Well Architected principle of scaling horizontally to minimize the impact of failures.
+ **[Required] **Understand the difference between Cluster-mode Enabled and Cluster-mode Disabled topologies. In almost all cases it is recommended to deploy with Cluster-mode enabled as it allow for greater scalability over time. Cluster-mode disabled components are limited in their ability to horizontally scale by adding read replicas.
+ **[Required] **Understand when and how to scale.
  + For more READIOPS: add replicas
  + For more WRITEOPS: add shards (scale out)
  + For more network IO – use network optimized instances, scale up
+ **[Best] **Deploy your ElastiCache components with Cluster-mode enabled, with a bias toward more, smaller nodes rather than fewer, larger nodes. This effectively limits the blast radius of a node failure.
+ **[Best] **Include replicas in your clusters for enhanced responsiveness during scaling events
+ **[Good] **For cluster-mode disabled, leverage read replicas to increase overall read capacity. ElastiCache has support for up to 5 read replicas in cluster-mode disabled, as well as vertical scaling.
+ **[Resources]: **
  + [Scaling ElastiCache clusters](Scaling.md)
  + [Online scaling up](redis-cluster-vertical-scaling.md#redis-cluster-vertical-scaling-scaling-up)

# Amazon ElastiCache Well-Architected Lens Performance Efficiency Pillar
Performance Efficiency Pillar

The performance efficiency pillar focuses on using IT and computing resources efficiently. Key topics include selecting the right resource types and sizes based on workload requirements, monitoring performance, and making informed decisions to maintain efficiency as business needs evolve.

**Topics**
+ [

## PE 1: How do you monitor the performance of your Amazon ElastiCache cluster?
](#PerformanceEfficiencyPillarPE1)
+ [

## PE 2: How are you distributing work across your ElastiCache Cluster nodes?
](#PerformanceEfficiencyPillarPE2)
+ [

## PE 3: For caching workloads, how do you track and report the effectiveness and performance of your cache?
](#PerformanceEfficiencyPillarPE3)
+ [

## PE 4: How does your workload optimize the use of networking resources and connections?
](#PerformanceEfficiencyPillarPE4)
+ [

## PE 5: How do you manage key deletion and/or eviction?
](#PerformanceEfficiencyPillarPE5)
+ [

## PE 6: How do you model and interact with data in ElastiCache?
](#PerformanceEfficiencyPillarPE6)
+ [

## PE 7: How do you log slow running commands in your Amazon ElastiCache cluster?
](#PerformanceEfficiencyPillarPE7)
+ [

## PE8: How does Auto Scaling help in increasing the performance of the ElastiCache cluster?
](#PerformanceEfficiencyPillarPE8)

## PE 1: How do you monitor the performance of your Amazon ElastiCache cluster?


**Question-level introduction: **By understanding the existing monitoring metrics, you can identify current utilization. Proper monitoring can help identify potential bottlenecks impacting the performance of your cluster. 

**Question-level benefit: **Understanding the metrics associated with your cluster can help guide optimization techniques that can lead to reduced latency and increased throughput. 
+ **[Required] **Baseline performance testing using a subset of your workload.
  + You should monitor performance of the actual workload using mechanisms such as load testing. 
  + Monitor the CloudWatch metrics while running these tests to gain an understanding of metrics available, and to establish a performance baseline. 
+ **[Best] **For ElastiCache for Valkey and Redis OSS workloads, rename computationally expensive commands, such as `KEYS`, to limit the ability of users to run blocking commands on production clusters. 
  + ElastiCache workloads running engine 6.x for Redis OSS can leverage role-based access control to restrict certain commands. Access to the commands can be controlled by creating Users and User Groups with the AWS Console or CLI, and associating the User Groups to a cluster. In Redis OSS 6, when RBAC is enabled, we can use "-@dangerous" and it will disallow expensive commands like KEYS, MONITOR, SORT, etc. for that user.
  + For engine version 5.x, rename commands using the `rename-commands` parameter on the cluster parameter group.
+ **[Better] **Analyze slow queries and look for optimization techniques. 
  + For ElastiCache for Valkey and Redis OSS workloads, learn more about your queries by analyzing the Slow Log. For example, you can use the following command, `valkey-cli slowlog get 10` to show last 10 commands which exceeded latency thresholds (10 milliseconds by default).
  + Certain queries can be performed more efficiently using complex ElastiCache for Valkey and Redis OSS data structures. As an example, for numerical style range lookups, an application can implement simple numerical indexes with Sorted Sets. Managing these indexes can reduce scans performed on the data set, and return data with greater performance efficiency. 
  + For ElastiCache for Valkey and Redis OSS workloads, `redis-benchmark` provides a simple interface for testing the performance of different commands using user defined inputs like number of clients, and size of data.
  + Since Memcached only supports simple key level commands, consider building additional keys as indexes to avoid iterating through the key space to serve client queries.
+ **[Resources]: **
  + [Monitoring use with CloudWatch Metrics](CacheMetrics.md)
  + [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html)
  + [Valkey and Redis OSS specific parameters](ParameterGroups.Engine.md#ParameterGroups.Redis)
  + [SLOWLOG](https://valkey.io/commands/slowlog/)
  + [ benchmark](https://valkey.io/topics/benchmark/)

## PE 2: How are you distributing work across your ElastiCache Cluster nodes?


**Question-level introduction: **The way your application connects to Amazon ElastiCache nodes can impact the performance and scalability of the cluster. 

**Question-level benefit: **Making proper use of the available nodes in the cluster will ensure that work is distributed across the available resources. The following techniques help avoid idle resources as well.
+ **[Required] **Have clients connect to the proper ElastiCache endpoint.
  + ElastiCache for Valkey and Redis OSS implements different endpoints based on the cluster mode in use. For cluster mode enabled, ElastiCache will provide a configuration endpoint. For cluster mode disabled, ElastiCache provides a primary endpoint, typically used for writes, and a reader endpoint for balancing reads across replicas. Implementing these endpoints correctly will results in better performance, and easier scaling operations. Avoid connecting to individual node endpoints unless there is a specific requirement to do so. 
  + For multi-node Memcached clusters, ElastiCache provides a configuration endpoint which enables Auto Discovery. It is recommended to use a hashing algorithm to distribute work evenly across the cache nodes. Many Memcached client libraries implement consistent hashing. Check the documentation for the library you are using to see if it supports consistent hashing and how to implement it. You can find more information on implementing these features [here](BestPractices.LoadBalancing.md).
+ **[Better] **Take advantage of ElastiCache for Valkey and Redis OSS cluster mode enabled clusters to improve scalability.
  + ElastiCache for Valkey and Redis OSS (cluster mode enabled) clusters support [online scaling operations](scaling-redis-cluster-mode-enabled.md#redis-cluster-resharding-online) (out/in and up/down) to help distribute data dynamically across shards. Using the Configuration Endpoint will ensure your cluster aware clients can adjust to changes in the cluster topology.
  + You may also rebalance the cluster by moving hashslots between available shards in your ElastiCache for Valkey and Redis OSS (cluster mode enabled) cluster. This helps distribute work more efficiently across available shards. 
+ **[Better] **Implement a strategy for identifying and remediating hot keys in your workload.
  + Consider the impact of multi-dimensional Valkey or Redis OSS data structures such a lists, streams, sets, etc. These data structures are stored in single Keys, which reside on a single node. A very large multi-dimensional key has the potential to utilize more network capacity and memory than other data types and can cause a disproportionate use of that node. If possible, design your workload to spread out data access across many discrete Keys.
  + Hot keys in the workload can impact performance of the node in use. For ElastiCache for Valkey and Redis OSS workloads, you can detect hot keys using `valkey-cli --hotkeys` if an LFU max-memory policy is in place.
  + Consider replicating hot keys across multiple nodes to distribute access to them more evenly. This approach requires the client to write to multiple primary nodes (the Valkey or Redis OSS node itself will not provide this functionality) and to maintain a list of key names to read from, in addition to the original key name.
  + ElastiCache engine 7.2 for Valkey and above, and ElastiCache version 6 for Redis OSS and above, all support server-assisted [client-side caching](https://valkey.io/topics/client-side-caching/). This enables applications to wait for changes to a key before making network calls back to ElastiCache. 
+ **[Resources]: **
  + [Configure ElastiCache for Valkey and Redis OSS for higher availability](https://aws.amazon.com/blogs/database/configuring-amazon-elasticache-for-redis-for-higher-availability/)
  + [Finding connection endpoints in ElastiCache](Endpoints.md)
  + [Load balancing best practices](https://docs.aws.amazon.com/AmazonElastiCache/latest/dg/BestPractices.LoadBalancing.html)
  + [Online resharding for Valkey or Redis OSS (cluster mode enabled)](scaling-redis-cluster-mode-enabled.md#redis-cluster-resharding-online)
  + [Client-side caching in Valkey and Redis OSS](https://valkey.io/topics/client-side-caching/)

## PE 3: For caching workloads, how do you track and report the effectiveness and performance of your cache?


**Question-level introduction: **Caching is a commonly encountered workload on ElastiCache and it is important that you understand how to manage the effectiveness and performance of your cache.

**Question-level benefit: **Your application may show signs of sluggish performance. Your ability to use cache specific metrics to inform your decision on how to increase app performance is critical for your cache workload.
+ **[Required] **Measure and track over time the cache-hits ratio. The efficiency of your cache is determined by its ‘cache hits ratio’. The cache hits ratio is defined by the total of key hits divided by the total hits and misses. The closer to 1 the ratio is, the more effective your cache is. A low cache hits ratio is caused by the volume of cache misses. Cache misses occur when the requested key is not found in the cache. A key is not in the cache because it either has been evicted or deleted, has expired, or has never existed. Understand why keys are not in cache and develop appropriate strategies to have them in cache. 

  **[Resources]: **
  + [Metrics for Valkey and Redis OSS](CacheMetrics.Redis.md)
+ **[Required] **Measure and collect your application cache performance in conjunction with latency and CPU utilization values to understand whether you need to make adjustments to your time-to-live or other application components. ElastiCache provides a set of CloudWatch metrics for aggregated latencies for each data structure. These latency metrics are calculated using the commandstats statistic from the INFO command and do not include the network and I/O time. This is only the time consumed by ElastiCache to process the operations.

  **[Resources]: **
  + [Metrics for Valkey and Redis OSS](CacheMetrics.Redis.md)
  + [Monitoring best practices with ElastiCache using Amazon CloudWatch](https://aws.amazon.com/blogs/database/monitoring-best-practices-with-amazon-elasticache-for-redis-using-amazon-cloudwatch/)
+ **[Best] **Choose the right caching strategy for your needs. A low cache hits ratio is caused by the volume of cache misses. If your workload is designed to have low volume of cache misses (such as real time communication), it is best to conduct reviews of your caching strategies and apply the most appropriate resolutions for your workload, such as query instrumentation to measure memory and performance. The actual strategies you use to implement for populating and maintaining your cache depend on what data your clients need to cache and the access patterns to that data. For example, it is unlikely that you will use the same strategy for both personalized recommendations on a streaming application, and for trending news stories. 

  **[Resources]: **
  + [Caching strategies for Memcached](Strategies.md)
  + [Caching Best Practices](https://aws.amazon.com/caching/best-practices/)
  + [Performance at Scale with Amazon ElastiCache Whitepaper](https://d0.awsstatic.com/whitepapers/performance-at-scale-with-amazon-elasticache.pdf)

## PE 4: How does your workload optimize the use of networking resources and connections?


**Question-level introduction: **ElastiCache for Valkey, Memcached, and Redis OSS are supported by many application clients, and implementations may vary. You need to understand the networking and connection management in place to analyze potential performance impact.

**Question-level benefit: **Efficient use of networking resources can improve the performance efficiency of your cluster. The following recommendations can reduce networking demands, and improve cluster latency and throughput. 
+ **[Required] **Proactively manage connections to your ElastiCache cluster.
  + Connection pooling in the application reduces the amount of overhead on the cluster created by opening and closing connections. Monitor connection behavior in Amazon CloudWatch using `CurrConnections` and `NewConnections`.
  + Avoid connection leaking by properly closing client connections where appropriate. Connection management strategies include properly closing connections that are not in use, and setting connection time-outs. 
  + For Memcached workloads, there is a configurable amount of memory reserved for handling connections called, `memcached_connections_overhead`. 
+ **[Better] **Compress large objects to reduce memory, and improve network throughput.
  + Data compression can reduce the amount of network throughput required (Gbps), but increases the amount of work on the application to compress and decompress data. 
  + Compression also reduces the amount of memory consumed by keys
  + Based on your application needs, consider the trade-offs between compression ratio and compression speed.
+ **[Resources]: **
  + [ElastiCache - Global Datastore](https://aws.amazon.com/elasticache/redis/global-datastore/)
  + [Memcached specific parameters](ParameterGroups.Engine.md#ParameterGroups.Memcached)
  + [ElastiCache version 5.0.3 for Redis OSS enhances I/O handling to boost performance](https://aws.amazon.com/about-aws/whats-new/2019/03/amazon-elasticache-for-redis-503-enhances-io-handling-to-boost-performance/)
  + [Metrics for Valkey and Redis OSS](CacheMetrics.Redis.md)
  + [Configure ElastiCache for higher availability](https://aws.amazon.com/blogs/database/configuring-amazon-elasticache-for-redis-for-higher-availability/)

## PE 5: How do you manage key deletion and/or eviction?


**Question-level introduction: **Workloads have different requirements, and expected behavior when a cluster node is approaching memory consumption limits. ElastiCache has different policies for handling these situations. 

**Question-level benefit: **Proper management of available memory, and understanding of eviction policies will help ensure awareness of cluster behavior when instance memory limits are exceeded. 
+ **[Required] **Instrument the data access to evaluate which policy to apply. Identify an appropriate max-memory policy to control if and how evictions are performed on the cluster.
  + Eviction occurs when the max-memory on the cluster is consumed and a policy is in place to allow eviction. The behavior of the cluster in this situation depends on the eviction policy specified. This policy can be managed using the `maxmemory-policy` on the cluster parameter group. 
  + The default policy `volatile-lru` frees up memory by evicting keys with a set expiration time (TTL value). Least frequently used (LFU) and least recently used (LRU) policies remove keys based on usage. 
  + For Memcached workloads, there is a default LRU policy in place controlling evictions on each node. The number of evictions on your Amazon ElastiCache cluster can be monitored using the Evictions metric on Amazon CloudWatch.
+ **[Better] **Standardize delete behavior to control performance impact on your cluster to avoid unexpected performance bottlenecks.
  + For ElastiCache for Valkey and Redis OSS workloads, when explicitly removing keys from the cluster, `UNLINK` is like `DEL`: it removes the specified keys. However, the command performs the actual memory reclaiming in a different thread, so it is not blocking, while `DEL` is. The actual removal will happen later asynchronously. 
  + For ElastiCache version 6.x for Redis OSS workloads, the behavior of the `DEL` command can be modified in the parameter group using `lazyfree-lazy-user-del` parameter.
+ **[Resources]: **
  + [Configuring engine parameters using ElastiCache parameter groups](ParameterGroups.md)
  + [UNLINK](https://valkey.io/commands/unlink/)
  + [Cloud Financial Management with AWS](https://aws.amazon.com/aws-cost-management/)

## PE 6: How do you model and interact with data in ElastiCache?


**Question-level introduction: **ElastiCache is heavily application dependent on the data structures and the data model used, but it also needs to consider the underlying data store (if present). Understand the data structures available and ensure you are using the most appropriate data structures for your needs. 

**Question-level benefit: **Data modeling in ElastiCache has several layers, including application use case, data types, and relationships between data elements. Additionally, each data type and command have their own well documented performance signatures.
+ **[Best] **A best practice is to reduce unintentional overwriting of data. Use a naming convention that minimizes overlapping key names. Conventional naming of your data structures uses a hierarchical method such as: `APPNAME:CONTEXT:ID`, such as `ORDER-APP:CUSTOMER:123`.

  **[Resources]: **
  + [Key naming](https://docs.gitlab.com/ee/development/redis.html#key-naming)
+ **[Best] **ElastiCache for Valkey and Redis OSS commands have a time complexity defined by the Big O notation. This time complexity of a command is a algorithmic/mathematical representation of its impact. When introducing a new data type in your application you need to carefully review the time complexity of the related commands. Commands with a time complexity of O(1) are constant in time and do not depend on the size of the input however commands with a time complexity of O(N) are linear in time and are subject to the size of the input. Due to the single threaded design of ElastiCache for Valkey and Redis OSS, large volume of high time complexity operations will result in lower performance and potential operation timeouts.

  **[Resources]: **
  + [Commands](https://valkey.io/commands/)
+ **[Best] **Use APIs to gain GUI visibility into the data model in your cluster.

  **[Resources]: **
  + [Redis OSS Commander](https://www.npmjs.com/package/ElastiCache for Redis-commander)
  + [Redis OSS Browser](https://github.com/humante/redis-browser)
  + [Redsmin](https://www.redsmin.com/)

## PE 7: How do you log slow running commands in your Amazon ElastiCache cluster?


**Question-level introduction: **Performance tuning benefits through the capture, aggregation, and notification of long-running commands. By understanding how long it takes for commands to execute, you can determine which commands result in poor performance as well as commands that block the engine from performing optimally. ElastiCache also has the capability to forward this information to Amazon CloudWatch or Amazon Kinesis Data Firehose.

**Question-level benefit: **Logging to a dedicated permanent location and providing notification events for slow commands can help with detailed performance analysis and can be used to trigger automated events.
+ **[Required] **ElastiCache running a Valkey engine 7.2 or newer, or running a Redis OSS engine version 6.0 or newer, properly configured parameter group and SLOWLOG logging enabled on the cluster.
  + The required parameters are only available when engine version compatibility is set to Valkey 7.2 and higher, or Redis OSS version 6.0 or higher.
  + SLOWLOG logging occurs when the server execution time of a command takes longer than a specified value. The behavior of the cluster depends on the associated Parameter Group parameters which are `slowlog-log-slower-than` and `slowlog-max-len`.
  + Changes take effect immediately.
+ **[Best] **Take advantage of CloudWatch or Kinesis Data Firehose capabilities.
  + Use the filtering and alarm capabilities of CloudWatch, CloudWatch Logs Insights and Amazon Simple Notification Services to achieve performance monitoring and event notification.
  + Use the streaming capabilities of Kinesis Data Firehose to archive SLOWLOG logs to permanent storage or to trigger automated cluster parameter tuning.
  + Determine if JSON or plain TEXT format suits your needs best.
  + Provide IAM permissions to publish to CloudWatch or Kinesis Data Firehose.
+ **[Better] **Configure `slowlog-log-slower-than` to a value other than the default.
  + This parameter determines how long a command may execute for within the Valkey or Redis OSS engine before it is logged as a slow running command. The default value is 10,000 microseconds (10 milliseconds). The default value may be too high for some workloads.
  + Determine a value that is more appropriate for your workload based on application needs and testing results; however, a value that is too low may generate excessive data.
+ **[Better] **Leave `slowlog-max-len` at the default value.
  + This parameter determines the upper limit for how many slow-running commands are captured in Valkey or Redis OSS memory at any given time. A value of 0 effectively disables the capture. The higher the value, the more entries will be stored in memory, reducing the chance of important information being evicted before it can be reviewed. The default value is 128.
  + The default value is appropriate for most workloads. If there is a need to analyze data in an expanded time window from the valkey-cli via the SLOWLOG command, consider increasing this value. This allows more commands to remain in Valkey or Redis OSS memory.

    If you are emitting the SLOWLOG data to either CloudWatch Logs or Kinesis Data Firehose, the data will be persisted and can be analyzed outside of the ElastiCache system, reducing the need to store large numbers of slow running commands in Valkey or Redis OSS memory.
+ **[Resources]: **
  + [How do I turn on Slow log in a cluster?](https://repost.aws/knowledge-center/elasticache-turn-on-slow-log)
  + [Log delivery](Log_Delivery.md)
  + [Redis OSS-specific parameters](ParameterGroups.Engine.md#ParameterGroups.Redis)
  + [https://aws.amazon.com/cloudwatch/](https://aws.amazon.com/cloudwatch/)Amazon CloudWatch
  + [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/)

## PE8: How does Auto Scaling help in increasing the performance of the ElastiCache cluster?


**Question-level introduction: **By implementing the feature of Valkey or Redis OSS auto scaling, your ElastiCache components can adjust over time to increase or decrease the desired shards or replicas automatically. This can be done by implementing either the target tracking or scheduled scaling policy.

**Question-level benefit: **Understanding and planning for the spikes in the workload can ensure enhanced caching performance and business continuity. ElastiCache Auto Scaling continually monitors your CPU/Memory utilization to make sure your cluster is operating at your desired performance levels.
+ **[Required] **When launching a cluster for ElastiCache for Valkey or Redis OSS:

  1. Ensure that the Cluster mode is enabled

  1. Make sure the instance belongs to a family of certain type and size that support auto scaling

  1. Ensure the cluster is not running in Global datastores, Outposts or Local Zones

  **[Resources]: **
  + [Scaling clusters in Valkey and Redis OSS (Cluster Mode Enabled)](scaling-redis-cluster-mode-enabled.md)
  + [Using Auto Scaling with shards](AutoScaling-Using-Shards.md)
  + [Using Auto Scaling with replicas](AutoScaling-Using-Replicas.md)
+ **[Best] **Identify if your workload is read-heavy or write-heavy to define scaling policy. For best performance, use just one tracking metric. It is recommended to avoid multiple policies for each dimension, as auto scaling policies scale out when the target is hit, but scale in only when all target tracking policies are ready to scale in.

  **[Resources]: **
  + [Auto Scaling policies](AutoScaling-Policies.md)
  + [Defining a scaling policy](AutoScaling-Scaling-Defining-Policy-API.md)
+ **[Best] **Monitoring performance over time can help you detect workload changes that would remain unnoticed if monitored at a particular point in time. You can analyze corresponding CloudWatch metrics for cluster utilization over a four-week period to determine the target value threshold. If you are still not sure of what value to choose, we recommend starting with a minimum supported predefined metric value.

  **[Resources]: **
  + [Monitoring use with CloudWatch Metrics](CacheMetrics.md)
+ **[Better] **We advise testing your application with expected minimum and maximum workloads, to identify the exact number of shards/replicas required for the cluster to develop scaling policies and mitigate availability issues.

  **[Resources]: **
  + [Registering a Scalable Target](AutoScaling-Register-Policy.md)
  + [Registering a Scalable Target using the AWS CLI](AutoScaling-Scaling-Registering-Policy-CLI.md)

# Amazon ElastiCache Well-Architected Lens Cost Optimization Pillar
Cost Optimization Pillar

The cost optimization pillar focuses on avoiding unnecessary costs. Key topics include understanding and controlling where money is being spent, selecting the most appropriate node type (use instances that support data tiering based on workload needs), the right number of resource types (how many read replicas) , analyzing spend over time, and scaling to meet business needs without overspending.

**Topics**
+ [

## COST 1: How do you identify and track costs associated with your ElastiCache resources? How do you develop mechanisms to enable users to create, manage, and dispose of created resources?
](#CostOptimizationPillarCOST1)
+ [

## COST 2: How do you use continuous monitoring tools to help you optimize the costs associated with your ElastiCache resources?
](#CostOptimizationPillarCOST2)
+ [

## COST 3: Should you use an instance type that support data tiering? What are the advantages of a data tiering? When not to use data tiering instances?
](#CostOptimizationPillarCOST3)

## COST 1: How do you identify and track costs associated with your ElastiCache resources? How do you develop mechanisms to enable users to create, manage, and dispose of created resources?


**Question-level introduction: **Understanding cost metrics requires the participation of and collaboration across multiple teams: software engineering, data management, product owners, finance, and leadership. Identifying key cost drivers requires all involved parties understand service usage control levers and cost management trade-offs and it is frequently the key difference between successful and less successful cost optimization efforts. Ensuring you have processes and tools in place to track resources created from development to production and retirement helps you manage the costs associated with ElastiCache.

**Question-level benefit: **Continuous tracking of all costs associated with your workload requires a deep understanding of the architecture that includes ElastiCache as one of its components. Additionally, you should have a cost management plan in place to collect and compare usage against your budget. 
+ **[Required] **Institute a Cloud Center of Excellence (CCoE) with one of its founding charters to own defining, tracking, and taking action on metrics around your organizations’ ElastiCache usage. If a CCoE exists and functions, ensure that it knows how to read and track costs associated with ElastiCache. When resources are created, use IAM roles and policies to validate that only specific teams and groups can instantiate resources. This ensures that costs are associated with business outcomes and a clear line of accountability is established, from a cost perspective.

  1. CCoE should identify, define, and publish cost metrics that are updated on a regular -monthly- basis around key ElastiCache usage across categorical data such as: 

     1. Types of nodes used and their attributes: standard vs. memory optimized, on-demand vs. reserved instances, regions and availability zones

     1. Types of environments: free, dev, testing, and production

     1. Backup storage and retention strategies

     1. Data transfer within and across regions

     1. Instances running on Amazon Outposts 

  1. CCoE consists of a cross-functional team with non-exclusive representation from software engineering, data management, product team, finance, and leadership teams in your organization.

  **[Resources]: **
  + [Create a Cloud Center of Excellence](https://docs.aws.amazon.com/whitepapers/latest/cost-optimization-laying-the-foundation/cloud-center-of-excellence.html)
  + [Amazon ElastiCache pricing](https://aws.amazon.com/elasticache/pricing/)
+ **[Required] **Use cost allocation tags to track costs at a low level of granularity. Use AWS Cost Management to visualize, understand, and manage your AWS costs and usage over time. 

  1. Use tags to organize your resources, and cost allocation tags to track your AWS costs on a detailed level. After you activate cost allocation tags, AWS uses the cost allocation tags to organize your resource costs on your cost allocation report, to make it easier for you to categorize and track your AWS costs. AWS provides two types of cost allocation tags, an AWS generated tags and user-defined tags. AWS defines, creates, and applies the AWS generated tags for you, and you define, create, and apply user-defined tags. You must activate both types of tags separately before they can appear in Cost Management or on a cost allocation report.

  1. Use cost allocation tags to organize your AWS bill to reflect your own cost structure. When you add cost allocation tags to your resources in Amazon ElastiCache, you will be able to track costs by grouping expenses on your invoices by resource tag values. You should consider combining tags to track costs at a greater level of detail.

  **[Resources]: **
  + [Using AWS cost allocation tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html)
  + [Monitoring costs with cost allocation tags](Tagging.md)
  + [AWS Cost Explorer](https://aws.amazon.com/aws-cost-management/aws-cost-explorer/)
+ **[Best] **Connect ElastiCache cost to metrics that reach across the organization.

  1. Consider business metrics as well as operational metrics like latency - what concepts in your business model are understandable across roles? The metrics need to be understandable by as many roles as possible in the organization. 

  1. Examples - simultaneous served users, max and average latency per operation and user, user engagement scores, user return rates/week, session length/user, abandonment rate, cache hit rate, and keys tracked

  **[Resources]: **
  + [Monitoring use with CloudWatch Metrics](CacheMetrics.md)
+ **[Good] **Maintain up-to-date architectural and operational visibility on metrics and costs across the entire workload that uses ElastiCache.

  1. Understand your entire solution ecosystem, ElastiCache tends to be part of a full ecosystem of AWS services in their technology set, from clients to API Gateway, Redshift, and QuickSight for reporting tools (for example). 

  1. Map components of your solution from clients, connections, security, in-memory operations, storage, resource automation, data access and management, on your architecture diagram. Each layer connects to the entire solution and has its own needs and capabilities that add to and/or help you manage the overall cost.

  1. Your diagram should include the use of compute, networking, storage, lifecycle policies, metrics gathering as well as the operational and functional ElastiCache elements of your application

  1. The requirements of your workload are likely to evolve over time and it is essential that you continue to maintain and document your understanding of the underlying components as well as your primary functional objectives in order to remain proactive in your workload cost management.

  1. Executive support for visibility, accountability, prioritization, and resources is crucial to you having an effective cost management strategy for your ElastiCache.

## COST 2: How do you use continuous monitoring tools to help you optimize the costs associated with your ElastiCache resources?


**Question-level introduction: **You need to aim for a proper balance between your ElastiCache cost and application performance metrics. Amazon CloudWatch provides visibility into key operational metrics that can help you assess whether your ElastiCache resources are over or under utilized, relative to your needs. From a cost optimization perspective, you need to understand when you are overprovisioned and be able to develop appropriate mechanisms to resize your ElastiCache resources while maintaining your operational, availability, resilience, and performance needs. 

**Question-level benefit: **In an ideal state, you will have provisioned sufficient resources to meet your workload operational needs and not have under-utilized resources that can lead to a sub-optimal cost state. You need to be able to both identify and avoid operating oversized ElastiCache resources for long periods of time. 
+ **[Required] **Use CloudWatch to monitor your ElastiCache clusters and analyze how these metrics relate to your AWS Cost Explorer dashboards. 

  1. ElastiCache provides both host-level metrics (for example, CPU usage) and metrics that are specific to the cache engine software (for example, cache gets and cache misses). These metrics are measured and published for each cache node in 60-second intervals.

  1. ElastiCache performance metrics (CPUUtilization, EngineUtilization, SwapUsage, CurrConnections, and Evictions) may indicate that you need to scale up/down (use larger/smaller cache node types) or in/out (add more/less shards). Understand the cost implications of scaling decisions by creating a playbook matrix that estimates the additional cost and the min and max lengths of time required to meet your application performance thresholds.

  **[Resources]: **
  + [Monitoring use with CloudWatch Metrics](CacheMetrics.md)
  + [Which Metrics Should I Monitor?](CacheMetrics.WhichShouldIMonitor.md)
  + [Amazon ElastiCache pricing](https://aws.amazon.com/elasticache/pricing/)
+ **[Required] **Understand and document your backup strategy and cost implications.

  1. With ElastiCache, the backups are stored in Amazon S3, which provides durable storage. You need to understand the cost implications in relation to your ability to recover from failures.

  1. Enable automatic backups that will delete backup files that are past the retention limit.

  **[Resources]: **
  + [Scheduling automatic backups](backups-automatic.md)
  + [Amazon Simple Storage Service pricing](https://aws.amazon.com/s3/pricing/)
+ **[Best] **Use Reserved Nodes for your instances as a deliberate strategy to manage costs for workloads that are well understood and documented. Reserved nodes are charged an up front fee that depends upon the node type and the length of reservation—one or three years. This charge is much less than the hourly usage charge that you incur with On-Demand nodes.

  1. You may need to operate your ElastiCache clusters using on-demand nodes until you have gathered sufficient data to estimate the reserved instance requirements. Plan and document the resources needed to meet your needs and compare expected costs across instance types (on-demand vs. reserved)

  1. Regularly evaluate new cache node types available and assess whether it makes sense, from a cost and operational metrics perspective, to migrate your instance fleet to new cache node types

## COST 3: Should you use an instance type that support data tiering? What are the advantages of a data tiering? When not to use data tiering instances?


**Question-level introduction: **Selecting the appropriate instance type can not only have performance and service level impact but also financial impact. Instance types have different cost associated with them. Selecting one or a few large instance types that can accommodate all storage needs in memory might be a natural decision. However, this could have significant cost impact as the project matures. Ensuring that the correct instance type is selected requires periodic examination of ElastiCache object idle time.

**Question-level benefit: **You should have a clear understanding of how various instance types impact your cost at the present and in the future. Marginal or periodic workload changes should not cause disproportionate costs changes. If the workload permits it, instance types that support data tiering offer a better price per storage available storage. Because of the per instance available SSD storage data tiering instances support a much higher total data per instance capability.
+ **[Required] **Understand limitations of data tiering instances

  1. Only available for ElastiCache for Valkey or Redis OSS clusters.

  1. Only limited instance types support data tiering.

  1. Only ElastiCache version 6.2 for Redis OSS and above is supported

  1. Large items are not swapped out to SSD. Objects over 128 MiB are kept in memory.

  **[Resources]: **
  + [Data tiering](data-tiering.md)
  + [Amazon ElastiCache pricing](https://aws.amazon.com/elasticache/pricing/)
+ **[Required] **Understand what percentage of your database is regularly accessed by your workload.

  1. Data tiering instances are ideal for workloads that often access a small portion of your overall dataset but still requires fast access to the remaining data. In other words, the ratio of hot to warm data is about 20:80.

  1. Develop cluster level tacking of object idle time.

  1. Large implementations of over 500 Gb of data are good candidates
+ **[Required] **Understand that data tiering instances are not optional for certain workloads.

  1. There is a small performance cost for accessing less frequently used objects as those are swapped out to local SSD. If your application is response time sensitive test the impact on your workload.

  1. Not suitable for caches that store mostly large objects over 128 MiB in size.

  **[Resources]: **
  + [Limitations](data-tiering.md#data-tiering-prerequisites)
+ **[Best] **Reserved instance types support data tiering. This assures the lowest cost in terms of amount of data storage per instance.

  1. You may need to operate your ElastiCache clusters using non-data tiering instances until you have a better understanding of your requirements.

  1. Analyze your ElastiCache clusters data usage pattern.

  1. Create an automated job that periodically collects object idle time.

  1. If you notice that a large percentage (about 80%) of objects are idle for a period of time deemed appropriate for your workload document the findings and suggest migrating the cluster to instances that support data tiering.

  1. Regularly evaluate new cache node types available and assess whether it makes sense, from a cost and operational metrics perspective, to migrate your instance fleet to new cache node types.

  **[Resources]: **
  + [OBJECT IDLETIME](https://valkey.io/commands/object-idletime/)
  + [Amazon ElastiCache pricing](https://aws.amazon.com/elasticache/pricing/)