Operational observability - Best Practices for Tagging AWS Resources

Operational observability

Observability is required to gain actionable insights into the performance of your environments and help you to detect and investigate problems. It also has a secondary purpose that allows you to define and measure key performance indicators (KPIs) and service level objectives (SLOs) such as uptime. For most organizations, important operations KPIs are mean time to detect (MTTD) and mean time to recover (MTTR) from an incident.

Throughout observability, context is important, because data is collected and then associated tags are gathered. Regardless of the service, application, or application tier that you are focusing on, you can filter and analyze for that specific dataset. Tags can be used to automate onboarding to CloudWatch Alarms so that the right teams can be alerted when certain metric thresholds are breached. For example, a tag key example-inc:ops:alarm-tag and the value on it could indicate creation of the CloudWatch Alarm. A solution demonstrating this is described in Use tags to create and maintain Amazon CloudWatch alarms for Amazon EC2 instances.

Having too many alarms configured can easily create an alert storm—when a large number of alarms or notifications rapidly overwhelm operators and reduce their overall effectiveness while operators are manually triaging and prioritizing individual alarms. Additional context for the alarms can be provided in the form of tags, which means that rules can be defined within Amazon EventBridge to help ensure that focus is given to the upstream issue rather than downstream dependencies.

The role of operations alongside DevOps is often overlooked, but for many organizations, central operations teams still provide a critical first response outside of normal business hours. (More details can be found about this model in the Operational Excellence whitepaper.) Unlike the DevOps team that owns the workload, they typically do not have the same depth of knowledge, so the context that tags provide within dashboards and alerts, can direct them to the correct runbook for the issue, or initiate an automated runbook (refer to the blog post Automating Amazon CloudWatch Alarms with AWS Systems Manager).