Operational observability
Observability is required to gain actionable insights into the performance of your environments and help you to detect and investigate problems. It also has a secondary purpose that allows you to define and measure key performance indicators (KPIs) and service level objectives (SLOs) such as uptime. For most organizations, important operations KPIs are mean time to detect (MTTD) and mean time to recover (MTTR) from an incident.
Throughout observability, context is important, because data is collected and then associated
tags are gathered. Regardless of the service, application, or application tier that you are
focusing on, you can filter and analyze for that specific dataset. Tags can be used to
automate onboarding to CloudWatch Alarms so that the right teams can be alerted when certain
metric thresholds are breached. For example, a tag key
example-inc:ops:alarm-tag
and the value on it could indicate creation of the
CloudWatch Alarm. A solution demonstrating this is described in Use tags to create and maintain Amazon CloudWatch alarms for Amazon EC2 instances
Having too many alarms configured can easily create an alert storm—when a large number of alarms or notifications rapidly overwhelm operators and reduce their overall effectiveness while operators are manually triaging and prioritizing individual alarms. Additional context for the alarms can be provided in the form of tags, which means that rules can be defined within Amazon EventBridge to help ensure that focus is given to the upstream issue rather than downstream dependencies.
The role of operations alongside DevOps is often overlooked, but
for many organizations, central operations teams still provide a
critical first response outside of normal business hours.
(More details can be found about this model in the
Operational
Excellence whitepaper.) Unlike the DevOps team that owns
the workload, they typically do not have the same depth of
knowledge, so the context that tags provide within dashboards
and alerts, can direct them to the correct runbook for the
issue, or initiate an automated runbook (refer to the blog post
Automating
Amazon CloudWatch Alarms with AWS Systems Manager