Operational observability
Observability is required to gain actionable insights into the performance of your environments and help you to detect and investigate problems. It also has a secondary purpose that allows you to define and measure key performance indicators (KPIs) and service level objectives (SLOs) such as uptime. For most organizations, important operations KPIs are mean time to detect (MTTD) and mean time to recover (MTTR) from an incident.
Throughout observability, context is important, because data is collected and then associated
        tags are gathered. Regardless of the service, application, or application tier that you are
        focusing on, you can filter and analyze for that specific dataset. Tags can be used to
        automate onboarding to CloudWatch Alarms so that the right teams can be alerted when certain
        metric thresholds are breached. For example, a tag key
          example-inc:ops:alarm-tag and the value on it could indicate creation of the
        CloudWatch Alarm. A solution demonstrating this is described in Use tags to create and maintain Amazon CloudWatch alarms for Amazon EC2 instances
Having too many alarms configured can easily create an alert storm—when a large number of alarms or notifications rapidly overwhelm operators and reduce their overall effectiveness while operators are manually triaging and prioritizing individual alarms. Additional context for the alarms can be provided in the form of tags, which means that rules can be defined within Amazon EventBridge to help ensure that focus is given to the upstream issue rather than downstream dependencies.
        The role of operations alongside DevOps is often overlooked, but
        for many organizations, central operations teams still provide a
        critical first response outside of normal business hours.
        (More details can be found about this model in the 
        Operational
        Excellence whitepaper.) Unlike the DevOps team that owns
        the workload, they typically do not have the same depth of
        knowledge, so the context that tags provide within dashboards
        and alerts, can direct them to the correct runbook for the
        issue, or initiate an automated runbook (refer to the blog post
        Automating
        Amazon CloudWatch Alarms with AWS Systems Manager