

# Operational analytics
<a name="operational-analytics"></a>

 Operational analytics refers to inter-disciplinary techniques and methodologies that aim to measure and improve day-to-day business performance in terms of increasing the efficiency of internal business processes and improving customer experience and value. 

 Traditional analytics like Business Intelligence (BI) provide each Line of Business (LOB) with insights to identify trends and take decisions based on what happened in the past.  

 But this is no longer sufficient. To deliver a good customer experience, organizations must continually measure their workload performance and quickly respond to operational inefficiencies for a better customer experience. 

 By using operational analytics systems, they can initiate such business actions based on the recommendations that the systems provide. They can also automate the execution processes to reduce the human errors. This makes the system go beyond being descriptive to being more prescriptive and even being predictive in nature. 

 On the other hand, IT infrastructures are becoming increasingly distributed adding more complexity to the workloads in terms of identifying the operational data that captures the system’s state, characterize its behavior, and finally rectify potential issues in the pipelines. 

 Several tools and methodologies have emerged that help companies keep their systems reliable. Every system or application must be instrumented to expose telemetry data that provides operational insights in real or near real time. 

 The telemetry data can be of different form of signals: logs, traces, and metrics. Traditionally this data came in the form of logs that represent a record of an event happened within an application, server or a system operation. It can be of different types such as: application logs, security logs, system logs, audit trails, and infrastructure logs. Logs are usually used in troubleshooting and generating root-cause analysis for a system or application failure at a specific point in time. 

 Trace signal captures the user request for resources as it passes through different systems all the way to the destination and the response back to the user. It indicates a causal relationship between all the services being part of a distributed transaction. Organizations used to develop their own trace mechanisms but it is recommended to use existing tools that support a standard trace-context propagation format. The trace-context holds the information that links the producer of a message to its downstream consumers. 

 Metric data provides a point-in-time measure of the health of the system, such as resource consumptions in terms of CPU utilization. Metric signals offer an overview of the overall system health while reducing the manual effort to build these metrics and store them. With metrics, system operators can be notified in real time about anomalies in production environments and establish automated recovery process in case of a recurrent incident. 

 The signals mentioned above have different ways to be instrumented and provide different approaches to implement operational analytics use cases. Therefore, organizations must have an operational objective in mind from which they can work backwards to identify what data output they need from their system, which tool is better fit for their business and IT environment and finally what insights are needed to better understand their customers and improve their production resiliency. 

# Characteristics
<a name="characteristics-4"></a>

 **Discoverability:** The ability of the system to make operational data available for consumption. This involves discovering multiple disparate types of data available within an application that can be used for various ad hoc explorations. 

 **Connectivity:** Operational data can emanate from a variety of data sources in different format with disparate volumes. For this reason, the operational system has to provide the capability to seamlessly integrate all the data with the least overhead for production application. 

 **Scalability:** The ability of the system to scale up and out to adapt to changes in the operational analytics workload in terms of storage or compute requirements. 

 **Monitoring:** You should be able to continuously monitor the operational system performance and get notified about the resource utilization and the overall health of your system. 

 **Security:** The access to the operational system must be secure. With Amazon OpenSearch Service, you can configure the domain to be accessible with an endpoint within your VPC or a public endpoint accessible to the internet. In addition to network-based access control, you must set up user authentication and authorization to secure the access to data based on business requirements. OpenSearch Service supports encryption at rest and in Transit. 

 **Data durability:** With operational analytics, the use cases differ as to the retention requirements. You should understand your business requirements in terms of analyzing historical data. With Amazon OpenSearch Service, you can retain more data with less cost using the UltraWarm and cold storage tiers. 

 **Automation:** The data lifecycle in your operational system should be automated in order to easily onboard new data pipelines and reduce the overhead of managing the lifecycle of the data. With Index State Management (ISM) in Amazon OpenSearch Service, you can create your own policies to automate the lifecycle management of indices stored in the service. 

  **Observability:** The ability to understand internal state from the various signal outputs in the system. By providing a holistic view of these various signals along with a meaningful inference it becomes easy understand how healthy and well performant the overall system is. 

 **User centricity:** Each analytics application should address a well-defined operational scope and solve a particular problem at hand. Users of the system often won’t understand or care about the analytics process but only see the value the result.  

 **Agility:** The system must be ﬂexible enough to accommodate changing needs of an analytics application and offer necessary control to bring in additional data with low overhead.

# Reference architecture
<a name="section-17"></a>

![\[Reference architecture diagram for operational analytics\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/operational-analytics-reference-architecture.png)




 The reference architecture covers the data flow in an operational analytics use case. The Ingestion pipeline contains up to five stages as follows: 

1.  With your operational and business goal in mind, you should instrument your system/plate-form to ***produce*** the relevant type of signals such as *various* logs, traces, and metrics, and expose the data to a set of collectors. At this stage, you choose open-source instrumentation tools such as [Jaeger](https://www.jaegertracing.io/) or [Zipkin](https://zipkin.io/). And if you plan to generate different type of signals, we recommend that you include signal correlation beginning with the design step. Open-source tools such as [OpenTelemetry](https://opentelemetry.io/docs/) facilitate the context propagation by adding a Trace ID to all logs related to a specific request. This reduces the mean time to problem resolution by enhancing the observability of the system from multiple viewpoints. 

1.  The second step is to **collect** the telemetry data from the producers and deliver it to the aggregators or buffers. You can use native AWS services (such as [Amazon Kinesis Agent](https://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html), [CloudWatch agents](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_OpenSearch_Stream.html), or [AWS Distro for OpenTelemetry](https://aws.amazon.com/otel/?otel-blogs.sort-by=item.additionalFields.createdDate&otel-blogs.sort-order=desc)) to let you instrument your applications just once, collect and correlate metrics and traces, along with contextual information and metadata about where the application is running. You can also use a number of lightweight shippers such as [Fluentd](https://docs.fluentd.org/) to collect logs, [Fluentbit](https://fluentbit.io/) to collect both logs and metrics, and open-source [OpenTelemetry](https://opentelemetry.io/docs/)*.* 

1.  Before sending the data to [Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/), it is recommended that you **buffer or aggregate** information from the collectors to reduce the overall connections to the domain and use the batch (\$1bulk) API to send batches of documents rather than sending single documents. It is also possible at this stage (or at the collection stage) to transform and aggregate the data for the downstream analytics tools. To do this, you can use AWS services such as [Amazon Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) and [Amazon Managed Streaming for Apache Kafka](https://aws.amazon.com/msk/). For large-scale environments, you can use [Amazon S3](https://aws.amazon.com/s3/) to have a backup the data. It is also possible to use open-source tools such as OpenSearch Data Prepper for trace and log analytics, or you can use the [open source version of Logstash](https://opensearch.org/docs/latest/clients/logstash/index/) (check compatibility with Amazon OpenSearch Service [here](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-logstash.html)). 

1.  Amazon OpenSearch Service makes it easy for you to index and **store** telemetry data to perform interactive analytics. Amazon OpenSearch Service is built to handle a large volume of structured and unstructured data from multiple data sources at high ingestion rates. Amazon OpenSearch Service integrates not only with AWS services but also with open-source tools as the ones listed previously. It is also possible to use [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) to **store** and **query** operational metrics. The service is integrated with Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Distro for Open Telemetry. 

1.  Amazon OpenSearch Service dashboard is the default **visualization** tool for data in Amazon OpenSearch Service. It also serves as a user interface for many of the OpenSearch plugins, including Observability, Security, Alerting, Index State Management, and SQL. You can also conduct interactive analysis and visualization on data with [Piped Processing Language (PPL)](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ppl-support.html), a query interface. You can use [Amazon Managed Grafana](https://aws.amazon.com/grafana/) to complement Amazon OpenSearch Service on the visualization layer. And you connect Amazon Grafana to Amazon Managed Service for Prometheus to query, **visualize**, alert on, and understand metric data. 

# Configuration notes
<a name="configuration-notes-3"></a>

 As shared in the previous section, there are different options and a non-exhaustive list of tools that you can choose from to implement an operational analytics pipeline. A list of configuration parameters to take into consideration for a well-architected operational pipeline is provided. 

 **Define operational goals and business requirements:** As a best practice, you should always start by identifying your operational goals, and what business outcome you must reach. Think about who are your end users, what are the insights to help drive their decisions, and how they will access these insights. After you define the business requirements, you can start designing your technical pipeline, establishing the integration options in your environment, and reviewing the skill sets you have, to choose the right option. 

 **Choose a data model before ingestion:** When bringing data in from disparate sources, especially from structured systems into structureless systems such as OpenSearch, special care must be taken to ensure that the chosen data model provides a frictionless search experience for users. 

 **Ingestion pipeline:** You should make sure that your ingestion framework is reusable and extensible to be able to scale and include new use cases on the long term, otherwise, check which parts of your infrastructure would require modernization.  

 **Production ready tools and services:** AWS offers a set of managed services that are production ready and which eliminate the operational overhead of managing the infrastructure, such as Amazon OpenSearch Service. As shared in the reference architecture, you can also integrate open source tools, such as OpenSearch Data Prepper, to transform and aggregate the operational data for downstream analytics and visualizations. 

 **Sizing OpenSearch domain: ** The first step in sizing an OpenSearch cluster is to check your data size, and identify your storage and query requirements. Estimate the number of active shards you will have per index based on your input data, and the shard size that you identify. Then, estimate your vCPU requirements and choose the type of instances that will be able to handle both storage and vCPUs. Plan for time to benchmark the domain with a realistic dataset using [OpenSearch Benchmark](https://github.com/opensearch-project/opensearch-benchmark), tune the configuration and iterate until you meet the performances required in terms of Throughout, Search Latency, and Index Latency. For more information, see [Sizing Amazon OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/sizing-domains.html) and [Best practices for configuring your Amazon OpenSearch Service domain](https://aws.amazon.com/blogs/big-data/best-practices-for-configuring-your-amazon-opensearch-service-domain/). 

 **Use tiered storage:** The value of operational data or any timestamped data generally decreases with the age of the data. Moving aged data into tiered storage can save significant operational cost. Summarized rollups that can condense the data can also help address storage cost. 

 **Performance:** There are multiple parameters to consider when thinking about performance and it is always specific to each workload. However, Amazon OpenSearch Service offers features that you can already enable in your domain, such as [Auto-Tune](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/auto-tune.html) that automatically deploys optional changes to improve cluster speed and stability. Other items to take into consideration include using the `_bulk` API to load data into OpenSearch, and only indexing data fields that need be searchable. 

 **Define security requirements:** Make sure to set up your domain inside a virtual Private Cloud (VPC) to secure the traffic to your domain. Apply the least privilege access approach with restrictive access policies, or with fine-grained access control for OpenSearch dashboards. OpenSearch Service also offers encryption of data at rest and in transit. 

 **Monitor all involved components:** Monitor all involved components with metrics in Amazon CloudWatch. With the CloudWatch metrics available for Amazon OpenSearch Service, you can monitor the overall cluster health, you can also check the performance of individuals nodes and monitor EBS volume metrics. It is also a best practice to set [CloudWatch alarms ](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/cloudwatch-alarms.html)to get notified about any issues that your production domain encounters. You can start by setting the following alarms: 
+  `CPUUtilization` maximum is >= 80% for 15 minutes, 3 consecutive times 
+  `ClusterStatus.yellow` maximum is >= 1 for 1 minute, 1 consecutive time 
+  `JVMMemoryPressure` maximum is >= 80% for 5 minutes, 3 consecutive times 
+  `FreeStorageSpace` minimum is <= 25% of the storage space for 1 minute, 1 consecutive time 