

# Scenarios
<a name="scenarios"></a>

In this section, we cover the seven key scenarios that are common in many analytics applications. We describe how they inﬂuence the design and architecture of your analytics environment in AWS. We present the assumptions made for each of these scenarios, the common drivers for the design, and a reference architecture for how these scenarios should be implemented. 

**Topics**
+ [Data discovery](data-discovery.md)
+ [Modern data architecture](modern-data-architecture.md)
+ [Batch data processing](batch-data-processing.md)
+ [Streaming ingest and stream processing](streaming-ingest-and-stream-processing.md)
+ [Operational analytics](operational-analytics.md)
+ [Data visualization](data-visualization.md)
+ [Data mesh](data-mesh.md)

# Data discovery
<a name="data-discovery"></a>

 Many organizations treat data like an organizational asset, meaning it is no longer the property of individual departments. You want to analyze all types of data to drive actionable insights, be prepared for the unexpected, create new revenue streams, improve customer experience, and increase operational efficiencies. 

 The data discovery process consists of a number of interactive sessions with various stakeholders within an organization. Sometimes this starts with an initial session to identify new ways to extract value from your data, while at other times, it could be with a specific use-case around what you want to do. You can go straight into identifying key people and diving deep to gather the information that is needed in order to determine your best possible solution. 

 In either case, the end goal is to maximize the value you get from the data and identify appropriate next steps. This includes how you plan to consume data, what data sources you have and how to ingest that data, and then potentially what types of transformations you may need for the data. 

# Characteristics
<a name="characteristics"></a>

 The common issue that can hold you back from maximizing the value of your data is the variety of data silos within your organization. Silos can prevent you from extracting maximum value from all your data with the greatest flexibility. Data warehouses can help with this to a point, but often only a small portion of raw data is brought into the data warehouse. Organizations often end up with multiple data warehouses, so you can still have these silos. There are a number of modern approaches to enterprise-wide analytics that can help solve this – such as data lakes, modern data architectures, and data mesh. If you are not already exploring these modern approaches, this is a good opportunity to learn more about these architectures in the following sections. 

 Another common issue is whether you can benefit from increasing the velocity at which you ingest and process your data. Many organizations still have a predominantly batch-oriented strategy to processing the data, where a majority of your data is processed on a daily schedule. You can ask yourself questions such as ‘How can we benefit if we have access to more up-to-date data?” AWS can help you explore options for ingesting streaming data, and for processing data in micro-batches or employ stream processing. 

 If you have previously explored streaming options in the past, you might have been concerned about the complexity of some of these solutions, but there are many AWS-managed solutions for streaming that significantly reduces much of this complexity. 

 In general, data discovery consists of five steps: 

## 1.  Define the business value 
<a name="define-the-business-value"></a>

 This is the first step in data discovery where you define the business value or opportunity by conducting interactive sessions. Here are a few example questions to define the business opportunity. 
+  What insights are you getting from the data? 
+  How would getting insight into data provide value to the business? 
+  Are you looking to create a new revenue stream from your data? 
+  What are challenges with your current approach and tool?  
+  What are you not providing to your customers that you would like to provide? 
+  Who is the executive level stakeholder for this effort? 
+  Example-specific use case questions: 
  +  How does data define your customer acquisition strategy? 
  +  Would your business benefit from exploring modern approaches to managing fraud detection, predictive maintenance, customer 360, IoT, clickstream, operational analytics, root-cause analysis to reduce mean time to detection and mean time to recovery? 
  +  How are you continually innovating on behalf of your customers and improving their user experience? 

## 2. Identify your user personas 
<a name="identify-your-user-personas"></a>

 In this step, you focus on your data consumers, such as business analysts, data engineers, data analysts, and data scientists. Once you have developed your user personas, enable them for purpose-built analytics and machine learning.  

 Here are few example questions to identify your data consumers. 
+  Who are the end users?  
+  What insights are you currently getting from your data? 
+  What insights are on your roadmap? 
+  Do you have a multi-tenant data model? 
+  What are the different consumption models? 
  +  Which tool or interface do your data consumers use? 
  +  How real time does the data need to be for this use case (for example, near real time, every 15 minutes, hourly, daily)?  
  +  What is the total number of consumers for this consumption model? 
  +  What is the peak concurrency? 

## 3. Identify your data sources
<a name="identify-your-data-sources"></a>

 In this step, you focus on your data sources and tools to bring that data into the data platform. This allows you to perform comprehensive analytics and machine learning from a wide variety of data from various data sources. 

**Data types and sources **

 Table 3: Typical data sources in an organization 


|  **Data type**  |  **Example data sources**  |   | 
| --- | --- | --- | 
|  Structured data  |  ERP applications, CRM applications, ERP applications, CMS applications, SaaS applications, SAP applications, line of business (LOB) applications, and SQL databases  |   | 
|  Semi-structured data  |  Web applications, NoSQL databases, EDI (electronic data interchange), CSV, XML, and JSON documents  |   | 
|  Unstructured data  |  Video files, audio files, images, IoT data, sensors data, and invoices  |   | 
|  Batch  |  Internal applications generate structured data at regularly defined schedules  |   | 
|  Streaming data  |  Sensors, social media, video streams, IoT devices, mobile devices that generate semi-structured and unstructured data as continuous streams  |   | 

 Here are a few example questions to identify your data consumers. 
+  How many data sources do you have to support? 
  +  Where and how is the data generated? 
  +  What are the different types of your data? (for example, structured, semi-structured, unstructured, batch, streaming) 
  +  What are the different formats of your data? (for example, JSON, CSV, FHIR) 
  +  Is your data originating from on premises, a third-party vendor, or the cloud? 
  +  Is the data source streaming, batch, or micro-batch? 
  +  What is the rate and volume of ingestion? 
  +  What is the ingestion interface (for example, API, SFTP, Amazon S3, AWS Marketplace) 
+  How does your team on-board new data sources? 

## 4. Define your data storage, catalog, and data access needs
<a name="define-your-data-storage-catalog-and-data-access-needs"></a>

In this step, you focus on your data storage, data cataloging, security, compliance, and data access requirements. 

 Here are few example questions to identify your data storage and data access requirements. 
+  What data stores do you have? 
+  What is the purpose of each data store? 
+  Why that storage method? (for example, files, SQL, NoSQL, data warehouse) 
+  How do you currently organize your data? (for example, data tiering, partition) 
+  How much data are you storing now, and how much do you expect to be storing in the future, for example, 18 months from now? 
+  How do you manage data governance?  
+  What data regulatory and governance compliance do you face? 
+  What is your disaster recovery (DR) strategy? 

## 5. Define your data processing requirements
<a name="define-your-data-processing-requirements"></a>

In this step, you focus on your data processing requirements. 

 Here are few example questions to identify your data processing requirements. 
+  Do you have to transform or enrich the data before you consume it? 
+  What tools do you use for transforming your data? 
+  Do you have a visual editor for the transformation code?  
+  What is your frequency of data transformation? (for example, real time, micro-batching, overnight batch)  
+  Are there any constraints with your current tool of choice? 

# Reference architecture
<a name="s1-reference-architecture"></a>

 The following diagram illustrates the solution architecture and its key components for data cataloging, security, compliance, and data access requirements using DataHub. 

![\[Reference architecture for data discovery\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/scenario-1-ref.png)


1.  DataHub is an open-source metadata management platform which enables end-to-end discovery, data observability, data governance , data lineage and many more. It runs on an Amazon EKS cluster, using Amazon OpenSearch Service, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and RDS for MySQL as the storage layer for the underlying data model and indexes. 

1.  Pull technical metadata from AWS Glue and Amazon Redshift to DataHub. 

1.  Enrich the technical metadata with a business glossary. 

1.  Run an AWS Glue job to transform the data and observe the data lineage in DataHub. 

# Modern data architecture
<a name="modern-data-architecture"></a>

 Organizations have been building data lakes to analyze massive amounts of data for deeper insights into their data. To do this, they bring data from multiple silos into their data lake, and then run analytics and AI/ML directly on it. It is also common for these organizations to have data stored in specialized data stores, such as a NoSQL database, a search service, or a data warehouse, to support different use cases. To analyze all of the data that is spread across the data lake and other data stores efficiently, businesses often move data in and out of the data lake and between these data stores. This data movement can get complex and messy as the data grows in these data stores. 

 To address this, businesses need a data architecture that allows building scalable, cost-effective data lakes. The architecture can also support simplified governance and data movement between various data stores. We refer to this as a *modern data architecture*. Modern data architecture integrates a data lake, a data warehouse, and other purpose-built data stores while enabling unified governance and seamless data movement. 

 As shown in the following diagram, with a modern data architecture, organizations can store their data in a data lake and use purpose-built data stores that work with the data lake. This approach allows access to all of the data to make better decisions with agility. 

![\[Diagram showing a modern data architecture\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/modern-data-architecture.png)




 There are three different patterns for data movement. They can be described as follows: 

 **Inside-out data movement:** A subset of data in a data lake is sometimes moved to a data store, such as an Amazon OpenSearch Service cluster or an Amazon Neptune cluster. This pattern supports specialized analytics, such as search analytics, building knowledge graphs, or both. For example, enterprises send information from structured sources (such as relational databases), unstructured sources (such as metadata, media, or spreadsheets) and other assets to a data lake. From there, it is moved to Amazon Neptune to build a knowledge graph. We refer to this kind of data movement as inside-out. 

 **Outside-in data movement:** Organizations use data stores that best fit their applications and later move that data into a data lake for analytics. For example, to maintain game state, player data, session history, and leader boards, a gaming company might choose Amazon DynamoDB as the data store. This data can later be exported to a data lake for additional analytics to improve the gaming experience for their players. We refer to this kind of data movement as outside-in. 

 **Around the perimeter:** In addition to the two preceding patterns, there are scenarios where the data is moved from one specialized data store to another. For example, enterprises might copy customer profile data from their relational database to a NoSQL database to support their reporting dashboards. We refer to this kind of data movement as around the perimeter.

# Characteristics
<a name="characteristics-1"></a>

 **Scalable data lake:** A data lake should be able to scale easily to petabytes and exabytes as data grows. Use a scalable, durable data store that provides the fastest performance at the lowest cost, supports multiple ways to bring data in, and has a good partner ecosystem. 

 **Data diversity:** Applications generate data in many formats. A data lake should support diverse data types—structured, semi-structured, or unstructured. 

 **Schema management:** A modern data architecture should support schema on read for a data lake with no strict source data requirement. The choice of storage structure, schema, ingestion frequency, and data quality should be left to the data producer. A data lake should also be able to incorporate changes to the structure of the incoming data that is referred to as schema evolution. In addition, schema enforcement helps businesses ensure data quality by preventing writes that do not match the schema. 

 **Metadata management:** Data should be self-discoverable with the ability to track lineage as data ﬂows through tiers within the data lake. A comprehensive Data Catalog that captures the metadata and provides a queryable interface for all data assets is recommended. 

 **Unified governance:** A modern data architecture should have a robust mechanism for centralized authorization and auditing. Configuring access policies in the data lake and across all the data stores can be overly complex and error prone. Having a centralized location to define the policies and enforce them is critical to a secure modern data architecture. 

 **Transactional semantics:** In a data lake, data is often ingested nearly continuously from multiple sources and is queried concurrently by multiple analytic engines. Having atomic, consistent, isolated, and durable (ACID) transactions is pivotal to keeping data consistent. 

 **Transactional Data Lake:** Data lakes offer one of the best options for cost, scalability, and flexibility to store data at a low cost, and to use this data for different types of analytics workloads. However, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies. Open table formats provide additional database-like functionality that simplifies the optimization and management overhead of data lakes, while still supporting storage on cost-effective systems. These features include: 
+  **ACID transactions:** Allowing a write to completely succeed or be rolled back in its entirety 
+  **Record-level operations:** Allowing for single rows to be inserted, updated, or deleted 
+  **Indexes:** Improving performance in addition to data lake techniques like partitioning 
+  **Concurrency control:** Allowing for multiple processes to read and write the same data at the same time 
+  **Schema evolution:** Allowing for columns of a table to be added or modified over the life of a table 
+  **Time travel:** Query data as of a point in time in the past 

 The three most common and prevalent open table formats are Apache Hudi, Apache Iceberg, and Delta Lake. 

# Reference architecture
<a name="reference-architecture"></a>

![\[Reference architecture diagram for a modern data architecture\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/modern-data-architecture-reference-architecture.png)


# Configuration notes
<a name="configuration-notes"></a>
+  To organize data for efficient access and easy management: 
  +  The storage layer can store data in different states of consumption readiness, including raw, trusted, conformed, enriched, and modeled. It’s important to segment your data lake into landing, raw, trusted, and curated zones to store data depending on its consumption readiness. Typically, data is ingested and stored as is in the data lake (without having to first define schema) to accelerate ingestion and reduce time needed for preparation before data can be explored. 
  +  Partition data with keys that align to common query criteria. 
  +  Convert data to an open columnar file format, and apply compression. This will lower storage usage, and increase query performance. 
+  Choose the proper storage tier based on data temperature. Establish a data lifecycle policy to delete old data automatically to meet your data retention requirements. 
+  Decide on a location for data lake ingestion, for example, an S3 bucket. Select a frequency and isolation mechanism that meet your business needs. 
+  Depending on your ingestion frequency and data mutation rate, schedule file compaction to maintain optimal performance. 
+  Use AWS Glue crawlers to discover new datasets, track lineage, and avoid a data swamp. 
+  Manage access control and security using AWS Lake Formation, IAM role setting, AWS KMS, and AWS CloudTrail. 
+  There is no need to move data between a data lake and the data warehouse for the data warehouse to access it. Amazon Redshift Spectrum can directly access the dataset in the data lake. 
+  For more details, refer to the [Derive Insights from AWS Modern Data](https://docs.aws.amazon.com/whitepapers/latest/derive-insights-from-aws-modern-data/derive-insights-from-aws-modern-data.html) whitepaper. 

   

## User personas
<a name="user-personas"></a>

 To get the full value from your modern data architecture, there are various personas who will access the data and perform data analytics. For example, the chief data officer (CDO) of an organization is responsible for driving digital innovation and transformation across lines of business. This CDO should set a data-driven vision for the organization and be a champion of using data, analytics, and AI/ML to inform business decisions. 

 Table 4: Key personas for a modern data architecture 


|  |  |  |  | 
| --- |--- |--- |--- |
|  Personas  |  Responsibility  |  Areas of interest  |  Modern data architecture purpose-built AWS services  | 
| Chief data officer (CDO) |  Build a culture of using data to solve problems and accelerate innovation.  |  Data quality, data governance, data and AI strategy, evangelize the value of data to the business.  |   AWS Lake Formation, Amazon OpenSearch Service   | 
| Data architect  |  Driven to architect technical solutions to meet business needs. Focuses on solving complex data challenges to help the CDO deliver on their vision.  |  Data pipeline, data processing, data integration, data governance, and data catalogs.  |  AWS Glue, Amazon EMR, Amazon Redshift, Amazon Athena, Amazon OpenSearch Service  | 
| Data engineer  |  Deliver usable, accurate dataset to organization in a secure and performant manner. |  Variety of tools to build data pipeline, ease of use, configuration, and maintenance.  |  AWS Glue, Amazon EMR, Amazon Kinesis, Amazon Redshift, Amazon Athena, Amazon OpenSearch Service  | 
|  Data security officer  |   Data security, privacy, and governance must be strictly defined and adhered to.   |  Keeping information secure. Comply with data privacy regulations and protecting personally identifiable information (PII), applying fine-grained access controls and data masking.  |  AWS Lake Formation, AWS Identity and Access Management (IAM).  | 
|  Data scientist  |  Construct the means for extracting business-focused insight from data quickly for the business to make better decision.  |  Tools that simplify data manipulation, and provide deeper insight than visualization tools. Tools that help build the ML pipeline.  |   Amazon SageMaker AI, Amazon Athena,   Quick,  AWS Glue Studio, AWS Glue DataBrew  | 
|  Data analyst  |  React to market conditions in real time, must have the ability to find data and perform analytics quickly and easily.  |  Querying data and performing analysis to create new business insights, producing reports and visualizations that explain the business insights.  |   Amazon Athena,   Quick,   AWS Glue Studio, Amazon Redshift   | 

# Batch data processing
<a name="batch-data-processing"></a>

 Most analytics applications require frequent batch processing that allows them to process data in batches at varying intervals. For example, processing daily sales aggregations by individual store and then writing that data to the data warehouse on a nightly basis can allow business intelligence (BI) reporting queries to run faster. Batch systems must be built to scale for all sizes of data and to scale seamlessly to the size of the dataset being processed by various job runs. 

 It is important for the batch processing system to be able to support disparate source and target systems. These include processing various data formats, seamlessly scaling out to process peak data volumes, orchestrating jobs using workﬂow, providing a simple way to monitor the jobs, and most importantly offering an ease-of-use development framework that accelerates job development. Business requirements might dictate that batch data processing jobs be bound by an SLA, or have certain budget thresholds. Use these requirements to determine the characteristics of the batch processing architecture. 

 On AWS, analytic services such as Amazon EMR, Amazon Redshift, Lake Formation blueprints, and [AWS Glue](https://aws.amazon.com/glue/) family services, namely Glue ETL, [Glue Workﬂows,](https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html) and [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/) allow you to run batch data processing jobs at scale for all batch data processing use cases and for various personas. These personas include data engineers, data analysts, and data scientists. While there are some overlapping capabilities between these services, knowing the core competencies and when to use which service or services allows you to accomplish your objectives in the most effective way. 

# Characteristics
<a name="characteristics-2"></a>

 Following are the key characteristics that determine how you should plan when developing a batch processing architecture. 

 **Ease of use development framework:** This is one of the most important characteristics that allows personas of ETL developers and data engineers, data analysts, and data scientists to improve their overall efficiencies. An ETL developer benefits from a hybrid development interface that helps them to use the best of both—developing part of their job and switching to writing customized complex code where applicable. Data analysts and data scientists spend much of their time preparing data for actual analysis, or capturing feature engineering data for their machine learning models. You can improve their efficiencies in data preparation by adopting a no-code data preparation interface. This helps them normalize and clean data up to 80% faster compared to traditional approaches to data preparation. 

 **Support disparate source and target systems:** Your batch processing system should support different types of data sources and targets between relational, semi-structured, non-relational, and SaaS providers. When operating in the cloud, a connector ecosystem can benefit you by seamlessly connecting to various sources and targets, and can simplify your job development. 

 **Support various data file formats:** Some of the commonly seen data formats are CSV, Excel, JSON, Apache Parquet, Apache ORC, XML, and Logstash Grok. Your job development can be accelerated and simplified if the batch processing services can natively profile these various file formats, and infer schema automatically (including complex nested structures) so that you can focus more on building transformations. 

 **Seamlessly scale out to process peak data volumes:** Most batch processing jobs experience varying data volumes. Your batch processing job should scale out to handle peak data spikes and scale back in when the job completes. 

 **Simplified job orchestration with job bookmarking capability:** The ability to develop job orchestration with dependency management, and the ability to author the workﬂow using API, CLI, and a graphical user interface allows for a robust CI/CD integration. 

 **Ability to monitor and alert on job failure:** This is an important measure for ease of operational management. Having quick and easy access to job logs, and a graphical monitoring interface to access job metrics can help you identify errors and tuning opportunities quickly for your job. Coupling that with an event-driven approach to alert on job failure will be invaluable for easing operational management. 

 **Provide a low-cost solution:** Costs can quickly get out of control if you do not plan correctly. A pay-as-you-go pricing model for both compute and authoring jobs can help you overcome hefty costs upfront and allows you to pay only for what you use instead of overpaying to accommodate for peak workloads. Use automatic scaling to accommodate spiky workloads when necessary. Using Spot Instances where applicable can bring your costs down for workloads where they are a good fit. 

# Reference architecture
<a name="reference-architecture-1"></a>

![\[Diagram showing a batch data processing reference architecture\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/batch-data-processing-reference-architecture.png)


 

1.  Batch data processing systems typically require a persistent data store for source data. When developing batch data processing applications on AWS, you can use data from various sources, including your on-premises data stores, Amazon RDS, Amazon S3, DynamoDB, and any other databases that are accessible in the cloud. 

1.  Data processing jobs need access to a variety of data stores to read data. You can use AWS Glue connectors from the AWS Marketplace to connect to a variety of data stores, such as Google BigQuery, and SAP HANA. You also can connect to SaaS application providers, such as Salesforce, ServiceNow, and Google Analytics, using AWS Glue DataBrew and Amazon AppFlow. In addition, you can always rely on the custom JDBC capability in Apache Spark and connect to any JDBC-compliant data store from Amazon EMR or AWS Glue jobs. 

1.  Choosing the right authoring tool for the job simplifies job development and improves agility. 

   1.  You can use AWS Glue Studio or Glue interactive sessions when authoring jobs for the AWS Glue Spark runtime engine. 

   1.  Use AWS Glue blueprints when you create a self-service parametrized job for analysts and control what data the analyst is allowed to access. 

   1.  Use Amazon EMR notebooks for interactive job development and scheduling notebook jobs against Amazon EMR. 

   1.  Use Amazon SageMaker AI notebook when working within SageMaker AI development and pre-processing data using Spark on EMR. 

   1.  Use AWS Glue DataBrew from the AWS Management Console or from a Jupyter notebook for no-code development experience. 

   1.  Use Lake Formation blueprints to quickly create batch data ingestion jobs to rapidly build a data lake in AWS. 

1.  Choosing the right processing engine for your batch jobs allows you to be ﬂexible with managing costs and lowering operational overhead. Amazon EMR, AWS Glue (Streaming) ETL and Amazon Redshift offer the ability to scale seamlessly based on your job runtime metrics using managed scaling, automatic scaling, and concurrency scaling features for read and write, respectively. Amazon EMR and Amazon Redshift offer both server-based and serverless architectures while the other services depicted in the reference architecture are fully serverless. Amazon EMR (server-based) allows you to use Spot Instances for suitable workloads that can further save your costs. A good strategy is to complement these processing engines to meet the business objectives of the SLA, functionality, and lower TCO by choosing the right engine for the right job. 

1.  Batch processing jobs usually require writing processed data to a target persistent store. This store can reside anywhere between AWS, on-premises environments, or other cloud providers. You can use the rich connector interface AWS Glue offers to write data to various target platforms, such as Amazon S3, Snowﬂake, and Amazon OpenSearch Service. You can also use the native Spark JDBC connector feature and write data to any supported JDBC target. 

1.  All batch jobs require a workﬂow that can handle dependency checks to ensure no downstream impacts and have a bookmarking capability that allows them to resume where they left off in the event of a failure or at the next run of the job. When using AWS Glue as your batch job processing engine, you can use the native workﬂow capability to help you create a workﬂow with a built-in state machine to track the state of your job across the entire workﬂow. AWS Glue jobs also support a bookmarking capability that keeps track of what it has processed and what will be processed during next run. Similarly, AWS Lake Formation blueprints support a bookmarking capability when processing incremental data. With Amazon EMR Studio, you can schedule notebook jobs. When using any of the analytic processing engines, you can build job workﬂows using an external scheduler, such as AWS Step Functions or Amazon Managed Workflows for Apache Airflow (Amazon MWAA) that allows you to interoperate between any service including external dependencies. 

1.  Batch processing jobs write data output to a target data store, which can be anywhere in the AWS Cloud, on premises, or at another cloud provider. You can use the AWS AWS Glue Data Catalog to crawl the supported target databases to simplify writing to your target database. 

# Configuration notes
<a name="configuration-notes-1"></a>

 **Use AWS Glue Data Catalog as a central metastore for your batch processing jobs, regardless of which AWS analytics service you use as a processing engine.** Batch processing jobs cater to a variety of workloads ranging from running several times an hour or day, to running monthly or quarterly. The data volumes vary significantly and so do the consumption patterns on the processed dataset. Always work backwards to understand the business SLAs and develop your job accordingly. The central Data Catalog makes it easy for you to use the right analytic service to meet your business SLAs and other objectives, thereby creating a central analytic ecosystem. 

 **Avoid lifting and shifting server-based batch processing systems to AWS.** By lifting and shifting traditional batch processing systems into AWS, you risk running overprovisioned resources on Amazon EC2. For example, traditional Hadoop clusters are often overprovisioned and idle in an on-premises setting. Use AWS Managed Services, such as AWS Glue, Amazon EMR, and Amazon Redshift, to simplify your architecture using a modern data architecture pattern and remove the undifferentiated heavy lifting of managing clustered and distributed environments. 

 **Automate and orchestrate everywhere.** In a traditional batch data processing environment, it’s a best practice to automate and schedule your jobs in the system. In AWS, you should use automation and orchestration for your batch data processing jobs in conjunction with the AWS APIs to spin up and tear down entire compute environments, so that you are only charged when the compute services are in use. For example, when a job is scheduled, a workﬂow service, such as AWS Step Functions, would use the AWS SDK to provision a new EMR cluster, submit the work, and shut down the cluster after the job is complete. Similarly, you can use Terraform or a CloudFormation template to achieve similar functionality. 

 **Use Spot Instances and Graviton-based instance types on EMR to save costs and get better price performance ratio**. Use Spot Instances when you have ﬂexible SLAs that are resilient to job reruns upon failure and when there is a need to process very large volumes of data. Use Spot Fleet, EC2 Fleet, and Spot Instance features in Amazon EMR to manage Spot Instances. 

 **Continually monitor and improve batch processing jobs.** Batch processing systems evolve rapidly as data source volumes increase, new batch processing jobs are authored, and new batch processing frameworks are launched. Instrument your jobs with metrics, timeouts, and alarms to have the metrics and insight to make informed decisions on batch data processing system changes. 

# Streaming ingest and stream processing
<a name="streaming-ingest-and-stream-processing"></a>

 Processing real time streaming data requires throughput scalability, reliability, high availability, and low latency to support a variety of applications and workloads. Some examples include: streaming ETL, real-time analytics, fraud detection, API microservices integration, fraud detection activity tracking, real-time inventory and recommendations, and click-stream, log file, and IoT device analysis. 

 Streaming data architectures are built on five core constructs: data sources, stream ingestion, stream storage, stream processing, and destinations. Each of these components can be created and launched using AWS Managed Services and deployed and managed as a purpose-built solution on Amazon EC2, Amazon Elastic Container Service (Amazon ECS), or Amazon Elastic Kubernetes Service (Amazon EKS). 

 Examples of each of these components include: 

 **Data sources**: Application and click stream logs, mobile apps, existing transactional relational and NoSQL databases, IoT sensors, and metering devices. 

 **Stream ingestion and producers**: Both open source and proprietary toolkits, libraries, and SDKs for Kinesis Data Streams and Apache Kafka to create custom stream producers, AWS service integrations such as AWS IoT Core, CloudWatch Logs and Events, Amazon Data Firehose, AWS Data Migration Service (DMS), and third-party integrations. 

 **Stream storage**: Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Apache Kafka. 

 **Stream processing and consumers**: Amazon EMR (Spark Structured Streaming, Apache Flink), AWS Glue ETL Streaming, Managed Service for Apache Flink for Apache Flink, third-party integrations, and build-your-own custom applications using AWS and open source community SDKs and libraries. 

 **Downstream destinations**: Databases, data warehouses, purpose-built systems such as OpenSearch services, data lakes, and various third-party integrations. 

 With these five components in mind, next let’s consider the characteristics as you design your stream processing pipeline for real-time ingestion and nearly continuous stream processing. 

# Characteristics
<a name="characteristics-3"></a>

 **Scalable throughput:** For real-time analytics, you should plan a resilient stream storage infrastructure that can adapt to changes in the rate of data ﬂowing through the stream. Scaling is typically performed by an administrative application that monitors shard and partition data-handling metrics. 

 **Dynamic stream processor consumption and collaboration:** Stream processors and consumers should automatically discover newly added Kinesis shards or Kafka partitions, and distribute them equitably across all available resources to process independently or collaboratively as a consumption group (Kinesis Application Name, Kafka Consumer Group). 

 **Durable:** Real-time streaming systems should provide high availability and data durability. For example, Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK) replicate data across Availability Zones providing the high durability that streaming applications need. 

 **Replay-ability:** Stream storage systems should provide the ordering of records within shards and partitions, as well as the ability to independently read or replay records in the same order to stream processors and consumers. 

 **Fault-tolerance, checkpoint, and replay:** Checkpointing refers to recording the farthest point in the stream that data records have been consumed and processed. If the consuming application crashes, it can resume reading the stream from that point instead of having to start at the beginning. 

 **Loosely coupled integration:** A key benefit of streaming applications is the construct of loose coupling. The value of loose coupling is the ability of stream ingestion, stream producers, stream processors, and stream consumers to act and behave independently of one another. Examples include the ability to scale consumers outside of the producer configuration and adding additional stream processors and consumers to receive from the same stream or topic as existing stream processors and consumers, but perform different actions. 

 **Allow multiple processing applications in parallel:** The ability for multiple applications to consume the same stream concurrently is an essential characteristic of a stream processing system. For example, you might have one application that updates a real-time dashboard and another that archives data to Amazon Redshift. You want both applications to consume data from the same stream concurrently and independently. 

 **Messaging semantics:** In a distributed messaging system, components might fail independently. Different messaging systems implement different semantic guarantees between a producer and a consumer in the case of such a failure. The most common message delivery guarantees implemented are: 
+  **At most once**: Messages that could not be delivered, or are lost, are never redelivered 
+  **At least once**: Message might be delivered more than once to the consumer 
+  **Exactly once**: Message is delivered exactly once 

 Depending on your application needs, you can choose a message delivery system that supports one or more of these required semantics. 

 **Security:** Streaming ingest and processing systems must be secure by default. You must grant access by using the principal of least privilege to the streaming APIs and infrastructure, and encrypt data at rest and in transit. Both Kinesis Data Streams and Amazon MSK can be configured to use IAM policies to grant least privilege access. For stream storage in particular, allow encryption in transit for producers and consumers, and encryption at rest. 

# Reference architecture
<a name="reference-architecture-2"></a>

![\[Reference architecture diagram for streaming data analytics\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/streaming-data-analytics-reference-architecture.png)




 The preceding streaming reference architecture diagram is segmented into the previously described components of streaming scenarios: 
+  Data sources 
+  Stream ingestion and producers 
+  Stream storage 
+  Stream processing and consumers 
+  Downstream destinations 

 All, or portions, of this reference architecture can be used for workloads such as application modernization with microservices, streaming ETL, ingest, real-time inventory, recommendations, or fraud detection. 

 In this section, we will identify each layer of components shown in the preceding diagram with specific examples. The examples are not intended to be an exhaustive list, but rather an attempt to describe some of the more popular options. 

 The subsequent Configuration notes section provides recommendations and considerations when implementing streaming data scenarios. 

 We will review the five core components of streaming architecture first, and then discuss these specialized ﬂows. 

1.  **Data sources**: The number of potential data sources is in the millions. Examples include application logs, mobile apps and applications with REST APIs, IoT sensors, existing application databases (RDBMS, NoSQL) and metering records. 

1.  **Stream ingestion and producers**: Multiple data sources generate data continually that might amount to terabytes of data per day. Toolkits, libraries, and SDKs can be used to develop custom stream producers to streaming storage. In contrast to custom developed producers, examples of pre-built producers include Kinesis Agent, Change Data Capture (CDC) solutions, and Kafka Connect Source connectors. 

1.  **Streaming storage**: Kinesis Data Streams, Amazon MSK, and self-managed Apache Kafka are all examples of stream storage options for ingesting, processing, and storing large streams of data records and events. Streaming storage implementations are modeled on the idea of a distributed, immutable commit log. Events are stored for a configurable duration (hours to days to months, or even permanently in some cases). While stored, events are available to any client. 

1.  **Stream processing and consumers**: Real-time data streams can be processed sequentially and incrementally on a record-by-record basis over sliding time windows using a variety of services. Or, put another way, this can be where particular domain-specific logic resides and is computed. 

 With Managed Service for Apache Flink for Apache Flink or Managed Service for Apache Flink Studio, you can process and analyze streaming data using standard SQL in a serverless way. The service allows you to quickly author and run SQL queries against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics. 

 If you work in an Amazon EMR environment, you can process streaming data using multiple options— Apache Flink or Spark Structured Streaming. 

 Finally, there are options for AWS Lambda, third-party integrations, and build-your-own custom applications using AWS SDKs, libraries, and open-source libraries and connectors for consuming from Kinesis Data Streams, Amazon MSK, and Apache Kafka. 

1.  **Downstream destinations:** Data can be persisted to durable storage to serve a variety of use cases including ad hoc analytics and search, machine learning, alerts, data science experiments, and additional custom actions. 

 A special note on the data ﬂow lanes noted with asterisk (\$1). There are two examples that both involve bidirectional ﬂow of data to and from layer \$13 streaming storage. 

 The first example is the bidirectional ﬂow of in-stream ETL between stream processor (\$14) that uses one or more raw event sources from stream storage (\$13) and performs, filtering, aggregations, joins, etc., and writes results back to streaming storage to a refined (that is, curated, hydrated) result stream or topic (\$13) where it can be used by a different stream processor or downstream consumer. 

 The second bidirectional ﬂow example is the ubiquitous application modernization microservice design (\$12) that often use a streaming storage layer (\$13) for decoupled microservice interaction. 

 The key takeaway here is for us to not presume that the streaming event ﬂows exclusively from left-to-right over time in the reference architecture diagram. 

## Configuration notes
<a name="configuration-notes-2"></a>

 As explored so far, we know streaming data architects have options for implementing particular components in their stack, for example, different options for streaming storage, streaming ingest, and streaming producers. While it’s impractical to provide in-depth recommendations for each layer’s options in this whitepaper, there are some high-level concepts to consider as guide posts, which we will present next. 

 For more in-depth analysis of a particular layer in your design, consider exploring the provided links within the following guidelines. 

## Streaming application guidelines
<a name="streaming-application-guidelines"></a>

 **Determine business requirements first.** It’s always a best practice and practical to focus on your workload’s particular needs first, rather than starting with a feature-by-feature comparison between the technical options. For example, we often see organizations prioritizing Technical Feature A vs. Technical Feature B before determining their workload’s requirements. This is the wrong order. Determine your workload’s requirements first because AWS has a wide variety of purpose-built options at each streaming architecture layer to best match your requirements. 

 **Technical comparisons second.** After business requirements have been clearly established, the next step is to match your business requirements with the technical options that offer the best chance for success. For example, if your team has few technical operators, serverless might be a good option. 

 Other technical questions about your workload might be whether you require a large number of independent stream processors and consuming applications, that is, one vs. many stream processors and consumers. What kind of manual or automatic scaling options are available to match business requirement throughput, latency, SLA, RPO, and RTO objectives? Is there a desire to use open source-based solutions? What are the security options and how well do they integrate into existing security postures? Is one path easier or more straightforward to migrate to versus another, for example, self-managed Apache Kafka to Amazon MSK. 

 To learn more about your options for various layers in the reference architecture stack, refer to the following: 
+  **Streaming ingest and producers** — Can be workload-dependent and use AWS service integrations, such as AWS IoT Core, CloudWatch Logs and Events, AWS Data Migration Service (AWS DMS), and third-party integrations (Refer to [Writing Data to Amazon Kinesis Data Streams in Amazon Kinesis](https://docs.aws.amazon.com/streams/latest/dev/building-producers.html) in the *Amazon Kinesis Data Streams Developer Guide*). 
+  **Streaming storage** — Kinesis Data Streams, Firehose, Amazon MSK, and Apache Kafka (Refer to [Best Practices in Amazon Managed](https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html) in *the Amazon Managed Streaming for Apache Kafka Developer Guide*). 
+  **Stream processing and consumers** — Managed Service for Apache Flink for Apache Flink, Firehose, AWS Lambda, open source and proprietary SDKs (Refer to [Advanced Topics for Amazon Kinesis Data Streams Consumers in Amazon Kinesis Data Streams Developer Guide](https://docs.aws.amazon.com/streams/latest/dev/advanced-consumers.html) and [Best Practices for Managed Service for Apache Flink for Apache Flink in the Amazon Managed Service for Apache Flink Developer Guide](https://docs.aws.amazon.com/kinesisanalytics/latest/java/best-practices.html)). 

 For more information, refer to the [Build Modern Data Streaming Architectures on AWS](https://docs.aws.amazon.com/whitepapers/latest/build-modern-data-streaming-analytics-architectures/build-modern-data-streaming-analytics-architectures.html) whitepaper. 

 **Remember separation of concerns.** Separation of concerns is the application design principle that promotes segmenting an application into distinct, particular area of concerns. For example, your application might require that stream processors and consumers are performing an aggregation computation in addition to recording the computation results to a downstream destination. While it can be tempting to clump both of these concerns into one stream processors or consumers, it is recommended to consider separation instead. It’s often better in the segment isolate into multiple stream processors or consumers for operation monitoring, performance tuning isolation, and reducing the downtime blast radius. 

## Development
<a name="development"></a>

 **Existing or desired skills match.** Realizing value from streaming architectures can be difficult and often a new endeavor for various roles within organizations. Increase your chances of success by aligning your team’s existing skillsets, or desired skillsets, wherever possible. For example, is your team familiar with Java or do they prefer a different language, such as Python or Go? Does your team prefer a graphical user interface for writing and deploying code? Work backwards from your existing skill resources and preferences to appropriate options for each component. 

 **Build vs. buy (Write your own or use off-the-shelf).** Consider whether an integration between components already exists or if you must write your own. Or perhaps both options are available. Many teams new to streaming incorrectly assume that everything must be written from scratch. Instead, consider services such as Kafka Connect Connectors for inbound and outbound traffic, AWS Lambda, and Firehose. 

## Performance
<a name="performance"></a>

 **Aggregate records before sending to stream storage for increased throughput.** When using Kinesis, Amazon MSK, or Kafka, ensure that the messages are accumulated on the producer side before sending to stream storage. This is also referred to as batching records to increase throughput, but at the cost of increased latency. 

 **When working with Kinesis Data Streams, use Kinesis Client Library (KCL) to de-aggregate records.** KCL takes care of many of the complex tasks associated with distributed computing, such as load balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to re-sharding. 

 **Initial planning and adjustment of shards and partitions.** The most common mechanism to scale stream storage for stream processors and consumers is through the number of configured shards (Kinesis Data Streams) or partitions (Apache Kafka, Amazon MSK) for a particular stream. This is a common element across Kinesis Data Streams, Amazon MSK, and Apache Kafka, but options for scaling out (and in) the number of shards or partitions vary. 
+  Amazon Kinesis Data Streams Developer Guide: [Resharding a Stream](https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-resharding.html) 
+  Apache Kafka Documentation - Operations: [Expanding your cluster](https://kafka.apache.org/documentation/#basic_ops_cluster_expansion) (also applicable to Amazon MSK) 
+  Amazon Managed Streaming for Apache Kafka Developer Guide: [Using LinkedIn's Cruise Control for](https://docs.aws.amazon.com/msk/latest/developerguide/cruise-control.html) [Apache Kafka with Amazon MSK](https://docs.aws.amazon.com/msk/latest/developerguide/cruise-control.html) (for partition rebalancing) 

 **Use Spot Instances and automatic scaling to process streaming data cost effectively.** You can also process the data using AWS Lambda with Kinesis, Amazon MSK, or both, and Kinesis record aggregation and de-aggregation modules for AWS Lambda. Various AWS services offer automatic scaling options to keep costs lower than provisioning for peak volumes. 

## Operations
<a name="operations"></a>

 **Monitor Kinesis Data Streams and Amazon MSK metrics using Amazon CloudWatch***.* You can get basic stream and topic level metrics in addition to shard and partition level metrics. Amazon MSK also provides an [Open Monitoring with Prometheus](https://docs.aws.amazon.com/msk/latest/developerguide/open-monitoring.html) option. 
+  Amazon Kinesis Data Streams Developer Guide: [Monitoring Amazon Kinesis Data Streams](https://docs.aws.amazon.com/streams/latest/dev/monitoring.html) 
+  Amazon Managed Streaming for Apache Kafka Developer Guide: [Monitoring an Amazon MSK Cluster](https://docs.aws.amazon.com/msk/latest/developerguide/monitoring.html) 

  **Plan for the unexpected / No single point of failure.** Some components in your streaming architecture will offer different options for durability in case of a failure. For example, Kinesis Data Streams replicates to three different Availability Zones. With Apache Kafka and Amazon MSK, producers can be configured to require acknowledgement for partition leader as well as a configurable number of in-sync replica followers before considering the write successful. In these examples, you are able to plan for possible disruptions in your AWS environment, for example, if an Availability Zone goes offline, without possible downtime of your producing and consuming layers.

## Security
<a name="security-1"></a>

### Authentication and authorization.
<a name="authentication-and-authorization."></a>
+  Amazon Managed Streaming for Apache Kafka Developer Guide: [Authentication and Authorization for Apache Kafka APIs](https://docs.aws.amazon.com/msk/latest/developerguide/kafka_apis_iam.html) 
+  Amazon Kinesis Data Streams Developer Guide: [Controlling Access to Amazon Kinesis Data Streams Resources Using IAM](https://docs.aws.amazon.com/streams/latest/dev/controlling-access.html) 

 **Encryption in transit and encryption at rest.** Streaming data actively moves from one layer to another, such as from a streaming data producer to stream storage over the internet or through a private network. 

 Protecting data in transit, enterprises can and often choose to use encrypted connections (HTTPS, SSL, TLS) to protect the contents of data in transit. Many AWS streaming services offer protection of data at rest through encryption. 
+  AWS Well-Architected Framework Security Pillar: [AWS Identity and Access Management](https://docs.aws.amazon.com/wellarchitected/latest/security-pillar/identity-management.html) 
+  AWS Lake Formation Developer Guide: [Security in AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/security.html) 

# Operational analytics
<a name="operational-analytics"></a>

 Operational analytics refers to inter-disciplinary techniques and methodologies that aim to measure and improve day-to-day business performance in terms of increasing the efficiency of internal business processes and improving customer experience and value. 

 Traditional analytics like Business Intelligence (BI) provide each Line of Business (LOB) with insights to identify trends and take decisions based on what happened in the past.  

 But this is no longer sufficient. To deliver a good customer experience, organizations must continually measure their workload performance and quickly respond to operational inefficiencies for a better customer experience. 

 By using operational analytics systems, they can initiate such business actions based on the recommendations that the systems provide. They can also automate the execution processes to reduce the human errors. This makes the system go beyond being descriptive to being more prescriptive and even being predictive in nature. 

 On the other hand, IT infrastructures are becoming increasingly distributed adding more complexity to the workloads in terms of identifying the operational data that captures the system’s state, characterize its behavior, and finally rectify potential issues in the pipelines. 

 Several tools and methodologies have emerged that help companies keep their systems reliable. Every system or application must be instrumented to expose telemetry data that provides operational insights in real or near real time. 

 The telemetry data can be of different form of signals: logs, traces, and metrics. Traditionally this data came in the form of logs that represent a record of an event happened within an application, server or a system operation. It can be of different types such as: application logs, security logs, system logs, audit trails, and infrastructure logs. Logs are usually used in troubleshooting and generating root-cause analysis for a system or application failure at a specific point in time. 

 Trace signal captures the user request for resources as it passes through different systems all the way to the destination and the response back to the user. It indicates a causal relationship between all the services being part of a distributed transaction. Organizations used to develop their own trace mechanisms but it is recommended to use existing tools that support a standard trace-context propagation format. The trace-context holds the information that links the producer of a message to its downstream consumers. 

 Metric data provides a point-in-time measure of the health of the system, such as resource consumptions in terms of CPU utilization. Metric signals offer an overview of the overall system health while reducing the manual effort to build these metrics and store them. With metrics, system operators can be notified in real time about anomalies in production environments and establish automated recovery process in case of a recurrent incident. 

 The signals mentioned above have different ways to be instrumented and provide different approaches to implement operational analytics use cases. Therefore, organizations must have an operational objective in mind from which they can work backwards to identify what data output they need from their system, which tool is better fit for their business and IT environment and finally what insights are needed to better understand their customers and improve their production resiliency. 

# Characteristics
<a name="characteristics-4"></a>

 **Discoverability:** The ability of the system to make operational data available for consumption. This involves discovering multiple disparate types of data available within an application that can be used for various ad hoc explorations. 

 **Connectivity:** Operational data can emanate from a variety of data sources in different format with disparate volumes. For this reason, the operational system has to provide the capability to seamlessly integrate all the data with the least overhead for production application. 

 **Scalability:** The ability of the system to scale up and out to adapt to changes in the operational analytics workload in terms of storage or compute requirements. 

 **Monitoring:** You should be able to continuously monitor the operational system performance and get notified about the resource utilization and the overall health of your system. 

 **Security:** The access to the operational system must be secure. With Amazon OpenSearch Service, you can configure the domain to be accessible with an endpoint within your VPC or a public endpoint accessible to the internet. In addition to network-based access control, you must set up user authentication and authorization to secure the access to data based on business requirements. OpenSearch Service supports encryption at rest and in Transit. 

 **Data durability:** With operational analytics, the use cases differ as to the retention requirements. You should understand your business requirements in terms of analyzing historical data. With Amazon OpenSearch Service, you can retain more data with less cost using the UltraWarm and cold storage tiers. 

 **Automation:** The data lifecycle in your operational system should be automated in order to easily onboard new data pipelines and reduce the overhead of managing the lifecycle of the data. With Index State Management (ISM) in Amazon OpenSearch Service, you can create your own policies to automate the lifecycle management of indices stored in the service. 

  **Observability:** The ability to understand internal state from the various signal outputs in the system. By providing a holistic view of these various signals along with a meaningful inference it becomes easy understand how healthy and well performant the overall system is. 

 **User centricity:** Each analytics application should address a well-defined operational scope and solve a particular problem at hand. Users of the system often won’t understand or care about the analytics process but only see the value the result.  

 **Agility:** The system must be ﬂexible enough to accommodate changing needs of an analytics application and offer necessary control to bring in additional data with low overhead.

# Reference architecture
<a name="section-17"></a>

![\[Reference architecture diagram for operational analytics\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/operational-analytics-reference-architecture.png)




 The reference architecture covers the data flow in an operational analytics use case. The Ingestion pipeline contains up to five stages as follows: 

1.  With your operational and business goal in mind, you should instrument your system/plate-form to ***produce*** the relevant type of signals such as *various* logs, traces, and metrics, and expose the data to a set of collectors. At this stage, you choose open-source instrumentation tools such as [Jaeger](https://www.jaegertracing.io/) or [Zipkin](https://zipkin.io/). And if you plan to generate different type of signals, we recommend that you include signal correlation beginning with the design step. Open-source tools such as [OpenTelemetry](https://opentelemetry.io/docs/) facilitate the context propagation by adding a Trace ID to all logs related to a specific request. This reduces the mean time to problem resolution by enhancing the observability of the system from multiple viewpoints. 

1.  The second step is to **collect** the telemetry data from the producers and deliver it to the aggregators or buffers. You can use native AWS services (such as [Amazon Kinesis Agent](https://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html), [CloudWatch agents](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_OpenSearch_Stream.html), or [AWS Distro for OpenTelemetry](https://aws.amazon.com/otel/?otel-blogs.sort-by=item.additionalFields.createdDate&otel-blogs.sort-order=desc)) to let you instrument your applications just once, collect and correlate metrics and traces, along with contextual information and metadata about where the application is running. You can also use a number of lightweight shippers such as [Fluentd](https://docs.fluentd.org/) to collect logs, [Fluentbit](https://fluentbit.io/) to collect both logs and metrics, and open-source [OpenTelemetry](https://opentelemetry.io/docs/)*.* 

1.  Before sending the data to [Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/), it is recommended that you **buffer or aggregate** information from the collectors to reduce the overall connections to the domain and use the batch (\$1bulk) API to send batches of documents rather than sending single documents. It is also possible at this stage (or at the collection stage) to transform and aggregate the data for the downstream analytics tools. To do this, you can use AWS services such as [Amazon Data Firehose](https://aws.amazon.com/kinesis/data-firehose/) and [Amazon Managed Streaming for Apache Kafka](https://aws.amazon.com/msk/). For large-scale environments, you can use [Amazon S3](https://aws.amazon.com/s3/) to have a backup the data. It is also possible to use open-source tools such as OpenSearch Data Prepper for trace and log analytics, or you can use the [open source version of Logstash](https://opensearch.org/docs/latest/clients/logstash/index/) (check compatibility with Amazon OpenSearch Service [here](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-logstash.html)). 

1.  Amazon OpenSearch Service makes it easy for you to index and **store** telemetry data to perform interactive analytics. Amazon OpenSearch Service is built to handle a large volume of structured and unstructured data from multiple data sources at high ingestion rates. Amazon OpenSearch Service integrates not only with AWS services but also with open-source tools as the ones listed previously. It is also possible to use [Amazon Managed Service for Prometheus](https://aws.amazon.com/prometheus/) to **store** and **query** operational metrics. The service is integrated with Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Distro for Open Telemetry. 

1.  Amazon OpenSearch Service dashboard is the default **visualization** tool for data in Amazon OpenSearch Service. It also serves as a user interface for many of the OpenSearch plugins, including Observability, Security, Alerting, Index State Management, and SQL. You can also conduct interactive analysis and visualization on data with [Piped Processing Language (PPL)](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ppl-support.html), a query interface. You can use [Amazon Managed Grafana](https://aws.amazon.com/grafana/) to complement Amazon OpenSearch Service on the visualization layer. And you connect Amazon Grafana to Amazon Managed Service for Prometheus to query, **visualize**, alert on, and understand metric data. 

# Configuration notes
<a name="configuration-notes-3"></a>

 As shared in the previous section, there are different options and a non-exhaustive list of tools that you can choose from to implement an operational analytics pipeline. A list of configuration parameters to take into consideration for a well-architected operational pipeline is provided. 

 **Define operational goals and business requirements:** As a best practice, you should always start by identifying your operational goals, and what business outcome you must reach. Think about who are your end users, what are the insights to help drive their decisions, and how they will access these insights. After you define the business requirements, you can start designing your technical pipeline, establishing the integration options in your environment, and reviewing the skill sets you have, to choose the right option. 

 **Choose a data model before ingestion:** When bringing data in from disparate sources, especially from structured systems into structureless systems such as OpenSearch, special care must be taken to ensure that the chosen data model provides a frictionless search experience for users. 

 **Ingestion pipeline:** You should make sure that your ingestion framework is reusable and extensible to be able to scale and include new use cases on the long term, otherwise, check which parts of your infrastructure would require modernization.  

 **Production ready tools and services:** AWS offers a set of managed services that are production ready and which eliminate the operational overhead of managing the infrastructure, such as Amazon OpenSearch Service. As shared in the reference architecture, you can also integrate open source tools, such as OpenSearch Data Prepper, to transform and aggregate the operational data for downstream analytics and visualizations. 

 **Sizing OpenSearch domain: ** The first step in sizing an OpenSearch cluster is to check your data size, and identify your storage and query requirements. Estimate the number of active shards you will have per index based on your input data, and the shard size that you identify. Then, estimate your vCPU requirements and choose the type of instances that will be able to handle both storage and vCPUs. Plan for time to benchmark the domain with a realistic dataset using [OpenSearch Benchmark](https://github.com/opensearch-project/opensearch-benchmark), tune the configuration and iterate until you meet the performances required in terms of Throughout, Search Latency, and Index Latency. For more information, see [Sizing Amazon OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/sizing-domains.html) and [Best practices for configuring your Amazon OpenSearch Service domain](https://aws.amazon.com/blogs/big-data/best-practices-for-configuring-your-amazon-opensearch-service-domain/). 

 **Use tiered storage:** The value of operational data or any timestamped data generally decreases with the age of the data. Moving aged data into tiered storage can save significant operational cost. Summarized rollups that can condense the data can also help address storage cost. 

 **Performance:** There are multiple parameters to consider when thinking about performance and it is always specific to each workload. However, Amazon OpenSearch Service offers features that you can already enable in your domain, such as [Auto-Tune](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/auto-tune.html) that automatically deploys optional changes to improve cluster speed and stability. Other items to take into consideration include using the `_bulk` API to load data into OpenSearch, and only indexing data fields that need be searchable. 

 **Define security requirements:** Make sure to set up your domain inside a virtual Private Cloud (VPC) to secure the traffic to your domain. Apply the least privilege access approach with restrictive access policies, or with fine-grained access control for OpenSearch dashboards. OpenSearch Service also offers encryption of data at rest and in transit. 

 **Monitor all involved components:** Monitor all involved components with metrics in Amazon CloudWatch. With the CloudWatch metrics available for Amazon OpenSearch Service, you can monitor the overall cluster health, you can also check the performance of individuals nodes and monitor EBS volume metrics. It is also a best practice to set [CloudWatch alarms ](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/cloudwatch-alarms.html)to get notified about any issues that your production domain encounters. You can start by setting the following alarms: 
+  `CPUUtilization` maximum is >= 80% for 15 minutes, 3 consecutive times 
+  `ClusterStatus.yellow` maximum is >= 1 for 1 minute, 1 consecutive time 
+  `JVMMemoryPressure` maximum is >= 80% for 5 minutes, 3 consecutive times 
+  `FreeStorageSpace` minimum is <= 25% of the storage space for 1 minute, 1 consecutive time 

# Data visualization
<a name="data-visualization"></a>

 Every day, the people in your organization make decisions that affect your business. When they have the right information at the right time, they can make the choices that move your company in the right direction. This gives decision makers the opportunity to explore and interpret information in an interactive visual environment to democratize data and accelerates data-driven insights that are easy to understand and navigate. 

 Building a BI and data visualization service in the cloud allows you to take advantage of capabilities such as scalability, availability, redundancy, and enterprise grade security. It also lowers the barrier to data connectivity and allows access to far wider range of data sources —both traditional, such as databases, as well as non-traditional, such as SaaS sources. An added advantage to a cloud-based data visualization service is the elimination of undifferentiated heavy lifting related to managing server infrastructure. 

# Characteristics
<a name="characteristics-5"></a>

 **Scalability:** Ensure that the underlying BI infrastructure is able to scale up vertically and horizontally both in terms of concurrent users as well as data volume. For example, Quick SPICE, and web applications automatically scale up server capacity to accommodate a large number of concurrent users without any manual intervention in terms of provisioning additional capacity for data, load balancing, and other services. 

 **Connectivity:** BI applications must be able to not only connect with data platforms such as traditional data warehouses and databases, but also support connectivity to a data lake and modern data architectures. The application must also have the capacity to connect to non-traditional sources, such as SaaS applications. Typically, data stores are secured behind a private subnet and BI tools and applications must be able to connect in a secure mechanism using strategies, such as VPC endpoints and secure firewalls. 

 **Centralized security and compliance:** BI applications must allow for a layered approach for security. This includes: Securing at the perimeter using techniques such as IP allow lists, security groups, ENIs and IAM policies for cloud resource access, securing the data in transit and data at rest using SSL and encryption, and restricting varying levels of access through fine-grained permissions for users to the underlying data and BI assets. The application must also comply with the governmental and industry regulations for the country or region the company is bound by. 

 **Sharing and collaboration:** BI applications must support data democratization. They must have features that allow sharing of the dashboards with other users in the company as well as for multiple report authors to collaborate with one another by sharing access to the underlying dataset. Not all BI tools have this capability. Quick allows the sharing of assets, such as data sources, data sets, analyses, dashboards, themes, and templates. 

 **Logging, monitoring, and auditing:** BI applications must provide adequate mechanisms to monitor and audit the usage of the application for security (to prevent unwanted access to data assets and other resources) and troubleshooting. Quick can be used with Amazon CloudWatch, AWS CloudTrail, and IAM to track record of actions taken by a user, user role, or an AWS service. This provides the who, what, when, and where of every user action in QuickSight. 

 **Perform advanced analytics** 

 Modern BI applications must be able to discover hidden insights from your data, perform forecasting and what-if analysis, or add easy-to-understand natural language narratives to dashboards. The business users need the ability to perform analytics without deep statistical and machine learning knowledge. 

 Quick ML Insights provide features that make it easy to discover hidden trends and outliers, identify key business drivers, and perform powerful what-if analysis and forecasting with no technical or ML experience.  

 **Enable self-service business intelligence** 

 The common challenges of BI tools are how to make data more accessible to more people without extensive user training and technical understanding. Data must be available in all format - raw, semi-processed and processed. Self-service BI should allow users to interact with data on an as-needed-basis without involving IT.  

 Quick Q allows user to ask business questions in natural language and receive answers with relevant visualizations that help them gain insights from the data. QuickSight Q uses machine learning to interpret the intent of a question and analyze the correct data to provide accurate answers to business questions quickly 

# Reference architecture
<a name="reference-architecture-4"></a>

![\[Diagram showing QuickSight dashboard end-to-end design\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/quicksight-dashboard-design.png)




 **Data sources:** Supports connection with traditional Data Warehouse or databases and also have the capacity to connect to non-traditional sources such as SaaS applications. Supported datasources in QuickSight include Amazon S3, Amazon Redshift, Amazon Aurora, Oracle, MySQL, Microsoft SQL Server, Snowﬂake, Teradata, Jira, and ServiceNow. Check [here](https://docs.aws.amazon.com/quicksight/latest/user/supported-data-sources.html) for the complete list of data sources supported in QuickSight. These data sources could be secured behind a private subnet and QuickSight can connect in a secure mechanism using strategies such as VPC endpoints, and secure firewalls. 

 **Visualization Tool:** Quick. 

 **Consumers:** Visual dashboard consumers accessing a QuickSight console or embedded QuickSight analytics dashboard. 

# Configuration notes
<a name="configuration-notes-4"></a>

 **Security:** Implement the principle of least privilege throughout the visualization application stack. Ensure data sources are connected using VPCs and restrict security groups to only the required protocols, sources, and destinations. Enforce that the users as well as applications in every layer of the stack are given just the right level of access permissions to data and the underlying resources. Ensure seamless integration with identity providers—either industry supported or customized. To ease flow and remove confusion, set up QuickSight and single sign-on (SSO) such that email addresses for end users are automatically synced at their first login. In the case of multi-tenancy, use namespaces for better isolation of principals and other assets across tenants. For example, QuickSight follows the least privilege principle and access to AWS resources such as Amazon Redshift, Amazon S3 or Amazon Athena (common services used in data warehouse, data lake or modern data architectures) can be managed through the QuickSight user interface. Additional security at the user or group level is supported using fine-grained access control through a combination of IAM permissions. Additionally, QuickSight features, such as row level security, column level security, and a range of asset governance capabilities that can be configured directly through QuickSight user interface. 

 **Cost optimization:** Accurately identify the volume of dashboard consumers and embedding requirements to determine the optimal pricing model for the given visualization use case. QuickSight offers two different pricing options (capacity and user based) that allows clients to implement cost-effective BI solutions. Capacity pricing allows large-scale implementations and user-based pricing allows clients to get started with minimal investment (Note: SPICE has a 500M records or 500 GB volume per dataset limitation). 

 **Low latency considerations:** Use in-memory caching option, such as Memcached, Redis, or the in-memory caching engine in QuickSight called SPICE (Super-fast, Parallel, In-memory Calculation Engine) to prevent latency in dashboard rendering while accommodating any built-in restrictions that the caching technology might have. 

 **Pre-process data views:** Ensure that the data is cleansed, standardized, enhanced, and pre-processed to allow analysis within the BI layer. If possible, create pre-processed, pre-combined, pre-aggregated data views for analysis purposes. ETL tools, such as AWS Glue DataBrew, or techniques, such as materialized views, can be employed to achieve this. After uploading the dataset, users can add calculated fields to a dataset during the data preparation or from the analysis page for additional insights provided data.  

# Data mesh
<a name="data-mesh"></a>

 A *data mesh* is an architectural framework that enables domain teams to perform cross-domain data analysis through distributed, decentralized ownership. 

 Organizations have multiple data sources from different lines of business that must be integrated for analytics. Managing all these data sources from a central data repository can be challenging. Similar to how application architecture has involved into building microservices rather than a single application entity, data teams are exploring ways to modularize their data platforms to become federated, decentralized solutions. 

 A data mesh is an analytics design pattern that effectively unites the disparate data sources and links them together through self-service data sharing and governance guidelines. Business functions can maintain control over how shared data is accessed, who can access it and when it can be accessed. Organizations that have built data lakes, data warehouses and other data repositories, and require these environments to be more connected, could benefit from a data mesh architecture. 

 The trade off to implementing a data mesh is that a data mesh adds complexities to architecture but also brings efficiency by improving data searchability, accessibility, security and scalability. 

 A data mesh transfers data control to domain experts who create meaningful data products within a decentralized governance framework. Data consumers request access to the data products and seek approvals or changes directly from data owners. As a result, everyone gets faster access to relevant data, and faster access improves business agility. 

 A data mesh may be suitable for customers who: 
+  Have a well-established data strategy 
+  Have a current implementation of a modern data architecture 
+  Have decoupled business units that operate autonomously 
+  Need to share data across business units, or with external partners 
+  Require consistent data governance across multiple teams that aren’t part of a single organization 
+  Need to have quick delivery cycles with well-defined agile practices, and are willing to iterate changes from lessons learned 

 Technology, people, and processes are the key principles that help deliver and maintain a successful data mesh. The people and processes can be identified as follows: 
+  **Data owner:** A data mesh features data domains as nodes, which exist in data lake accounts; it is founded in decentralization and distribution of data responsibility to people closest to the data, which become data domain owners. 
+  **Data steward:** Federated data governance is how data products are shared. Delivering discoverable metadata auditability based on federated decision-making and accountability structures falls to the data steward. 
+  **Data engineer:** A data producer contributes one or more data products to a central catalog in a data mesh account. Data products must be autonomous, discoverable, secure, and reusable. 
+  **Data consumer:** The platform streamlines the experience of data users to discover, access, and use data products. It streamlines the experience of data consumers to easily consume and drive value from the data. 

# Characteristics
<a name="data-mesh-characteristics"></a>

 The following are characteristics of a data mesh: 
+  **Data diversity:** Treats data platforms as independent data domains, connecting data domains into the mesh to create business-oriented data products that can support strategic goals. The information persisted in their respective environments comes from different applications or source systems adding to the overall data diversity that analysts and data scientists benefit from. 
+  **Data democritization:** Rather than try to combine multiple domains into a centrally managed data lake, data is intentionally left distributed. By adopting this approach, your organization’s data becomes democratized and becomes assessible to more teams. 
+  **Data governance:** Improve data governance by pushing data access policy down into the data domains. Large enterprise organizations experience challenges when scaling their data governance to the number of subscribers because this is managed centrally. A data mesh allows for disparate teams to inherit the data governance policy from the data producer domain. 
+  **Searchability:** Establishing a central mechanism for data discovery is valuable for analysts and researchers to know what data is available. An enterprise-level data catalog contains the metadata of the organization’s data assets. The data catalog contains data attributes, data quality, data classification, and a business glossary of the data. 
+  **Data sharing:** Provide self-service data sharing features to allow domain owners to grant access to consumers. 
+  **Increased flexibility:** Increase data flexibility by implementing an enterprise data mesh. A data mesh provides organizations greater agility as data becomes widely available and supports faster data-driven business decisions. 
+  **Reusability:** A data mesh increases the adoption of reusable data pipeline design patterns to share data across your organization. 

# Design
<a name="data-mesh-design"></a>

 The following are data mesh design goals: 
+  **Data as a product:** Each organizational domain owns their data end-to-end. They’re responsible for building, operating, serving, and resolving any issues arising from the use of their data. Data accuracy and accountability lies with the data owner within the domain. 
+  **Federated data governance:** Data governance helps ensure that data is secure, accurate, and the right personas have access to the right data. The technical implementation of data governance, such as collecting lineage, validating data quality, and enforcing appropriate access controls, can be managed by each of the data domains. However, central data discovery, reporting, and auditing is needed to make it easy for users to find data, and for auditors to verify compliance. 
+  **Common access:** Data must be easily consumable by subject matter experts, such as data analysts and data scientists, and by purpose-built analytics and machine learning (ML) services. This requires data domains to expose a set of interfaces that make data consumable while enforcing appropriate access controls and audit tracking. 

# Reference architecture
<a name="data-mesh-reference-architecture"></a>

![\[Data mesh reference architecture\]](http://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/images/data-mesh-reference-architecture.png)


 Each consumer, producer, and central governance layer are their own separate data domain and typically reside in their own separate AWS account. Information is shared between domains. 

1.  Data producers are source systems that generate data, which is shared throughout the organization. Data producers can be an application, data stream, data lake, or data warehouse – essentially a domain that either generates or updates data. The business owners that are responsible for the data producers must have their data attributes classified for consumers to inherit the classification so data processing and data access to that data meets the organization’s or industry’s data governance policy. 

1.  Metadata relating to producer data must be shared with the central federated data catalog. Data owner information, data quality information, data location and any other metadata must be shared with the central data catalog at the earliest possible opportunity. 

1.  The federated governance layer is a centralized data governance domain that supports data cataloging, asset discoverability, permission management, and a central log for audit history. 

1.  Data governance rules such as data classifications, access permissions and metadata is shared with the consumer system. This is typically shared using an API connection but can also be shared as a manual extract. 

1.  Data consumers are systems that consume information typically for analytical or data science type workloads. Information is either copied from or accessed directly from the producer domains through the federated governance environment. Access permissions are then inherited and propagated into the respective system to ensure the right people have access to the right data. 

 For more details, see [Design a data mesh architecture using AWS Lake Formation and AWS Glue](https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/) and [What is a Data Mesh?](https://aws.amazon.com/what-is/data-mesh/) 