Architecture overview - Clickstream Analytics on AWS

Architecture overview

This section provides a reference implementation architecture diagram for the components deployed with this solution.

Architecture diagram

Solution end-to-end architecture

Deploying this solution with the default parameters builds the following environment in AWS:

AWS architecture diagram showing data flow through various services for authentication and processing.

Clickstream Analytics on AWS architecture

This solution deploys the AWS CloudFormation template in your AWS Cloud account and completes the following settings.

  1. Amazon CloudFront distributes the frontend web UI assets hosted in the Amazon S3 bucket, and the backend APIs hosted with Amazon API Gateway and AWS Lambda.

  2. The Amazon Cognito user pool or OpenID Connect (OIDC) is used for authentication.

  3. The web UI console uses Amazon DynamoDB to store persistent data.

  4. AWS Step FunctionsAWS CloudFormation, AWS Lambda, and Amazon EventBridge are used for orchestrating the lifecycle management of data pipelines.

  5. The data pipeline is provisioned in the region specified by the system operator. It consists of Application Load BalancerAmazon ECSAmazon Managed Streaming for Apache Kafka (Amazon MSK)Amazon Kinesis Data Streams, Amazon S3, Amazon EMR Serverless, Amazon Redshift, and QuickSight.

The key functionality of this solution is to build a data pipeline to collect, process, and analyze their clickstream data. The data pipeline consists of four modules:

  • Ingestion module

  • Data processing module

  • Data modeling module

  • Reporting module

The following introduces the architecture diagram for each module.

Ingestion module

AWS architecture diagram showing data flow through various services including Cognito, ECS, Lambda, and S3.

Ingestion module architecture

Suppose you create a data pipeline in the solution. This solution deploys the Amazon CloudFormation template in your AWS account and completes the following settings.

Note

The ingestion module supports three types of data sinks. You can only have one type of data sink in a data pipeline.

  1. (Optional) The ingestion module creates an AWS global accelerator endpoint to reduce the latency of sending events from your clients (web applications or mobile applications).

  2. Elastic Load Balancing (ELB) is used for load balancing ingestion web servers.

  3. (Optional) If you enable the authenticating feature, the ALB will communicate with the OIDC provider to authenticate the requests.

  4. ALB forwards all authenticated and valid requests to the ingestion servers.

  5. Amazon ECS cluster is hosting the ingestion fleet servers. Each server consists of a proxy and a worker service. The proxy is a facade of the HTTP protocol, and the worker will send the events to a data sink based on your choice.

  6. If Amazon Kinesis Data Streams is used as a buffer, AWS Lambda consumes the clickstream data in Kinesis Data Streams and then sinks them to Amazon S3 in batches.

  7. If Amazon MSK is used as a buffer, MSK Connector is provisioned with an S3 connector plugin that sinks the clickstream data to Amazon S3 in batches.

  8. If Amazon S3 is selected as data sink, the ingestion server will buffer a batch of events and sink them to Amazon S3.

Data processing module

Data processing flow from Amazon EventBridge through various AWS services to Amazon S3.

Data processing module architecture

Suppose you create a data pipeline in the solution and enable data processing. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.

  1. Amazon EventBridge is used to trigger the data processing jobs periodically.

  2. The configurable time-based scheduler invokes an AWS Lambda function.

  3. The Lambda function kicks off an EMR Serverless application based on Spark to process a batch of clickstream events.

  4. The EMR Serverless application uses the configurable transformer and enrichment plug-ins to process the clickstream data from the source S3 bucket.

  5. After processing the clickstream events, the EMR Serverless application sinks the processed clickstream data to the sink S3 bucket.

Data modeling module

Data modeling workflow using AWS services, including S3, EventBridge, Lambda, DynamoDB, and Redshift.

Data modeling in Redshift architecture

Suppose you create a data pipeline in the solution and enable data modeling in Amazon Redshift. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.

  1. After the processed clickstream data is written in the Amazon S3 bucket, the Object Created Event is emitted.

  2. An Amazon EventBridge rule is created for the event emitted in step 1, and an AWS Lambda function is invoked when the event happens.

  3. The Lambda function persists the source event to be loaded in an Amazon DynamoDB table.

  4. When data processing job is done, an event is emitted to Amazon EventBridge.

  5. The pre-defined event rule of Amazon EventBridge processes the EMR job success event.

  6. The rule invokes the AWS Step Functions workflow.

  7. The workflow invokes the list objects Lambda function that queries the DynamoDB table to find out the data to be loaded, then creates a manifest file for a batch of event data to optimize the load performance.

  8. After a few seconds, the check status Lambda function checks the status of the loading job.

  9. If the load is still in progress, the check status Lambda function waits for a few more seconds.

  10. After all objects are loaded, the workflow ends.

Data modeling workflow with Amazon EventBridge, Lambda, Glue, Athena, and S3 components.

Data modeling in Athena architecture

Suppose you create a data pipeline in the solution and enable data modeling in Amazon Athena. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.

  1. Amazon EventBridge invokes the data load into Amazon Athena periodically.

  2. The configurable time-based scheduler invokes an AWS Lambda function.

  3. The AWS Lambda function creates the partitions of the AWS Glue table for the processed clickstream data.

  4. Amazon Athena is used for interactive querying of clickstream events.

  5. The processed clickstream data is scanned via the Glue table.

Reporting module

Diagram showing Amazon Redshift connected to Amazon QuickSight via QuickSight VPC Connection within a VPC.

Reporting module architecture

Suppose you create a data pipeline in the solution, enable data modeling in Amazon Redshift, and enable reporting in QuickSight. This solution deploys the Amazon CloudFormation template in your AWS Cloud account and completes the following settings.

  1. VPC connection in QuickSight is used for securely connecting your Redshift within VPC.

  2. The data source, data sets, template, analysis, and dashboard are created in QuickSight for out-of-the-box analysis and visualization.

Analytics Studio

Analytics Studio is a unified web interface for business analysts or data analysts to view and create dashboards, query and explore clickstream data, and manage metadata.

Diagram showing AWS services interaction for Analytics Studio, including authentication and data flow.

Analytics Studio architecture

  1. When analysts access Analytics Studio, requests are sent to Amazon CloudFront, which distributes the web application.

  2. When the analysts log in to Analytics Studio, the requests are redirected to the Amazon Cognito user pool or OpenID Connect (OIDC) for authentication.

  3. Amazon API Gateway hosts the backend API requests and uses the custom Lambda authorizer to authorize the requests with the public key of OIDC.

  4. API Gateway integrates with AWS Lambda to serve the API requests.

  5. The AWS Lambda function uses Amazon DynamoDB to retrieve and persist the data.

  6. When analysts create analyses, the Lambda function requests Amazon QuickSight to create assets and get the embed URL in the data pipeline Region.

  7. The browser of analysts accesses the QuickSight embed URL to view the QuickSight dashboards and visuals.