

# Data lifecycle


To build a data pipeline, you must first ingest data into AWS from an external or internal data source, such as a file server, database, storage bucket, or from an API call. The ingested data may or may not go through transformation, such as anonymization, column dropping, or data cleaning.

This section provides an overview of the stages in the data lifecycle process, as shown in the following diagram.

![\[Data lifecycle overview diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/data_lifecycle_overview.png)


These stages include the following:
+ Data collection
+ Data preparation and cleaning
+ Data quality checks
+ Data visualization and analysis
+ Monitoring and debugging
+ IaC deployment
+ Automation and access control

# Data collection


You can collect data from a variety of sources within AWS, but it's important to choose the right data collection tool for your use case. The following diagram shows how the data collection stage fits into the data engineering automation and access control lifecycle.

![\[Data collection diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/data_collection.png)


AWS provides the following data collection tools:
+ [Amazon Kinesis](https://aws.amazon.com/kinesis/) helps you collect streaming data. Kinesis also offers seamless integration and processing capabilities.
+ [AWS Database Migration Service (AWS DMS)](https://aws.amazon.com/dms/) helps you ingest data from relational databases. AWS DMS has configuration options and direct connections between on-premises and database services, such as Amazon Simple Storage Service (Amazon S3), that are hosted on AWS.
+ [AWS Glue](https://aws.amazon.com/glue/) is an extract, transform, and load (ETL) tool that helps you ingest unstructured data.

There are several use cases for collecting unstructured or semi-structured data by using Amazon S3 for storage. For example, a manufacturing site’s data collection use case could require historical data to be ingested for machine history data as XML files, event data as JSON files, and purchase data from a relational database. This use case could also require that all three data sources must be joined.

Before you start the data ingestion process, we recommend that you understand what data must be ingested, and then choose the right tool to collect this data.

# Data preparation and cleaning


Data preparation and cleaning is one of the most important yet most time-consuming stages of the data lifecycle. The following diagram shows how the data preparation and cleaning stage fits into the data engineering automation and access control lifecycle.

![\[Data prep and cleaning diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/data_prep_cleaning.png)


Here are some examples of data preparation or cleaning:
+ Mapping text columns to codes
+ Ignoring empty columns
+ Filling empty data fields with `0`, `None`, or `''`
+ Anonymizing or masking personally identifiable information (PII)

If you have a large workload that has a variety of data, then we recommend that you use [Amazon EMR](https://aws.amazon.com/emr/) or [AWS Glue](https://aws.amazon.com/glue/) for your data preparation and cleaning tasks. Amazon EMR and AWS Glue both work with unstructured, semi-structured, and relational data, and both can use Apache Spark to create a `DataFrame` or `DynamicFrame` to work with horizontal processing. Moreover, you can use [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/) to clean and process data with a no-code approach. Additionally, DataBrew can profile your dataset with column statistics, provide data lineages, and include data quality rules for all or specified columns.

For smaller workloads that don't require distributed processing and can be completed in under 15 minutes, we recommend that you use [AWS Lambda](https://aws.amazon.com/lambda/) for data preparation and cleaning. Lambda is a cost-effective and lightweight option for smaller workloads. For highly secure data that can't enter the cloud, we recommend that you perform data anonymization on Amazon Elastic Compute Cloud (Amazon EC2) instances by using an [AWS Outposts](https://aws.amazon.com/outposts/) server.

It's essential to choose the right AWS service for data preparation and cleaning and to understand the tradeoffs involved with your choice. For example, consider a scenario where you're choosing from AWS Glue, DataBrew, and Amazon EMR. AWS Glue is ideal if the ETL job is infrequent. An infrequent job takes place once a day, once a week, or once a month. You can further assume that your data engineers are proficient in writing Spark code (for big data use cases) or scripting in general. If the job is more frequent, running AWS Glue constantly can get expensive. In this case, Amazon EMR provides distributed processing capabilities and offers both a serverless and server-based version. If your data engineers don't have the right skillset or if you must deliver results fast, then DataBrew is a good option. DataBrew can reduce the effort to develop code and speed up the data preparation and cleaning process.

After the processing is completed, the data from the ETL process is stored on AWS. The choice of storage depends on what type of data you're dealing with. For example, you could be working with non-relational data like graph data, key-value pair data, images, text files, or relational structured data.

As shown in the following diagram, you can use the following AWS services for data storage:
+ [Amazon S3](https://aws.amazon.com/s3/) stores unstructured data or semi-structured data (for example, Apache Parquet files, images, and videos).
+ [Amazon Neptune](https://aws.amazon.com/neptune/) stores graph datasets that you can query by using SPARQL or GREMLIN.
+ [Amazon Keyspaces (for Apache Cassandra)](https://aws.amazon.com/keyspaces/) stores datasets that are compatible with Apache Cassandra.
+ [Amazon Aurora](https://aws.amazon.com/rds/aurora/) stores relational datasets.
+ [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) stores key-value or document data in a NoSQL database.
+ [Amazon Redshift](https://aws.amazon.com/redshift/) stores workloads for structured data in a data warehouse.



![\[Data storage services.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/data_prep_cleaning_storage_1.png)


By using the right service with the correct configurations, you can store your data in the most efficient and effective way. This minimizes the effort involved in data retrieval.

# Data quality checks


Data quality is an integral yet often overlooked part of the data cleaning process. The following diagram shows how data quality checks fit into the data engineering automation and access control lifecycle.

![\[Data quality diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/data_quality_checks.png)


The following table provides an overview of different data quality solutions based on use case.


|  |  |  | 
| --- |--- |--- |
| **Use case** | **Solution** | **Example** | 
| No-code solution to add column-level or table-level quality conditions | [AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/) | Checks if all column values are between 1 and 12, or if a table or column is empty | 
| Custom code added to an AWS Glue job or a no-code solution (in preview) to add column-level or table-level quality conditions | [AWS Glue Data Quality](https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html) | Checks if the column `first_name` is not null, or if the column `phone_number` contains only numbers or a "\$1" operator and/or statistical functions, such as average or sum | 
| Custom checks | ETL of choice, such as [AWS Lambda](https://aws.amazon.com/lambda/), [AWS Glue,](https://aws.amazon.com/glue/) or [Amazon EMR](https://aws.amazon.com/emr/) | Checks if the value of column A is always greater than the corresponding value of column B and column C, or if the value of column `continent` is always geographically correct and derived from the `city` column | 
| Sophisticated solution with a metrics report, constraint validation, and constraint suggestions | [Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) | Checks if the `CompletenessConstraint` for the Completeness of column metric `review_id` is equal to `1` | 

# Data visualization and analysis


After you complete your data quality checks, then you can move to the data analysis or visualization stage, as shown in the following diagram.

![\[Data visualization diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/data_visualization_analysis.png)


In this stage, you can use [Quick Sight](https://aws.amazon.com/quicksight/) for creating graphs or charts, [Neptune](https://aws.amazon.com/neptune/) for graph database operations and visualization, or [OpenSearch](https://aws.amazon.com/what-is/opensearch/) for open-source search and analytics. Typically, you can also feed clean data into data science or machine learning (ML) workflows by using Amazon SageMaker pipelines or simple reads from Amazon S3. The data visualization and analysis stage concludes the sequential portion of the data engineering pipeline.

# Monitoring and debugging


Certain phases in the data lifecycle are not sequential but consistently present. This is true for the monitoring and debugging stage, as shown in the following diagram.

![\[Monitoring and debugging diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/monitoring_debugging.png)


The process of data engineering must be continually monitored for correctness and performance. [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) plays a crucial role in monitoring data engineering, as it logs every error and info log to its log groups. You can use monitoring to build automated error recovery. For example, you can stop pipelines if you find that your data quality rules are not satisfied, or you can log successful runs and failed runs separately to enable a recovery action. Monitoring improves the overall reliability of the data engineering process (that is, the full ETL process) as well as the data.

Additionally, we recommend that you create CloudWatch dashboards that include the relevant metrics for the monitoring and debugging process. This can help ensure that the data engineering process is running smoothly and as expected. This is important for operations as well as reporting. For example, a CloudWatch dashboard can show users the status of loads to help them understand the reliability of their processes or what percentage of their data was dropped due to low quality or which sources have the maximum failures. A CloudWatch dashboard not only helps you visualize results but also helps you improve processes by identifying the pain points in the ETL process.

# IaC deployment


Modern architecture is incomplete without a mechanism for an infrastructure as code (IaC) deployment. The following diagram shows the AWS services related to IaC deployment.

![\[IaC deployment diagram\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/modern-data-centric-use-cases/images/iac_deployment.png)


We recommend that any deployed infrastructure is always backed by code using IaC tools. For example, you can use [AWS CloudFormation](https://aws.amazon.com/cloudformation/) or [AWS Cloud Development Kit (AWS CDK)](https://docs.aws.amazon.com/cdk/v2/guide/home.html). AWS CDK is a wrapper around CloudFormation.

As a best practice, we recommend that you push your code to a code repository of your choice. It's also a best practice to use source control in your code repository so that you have versioning and collaboration capabilities that enable multiple team members to work simultaneously on the same code base, while ensuring that the code integration from different developers into the main branch doesn't result in any conflicts. 

# Automation and access control


## Automation


Pipeline automation is a crucial part of modern data-centric architecture design. To successfully run your production system, we recommend that you have a data pipeline that has a start trigger, connecting steps, and a mechanism for separating failed and passed stages. It's also important to log failures while not hindering the rest of the ETL process.

You can use [AWS Glue workflows](https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html) to create a pipeline. The pipeline supports all AWS Glue jobs, Amazon EventBridge triggers, and crawlers. You can also create workflows from scratch or by using AWS Glue [blueprints](https://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html). A blueprint provides a framework that helps you get started on reusable use cases. For example, this could be a workflow to import data from Amazon S3 into a DynamoDB table. You can even use parameters to make the blueprint reusable.

If the data pipeline involves more services outside of AWS Glue, then we recommend that you use [AWS Step Functions](https://aws.amazon.com/step-functions/) as the orchestrator. Step Functions can create automated workflows, including manual approval steps for security incident response. You can also use Step Functions for large-scale parallel or sequential processing.

Finally, we recommend using [EventBridge](https://aws.amazon.com/eventbridge/) to insert triggers on schedules, events, or on demand. You can also use EventBridge to create pipelines with filters.

## Access control


We recommend that you use [AWS Identity and Access Management (IAM)](https://aws.amazon.com/iam/) for access control. IAM allows you to specify who or what can access services and resources in AWS and centrally manage fine-grained permissions. Every phase of the lifecycle—from storage to automation to using processing tools—requires the right access permissions. While working with data-centric use cases, you can use [AWS Lake Formation](https://aws.amazon.com/lake-formation/) to simplify the process of making data available for wide-ranging analytics and also across accounts.