Content Domain 1: Data Ingestion and Transformation
Tasks
Task 1.1: Perform data ingestion
Skill 1.1.1: Read data from streaming sources (for example, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka [Amazon MSK], Amazon DynamoDB Streams, AWS Database Migration Service [AWS DMS], AWS Glue, Amazon Redshift).
Skill 1.1.2: Read data from batch sources (for example, Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon AppFlow).
Skill 1.1.3: Implement appropriate configuration options for batch ingestion.
Skill 1.1.4: Consume data APIs.
Skill 1.1.5: Set up schedulers by using Amazon EventBridge, Apache Airflow, or time-based schedules for jobs and crawlers.
Skill 1.1.6: Set up event triggers (for example, Amazon S3 Event Notifications, EventBridge).
Skill 1.1.7: Call a Lambda function from Kinesis.
Skill 1.1.8: Create allowlists for IP addresses to allow connections to data sources.
Skill 1.1.9: Implement throttling and overcoming rate limits (for example, DynamoDB, Amazon RDS, Kinesis).
Skill 1.1.10: Manage fan-in and fan-out for streaming data distribution.
Skill 1.1.11: Describe replayability of data ingestion pipelines.
Skill 1.1.12: Define stateful and stateless data transactions.
Task 1.2: Transform and process data
Skill 1.2.1: Optimize container usage for performance needs (for example, Amazon Elastic Kubernetes Service [Amazon EKS], Amazon Elastic Container Service [Amazon ECS]).
Skill 1.2.2: Connect to different data sources (for example, Java Database Connectivity [JDBC], Open Database Connectivity [ODBC]).
Skill 1.2.3: Integrate data from multiple sources.
Skill 1.2.4: Optimize costs while processing data.
Skill 1.2.5: Implement data transformation services based on requirements (for example, Amazon EMR, AWS Glue, Lambda, Amazon Redshift).
Skill 1.2.6: Transform data between formats (for example, from .csv to Apache Parquet).
Skill 1.2.7: Troubleshoot and debug common transformation failures and performance issues.
Skill 1.2.8: Create data APIs to make data available to other systems by using AWS services.
Skill 1.2.9: Define volume, velocity, and variety of data (for example, structured data, unstructured data).
Skill 1.2.10: Integrate large language models (LLMs) for data processing.
Task 1.3: Orchestrate data pipelines
Skill 1.3.1: Use orchestration services to build workflows for data ETL pipelines (for example, Lambda, EventBridge, Amazon Managed Workflows for Apache Airflow [Amazon MWAA], AWS Step Functions, AWS Glue workflows).
Skill 1.3.2: Build data pipelines for performance, availability, scalability, resiliency, and fault tolerance.
Skill 1.3.3: Implement and maintain serverless workflows.
Skill 1.3.4: Use notification services to send alerts (for example, Amazon Simple Notification Service [Amazon SNS], Amazon Simple Queue Service [Amazon SQS]).
Task 1.4: Apply programming concepts
Skill 1.4.1: Optimize code to reduce runtime for data ingestion and transformation.
Skill 1.4.2: Configure Lambda functions to meet concurrency and performance needs.
Skill 1.4.3: Use programming languages and frameworks for data engineering (for example, Python, SQL, Scala, R, Java, Bash, PowerShell).
Skill 1.4.4: Use software engineering best practices for data engineering (for example, version control, testing, logging, monitoring).
Skill 1.4.5: Use Infrastructure as Code (IaC) to deploy data engineering solutions.
Skill 1.4.6: Use the AWS Serverless Application Model (AWS SAM) to package and deploy serverless data pipelines (for example, Lambda functions, Step Functions, DynamoDB tables).
Skill 1.4.7: Use and mount storage volumes from within Lambda functions.
Skill 1.4.8: Use infrastructure as code (IaC) for repeatable resource deployment (for example, AWS CloudFormation and AWS Cloud Development Kit [AWS CDK]).
Skill 1.4.9: Describe continuous integration and continuous delivery (CI/CD) (implementation, testing, and deployment of data pipelines).
Skill 1.4.10: Define distributed computing.
Skill 1.4.11: Describe data structures and algorithms (for example, graph data structures and tree data structures).