Content Domain 1: Data Engineering
Tasks
Task 1.1: Create data repositories for ML
Identify data sources (for example, content and location, primary sources such as user data).
Determine storage mediums (for example, databases, Amazon S3, Amazon Elastic File System [Amazon EFS], Amazon Elastic Block Store [Amazon EBS]).
Task 1.2: Identify and implement a data ingestion solution
Identify data job styles and job types (for example, batch load, streaming).
-
Orchestrate data ingestion pipelines (batch-based ML workloads and streaming-based ML workloads).
Amazon Kinesis
Amazon Data Firehose
Amazon EMR
AWS Glue
Amazon Managed Service for Apache Flink
Schedule jobs.
Task 1.3: Identify and implement a data transformation solution
Transform data in transit (ETL, AWS Glue, Amazon EMR, AWS Batch).
Handle ML-specific data by using MapReduce (for example, Apache Hadoop, Apache Spark, Apache Hive).