Content Domain 1: Data Preparation for Machine Learning (ML) - AWS Certification

Content Domain 1: Data Preparation for Machine Learning (ML)

Task 1.1: Ingest and store data

Knowledge of:

  • Data formats and ingestion mechanisms (for example, validated and non-validated formats, Apache Parquet, JSON, CSV, Apache ORC, Apache Avro, RecordIO)

  • How to use the core AWS data sources (for example, Amazon S3, Amazon Elastic File System [Amazon EFS], Amazon FSx for NetApp ONTAP)

  • How to use AWS streaming data sources to ingest data (for example, Amazon Kinesis, Apache Flink, Apache Kafka)

  • AWS storage options, including use cases and tradeoffs

Skills in:

  • Extracting data from storage (for example, Amazon S3, Amazon Elastic Block Store [Amazon EBS], Amazon EFS, Amazon RDS, Amazon DynamoDB) by using relevant AWS service options (for example, Amazon S3 Transfer Acceleration, Amazon EBS Provisioned IOPS)

  • Choosing appropriate data formats (for example, Parquet, JSON, CSV, ORC) based on data access patterns

  • Ingesting data into Amazon SageMaker Data Wrangler and SageMaker Feature Store

  • Merging data from multiple sources (for example, by using programming techniques, AWS Glue, Apache Spark)

  • Troubleshooting and debugging data ingestion and storage issues that involve capacity and scalability

  • Making initial storage decisions based on cost, performance, and data structure

Task 1.2: Transform data and perform feature engineering

Knowledge of:

  • Data cleaning and transformation techniques (for example, detecting and treating outliers, imputing missing data, combining, deduplication)

  • Feature engineering techniques (for example, data scaling and standardization, feature splitting, binning, log transformation, normalization)

  • Encoding techniques (for example, one-hot encoding, binary encoding, label encoding, tokenization)

  • Tools to explore, visualize, or transform data and features (for example, SageMaker Data Wrangler, AWS Glue, AWS Glue DataBrew)

  • Services that transform streaming data (for example, AWS Lambda, Spark)

  • Data annotation and labeling services that create high-quality labeled datasets

Skills in:

  • Transforming data by using AWS tools (for example, AWS Glue, DataBrew, Spark running on Amazon EMR, SageMaker Data Wrangler)

  • Creating and managing features by using AWS tools (for example, SageMaker Feature Store)

  • Validating and labeling data by using AWS services (for example, SageMaker Ground Truth, Amazon Mechanical Turk)

Task 1.3: Ensure data integrity and prepare data for modeling

Knowledge of:

  • Pre-training bias metrics for numeric, text, and image data (for example, class imbalance [CI], difference in proportions of labels [DPL])

  • Strategies to address CI in numeric, text, and image datasets (for example, synthetic data generation, resampling)

  • Techniques to encrypt data

  • Data classification, anonymization, and masking

  • Implications of compliance requirements (for example, personally identifiable information [PII], protected health information [PHI], data residency)

Skills in:

  • Validating data quality (for example, by using DataBrew and AWS Glue Data Quality)

  • Identifying and mitigating sources of bias in data (for example, selection bias, measurement bias) by using AWS tools (for example, SageMaker Clarify)

  • Preparing data to reduce prediction bias (for example, by using dataset splitting, shuffling, and augmentation)

  • Configuring data to load into the model training resource (for example, Amazon EFS, Amazon FSx)