MLPERF03-BP01 Use a modern data architecture - Machine Learning Lens

MLPERF03-BP01 Use a modern data architecture

Get the best insights from exponentially growing data using a modern data architecture. This architecture enables movement of data between a data lake and purpose-built stores including a data warehouse, relational databases, non-relational databases, ML and big data processing, and log analytics. A data lake provides a single place to run analytics across mixed data structures collected from disparate sources. Purpose-built analytics services provide the speed required for specific use cases like real-time dashboards and log analytics.

Desired outcome: You implement a modern data architecture that enables seamless data movement between storage systems, provides unified governance, and supports diverse ML workloads. This architecture accelerates data preparation, improves data quality, and enables efficient feature engineering for machine learning models.

Common anti-patterns:

  • Creating isolated data silos where different teams manage separate data stores without coordination.

  • Building data architecture without establishing proper governance, access controls, or compliance-aligned frameworks.

  • Using one-size-fits-all storage solutions without considering specific workload requirements.

  • Relying on manual processes for data movement, transformation, and quality checks.

  • Neglecting to implement proper data cataloging and discovery mechanisms.

Benefits of establishing this best practice:

  • Unified data governance and access control across data stores.

  • Reduced data preparation time through integrated data services.

  • Improved data quality and consistency for ML model training.

  • Enhanced scalability for growing data volumes and ML workloads.

  • Better cost optimization through purpose-built storage solutions.

Level of risk exposed if this best practice is not established: Medium

Implementation guidance

When building machine learning solutions, you need a modern data architecture that can handle diverse data types, support various ML workloads, and provide unified governance across data stores. This architecture enables efficient data movement between storage systems while maintaining security, quality, and cost optimization.

Avoid creating isolated data silos where different teams manage separate data stores without coordination. Many organizations struggle with inconsistent data governance policies across different storage systems, lack unified access controls for ML teams, fail to implement proper data cataloging and discovery mechanisms, and neglect to optimize storage costs based on access patterns. These issues create bottlenecks in ML workflows and increase operational complexity.

For example, in a retail ML use case, you might need to combine customer transaction data from a data warehouse, real-time clickstream data from streaming services, and product catalog information from operational databases. A modern data architecture enables seamless access to these data sources while maintaining consistent security policies and enabling efficient feature engineering for recommendation models.

Different ML workloads require different data access patterns. Batch training jobs benefit from optimized data lake storage with efficient querying capabilities, while real-time inference requires low-latency access to feature stores and streaming data pipelines. Custom data processing workflows can be developed when standard ETL processes don't adequately support your specific ML requirements.

Continuous monitoring of data quality and pipeline performance is crucial for maintaining reliable ML systems. Setting up automated data quality checks and pipeline monitoring allows for early detection of data issues that could impact model performance.

Implementation steps

  1. Design your data lake foundation. Begin by establishing a centralized data lake using Amazon S3 as the primary storage layer. Organize data using a logical structure that supports both current and future ML use cases, such as organizing by business domain, data source, and processing stage. Implement data partitioning strategies based on common query patterns to optimize performance and reduce costs. For example, partition time-series data by date and customer data by region to enable efficient querying for ML feature extraction.

  2. Implement unified data governance. Use AWS Lake Formation to build a scalable and secure data lake with centralized governance. Establish consistent security policies, access controls, and audit trails across data stores. Apply fine-grained permissions that enable self-service access for ML practitioners while maintaining security and and addressing requirements. Create data stewardship roles and processes to improve ongoing data quality and governance.

  3. Integrate purpose-built analytics services. Build a high-speed analytic layer with purpose-built services selected based on your specific ML workload requirements:

  4. Enable seamless data integration. Use AWS Glue to integrate data across services and data stores. Implement automated data cataloging to maintain metadata and enable data discovery across your organization. Create ETL pipelines that prepare data for ML workloads while maintaining data lineage and enabling reproducibility. Design workflows that can handle both batch and streaming data processing requirements.

  5. Optimize for ML workloads. Design data pipelines that support both batch and real-time ML training scenarios. Implement feature stores using services like Amazon SageMaker AI Feature Store to manage and share ML features across teams and models. Create standardized feature engineering processes that can be reused across different ML projects and provide consistent data transformations.

  6. Establish data quality monitoring. Implement automated data quality checks and monitoring for data reliability for ML models. Use AWS Glue DataBrew for data profiling and quality assessment. Set up automated alerts for data quality issues such as missing values, schema changes, or statistical anomalies that could impact ML model performance.

  7. Implement cost optimization strategies. Use appropriate storage classes in Amazon S3 based on data access patterns. Implement lifecycle policies to automatically transition data to lower-cost storage tiers such as S3 Infrequent Access or Amazon Glacier for archival data. Monitor and optimize query performance to control compute costs, and use reserved capacity where appropriate for predictable workloads.

  8. Enable real-time data processing. For ML use cases requiring real-time inference, implement streaming data pipelines using Amazon Kinesis and AWS Lambda to process data as it arrives and update feature stores in near real-time. Design architectures that can handle varying data volumes and provide consistent low-latency access to features for real-time ML predictions.

  9. Implement data lineage and versioning. Establish comprehensive data lineage tracking to understand data flow from source to ML models. Use versioning for both datasets and feature definitions to enable reproducible ML experiments and model rollbacks when necessary. This is crucial for improving regulatory adherence and debugging ML model issues.

  10. Create self-service data access. Build data catalogs and discovery tools that enable ML practitioners to find and access relevant data without requiring deep technical knowledge of the underlying storage systems. Implement standardized APIs and interfaces that abstract the complexity of the data architecture while providing the flexibility needed for diverse ML workloads.

Resources

Related documents:

Related videos:

Related services: