View a markdown version of this page

AI/ML Starter Package - Modern Data Architecture Accelerator

AI/ML Starter Package

The AI/ML Starter Package establishes a comprehensive environment for developing, training, and deploying machine learning models at scale.

This implementation demonstrates AWS best practices for creating an enterprise-grade data science platform. It combines data lake capabilities with SageMaker Studio to provide data scientists with the tools they need while maintaining appropriate governance and security controls.

This architecture is particularly effective when:

  1. You need to enable data science teams to rapidly develop and deploy ML models.

  2. Your organization requires governed access to data and model resources.

Deploy this package when you need a scalable, secure foundation that supports your organization’s machine learning and AI initiatives.

The AI/ML Starter Package provides a comprehensive environment for developing, training, and deploying machine learning models. This package is organized into three domains that work together to create a complete data science platform:

Shared Domain Components

  • IAM Roles - Secure access controls for data science teams

  • Data Lake - S3 buckets for storing training data and model artifacts

  • Glue Data Catalog - KMS-encrypted metadata management

  • Lake Formation - Fine-grained access control for data assets

  • Athena Workgroups - SQL-based data exploration capabilities

  • Audit Components - CloudTrail integration for comprehensive governance

DataOps Domain Components

  • DataOps Projects - Shared resources for data engineering workflows

  • Glue Crawlers - Automated metadata discovery and schema management

DataScience Domain Components

  • SageMaker Studio - Fully managed development environment for ML

  • Team Workspaces - Isolated environments for data science teams

  • Jupyter Notebooks - Pre-configured templates for common ML tasks

  • Model Registry - Version control for ML models

  • Training Pipelines - Automated workflows for model development

  • Deployment Infrastructure - Endpoints for model serving

This package accelerates your AI/ML initiatives by providing a ready-to-use environment with AWS best practices built in. It’s ideal for organizations looking to establish or enhance their machine learning capabilities with a secure, scalable foundation.

Deployment Instructions

Step-by-step guide for deploying the AI/ML Starter Package

You can deploy the AI/ML Starter Package using one of two methods: 1. CloudFormation Installer Method (recommended for most users) 2. Manual CLI Deploy Method (for advanced customization)

Method 1: CloudFormation Installer Method

Prerequisites

Before deploying using the CloudFormation installer, ensure you have:

  1. An AWS account with permissions to create the required resources

  2. A GitHub CodeConnect connection (if using GitHub as source) or an S3 bucket with the solution code (if using S3 as source)

  3. A VPC with at least one subnet (required for SageMaker Studio)

Deployment Steps

Step 1: Launch the CloudFormation stack

Step 2: Configure the stack

  1. Assign a unique name to your stack (e.g., mdaa-aiml-<accountname>-<region>).

  2. Under Parameters:

    • Enter your organization name in the OrgName field

    • Select github as the Source (default)

    • Verify the Repository Owner is aws and Repository Name is modern-data-architecture-accelerator

    • Enter your GitHub CodeConnect Connection ARN

    • Select basic data science as the Sample Name

    • Provide Subnet ID and VPC ID (required for SageMaker Studio)

  3. Choose Next, review the settings, and acknowledge that the template might create IAM resources

  4. Choose Submit to deploy the stack

Step 3: Monitor the deployment

  1. Navigate to the AWS CodePipeline console to monitor the deployment progress

  2. The pipeline will show a status of either In Progress or Complete

  3. The deployment typically takes about 30-45 minutes for the AI/ML configuration

Step 4: Verify deployment

Check that all CloudFormation stacks have completed successfully and the installer pipeline shows a COMPLETE status

Method 2: Manual CLI Deploy Method

Prerequisites

Before deploying the AI/ML Starter Package using the CLI method, ensure you have:

  1. AWS CLI configured with appropriate credentials

  2. Node.js 16.x or later installed

  3. AWS CDK installed (npm install -g aws-cdk)

  4. CDK bootstrapped in your target account and region

  5. A VPC with at least one subnet (required for SageMaker Studio)

Deployment Steps

Step 1: Clone the MDAA repository

git clone https://github.com/aws/modern-data-architecture-accelerator.git && cd modern-data-architecture-accelerator

Step 2: Configure your deployment

  • Copy the sample configuration files:

cp -r sample_configs/basic_datascience_platform my_aiml_config
cd my_aiml_config
  • Edit the mdaa.yaml file to set your organization name, datascience team name and VPC/subnet information:

organization: <your-org-name>
context:
  vpc_id: <your vpc id>
  subnet_id: <your subnet id>
  datascience_team_name: <your datascience team name>

Step 3: Deploy the solution * Ensure you are authenticated to your target AWS account.

  • Optionally, run the following command to understand what stacks will be deployed:

../bin/mdaa ls
  • Optionally, run the following command to review the produced templates:

../bin/mdaa synth
  • Run the following command to deploy all modules:

../bin/mdaa deploy

Step 4: Verify deployment * Check the AWS CloudFormation console to ensure all stacks have been created successfully * Verify the SageMaker Studio Domain, IAM roles, and other resources have been created

Usage Instructions

How to effectively use the AI/ML Starter Package after deployment

Once the MDAA deployment is complete, follow these steps to interact with the AI/ML platform:

Initial Setup and Data Access

  1. Create sample data for testing

    • Check the DATASETS.md file in the sample_configs directory for instructions on creating a sample_data folder

    • Alternatively, prepare your own data files for upload

  2. Assume the shared-roles-data-admin role

    • This role is configured with AssumeRole trust to the local account by default

    • It has write access to the data lake

  3. Upload sample data to the transformed bucket

    • Upload the sample_data folder and contents to the transformed bucket

  4. Run the Glue Crawler

    • In the AWS Glue Console, trigger/run the Glue Crawler

    • Once successful, view the Crawler’s CloudWatch logs to observe that tables were created

Using SageMaker Studio

  1. Assume the data-scientist role

    • This role is configured with AssumeRole trust to the local account by default

    • Important: The role session name must match the userid specified in the datascience-team.yaml configuration

  2. Access SageMaker Studio

    • Navigate to the Amazon SageMaker console

    • Go to the Domain section and find the deployed SageMaker Studio Domain

    • Launch the user profile matching your role session name/userid

    • SageMaker Studio should launch

  3. Work with data in Athena

    • In the Athena Query Editor, select the MDAA-deployed Workgroup from the dropdown list

    • The tables created by the crawler should be available for query under the MDAA-created Database

    • Run queries to explore and analyze your data

  4. Develop ML models

    • Use the pre-configured Jupyter notebooks in SageMaker Studio

    • Access your data through the Athena integration

    • Train models using SageMaker’s built-in algorithms or custom code

    • Deploy models to SageMaker endpoints for inference

For more detailed information about the configuration files and their purposes, refer to the README.md file in the sample_configs/basic_datascience_platform directory of the MDAA repository.