Shared Domain Components DataOps Domain Components DataScience Domain Components Deployment Instructions Usage Instructions

AI/ML Starter Package

The AI/ML Starter Package establishes a comprehensive environment for developing, training, and deploying machine learning models at scale.

This implementation demonstrates AWS best practices for creating an enterprise-grade data science platform. It combines data lake capabilities with SageMaker Studio to provide data scientists with the tools they need while maintaining appropriate governance and security controls.

This architecture is particularly effective when:

You need to enable data science teams to rapidly develop and deploy ML models.
Your organization requires governed access to data and model resources.

Deploy this package when you need a scalable, secure foundation that supports your organization’s machine learning and AI initiatives.

The AI/ML Starter Package provides a comprehensive environment for developing, training, and deploying machine learning models. This package is organized into three domains that work together to create a complete data science platform:

Shared Domain Components

IAM Roles - Secure access controls for data science teams
Data Lake - S3 buckets for storing training data and model artifacts
Glue Data Catalog - KMS-encrypted metadata management
Lake Formation - Fine-grained access control for data assets
Athena Workgroups - SQL-based data exploration capabilities
Audit Components - CloudTrail integration for comprehensive governance

DataOps Domain Components

DataOps Projects - Shared resources for data engineering workflows
Glue Crawlers - Automated metadata discovery and schema management

DataScience Domain Components

SageMaker Studio - Fully managed development environment for ML
Team Workspaces - Isolated environments for data science teams
Jupyter Notebooks - Pre-configured templates for common ML tasks
Model Registry - Version control for ML models
Training Pipelines - Automated workflows for model development
Deployment Infrastructure - Endpoints for model serving

This package accelerates your AI/ML initiatives by providing a ready-to-use environment with AWS best practices built in. It’s ideal for organizations looking to establish or enhance their machine learning capabilities with a secure, scalable foundation.

Deployment Instructions

Step-by-step guide for deploying the AI/ML Starter Package

You can deploy the AI/ML Starter Package using one of two methods: 1. CloudFormation Installer Method (recommended for most users) 2. Manual CLI Deploy Method (for advanced customization)

Method 1: CloudFormation Installer Method

Prerequisites

Before deploying using the CloudFormation installer, ensure you have:

An AWS account with permissions to create the required resources
A GitHub CodeConnect connection (if using GitHub as source) or an S3 bucket with the solution code (if using S3 as source)
A VPC with at least one subnet (required for SageMaker Studio)

Deployment Steps

Step 1: Launch the CloudFormation stack

Step 2: Configure the stack

Assign a unique name to your stack (e.g., mdaa-aiml-<accountname>-<region>).
Under Parameters:
- Enter your organization name in the OrgName field
- Select github as the Source (default)
- Verify the Repository Owner is aws and Repository Name is modern-data-architecture-accelerator
- Enter your GitHub CodeConnect Connection ARN
- Select basic data science as the Sample Name
- Provide Subnet ID and VPC ID (required for SageMaker Studio)
Choose Next, review the settings, and acknowledge that the template might create IAM resources
Choose Submit to deploy the stack

Step 3: Monitor the deployment

Navigate to the AWS CodePipeline console to monitor the deployment progress
The pipeline will show a status of either In Progress or Complete
The deployment typically takes about 30-45 minutes for the AI/ML configuration

Step 4: Verify deployment

Check that all CloudFormation stacks have completed successfully and the installer pipeline shows a COMPLETE status

Method 2: Manual CLI Deploy Method

Prerequisites

Before deploying the AI/ML Starter Package using the CLI method, ensure you have:

AWS CLI configured with appropriate credentials
Node.js 16.x or later installed
AWS CDK installed (npm install -g aws-cdk)
CDK bootstrapped in your target account and region
A VPC with at least one subnet (required for SageMaker Studio)

Deployment Steps

Step 1: Clone the MDAA repository


git clone https://github.com/aws/modern-data-architecture-accelerator.git &&
cd modern-data-architecture-accelerator

Step 2: Configure your deployment

Copy the sample configuration files:

cp -r sample_configs/basic_datascience_platform my_aiml_config
cd my_aiml_config

Edit the mdaa.yaml file to set your organization name, datascience team name and VPC/subnet information:

organization: <your-org-name>
context:
  vpc_id: <your vpc id>
  subnet_id: <your subnet id>
  datascience_team_name: <your datascience team name>

Step 3: Deploy the solution * Ensure you are authenticated to your target AWS account.

Optionally, run the following command to understand what stacks will be deployed:

../bin/mdaa ls

Optionally, run the following command to review the produced templates:

../bin/mdaa synth

Run the following command to deploy all modules:

../bin/mdaa deploy

Step 4: Verify deployment * Check the AWS CloudFormation console to ensure all stacks have been created successfully * Verify the SageMaker Studio Domain, IAM roles, and other resources have been created

Usage Instructions

How to effectively use the AI/ML Starter Package after deployment

Once the MDAA deployment is complete, follow these steps to interact with the AI/ML platform:

Initial Setup and Data Access

Create sample data for testing
- Check the DATASETS.md file in the sample_configs directory for instructions on creating a sample_data folder
- Alternatively, prepare your own data files for upload
Assume the shared-roles-data-admin role
- This role is configured with AssumeRole trust to the local account by default
- It has write access to the data lake
Upload sample data to the transformed bucket
- Upload the sample_data folder and contents to the transformed bucket
Run the Glue Crawler
- In the AWS Glue Console, trigger/run the Glue Crawler
- Once successful, view the Crawler’s CloudWatch logs to observe that tables were created

Using SageMaker Studio

Assume the data-scientist role
- This role is configured with AssumeRole trust to the local account by default
- Important: The role session name must match the userid specified in the datascience-team.yaml configuration
Access SageMaker Studio
- Navigate to the Amazon SageMaker console
- Go to the Domain section and find the deployed SageMaker Studio Domain
- Launch the user profile matching your role session name/userid
- SageMaker Studio should launch
Work with data in Athena
- In the Athena Query Editor, select the MDAA-deployed Workgroup from the dropdown list
- The tables created by the crawler should be available for query under the MDAA-created Database
- Run queries to explore and analyze your data
Develop ML models
- Use the pre-configured Jupyter notebooks in SageMaker Studio
- Access your data through the Athena integration
- Train models using SageMaker’s built-in algorithms or custom code
- Deploy models to SageMaker endpoints for inference

For more detailed information about the configuration files and their purposes, refer to the README.md file in the sample_configs/basic_datascience_platform directory of the MDAA repository.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Deploying starter packages

Datalake Starter Package