Let's make it happen
Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.
These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Step 1
Everything you need to launch this Guidance in your account is right here.
Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
To support scalable simulation and key performance indicator (KPI) calculation models, use Amazon EKS and Amazon QuickSight.
Resources are stored in a virtual private cloud (VPC), which provides a logically isolated network. You can grant access to these resources using AWS Identity and Access Management (IAM) roles that grant least privilege, or the minimum number of permissions required to complete a task.
Kubeflow on AWS supports a data pipeline orchestration.
If you have on-premises restrictions or existing Kubernetes investments, you can use Amazon EKS and Kubeflow on AWS to implement an ML pipeline for distributed training or use a fully managed SageMaker solution for production-scale training infrastructure. These two options help you scale to meet workload requirements of the training environment.
We selected resource sizes and types based on resource characteristics and past workloads so you only pay for resources matched to your needs.
SageMaker is designed to handle training clusters that scale up as needed and shut down automatically when jobs are complete. SageMaker also reduces the amount of infrastructure and operational overhead typically required with training deep learning models on hundreds of GPUs. Amazon Elastic File System (Amazon EFS) integration with the training clusters and the development environment allow you to share your code and processed training dataset, so you don’t have to build the container image and load large datasets after every code change.
Build flexible and scalable distributed training architectures using Kubeflow on AWS and Amazon SageMakerThis blog post demonstrates how Kubeflow on AWS (an AWS-specific distribution of Kubeflow) used withAWS Deep Learning ContainersandAmazon EFSsimplifies collaboration and provides flexibility in training deep learning models at scale on bothAmazon EKSandAmazon SageMakerutilizing a hybrid architecture approach.