Architecture overview
This section provides an overview of the architecture of this solution.
Topics
Architecture diagram
Deploying this solution with the default parameters deploys the following components in your AWS account.
Architectural components
-
A user accesses the DeepRacer on AWS user interface through an Amazon CloudFront
distribution, which delivers static web assets from the UI assets bucket and video streams from simulations. -
The user interface assets are hosted in an Amazon S3
bucket that stores the static web assets comprising the user interface. -
An Amazon Cognito
user pool manages users and user group membership. -
An Amazon Cognito identity pool manages federation and rule-based role mapping for users.
-
AWS IAM
roles define permissions and level-of-access for each user group in the system, used for access control and authorization. -
AWS Lambda
registration hooks execute pre- and post-registration actions including assigning new users as racers, handling initial admin profile creation, and more. -
AWS WAF
provides intelligent protection for the API against common attack vectors and allows customers to define custom rules based on individual use cases and usage patterns. -
Amazon API Gateway
routes API requests to their appropriate handler using a defined Smithy model. -
A single AWS DynamoDB
table is responsible for storing and managing profiles, training jobs, models, evaluation jobs, submissions, and leaderboards. -
AWS Lambda functions are triggered in response to requests routed from the API and are responsible for CRUD operations, dispatching training/evaluation jobs, and more.
-
A global settings handler (AWS Lambda function) reads and writes application-level settings to the configuration.
-
An AWS AppConfig
hosted configuration stores application-level settings, such as usage quotas. -
Model export handlers (AWS Lambda functions) retrieve the asset URL and package assets for use in exporting models from the system.
-
An Amazon SQS
dead-letter queue catches failed export jobs from the asset packaging function. -
A virtual model bucket stores exported models and provides access to them via pre-signed URL.
-
A model import handler (AWS Lambda function) receives requests to import a model onto the system and creates a new import job.
-
A model import queue (Amazon SQS) receives jobs from the model import function and holds them until they are accepted by the dispatcher; a DLQ handles failed jobs.
-
A failed request handler (AWS Lambda function) manages failed requests and updates their status to reflect their current state.
-
An import dispatching function takes a job from the queue and dispatches it to the workflow.
-
A reward function validator (AWS Lambda function) checks the reward function and validates/sanitizes the customer-provided code before it is saved to the system.
-
An imported model validator function checks and validates the imported model before it is saved to the system.
-
An imported model assets handler (AWS Lambda function) brings in model assets from the upload bucket.
-
An import completion handler (AWS Lambda function) handles status updates when a job is completed successfully.
-
An upload bucket (Amazon S3) stores uploaded (but not yet imported) assets from the user.
-
An Amazon SQS FIFO queue receives requests for training and evaluation jobs and stores them in FIFO order.
-
A job dispatcher function picks a job off the top of the FIFO queue and dispatches it to the workflow.
-
Workflow functions handle setting up the job, setting status, and other workflow tasks.
-
Amazon SageMaker AI training jobs
perform the actual training and evaluation of the model using the reward function and hyperparameters provided. -
Amazon Kinesis Video Streams
handles presenting the simulation video to the user from the training job. -
A user data bucket stores all user data including trained models, evaluation results, and other assets generated during the DeepRacer workflow.
Functional components
This solution implements a serverless, microservices-based architecture that enables users to train and evaluate reinforcement learning models for autonomous racing. The architecture is organized around several key functional areas that work together to provide a complete reinforcement learning education platform.
User Interface and Authentication
Users access DeepRacer on AWS through a web-based console delivered via Amazon CloudFront, which provides fast, global distribution of the user interface assets. These static web assets are hosted in Amazon S3, ensuring reliable and scalable content delivery to users worldwide. Amazon Cognito manages user authentication and authorization, handling user registration, login, and session management.
When new users register, the system automatically creates user profiles and establishes proper permissions, ensuring a seamless onboarding experience. This authentication layer secures access to the platform while enabling users to maintain their own private workspace for models, training data, and race submissions.
API Layer and Request Processing
All user interactions with the system flow through Amazon API Gateway, which serves as the central entry point for backend operations. The API Gateway routes requests to appropriate AWS Lambda functions based on the endpoint accessed, providing a clean separation between the user interface and backend processing logic. AWS WAF protects the API layer from common security threats such as bot attacks, DDoS attempts, and malicious traffic patterns.
Data Management
The solution uses a combination of Amazon DynamoDB and Amazon S3 to handle different types of data storage needs. DynamoDB serves as the primary database for structured data including user profiles, model metadata, training job status, leaderboards, and race submissions. Amazon S3 handles file storage for larger assets such as trained model files, training logs, evaluation videos, and other user-generated content.
Model Training and Evaluation Engine
The core reinforcement learning functionality is centered around Amazon SageMaker AI training jobs, which provides the compute resources for running reinforcement learning training and evaluation jobs. When users initiate training jobs, the requests are queued in Amazon SQS to manage demand and ensure fair resource allocation. AWS Step Functions orchestrates the workflow of preparing training environments, monitoring job progress, and handling completion tasks. The system pulls a containerized simulation environment from Amazon ECR, which comprises the DeepRacer virtual simulator built on robotics simulation technology.
Real-time Simulation Streaming
During model training and evaluation, Amazon Kinesis Video Streams captures video from the simulation environment and streams it in real-time to the console. This allows users to watch their models learn and perform, providing immediate visual feedback on training progress and model behavior. The streaming capability delivers an engaging, visual experience that helps users understand how their models are developing and performing on the virtual race track.
Security and Validation
Before any user-provided code executes in the system, it passes through validation functions. These examine reward functions and imported models for security issues, ensuring that malicious or harmful code cannot compromise the system. The functions operate within isolated network environments that prevent external communication, providing an additional security boundary.
Monitoring and Operations
Amazon CloudWatch provides comprehensive monitoring and logging across all system components, collecting metrics, logs, and performance data from Lambda functions, SageMaker instances, API Gateway, and other services. This enables cloud administrators to understand system performance, troubleshoot issues, and optimize resource usage.
AWS services
| AWS service | Function | Description |
|---|---|---|
|
Core |
Hosts REST API endpoints in the solution. |
|
|
Core |
Used to deploy the solution. |
|
|
Core |
Serves the web content hosted in Amazon S3. |
|
|
Core |
Handles user management and authentication for the API. |
|
|
Core |
Stores all user data related to user profiles, models, leaderboards, and submissions in a single table. |
|
|
Core |
Stores the Simulation Application (SimApp) image as a public container image, which is used by SageMaker instances to run the DeepRacer simulation application. |
|
|
Core |
Streams videos from SageMaker AI training jobs to the user console, providing real-time visual feedback of model performance. |
|
|
Core |
Hosts static web assets for the user console and stores user-generated artifacts such as model files, training logs, and evaluation videos. |
|
|
Core |
Runs the Simulation Application (SimApp) for training and evaluating DeepRacer models. |
|
|
Core |
Provides a first-in-first-out job queue that holds simulation jobs before they are forwarded to the job dispatcher. |
|
|
Core |
Powers various functions including API request handling, model validation, reward function validation, job dispatching, and workflow management. |
|
|
Core |
Manages workflow functions that orchestrate training and evaluation jobs on SageMaker instances. |
|
|
Core |
Provides system protection against bot spam, DDoS attacks, credential stuffing, and other common attack vectors. |
|
|
Core |
Provides monitoring and logging capabilities for all components of the DeepRacer on AWS solution. |
|
|
Core |
Manages access control and permissions for various components of the DeepRacer on AWS solution. |
|
|
Optional |
Can be used to provide network isolation for SageMaker AI training jobs for enhanced security. |