Architecture overview - DeepRacer on AWS

Architecture overview

This section provides an overview of the architecture of this solution.

Topics

Architecture diagram

Deploying this solution with the default parameters deploys the following components in your AWS account.

architecture diagram

Architectural components

  1. A user accesses the DeepRacer on AWS user interface through an Amazon CloudFront distribution, which delivers static web assets from the UI assets bucket and video streams from simulations.

  2. The user interface assets are hosted in an Amazon S3 bucket that stores the static web assets comprising the user interface.

  3. An Amazon Cognito user pool manages users and user group membership.

  4. An Amazon Cognito identity pool manages federation and rule-based role mapping for users.

  5. AWS IAM roles define permissions and level-of-access for each user group in the system, used for access control and authorization.

  6. AWS Lambda registration hooks execute pre- and post-registration actions including assigning new users as racers, handling initial admin profile creation, and more.

  7. AWS WAF provides intelligent protection for the API against common attack vectors and allows customers to define custom rules based on individual use cases and usage patterns.

  8. Amazon API Gateway routes API requests to their appropriate handler using a defined Smithy model.

  9. A single AWS DynamoDB table is responsible for storing and managing profiles, training jobs, models, evaluation jobs, submissions, and leaderboards.

  10. AWS Lambda functions are triggered in response to requests routed from the API and are responsible for CRUD operations, dispatching training/evaluation jobs, and more.

  11. A global settings handler (AWS Lambda function) reads and writes application-level settings to the configuration.

  12. An AWS AppConfig hosted configuration stores application-level settings, such as usage quotas.

  13. Model export handlers (AWS Lambda functions) retrieve the asset URL and package assets for use in exporting models from the system.

  14. An Amazon SQS dead-letter queue catches failed export jobs from the asset packaging function.

  15. A virtual model bucket stores exported models and provides access to them via pre-signed URL.

  16. A model import handler (AWS Lambda function) receives requests to import a model onto the system and creates a new import job.

  17. A model import queue (Amazon SQS) receives jobs from the model import function and holds them until they are accepted by the dispatcher; a DLQ handles failed jobs.

  18. A failed request handler (AWS Lambda function) manages failed requests and updates their status to reflect their current state.

  19. An import dispatching function takes a job from the queue and dispatches it to the workflow.

  20. A reward function validator (AWS Lambda function) checks the reward function and validates/sanitizes the customer-provided code before it is saved to the system.

  21. An imported model validator function checks and validates the imported model before it is saved to the system.

  22. An imported model assets handler (AWS Lambda function) brings in model assets from the upload bucket.

  23. An import completion handler (AWS Lambda function) handles status updates when a job is completed successfully.

  24. An upload bucket (Amazon S3) stores uploaded (but not yet imported) assets from the user.

  25. An Amazon SQS FIFO queue receives requests for training and evaluation jobs and stores them in FIFO order.

  26. A job dispatcher function picks a job off the top of the FIFO queue and dispatches it to the workflow.

  27. Workflow functions handle setting up the job, setting status, and other workflow tasks.

  28. Amazon SageMaker AI training jobs perform the actual training and evaluation of the model using the reward function and hyperparameters provided.

  29. Amazon Kinesis Video Streams handles presenting the simulation video to the user from the training job.

  30. A user data bucket stores all user data including trained models, evaluation results, and other assets generated during the DeepRacer workflow.

Functional components

This solution implements a serverless, microservices-based architecture that enables users to train and evaluate reinforcement learning models for autonomous racing. The architecture is organized around several key functional areas that work together to provide a complete reinforcement learning education platform.

User Interface and Authentication

Users access DeepRacer on AWS through a web-based console delivered via Amazon CloudFront, which provides fast, global distribution of the user interface assets. These static web assets are hosted in Amazon S3, ensuring reliable and scalable content delivery to users worldwide. Amazon Cognito manages user authentication and authorization, handling user registration, login, and session management.

When new users register, the system automatically creates user profiles and establishes proper permissions, ensuring a seamless onboarding experience. This authentication layer secures access to the platform while enabling users to maintain their own private workspace for models, training data, and race submissions.

API Layer and Request Processing

All user interactions with the system flow through Amazon API Gateway, which serves as the central entry point for backend operations. The API Gateway routes requests to appropriate AWS Lambda functions based on the endpoint accessed, providing a clean separation between the user interface and backend processing logic. AWS WAF protects the API layer from common security threats such as bot attacks, DDoS attempts, and malicious traffic patterns.

Data Management

The solution uses a combination of Amazon DynamoDB and Amazon S3 to handle different types of data storage needs. DynamoDB serves as the primary database for structured data including user profiles, model metadata, training job status, leaderboards, and race submissions. Amazon S3 handles file storage for larger assets such as trained model files, training logs, evaluation videos, and other user-generated content.

Model Training and Evaluation Engine

The core reinforcement learning functionality is centered around Amazon SageMaker AI training jobs, which provides the compute resources for running reinforcement learning training and evaluation jobs. When users initiate training jobs, the requests are queued in Amazon SQS to manage demand and ensure fair resource allocation. AWS Step Functions orchestrates the workflow of preparing training environments, monitoring job progress, and handling completion tasks. The system pulls a containerized simulation environment from Amazon ECR, which comprises the DeepRacer virtual simulator built on robotics simulation technology.

Real-time Simulation Streaming

During model training and evaluation, Amazon Kinesis Video Streams captures video from the simulation environment and streams it in real-time to the console. This allows users to watch their models learn and perform, providing immediate visual feedback on training progress and model behavior. The streaming capability delivers an engaging, visual experience that helps users understand how their models are developing and performing on the virtual race track.

Security and Validation

Before any user-provided code executes in the system, it passes through validation functions. These examine reward functions and imported models for security issues, ensuring that malicious or harmful code cannot compromise the system. The functions operate within isolated network environments that prevent external communication, providing an additional security boundary.

Monitoring and Operations

Amazon CloudWatch provides comprehensive monitoring and logging across all system components, collecting metrics, logs, and performance data from Lambda functions, SageMaker instances, API Gateway, and other services. This enables cloud administrators to understand system performance, troubleshoot issues, and optimize resource usage.

AWS services

AWS service Function Description

Amazon API Gateway

Core

Hosts REST API endpoints in the solution.

AWS CloudFormation

Core

Used to deploy the solution.

Amazon CloudFront

Core

Serves the web content hosted in Amazon S3.

Amazon Cognito

Core

Handles user management and authentication for the API.

Amazon DynamoDB

Core

Stores all user data related to user profiles, models, leaderboards, and submissions in a single table.

Amazon Elastic Container Registry

Core

Stores the Simulation Application (SimApp) image as a public container image, which is used by SageMaker instances to run the DeepRacer simulation application.

Amazon Kinesis Video Streams

Core

Streams videos from SageMaker AI training jobs to the user console, providing real-time visual feedback of model performance.

Amazon S3

Core

Hosts static web assets for the user console and stores user-generated artifacts such as model files, training logs, and evaluation videos.

Amazon SageMaker

Core

Runs the Simulation Application (SimApp) for training and evaluating DeepRacer models.

Amazon SQS

Core

Provides a first-in-first-out job queue that holds simulation jobs before they are forwarded to the job dispatcher.

AWS Lambda

Core

Powers various functions including API request handling, model validation, reward function validation, job dispatching, and workflow management.

AWS Step Functions

Core

Manages workflow functions that orchestrate training and evaluation jobs on SageMaker instances.

AWS WAF

Core

Provides system protection against bot spam, DDoS attacks, credential stuffing, and other common attack vectors.

Amazon CloudWatch

Core

Provides monitoring and logging capabilities for all components of the DeepRacer on AWS solution.

AWS Identity and Access Management (IAM)

Core

Manages access control and permissions for various components of the DeepRacer on AWS solution.

Amazon Virtual Private Cloud (VPC)

Optional

Can be used to provide network isolation for SageMaker AI training jobs for enhanced security.