Aggregate data in Amazon DynamoDB for ML forecasting in Athena
Sachin Doshi and Peter Molnar, Amazon Web Services
Summary
This pattern shows you how to build complex aggregations of Internet of Things (IoT) data in an Amazon DynamoDB table by using Amazon Athena. You also learn how to enrich the data with machine learning (ML) inference by using Amazon SageMaker AI and how to query geospatial data by using Athena. You can use this pattern as the basis for creating an ML forecasting solution that meets your organization's requirements.
For demonstration purposes, this pattern uses an example scenario of a business that's operating a scooter rideshare and wants to predict the optimal number of scooters that must be deployed for customers in different urban neighborhoods. The business uses a pre-trained ML model that predicts customer demand for the next hour based on the past four hours. The scenario uses a public dataset from the Office of Civic Innovation & Technology
Prerequisites and limitations
- An active AWS account 
- Permissions to create an AWS CloudFormation stack with AWS Identity and Access Management (IAM) roles for the following: - Amazon Simple Storage Service (Amazon S3) bucket 
- Athena 
- DynamoDB 
- SageMaker AI 
- AWS Lambda 
 
Architecture
Technology stack
- Amazon Quick Suite 
- Amazon S3 
- Athena 
- DynamoDB 
- Lambda 
- SageMaker AI 
Target architecture
The following diagram shows an architecture for building complex aggregations of data in DynamoDB by using the querying capabilities of Athena, a Lambda function, Amazon S3 storage, a SageMaker AI endpoint, and a Quick Suite dashboard.

The diagram shows the following workflow:
- A DynamoDB table ingests IoT data that's transmitted from a fleet of scooters. 
- A Lambda function loads the DynamoDB table with the ingested data. 
- An Athena query creates a new DynamoDB table for the geospatial data that represents the urban neighborhoods. 
- The query location is saved in an S3 bucket. 
- An Athena function queries the ML inference from the SageMaker AI endpoint that hosts the pre-trained ML model. 
- Athena queries data directly from the DynamoDB tables and aggregates the data for analysis. 
- A user views the output of the analyzed data in a Quick Suite dashboard. 
Tools
AWS services
- Amazon Athena is an interactive query service that helps you analyze data directly in Amazon S3 by using standard SQL. 
- Amazon DynamoDB is a fully managed NoSQL database service that provides fast, predictable, and scalable performance. 
- Amazon SageMaker AI is a managed ML service that helps you build and train ML models and then deploy them into a production-ready hosted environment. 
- Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data. 
- Amazon Quick Suite is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data in a single dashboard. 
- AWS Lambda is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use. 
Code repository
The code for this pattern is available in the GitHub Use ML predictions over Amazon DynamoDB data with Amazon Athena ML
- A DynamoDB table 
- A Lambda function to load the table with relevant data 
- A SageMaker AI endpoint for inference requests, with the pre-trained XGBoost model that's stored in Amazon S3 
- An Athena workgroup named - V2EngineWorkGroup
- Named Athena queries to look up the geospatial shapefiles and predict scooter demand 
- A prebuilt Amazon Athena DynamoDB connector that enables Athena to communicate with DynamoDB and uses AWS Serverless Application Model (AWS SAM) - to build the application in reference to the DynamoDB connector 
Epics
| Task | Description | Skills required | 
|---|---|---|
| Download the dataset and resources. | 
 | App developer, Data scientist | 
| Task | Description | Skills required | 
|---|---|---|
| Create a CloudFormation stack. | 
 NoteIt can take 15–20 minutes for the CloudFormation stack to create these resources. | AWS DevOps | 
| Verify the CloudFormation deployment. | To verify that the sample data from the CloudFormation template is loaded into DynamoDB, do the following: 
 | App developer | 
| Task | Description | Skills required | 
|---|---|---|
| Create an Athena table with geospatial data. | To load the geolocation files into Athena, do the following: 
 The query creates a new table for the geospatial data that represents the urban neighborhoods. The data table is created from GIS shapefiles. The  For the Python code to process shapefiles and produce this table, see Geo-Spatial processing of GIS shapefiles with Amazon Athena | Data engineer | 
| Task | Description | Skills required | 
|---|---|---|
| Declare a function in Athena to query SageMaker AI. | 
 | Data scientist, Data engineer | 
| Predict demand for scooters by neighborhood from the aggregated DynamoDB data. | Now you can use Athena to query transactional data directly from DynamoDB, and then aggregate the data for analysis and forecasting. This isn’t easily achieved by directly querying a DynamoDB NoSQL database. 
 The SQL statement does the following: 
 For information about using SQL to aggregate DynamoDB data and SageMaker AI inference data in Athena, see athena_long.sql | App developer, Data scientist | 
| Verify the output. | The output table includes the neighborhood, longitude, and latitude of the centroid of the neighborhood. It also includes the number of vehicles that are predicted for the next hour. The query produces the predictions for a selected point in time. You can make predictions for any other time by changing the expression  If you have a real-time data feed in your DynamoDB table, change the timestamp to  | App developer, Data scientist | 
| Task | Description | Skills required | 
|---|---|---|
| Delete resources. | 
 | App developer, AWS DevOps | 
Related resources
- Amazon Athena Query Federation SDK - (GitHub) 
- Querying geospatial data (AWS documentation) 
- Use ML predictions over Amazon DynamoDB data with Amazon Athena ML - (AWS Big Data Blog) 
- Amazon ElastiCache (Redis OSS) - (AWS documentation) 
- Amazon Neptune - (AWS documentation)