Guidance for Aircraft Predictive Maintenance on AWS

Overview

This Guidance shows how machine learning (ML) models can be applied to Internet of Things (IoT) sensor data to predict component or system failures before they happen and recommend appropriate maintenance steps. Aerospace manufacturing, aircraft operations, and other manufacturing and industrial domains use IoT devices to identify patterns in sensor output data to predict preventative maintenance operations needed to prevent system failures and downtime. This Guidance helps you use that data to reduce unplanned downtime of manufacturing lines, aircrafts, and other systems.

How it works

This architecture diagram shows how to predictively reduce unscheduled maintenance delays and flight cancellations using MLand data from IoT devices.

Architecture diagram Step 1
Source data is generated by multiple sources. The aircraft generates flight logs, transmitted wirelessly through the Aircraft Communication Addressing and Reporting system (ACARS) or recorded through a Quick Access Recorder (QAR). Maintenance, Repair, and Overhaul (MRO) facilities generate maintenance records. Airlines broadcast delay and cancellation notices as flight ops events.
Step 2
Raw data is ingested to an Amazon Simple Storage Service (Amazon S3) bucket by a variety of channels, depending on source type. Amazon Kinesis Data Streams manages event streams, and AWS DataSync or AWS Glue manage bulk transfers from databases and file storage.
Step 3
AWS Glue transforms raw data into a second S3 bucket, redacting sensitive fields from QAR records, normalizing values, and other data.
Step 4
AWS Lambda records each normalized flight and maintenance event into an Amazon Aurora database. These records capture usage and failure history per tail number and serviceable component.
Step 5
Once a data baseline is established, Lambda queues up an ML training job in Amazon SageMaker, training unique prediction models per serviceable component.
Step 6
Once prediction models are established, Lambda invokes those models to predict time to next service based on each new record. SageMaker stores predicted results in a separate S3 bucket, and AWS Glue updates the corresponding component records in Aurora.
Step 7
MRO technicians access aircraft and component service history and maintenance recommendations through a cloud-hosted web portal. Amazon API Gateway provides API-level access for maintenance apps, and Amazon QuickSight provides pre-built dashboards.
Step 8
Amazon Simple Notification Service (Amazon SNS) sends priority service requests to technicians by text message or email, as identified directly from source messages (for example, ACARS alarm messages).
Step 9
Amazon Athena performs direct analytical SQL queries to the maintenance data lake for insights for data scientists.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

Amazon CloudWatch maintains telemetry on the running system to alert on conditions such as failed AWS Glue extract, transform, and load (ETL) jobs (indicating formatting errors in aircraft data) or error codes returned by API Gateway (indicating configuration problems with the maintenance application or website). Aurora is configured to generate automatic backups of aircraft and prediction data and can rapidly restore those backups. Automated telemetry and alarms help identify when the system is not meeting desired business outcomes and can help to quickly identify underlying issues before the customer detects or reports them. Errors can be detected and reported both in external customer systems (such as the ACARS system or QAR processing) in addition to services in the AWS Cloud. Automated database backup and restoration allows for quicker recovery to normal status in the event of a failure or disruption.

Read the Operational Excellence whitepaper

Security

Amazon S3, AWS Glue, and Kinesis Data Streams enforce mutual TLS for encryption of all customer data (such as aircraft, flight ops, and maintenance data) ingested to the cloud. Amazon S3 and Aurora, where all customer data is stored, enforce encryption on all data in storage. Customer data is encrypted at all times, whether in transit or at rest. This ensures that sensitive data about flight operations and aircraft repair data is only visible to authorized users. AWS Glue is configured to eliminate privacy-regulated data from the dataset upon ingestion. API Gateway enforces user access control by requiring an authentication token provided by AWS IAM Identity Center, which manages user credentials and roles. User authentication functions help ensure that user credentials are securely managed and rotated, with users allocated to groups with specific access rights according to job role (such as mechanic, supervisor, data scientist or admin), following the least privilege principle. Group- and role-based access management help ensure that user access rights are securely and consistently managed at scale across all organizations.

Read the Security whitepaper

Reliability

Amazon S3 and Aurora provide a high degree of data durability with multi-Availability Zone data replication in addition to automation and restoration of data backups. Data durability ensures that all data required to make maintenance predictions is available and can be restored in the event of a failure. Lambda, AWS Glue, SageMaker and API Gateway are fully managed services with automated scaling of resources. Loss of an Availability Zone or database replica will not take down the preventative maintenance system; these services will automatically divert requests from failed resources to healthy ones. The managed services provide automated failover without user intervention and without additional cost. Kinesis Data Streams automatically scales data ingestion and throttles throughput to match downstream processing rates. The autoscaling of compute resources and auto-throttling of data streams helps ensure that the system can adapt reliably to traffic bursts related to events such as higher flight volume or uploads of large maintenance record batches.

Read the Reliability whitepaper

Performance Efficiency

SageMaker and Aurora report utilization metrics to CloudWatch, allowing you to monitor historical utilization of computing resources. CloudWatch alarms can be configured to invoke scale-in or scale-out operations in Aurora and SageMaker to match changing demand. For example, if the alarm signals low utilization of database instances, it could automatically eliminate a database replica or the operator could select a smaller database instance type. CloudWatch instrumentation provides real-time visibility to changes in system utilization, allowing deeper insight into when computing resources are right-sized to the predictive maintenance application. Based on this information, you can adapt computing resources, such as allocating larger or smaller instance types for the SageMaker prediction inference endpoint or an Amazon Redshift data warehouse for maintenance analytics.

Read the Performance Efficiency whitepaper

Cost Optimization

Amazon S3 provides automated lifecycle management of data, moving infrequently-accessed data to lower-cost Amazon S3 Glacier storage tiers. This can save significant cost in retaining legacy flight and component records that may be outdated but still relevant for infrequent reports or model training. The automated tiering or retiring of older data reduces storage costs while maintaining a long service history for making accurate maintenance predictions. Additionally, Lambda and AWS Glue provide serverless computing and data transformation that automatically scale resources up or down to match real-time demand signals; you only pay for the actual computing time used for maintenance predictions. The fully managed, serverless computing resources help to avoid cost waste by automatically scaling resources based on real-time demand. This is important because system utilization will be inherently cyclic: data from flight ops, ACARS, and QAR systems will peak during the daytime or peak travel seasons and wane at night or during off-peak seasons.

Read the Cost Optimization whitepaper

Sustainability

Aurora and Athena both support compression of underlying data sources. Compression of system data (such as maintenance logs or flight records) significantly reduces the data storage requirements of the predictive maintenance system, reducing the system’s environmental impact.

Read the Sustainability whitepaper