Flink application troubleshooting
Problem: Flink application not running (dead app)
A Flink application shows status READY instead of RUNNING, or has stopped processing records. CloudWatch alarms fire for downtime or idle processing.
Diagnosis
-
Check the application status:
# List all CMS Flink applications and their status STAGE=dev aws kinesisanalyticsv2 list-applications \ --query "ApplicationSummaries[?contains(ApplicationName, 'cms-$STAGE-flink')].[ApplicationName,ApplicationStatus]" \ --output table -
Check CloudWatch alarms:
aws cloudwatch describe-alarms \ --alarm-name-prefix "cms-dev-flink" \ --query "MetricAlarms[?StateValue=='ALARM'].[AlarmName,AlarmDescription]" \ --output table -
Check the application logs for errors:
APP=cms-dev-flink-trip-processor aws logs tail /aws/kinesis-analytics/$APP --since 30m --follow
Resolution
If the app is in READY state (stopped):
APP=cms-dev-flink-trip-processor aws kinesisanalyticsv2 start-application --application-name $APP
If the app is stuck in UPDATING state:
# Force stop the application APP=cms-dev-flink-trip-processor aws kinesisanalyticsv2 stop-application --application-name $APP --force # Wait for READY status, then restart
If the app crashes immediately after starting:
This typically indicates a configuration error (wrong bootstrap servers, missing IAM permissions, or invalid JAR). Check the CloudWatch logs for the root cause, then:
# Delete and recreate the application with fresh configuration APP=cms-dev-flink-trip-processor TIMESTAMP=$(aws kinesisanalyticsv2 describe-application \ --application-name $APP \ --query "ApplicationDetail.CreateTimestamp" --output text) aws kinesisanalyticsv2 delete-application \ --application-name $APP \ --create-timestamp $TIMESTAMP # Redeploy via CDK cd deployment make phase4
Batch restart all Flink applications:
The solution includes a helper script to stop all applications, wait for READY status, and reconfigure:
cd deployment ./update-all-flink-apps.sh
Or to check and fix stuck applications:
cd deployment ./fix-flink-apps.sh
Problem: Flink processor Kafka SSL connection errors
Flink processors log SSL handshake failures or connection reset errors when communicating with MSK. This can occur after MSK broker maintenance, certificate rotation, or extended idle periods.
Resolution
All processors include KafkaConfig.withReconnect() settings that handle most transient SSL issues automatically. The reconnect configuration applies:
-
reconnect.backoff.ms=1000— Initial reconnect delay -
reconnect.backoff.max.ms=10000— Maximum reconnect delay -
connections.max.idle.ms=540000— Close idle connections after 9 minutes -
metadata.max.age.ms=300000— Refresh broker metadata every 5 minutes -
request.timeout.ms=30000— Request timeout -
session.timeout.ms=45000— Consumer session timeout
If errors persist after the automatic reconnect:
-
Delete and recreate the affected Flink application. Deleting and recreating has a higher success rate than updating in place:
APP=cms-dev-flink-fw-telemetry-processor TIMESTAMP=$(aws kinesisanalyticsv2 describe-application \ --application-name $APP \ --query "ApplicationDetail.CreateTimestamp" --output text) aws kinesisanalyticsv2 delete-application \ --application-name $APP --create-timestamp $TIMESTAMP # Redeploy via CDK cd deployment && make phase4 -
After recreation, start the application with
SKIP_RESTORE_FROM_SNAPSHOTto avoid restoring stale state that may contain cached SSL sessions.
Problem: Flink application running but processing zero records
The application shows RUNNING status but the numRecordsInPerSecond metric is zero. The idle processing CloudWatch alarm fires.
Diagnosis
-
Verify the upstream Kafka topic has data:
# Check topic offsets (requires kafka-tools or MSK client) # Look at the IoT Rule metrics to confirm messages are reaching MSK aws cloudwatch get-metric-statistics \ --namespace "AWS/IoT" \ --metric-name "RuleMessageThrottled" \ --dimensions Name=RuleName,Value=cms_dev_iot_msk_rule \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 300 --statistics Sum -
Check if the Flink consumer group is lagging or stuck at an old offset.
-
Verify the Flink application’s runtime properties have the correct Kafka topic name and bootstrap servers.
Resolution
-
If no data is reaching MSK: Check the IoT Rules and VPC Destination configuration (see Problem: IoT Rule not routing messages to MSK).
-
If data is in MSK but Flink is not consuming: The consumer group may be stuck. Delete and recreate the Flink application to reset the consumer offset to
latest. -
If the SimulatorPreprocessor is not running: Downstream processors will not receive data because MQTT Direct telemetry requires decompression before processing.
Problem: Flink checkpoint failures
Checkpoints fail with timeout errors, causing the application to restart repeatedly.
Resolution
-
Check if DynamoDB writes are being throttled:
aws cloudwatch get-metric-statistics \ --namespace "AWS/DynamoDB" \ --metric-name "ThrottledRequests" \ --dimensions Name=TableName,Value=cms-dev-storage-trips \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 300 --statistics Sum -
If DynamoDB is throttled, the tables use on-demand billing and should auto-scale. Wait a few minutes for capacity to adjust.
-
If Redis writes are timing out, check ElastiCache connectivity from the Flink VPC configuration.
-
Increase the checkpoint interval if the application is processing high volumes:
Update the Flink application runtime properties to set
checkpointing.interval=120000(2 minutes instead of the default 60 seconds).
Problem: FleetWise Edge agent not collecting signals
The FWE agent starts and checks in but does not collect or upload telemetry data. This typically occurs when the agent has not received a valid collection scheme or decoder manifest.
Resolution
-
Verify the CampaignSyncProcessor Flink application is in RUNNING state:
aws kinesisanalyticsv2 describe-application \ --application-name cms-dev-flink-campaign-sync-processor \ --query "ApplicationDetail.ApplicationStatus" -
Check that the agent’s checkin is reaching the
fw-checkinKafka topic by reviewing the CampaignSyncProcessor CloudWatch logs:aws logs tail /aws/kinesis-analytics/cms-dev-flink-campaign-sync-processor --follow -
Verify that active campaigns exist in DynamoDB for the vehicle’s VIN:
aws dynamodb query --table-name cms-dev-campaigns \ --index-name targetArn-index \ --key-condition-expression "targetArn = :arn" \ --expression-attribute-values '{":arn": {"S": "vehicle:YOUR_VIN"}}' -
Confirm the campaign includes
signalsToCollectwith valid signal IDs from the decoder manifest. -
Check the FWE agent container logs for scheme receipt:
# If using the simulation service curl http://localhost:5001/api/agent/logs/YOUR_VIN # Or directly via Docker docker logs fwe-YOUR_VIN --tail 100
Problem: FleetWise telemetry not appearing in DynamoDB
The FWE agent is collecting signals (visible in agent logs) but telemetry records do not appear in the DynamoDB telemetry table.
Resolution
-
Verify the IoT Rule
fw_dev_iot_msk_ruleis routing messages to thefw-telemetry-rawKafka topic. Check IoT Core rule metrics in CloudWatch for error counts:aws cloudwatch get-metric-statistics \ --namespace "AWS/IoT" \ --metric-name "Failure" \ --dimensions Name=RuleName,Value=fw_dev_iot_msk_rule \ --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 300 --statistics Sum -
Verify the FWTelemetryProcessor Flink application is RUNNING and not showing SSL or connection errors:
aws logs tail /aws/kinesis-analytics/cms-dev-flink-fw-telemetry-processor --follow -
If the processor shows stale SSL connection errors, delete and recreate the Flink application with a fresh JAR (see Problem: Flink processor Kafka SSL connection errors).
-
Verify the decoder manifest exists in DynamoDB with signal mappings:
aws dynamodb scan --table-name cms-dev-decoder-manifest --max-items 5 -
Check that the FWTelemetryProcessor output topic
cms-telemetry-preprocessedis being consumed by downstream processors. If the SimulatorPreprocessor is also writing to this topic, verify there are no consumer group conflicts.
Problem: Campaign not syncing to FWE agent
The CampaignSyncProcessor is running but the agent does not receive collection schemes.
Resolution
-
Check the CampaignSyncProcessor logs for IoT Core MQTT publish errors:
aws logs tail /aws/kinesis-analytics/cms-dev-flink-campaign-sync-processor \ --filter-pattern "ERROR" --since 30m -
Verify the Flink application’s IAM role has
iot:Publishpermission on the topiccms/fleetwise/vehicles/+/collection_schemes_and_decoder_manifests. -
Verify the agent is subscribed to the correct MQTT topic. The agent subscribes to
cms/fleetwise/vehicles/{vin}/collection_schemes_and_decoder_manifestswhere{vin}must match the VIN in the campaign’s target. -
Check campaign status in DynamoDB. Campaigns in
SUSPENDEDstate will cause the processor to push an empty collection scheme, clearing the agent’s active collection.