Problem: Flink application not running (dead app)Problem: Flink processor Kafka SSL connection errors Problem: Flink application running but processing zero records Problem: Flink checkpoint failures Problem: FleetWise Edge agent not collecting signals Problem: FleetWise telemetry not appearing in DynamoDB Problem: Campaign not syncing to FWE agent

Flink application troubleshooting

Problem: Flink application not running (dead app)

A Flink application shows status READY instead of RUNNING, or has stopped processing records. CloudWatch alarms fire for downtime or idle processing.

Diagnosis

Check the application status:


# List all CMS Flink applications and their status
STAGE=dev
aws kinesisanalyticsv2 list-applications \
  --query "ApplicationSummaries[?contains(ApplicationName, 'cms-$STAGE-flink')].[ApplicationName,ApplicationStatus]" \
  --output table

Check CloudWatch alarms:


aws cloudwatch describe-alarms \
  --alarm-name-prefix "cms-dev-flink" \
  --query "MetricAlarms[?StateValue=='ALARM'].[AlarmName,AlarmDescription]" \
  --output table

Check the application logs for errors:


APP=cms-dev-flink-trip-processor
aws logs tail /aws/kinesis-analytics/$APP --since 30m --follow

Resolution

If the app is in READY state (stopped):


APP=cms-dev-flink-trip-processor
aws kinesisanalyticsv2 start-application --application-name $APP

If the app is stuck in UPDATING state:


# Force stop the application
APP=cms-dev-flink-trip-processor
aws kinesisanalyticsv2 stop-application --application-name $APP --force
# Wait for READY status, then restart

If the app crashes immediately after starting:

This typically indicates a configuration error (wrong bootstrap servers, missing IAM permissions, or invalid JAR). Check the CloudWatch logs for the root cause, then:


# Delete and recreate the application with fresh configuration
APP=cms-dev-flink-trip-processor
TIMESTAMP=$(aws kinesisanalyticsv2 describe-application \
  --application-name $APP \
  --query "ApplicationDetail.CreateTimestamp" --output text)
aws kinesisanalyticsv2 delete-application \
  --application-name $APP \
  --create-timestamp $TIMESTAMP

# Redeploy via CDK
cd deployment
make phase4

Batch restart all Flink applications:

The solution includes a helper script to stop all applications, wait for READY status, and reconfigure:


cd deployment
./update-all-flink-apps.sh

Or to check and fix stuck applications:


cd deployment
./fix-flink-apps.sh

Problem: Flink processor Kafka SSL connection errors

Flink processors log SSL handshake failures or connection reset errors when communicating with MSK. This can occur after MSK broker maintenance, certificate rotation, or extended idle periods.

Resolution

All processors include KafkaConfig.withReconnect() settings that handle most transient SSL issues automatically. The reconnect configuration applies:

reconnect.backoff.ms=1000 — Initial reconnect delay
reconnect.backoff.max.ms=10000 — Maximum reconnect delay
connections.max.idle.ms=540000 — Close idle connections after 9 minutes
metadata.max.age.ms=300000 — Refresh broker metadata every 5 minutes
request.timeout.ms=30000 — Request timeout
session.timeout.ms=45000 — Consumer session timeout

If errors persist after the automatic reconnect:

Delete and recreate the affected Flink application. Deleting and recreating has a higher success rate than updating in place:


APP=cms-dev-flink-fw-telemetry-processor
TIMESTAMP=$(aws kinesisanalyticsv2 describe-application \
  --application-name $APP \
  --query "ApplicationDetail.CreateTimestamp" --output text)
aws kinesisanalyticsv2 delete-application \
  --application-name $APP --create-timestamp $TIMESTAMP
# Redeploy via CDK
cd deployment && make phase4

After recreation, start the application with SKIP_RESTORE_FROM_SNAPSHOT to avoid restoring stale state that may contain cached SSL sessions.

Problem: Flink application running but processing zero records

The application shows RUNNING status but the numRecordsInPerSecond metric is zero. The idle processing CloudWatch alarm fires.

Diagnosis

Verify the upstream Kafka topic has data:


# Check topic offsets (requires kafka-tools or MSK client)
# Look at the IoT Rule metrics to confirm messages are reaching MSK
aws cloudwatch get-metric-statistics \
  --namespace "AWS/IoT" \
  --metric-name "RuleMessageThrottled" \
  --dimensions Name=RuleName,Value=cms_dev_iot_msk_rule \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Sum

Check if the Flink consumer group is lagging or stuck at an old offset.
Verify the Flink application’s runtime properties have the correct Kafka topic name and bootstrap servers.

Resolution

If no data is reaching MSK: Check the IoT Rules and VPC Destination configuration (see Problem: IoT Rule not routing messages to MSK).
If data is in MSK but Flink is not consuming: The consumer group may be stuck. Delete and recreate the Flink application to reset the consumer offset to latest.
If the SimulatorPreprocessor is not running: Downstream processors will not receive data because MQTT Direct telemetry requires decompression before processing.

Problem: Flink checkpoint failures

Checkpoints fail with timeout errors, causing the application to restart repeatedly.

Resolution

Check if DynamoDB writes are being throttled:


aws cloudwatch get-metric-statistics \
  --namespace "AWS/DynamoDB" \
  --metric-name "ThrottledRequests" \
  --dimensions Name=TableName,Value=cms-dev-storage-trips \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Sum

If DynamoDB is throttled, the tables use on-demand billing and should auto-scale. Wait a few minutes for capacity to adjust.
If Redis writes are timing out, check ElastiCache connectivity from the Flink VPC configuration.
Increase the checkpoint interval if the application is processing high volumes:

Update the Flink application runtime properties to set checkpointing.interval=120000 (2 minutes instead of the default 60 seconds).

Problem: FleetWise Edge agent not collecting signals

The FWE agent starts and checks in but does not collect or upload telemetry data. This typically occurs when the agent has not received a valid collection scheme or decoder manifest.

Resolution

Verify the CampaignSyncProcessor Flink application is in RUNNING state:


aws kinesisanalyticsv2 describe-application \
  --application-name cms-dev-flink-campaign-sync-processor \
  --query "ApplicationDetail.ApplicationStatus"

Check that the agent’s checkin is reaching the fw-checkin Kafka topic by reviewing the CampaignSyncProcessor CloudWatch logs:
```
aws logs tail /aws/kinesis-analytics/cms-dev-flink-campaign-sync-processor --follow
```

Verify that active campaigns exist in DynamoDB for the vehicle’s VIN:


aws dynamodb query --table-name cms-dev-campaigns \
  --index-name targetArn-index \
  --key-condition-expression "targetArn = :arn" \
  --expression-attribute-values '{":arn": {"S": "vehicle:YOUR_VIN"}}'

Confirm the campaign includes signalsToCollect with valid signal IDs from the decoder manifest.

Check the FWE agent container logs for scheme receipt:


# If using the simulation service
curl http://localhost:5001/api/agent/logs/YOUR_VIN
# Or directly via Docker
docker logs fwe-YOUR_VIN --tail 100

Problem: FleetWise telemetry not appearing in DynamoDB

The FWE agent is collecting signals (visible in agent logs) but telemetry records do not appear in the DynamoDB telemetry table.

Resolution

Verify the IoT Rule fw_dev_iot_msk_rule is routing messages to the fw-telemetry-raw Kafka topic. Check IoT Core rule metrics in CloudWatch for error counts:


aws cloudwatch get-metric-statistics \
  --namespace "AWS/IoT" \
  --metric-name "Failure" \
  --dimensions Name=RuleName,Value=fw_dev_iot_msk_rule \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 --statistics Sum

Verify the FWTelemetryProcessor Flink application is RUNNING and not showing SSL or connection errors:
```
aws logs tail /aws/kinesis-analytics/cms-dev-flink-fw-telemetry-processor --follow
```
If the processor shows stale SSL connection errors, delete and recreate the Flink application with a fresh JAR (see Problem: Flink processor Kafka SSL connection errors).

Verify the decoder manifest exists in DynamoDB with signal mappings:


aws dynamodb scan --table-name cms-dev-decoder-manifest --max-items 5

Check that the FWTelemetryProcessor output topic cms-telemetry-preprocessed is being consumed by downstream processors. If the SimulatorPreprocessor is also writing to this topic, verify there are no consumer group conflicts.

Problem: Campaign not syncing to FWE agent

The CampaignSyncProcessor is running but the agent does not receive collection schemes.

Resolution

Check the CampaignSyncProcessor logs for IoT Core MQTT publish errors:


aws logs tail /aws/kinesis-analytics/cms-dev-flink-campaign-sync-processor \
  --filter-pattern "ERROR" --since 30m

Verify the Flink application’s IAM role has iot:Publish permission on the topic cms/fleetwise/vehicles/+/collection_schemes_and_decoder_manifests.
Verify the agent is subscribed to the correct MQTT topic. The agent subscribes to cms/fleetwise/vehicles/{vin}/collection_schemes_and_decoder_manifests where {vin} must match the VIN in the campaign’s target.
Check campaign status in DynamoDB. Campaigns in SUSPENDED state will cause the processor to push an empty collection scheme, clearing the agent’s active collection.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Troubleshooting

Telemetry pipeline troubleshooting