This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.
Streaming data
Amazon Managed Streaming for Apache Kafka and Amazon Kinesis: Introduction
Amazon Managed Streaming for Apache Kafka
Amazon Data Firehose
Architecture overview
Often, use cases require the source data to be to immediately
made available in the data lake for analytics. Getting the
data from the SaaS application in near-real-time lays the
foundation for an event-driven architecture. Start with how
Apache Kafka helps with this architecture pattern: Apache
Kafka has a tool called
Kafka
Connect
Amazon MSK eliminates the operational overhead, including the
provisioning, configuration, and maintenance of
highly-available Apache Kafka and Kafka Connect clusters. With
Amazon
MSK Connect
Once the data is streamed in Kafka, you can either directly
send it to Amazon S3 or you can use the
Kinesis-Kafka
Connector
Streaming based-data ingestion pattern
Usage patterns
This pattern is handy when you want to stream data changes from SaaS applications into your Amazon S3 data lake in near-real-time.
Some use cases are as follows:
-
Real-time analytics dashboards.
-
External, API-driven applications.
Considerations
The above architecture pattern adds all records in Amazon S3
as new data. To handle updates and deletes in the data lake,
you may have to use AWS Glue, streaming with
Apache
Hudi connector