Migrate Apache Cassandra workloads to Amazon Keyspaces by using AWS Glue
Nikolai Kolesnikov, Karthiga Priya Chandran, and Samir Patel, Amazon Web Services
Summary
This pattern shows you how to migrate your existing Apache Cassandra workloads to Amazon Keyspaces (for Apache Cassandra) by using CQLReplicator on AWS Glue. You can use CQLReplicator on AWS Glue to minimize the replication lag of migrating your workloads down to a matter of minutes. You also learn how to use an Amazon Simple Storage Service (Amazon S3) bucket to store data required for the migration, including Apache Parquet
Prerequisites and limitations
Prerequisites
- Cassandra cluster with a source table 
- Target table in Amazon Keyspaces to replicate the workload 
- S3 bucket to store intermediate Parquet files that contain incremental data changes 
- S3 bucket to store job configuration files and scripts 
Limitations
- CQLReplicator on AWS Glue requires some time to provision Data Processing Units (DPUs) for the Cassandra workloads. The replication lag between the Cassandra cluster and the target keyspace and table in Amazon Keyspaces is likely to last for only a matter of minutes. 
Architecture
Source technology stack
- Apache Cassandra 
- DataStax Server 
- ScyllaDB 
Target technology stack
- Amazon Keyspaces 
Migration architecture
The following diagram shows an example architecture where a Cassandra cluster is hosted on EC2 instances and spread across three Availability Zones. The Cassandra nodes are hosted in private subnets.

The diagram shows the following workflow:
- A custom service role provides access to Amazon Keyspaces and the S3 bucket. 
- An AWS Glue job reads the job configuration and scripts in the S3 bucket. 
- The AWS Glue job connects through port 9042 to read data from the Cassandra cluster. 
- The AWS Glue job connects through port 9142 to write data to Amazon Keyspaces. 
Tools
AWS services and tools
- AWS Command Line Interface (AWS CLI) is an open-source tool that helps you interact with AWS services through commands in your command-line shell. 
- AWS CloudShell is a browser-based shell that you can use to manage AWS services by using the AWS Command Line Interface (AWS CLI) and a range of preinstalled development tools. 
- AWS Glue is a fully managed ETL service that helps you reliably categorize, clean, enrich, and move data between data stores and data streams. 
- Amazon Keyspaces (for Apache Cassandra) is a managed database service that helps you migrate, run, and scale your Cassandra workloads in the AWS Cloud. 
Code
The code for this pattern is available in the GitHub CQLReplicator
Best practices
- To determine the necessary AWS Glue resources for the migration, estimate the number of rows in the source Cassandra table. For example, 250 K rows per 0.25 DPU (2 vCPUs, 4 GB of memory) with 84 GB disk. 
- Pre-warm Amazon Keyspaces tables before running CQLReplicator. For example, eight CQLReplicator tiles (AWS Glue jobs) can write up to 22 K WCUs per second, so the target should be pre-warmed up to 25-30 K WCUs per second. 
- To enable communication between AWS Glue components, use a self-referencing inbound rule for all TCP ports in your security group. 
- Use the incremental traffic strategy to distribute the migration workload over time. 
Epics
| Task | Description | Skills required | 
|---|---|---|
| Create a target keyspace and table. | 
 | App owner, AWS administrator, DBA, App developer | 
| Configure the Cassandra driver to connect to Cassandra. | Use the following configuration script: 
 NoteThe preceding script uses the Spark Cassandra Connector. For more information, see the reference configuration for Cassandra | DBA | 
| Configure the Cassandra driver to connect to Amazon Keyspaces. | Use the following configuration script: 
 NoteThe preceding script uses the Spark Cassandra Connector. For more information, see the reference configuration for Cassandra | DBA | 
| Create an IAM role for the AWS Glue job. | Create a new AWS service role named  NoteThe  | AWS DevOps | 
| Download CQLReplicator in AWS CloudShell. | Download the project to your home folder by running the following command: 
 | |
| Modify the reference configuration files. | Copy  | AWS DevOps | 
| Initiate the migration process. | The following command initializes the CQLReplicator environment. Initializaition involves copying .jar artifacts, and creating an AWS Glue connector, an S3 bucket, an AWS Glue job, the  
 The script includes the following parameters: 
 | AWS DevOps | 
| Validate the deployment. | After you run the previous command, the AWS account should contain the following: 
 | AWS DevOps | 
| Task | Description | Skills required | 
|---|---|---|
| Start the migration process. | To operate CQLReplicator on AWS Glue, you need to use the  To replicate the workload from the Cassandra cluster to Amazon Keyspaces, run the following command: 
 Your source keyspace and table are  To replicate updates, add  | AWS DevOps | 
| Task | Description | Skills required | 
|---|---|---|
| Validate migrated Cassandra rows during the historical migration phase. | To obtain the number of rows replicated during the backfilling phase, run the following command: 
 | AWS DevOps | 
| Task | Description | Skills required | 
|---|---|---|
| Use the  | To stop the migration process gracefully, run the following command: 
 To stop the migration process immediately, use the AWS Glue console. | AWS DevOps | 
| Task | Description | Skills required | 
|---|---|---|
| Delete the deployed resources. | The following command will delete the AWS Glue job, connector, S3 bucket, and Keyspaces table  
 | AWS DevOps | 
Troubleshooting
| Issue | Solution | 
|---|---|
| AWS Glue jobs failed and returned an Out of Memory (OOM) error. | 
 | 
Related resources
Additional information
Migration considerations
You can use AWS Glue to migrate your Cassandra workload to Amazon Keyspaces, while keeping your Cassandra source databases completely functional during the migration process. After the replication is complete, you can choose to cut over your applications to Amazon Keyspaces with minimal replication lag (less than minutes) between the Cassandra cluster and Amazon Keyspaces. To maintain data consistency, you can also use a similar pipeline to replicate the data back to the Cassandra cluster from Amazon Keyspaces.
Write unit calculations
As an example, consider that you intend to write 500,000,000 with the row size 1 KiB during one hour. The total number of Amazon Keyspaces write units (WCUs) that you require is based on this calculation:
(number of rows/60 mins 60s) 1 WCU per row = (500,000,000/(60*60s) * 1 WCU) = 69,444 WCUs required
69,444 WCUs per second is the rate for 1 hour, but you could add some cushion for overhead.  For example, 69,444 * 1.10 = 76,388 WCUs has 10 percent overhead.
Create a keyspace by using CQL
To create a keyspace by using CQL, run the following commands:
CREATE KEYSPACE target_keyspace WITH replication = {'class': 'SingleRegionStrategy'} CREATE TABLE target_keyspace.target_table ( userid uuid, level text, gameid int, description text, nickname text, zip text, email text, updatetime text, PRIMARY KEY (userid, level, gameid) ) WITH default_time_to_live = 0 AND CUSTOM_PROPERTIES = {'capacity_mode':{ 'throughput_mode':'PROVISIONED', 'write_capacity_units':76388, 'read_capacity_units':3612 }} AND CLUSTERING ORDER BY (level ASC, gameid ASC)