

# Migrating an RDS for MySQL database to an S3 data lake
<a name="mysql-s3datalake"></a>

A data lake is a system architecture that enables you to store data in a centralized repository, allowing for categorization, catalogging security, and analysis by a diverse range of users and tools. In a data lake, you can analyze structured, semi-structured, and unstructured data, as well as transform these raw data assets as necessary.

Thousands of customers are building data lakes in AWS, using the cloud-scale storage provided by Amazon S3. The transformation capabilities of services such as AWS Glue, Amazon EMR, and the analytic capabilities of services such as Amazon Athena, Amazon Redshift, and Amazon SageMaker enable you to utilize data lakes easily and cost efficiently.

When building a data lake, a common concern is how to hydrate your data lake: populating data from upstream systems, and keeping the lake up-to-date as the source data grows and changes. Traditionally, customers have relied on SQL-level solutions to extract changed records from source systems, e.g., filtering on “last updated” timestamps, or performing full-refreshes on a periodic basis. Both solutions have drawbacks: last updated filters rely on the timestamps being accurately populated, and full refresh has performance and timeliness considerations.

A different approach is to use a database replication service like [AWS Database Migration Service (AWS DMS)](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_GettingStarted.html). AWS DMS captures source data changes from the database transaction logs and logically replicates them on the target (Change Data Capture, CDC). It can also perform a "full-load" to populate the data lake with an initial snapshot of your source data. Then, as changes occur on the source, AWS DMS finds and applies those changes to your data lake, ensuring your data is consistent.

In this document, we will describe the process of setting up an AWS data lake using source data from an Amazon RDS for MySQL database. We will host the lake on Amazon S3, and use AWS DMS to hydrate the data. After describing some prerequisites, we will walk through the steps to setup AWS DMS, connect to the source database, and discuss considerations you should know about when using AWS DMS.

**Topics**
+ [Solution overview](#mysql-s3datalake.solutionoverview)
+ [Use case](#mysql-s3datalake.usecase)
+ [Limitations](#mysql-s3datalake.limitations)
+ [Choosing an instance class and storage size](mysql-s3datalake.choosinginstanceandstorage.md)
+ [Step-By-Step Migration](mysql-s3datalake.stepbystep.md)

## Solution overview
<a name="mysql-s3datalake.solutionoverview"></a>

The following diagram displays a high-level architecture of the solution, where we use AWS DMS to move data from two MySQL databases hosted on Amazon RDS to Amazon S3.

![\[Solution overview\]](http://docs.aws.amazon.com/dms/latest/sbs/images/mysql2s3datalake_overview.png)


This walkthrough assumes that the source data is sharded over two MySQL instances with identical schemas. Note that the only difference from having a single source instance is that you will create an additional endpoint and task. Therefore, this walkthrough can be applied even if the source is single instance. The schema and table structures used in this walkthrough will be explained in further detail later in the use case section.

In this walkthrough, you will set up the following resources in AWS DMS:
+  **Replication Instance** — An AWS managed instance that hosts the AWS DMS engine. You control the type and size of the instance based on your workload.
+  **Source Endpoint** — A resource that provides connection details, data store type, and credentials to connect to a source database. For this use case, we will configure the source endpoint to point to the Amazon RDS for MySQL database.
+  **Target Endpoint** — AWS DMS supports several target systems including Amazon RDS, Amazon Aurora, Amazon Redshift, Amazon Kinesis Data Streams, Amazon S3, and more. For this use case, we will configure Amazon S3 as the target endpoint.
+  **Replication Task** — A resource that runs on the replication instance and connects to endpoints to replicate data from the source to the target.

## Use case
<a name="mysql-s3datalake.usecase"></a>

The source MySQL engine version that we will use in this walkthrough is 8.0.31. AWS DMS supports Amazon RDS for MySQL 5.6 or higher as a source. There are three tables under the `dms_sample` schema in the two MySQL databases. The total size is about **220 GiB**. We assume a data change amount of about tens of GiB per day. A similar size of data exists in both instances. The primary keys of the `posts` and `post_history` tables are `id` and `creation_date`, and the tables are partitioned with 180 partitions on the `creation_date` column. The `votes` table is not partitioned and the `id` column is the primary key.

![\[Schema\]](http://docs.aws.amazon.com/dms/latest/sbs/images/mysql2s3datalake_schema.png)


## Limitations
<a name="mysql-s3datalake.limitations"></a>

As a managed service, AWS DMS allows users to start migration in a few steps. However there are some limitations/restrictions depending on the type of source and target endpoints.

There are some data types that are not supported as MySQL source. Before you start your migration, it’s a good idea to find out if there are any unsupported data types. Premigration assessments can help you find unsupported data in your source database. For information about datatypes supported in MySQL, see [Data types](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html#CHAP_Source.MySQL.DataTypes).

```
For other MySQL source or S3 target endpoint limitations, see the following documents:
* https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html#CHAP_Source.MySQL.Limitations[Limitations on using a MySQL database as a source for [.shared]`DMS`]
* https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html#CHAP_Target.S3.Limitations[Limitations to using Amazon S3 as a target]
```

# Choosing an instance class and storage size
<a name="mysql-s3datalake.choosinginstanceandstorage"></a>

Before you start migrating your database, you need to consider your source, target, and replication instance resources such as CPU, memory, disk space, and network bandwidth/latency. How much workload will be placed on the source database, how to determine the sizing of the replication instance, and what instance class should be used for the target database are common questions when starting a migration.

There is no single answer to these questions. It’s hard to calculate because the optimal configuration varies depending on the amount of data in your source database, your workload, your AWS DMS task configuration, and the number of tasks running concurrently. One of the benefits of using AWS is the ability to flexibly and easily resize resources as needed. You can change your replication instance class or target database instance class in-place as needed in a few clicks and minutes. Test your migrations using a larger instance class first, then check the resource usage provided by CloudWatch metrics and resize if necessary.

In this walkthrough, we will use the following instance classes.

## Source database
<a name="mysql-s3datalake.choosinginstanceandstorage.sourcedatabase"></a>
+  **Instance class**: db.c5.4xlarge
+  **Allocated storage size**: 1024 GiB
+  **Storage type**: io1
+  **IOPS**: 20000

Full-load typically requires more resources from the source database than CDC because full-load simultaneously transfers data from the source with the number of parallels you specify in the task settings. The default parallelism is 8, which means that all data from the source table will be transferred to the target through the replication instance in 8 parallel threads. In this walkthrough, we will allocate the resources described above to use a maximum possible parallelism of **49** threads for an AWS DMS task.

Note that this setting yields **250-300 MiB/s** read throughput on the source database. If you want to reduce the workload on the source database, you can lower the parallelism number described in later section.

## Replication instance
<a name="mysql-s3datalake.choosinginstanceandstorage.replicationinstance"></a>
+  **dms.c5.9xlarge** 
+  **Allocated storage size**: 100 GiB

Because we are performing a heterogeneous migration and using the parallel full-load option with a maximum of 49 parallel threads, we start with the relatively large compute optimized instance **dms.c5.9xlarge** as the replication instance class. This instance class has enough performance to migrate source data to S3 in 49 parallel threads in our use case. It is also possible to use a smaller instance class if it reduces the number of threads. We’ll discuss this in a later section.

When Amazon S3 is the target, storage throughput is the primary factor when determining the full-load performance. This is because when AWS DMS outputs a CSV or a Parquet file to Amazon S3, AWS DMS first writes the file to storage on your replication instance, and then AWS DMS uploads the file to the Amazon S3 bucket.

 AWS DMS supports GP2 EBS storage. IOPS for GP2 EBS storage depends on storage size. It increases at a rate of 3 IOPS/GiB. This value is the same as the EBS burst credits added per second. A single GP2 volume performs up to 3000 IOPS as long as it has burst credits, but once it runs out of credits, it only performs as much performance as the credits provided at 3 IOPS/GiB. For example, 100 GiB is 300 IOPS.

In this scenario, we will allocate 100 GiB of storage for a temporary maximum throughput of about 20-30 minutes. This is enough with this workload. Find the optimal disk size for your workload by running a test task. The storage size can be changed online even while the task is running. However, the storage performance may be temporarily degraded during the change. Also, the storage size can increase, but cannot decrease unless you recreate the replication instance.

# Step-By-Step Migration
<a name="mysql-s3datalake.stepbystep"></a>

The following steps provide instructions for migrating Amazon RDS for MySQL databases to an Amazon S3 data lake.

**Topics**
+ [Step 0: Configure the source Amazon RDS for MySQL database](#mysql-s3datalake.stepbystep.0)
+ [Step 1: Create a replication instance](#mysql-s3datalake.stepbystep.1)
+ [Step 2: Create an AWS DMS source endpoint](#mysql-s3datalake.stepbystep.2)
+ [Step 3: Configure a target Amazon S3 bucket](#mysql-s3datalake.stepbystep.3)
+ [Step 4: Create an AWS DMS Task](#mysql-s3datalake.stepbystep.5)
+ [Step 5: Run and monitor your AWS DMS Task](#mysql-s3datalake.stepbystep.6)
+ [Step 6: Monitor your migration](#mysql-s3datalake.stepbystep.7)
+ [Conclusion](#mysql-s3datalake.conclusion)

## Step 0: Configure the source Amazon RDS for MySQL database
<a name="mysql-s3datalake.stepbystep.0"></a>

Before setting up AWS DMS resources, you need to configure your Amazon RDS for MySQL database instances as a source for AWS DMS.

### Amazon RDS Backup configuration
<a name="mysql-s3datalake.stepbystep.0.1"></a>

Your Amazon RDS for MySQL instance must have **Automatic Backups** turned on to use CDC. Otherwise, binary logging will not be enabled at the MySQL level. Enabling automatic backups enables binary logging for the database instance. The backup retention period can be any value from one to 35 days. One day is enough for this walkthrough.

### Binary logging configuration
<a name="mysql-s3datalake.stepbystep.0.2"></a>

To use AWS DMS CDC, the following parameters must be set correctly in the parameter group attached to your database instances.
+  `binlog_format : "ROW"` 
+  `binlog_row_image` : "Full"`
+  `binlog_checksum` : "NONE"`

The default **binlog\$1format** is “Mixed”. AWS DMS requires the “ROW” format, and all columns before and after the imaging. We recommend that **binlog\$1checksum** set to NONE.

### Binary logging retention hours
<a name="mysql-s3datalake.stepbystep.0.3"></a>

 AWS DMS requires binary logs to be local to the Amazon RDS for MySQL database instance. To ensure that binary logs are available to AWS DMS, you should increase the length of time that the logs remain available in the database instance host. For example, to increase log retention to 24 hours, run the following command. 24 hours are enough for this walkthrough.

```
call mysql.rds_set_configuration('binlog retention hours', 24);
```

### VPC, Subnet and Network ACL configuration
<a name="mysql-s3datalake.stepbystep.0.4"></a>

In this walkthrough, the database instance and the replication instance are placed in the same VPC and the same subnet, so all you need to do is configure security groups, network ACLs, and route tables so that your Amazon RDS for MySQL database instance and AWS DMS replication instance can communicate within the same subnet. If you have source databases in a different subnet, VPC, or different location outside AWS, you need to configure your network to allow communication between your Amazon RDS for MySQL database instance and your AWS DMS replication instance.

### Inbound connection rule
<a name="mysql-s3datalake.stepbystep.0.5"></a>

To ensure that AWS DMS can access your database server, you need to make changes to the relevant security groups and network access control lists. AWS DMS only requires access to the MySQL database listener port (the default is 3306). The connection always starts from the AWS DMS replication instance to MySQL. Therefore, you add allowed connections from the replication instance to the ingress rule of the security group attached to the database instance. We recommend you add all subnet group ranges to the ingress rule, because replication instances are a managed service, and the IP address of a replication instance may change automatically.

You have now completed all necessary setup for your Amazon RDS for MySQL database instance. Next, create a replication instance.

## Step 1: Create a replication instance
<a name="mysql-s3datalake.stepbystep.1"></a>

To create an AWS DMS replication instance, do the following:

1. Sign in to the AWS Management Console, and open the [AWS DMS console](https://console.aws.amazon.com/dms/v2).

1. If you are signed in as an AWS Identity and Access Management (IAM) user, you must have the appropriate permissions to access AWS DMS. For more information about the permissions required, see IAM permissions.

1. On the Welcome page, choose **Create replication instance**` to start a database migration.

1. On the **Create replication instance** page, specify your replication instance information.


|  |  | 
| --- |--- |
|  For this parameter  |  Do this  | 
|   **Name**   |  Enter `s3-datalake-migration-ri`. If you are using multiple replication servers or sharing an account, choose a name that helps you quickly differentiate between the different servers.  | 
|   **Description**   |  Enter `Migrate MySQL to [.shared]`S3` data lake`.  | 
|   **Instance class**   |  Choose `dms.c5.9xlarge`. Each size and type of instance class has increasing CPU, memory, and I/O capacity.  | 
|   **Engine version**   |  Leave the default value, which is the latest stable version of the AWS DMS replication engine.  | 
|   **Allocated storage (GiB)**   |  Choose `100 GiB`.  | 
|   **VPC**   |  Choose the virtual private cloud (VPC) in which your replication instance will launch. Select the same VPC in which your source is placed.  | 
|   **Multi AZ**   |  In this scenario, choose **No**. If you choose **Yes**, AWS DMS creates a second replication server in a different Availability Zone for failover if there is a problem with the primary replication server.  | 
|   **Publicly accessible**   |  Choose **Yes**. If either your source or target database resides outside of the VPC in which your replication server resides, you must make your replication server policy publicly accessible.  | 

Once the creation of the replication instance starts, it usually becomes available in about ten minutes or more. The next endpoint can be created even when the replication instance is in the `creating` status, but the connection test cannot be performed unless the replication instance is in the `available` status.

## Step 2: Create an AWS DMS source endpoint
<a name="mysql-s3datalake.stepbystep.2"></a>

To create a source endpoint, do the following:

1. Open the AWS DMS console at [https://console.aws.amazon.com/dms/v2/](https://console.aws.amazon.com/dms/v2/).

1. Choose **Endpoints**.

1. Choose **Create endpoint**.

1. On the **Create endpoint** page, enter the following information.

 **Source endpoint 1:** 


|  | 
| --- |
|   **Endpoint type**   | 
|  Choose **Source endpoint**, **Select RDS DB instance**, and choose the `datalake-source-db1` RDS instance.  | 
|   **Endpoint identifier**   | 
|  Enter **mysql-dms-s3-source-1**   | 
|   **Source engine**   | 
|  Choose **MySQL**.  | 
|   **Access to endpoint database**   | 
|  Choose **Provide access information manually**.  | 
|   **Server name**   | 
|  Enter the Amazon RDS database server name.  | 
|   **Port**   | 
|  Enter **3306**.  | 
|   **Secure Socket Layer (SSL) mode**   | 
|  Choose **none**.  | 
|   **User name**   | 
|  Enter **dms\$1user**.  | 
|   **Password**   | 
|  Enter the password that you created for the `dms_user` user.  | 

 **Source endpoint 2:** 


|  | 
| --- |
|   **Endpoint type**   | 
|  Choose **Source endpoint**, **Select RDS DB instance**, and choose the `datalake-source-db2` RDS instance.  | 
|   **Endpoint identifier**   | 
|  Enter **mysql-dms-s3-source-2**   | 
|   **Source engine**   | 
|  Choose **MySQL**.  | 
|   **Access to endpoint database**   | 
|  Choose **Provide access information manually**.  | 
|   **Server name**   | 
|  Enter the [.shared]`RDS`database server name.  | 
|   **Port**   | 
|  Enter **3306**.  | 
|   **Secure Socket Layer (SSL) mode**   | 
|  Choose **none**.  | 
|   **User name**   | 
|  Enter **dms\$1user**.  | 
|   **Password**   | 
|  Enter the password that you created for the `dms_user` user.  | 

You can try testing the connection before you finish creating the endpoint. Test Connection attempts to connect from the replication instance to the source database and verify that the replication instance can connect to MySQL with the settings provided. If the connection test succeeds, go to the next step; otherwise, check if the values you set for the endpoint are correct. If correct, check if the network between the source and the replication instance is configured correctly.

## Step 3: Configure a target Amazon S3 bucket
<a name="mysql-s3datalake.stepbystep.3"></a>

To create the Amazon S3 bucket, do the following:

1. Open the Amazon S3 console at [https://s3.console.aws.amazon.com/s3/home](https://s3.console.aws.amazon.com/s3/home).

1. Choose **Create bucket**.

1. For **Bucket name**, enter *\$1<your-bucket-name>\$1*. Note: The bucket name needs to be unique globally.

1. For ** AWS Region**, choose the region that hosts your AWS DMS replication instance.

1. Leave the default values in the other fields and choose **Create bucket**.

To use Amazon S3 as an AWS Database Migration Service (AWS DMS) target endpoint, create an IAM role with write and delete access to the S3 bucket. Then add DMS (dms.amazonaws.com) as trusted entity in this IAM role. This is a minimum required assume role policy and policy document. For more information, see [Prerequisites for using Amazon S3 as a target](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html#CHAP_Target.S3.Prerequisites).

Assume role policy:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "dms.amazonaws.com",
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

Policy:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::mysql2s3walkthough/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::mysql2s3walkthough",
            "Effect": "Allow"
        }
    ]
}
```

To create a target endpoint, do the following:

1. Open the AWS DMS console at [https://console.aws.amazon.com/dms/v2/](https://console.aws.amazon.com/dms/v2/).

1. Choose **Endpoints**, and then choose **Create endpoint**.

1. On the **Create endpoint** page, enter the following information.


|  |  | 
| --- |--- |
|   **Endpoint type**   |  Choose **Target endpoint**, and turn off **Select RDS DB instance**.  | 
|   **Endpoint identifier**   |  Enter **mysql-dms-s3-target.**   | 
|   **Target engine**   |  Choose ** Amazon S3 **.  | 
|   **Service access role ARN**   |  Enter the IAM role that can access your Amazon S3 data lake.  | 
|   **Bucket name**   |  Enter **<your-bucket-name>.**   | 

Expand the **Endpoint settings** section, choose **Wizard**, and then choose **Add new setting** to add the settings as shown on the following image.

When using AWS DMS to migrate data to an Amazon Simple Storage Service (Amazon S3) data lake, you can change the default task behavior, such as file formats, partitioning, file sizing, etc. This leads to minimizing post-migration processing and helps downstream applications consume data efficiently. You can customize task behavior using endpoint settings and extra connection attributes (ECA). Most of the Amazon S3 endpoint settings and ECA settings overlap, except for a few parameters. In this walkthrough, we will configure Amazon S3 endpoint settings.

### Choose file format (`dataFormat`)
<a name="mysql-s3datalake.stepbystep.3.1"></a>

 AWS DMS supports CSV and Parquet formats for outputing data to an S3 target. Each file format has its own benefits. Choose the right file format depending on your consumption pattern. Apache Parquet is an open-source file format that stores data in a columnar format, which is built to support efficient compression and encoding schemes providing storage space savings and performance benefits. CSV files are helpful when you plan to keep data in human readable format, or share or transfer Amazon S3 files into other downstream systems for further processing. In this scenario, we will use the CSV format.

### Date based partitioning (`DatePartitionEnabled`)
<a name="mysql-s3datalake.stepbystep.3.2"></a>

In addition to using optimized file formats like Parquet, another common approach for further optimization is to partition the data. AWS DMS supports date-based folder partitioning based on transaction commit dates. The data is stored in different folders based on a timestamp which has following benefits:
+ Better management for your S3 objects.
+ Limiting the size of each S3 folder.
+ Optimizing data lake queries or other subsequent operations.

```
dms_sample/post_history/LOAD00000001.csv
dms_sample/post_history/LOAD00000002.csv
...
dms_sample/posts/LOAD00000001.csv
dms_sample/posts/LOAD00000002.csv
dms_sample/posts/LOAD00000003.csv
...
...
dms_sample/posts/2022/5/21/20220521-145815742.csv
dms_sample/posts/2022/5/21/20220521-145918391.csv
```

### Determine file size
<a name="mysql-s3datalake.stepbystep.3.3"></a>

By default, an AWS DMS task writes captured data to an Amazon S3 bucket either if the file size reaches 32 MB or if the previous file write was more than 60 seconds ago. These settings ensure that the data capture latency is low. However, this approach creates a large number of small files in the target Amazon S3 bucket. This value can be changed with `CdcMaxBatchInterval` in the S3 target endpoint settings.

However, we need to optimize this schema for cost and performance. When you use distributed processing frameworks such as Amazon Athena, AWS Glue or Amazon EMR, it is recommended to avoid having many small files (less than 64 MB). Small files tend to cause operational overhead in various distributed processing frameworks. Since we plan to use Amazon Athena to query data from our Amazon S3 bucket, we need to make sure our target file size is at least 64 MB.

In this scenario, we’ll use the following endpoint settings: `MaxFileSize=64000`, `CdcMaxBatchInterval=3600` and `CdcMinFileSize=64000`. These settings ensure that AWS DMS does not write the file until its size reaches 64 MB or if the last file write was more than an hour ago.

### Serialize ongoing replication events
<a name="mysql-s3datalake.stepbystep.3.4"></a>

A common challenge when using Amazon S3 as a target involves identifying the ongoing replication event sequence when multiple records are updated at the same time on the source database. AWS DMS provides two options to help serialize such events for Amazon S3. You can use the `TimeStampColumnName` endpoint setting or use transformation rules to include a LSN column. Here, we will discuss the first option. For more information about the second option, see [Step 6: Create an AWS DMS Task](https://docs.aws.amazon.com/dms/latest/sbs/chap-rdssqlserver2s3datalake.steps.createtask.html).

### Use the TimeStampColumnName endpoint setting
<a name="mysql-s3datalake.stepbystep.3.5"></a>

The `TimeStampColumnName` setting adds an additional `STRING` column to the target Parquet file created by AWS DMS. During ongoing replication, the column value represents the commit timestamp of the event in SQL Server. For the full load phase, the columns' values represent the timestamp of the data transfer to Amazon S3. The default format is `yyyy-MM-dd HH:mm:ss.SSSSSS`. This format provides a microsecond precision, but also depends on the source database transaction log timestamp precision.

### Include full load operation field
<a name="mysql-s3datalake.stepbystep.3.6"></a>

All files created during ongoing replication have the first column marked with `I`, `U`, or `D`. These symbols represent the DML operation on the source and stand for **Insert**, **Update**, or **Delete**. For full load files, you can add this column by configuring the following endpoint setting.

```
includeOpForFullLoad=true
```

This ensures that all full load files are marked with an `I` operation.

When you use this approach, new subscribers can consume the entire data set or prepare a fresh copy in case of any downstream processing issues.

 AWS DMS outputs an extra column (`Op`) where each record has one of the DML flags (I: Insert, U: Update, or D: Delete) in addition to the existing columns in the source tables, indicating which operation generated the change at that time.

In the following example, a source table has a structure similar to the following:


|  |  |  |  | 
| --- |--- |--- |--- |
|  id  |  name  |  age  |  year  | 
|  1  |  Scott  |  36  |  1986  | 
|  2  |  Mike  |  27  |  1995  | 
|  3  |  Bob  |  42  |  1980  | 

For this example, we insert a record into this table such as the following:

```
INSERT INTO dms_example.users (id, name, age, birthday) VALUES (4, 'Kate', 23, 1999);
```

The generated record will look similar to the following:

```
I, 4, Kate, 23, 1999
```

To handle these changed data, you need to take the operation flag into consideration when querying the file output in the S3 bucket, or alternatively you can process those files using AWS Glue and store the output in another S3 bucket which can then be queried using Amazon Athena.

There are several possible methods depending on what software stack you want to achieve. The last section in this document, Next Steps, references specific examples.

In this scenario, we’ll use the following settings:

 **Endpoint 1:** 

```
{
  "ServiceAccessRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/mysql2s3-walkthrough-dms-s3-target-access-role",
  "CsvRowDelimiter": "\\n",
  "CsvDelimiter": ",",
  "BucketName": "<S3_BUCKET_NAME>",
  "BucketFolder": "endpoint1",
  "CompressionType": "NONE",
  "DataFormat": "CSV",
  "EnableStatistics": true,
  "DatePartitionEnabled": true,
  "MaxFileSize": 64000,
  "CdcMaxBatchInterval": 3600,
  "CdcMinFileSize": 64000,
  "IncludeOpForFullLoad": true
}
```

 **Endpoint 2:** 

```
{
  "ServiceAccessRoleArn": "arn:aws:iam::<ACCOUNT_ID>:role/mysql2s3-walkthrough-dms-s3-target-access-role",
  "CsvRowDelimiter": "\\n",
  "CsvDelimiter": ",",
  "BucketName": "<S3_BUCKET_NAME>",
  "BucketFolder": "endpoint2",
  "CompressionType": "NONE",
  "DataFormat": "CSV",
  "EnableStatistics": true,
  "DatePartitionEnabled": true,
  "MaxFileSize": 64000,
  "CdcMaxBatchInterval": 3600,
  "CdcMinFileSize": 64000,
  "IncludeOpForFullLoad": true
}
```

By using this configuration, data on two sharded database instances will be migrated to different bucket folders in the same bucket.

## Step 4: Create an AWS DMS Task
<a name="mysql-s3datalake.stepbystep.5"></a>

After you configure the replication instance and endpoints, the next step is creating the AWS DMS task. In this scenario, we will create a task that performs both full-load and CDC. To create a database migration task, do the following:

1. Open the AWS DMS console at [https://console.aws.amazon.com/dms/v2/](https://console.aws.amazon.com/dms/v2/).

1. Select **Database migration tasks**, and then choose **Create task**.

1. On the **Create database migration task** page, enter the following information.

 **Replication task 1:** 


|  |  | 
| --- |--- |
|  For this parameter  |  Do this  | 
|   **Task identifier**   |  Enter **mysql-dms-s3-task-1**   | 
|   **Replication instance**   |  Choose **datalake-migration-ri** (the value that you configured on Step 1).  | 
|   **Source database endpoint**   |  Choose **mysql-dms-s3-source-1** (the value that you configured on Step 3).  | 
|   **Target database endpoint**   |  Choose **mysql-dms-s3-target** (the value that you configured on Step 4).  | 
|   **Migration type**   |  Choose **Migrate existing data and replicate ongoing changes**.  | 
|   **Editing mode**   |  Choose **Wizard**.  | 
|   **Custom CDC stop mode for source transactions**   |  Choose **Disable custom CDC stop mode**.  | 
|   **Target table preparation mode**   |  Choose **Drop and create**   | 
|   **Stop task after full load completes**   |  Choose **Don’t stop**.  | 
|   **Include LOB columns in replication**   |  Choose **Limited LOB mode**.  | 
|   **Maximum LOB size (KB)**   |  Enter **1024**   | 
|   **Enable validation**   |  Enter **1024**   | 
|   **Enable validation**   |  Turn off because Amazon S3 does not support validation.  | 
|   **Enable CloudWatch logs**   |  Turn on.  | 

 **Replication task 2:** 


|  |  | 
| --- |--- |
|  For this parameter  |  Do this  | 
|   **Task identifier**   |  Enter **mysql-dms-s3-task-2**   | 
|   **Replication instance**   |  Choose **datalake-migration-ri** (the value that you configured on Step 1).  | 
|   **Source database endpoint**   |  Choose **mysql-dms-s3-source-2** (the value that you configured on Step 3).  | 
|   **Target database endpoint**   |  Choose **mysql-dms-s3-target** (the value that you configured on Step 4).  | 
|   **Migration type**   |  Choose **Migrate existing data and replicate ongoing changes**.  | 
|   **Editing mode**   |  Choose **Wizard**.  | 
|   **Custom CDC stop mode for source transactions**   |  Choose **Disable custom CDC stop mode**.  | 
|   **Target table preparation mode**   |  Choose **Drop and create**   | 
|   **Stop task after full load completes**   |  Choose **Don’t stop**.  | 
|   **Include LOB columns in replication**   |  Choose **Limited LOB mode**.  | 
|   **Maximum LOB size (KB)**   |  Enter **1024**   | 
|   **Enable validation**   |  Enter **1024**   | 
|   **Enable validation**   |  Turn off because Amazon S3 does not support validation.  | 
|   **Enable CloudWatch logs**   |  Turn on.  | 

Table mappings:

```
{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": 1,
      "rule-name": "1",
      "object-locator": {
        "schema-name": "dms_sample",
        "table-name": "%"
      },
      "rule-action": "include"
    },
    {
      "rule-type": "table-settings",
      "rule-id": 2,
      "rule-name": "2",
      "object-locator": {
        "schema-name": "dms_sample",
        "table-name": "post_history"
      },
      "parallel-load": {
        "type": "partitions-auto"
      }
    },
    {
      "rule-type": "table-settings",
      "rule-id": 3,
      "rule-name": "3",
      "object-locator": {
        "schema-name": "dms_sample",
        "table-name": "posts"
      },
      "parallel-load": {
        "type": "partitions-auto"
      }
    },
    {
      "rule-type": "table-settings",
      "rule-id": 4,
      "rule-name": "3",
      "object-locator": {
        "schema-name": "dms_sample",
        "table-name": "votes"
      },
      "parallel-load": {
        "type": "partitions-auto"
      }
    }
  ]
}
```

Task settings:

```
{
  "TargetMetadata": {
    "SupportLobs": true,
    "LimitedSizeLobMode": true,
    "LobMaxSize": 1024,
  },
  "FullLoadSettings": {
    "TargetTablePrepMode": "TRUNCATE_BEFORE_LOAD",
    "MaxFullLoadSubTasks": 49,
    "CommitRate": 50000
  },
  "Logging": {
    "EnableLogging": true
  }
}
```

## Step 5: Run and monitor your AWS DMS Task
<a name="mysql-s3datalake.stepbystep.6"></a>

After you created your AWS Database Migration Service (AWS DMS) task, start your replication tasks. To start your AWS DMS task, do the following:

1. Open the AWS DMS console at [https://console.aws.amazon.com/dms/v2/](https://console.aws.amazon.com/dms/v2/).

1. Select **Database migration tasks**, and then choose **Create task**.

1. On the **Create database migration task** page, select your replication task.

1. Choose **Actions**, **“Restart / Resume”**.

## Step 6: Monitor your migration
<a name="mysql-s3datalake.stepbystep.7"></a>

### Task status and Table statistics
<a name="mysql-s3datalake.stepbystep.7.1"></a>

After you start the task, the full load operation starts loading tables. Your replication task status will be **“Running”** until full-load completes. After the AWS DMS task completes full load, the task status changes to the **Load complete, replication ongoing** phase. The following image shows the updated status of the task.

You can see the table load completion status in the **Table statistics** section and the corresponding target files in the Amazon S3 bucket. You can check the progress of replication on the **Table statistics** tab. AWS DMS first does full-load on each table. Meanwhile, the task status is **Running**, and at least one of the tables' Load states is **“Before Load”** or **“Full load”**. Tables that have been loaded are displayed as **“Table completed”**. When all tables have been fully loaded, the task status becomes **“Load completed, replication ongoing”**. The task continues to capture source changes and apply them to the target.

In this scenario, the full-load phase typically completes in about 20 minutes. If you don’t use **partitions-auto** for table mapping, the same full-load phase takes about an hour. Parallel full load can significantly improve full load performance.

### Cloudwatch Metrics
<a name="mysql-s3datalake.stepbystep.7.2"></a>

The AWS DMS console shows CloudWatch statistics for each task. To see metrics, select the replication task and then select the **CloudWatch metrics** tab.

Task metrics are divided into statistics between the replication host and the source endpoint, and statistics between the replication host and the target endpoint. You can determine the total statistic for a task by adding two related statistics together. For example, you can determine the total latency, or replica lag, for a task by combining the **CDCLatencySource** and **CDCLatencyTarget** values.

 **CDCLatencySource** is the gap, in seconds, between the last event captured from the source endpoint and current system time stamp of the AWS DMS instance. CDCLatencySource represents the latency between source and replication instance. High CDCLatencySource means the process of capturing changes from source is delayed. To identify latency in an ongoing replication, you can view this metric together with CDCLatencyTarget. If both CDCLatencySource and CDCLatencyTarget are high, investigate CDCLatencySource first.

 **CDCLatencyTarget** is the gap, in seconds, between the first event timestamp waiting to commit on the target and the current timestamp of the AWS DMS instance. Target latency is the difference between the replication instance server time and the oldest unconfirmed event id forwarded to a target component. In other words, target latency is the timestamp difference between the replication instance and the oldest event applied but unconfirmed by the TRG endpoint. When CDCLatencyTarget is high, it indicates that the process of applying change events to the target is delayed.

These metrics are useful for knowing what state your tasks are in.

## Conclusion
<a name="mysql-s3datalake.conclusion"></a>

In this walkthrough, we covered most prerequisites that help avoid configuration related errors. You can get started on your own migrations using the following documentation.
+  [Getting started with Database Migration Service](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_GettingStarted.html).
+  [Using a MySQL-compatible database as a source for AWS DMS](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html) 
+  [Using Amazon S3 as a target](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html) 

If you observe issues when running your task, see [Troubleshooting migration tasks](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html) and [Best practices](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_BestPractices.html) in the AWS DMS public documentation, or reach out to AWS Support for further assistance.