

# Troubleshooting latency issues in AWS Database Migration Service
<a name="CHAP_Troubleshooting_Latency"></a>

This section provides an overview of the common causes for AWS DMS task latency during the ongoing replication phase (CDC). AWS DMS replicates data asynchronously. Latency is the delay between when a change was committed on the source and when the change was replicated to the target. Latency can be caused due to misconfiguration of replication components, such as the following: 
+ Source endpoint or data source
+ Target endpoint or data source
+ Replication instances
+ The network between these components

We recommend that you use a test migration as a proof of concept to gather information about your replication. You can then use this information for tuning your replication configuration to minimize latency. For information about running a proof of concept migration, see [Running a proof of concept](CHAP_BestPractices.md#CHAP_BestPractices.RunPOC).

**Topics**
+ [Types of CDC latency](#CHAP_Troubleshooting_Latency_Types)
+ [Common causes of CDC latency](#CHAP_Troubleshooting_Latency_Causes)
+ [Troubleshooting latency issues](CHAP_Troubleshooting_Latency_Troubleshooting.md)

## Types of CDC latency
<a name="CHAP_Troubleshooting_Latency_Types"></a>

This section contains types of replication latency that may occur during CDC.

### Source latency
<a name="CHAP_Troubleshooting_Latency_Types_Source"></a>

The delay, in seconds, between the commit time of the last event captured from the source endpoint, and the current system timestamp of the replication instance. You can monitor the latency between the data source and your replication instance using the `CDCLatencySource` CloudWatch metric. A high `CDCLatencySource` metric indicates that the process of capturing changes from the source is delayed. For example, if your application commits an insert to the source at 10:00, and AWS DMS consumes the change at 10:02, the `CDCLatencySource` metric is 120 seconds. 

For information about CloudWatch metrics for AWS DMS, see [Replication task metrics](CHAP_Monitoring.md#CHAP_Monitoring.Metrics.Task).

### Target latency
<a name="CHAP_Troubleshooting_Latency_Types_Target"></a>

The delay, in seconds, between the commit time on the source of the first event waiting to commit to the target, and the current timestamp of the DMS replication instance. You can monitor the latency between commits on the data source and your data target using the `CDCLatencyTarget` CloudWatch metric. This means that `CDCLatencyTarget` includes any delays in reading from the source. As a result, `CDCLatencyTarget` is always greater than or equal to `CDCLatencySource`.

For example, if your application commits an insert to the source at 10:00, and AWS DMS consumes it at 10:02 and writes it to the target at 10:05, the `CDCLatencyTarget` metric is 300 seconds.

## Common causes of CDC latency
<a name="CHAP_Troubleshooting_Latency_Causes"></a>

This section contains causes of latency that your replication may experience during CDC.

**Topics**
+ [Endpoint resources](#CHAP_Troubleshooting_Latency_Causes_Endpoint)
+ [Replication instance resources](#CHAP_Troubleshooting_Latency_Causes_Replication_Instance)
+ [Network speed and bandwidth](#CHAP_Troubleshooting_Latency_Causes_Replication_Network)
+ [DMS configuration](#CHAP_Troubleshooting_Latency_Causes_Replication_DMS_Config)
+ [Replication scenarios](#CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios)

### Endpoint resources
<a name="CHAP_Troubleshooting_Latency_Causes_Endpoint"></a>

The following factors significantly affect replication performance and latency:
+ Source and target database configurations
+ Instance size
+ Under-provisioned or misconfigured source or target data stores

To identify causes for latency caused by endpoint issues for AWS-hosted sources and targets, monitor the following CloudWatch metrics:
+ `FreeMemory`
+ `CPUUtilization`
+ Throughput and I/O metrics, such as `WriteIOPS`, `WriteThroughput`, or `ReadLatency`
+ Transaction volume metrics such as `CDCIncomingChanges`.

For information about monitoring CloudWatch metrics, see [AWS Database Migration Service metrics](CHAP_Monitoring.md#CHAP_Monitoring.Metrics).

### Replication instance resources
<a name="CHAP_Troubleshooting_Latency_Causes_Replication_Instance"></a>

Replication instance resources are critical for replication, and you should make sure that there are no resource bottlenecks, as they can lead to both source and target latency.

To identify resource bottlenecks for your replication instance, verify the following:
+ Critical CloudWatch metrics such as CPU, Memory, I/O per second, and storage are not experiencing spikes or consistenly high values.
+ Your replication instance is sized appropriately for your workload. For information about determining the correct size of a replication instance, see [Selecting the best size for a replication instance](CHAP_BestPractices.SizingReplicationInstance.md).

### Network speed and bandwidth
<a name="CHAP_Troubleshooting_Latency_Causes_Replication_Network"></a>

Network bandwith is a factor that affects data transmission. To analyze the network performance of your replication, do one of the following:
+ Check the `ReadThroughput` and `WriteThroughput` metrics at the instance level. For information about monitoring CloudWatch metrics, see [AWS Database Migration Service metrics](CHAP_Monitoring.md#CHAP_Monitoring.Metrics).
+ Use the AWS DMS Diagnostic Support AMI. If the Diagnostic Support AMI is not available in your region, you can download it from any supported region and copy it to your region to perform your network analysis. For information about the Diagnostic Support AMI, see [Working with the AWS DMS diagnostic support AMI](CHAP_SupportAmi.md).

CDC in AWS DMS is single-threaded to ensure data consistency. As a result, you can determine the data volume your network can support by calculating your single-threaded data transfer rate. For example, if your task connects to its source using a 100 Mbps (megabits per second) network, your replication has a theoretical maximum bandwidth allocation of 12.5 MBps (megabytes per second). This is equal to 45 gigabits per hour. If the rate of transaction log generation on the source is greater than 45 gigabits per hour, this means that the task has CDC latency. For a 100 MBps network, these rates are theoretical maximums; other factors such as network traffic and resource overhead on the source and target reduce the actual available bandwidth.

### DMS configuration
<a name="CHAP_Troubleshooting_Latency_Causes_Replication_DMS_Config"></a>

This section contains recommended replication configurations that can help reduce latency.
+ **Endpoint settings**: Your source and target endpoint settings can cause your replication instance to suffer poor performance. Endpoint settings that turn on resource-intensive features will impact performance. For example, for an Oracle endpoint, disabling LogMiner and using Binary Reader improves performance, since LogMiner is resource-intensive. The following endpoing setting improves performance for an Oracle endpoint:

  ```
  useLogminerReader=N;useBfile=Y
  ```

  For more information about endpoint settings, see the documentation for your source and target endpoint engine in the [Working with AWS DMS endpoints](CHAP_Endpoints.md) topic.
+ **Task settings**: Some task settings for your particular replication scenario can cause your replication instance to suffer poor performance. For example, AWS DMS uses transactional apply mode by default (`BatchApplyEnabled=false`) for CDC for all endpoints except for Amazon Redshift. However, for sources with a large number of changes, setting `BatchApplyEnabled` to `true` may improve performance.

  For more information about task settings, see [Specifying task settings for AWS Database Migration Service tasks](CHAP_Tasks.CustomizingTasks.TaskSettings.md).
+ **Start Position of a CDC only task**: Starting a CDC-only task from a position or timestamp in the past will start the task with increased CDC source latency. Depending on the volume of changes on the source, task latency will take time to subside. 
+ **LOB settings**: Large Object data types can hinder replication performance due to the way AWS DMS replicates large binary data. For more information, see the following topics:
  + [Setting LOB support for source databases in an AWS DMS task](CHAP_Tasks.LOBSupport.md)
  + [Migrating large binary objects (LOBs)](CHAP_BestPractices.md#CHAP_BestPractices.LOBS).

### Replication scenarios
<a name="CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios"></a>

This section describes specific replication scenarios and how they may affect latency.

**Topics**
+ [Stopping a task for an extended period of time](#CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios_Stoptask)
+ [Cached changes](#CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios_Cachedchanges)
+ [Cross-region replication](#CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios_Crossregion)

#### Stopping a task for an extended period of time
<a name="CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios_Stoptask"></a>

When you stop a task, AWS DMS saves the position of the last transaction log that was read from the source. When you resume the task, DMS tries to continue reading from the same transaction log position. Resuming a task after several hours or days causes CDC source latency to increase until DMS finishes consuming the transaction backlog.

#### Cached changes
<a name="CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios_Cachedchanges"></a>

**Cached changes** are changes that your application writes to the data source while AWS DMS runs the full-load replication phase. DMS doesn't apply these changes until the full-load phase completes and the CDC phase starts. For a source with large number of transactions, cached changes take longer to apply, so source latency increases when the CDC phase starts. We recommend that you run the full-load phase when transaction volumes are low to minimize the number of cached changes.

#### Cross-region replication
<a name="CHAP_Troubleshooting_Latency_Causes_Replication_Scenarios_Crossregion"></a>

Locating your DMS endpoints or your replication instance in different AWS regions increases network latency. This increases replication latency. For best performance, locate your source endpoint, target endpoint, and replication instance in the same AWS region.

# Troubleshooting latency issues
<a name="CHAP_Troubleshooting_Latency_Troubleshooting"></a>

This section contains troubleshooting steps for replication latency.

To troubleshoot latency, do the following:
+ First, determine the type and amount of latency for the task. Check the task's Table Statistics section from the DMS console or CLI. If the counters are changing, then data transmission is in progress. Check the `CDCLatencySource` and`CDCLatencyTarget` metrics together to determine if there's a bottleneck during CDC.
+ If high `CDCLatencySource` or `CDCLatencyTarget` metrics indicate a bottleneck in your replication, check the following:
  + If `CDCLatencySource` is high and `CDCLatencyTarget` is equal to `CDCLatencySource`, this indicates that there is a bottleneck in your source endpoint, and AWS DMS is writing data to the target smoothly. See [Troubleshooting source latency issues](CHAP_Troubleshooting_Latency_Source.md) following.
  + If `CDCLatencySource` is low and `CDCLatencyTarget` is high, this indicates that there is a bottleneck in your target endpoint, and AWS DMS is reading data from the source smoothly. See [Troubleshooting target latency issues](CHAP_Troubleshooting_Latency_Target.md) following.
  + If `CDCLatencySource` is high and `CDCLatencyTarget` is significantly higher than `CDCLatencySource`, this indicates bottlenecks on both source reads and target writes. Investigate source latency first, and then investigate target latency.

For information about monitoring DMS task metrics, see [Monitoring AWS DMS tasks](CHAP_Monitoring.md). 

# Troubleshooting source latency issues
<a name="CHAP_Troubleshooting_Latency_Source"></a>

The following topics describe replication scenarios specific to source endpoint types.

**Topics**
+ [Oracle Endpoint Troubleshooting](CHAP_Troubleshooting_Latency_Source_Oracle.md)
+ [MySQL Endpoint Troubleshooting](CHAP_Troubleshooting_Latency_Source_MySQL.md)
+ [PostgreSQL Endpoint Troubleshooting](CHAP_Troubleshooting_Latency_Source_PostgreSQL.md)
+ [SQL Server Endpoint Troubleshooting](CHAP_Troubleshooting_Latency_Source_SQLServer.md)

# Oracle Endpoint Troubleshooting
<a name="CHAP_Troubleshooting_Latency_Source_Oracle"></a>

This section contains replication scenarios specific to Oracle.

## Source reading paused
<a name="CHAP_Troubleshooting_Latency_Source_Oracle_Sourcereadingpaused"></a>

AWS DMS pauses reading from an Oracle source in the following scenarios. This behavior is by design. You can investigate the causes for this using the task log. Look for messages similar to the following in the task log. For information about working with the task log, see [Viewing and managing AWS DMS task logs](CHAP_Monitoring.md#CHAP_Monitoring.ManagingLogs).
+ **SORTER message**: This indicates that DMS is caching transactions on the replication instance. For more information, see [SORTER message in task log](CHAP_Troubleshooting_Latency_Target.md#CHAP_Troubleshooting_Latency_Target_Sorter) following.
+ **Debug task logs**: If DMS interrupts the read process, your task repeatedly writes the following message to the debug task logs, without a change to the context field or timestamp:
  + **Binary reader**: 

    ```
    [SOURCE_CAPTURE  ]T:  Produce CTI event: 
    context '00000020.f23ec6e5.00000002.000a.00.0000:190805.3477731.16' 
    xid [00000000001e0018] timestamp '2021-07-19 06:57:55' 
    thread 2  (oradcdc_oralog.c:817)
    ```
  + **Logminer**: 

    ```
    [SOURCE_CAPTURE  ]T:  Produce INSERT event: 
    object id 1309826 context '000000000F2CECAA010000010005A8F500000275016C0000000000000F2CEC58' 
    xid [000014e06411d996] timestamp '2021-08-12 09:20:32' thread 1  (oracdc_reader.c:2269)
    ```
+ AWS DMS logs the following message for every new redo or archived log operation.

  ```
  00007298: 2021-08-13T22:00:34 [SOURCE_CAPTURE ]I: Start processing archived Redo log sequence 14850 thread 2 name XXXXX/XXXXX/ARCHIVELOG/2021_08_14/thread_2_seq_14850.22977.1080547209 (oradcdc_redo.c:754)
  ```

  If the source has new redo or archived log operations, and AWS DMS is not writing these messages to the log, this means that the task is not processing events.

## High redo generation
<a name="CHAP_Troubleshooting_Latency_Source_Oracle_Highredo"></a>

If your task is processing redo or archived logs, but the source latency remains high, try to identify the redo log generation rate and generation patterns. If you have a high level of redo log generation, this increases source latency, because your task reads all of the redo and archive logs in order to fetch changes related to the replicated tables. 

To determine the redo generation rate, use the following queries.
+ Per-day redo generation rate:

  ```
  select trunc(COMPLETION_TIME,'DD') Day, thread#, 
  round(sum(BLOCKS*BLOCK_SIZE)/1024/1024/1024) GB,
  count(*) Archives_Generated from v$archived_log 
  where completion_time > sysdate- 1
  group by trunc(COMPLETION_TIME,'DD'),thread# order by 1;
  ```
+ Per-hour redo generation rate:

  ```
  Alter session set nls_date_format = 'DD-MON-YYYY HH24:MI:SS';
  select trunc(COMPLETION_TIME,'HH') Hour,thread# , 
  round(sum(BLOCKS*BLOCK_SIZE)/1024/1024) "REDO PER HOUR (MB)",
  count(*) Archives from v$archived_log 
  where completion_time > sysdate- 1
  group by trunc(COMPLETION_TIME,'HH'),thread#  order by 1 ;
  ```

To troubleshoot latency in this scenario, check the following:
+ Check the network bandwidth and single-thread performance of your replication to ensure that your underlying network can support the source redo generation rate. For information about how network bandwidth can affect replication performance, see [Network speed and bandwidth](CHAP_Troubleshooting_Latency.md#CHAP_Troubleshooting_Latency_Causes_Replication_Network) prior.
+ Check if you set up supplemental logging correctly. Avoid extra logging on the source, such as enabling logging on all columns of a table. For information about setting up supplemental logging, see [Setting up supplemental logging](CHAP_Source.Oracle.md#CHAP_Source.Oracle.Self-Managed.Configuration.SupplementalLogging). 
+ Verify that you are using the correct API to read the redo or archved logs. You can use either Oracle LogMiner or AWS DMS Binary Reader. While LogMiner reads the online redo logs and archived redo log files, Binary Reader reads and parses the raw redo log files directly. As a result, Binary Reader is more performant. We recommend that you use Binary Reader if your redo log generation is more than 10 GB/ hour. For more information, see [Using Oracle LogMiner or AWS DMS Binary Reader for CDC](CHAP_Source.Oracle.md#CHAP_Source.Oracle.CDC).
+ Check if you set `ArchivedLogsOnly` to `Y`. If this endpoint setting is set, AWS DMS reads from the archived redo logs. This increases source latency, because AWS DMS waits for the online redo log to be archived before reading. For more information, see [ArchivedLogsOnly](https://docs.aws.amazon.com/dms/latest/APIReference/API_OracleSettings.html#DMS-Type-OracleSettings-ArchivedLogsOnly).
+ If your Oracle source uses Automatic Storage Management (ASM), see [Storing REDO on Oracle ASM when using Oracle as a source for AWS DMS](CHAP_Source.Oracle.md#CHAP_Source.Oracle.REDOonASM) for information about how to properly configure your data store. You may also be able to optimize reading performance further by using the `asmUsePLSQLArray` extra connection attrribute (ECA). For information about using `asmUsePLSQLArray`, see [Endpoint settings when using Oracle as a source for AWS DMS](CHAP_Source.Oracle.md#CHAP_Source.Oracle.ConnectionAttrib).

# MySQL Endpoint Troubleshooting
<a name="CHAP_Troubleshooting_Latency_Source_MySQL"></a>

This section contains replication scenarios specific to MySQL. AWS DMS scans the MySQL binary log periodically to replicate changes. This process can increase latency in the following scenarios:

**Topics**
+ [Long-running transaction on source](#CHAP_Troubleshooting_Latency_Source_MySQL_Longrunning)
+ [High workload on source](#CHAP_Troubleshooting_Latency_Source_MySQL_Highworkload)
+ [Binary log contention](#CHAP_Troubleshooting_Latency_Source_MySQL_Binarylog)

## Long-running transaction on source
<a name="CHAP_Troubleshooting_Latency_Source_MySQL_Longrunning"></a>

Since MySQL only writes committed transactions to the binary log, long-running transactions cause latency spikes proportional to the query run time.

To identify long-running transactions, use the following query, or use the slow query log:

```
SHOW FULL PROCESSLIST;
```

For information about using the slow query log, see [The Slow Query Log](https://dev.mysql.com/doc/refman/5.7/en/slow-query-log.html) in the [MySQL documentation](https://dev.mysql.com/doc/).

To avoid latency spikes from long-running transactions, restructure your source transactions to either reduce the query run time or increase your commit frequency.

## High workload on source
<a name="CHAP_Troubleshooting_Latency_Source_MySQL_Highworkload"></a>

Because DMS CDC is single-threaded, a large number of transactions can increase source latency. To identify if source latency is due to a heavy workload, compare the number and size of the binary logs generated during the latency period to the logs generated before the latency. To check the binary logs, and DMS CDC thread status, use the following queries:

```
SHOW BINARY LOGS;
SHOW PROCESSLIST;
```

For more information about CDC binary log dump thread states, see [ Replication Source Thread States ](https://dev.mysql.com/doc/refman/8.0/en/source-thread-states.html).

You can determine the latency by comparing the latest binary log position generated on the source with the event DMS is currently processing. To identify the latest binary log on the source, do the following:
+ Enable debug logs on the SOURCE\$1CAPTURE component.
+ Retrieve the DMS processing binary log and position details from the the task debug logs.
+ Use the following query to identify the latest binary log on the source:

  ```
  SHOW MASTER STATUS;
  ```

To further optimize performance, tune the `EventsPollInterval`. By default, DMS polls the binary log every 5 seconds, but you may improve performance by reducing this value. For more information about the `EventsPollInterval` setting, see [Endpoint settings when using MySQL as a source for AWS DMS](CHAP_Source.MySQL.md#CHAP_Source.MySQL.ConnectionAttrib).

## Binary log contention
<a name="CHAP_Troubleshooting_Latency_Source_MySQL_Binarylog"></a>

When migrating multiple tables with a large amount of data, we recommend splitting tables into separate tasks for MySQL 5.7.2 or later. In MySQL versions 5.7.2 and later, the master dump thread creates fewer lock contentions and improves throughput. As a result, the dump thread no longer locks the binary log whenever it reads an event. This means that multiple dump threads can read the binary log file concurrently. This also means that dump threads can read the binary log while clients write to it. For more information about dump threads, see [Replication Threads](https://dev.mysql.com/doc/refman/8.0/en/replication-threads.html) and the [MySQL 5.7.2 release notes](https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-2.html).

To improve replication performance for MySQL sources versions prior to 5.7.2, try consolidating tasks with CDC components.

# PostgreSQL Endpoint Troubleshooting
<a name="CHAP_Troubleshooting_Latency_Source_PostgreSQL"></a>

This section contains replication scenarios specific to PostgreSQL.

**Topics**
+ [Long-running transaction on source](#CHAP_Troubleshooting_Latency_Source_PostgreSQL_Longrunning)
+ [High workload on source](#CHAP_Troubleshooting_Latency_Source_PostgreSQL_Highworkload)
+ [High network throughput](#CHAP_Troubleshooting_Latency_Source_PostgreSQL_Highnetwork)
+ [Spill files in Aurora PostgreSQL](#CHAP_Troubleshooting_Latency_Source_PostgreSQL_Spill)

## Long-running transaction on source
<a name="CHAP_Troubleshooting_Latency_Source_PostgreSQL_Longrunning"></a>

When there are long-running transactions in the source database, such as a few thousand inserts in a single transaction, the DMS CDC event and transaction counters do not increase until the transaction is complete. This delay can cause latency issues that you can measure using the `CDCLatencyTarget` metric.

To review long-running transactions, do one of the following:
+ Use the `pg_replication_slots` view. If the `restart_lsn` value isn't updating, it is likely that PostgreSQL is unable to release Write Ahead Logs (WALs) due to long-running active transactions. For information about the `pg_replication_slots` view, see [pg\$1replication\$1slots](https://www.postgresql.org/docs/15/view-pg-replication-slots.html) in the [PostgreSQL 15.4 Documentation](https://www.postgresql.org/docs/15/).
+ Use the following query to return a list of all active queries in the database, along with related information: 

  ```
  SELECT pid, age(clock_timestamp(), query_start), usename, query 
  FROM pg_stat_activity WHERE query != '<IDLE>' 
  AND query NOT ILIKE '%pg_stat_activity%'
  ORDER BY query_start desc;
  ```

  In the query results, the `age` field shows the active duration of each query, which you can use to identify long-running queries.

## High workload on source
<a name="CHAP_Troubleshooting_Latency_Source_PostgreSQL_Highworkload"></a>

If your source PostgreSQL has a high workload, check the following to reduce latency:
+ You may experience high latency when using the `test_decoding` plugin while migrating a subset of tables from the source database with a high transactions per second (TPS) value. This is because the `test_decoding` plugin sends all database changes to the replication instance which DMS then filters, based on the task’s table mapping. Events for tables that aren’t part of the task’s table mapping can increase source latency.
+ Check TPS throughput using one of the following methods.
  + For Aurora PostgreSQL sources, use the `CommitThroughput` CloudWatch metric.
  + For PostgreSQL running on Amazon RDS or on-premises, use the following query using a PSQL client version 11 or higher (Press **enter** during the query to advance the results):

    ```
    SELECT SUM(xact_commit)::numeric as temp_num_tx_ini FROM pg_stat_database; \gset
    select pg_sleep(60);
    SELECT SUM(xact_commit)::numeric as temp_num_tx_final FROM pg_stat_database; \gset
    select (:temp_num_tx_final - :temp_num_tx_ini)/ 60.0 as "Transactions Per Second";
    ```
+ To reduce latency when using the `test_decoding` plugin, consider using the `pglogical` plugin instead. Unlike the `test_decoding` plugin, the `pglogical` plugin filters write ahead log (WAL) changes at the source, and only sends relevant changes to the replication instance. For information about using the `pglogical` plugin with AWS DMS, see [Configuring the pglogical plugin](CHAP_Source.PostgreSQL.md#CHAP_Source.PostgreSQL.Security.Pglogical).

## High network throughput
<a name="CHAP_Troubleshooting_Latency_Source_PostgreSQL_Highnetwork"></a>

Your replication may have high network bandwidth use when using the `test_decoding` plugin, especially during high-volume transactions. This is because the `test_decoding` plugin processes changes, and converts them into a human-readable format that is larger than the original binary format.

To improve performance, consider using the `pglogical` plugin instead, which is a binary plugin. Unlike the `test_decoding` plugin, the `pglogical` plugin generates binary format output, resulting in compressed write ahead log (WAL) stream changes.

## Spill files in Aurora PostgreSQL
<a name="CHAP_Troubleshooting_Latency_Source_PostgreSQL_Spill"></a>

In PostgreSQL version 13 and higher, the `logical_decoding_work_mem` parameter determines the memory allocation for decoding and streaming. For more information about the `logical_decoding_work_mem` parameter, see [ Resource Consumption in PostgreSQL](https://www.postgresql.org/docs/13/runtime-config-resource.html#GUC-LOGICAL-DECODING-WORK-MEM) in the [ PostgreSQL 13.13 Documentation](https://www.postgresql.org/docs/13/).

Logical replication accumulates changes for all transactions in memory until those transactions commit. If the amount of data stored across all transactions exceeds the amount specified by the database parameter `logical_decoding_work_mem`, then DMS spills the transaction data to disk to release memory for new decoding data.

Long running transactions, or many subtransactions, may result in DMS consuming increased logical decoding memory. This increased memory use results in DMS creating spill files on disk, which causes high source latency during replication.

To reduce the impact of an increase in the source workload, do the following:
+ Reduce long-running transactions.
+ Reduce the number of sub-transactions.
+ Avoid performing operations that generate a large burst of log records, such as deleting or updating an entire table in a single transaction. Perform operations in smaller batches instead.

You can use the following CloudWatch metrics to monitor the workload on the source:
+ `TransactionLogsDiskUsage`: The number of bytes currently occupied by the logical WAL. This value increases monotonically if logical replication slots are unable to keep up with the pace of new writes, or if any long running transactions prevent garbage collection of older files.
+ `ReplicationSlotDiskUsage`: The amount of disk space the logical replication slots currently use.

You can reduce source latency by tuning the `logical_decoding_work_mem` parameter. The default value for this parameter is 64 MB. This parameter limits the amount of memory used by each logical streaming replication connection. We recommend setting the `logical_decoding_work_mem` value significantly higher than the `work_mem` value to reduce the amount of decoded changes that DMS writes to disk.

We recommend that you periodically check for spill files, particularly during periods of heavy migration activity or latency. If DMS is creating a significant number of spill files, this means that logical decoding isn't operating efficiently, which can increase latency. To mitigate this, increase the `logical_decoding_work_mem` parameter value. 

You can check the current transaction overflow with the `aurora_stat_file` function. For more information, see [ Adjusting working memory for logical decoding](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.BestPractices.Tuning-memory-parameters.html#AuroraPostgreSQL.BestPractices.Tuning-memory-parameters.logical-decoding-work-mem) in the *Amazon Relational Database Service Developer Guide*.



# SQL Server Endpoint Troubleshooting
<a name="CHAP_Troubleshooting_Latency_Source_SQLServer"></a>

This section contains replication scenarios specific to SQL Server. To determine what changes to replicate from SQL server AWS DMS reads the transaction logs, and runs periodic scans on the source database. Replication latency usually results from SQL Server throttling these scans because of resource constraints. It can also result from a significant increase in the number of events written to the transaction log in a short time. 

**Topics**
+ [Index rebuilds](#CHAP_Troubleshooting_Latency_Source_SQLServer_Indexrebuilds)
+ [Large transactions](#CHAP_Troubleshooting_Latency_Source_SQLServer_Largetransactions)
+ [Misconfigured MS-CDC polling interval for Amazon RDS SQL Server](#CHAP_Troubleshooting_Latency_Source_SQLServer_MisconfiguredCDC)
+ [Multiple CDC tasks replicating from the same source database](#CHAP_Troubleshooting_Latency_Source_SQLServer_MultipleCDC)
+ [Transaction log backup processing for RDS for SQL Server](#CHAP_Troubleshooting_Latency_Source_SQLServer_backup)

## Index rebuilds
<a name="CHAP_Troubleshooting_Latency_Source_SQLServer_Indexrebuilds"></a>

When SQL Server rebuilds a large index, it uses a single transaction. This generates a lot of events, and can use up a large amount of log space if SQL Server rebuilds several indexes at once. When this happens, you can expect brief replication spikes. If your SQL Server source has sustained log spikes, check the following:
+ First, check the time period of the latency spikes using either the `CDCLatencySource` and `CDCLatencySource` CloudWatch metrics, or by checking Throughput Monitoring messages in the task logs. For information about CloudWatch metrics for AWS DMS, see [Replication task metrics](CHAP_Monitoring.md#CHAP_Monitoring.Metrics.Task). 
+ Check if the size of the active transaction logs or log backups increased during the latency spike. Also check if a maintenance job or a rebuild ran during that time. For information about checking transaction log size, see [ Monitor log space use](https://learn.microsoft.com/en-us/sql/relational-databases/logs/manage-the-size-of-the-transaction-log-file?view=sql-server-ver16#MonitorSpaceUse) in the [SQL Server technical documentation](https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver16).
+ Verify that your maintenance plan follows SQL server best practices. For information about SQL server maintenance best practices, see [ Index maintenance strategy](https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver16#index-maintenance-strategy) in the [SQL Server technical documentation](https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver16).

To fix latency issues during index rebuilds, try the following:
+ Use the `BULK_LOGGED` recovery model for offline rebuilds to reduce the events a task has to process.
+ If possible, stop the task during index rebuilds. Or, try to schedule index rebuilds during non-peak hours to mitigate the impact of a latency spike.
+ Try to identify resource bottlenecks that are slowing DMS reads, such as disk latency or I/O throughput, and address them.

## Large transactions
<a name="CHAP_Troubleshooting_Latency_Source_SQLServer_Largetransactions"></a>

Transactions with a lot of events, or long-running transactions, cause the transaction log to grow. This causes DMS reads to take longer, resulting in latency. This is similar to the effect index rebuilds have on replication performance.

You may have difficulty identifying this issue if you're not familiar with the typical workload on the source database. To troubleshoot this issue, do the following:
+ First, identify the time that latency spiked using either the `ReadThroughput` and `WriteThroughput` CloudWatch metrics, or by checking Throughput Monitoring messages in the task logs.
+ Check if there are any long-running queries on the source database during the latency spike. For information about long-running queries, see [ Troubleshoot slow-running queries in SQL Server ](https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/performance/troubleshoot-slow-running-queries) in the [SQL Server technical documentation](https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver16).
+ Check if the size of the active transaction logs or the log backups has increased. For more information, see [ Monitor log space use ](https://learn.microsoft.com/en-us/sql/relational-databases/logs/manage-the-size-of-the-transaction-log-file?view=sql-server-ver16#MonitorSpaceUse) in the [SQL Server technical documentation](https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver16).

To fix this issue, do one of the following:
+ The best fix is to restructure your transactions on the application side so that they complete quickly. 
+ If you can't restructure your transactions, a short-term workaround is to check for resource bottlenecks such as disk waits or CPU contention. If you find bottlenecks in your source database, you can reduce latency by increasing disk, CPU, and memory resources for source database. This reduces contention for system resources, allowing DMS queries to complete faster.

## Misconfigured MS-CDC polling interval for Amazon RDS SQL Server
<a name="CHAP_Troubleshooting_Latency_Source_SQLServer_MisconfiguredCDC"></a>

A misconfigured polling interval setting on Amazon RDS instances can cause the transaction log to grow. This is because replication prevents log truncation. While tasks that are running might continue replicating with minimal latency, stopping and resuming tasks, or starting CDC-only tasks, can cause task failures. These are due to timeouts while scanning the large transaction log.

To troubleshoot a misconfigured polling interval, do the following:
+ Check if the active transaction log size is increasing, and if log usage is close to 100 percent. For more information, see [ Monitor log space use ](https://learn.microsoft.com/en-us/sql/relational-databases/logs/manage-the-size-of-the-transaction-log-file?view=sql-server-ver16#MonitorSpaceUse) in the [SQL Server technical documentation](https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver16).
+ Check if log truncation is delayed with a `log_reuse_wait_desc value` of `REPLICATION`. For more information, see [ The Transaction Log (SQL Server) ](https://learn.microsoft.com/en-us/sql/relational-databases/logs/the-transaction-log-sql-server?view=sql-server-ver16#FactorsThatDelayTruncation) in the [SQL Server technical documentation](https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver16).

If you find issues with any of the items in the previous list, tune the MS-CDC polling interval. For information about tuning the polling interval, see [Recommended settings when using RDS for SQL Server as a source for AWS DMS](CHAP_Source.SQLServer.CDC.md#CHAP_Source.SQLServer.Configuration.Settings). 

## Multiple CDC tasks replicating from the same source database
<a name="CHAP_Troubleshooting_Latency_Source_SQLServer_MultipleCDC"></a>

During the full load phase, we recommend splitting tables across tasks to improve performance, to separate dependent tables logically, and to mitigate the impact of a task failure. However, during the CDC phase, we recommend consolidating tasks to minimize DMS scans. During the CDC phase, each DMS task scans the transaction logs for new events several times a minute. Since each task runs independently, every task scans each transaction log individually. This increases disk and CPU usage on the source SQL Server database. As a result, a large number of tasks running in parallel can cause SQL Server to throttle DMS reads, leading to increased latency.

You may have difficulty identifying this issue if muliple tasks start gradually. The most common symptom of this issue is most task scans starting to take longer. This leads to higher latency for these scans. SQL Server prioritizes a few of the task scans, so a few of the tasks show normal latency. To troubleshoot this issue, check the `CDCLatencySource` metric for all of your tasks. If some of the tasks have an increasing `CDCLatencySource`, while a few tasks have a low `CDCLatencySource`, it is likely that SQL Server is throttling your DMS reads for some of your tasks.

If SQL Server is throttling your task reads during CDC, consolidate your tasks to minimize the number of DMS scans. The maximum number of tasks that can connect to your source database without creating contention depends on factors such as the source database capacity, the rate of transaction log growth, or the number of tables. To determine the ideal number of tasks for your replication scenario, test replication in a test environment similar to your production environment.

## Transaction log backup processing for RDS for SQL Server
<a name="CHAP_Troubleshooting_Latency_Source_SQLServer_backup"></a>

AWS DMS 3.5.3 and above support replicating from RDS for SQL Server log backups. Replicating events from the backup logs on RDS instances is slower than replicating events from the active transaction log. This is because DMS requests access to backups serially to ensure that it maintains the transaction sequence, and to minimize the risk of the Amazon RDS instance storage filling up. Moreover, at the Amazon RDS end, the time taken to make the backups available to DMS varies depending on the size of the log backup, and the load on the RDS for SQL Server instance.

Because of these constraints, we recommend that you set the ECA `ActivateSafeguard` to `true`. This ensures that transactions are not backed up while the DMS task is reading from the active transaction log. This setting also prevents Amazon RDS archiving transactions in the active log when DMS is reading transactions from the backup, thereby eliminating the possibility that DMS cannot catch up to the active log. Note that this may cause the active log size to grow while the task is catching up. Ensure that your instance has enough storage to keep the instance from running out of space.

For a CDC-only task replicating from RDS for SQL Server sources, use the use of native CDC start position over native CDC start time if possible. This is because DMS relies on system tables to identify the starting point for the native start position, rather than scanning individual log backups when you specify a native start time.

# Troubleshooting target latency issues
<a name="CHAP_Troubleshooting_Latency_Target"></a>

This section contains scenarios that can contribute to target latency.

**Topics**
+ [Indexing issues](#CHAP_Troubleshooting_Latency_Target_Indexing)
+ [SORTER message in task log](#CHAP_Troubleshooting_Latency_Target_Sorter)
+ [Database locking](#CHAP_Troubleshooting_Latency_Target_Locking)
+ [Slow LOB lookups](#CHAP_Troubleshooting_Latency_Target_LOB)
+ [Multi-AZ, audit logging and backups](#CHAP_Troubleshooting_Latency_Target_MultiAZ)

## Indexing issues
<a name="CHAP_Troubleshooting_Latency_Target_Indexing"></a>

During the CDC phase, AWS DMS replicates changes on the source by executing DML statements (insert, update, and delete) on the target. For heterogenous migrations using DMS, differences in index optimizations on the source and target can cause writes to the target to take longer. This results in target latency and performance issues.

To troubleshoot indexing issues, do the following. The procedures for these steps vary for different database engines. 
+ Monitor the query time for your target database. Comparing the query execution time on the target and source can indicate which indexes need optimization.
+ Enable logging for slow-running queries.

To fix indexing issues for long-running replications, do the following:
+ Tune the indexes on your source and target databases so that the query execution time is similar on the source and the target.
+ Compare the secondary indexes used in DML queries for the source and the target. Make sure that DML performance on the target is comparable to or better than the source DML performance.

Note that the procedure for optimizing indexes is specific to your database engine. There is no DMS feature for tuning source and target indexes.

## SORTER message in task log
<a name="CHAP_Troubleshooting_Latency_Target_Sorter"></a>

If a target endpoint can't keep up with the volume of changes that AWS DMS writes to it, the task caches the changes on the replication instance. If the cache grows larger than an internal threshold, the task stops reading further changes from the source. DMS does this to prevent the replication instance from running out of storage, or the task being stuck while reading a large volume of pending events. 

To troubleshoot this issue, check the CloudWatch logs for a message similar to either of the following:

```
[SORTER ]I: Reading from source is paused. Total disk usage exceeded the limit 90% (sorter_transaction.c:110)
[SORTER ]I: Reading from source is paused. Total storage used by swap files exceeded the limit 1048576000 bytes  (sorter_transaction.c:110)
```

If your logs contain a message similar to the first message, disable any trace logging for the task, and increase the replication instance storage. For information about increasing replication instance storage, see [Modifying a replication instance](CHAP_ReplicationInstance.Modifying.md).

If your logs contain a message similar to the second message, do the following:
+ Move tables with numerous transactions or long running DML operations to a separate task, if they don’t have any dependencies on other tables in the task.
+ Increase the `MemoryLimitTotal` and `MemoryKeepTime` settings to hold the transaction for a longer duration in memory. This won't help if the latency is sustained, but it can help keep latency down during short bursts of transactional volume. For information about these task settings, see [Change processing tuning settings](CHAP_Tasks.CustomizingTasks.TaskSettings.ChangeProcessingTuning.md).
+ Evaluate if you can use batch apply for your transaction by setting `BatchApplyEnabled` to `true`. For information about the `BatchApplyEnabled` setting, see [Target metadata task settings](CHAP_Tasks.CustomizingTasks.TaskSettings.TargetMetadata.md).

## Database locking
<a name="CHAP_Troubleshooting_Latency_Target_Locking"></a>

If an application accesses a database that AWS DMS is using as a replication target, the application may lock a table that DMS is trying to access. This creates a lock contention. Since DMS writes changes to the target database in the order they occurred on the source, delays to writing to one table due to lock contentions create delays to writing to all tables. 

To troubleshoot this issue, query the target database to check if a lock contention is blocking DMS write transactions. If the target database is blocking DMS write transactions, do one or more of the following:
+ Restructure your queries to commit changes more frequently.
+ Modify your lock timeout settings.
+ Partition your tables to minimize lock contentions.

Note that the procedure for optimizing lock contentions is specific to your database engine. There is no DMS feature for tuning lock contentions.

## Slow LOB lookups
<a name="CHAP_Troubleshooting_Latency_Target_LOB"></a>

When AWS DMS replicates a large object (LOB) column, it performs a lookup on the source just before writing changes to the target. This lookup normally doesn't cause any latency on the target, but if the source database delays the lookup due to locking, target latency may spike. 

This issue is normally difficult to diagnose. To troubleshoot this issue, enable detailed debugging on the task logs, and compare the timestamps of the DMS LOB lookup calls. For information about enabling detailed debugging, see [Viewing and managing AWS DMS task logs](CHAP_Monitoring.md#CHAP_Monitoring.ManagingLogs).

To fix this issue, try the following:
+ Improve the SELECT query performance on the source database.
+ Tune the DMS LOB settings. For information about tuning the LOB settings, see [Migrating large binary objects (LOBs)](CHAP_BestPractices.md#CHAP_BestPractices.LOBS).

## Multi-AZ, audit logging and backups
<a name="CHAP_Troubleshooting_Latency_Target_MultiAZ"></a>

For Amazon RDS targets, target latency can increase during the following:
+ Backups
+ After enabling multiple availability zones (multi-AZ)
+ After enabling database logging, such as audit or slow query logs.

These issues are normally difficult to diagnose. To troubleshoot these issues, monitor latency for periodic spikes during Amazon RDS maintenance windows or periods of heavy database loads.

To fix these issues, try the following:
+ If possible, during short term migration, disable multi-AZ, backups, or logging.
+ Reschedule your maintenance windows for periods of low activity.