DR Orchestrator Framework overview
DR Orchestrator Framework provides a one-click solution to orchestrate and automate
cross-Region DR for AWS databases. It uses AWS Step Functions and AWS Lambda to perform the required steps during the
failover and failback. The Step Functions state machines provide the basis for decision making
within the orchestrator design. The API operations for performing a failover or failback actions
are coded into Lambda functions that are called from within the state machine. The Lambda
functions run AWS SDK for Python (Boto3)
DR Orchestrator Framework contains two main state machines that correspond to the failover and failback phases.
For Amazon RDS, the failover phase promotes a cross-Region RDS read replica into a standalone DB instance. For Amazon Aurora, when the primary Region is down during a rare, unexpected outage, its writer node isn't available. Replication between the writer node and the secondary clusters stops. You must detach the secondary cluster from the global database and promote it as a standalone cluster. Applications can connect and send write traffic to the standalone cluster. You can use this same process to switch over the primary DB cluster of the global database to the secondary Regions. Use this approach for controlled scenarios such as the following:
-
Operational maintenance
-
Planned operational procedures
-
Promotion of an Amazon ElastiCache (Redis OSS) secondary cluster as your new primary cluster
The failback phase establishes live replication of data between a live primary Region and a new secondary Region.
It's critical to understand that DR Orchestrator applies to databases only. All the applications that reference these databases and are in the same Region might need a separate, tandem failover solution. After the databases fail over to the secondary Region, the applications need to be updated to connect to the new database instances, which will serve as the data source.
The failover process
To perform a failover, run the DR Orchestrator FAILOVER state machine. At
this stage, a secondary database is already present in the secondary Region, either as a read
replica (Amazon RDS) or as a secondary cluster (Amazon Aurora). When you run the DR Orchestrator
FAILOVER state machine, it promotes the secondary database to become the
primary.
DR Orchestrator FAILOVER
architecture
The following diagram shows the concepts of the failover process for Amazon Aurora when using DR Orchestrator. Amazon Aurora and Amazon ElastiCache use the same workflow but with different state machines and Lambda functions.
-
The
DR Orchestrator FAILOVERstate machine reads the input JSON parameters. -
Based on the
resourceTypeparameter, the state machine calls other state machines:Promote RDS Read Replica,Failover Aurora Cluster, orFailover ElastiCache. If more than one resource is passed in the input, these state machines run in parallel. -
The
Failover Aurora Clusterstate machine calls Lambda functions in each of the following three steps. -
The
Resolve importsLambda function resolves"! import <export-variable-name>"with the actual values from theApp-StackAWS CloudFormation template. -
The
Failover Aurora ClusterLambda function promotes the read replica as a standalone DB instance. -
The
Check Failover StatusLambda function checks the status of the promoted DB instance. After the status is AVAILABLE, the Lambda function sends a success token back to the calling state machine and completes. -
You can redirect your applications to the standalone database in the DR Region (
us-west-2), which is now the primary database.
The failback process
After your former primary Region (us-east-1) is up again, you can fail back
to it, so that the database in us-east-1 becomes the primary again. To start the
failback, run the DR Orchestrator FAILBACK state machine. As the name indicates,
this state machine starts replicating changes in your new primary Region
(us-west-2) back to the former primary Region (us-east-1), which
acts as the current secondary.
After replication is established between the two Regions, you can initiate the failback.
To failback and return to your original primary Region (us-east-1), run the
DR Orchestrator FAILOVER state machine in the current secondary Region
(us-east-1) to promote it to the primary Region.
DR Orchestrator FAILBACK
architecture
The following diagram shows the concepts of the failback process for Amazon Aurora when using DR Orchestrator.
-
Before beginning failback, take a manual DB snapshot to use when performing root cause analysis (RCA).
Also, disable the
DeletionProtectionfor the Aurora cluster in the previous primary Region (us-east-1). -
The
DR Orchestrator FAILBACKstate machine reads the input JSON parameters. -
Based on the
resourceType, theDR Orchestrator FAILBACKstate machine calls theCreate Aurora Secondary DB Clusterstate machine. -
The
Create Aurora Secondary DB Clusterstate machine calls Lambda functions in each of the following five steps. -
The
Resolve importLambda function resolves"! import <export-variable-name>"with the actual values from theApp-StackCloudFormation template. -
The
Delete DB InstanceLambda function deletes the former primary instance. -
The
Check DB instance statusLambda function checks theDelete DB Instance statusuntil the DB is deleted. -
The
Create Read ReplicaLambda function creates a read replica in the secondary Region from the DB instance that's in the new primary Region. -
The
Check DB instance statusLambda function checks the read replica DB instance status. When the status is AVAILABLE, the Lambda function sends a success token back to the calling state machine, which is completed.
DR Orchestrator FAILOVER
Use the DR Orchestrator FAILOVER state machine in the DR event when the
primary Region (us-east-1) is down or during planned events such as operational
maintenance.
The function can be called to fail over single or multiple databases in parallel.
The state machine accepts parameters in the JSON format as shown in the following code:
{ "StatePayload": [ { "layer": 1, "resources": [ { "resourceType": "PromoteRDSReadReplica", "resourceName": "Promote RDS MySQL Read Replica", "parameters": { "RDSInstanceIdentifier": "!Import rds-mysql-instance-identifier", "TargetClusterIdentifier": "!Import rds-mysql-instance-global-arn" } }, { "resourceType": "FailoverElastiCacheCluster", "resourceName": "Failover ElastiCache Cluster", "parameters": { "GlobalReplicationGroupId": "!Import demo-redis-cluster-global-replication-group-id", "TargetRegion": "!Import demo-redis-cluster-target-region", "TargetReplicationGroupId": "!Import demo-redis-cluster-target-replication-group-id" } } ] } ] }
Parameter details
The following table shows the parameters used by the DR Orchestrator
FAILOVER state machine.
| Parameter name | Description | Expected values |
|---|---|---|
layer (required: number) |
The processing sequence. All the resources defined in layer 1 must be run before the layer 2 resources are run. | 1 or 2, and so on |
| resources (required: array of dictionary) | All the resources within a single layer run in parallel. |
|
resourceType (required: string) |
Type of the resource to identify the resource | PromoteRDSReadReplica or
FailoverElastiCacheCluster |
resourceName (optional: string) |
To identify which application portfolio these resources belong to | Promote RDS for MySQL Read Replica |
| parameters (required: array of dictionary) | List of parameters required to fail over or fail back the AWS database |
|
DR Orchestrator FAILBACK
Use the DR Orchestrator FAILBACK state machine after the DR event, when the
former primary Region (us-east-1) is up. You can create the read
replica for Amazon RDS in the former primary Region from the new primary Region
(us-west-2) to be compliant with your DR strategy. Because this is a planned
event, you can schedule this activity over the weekend or during off-peak business hours with
an estimated downtime.
The state machine accepts parameters in the JSON format as shown in the following code:
{ "StatePayload": [ { "layer": 1, "resources": [ { "resourceType": "CreateRDSReadReplica", "resourceName": "Create RDS for MySQL Read Replica", "parameters": { "RDSInstanceIdentifier": "!Import rds-mysql-instance-identifier", "TargetClusterIdentifier": "!Import rds-mysql-instance-global-arn", "SourceRDSInstanceIdentifier": "!Import rds-mysql-instance-source-identifier", "SourceRegion": "!Import rds-mysql-instance-SourceRegion", "MultiAZ": "!Import rds-mysql-instance-MultiAZ", "DBInstanceClass": "!Import rds-mysql-instance-DBInstanceClass", "DBSubnetGroup": "!Import rds-mysql-instance-DBSubnetGroup", "DBSecurityGroup": "!Import rds-mysql-instance-DBSecurityGroup", "KmsKeyId": "!Import rds-mysql-instance-KmsKeyId", "BackupRetentionPeriod": "7", "MonitoringInterval": "60", "StorageEncrypted": "True", "EnableIAMDatabaseAuthentication": "True", "DeletionProtection": "True", "CopyTagsToSnapshot": "True", "AutoMinorVersionUpgrade": "True", "Port": "!Import rds-mysql-instance-DBPortNumber", "MonitoringRoleArn": "!Import rds-mysql-instance-RDSMonitoringRole" } } ] } ] }