TELCOREL02-BP02 Implement full mesh between control plane and user plane functions
Implement a resilient architecture design where each control plane node can manage user plane functions in a mesh configuration. For each user plane function, one control plane node will be active at a time. The control plane node should be designed to have enough capacity to control the user plane nodes, when needed. This verifies that if one or multiple control plane nodes fail, the remaining nodes can manage user plane functions without service interruption. The design incorporates high-capacity centralized control components with distributed user plane functions.
Desired outcome:
-
Enhanced system resilience through redundant connectivity.
-
Remove single points of failure in control plane.
-
Seamless failover during node failures.
-
Maintained service continuity.
Level of risk exposed if this best practice is not established: Low
Implementation guidance
A full mesh connectivity design between the control plane and user plane components is recommended for a highly available Telcom network. Implementing diverse routing paths and continuously monitoring the status of these connectivity routes are key steps. Defining clear failover triggers and thresholds, along with automated recovery procedures, allows the network to quickly respond to and recover from failures. Thorough documentation of these processes, including integration with incident management and troubleshooting guides, further enhances the reliability of the overall system. Comprehensive monitoring and observability solutions provide the necessary visibility into the network's performance and health, enabling proactive identification and mitigation of potential issues.
Implementation steps
-
Design mesh topology:
-
Use AWS Transit Gateway to enable connectivity between your control and user plane network functions in a star topology, enabling each control plane node to be able to manage the user plane node, when needed.
-
Verify capacity planning and failure domain considerations using Amazon CloudWatch and AWS Auto Scaling.
-
-
Implement connectivity paths:
-
Deploy control plane and user plane instances across multiple AWS Availability Zones and Regions where possible.
-
Configure diverse routing paths using AWS VPC routing tables and AWS Transit Gateway routing policies.
-
Leverage AWS Direct Connect for physical path separation and high-performance connectivity with the on-premises systems including Access Network (RAN) functions.
-
-
Configure routing policies:
-
Implement routing policies using AWS Transit Gateway route tables and AWS Lambda-based custom routing logic to recover services in case of a control plane node failure.
-
-
Define failover triggers:
-
Use Amazon CloudWatch alarms and metrics to define the conditions and thresholds for triggering automated failover processes.
-
Provide manual override capabilities through AWS Lambda functions or Amazon API Gateway.
-
-
Document procedures:
-
Store the detailed failover and recovery procedures in AWS Systems Manager for secure access and versioning.
-
Integrate the documentation with your incident management processes and provide troubleshooting guides and escalation procedures.
-
Resources
Key AWS services: