Operational excellence pillar
Operational excellence (OE) represents a dedication to crafting high-quality software solutions that consistently meet and exceed user expectations. The operational excellence pillar of the AWS Well-Architected Framework encompasses proven strategies for effective team organization, robust workload design, efficient large-scale operations, and seamless adaptation to changing requirements over time. By adhering to these principles, organizations can ensure that their systems remain resilient, performant, and aligned with evolving business needs.
Key focus areas for applying this pillar to your WorkSpaces Applications streaming environment:
-
Monitoring and observability
-
Automation and DevOps
-
Operational procedures and documentation
-
Support and incident management
Organize teams around business outcomes
Create a cloud-aligned operating model with strong leadership commitment, where business goals and key performance indicators (KPIs) drive organizational transformation through optimized people, processes, and technology.
-
Team structure. Establish dedicated teams that align with application streaming outcomes. For example:
-
Image management team is responsible for application packaging and image optimization.
-
Fleet operations team manages capacity, performance, and scaling.
-
User experience team handles end-user support and satisfaction.
-
-
KPIs and metrics. Define and track business-aligned metrics such as:
-
Application availability rates
-
Time to deploy new applications
-
Cost per application streaming hour
-
-
Operating model. Create clear processes for:
-
Application onboarding and updates
-
Fleet capacity management
-
User access provisioning
-
Incident response and resolution
-
Implement observability for actionable insights
Implement comprehensive monitoring and observability to track KPIs and workload health. This principle enables data-driven decisions and proactive improvements across performance, reliability, and cost.
-
Implement performance monitoring. Configure Amazon CloudWatch to:
-
Ensure sufficient capacity to meet demand. For example, you can use the following metrics:
-
AvailableCapacityto monitor available streaming instances -
InUseCapacityto track currently used instances -
CapacityUtilizationto monitor the percentage of fleet usage
-
-
Monitor user experience and performance.
-
Identify and address service issues promptly.
-
-
Track and analyze WorkSpaces Applications usage reports.
-
Capture and analyze application logs. For more information, see the AWS blog posts Using Kinesis Agent for Linux to stream application logs in WorkSpaces Applications
and Using Kinesis Agent for Microsoft Windows to store WorkSpaces Applications Windows event logs . -
Monitor WorkSpaces Applications metrics and events through chat notifications. For more information, see the AWS blog post Monitor and automate AWS end user computing (EUC) with AWS Chatbot
. -
Enable proactive session management through visual cues. For more information, see the AWS blog post Display session expiration and a countdown timer in Amazon WorkSpaces Applications
. -
Create visualizations for usage patterns and trends. For more information, see the AWS blog post Ingest and visualize Amazon WorkSpaces Applications usage reports in Amazon OpenSearch Service
. -
Utilize the EUC toolkit to monitor active sessions, track fleet inventory, and generate session reports (CSV export). For more information, see the AWS blog post Use the EUC Toolkit to manage Amazon WorkSpaces Applications and Amazon WorkSpaces
.
Safely automate where possible
Apply infrastructure as code (IaC) principles to automate all aspects of your workload operations. Use guardrails to help ensure safe and consistent execution while reducing manual intervention.
-
Automate the creation and configuration of WorkSpaces Applications images by using the Image Assistant CLI. For more information, see Create your Amazon WorkSpaces Applications image programmatically by using the Image Assistant CLI operations in the WorkSpaces Applications documentation.
-
Application installation: Use the Image Assistant CLI to automate the installation of applications during image creation.
-
Image creation: Programmatically create WorkSpaces Applications images by using the Image Assistant CLI commands.
-
Configuration management: Automate the configuration of default application settings and launch parameters.
-
-
Automate the customization of WorkSpaces Applications images. For more information, see the AWS blog post Automatically create customized WorkSpaces Applications Windows images
. -
Apply IaC to deploy the infrastructure and application components for WorkSpaces Applications. For more information, see the AWS blog post Automation of infrastructure and application deployment for Amazon WorkSpaces Applications with Terraform
. -
Implement automated processes for fleet management, including:
-
Fleet scaling based on demand. Configure automatic scaling policies to adjust fleet capacity automatically based on utilization metrics. For more information, see the AWS blog post Use AWS Lambda to adjust scaling steps and thresholds for Amazon WorkSpaces Applications
. -
Base image updates. Benefit from automatic updates to the WorkSpaces Applications base image that's provided by AWS.
-
Capacity optimization. Set up automated scaling thresholds to optimize resource usage based on demand patterns.
-
-
Configure guardrails to automate safety controls:
-
Maximum fleet size limits. Set upper bounds on fleet capacity to prevent over-provisioning.
-
Scaling policy configuration. Implement step scaling or target tracking scaling policies with appropriate thresholds.
-
Service quotas. Use AWS service quotas as built-in limits to prevent excessive resource allocation.
-
Scale-in protection. Configure scale-in protection to prevent the removal of active instances during scaling events.
-
-
Perform testing and validation, including image builder, fleet, and integration testing.
-
Image builder testing:
-
Test applications directly in the image builder interface.
-
Verify application launch and functionality.
-
Test user settings and configurations.
-
Validate application compatibility.
-
-
Fleet testing:
-
Test streaming sessions from different client devices.
-
Verify user entitlements and access.
-
Validate application performance.
-
Test user experience for elements and operations such as the clipboard, file transfer, and printing.
-
-
Integration testing:
-
Test Active Directory or SAML 2.0-based authentication.
-
Test home folders and persistent storage.
-
Test application entitlements.
-
Test USB device redirection (if configured).
-
-
-
Use the WorkSpaces Applications applications manager to automate application packaging and deployment. For more information, see the AWS blog post Streamline application onboarding with applications manager for Amazon WorkSpaces Applications
. -
Automate the deployment of new application versions by using continuous integration and continuous delivery (CI/CD) pipelines. For more information, see the AWS blog post Screening Eagle: Optimize CI/CD and end user experience in
Amazon WorkSpaces Applications.
Make frequent, small, reversible changes
Build loosely coupled, scalable workloads that enable frequent, small-scale automated deployments with minimal risk and easy rollback capabilities.
-
For image updates, use versioned image creation and incremental updates.
-
Versioned image creation:
-
Create new images for each set of changes by using an image builder.
-
Maintain multiple image versions to support rollback scenarios.
-
Use AWS tagging strategies to track image versions and attributes.
-
-
Incremental updates:
-
Make small, incremental changes to applications or configurations.
-
Test updates thoroughly in the image builder before you create a new image.
-
Document all the changes that you made in each new image version.
-
-
-
For control fleet updates:
-
Create new fleets with updated images for testing.
-
Modify existing fleet attributes without disrupting active sessions.
-
-
Establish change management procedures for documentation, testing protocols, approval workflows, and monitoring processes.
-
Documentation:
-
Maintain detailed change logs for all image and fleet updates.
-
Document testing procedures and results for each change.
-
Use AWS CloudTrail to track and audit configuration changes.
-
-
Testing protocols:
-
Establish a comprehensive testing process for all changes.
-
Include application functionality, performance, and user experience tests.
-
Conduct testing in the image builder before you create new images.
-
Perform additional testing on non-production fleets before full deployment.
-
-
Approval workflows:
-
Implement an approval process for changes to production environments.
-
Define criteria for changes that require approval versus standard updates.
-
Establish roles and responsibilities for change approval.
-
-
Monitoring and validation:
-
Use Amazon CloudWatch to monitor fleet and application performance after changes.
-
Set up alerts for key metrics to quickly identify issues after updates.
-
Conduct post-implementation reviews to validate change success and gather learnings.
-
-
Refine operations procedures frequently
Continuously improve operational procedures through regular reviews, updates, and team engagement to keep all stakeholders informed and aligned with best practices.
-
Documentation management. Maintain current, version-controlled documentation of WorkSpaces Applications procedures in a central location to ensure operational consistency and knowledge sharing across teams.
-
Required documentation: Maintain up-to-date documentation for critical WorkSpaces Applications operations for image creation and management, fleet operations, and troubleshooting.
-
Operational reviews: Monitor and review key operational aspects, including performance metrics and incident management.
-
-
Continuous improvement. Systematically enhance WorkSpaces Applications operations by incorporating AWS service updates, operational metrics, and learned best practices into standard procedures.
-
Service updates: Monitor WorkSpaces Applications release notes for new features, service improvements, security updates, and Regional availability.
-
Best practices: Review and incorporate AWS Well-Architected Framework updates, WorkSpaces Applications best practices, AWS reference architectures, and AWS security recommendations.
-
Knowledge management: Maintain and update standard operating procedures, runbooks, troubleshooting guides, and user support documentation.
-
Anticipate failure
Conduct failure scenario testing regularly to understand risks, validate response procedures, and improve team readiness for handling real incidents.
-
Failure testing. Regularly simulate and test for failures such as fleet capacity exhaustion, application launch failures, and network connectivity issues.
-
Fleet capacity exhaustion:
-
Monitor and test fleet scaling behavior when approaching capacity limits.
-
Configure CloudWatch alarms for
CapacityUtilizationandAvailableCapacitymetrics. -
Implement procedures for handling capacity constraints during peak usage.
-
-
Application launch failures:
-
Test application launch behavior on streaming instances.
-
Validate application access and performance across different fleet configurations.
-
-
Network connectivity issues:
-
Test streaming session performance across different network conditions.
-
Monitor
StreamingSessionLatencyfor connection quality issues. -
Ensure proper configuration of VPC settings and security groups.
-
-
-
Recovery procedures. Develop and test procedures for:
-
Fleet failover between AWS Availability Zones. In addition, document procedures for scaling fleet capacity, managing fleet updates, and responding to instance health issues.
-
User data management:
-
Configure and test application settings persistence and storage solutions for home folders in Amazon Simple Storage Service (Amazon S3) for Windows fleets and shared file systems in Amazon Elastic File System (Amazon EFS) for Linux fleets.
-
Validate data synchronization between sessions.
-
-
Service continuity. Maintain procedures for creating new fleet instances, managing image updates, and handling session disconnections.
-
-
Risk management. Identify and mitigate:
-
Capacity constraints by setting appropriate fleet minimum capacity, configuring automatic scaling policies based on demand patterns, and monitoring fleet utilization trends by using CloudWatch metrics such as
CapacityUtilization,InUseCapacity, andAvailableCapacity. -
Performance bottlenecks by tracking key metrics such as
StreamingSessionLatencyand configuring the appropriate CloudWatch alarms.
-
Learn from all operational events and metrics
Foster a culture of continuous improvement by sharing lessons learned from operational events and failures across the organization. Emphasize their impact on business outcomes.
-
Event analysis. Document and analyze service interruptions, performance degradation, user complaints, and capacity issues.
-
Metrics review. Analyze usage patterns, performance trends, cost metrics, and user satisfaction data on a regular basis.
-
Knowledge sharing. Establish processes for team learning sessions, best practice documentation, cross-team knowledge transfer, and incident retrospectives.
Use managed services
Minimize operational overhead by using AWS managed services and building standardized procedures around them. Integrate with the following AWS managed services:
-
AWS Systems Manager for automation
-
Amazon CloudWatch for monitoring
-
AWS Identity and Access Management (IAM) for access control
-
Amazon S3 for user storage for Windows fleets
-
Amazon EFS for user storage for Linux fleets
-
AWS Directory Service for user authentication