

# Stage 4: Operate
<a name="stage-4"></a>

You've built a resilient application and tested it. Now the daily reality is keeping it running. But in a startup, you can't watch all operations, and you shouldn't try to. The key is to stay alert to what matters without providing too many metrics or overburdening your team.

Start with the customer perspective. [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) canaries act as automated customers. They continuously test critical user journeys. Have them log in, simulate purchases using test accounts, or access key features, especially during your busiest hours. This helps you understand the customer experience and helps you catch issues before real users do. When a canary fails, you know immediately that something's wrong from a customer's perspective.

Build on this foundation with focused monitoring of the supporting infrastructure. What signals tell you there's trouble? [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) helps you build dashboards that track these signs. Don't just monitor technical metrics; tie them to business impact. For example, high CPU usage matters, but that's because it might degrade the customer experience that you're tracking with canaries.

As a practical approach, map your monitoring to your customer journeys. If you're running a software as a service (SaaS) platform, you likely care about API response times, authentication success rates, and core feature availability. Set up alerts that tell you when these metrics drift. However, be selective. Every alert should demand action. If your team starts ignoring alerts because "it's probably nothing," you've set too many or are tracking the wrong metrics.

Route these alerts through tools that your team already uses. If your engineers live in a particular messaging application, send alerts there. The goal is quick awareness without creating a new process. When an alert fires, your team should know exactly what it means and what to do about it.

Keep your operational documentation lean and practical. Store runbooks with your code in version control, but remember that they're not novels. When something breaks, your team needs clear, actionable steps. Each alert should link to a corresponding runbook, and each runbook should answer three questions: 
+ What broke?
+ Why does it matter?
+ How do I fix it?

Implement a simple incident management process. You don't need complex frameworks, just clear definitions of what constitutes an incident and who to call when things escalate. Keep incident logs because they help you improve your application's resilience.

The key is finding the sweet spot between vigilance and overhead. Use AWS tools to automate what you can, focus on monitoring metrics that impact customers, and keep your processes light enough to evolve as you grow.

The next chapter explores how to foster a resilience mindset without sacrificing the speed and innovation that make startups special. At the end of the day, resilience is as much about people as it is about technology.