# Security Incident Response in AMS
Security Incident Response

Security is the top priority at AWS Managed Services (AMS). AMS deploys resources and controls in your accounts to manage them. AWS has a shared responsibility model: AWS manages the security of the cloud, and you are responsible for security in the cloud. AMS protects your data and assets and helps keep your AWS infrastructure secure by using security controls and active monitoring for security issues. These capabilities help you establish a security baseline for applications running in the AWS Cloud. AMS collaborates with you through Security Incident Response to assess the effect, and then carry out containment and remediations based on best practice recommendations.

When a deviation from the baseline occurs, such as by a misconfiguration or a change in external factors, you need to respond and investigate. To successfully do so, you need to understand the basic concepts of Security Incident Response within your AMS environment. You must also understand the requirements to prepare, educate, and train cloud teams before security issues occur. It is important to know the controls and capabilities that you can use, prepare response plans for common security issues such as a user account compromise or a misuse of privileged accounts, and identify remediation methods that use automation to improve response speed and consistency. Additionally, you need to understand your compliance and regulatory requirements as they relate to building a Security Incident Response program to fulfill those requirements.

Security Incident Response can be complex, but by implementing an iterative approach you can simplify the process and allow the incident response team to keep asset stakeholders satisfied by providing early and continuous detection and response. In this guide, we provide you with the methodology that AMS uses for incident response, the AMS responsibility matrix (RACI), how you can be prepared for a security event, how to engage AMS during security incidents, and some of the incident response runbooks that AMS uses.

# How AMS Security Incident Response works
How it works

AWS Managed Services aligns to the NIST 800-61 [Computer Security Incident Handling Guide](https://csrc.nist.gov/publications/detail/sp/800-61/rev-2/final) for Security Incident Response. By aligning to this industry standard, we provide a consistent approach to security event management and adhere to best practices in securing and responding to security incidents in your cloud.

![\[Incident response lifecycle\]](http://docs.aws.amazon.com/managedservices/latest/accelerate-guide/images/sec-inc-response-1.png)


**Incident response lifecycle**

When detection identifies and generates a security alert, or you request security assistance, the AWS Managed Services Operations team makes sure that there is a timely investigation, executes automations to perform data collection, triages and analyzes, informs you of the analysis, performs investigation and any containment activities, and then posts event analysis.

The data collection, triage, analysis, and containment activities performed during the incident response vary depending on the type of security event being investigated. Example Security Incident Response workflows for select scenarios are at the end of this document.

During incidents, AMS determines the correct course of action dynamically, which might result in documented steps being re-ordered or bypassed as appropriate to make sure that the right outcome occurs.

# Prepare
Prepare

As the threat landscape evolves, AMS continues to expand detection and response capabilities. As new detections are added, AMS incorporates the alerts from these new detections into the detection and response platform. AMS security responders are trained to investigate and partner with you throughout the Security Incident Response lifecycle.

Because of this partnership approach, it's important that your security and application teams are prepared to engage with AMS to handle security events as these events occur. This documentation explains what to expect during a security event and helps you prepare for rapid response when a security incident occurs.

This documentation uses the NIST 800-61 definition of an **event** as any observable occurrence in a system or network and an **incident** as a violation or imminent threat of violation of policies, acceptable use policies, or standard security practices.

## Preparation checklist
Preparation checklist

 Work through the following checklist with your AMS cloud solution delivery manager (CSDM) and AMS cloud architect (CA):
+ Understand what workloads are running in which accounts.
+ Understand what internal teams are responsible for the various workloads and tag them appropriately in the workloads.
+ Maintain contact details internally for other teams who might be required during a security event investigation and for containment decisions.
+ Confirm that security contacts are up to date and added to all managed AWS accounts. The contacts are managed on a per account basis.
+ Know how to raise security incident to AMS, and be familiar with the severity and expected response times.
+ Make sure that when security notifications are received, they are routed to the appropriate people and systems such as pagers or your security operations center.
+ Understand what log sources are available to you, where these are stored in your accounts and who has access to them.
+ Understand how to use CloudWatch Insights to Query Logs during investigations.
+ Understand the containment options available to you by resource (EC2, IAM, S3, and son on) and the consequences on your workload availability when in containment.

# Detect
Detect

During the management of your AWS accounts, AMS monitors for anomalies in user behavior, account activities and potential security events using data collected from detection sources and controls including but not limited to Amazon CloudWatch, Amazon GuardDuty, VPC Flow Logs, Amazon Macie, AWS Config and Amazon internal Threat Intelligence feeds.

AMS uses both native AWS services and other detection technologies to respond to security events created by:
+ Config Conformance Finding Types
+ GuardDuty Finding Types
+ Macie Finding Types
+ Amazon Route 53 Resolver DNS Firewall Events
+ AMS Security events (cloud watch alarms)

Additional findings are added as services, products and threat ecosystems evolves. 

## Report security events to AMS
Report security events

Raise an incident through the AMS Support Portal or Support Center to notify AMS of a security incident or to request investigations.

# Analyze
Analyze

After a security event is identified and reported, the next step is to analyze whether the reported event is a false positive or a real incident. AMS uses automation and manual investigative techniques to handle security events. The analysis includes investigation of logs from different detection sources such as network traffic logs, host logs,CloudTrail events, AWS service logs and so on. The analysis also looks for patterns that show an anomalous behavior by correlation.

Your partnership is required to understand context specific to the account environment and to establish what is normal for your account and workloads. This helps AMS identify an anomaly faster and to an accelerated incident response.

## Handle communications from AMS about security events
Communications from AMS

AMS keeps you informed during the investigation by engaging your security contacts through an incident ticket. Your AMS cloud service delivery manager (CSDM) and AMS cloud architect (CA) are the point of contacts to reach out to for any communication during an active security investigation.

Communication includes automated notification when a security alert is generated, communication after event analysis, establishing call bridges and the ongoing delivery of artifacts such as log files, snapshot of infected resources, and getting investigation results to you during the security event.

Standard fields included in AMS security alert notifications are listed below. These fields provide you with information so that you can route events to the appropriate teams within your organization for remediation.
+ Finding Type
+ Finding Identifier (Where relevant)
+ Finding Severity
+ Finding Description
+ Finding created Date & Time
+ AWS Account Id
+ Region (Where relevant)
+ AWS Resources (IAM user/role/policy, EC2, S3, EKS)

Additional fields are provided depending on the Finding Type, for example EKS Findings include Pod, Container, and Cluster details.

# Contain
Contain

AMS's approach to containment is partnership with you. You understand your business and the workload impacts that might occur from containment activities, such as network isolation, IAM user or role de-provisioning, instance re-building, and so forth.

An essential part of containment is decision-making. For example, shut down a system, isolate a resource from the network, or turn off access or end sessions. These decisions are easier to make if there are predetermined strategies and procedures to contain the incident. AMS provides the containment strategy and then implements the solution after you have considered the risk involved with implementing the containment actions.

There are different containment options depending on the resources under analysis. AMS expects multiple types of containment to be simultaneously deployed during an incident investigation. Some of these examples include:
+ Apply protection rules to block unauthorized traffic (Security group, NACL, WAF Rules, SCP rules, Deny listing, setting signature action to quarantine or block)
+ Resource Isolation
+ Network Isolation
+ Disabling IAM users, roles and policies
+ Modifying/Reducing IAM user, role privilege
+ Terminating / Suspending / Deleting compute resources
+ Restricting public access from affected resource
+ Rotating access keys, API keys, and passwords
+ Scrubbing disclosed credentials and sensitive information

AMS encourages you to consider the type of containment strategies for each major incident type that is within their risk appetite, with criteria clearly documented to help with decision making in the event of an incident. Criteria to determine the appropriate strategy include:
+ Potential damage to resources
+ Preservation of evidence
+ Service unavailability (for example, network connectivity, services provided to external parties)
+ Time and resources needed to implement the strategy
+ Effectiveness of the strategy (For example, partial containment, full containment)
+ Permanence of the solution (For example, one-way door vs two-way door decisions)
+ Duration of the solution (For example, emergency workaround to be removed in four hours, temporary workaround to be removed in two weeks, permanent solution).
+ Apply security controls that you can turn on to lower the risk and allow time to define and implement a more effective containment. 

The speed of containment is critical, AMS advises a staged approach to achieve efficient and effective containment by strategizing short-term and long-term approaches.

Use this guide to consider your containment strategy that involves different techniques based on the resource type.
+ Containment Strategy
  + Can AMS identify the scope of the security incident?
    + If yes, identify all the resources (users, systems, resources).
    + If no, investigate in parallel with executing the next step on identified resources.
  + Can the resource be isolated?
    + If yes, then proceed to isolate the affected resources.
    + If no, then work with system owners and managers to determine further actions necessary to contain the problem.
  + Are all affected resources isolated from non-affected resources?
    + If yes, then continue to the next step.
    + If no, then continue to isolate affected resources until short-term containment is accomplished to prevent the incident from escalating further.
+ System Backup
  + Were backup copies of affected systems created for further analysis?
  + Are the forensic copies encrypted and stored in a secure location?
    + If yes, then continue to the next step.
    + If no, encrypt the forensic images, then store them in a secure location to prevent accidental usage, damage, and tampering.

# Eradicate
Eradicate

After an incident is contained, eradication might be necessary to eliminate sources of threat altogether to secure the system before you proceed to the next recovery stage. Eradication steps might include deleting malware and removing compromised user accounts, as well as identifying and mitigating all vulnerabilities that were exploited. During eradication, it's important to identify all affected accounts, resources, and instances within the environment so that they can be remediated. 

It's a best practice that eradication and recovery is done in a phased approach so that remediation steps are prioritized. For large-scale incidents, recovery might take months. The intent of the early phases must be to increase the overall security with relatively quick (days to weeks) high value changes to prevent future incidents. The later phases must focus on longer-term changes (for example, infrastructure changes) and ongoing work to keep the enterprise as secure as possible.

For some incidents, eradication is either not necessary or is performed during recovery. 

Consider the following:
+ Can the system be re-imaged and then hardened with patches or other countermeasures to prevent or reduce the risk of attacks?
+ Are all malware and other artifacts left behind by the attackers removed and the affected systems hardened against further attacks?

# Recover
Recover

AMS partners with you to restore systems to normal operation, confirm that the systems are functioning normally, and (as applicable) remediate vulnerabilities to prevent similar incidents.

 Consider the following:
+ Are the affected system(s) patched and hardened against the recent attack and possible future attacks?
+ What day and time is feasible to restore the affected systems back into production?
+ What tools will you use to test, monitor, and verify that the systems that you restore to production aren't vulnerable to the initial attack techniques?

# Post Incident Report
Post Incident Report

Post event, AMS runs an investigation review process for all security incidents. And, AMS initiates a correction of error (COE) process to address security incidents caused by a system or a procedural miss that plausibly has room for improvement. AMS partners with you to continuously-improve security investigation experience. The COE process helps AMS identify the contributing factors of customer-impacting events and connects those causes to next actions items that can prevent similar events from recurring, or helps mitigate the duration or level of impact.

 The investigation review process for security incidents addresses the following items to identify opportunities for improvement:
+ What was the elapsed time from the beginning of the incident to incident discovery, to the initial impact assessment, and to each stage of the incident handling process (for example, containment, recovery)?
+ How long did it take the incident response team to respond to the initial report of the incident?
+ How long did it take to do an initial impact analysis?
+ Was this preventable and how? Is there a tool or process that could have prevented this?
+ Could we have detected this sooner and how?
+ What could have made the investigation go faster?
+ Were the documented Incident Response Procedures followed? Were they adequate?
+ Was the information sharing with other stakeholders done in a timely manner How could it be improved?
+ Was the collaboration with other teams (AWS Security, account teams, AWS Development team and customer security team's) effective? If not, what could be improved?
+ What preparation steps were missing that might have helped, escalation matrices, RACI’s, shared responsibility models, and so on? Is there a need to update any Runbooks?
+ What was the difference between the initial impact assessment and the final impact assessment? What can we do to improve accuracy of assessments earlier in the incident response?
+ What are the Action Items from the Lessons Learned?

# Security Incident Response Runbooks in AMS
Security Incident Response Runbooks

This section contains two runbooks:
+ [Response to root user activity](sir-root-user.md)
+ [Response to malware events](sir-malware.md)

# Response to root user activity
Response to root user activity

The  [root user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_root-user.html) is the superuser within your AWS account. Note that AMS monitors root usage. It's a best practice to use the root user only for the few tasks that require it, such as to change your account settings, activate AWS Identity and Access Management (IAM) access to billing and cost management, change your root password, and turn on multi-factor authentication (MFA). For more information, see  [Tasks that require root user credentials](https://docs.aws.amazon.com/general/latest/gr/root-vs-iam.html#aws_tasks-that-require-root).

For more information on how to inform AMS of planned root usage, see [When and how to use the root account in AMS](https://docs.aws.amazon.com/managedservices/latest/userguide/how-when-to-use-root.html).

When root user activity is detected, either failed attempts to login that might indicate a brute force attack or activity in the account after a successful login, an event generates and an incident sent to your defined security contacts.

AWS Managed Services Operations investigates unplanned root user activity, perform data collection, triage and analysis, and perform containment activities at your direction, followed by post event analysis.

If you have the AMS Advanced operating model, you receive additional communications from AMS CSDM and AMS Ops engineers that confirm unplanned root user activity due AMS's responsibility to secure root user credentials. AMS investigates root user activity until you confirm a path forward.

## Prepare
Prepare

Advise AMS of any planned use of root user by submitting an AMS service request with data and times of planned event to prevent unnecessary incident response activities.

Periodically conduct GameDays with AMS to validate AMS's customer incident response processes, people and systems are current, and build muscle memory with responsible individuals to achieve faster incident response.

## Phase A: Detect
Phase A: Detect

AMS monitors for root activity in the accounts through detection sources including GuardDuty and AMS monitoring.

If you have AMS Accelerate, the operating model responds to the incident requesting investigation for unexpected root user activity. When this occurs, AMS Operations initiates the Compromised Account runbook.

If you have AMS Advanced, the operating model responds to the incident, or informs the CSDM of any planned root user activity to terminate an active Account Compromise investigation.

## Phase B: Analyze
Phase B: Analyze

AMS performs a thorough investigation of the root user events when it's determined that the activity isn't authorized. Using both automations and AMS security response team, logs and events are analyzed for anomalies and unexpected behavior for root users. Logs are provided to you to help determine if the activity is unknown, or if it's an authorized root user event, or if it requires further investigation.

Some examples of the information provided during the investigation to support internal checks includes:
+ Account information: What account was the root account used on?
+ E-mail address for root user: Each root user is associated with an e-mail address from your organization
+ Authentication details: Where and when did the root user access your environment from?
+ Activity records: What did the user do when logged in as root? These records are in the form of CloudWatch events. Understanding how to read these logs aids in investigation.

It's a best practice that you are prepared to receive the analysis information and have a plan for how to reach authorized points of contact for accounts within your organization. Because root users aren't named as individuals, determining who has access to the root e-mail address used for the account within your organization helps to quickly route questions internally. 

## Phase C: Contain and Eradicate
Phase C: Contain and Eradicate

AMS partners with your security teams to perform containment at the direction of your authorized Customer Security contacts. Containment options include: 
+ Rotating appropriate credentials and keys.
+ Terminating active sessions to accounts and resources.
+ Eradicating resources created.

During the containment activities AMS works closely with your security team to ensure any disruption to your workloads are minimized and the root credentials are appropriately secured.

After the containment plan is completed, you work with AMS Operations team for any recovery actions as required.

## Post Incident Report
Post Incident Report

As required, AMS initiates the investigation review process to identify any lessons learned. As part of completing a COE, AMS communicates any relevant findings to affected customers to help them improve their incident response process.

AMS documents all final details of the investigation, collects appropriate metrics, and then reports the incident to any AMS internal teams that require information, including your assigned CSDM and CA.

# Response to malware events
Response to malware events

Amazon EC2 instances are used to host a variety of workloads including third-party software and custom-developed software deployed by application teams within organizations. AMS provides and encourages you to deploy your workloads on images that are patched and maintained on an ongoing basis by AMS.

During the operation of instances, AMS monitors for anomalies in behavior or activity through a variety of security detection controls, including Amazon GuardDuty, Network Traffic, and Amazon internal Threat Intelligence feeds.

AMS also monitors GuardDuty Malware Findings. These are available on both AMS Advanced and AMS Accelerate, if enabled. See [Malware Protection in Amazon GuardDuty](https://docs.aws.amazon.com/guardduty/latest/ug/malware-protection.html) for more information.

**Note**  
If you opted for [Bring Your Own EPS](https://docs.aws.amazon.com/managedservices/latest/userguide/ams-byoeps.html), then the process for incident response differs from what's outlined on this page. For more information, see the referenced documentation.

When malware is detected, an incident is created and you are notified of the event. This notification is followed by any remediation activities that occurred. AMS Operations investigates, performs data collection, triage and analysis, and then performs containment activities at your direction, followed by post event analysis.

## Phase A: Detect
Phase A: Detect

AMS monitors for events on instances with GuardDuty. AMS determines the appropriate enrichment and triage activities to help you make containment or risk acceptance decisions based on the finding or alert type.

Data collection is performed based on the finding type. Data collection involves querying multiple data sources both inside and outside of the affected account to build a picture of the activity observed or the configurations of concern.

AMS performs correlation of the finding with any other alarms and alerts or telemetry from any impacted accounts or AMS threat intelligence platforms.

## Phase B: Analyze
Phase B: Analyze

After data is collected, it's analyzed to identify any activity or indicators of concern. During this phase of the investigation, AMS partners with you to integrate business and domain knowledge of the instances and workloads to help understand what's expected and what's out of the ordinary.

Some examples of the information provided during the investigation to support internal checks includes:
+ Account Information: What account was the malware activity observed on?
+ Instance Details: What instance(s) are implicated with the malware events?
+ Event timestamp: When did the alert trigger?
+ Workload Information: What is running on the instance? 
+ Malware details, if relevant: Families of malware and Open Source information about the malware.
+ Users or Role Details: What users or roles are affected by and involved in the activity?
+ Activity Records: What activities are recorded on the instance? These are in the form of CloudWatch events, and system events from the instance. Understanding how to read these logs will aid you in investigation
+ Network Activity: What endpoints are connecting to the instance, what the instance is connecting to, and what is the traffics analysis?

It's a best practice to be prepared to receive investigation information, and have a plan about how to contact the appropriate points of contact for accounts, instances and workloads within your organization. Understanding your network topology and expected connection can help accelerate impact analysis. Knowledge of planned penetration testing in the environment and recent deployments performed by application owners can also speed up the investigation.

If you determine that the activity is planned and authorized, then the incident is updated and the investigation ends. If compromise is confirmed, then you and AMS determine the appropriate containment plan.

## Phace C: Contain and Eradicate
Phase C: Contain and Eradicate

AMS partners with you to determine appropriate containment activities based on the data collected and information known. Containment options include but are not limited to:
+ Preserving data through snapshots
+ Modifying network rules to limit traffic in or out of instances
+ Modifying SCP, IAM user and role policies to limit access
+ Terminating, Suspending or Turning off Instances
+ Terminating any persistent connections
+ Rotating appropriate credentials/keys

If you opt to perform eradication activity against the instance, then AMS supports you in achieving this. Options include, but are not limited to:
+ Removing any unwanted software
+ Rebuilding the instance from a clean fully patched image and redeploying applications and configuration
+ Restoring the instance from a previous backup
+ Deploying applications and services on to another instance within your account that might be suitable to host the workloads.

It's important to determine how the malware was delivered and run on the instance before restoration of service to make sure that any additional controls are applied to prevent reoccurrence of the malware on the instance. AMS provides additional insights or information to your forensics partners or teams as necessary to support forensics.

At this point, you work with AMS Operations for the recovery activities. AMS works closely with you to minimize disruption to the workloads and secure the instances.

## Post Incident Report
Post Incident Report

As required, AMS initiates the investigation review process to identify lessons learned. As part of completing a COE, AMS communicates relevant findings to you to help you improve your incident response process.

AMS documents the final details of the investigation, collects appropriate metrics, and reports the incident to AMS internal teams that require information, including your assigned CSDM and CA.