# Data strategy framework
<a name="framework"></a>

The data strategy framework presented in this guide is based on the following tenets of modern data and analytics architecture:

1. Use an **integrated, cost-effective, and scalable storage layer**, so every data producer and consumer has the technical capabilities to interact with data.

1. **Security is mandatory**. Apply data privacy rules, provide data protection with encryption, enable auditing, and provide automated compliance.

1. **Govern the data to share** it across the company. Provide a unique data catalog and a business glossary so users can find and use the data they need.

1. Select the **right service for the right job.** Consider functionality, scalability, data latency, the effort required to run the service, resilience, integration, and automation when you choose a component.

1. Use **artificial intelligence (AI) and machine learning (ML)**.

1. Provide **data literacy** and tools with **abstractions for business people**.

1. **Test the hypotheses** of your data initiatives and **measure their results**.

The data framework uses the approach of [working back from the customer](https://docs.aws.amazon.com/whitepapers/latest/building-cloud-operating-model/step-1.-work-backwards-from-the-customer.html). This method, which is used at Amazon and AWS, follows five steps:

1. Interview users in your company's business areas. Select business problems and opportunities that could be addressed by data initiatives.

1. Define expected business outcomes within the business areas.

1. Prioritize initiatives that have the highest business impact.

1. Identify data sharing and technical capabilities to achieve business outcomes, and group them in enablement projects.

1. Identify roles and responsibilities to enable data-driven initiatives, and discuss multidisciplinary team building.

The following sections discuss the main stages of this process:
+ [Business discovery](business-discovery.md)
+ [Assessing data availability](data-availability.md)
+ [Technical assessment](technical-assessment.md)
+ [Aligning stories with business goals](align-stories-goals.md)

# Business discovery
<a name="business-discovery"></a>

To perform business interviews effectively, it is important to understand your** **company's goals that depend on data at a high level. For example, these goals might include:
+ Improving business agility
+ Enabling advanced innovation
+ Becoming customer-centric
+ Increasing market share
+ Reaching global markets
+ Launching a new customer platform  

After you align on your company's goals, you should talk to team members in business areas. At a minimum, focus on areas that impact your company's main goals, but if you have a chance, talk to team members in every business area.

In this discovery conversation, you want learn the goals of each business area or business unit (BU), the metrics they use to measure their area, and how data usage can affect their goals. Here are some examples of questions you might ask:
+ What are your BU's main goals?
+ How will your BU contribute to achieving the company's goals?
+ What are the key projects in your BU?
+ How does each project depend on data?

It is important to gain visibility into key projects, their timeline, how they depend on data, and how they align with, or support, the company's goals. Examples of projects include:
+ Improvements to customer experience through consistent omnichannel interaction, and building awareness of the latest customer actions and issues
+ Creating a recommendation engine based on customer behavior to increase conversion rate and engagement
+ For online financial products, faster risk calculation to approve customer credit, to avoid taking too long and losing the customer to another financial institution
+ Better sales forecast accuracy to reduce supply loss
+ Reducing fraud loss by optimizing fraud detection in real time

# Assessing data availability for business
<a name="data-availability"></a>

Use follow-up questions such as the following to understand the gaps between the current state of data availability and what the BU wants to achieve:
+ How does data support your projects and your current business goals?
+ Is it challenging to obtain the right data to use and make decisions?
+ How automated is the process to obtain the data? What are the manual steps involved, if any?
+ When data becomes available, can your team understand and work with it, or do you have to translate the data to your business domain?
+ Do you receive data on a timely basis to support your business decisions?
  + How would getting data faster improve your business? To drive improvements, how fast should data become available?
+ Are your decision-makers missing any data?
  + If yes, which data is missing?
  + What would be the advantage of having this data?
  + How are your main projects affected by the missing data?
+ Do you have any challenges associated with compliance regulations such as General Data Protection Regulation (GDPR) or other standards?
+ Does your BU have data products available to enable applications to take actions?
+ Is your area able to deliver machine learning models to improve your business? If not, do other BUs support your business in this area?
+ Are you aware of any data inside the company that is currently not available for your BU but would support your projects or drive improvements in your area?
  + What are they?
+ Do you rely on the quality of data available for your area?
  + Does your team perform your own data cleansing process before you use the data?
  + Does your team perform your own quality process before you use the data?
  + When your team works on data availability and produces new data products for analysis, enrichment, and an aggregated vision, can they share these products with other BUs in your company?

# Technical assessment
<a name="technical-assessment"></a>

A technical assessment is important because it gives you a map of the current technical capabilities that your company has in place. The assessment covers data governance, data ingestion, data transformation, data sharing, machine learning (ML) platform, process, and automation. 

Here are examples of questions you can ask during the technical assessment, by team. You can add questions based on your context.

## Data engineering team
<a name="data-engineering"></a>
+ What are the current challenges associated with ingesting data for your team? 
+ Are there any external or internal data sources your team needs that aren't available for ingestion? Why aren't they available?
+ What types of data sources do you ingest data from (for example, MySQL databases, Salesforce API, files received, website navigation data)?
+ How long does it take to ingest data from a new data source?
+ Are the processes of ingesting data from a new source automated?
+ How easy is it for a development team to publish transactional data for analytics from their application?
+ Do you have tools for full loads or incremental loads (in batches or micro-batches) from your data source?
+ Do you have change data capture (CDC) tools for continuous loads from your databases?
+ Do you have data streaming options for data ingestion?
+ How do you perform data transformation for batch and real-time data?
+ How do you manage the orchestration of data transformation workflows?
+ Which activities do you perform most frequently: data discovery and cataloging, data ingestion, data transformation, helping business analysts, helping data scientists, data governance, training teams and users?
+ When a dataset is created, how is it classified for data privacy? How do you clean it to make it meaningful for your internal consumers?
+ Are data governance and data stewardship centralized or decentralized?
+ How do you enforce data governance? Do you have an automated process?
+ Who is the data owner and steward in each phase of the pipeline: data ingestion, data processing, data sharing, and data usage? Is there a data domain concept for determining owners and stewards?
+ What are the main challenges in sharing datasets within the organization with access control?
+ Do you use infrastructure as code (IaC) to deploy and manage data pipelines?
+ Do you have a data lake strategy? 
  + Is your data lake distributed or centralized across the organization? 
+ How is your data catalog organized? Is it companywide or per area?
+ Do you have a data lakehouse approach in place?
+ Do you use or plan to use data mesh concepts?

You can complement these questions with the [AWS Well-Architected Framework Data Analytics Lens](https://docs.aws.amazon.com/wellarchitected/latest/analytics-lens/analytics-lens.html).

## Business analysis team
<a name="business-analysis"></a>
+ How would you describe the following characteristics of the data that's available for your work:
  + Cleanliness
  + Quality
  + Classification
  + Metadata
  + Business meaning
+ Does your team participate in business glossary definitions of datasets in your domain?
+ What is the impact of not having the data you need to perform your job at the time you need it?
+ Do you have any examples of scenarios where you don't have access to data or it takes too long to obtain the data? How long does it take to obtain the data you need?
+ How often do you use a smaller dataset than you needed because of technical issues or processing time?
+ Do you have a sandbox environment with the scale and tools that you need?
+ Can you perform A/B testing to validate hypotheses?
+ Are you missing any tools you need to perform your job?
  + Which types of tools?
  + Why aren't they available?
+ Are there any important activities that you don't have time to perform?
+ Which activities consume your time the most?
+ How are your business views refreshed?
  + Are they scheduled and managed automatically?
+ In which scenarios would you need data that's fresher than the data you get?
+ How do you share analyses? Which tools and processes do you use for sharing?
+ Do you often create new data products and make them available to other teams?
  + What is your process for sharing data products with other business areas or across the company?

## Data science teams (to determine model deployment)
<a name="data-science"></a>
+ How would you describe the following characteristics of the data that's available for your work:
  + Cleanliness
  + Quality
  + Classification
  + Metadata
  + Meaning
+ Do you have any automated tools for training, testing, and deploying machine learning (ML) models?
+ Do you have machine size options for performing each step in the creation and deployment of an ML model?
+ How are the ML models put in production?
+ What are the steps to deploy a new model? How automated are they?
+ Do you have the components to train, test, and deploy ML models for batch and real-time data? 
+ Can you use and process a dataset that's large enough to represent the data you need to create the model?
+ How do you monitor your models and take actions to retrain them?
+ How do you measure the impact of the models to your business?
+ Can you perform A/B testing to validate hypotheses for business teams?

For additional questions, see the [AWS Well-Architected Framework Machine Learning Lens](https://docs.aws.amazon.com/wellarchitected/latest/machine-learning-lens/machine-learning-lens.html).

# Aligning stories with business goals
<a name="align-stories-goals"></a>

After you perform business and technical assessments, we recommend that you create a diagram that includes a set of stories for each level of data usage maturity. This visualization makes it easy to align your data usage with your company's business goals. For example, a near real-time fraud detection business outcome requires a near real-time actions capability story.  

The stories are technical capabilities, data sharing mechanisms, people, and processes that are required to achieve the business goals. You write the business outcomes on the right side of the diagram based on your business discovery interviews, and fill the status of each story based on technical assessments. You can then select the stories your company should work on, and create a roadmap.  

The following diagram shows whether each story is required, based on business outcomes. It also shows the current status of each story based on information that you collected in technical assessments. The diagram is usually followed by a report that explains each status in detail.

![\[Visualizing enablement stories for each data maturity phase\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/strategy-aws-data/images/enablement-stories.png)


You work back from the right side (*Business outcomes*) to the left side to enable the stories. For example, to enable a story in the third stage (*Insights and reports*), you have to enable its dependencies in the second stage (*Data lake*) and first stage (*Data foundation*).

Based on the assessment and the requirements for business outcomes, each story is classified as green, yellow, gray, or red.
+ Green means that the story is in place and can scale to deliver the business outcomes. For example, in the diagram, the CDC ingestion story in the first (*Data foundation*) stage is green, which means that the company has the tools and process to accomplish the story for the data source they have. The *Better customer experience* business outcome requires ingesting relevant customer data and enriching it with other data inside the company, to better understand the customer and provide personalization.
+ Yellow means that the capability or process exists, but it is not fully functional or will not support the scale that the business outcome requires. For example, in the diagram, the *Centralized data catalog* story in the second (*Data lake*) stage is yellow. This indicates that the company has a central data catalog, but the catalog isn't fully populated with the metadata required by the other stages, or it's used by only a few business areas. This classification impacts the data sharing capabilities in the next (*Insights and reports*) stage.
+ Gray means that the story isn't required.
+ Red means that the story is required by business outcomes but hasn't been implemented. For example, in the diagram, the *Data sharing* story in the *Insights and reports* stage is red. Creating a comprehensive machine learning model for customer recommendations requires grouping datasets, which requires data sharing capabilities. However, this story hasn't been implemented. In this example, data sharing also requires capabilities in the *Data lake *stage to be fully functional, at least for the datasets that are part of the models, but you can see that *Data stewardship* hasn't been implemented.

The story *Data privacy, protection, and compliance* (in the *Data lake* stage) is always required, and it becomes more relevant as data privacy regulations are pushed by new data protection requirements. For example, the [General Data Protection Regulation (GDPR)](https://gdpr.eu/what-is-gdpr/) started in the US with the [Virginia Consumer Data Protection Act (CDPA)](https://law.lis.virginia.gov/vacodefull/title59.1/chapter53/) and the [California Consumer Privacy Act (CCPA)](https://oag.ca.gov/privacy/ccpa), and is already in place for some Latin American countries such as [Lei Geral de Proteção a Dados Pessoais (LGPD)](https://www.serpro.gov.br/privacidade-protecao-dados) in Brazil, [Mexican data protection](https://www.dataguidance.com/notes/mexico-data-protection-overview) in Mexico, Data Protection in Colombia, [Law 29733](https://www.leyes.congreso.gob.pe/Documentos/Leyes/29733.pdf) in Peru, and [Argentina Personal Data Protection laws](http://servicios.infoleg.gob.ar/infolegInternet/anexos/320000-324999/323901/norma.htm).