

# AWS models in Clean Rooms ML
<a name="aws-models"></a>

AWS Clean Rooms ML provides a privacy-preserving method for two parties to identify similar users in their data without the need to share their data with each other. The first party brings the training data to AWS Clean Rooms so that they can create and configure a lookalike model and associate it with a collaboration. Then, seed data is brought to the collaboration to create a lookalike segment that resembles the training data.

For a more detailed explanation of how this works, see [Cross-account jobs](ml-behaviors.md#ml-behaviors-cross-account-jobs).

The following topics provide information on how to create and configure a AWS models in Clean Rooms ML.

**Topics**
+ [Privacy protections of AWS Clean Rooms ML](ml-privacy.md)
+ [Training data requirements for Clean Rooms ML](ml-training-data-requirements.md)
+ [Seed data requirements for Clean Rooms ML](ml-seed-data-requirements.md)
+ [AWS Clean Rooms ML model evaluation metrics](ml-metrics.md)

# Privacy protections of AWS Clean Rooms ML
<a name="ml-privacy"></a>

Clean Rooms ML is designed to reduce the risk of *membership inference attacks* where the training data provider can learn who is in the seed data and the seed data provider can learn who is in the training data. Several steps are taken to prevent this attack.

First, seed data providers don't directly observe the Clean Rooms ML output and training data providers can never observe the seed data. Seed data providers can choose to include the seed data in the output segment.

Next, the lookalike model is created from a random sample of the training data. This sample includes a significant number of users that don't match the seed audience. This process makes it harder to determine whether a user was not in the data, which is another avenue for membership inference.

Further, multiple seed customers can be used for every parameter of seed-specific lookalike model training. This limits how much the model can overfit, and thus how much can be inferred about a user. As a result, we recommend that the minimum size of the seed data is 500 users. 

Finally, user-level metrics are never provided to training data providers, which eliminates another avenue for a membership inference attack.

# Training data requirements for Clean Rooms ML
<a name="ml-training-data-requirements"></a>

To successfully create a lookalike model, your training data must meet the following requirements:
+ The training data must be in Parquet, CSV, or JSON format.
**Note**  
Zstandard (ZSTD) compressed Parquet data is not supported.
+ Your training data must be cataloged in AWS Glue. For more information, see [Getting started with the AWS Glue Data Catalog](https://docs.aws.amazon.com//glue/latest/dg/start-data-catalog.html) in the AWS Glue Developer Guide. We recommend using AWS Glue crawlers to create your tables because the schema is inferred automatically.
+ The Amazon S3 bucket that contains the training data and seed data is in the same AWS region as your other Clean Rooms ML resources.
+ The training data must contain at least 100,000 unique user IDs with at least two item interactions each.
+ The training data must contain at least 1 million records.
+ The schema specified in the [CreateTrainingDataset](https://docs.aws.amazon.com/cleanrooms-ml/latest/APIReference/API_CreateTrainingDataset.html) action must align with the schema defined when the AWS Glue table was created.
+ The required fields, as defined in the provided table, are defined in the [CreateTrainingDataset](https://docs.aws.amazon.com/cleanrooms-ml/latest/APIReference/API_CreateTrainingDataset.html) action.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/clean-rooms/latest/userguide/ml-training-data-requirements.html)
+ Optionally, you can provide up to 10 total categorical or numerical features.

Here is an example of a valid training data set in CSV format

```
USER_ID,ITEM_ID,TIMESTAMP,EVENT_TYPE(CATEGORICAL FEATURE),EVENT_VALUE (NUMERICAL FEATURE)
196,242,881250949,click,15
186,302,891717742,click,13
22,377,878887116,click,10
244,51,880606923,click,20
166,346,886397596,click,10
```

# Seed data requirements for Clean Rooms ML
<a name="ml-seed-data-requirements"></a>

The seed data for a lookalike model can either come directly from an Amazon S3 bucket or from the results of an SQL query. 

Seed data that's provided directly must meet the following requirements:
+ The seed data must be in JSON lines format with a list of user IDs.
+ The seed size should be between 25 and 500,000 unique user IDs.
+ The minimum number of seed users must match the minimum matching seed size value that was specified when you created the configured audience model.

The following is an example of a valid training data set in CSV format

```
{"user_id": "abc"}
{"user_id": "def"}
{"user_id": "ghijkl"}
{"user_id": "123"}
{"user_id": "456"}
{"user_id": "7890"}
```

# AWS Clean Rooms ML model evaluation metrics
<a name="ml-metrics"></a>

Clean Rooms ML computes the *recall* and *relevance score* to determine how well your model performs. Recall compares the similarity between the lookalike data and training data. The relevance score is used to decide how large the audience should be, not whether the model is well-performing.

*Recall* is an unbiased measure of how similar the lookalike segment is to the training data. Recall is the percentage of the most similar users (by default, the most similar 20%) from a sample of the training data that are included in the seed audience by the audience generation job. Values range from 0-1, larger values indicate a better audience. A recall value approximately equal to the maximum bin percentage indicates that the audience model is equivalent to random selection.

We consider this a better evaluation metric than accuracy, precision, and F1 scores because Clean Rooms ML doesn't have accurately labeled true negative users when building its model.

Segment-level *relevance score* is a measure of similarity with values ranging from -1 (least similar) to 1 (most similar). Clean Rooms ML computes a set of relevance scores for various segment sizes to help you determine the best segment size for your data. Relevance scores monotonically decrease as the segment size increases, thus as the segment size increases it can be less similar to the seed data. When the segment-level relevance score reaches 0, the model predicts that all users in the lookalike segment are from the same distribution as the seed data. Increasing the output size is likely to include users in the lookalike segment that aren't from the same distribution as the seed data.

Relevance scores are normalized within a single campaign and should not be used to compare across campaigns. Relevancy scores shouldn't be used as a single-sourced evidence for any business outcome because those are impacted by multiple complex factors in addition to relevance, such as inventory quality, inventory type, timing of advertising, and so on.

Relevance scores should not be used to judge the quality of the seed, but rather if it can be increased or decreased. Consider the following examples:
+ All positive scores – This indicates that there are more output users that are predicted as similar than are included in the lookalike segment. This is common for seed data that's part of a large market, such as everybody who has bought toothpaste in the past month. We recommend looking at smaller seed data, such as everybody who has bought toothpaste more than once in the past month.
+ All negatives scores or negative for your desired lookalike segment size – This indicates that Clean Rooms ML predicts there aren't enough similar users in the desired lookalike segment size. This can be because the seed data is too specific or the market is too small. We recommend either applying fewer filters to the seed data or widening the market. For example, if the original seed data was customers that bought a stroller and car seat, you could expand the market to customers that bought multiple baby products.

Training data providers determine whether the relevance scores are exposed and the bucket bins where relevance scores are computed.