# Analysis templates in AWS Clean Rooms
<a name="create-analysis-template"></a>

Analysis templates work with the [Custom analysis rule in AWS Clean Rooms](analysis-rules-custom.md). With an analysis template, you can define parameters to help you reuse the same query. AWS Clean Rooms supports a subset of parameterization with literal values. 

Analysis templates are collaboration-specific. For each collaboration, members can only see the queries in that collaboration. If you plan to use differential privacy in a collaboration, you should make sure that your analysis templates are compatible with the [general-purpose query structure](analysis-rules-custom.md#dp-query-structure-syntax) of AWS Clean Rooms Differential Privacy.

You can create an analysis template in two ways: using SQL code or using Python code for Spark.

**Topics**
+ [SQL analysis templates](sql-analysis-templates.md)
+ [PySpark analysis templates](pyspark-analysis-templates.md)
+ [Troubleshooting PySpark analysis templates](troubleshooting-pyspark-analysis-templates.md)

# SQL analysis templates
<a name="sql-analysis-templates"></a>

SQL analysis templates enable you to query and analyze data across different datasets within a collaboration. You can use these templates to perform various types of analysis, such as identifying audience overlaps and calculating aggregated metrics.

With SQL analysis templates, you can:
+ Write standard SQL queries
+ Add parameters to make your queries dynamic
+ Control access to specific columns and tables
+ Set aggregation requirements for sensitive data
+ Define input data for the generation of privacy-enhanced synthetic datasets for custom machine learning (ML) models

**Topics**
+ [Creating a SQL analysis template](create-sql-analysis-template.md)
+ [Reviewing a SQL analysis template](review-analysis-template.md)

# Creating a SQL analysis template
<a name="create-sql-analysis-template"></a>

**Prerequisites**

 Before you create a SQL analysis template, you must have:
+ An active AWS Clean Rooms collaboration
+ Access to at least one configured table in the collaboration

  For information about configuring tables in AWS Clean Rooms, see [Creating a configured table in AWS Clean Rooms](create-configured-table.md).
+ Permissions to create analysis templates
+ Basic knowledge of SQL query syntax

The following procedure describes the process of creating a SQL analysis template using the [AWS Clean Rooms console](https://console.aws.amazon.com/cleanrooms/home).

For information about how to create a SQL analysis template using the AWS SDKs, see the [AWS Clean Rooms API Reference](https://docs.aws.amazon.com/clean-rooms/latest/apireference/Welcome.html).

**To create a SQL analysis template**

1. Sign in to the AWS Management Console and open the [AWS Clean Rooms console](https://console.aws.amazon.com/cleanrooms/home) with the AWS account that will function as the collaboration creator.

1. In the left navigation pane, choose **Collaborations**.

1. Choose the collaboration.

1. On the **Templates** tab, go to the **Analysis templates created by you** section.

1. Choose **Create analysis template**.

1. On the **Create analysis template** page, for **Details**, 

   1. Enter a **Name** for the analysis template.

   1. (Optional) Enter a **Description**.

   1. For **Format**, leave the **SQL** option selected.

1. For **Tables**, view the configured tables associated with the collaboration.

1. For **Definition**,

   1. Enter the definition for the analysis template.

   1. Choose **Import from** to import a definition.

   1. (*Optional*) Specify a parameter in the SQL editor by entering a colon (`:`) in front of the parameter name.

      For example: 

      `WHERE table1.date + :date_period > table1.date`

1. If you added parameters previously, under **Parameters – optional**, for each **Parameter name**, choose the **Type** and **Default value** (optional).

1. For **Synthetic data**, if you want to generate synthetic data for model training, select the **Require analysis template output to be synthetic** checkbox.

   For more information, see [Privacy-enhanced synthetic dataset generation](synthetic-data-generation.md).

   1. For **Column classification**, choose a **Column** from the dropdown list. At least five columns are required.

      1. Choose a **Classification** from the dropdown list. This identifies the data type for each column.

         Classification types include:
         + **Numerical** – Continuous numerical values such as measurements or counts
         + **Categorical** – Discrete values or categories such as labels or types

      1. To remove a column, select **Remove**.

      1. To add another column, select **Add another column**. Choose the **Column** and **Classification** from the dropdown lists.

      1. For **Predictive value**, choose a **Column** from the dropdown list. This is the column the custom model uses for prediction after it's trained on the synthetic dataset.

   1. **Advanced settings** allow you to set the **Privacy level** and **Privacy threshold**. Adjust the settings to fit your needs.

      1. For **Privacy level**, enter an epsilon value to determine how much noise the synthetic model adds to protect privacy in your generated dataset. The value must be between 0.0001 and 10.
        + Lower values add more noise, providing stronger privacy protection but potentially reducing utility for downstream custom model trained on this data.
        + Higher values add less noise, providing more accuracy but potentially reducing privacy protection.

        For **Privacy threshold**, enter the highest allowed probability that a membership inference attack could identify members of the original dataset. The value must be between 50.0 and 100.
        + Scores of 50% indicate that a membership inference attack can't successfully distinguish members from non-members better than a random guess.
        + For no privacy limit, enter 100%.

        The optimal value depends on your specific use case and privacy requirements. If the privacy threshold is exceeded, the ML input channel creation fails, and you can't use the synthetic dataset to train a model.
**Warning**  
Synthetic data generation protects against inferring individual attributes whether specific individuals are present in the original dataset or learning attributes of those individuals are present. However, it doesn't prevent literal values from the original dataset, including personally identifiable information (PII) from appearing in the synthetic dataset.  
We recommend avoiding values in the input dataset that are associated with only one data subject because these may re-identify a data subject. For example, if only one user lives in a zip code, the presence of that zip code in the synthetic dataset would confirm that user was in the original dataset. Techniques like truncating high precision values or replacing uncommon catalogues with *other* can be used to mitigate this risk. These transformations can be part of the query used to create the ML input channel.

1. If you want to enable **Tags** for the resource, choose **Add new tag** and then enter the **Key** and **Value** pair.

1. Choose **Create**.

1. You are now ready to inform your collaboration member that they can [Review an analysis template](review-analysis-template.md). (Optional if you want to query your own data.)

# Reviewing a SQL analysis template
<a name="review-analysis-template"></a>

After a collaboration member has created a SQLanalysis template, you can review and approve it. After the analysis template and approved, it can be used in a query in AWS Clean Rooms.

**Note**  
When you bring your analysis code into a collaboration, be aware of the following:   
AWS Clean Rooms does not validate or guarantee the behavior of the analysis code.   
If you need to ensure certain behavior, review the code of your collaboration partner directly or work with a trusted third-party auditor to review it.
In the shared security model:  
You (the customer) are responsible for the security of the code running in the environment.
AWS Clean Rooms is responsible for the security of the environment, ensuring that  
only the approved code runs 
only specified configured tables are accessible 
the only output destination is the result receiver's S3 bucket.

**To review a SQL analysis template using the AWS Clean Rooms console**

1. Sign in to the AWS Management Console and open the [AWS Clean Rooms console](https://console.aws.amazon.com/cleanrooms/home) with the AWS account that will function as the collaboration creator.

1. In the left navigation pane, choose **Collaborations**.

1. Choose the collaboration.

1. On the **Templates** tab, go to the **Analysis templates created by other members** section.

1. Choose the analysis template that has the **Can run status** of **No requires your review**.

1. Choose **Review**.

1. Review the analysis rule **Overview**, **Definition**, and **Parameters** (if any). 

1. Review the configured tables listed under **Tables referenced in definition**. 

   The **Status** next to each table will read **Template not allowed**.

1. Choose a table.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/clean-rooms/latest/userguide/review-analysis-template.html)

You are now ready to query the configured table using a SQL analysis template. For more information, see [Running SQL queries](running-sql-queries.md).

# PySpark analysis templates
<a name="pyspark-analysis-templates"></a>

PySpark analysis templates require a Python user script and an optional virtual environment to use custom and open-source libraries. These files are called artifacts. 

Before you create an analysis template, you first create the artifacts and then store the artifacts in an Amazon S3 bucket. AWS Clean Rooms uses these artifacts when running analysis jobs. AWS Clean Rooms only accesses the artifacts when running a job. 

Before running any code on a PySpark analysis template, AWS Clean Rooms validates artifacts by: 
+ Checking the specific S3 object version used when creating the template
+ Verifying the SHA-256 hash of the artifact 
+ Failing any job where artifacts have been modified or removed

**Note**  
The maximum size of all combined artifacts for a given PySpark analysis template in AWS Clean Rooms is 1 GB.

## Security for PySpark analysis templates
<a name="pyspark-analysis-templates-security"></a>

To preserve a secure compute environment, AWS Clean Rooms uses a two-tier compute architecture to isolate user code from system operations. This architecture is based on Amazon EMR Serverless Fine Grained Access Control technology, also know as Membrane. For more information, see [Membrane – Safe and performant data access controls in Apache Spark in the presence of imperative code](https://www.amazon.science/publications/membrane-safe-and-performant-data-access-controls-in-apache-spark-in-the-presence-of-imperative-code).

The compute environment components are divided into a separate user space and system space. The user space executes the PySpark code in the PySpark analysis template. AWS Clean Rooms uses the system space to enable the job to run including using service roles provided by customers to read data to run the job and implementing the column allowlist. As a result of this architecture, a customer’s PySpark code that affects the system space, which could include small number of Spark SQL and PySpark DataFrames APIs, is blocked.

## PySpark limitations in AWS Clean Rooms
<a name="pyspark-limitations"></a>

When customers submit an approved PySpark analysis template, AWS Clean Rooms runs it on its own secure compute environment which no customer can access. The compute environment implements a compute architecture with a user space and system space to preserve a secure compute environment. For more information, see [Security for PySpark analysis templates](#pyspark-analysis-templates-security).

Consider the following limitations before you use PySpark in AWS Clean Rooms. 

**Limitations**
+ Only DataFrame outputs are supported
+ Single Spark session per job execution

**Unsupported features**
+ **Data management**
  + Iceberg table formats
  + LakeFormation managed tables
  + Resilient distributed datasets (RDD)
  + Spark streaming
  + Access control for nested columns
+ **Custom functions and extensions**
  + User-defined table functions (UDTFs)
  + HiveUDFs
  + Custom classes in user-defined functions
  + Custom data sources
  + Additional JAR files for:
    + Spark extensions
    + Connectors
    + Metastore configurations
+ **Monitoring and analysis**
  + Spark logging
  + Spark UI
  + `ANALYZE TABLE` commands

**Important**  
These limitations are in place to maintain the security isolation between user and system spaces.  
All restrictions apply regardless of collaboration configuration.  
Future updates may add support for additional features based on security evaluations.

## Best practices
<a name="python-best-practices"></a>

We recommend the following best practices when creating PySpark analysis templates.
+ Design your analysis templates with the [PySpark limitations in AWS Clean Rooms](#pyspark-limitations) in mind.
+ Test your code in a development environment first.
+ Use supported DataFrame operations exclusively.
+ Plan your output structure to work with DataFrame limitations.

We recommend the following best practices for managing artifacts
+ Keep all PySpark analysis template artifacts in a dedicated S3 bucket or prefix.
+ Use clear version naming for different artifact versions.
+ Create new analysis templates when artifact updates are needed.
+ Maintain an inventory of which templates use which artifact versions.

For more information about how to write Spark code, see the following: 
+ [Apache Spark Examples](https://spark.apache.org/examples.html)
+ [Write a Spark application](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-application.html) in the *Amazon EMR Release Guide*
+ [Tutorial: Writing an AWS Glue for Spark script](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-intro-tutorial.html) in the *AWS Glue User Guide*

The following topics explain how to create Python user scripts and libraries before creating and reviewing the analysis template. 

**Topics**
+ [Security for PySpark analysis templates](#pyspark-analysis-templates-security)
+ [PySpark limitations in AWS Clean Rooms](#pyspark-limitations)
+ [Best practices](#python-best-practices)
+ [Creating a user script](create-user-script.md)
+ [Working with parameters in PySpark analysis templates](pyspark-parameter-handling.md)
+ [Creating a virtual environment (optional)](create-virtual-environment.md)
+ [Storing a user script and virtual environment in S3](store-artifacts-in-s3.md)
+ [Creating a PySpark analysis template](create-pyspark-analysis-template.md)
+ [Reviewing a PySpark analysis template](review-pyspark-analysis-template.md)

# Creating a user script
<a name="create-user-script"></a>

The user script must contain an entrypoint function (in other words, a handler). You can name your user script file with any valid Python filename.

The following procedure describes how to create a user script to define the core functionality of your PySpark analysis.

**Prerequisites**
+ PySpark 1.0 (corresponds to Python 3.11 and Spark 3.5.3)
+ Datasets in Amazon S3 can only be read as configured table associations in the Spark session you define. 
+ Your code can't directly call Amazon S3 and AWS Glue
+ Your code can’t make network calls

**To create a user script**

1. Open a text editor or Integrated Development Environment (IDE) of your choice.

   You can use any text editor or IDE (such as Visual Studio Code, PyCharm, or Notepad\$1\$1) that supports Python files.

1. Create a new Python file with a name of your choice (for example, **my\$1analysis.py**).

1. Define an entrypoint function that accepts a context object parameter.

   ```
   def entrypoint(context)
   ```

   The `context` object parameter is a dictionary that provides access to essential Spark components, referenced tables, and analysis parameters. It contains:

   Spark session access via `context['sparkSession']`

   Referenced tables via `context['referencedTables']`

   Analysis parameters via `context['analysisParameters']` (if parameters are defined in the template)

1. Define the results of the entrypoint function: 

   ```
   return results
   ```

   The `results` must return an object containing a results dictionary of filenames to an output DataFrame.
**Note**  
AWS Clean Rooms automatically writes the DataFrame objects to the S3 bucket of the result receiver.

1. You are now ready to: 

   1. Store this user script in S3. For more information, see [Storing a user script and virtual environment in S3](store-artifacts-in-s3.md).

   1. Create the optional virtual environment to support any additional libraries required by your user script. For more information, see [Creating a virtual environment (optional)](create-virtual-environment.md).

**Example 1**  

```
# File name: my_analysis.py

def entrypoint(context):
    try:
        # Access Spark session
        spark = context['sparkSession']

        # Access input tables
        input_table1 = context['referencedTables']['table1_name']
        input_table2 = context['referencedTables']['table2_name']

        # Example data processing operations
        output_df1 = input_table1.select("column1", "column2")
        output_df2 = input_table2.join(input_table1, "join_key")
        output_df3 = input_table1.groupBy("category").count()
    
        # Return results - each key creates a separate output folder
        return {
            "results": {
                "output1": output_df1,        # Creates output1/ folder
                "output2": output_df2,        # Creates output2/ folder
                "analysis_summary": output_df3 # Creates analysis_summary/ folder
            }
        }
   
    except Exception as e:
        print(f"Error in main function: {str(e)}")
        raise e
```
The folder structure of this example is as follows:   

```
analysis_results/
│
├── output1/ # Basic selected columns
│ ├── part-00000.parquet
│ └── _SUCCESS
│
├── output2/ # Joined data
│ ├── part-00000.parquet
│ └── _SUCCESS
│
└── analysis_summary/ # Aggregated results
├── part-00000.parquet
└── _SUCCESS
```

**Example 2**  

```
def entrypoint(context):
    try:
        # Get DataFrames from context
        emp_df = context['referencedTables']['employees']
        dept_df = context['referencedTables']['departments']

        # Apply Transformations
        emp_dept_df = emp_df.join(
            dept_df,
            on="dept_id",
            how="left"
        ).select(
            "emp_id",
            "name",
            "salary",
            "dept_name"
        )

        # Return Dataframes
        return {
            "results": {
                "outputTable": emp_dept_df
            }
        }

    except Exception as e:
        print(f"Error in entrypoint function: {str(e)}")
        raise e
```

# Working with parameters in PySpark analysis templates
<a name="pyspark-parameter-handling"></a>

Parameters increase the flexibility of your PySpark analysis templates by allowing different values to be provided at job submission time. Parameters are accessible through the context object passed to your entrypoint function.

**Note**  
Parameters are user-provided strings that can contain arbitrary content.  
Review the code to ensure parameters are handled safely to prevent unexpected behavior in your analysis.
Design parameter handling to work safely regardless of what parameter values are provided at submission time.

## Accessing parameters
<a name="accessing-parameters"></a>

Parameters are available in the `context['analysisParameters']` dictionary. All parameter values are strings.

**Example Accessing parameters safely**  

```
def entrypoint(context):
    # Access parameters from context
    parameters = context['analysisParameters']
    threshold = parameters['threshold']
    table_name = parameters['table_name']
    
    # Continue with analysis using parameters
    spark = context['sparkSession']
    input_df = context['referencedTables'][table_name]
    
    # Convert threshold value
    threshold_val = int(threshold)
    
    # Use parameter in DataFrame operation
    filtered_df = input_df.filter(input_df.amount > threshold_val)
    
    return {
        "results": {
            "output": filtered_df
        }
    }
```

## Parameter security best practices
<a name="parameter-security-best-practices"></a>

**Warning**  
Parameters are user-provided strings that can contain arbitrary content. You must handle parameters safely to prevent security vulnerabilities in your analysis code.

**Unsafe parameter handling patterns to avoid:**
+ **Executing parameters as code** – Never use `eval()` or `exec()` on parameter values

  ```
  # UNSAFE - Don't do this
  eval(parameters['expression'])  # Can execute arbitrary code
  ```
+ **SQL string interpolation** – Never concatenate parameters directly into SQL strings

  ```
  # UNSAFE - Don't do this
  sql = f"SELECT * FROM table WHERE column = '{parameters['value']}'"  # SQL injection risk
  ```
+ **Unsafe file path operations** – Never use parameters directly in file system operations without validation

  ```
  # UNSAFE - Don't do this
  file_path = f"/data/{parameters['filename']}"  # Path traversal risk
  ```

**Safe parameter handling patterns:**
+ **Use parameters in DataFrame operations** – Spark DataFrames handle parameter values safely

  ```
  # SAFE - Use parameters in DataFrame operations
  threshold = int(parameters['threshold'])
  filtered_df = input_df.filter(input_df.value > threshold)
  ```
+ **Validate parameter values** – Check that parameters meet expected formats before use

  ```
  # SAFE - Validate parameters before use
  def validate_date(date_str):
      try:
          from datetime import datetime
          datetime.strptime(date_str, '%Y-%m-%d')
          return True
      except ValueError:
          return False
  
  date_param = parameters['date_filter'] or '2024-01-01'
  if not validate_date(date_param):
      raise ValueError(f"Invalid date format: {date_param}")
  ```
+ **Use allowlists for parameter values** – When possible, validate parameters against known good values

  ```
  # SAFE - Use allowlists
  allowed_columns = ['column1', 'column2', 'column3']
  column_param = parameters['column_name']
  if column_param not in allowed_columns:
      raise ValueError(f"Invalid column: {column_param}")
  ```
+ **Type conversion with error handling** – Convert string parameters to expected types safely

  ```
  # SAFE - Convert with error handling
  try:
      batch_size = int(parameters['batch_size'] or '1000')
      if batch_size <= 0 or batch_size > 10000:
          raise ValueError(f"Batch size must be between 1 and 10000")
  except ValueError as e:
      print(f"Invalid parameter: {e}")
      raise
  ```

**Important**  
Remember that parameters bypass code review when job runners provide different values. Design your parameter handling to work safely regardless of what parameter values are provided.

## Complete parameter example
<a name="parameter-examples"></a>

**Example Using parameters safely in a PySpark script**  

```
def entrypoint(context):
    try:
        # Access Spark session and tables
        spark = context['sparkSession']
        input_table = context['referencedTables']['sales_data']
        
        # Access parameters - fail fast if analysisParameters missing
        parameters = context['analysisParameters']
        
        # Validate and convert numeric parameter (handles empty strings with default)
        try:
            threshold = int(parameters['threshold'] or '100')
            if threshold <= 0:
                raise ValueError("Threshold must be positive")
        except (ValueError, TypeError) as e:
            print(f"Invalid threshold parameter: {e}")
            raise
        
        # Validate date parameter (handles empty strings with default)
        date_filter = parameters['start_date'] or '2024-01-01'
        from datetime import datetime
        try:
            datetime.strptime(date_filter, '%Y-%m-%d')
        except ValueError:
            raise ValueError(f"Invalid date format: {date_filter}")
        
        # Use parameters safely in DataFrame operations
        filtered_df = input_table.filter(
            (input_table.amount > threshold) &
            (input_table.date >= date_filter)
        )
        
        result_df = filtered_df.groupBy("category").agg(
            {"amount": "sum"}
        )
        
        return {
            "results": {
                "filtered_results": result_df
            }
        }
    
    except Exception as e:
        print(f"Error in analysis: {str(e)}")
        raise
```

# Creating a virtual environment (optional)
<a name="create-virtual-environment"></a>

If you have any additional libraries required by your user script, you have the option to create a virtual environment to store those libraries. If you don't need additional libraries, you can skip this step.

When working with libraries that have native extensions, keep in mind that PySpark in AWS Clean Rooms operates on Linux with ARM64 architecture.

The following procedure demonstrates how to create a virtual environment using a basic CLI command.

**To create a virtual environment**

1. Open a terminal or command prompt.

1. Add the following content:

   ```
   # create and activate a python virtual environment
   python3 -m venv pyspark_venvsource
   source pyspark_venvsource/bin/activate
   
   # install the python packages
   pip3 install pandas # add packages here
   
   # package the virtual environment into an archive
   pip3 install venv-pack
   venv-pack -f -o pyspark_venv.tar.gz
   
   
   # optionally, remove the virtual environment directory
   deactivate
   rm -fr pyspark_venvsource
   ```

1. You are now ready to store this virtual environment in S3. For more information, see [Storing a user script and virtual environment in S3](store-artifacts-in-s3.md).

For more information about working with Docker and Amazon ECR, see the [Amazon ECRUser Guide](https://docs.aws.amazon.com/AmazonECR/latest/userguide/).

# Storing a user script and virtual environment in S3
<a name="store-artifacts-in-s3"></a>

The following procedure explains how to store a user script and optional virtual environment in Amazon S3. Complete this step before creating a PySpark analysis template. 

**Important**  
Do not modify or remove artifacts (user scripts or virtual environments) after creating an analysis template.  
Doing so will:  
Cause all future analysis jobs using this template to fail.
Require creation of a new analysis template with new artifacts.
Not affect previously completed analysis jobs

**Prerequisites**
+ An AWS account with appropriate permissions
+ A user script file (such as `my_analysis.py`)
+ (Optional, if one exists) A virtual environment package (`.tar.gz` file) 
+ Access to create or modify IAM roles

------
#### [ Console ]

**To store a user script and virtual environment in S3 using the console:**

1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Create a new S3 bucket or use an existing one.

1. Enable versioning for the bucket.

   1. Select your bucket.

   1. Choose **Properties**.

   1. In the **Bucket Versioning** section, choose **Edit**.

   1. Select **Enable** and save changes.

1. Upload your artifacts and enable SHA-256 hash. 

   1. Navigate to your bucket.

   1. Choose **Upload**.

   1. Choose **Add files** and add your user script file.

   1. (Optional, if one exists) Add your **.tar.gz** file.

   1. Expand **Properties**.

   1. Under **Checksums**, for **Checksum function**, select **SHA256**.

   1. Choose **Upload**.

1. You are now ready to create a PySpark analysis template.

------
#### [ CLI ]

**To store the user script and virtual environment in S3 using the AWS CLI:**

1. Run the following command:

   ```
   aws s3 cp --checksum-algorithm sha256 pyspark_venv.tar.gz s3://ARTIFACT-BUCKET/EXAMPLE-PREFIX/
   ```

1. You are now ready to create a PySpark analysis template.

------

**Note**  
If you need to update script or virtual environment:   
Upload the new version as a separate object.
Create a new analysis template using the new artifacts.
Deprecate the old template.
Keep the original artifacts in S3 if the old template might still be needed.

# Creating a PySpark analysis template
<a name="create-pyspark-analysis-template"></a>

**Note**  
Parameters are user-provided strings that can contain arbitrary content.  
Review the code to ensure parameters are handled safely to prevent unexpected behavior in your analysis.
Design parameter handling to work safely regardless of what parameter values are provided at submission time.

**Prerequisites**

 Before you create a PySpark analysis template, you must have:
+ A membership in an active AWS Clean Rooms collaboration
+ Access to at least one configured table in the active collaboration
+ Permissions to create analysis templates
+ A Python user script and a virtual environment created and stored in S3
  + S3 bucket has versioning enabled. For more information, see [Using versioning in S3 buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html)
  + S3 bucket can calculate SHA-256 checksums for uploaded artifacts. For more information, see [Using checksums](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html)
+ Permissions to read code from an S3 bucket

  For information about creating the required service role, see [Create a service role to read code from an S3 bucket (PySpark analysis template role)](setting-up-roles.md#create-role-pyspark-analysis-template).

The following procedure describes the process of creating a PySpark analysis template using the [AWS Clean Rooms console](https://console.aws.amazon.com/cleanrooms/home). It assumes that you have already created a user script and virtual environment files and stored your user script and virtual environment files in an Amazon S3 bucket.

**Note**  
The member who creates the PySpark analysis template must also be the member who receives results.

For information about how to create a PySpark analysis template using the AWS SDKs, see the [AWS Clean Rooms API Reference](https://docs.aws.amazon.com/clean-rooms/latest/apireference/Welcome.html).

**To create a PySpark analysis template**

1. Sign in to the AWS Management Console and open the [AWS Clean Rooms console](https://console.aws.amazon.com/cleanrooms/home) with the AWS account that will function as the collaboration creator.

1. In the left navigation pane, choose **Collaborations**.

1. Choose the collaboration.

1. On the **Templates** tab, go to the **Analysis templates created by you** section.

1. Choose **Create analysis template**.

1. On the **Create analysis template** page, for **Details**, 

   1. Enter a **Name** for the analysis template.

   1. (Optional) Enter a **Description**.

   1. For **Format**, choose the **PySpark** option.

1. For **Definition**,

   1. Review the **Prerequisites** and ensure each prerequisite is met before continuing.

   1. For **Entry point file**, enter the S3 bucket or choose **Browse S3**.

   1. (Optional) For **Libraries file**, enter the S3 bucket or choose **Browse S3**.

1. For **Parameters – optional**, if you want to add parameters to make your analysis template reusable:

   1. Choose **Add parameter**.

   1. Enter a **Parameter name**.

      Parameter names must start with a letter or underscore, followed by alphanumeric characters or underscores.

   1. For **Type**, **STRING** is automatically selected as the only supported type for PySpark analysis templates.

   1. (Optional) Enter a **Default value** for the parameter.

      If you provide a default value, job runners can use this value when running jobs without explicitly providing a parameter value.

   1. To add more parameters, choose **Add another parameter** and repeat the previous steps.
**Note**  
You can define up to 50 parameters per PySpark analysis template. Each parameter value can be up to 1,000 characters.

1. For **Tables referenced in the definition**, 
   + If all tables referenced in the definition have been associated to the collaboration:
     + Leave the **All tables referenced in the definition have been associated to the collaboration** checkbox selected.
     + Under **Tables associated to the collaboration**, choose all associated tables that are referenced in the definition. 
   + If all tables referenced in the definition haven't been associated to the collaboration:
     + Clear the **All tables referenced in the definition have been associated to the collaboration** checkbox.
     + Under **Tables associated to the collaboration**, choose all associated tables that are referenced in the definition.
     + Under **Tables that will be associated later**, enter a table name. 
     + Choose **List another table** to list another table.

1. For **Error message configuration**, choose one of the following:
   + **Basic error messages** – returns basic error messages without exposing underlying data. Recommended for production workloads.
   + **Detailed error messages** – returns detailed error messages for faster troubleshooting. Recommended in development and testing environments. May expose sensitive data, including personally identifiable information (PII).
**Note**  
When using **Detailed error messages**, all data provider members must approve this setting for the template.

1. Specify the **Service access** permissions by selecting an **Existing service role name** from the dropdown list.

   1. The list of roles are displayed if you have permissions to list roles.

      If you don't have permissions to list roles, you can enter the Amazon Resource Name (ARN) of the role that you want to use.

   1. View the service role by choosing the **View in IAM** external link.

      If there are no existing service roles, the option to **Use an existing service role** is unavailable.

      By default, AWS Clean Rooms doesn't attempt to update the existing role policy to add necessary permissions. 
**Note**  
AWS Clean Rooms requires permissions to query according to the analysis rules. For more information about permissions for AWS Clean Rooms, see [AWS managed policies for AWS Clean Rooms](security-iam-awsmanpol.md).
If the role doesn’t have sufficient permissions for AWS Clean Rooms, you receive an error message stating that the role doesn't have sufficient permissions for AWS Clean Rooms. The role policy must be added before proceeding.
If you can’t modify the role policy, you receive an error message stating that AWS Clean Rooms couldn't find the policy for the service role.

1. If you want to enable **Tags** for the configured table resource, choose **Add new tag** and then enter the **Key** and **Value** pair.

1. Choose **Create**.

1. You are now ready to inform your collaboration member that they can [Review an analysis template](review-analysis-template.md). (Optional if you want to query your own data.)

**Important**  
Don't modify or remove artifacts (user scripts or virtual environments) after creating an analysis template.  
Doing so will:  
Cause all future analysis jobs using this template to fail.
Require creation of a new analysis template with new artifacts.
Not affect previously completed analysis jobs.

# Reviewing a PySpark analysis template
<a name="review-pyspark-analysis-template"></a>

When another member creates an analysis template in your collaboration, you must review and approve it before it can be used. 

The following procedure shows you how to review a PySpark analysis template, including its rules, parameters, and referenced tables. As a collaboration member, you'll assess whether the template aligns with your data sharing agreements and security requirements.

After the analysis template and approved, it can be used in a job in AWS Clean Rooms.

**Note**  
When you bring your analysis code into a collaboration, be aware of the following:   
AWS Clean Rooms doesn't validate or guarantee the behavior of the analysis code.   
If you need to ensure certain behavior, review the code of your collaboration partner directly or work with a trusted third-party auditor to review it.
AWS Clean Rooms guarantees that the SHA-256 hashes of the code listed in the PySpark analysis template matches the code running in the PySpark analysis environment. 
AWS Clean Rooms doesn't perform any auditing or security analysis of additional libraries you bring into the environment.
In the shared security model:  
You (the customer) are responsible for the security of the code running in the environment.
You (the customer) are responsible for setting the appropriate error message configuration for the environment.
AWS Clean Rooms is responsible for the security of the environment, ensuring that  
only the approved code runs 
only specified configured tables are accessible 
the only output destination is the result receiver's S3 bucket.

AWS Clean Rooms generates SHA-256 hashes of the user script and virtual environment for your review. However, the actual user script and libraries aren't directly accessible within AWS Clean Rooms. 

To validate that the user script and libraries shared are the same as those referenced in the analysis template, you can create a SHA-256 hash of the files shared and compare it to the analysis template hash created by AWS Clean Rooms. The hashes of the code run will also be in the job logs. 

**Prerequisites**
+ Linux/Unix operating system or Windows Subsystem for Linux (WSL)
+ User script file you want to hash
  + Request that the analysis template creator share the file through a secure channel.
+ The analysis template hash created by AWS Clean Rooms

**To review a PySpark analysis template using the AWS Clean Rooms console**

1. Sign in to the AWS Management Console and open the [AWS Clean Rooms console](https://console.aws.amazon.com/cleanrooms/home) with the AWS account that will function as the collaboration creator.

1. In the left navigation pane, choose **Collaborations**.

1. Choose the collaboration.

1. On the **Templates** tab, go to the **Analysis templates created by other members** section.

1. Choose the analysis template that has the **Can run status** of **No requires your review**.

1. Choose **Review**.

1. Review the analysis rule **Overview**, **Definition**, and **Parameters** (if any). 
**Note**  
Parameters allow analysis runners to submit different values at submission time. If an analysis template supports parameters, review how the parameter values are used in the code of your collaboration partner to ensure it meets your requirements.

1. Validate that the shared user script and libraries are the same as those referenced in the analysis template.

   1. Create a SHA-256 hash of the files shared and compare it to the analysis template hash created by AWS Clean Rooms. 

      You can generate a hash by navigating to the directory containing your user script file and then running the following command: 

      ```
      sha256sum your_script_filename.py
      ```

      Example output:

      ```
      e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 my_analysis.py
      ```

   1. Alternatively, you can use Amazon S3 checksum features. For more information, see [https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html) in the *Amazon S3 User Guide*.

   1. Another alternative is to view the hashes of the executed code in the job logs.

1. Review the configured tables listed under **Tables referenced in definition**. 

   The **Status** next to each table will read **Template not allowed**.

1. Choose a table.

   1. To approve the analysis template, choose **Allow template on table**. Confirm your approval by choosing **Allow**.

   1. To decline approval, choose **Disallow**.

If you have chosen to approve the analysis template, the member who can run jobs can now run a PySpark job on a configured table using a PySpark analysis template. For more information, see [Running PySpark jobs](run-jobs.md).

# Troubleshooting PySpark analysis templates
<a name="troubleshooting-pyspark-analysis-templates"></a>

When running jobs using PySpark analysis templates, you might encounter failures during job initialization or execution. These failures typically relate to script configuration, data access permissions, or environment setup.

For more information about PySpark limitations, see [PySpark limitations in AWS Clean Rooms](pyspark-analysis-templates.md#pyspark-limitations).

**Topics**
+ [Troubleshooting your code](#troubleshoot-your-code)
+ [Analysis template job doesn't start](#troubleshooting_analysis_template_job_fails_to_start)
+ [Analysis template job starts but fails during processing](#analysis-template-job-failes-to-run)
+ [Virtual environment setup fails](#virtual-environment-setup-fails)

## Troubleshooting your code
<a name="troubleshoot-your-code"></a>


To help you develop and troubleshoot your code, we suggest you simulate AWS Clean Rooms in your own AWS account by enabling **Detailed error messages** and running jobs using your own test data. 

You can also simulate PySpark in AWS Clean Rooms in Amazon EMR Serverless with the following steps. It will have small differences with PySpark in AWS Clean Rooms but mostly covers how your code can be run. 

**To simulate PySpark in AWS Clean Rooms in EMR Serverless**

1. Create a dataset in Amazon S3, catalog it in the AWS Glue Data Catalog, and set up Lake Formation permissions.

1. Register the S3 location with Lake Formation using a custom role.

1. Create an Amazon EMR Studio instance if you don’t already have one (Amazon EMR Studio is needed to use Amazon EMR Serverless).

1. Create an EMR Serverless app
   + Select release version emr-7.7.0.
   + Select ARM64 architecture.
   + Opt for **Use custom settings**.
   + Disable preinitialized capacity.
   + If you plan to do interactive work, select **Interactive endpoint** > **Enable endpoint for EMR studio**.
   + Select **Additional configurations** > **Use Lake Formation for fine-grained access control**.
   + Create the application.

1. Use EMR-S either via EMR-Studio notebooks or the `StartJobRun` API.

## Analysis template job doesn't start
<a name="troubleshooting_analysis_template_job_fails_to_start"></a>

### Common causes
<a name="troubleshooting_common_cause"></a>

Analysis template jobs can fail immediately at startup due to three main configuration issues: 
+ Incorrect script naming that doesn't match the required format 
+ Missing or incorrectly formatted entrypoint function in the user script 
+ Incompatible Python version in the virtual environment

### Resolution
<a name="troubleshooting_resolution"></a>

**To resolve:**

1. Verify your user script: 

   1. Check that your user script has a valid Python filename.

     Valid Python filenames use lowercase letters, underscores to separate words, and the .py extension.

1. Verify the entrypoint function. If your user script doesn't have an entrypoint function, add one. 

   1. Open your user script.

   1. Add this entrypoint function:

      ```
      def entrypoint(context):
          # Your analysis code here
      ```

   1. Ensure the function name is spelled exactly as `entrypoint`. 

   1. Verify the function accepts the `context` parameter.

1. Check Python version compatibility: 

   1. Verify your virtual environment uses Python 3.9 or 3.11.

   1. To check your version, run: `python --version` 

   1. If needed, update your virtual environment: 

      ```
      conda create -n analysis-env python=3.9
      conda activate analysis-env
      ```

### Prevention
<a name="troubleshooting_prevention"></a>
+ Use the provided analysis template starter code that includes the correct file structure.
+ Set up a dedicated virtual environment with Python 3.9 or 3.11 for all analysis templates. 
+ Test your analysis template locally using the template validation tool before submitting jobs. 
+ Implement CI/CD checks to verify script naming and entrypoint function requirements.

## Analysis template job starts but fails during processing
<a name="analysis-template-job-failes-to-run"></a>

### Common causes
<a name="troubleshooting_common_cause"></a>

Analysis jobs can fail during execution for these security and formatting reasons:
+ Unauthorized direct access attempts to AWS services like Amazon S3 or AWS Glue 
+ Returning output in incorrect formats that don't match required DataFrame specifications 
+ Blocked network calls due to security restrictions in the execution environment 

### Resolution
<a name="troubleshooting_resolution"></a>

**To resolve**

1. Remove direct AWS service access: 

   1. Search your code for direct AWS service imports and calls. 

   1. Replace direct S3 access with provided Spark session methods. 

   1. Use only pre-configured tables through the collaboration interface. 

1. Format outputs correctly: 

   1. Verify all outputs are Spark DataFrames. 

   1. Update your return statement to match this format: 

      ```
      return {
          "results": {
              "output1": dataframe1
          }
      }
      ```

   1. Remove any non-DataFrame return objects. 

1. Remove network calls: 

   1. Identify and remove any external API calls. 

   1. Remove any urllib, requests, or similar network libraries. 

   1. Remove any socket connections or HTTP client code. 

### Prevention
<a name="troubleshooting_prevention"></a>
+ Use the provided code linter to check for unauthorized AWS imports and network calls. 
+ Test jobs in the development environment where security restrictions match production. 
+ Follow the output schema validation process before deploying jobs. 
+ Review the security guidelines for approved service access patterns.

## Virtual environment setup fails
<a name="virtual-environment-setup-fails"></a>

### Common causes
<a name="virtual-env-fails-cause"></a>

Virtual environment configuration failures commonly occur due to: 
+ Mismatched CPU architecture between development and execution environments 
+ Python code formatting issues that prevent proper environment initialization 
+ Incorrect base image configuration in container settings 

### Resolution
<a name="virtual-env-fails-resolution"></a>

**To resolve**

1. Configure the correct architecture: 

   1. Check your current architecture with `uname -m. `

   1. Update your Dockerfile to specify ARM64: 

      ```
      FROM --platform=linux/arm64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal
      ```

   1. Rebuild your container with `docker build --platform=linux/arm64.` 

1. Fix Python indentation: 

   1. Run a Python code formatter like `black` on your code files. 

   1. Verify consistent use of spaces or tabs (not both). 

   1. Check indentation of all code blocks: 

      ```
      def my_function():
          if condition:
              do_something()
          return result
      ```

   1. Use an IDE with Python indentation highlighting. 

1. Validate environment configuration: 

   1. Run `python -m py_compile your_script.py` to check for syntax errors. 

   1. Test the environment locally before deployment. 

   1. Verify all dependencies are listed in `requirements.txt`. 

### Prevention
<a name="virtual-env-fails-prevention"></a>
+ Use Visual Studio Code or PyCharm with Python formatting plugins 
+ Configure pre-commit hooks to run code formatters automatically 
+ Build and test environments locally using the provided ARM64 base image 
+ Implement automated code style checking in your CI/CD pipeline