# Custom models in Clean Rooms ML
<a name="custom-models"></a>

With Clean Rooms ML, members of a collaboration can use a dockerized custom model algorithm that is stored in Amazon ECR to jointly analyze their data. To do this, the *model provider* must create an image and store it in Amazon ECR. Follow the steps in [Amazon Elastic Container Registry User Guide](https://docs.aws.amazon.com/AmazonECR/latest/userguide/) to create a private repository that will contain the custom ML model. 

Any member of a collaboration can be the *model provider*, provided they have the correct permissions. All members of a collaboration can contribute data to the model. For the purpose of this guide, members contributing data are referred to as *data providers*. The member who creates the collaboration is the *collaboration creator*, and this member can be either the *model provider*, one of the *data providers*, or both.

The following topics describe the information necessary to create a custom ML model

**Topics**
+ [

# Custom ML modeling prerequisites
](custom-model-prerequisites.md)
+ [

# Model authoring guidelines for the training container
](custom-model-guidelines.md)
+ [

# Model authoring guidelines for the inference container
](inference-model-guidelines.md)
+ [

# Receiving model logs and metrics
](custom-model-logs.md)

# Custom ML modeling prerequisites
<a name="custom-model-prerequisites"></a>

Before you can perform custom ML modeling, you should consider the following:
+ Determine whether both model training and inference on the trained model is going to be performed in the collaboration.
+ Determine the role that each collaboration member will perform and assign them the appropriate abilities.
  + Assign the `CAN_QUERY` ability to the member who will train the model and run inference on the trained model.
  + Assign the `CAN_RECEIVE_RESULTS` to at least one member of the collaboration.
  + Assign `CAN_RECEIVE_MODEL_OUTPUT` or `CAN_RECEIVE_INFERENCE_OUTPUT` abilities to the member that will receive trained model exports or inference output, respectively. You can choose to use both abilities if they are required by your use-case.
+ Determine the maximum size of the trained model artifacts or inference results that you will allow to be exported.
+ We recommend that all users have the `CleanrooomsFullAccess` and `CleanroomsMLFullAccess` policies attached to their role. Using custom ML models requires using both the AWS Clean Rooms and AWS Clean Rooms ML SDKs.
+ Consider the following information about IAM roles.
  + All data providers must have a service access role that allows AWS Clean Rooms to read data from their AWS Glue catalogs and tables, and the underlying Amazon S3 locations. These roles are similar to those required for SQL querying. This allows you to use the `CreateConfiguredTableAssociation` action. For more information, see [Create a service role to create a configured table association](ml-roles.md#ml-roles-custom-configure-table). 
  + All members that want to receive metrics must have a service access role that allows them to write CloudWatch metrics and logs. This role is used by Clean Rooms ML to write all model metrics and logs to the member's AWS account during model training and inference. We also provide privacy controls to determine which members have access to the metrics and logs. This allows you to use the `CreateMLConfiguration` action. For more information see, [Create a service role for custom ML modeling - ML Configuration](ml-roles.md#ml-roles-custom-configure). 

    The member receiving results must provide a service access role with permissions to write to their Amazon S3 bucket. This role allows Clean Rooms ML to export results (trained model artifacts or inference results) to an Amazon S3 bucket. This allows you to use the `CreateMLConfiguration` action. For more information, see [Create a service role for custom ML modeling - ML Configuration](ml-roles.md#ml-roles-custom-configure). 
  + The model provider must provide a service access role with permissions to read their Amazon ECR repository and image. This allows you to use the `CreateConfigureModelAlgorithm` action. For more information, see [Create a service role to provide a custom ML model](ml-roles.md#ml-roles-custom-model-provider). 
  + The member that creates the `MLInputChannel` to generate datasets for training or inference must provide a service access role that allows Clean Rooms ML to execute an SQL query in AWS Clean Rooms. This allows you to use the `CreateTrainedModel` and `StartTrainedModelInferenceJob` actions. For more information, see [Create a service role to query a dataset](ml-roles.md#ml-roles-custom-query-dataset). 
+ Model authors should follow the [Model authoring guidelines for the training container](custom-model-guidelines.md) and [Model authoring guidelines for the inference containerReceiving model logs and metrics](inference-model-guidelines.md) to ensure model inputs and outputs are configured as expected by AWS Clean Rooms.

# Model authoring guidelines for the training container
<a name="custom-model-guidelines"></a>

This section details the guidelines that model providers should follow when creating a custom ML model algorithm for Clean Rooms ML.
+ Use the appropriate SageMaker AI training-supported container base image, as described in the [SageMaker AI Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). The following code allows you to pull the supported container base images from public SageMaker AI endpoints.

  ```
  ecr_registry_endpoint='763104351884.dkr.ecr.$REGION.amazonaws.com'
  base_image='pytorch-training:2.3.0-cpu-py311-ubuntu20.04-sagemaker'
  aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ecr_registry_endpoint
  docker pull $ecr_registry_endpoint/$base_image
  ```
+ When authoring the model locally, ensure the following so that you can test your model locally, on a development instance, on SageMaker AI Training in your AWS account, and on Clean Rooms ML.
  + We recommend writing a training script that accesses useful properties about the training environment through various environment variables. Clean Rooms ML uses the following arguments to invoke training on your model code: `SM_MODEL_DIR`, `SM_OUTPUT_DIR`, `SM_CHANNEL_TRAIN`, and `FILE_FORMAT`. These defaults are used by Clean Rooms ML to train your ML model in its own execution environment with the data from all parties.
  + Clean Rooms ML makes your training input channels available via the `/opt/ml/input/data/channel-name` directories in the docker container. Each ML input channel is mapped based on its corresponding `channel_name` provided in the `CreateTrainedModel` request.

    ```
    parser = argparse.ArgumentParser()# Data, model, and output directories
    
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR', "/opt/ml/model"))
    parser.add_argument('--output_dir', type=str, default=os.environ.get('SM_OUTPUT_DIR', "/opt/ml/output/data"))
    parser.add_argument('--train_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN', "/opt/ml/input/data/train"))
    parser.add_argument('--train_file_format', type=str, default=os.environ.get('FILE_FORMAT', "csv"))
    ```
  + Ensure that you are able to generate a synthetic or test dataset based on the schema of the collaborators that will be used in your model code.
  + Ensure that you can run a SageMaker AI training job on your own AWS account before you associate the model algorithm with a AWS Clean Rooms collaboration.

    The following code contains a sample Docker file that is compatible with local testing, SageMaker AI Training environment testing, and Clean Rooms ML

    ```
    FROM  763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.3.0-cpu-py311-ubuntu20.04-sagemaker
    MAINTAINER $author_name
    
    ENV PYTHONDONTWRITEBYTECODE=1 \
        PYTHONUNBUFFERED=1 \
        LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
    
    ENV PATH="/opt/ml/code:${PATH}"
    
    # this environment variable is used by the SageMaker PyTorch container to determine our user code directory
    ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
    
    # copy the training script inside the container
    COPY train.py /opt/ml/code/train.py
    # define train.py as the script entry point
    ENV SAGEMAKER_PROGRAM train.py
    ENTRYPOINT ["python", "/opt/ml/code/train.py"]
    ```
+ To best monitor container failures, we recommend exporting logs and debugging for failure reasons. In a `GetTrainedModel` response, Clean Rooms ML returns the first 1024 characters from this file under `StatusDetails`. 
+ After you have completed any model changes and you are ready to test it in the SageMaker AI environment, run the following commands in the order provided.

  ```
  export ACCOUNT_ID=xxx
  export REPO_NAME=xxx
  export REPO_TAG=xxx
  export REGION=xxx
  
  docker build -t $ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/$REPO_NAME:$REPO_TAG
  
  # Sign into AWS $ACCOUNT_ID/ Run aws configure
  # Check the account and make sure it is the correct role/credentials
  aws sts get-caller-identity
  aws ecr create-repository --repository-name $REPO_NAME --region $REGION
  aws ecr describe-repositories --repository-name $REPO_NAME --region $REGION
  
  # Authenticate Doker
  aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
  
  # Push To ECR Images
  docker push  $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com$REPO_NAME:$REPO_TAG
  
  # Create Sagemaker Training job
  # Configure the training_job.json with
  # 1. TrainingImage
  # 2. Input DataConfig
  # 3. Output DataConfig
  aws sagemaker create-training-job --cli-input-json file://training_job.json --region $REGION
  ```

  After the SageMaker AI job is complete and you are satisfied with your model algorithm, you can register the Amazon ECR Registry with AWS Clean Rooms ML. Use the `CreateConfiguredModelAlgorithm` action to register the model algorithm and the `CreateConfiguredModelAlgorithmAssociation` to associate it to a collaboration.

# Model authoring guidelines for the inference container
<a name="inference-model-guidelines"></a>

This section details the guidelines that model providers should follow when creating an inference algorithm for Clean Rooms ML.
+ Use the appropriate SageMaker AI inference-supported container base image, as described in the [SageMaker AI Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). The following code allows you to pull the supported container base images from public SageMaker AI endpoints.

  ```
  ecr_registry_endpoint='763104351884.dkr.ecr.$REGION.amazonaws.com'
  base_image='pytorch-inference:2.3.0-cpu-py311-ubuntu20.04-sagemaker'
  aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ecr_registry_endpoint
  docker pull $ecr_registry_endpoint/$base_image
  ```
+ When authoring the model locally, ensure the following so that you can test your model locally, on a development instance, on SageMaker AI Batch Transform in your AWS account, and on Clean Rooms ML.
  + Clean Rooms ML makes your model artifacts from inference available for use by your inference code via the `/opt/ml/model` directory in the docker container.
  + Clean Rooms ML splits input by line, uses a `MultiRecord` batch strategy, and adds a newline character at the end of every transformed record.
  + Ensure that you are able to generate a synthetic or test inference dataset based on the schema of the collaborators that will be used in your model code.
  + Ensure that you can run a SageMaker AI batch transform job on your own AWS account before you associate the model algorithm with a AWS Clean Rooms collaboration.

    The following code contains a sample Docker file that is compatible with local testing, SageMaker AI transform environment testing, and Clean Rooms ML

    ```
    FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.12.1-cpu-py38-ubuntu20.04-sagemaker
    
    ENV PYTHONUNBUFFERED=1
    
    COPY serve.py /opt/ml/code/serve.py
    COPY inference_handler.py /opt/ml/code/inference_handler.py
    COPY handler_service.py /opt/ml/code/handler_service.py
    COPY model.py /opt/ml/code/model.py
    
    RUN chmod +x /opt/ml/code/serve.py
    
    ENTRYPOINT ["/opt/ml/code/serve.py"]
    ```
+ After you have completed any model changes and you are ready to test it in the SageMaker AI environment, run the following commands in the order provided.

  ```
  export ACCOUNT_ID=xxx
  export REPO_NAME=xxx
  export REPO_TAG=xxx
  export REGION=xxx
  
  docker build -t $ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/$REPO_NAME:$REPO_TAG
  
  # Sign into AWS $ACCOUNT_ID/ Run aws configure
  # Check the account and make sure it is the correct role/credentials
  aws sts get-caller-identity
  aws ecr create-repository --repository-name $REPO_NAME --region $REGION
  aws ecr describe-repositories --repository-name $REPO_NAME --region $REGION
  
  # Authenticate Docker
  aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
  
  # Push To ECR Repository
  docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com$REPO_NAME:$REPO_TAG
  
  # Create Sagemaker Model
  # Configure the create_model.json with
  # 1. Primary container - 
      # a. ModelDataUrl - S3 Uri of the model.tar from your training job
  aws sagemaker create-model --cli-input-json file://create_model.json --region $REGION
  
  # Create Sagemaker Transform Job
  # Configure the transform_job.json with
  # 1. Model created in the step above 
  # 2. MultiRecord batch strategy
  # 3. Line SplitType for TransformInput
  # 4. AssembleWith Line for TransformOutput
  aws sagemaker create-transform-job --cli-input-json file://transform_job.json --region $REGION
  ```

  After the SageMaker AI job is complete and you are satisfied with your batch transform, you can register the Amazon ECR Registry with AWS Clean Rooms ML. Use the `CreateConfiguredModelAlgorithm` action to register the model algorithm and the `CreateConfiguredModelAlgorithmAssociation` to associate it to a collaboration.

# Receiving model logs and metrics
<a name="custom-model-logs"></a>

To receive logs and metrics from custom model training or inference, members must have [created an ML Configuration](https://docs.aws.amazon.com/clean-rooms/latest/userguide/create-custom-ml-collaboration.html) with a valid role that provides the necessary CloudWatch permissions (see [Create a service role for custom ML modeling - ML Configuration](https://docs.aws.amazon.com/clean-rooms/latest/userguide/ml-roles.html#ml-roles-custom-configure)).

**System metric**

System metrics for both training and inference, such as CPU and memory utilization, are published to all members in the collaboration with valid ML Configurations. These metrics can be viewed as the job progresses via CloudWatch Metrics in the `/aws/cleanroomsml/TrainedModels` or `/aws/cleanroomsml/TrainedModelInferenceJobs` namespaces, respectively.

**Model logs**

Access to the model logs is provided by the privacy configuration policy of each configured model algorithm. The model author sets the privacy configuration policy when associating a configured model algorithm (either via the console or the `CreateConfiguredModelAlgorithmAssociation` API) to a collaboration. Setting the privacy configuration policy controls which members can receive the model logs.

Additionally, the model author can set a filter pattern in the privacy configuration policy to filter log events. All logs that a model container sends to `stdout` or `stderr` and that match the filter pattern (if set), are sent to Amazon CloudWatch Logs. Model logs are available in CloudWatch log groups `/aws/cleanroomsml/TrainedModels` or `/aws/cleanroomsml/TrainedModelInferenceJobs`, respectively.

**Custom defined metrics**

When you configure a model algorithm (either via the console or the `CreateConfiguredModelAlgorithm` API), the model author can provide specific metric names and regex statements to search for in the output logs. These can be viewed as the job progresses via CloudWatch Metrics in the `/aws/cleanroomsml/TrainedModels` namespace. When associating a configured model algorithm, the model author can set an optional noise level in the metrics privacy configuration to avoid outputting raw data while still providing visibility into custom metric trends. If a noise level is set, the metrics are published at the end of the job rather than in real time.