# Collecting data from custom sources in Security Lake
<a name="custom-sources"></a>

Amazon Security Lake can collect logs and events from third-party custom sources. A Security Lake custom source is a third-party service that sends security logs and events to Amazon Security Lake. Before sending the data, the custom source must convert the logs and events to the Open Cybersecurity Schema Framework (OCSF) and meet the source requirements for Security Lake including partitioning, parquet file format and object size and rate requirements.

For each custom source, Security Lake handles the following:
+ Provides a unique prefix for the source in your Amazon S3 bucket.
+ Creates a role in AWS Identity and Access Management (IAM) that permits a custom source to write data to the data lake. The permissions boundary for this role is set by an AWS managed policy called [`AmazonSecurityLakePermissionsBoundary`](security-iam-awsmanpol.md#security-iam-awsmanpol-AmazonSecurityLakePermissionsBoundary).
+ Creates an AWS Lake Formation table to organize objects that the source writes to Security Lake.
+ Sets up an AWS Glue crawler to partition your source data. The crawler populates the AWS Glue Data Catalog with the table. It also automatically discovers new source data and extracts schema definitions.

**Note**  
You can add up to a maximum of 50 custom log sources in an account.

To add a custom source to Security Lake, it must meet the following requirements. Failure to meet these requirements could have performance impacts, and may impact analytics use cases such as querying.
+ **Destination** – The custom source must be able to write data to Security Lake as a set of S3 objects underneath the prefix assigned to the source. For sources that contain multiple categories of data, you should deliver each unique [Open Cybersecurity Schema Framework (OCSF) event class](https://schema.ocsf.io/classes?extensions=) as a separate source. Security Lake creates an IAM role that permits the custom source to write to the specified location in your S3 bucket.
+ **Format** – Each S3 object that's collected from the custom source should be formatted as an Apache Parquet file.
+ **Schema** – The same OCSF event class should apply to each record within a Parquet-formatted object. Security Lake supports versions 1.x and 2.x of Parquet. Data page size should be limited to 1 MB (uncompressed). Row group size should be no larger than 256 MB (compressed). For compression within the Parquet object, zstandard is preferred.
+ **Partitioning** – Objects must be partitioned by region, AWS account, eventDay. Objects should be prefixed with `source location/region=region/accountId=accountID/eventDay=yyyyMMdd/`.
+ **Object size and rate** – Files sent to Security Lake should be sent in increments between 5 minutes and 1 event day. Customers may send files more often than 5 minutes if files are larger than 256MB in size. The object and size requirement is to optimize Security Lake for Query Performance. Not following the custom source requirements may have an impact on your Security Lake performance.
+ **Sorting** – Within each Parquet-formatted object, records should be ordered by time to reduce the cost of querying data.

**Note**  
Use the [OCSF Validation tool](https://github.com/aws-samples/amazon-security-lake-ocsf-validation) to verify if the custom source is compatible with the `OCSF Schema`. For custom sources, Security Lake supports OCSF version 1.3 and earlier.

## Partitioning requirements for ingesting custom sources in Security Lake
<a name="custom-sources-best-practices"></a>

To facilitate efficient data processing and querying, we require meeting the partitioning and object and size requirements when adding a custom source to Security Lake:

**Partitioning**  
Objects should be partitioned by source location, AWS Region, AWS account, and date.  
+ The partition data path is formatted as

   `/ext/custom-source-name/region=region/accountId=accountID/eventDay=YYYYMMDD`.

  A sample partition with example bucket name is `aws-security-data-lake-us-west-2-lake-uid/ext/custom-source-name/region=us-west-2/accountId=123456789012/eventDay=20230428/`.

The following list describes the parameters used in the S3 path partition:
+ The name of the Amazon S3 bucket in which Security Lake stores your custom source data.
+ `source-location` – Prefix for the custom source in your S3 bucket. Security Lake stores all S3 objects for a given source under this prefix, and the prefix is unique to the given source.
+ `region` – AWS Region to which the data is uploaded. For example, you must use `US East (N. Virginia)` to upload data into your Security Lake bucket in the US East (N. Virginia) region.
+ `accountId` – AWS account ID that the records in the source partition pertain to. For records pertaining to accounts outside of AWS, we recommend using a string such as `external` or `external_externalAccountId`. By adopting this naming convection, you can avoid ambiguity in naming external account IDs so that they do not conflict with AWS account IDs or external account IDs maintained by other identity management systems.
+ `eventDay` – UTC timestamp of the record, truncated to hour formatted as an eight character string (`YYYYMMDD`). If records specify a different timezone in the event timestamp, you must convert the timestamp into UTC for this partition key.

## Prerequisites to adding a custom source in Security Lake
<a name="iam-roles-custom-sources"></a>

When adding a custom source, Security Lake creates an IAM role that permits the source to write data to the correct location in the data lake. The name of the role follows the format `AmazonSecurityLake-Provider-{name of the custom source}-{region}`, where `region` is the AWS Region in which you're adding the custom source. Security Lake attaches a policy to the role that permits access to the data lake. If you've encrypted the data lake with a customer managed AWS KMS key, Security Lake also attaches a policy with `kms:Decrypt` and `kms:GenerateDataKey` permissions to the role. The permissions boundary for this role is set by an AWS managed policy called [`AmazonSecurityLakePermissionsBoundary`](security-iam-awsmanpol.md#security-iam-awsmanpol-AmazonSecurityLakePermissionsBoundary).

**Topics**
+ [Verify permissions](#add-custom-sources-permissions)
+ [Create IAM role to permit write access to Security Lake bucket location (API and AWS CLI-only step)](#iam-roles-glue-crawler)

### Verify permissions
<a name="add-custom-sources-permissions"></a>

Before adding a custom source, verify that you have the permissions to perform the following actions.

To verify your permissions, use IAM to review the IAM policies that are attached to your IAM identity. Then, compare the information in those policies to the following list of actions that you must be allowed to perform to add a custom source. 
+ `glue:CreateCrawler`
+ `glue:CreateDatabase`
+ `glue:CreateTable`
+ `glue:StopCrawlerSchedule`
+ `iam:GetRole`
+ `iam:PutRolePolicy`
+ `iam:DeleteRolePolicy`
+ `iam:PassRole`
+ `lakeformation:RegisterResource`
+ `lakeformation:GrantPermissions`
+ `s3:ListBucket`
+ `s3:PutObject`

These actions allow you to collect logs and events from a custom source, send them to the correct AWS Glue database and table, and store them in Amazon S3.

If you use an AWS KMS key for server-side encryption of your data lake, you also need permission for `kms:CreateGrant`, `kms:DescribeKey`, and `kms:GenerateDataKey`.

**Important**  
If you plan to use the Security Lake console to add a custom source, you can skip the next step and proceed to [Adding a custom source in Security Lake](adding-custom-sources.md). The Security Lake console offers a streamlined process for getting started, and creates all necessary IAM roles or uses existing roles on your behalf.  
If you plan to use Security Lake API or AWS CLI to add a custom source, continue with the next step to create an IAM role to permit write access to Security Lake bucket location.

### Create IAM role to permit write access to Security Lake bucket location (API and AWS CLI-only step)
<a name="iam-roles-glue-crawler"></a>

If you're using Security Lake API or AWS CLI to add a custom source, add this IAM role to grant AWS Glue permission to crawl your custom source data and identify partitions in the data. These partitions are necessary to organize your data and create and update tables in the Data Catalog.

After creating this IAM role, you will need the Amazon Resource Name (ARN) of the role in order to add a custom source.

You must attach the `arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole` AWS managed policy.

To grant the necessary permissions, you must also create and embed the following inline policy in your role to permit AWS Glue crawler to read data files from the custom source and create/update the tables in AWS Glue Data Catalog.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "S3WriteRead",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
            "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ]
        }
    ]
}
```

------

Attach the following trust policy to permit an AWS account by using which, it can assume the role based on the external ID:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

------

If the S3 bucket in the Region where you're adding the custom source is encrypted with a customer-managed AWS KMS key, you must also attach the following policy to the role and to your KMS key policy:

```
{
    "Effect": "Allow",
    "Action": [
        "kms:GenerateDataKey"
        "kms:Decrypt"
    ],
    "Condition": {
        "StringLike": {
            "kms:EncryptionContext:aws:s3:arn": [
                "arn:aws:s3:::{{name of S3 bucket created by Security Lake}"
            ]
        }
    },
    "Resource": [
        "{{ARN of customer managed key}}"
    ]
}
```

# Adding a custom source in Security Lake
<a name="adding-custom-sources"></a>

After creating the IAM role to invoke the AWS Glue crawler, follow these steps to add a custom source in Security Lake.

------
#### [ Console ]

1. Open the Security Lake console at [https://console.aws.amazon.com/securitylake/](https://console.aws.amazon.com/securitylake/).

1. By using the AWS Region selector in the upper-right corner of the page, select the Region where you want to create the custom source.

1. Choose **Custom sources** in the navigation pane, and then choose **Create custom source**.

1. In the **Custom source details** section, enter a globally unique name for your custom source. Then, select an OCSF event class that describes the type of data that the custom source will send to Security Lake.

1. For **AWS account with permission to write data**, enter the **AWS account ID** and **External ID** of the custom source that will write logs and events to the data lake.

1. For **Service Access**, create and use a new service role or use an existing service role that gives Security Lake permission to invoke AWS Glue.

1. Choose **Create**.

------
#### [ API ]

To add a custom source programmatically, use the [CreateCustomLogSource](https://docs.aws.amazon.com/security-lake/latest/APIReference/API_CreateCustomLogSource.html) operation of the Security Lake API. Use the operation in the AWS Region where you want to create the custom source. If you're using the AWS Command Line Interface (AWS CLI), run the [create-custom-log-source](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/securitylake/create-custom-log-source.html) command.

In your request, use the supported parameters to specify configuration settings for the custom source:
+ `sourceName` – Specify a name for the source. The name must be a Regionally unique value.
+ `eventClasses` – Specify one or more OCSF event classes to describe the type of data that the source will send to Security Lake. For a list of OCSF event classes supported as source in Security Lake, see [Open Cybersecurity Schema Framework (OCSF)](https://schema.ocsf.io/classes?extensions).
+ `sourceVersion` – Optionally, specify a value to limit log collection to a specific version of custom source data.
+ `crawlerConfiguration` – Specify the Amazon Resource Name (ARN) of the IAM role that you created to invoke the AWS Glue crawler. For the detailed steps to create an IAM role, see [Prerequisites to adding a custom source](https://docs.aws.amazon.com//security-lake/latest/userguide/custom-sources.html#iam-roles-glue-crawler)
+ `providerIdentity` – Specify the AWS identity and external ID that the source will use to write logs and events to the data lake.

The following example adds a custom source as a log source in the designated log provider account in designated Regions. This example is formatted for Linux, macOS, or Unix, and it uses the backslash (\$1) line-continuation character to improve readability.

```
$ aws securitylake create-custom-log-source \
--source-name EXAMPLE_CUSTOM_SOURCE \
--event-classes '["DNS_ACTIVITY", "NETWORK_ACTIVITY"]' \
--configuration crawlerConfiguration={"roleArn=arn:aws:iam::XXX:role/service-role/RoleName"},providerIdentity={"externalId=ExternalId,principal=principal"}  \
--region=[“ap-southeast-2”]
```

------

## Keeping custom source data updated in AWS Glue
<a name="maintain-glue-schema"></a>

After you add a custom source in Security Lake, Security Lake creates an AWS Glue crawler. The crawler connects to your custom source, determines the data structures, and populates the AWS Glue Data Catalog with tables.

We recommend manually running the crawler to keep your custom source schema up to date and maintain query functionality in Athena and other querying services. Specifically, you should run the crawler if either of the following changes occur in your input data set for a custom source:
+ The data set has one or more new top-level columns.
+ The data set has one or more new fields in a column with a `struct` datatype.

For instructions on running a crawler, see [Scheduling an AWS Glue crawler](https://docs.aws.amazon.com/glue/latest/dg/schedule-crawler.html) in the *AWS Glue Developer Guide*.

Security Lake can't delete or update existing crawlers in your account. If you delete a custom source, we recommend deleting the associated crawler if you plan to create a custom source with the same name in the future.

## Supported OCSF event classes
<a name="ocsf-eventclass"></a>

The Open Cybersecurity Schema Framework (OCSF) event classes describes the type of data that the custom source will send to Security Lake. The list of supported event classes are:

```
public enum OcsfEventClass {
    ACCOUNT_CHANGE,
    API_ACTIVITY,
    APPLICATION_LIFECYCLE,
    AUTHENTICATION,
    AUTHORIZE_SESSION,
    COMPLIANCE_FINDING,
    DATASTORE_ACTIVITY,
    DEVICE_CONFIG_STATE,
    DEVICE_CONFIG_STATE_CHANGE,
    DEVICE_INVENTORY_INFO,
    DHCP_ACTIVITY,
    DNS_ACTIVITY,
    DETECTION_FINDING,
    EMAIL_ACTIVITY,
    EMAIL_FILE_ACTIVITY,
    EMAIL_URL_ACTIVITY,
    ENTITY_MANAGEMENT,
    FILE_HOSTING_ACTIVITY,
    FILE_SYSTEM_ACTIVITY,
    FTP_ACTIVITY,
    GROUP_MANAGEMENT,
    HTTP_ACTIVITY,
    INCIDENT_FINDING,
    KERNEL_ACTIVITY,
    KERNEL_EXTENSION,
    MEMORY_ACTIVITY,
    MODULE_ACTIVITY,
    NETWORK_ACTIVITY,
    NETWORK_FILE_ACTIVITY,
    NTP_ACTIVITY,
    PATCH_STATE,
    PROCESS_ACTIVITY,
    RDP_ACTIVITY,
    REGISTRY_KEY_ACTIVITY,
    REGISTRY_VALUE_ACTIVITY,
    SCHEDULED_JOB_ACTIVITY,
    SCAN_ACTIVITY,
    SECURITY_FINDING,
    SMB_ACTIVITY,
    SSH_ACTIVITY,
    USER_ACCESS,
    USER_INVENTORY,
    VULNERABILITY_FINDING,
    WEB_RESOURCE_ACCESS_ACTIVITY,
    WEB_RESOURCES_ACTIVITY,
    WINDOWS_RESOURCE_ACTIVITY,
    // 1.3 OCSF event classes
    ADMIN_GROUP_QUERY,
    DATA_SECURITY_FINDING,
    EVENT_LOG_ACTIVITY,
    FILE_QUERY,
    FILE_REMEDIATION_ACTIVITY,
    FOLDER_QUERY,
    JOB_QUERY,
    KERNEL_OBJECT_QUERY,
    MODULE_QUERY,
    NETWORK_CONNECTION_QUERY,
    NETWORK_REMEDIATION_ACTIVITY,
    NETWORKS_QUERY,
    PERIPHERAL_DEVICE_QUERY,
    PROCESS_QUERY,
    PROCESS_REMEDIATION_ACTIVITY,
    REMEDIATION_ACTIVITY,
    SERVICE_QUERY,
    SOFTWARE_INVENTORY_INFO,
    TUNNEL_ACTIVITY,
    USER_QUERY,
    USER_SESSION_QUERY,
    // 1.3 OCSF event classes (Win extension)
    PREFETCH_QUERY,
    REGISTRY_KEY_QUERY,
    REGISTRY_VALUE_QUERY,
    WINDOWS_SERVICE_ACTIVITY
}
```

# Deleting a custom source from Security Lake
<a name="delete-custom-source"></a>

Delete a custom source to stop sending data from the source to Security Lake. When you remove the source, Security Lake stops collecting data from that source in the specified Regions and accounts, and subscribers can no longer consume new data from the source. However, subscribers can still consume data that Security Lake collected from the source before removal. You can only use these instructions to remove a custom source. For information about removing a natively-supported AWS service, see [Collecting data from AWS services in Security Lake](custom-sources.md).

When deleting a custom source in Security Lake, you must disable each source outside of the Security Lake console with the source. Failure to disable an integration may result in source integrations continuing to send logs into Amazon S3. 

------
#### [ Console ]

1. Open the Security Lake console at [https://console.aws.amazon.com/securitylake/](https://console.aws.amazon.com/securitylake/).

1. By using the AWS Region selector in the upper-right corner of the page, select the Region that you want to remove the custom source from.

1. In the navigation pane, choose **Custom sources**.

1. Select the custom source that you want to remove.

1. Choose **Deregister custom source** and then choose **Delete** to confirm the action.

------
#### [ API ]

To delete a custom source programmatically, use the [DeleteCustomLogSource](https://docs.aws.amazon.com/security-lake/latest/APIReference/API_DeleteCustomLogSource.html) operation of the Security Lake API. If you're using the AWS Command Line Interface (AWS CLI), run the [delete-custom-log-source](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/securitylake/delete-custom-log-source.html) command. Use the operation in the AWS Region where you want to delete the custom source.

In your request, use the `sourceName` parameter to specify the name of the custom source to delete. Or specify the name of the custom source and use the `sourceVersion` parameter to limit the scope of the deletion to only a specific version of data from the custom source.

The following example deletes a custom log source from Security Lake.

This example is formatted for Linux, macOS, or Unix, and it uses the backslash (\$1) line-continuation character to improve readability.

```
$ aws securitylake delete-custom-log-source \
--source-name EXAMPLE_CUSTOM_SOURCE
```

------