

# Overview of Amazon OpenSearch Ingestion
<a name="ingestion"></a>

Amazon OpenSearch Ingestion is a fully managed, serverless data collector that streams real-time logs, metrics, and trace data to Amazon OpenSearch Service domains and OpenSearch Serverless collections.

With OpenSearch Ingestion, you no longer need third-party tools like Logstash or Jaeger to ingest data. You configure your data producers to send data to OpenSearch Ingestion, and it automatically delivers it to your specified domain or collection. You can also transform data before delivery.

Because OpenSearch Ingestion is serverless, you don’t have to manage infrastructure, patch software, or scale clusters manually. You can provision ingestion pipelines directly in the AWS Management Console, and OpenSearch Ingestion handles the rest.

As a component of Amazon OpenSearch Service, OpenSearch Ingestion is powered by Data Prepper—an open-source data collector that filters, enriches, transforms, normalizes, and aggregates data for downstream analysis and visualization.

![\[OpenSearch Ingestion pipelines showing data flow from sources to Amazon OpenSearch Service domains.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/Ingestion.png)


# Key concepts in Amazon OpenSearch Ingestion
<a name="ingestion-process"></a>

Before you start using OpenSearch Ingestion, it's helpful to understand these key concepts.

**Pipeline**  
From an OpenSearch Ingestion perspective, a *pipeline* refers to a single provisioned data collector that you create within OpenSearch Service. You can think of it as the entire YAML configuration file, which includes one or more sub-pipelines. For steps to create an ingestion pipeline, see [Creating pipelines](creating-pipeline.md#create-pipeline).

**Sub-pipeline**  
You define sub-pipelines *within* a YAML configuration file. Each sub-pipeline is a combination of a source, a buffer, zero or more processors, and one or more sinks. You can define multiple sub-pipelines in a single YAML file, each with unique sources, processors, and sinks. To aid in monitoring with CloudWatch and other services, we recommend that you specify a pipeline name that's distinct from all of its sub-pipelines.  
You can string multiple sub-pipelines together within a single YAML file, so that the source for one sub-pipeline is another sub-pipeline, and its sink is a third sub-pipeline. For an example, see [Using an OpenSearch Ingestion pipeline with OpenTelemetry Collector](configure-client-otel.md).

**Source**  
The input component of a sub-pipeline. It defines the mechanism through which a pipeline consumes records. The source can consume events either by receiving them over HTTPS, or by reading from external endpoints such as Amazon S3. There are two types of sources: *push-based* and *pull-based*. Push-based sources, such as [HTTP](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/http-source/) and [OTel logs](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-logs-source/), stream records to ingestion endpoints. Pull-based sources, such as [OTel trace](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-trace/) and [S3](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/), pull data from the source.

**Processors**  
Intermediate processing units that can filter, transform, and enrich records into a desired format before publishing them to the sink. The processor is an optional component of a pipeline. If you don't define a processor, records are published in the format defined in the source. You can have more than one processor. A pipeline runs processors in the order that you define them.

**Sink**  
The output component of a sub-pipeline. It defines one or more destinations that a sub-pipeline publishes records to. OpenSearch Ingestion supports OpenSearch Service domains as sinks. It also supports sub-pipelines as sinks. This means that you can string together multiple sub-pipelines within a single OpenSearch Ingestion pipeline (YAML file). Self-managed OpenSearch clusters aren't supported as sinks.

**Buffer**  
The part of a processor that acts as the layer between the source and the sink. You can't manually configure a buffer within your pipeline. OpenSearch Ingestion uses a default buffer configuration.

**Route**  
The part of a processor that allows pipeline authors to only send events that match certain conditions to different sinks.

A valid sub-pipeline definition must contain a source and a sink. For more information about each of these pipeline elements, see the [configuration reference](pipeline-config-reference.md#ingestion-parameters). 

## Benefits of Amazon OpenSearch Ingestion
<a name="ingestion-benefits"></a>

OpenSearch Ingestion has the following main benefits:
+ Eliminates the need for you to manually manage a self-provisioned pipeline.
+ Automatically scales your pipelines based on capacity limits that you define.
+ Keeps your pipeline up to date with security and bug patches.
+ Provides the option to connect pipelines to your virtual private cloud (VPC) for an added layer of security.
+ Allows you to stop and start pipelines in order to control costs.
+ Provides pipeline configuration blueprints for popular use cases to help you get up and running faster.
+ Allows you to interact programmatically with your pipelines through the various AWS SDKs and the OpenSearch Ingestion API.
+ Supports performance monitoring in Amazon CloudWatch and error logging in CloudWatch Logs.

# Limitations of Amazon OpenSearch Ingestion
<a name="ingestion-limitations"></a>

OpenSearch Ingestion has the following limitations:
+ You can only ingest data into domains running OpenSearch 1.0 or later, or Elasticsearch 6.8 or later. If you're using the [OTel trace](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-trace/) source, we recommend using Elasticsearch 7.9 or later so that you can use the [OpenSearch Dashboards plugin](https://opensearch.org/docs/latest/observability-plugin/trace/ta-dashboards/).
+ If a pipeline is writing to an OpenSearch Service domain that's within a VPC, the pipeline must be created in the same AWS Region as the domain.
+ You can only configure a single data source within a pipeline definition.
+ You can't specify [self-managed OpenSearch clusters](https://opensearch.org/docs/latest/about/#clusters-and-nodes) as sinks.
+ You can't specify a [custom endpoint](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/customendpoint.html) as a sink. You can still write to a domain that has custom endpoints enabled, but you must specify its standard endpoint.
+ You can't specify resources within [opt-in Regions](https://docs.aws.amazon.com//controltower/latest/userguide/opt-in-region-considerations.html) as sources or sinks.
+ There are some constraints on the parameters that you can include in a pipeline configuration. For more information, see [Configuration requirements and constraints](pipeline-config-reference.md#ingestion-parameters).

## Supported Data Prepper versions
<a name="ingestion-supported-versions"></a>

OpenSearch Ingestion currently supports the following major versions of Data Prepper:
+ 2.x

When you create a pipeline using the code editor, use the required `version` option to specify the major version of Data Prepper to use. For example, `version: "2"`. OpenSearch Ingestion retrieves the latest supported *minor* version of that major version and provisions the pipeline with that version.

If you don't use the code editor to create your pipeline, OpenSearch Ingestion automatically provisions your pipeline with the latest supported version.

Currently, OpenSearch Ingestion provisions pipelines with version 2.7 of Data Prepper. For information, see the [2.7 release notes](https://github.com/opensearch-project/data-prepper/releases/tag/2.7.0). Not every minor version of a particular major version is supported by OpenSearch Ingestion.

When you update a pipeline's configuration, if there's support for a new minor version of Data Prepper, OpenSearch Ingestion automatically upgrades the pipeline to the latest supported minor version of the major version that's specified in the pipeline configuration. For example, you might have `version: "2"` in your pipeline configuration, and OpenSearch Ingestion initially provisioned the pipeline with version 2.6.0. When support for version 2.7.0 is added, and you make a change to the pipeline configuration, OpenSearch Ingestion upgrades the pipeline to version 2.7.0. This process keeps your pipeline up to date with the latest bug fixes and performance improvements. OpenSearch Ingestion can't update the major version of your pipeline unless you manually change the `version` option within the pipeline configuration. For more information, see [Updating Amazon OpenSearch Ingestion pipelines](update-pipeline.md).

# Scaling pipelines in Amazon OpenSearch Ingestion
<a name="ingestion-scaling"></a>

OpenSearch Ingestion automatically scales pipeline capacity based on your specified minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs). This eliminates the need for manual provisioning and management.

Each Ingestion OCU is a combination of approximately 15 GiB of memory and 2 vCPUs. You can specify the minimum and maximum OCU values for a pipeline, and OpenSearch Ingestion automatically scales your pipeline capacity based on these limits.

 You specify the following values when you create a pipeline:
+ **Minimum capacity** – The pipeline can reduce capacity down to this number of Ingestion OCUs. The specified minimum capacity is also the starting capacity for a pipeline.
+ **Maximum capacity** – The pipeline can increase capacity up to this number of Ingestion OCUs.

![\[Edit capacity interface for pipeline capacity with min and max OCU settings.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/pipeline-scaling.png)


Make sure that the maximum capacity for a pipeline is high enough to handle spikes in workload, and the minimum capacity is low enough to minimize costs when the pipeline isn't busy. Based on your settings, OpenSearch Ingestion automatically scales the number of Ingestion OCUs for your pipeline to process the ingest workload. At any specific time, you're charged only for the Ingestion OCUs that are being actively used by your pipeline.

The capacity allocated to your OpenSearch Ingestion pipeline scales up and down based on the processing requirements of your pipeline and the load generated by your client application. When capacity is constrained, OpenSearch Ingestion scales up by allocating more compute units (GiB of memory). When your pipeline is processing smaller workloads, or not processing data at all, it can scale down to the minimum configured Ingestion OCUs.

You can specify a minimum of 1 Ingestion OCU, a maximum of 96 Ingestion OCUs for stateless pipelines, and a maximum of 48 Ingestion OCUs for stateful pipelines. We recommend a minimum of at least 2 Ingestion OCUs for push-based sources. When persistent buffering is enabled, you can specify a minimum of 2 and maximum of 384 Ingestion OCUs.

Given a standard log pipeline with a single source, a simple grok pattern, and a sink, each compute unit can support up to 2 MiB per second. For more complex log pipelines with multiple processors, each compute unit might support less ingest load. Based on pipeline capacity and resource utilization, the OpenSearch Ingestion scaling process kicks in.

To ensure high availability, Ingestion OCUs are distributed across Availability Zones (AZs). The number of AZs depends on the minimum capacity that you specify.

For example, if you specify a minimum of 2 compute units, the Ingestion OCUs that are in use at any given time are evenly distributed across 2 AZs. If you specify a minimum of 3 or more compute units, the Ingestion OCUs are evenly distributed across 3 AZs. We recommend that you provision *at least two* Ingestion OCUs to ensure 99.9% availability for your ingest pipelines.

You're not billed for Ingestion OCUs when a pipeline is in the `Create failed`, `Creating`, `Deleting`, and `Stopped` states.

For instructions to configure and retrieve capacity settings for a pipeline, see [Creating pipelines](creating-pipeline.md#create-pipeline).

## OpenSearch Ingestion pricing
<a name="ingestion-pricing"></a>

At any specific time, you only pay for the number of Ingestion OCUs that are allocated to a pipeline, regardless of whether there's data flowing through the pipeline. OpenSearch Ingestion immediately accommodates your workloads by scaling pipeline capacity up or down based on usage.

For full pricing details, see [Amazon OpenSearch Service pricing](https://aws.amazon.com/opensearch-service/pricing/).

## Supported AWS Regions
<a name="osis-regions"></a>

OpenSearch Ingestion is available in a subset of AWS Regions that OpenSearch Service is available in. For a list of supported Regions, see [Amazon OpenSearch Service endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/opensearch-service.html) in the *AWS General Reference*.

# Setting up roles and users in Amazon OpenSearch Ingestion
<a name="pipeline-security-overview"></a>

Amazon OpenSearch Ingestion uses a variety of permissions models and IAM roles in order to allow source applications to write to pipelines, and to allow pipelines to write to sinks. Before you can start ingesting data, you need to create one or more IAM roles with specific permissions based on your use case.

At minimum, the following roles are required to set up a successful pipeline.


| Name | Description | 
| --- | --- | 
| [**Pipeline role**](#pipeline-security-sink) |  The pipeline role provides the required permissions for a pipeline to read from the source and write to the domain or collection sink. You can manually create the pipeline role, or you can have OpenSearch Ingestion create it for you.  | 
| [**Ingestion role**](#pipeline-security-same-account) |  The ingestion role contains the `osis:Ingest` permission for the pipeline resource. This permission allows push-based sources to ingest data into a pipeline.  | 

The following image demonstrates a typical pipeline setup, where a data source such as Amazon S3 or Fluent Bit is writing to a pipeline in a different account. In this case, the client needs to assume the ingestion role in order to access the pipeline. For more information, see [Cross-account ingestion](#pipeline-security-different-account).

![\[Cross-account data ingestion pipeline showing client application, roles, and OpenSearch sink.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/pipeline-security.png)


For a simple setup guide, see [Tutorial: Ingesting data into a domain using Amazon OpenSearch Ingestion](osis-get-started.md).

**Topics**
+ [Pipeline role](#pipeline-security-sink)
+ [Ingestion role](#pipeline-security-same-account)
+ [Cross-account ingestion](#pipeline-security-different-account)

## Pipeline role
<a name="pipeline-security-sink"></a>

A pipeline needs certain permissions to read from its source and write to its sink. These permissions depend on the client application or AWS service that is writing to the pipeline, and whether the sink is an OpenSearch Service domain, an OpenSearch Serverless collection, or Amazon S3. In addition, a pipeline might need permissions to physically *pull* data from the source application (if the source is a pull-based plugin), and permissions to write to an S3 dead letter queue, if enabled.

When you create a pipeline, you have the option of specifying an existing IAM role that you manually created, or having OpenSearch Ingestion automatically create the pipeline role based on the source and the sink that you selected. The following image shows how to specify the pipeline role in the AWS Management Console.

![\[Pipeline role selection interface with options to create new or use existing IAM role.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/pipeline-role.png)


**Topics**
+ [Automating pipeline role creation](#pipeline-role-auto-create)
+ [Manually creating the pipeline role](#pipeline-role-manual-create)

### Automating pipeline role creation
<a name="pipeline-role-auto-create"></a>

You can choose to have OpenSearch Ingestion create the pipeline role for you. It automatically identifies which permissions the role requires based on the configured source and sinks. It creates an IAM role with the prefix `OpenSearchIngestion-`, and with the suffix that you enter. For example, if you enter `PipelineRole` as the suffix, OpenSearch Ingestion creates a role named `OpenSearchIngestion-PipelineRole`.

Automatically creating the pipeline role simplifies the setup process and reduces the likelihood of configuration errors. By automating the role creation, you can avoid manually assigning permissions, ensuring that the correct policies are applied without risking security misconfigurations. This also saves time and enhances security compliance by enforcing best practices, while ensuring consistency across multiple pipeline deployments.

You can only have OpenSearch Ingestion automatically create the pipeline role in the AWS Management Console. If you're using the AWS CLI, the OpenSearch Ingestion API, or one of the SDKs, you must specify a manually-created pipeline role.

To have OpenSearch Ingestion create the role for you, select **Create and use a new service role**.

**Important**  
You still need to manually modify the domain or collection access policy to grant access to the pipeline role. For domains that use fine-grained access control, you must also map the pipeline role to a backend role. You can perform these steps before or after you create the pipeline.   
For instructions, see the following topics:  
[Configure data access for the domain](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-domain-access.html#pipeline-access-domain)
[Configure data and network access for the collection](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-domain-access.html#pipeline-collection-acces)

### Manually creating the pipeline role
<a name="pipeline-role-manual-create"></a>

You might prefer to manually create the pipeline role if you need more control over permissions to meet specific security or compliance requirements. Manual creation allows you to tailor roles to fit existing infrastructure or access management strategies. You might also choose manual setup to integrate the role with other AWS services or ensure it aligns with your unique operational needs.

To choose a manually-created pipeline role, select **Use an existing IAM role** and choose an existing role. The role must have all permissions needed to receive data from the selected source and write to the selected sink. The following sections outline how to manually create a pipeline role.

**Topics**
+ [Permissions to read from a source](#pipeline-security--source)
+ [Permissions to write to a domain sink](#pipeline-security-domain-sink)
+ [Permissions to write to a collection sink](#pipeline-security--collection-sink)
+ [Permissions to write to Amazon S3 or a dead-letter queue](#pipeline-security-dlq)

#### Permissions to read from a source
<a name="pipeline-security--source"></a>

An OpenSearch Ingestion pipeline needs permission to read and receive data from the specified source. For example, for an Amazon DynamoDB source, it needs permissions such as `dynamodb:DescribeTable` and `dynamodb:DescribeStream`. For sample pipeline role access policies for common sources, such as Amazon S3, Fluent Bit, and the OpenTelemetry Collector, see [Integrating Amazon OpenSearch Ingestion pipelines with other services and applications](configure-client.md).

#### Permissions to write to a domain sink
<a name="pipeline-security-domain-sink"></a>

An OpenSearch Ingestion pipeline needs permission to write to an OpenSearch Service domain that is configured as its sink. These permissions include the ability to describe the domain and send HTTP requests to it. These permissions are the same for public and VPC domains. For instructions to create a pipeline role and specify it in the domain access policy, see [Allowing pipelines to access domains](pipeline-domain-access.md).

#### Permissions to write to a collection sink
<a name="pipeline-security--collection-sink"></a>

An OpenSearch Ingestion pipeline needs permission to write to an OpenSearch Serverless collection that is configured as its sink. These permissions include the ability to describe the collection and send HTTP requests to it.

First, make sure your pipeline role access policy grants the required permissions. Then, include this role in a data access policy and provide it permissions to create indexes, update indexes, describe indexes, and write documents within the collection. For instructions to complete each of these steps, see [Allowing pipelines to access collections](pipeline-collection-access.md).

#### Permissions to write to Amazon S3 or a dead-letter queue
<a name="pipeline-security-dlq"></a>

If you specify Amazon S3 as a sink destination for your pipeline, or if you enable a [dead-letter queue](https://opensearch.org/docs/latest/data-prepper/pipelines/dlq/) (DLQ), the pipeline role must allow it to access the S3 bucket that you specify as the destination.

Attach a separate permissions policy to the pipeline role that provides DLQ access. At minimum, the role must be granted the `S3:PutObject` action on the bucket resource:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "WriteToS3DLQ",
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::my-dlq-bucket/*"
    }
  ]
}
```

------

## Ingestion role
<a name="pipeline-security-same-account"></a>

The ingestion role is an IAM role that allows external services to securely interact with and send data to an OpenSearch Ingestion pipeline. For push-based sources, such as Amazon Security Lake, this role must grant permissions to push data into the pipeline, including `osis:Ingest`. For pull-based sources, like Amazon S3, the role must enable OpenSearch Ingestion to assume it and access the data with the necessary permissions.

**Topics**
+ [Ingestion role for push-based sources](#ingestion-role-push-based)
+ [Ingestion role for pull-based sources](#ingestion-role-pull-based)
+ [Cross-account ingestion](#pipeline-security-different-account)

### Ingestion role for push-based sources
<a name="ingestion-role-push-based"></a>

For push-based sources, data is sent or pushed to the ingestion pipeline from another service, such as Amazon Security Lake or Amazon DynamoDB. In this scenario, the ingestion role needs, at minimum, the `osis:Ingest` permission to interact with the pipeline.

The following IAM access policy demonstrates how to grant this permission to the ingestion role:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "osis:Ingest"
      ],
      "Resource": "arn:aws:osis:us-east-1:111122223333:pipeline/pipeline-name/*"
    }
  ]
}
```

------

### Ingestion role for pull-based sources
<a name="ingestion-role-pull-based"></a>

For pull-based sources, the OpenSearch Ingestion pipeline actively pulls or fetches data from an external source, such as Amazon S3. In this case, the pipeline must assume an IAM pipeline role that grants the necessary permissions to access the data source. In these scenarios, the *ingestion role* is synonymous with the *pipeline role*.

The role must include a trust relationship that allows OpenSearch Ingestion to assume it, and permissions specific to the data source. For more information, see [Permissions to read from a source](#pipeline-security--source).

### Cross-account ingestion
<a name="pipeline-security-different-account"></a>

You might need to ingest data into a pipeline from a different AWS account, such as an application account. To configure cross-account ingestion, define an ingestion role within the same account as the pipeline and establish a trust relationship between the ingestion role and the application account:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [{
     "Effect": "Allow",
     "Principal": {
       "AWS": "arn:aws:iam::444455556666:root"
      },
     "Action": "sts:AssumeRole"
  }]
}
```

------

Then, configure your application to assume the ingestion role. The application account must grant the application role [AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) permissions for the ingestion role in the pipeline account.

For detailed steps and example IAM policies, see [Providing cross-account ingestion access](configure-client.md#configure-client-cross-account).

# Granting Amazon OpenSearch Ingestion pipelines access to domains
<a name="pipeline-domain-access"></a>

An Amazon OpenSearch Ingestion pipeline needs permission to write to the OpenSearch Service domain that is configured as its sink. To provide access, you configure an AWS Identity and Access Management (IAM) role with a restrictive permissions policy that limits access to the domain that a pipeline is sending data to. For example, you might want to limit an ingestion pipeline to only the domain and indexes that are required to support its use case.

**Important**  
You can choose to manually create the pipeline role, or you can have OpenSearch Ingestion create it for you during pipeline creation. If you choose automatic role creation, OpenSearch Ingestion adds all required permissions to the pipeline role access policy based on the source and sink that you choose. It creates a pipeline role in IAM with the prefix `OpenSearchIngestion-` and the suffix that you enter. For more information, see [Pipeline role](pipeline-security-overview.md#pipeline-security-sink).  
If you have OpenSearch Ingestion create the pipeline role for you, you still need to include the role in the domain access policy and map it to a backend role (if the domain uses fine-graned access control), either before or after you create the pipeline. See step 2 for instructions. 

**Topics**
+ [Step 1: Create the pipeline role](#pipeline-access-configure)
+ [Step 2: Configure data access for the domain](#pipeline-access-domain)

## Step 1: Create the pipeline role
<a name="pipeline-access-configure"></a>

The pipeline role must have an attached permissions policy that allows it to send data to the domain sink. It must also have a trust relationship that allows OpenSearch Ingestion to assume the role. For instructions on how to attach a policy to a role, see [Adding IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console) in the *IAM User Guide*.

The following sample policy demonstrates the [least privilege](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#grant-least-privilege) that you can provide in a pipeline role for it to write to a single domain:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "es:DescribeDomain",
            "Resource": "arn:aws:es:*:111122223333:domain/*"
        },
        {
            "Effect": "Allow",
            "Action": "es:ESHttp*",
            "Resource": "arn:aws:es:*:111122223333:domain/domain-name/*"
        }
    ]
}
```

------

If you plan to reuse the role to write to multiple domains, you can make the policy more broad by replacing the domain name with a wildcard character (`*`).

The role must have the following [trust relationship](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-managingrole_edit-trust-policy), which allows OpenSearch Ingestion to assume the pipeline role:

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "Service":"osis-pipelines.amazonaws.com"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}
```

------

## Step 2: Configure data access for the domain
<a name="pipeline-access-domain"></a>

In order for a pipeline to write data to a domain, the domain must have a [domain-level access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) that allows the pipeline role to access it.

The following sample domain access policy allows the pipeline role named `pipeline-role` to write data to the domain named `ingestion-domain`:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:role/pipeline-role"
            },
            "Action": [
                "es:DescribeDomain",
                "es:ESHttp*"
            ],
            "Resource": "arn:aws:es:us-east-1:111122223333:domain/domain-name/*"
        }
    ]
}
```

------

### Map the pipeline role (only for domains that use fine-grained access control)
<a name="pipeline-access-domain-fgac"></a>

If your domain uses [fine-grained access control](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac.html) for authentication, there are extra steps you need to take to provide your pipeline access to a domain. The steps differ depending on your domain configuration:
+ **Scenario 1: Different master role and pipeline role** – If you're using an IAM Amazon Resource Name (ARN) as the master user and it's *different* than the pipeline role, you need to map the pipeline role to the OpenSearch `all_access` backend role. This adds the pipeline role as an additional master user. For more information, see [Additional master users](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac.html#fgac-more-masters).
+ **Scenario 2: Master user in the internal user database** – If your domain uses a master user in the internal user database and HTTP basic authentication for OpenSearch Dashboards, you can't pass the master username and password directly into the pipeline configuration. Instead, map the pipeline role to the OpenSearch `all_access` backend role. This adds the pipeline role as an additional master user. For more information, see [Additional master users](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac.html#fgac-more-masters).
+ **Scenario 3: Same master role and pipeline role (uncommon)** – If you're using an IAM ARN as the master user, and it's the same ARN that you're using as the pipeline role, you don't need to take any further action. The pipeline has the required permissions to write to the domain. This scenario is uncommon because most environments use an administrator role or some other role as the master role.

The following image shows how to map the pipeline role to a backend role:

![\[Backend roles section showing an AWSIAM role ARN for a pipeline role with a Remove option.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/ingestion-fgac.png)


# Granting Amazon OpenSearch Ingestion pipelines access to collections
<a name="pipeline-collection-access"></a>

An Amazon OpenSearch Ingestion pipeline can write to an OpenSearch Serverless public collection or VPC collection. To provide access to the collection, you configure an AWS Identity and Access Management (IAM) pipeline role with a permissions policy that grants access to the collection. The pipeline assumes this role in order to sign requests to the OpenSearch Serverless collection sink.

**Important**  
You can choose to manually create the pipeline role, or you can have OpenSearch Ingestion create it for you during pipeline creation. If you choose automatic role creation, OpenSearch Ingestion adds all required permissions to the pipeline role access policy based on the source and sink that you choose. It creates a pipeline role in IAM with the prefix `OpenSearchIngestion-` and the suffix that you enter. For more information, see [Pipeline role](pipeline-security-overview.md#pipeline-security-sink).  
If you have OpenSearch Ingestion create the pipeline role for you, you still need to include the role in the collection's data access policy, either before or after you create the pipeline. See step 2 for instructions. 

During pipeline creation, OpenSearch Ingestion creates an AWS PrivateLink connection between the pipeline and the OpenSearch Serverless collection. All traffic from the pipeline goes through this VPC endpoint and is routed to the collection. In order to reach the collection, the endpoint must be granted access to the collection through a network access policy.

![\[OpenSearch Ingestion pipeline connecting to OpenSearch Serverless collection via PrivateLink VPC endpoint.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/osis-aoss-permissions.png)


**Topics**
+ [Step 1: Create the pipeline role](#pipeline-collection-access-configure)
+ [Step 2: Configure data and network access for the collection](#pipeline-access-collection)

## Step 1: Create the pipeline role
<a name="pipeline-collection-access-configure"></a>

The pipeline role must have an attached permissions policy that allows it to send data to the collection sink. It must also have a trust relationship that allows OpenSearch Ingestion to assume the role. For instructions on how to attach a policy to a role, see [Adding IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console) in the *IAM User Guide*.

The following sample policy demonstrates the [least privilege](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#grant-least-privilege) that you can provide in a pipeline role access policy for it to write to collections:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "aoss:APIAccessAll",
                "aoss:BatchGetCollection",
                "aoss:CreateSecurityPolicy",
                "aoss:GetSecurityPolicy",
                "aoss:UpdateSecurityPolicy"
            ],
            "Resource": "*"
        }
    ]
}
```

------

The role must have the following [trust relationship](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-managingrole_edit-trust-policy), which allows OpenSearch Ingestion to assume it:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "osis-pipelines.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

------

## Step 2: Configure data and network access for the collection
<a name="pipeline-access-collection"></a>

Create an OpenSearch Serverless collection with the following settings. For instructions to create a collection, see [Creating collections](serverless-create.md).

### Data access policy
<a name="pipeline-data-access"></a>

Create a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) for the collection that grants the required permissions to the pipeline role. For example:

```
[
  {
    "Rules": [
      {
        "Resource": [
          "index/collection-name/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::account-id:role/pipeline-role"
    ],
    "Description": "Pipeline role access"
  }
]
```

**Note**  
In the `Principal` element, specify the Amazon Resource Name (ARN) of the pipeline role.

### Network access policy
<a name="pipeline-network-access"></a>

Each collection that you create in OpenSearch Serverless has at least one network access policy associated with it. Network access policies determine whether the collection is accessible over the internet from public networks, or whether it must be accessed privately. For more information about network policies, see [Network access for Amazon OpenSearch Serverless](serverless-network.md).

Within a network access policy, you can only specify OpenSearch Serverless-managed VPC endpoints. For more information, see [Data plane access through AWS PrivateLink](serverless-vpc.md). However, in order for the pipeline to write to the collection, the policy must also grant access to the VPC endpoint that OpenSearch Ingestion automatically creates between the pipeline and the collection. Therefore, if you choose an OpenSearch Serverless collection as the destination sink for a pipeline, you must enter the name of the associated network policy in the **Network policy name** field.

During pipeline creation, OpenSearch Ingestion checks for the existence of the specified network policy. If it doesn't exist, OpenSearch Ingestion creates it. If it does exist, OpenSearch Ingestion updates it by adding a new rule to it. The rule grants access to the VPC endpoint that connects the pipeline and the collection. 

For example:

```
{
   "Rules":[
      {
         "Resource":[
            "collection/my-collection"
         ],
         "ResourceType":"collection"
      }
   ],
   "SourceVPCEs":[
      "vpce-0c510712627e27269" # The ID of the VPC endpoint that OpenSearch Ingestion creates between the pipeline and collection
   ],
   "Description":"Created by Data Prepper"
}
```

In the console, any rules that OpenSearch Ingestion adds to your network policies are named **Created by Data Prepper**:

![\[Configuration details for OpenSearch endpoint access, including VPC endpoint and resources.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/osis-aoss-network.png)


**Note**  
In general, a rule that specifies public access for a collection overrides a rule that specifies private access. Therefore, if the policy already had *public* access configured, this new rule that OpenSearch Ingestion adds doesn't actually change the behavior of the policy. For more information, see [Policy precedence](serverless-network.md#serverless-network-precedence).

If you stop or delete the pipeline, OpenSearch Ingestion deletes the VPC endpoint between the pipeline and the collection. It also modifies the network policy to remove the VPC endpoint from the list of allowed endpoints. If you restart the pipeline, it recreates the VPC endpoint and re-updates the network policy with the endpoint ID.

# Getting started with Amazon OpenSearch Ingestion
<a name="osis-getting-started-tutorials"></a>

Amazon OpenSearch Ingestion supports ingesting data into managed OpenSearch Service domains and OpenSearch Serverless collections. The following tutorials walk you through the basic steps to get a pipeline up and running.

The first tutorial shows you how to use Amazon OpenSearch Ingestion to configure a simple pipeline and ingest data into an Amazon OpenSearch Service domain.

The second tutorial shows you how to use Amazon OpenSearch Ingestion to configure a simple pipeline and ingest data into an Amazon OpenSearch Serverless collection.

**Note**  
Pipeline creation will fail if you don't set up the correct permissions. See [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md) for a better understanding of the required roles before you create a pipeline.

**Topics**
+ [Tutorial: Ingesting data into a domain using Amazon OpenSearch Ingestion](osis-get-started.md)
+ [Tutorial: Ingesting data into a collection using Amazon OpenSearch Ingestion](osis-serverless-get-started.md)

# Tutorial: Ingesting data into a domain using Amazon OpenSearch Ingestion
<a name="osis-get-started"></a>

This tutorial shows you how to use Amazon OpenSearch Ingestion to configure a simple pipeline and ingest data into an Amazon OpenSearch Service domain. A *pipeline* is a resource that OpenSearch Ingestion provisions and manages. You can use a pipeline to filter, enrich, transform, normalize, and aggregate data for downstream analytics and visualization in OpenSearch Service.

This tutorial walks you through the basic steps to get a pipeline up and running quickly. For more comprehensive instructions, see [Creating pipelines](creating-pipeline.md#create-pipeline).

You'll complete the following steps in this tutorial:

1. [Create a domain](#osis-get-started-access).

1. [Create a pipeline](#osis-get-started-pipeline).

1. [Ingest some sample data](#osis-get-started-ingest).

Within the tutorial, you'll create the following resources:
+ A domain named `ingestion-domain` that the pipeline writes to
+ A pipeline named `ingestion-pipeline`

## Required permissions
<a name="osis-get-started-permissions"></a>

To complete this tutorial, your user or role must have an attached [identity-based policy](security-iam-serverless.md#security-iam-serverless-id-based-policies) with the following minimum permissions. These permissions allow you to create a pipeline role and attach a policy (`iam:Create*` and `iam:Attach*`), create or modify a domain (`es:*`), and work with pipelines (`osis:*`).

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement":[
      {
         "Effect":"Allow",
         "Resource":"*",
         "Action":[
            "osis:*",
            "iam:Create*",
            "iam:Attach*",
            "es:*"
         ]
      },
      {
         "Resource":[
            "arn:aws:iam::111122223333:role/OpenSearchIngestion-PipelineRole"
         ],
         "Effect":"Allow",
         "Action":[
            "iam:CreateRole",
            "iam:AttachRolePolicy",
            "iam:PassRole"
         ]
      }
   ]
}
```

------

## Step 1: Create the pipeline role
<a name="osis-get-started-role"></a>

First, create a role that the pipeline will assume in order to access the OpenSearch Service domain sink. You'll include this role within the pipeline configuration later in this tutorial.

**To create the pipeline role**

1. Open the AWS Identity and Access Management console at [https://console.aws.amazon.com/iamv2/](https://console.aws.amazon.com/iamv2/ ).

1. Choose **Policies**, and then choose **Create policy**.

1. In this tutorial, you'll ingest data into a domain called `ingestion-domain`, which you'll create in the next step. Select **JSON** and paste the following policy into the editor. Replace `your-account-id` with your account ID, and modify the Region if necessary.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": "es:DescribeDomain",
               "Resource": "arn:aws:es:us-east-1:111122223333:domain/ingestion-domain"
           },
           {
               "Effect": "Allow",
               "Action": "es:ESHttp*",
               "Resource": "arn:aws:es:us-east-1:111122223333:domain/ingestion-domain/*"
           }
       ]
   }
   ```

------

   If you want to write data to an *existing* domain, replace `ingestion-domain` with the name of your domain.
**Note**  
For simplicity in this tutorial, we use a broad access policy. In production environments, however, we recommend that you apply a more restrictive access policy to your pipeline role. For an example policy that provides the minimum required permissions, see [Granting Amazon OpenSearch Ingestion pipelines access to domains](pipeline-domain-access.md).

1. Choose **Next**, choose **Next**, and name your policy **pipeline-policy**.

1. Choose **Create policy**.

1. Next, create a role and attach the policy to it. Choose **Roles**, and then choose **Create role**.

1. Choose **Custom trust policy** and paste the following policy into the editor:

------
#### [ JSON ]

****  

   ```
   {
      "Version":"2012-10-17",		 	 	 
      "Statement":[
         {
            "Effect":"Allow",
            "Principal":{
               "Service":"osis-pipelines.amazonaws.com"
            },
            "Action":"sts:AssumeRole"
         }
      ]
   }
   ```

------

1. Choose **Next**. Then search for and select **pipeline-policy** (which you just created).

1. Choose **Next** and name the role **PipelineRole**.

1. Choose **Create role**.

Remember the Amazon Resource Name (ARN) of the role (for example, `arn:aws:iam::your-account-id:role/PipelineRole`). You'll need it when you create your pipeline.

## Step 2: Create a domain
<a name="osis-get-started-access"></a>

First, create a domain named `ingestion-domain` to ingest data into.

Navigate to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/home](https://console.aws.amazon.com/aos/home) and [create a domain](createupdatedomains.md) that meets the following requirements:
+ Is running OpenSearch 1.0 or later, or Elasticsearch 7.4 or later
+ Uses public access
+ Does not use fine-grained access control

**Note**  
These requirements are meant to ensure simplicity in this tutorial. In production environments, you can configure a domain with VPC access and/or use fine-grained access control. To use fine-grained access control, see [Map the pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-domain-access.html#pipeline-access-domain).

The domain must have an access policy that grants permission to the `OpenSearchIngestion-PipelineRole` IAM role, which OpenSearch Service will create for you in the next step. The pipeline will assume this role in order to send data to the domain sink.

Make sure that the domain has the following domain-level access policy, which grants the pipeline role access to the domain. Replace the Region and account ID with your own:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/OpenSearchIngestion-PipelineRole"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-east-1:111122223333:domain/ingestion-domain/*"
    }
  ]
}
```

------

For more information about creating domain-level access policies, see [Resource-based policies](ac.md#ac-types-resource).

If you already have a domain created, modify its existing access policy to provide the above permissions to `OpenSearchIngestion-PipelineRole`.

## Step 3: Create a pipeline
<a name="osis-get-started-pipeline"></a>

Now that you have a domain, you can create a pipeline.

**To create a pipeline**

1. Within the Amazon OpenSearch Service console, choose **Pipelines** from the left navigation pane.

1. Choose **Create pipeline**.

1. Select the **Blank** pipeline, then choose **Select blueprint**.

1. In this tutorial, we'll create a simple pipeline that uses the [HTTP source](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/http-source/) plugin. The plugin accepts log data in a JSON array format. We'll specify a single OpenSearch Service domain as the sink, and ingest all data into the `application_logs` index.

   In the **Source** menu, choose **HTTP**. For the **Path**, enter **/logs**.

1. For simplicity in this tutorial, we'll configure public access for the pipeline. For **Source network options**, choose **Public access**. For information about configuring VPC access, see [Configuring VPC access for Amazon OpenSearch Ingestion pipelines](pipeline-security.md).

1. Choose **Next**.

1. For **Processor**, enter **Date** and choose **Add**.

1. Enable **From time received**. Leave all other settings as their defaults.

1. Choose **Next**.

1. Configure sink details. For **OpenSearch resource type**, choose **Managed cluster**. Then choose the OpenSearch Service domain that you created in the previous section.

   For **Index name**, enter **application\$1logs**. OpenSearch Ingestion automatically creates this index in the domain if it doesn't already exist.

1. Choose **Next**.

1. Name the pipeline **ingestion-pipeline**. Leave the capacity settings as their defaults.

1. For **Pipeline role**, select **Create and use a new service role**. The pipeline role provides the required permissions for a pipeline to write to the domain sink and read from pull-based sources. By selecting this option, you allow OpenSearch Ingestion to create the role for you, rather than manually creating it in IAM. For more information, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

1. For **Service role name suffix**, enter **PipelineRole**. In IAM, the role will have the format `arn:aws:iam::your-account-id:role/OpenSearchIngestion-PipelineRole`.

1. Choose **Next**. Review your pipeline configuration and choose **Create pipeline**. The pipeline takes 5–10 minutes to become active.

## Step 4: Ingest some sample data
<a name="osis-get-started-ingest"></a>

When the pipeline status is `Active`, you can start ingesting data into it. You must sign all HTTP requests to the pipeline using [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html). Use an HTTP tool such as [Postman](https://www.getpostman.com/) or [awscurl](https://github.com/okigan/awscurl) to send some data to the pipeline. As with indexing data directly to a domain, ingesting data into a pipeline always requires either an IAM role or an [IAM access key and secret key](https://docs.aws.amazon.com/powershell/latest/userguide/pstools-appendix-sign-up.html). 

**Note**  
The principal signing the request must have the `osis:Ingest` IAM permission.

First, get the ingestion URL from the **Pipeline settings** page:

![\[Pipeline settings page showing ingestion URL and other configuration details.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/pipeline-endpoint.png)


Then, ingest some sample data. The following request uses [awscurl](https://github.com/okigan/awscurl) to send a single log file to the pipeline:

```
awscurl --service osis --region us-east-1 \
    -X POST \
    -H "Content-Type: application/json" \
    -d '[{"time":"2014-08-11T11:40:13+00:00","remote_addr":"122.226.223.69","status":"404","request":"GET http://www.k2proxy.com//hello.html HTTP/1.1","http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)"}]' \
    https://pipeline-endpoint.us-east-1.osis.amazonaws.com/logs
```

You should see a `200 OK` response. If you get an authentication error, it might be because you're ingesting data from a separate account than the pipeline is in. See [Fixing permissions issues](#osis-get-started-troubleshoot).

Now, query the `application_logs` index to ensure that your log entry was successfully ingested:

```
awscurl --service es --region us-east-1 \
     -X GET \
     https://search-ingestion-domain.us-east-1.es.amazonaws.com/application_logs/_search | json_pp
```

**Sample response**:

```
{
   "took":984,
   "timed_out":false,
   "_shards":{
      "total":1,
      "successful":5,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":{
         "value":1,
         "relation":"eq"
      },
      "max_score":1.0,
      "hits":[
         {
            "_index":"application_logs",
            "_type":"_doc",
            "_id":"z6VY_IMBRpceX-DU6V4O",
            "_score":1.0,
            "_source":{
               "time":"2014-08-11T11:40:13+00:00",
               "remote_addr":"122.226.223.69",
               "status":"404",
               "request":"GET http://www.k2proxy.com//hello.html HTTP/1.1",
               "http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)",
               "@timestamp":"2022-10-21T21:00:25.502Z"
            }
         }
      ]
   }
}
```

## Fixing permissions issues
<a name="osis-get-started-troubleshoot"></a>

If you followed the steps in the tutorial and you still see authentication errors when you try to ingest data, it might be because the role that is writing to a pipeline is in a different AWS account than the pipeline itself. In this case, you need to create and [assume a role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html) that specifically enables you to ingest data. For instructions, see [Providing cross-account ingestion access](configure-client.md#configure-client-cross-account).

## Related resources
<a name="osis-get-started-next"></a>

This tutorial presented a simple use case of ingesting a single document over HTTP. In production scenarios, you'll configure your client applications (such as Fluent Bit, Kubernetes, or the OpenTelemetry Collector) to send data to one or more pipelines. Your pipelines will likely be more complex than the simple example in this tutorial.

To get started configuring your clients and ingesting data, see the following resources:
+ [Creating and managing pipelines](creating-pipeline.md#create-pipeline)
+ [Configuring your clients to send data to OpenSearch Ingestion](configure-client.md)
+ [Data Prepper documentation](https://opensearch.org/docs/latest/clients/data-prepper/index/)

# Tutorial: Ingesting data into a collection using Amazon OpenSearch Ingestion
<a name="osis-serverless-get-started"></a>

This tutorial shows you how to use Amazon OpenSearch Ingestion to configure a simple pipeline and ingest data into an Amazon OpenSearch Serverless collection. A *pipeline* is a resource that OpenSearch Ingestion provisions and manages. You can use a pipeline to filter, enrich, transform, normalize, and aggregate data for downstream analytics and visualization in OpenSearch Service.

For a tutorial that demonstrates how to ingest data into a provisioned OpenSearch Service *domain*, see [Tutorial: Ingesting data into a domain using Amazon OpenSearch Ingestion](osis-get-started.md).

You'll complete the following steps in this tutorial:.

1. [Create a collection](#osis-serverless-get-started-access).

1. [Create a pipeline](#osis-serverless-get-started-pipeline).

1. [Ingest some sample data](#osis-serverless-get-started-ingest).

Within the tutorial, you'll create the following resources:
+ A collection named `ingestion-collection` that the pipeline will write to
+ A pipeline named `ingestion-pipeline-serverless`

## Required permissions
<a name="osis-serverless-get-started-permissions"></a>

To complete this tutorial, your user or role must have an attached [identity-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/security-iam-serverless.html#security-iam-serverless-id-based-policies) with the following minimum permissions. These permissions allow you to create a pipeline role and attach a policy (`iam:Create*` and `iam:Attach*`), create or modify a collection (`aoss:*`), and work with pipelines (`osis:*`).

In addition, several IAM permissions are required in order to automatically create the pipeline role and pass it to OpenSearch Ingestion so that it can write data to the collection.

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement":[
      {
         "Effect":"Allow",
         "Resource":"*",
         "Action":[
            "osis:*",
            "iam:Create*",
            "iam:Attach*",
            "aoss:*"
         ]
      },
      {
         "Resource":[
            "arn:aws:iam::111122223333:role/OpenSearchIngestion-PipelineRole"
         ],
         "Effect":"Allow",
         "Action":[
            "iam:CreateRole",
            "iam:AttachRolePolicy",
            "iam:PassRole"
         ]
      }
   ]
}
```

------

## Step 1: Create a collection
<a name="osis-serverless-get-started-access"></a>

First, create a collection to ingest data into. We'll name the collection `ingestion-collection`.

1. Navigate to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/home](https://console.aws.amazon.com/aos/home).

1. Choose **Collections** from the left navigation and choose **Create collection**.

1. Name the collection **ingestion-collection**.

1. For **Security**, choose **Standard create**.

1. Under **Network access settings**, change the access type to **Public**.

1. Keep all other settings as their defaults and choose **Next**.

1. Now, configure a data acces policy for the collection. Deselect **Automatically match access policy settings**.

1. For **Definition method**, choose **JSON** and paste the following policy into the editor. This policy does two things:
   + Allows the pipeline role to write to the collection.
   + Allows you to *read* from the collection. Later, after you ingest some sample data into the pipeline, you'll query the collection to ensure that the data was successfully ingested and written to the index.

     ```
     [
       {
         "Rules": [
           {
             "Resource": [
               "index/ingestion-collection/*"
             ],
             "Permission": [
               "aoss:CreateIndex",
               "aoss:UpdateIndex",
               "aoss:DescribeIndex",
               "aoss:ReadDocument",
               "aoss:WriteDocument"
             ],
             "ResourceType": "index"
           }
         ],
         "Principal": [
           "arn:aws:iam::your-account-id:role/OpenSearchIngestion-PipelineRole",
           "arn:aws:iam::your-account-id:role/Admin"
         ],
         "Description": "Rule 1"
       }
     ]
     ```

1. Modify the `Principal` elements to include your AWS account ID. For the second principal, specify a user or role that you can use to query the collection later.

1. Choose **Next**. Name the access policy **pipeline-collection-access** and choose **Next** again.

1. Review your collection configuration and choose **Submit**.

## Step 2: Create a pipeline
<a name="osis-serverless-get-started-pipeline"></a>

Now that you have a collection, you can create a pipeline.

**To create a pipeline**

1. Within the Amazon OpenSearch Service console, choose **Pipelines** from the left navigation pane.

1. Choose **Create pipeline**.

1. Select the **Blank** pipeline, then choose **Select blueprint**.

1. In this tutorial, we'll create a simple pipeline that uses the [HTTP source](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/http-source/) plugin. The plugin accepts log data in a JSON array format. We'll specify a single OpenSearch Serverless collection as the sink, and ingest all data into the `my_logs` index.

   In the **Source** menu, choose **HTTP**. For the **Path**, enter **/logs**.

1. For simplicity in this tutorial, we'll configure public access for the pipeline. For **Source network options**, choose **Public access**. For information about configuring VPC access, see [Configuring VPC access for Amazon OpenSearch Ingestion pipelines](pipeline-security.md).

1. Choose **Next**.

1. For **Processor**, enter **Date** and choose **Add**.

1. Enable **From time received**. Leave all other settings as their defaults.

1. Choose **Next**.

1. Configure sink details. For **OpenSearch resource type**, choose **Collection (Serverless)**. Then choose the OpenSearch Service collection that you created in the previous section.

   Leave the network policy name as the default. For **Index name**, enter **my\$1logs**. OpenSearch Ingestion automatically creates this index in the collection if it doesn't already exist.

1. Choose **Next**.

1. Name the pipeline **ingestion-pipeline-serverless**. Leave the capacity settings as their defaults.

1. For **Pipeline role**, select **Create and use a new service role**. The pipeline role provides the required permissions for a pipeline to write to the collection sink and read from pull-based sources. By selecting this option, you allow OpenSearch Ingestion to create the role for you, rather than manually creating it in IAM. For more information, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

1. For **Service role name suffix**, enter **PipelineRole**. In IAM, the role will have the format `arn:aws:iam::your-account-id:role/OpenSearchIngestion-PipelineRole`.

1. Choose **Next**. Review your pipeline configuration and choose **Create pipeline**. The pipeline takes 5–10 minutes to become active.

## Step 3: Ingest some sample data
<a name="osis-serverless-get-started-ingest"></a>

When the pipeline status is `Active`, you can start ingesting data into it. You must sign all HTTP requests to the pipeline using [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html). Use an HTTP tool such as [Postman](https://www.getpostman.com/) or [awscurl](https://github.com/okigan/awscurl) to send some data to the pipeline. As with indexing data directly to a collection, ingesting data into a pipeline always requires either an IAM role or an [IAM access key and secret key](https://docs.aws.amazon.com/powershell/latest/userguide/pstools-appendix-sign-up.html). 

**Note**  
The principal signing the request must have the `osis:Ingest` IAM permission.

First, get the ingestion URL from the **Pipeline settings** page:

![\[Pipeline settings page showing ingestion URL and other configuration details.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/pipeline-endpoint.png)


Then, send some sample data to the ingestion path. The following sample request uses [awscurl](https://github.com/okigan/awscurl) to send a single log file to the pipeline:

```
awscurl --service osis --region us-east-1 \
    -X POST \
    -H "Content-Type: application/json" \
    -d '[{"time":"2014-08-11T11:40:13+00:00","remote_addr":"122.226.223.69","status":"404","request":"GET http://www.k2proxy.com//hello.html HTTP/1.1","http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)"}]' \
    https://pipeline-endpoint.us-east-1.osis.amazonaws.com/logs
```

You should see a `200 OK` response.

Now, query the `my_logs` index to ensure that the log entry was successfully ingested:

```
awscurl --service aoss --region us-east-1 \
     -X GET \
     https://collection-id.us-east-1.aoss.amazonaws.com/my_logs/_search | json_pp
```

**Sample response**:

```
{
   "took":348,
   "timed_out":false,
   "_shards":{
      "total":0,
      "successful":0,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":{
         "value":1,
         "relation":"eq"
      },
      "max_score":1.0,
      "hits":[
         {
            "_index":"my_logs",
            "_id":"1%3A0%3ARJgDvIcBTy5m12xrKE-y",
            "_score":1.0,
            "_source":{
               "time":"2014-08-11T11:40:13+00:00",
               "remote_addr":"122.226.223.69",
               "status":"404",
               "request":"GET http://www.k2proxy.com//hello.html HTTP/1.1",
               "http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)",
               "@timestamp":"2023-04-26T05:22:16.204Z"
            }
         }
      ]
   }
}
```

## Related resources
<a name="osis-serverless-get-started-next"></a>

This tutorial presented a simple use case of ingesting a single document over HTTP. In production scenarios, you'll configure your client applications (such as Fluent Bit, Kubernetes, or the OpenTelemetry Collector) to send data to one or more pipelines. Your pipelines will likely be more complex than the simple example in this tutorial.

To get started configuring your clients and ingesting data, see the following resources:
+ [Creating and managing pipelines](creating-pipeline.md#create-pipeline)
+ [Configuring your clients to send data to OpenSearch Ingestion](configure-client.md)
+ [Data Prepper documentation](https://opensearch.org/docs/latest/clients/data-prepper/index/)

# Overview of pipeline features in Amazon OpenSearch Ingestion
<a name="osis-features-overview"></a>

Amazon OpenSearch Ingestion provisions *pipelines*, which consist of a source, a buffer, zero or more processors, and one or more sinks. Ingestion pipelines are powered by Data Prepper as the data engine. For an overview of the various components of a pipeline, see [Key concepts in Amazon OpenSearch Ingestion](ingestion-process.md).

The following sections provide an overview of some of the most commonly used features in Amazon OpenSearch Ingestion.

**Note**  
This is not an exhaustive list of features that are available for pipelines. For comprehensive documentation of all available pipeline functionality, see the [Data Prepper documentation](https://opensearch.org/docs/latest/data-prepper/pipelines/pipelines/). Note that OpenSearch Ingestion places some constraints on the plugins and options that you can use. For more information, see [Supported plugins and options for Amazon OpenSearch Ingestion pipelines](pipeline-config-reference.md).

**Topics**
+ [Persistent buffering](#persistent-buffering)
+ [Splitting](#osis-features-splitting)
+ [Chaining](#osis-features-chaining)
+ [Dead-letter queues](#osis-features-dlq)
+ [Index management](#osis-features-index-management)
+ [End-to-end acknowledgement](#osis-features-e2e)
+ [Source back pressure](#osis-features-backpressure)

## Persistent buffering
<a name="persistent-buffering"></a>

A persistent buffer stores your data in a disk-based buffer across multiple Availability Zones to enhance data durability. You can use persistent buffering to ingest data from all supported push-based sources without setting up a standalone buffer. These sources include HTTP and OpenTelemetry for logs, traces, and metrics. To enable persistent buffering, choose **Enable persistent buffer** when you create or update a pipeline. For more information, see [Creating Amazon OpenSearch Ingestion pipelines](creating-pipeline.md). 

OpenSearch Ingestion dynamically determines the number of OCUs to use for persistent buffering, factoring in the data source, streaming transformations, and sink destination. Because it allocates some OCUs to buffering, you might need to increase the minimum and maximum OCUs to maintain the same ingestion throughput. Pipelines retain data in the buffer for up to 72 hours.

If you enable persistent buffering for a pipeline, the default maximum request payload sizes are as follows:
+ **HTTP sources** – 10 MB
+ **OpenTelemetry sources** – 4 MB

For HTTP sources, you can increase the maximum payload size to 20 MB. The request payload size includes the entire HTTP request, which typically contains multiple events. Each event can't exceed 3.5 MB.

Pipelines with persistent buffering split the configured pipeline units between compute and buffer units. If a pipeline uses a CPU-intensive processor like grok, key-value, or split string, it allocates the units in a 1:1 buffer-to-compute ratio. Otherwise, it allocates them in a 3:1 ratio, always favoring compute units.

For example:
+ Pipeline with grok and 2 max units – 1 compute unit and 1 buffer units
+ Pipeline with grok and 5 max units – 3 compute units and 2 buffer units
+ Pipeline with no processors and 2 max units – 1 compute unit and 1 buffer units
+ Pipeline with no processors and 4 max units – 1 compute unit and 3 buffer units
+ Pipeline with grok and 5 max units – 2 compute units and 3 buffer units

By default, pipelines use an AWS owned key to encrypt buffer data. These pipelines don't need any additional permissions for the pipeline role. 

Alternately, you can specify a customer managed key and add the following IAM permissions to the pipeline role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "KeyAccess",
            "Effect": "Allow",
            "Action": [
              "kms:Decrypt",
              "kms:GenerateDataKeyWithoutPlaintext"
            ],
            "Resource": "arn:aws:kms:us-east-1:111122223333:key/ASIAIOSFODNN7EXAMPLE"
        }
    ]
}
```

------

For more information, see [Customer managed keys](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html#customer-cmk) in the *AWS Key Management Service Developer Guide*. 

**Note**  
If you disable persistent buffering, your pipeline starts running entirely on in-memory buffering.

## Splitting
<a name="osis-features-splitting"></a>

You can configure an OpenSearch Ingestion pipeline to *split* incoming events into a sub-pipeline, allowing you to perform different types of processing on the same incoming event.

The following example pipeline splits incoming events into two sub-pipelines. Each sub-pipeline uses its own processor to enrich and manipulate the data, and then sends the data to different OpenSearch indexes.

```
version: "2"
log-pipeline:
  source:
    http:
    ...
  sink:
    - pipeline:
        name: "logs_enriched_one_pipeline"
    - pipeline:
        name: "logs_enriched_two_pipeline"

logs_enriched_one_pipeline:
  source:
    pipeline:
      name: "log-pipeline"
  processor:
   ...
  sink:
    - opensearch:
        # Provide a domain or collection endpoint
        # Enable the 'serverless' flag if the sink is an OpenSearch Serverless collection
        aws:
          ...
        index: "enriched_one_logs"

logs_enriched_two_pipeline:
  source:
    pipeline:
      name: "log-pipeline"
  processor:
   ...
  sink:
    - opensearch:
        # Provide a domain or collection endpoint
        # Enable the 'serverless' flag if the sink is an OpenSearch Serverless collection
        aws:
          ...
          index: "enriched_two_logs"
```

## Chaining
<a name="osis-features-chaining"></a>

You can *chain* multiple sub-pipelines together in order to perform data processing and enrichment in chunks. In other words, you can enrich an incoming event with certain processing capabilities in one sub-pipeline, then send it to another sub-pipeline for additional enrichment with a different processor, and finally send it to its OpenSearch sink.

In the following example, the `log_pipeline` sub-pipeline enriches an incoming log event with a set of processors, then sends the event to an OpenSearch index named `enriched_logs`. The pipeline sends the same event to the `log_advanced_pipeline` sub-pipeline, which processes it and sends it to a different OpenSearch index named `enriched_advanced_logs`. 

```
version: "2"
log-pipeline:
  source:
    http:
    ...
  processor:
    ...
  sink:
    - opensearch:
        # Provide a domain or collection endpoint
        # Enable the 'serverless' flag if the sink is an OpenSearch Serverless collection
        aws:
          ...
          index: "enriched_logs"
    - pipeline:
        name: "log_advanced_pipeline"

log_advanced_pipeline:
  source:
    pipeline:
      name: "log-pipeline"
  processor:
   ...
  sink:
    - opensearch:
        # Provide a domain or collection endpoint
        # Enable the 'serverless' flag if the sink is an OpenSearch Serverless collection
        aws:
          ...
          index: "enriched_advanced_logs"
```

## Dead-letter queues
<a name="osis-features-dlq"></a>

Dead-letter queues (DLQs) are destinations for events that a pipeline fails to write to a sink. In OpenSearch Ingestion, you must specify a Amazon S3 bucket with appropriate write permissions to be used as the DLQ. You can add a DLQ configuration to every sink within a pipeline. When a pipeline encounters write errors, it creates DLQ objects in the configured S3 bucket. DLQ objects exist within a JSON file as an array of failed events.

A pipeline writes events to the DLQ when either of the following conditions are met:
+ The **Max retries** count for the OpenSearch sink has been exhausted. OpenSearch Ingestion requires a minimum of 16 for this setting.
+ The sink is rejecting events due to an error condition.

### Configuration
<a name="osis-features-dlq-config"></a>

To configure a dead-letter queue for a sub-pipeline, choose **Enable S3 DLQ** when you configure your sink destination. Then, specify the required settings for the queue. For more information, see [Configuration](https://opensearch.org/docs/latest/data-prepper/pipelines/dlq/#configuration) in the Data Prepper DLQ documentation.

Files written to this S3 DLQ have the following naming pattern:

```
dlq-v${version}-${pipelineName}-${pluginId}-${timestampIso8601}-${uniqueId}
```

For instructions to manually configure the pipeline role to allow access to the S3 bucket that the DLQ writess to, see [Permissions to write to Amazon S3 or a dead-letter queue](pipeline-security-overview.md#pipeline-security-dlq).

### Example
<a name="osis-features-dlq-example"></a>

Consider the following example DLQ file:

```
dlq-v2-apache-log-pipeline-opensearch-2023-04-05T15:26:19.152938Z-e7eb675a-f558-4048-8566-dac15a4f8343
```

Here's an example of data that failed to be written to the sink, and is sent to the DLQ S3 bucket for further analysis:

```
Record_0	
pluginId            "opensearch"
pluginName          "opensearch"
pipelineName        "apache-log-pipeline"
failedData	
index		  "logs"
indexId		 null
status		  0
message		"Number of retries reached the limit of max retries (configured value 15)"
document	
log		    "sample log"
timestamp	    "2023-04-14T10:36:01.070Z"

Record_1	
pluginId            "opensearch"
pluginName          "opensearch"
pipelineName        "apache-log-pipeline"
failedData	
index               "logs"
indexId		 null
status		  0
message		"Number of retries reached the limit of max retries (configured value 15)"
document	
log                 "another sample log"
timestamp           "2023-04-14T10:36:01.071Z"
```

## Index management
<a name="osis-features-index-management"></a>

Amazon OpenSearch Ingestion has many index management capabilities, including the following.

### Creating indexes
<a name="osis-features-index-management-create"></a>

You can specify an index name in a pipeline sink and OpenSearch Ingestion creates the index when it provisions the pipeline. If an index already exists, the pipeline uses it to index incoming events. If you stop and restart a pipeline, or if you update its YAML configuration, the pipeline attempts to create new indexes if they don't already exist. A pipeline can never delete an index.

The following example sinks create two indexes when the pipeline is provisioned:

```
sink:
  - opensearch:
      index: apache_logs
  - opensearch:
      index: nginx_logs
```

### Generating index names and patterns
<a name="osis-features-index-management-patterns"></a>

You can generate dynamic index names by using variables from the fields of incoming events. In the sink configuration, use the format `string${}` to signal string interpolation, and use a JSON pointer to extract fields from events. The options for `index_type` are `custom` or `management_disabled`. Because `index_type` defaults to `custom` for OpenSearch domains and `management_disabled` for OpenSearch Serverless collections, it can be left unset.

For example, the following pipeline selects the `metadataType` field from incoming events to generate index names.

```
pipeline:
  ...
  sink:
    opensearch:
      index: "metadata-${metadataType}"
```

The following configuration continues to generate a new index every day or every hour.

```
pipeline:
  ...
  sink:
    opensearch:
      index: "metadata-${metadataType}-%{yyyy.MM.dd}"

pipeline:
  ...
  sink:
    opensearch:
      index: "metadata-${metadataType}-%{yyyy.MM.dd.HH}"
```

The index name can also be a plain string with a date-time pattern as a suffix, such as `my-index-%{yyyy.MM.dd}`. When the sink sends data to OpenSearch, it replaces the date-time pattern with UTC time and creates a new index for each day, such as `my-index-2022.01.25`. For more information, see the [DateTimeFormatter](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html) class.

This index name can also be a formatted string (with or without a date-time pattern suffix), such as `my-${index}-name`. When the sink sends data to OpenSearch, it replaces the `"${index}"` portion with the value in the event being processed. If the format is `"${index1/index2/index3}"`, it replaces the field `index1/index2/index3` with its value in the event.

### Generating document IDs
<a name="osis-features-index-management-ids"></a>

A pipeline can generate a document ID while indexing documents to OpenSearch. It can infer these document IDs from the fields within incoming events.

This example uses the `uuid` field from an incoming event to generate a document ID.

```
pipeline:
  ...
  sink:
    opensearch:
      index_type: custom
      index: "metadata-${metadataType}-%{yyyy.MM.dd}" 
      "document_id": "uuid"
```

In the following example, the [Add entries](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/add-entries/) processor merges the fields `uuid` and `other_field` from the incoming event to generate a document ID.

The `create` action ensures that documents with identical IDs aren't overwritten. The pipeline drops duplicate documents without any retry or DLQ event. This is a reasonable expectation for pipeline authors who use this action, because the goals is to avoid updating existing documents.

```
pipeline:
  ...
  processor:
   - add_entries:
      entries:
        - key: "my_doc_id_field"
          format: "${uuid}-${other_field}"
  sink:
    - opensearch:
       ...
       action: "create"
       document_id: "my_doc_id"
```

You might want to set an event's document ID to a field from a sub-object. In the following example, the OpenSearch sink plugin uses the sub-object `info/id` to generate a document ID.

```
sink:
  - opensearch:
       ...
       document_id: info/id
```

Given the following event, the pipeline will generate a document with the `_id` field set to `json001`:

```
{
   "fieldA":"arbitrary value",
   "info":{
      "id":"json001",
      "fieldA":"xyz",
      "fieldB":"def"
   }
}
```

### Generating routing IDs
<a name="osis-features-index-management-routing-ids"></a>

You can use the `routing_field` option within the OpenSearch sink plugin to set the value of a document routing property (`_routing`) to a value from an incoming event.

Routing supports JSON pointer syntax, so nested fields also are available, not just top-level fields.

```
sink:
  - opensearch:
       ...
       routing_field: metadata/id
       document_id: id
```

Given the following event, the plugin generates a document with the `_routing` field set to `abcd`:

```
{
   "id":"123",
   "metadata":{
      "id":"abcd",
      "fieldA":"valueA"
   },
   "fieldB":"valueB"
}
```

For instructions to create index templates that pipelines can use during index creation, see [Index templates](https://opensearch.org/docs/latest/im-plugin/index-templates/).

## End-to-end acknowledgement
<a name="osis-features-e2e"></a>

OpenSearch Ingestion ensures the durability and reliability of data by tracking its delivery from source to sinks in stateless pipelines using *end-to-end acknowledgement*.

**Note**  
Currently, only the [S3 source](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/) plugin supports end-to-end acknowledgement.

With end-to-end acknowledgement, the pipeline source plugin creates an *acknowledgement set* to monitor a batch of events. It receives a positive acknowledgement when those events are successfully sent to their sinks, or a negative acknowledgement when any of the events could not be sent to their sinks.

In the event of a failure or crash of a pipeline component, or if a source fails to receive an acknowledgement, the source times out and takes necessary actions such as retrying or logging the failure. If the pipeline has multiple sinks or multiple sub-pipelines configured, event-level acknowledgements are sent only after the event is sent to *all* sinks in *all* sub-pipelines. If a sink has a DLQ configured, end-to-end acknowledgements also tracks events written to the DLQ.

To enable end-to-end acknowledgement, expand **Additional options** in the Amazon S3 source configuration and choose **Enable end-to-end message acknowledgement**.

## Source back pressure
<a name="osis-features-backpressure"></a>

A pipeline can experience back pressure when it's busy processing data, or if its sinks are temporarily down or slow to ingest data. OpenSearch Ingestion has different ways of handling back pressure depending on the source plugin that a pipeline is using.

### HTTP source
<a name="osis-features-backpressure-http"></a>

Pipelines that use the [HTTP source](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/http-source/) plugin handle back pressure differently depending on which pipeline component is congested:
+ **Buffers** – When buffers are full, the pipeline starts returning HTTP status `REQUEST_TIMEOUT` with error code 408 back to the source endpoint. As buffers are freed up, the pipeline starts processing HTTP events again.
+ **Source threads** – When all HTTP source threads are busy executing requests and the unprocessed request queue size has exceeded the maximum allowed number of requests, the pipeline starts to return HTTP status `TOO_MANY_REQUESTS` with error code 429 back to the source endpoint. When the request queue drops below the maximum allowed queue size, the pipeline starts processing requests again.

### OTel source
<a name="osis-features-backpressure-otel"></a>

When buffers are full for pipelines that use OpenTelemetry sources ([OTel logs](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/otel-logs-source), [OTel metrics](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/), and [OTel trace](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-trace/)), the pipeline starts to return HTTP status `REQUEST_TIMEOUT` with error code 408 to the source endpoint. As buffers are freed up, the pipeline starts processing events again.

### S3 source
<a name="osis-features-backpressure-s3"></a>

When buffers are full for pipelines with an [S3](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/) source, the pipelines stop processing SQS notifications. As the buffers are freed up, the pipelines start processing notifications again. 

If a sink is down or unable to ingest data and end-to-end acknowledgement is enabled for the source, the pipeline stops processing SQS notifications until it receives a successful acknowledgement from all sinks.

# Creating Amazon OpenSearch Ingestion pipelines
<a name="creating-pipeline"></a>

A *pipeline* is the mechanism that Amazon OpenSearch Ingestion uses to move data from its *source* (where the data comes from) to its *sink* (where the data goes). In OpenSearch Ingestion, the sink will always be a single Amazon OpenSearch Service domain, while the source of your data could be clients like Amazon S3, Fluent Bit, or the OpenTelemetry Collector.

For more information, see [Pipelines](https://opensearch.org/docs/latest/clients/data-prepper/pipelines/) in the OpenSearch documentation.

**Topics**
+ [Prerequisites and required IAM role](#manage-pipeline-prerequisites)
+ [Required IAM permissions](#create-pipeline-permissions)
+ [Specifying the pipeline version](#pipeline-version)
+ [Specifying the ingestion path](#pipeline-path)
+ [Creating pipelines](#create-pipeline)
+ [Tracking the status of pipeline creation](#get-pipeline-progress)
+ [Working with blueprints](pipeline-blueprint.md)

## Prerequisites and required IAM role
<a name="manage-pipeline-prerequisites"></a>

To create an OpenSearch Ingestion pipeline, you must have the following resources:
+ An IAM role, called the *pipeline role*, that OpenSearch Ingestion assumes in order to write to the sink. You can create this role ahead of time, or you can have OpenSearch Ingestion create it automatically while you're creating the pipeline.
+ An OpenSearch Service domain or OpenSearch Serverless collection to act as the sink. If you're writing to a domain, it must be running OpenSearch 1.0 or later, or Elasticsearch 7.4 or later. The sink must have an access policy that grants the appropriate permissions to your IAM pipeline role.

For instructions to create these resources, see the following topics:
+ [Granting Amazon OpenSearch Ingestion pipelines access to domains](pipeline-domain-access.md)
+ [Granting Amazon OpenSearch Ingestion pipelines access to collections](pipeline-collection-access.md)

**Note**  
If you're writing to a domain that uses fine-grained access control, there are extra steps you need to complete. See [Map the pipeline role (only for domains that use fine-grained access control)](pipeline-domain-access.md#pipeline-access-domain-fgac).

## Required IAM permissions
<a name="create-pipeline-permissions"></a>

OpenSearch Ingestion uses the following IAM permissions for creating pipelines:
+ `osis:CreatePipeline` – Create a pipeline.
+ `osis:ValidatePipeline` – Check whether a pipeline configuration is valid.
+ `iam:CreateRole` and `iam:AttachPolicy` – Have OpenSearch Ingestion automatically create the pipeline role for you.
+ `iam:PassRole` – Pass the pipeline role to OpenSearch Ingestion so that it can write data to the domain. This permission must be on the [pipeline role resource](pipeline-domain-access.md#pipeline-access-configure), or simply `*` if you plan to use different roles in each pipeline.

For example, the following policy grants permission to create a pipeline:

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement":[
      {
         "Effect":"Allow",
         "Resource":"*",
         "Action":[
            "osis:CreatePipeline",
            "osis:ListPipelineBlueprints",
            "osis:ValidatePipeline"
         ]
      },
      {
         "Resource":[
            "arn:aws:iam::111122223333:role/pipeline-role"
         ],
         "Effect":"Allow",
         "Action":[
            "iam:CreateRole",
            "iam:AttachRolePolicy",
            "iam:PassRole"
         ]
      }
   ]
}
```

------

OpenSearch Ingestion also includes a permission called `osis:Ingest`, which is required in order to send signed requests to the pipeline using [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html). For more information, see [Creating an ingestion role](configure-client.md#configure-client-auth).

**Note**  
In addition, the first user to create a pipeline in an account must have permissions for the `iam:CreateServiceLinkedRole` action. For more information, see [pipeline role resource](pipeline-security.md#pipeline-vpc-slr).

For more information about each permission, see [Actions, resources, and condition keys for OpenSearch Ingestion](https://docs.aws.amazon.com/service-authorization/latest/reference/list_opensearchingestionservice.html) in the *Service Authorization Reference*.

## Specifying the pipeline version
<a name="pipeline-version"></a>

When you create a pipeline using the configuration editor, you must specify the major [version of Data Prepper](https://github.com/opensearch-project/data-prepper/releases) that the pipeline will run. To specify the version, include the `version` option in your pipeline configuration:

```
version: "2"
log-pipeline:
  source:
    ...
```

When you choose **Create**, OpenSearch Ingestion determines the latest available *minor* version of the major version that you specify, and provisions the pipeline with that version. For example, if you specify `version: "2"`, and the latest supported version of Data Prepper is 2.1.1, OpenSearch Ingestion provisions your pipeline with version 2.1.1. We don't publicly display the minor version that your pipeline is running.

In order to upgrade your pipeline when a new major version of Data Prepper is available, edit the pipeline configuration and specify the new version. You can't downgrade a pipeline to an earlier version.

**Note**  
OpenSearch Ingestion doesn't immediately support new versions of Data Prepper as soon as they're released. There will be some lag between when a new version is publicly available and when it's supported in OpenSearch Ingestion. In addition, OpenSearch Ingestion might explicitly not support certain major or minor versions altogether. For a comprehensive list, see [Supported Data Prepper versions](ingestion.md#ingestion-supported-versions).

Any time you make a change to your pipeline that initiates a blue/green deployment, OpenSearch Ingestion can upgrade it to the latest minor version of the major version that's currently configured for the pipeline. For more information, see [Blue/green deployments for pipeline updates](update-pipeline.md#pipeline-bg). OpenSearch Ingestion can't change the major version of your pipeline unless you explicitly update the `version` option within the pipeline configuration.

## Specifying the ingestion path
<a name="pipeline-path"></a>

For pull-based sources like [OTel trace](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-trace/) and [OTel metrics](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/), OpenSearch Ingestion requires the additional `path` option in your source configuration. The path is a string such as `/log/ingest`, which represents the URI path for ingestion. This path defines the URI that you use to send data to the pipeline. 

For example, say you specify the following path for a pipeline with an HTTP source:

![\[Input field for specifying the path for ingestion, with an example path entered.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/ingestion-path.png)


When you [ingest data](configure-client.md) into the pipeline, you must specify the following endpoint in your client configuration: `https://pipeline-name-abc123.us-west-2.osis.amazonaws.com/my/test_path`.

The path must start with a slash (/) and can contain the special characters '-', '\$1', '.', and '/', as well as the `${pipelineName}` placeholder. If you use `${pipelineName}` (such as `/${pipelineName}/test_path`), OpenSearch Ingestion replaces the variable with the name of the associated sub-pipeline.

## Creating pipelines
<a name="create-pipeline"></a>

This section describes how to create OpenSearch Ingestion pipelines using the OpenSearch Service console and the AWS CLI.

### Console
<a name="create-pipeline-console"></a>

To create a pipeline, sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines) and choose **Create pipeline**. 

Either select a blank pipeline, or choose a configuration blueprint. Blueprints include a preconfigured pipeline for a variety of common use cases. For more information, see [Working with blueprints](pipeline-blueprint.md).

Choose **Select blueprint**.

#### Configure source
<a name="create-pipeline-console-source"></a>

1. If you're starting from a blank pipeline, select a source from the dropdown menu. Available sources might include other AWS services, OpenTelemetry, or HTTP. For more information, see [Integrating Amazon OpenSearch Ingestion pipelines with other services and applications](configure-client.md).

1. Depending on which source you choose, configure additional settings for the source. For example, to use Amazon S3 as a source, you must specify the URL of the Amazon SQS queue from the pipeline receives messagess. For a list of supported source plugins and links to their documentation, see [Supported plugins and options for Amazon OpenSearch Ingestion pipelines](pipeline-config-reference.md).

1. For some sources, you must specify **Source network options**. Choose either **VPC access** or **Public access**. If you choose **Public access**, skip to the next step. If you choose **VPC access**, configure the following settings:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html)

   For more information, see [Configuring VPC access for Amazon OpenSearch Ingestion pipelines](pipeline-security.md).

1. Choose **Next**.

#### Configure processor
<a name="create-pipeline-console-processor"></a>

Add one or more processors to your pipeline. Processors are components within a sub-pipeline that let you filter, transform, and enrich events before publishing records to the domain or collection sink. For a list of supported processors and links to their documentation, see [Supported plugins and options for Amazon OpenSearch Ingestion pipelines](pipeline-config-reference.md).

You can choose **Actions** and add the following:
+ **Conditional routing** – Routes events to different sinks based on specific conditions. For more information, see [Conditional routing](https://opensearch.org/docs/latest/data-prepper/pipelines/pipelines/#conditional-routing).
+ **Sub-pipeline** – Each sub-pipeline is a combination of a single source, zero or more processors, and a single sink. Only one sub-pipeline can have an external source. All others must have sources that are other sub-pipelines within the overall pipeline configuration. A single pipeline configuration can contain 1-10 sub-pipelines.

Choose **Next**.

#### Configure sink
<a name="create-pipeline-console-sink"></a>

Select the destination where the pipeline publishes records. Every sub-pipeline must contain at least one sink. You can add a maximum of 10 sinks to a pipeline.

For OpenSearch sinks, configure the following fields:


| Setting | Description | 
| --- | --- | 
| Network policy name(Serverless sinks only) |  If you selected an OpenSearch Serverless collection, enter a **Network policy name**. OpenSearch Ingestion either creates the policy if it doesn't exist, or updates it with a rule that grants access to the VPC endpoint connecting the pipeline and the collection. For more information, see [Granting Amazon OpenSearch Ingestion pipelines access to collections](pipeline-collection-access.md).  | 
| Index name |  The name of the index where the pipeline sends data. OpenSearch Ingestion creates this index if it doesn't already exist.  | 
| Index mapping options |  Choose how the pipeline stores and indexes documents and their fields into the OpenSearch sink. If you select **Dynamic mapping**, OpenSearch adds fields automatically when you index a document. If you select **Customize mapping**, enter an index mapping template. For more information, see [Index templates](https://opensearch.org/docs/latest/im-plugin/index-templates/).  | 
| Enable DLQ |  Configure an Amazon S3 dead-letter queue (DLQ) for the pipeline. For more information, see [Dead-letter queues](osis-features-overview.md#osis-features-dlq).  | 
| Additional settings |  Configure advanced options for the OpenSearch sink. For more information, see [Configuration options](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/opensearch/#configuration-options) in the Data Prepper documentation.  | 

To add an Amazon S3 sink, choose **Add sink** and **Amazon S3**. For more information, see [Amazon S3 as a destination](configure-client-s3.md#s3-destination).

Choose **Next**.

#### Configure pipeline
<a name="create-console-pipeline"></a>

Configure the following additional pipeline settings:


| Setting | Description | 
| --- | --- | 
| Pipeline name |  A unique name for the pipeline.  | 
| Persistent buffer |  A persistent buffer stores your data in a disk-based buffer across multiple Availability Zones. For more information, see [Persistent buffering](osis-features-overview.md#persistent-buffering).  If you enable persistent buffering, select the AWS Key Management Service key to encrypt the buffer data.   | 
| Pipeline capacity |  The minimum and maximum pipeline capacity, in Ingestion OpenSearch Compute Units (OCUs). For more information, see [Scaling pipelines in Amazon OpenSearch Ingestion](ingestion-scaling.md).  | 
| Pipeline role |  The IAM role that provides the required permissions for the pipeline to write to the sink and read from pull-based sources. You can create the role yourself, or have OpenSearch Ingestion create it for you based on your selected use case.  For more information, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).  | 
| Tags |  Add one or more tags to your pipeline. For more information, see [Tagging Amazon OpenSearch Ingestion pipelines](tag-pipeline.md).  | 
| Log publishing options | Enable pipeline log publishing to Amazon CloudWatch Logs. We recommend that you enable log publishing so that you can more easily troubleshoot pipeline issues. For more information, see [Monitoring pipeline logs](monitoring-pipeline-logs.md). | 

Choose **Next**., then review your pipeline configuration and choose **Create pipeline**.

OpenSearch Ingestion runs an asynchronous process to build the pipeline. Once the pipeline status is `Active`, you can start ingesting data.

### AWS CLI
<a name="create-pipeline-cli"></a>

The [create-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/create-pipeline.html) command accepts the pipeline configuration as a string or within a .yaml or .json file. If you provide the configuration as a string, each new line must be escaped with `\n`. For example, `"log-pipeline:\n source:\n http:\n processor:\n - grok:\n ...`

The following sample command creates a pipeline with the following configuration:
+ Minimum of 4 Ingestion OCUs, maximum of 10 Ingestion OCUs
+ Provisioned within a virtual private cloud (VPC)
+ Log publishing enabled

```
aws osis create-pipeline \
  --pipeline-name my-pipeline \
  --min-units 4 \
  --max-units 10 \
  --log-publishing-options  IsLoggingEnabled=true,CloudWatchLogDestination={LogGroup="MyLogGroup"} \
  --vpc-options SecurityGroupIds={sg-12345678,sg-9012345},SubnetIds=subnet-1212234567834asdf \
  --pipeline-configuration-body "file://pipeline-config.yaml" \
  --pipeline-role-arn  arn:aws:iam::1234456789012:role/pipeline-role
```

OpenSearch Ingestion runs an asynchronous process to build the pipeline. Once the pipeline status is `Active`, you can start ingesting data. To check the status of the pipeline, use the [GetPipeline](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_GetPipeline.html) command.

### OpenSearch Ingestion API
<a name="create-pipeline-api"></a>

To create an OpenSearch Ingestion pipeline using the OpenSearch Ingestion API, call the [CreatePipeline](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_CreatePipeline.html) operation.

After your pipeline is successfully created, you can configure your client and start ingesting data into your OpenSearch Service domain. For more information, see [Integrating Amazon OpenSearch Ingestion pipelines with other services and applications](configure-client.md).

## Tracking the status of pipeline creation
<a name="get-pipeline-progress"></a>

You can track the status of a pipeline as OpenSearch Ingestion provisions it and prepares it to ingest data.

### Console
<a name="get-pipeline-progress-console"></a>

After you initially create a pipeline, it goes through multiple stages as OpenSearch Ingestion prepares it to ingest data. To view the various stages of pipeline creation, choose the pipeline name to see its **Pipeline settings** page. Under **Status**, choose **View details**.

A pipeline goes through the following stages before it's available to ingest data:
+ **Validation** – Validating pipeline configuration. When this stage is complete, all validations have succeeded.
+ **Create environment** – Preparing and provisioning resources. When this stage is complete, the new pipeline environment has been created.
+ **Deploy pipeline** – Deploying the pipeline. When this stage is complete, the pipeline has been successfully deployed.
+ **Check pipeline health** – Checking the health of the pipeline. When this stage is complete, all health checks have passed.
+ **Enable traffic** – Enabling the pipeline to ingest data. When this stage is complete, you can start ingesting data into the pipeline.

### CLI
<a name="get-pipeline-progress-cli"></a>

Use the [get-pipeline-change-progress](https://docs.aws.amazon.com/cli/latest/reference/osis/get-pipeline-change-progress.html) command to check the status of a pipeline. The following AWS CLI request checks the status of a pipeline named `my-pipeline`:

```
aws osis get-pipeline-change-progress \
    --pipeline-name my-pipeline
```

**Response**:

```
{
   "ChangeProgressStatuses": {
      "ChangeProgressStages": [ 
         { 
            "Description": "Validating pipeline configuration",
            "LastUpdated": 1.671055851E9,
            "Name": "VALIDATION",
            "Status": "PENDING"
         }
      ],
      "StartTime": 1.671055851E9,
      "Status": "PROCESSING",
      "TotalNumberOfStages": 5
   }
}
```

### OpenSearch Ingestion API
<a name="get-pipeline-progress-api"></a>

To track the status of pipeline creation using the OpenSearch Ingestion API, call the [GetPipelineChangeProgress](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_GetPipelineChangeProgress.html) operation.

# Working with blueprints
<a name="pipeline-blueprint"></a>

Rather than creating a pipeline definition from scratch, you can use *configuration blueprints*, which are preconfigured templates for common ingestion scenarios such as Trace Analytics or Apache logs. Configuration blueprints help you easily provision pipelines without having to author a configuration from scratch. 

## Console
<a name="pipeline-blueprint-console"></a>

**To use a pipeline blueprint**

1. Sign in to the OpenSearch Ingestion console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. Choose **Create pipeline**.

1. Select a blueprint from the list of use cases, then choose **Select blueprint**. The pipeline configuration populates with a sub-pipeline for the use case you selected. 

   The pipeline blueprint isn't valid as-is. You need to specify additional settings depending on the selected source.

## CLI
<a name="pipeline-blueprint-cli"></a>

To get a list of all available blueprints using the AWS CLI, send a [list-pipeline-blueprints](https://docs.aws.amazon.com/cli/latest/reference/osis/list-pipeline-blueprints.html) request.

```
aws osis list-pipeline-blueprints 
```

The request returns a list of all available blueprints.

To get more detailed information about a specific blueprint, use the [get-pipeline-blueprint](https://docs.aws.amazon.com/cli/latest/reference/osis/get-pipeline-blueprint.html) command:

```
aws osis get-pipeline-blueprint --blueprint-name AWS-ApacheLogPipeline
```

This request returns the contents of the Apache log pipeline blueprint:

```
{
   "Blueprint":{
      "PipelineConfigurationBody":"###\n  # Limitations: https://docs.aws.amazon.com/opensearch-service/latest/ingestion/ingestion.html#ingestion-limitations\n###\n###\n  # apache-log-pipeline:\n    # This pipeline receives logs via http (e.g. FluentBit), extracts important values from the logs by matching\n    # the value in the 'log' key against the grok common Apache log pattern. The grokked logs are then sent\n    # to OpenSearch to an index named 'logs'\n###\n\nversion: \"2\"\napache-log-pipeline:\n  source:\n    http:\n      # Provide the path for ingestion. ${pipelineName} will be replaced with pipeline name configured for this pipeline.\n      # In this case it would be \"/apache-log-pipeline/logs\". This will be the FluentBit output URI value.\n      path: \"/${pipelineName}/logs\"\n  processor:\n    - grok:\n        match:\n          log: [ \"%{COMMONAPACHELOG_DATATYPED}\" ]\n  sink:\n    - opensearch:\n        # Provide an AWS OpenSearch Service domain endpoint\n        # hosts: [ \"https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com\" ]\n        aws:\n          # Provide the region of the domain.\n          # region: \"us-east-1\"\n          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection\n          # serverless: true\n        index: \"logs\"\n        # Enable the S3 DLQ to capture any failed requests in an S3 bucket\n        # dlq:\n          # s3:\n            # Provide an S3 bucket\n            # bucket: \"your-dlq-bucket-name\"\n            # Provide a key path prefix for the failed requests\n            # key_path_prefix: \"${pipelineName}/logs/dlq\"\n            # Provide the region of the bucket.\n            # region: \"us-east-1\"\n            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com\n"
      "BlueprintName":"AWS-ApacheLogPipeline"
   }
}
```

## OpenSearch Ingestion API
<a name="pipeline-blueprint-api"></a>

To get information about pipeline blueprints using the OpenSearch Ingestion API, use the the [ListPipelineBlueprints](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_ListPipelineBlueprints.html) and [GetPipelineBlueprint](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_GetPipelineBlueprint.html) operations.

# Viewing Amazon OpenSearch Ingestion pipelines
<a name="list-pipeline"></a>

You can view the details about an Amazon OpenSearch Ingestion pipeline using the AWS Management Console, the AWS CLI, or the OpenSearch Ingestion API.

## Console
<a name="list-pipeline-console"></a>

**To view a pipeline**

1. Sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. (Optional) To view pipelines with a particular status, choose **Any status** and select a status to filter by.

   A pipeline can have the following statuses:
   + `Active` – The pipeline is active and ready to ingest data.
   + `Creating` – The pipeline is being created.
   + `Updating` – The pipeline is being updated.
   + `Deleting` – The pipeline is being deleted.
   + `Create failed` – The pipeline could not be created.
   + `Update failed` – The pipeline could not be updated.
   + `Stop failed` – The pipeline could not be stopped.
   + `Start failed` – The pipeline could not be started.
   + `Stopping` – The pipeline is being stopped.
   + `Stopped` – The pipeline is stopped and can be restarted at any time.
   + `Starting` – The pipeline is starting.

You're not billed for Ingestion OCUs when a pipeline is in the `Create failed`, `Creating`, `Deleting`, and `Stopped` states.

## CLI
<a name="list-pipeline-cli"></a>

To view pipelines using the AWS CLI, send a [list-pipelines](https://docs.aws.amazon.com/cli/latest/reference/osis/list-pipelines.html) request:

```
aws osis list-pipelines  
```

The request returns a list of all existing pipelines:

```
{
    "NextToken": null,
    "Pipelines": [
        {,
            "CreatedAt": 1.671055851E9,
            "LastUpdatedAt": 1.671055851E9,
            "MaxUnits": 4,
            "MinUnits": 2,
            "PipelineArn": "arn:aws:osis:us-west-2:123456789012:pipeline/log-pipeline",
            "PipelineName": "log-pipeline",
            "Status": "ACTIVE",
            "StatusReason": {
                "Description": "The pipeline is ready to ingest data."
            }
        },
            "CreatedAt": 1.671055851E9,
            "LastUpdatedAt": 1.671055851E9,
            "MaxUnits": 2,
            "MinUnits": 8,
            "PipelineArn": "arn:aws:osis:us-west-2:123456789012:pipeline/another-pipeline",
            "PipelineName": "another-pipeline",
            "Status": "CREATING",
            "StatusReason": {
                "Description": "The pipeline is being created. It is not able to ingest data."
            }
        }
    ]
}
```

To get information about a single pipeline, use the [get-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/get-pipeline.html) command:

```
aws osis get-pipeline --pipeline-name "my-pipeline"
```

The request returns configuration information for the specified pipeline:

```
{
    "Pipeline": {
        "PipelineName": "my-pipeline",
        "PipelineArn": "arn:aws:osis:us-east-1:123456789012:pipeline/my-pipeline",
        "MinUnits": 9,
        "MaxUnits": 10,
        "Status": "ACTIVE",
        "StatusReason": {
            "Description": "The pipeline is ready to ingest data."
        },
        "PipelineConfigurationBody": "log-pipeline:\n source:\n http:\n processor:\n - grok:\n match:\nlog: [ '%{COMMONAPACHELOG}' ]\n - date:\n from_time_received: true\n destination: \"@timestamp\"\n  sink:\n - opensearch:\n hosts: [ \"https://search-mdp-performance-test-duxkb4qnycd63rpy6svmvyvfpi.us-east-1.es.amazonaws.com\" ]\n index: \"apache_logs\"\n aws_sts_role_arn: \"arn:aws:iam::123456789012:role/my-domain-role\"\n  aws_region: \"us-east-1\"\n  aws_sigv4: true",,
        "CreatedAt": "2022-10-01T15:28:05+00:00",
        "LastUpdatedAt": "2022-10-21T21:41:08+00:00",
        "IngestEndpointUrls": [
            "my-pipeline-123456789012.us-east-1.osis.amazonaws.com"
        ]
    }
}
```

## OpenSearch Ingestion API
<a name="list-pipelines-api"></a>

To view OpenSearch Ingestion pipelines using the OpenSearch Ingestion API, call the [ListPipelines](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_ListPipelines.html) and [GetPipeline](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_GetPipeline.html) operations.

# Updating Amazon OpenSearch Ingestion pipelines
<a name="update-pipeline"></a>

You can update Amazon OpenSearch Ingestion pipelines using the AWS Management Console, the AWS CLI, or the OpenSearch Ingestion API. OpenSearch Ingestion initiates a blue/green deployment when you update a pipeline configuration. For more information, see [Blue/green deployments for pipeline updates](#pipeline-bg).

**Topics**
+ [Considerations](#update-pipeline-considerations)
+ [Permissions required](#update-pipeline-permissions)
+ [Updating pipelines](#update-pipeline-steps)
+ [Blue/green deployments for pipeline updates](#pipeline-bg)

## Considerations
<a name="update-pipeline-considerations"></a>

Consider the following when you update a pipeline:
+ You can't update a pipeline's name or network settings.
+ If your pipeline writes to a VPC domain sink, you can't go back and change the sink to a different VPC domain after the pipeline is created. You must delete and recreate the pipeline with the new sink. You can still switch the sink from a VPC domain to a public domain, from a public domain to a VPC domain, or from a public domain to another public domain.
+ You can switch the pipeline sink at any time between a public OpenSearch Service domain and an OpenSearch Serverless collection.
+ When you update the source, processor, or sink configuration for a pipeline, OpenSearch Ingestion initiates a blue/green deployment. For more information, see [Blue/green deployments for pipeline updates](#pipeline-bg).
+ When you update the source, processor, or sink configuration for a pipeline, OpenSearch Ingestion automatically upgrades your pipeline to the latest supported minor version of the major version of Data Prepper that the pipeline is running. This process keeps your pipeline up to date with the latest bug fixes and performance improvements.
+ You can still make updates to your pipeline when it's stopped. 

## Permissions required
<a name="update-pipeline-permissions"></a>

OpenSearch Ingestion uses the following IAM permissions for updating pipelines:
+ `osis:UpdatePipeline` – Update a pipeline.
+ `osis:ValidatePipeline` – Check whether a pipeline configuration is valid.
+ `iam:PassRole` – Pass the pipeline role to OpenSearch Ingestion so that it can write data to the domain. This permission is only required if you're updating the pipeline configuration, not if you're modifying other settings such as log publishing or capacity limits.

For example, the following policy grants permission to update a pipeline:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "osis:UpdatePipeline",
                "osis:ValidatePipeline"
            ]
        },
        {
            "Resource": [
                "arn:aws:iam::111122223333:role/pipeline-role"
            ],
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ]
        }
    ]
}
```

------

## Updating pipelines
<a name="update-pipeline-steps"></a>

You can update Amazon OpenSearch Ingestion pipelines using the AWS Management Console, the AWS CLI, or the OpenSearch Ingestion API. 

### Console
<a name="update-pipeline-console"></a>

**To update a pipeline**

1. Sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. Choose a pipeline to open its settings. Then, choose one of the **Edit** options.

1. When you're done making changes, choose **Save**.

### CLI
<a name="update-pipeline-cli"></a>

To update a pipeline using the AWS CLI, send an [update-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/update-pipeline.html) request. The following sample request uploads a new configuration file and updates the minimum and maximum capacity values:

```
aws osis update-pipeline \
  --pipeline-name "my-pipeline" \
  --pipline-configuration-body "file://new-pipeline-config.yaml" \
  --min-units 11 \
  --max-units 18
```

### OpenSearch Ingestion API
<a name="update-pipeline-api"></a>

To update an OpenSearch Ingestion pipeline using the OpenSearch Ingestion API, call the [UpdatePipeline](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_UpdatePipeline.html) operation.

## Blue/green deployments for pipeline updates
<a name="pipeline-bg"></a>

OpenSearch Ingestion initiates a *blue/green* deployment process when you update a pipeline configuration.

Blue/green refers to the practice of creating a new environment for pipeline updates and routing traffic to the new environment after those updates are complete. The practice minimizes downtime and maintains the original environment in the event that deployment to the new environment is unsuccessful. Blue/green deployments themselves don't have any performance impact, but performance might change if your pipeline configuration changes in a way that alters performance.

OpenSearch Ingestion blocks auto-scaling during blue/green deployments. You continue to be charged only for traffic to the old pipeline until it's redirected to the new pipeline. Once traffic has been redirected, you're only charged for the new pipeline. You're never charged for two pipelines simultaneously.

When you update a pipeline's source, processor, or sink configuration, OpenSearch Ingestion can automatically upgrade your pipeline to the latest supported minor version of the major version that the pipeline is running. For example, you might have `version: "2"` in your pipeline configuration, and OpenSearch Ingestion initially provisioned the pipeline with version 2.1.0. When support for version 2.1.1 is added, and you make a change to your pipeline configuration, OpenSearch Ingestion upgrades your pipeline to version 2.1.1.

This process keeps your pipeline up to date with the latest bug fixes and performance improvements. OpenSearch Ingestion can't update the major version of your pipeline unless you manually change the `version` option within the pipeline configuration.

# Managing Amazon OpenSearch Ingestion pipeline costs
<a name="pipeline--stop-start"></a>

You can start and stop ingestion pipelines in Amazon OpenSearch Ingestion to control data flow based on your needs. Stopping a pipeline halts data processing while preserving configurations, so you can restart it without reconfiguring it. This can help optimize costs, manage resource usage, or troubleshoot issues. When you stop a pipeline, OpenSearch Ingestion doesn't process incoming data, but previously ingested data remains available in OpenSearch. 

Starting and stopping simplifies the setup and teardown processes for pipelines that you use for development, testing, or similar activities that don't require continuous availability. While your pipeline is stopped, you aren't charged for any Ingestion OCU hours. You can still update stopped pipelines, and they receive automatic minor version updates and security patches. 

Stopping and starting a pipeline will result in reprocessing all the data from the beginning for pull based pipelines (DDB, S3, DocDB, etc). When you stop a pipeline, any service-managed VPC endpoints created by the pipeline are removed. For pipelines with self-managed VPC endpoints, you must recreate the VPC endpoint in your account when you restart the pipeline. For more information, see [Self-managed VPC endpoints](pipeline-security.md#pipeline-vpc-self-managed).

**Note**  
If your pipeline has excess capacity but needs to remain operational, consider adjusting its maximum capacity limits rather than stopping and restarting it. This can help manage costs while ensuring that the pipeline continues processing data efficiently. For more details, see [Scaling pipelines in Amazon OpenSearch Ingestion](ingestion-scaling.md).

The following topics explain how to start and stop pipelines using the AWS Management Console, AWS CLI, and OpenSearch Ingestion API.

**Topics**
+ [Stopping an Amazon OpenSearch Ingestion pipeline](pipeline--stop.md)
+ [Starting an Amazon OpenSearch Ingestion pipeline](pipeline--start.md)

# Stopping an Amazon OpenSearch Ingestion pipeline
<a name="pipeline--stop"></a>

To use an OpenSearch Ingestion pipeline or perform administration, you always begin with an active pipeline, then stop the pipeline, and then start the pipeline again. While your pipeline is stopped, you're not charged for Ingestion OCU hours.

## Console
<a name="stop-pipeline-console"></a>

**To stop a pipeline**

1. Sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. Choose a pipeline. You can perform the stop operation from this page, or navigate to the details page for the pipeline that you want to stop.

1. For **Actions**, choose **Stop pipeline**.

   If a pipeline can't be stopped and started, the **Stop pipeline** action isn't available.

## AWS CLI
<a name="stop-pipeline-cli"></a>

To stop a pipeline using the AWS CLI, call the [stop-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/stop-pipeline.html) command with the following parameters: 
+ `--pipeline-name` – the name of the pipeline. 

**Example**  

```
aws osis stop-pipeline --pipeline-name my-pipeline
```

## OpenSearch Ingestion API
<a name="stop-pipeline-api"></a>

To stop a pipeline using the OpenSearch Ingestion API, call the [StopPipeline](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_StopPipeline.html) operation with the following parameter: 
+ `PipelineName` – the name of the pipeline. 

# Starting an Amazon OpenSearch Ingestion pipeline
<a name="pipeline--start"></a>

You always start an OpenSearch Ingestion pipeline beginning with a pipeline that's already in the stopped state. The pipeline keeps its configuration settings such as capacity limits, network settings, and log publishing options.

Restarting a pipeline usually takes several minutes.

## Console
<a name="start-pipeline-console"></a>

**To start a pipeline**

1. Sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. Choose a pipeline. You can perform the start operation from this page, or navigate to the details page for the pipeline that you want to start.

1.  For **Actions**, choose **Start pipeline**. 

## AWS CLI
<a name="start-pipeline-cli"></a>

To start a pipeline by using the AWS CLI, call the [start-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/start-pipeline.html) command with the following parameters: 
+ `--pipeline-name` – the name of the pipeline.

**Example**  

```
aws osis start-pipeline --pipeline-name my-pipeline
```

## OpenSearch Ingestion API
<a name="start-pipeline-api"></a>

To start an OpenSearch Ingestion pipeline using the OpenSearch Ingestion API, call the [StartPipeline](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_StartPipeline.html) operation with the following parameter: 
+ `PipelineName` – the name of the pipeline.

# Deleting Amazon OpenSearch Ingestion pipelines
<a name="delete-pipeline"></a>

You can delete an Amazon OpenSearch Ingestion pipeline using the AWS Management Console, the AWS CLI, or the OpenSearch Ingestion API. You can't delete a pipeline when has a status of `Creating` or `Updating`.

## Console
<a name="delete-pipeline-console"></a>

**To delete a pipeline**

1. Sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. Select the pipeline that you want to delete and choose **Actions**, **Delete**.

1. Confirm deletion and choose **Delete**.

## CLI
<a name="delete-pipeline-cli"></a>

To delete a pipeline using the AWS CLI, send a [delete-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/delete-pipeline.html) request:

```
aws osis delete-pipeline --pipeline-name "my-pipeline"
```

## OpenSearch Ingestion API
<a name="delete-pipeline-api"></a>

To delete an OpenSearch Ingestion pipeline using the OpenSearch Ingestion API, call the [DeletePipeline](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_DeletePipeline.html) operation with the following parameter: 
+ `PipelineName` – the name of the pipeline.

# Supported plugins and options for Amazon OpenSearch Ingestion pipelines
<a name="pipeline-config-reference"></a>

Amazon OpenSearch Ingestion supports a subset of sources, processors, and sinks within open source [OpenSearch Data Prepper](https://opensearch.org/docs/latest/data-prepper/). In addition, there are some constraints that OpenSearch Ingestion places on the available options for each supported plugin. The following sections describe the plugins and associated options that OpenSearch Ingestion supports.

**Note**  
OpenSearch Ingestion doesn't support any buffer plugins because it automatically configures a default buffer. You receive a validation error if you include a buffer in your pipeline configuration.

**Topics**
+ [Supported plugins](#ingestion-plugins)
+ [Stateless versus stateful processors](#processor-stateful-stateless)
+ [Configuration requirements and constraints](#ingestion-parameters)

## Supported plugins
<a name="ingestion-plugins"></a>

OpenSearch Ingestion supports the following Data Prepper plugins:

**Sources**:
+ [DocumentDB](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/documentdb/)
+ [DynamoDB](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/dynamo-db/)
+ [HTTP](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/http-source/)
+ [Kafka](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/kafka/)
+ [Kinesis](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/kinesis/)
+ [OpenSearch](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/opensearch/)
+ [OTel logs](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-logs-source/)
+ [OTel metrics](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/)
+ [OTel trace](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-trace/)
+ [S3](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/)

**Processors**:
+ [Add entries](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/add-entries/)
+ [Aggregate](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/aggregate/)
+ [Anomaly detector](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/anomaly-detector/)
+ [AWS Lambda](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/aws-lambda/)
+ [Convert entry type](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/convert-entry-type/)
+ [Copy values](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/copy-values/)
+ [CSV](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/csv/)
+ [Date](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/date/)
+ [Delay](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/delay/)
+ [Decompress](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/decompress/)
+ [Delete entries](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/delete-entries/)
+ [Dissect](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/dissect/)
+ [Drop events](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/drop-events/)
+ [Flatten](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/flatten/)
+ [Geo IP](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/geoip/)
+ [Grok](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/grok/)
+ [Key value](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/key-value/)
+ [List to map](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/list-to-map/)
+ [Lowercase string](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/lowercase-string/)
+ [Map to list](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/map-to-list/)
+ [Mutate event](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/mutate-event/) (series of processors)
+ [Mutate string](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/mutate-string/) (series of processors)
+ [Obfuscate](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/obfuscate/)
+ [OTel metrics](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/otel-metrics/)
+ [OTel trace group](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/otel-trace-group/)
+ [OTel trace](https://docs.opensearch.org/latest/data-prepper/common-use-cases/trace-analytics/)
+ [Parse Ion](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/parse-ion/)
+ [Parse JSON](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/parse-json/)
+ [Parse XML](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/parse-xml/)
+ [Rename keys](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/rename-keys/)
+ [Select entries](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/select-entries/)
+ [Service map](https://docs.opensearch.org/latest/data-prepper/common-use-cases/trace-analytics/)
+ [Split event](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/split-event/)
+ [Split string](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/split-string/)
+ [String converter](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/string-converter/)
+ [Substitute string](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/substitute-string/)
+ [Trace peer forwarder](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/trace-peer-forwarder/)
+ [Translate](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/translate/)
+ [Trim string](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/trim-string/)
+ [Truncate](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/truncate/)
+ [Uppercase string](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/uppercase-string/)
+ [User agent](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/user-agent/)
+ [Write JSON](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/write-json/)

**Sinks**:
+ [OpenSearch](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/opensearch/) (supports OpenSearch Service, OpenSearch Serverless, and Elasticsearch 6.8 or later)
+ [S3](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/)

**Sink codecs**:
+ [Avro](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/#avro-codec)
+ [NDJSON](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/#ndjson-codec)
+ [JSON](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/#json-codec)
+ [Parquet](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/#parquet-codec)

## Stateless versus stateful processors
<a name="processor-stateful-stateless"></a>

*Stateless* processors perform operations like transformations and filtering, while *stateful* processors perform operations like aggregations, which remember the result of the previous run. OpenSearch Ingestion supports the stateful processors [Aggregate](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/aggregate/) and [Service-map](https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/processors/service-map/). All other supported processors are stateless.

For pipelines that contain only stateless processors, the maximum capacity limit is 96 Ingestion OCUs. If a pipeline contains any stateful processors, the maximum capacity limit is 48 Ingestion OCUs. However, if a pipeline has [persistent buffering](osis-features-overview.md#persistent-buffering) enabled, it can have a maximum of 384 Ingestion OCUs with only stateless processors, or 192 Ingestion OCUs if it contains any stateful processors. For more information, see [Scaling pipelines in Amazon OpenSearch Ingestion](ingestion-scaling.md).

End-to-end acknowledgment is only supported for stateless processors. For more information, see [End-to-end acknowledgement](osis-features-overview.md#osis-features-e2e).

## Configuration requirements and constraints
<a name="ingestion-parameters"></a>

Unless otherwise specified below, all options described in the Data Prepper configuration reference for the supported plugins listed above are allowed in OpenSearch Ingestion pipelines. The following sections explain the constraints that OpenSearch Ingestion places on certain plugin options.

**Note**  
OpenSearch Ingestion doesn't support any buffer plugins because it automatically configures a default buffer. You receive a validation error if you include a buffer in your pipeline configuration.

Many options are configured and managed internally by OpenSearch Ingestion, such as `authentication` and `acm_certificate_arn`. Other options, such as `thread_count` and `request_timeout`, have performance impacts if changed manually. Therefore, these values are set internally to ensure optimal performance of your pipelines.

Lastly, some options can't be passed to OpenSearch Ingestion, such as `ism_policy_file` and `sink_template`, because they're local files when run in open source Data Prepper. These values aren't supported.

**Topics**
+ [General pipeline options](#ingestion-params-general)
+ [Grok processor](#ingestion-params-grok)
+ [HTTP source](#ingestion-params-http)
+ [OpenSearch sink](#ingestion-params-opensearch)
+ [OTel metrics source, OTel trace source, and OTel logs source](#ingestion-params-otel-source)
+ [OTel trace group processor](#ingestion-params-otel-trace)
+ [OTel trace processor](#ingestion-params-otel-raw)
+ [Service-map processor](#ingestion-params-servicemap)
+ [S3 source](#ingestion-params-s3)

### General pipeline options
<a name="ingestion-params-general"></a>

The following [general pipeline options](https://docs.opensearch.org/latest/data-prepper/pipelines/pipelines/) are set by OpenSearch Ingestion and aren't supported in pipeline configurations:
+ `workers`
+ `delay`

### Grok processor
<a name="ingestion-params-grok"></a>

The following [Grok](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/grok/) processor options aren't supported:
+ `patterns_directories`
+ `patterns_files_glob`

### HTTP source
<a name="ingestion-params-http"></a>

The [HTTP](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/http-source/) source plugin has the following requirements and constraints:
+ The `path` option is *required*. The path is a string such as `/log/ingest`, which represents the URI path for log ingestion. This path defines the URI that you use to send data to the pipeline. For example, `https://log-pipeline.us-west-2.osis.amazonaws.com/log/ingest`. The path must start with a slash (/), and can contain the special characters '-', '\$1', '.', and '/', as well as the `${pipelineName}` placeholder.
+ The following HTTP source options are set by OpenSearch Ingestion and aren't supported in pipeline configurations:
  + `port`
  + `ssl`
  + `ssl_key_file`
  + `ssl_certificate_file`
  + `aws_region`
  + `authentication`
  + `unauthenticated_health_check`
  + `use_acm_certificate_for_ssl`
  + `thread_count`
  + `request_timeout`
  + `max_connection_count`
  + `max_pending_requests`
  + `health_check_service`
  + `acm_private_key_password`
  + `acm_certificate_timeout_millis`
  + `acm_certificate_arn`

### OpenSearch sink
<a name="ingestion-params-opensearch"></a>

The [OpenSearch](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/opensearch/) sink plugin has the following requirements and limitations.
+ The `aws` option is *required*, and must contain the following options:
  + `sts_role_arn`
  + `region`
  + `hosts`
  + `serverless` (if the sink is an OpenSearch Serverless collection)
+ The `sts_role_arn` option must point to the same role for each sink within a YAML definition file.
+ The `hosts` option must specify an OpenSearch Service domain endpoint or an OpenSearch Serverless collection endpoint. You can't specify a [custom endpoint](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/customendpoint.html) for a domain; it must be the standard endpoint.
+ If the `hosts` option is a serverless collection endpoint, you must set the `serverless` option to `true`. In addition, if your YAML definition file contains the `index_type` option, it must be set to `management_disabled`, otherwise validation fails.
+ The following options aren't supported:
  + `username`
  + `password`
  + `cert`
  + `proxy`
  + `dlq_file` - If you want to offload failed events to a dead letter queue (DLQ), you must use the `dlq` option and specify an S3 bucket.
  + `ism_policy_file`
  + `socket_timeout`
  + `template_file`
  + `insecure`

### OTel metrics source, OTel trace source, and OTel logs source
<a name="ingestion-params-otel-source"></a>

The [OTel metrics](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/) source, [OTel trace](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-trace/) source, and [OTel logs](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-logs-source/) source plugins have the following requirements and limitations:
+ The `path` option is *required*. The path is a string such as `/log/ingest`, which represents the URI path for log ingestion. This path defines the URI that you use to send data to the pipeline. For example, `https://log-pipeline.us-west-2.osis.amazonaws.com/log/ingest`. The path must start with a slash (/), and can contain the special characters '-', '\$1', '.', and '/', as well as the `${pipelineName}` placeholder.
+ The following options are set by OpenSearch Ingestion and aren't supported in pipeline configurations:
  + `port`
  + `ssl`
  + `sslKeyFile`
  + `sslKeyCertChainFile`
  + `authentication`
  + `unauthenticated_health_check`
  + `useAcmCertForSSL`
  + `unframed_requests`
  + `proto_reflection_service`
  + `thread_count`
  + `request_timeout`
  + `max_connection_count`
  + `acmPrivateKeyPassword`
  + `acmCertIssueTimeOutMillis`
  + `health_check_service`
  + `acmCertificateArn`
  + `awsRegion`

### OTel trace group processor
<a name="ingestion-params-otel-trace"></a>

The [OTel trace group](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/otel-trace-group/) processor has the following requirements and limitations:
+ The `aws` option is *required*, and must contain the following options:
  + `sts_role_arn`
  + `region`
  + `hosts`
+ The `sts_role_arn` option specify the same role as the pipeline role that you specify in the OpenSearch sink configuration.
+ The `username`, `password`, `cert`, and `insecure` options aren't supported.
+ The `aws_sigv4` option is required and must be set to true.
+ The `serverless` option within the OpenSearch sink plugin isn't supported. The Otel trace group processor doesn't currently work with OpenSearch Serverless collections.
+ The number of `otel_trace_group` processors within the pipeline configuration body can't exceed 8.

### OTel trace processor
<a name="ingestion-params-otel-raw"></a>

The [OTel trace](https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/processors/otel-traces/) processor has the following requirements and limitations:
+ The value of the `trace_flush_interval` option can't exceed 300 seconds.

### Service-map processor
<a name="ingestion-params-servicemap"></a>

The [Service-map](https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/processors/service-map/) processor has the following requirements and limitations:
+ The value of the `window_duration` option can't exceed 300 seconds.

### S3 source
<a name="ingestion-params-s3"></a>

The [S3](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/) source plugin has the following requirements and limitations:
+ The `aws` option is *required*, and must contain `region` and `sts_role_arn` options.
+ The value of the `records_to_accumulate` option can't exceed 200.
+ The value of the `maximum_messages` option can't exceed 10.
+ If specified, the `disable_bucket_ownership_validation` option must be set to false.
+ If specified, the `input_serialization` option must be set to `parquet`.

# Integrating Amazon OpenSearch Ingestion pipelines with other services and applications
<a name="configure-client"></a>

To successfully ingest data into an Amazon OpenSearch Ingestion pipeline, you must configure your client application (the *source*) to send data to the pipeline endpoint. Your source might be clients like Fluent Bit logs, the OpenTelemetry Collector, or a simple S3 bucket. The exact configuration differs for each client.

The important differences during source configuration (compared to sending data directly to an OpenSearch Service domain or OpenSearch Serverless collection) are the AWS service name (`osis`) and the host endpoint, which must be the pipeline endpoint.

## Constructing the ingestion endpoint
<a name="configure-client-endpoint"></a>

To ingest data into a pipeline, send it to the ingestion endpoint. To locate the ingestion URL, navigate to the **Pipeline settings** page and copy the **Ingestion URL**.

![\[Pipeline settings page showing details like status, capacity, and ingestion URL for data input.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/pipeline-endpoint.png)


To construct the full ingestion endpoint for pull-based sources like [OTel trace](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-trace/) and [OTel metrics](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/), add the ingestion path from your pipeline configuration to the ingestion URL.

For example, say that your pipeline configuration has the following ingestion path:

![\[Input field for HTTP source path with example "/my/test_path" entered.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/ingestion-path.png)


The full ingestion endpoint, which you specify in your client configuration, will take the following format: `https://ingestion-pipeline-abcdefg.us-east-1.osis.amazonaws.com/my/test_path`.

## Creating an ingestion role
<a name="configure-client-auth"></a>

All requests to OpenSearch Ingestion must be signed with [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html). At minimum, the role that signs the request must be granted permission for the `osis:Ingest` action, which allows it to send data to an OpenSearch Ingestion pipeline.

For example, the following AWS Identity and Access Management (IAM) policy allows the corresponding role to send data to a single pipeline:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "osis:Ingest",
      "Resource": "arn:aws:osis:us-east-1:111122223333:pipeline/pipeline-name"
    }
  ]
}
```

------

**Note**  
To use the role for *all* pipelines, replace the ARN in the `Resource` element with a wildcard (\$1).

### Providing cross-account ingestion access
<a name="configure-client-cross-account"></a>

**Note**  
You can only provide cross-account ingestion access for public pipelines, not VPC pipelines.

You might need to ingest data into a pipeline from a different AWS account, such as an account that houses your source application. If the principal that is writing to a pipeline is in a different account than the pipeline itself, you need to configure the principal to trust another IAM role to ingest data into the pipeline.

**To configure cross-account ingestion permissions**

1. Create the ingestion role with `osis:Ingest` permission (described in the previous section) within the same AWS account as the pipeline. For instructions, see [Creating IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html).

1. Attach a [trust policy](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-managingrole_edit-trust-policy) to the ingestion role that allows a principal in another account to assume it:

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [{
        "Effect": "Allow",
        "Principal": {
          "AWS": "arn:aws:iam::111122223333:root"
         },
        "Action": "sts:AssumeRole"
     }]
   }
   ```

------

1. In the other account, configure your client application (for example, Fluent Bit) to assume the ingestion role. In order for this to work, the application account must grant permissions to the application user or role to assume the ingestion role.

   The following example identity-based policy allows the attached principal to assume `ingestion-role` from the pipeline account:

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": "sts:AssumeRole",
         "Resource": "arn:aws:iam::111122223333:role/ingestion-role"
       }
     ]
   }
   ```

------

The client application can then use the [AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) operation to assume `ingestion-role` and ingest data into the associated pipeline.

# Using an OpenSearch Ingestion pipeline with Atlassian Services
<a name="configure-client-atlassian"></a>

You can use the Atlassian Jira and Confluence source plugins to ingest data from Atlassian services into your OpenSearch Ingestion pipeline. These integrations enable you to create a unified searchable knowledge base by synchronizing complete Jira projects and Confluence spaces, while maintaining real-time relevance through continuous monitoring and automatic synchronization of updates.

------
#### [ Integrating with Jira ]

Transform your Jira experience with powerful contextual search capabilities by integrating your Jira content into OpenSearch. The Data Prepper [Atlassian Jira](https://www.atlassian.com/software/jira) source plugin enables you to create a unified searchable knowledge base by synchronizing complete Jira projects, while maintaining real-time relevance through continuous monitoring and automatic synchronization of updates. This integration allows for data synchronization with flexible filtering options for specific projects, issue types, and status, ensuring that only the information you need is imported. 

To ensure secure and reliable connectivity, the plugin supports multiple authentication methods, including basic API key authentication and OAuth2 authentication, with the added security of managing credentials using a secret stored in AWS Secrets Manager. It also features automatic token renewal for uninterrupted access, ensuring continuous operation. Built on Atlassian's [API version 2](https://developer.atlassian.com/cloud/jira/platform/rest/v2/intro/#version%22%3Eapi-version-2), this integration empowers teams to unlock valuable insights from their Jira data through OpenSearch's advanced search capabilities.

------
#### [ Integrating with Confluence ]

Enhance your team's knowledge management and collaboration capabilities by integrating [Atlassian Confluence](https://www.atlassian.com/software/confluence) content into OpenSearch through Data Prepper's Confluence source plugin. This integration enables you to create a centralized, searchable repository of collective knowledge, improving information discovery and team productivity. By synchronizing Confluence content and continuously monitoring for updates, the plugin ensures that your OpenSearch index remains up-to-date and comprehensive. 

The integration offers flexible filtering options, allowing you to selectively import content from specific spaces or page types, tailoring the synchronized content to your organization's needs. The plugin supports both basic API key and OAuth2 authentication methods, with the option of securely managing credentials through AWS Secrets Manager. The plugin's automatic token renewal feature ensures uninterrupted access and seamless operation. Built on Atlassian's Confluence [API](https://developer.atlassian.com/cloud/confluence/rest/v1/intro/#auth), this integration enables teams to leverage OpenSearch's advanced search capabilities across their Confluence content, enhancing information accessibility and utilization within the organization.

------

**Topics**
+ [Prerequisites](#atlassian-prerequisites)
+ [Configure a pipeline role](#atlassian-pipeline-role)
+ [Jira connector pipeline configuration](#jira-connector-pipeline)
+ [Confluence connector pipeline configuration](#confluence-connector-pipeline)
+ [Data consistency](#data-consistency)
+ [Limitations](#limitations)
+ [Metrics in CloudWatch for Atlassian connectors](#metrics)
+ [Connecting an Amazon OpenSearch Ingestion pipeline to Atlassian Jira or Confluence using OAuth 2.0](configure-client-atlassian-OAuth2-setup.md)

## Prerequisites
<a name="atlassian-prerequisites"></a>

Before you create your OpenSearch Ingestion pipeline, complete the following steps:

1. Prepare credentials for your Jira site by choosing one of the following options. OpenSearch Ingestion requires only `ReadOnly` authorization to the content.

   1. **Option 1: API key** – Log in to your Atlassian account and use the information in the following topic to generate your API key:
      + [Manage API tokens for your Atlassian account](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/)

   1. **Option 2: OAuth2** – Log in to your Atlassian account and use the information in [Connecting an Amazon OpenSearch Ingestion pipeline to Atlassian Jira or Confluence using OAuth 2.0](configure-client-atlassian-OAuth2-setup.md).

1. [Create a secret in AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) to store the credentials created in the previous step. Make the following choices as you follow the procedure: 
   + For **Secret type**, choose **Other type of secret**.
   + For **Key/value pairs**, create the following pairs, depending on your selected authorization type: 

------
#### [ API key ]

   ```
   {
      "username": user-name-usualy-email-id,
      "password": api-key
   }
   ```

------
#### [ OAuth 2.0 ]

   ```
   {
      "clientId": client-id
      "clientSecret": client-secret
      "accessKey": access-key
      "refreshKey": refresh-key
   }
   ```

------

   After you've created the secret, copy the Amazon Resource Name (ARN) of the secret. You will include it in the pipeline role permissions policy.

## Configure a pipeline role
<a name="atlassian-pipeline-role"></a>

The role passed in the pipeline must have the following policy attached to read and write to the secret created in the prerequisites section.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SecretReadWrite",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:PutSecretValue",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name-random-6-characters"
        }
    ]
}
```

------

The role should also have a policy attached to access and write to your chosen sink. For example, if you choose OpenSearch as your sink, the policy looks similar to the following:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "OpenSearchWritePolicy",
            "Effect": "Allow",
            "Action": "aoss:*",
            "Resource": "arn:aws:aoss:us-east-1:111122223333:collection/collection-id"
        }
    ]
}
```

------

## Jira connector pipeline configuration
<a name="jira-connector-pipeline"></a>

You can use a preconfigured Atlassian Jira blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

Replace the *placeholder values* with your own information.

```
version: "2"
extension:
  aws:
    secrets:
      jira-account-credentials:
        secret_id: "secret-arn"
        region: "secret-region"
        sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
atlassian-jira-pipeline:
  source:
    jira:
      # We only support one host url for now
      hosts: ["jira-host-url"]
      acknowledgments: true
      authentication:
        # Provide one of the authentication method to use. Supported methods are 'basic' and 'oauth2'.
        # For basic authentication, password is the API key that you generate using your jira account
        basic:
          username: ${{aws_secrets:jira-account-credentials:username}}
          password: ${{aws_secrets:jira-account-credentials:password}}
        # For OAuth2 based authentication, we require the following 4 key values stored in the secret
        # Follow atlassian instructions at the below link to generate these keys.
        # https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/
        # If you are using OAuth2 authentication, we also require, write permission to your AWS secret to
        # be able to write the renewed tokens back into the secret.
        # oauth2:
          # client_id: ${{aws_secrets:jira-account-credentials:clientId}}
          # client_secret: ${{aws_secrets:jira-account-credentials:clientSecret}}
          # access_token: ${{aws_secrets:jira-account-credentials:accessToken}}
          # refresh_token: ${{aws_secrets:jira-account-credentials:refreshToken}}
      filter:
        project:
          key:
            include:
              # This is not project name.
              # It is an alphanumeric project key that you can find under project details in Jira.
              - "project-key"
              - "project-key"
            # exclude:
              # - "project-key"
              # - "project-key"
        issue_type:
          include:
            - "issue-type"
            # - "Story"
            # - "Bug"
            # - "Task"
         # exclude:
             # - "Epic"
        status:
          include:
            - "ticket-status"
            # - "To Do"
            # - "In Progress"
            # - "Done"
         # exclude:
           # - "Backlog"

  sink:
    - opensearch:
        # Provide an Amazon OpenSearch Service domain endpoint
        hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
        index: "index_${getMetadata(\"project\")}"
        # Ensure adding unique document id which is the unique ticket id in this case
        document_id: '${/id}'
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
          # Provide the region of the domain.
          region: "us-east-1"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          serverless: false
          # serverless_options:
            # Specify a name here to create or update network policy for the serverless collection
            # network_policy_name: "network-policy-name"
        # Enable the 'distribution_version' setting if the Amazon OpenSearch Service domain is of version Elasticsearch 6.x
        # distribution_version: "es6"
        # Enable and switch the 'enable_request_compression' flag if the default compression setting is changed in the domain. 
        # See Compressing HTTP requests in Amazon OpenSearch Service
        # enable_request_compression: true/false
        # Optional: Enable the S3 DLQ to capture any failed requests in an S3 bucket. Delete this entire block if you don't want a DLQ.
        dlq:
          s3:
            # Provide an S3 bucket
            bucket: "your-dlq-bucket-name"
            # Provide a key path prefix for the failed requests
            # key_path_prefix: "kinesis-pipeline/logs/dlq"
            # Provide the region of the bucket.
            region: "us-east-1"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
            sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
```

Key to attributes in the Jira source:

1. **hosts**: Your Jira cloud or on-premises URL. Generally, it looks like `https://your-domain-name.atlassian.net/`.

1. **acknowledgments**: To guarantee the delivery of data all the way to the sink.

1. **authentication**: Describes how you want the pipeline to access your Jira instance. Choose `Basic` or `OAuth2` and specify the corresponding key attributes referencing the keys in your AWS secret..

1. **filter**: This section helps you select which portion of your Jira data to extract and synchronize.

   1. **project**: List the project keys that you want to sync in the `include` section. Otherwise, list the projects that you want to exclude under the `exclude` section. Provide only one of the include or exclude options at any given time.

   1. **issue\$1type**: Specific issue types that you want to sync. Follow the similar `include` or `exclude` pattern that suits your needs. Note that attachments will appear as anchor links to the original attachment, but the attachment content won't be extracted.

   1. **status**: Specific status filter you want to apply for the data extraction query. If you specify `include`, only tickets with those statuses will be synced. If you specify `exclude`, then all tickets except those with the listed excluded statuses will be synced.

## Confluence connector pipeline configuration
<a name="confluence-connector-pipeline"></a>

You can use a preconfigured Atlassian Confluence blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

```
version: "2"
extension:
  aws:
    secrets:
      confluence-account-credentials:
        secret_id: "secret-arn"
        region: "secret-region"
        sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
atlassian-confluence-pipeline:
  source:
    confluence:
      # We currently support only one host URL.
      hosts: ["confluence-host-url"]
      acknowledgments: true
      authentication:
        # Provide one of the authentication method to use. Supported methods are 'basic' and 'oauth2'.
        # For basic authentication, password is the API key that you generate using your Confluence account
        basic:
          username: ${{aws_secrets:confluence-account-credentials:confluenceId}}
          password: ${{aws_secrets:confluence-account-credentials:confluenceCredential}}
        # For OAuth2 based authentication, we require the following 4 key values stored in the secret
        # Follow atlassian instructions at the following link to generate these keys:
        # https://developer.atlassian.com/cloud/confluence/oauth-2-3lo-apps/
        # If you are using OAuth2 authentication, we also require write permission to your AWS secret to
        # be able to write the renewed tokens back into the secret.
        # oauth2:
          # client_id: ${{aws_secrets:confluence-account-credentials:clientId}}
          # client_secret: ${{aws_secrets:confluence-account-credentials:clientSecret}}
          # access_token: ${{aws_secrets:confluence-account-credentials:accessToken}}
          # refresh_token: ${{aws_secrets:confluence-account-credentials:refreshToken}}
      filter:
        space:
          key:
            include:
              # This is not space name.
              # It is a space key that you can find under space details in Confluence.
              - "space key"
              - "space key"
           # exclude:
             #  - "space key"
             #  - "space key"
        page_type:
          include:
            - "content type"
            # - "page"
            # - "blogpost"
            # - "comment"
         # exclude:
            # - "attachment"

  sink:
    - opensearch:
        # Provide an Amazon OpenSearch Service domain endpoint
        hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
         index: "index_${getMetadata(\"space\")}"
        # Ensure adding unique document id which is the unique ticket ID in this case.
        document_id: '${/id}'
        aws:
          # Provide the Amazon Resource Name (ARN) for a role with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com.
          sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
          # Provide the Region of the domain.
          region: "us-east-1"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          serverless: false
          # serverless_options:
            # Specify a name here to create or update network policy for the serverless collection.
            # network_policy_name: "network-policy-name"
        # Enable the 'distribution_version' setting if the Amazon OpenSearch Service domain is of version Elasticsearch 6.x
        # distribution_version: "es6"
        # Enable and switch the 'enable_request_compression' flag if the default compression setting is changed in the domain. 
        # For more information, see Compressing HTTP requests in Amazon OpenSearch Service.
        # enable_request_compression: true/false
        # Optional: Enable the S3 DLQ to capture any failed requests in an S3 bucket. Delete this entire block if you don't want a DLQ.
        dlq:
          s3:
            # Provide an S3 bucket
            bucket: "your-dlq-bucket-name"
            # Provide a key path prefix for the failed requests
            # key_path_prefix: "kinesis-pipeline/logs/dlq"
            # Provide the Rregion of the bucket.
            region: "us-east-1"
            # Provide the Amazon Resource Name (ARN) for a role with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
            sts_role_arn: "arn:aws:iam::123456789012:role/Example-Role"
```

Key attributes in the Confluence source:

1. **hosts**: Your Confluence cloud or on-premises URL. Generally, it looks like `https://your-domain-name.atlassian.net/`

1. **acknowledgments**: To guarantee the delivery of data all the way to the sink.

1. **authentication**: Describes how you want the pipeline to access your Confluence instance. Choose `Basic` or `OAuth2` and specify the corresponding key attributes referencing the keys in your AWS secret.

1. **filter**: This section helps you select which portion of your Confluence data to extract and synchronize.

   1. **space**: List the space keys that you want to sync in the `include` section. Otherwise, list the spaces that you want to exclude under the `exclude` section. Provide only one of the include or exclude options at any given time.

   1. **page\$1type**: Specific page types (like page, blogpost, or attachments) that you want to sync. Follow the similar `include` or `exclude` pattern that suits your needs. Note that attachments will appear as anchor links to the original attachment, but the attachment content won't be extracted.

## Data consistency
<a name="data-consistency"></a>

Based on the filters specified in the pipeline YAML, selected projects (or spaces) will be extracted once and fully synced to the target sink. Then continuous change monitoring will capture changes as they occur and update the data in the sink. One exception is that the change monitoring syncs only `create` and `update` actions, not `delete` actions.

## Limitations
<a name="limitations"></a>
+ User delete actions won't be synced. Data once recorded in the sink will remain in the sink. Updates will overwrite the existing content with new changes if the ID mapping is specified in the sink settings.
+ On-premises instances using older versions of Atlassian software that don't support the following APIs are not compatible with this source:
  + Jira Search API version 3
    + `rest/api/3/search`
    + `rest/api/3/issue`
  + Confluence
    + `wiki/rest/api/content/search`
    + `wiki/rest/api/content`
    + `wiki/rest/api/settings/systemInfo`

## Metrics in CloudWatch for Atlassian connectors
<a name="metrics"></a>

**Type: Jira connector metrics**


| Source | Metric | Metric Type | 
| --- | --- | --- | 
| acknowledgementSetSuccesses.count | Counter | If acknowledgments are enabled, this metric provides the number of tickets successfully synced. | 
| acknowledgementSetFailures.count | Counter | If acknowledgments are enabled, this metric provides the number of tickets that failed to sync. | 
| crawlingTime.avg | Timer | The time it took to crawl through all the new changes. | 
| ticketFetchLatency.avg | Timer | The ticket fetch API latency average. | 
| ticketFetchLatency.max | Timer | The ticket fetch API latency maximum. | 
| ticketsRequested.count | Counter | Number of ticket fetch requests made. | 
| ticketRequestedFailed.count | Counter | Number of ticket fetch requests failed. | 
| ticketRequestedSuccess.count | Counter | Number of ticket fetch requests succeeded. | 
| searchCallLatency.avg | Timer | Search API call latency average. | 
| searchCallLatency.max | Timer | Search API call latency maximum. | 
| searchResultsFound.count | Counter | Number of items found in a given search call. | 
| searchRequestFailed.count | Counter | Search API call failures count. | 
| authFailures.count | Counter | Authentication failure count. | 

**Type: Confluence connector metrics**


| Source | Metric | Metric Type | 
| --- | --- | --- | 
| acknowledgementSetSuccesses.count | Counter | If acknowledgments are enabled, this metric provides the number of pages successfully synced. | 
| acknowledgementSetFailures.count | Counter | If acknowledgments are enabled, this metric provides the number of pages that failed to sync. | 
| crawlingTime.avg | Timer | The time it took to crawl through all the new changes. | 
| pageFetchLatency.avg | Timer | Content fetching API latency (average). | 
| pageFetchLatency.max | Timer | Content fetching API latency (maximum). | 
| pagesRequested.count | Counter | Number of invocations of content fetching API. | 
| pageRequestFailed.count | Counter | Number of failed requests of content fetching API. | 
| pageRequestedSuccess.count | Counter | Number of successful requests of content fetching API. | 
| searchCallLatency.avg | Timer | Search API call latency average. | 
| searchCallLatency.max | Timer | Search API call latency max. | 
| searchResultsFound.count | Counter | Number of items found in a given search call. | 
| searchRequestsFailed.count | Counter | Search API call failures count. | 
| authFailures.count | Counter | Authentication failure count. | 

# Connecting an Amazon OpenSearch Ingestion pipeline to Atlassian Jira or Confluence using OAuth 2.0
<a name="configure-client-atlassian-OAuth2-setup"></a>

Use the information in this topic to help you configure and connect an Amazon OpenSearch Ingestion pipeline to a Jira or Confluence account using OAuth 2.0 authentication. Perform this task when are are completing the [Prerequisites](configure-client-atlassian.md#atlassian-prerequisites) for using an OpenSearch Ingestion pipeline with Atlassian Services but choose not to use API key credentials.

**Topics**
+ [Create an OAuth 2.0 integration app](#create-OAuth2-integration-app)
+ [Generating and refreshing an Atlassian Developer access token](#generate-and-refresh-jira-access-token)

## Create an OAuth 2.0 integration app
<a name="create-OAuth2-integration-app"></a>

Use the following procedure to help you create an OAuth 2.0 integration app on the Atlassian Developer website.

**To create an OAuth 2.0 integration app**

1. Log in to your Atlassian Developer account at [ https://developer.atlassian.com/console/myapps/](https://developer.atlassian.com/console/myapps/).

1. Choose **Create**, **OAuth 2.0 integration**.

1. For **Name**, enter a name to identify the purpose of the app.

1. Select the **I agree to be bound by Atlassian's developer terms** check box, and then choose **Create**.

1. In the left navigation, choose **Authorization**, and then choose **Add**.

1. For **Callback URL**, enter any URL, such as **https://www.amazon.com** or **https://www.example.com**, and then choose **Save changes**.

1. In the left navigation, choose **Permissions** page, and then in the row for Jira API, choose **Add**, and then choose **Configure**. and select all the Classic Scopes Read permissions (list given below) and then select Save

1. Choose the **Granular scopes** tab, and then choose **Edit Scopes** to open the **Edit Jira API** dialog box.

1. Select the permissions for source plugin you are using:

------
#### [ Jira ]

   ```
   read:audit-log:jira
   read:issue:jira
   read:issue-meta:jira
   read:attachment:jira
   read:comment:jira
   read:comment.property:jira
   read:field:jira
   read:field.default-value:jira
   read:field.option:jira
   read:field-configuration-scheme:jira
   read:field-configuration:jira
   read:issue-link:jira
   read:issue-link-type:jira
   read:issue-link-type:jira
   read:issue.remote-link:jira
   read:issue.property:jira
   read:resolution:jira
   read:issue-details:jira
   read:issue-type:jira
   read:issue-worklog:jira
   read:issue-field-values:jira
   read:issue.changelog:jira
   read:issue.transition:jira
   read:issue.vote:jira
   read:jira-expressions:jira
   ```

------
#### [ Confluence ]

   ```
   read:content:confluence
   read:content-details:confluence
   read:space-details:confluence
   read:audit-log:confluence
   read:page:confluence
   read:blogpost:confluence
   read:custom-content:confluence
   read:comment:confluence
   read:space:confluence
   read:space.property:confluence
   read:space.setting:confluence
   read:content.property:confluence
   read:content.metadata:confluence
   read:task:confluence
   read:whiteboard:confluence
   read:app-data:confluence
   manage:confluence-configuration
   ```

------

1. Choose **Save**.

For related information, see [Implementing OAuth 2.0 (3LO)](https://developer.atlassian.com/cloud/oauth/getting-started/implementing-oauth-3lo/) and [Determining the scopes required for an operation](https://developer.atlassian.com/cloud/oauth/getting-started/determining-scopes/) on the Atlassian Developer website.

## Generating and refreshing an Atlassian Developer access token
<a name="generate-and-refresh-jira-access-token"></a>

Use the following procedure to help you generate and refresh an Atlassian Developer access token on the Atlassian Developer website.

**To generate and refresh a Jira access token**

1. Log in to your Atlassian Developer account at [ https://developer.atlassian.com/console/myapps/](https://developer.atlassian.com/console/myapps/).

1. Choose the app you created in [Create an OAuth 2.0 integration app](#create-OAuth2-integration-app).

1. In the left navigation, choose **Authorization.**

1. Copy the granular Atlassian API authorization URL value from the bottom of the page and paste it into the text editor of your choice.

   The format of the URL is as follows:

   ```
   https://auth.atlassian.com/authorize?
   audience=api.atlassian.com 
   &client_id=YOUR_CLIENT_ID
   &scope=REQUESTED_SCOPE%20REQUESTED_SCOPE_TWO
   &redirect_uri=https://YOUR_APP_CALLBACK_URL
   &state=YOUR_USER_BOUND_VALUE 
   &response_type=code
   &prompt=consent
   ```

1. For `state=YOUR_USER_BOUND_VALUE`, change the parameter value to anything you choose, such as state="**sample\$1text**".

   For more information, see [What is the state parameter used for?](https://developer.atlassian.com/cloud/jira/platform/oauth-2-3lo-apps/#what-is-the-state-parameter-used-for-) on the Atlassian Developer website.

1. Note that the `scope` section lists the granular scopes you selected in an earlier task. For example: `scope=read%3Ajira-work%20read%3Ajira-user%20offline_access`

   `offline_access` indicates that you want to generate a `refresh_token`.

1. Open a web browser window and enter the authorization URL you copied into the browser window's address bar.

1. When the target page opens, verify that the information is correct, and then choose **Accept** to be redirected to your Jira or Confluence homepage.

1. After the homepage has loaded, copy the URL of this page. It contains the authorization code for your application. You use this code to generate your access token. The entire section after `code=` is the authorization code.

1. Use the following cURL command to generate the access token. Replace the *placeholder values* with your own information.
**Tip**  
You can also use a third-party service such as Postman.

   ```
   curl --request POST --url 'https://auth.atlassian.com/oauth/token' \
   --header 'Content-Type: application/json' \
   --data '{"grant_type": "authorization_code",
   "client_id": "YOUR_CLIENT_ID",
   "client_secret": "YOUR_CLIENT_SECRET",
   "code": "AUTHORIZATION_CODE",
   "redirect_uri": "YOUR_CALLBACK_URL"}'
   ```

   The response to this command includes the values for `access_code` and `refresh_token`.

# Using an OpenSearch Ingestion pipeline with Amazon Aurora
<a name="configure-client-aurora"></a>

You can use an OpenSearch Ingestion pipeline with Amazon Aurora to export existing data and stream changes (such as create, update, and delete) to Amazon OpenSearch Service domains and collections. The OpenSearch Ingestion pipeline incorporates change data capture (CDC) infrastructure to provide a high-scale, low-latency way to continuously stream data from Amazon Aurora. Aurora MySQL and Aurora PostgreSQL are supported.

There are two ways that you can use Amazon Aurora as a source to process data—with or without a full initial snapshot. A full initial snapshot is a snapshot of specified tables and this snapshot is exported to Amazon S3. From there, an OpenSearch Ingestion pipeline sends it to one index in a domain, or partitions it to multiple indexes in a domain. To keep the data in Amazon Aurora and OpenSearch consistent, the pipeline syncs all of the create, update, and delete events in the tables in Amazon Aurora clusters with the documents saved in the OpenSearch index or indexes.

When you use a full initial snapshot, your OpenSearch Ingestion pipeline first ingests the snapshot and then starts reading data from Amazon Aurora change streams. It eventually catches up and maintains near real-time data consistency between Amazon Aurora and OpenSearch. 

You can also use the OpenSearch Ingestion integration with Amazon Aurora to track change data capture and ingest all updates in Aurora to OpenSearch. Choose this option if you already have a full snapshot from some other mechanism, or if you just want to capture all changes to the data in Amazon Aurora cluster. 

When you choose this option you need to [configure binary logging for Aurora MySQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_LogAccess.MySQL.BinaryFormat.html) or [set up logical replication for Aurora PostgreSQL on the cluster](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Replication.Logical.Configure.html). 

**Topics**
+ [Aurora MySQL](aurora-mysql.md)
+ [Aurora PostgreSQL](aurora-PostgreSQL.md)

# Aurora MySQL
<a name="aurora-mysql"></a>

Complete the following steps to configure an OpenSearch Ingestion pipeline with Amazon Aurora for Aurora MySQL.

**Topics**
+ [Aurora MySQL prerequisites](#aurora-mysql-prereqs)
+ [Step 1: Configure the pipeline role](#aurora-mysql-pipeline-role)
+ [Step 2: Create the pipeline](#aurora-mysql-pipeline)
+ [Data consistency](#aurora-mysql-pipeline-consistency)
+ [Mapping data types](#aurora-mysql-pipeline-mapping)
+ [Limitations](#aurora-mysql-pipeline-limitations)
+ [Recommended CloudWatch Alarms](#aurora-mysql-pipeline-metrics)

## Aurora MySQL prerequisites
<a name="aurora-mysql-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. [Create a custom Aurora DB cluster parameter group in Amazon Aurora to configure binary logging](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.setting-up.html#zero-etl.parameters).

   ```
   aurora_enhanced_binlog=1
   binlog_backup=0
   binlog_format=ROW
   binlog_replication_globaldb=0
   binlog_row_image=full
   binlog_row_metadata=full
   ```

   Additionally, make sure the `binlog_transaction_compression` parameter is not set to `ON`, and that the `binlog_row_value_options` parameter is not set to `PARTIAL_JSON`.

1. [Select or create an Aurora MySQL DB cluster](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_GettingStartedAurora.CreatingConnecting.Aurora.html) and associate the parameter group created in the previous step with the DB cluster.

1. [Configure binary log retention to 24 hours or longer](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/mysql-stored-proc-configuring.html). 

1. Set up username and password authentication on your Amazon Aurora cluster using [password management with Aurora and AWS Secrets Manager](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-secrets-manager.html). You can also create a username/password combination by [creating a Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

1. If you use the full initial snapshot feature, create an AWS KMS key and an IAM role for exporting data from Amazon Aurora to Amazon S3.

   The IAM role should have the following permission policy:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "ExportPolicy",
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject*",
                   "s3:ListBucket",
                   "s3:GetObject*",
                   "s3:DeleteObject*",
                   "s3:GetBucketLocation"
               ],
               "Resource": [
                   "arn:aws:s3:::s3-bucket-used-in-pipeline",
                   "arn:aws:s3:::s3-bucket-used-in-pipeline/*"
               ]
           }
       ]
   }
   ```

------

   The role should also have the following trust relationships:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": "export.rds.amazonaws.com"
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

1. Select or create an OpenSearch Service domain or OpenSearch Serverless collection. For more information, see [Creating OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html#createdomains) and [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your Amazon Aurora DB cluster to your domain or collection.

## Step 1: Configure the pipeline role
<a name="aurora-mysql-pipeline-role"></a>

After you have your Amazon Aurora pipeline prerequisites set up, [configure the pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-sink) to use in your pipeline configuration. Also add the following permissions for Amazon Aurora source to the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
    {
    "Sid": "allowReadingFromS3Buckets",
    "Effect": "Allow",
    "Action": [
    "s3:GetObject",
    "s3:DeleteObject",
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:PutObject"
    ],
    "Resource": [
    "arn:aws:s3:::s3_bucket",
    "arn:aws:s3:::s3_bucket/*"
    ]
    },
    {
    "Sid": "allowNetworkInterfacesActions",
    "Effect": "Allow",
    "Action": [
    "ec2:AttachNetworkInterface",
    "ec2:CreateNetworkInterface",
    "ec2:CreateNetworkInterfacePermission",
    "ec2:DeleteNetworkInterface",
    "ec2:DeleteNetworkInterfacePermission",
    "ec2:DetachNetworkInterface",
    "ec2:DescribeNetworkInterfaces"
    ],
    "Resource": [
    "arn:aws:ec2:*:111122223333:network-interface/*",
    "arn:aws:ec2:*:111122223333:subnet/*",
    "arn:aws:ec2:*:111122223333:security-group/*"
    ]
    },
    {
    "Sid": "allowDescribeEC2",
    "Effect": "Allow",
    "Action": [
    "ec2:Describe*"
    ],
    "Resource": "*"
    },
    {
    "Sid": "allowTagCreation",
    "Effect": "Allow",
    "Action": [
    "ec2:CreateTags"
    ],
    "Resource": "arn:aws:ec2:*:111122223333:network-interface/*",
    "Condition": {
    "StringEquals": {
    "aws:RequestTag/OSISManaged": "true"
    }
    }
    },
    {
    "Sid": "AllowDescribeInstances",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBInstances"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:db:*"
    ]
    },
    {
    "Sid": "AllowDescribeClusters",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBClusters"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:cluster:DB-id"
    ]
    },
    {
    "Sid": "AllowSnapshots",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBClusterSnapshots",
    "rds:CreateDBClusterSnapshot",
    "rds:AddTagsToResource"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:cluster:DB-id",
    "arn:aws:rds:us-east-2:111122223333:cluster-snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowExport",
    "Effect": "Allow",
    "Action": [
    "rds:StartExportTask"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:cluster:DB-id",
    "arn:aws:rds:us-east-2:111122223333:cluster-snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowDescribeExports",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeExportTasks"
    ],
    "Resource": "*",
    "Condition": {
    "StringEquals": {
    "aws:RequestedRegion": "us-east-2",
    "aws:ResourceAccount": "111122223333"
    }
    }
    },
    {
    "Sid": "AllowAccessToKmsForExport",
    "Effect": "Allow",
    "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:DescribeKey",
    "kms:RetireGrant",
    "kms:CreateGrant",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*"
    ],
    "Resource": [
    "arn:aws:kms:us-east-2:111122223333:key/export-key-id"
    ]
    },
    {
    "Sid": "AllowPassingExportRole",
    "Effect": "Allow",
    "Action": "iam:PassRole",
    "Resource": [
    "arn:aws:iam::111122223333:role/export-role"
    ]
    },
    {
    "Sid": "SecretsManagerReadAccess",
    "Effect": "Allow",
    "Action": [
    "secretsmanager:GetSecretValue"
    ],
    "Resource": [
    "arn:aws:secretsmanager:*:111122223333:secret:*"
    ]
    }
    ]
    }
```

------

## Step 2: Create the pipeline
<a name="aurora-mysql-pipeline"></a>

Configure an OpenSearch Ingestion pipeline similar to the following. The example pipeline specifies an Amazon Aurora cluster as the source. 

```
version: "2"
aurora-mysql-pipeline:
  source:
    rds:
      db_identifier: "cluster-id"
      engine: aurora-mysql
      database: "database-name"
      tables:
        include:
          - "table1"
          - "table2"
      s3_bucket: "bucket-name"
      s3_region: "bucket-region"
      s3_prefix: "prefix-name"
      export:
        kms_key_id: "kms-key-id"
        iam_role_arn: "export-role-arn"
      stream: true
      aws:
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        region: "us-east-1"
      authentication:
        username: ${{aws_secrets:secret:username}}
        password: ${{aws_secrets:secret:password}}
  sink:
    - opensearch:
        hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
        index: "${getMetadata(\"table_name\")}"
        index_type: custom
        document_id: "${getMetadata(\"primary_key\")}"
        action: "${getMetadata(\"opensearch_action\")}"
        document_version: "${getMetadata(\"document_version\")}"
        document_version_type: "external"
        aws:
          sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
          region: "us-east-1"
extension:
  aws:
    secrets:
      secret:
        secret_id: "rds-secret-id"
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        refresh_interval: PT1H
```

You can use a preconfigured Amazon Aurora blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

To use Amazon Aurora as a source, you need to configure VPC access for the pipeline. The VPC you choose should be the same VPC your Amazon Aurora source uses. Then choose one or more subnets and one or more VPC security groups. Note that the pipeline needs network access to a Aurora MySQL database, so you should also verify that your Aurora cluster is configured with a VPC security group that allows inbound traffic from the pipeline's VPC security group to the database port. For more information, see [Controlling access with security groups](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Overview.RDSSecurityGroups.html).

If you're using the AWS Management Console to create your pipeline, you must also attach your pipeline to your VPC in order to use Amazon Aurora as a source. To do so, find the **Network configuration** section, select the **Attach to VPC** checkbox, and choose your CIDR from one of the provided default options, or select your own. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).

To provide a custom CIDR, select **Other** from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and Amazon Aurora, ensure that the Amazon Aurora VPC CIDR is different from the CIDR for OpenSearch Ingestion.

For more information, see [Configuring VPC access for a pipeline](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security.html#pipeline-vpc-configure).

## Data consistency
<a name="aurora-mysql-pipeline-consistency"></a>

The pipeline ensures data consistency by continuously polling or receiving changes from the Amazon Aurora cluster and updating the corresponding documents in the OpenSearch index.

OpenSearch Ingestion supports end-to-end acknowledgement to ensure data durability. When a pipeline reads snapshots or streams, it dynamically creates partitions for parallel processing. The pipeline marks a partition as complete when it receives an acknowledgement after ingesting all records in the OpenSearch domain or collection. If you want to ingest into an OpenSearch Serverless search collection, you can generate a document ID in the pipeline. If you want to ingest into an OpenSearch Serverless time series collection, note that the pipeline doesn't generate a document ID, so you must omit `document_id: "${getMetadata(\"primary_key\")}"` in your pipeline sink configuration. 

An OpenSearch Ingestion pipeline also maps incoming event actions into corresponding bulk indexing actions to help ingest documents. This keeps data consistent, so that every data change in Amazon Aurora is reconciled with the corresponding document changes in OpenSearch.

## Mapping data types
<a name="aurora-mysql-pipeline-mapping"></a>

OpenSearch Ingestion pipeline maps MySQL data types to representations that are suitable for OpenSearch Service domains or collections to consume. If no mapping template is defined in OpenSearch, OpenSearch automatically determines field types with [dynamic mapping](https://opensearch.org/docs/latest/field-types/#dynamic-mapping) based on the first sent document. You can also explicitly define the field types that work best for you in OpenSearch through a mapping template. 

The table below lists MySQL data types and corresponding OpenSearch field types. The *Default OpenSearch Field Type* column shows the corresponding field type in OpenSearch if no explicit mapping is defined. In this case, OpenSearch automatically determines field types with dynamic mapping. The *Recommended OpenSearch Field Type* column is the corresponding field type that is recommended to explicitly specify in a mapping template. These field types are more closely aligned with the data types in MySQL and can usually enable better search features available in OpenSearch.


| MySQL Data Type | Default OpenSearch Field Type | Recommended OpenSearch Field Type | 
| --- | --- | --- | 
| BIGINT | long | long | 
| BIGINT UNSIGNED | long | unsigned long | 
| BIT | long | byte, short, integer, or long depending on number of bits | 
| DECIMAL | text | double or keyword | 
| DOUBLE | float | double | 
| FLOAT | float | float | 
| INT | long | integer | 
| INT UNSIGNED | long | long | 
| MEDIUMINT | long | integer | 
| MEDIUMINT UNSIGNED | long | integer | 
| NUMERIC | text | double or keyword | 
| SMALLINT | long | short | 
| SMALLINT UNSIGNED | long | integer | 
| TINYINT | long | byte | 
| TINYINT UNSIGNED | long | short | 
| BINARY | text | binary | 
| BLOB | text | binary | 
| CHAR | text | text | 
| ENUM | text | keyword | 
| LONGBLOB | text | binary | 
| LONGTEXT | text | text | 
| MEDIUMBLOB | text | binary | 
| MEDIUMTEXT | text | text | 
| SET | text | keyword | 
| TEXT | text | text | 
| TINYBLOB | text | binary | 
| TINYTEXT | text | text | 
| VARBINARY | text | binary | 
| VARCHAR | text | text | 
| DATE | long (in epoch milliseconds) | date | 
| DATETIME | long (in epoch milliseconds) | date | 
| TIME | long (in epoch milliseconds) | date | 
| TIMESTAMP | long (in epoch milliseconds) | date | 
| YEAR | long (in epoch milliseconds) | date | 
| GEOMETRY | text (in WKT format) | geo\$1shape | 
| GEOMETRYCOLLECTION | text (in WKT format) | geo\$1shape | 
| LINESTRING | text (in WKT format) | geo\$1shape | 
| MULTILINESTRING | text (in WKT format) | geo\$1shape | 
| MULTIPOINT | text (in WKT format) | geo\$1shape | 
| MULTIPOLYGON | text (in WKT format) | geo\$1shape | 
| POINT | text (in WKT format) | geo\$1point or geo\$1shape | 
| POLYGON | text (in WKT format) | geo\$1shape | 
| JSON | text | object | 

We recommend that you configure the dead-letter queue (DLQ) in your OpenSearch Ingestion pipeline. If you've configured the queue, OpenSearch Service sends all failed documents that can't be ingested due to dynamic mapping failures to the queue.

If automatic mappings fail, you can use `template_type` and `template_content` in your pipeline configuration to define explicit mapping rules. Alternatively, you can create mapping templates directly in your search domain or collection before you start the pipeline.

## Limitations
<a name="aurora-mysql-pipeline-limitations"></a>

Consider the following limitations when you set up an OpenSearch Ingestion pipeline for Aurora MySQL:
+ The integration only supports one MySQL database per pipeline.
+ The integration does not currently support cross-region data ingestion; your Amazon Aurora cluster and OpenSearch domain must be in the same AWS Region.
+ The integration does not currently support cross-account data ingestion; your Amazon Aurora cluster and OpenSearch Ingestion pipeline must be in the same AWS account. 
+ Ensure that the Amazon Aurora cluster has authentication enabled using Secrets Manager, which is the only supported authentication mechanism.
+ The existing pipeline configuration can't be updated to ingest data from a different database and/or a different table. To update the database and/or table name of a pipeline, you have to stop the pipeline and restart it with an updated configuration, or create a new pipeline.
+ Data Definition Language (DDL) statements are generally not supported. Data consistency will not be maintained if:
  + Primary keys are changed (add/delete/rename).
  + Tables are dropped/truncated.
  + Column names or data types are changed.
+ If the MySQL tables to sync don't have primary keys defined, data consistency are not guaranteed. You will need to define custom `document_id` option in OpenSearch sink configuration properly to be able to sync updates/deletes to OpenSearch.
+ Foreign key references with cascading delete actions are not supported and can result in data inconsistency between Aurora MySQL and OpenSearch.
+ Supported versions: Aurora MySQL version 3.05.2 and higher.

## Recommended CloudWatch Alarms
<a name="aurora-mysql-pipeline-metrics"></a>

The following CloudWatch metrics are recommended for monitoring the performance of your ingestion pipeline. These metrics can help you identify the amount of data processed from exports, the number of events processed from streams, the errors in processing exports and stream events, and the number of documents written to the destination. You can setup CloudWatch alarms to perform an action when one of these metrics exceed a specified value for a specified amount of time.


| Metric | Description | 
| --- | --- | 
| pipeline-name.rds.credentialsChanged | This metric indicates how often AWS secrets are rotated. | 
| pipeline-name.rds.executorRefreshErrors | This metric indicates failures to refresh AWS secrets. | 
| pipeline-name.rds.exportRecordsTotal | This metric indicates the number of records exported from Amazon Aurora. | 
| pipeline-name.rds.exportRecordsProcessed | This metric indicates the number of records processed by OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.exportRecordProcessingErrors | This metric indicates number of processing errors in an OpenSearch Ingestion pipeline while reading the data from an Amazon Aurora cluster. | 
| pipeline-name.rds.exportRecordsSuccessTotal | This metric indicates the total number of export records processed successfully. | 
| pipeline-name.rds.exportRecordsFailedTotal | This metric indicates the total number of export records that failed to process. | 
| pipeline-name.rds.bytesReceived | This metrics indicates the total number of bytes received by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.bytesProcessed | This metrics indicates the total number of bytes processed by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.streamRecordsSuccessTotal | This metric indicates the number of records successfully processed from the stream. | 
| pipeline-name.rds.streamRecordsFailedTotal | This metrics indicates the total number of records failed to process from the stream. | 

# Aurora PostgreSQL
<a name="aurora-PostgreSQL"></a>

Complete the following steps to configure an OpenSearch Ingestion pipeline with Amazon Aurora for Aurora PostgreSQL.

**Topics**
+ [Aurora PostgreSQL prerequisites](#aurora-PostgreSQL-prereqs)
+ [Step 1: Configure the pipeline role](#aurora-mysql-pipeline-role)
+ [Step 2: Create the pipeline](#aurora-PostgreSQL-pipeline)
+ [Data consistency](#aurora-mysql-pipeline-consistency)
+ [Mapping data types](#aurora-PostgreSQL-pipeline-mapping)
+ [Limitations](#aurora-PostgreSQL-pipeline-limitations)
+ [Recommended CloudWatch Alarms](#aurora-mysql-pipeline-metrics)

## Aurora PostgreSQL prerequisites
<a name="aurora-PostgreSQL-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. [Create a custom DB cluster parameter group](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_GettingStartedAurora.CreatingConnecting.Aurora.html) in Amazon Aurora to configure logical replication.

   ```
   rds.logical_replication=1
       aurora.enhanced_logical_replication=1
       aurora.logical_replication_backup=0
       aurora.logical_replication_globaldb=0
   ```

1. [Select or create an Aurora PostgreSQL DB cluster](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_GettingStartedAurora.CreatingConnecting.Aurora.html) and associate the parameter group created in step 1 with the DB cluster.

1. Set up username and password authentication on your Amazon Aurora cluster using [password management with Aurora and AWS Secrets Manager](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-secrets-manager.html). You can also create a username/password combination by [creating a Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

1. If you use the full initial snapshot feature, create an AWS KMS key and an IAM role for exporting data from Amazon Aurora to Amazon S3.

   The IAM role should have the following permission policy:

------
#### [ JSON ]

****  

   ```
   {
           "Version":"2012-10-17",		 	 	 
           "Statement": [
               {
                   "Sid": "ExportPolicy",
                   "Effect": "Allow",
                   "Action": [
                       "s3:PutObject*",
                       "s3:ListBucket",
                       "s3:GetObject*",
                       "s3:DeleteObject*",
                       "s3:GetBucketLocation"
                   ],
                   "Resource": [
                       "arn:aws:s3:::s3-bucket-used-in-pipeline",
                       "arn:aws:s3:::s3-bucket-used-in-pipeline/*"
                   ]
               }
           ]
       }
   ```

------

   The role should also have the following trust relationships:

------
#### [ JSON ]

****  

   ```
   {
           "Version":"2012-10-17",		 	 	 
           "Statement": [
               {
                   "Effect": "Allow",
                   "Principal": {
                       "Service": "export.rds.amazonaws.com"
                   },
                   "Action": "sts:AssumeRole"
               }
           ]
       }
   ```

------

1. Select or create an OpenSearch Service domain or OpenSearch Serverless collection. For more information, see [Creating OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html#createdomains) and [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your Amazon Aurora DB cluster to your domain or collection.

## Step 1: Configure the pipeline role
<a name="aurora-mysql-pipeline-role"></a>

After you have your Amazon Aurora pipeline prerequisites set up, [configure the pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-sink) to use in your pipeline configuration. Also add the following permissions for Amazon Aurora source to the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
    {
    "Sid": "allowReadingFromS3Buckets",
    "Effect": "Allow",
    "Action": [
    "s3:GetObject",
    "s3:DeleteObject",
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:PutObject"
    ],
    "Resource": [
    "arn:aws:s3:::s3_bucket",
    "arn:aws:s3:::s3_bucket/*"
    ]
    },
    {
    "Sid": "allowNetworkInterfacesActions",
    "Effect": "Allow",
    "Action": [
    "ec2:AttachNetworkInterface",
    "ec2:CreateNetworkInterface",
    "ec2:CreateNetworkInterfacePermission",
    "ec2:DeleteNetworkInterface",
    "ec2:DeleteNetworkInterfacePermission",
    "ec2:DetachNetworkInterface",
    "ec2:DescribeNetworkInterfaces"
    ],
    "Resource": [
    "arn:aws:ec2:*:111122223333:network-interface/*",
    "arn:aws:ec2:*:111122223333:subnet/*",
    "arn:aws:ec2:*:111122223333:security-group/*"
    ]
    },
    {
    "Sid": "allowDescribeEC2",
    "Effect": "Allow",
    "Action": [
    "ec2:Describe*"
    ],
    "Resource": "*"
    },
    {
    "Sid": "allowTagCreation",
    "Effect": "Allow",
    "Action": [
    "ec2:CreateTags"
    ],
    "Resource": "arn:aws:ec2:*:111122223333:network-interface/*",
    "Condition": {
    "StringEquals": {
    "aws:RequestTag/OSISManaged": "true"
    }
    }
    },
    {
    "Sid": "AllowDescribeInstances",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBInstances"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:db:*"
    ]
    },
    {
    "Sid": "AllowDescribeClusters",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBClusters"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:cluster:DB-id"
    ]
    },
    {
    "Sid": "AllowSnapshots",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBClusterSnapshots",
    "rds:CreateDBClusterSnapshot",
    "rds:AddTagsToResource"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:cluster:DB-id",
    "arn:aws:rds:us-east-2:111122223333:cluster-snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowExport",
    "Effect": "Allow",
    "Action": [
    "rds:StartExportTask"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:cluster:DB-id",
    "arn:aws:rds:us-east-2:111122223333:cluster-snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowDescribeExports",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeExportTasks"
    ],
    "Resource": "*",
    "Condition": {
    "StringEquals": {
    "aws:RequestedRegion": "us-east-2",
    "aws:ResourceAccount": "111122223333"
    }
    }
    },
    {
    "Sid": "AllowAccessToKmsForExport",
    "Effect": "Allow",
    "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:DescribeKey",
    "kms:RetireGrant",
    "kms:CreateGrant",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*"
    ],
    "Resource": [
    "arn:aws:kms:us-east-2:111122223333:key/export-key-id"
    ]
    },
    {
    "Sid": "AllowPassingExportRole",
    "Effect": "Allow",
    "Action": "iam:PassRole",
    "Resource": [
    "arn:aws:iam::111122223333:role/export-role"
    ]
    },
    {
    "Sid": "SecretsManagerReadAccess",
    "Effect": "Allow",
    "Action": [
    "secretsmanager:GetSecretValue"
    ],
    "Resource": [
    "arn:aws:secretsmanager:*:111122223333:secret:*"
    ]
    }
    ]
    }
```

------

## Step 2: Create the pipeline
<a name="aurora-PostgreSQL-pipeline"></a>

Configure an OpenSearch Ingestion pipeline like the following, which specifies Aurora PostgreSQL cluster as the source. 

```
version: "2"
aurora-postgres-pipeline:
  source:
    rds:
      db_identifier: "cluster-id"
      engine: aurora-postgresql
      database: "database-name"
      tables:
        include:
          - "schema1.table1"
          - "schema2.table2"
      s3_bucket: "bucket-name"
      s3_region: "bucket-region"
      s3_prefix: "prefix-name"
      export:
        kms_key_id: "kms-key-id"
        iam_role_arn: "export-role-arn"
      stream: true
      aws:
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        region: "us-east-1"
      authentication:
        username: ${{aws_secrets:secret:username}}
        password: ${{aws_secrets:secret:password}}
  sink:
    - opensearch:
        hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
        index: "${getMetadata(\"table_name\")}"
        index_type: custom
        document_id: "${getMetadata(\"primary_key\")}"
        action: "${getMetadata(\"opensearch_action\")}"
        document_version: "${getMetadata(\"document_version\")}"
        document_version_type: "external"
        aws:
          sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
          region: "us-east-1"
extension:
  aws:
    secrets:
      secret:
        secret_id: "rds-secret-id"
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        refresh_interval: PT1H
```

**Note**  
You can use a preconfigured Amazon Aurora blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

To use Amazon Aurora as a source, you need to configure VPC access for the pipeline. The VPC you choose should be the same VPC your Amazon Aurora source uses. Then choose one or more subnets and one or more VPC security groups. Note that the pipeline needs network access to a Aurora MySQL database, so you should also verify that your Aurora cluster is configured with a VPC security group that allows inbound traffic from the pipeline's VPC security group to the database port. For more information, see [Controlling access with security groups](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Overview.RDSSecurityGroups.html).

If you're using the AWS Management Console to create your pipeline, you must also attach your pipeline to your VPC in order to use Amazon Aurora as a source. To do this, find the **Network configuration** section, choose **Attach to VPC**, and choose your CIDR from one of the provided default options, or select your own. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).

To provide a custom CIDR, select Other from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and Amazon Aurora, ensure that the Amazon Aurora VPC CIDR is different from the CIDR for OpenSearch Ingestion.

For more information, see [Configuring VPC access for a pipeline](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security.html#pipeline-vpc-configure).

## Data consistency
<a name="aurora-mysql-pipeline-consistency"></a>

The pipeline ensures data consistency by continuously polling or receiving changes from the Amazon Aurora cluster and updating the corresponding documents in the OpenSearch index.

OpenSearch Ingestion supports end-to-end acknowledgement to ensure data durability. When a pipeline reads snapshots or streams, it dynamically creates partitions for parallel processing. The pipeline marks a partition as complete when it receives an acknowledgement after ingesting all records in the OpenSearch domain or collection. If you want to ingest into an OpenSearch Serverless search collection, you can generate a document ID in the pipeline. If you want to ingest into an OpenSearch Serverless time series collection, note that the pipeline doesn't generate a document ID, so you must omit `document_id: "${getMetadata(\"primary_key\")}"` in your pipeline sink configuration. 

An OpenSearch Ingestion pipeline also maps incoming event actions into corresponding bulk indexing actions to help ingest documents. This keeps data consistent, so that every data change in Amazon Aurora is reconciled with the corresponding document changes in OpenSearch.

## Mapping data types
<a name="aurora-PostgreSQL-pipeline-mapping"></a>

OpenSearch Ingestion pipeline maps Aurora PostgreSQL data types to representations that are suitable for OpenSearch Service domains or collections to consume. If no mapping template is defined in OpenSearch, OpenSearch automatically determine field types with a [dynamic mapping](https://opensearch.org/docs/latest/field-types/#dynamic-mapping) based on the first sent document. You can also explicitly define the field types that work best for you in OpenSearch through a mapping template. 

The table below lists Aurora PostgreSQL data types and corresponding OpenSearch field types. The *Default OpenSearch Field Type* column shows the corresponding field type in OpenSearch if no explicit mapping is defined. In this case, OpenSearch automatically determines field types with dynamic mapping. The *Recommended OpenSearch Field Type* column is the corresponding recommended field type to explicitly specify in a mapping template. These field types are more closely aligned with the data types in Aurora PostgreSQL and can usually enable better search features available in OpenSearch.


| Aurora PostgreSQL Data Type | Default OpenSearch Field Type | Recommended OpenSearch Field Type | 
| --- | --- | --- | 
| smallint | long | short | 
| integer | long | integer | 
| bigint | long | long | 
| decimal | text | double or keyword | 
| numeric[ (p, s) ] | text | double or keyword | 
| real | float | float | 
| double precision | float | double | 
| smallserial | long | short | 
| serial | long | integer | 
| bigserial | long | long | 
| money | object | object | 
| character varying(n) | text | text | 
| varchar(n) | text | text | 
| character(n) | text | text | 
| char(n) | text | text | 
| bpchar(n) | text | text | 
| bpchar | text | text | 
| text | text | text | 
| enum | text | text | 
| bytea | text | binary | 
| timestamp [ (p) ] [ without time zone ] | long (in epoch milliseconds) | date | 
| timestamp [ (p) ] with time zone | long (in epoch milliseconds) | date | 
| date | long (in epoch milliseconds) | date | 
| time [ (p) ] [ without time zone ] | long (in epoch milliseconds) | date | 
| time [ (p) ] with time zone | long (in epoch milliseconds) | date | 
| interval [ fields ] [ (p) ] | text (ISO8601 format) | text | 
| boolean | boolean | boolean | 
| point | text (in WKT format) | geo\$1shape | 
| line | text (in WKT format) | geo\$1shape | 
| lseg | text (in WKT format) | geo\$1shape | 
| box | text (in WKT format) | geo\$1shape | 
| path | text (in WKT format) | geo\$1shape | 
| polygon | text (in WKT format) | geo\$1shape | 
| circle | object | object | 
| cidr | text | text | 
| inet | text | text | 
| macaddr | text | text | 
| macaddr8 | text | text | 
| bit(n) | long | byte, short, integer, or long (depending on number of bits) | 
| bit varying(n) | long | byte, short, integer, or long (depending on number of bits) | 
| json | object | object | 
| jsonb | object | object | 
| jsonpath | text | text | 

We recommend that you configure the dead-letter queue (DLQ) in your OpenSearch Ingestion pipeline. If you've configured the queue, OpenSearch Service sends all failed documents that can't be ingested due to dynamic mapping failures to the queue.

In case automatic mappings fail, you can use `template_type` and `template_content` in your pipeline configuration to define explicit mapping rules. Alternatively, you can create mapping templates directly in your search domain or collection before you start the pipeline.

## Limitations
<a name="aurora-PostgreSQL-pipeline-limitations"></a>

Consider the following limitations when you set up an OpenSearch Ingestion pipeline for Aurora PostgreSQL:
+ The integration only supports one Aurora PostgreSQL database per pipeline.
+ The integration does not currently support cross-region data ingestion; your Amazon Aurora cluster and OpenSearch domain must be in the same AWS Region.
+ The integration does not currently support cross-account data ingestion; your Amazon Aurora cluster and OpenSearch Ingestion pipeline must be in the same AWS account. 
+ Ensure that the Amazon Aurora cluster has authentication enabled using AWS Secrets Manager, which is the only supported authentication mechanism.
+ The existing pipeline configuration can't be updated to ingest data from a different database and/or a different table. To update the database and/or table name of a pipeline, you have to stop the pipeline and restart it with an updated configuration, or create a new pipeline.
+ Data Definition Language (DDL) statements are generally not supported. Data consistency will not be maintained if:
  + Primary keys are changed (add/delete/rename).
  + Tables are dropped/truncated.
  + Column names or data types are changed.
+ If the Aurora PostgreSQL tables to sync don’t have primary keys defined, data consistency isn't guaranteed. You will need to define custom the `document_id` option in OpenSearch and sink configuration properly to be able to sync updates/deletes to OpenSearch.
+ Supported versions: Aurora PostgreSQL Version 16.4 and higher. 

## Recommended CloudWatch Alarms
<a name="aurora-mysql-pipeline-metrics"></a>

The following CloudWatch metrics are recommended for monitoring the performance of your ingestion pipeline. These metrics can help you identify the amount of data processed from exports, the number of events processed from streams, the errors in processing exports and stream events, and the number of documents written to the destination. You can setup CloudWatch alarms to perform an action when one of these metrics exceed a specified value for a specified amount of time.


| Metric | Description | 
| --- | --- | 
| pipeline-name.rds.credentialsChanged | This metric indicates how often AWS secrets are rotated. | 
| pipeline-name.rds.executorRefreshErrors | This metric indicates failures to refresh AWS secrets. | 
| pipeline-name.rds.exportRecordsTotal | This metric indicates the number of records exported from Amazon Aurora. | 
| pipeline-name.rds.exportRecordsProcessed | This metric indicates the number of records processed by OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.exportRecordProcessingErrors | This metric indicates number of processing errors in an OpenSearch Ingestion pipeline while reading the data from an Amazon Aurora cluster. | 
| pipeline-name.rds.exportRecordsSuccessTotal | This metric indicates the total number of export records processed successfully. | 
| pipeline-name.rds.exportRecordsFailedTotal | This metric indicates the total number of export records that failed to process. | 
| pipeline-name.rds.bytesReceived | This metrics indicates the total number of bytes received by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.bytesProcessed | This metrics indicates the total number of bytes processed by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.streamRecordsSuccessTotal | This metric indicates the number of records successfully processed from the stream. | 
| pipeline-name.rds.streamRecordsFailedTotal | This metrics indicates the total number of records failed to process from the stream. | 

# Using an OpenSearch Ingestion pipeline with Amazon DynamoDB
<a name="configure-client-ddb"></a>

You can use the [DynamoDB](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/dynamo-db/) plugin to stream table events, such as creates, updates, and deletes, to Amazon OpenSearch Service domains and Amazon OpenSearch Serverless collections. The pipeline uses change data capture (CDC) for high-scale, low-latency streaming.

You can process DynamoDB data with or without a full initial snapshot. 
+ **With a full snapshot** – DynamoDB uses [point-in-time recovery](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery.html) (PITR) to create a backup and uploads it to Amazon S3. OpenSearch Ingestion then indexes the snapshot in one or multiple OpenSearch indexes. To maintain consistency, the pipeline synchronizes all DynamoDB changes with OpenSearch. This option requires you to enable both PITR and [DynamoDB Streams](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.Streams).
+ **Without a snapshot** – OpenSearch Ingestion streams only new DynamoDB events. Choose this option if you already have a snapshot or need real-time streaming without historical data. This option requires you to enable only DynamoDB Streams.

For more information, see [DynamoDB zero-ETL integration with Amazon OpenSearch Service](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/OpenSearchIngestionForDynamoDB.html) in the *Amazon DynamoDB Developer Guide*.

**Topics**
+ [Prerequisites](#s3-prereqs)
+ [Step 1: Configure the pipeline role](#ddb-pipeline-role)
+ [Step 2: Create the pipeline](#ddb-pipeline)
+ [Data consistency](#ddb-pipeline-consistency)
+ [Mapping data types](#ddb-pipeline-mapping)
+ [Limitations](#ddb-pipeline-limitations)
+ [Recommended CloudWatch Alarms for DynamoDB](#ddb-pipeline-metrics)

## Prerequisites
<a name="s3-prereqs"></a>

To set up your pipeline, you must have a DynamoDB table with DynamoDB Streams enabled. Your stream should use the `NEW_IMAGE` stream view type. However, OpenSearch Ingestion pipelines can also stream events with `NEW_AND_OLD_IMAGES` if this stream view type fits your use case.

If you're using snapshots, you must also enable point-in-time recovery on your table. For more information, see [Creating a table](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithTables.Basics.html#WorkingWithTables.Basics.CreateTable), [Enabling point-in-time recovery](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery_Howitworks.html#howitworks_enabling), and [Enabling a stream](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html#Streams.Enabling) in the *Amazon DynamoDB Developer Guide*.

## Step 1: Configure the pipeline role
<a name="ddb-pipeline-role"></a>

After you have your DynamoDB table set up, [set up the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add the following DynamoDB permissions in the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "allowRunExportJob",
            "Effect": "Allow",
            "Action": [
                "dynamodb:DescribeTable",
                "dynamodb:DescribeContinuousBackups",
                "dynamodb:ExportTableToPointInTime"
            ],
            "Resource": [
                "arn:aws:dynamodb:us-east-1:111122223333:table/my-table"
            ]
        },
        {
            "Sid": "allowCheckExportjob",
            "Effect": "Allow",
            "Action": [
                "dynamodb:DescribeExport"
            ],
            "Resource": [
                "arn:aws:dynamodb:us-east-1:111122223333:table/my-table/export/*"
            ]
        },
        {
            "Sid": "allowReadFromStream",
            "Effect": "Allow",
            "Action": [
                "dynamodb:DescribeStream",
                "dynamodb:GetRecords",
                "dynamodb:GetShardIterator"
            ],
            "Resource": [
                "arn:aws:dynamodb:us-east-1:111122223333:table/my-table/stream/*"
            ]
        },
        {
            "Sid": "allowReadAndWriteToS3ForExport",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:AbortMultipartUpload",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/export-folder/*"
            ]
        }
    ]
}
```

------

You can also use an AWS KMS customer managed key to encrypt the export data files. To decrypt the exported objects, specify `s3_sse_kms_key_id` for the key ID in the export configuration of the pipeline with the following format: `arn:aws:kms:region:account-id:key/my-key-id`. The following policy includes the required permissions for using a customer managed key:

```
{
    "Sid": "allowUseOfCustomManagedKey",
    "Effect": "Allow",
    "Action": [
        "kms:GenerateDataKey",
        "kms:Decrypt"
    ],
    "Resource": arn:aws:kms:region:account-id:key/my-key-id
}
```

## Step 2: Create the pipeline
<a name="ddb-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies DynamoDB as the source. This sample pipeline ingests data from `table-a` with the PITR snapshot, followed by events from DynamoDB Streams. A start position of `LATEST` indicates that the pipeline should read the latest data from DynamoDB Streams.

```
version: "2"
cdc-pipeline:
  source:
    dynamodb:
      tables:
      - table_arn: "arn:aws:dynamodb:region:account-id:table/table-a"  
        export:
          s3_bucket: "my-bucket"
          s3_prefix: "export/"
        stream:
          start_position: "LATEST"
      aws:
        region: "us-east-1"
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.region.es.amazonaws.com"]
      index: "${getMetadata(\"table-name\")}"
      index_type: custom
      normalize_index: true
      document_id: "${getMetadata(\"primary_key\")}"
      action: "${getMetadata(\"opensearch_action\")}"
      document_version: "${getMetadata(\"document_version\")}"
      document_version_type: "external"
```

You can use a preconfigured DynamoDB blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

## Data consistency
<a name="ddb-pipeline-consistency"></a>

OpenSearch Ingestion supports end-to-end acknowledgement to ensure data durability. When a pipeline reads snapshots or streams, it dynamically creates partitions for parallel processing. The pipeline marks a partition as complete when it receives an acknowledgement after ingesting all records in the OpenSearch domain or collection. 

If you want to ingest into an OpenSearch Serverless *search* collection, you can generate a document ID in the pipeline. If you want to ingest into an OpenSearch Serverless *time series* collection, note that the pipeline doesn't generate a document ID.

An OpenSearch Ingestion pipeline also maps incoming event actions into corresponding bulk indexing actions to help ingest documents. This keeps data consistent, so that every data change in DynamoDB is reconciled with the corresponding document changes in OpenSearch.

## Mapping data types
<a name="ddb-pipeline-mapping"></a>

OpenSearch Service dynamically maps data types in each incoming document to the corresponding data type in DynamoDB. The following table shows how OpenSearch Service automatically maps various data types.


| Data type | OpenSearch | DynamoDB | 
| --- | --- | --- | 
| Number |  OpenSearch automatically maps numeric data. If the number is a whole number, OpenSearch maps it as a long value. If the number is fractional, then OpenSearch maps it as a float value. OpenSearch dynamically maps various attributes based on the first sent document. If you have a mix of data types for the same attribute in DynamoDB, such as both a whole number and a fractional number, mapping might fail.  For example, if your first document has an attribute that is a whole number, and a later document has that same attribute as a fractional number, OpenSearch fails to ingest the second document. In these cases, you should provide an explicit mapping template, such as the following: <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "MixedNumberAttribute": {<br />     "type": "float"<br />    }<br />   }<br />  }<br /> }<br />}</pre> If you need double precision, use string-type field mapping. There is no equivalent numeric type that supports 38 digits of precision in OpenSearch.  |  DynamoDB supports [numbers](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.Number).  | 
| Number set | OpenSearch automatically maps a number set into an array of either long values or float values. As with the scalar numbers, this depends on whether the first number ingested is a whole number or a fractional number. You can provide mappings for number sets the same way that you map scalar strings. |  DynamoDB supports types that represent [sets of numbers](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.SetTypes).  | 
| String |  OpenSearch automatically maps string values as text. In some situations, such as enumerated values, you can map to the keyword type. The following example shows how to map a DynamoDB attribute named `PartType` to an OpenSearch keyword. <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "PartType": {<br />     "type": "keyword"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  |  DynamoDB supports [strings](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.String).  | 
| String set |  OpenSearch automatically maps a string set into an array of strings. You can provide mappings for string sets the same way that you map scalar strings.  | DynamoDB supports types that represent [sets of strings](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.SetTypes). | 
| Binary |  OpenSearch automatically maps binary data as text. You can provide a mapping to write these as binary fields in OpenSearch. The following example shows how to map a DynamoDB attribute named `ImageData` to an OpenSearch binary field. <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "ImageData": {<br />     "type": "binary"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  | DynamoDB supports [binary type attributes](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.Binary). | 
| Binary set |  OpenSearch automatically maps a binary set into an array of binary data as text. You can provide mappings for number sets the same way that you map scalar binary.  | DynamoDB supports types that represent [sets of binary values](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.SetTypes). | 
| Boolean |  OpenSearch maps a DynamoDB Boolean type into an OpenSearch Boolean type.  |  DynamoDB supports [Boolean type attributes](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.Boolean).  | 
| Null |  OpenSearch can ingest documents with the DynamoDB null type. It saves the value as a null value in the document. There is no mapping for this type, and this field is not indexed or searchable. If the same attribute name is used for a null type and then later changes to different type such as string, OpenSearch creates a dynamic mapping for the first non-null value. Subsequent values can still be DynamoDB null values.  | DynamoDB supports [null type attributes](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.Null). | 
| Map |  OpenSearch maps DynamoDB map attributes to nested fields. The same mappings apply within a nested field. The following example maps a string in a nested field to a keyword type in OpenSearch: <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "AdditionalDescriptions": {<br />     "properties": {<br />      "PartType": {<br />       "type": "keyword"<br />      }<br />     }<br />    }<br />   }<br />  }<br /> }<br />}</pre>  | DynamoDB supports [map type attributes](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.Document.Map). | 
| List |  OpenSearch provides different results for DynamoDB lists, depending on what is in the list. When a list contains all of the same type of scalar types (for example, a list of all strings), then OpenSearch ingests the list as an array of that type. This works for string, number, Boolean, and null types. The restrictions for each of these types are the same as restrictions for a scalar of that type. You can also provide mappings for lists of maps by using the same mapping as you would use for a map. You can't provide a list of mixed types.   |  DynamoDB supports [list type attributes](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.Document.List).  | 
| Set |  OpenSearch provides different results for DynamoDB sets depending on what is in the set. When a set contains all of the same type of scalar types (for example, a set of all strings), then OpenSearch ingests the set as an array of that type. This works for string, number, Boolean, and null types. The restrictions for each of these types are the same as the restrictions for a scalar of that type. You can also provide mappings for sets of maps by using the same mapping as you would use for a map. You can't provide a set of mixed types.   | DynamoDB supports types that represent [sets](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.DataTypes.SetTypes). | 

We recommend that you configure the dead-letter queue (DLQ) in your OpenSearch Ingestion pipeline. If you've configured the queue, OpenSearch Service sends all failed documents that can't be ingested due to dynamic mapping failures to the queue. 

In case automatic mappings fail, you can use `template_type` and `template_content` in your pipeline configuration to define explicit mapping rules. Alternatively, you can create mapping templates directly in your search domain or collection before you start the pipeline. 

## Limitations
<a name="ddb-pipeline-limitations"></a>

Consider the following limitations when you set up an OpenSearch Ingestion pipeline for DynamoDB:
+ The OpenSearch Ingestion integration with DynamoDB currently doesn't support cross-Region ingestion. Your DynamoDB table and OpenSearch Ingestion pipeline must be in the same AWS Region.
+ Your DynamoDB table and OpenSearch Ingestion pipeline must be in the same AWS account.
+ An OpenSearch Ingestion pipeline supports only one DynamoDB table as its source. 
+ DynamoDB Streams only stores data in a log for up to 24 hours. If ingestion from an initial snapshot of a large table takes 24 hours or more, there will be some initial data loss. To mitigate this data loss, estimate the size of the table and configure appropriate compute units of OpenSearch Ingestion pipelines. 

## Recommended CloudWatch Alarms for DynamoDB
<a name="ddb-pipeline-metrics"></a>

The following CloudWatch metrics are recommended for monitioring the performance of your ingestion pipeline. These metrics can help you identify the amount of data processed from exports, the amount of events processed from streams, the errors in processing exports and stream events, and the number of documents written to the destination. You can setup CloudWatch alarms to perform an action when one of these metrics exceed a specified value for a specified amount of time.


| Metric | Description | 
| --- |--- |
| dynamodb-pipeline.BlockingBuffer.bufferUsage.value |  Indicates how much of the buffer is being utilized.  | 
|  dynamodb-pipeline.dynamodb.activeExportS3ObjectConsumers.value  |  Shows the total number of OCUs that are actively processing Amazon S3 objects for the export.  | 
|  dynamodb-pipeline.dynamodb.bytesProcessed.count  |  Count of bytes processed from DynamoDB source.  | 
|  dynamodb-pipeline.dynamodb.changeEventsProcessed.count  |  Number of change events processed from DynamoDB stream.  | 
|  dynamodb-pipeline.dynamodb.changeEventsProcessingErrors.count  |  Number of errors from change events processed from DynamoDB.  | 
|  dynamodb-pipeline.dynamodb.exportJobFailure.count  | Number of export job submission attempts that have failed. | 
|  dynamodb-pipeline.dynamodb.exportJobSuccess.count  | Number of export jobs that have been submitted successfully. | 
|  dynamodb-pipeline.dynamodb.exportRecordsProcessed.count  |  Total number of records processed from the export.  | 
|  dynamodb-pipeline.dynamodb.exportRecordsTotal.count  |  Total number of records exported from DynamoDB, essential for tracking data export volumes.  | 
|  dynamodb-pipeline.dynamodb.exportS3ObjectsProcessed.count  | Total number of export data files that have been processed successfully from Amazon S3. | 
|  dynamodb-pipeline.opensearch.bulkBadRequestErrors.count  | Count of errors during bulk requests due to malformed request. | 
|  dynamodb-pipeline.opensearch.bulkRequestLatency.avg  | Average latency for bulk write requests made to OpenSearch. | 
|  dynamodb-pipeline.opensearch.bulkRequestNotFoundErrors.count  | Number of bulk requests that failed because the target data could not be found. | 
|  dynamodb-pipeline.opensearch.bulkRequestNumberOfRetries.count  | Number of retries by OpenSearch Ingestion pipelines to write OpenSearch cluster. | 
|  dynamodb-pipeline.opensearch.bulkRequestSizeBytes.sum  | Total size in bytes of all bulk requests made to OpenSearch. | 
|  dynamodb-pipeline.opensearch.documentErrors.count  | Number of errors when sending documents to OpenSearch. The documents causing the errors witll be sent to DLQ. | 
|  dynamodb-pipeline.opensearch.documentsSuccess.count  | Number of documents successfully written to an OpenSearch cluster or collection. | 
|  dynamodb-pipeline.opensearch.documentsSuccessFirstAttempt.count  | Number of documents successfully indexed in OpenSearch on the first attempt. | 
|  `dynamodb-pipeline.opensearch.documentsVersionConflictErrors.count`  | Count of errors due to version conflicts in documents during processing. | 
|  `dynamodb-pipeline.opensearch.PipelineLatency.avg`  | Average latency of OpenSearch Ingestion pipeline to process the data by reading from the source to writint to the destination. | 
|  dynamodb-pipeline.opensearch.PipelineLatency.max  | Maximum latency of OpenSearch Ingestion pipeline to process the data by reading from the source to writing the destination. | 
|  dynamodb-pipeline.opensearch.recordsIn.count  | Count of records successfully ingested into OpenSearch. This metric is essential for tracking the volume of data being processed and stored. | 
|  dynamodb-pipeline.opensearch.s3.dlqS3RecordsFailed.count  | Number of records that failed to write to DLQ. | 
|  dynamodb-pipeline.opensearch.s3.dlqS3RecordsSuccess.count  | Number of records that are written to DLQ. | 
|  dynamodb-pipeline.opensearch.s3.dlqS3RequestLatency.count  | Count of latency measurements for requests to the Amazon S3 dead-letter queue. | 
|  dynamodb-pipeline.opensearch.s3.dlqS3RequestLatency.sum  | Total latency for all requests to the Amazon S3 dead-letter queue | 
|  dynamodb-pipeline.opensearch.s3.dlqS3RequestSizeBytes.sum  | Total size in bytes of all requests made to the Amazon S3 dead-letter queue. | 
|  dynamodb-pipeline.recordsProcessed.count  | Total number of records processed in the pipeline, a key metric for overal throughput. | 
|  dynamodb.changeEventsProcessed.count  | No records are being gathered from DynamoDB streams. This could be due to no activitiy on the table, an export being in progress, or an issue accessing the DynamoDB streams. | 
|  `dynamodb.exportJobFailure.count`  | The attempt to trigger an export to S3 failed. | 
|  `dynamodb-pipeline.opensearch.bulkRequestInvalidInputErrors.count`  | Count of bulk request errors in OpenSearch due to invalid input, crucial for monitoring data quality and operational issues. | 
|  opensearch.EndToEndLatency.avg  | The end to end latnecy is higher than desired for reading from DynamoDB streams. This could be due to an underscaled OpenSearch cluster or a maximum pipeline OCU capacity that is too low for the WCU throughput on the DynamoDB table. This end to end latency will be high after an export and should decrease over time as it catches up to the latest DynamoDB streams. | 

# Using an OpenSearch Ingestion pipeline with Amazon DocumentDB
<a name="configure-client-docdb"></a>

You can use the [DocumentDB](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/documentdb/) plugin to stream document changes, such as creates, updates, and deletes, to Amazon OpenSearch Service. The pipeline supports change data capture (CDC), if available, or API polling for high-scale, low-latency streaming.

You can process data with or without a full initial snapshot. A full snapshot captures an entire Amazon DocumentDB collection and uploads it to Amazon S3. The pipeline then sends the data to one or more OpenSearch indexes. After it ingests the snapshot, the pipeline synchronizes ongoing changes to maintain consistency and eventually catches up to near real-time updates.

If you already have a full snapshot from another source, or only need to process new events, you can stream without a snapshot. In this case, the pipeline reads directly from Amazon DocumentDB change streams without an initial bulk load.

If you enable streaming, you must [enable a change stream](https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html#change_streams-enabling) on your Amazon DocumentDB collection. However, if you only perform a full load or export, you don’t need a change stream.

## Prerequisites
<a name="s3-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a Amazon DocumentDB cluster with permission to read data by following the steps in [Create an Amazon DocumentDB cluster](https://docs.aws.amazon.com/documentdb/latest/developerguide/get-started-guide.html#cloud9-cluster) in the *Amazon DocumentDB Developer Guide*. If you use CDC infrastructure, configure your Amazon DocumentDB cluster to publish change streams. 

1. Enable TLS on your Amazon DocumentDB cluster.

1. Set up a VPC CIDR of a private address space for use with OpenSearch Ingestion.

1. Set up authentication on your Amazon DocumentDB cluster with AWS Secrets Manager. Enable secrets rotation by following the steps in [Automatically rotating passwords for Amazon DocumentDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/security.managing-users.html#security.managing-users-rotating-passwords). For more information, see [Database access using Role-Based Access Control](https://docs.aws.amazon.com/documentdb/latest/developerguide/role_based_access_control.html) and [Security in Amazon DocumentDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/security.html).

1. If you use a change stream to subscribe to data changes on your Amazon DocumentDB collection, avoid data loss by extending the retention period to up to 7 days using the `change_stream_log_retention_duration` parameter. Change streams events are stored for 3 hours, by default, after the event has been recorded, which isn't enough time for large collections. To modify the change stream retention period, see [Modifying the change stream log retention duration](https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html#change_streams-modifying_log_retention).

1. Create an OpenSearch Service domain or OpenSearch Serverless collection. For more information, see [Creating OpenSearch Service domains](createupdatedomains.md#createdomains) and [Creating collections](serverless-create.md).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your Amazon DocumentDB cluster to your domain or collection. 

   The following sample domain access policy allows the pipeline role, which you create in the next step, to write data to a domain. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "es:DescribeDomain",
           "es:ESHttp*"
         ],
         "Resource": [
           "arn:aws:es:us-east-1:111122223333:domain/domain-name"
         ]
       }
     ]
   }
   ```

------

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

## Step 1: Configure the pipeline role
<a name="docdb-pipeline-role"></a>

After you have your Amazon DocumentDB pipeline prerequisites set up, [configure the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add the following Amazon DocumentDB permissions in the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "allowS3ListObjectAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::s3-bucket"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": "s3-prefix/*"
                }
            }
        },
        {
            "Sid": "allowReadAndWriteToS3ForExportStream",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::s3-bucket/s3-prefix/*"
            ]
        },
        {
            "Sid": "SecretsManagerReadAccess",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": [
                "arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AttachNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeNetworkInterfaces"
            ],
            "Resource": [
                "arn:aws:ec2:*:111122223333:network-interface/*",
                "arn:aws:ec2:*:111122223333:subnet/*",
                "arn:aws:ec2:*:111122223333:security-group/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:Describe*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/OSISManaged": "true"
                }
            }
        }
    ]
}
```

------

You must provide the above Amazon EC2 permissions on the IAM role that you use to create the OpenSearch Ingestion pipeline because the pipeline uses these permissions to create and delete a network interface in your VPC. The pipeline can only access the Amazon DocumentDB cluster through this network interface.

## Step 2: Create the pipeline
<a name="docdb-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies Amazon DocumentDB as the source. Note that to populate the index name, the `getMetadata` function uses `documentdb_collection` as a metadata key. If you want to use a different index name without the `getMetadata` method, you can use the configuration `index: "my_index_name"`.

```
version: "2"
documentdb-pipeline:
  source:
    documentdb:
      acknowledgments: true
      host: "https://docdb-cluster-id.us-east-1.docdb.amazonaws.com"
      port: 27017
      authentication:
        username: ${aws_secrets:secret:username}
        password: ${aws_secrets:secret:password}
      aws:
      s3_bucket: "bucket-name"
      s3_region: "bucket-region" 
      s3_prefix: "path" #optional path for storing the temporary data
      collections:
        - collection: "dbname.collection"
          export: true
          stream: true
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
      index: "${getMetadata(\"documentdb_collection\")}"
      index_type: custom
      document_id: "${getMetadata(\"primary_key\")}"
      action: "${getMetadata(\"opensearch_action\")}"
      document_version: "${getMetadata(\"document_version\")}"
      document_version_type: "external"
extension:
  aws:
    secrets:
      secret:
        secret_id: "my-docdb-secret"
        region: "us-east-1"
        refresh_interval: PT1H
```

You can use a preconfigured Amazon DocumentDB blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

If you're using the AWS Management Console to create your pipeline, you must also attach your pipeline to your VPC in order to use Amazon DocumentDB as a source. To do so, find the **Source network options** section, select the **Attach to VPC** checkbox, and choose your CIDR from one of the provided default options. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).

To provide a custom CIDR, select **Other** from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and Amazon DocumentDB, ensure that the Amazon DocumentDB VPC CIDR is different from the CIDR for OpenSearch Ingestion.

For more information, see [Configuring VPC access for a pipeline](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security.html#pipeline-vpc-configure).

## Data consistency
<a name="docdb-pipeline-consistency"></a>

The pipeline ensures data consistency by continuously polling or receiving changes from the Amazon DocumentDB cluster and updating the corresponding documents in the OpenSearch index.

OpenSearch Ingestion supports end-to-end acknowledgement to ensure data durability. When a pipeline reads snapshots or streams, it dynamically creates partitions for parallel processing. The pipeline marks a partition as complete when it receives an acknowledgement after ingesting all records in the OpenSearch domain or collection. 

If you want to ingest into an OpenSearch Serverless *search* collection, you can generate a document ID in the pipeline. If you want to ingest into an OpenSearch Serverless *time series* collection, note that the pipeline doesn't generate a document ID, so you must omit `document_id: "${getMetadata(\"primary_key\")}"` in your pipeline sink configuration. 

An OpenSearch Ingestion pipeline also maps incoming event actions into corresponding bulk indexing actions to help ingest documents. This keeps data consistent, so that every data change in Amazon DocumentDB is reconciled with the corresponding document changes in OpenSearch.

## Mapping data types
<a name="docdb-pipeline-mapping"></a>

OpenSearch Service dynamically maps data types in each incoming document to the corresponding data type in Amazon DocumentDB. The following table shows how OpenSearch Service automatically maps various data types.


| Data type | OpenSearch | Amazon DocumentDB | 
| --- | --- | --- | 
| Integer |  OpenSearch automatically maps Amazon DocumentDB integer values to OpenSearch integers. OpenSearch dynamically maps the field based on the first sent document. If you have a mix of data types for the same attribute in Amazon DocumentDB, automatic mapping might fail.  For example, if your first document has an attribute that is a long, and a later document has that same attribute as an integer, OpenSearch fails to ingest the second document. In these cases, you should provide an explicit mapping template that chooses the most flexible number type, such as the following: <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "MixedNumberField": {<br />     "type": "float"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  |  Amazon DocumentDB supports [integers](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types).  | 
| Long |  OpenSearch automatically maps Amazon DocumentDB long values to OpenSearch longs. OpenSearch dynamically maps the field based on the first sent document. If you have a mix of data types for the same attribute in Amazon DocumentDB, automatic mapping might fail.  For example, if your first document has an attribute that is a long, and a later document has that same attribute as an integer, OpenSearch fails to ingest the second document. In these cases, you should provide an explicit mapping template that chooses the most flexible number type, such as the following: <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "MixedNumberField": {<br />     "type": "float"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  |  Amazon DocumentDB supports [longs](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types).  | 
| String |  OpenSearch automatically maps string values as text. In some situations, such as enumerated values, you can map to the keyword type. The following example shows how to map a Amazon DocumentDB attribute named `PartType` to an OpenSearch keyword. <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "PartType": {<br />     "type": "keyword"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  |  Amazon DocumentDB supports [strings](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types).  | 
| Double |  OpenSearch automatically maps Amazon DocumentDB double values to OpenSearch doubles. OpenSearch dynamically maps the field based on the first sent document. If you have a mix of data types for the same attribute in Amazon DocumentDB, automatic mapping might fail.  For example, if your first document has an attribute that is a long, and a later document has that same attribute as an integer, OpenSearch fails to ingest the second document. In these cases, you should provide an explicit mapping template that chooses the most flexible number type, such as the following: <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "MixedNumberField": {<br />     "type": "float"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  | Amazon DocumentDB supports [doubles](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| Date |  By default, date maps to an integer in OpenSearch. You can define a custom mapping template to map a date to an OpenSearch date. <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "myDateField": {<br />     "type": "date",<br />     "format": "epoch_second"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  | Amazon DocumentDB supports [dates](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| Timestamp |  By default, timestamp maps to an integer in OpenSearch. You can define a custom mapping template to map a date to an OpenSearch date. <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "myTimestampField": {<br />     "type": "date",<br />     "format": "epoch_second"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  | Amazon DocumentDB supports [timestamps](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| Boolean |  OpenSearch maps a Amazon DocumentDB Boolean type into an OpenSearch Boolean type.  |  Amazon DocumentDB supports [Boolean type attributes](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types).  | 
| Decimal |  OpenSearch maps Amazon DocumentDB map attributes to nested fields. The same mappings apply within a nested field. The following example maps a string in a nested field to a keyword type in OpenSearch: <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "myDecimalField": {<br />     "type": "double"<br />    }<br />   }<br />  }<br /> }<br />}</pre> With this custom mapping, you can query and aggregate the field with double-level precision. The original value retains the full precision in the `_source` property of the OpenSearch document. Without this mapping, OpenSearch uses text by default.  | Amazon DocumentDB supports [decimals](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| Regular Expression | The regex type creates nested fields. These include <myFieldName>.pattern and <myFieldName>.options. |  Amazon DocumentDB supports [regular expressions](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types).  | 
| Binary Data |  OpenSearch automatically maps Amazon DocumentDB binary data to OpenSearch text. You can provide a mapping to write these as binary fields in OpenSearch.  The following example shows how to map a Amazon DocumentDB field named `imageData` to an OpenSearch binary field. <pre>{<br /> "template": {<br />  "mappings": {<br />   "properties": {<br />    "imageData": {<br />     "type": "binary"<br />    }<br />   }<br />  }<br /> }<br />}</pre>  | Amazon DocumentDB supports[binary data fields](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| ObjectId | Fields with a type of objectId map to OpenSearch text fields. The value will be the string representation of the objectId.  | Amazon DocumentDB supports [objectIds](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| Null |  OpenSearch can ingest documents with the Amazon DocumentDB null type. It saves the value as a null value in the document. There is no mapping for this type, and this field is not indexed or searchable. If the same attribute name is used for a null type and then later changes to different type such as string, OpenSearch creates a dynamic mapping for the first non-null value. Subsequent values can still be Amazon DocumentDB null values.  | Amazon DocumentDB supports [null type fields](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| Undefined |  OpenSearch can ingest documents with the Amazon DocumentDB undefined type. It saves the value as a null value in the document. There is no mapping for this type, and this field is not indexed or searchable. If the same field name is used for a undefined type and then later changes to different type such as string, OpenSearch creates a dynamic mapping for the first non-undefined value. Subsequent values can still be Amazon DocumentDB undefined values.  | Amazon DocumentDB supports [undefined type fields](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| MinKey |  OpenSearch can ingest documents with the Amazon DocumentDB minKey type. It saves the value as a null value in the document. There is no mapping for this type, and this field is not indexed or searchable. If the same field name is used for a minKey type and then later changes to different type such as string, OpenSearch creates a dynamic mapping for the first non-minKey value. Subsequent values can still be Amazon DocumentDB minKey values.  | Amazon DocumentDB supports [minKey type fields](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 
| MaxKey |  OpenSearch can ingest documents with the Amazon DocumentDB maxKey type. It saves the value as a null value in the document. There is no mapping for this type, and this field is not indexed or searchable. If the same field name is used for a maxKey type and then later changes to different type such as string, OpenSearch creates a dynamic mapping for the first non-maxKey value. Subsequent values can still be Amazon DocumentDB maxKey values.  | Amazon DocumentDB supports [maxKey type fields](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html#mongo-apis-data-types). | 

We recommend that you configure the dead-letter queue (DLQ) in your OpenSearch Ingestion pipeline. If you've configured the queue, OpenSearch Service sends all failed documents that can't be ingested due to dynamic mapping failures to the queue. 

In case automatic mappings fail, you can use `template_type` and `template_content` in your pipeline configuration to define explicit mapping rules. Alternatively, you can create mapping templates directly in your search domain or collection before you start the pipeline. 

## Limitations
<a name="docdb-pipeline-limitations"></a>

Consider the following limitations when you set up an OpenSearch Ingestion pipeline for Amazon DocumentDB:
+ The OpenSearch Ingestion integration with Amazon DocumentDB currently doesn't support cross-Region ingestion. Your Amazon DocumentDB cluster and OpenSearch Ingestion pipeline must be in the same AWS Region.
+ The OpenSearch Ingestion integration with Amazon DocumentDB currently doesn't support cross-account ingestion. Your Amazon DocumentDB cluster and OpenSearch Ingestion pipeline must be in the same AWS account.
+ An OpenSearch Ingestion pipeline supports only one Amazon DocumentDB cluster as its source. 
+ The OpenSearch Ingestion integration with Amazon DocumentDB specifically supports Amazon DocumentDB instance-based clusters. It doesn't support Amazon DocumentDB elastic clusters.
+ The OpenSearch Ingestion integration only supports AWS Secrets Manager as an authentication mechanism for your Amazon DocumentDB cluster.
+ You can't update the existing pipeline configuration to ingest data from a different database or collection. Instead, you must create a new pipeline. 

## Recommended CloudWatch alarms
<a name="cloudwatch-metrics-docdb"></a>

For the best performance, we recommend that you use the following CloudWatch alarms when you create an OpenSearch Ingestion pipeline to access an Amazon DocumentDB cluster as a source.


| CloudWatch Alarm | Description | 
| --- | --- | 
| <pipeline-name>.doucmentdb.credentialsChanged | This metric indicates how often AWS secrets are rotated.  | 
| <pipeline-name>.doucmentdb.executorRefreshErrors | This metric indicates failures to refresh AWS secrets.  | 
| <pipeline-name>.doucmentdb.exportRecordsTotal |  This metric indicates the number of records exported from Amazon DocumentDB.  | 
| <pipeline-name>.doucmentdb.exportRecordsProcessed | This metric indicates the number of records processed by OpenSearch Ingestion pipeline.  | 
| <pipeline-name>.doucmentdb.exportRecordProcessingErrors |  This metric indicates number of processing errors in an OpenSearch Ingestion pipeline while reading the data from an Amazon DocumentDB cluster.  | 
| <pipeline-name>.doucmentdb.exportRecordsSuccessTotal |  This metric indicates the total number of export records processed successfully.  | 
| <pipeline-name>.doucmentdb.exportRecordsFailedTotal |  This metric indicates the total number of export records that failed to process.  | 
| <pipeline-name>.doucmentdb.bytesReceived |  This metrics indicates the total number of bytes received by an OpenSearch Ingestion pipeline.  | 
| <pipeline-name>.doucmentdb.bytesProcessed |  This metrics indicates the total number of bytes processed by an OpenSearch Ingestion pipeline.  | 
| <pipeline-name>.doucmentdb.exportPartitionQueryTotal |  This metric indicates the export partition total.  | 
| <pipeline-name>.doucmentdb.streamRecordsSuccessTotal |  This metric indicates the number of records successfully processed from the stream.  | 
| <pipeline-name>.doucmentdb.streamRecordsFailedTotal |  This metrics indicates the total number of records failed to process from the stream.  | 

# Using an OpenSearch Ingestion pipeline with Confluent Cloud Kafka
<a name="configure-client-confluent-kafka"></a>

You can use an OpenSearch Ingestion pipeline to stream data from Confluent Cloud Kafka clusters to Amazon OpenSearch Service domains and OpenSearch Serverless collections. OpenSearch Ingestion supports both public and private network configurations for the streaming of data from Confluent Cloud Kafka clusters to domains or collections managed by OpenSearch Service or OpenSearch Serverless. 

## Connectivity to Confluent Cloud public Kafka clusters
<a name="confluent-cloud-kafka-public"></a>

You can use OpenSearch Ingestion pipelines to migrate data from a Confluent Cloud Kafka cluster with a public configuration, which means that the domain DNS name can be publicly resolved. To do so, set up an OpenSearch Ingestion pipeline with Confluent Cloud public Kafka cluster as the source and OpenSearch Service or OpenSearch Serverless as the destination. This processes your streaming data from a self-managed source cluster to an AWS-managed destination domain or collection. 

### Prerequisites
<a name="confluent-cloud-kafka-public-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a Confluent Cloud Kafka clusters cluster acting as a source. The cluster should contain the data you want to ingest into OpenSearch Service.

1. Create an OpenSearch Service domain or OpenSearch Serverless collection where you want to migrate data to. For more information, see [Creating OpenSearch Service domains](createupdatedomains.md#createdomains) and [Creating collections](serverless-create.md).

1. Set up authentication on your Confluent Cloud Kafka cluster with AWS Secrets Manager. Enable secrets rotation by following the steps in [Rotate AWS Secrets Manager secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your self-managed cluster to your domain or collection. 

   The following sample domain access policy allows the pipeline role, which you create in the next step, to write data to a domain. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "es:DescribeDomain",
           "es:ESHttp*"
         ],
         "Resource": [
           "arn:aws:es:us-east-1:111122223333:domain/domain-name"
         ]
       }
     ]
   }
   ```

------

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

### Step 1: Configure the pipeline role
<a name="confluent-cloud-kafka-public-pipeline-role"></a>

After you have your Confluent Cloud Kafka cluster pipeline prerequisites set up, [configure the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add permission to write to an OpenSearch Service domain or OpenSearch Serverless collection, as well as permission to read secrets from Secrets Manager.

The following permission is needed to manage the network interface: 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AttachNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeNetworkInterfaces"
            ],
            "Resource": [
                "arn:aws:ec2:us-east-1:111122223333:network-interface/*",
                "arn:aws:ec2:us-east-1:111122223333:subnet/*",
                "arn:aws:ec2:us-east-1:111122223333:security-group/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:Describe*"
            ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:subnet/*"
        },
        { 
            "Effect": "Allow",
            "Action": [ "ec2:CreateTags" ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:network-interface/*",
            "Condition": { 
               "StringEquals": { "aws:RequestTag/OSISManaged": "true" } 
            } 
        }
    ]
}
```

------

The following is permission needed to read secrets from AWS Secrets Manager service:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SecretsManagerReadAccess",
            "Effect": "Allow",
            "Action": ["secretsmanager:GetSecretValue"],
            "Resource": ["arn:aws:secretsmanager:us-east-1:111122223333:secret:,secret-name"]
        }
    ]
}
```

------

The following permissions are needed to write to an Amazon OpenSearch Service domain:

```
{
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::account-id:role/pipeline-role"
      },
      "Action": ["es:DescribeDomain", "es:ESHttp*"],
      "Resource": "arn:aws:es:region:account-id:domain/domain-name/*"
    }
  ]
}
```

### Step 2: Create the pipeline
<a name="confluent-cloud-kafka-public-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies your Confluent Cloud Kafka as the source. 

You can specify multiple OpenSearch Service domains as destinations for your data. This capability enables conditional routing or replication of incoming data into multiple OpenSearch Service domains.

You can also migrate data from a source Confluent Kafka cluster to an OpenSearch Serverless VPC collection. Ensure you provide a network access policy within the pipeline configuration. You can use a Confluent schema registry to define a Confluent schema.

```
version: "2"
kafka-pipeline:
  source:
    kafka:
      encryption:
        type: "ssl"
      topics:
        - name: "topic-name"
          group_id: "group-id"
      bootstrap_servers:
        - "bootstrap-server.us-east-1.aws.private.confluent.cloud:9092"
      authentication:
        sasl:
          plain:
            username: ${aws_secrets:confluent-kafka-secret:username}
            password: ${aws_secrets:confluent-kafka-secret:password}
      schema:
        type: confluent
        registry_url: https://my-registry.us-east-1.aws.confluent.cloud
        api_key: "${{aws_secrets:schema-secret:schema_registry_api_key}}"
        api_secret: "${{aws_secrets:schema-secret:schema_registry_api_secret}}"
        basic_auth_credentials_source: "USER_INFO"
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
      aws:
          region: "us-east-1"
  aws:
    secrets:
      confluent-kafka-secret:
        secret_id: "my-kafka-secret"
        region: "us-east-1"
      schema-secret:
        secret_id: "my-self-managed-kafka-schema"
        region: "us-east-1"
```

You can use a preconfigured blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

### Connectivity to Confluent Cloud Kafka clusters in a VPC
<a name="confluent-cloud-kafka-private"></a>

You can also use OpenSearch Ingestion pipelines to migrate data from a Confluent Cloud Kafka cluster running in a VPC. To do so, set up an OpenSearch Ingestion pipeline with a Confluent Cloud Kafka cluster as a source and OpenSearch Service or OpenSearch Serverless as the destination. This processes your streaming data from a Confluent Cloud Kafka source cluster to an AWS-managed destination domain or collection. 

 OpenSearch Ingestion supports Confluent Cloud Kafka clusters configured in all supported network modes in Confluent. The following modes of network configuration are supported as a source in OpenSearch Ingestion:
+ AWS VPC peering
+  AWS PrivateLink for dedicated clusters
+  AWS PrivateLink for Enterprise clusters
+ AWS Transit Gateway

#### Prerequisites
<a name="confluent-cloud-kafka-private-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a Confluent Cloud Kafka cluster with a VPC network configuration that contains the data you want to ingest into OpenSearch Service. 

1. Create an OpenSearch Service domain or OpenSearch Serverless collection where you want to migrate data to. For more information, see For more information, see [Creating OpenSearch Service domains](createupdatedomains.md#createdomains) and [Creating collections](serverless-create.md).

1. Set up authentication on your Confluent Cloud Kafka cluster with AWS Secrets Manager. Enable secrets rotation by following the steps in [Rotate AWS Secrets Manager secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html).

1. Obtain the ID of the VPC that has access to the Confluent Cloud Kafka cluster. Choose the VPC CIDR to be used by OpenSearch Ingestion.
**Note**  
If you're using the AWS Management Console to create your pipeline, you must also attach your OpenSearch Ingestion pipeline to your VPC in order to use Confluent Cloud Kafka cluster. To do so, find the **Network configuration** section, select the **Attach to VPC** checkbox, and choose your CIDR from one of the provided default options, or select your own. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).  
To provide a custom CIDR, select **Other** from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and self-managed OpenSearch, ensure that the self-managed OpenSearch VPC CIDR is different from the CIDR for OpenSearch Ingestion.

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your self-managed cluster to your domain or collection. 
**Note**  
If you are using AWS PrivateLink to connect your Confluent Cloud Kafka, you will need to configure [VPC DHCP Options](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html). *DNS hostnames* and *DNS resolution*should be enabled.  
Specifically, use the following option set values:  
**Enterprise clusters:**  

   ```
   domain-name: aws.private.confluent.cloud
   domain-name-servers: AmazonProvidedDNS
   ```
**Dedicated clusters:**  

   ```
   domain-name: aws.confluent.cloud
   domain-name-servers: AmazonProvidedDNS
   ```
This change ensures that DNS resolution for the Confluent PrivateLink endpoint works correctly within the VPC.

   The following sample domain access policy allows the pipeline role, which you create in the next step, to write data to a domain. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "es:DescribeDomain",
           "es:ESHttp*"
         ],
         "Resource": [
           "arn:aws:es:us-east-1:111122223333:domain/domain-name"
         ]
       }
     ]
   }
   ```

------

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

#### Step 1: Configure the pipeline role
<a name="confluent-cloud-kafka-private-pipeline-role"></a>

After you have your pipeline prerequisites set up, [configure the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add the following permissions in the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SecretsManagerReadAccess",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": ["arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name"]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AttachNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeNetworkInterfaces"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:subnet/*",
                "arn:aws:ec2:*:*:security-group/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:Describe*"
            ],
            "Resource": "*"
        },
        { 
            "Effect": "Allow",
            "Action": [ 
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Condition": { 
               "StringEquals": 
                    {
                        "aws:RequestTag/OSISManaged": "true"
                    } 
            } 
        }
    ]
}
```

------

You must provide the above Amazon EC2 permissions on the IAM role that you use to create the OpenSearch Ingestion pipeline because the pipeline uses these permissions to create and delete a network interface in your VPC. The pipeline can only access the Kafka cluster through this network interface.

#### Step 2: Create the pipeline
<a name="self-managed-kafka-private-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies Kafka as the source.

You can specify multiple OpenSearch Service domains as destinations for your data. This capability enables conditional routing or replication of incoming data into multiple OpenSearch Service domains.

You can also migrate data from a source Confluent Kafka cluster to an OpenSearch Serverless VPC collection. Ensure you provide a network access policy within the pipeline configuration. You can use a Confluent schema registry to define a Confluent schema.

```
 version: "2"
kafka-pipeline:
  source:
    kafka:
      encryption:
        type: "ssl"
      topics:
        - name: "topic-name"
          group_id: "group-id"
      bootstrap_servers:
        - "bootstrap-server.us-east-1.aws.private.confluent.cloud:9092"
      authentication:
        sasl:
          plain:
            username: ${aws_secrets:confluent-kafka-secret:username}
            password: ${aws_secrets:confluent-kafka-secret:password}
      schema:
        type: confluent
        registry_url: https://my-registry.us-east-1.aws.confluent.cloud
        api_key: "${{aws_secrets:schema-secret:schema_registry_api_key}}"
        api_secret: "${{aws_secrets:schema-secret:schema_registry_api_secret}}"
        basic_auth_credentials_source: "USER_INFO"
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
      aws:
          region: "us-east-1"
      index: "confluent-index"
extension:
  aws:
    secrets:
      confluent-kafka-secret:
        secret_id: "my-kafka-secret"
        region: "us-east-1"
      schema-secret:
        secret_id: "my-self-managed-kafka-schema"
        region: "us-east-2"
```

# Using an OpenSearch Ingestion pipeline with Amazon Managed Streaming for Apache Kafka
<a name="configure-client-msk"></a>

You can use the [Kafka plugin](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/kafka/) to ingest data from [Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/msk/latest/developerguide/) (Amazon MSK) into your OpenSearch Ingestion pipeline. With Amazon MSK, you can build and run applications that use Apache Kafka to process streaming data. OpenSearch Ingestion uses AWS PrivateLink to connect to Amazon MSK. You can ingest data from both Amazon MSK and Amazon MSK Serverless clusters. The only difference between the two processes is the prerequisite steps you must take before you set up your pipeline.

**Topics**
+ [Provisioned Amazon MSK prerequisites](#msk-prereqs)
+ [Amazon MSK Serverless prerequisites](#msk-serverless-prereqs)
+ [Step 1: Configure a pipeline role](#msk-pipeline-role)
+ [Step 2: Create the pipeline](#msk-pipeline)
+ [Step 3: (Optional) Use the AWS Glue Schema Registry](#msk-glue)
+ [Step 4: (Optional) Configure recommended compute units (OCUs) for the Amazon MSK pipeline](#msk-ocu)

## Provisioned Amazon MSK prerequisites
<a name="msk-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create an Amazon MSK provisioned cluster by following the steps in [Creating a cluster](https://docs.aws.amazon.com/msk/latest/developerguide/msk-create-cluster.html#create-cluster-console) in the *Amazon Managed Streaming for Apache Kafka Developer Guide*. For **Broker type**, choose any option except for `t3` types, as these aren't supported by OpenSearch Ingestion.

1. After the cluster has an **Active** status, follow the steps in [Turn on multi-VPC connectivity](https://docs.aws.amazon.com/msk/latest/developerguide/aws-access-mult-vpc.html#mvpc-cluster-owner-action-turn-on).

1. Follow the steps in [Attach a cluster policy to the MSK cluster](https://docs.aws.amazon.com/msk/latest/developerguide/aws-access-mult-vpc.html#mvpc-cluster-owner-action-policy) to attach one of the following policies, depending on if your cluster and pipeline are in the same AWS account. This policy allows OpenSearch Ingestion to create a AWS PrivateLink connection to your Amazon MSK cluster and read data from Kafka topics. Make sure that you update the `resource` with your own ARN. 

   The following policies applies when your cluster and pipeline are in the same AWS account:

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "osis.amazonaws.com"
         },
         "Action": [
           "kafka:CreateVpcConnection",
           "kafka:DescribeClusterV2"
         ],
         "Resource": "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id"
       },
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "osis-pipelines.amazonaws.com"
         },
         "Action": [
           "kafka:CreateVpcConnection",
           "kafka:GetBootstrapBrokers",
           "kafka:DescribeClusterV2"
         ],
         "Resource": "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id"
       }
     ]
   }
   ```

------

   If your Amazon MSK cluster is in a different AWS account than your pipeline, attach the following policy instead. Note that cross-account access is only possible with provisioned Amazon MSK clusters and not Amazon MSK Serverless clusters. The ARN for the AWS `principal` should be the ARN for the same pipeline role that you provide to your pipeline configuration:

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "osis.amazonaws.com"
         },
         "Action": [
           "kafka:CreateVpcConnection",
           "kafka:DescribeClusterV2"
         ],
         "Resource": "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id"
       },
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "osis-pipelines.amazonaws.com"
         },
         "Action": [
           "kafka:CreateVpcConnection",
           "kafka:GetBootstrapBrokers",
           "kafka:DescribeClusterV2"
         ],
         "Resource": "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id"
       },
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "kafka-cluster:*",
           "kafka:*"
         ],
         "Resource": [
           "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id",
           "arn:aws:kafka:us-east-1:111122223333:topic/cluster-name/cluster-id/*",
           "arn:aws:kafka:us-east-1:111122223333:group/cluster-name/*"
         ]
       }
     ]
   }
   ```

------

1. Create a Kafka topic by following the steps in [Create a topic](https://docs.aws.amazon.com/msk/latest/developerguide/create-topic.html). Make sure that `BootstrapServerString` is one of the private endpoint (single-VPC) bootstrap URLs. The value for `--replication-factor` should be `2` or `3`, based on the number of zones your Amazon MSK cluster has. The value for `--partitions` should be at least `10`.

1. Produce and consume data by following the steps in [Produce and consume data](https://docs.aws.amazon.com/msk/latest/developerguide/produce-consume.html). Again, make sure that `BootstrapServerString` is one of your private endpoint (single-VPC) bootstrap URLs.

## Amazon MSK Serverless prerequisites
<a name="msk-serverless-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create an Amazon MSK Serverless cluster by following the steps in [Create an MSK Serverless cluster](https://docs.aws.amazon.com/msk/latest/developerguide/create-serverless-cluster.html#) in the *Amazon Managed Streaming for Apache Kafka Developer Guide*.

1. After the cluster has an **Active** status, follow the steps in [Attach a cluster policy to the MSK cluster](https://docs.aws.amazon.com/msk/latest/developerguide/aws-access-mult-vpc.html#mvpc-cluster-owner-action-policy) to attach the following policy. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "osis.amazonaws.com"
         },
         "Action": [
           "kafka:CreateVpcConnection",
           "kafka:DescribeClusterV2"
         ],
         "Resource": "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id"
       },
       {
         "Effect": "Allow",
         "Principal": {
           "Service": "osis-pipelines.amazonaws.com"
         },
         "Action": [
           "kafka:CreateVpcConnection",
           "kafka:GetBootstrapBrokers",
           "kafka:DescribeClusterV2"
         ],
         "Resource": "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id"
       }
     ]
   }
   ```

------

   This policy allows OpenSearch Ingestion to create a AWS PrivateLink connection to your Amazon MSK Serverless cluster and read data from Kafka topics. This policy applies when your cluster and pipeline are in the same AWS account, which must be true as Amazon MSK Serverless doesn't support cross-account access.

1. Create a Kafka topic by following the steps in [Create a topic](https://docs.aws.amazon.com/msk/latest/developerguide/msk-serverless-create-topic.html). Make sure that `BootstrapServerString` is one of your Simple Authentication and Security Layer (SASL) IAM bootstrap URLs. The value for `--replication-factor` should be `2` or `3`, based on the number of zones your Amazon MSK Serverless cluster has. The value for `--partitions` should be at least `10`.

1. Produce and consume data by following the steps in [Produce and consume data](https://docs.aws.amazon.com/msk/latest/developerguide/msk-serverless-produce-consume.html). Again, make sure that `BootstrapServerString` is one of your Simple Authentication and Security Layer (SASL) IAM bootstrap URLs.

## Step 1: Configure a pipeline role
<a name="msk-pipeline-role"></a>

After you have your Amazon MSK provisoned or serverless cluster set up, add the following Kafka permissions in the pipeline role that you want to use in your pipeline configuration:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:Connect",
                "kafka-cluster:AlterCluster",
                "kafka-cluster:DescribeCluster",
                "kafka:DescribeClusterV2",
                "kafka:GetBootstrapBrokers"
            ],
            "Resource": [
                "arn:aws:kafka:us-east-1:111122223333:cluster/cluster-name/cluster-id"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:*Topic*",
                "kafka-cluster:ReadData"
            ],
            "Resource": [
                "arn:aws:kafka:us-east-1:111122223333:topic/cluster-name/cluster-id/topic-name"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:AlterGroup",
                "kafka-cluster:DescribeGroup"
            ],
            "Resource": [
                "arn:aws:kafka:us-east-1:111122223333:group/cluster-name/*"
            ]
        }
    ]
}
```

------

## Step 2: Create the pipeline
<a name="msk-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies Kafka as the source:

```
version: "2"
log-pipeline:
  source:
    kafka:
      acknowledgements: true
      topics:
      - name: "topic-name"
        group_id: "grouplambd-id"
      aws:
        msk:
          arn: "arn:aws:kafka:region:account-id:cluster/cluster-name/cluster-id"
        region: "us-west-2"
  processor:
  - grok:
      match:
        message:
        - "%{COMMONAPACHELOG}"
  - date:
      destination: "@timestamp"
      from_time_received: true
  sink:
  - opensearch:
      hosts: ["https://search-domain-endpoint.us-east-1es.amazonaws.com"]
      index: "index_name"
      aws_region: "region"
      aws_sigv4: true
```

You can use a preconfigured Amazon MSK blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

## Step 3: (Optional) Use the AWS Glue Schema Registry
<a name="msk-glue"></a>

When you use OpenSearch Ingestion with Amazon MSK, you can use the AVRO data format for schemas hosted in the AWS Glue Schema Registry. With the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html), you can centrally discover, control, and evolve data stream schemas. 

To use this option, enable the schema `type` in your pipeline configuration:

```
schema:
  type: "aws_glue"
```

You must also provide AWS Glue with read access permissions in your pipeline role. You can use the AWS managed policy called [AWSGlueSchemaRegistryReadonlyAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSGlueSchemaRegistryReadonlyAccess.html). Additionally, your registry must be in the same AWS account and Region as your OpenSearch Ingestion pipeline.

## Step 4: (Optional) Configure recommended compute units (OCUs) for the Amazon MSK pipeline
<a name="msk-ocu"></a>

Each compute unit has one consumer per topic. Brokers balance partitions among these consumers for a given topic. However, when the number of partitions is greater than the number of consumers, Amazon MSK hosts multiple partitions on every consumer. OpenSearch Ingestion has built-in auto scaling to scale up or down based on CPU usage or number of pending records in the pipeline. 

For optimal performance, distribute your partitions across many compute units for parallel processing. If topics have a large number of partitions (for example, more than 96, which is the maximum OCUs per pipeline), we recommend that you configure a pipeline with 1–96 OCUs. This is because it will automatically scale as needed. If a topic has a low number of partitions (for example, less than 96), keep the maximum compute unit the same as the number of partitions. 

When a pipeline has more than one topic, choose the topic with the highest number of partitions as a reference to configure maximum computes units. By adding another pipeline with a new set of OCUs to the same topic and consumer group, you can scale the throughput almost linearly.

# Using an OpenSearch Ingestion pipeline with Amazon RDS
<a name="configure-client-rds"></a>

You can use an OpenSearch Ingestion pipeline with Amazon RDS to export existing data and stream changes (such as create, update, and delete) to Amazon OpenSearch Service domains and collections. The OpenSearch Ingestion pipeline incorporates change data capture (CDC) infrastructure to provide a high-scale, low-latency way to continuously stream data from Amazon RDS. RDS for MySQL and RDS for PostgreSQL are supported.

There are two ways that you can use Amazon RDS as a source to process data—with or without a full initial snapshot. A full initial snapshot is a snapshot of specified tables and this snapshot is exported to Amazon S3. From there, an OpenSearch Ingestion pipeline sends it to one index in a domain, or partitions it to multiple indexes in a domain. To keep the data in Amazon RDS and OpenSearch consistent, the pipeline syncs all of the create, update, and delete events in the tables in Amazon RDS instances with the documents saved in the OpenSearch index or indexes.

When you use a full initial snapshot, your OpenSearch Ingestion pipeline first ingests the snapshot and then starts reading data from Amazon RDS change streams. It eventually catches up and maintains near real-time data consistency between Amazon RDS and OpenSearch. 

You can also use the OpenSearch Ingestion integration with Amazon RDS to track change data capture and ingest all updates in Aurora to OpenSearch. Choose this option if you already have a full snapshot from some other mechanism, or if you just want to capture all changes to the data in an Amazon RDS instance. 

When you choose this option you need to [configure Amazon RDS for MySQL binary logging](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.MySQL.BinaryFormat.html) or [set up logical replication for Amazon RDS for PostgresSQL DB instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.pglogical.setup-replication.html). 

**Topics**
+ [RDS for MySQL](rds-mysql.md)
+ [RDS for PostgreSQL](rds-PostgreSQL.md)

# RDS for MySQL
<a name="rds-mysql"></a>

Complete the following steps to configure an OpenSearch Ingestion pipeline with Amazon RDS for RDS for MySQL.

**Topics**
+ [RDS for MySQL prerequisites](#rds-mysql-prereqs)
+ [Step 1: Configure the pipeline role](#rds-mysql-pipeline-role)
+ [Step 2: Create the pipeline](#rds-mysql-pipeline)
+ [Data consistency](#rds-mysql-pipeline-consistency)
+ [Mapping data types](#rds-mysql-pipeline-mapping)
+ [Limitations](#rds-mysql-pipeline-limitations)
+ [Recommended CloudWatch Alarms](#aurora-mysql-pipeline-metrics)

## RDS for MySQL prerequisites
<a name="rds-mysql-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a custom DB parameter group in Amazon RDS to configure binary logging and set the following parameters.

   ```
   binlog_format=ROW
   binlog_row_image=full
   binlog_row_metadata=FULL
   ```

   Additionally, make sure the `binlog_row_value_options` parameter is not set to `PARTIAL_JSON`.

   For more information, see [Configuring RDS for MySQL binary logging](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.MySQL.BinaryFormat.html).

1. [Select or create an RDS for MySQL DB instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateDBInstance.html) and associate the parameter group created in the previous step with the DB instance.

1. Verify that automated backups are enabled on the database. For more information, see [Enabling automated backups](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.Enabling.html). 

1. Configure binary log retention with enough time for replication to occur, for example 24 hours. For more information, see [Setting and showing binary log configuration](https://docs.aws.amazon.com//AmazonRDS/latest/UserGuide/mysql-stored-proc-configuring.html) in the *Amazon RDS User Guide*.

1. Set up username and password authentication on your Amazon RDS instance using [password management with Amazon RDS and AWS Secrets Manager](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-secrets-manager.html). You can also create a username/password combination by [creating a Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

1. If you use the full initial snapshot feature, create an AWS KMS key and an IAM role for exporting data from Amazon RDS to Amazon S3.

   The IAM role should have the following permission policy:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "ExportPolicy",
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject*",
                   "s3:ListBucket",
                   "s3:GetObject*",
                   "s3:DeleteObject*",
                   "s3:GetBucketLocation"
               ],
               "Resource": [
                   "arn:aws:s3:::s3-bucket-used-in-pipeline",
                   "arn:aws:s3:::s3-bucket-used-in-pipeline/*"
               ]
           }
       ]
   }
   ```

------

   The role should also have the following trust relationships:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": "export.rds.amazonaws.com"
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

1. Select or create an OpenSearch Service domain or OpenSearch Serverless collection. For more information, see [Creating OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html#createdomains) and [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your Amazon RDS DB instance to your domain or collection.

## Step 1: Configure the pipeline role
<a name="rds-mysql-pipeline-role"></a>

After you have your Amazon RDS pipeline prerequisites set up, [configure the pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-sink) to use in your pipeline configuration. Also add the following permissions for Amazon RDS source to the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
    {
    "Sid": "allowReadingFromS3Buckets",
    "Effect": "Allow",
    "Action": [
    "s3:GetObject",
    "s3:DeleteObject",
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:PutObject"
    ],
    "Resource": [
    "arn:aws:s3:::s3_bucket",
    "arn:aws:s3:::s3_bucket/*"
    ]
    },
    {
    "Sid": "allowNetworkInterfacesActions",
    "Effect": "Allow",
    "Action": [
    "ec2:AttachNetworkInterface",
    "ec2:CreateNetworkInterface",
    "ec2:CreateNetworkInterfacePermission",
    "ec2:DeleteNetworkInterface",
    "ec2:DeleteNetworkInterfacePermission",
    "ec2:DetachNetworkInterface",
    "ec2:DescribeNetworkInterfaces"
    ],
    "Resource": [
    "arn:aws:ec2:*:111122223333:network-interface/*",
    "arn:aws:ec2:*:111122223333:subnet/*",
    "arn:aws:ec2:*:111122223333:security-group/*"
    ]
    },
    {
    "Sid": "allowDescribeEC2",
    "Effect": "Allow",
    "Action": [
    "ec2:Describe*"
    ],
    "Resource": "*"
    },
    {
    "Sid": "allowTagCreation",
    "Effect": "Allow",
    "Action": [
    "ec2:CreateTags"
    ],
    "Resource": "arn:aws:ec2:*:111122223333:network-interface/*",
    "Condition": {
    "StringEquals": {
    "aws:RequestTag/OSISManaged": "true"
    }
    }
    },
    {
    "Sid": "AllowDescribeInstances",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBInstances"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:db:*"
    ]
    },
    {
    "Sid": "AllowSnapshots",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBSnapshots",
    "rds:CreateDBSnapshot",
    "rds:AddTagsToResource"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:db:DB-id",
    "arn:aws:rds:us-east-2:111122223333:snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowExport",
    "Effect": "Allow",
    "Action": [
    "rds:StartExportTask"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowDescribeExports",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeExportTasks"
    ],
    "Resource": "*",
    "Condition": {
    "StringEquals": {
    "aws:RequestedRegion": "us-east-2",
    "aws:ResourceAccount": "111122223333"
    }
    }
    },
    {
    "Sid": "AllowAccessToKmsForExport",
    "Effect": "Allow",
    "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:DescribeKey",
    "kms:RetireGrant",
    "kms:CreateGrant",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*"
    ],
    "Resource": [
    "arn:aws:kms:us-east-2:111122223333:key/export-key-id"
    ]
    },
    {
    "Sid": "AllowPassingExportRole",
    "Effect": "Allow",
    "Action": "iam:PassRole",
    "Resource": [
    "arn:aws:iam::111122223333:role/export-role"
    ]
    },
    {
    "Sid": "SecretsManagerReadAccess",
    "Effect": "Allow",
    "Action": [
    "secretsmanager:GetSecretValue"
    ],
    "Resource": [
    "arn:aws:secretsmanager:*:111122223333:secret:*"
    ]
    }
    ]
    }
```

------

## Step 2: Create the pipeline
<a name="rds-mysql-pipeline"></a>

Configure an OpenSearch Ingestion pipeline similar to the following. The example pipeline specifies an Amazon RDS instance as the source. 

```
version: "2"
rds-mysql-pipeline:
  source:
    rds:
      db_identifier: "instance-id"
      engine: mysql
      database: "database-name"
      tables:
        include:
          - "table1"
          - "table2"
      s3_bucket: "bucket-name"
      s3_region: "bucket-region"
      s3_prefix: "prefix-name"
      export:
        kms_key_id: "kms-key-id"
        iam_role_arn: "export-role-arn"
      stream: true
      aws:
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        region: "us-east-1"
      authentication:
        username: ${{aws_secrets:secret:username}}
        password: ${{aws_secrets:secret:password}}
  sink:
    - opensearch:
        hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
        index: "${getMetadata(\"table_name\")}"
        index_type: custom
        document_id: "${getMetadata(\"primary_key\")}"
        action: "${getMetadata(\"opensearch_action\")}"
        document_version: "${getMetadata(\"document_version\")}"
        document_version_type: "external"
        aws:
          sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
          region: "us-east-1"
extension:
  aws:
    secrets:
      secret:
        secret_id: "rds-secret-id"
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        refresh_interval: PT1H
```

You can use a preconfigured Amazon RDS blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

To use Amazon Aurora as a source, you need to configure VPC access for the pipeline. The VPC you choose should be the same VPC your Amazon Aurora source uses. Then choose one or more subnets and one or more VPC security groups. Note that the pipeline needs network access to a Aurora MySQL database, so you should also verify that your Aurora cluster is configured with a VPC security group that allows inbound traffic from the pipeline's VPC security group to the database port. For more information, see [Controlling access with security groups](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Overview.RDSSecurityGroups.html).

If you're using the AWS Management Console to create your pipeline, you must also attach your pipeline to your VPC in order to use Amazon Aurora as a source. To do this, find the **Network configuration** section, choose **Attach to VPC**, and choose your CIDR from one of the provided default options, or select your own. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).

To provide a custom CIDR, select **Other** from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and Amazon RDS, ensure that the Amazon RDS VPC CIDR is different from the CIDR for OpenSearch Ingestion.

For more information, see [Configuring VPC access for a pipeline](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security.html#pipeline-vpc-configure).

## Data consistency
<a name="rds-mysql-pipeline-consistency"></a>

The pipeline ensures data consistency by continuously polling or receiving changes from the Amazon RDS instance and updating the corresponding documents in the OpenSearch index.

OpenSearch Ingestion supports end-to-end acknowledgement to ensure data durability. When a pipeline reads snapshots or streams, it dynamically creates partitions for parallel processing. The pipeline marks a partition as complete when it receives an acknowledgement after ingesting all records in the OpenSearch domain or collection. If you want to ingest into an OpenSearch Serverless search collection, you can generate a document ID in the pipeline. If you want to ingest into an OpenSearch Serverless time series collection, note that the pipeline doesn't generate a document ID, so you must omit `document_id: "${getMetadata(\"primary_key\")}"` in your pipeline sink configuration. 

An OpenSearch Ingestion pipeline also maps incoming event actions into corresponding bulk indexing actions to help ingest documents. This keeps data consistent, so that every data change in Amazon RDS is reconciled with the corresponding document changes in OpenSearch.

## Mapping data types
<a name="rds-mysql-pipeline-mapping"></a>

OpenSearch Ingestion pipeline maps MySQL data types to representations that are suitable for OpenSearch Service domains or collections to consume. If no mapping template is defined in OpenSearch, OpenSearch automatically determines field types with [dynamic mapping](https://docs.opensearch.org/latest/field-types/#dynamic-mapping) based on the first sent document. You can also explicitly define the field types that work best for you in OpenSearch through a mapping template. 

The table below lists MySQL data types and corresponding OpenSearch field types. The *Default OpenSearch Field Type* column shows the corresponding field type in OpenSearch if no explicit mapping is defined. In this case, OpenSearch automatically determines field types with dynamic mapping. The *Recommended OpenSearch Field Type* column is the corresponding field type that is recommended to explicitly specify in a mapping template. These field types are more closely aligned with the data types in MySQL and can usually enable better search features available in OpenSearch.


| MySQL Data Type | Default OpenSearch Field Type | Recommended OpenSearch Field Type | 
| --- | --- | --- | 
| BIGINT | long | long | 
| BIGINT UNSIGNED | long | unsigned long | 
| BIT | long | byte, short, integer, or long depending on number of bits | 
| DECIMAL | text | double or keyword | 
| DOUBLE | float | double | 
| FLOAT | float | float | 
| INT | long | integer | 
| INT UNSIGNED | long | long | 
| MEDIUMINT | long | integer | 
| MEDIUMINT UNSIGNED | long | integer | 
| NUMERIC | text | double or keyword | 
| SMALLINT | long | short | 
| SMALLINT UNSIGNED | long | integer | 
| TINYINT | long | byte | 
| TINYINT UNSIGNED | long | short | 
| BINARY | text | binary | 
| BLOB | text | binary | 
| CHAR | text | text | 
| ENUM | text | keyword | 
| LONGBLOB | text | binary | 
| LONGTEXT | text | text | 
| MEDIUMBLOB | text | binary | 
| MEDIUMTEXT | text | text | 
| SET | text | keyword | 
| TEXT | text | text | 
| TINYBLOB | text | binary | 
| TINYTEXT | text | text | 
| VARBINARY | text | binary | 
| VARCHAR | text | text | 
| DATE | long (in epoch milliseconds) | date | 
| DATETIME | long (in epoch milliseconds) | date | 
| TIME | long (in epoch milliseconds) | date | 
| TIMESTAMP | long (in epoch milliseconds) | date | 
| YEAR | long (in epoch milliseconds) | date | 
| GEOMETRY | text (in WKT format) | geo\$1shape | 
| GEOMETRYCOLLECTION | text (in WKT format) | geo\$1shape | 
| LINESTRING | text (in WKT format) | geo\$1shape | 
| MULTILINESTRING | text (in WKT format) | geo\$1shape | 
| MULTIPOINT | text (in WKT format) | geo\$1shape | 
| MULTIPOLYGON | text (in WKT format) | geo\$1shape | 
| POINT | text (in WKT format) | geo\$1point or geo\$1shape | 
| POLYGON | text (in WKT format) | geo\$1shape | 
| JSON | text | object | 

We recommend that you configure the dead-letter queue (DLQ) in your OpenSearch Ingestion pipeline. If you've configured the queue, OpenSearch Service sends all failed documents that can't be ingested due to dynamic mapping failures to the queue.

If automatic mappings fail, you can use `template_type` and `template_content` in your pipeline configuration to define explicit mapping rules. Alternatively, you can create mapping templates directly in your search domain or collection before you start the pipeline.

## Limitations
<a name="rds-mysql-pipeline-limitations"></a>

Consider the following limitations when you set up an OpenSearch Ingestion pipeline for RDS for MySQL:
+ The integration only supports one MySQL database per pipeline.
+ The integration does not currently support cross-region data ingestion; your Amazon RDS instance and OpenSearch domain must be in the same AWS Region.
+ The integration does not currently support cross-account data ingestion; your Amazon RDS instance and OpenSearch Ingestion pipeline must be in the same AWS account. 
+ Ensure that the Amazon RDS instance has authentication enabled using Secrets Manager, which is the only supported authentication mechanism.
+ The existing pipeline configuration can't be updated to ingest data from a different database and/or a different table. To update the database and/or table name of a pipeline, you have to create a new pipeline.
+ Data Definition Language (DDL) statements are generally not supported. Data consistency will not be maintained if:
  + Primary keys are changed (add/delete/rename).
  + Tables are dropped/truncated.
  + Column names or data types are changed.
+ If the MySQL tables to sync don't have primary keys defined, data consistency are not guaranteed. You will need to define custom `document_id` option in OpenSearch sink configuration properly to be able to sync updates/deletes to OpenSearch.
+ Foreign key references with cascading delete actions are not supported and can result in data inconsistency between RDS for MySQL and OpenSearch.
+ Amazon RDS multi-availability zone DB clusters are not supported.
+ Supported versions: MySQL version 8.0 and higher.

## Recommended CloudWatch Alarms
<a name="aurora-mysql-pipeline-metrics"></a>

The following CloudWatch metrics are recommended for monitoring the performance of your ingestion pipeline. These metrics can help you identify the amount of data processed from exports, the number of events processed from streams, the errors in processing exports and stream events, and the number of documents written to the destination. You can setup CloudWatch alarms to perform an action when one of these metrics exceed a specified value for a specified amount of time.


| Metric | Description | 
| --- | --- | 
| pipeline-name.rds.credentialsChanged | This metric indicates how often AWS secrets are rotated. | 
| pipeline-name.rds.executorRefreshErrors | This metric indicates failures to refresh AWS secrets. | 
| pipeline-name.rds.exportRecordsTotal | This metric indicates the number of records exported from Amazon Aurora. | 
| pipeline-name.rds.exportRecordsProcessed | This metric indicates the number of records processed by OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.exportRecordProcessingErrors | This metric indicates number of processing errors in an OpenSearch Ingestion pipeline while reading the data from an Amazon Aurora cluster. | 
| pipeline-name.rds.exportRecordsSuccessTotal | This metric indicates the total number of export records processed successfully. | 
| pipeline-name.rds.exportRecordsFailedTotal | This metric indicates the total number of export records that failed to process. | 
| pipeline-name.rds.bytesReceived | This metrics indicates the total number of bytes received by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.bytesProcessed | This metrics indicates the total number of bytes processed by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.streamRecordsSuccessTotal | This metric indicates the number of records successfully processed from the stream. | 
| pipeline-name.rds.streamRecordsFailedTotal | This metrics indicates the total number of records failed to process from the stream. | 

# RDS for PostgreSQL
<a name="rds-PostgreSQL"></a>

Complete the following steps to configure an OpenSearch Ingestion pipeline with Amazon RDS for RDS for PostgreSQL.

**Topics**
+ [RDS for PostgreSQL prerequisites](#rds-PostgreSQL-prereqs)
+ [Step 1: Configure the pipeline role](#rds-mysql-pipeline-role)
+ [Step 2: Create the pipeline](#rds-PostgreSQL-pipeline)
+ [Data consistency](#rds-mysql-pipeline-consistency)
+ [Mapping data types](#rds-PostgreSQL-pipeline-mapping)
+ [Limitations](#rds-PostgreSQL-pipeline-limitations)
+ [Recommended CloudWatch Alarms](#aurora-mysql-pipeline-metrics)

## RDS for PostgreSQL prerequisites
<a name="rds-PostgreSQL-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. [Create a custom DB parameter group](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/zero-etl.setting-up.html#zero-etl.parameters) in Amazon RDS to configure logical replication.

   ```
   rds.logical_replication=1
   ```

   For more information, see [Performing logical replication for Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Concepts.General.FeatureSupport.LogicalReplication.html).

1. [Select or create an RDS for PostgreSQL DB instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html) and associate the parameter group created in step 1 with the DB instance.

1. Set up username and password authentication on your Amazon RDS instance using [password management with Aurora and AWS Secrets Manager](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/rds-secrets-manager.html). You can also create a username/password combination by [creating a Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

1. If you use the full initial snapshot feature, create an AWS KMS key and an IAM role for exporting data from Amazon RDS to Amazon S3.

   The IAM role should have the following permission policy:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "ExportPolicy",
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject*",
                   "s3:ListBucket",
                   "s3:GetObject*",
                   "s3:DeleteObject*",
                   "s3:GetBucketLocation"
               ],
               "Resource": [
                   "arn:aws:s3:::s3-bucket-used-in-pipeline",
                   "arn:aws:s3:::s3-bucket-used-in-pipeline/*"
               ]
           }
       ]
   }
   ```

------

   The role should also have the following trust relationships:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": "export.rds.amazonaws.com"
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

1. Select or create an OpenSearch Service domain or OpenSearch Serverless collection. For more information, see [Creating OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html#createdomains) and [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your Amazon RDS DB instance to your domain or collection.

## Step 1: Configure the pipeline role
<a name="rds-mysql-pipeline-role"></a>

After you have your Amazon RDS pipeline prerequisites set up, [configure the pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-sink) to use in your pipeline configuration. Also add the following permissions for Amazon RDS source to the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
    {
    "Sid": "allowReadingFromS3Buckets",
    "Effect": "Allow",
    "Action": [
    "s3:GetObject",
    "s3:DeleteObject",
    "s3:GetBucketLocation",
    "s3:ListBucket",
    "s3:PutObject"
    ],
    "Resource": [
    "arn:aws:s3:::s3_bucket",
    "arn:aws:s3:::s3_bucket/*"
    ]
    },
    {
    "Sid": "allowNetworkInterfacesActions",
    "Effect": "Allow",
    "Action": [
    "ec2:AttachNetworkInterface",
    "ec2:CreateNetworkInterface",
    "ec2:CreateNetworkInterfacePermission",
    "ec2:DeleteNetworkInterface",
    "ec2:DeleteNetworkInterfacePermission",
    "ec2:DetachNetworkInterface",
    "ec2:DescribeNetworkInterfaces"
    ],
    "Resource": [
    "arn:aws:ec2:*:111122223333:network-interface/*",
    "arn:aws:ec2:*:111122223333:subnet/*",
    "arn:aws:ec2:*:111122223333:security-group/*"
    ]
    },
    {
    "Sid": "allowDescribeEC2",
    "Effect": "Allow",
    "Action": [
    "ec2:Describe*"
    ],
    "Resource": "*"
    },
    {
    "Sid": "allowTagCreation",
    "Effect": "Allow",
    "Action": [
    "ec2:CreateTags"
    ],
    "Resource": "arn:aws:ec2:*:111122223333:network-interface/*",
    "Condition": {
    "StringEquals": {
    "aws:RequestTag/OSISManaged": "true"
    }
    }
    },
    {
    "Sid": "AllowDescribeInstances",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBInstances"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:db:*"
    ]
    },
    {
    "Sid": "AllowSnapshots",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeDBSnapshots",
    "rds:CreateDBSnapshot",
    "rds:AddTagsToResource"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:db:DB-id",
    "arn:aws:rds:us-east-2:111122223333:snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowExport",
    "Effect": "Allow",
    "Action": [
    "rds:StartExportTask"
    ],
    "Resource": [
    "arn:aws:rds:us-east-2:111122223333:snapshot:DB-id*"
    ]
    },
    {
    "Sid": "AllowDescribeExports",
    "Effect": "Allow",
    "Action": [
    "rds:DescribeExportTasks"
    ],
    "Resource": "*",
    "Condition": {
    "StringEquals": {
    "aws:RequestedRegion": "us-east-2",
    "aws:ResourceAccount": "111122223333"
    }
    }
    },
    {
    "Sid": "AllowAccessToKmsForExport",
    "Effect": "Allow",
    "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:DescribeKey",
    "kms:RetireGrant",
    "kms:CreateGrant",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*"
    ],
    "Resource": [
    "arn:aws:kms:us-east-2:111122223333:key/export-key-id"
    ]
    },
    {
    "Sid": "AllowPassingExportRole",
    "Effect": "Allow",
    "Action": "iam:PassRole",
    "Resource": [
    "arn:aws:iam::111122223333:role/export-role"
    ]
    },
    {
    "Sid": "SecretsManagerReadAccess",
    "Effect": "Allow",
    "Action": [
    "secretsmanager:GetSecretValue"
    ],
    "Resource": [
    "arn:aws:secretsmanager:*:111122223333:secret:*"
    ]
    }
    ]
    }
```

------

## Step 2: Create the pipeline
<a name="rds-PostgreSQL-pipeline"></a>

Configure an OpenSearch Ingestion pipeline like the following, which specifies an RDS for PostgreSQL instance as the source. 

```
version: "2"
rds-postgres-pipeline:
  source:
    rds:
      db_identifier: "instance-id"
      engine: postgresql
      database: "database-name"
      tables:
        include:
          - "schema1.table1"
          - "schema2.table2"
      s3_bucket: "bucket-name"
      s3_region: "bucket-region"
      s3_prefix: "prefix-name"
      export:
        kms_key_id: "kms-key-id"
        iam_role_arn: "export-role-arn"
      stream: true
      aws:
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        region: "us-east-1"
      authentication:
        username: ${{aws_secrets:secret:username}}
        password: ${{aws_secrets:secret:password}}
  sink:
    - opensearch:
        hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
        index: "${getMetadata(\"table_name\")}"
        index_type: custom
        document_id: "${getMetadata(\"primary_key\")}"
        action: "${getMetadata(\"opensearch_action\")}"
        document_version: "${getMetadata(\"document_version\")}"
        document_version_type: "external"
        aws:
          sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
          region: "us-east-1"
extension:
  aws:
    secrets:
      secret:
        secret_id: "rds-secret-id"
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::account-id:role/pipeline-role"
        refresh_interval: PT1H
```

**Note**  
You can use a preconfigured Amazon RDS blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

To use Amazon Aurora as a source, you need to configure VPC access for the pipeline. The VPC you choose should be the same VPC your Amazon Aurora source uses. Then choose one or more subnets and one or more VPC security groups. Note that the pipeline needs network access to a Aurora MySQL database, so you should also verify that your Aurora cluster is configured with a VPC security group that allows inbound traffic from the pipeline's VPC security group to the database port. For more information, see [Controlling access with security groups](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Overview.RDSSecurityGroups.html).

If you're using the AWS Management Console to create your pipeline, you must also attach your pipeline to your VPC in order to use Amazon Aurora as a source. To do this, find the **Network configuration** section, choose **Attach to VPC**, and choose your CIDR from one of the provided default options, or select your own. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).

To provide a custom CIDR, select **Other** from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and Amazon RDS, ensure that the Amazon Aurora VPC CIDR is different from the CIDR for OpenSearch Ingestion.

For more information, see [Configuring VPC access for a pipeline](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security.html#pipeline-vpc-configure).

## Data consistency
<a name="rds-mysql-pipeline-consistency"></a>

The pipeline ensures data consistency by continuously polling or receiving changes from the Amazon RDS instance and updating the corresponding documents in the OpenSearch index.

OpenSearch Ingestion supports end-to-end acknowledgement to ensure data durability. When a pipeline reads snapshots or streams, it dynamically creates partitions for parallel processing. The pipeline marks a partition as complete when it receives an acknowledgement after ingesting all records in the OpenSearch domain or collection. If you want to ingest into an OpenSearch Serverless search collection, you can generate a document ID in the pipeline. If you want to ingest into an OpenSearch Serverless time series collection, note that the pipeline doesn't generate a document ID, so you must omit `document_id: "${getMetadata(\"primary_key\")}"` in your pipeline sink configuration. 

An OpenSearch Ingestion pipeline also maps incoming event actions into corresponding bulk indexing actions to help ingest documents. This keeps data consistent, so that every data change in Amazon RDS is reconciled with the corresponding document changes in OpenSearch.

## Mapping data types
<a name="rds-PostgreSQL-pipeline-mapping"></a>

OpenSearch Ingestion pipeline maps PostgreSQL data types to representations that are suitable for OpenSearch Service domains or collections to consume. If no mapping template is defined in OpenSearch, OpenSearch automatically determine field types with a [dynamic mapping](https://docs.opensearch.org/latest/field-types/#dynamic-mapping) based on the first sent document. You can also explicitly define the field types that work best for you in OpenSearch through a mapping template. 

The table below lists RDS for PostgreSQL data types and corresponding OpenSearch field types. The *Default OpenSearch Field Type* column shows the corresponding field type in OpenSearch if no explicit mapping is defined. In this case, OpenSearch automatically determines field types with dynamic mapping. The *Recommended OpenSearch Field Type* column is the corresponding recommended field type to explicitly specify in a mapping template. These field types are more closely aligned with the data types in RDS for PostgreSQL and can usually enable better search features available in OpenSearch.


| RDS for PostgreSQL Data Type | Default OpenSearch Field Type | Recommended OpenSearch Field Type | 
| --- | --- | --- | 
| smallint | long | short | 
| integer | long | integer | 
| bigint | long | long | 
| decimal | text | double or keyword | 
| numeric[ (p, s) ] | text | double or keyword | 
| real | float | float | 
| double precision | float | double | 
| smallserial | long | short | 
| serial | long | integer | 
| bigserial | long | long | 
| money | object | object | 
| character varying(n) | text | text | 
| varchar(n) | text | text | 
| character(n) | text | text | 
| char(n) | text | text | 
| bpchar(n) | text | text | 
| bpchar | text | text | 
| text | text | text | 
| enum | text | text | 
| bytea | text | binary | 
| timestamp [ (p) ] [ without time zone ] | long (in epoch milliseconds) | date | 
| timestamp [ (p) ] with time zone | long (in epoch milliseconds) | date | 
| date | long (in epoch milliseconds) | date | 
| time [ (p) ] [ without time zone ] | long (in epoch milliseconds) | date | 
| time [ (p) ] with time zone | long (in epoch milliseconds) | date | 
| interval [ fields ] [ (p) ] | text (ISO8601 format) | text | 
| boolean | boolean | boolean | 
| point | text (in WKT format) | geo\$1shape | 
| line | text (in WKT format) | geo\$1shape | 
| lseg | text (in WKT format) | geo\$1shape | 
| box | text (in WKT format) | geo\$1shape | 
| path | text (in WKT format) | geo\$1shape | 
| polygon | text (in WKT format) | geo\$1shape | 
| circle | object | object | 
| cidr | text | text | 
| inet | text | text | 
| macaddr | text | text | 
| macaddr8 | text | text | 
| bit(n) | long | byte, short, integer, or long (depending on number of bits) | 
| bit varying(n) | long | byte, short, integer, or long (depending on number of bits) | 
| json | object | object | 
| jsonb | object | object | 
| jsonpath | text | text | 

We recommend that you configure the dead-letter queue (DLQ) in your OpenSearch Ingestion pipeline. If you've configured the queue, OpenSearch Service sends all failed documents that can't be ingested due to dynamic mapping failures to the queue.

In case automatic mappings fail, you can use `template_type` and `template_content` in your pipeline configuration to define explicit mapping rules. Alternatively, you can create mapping templates directly in your search domain or collection before you start the pipeline.

## Limitations
<a name="rds-PostgreSQL-pipeline-limitations"></a>

Consider the following limitations when you set up an OpenSearch Ingestion pipeline for RDS for PostgreSQL:
+ The integration only supports one PostgreSQL database per pipeline.
+ The integration does not currently support cross-region data ingestion; your Amazon RDS instance and OpenSearch domain must be in the same AWS Region.
+ The integration does not currently support cross-account data ingestion; your Amazon RDS instance and OpenSearch Ingestion pipeline must be in the same AWS account. 
+ Ensure that the Amazon RDS instance has authentication enabled using AWS Secrets Manager, which is the only supported authentication mechanism.
+ The existing pipeline configuration can't be updated to ingest data from a different database and/or a different table. To update the database and/or table name of a pipeline, you have to stop the pipeline and restart it with an updated configuration, or create a new pipeline.
+ Data Definition Language (DDL) statements are generally not supported. Data consistency will not be maintained if:
  + Primary keys are changed (add/delete/rename).
  + Tables are dropped/truncated.
  + Column names or data types are changed.
+ If the PostgreSQL tables to sync don’t have primary keys defined, data consistency isn't guaranteed. You will need to define custom the `document_id` option in OpenSearch and sink configuration properly to be able to sync updates/deletes to OpenSearch.
+ RDS multi-AZ DB clusters are not supported.
+ Supported versions: PostgreSQL 16 and higher.

## Recommended CloudWatch Alarms
<a name="aurora-mysql-pipeline-metrics"></a>

The following CloudWatch metrics are recommended for monitoring the performance of your ingestion pipeline. These metrics can help you identify the amount of data processed from exports, the number of events processed from streams, the errors in processing exports and stream events, and the number of documents written to the destination. You can setup CloudWatch alarms to perform an action when one of these metrics exceed a specified value for a specified amount of time.


| Metric | Description | 
| --- | --- | 
| pipeline-name.rds.credentialsChanged | This metric indicates how often AWS secrets are rotated. | 
| pipeline-name.rds.executorRefreshErrors | This metric indicates failures to refresh AWS secrets. | 
| pipeline-name.rds.exportRecordsTotal | This metric indicates the number of records exported from Amazon Aurora. | 
| pipeline-name.rds.exportRecordsProcessed | This metric indicates the number of records processed by OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.exportRecordProcessingErrors | This metric indicates number of processing errors in an OpenSearch Ingestion pipeline while reading the data from an Amazon Aurora cluster. | 
| pipeline-name.rds.exportRecordsSuccessTotal | This metric indicates the total number of export records processed successfully. | 
| pipeline-name.rds.exportRecordsFailedTotal | This metric indicates the total number of export records that failed to process. | 
| pipeline-name.rds.bytesReceived | This metrics indicates the total number of bytes received by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.bytesProcessed | This metrics indicates the total number of bytes processed by an OpenSearch Ingestion pipeline. | 
| pipeline-name.rds.streamRecordsSuccessTotal | This metric indicates the number of records successfully processed from the stream. | 
| pipeline-name.rds.streamRecordsFailedTotal | This metrics indicates the total number of records failed to process from the stream. | 

# Using an OpenSearch Ingestion pipeline with Amazon S3
<a name="configure-client-s3"></a>

With OpenSearch Ingestion, you can use Amazon S3 as a source or as a destination. When you use Amazon S3 as a source, you send data to an OpenSearch Ingestion pipeline. When you use Amazon S3 as a destination, you write data from an OpenSearch Ingestion pipeline to one or more S3 buckets.

**Topics**
+ [Amazon S3 as a source](#s3-source)
+ [Amazon S3 as a destination](#s3-destination)
+ [Amazon S3 cross account as a source](#fdsf)

## Amazon S3 as a source
<a name="s3-source"></a>

There are two ways that you can use Amazon S3 as a source to process data—with *S3-SQS processing* and with *scheduled scans*. 

Use S3-SQS processing when you require near real-time scanning of files after they are written to S3. You can configure Amazon S3 buckets to raise an event any time an object is stored or modified within the bucket. Use a one-time or recurring scheduled scan to batch process data in a S3 bucket. 

**Topics**
+ [Prerequisites](#s3-prereqs)
+ [Step 1: Configure the pipeline role](#s3-pipeline-role)
+ [Step 2: Create the pipeline](#s3-pipeline)

### Prerequisites
<a name="s3-prereqs"></a>

To use Amazon S3 as the source for an OpenSearch Ingestion pipeline for both a scheduled scan or S3-SQS processing, first [create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html).

**Note**  
If the S3 bucket used as a source in the OpenSearch Ingestion pipeline is in a different AWS account, you also need to enable cross-account read permissions on the bucket. This allows the pipeline to read and process the data. To enable cross-account permissions, see [Bucket owner granting cross-account bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html) in the *Amazon S3 User Guide*.  
If your S3 buckets are in multiple accounts, use a `bucket_owners` map. For an example, see [Cross-account S3 access](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/#cross-account-s3-access) in the OpenSearch documentation.

To set up S3-SQS processing, you also need to perform the following steps:

1. [Create an Amazon SQS queue](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/step-create-queue.html).

1. [Enable event notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html) on the S3 bucket with the SQS queue as a destination.

### Step 1: Configure the pipeline role
<a name="s3-pipeline-role"></a>

Unlike other source plugins that *push* data to a pipeline, the [S3 source plugin](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/) has a read-based architecture in which the pipeline *pulls* data from the source. 

Therefore, in order for a pipeline to read from S3, you must specify a role within the pipeline's S3 source configuration that has access to both the S3 bucket and the Amazon SQS queue. The pipeline will assume this role in order to read data from the queue.

**Note**  
The role that you specify within the S3 source configuration must be the [pipeline role](). Therefore, your pipeline role must contain two separate permissions policies—one to write to a sink, and one to pull from the S3 source. You must use the same `sts_role_arn` in all pipeline components.

The following sample policy shows the required permissions for using S3 as a source:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action":[
          "s3:ListBucket",
          "s3:GetBucketLocation",
          "s3:GetObject"
       ],
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
    },
    {
       "Effect":"Allow",
       "Action":"s3:ListAllMyBuckets",
       "Resource":"arn:aws:s3:::*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sqs:DeleteMessage",
        "sqs:ReceiveMessage",
        "sqs:ChangeMessageVisibility"
      ],
      "Resource": "arn:aws:sqs:us-east-1:111122223333:MyS3EventSqsQueue"
    }
  ]
}
```

------

 You must attach these permissions to the IAM role that you specify in the `sts_role_arn` option within the S3 source plugin configuration:

```
version: "2"
source:
  s3:
    ...
    aws:
      ...
processor:
  ...
sink:
  - opensearch:
      ...
```

### Step 2: Create the pipeline
<a name="s3-pipeline"></a>

After you've set up your permissions, you can configure an OpenSearch Ingestion pipeline depending on your Amazon S3 use case.

#### S3-SQS processing
<a name="s3-sqs-processing"></a>

To set up S3-SQS processing, configure your pipeline to specify S3 as the source and set up Amazon SQS notifications:

```
version: "2"
s3-pipeline:
  source:
    s3:
      notification_type: "sqs"
      codec:
        newline: null
      sqs:
        queue_url: "https://sqs.us-east-1amazonaws.com/account-id/ingestion-queue"
      compression: "none"
      aws:
        region: "region"
  processor:
  - grok:
      match:
        message:
        - "%{COMMONAPACHELOG}"
  - date:
      destination: "@timestamp"
      from_time_received: true
  sink:
  - opensearch:
      hosts: ["https://search-domain-endpoint.us-east-1es.amazonaws.com"]
      index: "index-name"
      aws:
        region: "region"
```

If you observe low CPU utilization while processing small files on Amazon S3, consider increasing the throughput by modifying the value of the `workers` option. For more information, see the [S3 plugin configuration options](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/#configuration).

#### Scheduled scan
<a name="s3-scheduled-scan"></a>

To set up a scheduled scan, configure your pipeline with a schedule at the scan level that applies to all your S3 buckets, or at the bucket level. A bucket-level schedule or a scan-interval configuration always overwrites a scan-level configuration. 

You can configure scheduled scans with either a *one-time scan*, which is ideal for data migration, or a *recurring scan*, which is ideal for batch processing. 

To configure your pipeline to read from Amazon S3, use the preconfigured Amazon S3 blueprints. You can edit the `scan` portion of your pipeline configuration to meet your scheduling needs. For more information, see [Working with blueprints](pipeline-blueprint.md).

**One-time scan**

A one-time scheduled scan runs once. In your pipeline configuration, you can use a `start_time` and `end_time` to specify when you want the objects in the bucket to be scanned. Alternatively, you can use `range` to specify the interval of time relative to current time that you want the objects in the bucket to be scanned. 

For example, a range set to `PT4H` scans all files created in the last four hours. To configure a one-time scan to run a second time, you must stop and restart the pipeline. If you don't have a range configured, you must also update the start and end times.

The following configuration sets up a one-time scan for all buckets and all objects in those buckets:

```
version: "2"
log-pipeline:
  source:
    s3:
      codec:
        csv:
      compression: "none"
      aws:
        region: "region"
      acknowledgments: true
      scan:
        buckets:
          - bucket:
              name: my-bucket
              filter:
                include_prefix:
                  - Objects1/
                exclude_suffix:
                  - .jpeg
                  - .png
          - bucket:
              name: my-bucket-2
              key_prefix:
                include:
                  - Objects2/
                exclude_suffix:
                  - .jpeg
                  - .png
      delete_s3_objects_on_read: false
  processor:
    - date:
        destination: "@timestamp"
        from_time_received: true
  sink:
    - opensearch:
        hosts: ["https://search-domain-endpoint.us-east-1es.amazonaws.com"]
        index: "index-name"
        aws:
          region: "region"
        dlq:
          s3:
            bucket: "dlq-bucket"
            region: "us-east-1"
```

The following configuration sets up a one-time scan for all buckets during a specified time window. This means that S3 processes only those objects with creation times that fall within this window.

```
scan:
  start_time: 2023-01-21T18:00:00.000Z
  end_time: 2023-04-21T18:00:00.000Z
  buckets:
    - bucket:
        name: my-bucket-1
        filter:
          include:
            - Objects1/
          exclude_suffix:
            - .jpeg
            - .png
    - bucket:
        name: my-bucket-2
        filter:
          include:
            - Objects2/
          exclude_suffix:
            - .jpeg
            - .png
```

The following configuration sets up a one-time scan at both the scan level and the bucket level. Start and end times at the bucket level override start and end times at the scan level. 

```
scan:
  start_time: 2023-01-21T18:00:00.000Z
  end_time: 2023-04-21T18:00:00.000Z
  buckets:
    - bucket:
        start_time: 2023-01-21T18:00:00.000Z
        end_time: 2023-04-21T18:00:00.000Z
        name: my-bucket-1
        filter:
          include:
            - Objects1/
          exclude_suffix:
            - .jpeg
            - .png
    - bucket:
        start_time: 2023-01-21T18:00:00.000Z
        end_time: 2023-04-21T18:00:00.000Z
        name: my-bucket-2
        filter:
          include:
            - Objects2/
          exclude_suffix:
            - .jpeg
            - .png
```

Stopping a pipeline removes any pre-existing reference of what objects have been scanned by the pipeline before the stop. If a single scan pipeline is stopped, it will rescan all objects again after its started, even if they were already scanned. If you need to stop a single scan pipeline, it is recommended you change your time window before starting the pipeline again.

If you need to filter objects by start time and end time, stopping and starting your pipeline is the only option. If you don't need to filter by start time and end time, you can filter objects by name. Flitering by name doesn't require you to stop and start your pipeline. To do this, use `include_prefix` and `exclude_suffix`.

**Recurring scan**

A recurring scheduled scan runs a scan of your specified S3 buckets at regular, scheduled intervals. You can only configure these intervals at the scan level because individual bucket level configurations aren't supported. 

In your pipeline configuration, the `interval` specifies the frequency of the recurring scan, and can be between 30 seconds and 365 days. The first of these scans always occurs when you create the pipeline. The `count` defines the total number of scan instances.

The following configuration sets up a recurring scan, with a delay of 12 hours between the scans:

```
scan:
  scheduling:
    interval: PT12H
    count: 4
  buckets:
    - bucket:
        name: my-bucket-1
        filter:
          include:
            - Objects1/
          exclude_suffix:
            - .jpeg
            - .png
    - bucket:
        name: my-bucket-2
        filter:
          include:
            - Objects2/
          exclude_suffix:
            - .jpeg
            - .png
```

## Amazon S3 as a destination
<a name="s3-destination"></a>

To write data from an OpenSearch Ingestion pipeline to an S3 bucket, use the preconfigured S3 blueprint to create a pipeline with an [S3 sink](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/). This pipeline routes selective data to an OpenSearch sink and simultaneously sends all data for archival in S3. For more information, see [Working with blueprints](pipeline-blueprint.md).

When you create your S3 sink, you can specify your preferred formatting from a variety of [sink codecs](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/s3/#codec). For example, if you want to write data in columnar format, choose the Parquet or Avro codec. If you prefer a row-based format, choose JSON or NDJSON. To write data to S3 in a specified schema, you can also define an inline schema within sink codecs using the [Avro format](https://avro.apache.org/docs/current/specification/#schema-declaration). 

The following example defines an inline schema in an S3 sink:

```
- s3:
  codec:
    parquet:
      schema: >
        {
           "type" : "record",
           "namespace" : "org.vpcFlowLog.examples",
           "name" : "VpcFlowLog",
           "fields" : [
             { "name" : "version", "type" : "string"},
             { "name" : "srcport", "type": "int"},
             { "name" : "dstport", "type": "int"},
             { "name" : "start", "type": "int"},
             { "name" : "end", "type": "int"},
             { "name" : "protocol", "type": "int"},
             { "name" : "packets", "type": "int"},
             { "name" : "bytes", "type": "int"},
             { "name" : "action", "type": "string"},
             { "name" : "logStatus", "type" : "string"}
           ]
         }
```

When you define this schema, specify a superset of all keys that might be present in the different types of events that your pipeline delivers to a sink. 

For example, if an event has the possibility of a key missing, add that key in your schema with a `null` value. Null value declarations allow the schema to process non-uniform data (where some events have these keys and others don't). When incoming events do have these keys present, their values are written to sinks. 

This schema definition acts as a filter that only allows defined keys to be sent to sinks, and drops undefined keys from incoming events. 

You can also use `include_keys` and `exclude_keys` in your sink to filter data that's routed to other sinks. These two filters are mutually exclusive, so you can only use one at a time in your schema. Additionally, you can't use them within user-defined schemas. 

To create pipelines with such filters, use the preconfigured sink filter blueprint. For more information, see [Working with blueprints](pipeline-blueprint.md).

## Amazon S3 cross account as a source
<a name="fdsf"></a>

You can grant access across accounts with Amazon S3 so that OpenSearch Ingestion pipelines can access S3 buckets in another account as a source. To enable cross-account access, see [Bucket owner granting cross-account bucket permissions](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html) in the *Amazon S3 User Guide*. After you have granted access, ensure that your pipeline role has the required permissions.

Then, you can create a pipeline using `bucket_owners` to enable cross-account access to an Amazon S3 bucket as a source:

```
s3-pipeline:
 source:
  s3:
   notification_type: "sqs"
   codec:
    csv:
     delimiter: ","
     quote_character: "\""
     detect_header: True
   sqs:
    queue_url: "https://sqs.ap-northeast-1.amazonaws.com/401447383613/test-s3-queue"
   bucket_owners:
    my-bucket-01: 123456789012
    my-bucket-02: 999999999999
   compression: "gzip"
```

# Using an OpenSearch Ingestion pipeline with Amazon Security Lake
<a name="configure-client-security-lake"></a>

You can use the [S3 source plugin](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/) to ingest data from [Amazon Security Lake](https://docs.aws.amazon.com/security-lake/latest/userguide/what-is-security-lake.html) into your OpenSearch Ingestion pipeline. Security Lake automatically centralizes security data from AWS environments, on-premises environments, and SaaS providers into a purpose-built data lake. You can create a subscription that replicates data from Security Lake to your OpenSearch Ingestion pipeline, which then writes it to your OpenSearch Service domain or OpenSearch Serverless collection.

To configure your pipeline to read from Security Lake, use the preconfigured Security Lake blueprint. The blueprint includes a default configuration for ingesting Open Cybersecurity Schema Framework (OCSF) parquet files from Security Lake. For more information, see [Working with blueprints](pipeline-blueprint.md).

**Topics**
+ [Using an OpenSearch Ingestion pipeline with Amazon Security Lake as a source](configure-client-source-security-lake.md)
+ [Using an OpenSearch Ingestion pipeline with Amazon Security Lake as a sink](configure-client-sink-security-lake.md)

# Using an OpenSearch Ingestion pipeline with Amazon Security Lake as a source
<a name="configure-client-source-security-lake"></a>

You can use the Amazon S3 source plugin within your OpenSearch Ingestion pipeline to ingest data from Amazon Security Lake. Security Lake automatically centralizes security data from AWS environments, on-premises systems, and SaaS providers into a purpose-built data lake.

Amazon Security Lake has the following metadata attributes within a pipeline:
+ `bucket_name`: The name of the Amazon S3 bucket created by Security Lake for storing security data.
+ `path_prefix`: The custom source name defined in the Security Lake IAM role policy.
+ `region`: The AWS Region where the Security Lake S3 bucket is located.
+ `accountID`: The AWS account ID in which Security Lake is enabled.
+ `sts_role_arn`: The ARN of the IAM role intended for use with Security Lake.

## Prerequisites
<a name="sl-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:
+ [Enable Security Lake](https://docs.aws.amazon.com/security-lake/latest/userguide/getting-started.html#enable-service).
+ [Create a subscriber](https://docs.aws.amazon.com/security-lake/latest/userguide/subscriber-data-access.html#create-subscriber-data-access) in Security Lake.
  + Choose the sources that you want to ingest into your pipeline.
  + For **Subscriber credentials**, add the ID of the AWS account where you intend to create the pipeline. For the external ID, specify `OpenSearchIngestion-{accountid}`.
  + For **Data access method**, choose **S3**.
  + For **Notification details**, choose **SQS queue**.

When you create a subscriber, Security Lake automatically creates two inline permissions policies—one for S3 and one for SQS. The policies take the following format: `AmazonSecurityLake-amzn-s3-demo-bucket-S3` and `AmazonSecurityLake-AWSDemo-SQS`. To allow your pipeline to access the subscriber sources, you must associate the required permissions with your pipeline role.

## Configure the pipeline role
<a name="sl-pipeline-role"></a>

Create a new permissions policy in IAM that combines only the required permissions from the two policies that Security Lake automatically created. The following example policy shows the least privilege required for an OpenSearch Ingestion pipeline to read data from multiple Security Lake sources:

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "s3:GetObject"
         ],
         "Resource":[
            "arn:aws:s3:::aws-security-data-lake-us-east-1-abcde/aws/LAMBDA_EXECUTION/1.0/*",
            "arn:aws:s3:::aws-security-data-lake-us-east-1-abcde/aws/S3_DATA/1.0/*",
            "arn:aws:s3:::aws-security-data-lake-us-east-1-abcde/aws/VPC_FLOW/1.0/*",
            "arn:aws:s3:::aws-security-data-lake-us-east-1-abcde/aws/ROUTE53/1.0/*",
            "arn:aws:s3:::aws-security-data-lake-us-east-1-abcde/aws/SH_FINDINGS/1.0/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "sqs:ReceiveMessage",
            "sqs:DeleteMessage"
         ],
         "Resource":[
            "arn:aws:sqs:us-east-1:111122223333:AmazonSecurityLake-abcde-Main-Queue"
         ]
      }
   ]
}
```

------

**Important**  
Security Lake doesn’t manage the pipeline role policy for you. If you add or remove sources from your Security Lake subscription, you must manually update the policy. Security Lake creates partitions for each log source, so you need to manually add or remove permissions in the pipeline role.

You must attach these permissions to the IAM role that you specify in the `sts_role_arn` option within the S3 source plugin configuration, under `sqs`.

```
version: "2"
source:
  s3:
    ...
    sqs:
      queue_url: "https://sqs.us-east-1amazonaws.com/account-id/AmazonSecurityLake-amzn-s3-demo-bucket-Main-Queue"
    aws:
      ...
processor:
  ...
sink:
  - opensearch:
      ...
```

## Create the pipeline
<a name="sl-pipeline"></a>

After you add the permissions to the pipeline role, use the preconfigured Security Lake blueprint to create the pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

You must specify the `queue_url` option within the `s3` source configuration, which is the Amazon SQS queue URL to read from. To format the URL, locate the **Subscription endpoint** in the subscriber configuration and change `arn:aws:` to `https://`. For example, `https://sqs.us-east-1amazonaws.com/account-id/AmazonSecurityLake-AWSDemo-Main-Queue`.

The `sts_role_arn` that you specify within the S3 source configuration must be the ARN of the pipeline role.

# Using an OpenSearch Ingestion pipeline with Amazon Security Lake as a sink
<a name="configure-client-sink-security-lake"></a>

Use the Amazon S3 sink plugin in OpenSearch Ingestion to send data from any supported source to Amazon Security Lake. Security Lake collects and stores security data from AWS, on-premises environments, and SaaS providers in a dedicated data lake.

To configure your pipeline to write log data to Security Lake, use the preconfigured **Firewall Traffic logs** blueprint. The blueprint includes a default configuration for retrieving raw security logs or other data stored in an Amazon S3 bucket, processing the records, and normalizing them. It then maps the data to Open Cybersecurity Schema Framework (OCSF) and sends the transformed OCSF-compliant data to Security Lake.

The pipeline has the following metadata attributes:
+ `bucket_name`: The name of the Amazon S3 bucket created by Security Lake for storing security data.
+ `path_prefix`: The custom source name defined in the Security Lake IAM role policy.
+ `region`: The AWS Region where the Security Lake S3 bucket is located.
+ `accountID`: The AWS account ID in which Security Lake is enabled.
+ `sts_role_arn`: The ARN of the IAM role intended for use with Security Lake.

## Prerequisites
<a name="configure-clients-lambda-prereqs"></a>

Before you create a pipeline to send data to Security Lake, perform the following steps:
+ **Enable and configure Amazon Security Lake**: Set up Amazon Security Lake to centralize security data from various sources. For instructions, see [Enabling Security Lake using the console](https://docs.aws.amazon.com/security-lake/latest/userguide/get-started-console.html).

  When you select a source, choose **Ingest specific AWS sources** and select one or more log and event sources that you want to ingest.
+ **Set up permissions**: Configure the pipeline role with the required permissions to write data to Security Lake. For more information, see [Pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-sink).

### Create the pipeline
<a name="create-opensearch-ingestion-pipeline"></a>

Use the preconfigured Security Lake blueprint to create the pipeline. For more information, see [Using blueprints to create a pipeline](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-blueprint.html). 

# Using an OpenSearch Ingestion pipeline with Fluent Bit
<a name="configure-client-fluentbit"></a>

This sample [Fluent Bit configuration file](https://docs.fluentbit.io/manual/pipeline/outputs/http) sends log data from Fluent Bit to an OpenSearch Ingestion pipeline. For more information about ingesting log data, see [Log Analytics](https://github.com/opensearch-project/data-prepper/blob/main/docs/log_analytics.md) in the Data Prepper documentation.

Note the following:
+ The `host` value must be your pipeline endpoint. For example, `pipeline-endpoint.us-east-1osis.amazonaws.com`.
+ The `aws_service` value must be `osis`.
+ The `aws_role_arn` value is the ARN of the AWS IAM role for the client to assume and use for Signature Version 4 authentication.

```
[INPUT]
  name                  tail
  refresh_interval      5
  path                  /var/log/test.log
  read_from_head        true

[OUTPUT]
  Name http
  Match *
  Host pipeline-endpoint.us-east-1osis.amazonaws.com
  Port 443
  URI /log/ingest
  Format json
  aws_auth true
  aws_region region
  aws_service osis
  aws_role_arn arn:aws:iam::account-id:role/ingestion-role
  Log_Level trace
  tls On
```

You can then configure an OpenSearch Ingestion pipeline like the following, which has HTTP as the source:

```
version: "2"
unaggregated-log-pipeline:
  source:
    http:
      path: "/log/ingest"
  processor:
    - grok:
        match:
          log:
            - "%{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:network_node} %{NOTSPACE:network_host} %{IPORHOST:source_ip}:%{NUMBER:source_port:int} -> %{IPORHOST:destination_ip}:%{NUMBER:destination_port:int} %{GREEDYDATA:details}"
    - grok:
        match:
          details:
            - "'%{NOTSPACE:http_method} %{NOTSPACE:http_uri}' %{NOTSPACE:protocol}"
            - "TLS%{NOTSPACE:tls_version} %{GREEDYDATA:encryption}"
            - "%{NUMBER:status_code:int} %{NUMBER:response_size:int}"
    - delete_entries:
        with_keys: ["details", "log"]

  sink:
    - opensearch:
        hosts: ["https://search-domain-endpoint.us-east-1es.amazonaws.com"]
        index: "index_name"
        index_type: custom
        bulk_size: 20
        aws:
          region: "region"
```

# Using an OpenSearch Ingestion pipeline with Fluentd
<a name="configure-client-fluentd"></a>

Fluentd is an open-source data collection ecosystem that provides SDKs for different languages and sub-projects like Fluent Bit. This sample [Fluentd configuration file](https://docs.fluentd.org/output/http#example-configuration) sends log data from Fluentd to an OpenSearch Ingestion pipeline. For more information about ingesting log data, see [Log Analytics](https://github.com/opensearch-project/data-prepper/blob/main/docs/log_analytics.md) in the Data Prepper documentation.

Note the following:
+ The `endpoint` value must be your pipeline endpoint. For example, `pipeline-endpoint.us-east-1osis.amazonaws.com/apache-log-pipeline/logs`.
+ The `aws_service` value must be `osis`.
+ The `aws_role_arn` value is the ARN of the AWS IAM role for the client to assume and use for Signature Version 4 authentication.

```
<source>
  @type tail
  path logs/sample.log
  path_key log
  tag apache
  <parse>
    @type none
  </parse>
</source>

<filter apache>
  @type record_transformer
  <record>
    log ${record["message"]}
  </record>
</filter>

<filter apache>
  @type record_transformer
  remove_keys message
</filter>

<match apache>
  @type http
  endpoint pipeline-endpoint.us-east-1osis.amazonaws.com/apache-log-pipeline/logs
  json_array true

  <auth>
    method aws_sigv4
    aws_service osis
    aws_region region
    aws_role_arn arn:aws:iam::account-id:role/ingestion-role
  </auth>

  <format>
    @type json
  </format>

  <buffer>
    flush_interval 1s
  </buffer>
</match>
```

You can then configure an OpenSearch Ingestion pipeline like the following, which has HTTP as the source:

```
version: "2"
apache-log-pipeline:
  source:
    http:
      path: "/${pipelineName}/logs"
  processor:
    - grok:
        match:
          log:
            - "%{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:network_node} %{NOTSPACE:network_host} %{IPORHOST:source_ip}:%{NUMBER:source_port:int} -> %{IPORHOST:destination_ip}:%{NUMBER:destination_port:int} %{GREEDYDATA:details}"
  sink:
    - opensearch:
        hosts: ["https://search-domain-endpoint.us-east-1es.amazonaws.com"]
        index: "index_name"
        aws_region: "region"
        aws_sigv4: true
```

# Using an OpenSearch Ingestion pipeline with machine learning offline batch inference
<a name="configure-clients-ml-commons-batch"></a>

Amazon OpenSearch Ingestion (OSI) pipelines support machine learning (ML) offline batch inference processing to efficiently enrich large volumes of data at low cost. Use offline batch inference whenever you have large datasets that can be processed asynchronously. Offline batch inference works with Amazon Bedrock and SageMaker models. This feature is available in all AWS Regions that support OpenSearch Ingestion with OpenSearch Service 2.17\$1 domains.

**Note**  
For real-time inference processing, use [Amazon OpenSearch Service ML connectors for third-party platforms](ml-external-connector.md).

Offline batch inference processing leverages a feature of OpenSearch called ML Commons. *ML Commons* provides ML algorithms through transport and REST API calls. Those calls choose the right nodes and resources for each ML request and monitor ML tasks to ensure uptime. In this way, ML Commons allows you to leverage existing open-source ML algorithms and reduce the effort required to develop new ML features. For more information about ML Commons, see [Machine learning](https://docs.opensearch.org/latest/ml-commons-plugin/) in the OpenSearch.org documentation. 

## How it works
<a name="configure-clients-ml-commons-batch-how"></a>

You can create an offline batch inference pipeline on OpenSearch Ingestion by [adding a machine learning inference processor](https://docs.opensearch.org/latest/ingest-pipelines/processors/ml-inference/) to a pipeline. This processor enables your pipeline to connect to AI services like SageMaker to run batch inference jobs. You can configure your processor to connect to your desired AI service through the AI connectors (with [batch\$1predict](https://docs.opensearch.org/latest/ml-commons-plugin/api/model-apis/batch-predict/) support) running on your target domain.

OpenSearch Ingestion uses the `ml_inference` processor with ML Commons to create offline batch inference jobs. ML Commons then uses the [batch\$1predict](https://docs.opensearch.org/latest/ml-commons-plugin/api/model-apis/batch-predict/) API, which performs inference on large datasets in an offline asynchronous mode using a model deployed on external model servers in Amazon Bedrock, Amazon SageMaker, Cohere, and OpenAI. The following diagram shows an OpenSearch Ingestion pipeline that orchestrates multiple components to perform this process end to end:

![\[Three-pipeline architecture of batch AI inference processing.\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/images/ml_processor.png)


The pipeline components work as follows:

**Pipeline 1 (Data preparation and transformation)\$1:**
+ Source: Data is scanned from your external source supported by OpenSearch Ingestion.
+ Data processors: The raw data is processed and transformed to the correct format for batch inference on the integrated AI service.
+ S3 (Sink): The processed data is staged in an Amazon S3 bucket ready to serve as input for running batch inference jobs on the integrated AI service. 

**Pipeline 2 (Trigger ML batch\$1inference):**
+ Source: Automated S3 event detection of new files created by output of Pipeline 1.
+ Ml\$1inference processor: Processor that generates ML inferences through an asynchronous batch job. It connects to AI services through the configured AI connector that's running on your target domain.
+ Task ID: Each batch job is associated with a task ID in ml-commons for tracking and management.
+ OpenSearch ML Commons: ML Commons, which hosts the model for real-time neural search, manages the connectors to remote AI servers, and serves the APIs for batch inference and jobs management.
+ AI services: OpenSearch ML Commons interacts with AI services like Amazon Bedrock and Amazon SageMaker to perform batch inference on the data, producing predictions or insights. The results are saved asynchronously to a separate S3 file.

**Pipeline 3 (Bulk ingestion):**
+ S3 (source): The results of the batch jobs are stored in S3, which is the source of this pipeline.
+ Data transformation processors: Further processing and transformation are applied to the batch inference output before ingestion. This ensures the data is mapped correctly in the OpenSearch index.
+ OpenSearch index (Sink): The processed results are indexed into OpenSearch for storage, search, and further analysis.

**Note**  
\$1The process described by Pipeline 1 is optional. If you prefer, you can skip that process and simply upload your prepared data in the S3 sink to create batch jobs.

## About the ml\$1inference processor
<a name="configure-clients-ml-commons-batch-inference-processor"></a>

OpenSearch Ingestion uses a specialized integration between the S3 Scan source and ML inference processor for batch processing. The S3 Scan operates in metadata-only mode to efficiently collect S3 file information without reading the actual file contents. The `ml_inference` processor uses the S3 file URLs to coordinate with ML Commons for batch processing. This design optimizes the batch inference workflow by minimizing unnecessary data transfer during the scanning phase. You define the `ml_inference` processor using parameters. Here is an example: 

```
processor:
    - ml_inference:
        # The endpoint URL of your OpenSearch domain
        host: "https://AWStest-offlinebatch-123456789abcdefg.us-west-2.es.amazonaws.com"
        
        # Type of inference operation:
        # - batch_predict: for batch processing
        # - predict: for real-time inference
        action_type: "batch_predict"
        
        # Remote ML model service provider (Amazon Bedrock or SageMaker)
        service_name: "bedrock"
        
        # Unique identifier for the ML model
        model_id: "AWSTestModelID123456789abcde"
        
        # S3 path where batch inference results will be stored
        output_path: "s3://amzn-s3-demo-bucket/"
      
        # Supports ISO_8601 notation strings like PT20.345S or PT15M
        # These settings control how long to keep your inputs in the processor for retry on throttling errors
        retry_time_window: "PT9M"
        
        # AWS configuration settings
        aws:
            # AWS Region where the Lambda function is deployed
            region: "us-west-2"
            # IAM role ARN for Lambda function execution
            sts_role_arn: "arn:aws::iam::account_id:role/Admin"
        
        # Dead-letter queue settings for storing errors
        dlq:
          s3:
            region: us-west-2
            bucket: batch-inference-dlq
            key_path_prefix: bedrock-dlq
            sts_role_arn: arn:aws:iam::account_id:role/OSI-invoke-ml
            
        # Conditional expression that determines when to trigger the processor
        # In this case, only process when bucket matches "amzn-s3-demo-bucket"
        ml_when: /bucket == "amzn-s3-demo-bucket"
```

### Ingestion performance improvements using the ml\$1inference processor
<a name="configure-clients-ml-commons-batch-ingestion-performance"></a>

The OpenSearch Ingestion `ml_inference` processor significantly enhances data ingestion performance for ML-enabled search. The processor is ideally suited for use cases requiring machine learning model-generated data, including semantic search, multimodal search, document enrichment, and query understanding. In semantic search, the processor can accelerate the creation and ingestion of large-volume, high-dimensional vectors by an order of magnitude.

The processor's offline batch inference capability offers distinct advantages over real-time model invocation. While real-time processing requires a live model server with capacity limitations, batch inference dynamically scales compute resources on demand and processes data in parallel. For example, when the OpenSearch Ingestion pipeline receives one billion source data requests, it creates 100 S3 files for ML batch inference input. The `ml_inference` processor then initiates a SageMaker batch job using 100 `ml.m4.xlarge` Amazon Elastic Compute Cloud (Amazon EC2) instances, completing the vectorization of one billion requests in 14 hours—a task that would be virtually impossible to accomplish in real-time mode.

## Configure the ml\$1inference processor to ingest data requests for a semantic search
<a name="configure-clients-ml-commons-configuring"></a>

The following procedures walk you through the process of setting up and configuring the OpenSearch Ingestion `ml_inference` processor to ingest one billion data requests for semantic search using a text embedding model.

**Topics**
+ [Step 1: Create connectors and register models in OpenSearch](#configure-clients-ml-commons-configuring-create-connectors)
+ [Step 2: Create an OpenSearch Ingestion pipeline for ML offline batch inference](#configure-clients-ml-commons-configuring-pipeline)
+ [Step 3: Prepare your data for ingestion](#configure-clients-ml-commons-configuring-data)
+ [Step 4: Monitor the batch inference job](#configure-clients-ml-commons-configuring-monitor)
+ [Step 5: Run search](#configure-clients-ml-commons-configuring-semantic-search)

### Step 1: Create connectors and register models in OpenSearch
<a name="configure-clients-ml-commons-configuring-create-connectors"></a>

For the following procedure, use the ML Commons [batch\$1inference\$1sagemaker\$1connector\$1blueprint](https://github.com/opensearch-project/ml-commons/blob/main/docs/remote_inference_blueprints/batch_inference_sagemaker_connector_blueprint.md) to create a connector and model in Amazon SageMaker. If you prefer to use OpenSearch CloudFormation integration templates, see [(Alternative procedure) Step 1: Create connectors and models using an CloudFormation integration template](#configure-clients-ml-commons-configuring-create-connectors-alternative) later in this section. 

**To create connectors and register models in OpenSearch**

1. Create a Deep Java Library (DJL) ML model in SageMaker for batch transform. To view other DJL models, see [semantic\$1search\$1with\$1CFN\$1template\$1for\$1Sagemaker](https://github.com/opensearch-project/ml-commons/blob/main/docs/tutorials/aws/semantic_search_with_CFN_template_for_Sagemaker.md) on GitHub:

   ```
   POST https://api.sagemaker.us-east-1.amazonaws.com/CreateModel
   {
      "ExecutionRoleArn": "arn:aws:iam::123456789012:role/aos_ml_invoke_sagemaker",
      "ModelName": "DJL-Text-Embedding-Model-imageforjsonlines",
      "PrimaryContainer": { 
         "Environment": { 
            "SERVING_LOAD_MODELS" : "djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2" 
         },
         "Image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-cpu-full"
      }
   }
   ```

1. Create a connector with `batch_predict` as the new `action` type in the `actions` field:

   ```
   POST /_plugins/_ml/connectors/_create
   {
     "name": "DJL Sagemaker Connector: all-MiniLM-L6-v2",
     "version": "1",
     "description": "The connector to sagemaker embedding model all-MiniLM-L6-v2",
     "protocol": "aws_sigv4",
     "credential": {
     "roleArn": "arn:aws:iam::111122223333:role/SageMakerRole"
   },
     "parameters": {
       "region": "us-east-1",
       "service_name": "sagemaker",
       "DataProcessing": {
         "InputFilter": "$.text",
         "JoinSource": "Input",
         "OutputFilter": "$"
       },
       "MaxConcurrentTransforms": 100,
       "ModelName": "DJL-Text-Embedding-Model-imageforjsonlines",
       "TransformInput": {
         "ContentType": "application/json",
         "DataSource": {
           "S3DataSource": {
             "S3DataType": "S3Prefix",
             "S3Uri": "s3://offlinebatch/msmarcotests/"
           }
         },
         "SplitType": "Line"
       },
       "TransformJobName": "djl-batch-transform-1-billion",
       "TransformOutput": {
         "AssembleWith": "Line",
         "Accept": "application/json",
         "S3OutputPath": "s3://offlinebatch/msmarcotestsoutputs/"
       },
       "TransformResources": {
         "InstanceCount": 100,
         "InstanceType": "ml.m4.xlarge"
       },
       "BatchStrategy": "SingleRecord"
     },
     "actions": [
       {
         "action_type": "predict",
         "method": "POST",
         "headers": {
           "content-type": "application/json"
         },
         "url": "https://runtime.sagemaker.us-east-1.amazonaws.com/endpoints/OpenSearch-sagemaker-060124023703/invocations",
         "request_body": "${parameters.input}",
         "pre_process_function": "connector.pre_process.default.embedding",
         "post_process_function": "connector.post_process.default.embedding"
       },
       {
         "action_type": "batch_predict",
         "method": "POST",
         "headers": {
           "content-type": "application/json"
         },
         "url": "https://api.sagemaker.us-east-1.amazonaws.com/CreateTransformJob",
         "request_body": """{ "BatchStrategy": "${parameters.BatchStrategy}", "ModelName": "${parameters.ModelName}", "DataProcessing" : ${parameters.DataProcessing}, "MaxConcurrentTransforms": ${parameters.MaxConcurrentTransforms}, "TransformInput": ${parameters.TransformInput}, "TransformJobName" : "${parameters.TransformJobName}", "TransformOutput" : ${parameters.TransformOutput}, "TransformResources" : ${parameters.TransformResources}}"""
       },
       {
         "action_type": "batch_predict_status",
         "method": "GET",
         "headers": {
           "content-type": "application/json"
         },
         "url": "https://api.sagemaker.us-east-1.amazonaws.com/DescribeTransformJob",
         "request_body": """{ "TransformJobName" : "${parameters.TransformJobName}"}"""
       },
       {
         "action_type": "cancel_batch_predict",
         "method": "POST",
         "headers": {
           "content-type": "application/json"
         },
         "url": "https://api.sagemaker.us-east-1.amazonaws.com/StopTransformJob",
         "request_body": """{ "TransformJobName" : "${parameters.TransformJobName}"}"""
       }
     ]
   }
   ```

1. Use the returned connector ID to register the SageMaker model:

   ```
   POST /_plugins/_ml/models/_register
   {
       "name": "SageMaker model for batch",
       "function_name": "remote",
       "description": "test model",
       "connector_id": "example123456789-abcde"
   }
   ```

1. Invoke the model with the `batch_predict` action type:

   ```
   POST /_plugins/_ml/models/teHr3JABBiEvs-eod7sn/_batch_predict
   {
     "parameters": {
       "TransformJobName": "SM-offline-batch-transform"
     }
   }
   ```

   The response contains a task ID for the batch job:

   ```
   {
    "task_id": "exampleIDabdcefd_1234567",
    "status": "CREATED"
   }
   ```

1. Check the batch job status by calling the Get Task API using the task ID:

   ```
   GET /_plugins/_ml/tasks/exampleIDabdcefd_1234567
   ```

   The response contains the task status:

   ```
   {
     "model_id": "nyWbv5EB_tT1A82ZCu-e",
     "task_type": "BATCH_PREDICTION",
     "function_name": "REMOTE",
     "state": "RUNNING",
     "input_type": "REMOTE",
     "worker_node": [
       "WDZnIMcbTrGtnR4Lq9jPDw"
     ],
     "create_time": 1725496527958,
     "last_update_time": 1725496527958,
     "is_async": false,
     "remote_job": {
       "TransformResources": {
         "InstanceCount": 1,
         "InstanceType": "ml.c5.xlarge"
       },
       "ModelName": "DJL-Text-Embedding-Model-imageforjsonlines",
       "TransformOutput": {
         "Accept": "application/json",
         "AssembleWith": "Line",
         "KmsKeyId": "",
         "S3OutputPath": "s3://offlinebatch/output"
       },
       "CreationTime": 1725496531.935,
       "TransformInput": {
         "CompressionType": "None",
         "ContentType": "application/json",
         "DataSource": {
           "S3DataSource": {
             "S3DataType": "S3Prefix",
             "S3Uri": "s3://offlinebatch/sagemaker_djl_batch_input.json"
           }
         },
         "SplitType": "Line"
       },
       "TransformJobArn": "arn:aws:sagemaker:us-east-1:111122223333:transform-job/SM-offline-batch-transform15",
       "TransformJobStatus": "InProgress",
       "BatchStrategy": "SingleRecord",
       "TransformJobName": "SM-offline-batch-transform15",
       "DataProcessing": {
         "InputFilter": "$.content",
         "JoinSource": "Input",
         "OutputFilter": "$"
       }
     }
   }
   ```

#### (Alternative procedure) Step 1: Create connectors and models using an CloudFormation integration template
<a name="configure-clients-ml-commons-configuring-create-connectors-alternative"></a>

If you prefer, you can use AWS CloudFormation to automatically create all required Amazon SageMaker connectors and models for ML inference. This approach simplifies setup by using a preconfigured template available in the Amazon OpenSearch Service console. For more information, see [Using CloudFormation to set up remote inference for semantic search](cfn-template.md).

**To deploy an CloudFormation stack that creates all the required SageMaker connectors and models**

1. Open the Amazon OpenSearch Service console.

1. In the navigation pane, choose **Integrations**.

1. In the Search field, enter **SageMaker**, and then choose **Integration with text embedding models through Amazon SageMaker**.

1. Choose **Configure domain** and then choose **Configure VPC domain** or **Configure public domain**.

1. Enter information in the template fields. For **Enable Offline Batch Inference**, choose **true** to provision resources for offline batch processing.

1. Choose **Create** to create the CloudFormation stack.

1. After the stack is created, open the **Outputs** tab in the CloudFormation console Locate the **connector\$1id** and **model\$1id**. You will need these values later when you configure the pipeline.

### Step 2: Create an OpenSearch Ingestion pipeline for ML offline batch inference
<a name="configure-clients-ml-commons-configuring-pipeline"></a>

Use the following sample to create an OpenSearch Ingestion pipeline for ML offline batch inference. For more information about creating a pipeline for OpenSearch Ingestion, see [Creating Amazon OpenSearch Ingestion pipelines](creating-pipeline.md).

**Before you begin**

In the following sample, you specify an IAM role ARN for the `sts_role_arn` parameter. Use the following procedure to verify that this role is mapped to the backend role that has access to ml-commons in OpenSearch.

1. Navigate to the OpenSearch Dashboards plugin for your OpenSearch Service domain. You can find the dashboards endpoint on your domain dashboard on the OpenSearch Service console.

1. From the main menu choose **Security**, **Roles**, and select the **ml\$1full\$1access** role.

1. Choose **Mapped users**, **Manage mapping**. 

1. Under **Backend roles**, enter the ARN of the Lambda role that needs permission to call your domain. Here is an example: arn:aws:iam::*111122223333*:role/*lambda-role*

1. Select **Map** and confirm the user or role shows up under **Mapped users**.

**Sample to create an OpenSearch Ingestion pipeline for ML offline batch inference**

```
version: '2'
extension:
  osis_configuration_metadata:
    builder_type: visual
sagemaker-batch-job-pipeline:
  source:
    s3:
      acknowledgments: true
      delete_s3_objects_on_read: false
      scan:
        buckets:
          - bucket:
              name: name
              data_selection: metadata_only
              filter:
                include_prefix:
                  - sagemaker/sagemaker_djl_batch_input
                exclude_suffix:
                  - .manifest
          - bucket:
              name: name
              data_selection: data_only
              filter:
                include_prefix:
                  - sagemaker/output/
        scheduling:
          interval: PT6M
      aws:
        region: name
      default_bucket_owner: account_ID
      codec:
        ndjson:
          include_empty_objects: false
      compression: none
      workers: '1'
  processor:
    - ml_inference:
        host: "https://search-AWStest-offlinebatch-123456789abcdef.us-west-2.es.amazonaws.com"
        aws_sigv4: true
        action_type: "batch_predict"
        service_name: "sagemaker"
        model_id: "model_ID"
        output_path: "s3://AWStest-offlinebatch/sagemaker/output"
        aws:
          region: "us-west-2"
          sts_role_arn: "arn:aws:iam::account_ID:role/Admin"
        ml_when: /bucket == "AWStest-offlinebatch"
        dlq:
          s3:
            region: us-west-2
            bucket: batch-inference-dlq
            key_path_prefix: bedrock-dlq
            sts_role_arn: arn:aws:iam::account_ID:role/OSI-invoke-ml
    - copy_values:
        entries:
          - from_key: /text
            to_key: chapter
          - from_key: /SageMakerOutput
            to_key: chapter_embedding
          - delete_entries:
            with_keys:
          - text
          - SageMakerOutput
  sink:
    - opensearch:
        hosts: ["https://search-AWStest-offlinebatch-123456789abcdef.us-west-2.es.amazonaws.com"]
        aws:
          serverless: false
          region: us-west-2
        routes:
          - ml-ingest-route
        index_type: custom
        index: test-nlp-index
  routes:
    - ml-ingest-route: /chapter != null and /title != null
```

### Step 3: Prepare your data for ingestion
<a name="configure-clients-ml-commons-configuring-data"></a>

To prepare your data for ML offline batch inference processing, either prepare the data yourself using your own tools or processes or use the [OpenSearch Data Prepper](https://docs.opensearch.org/latest/data-prepper/getting-started/). Verify that the data is organized into the correct format either by using a pipeline to consume the data from your data source or by creating a machine learning dataset.

The following example uses the [MS MARCO](https://microsoft.github.io/msmarco/Datasets.html) dataset, which includes a collection of real user queries for natural language processing tasks. The dataset is structured in JSONL format, where each line represents a request sent to the ML embedding model:

```
{"_id": "1185869", "text": ")what was the immediate impact of the Paris Peace Treaties of 1947?", "metadata": {"world war 2"}}
{"_id": "1185868", "text": "_________ justice is designed to repair the harm to victim, the community and the offender caused by the offender criminal act. question 19 options:", "metadata": {"law"}}
{"_id": "597651", "text": "what is amber", "metadata": {"nothing"}}
{"_id": "403613", "text": "is autoimmune hepatitis a bile acid synthesis disorder", "metadata": {"self immune"}}
...
```

To test using the MS MARCO dataset, imagine a scenario where you construct one billion input requests distributed across 100 files, each containing 10 million requests. The files would be stored in Amazon S3 with the prefix s3://offlinebatch/sagemaker/sagemaker\$1djl\$1batch\$1input/. The OpenSearch Ingestion pipeline would scan these 100 files simultaneously and initiate a SageMaker batch job with 100 workers for parallel processing, enabling efficient vectorization and ingestion of the one billion documents into OpenSearch.

In production environments, you can use an OpenSearch Ingestion pipeline to generate S3 files for batch inference input. The pipeline supports various [data sources](https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/sources/) and operates on a schedule to continuously transform source data into S3 files. These files are then automatically processed by AI servers through scheduled offline batch jobs, ensuring continuous data processing and ingestion.

### Step 4: Monitor the batch inference job
<a name="configure-clients-ml-commons-configuring-monitor"></a>

You can monitor the batch inference jobs using the SageMaker console or the AWS CLI. You can also use the Get Task API to monitor batch jobs:

```
GET /_plugins/_ml/tasks/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "state": "RUNNING"
          }
        }
      ]
    }
  },
  "_source": ["model_id", "state", "task_type", "create_time", "last_update_time"]
}
```

The API returns a list of active batch job tasks:

```
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.0,
    "hits": [
      {
        "_index": ".plugins-ml-task",
        "_id": "nyWbv5EB_tT1A82ZCu-e",
        "_score": 0.0,
        "_source": {
          "model_id": "nyWbv5EB_tT1A82ZCu-e",
          "state": "RUNNING",
          "task_type": "BATCH_PREDICTION",
          "create_time": 1725496527958,
          "last_update_time": 1725496527958
        }
      },
      {
        "_index": ".plugins-ml-task",
        "_id": "miKbv5EB_tT1A82ZCu-f",
        "_score": 0.0,
        "_source": {
          "model_id": "miKbv5EB_tT1A82ZCu-f",
          "state": "RUNNING",
          "task_type": "BATCH_PREDICTION",
          "create_time": 1725496528123,
          "last_update_time": 1725496528123
        }
      },
      {
        "_index": ".plugins-ml-task",
        "_id": "kiLbv5EB_tT1A82ZCu-g",
        "_score": 0.0,
        "_source": {
          "model_id": "kiLbv5EB_tT1A82ZCu-g",
          "state": "RUNNING",
          "task_type": "BATCH_PREDICTION",
          "create_time": 1725496529456,
          "last_update_time": 1725496529456
        }
      }
    ]
  }
}
```

### Step 5: Run search
<a name="configure-clients-ml-commons-configuring-semantic-search"></a>

After monitoring the batch inference job and confirming it completed, you can run various types of AI searches, including semantic, hybrid, conversational (with RAG), neural sparse, and multimodal. For more information about AI searches supported by OpenSearch Service, see [AI search](https://docs.opensearch.org/latest/vector-search/ai-search/index/). 

To search raw vectors, use the `knn` query type, provide the `vector` array as input, and specify the `k` number of returned results:

```
GET /my-raw-vector-index/_search
{
  "query": {
    "knn": {
      "my_vector": {
        "vector": [0.1, 0.2, 0.3],
        "k": 2
      }
    }
  }
}
```

To run an AI-powered search, use the `neural` query type. Specify the `query_text` input, the `model_id` of the embedding model you configured in the OpenSearch Ingestion pipeline, and the `k` number of returned results. To exclude embeddings from search results, specify the name of the embedding field in the `_source.excludes` parameter:

```
GET /my-ai-search-index/_search
{
  "_source": {
    "excludes": [
      "output_embedding"
    ]
  },
  "query": {
    "neural": {
      "output_embedding": {
        "query_text": "What is AI search?",
        "model_id": "mBGzipQB2gmRjlv_dOoB",
        "k": 2
      }
    }
  }
}
```

# Using an OpenSearch Ingestion pipeline with OpenTelemetry Collector
<a name="configure-client-otel"></a>

You can use the [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/) to ingest logs, traces, and metrics into OpenSearch Ingestion pipelines. A single pipeline can be used to ingest all logs, traces, and metrics to different indices on a domain or collection. You can also use pipelines to ingest only logs, traces, or metrics individually. 

**Topics**
+ [Prerequisites](#otel-prereqs)
+ [Step 1: Configure the pipeline role](#otel-pipeline-role)
+ [Step 2: Create the pipeline](#create-otel-pipeline)
+ [Cross-account Connectivity](#x-account-connectivity)
+ [Limitations](#otel-limitations)
+ [Recommended CloudWatch Alarms for OpenTelemetry sources](#otel-pipeline-metrics)

## Prerequisites
<a name="otel-prereqs"></a>

While setting up the [OpenTelemetry configuration file](https://opentelemetry.io/docs/collector/configuration/), you must configure the following in order for ingestion to occur: 
+ The ingestion role needs the `osis:Ingest` permission to interact with the pipeline. For more information, see [Ingestion role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-same-account). 
+ The endpoint value must include your pipeline endpoint. For example, `https://pipeline-endpoint.us-east-1.osis.amazonaws.com.`
+ The service value must be `osis`.
+ The compression option for the OTLP/HTTP Exporter must match the compression option on the pipeline's selected source.

```
extensions:
    sigv4auth:
        region: "region"
        service: "osis"

exporters:
    otlphttp:
        logs_endpoint: "https://pipeline-endpoint.us-east-1.osis.amazonaws.com/v1/logs"
        metrics_endpoint: "https://pipeline-endpoint.us-east-1.osis.amazonaws.com/v1/metrics"
        traces_endpoint: "https://pipeline-endpoint.us-east-1.osis.amazonaws.com/v1/traces"
        auth:
            authenticator: sigv4auth
        compression: none

service:
    extensions: [sigv4auth]
    pipelines:
        traces:
        receivers: [jaeger]
        exporters: [otlphttp]
```

## Step 1: Configure the pipeline role
<a name="otel-pipeline-role"></a>

 After setting up the OpenTelemetry collector configuration, [ set up the pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-sink) that you want to use in your pipeline configuration. There are not specific permissions that the pipeline role needs for the OTLP source, only permissions to grant pipelines access to the OpenSearch domain or collection. 

## Step 2: Create the pipeline
<a name="create-otel-pipeline"></a>

 You can then configure an OpenSearch Ingestion pipeline like the following, which specifies OTLP as the source. You can also configure OpenTelemetry logs, metrics, and traces as individual sources. 

OTLP source pipeline configuration:

```
version: 2
otlp-pipeline:
    source:
        otlp:
            logs_path: /otlp-pipeline/v1/logs
            traces_path: /otlp-pipeline/v1/traces
            metrics_path: /otlp-pipeline/v1/metrics
    sink:
        - opensearch:
            hosts: ["https://search-mydomain.region.es.amazonaws.com"]
            index: "ss4o_metrics-otel-%{yyyy.MM.dd}"
            index_type: custom
            aws:
                region: "region"
```

OpenTelemetry Logs pipeline configuration:

```
version: 2
otel-logs-pipeline:
  source:
    otel_logs_source:
        path: /otel-logs-pipeline/v1/logs
  sink:
    - opensearch:
        hosts: ["https://search-mydomain.region.es.amazonaws.com"]
        index: "ss4o_metrics-otel-%{yyyy.MM.dd}"
        index_type: custom
        aws:
            region: "region"
```

OpenTelemetry Metrics pipeline configuration:

```
version: 2
otel-metrics-pipeline:
  source:
    otel_metrics_source:
        path: /otel-metrics-pipeline/v1/metrics
  sink:
    - opensearch:
        hosts: ["https://search-mydomain.region.es.amazonaws.com"]
        index: "ss4o_metrics-otel-%{yyyy.MM.dd}"
        index_type: custom
        aws:
            region: "region"
```

OpenTelemetry Traces pipeline configuration:

```
version: 2
otel-trace-pipeline:
  source:
    otel_trace_source:
        path: /otel-traces-pipeline/v1/traces
  sink:
    - opensearch:
        hosts: ["https://search-mydomain.region.es.amazonaws.com"]
        index: "ss4o_metrics-otel-%{yyyy.MM.dd}"
        index_type: custom
        aws:
            region: "region"
```

You can use a preconfigured blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md). 

## Cross-account Connectivity
<a name="x-account-connectivity"></a>

 OpenSearch Ingestion pipelines with OpenTelemetry sources have cross-account ingestion capability. Amazon OpenSearch Ingestion enables you to share pipelines across AWS accounts from a virtual private cloud (VPC) to a pipeline endpoint in a separate VPC. For more information, see [Configuring OpenSearch Ingestion pipelines for cross-account ingestion](cross-account-pipelines.md). 

## Limitations
<a name="otel-limitations"></a>

 The OpenSearch Ingestion pipeline cannot receive any requests greater than 20mb. This value is configured by the user in the `max_request_length` option. This option defaults to 10mb. 

## Recommended CloudWatch Alarms for OpenTelemetry sources
<a name="otel-pipeline-metrics"></a>

 The following CloudWatch metrics are recommended for monitoring the performance of your ingestion pipeline. These metrics can help you identify the amount of data processed from exports, the amount of events processed from streams, the errors in processing exports and stream events, and the number of documents written to the destination. You can setup CloudWatch alarms to perform an action when one of these metrics exceed a specified value for a specified amount of time. 

 The CloudWatch metrics for OTLP source are formatted as `{pipeline-name}.otlp.{logs | traces | metrics}.{metric-name}`. For example, `otel-pipeline.otlp.metrics.requestTimeouts.count`. 

 In the case of using an individual OpenTelemetry source, the metrics will be formatted as `{pipeline-name}.{source-name}.{metric-name}`. For example, `trace-pipeline.otel_trace_source.requestTimeouts.count`. 

All three OpenTelemetry data types will have the same metrics, but for brevity the metrics will only be listed in the below table for OTLP source log type data.


| Metric | Description | 
| --- |--- |
| otel-pipeline.BlockingBuffer.bufferUsage.value |  Indicates how much of the buffer is being utilized.  | 
|  otel-pipeline.otlp.logs.requestTimeouts.count  |  The number of requests that have timed out.  | 
|  otel-pipeline.otlp.logs.requestsReceived.count  |  The number of requests received by the OpenTelemetry Collector.  | 
|  otel-pipeline.otlp.logs.badRequests.count  |  The number of malformed requests received by the OpenTelemetry Collector.  | 
|  otel-pipeline.otlp.logs.requestsTooLarge.count  |  The number of requests greater than the maximum of 20mb received by the OpenTelemetry Collector.  | 
|  otel-pipeline.otlp.logs.internalServerError.count  | The number of HTTP 500 errors received from the OpenTelemetry Collector. | 
|  otel-pipeline.opensearch.bulkBadRequestErrors.count  | Count of errors during bulk requests due to malformed request. | 
|  otel-pipeline.opensearch.bulkRequestLatency.avg  | Average latency for bulk write requests made to OpenSearch. | 
|  otel-pipeline.opensearch.bulkRequestNotFoundErrors.count  | Number of bulk requests that failed because the target data could not be found. | 
|  otel-pipeline.opensearch.bulkRequestNumberOfRetries.count  | Number of retries by OpenSearch Ingestion pipelines to write OpenSearch cluster. | 
|  otel-pipeline.opensearch.bulkRequestSizeBytes.sum  | Total size in bytes of all bulk requests made to OpenSearch. | 
|  otel-pipeline.opensearch.documentErrors.count  | Number of errors when sending documents to OpenSearch. The documents causing the errors witll be sent to DLQ. | 
|  otel-pipeline.opensearch.documentsSuccess.count  | Number of documents successfully written to an OpenSearch cluster or collection. | 
|  otel-pipeline.opensearch.documentsSuccessFirstAttempt.count  | Number of documents successfully indexed in OpenSearch on the first attempt. | 
|  `otel-pipeline.opensearch.documentsVersionConflictErrors.count`  | Count of errors due to version conflicts in documents during processing. | 
|  `otel-pipeline.opensearch.PipelineLatency.avg`  | Average latency of OpenSearch Ingestion pipeline to process the data by reading from the source to writing to the destination. | 
|  otel-pipeline.opensearch.PipelineLatency.max  | Maximum latency of OpenSearch Ingestion pipeline to process the data by reading from the source to writing the destination. | 
|  otel-pipeline.opensearch.recordsIn.count  | Count of records successfully ingested into OpenSearch. This metric is essential for tracking the volume of data being processed and stored. | 
|  otel-pipeline.opensearch.s3.dlqS3RecordsFailed.count  | Number of records that failed to write to DLQ. | 
|  otel-pipeline.opensearch.s3.dlqS3RecordsSuccess.count  | Number of records that are written to DLQ. | 
|  otel-pipeline.opensearch.s3.dlqS3RequestLatency.count  | Count of latency measurements for requests to the Amazon S3 dead-letter queue. | 
|  otel-pipeline.opensearch.s3.dlqS3RequestLatency.sum  | Total latency for all requests to the Amazon S3 dead-letter queue | 
|  otel-pipeline.opensearch.s3.dlqS3RequestSizeBytes.sum  | Total size in bytes of all requests made to the Amazon S3 dead-letter queue. | 
|  otel-pipeline.recordsProcessed.count  | Total number of records processed in the pipeline, a key metric for overal throughput. | 
|  `otel-pipeline.opensearch.bulkRequestInvalidInputErrors.count`  | Count of bulk request errors in OpenSearch due to invalid input, crucial for monitoring data quality and operational issues. | 

# Using an OpenSearch Ingestion pipeline with Amazon Managed Service for Prometheus
<a name="configure-client-prometheus"></a>

You can use Amazon Managed Service for Prometheus as a destination for your OpenSearch Ingestion pipeline to store metrics in time series format. The Prometheus sink allows you to send OpenTelemetry metrics or other time series data from your pipeline to an Amazon Managed Service for Prometheus workspace for monitoring, alerting, and analysis.

The `prometheus` sink plugin enables OpenSearch Ingestion pipelines to write metrics data to Amazon Managed Service for Prometheus workspaces using the Prometheus remote write protocol. This integration allows you to:
+ Store time series metrics data in Amazon Managed Service for Prometheus
+ Monitor and alert on metrics using Amazon Managed Service for Prometheus and Amazon Managed Grafana
+ Route metrics to multiple destinations simultaneously (for example, OpenSearch and Amazon Managed Service for Prometheus)
+ Process OpenTelemetry metrics from external agents or generate metrics within the pipeline

**Topics**
+ [Prerequisites](#prometheus-prereqs)
+ [Step 1: Configure the pipeline role](#prometheus-pipeline-role)
+ [Step 2: Create the pipeline](#prometheus-pipeline)
+ [Monitoring and troubleshooting](#prometheus-monitoring)
+ [Limitations](#prometheus-limitations)
+ [Best practices](#prometheus-best-practices)

## Prerequisites
<a name="prometheus-prereqs"></a>

Before you configure the Prometheus sink, ensure you have the following:
+ **Amazon Managed Service for Prometheus workspace**: Create a workspace in the same AWS account and AWS Region as your OpenSearch Ingestion pipeline. For instructions, see [Creating a workspace](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-onboard-create-workspace.html) in the *Amazon Managed Service for Prometheus User Guide*.
+ **IAM permissions**: Configure an IAM role with permissions to write to Amazon Managed Service for Prometheus. For more information, see [Step 1: Configure the pipeline role](#prometheus-pipeline-role).

**Note**  
Amazon Managed Service for Prometheus workspaces must use AWS service-managed AWS KMS keys. Customer-managed AWS KMS keys are not currently supported for Amazon Managed Service for Prometheus sinks in OpenSearch Ingestion pipelines.

## Step 1: Configure the pipeline role
<a name="prometheus-pipeline-role"></a>

The Prometheus sink automatically inherits the [pipeline role's](pipeline-security-overview.md#pipeline-security-sink) IAM permissions for authentication, so no additional role configuration (like `sts_role_arn`) is required in the sink settings.

The following sample policy shows the required permissions for using Amazon Managed Service for Prometheus as a sink:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AMPRemoteWrite",
      "Effect": "Allow",
      "Action": [
        "aps:RemoteWrite"
      ],
      "Resource": "arn:aws:aps:region:account-id:workspace/workspace-id"
    }
  ]
}
```

Replace the following placeholders:
+ `region`: Your AWS Region (for example, `us-east-1`)
+ `account-id`: Your AWS account ID
+ `workspace-id`: Your Amazon Managed Service for Prometheus workspace ID

You must attach these permissions to your pipeline role.

Ensure your pipeline role has a trust relationship that allows OpenSearch Ingestion to assume it:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "osis-pipelines.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
```

## Step 2: Create the pipeline
<a name="prometheus-pipeline"></a>

After you've set up your permissions, you can configure an OpenSearch Ingestion pipeline to use Amazon Managed Service for Prometheus as a sink.

### Basic configuration
<a name="prometheus-basic-config"></a>

The following example shows a minimal Prometheus sink configuration:

```
version: "2"
sink:
  - prometheus:
      url: "https://aps-workspaces.region.amazonaws.com/workspaces/workspace-id/api/v1/remote_write"
      aws:
        region: "region"
```

You must specify the `url` option within the `prometheus` sink configuration, which is the Amazon Managed Service for Prometheus remote write endpoint. To format the URL, locate your workspace ID in the Amazon Managed Service for Prometheus console and construct the URL as follows: `https://aps-workspaces.region.amazonaws.com/workspaces/workspace-id/api/v1/remote_write`.

### Configuration options
<a name="prometheus-config-options"></a>

Use the following options to configure batching and flushing behavior for the Prometheus sink:


**Prometheus sink configuration options**  

| Option | Required | Type | Description | 
| --- | --- | --- | --- | 
| max\$1events | No | Integer | The maximum number of events to accumulate before flushing to Prometheus. Default is 1000. | 
| max\$1request\$1size | No | Byte Count | The maximum size of the request payload before flushing. Default is 1mb. | 
| flush\$1interval | No | Duration | The maximum amount of time to wait before flushing events. Default is 10s. Maximum allowed value is 60s. | 

### Example pipelines
<a name="prometheus-example-pipelines"></a>

**Example 1: OpenTelemetry metrics to Amazon Managed Service for Prometheus**

This pipeline receives OpenTelemetry metrics from an external agent and writes them to Amazon Managed Service for Prometheus:

```
version: "2"
source:
  otel_metrics_source:
    path: "/v1/metrics"
    output_format: otel

sink:
  - prometheus:
      url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111/api/v1/remote_write"
      aws:
        region: "us-east-1"
```

**Example 2: Dual sink - OpenSearch and Amazon Managed Service for Prometheus**

This pipeline routes metrics to both OpenSearch and Amazon Managed Service for Prometheus:

```
version: "2"
source:
  otel_metrics_source:
    path: "/v1/metrics"
    output_format: otel

sink:
  - opensearch:
      hosts:
        - "https://search-domain-endpoint.us-east-1.es.amazonaws.com"
      index: "metrics-%{yyyy.MM.dd}"
      aws:
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::123456789012:role/OSI-Pipeline-Role"

  - prometheus:
      url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111/api/v1/remote_write"
      aws:
        region: "us-east-1"
```

**Example 3: Metrics with filtering**

This pipeline filters metrics before sending to Amazon Managed Service for Prometheus:

```
version: "2"
source:
  otel_metrics_source:
    path: "/v1/metrics"
    output_format: otel

processor:
  - drop_events:
      drop_when: '/name != "http.server.duration" and /name != "http.client.duration"'

sink:
  - prometheus:
      url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111/api/v1/remote_write"
      aws:
        region: "us-east-1"
```

You can use a preconfigured Amazon Managed Service for Prometheus blueprint to create these pipelines. For more information, see [Working with blueprints](pipeline-blueprint.md).

### Creating a pipeline with Amazon Managed Service for Prometheus sink
<a name="prometheus-create-pipeline"></a>

#### Using the AWS Console
<a name="prometheus-console"></a>

1. Navigate to the OpenSearch Service console.

1. Choose **Pipelines** under **Ingestion**.

1. Choose **Create pipeline**.

1. Select **Build using blueprint** and choose the **OpenTelemetry metrics to Amazon Prometheus** blueprint.

1. Configure the pipeline:
   + Enter your Amazon Managed Service for Prometheus workspace ID
   + Specify the pipeline role ARN
   + Configure source and processor settings as needed

1. Review and create the pipeline.

#### Using the AWS CLI
<a name="prometheus-cli"></a>

Create a pipeline configuration file (for example, `amp-pipeline.yaml`) with your desired configuration, then run:

```
aws osis create-pipeline \
  --pipeline-name my-amp-pipeline \
  --min-units 2 \
  --max-units 4 \
  --pipeline-configuration-body file://amp-pipeline.yaml
```

#### Using AWS CloudFormation
<a name="prometheus-cfn"></a>

```
Resources:
  MyAMPPipeline:
    Type: AWS::OSIS::Pipeline
    Properties:
      PipelineName: my-amp-pipeline
      MinUnits: 2
      MaxUnits: 4
      PipelineConfigurationBody: |
        version: "2"
        source:
          otel_metrics_source:
            path: "/v1/metrics"
            output_format: otel
        sink:
          - prometheus:
              url: "https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-a1b2c3d4-5678-90ab-cdef-EXAMPLE11111/api/v1/remote_write"
              aws:
                region: "us-east-1"
```

## Monitoring and troubleshooting
<a name="prometheus-monitoring"></a>

### CloudWatch metrics
<a name="prometheus-cloudwatch-metrics"></a>

Monitor your pipeline's performance using CloudWatch metrics:
+ `DocumentsWritten`: Number of metrics successfully written to Amazon Managed Service for Prometheus
+ `DocumentsWriteFailed`: Number of metrics that failed to write
+ `RequestLatency`: Latency of remote write requests

### Common issues
<a name="prometheus-troubleshooting"></a>

**Issue**: Pipeline fails to write to Amazon Managed Service for Prometheus

**Solutions**:
+ Verify the workspace ID and region in the URL are correct
+ Ensure the pipeline role has `aps:RemoteWrite` permission
+ Check that the workspace uses service-managed AWS KMS keys
+ Verify the pipeline and workspace are in the same AWS account

**Issue**: Authentication errors

**Solutions**:
+ Verify the trust relationship allows `osis-pipelines.amazonaws.com` to assume the pipeline role
+ Ensure the pipeline role has the required `aps:RemoteWrite` permission

**Issue**: High latency or throttling

**Solutions**:
+ Increase pipeline capacity units
+ Implement batching in the processor
+ Review Amazon Managed Service for Prometheus service quotas

## Limitations
<a name="prometheus-limitations"></a>

Consider the following limitations when you set up an OpenSearch Ingestion pipeline for Amazon Managed Service for Prometheus:
+ Amazon Managed Service for Prometheus workspaces must use AWS service-managed AWS KMS keys. Customer-managed AWS KMS keys are not currently supported.
+ The pipeline and Amazon Managed Service for Prometheus workspace must be in the same AWS account.

## Best practices
<a name="prometheus-best-practices"></a>
+ **Use the same IAM role**: The Prometheus sink automatically uses the pipeline role. If other sinks are used, ensure the `sts_role_arn` is the same as the pipeline role
+ **Monitor metrics**: Set up CloudWatch alarms for failed writes and high latency
+ **Implement filtering**: Use processors to filter unnecessary metrics before sending to Amazon Managed Service for Prometheus
+ **Right-size capacity**: Start with minimum capacity and scale based on metrics volume
+ **Use blueprints**: Leverage pre-configured blueprints for common use cases

# Using an OpenSearch Ingestion pipeline with Kafka
<a name="configure-client-self-managed-kafka"></a>

You can use the [Kafka](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/kafka/) plugin to stream data from self-managed Kafka clusters to Amazon OpenSearch Service domains and OpenSearch Serverless collections. OpenSearch Ingestion supports connections from Kafka clusters configured with either public or private (VPC) networking. This topic outlines the prerequisites and steps to set up an ingestion pipeline, including configuring network settings and authentication methods such as mutual TLS (mTLS), SASL/SCRAM, or IAM.

## Migrating data from public Kafka clusters
<a name="self-managaged-kafka-public"></a>

You can use OpenSearch Ingestion pipelines to migrate data from a public self-managed Kafka cluster, which means that the domain DNS name can be publicly resolved. To do so, set up an OpenSearch Ingestion pipeline with self-managed Kafka as the source and OpenSearch Service or OpenSearch Serverless as the destination. This processes your streaming data from a self-managed source cluster to an AWS-managed destination domain or collection. 

### Prerequisites
<a name="self-managaged-kafka-public-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a self-managed Kafka cluster with a public network configuration. The cluster should contain the data you want to ingest into OpenSearch Service.

1. Create an OpenSearch Service domain or OpenSearch Serverless collection where you want to migrate data to. For more information, see [Creating OpenSearch Service domains](createupdatedomains.md#createdomains) and [Creating collections](serverless-create.md).

1. Set up authentication on your self-managed cluster with AWS Secrets Manager. Enable secrets rotation by following the steps in [Rotate AWS Secrets Manager secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your self-managed cluster to your domain or collection. 

   The following sample domain access policy allows the pipeline role, which you create in the next step, to write data to a domain. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "es:DescribeDomain",
           "es:ESHttp*"
         ],
         "Resource": [
           "arn:aws:es:us-east-1:111122223333:domain/domain-name"
         ]
       }
     ]
   }
   ```

------

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

### Step 1: Configure the pipeline role
<a name="self-managed-kafka-public-pipeline-role"></a>

After you have your Kafka pipeline prerequisites set up, [configure the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add permission to write to an OpenSearch Service domain or OpenSearch Serverless collection, as well as permission to read secrets from Secrets Manager.

### Step 2: Create the pipeline
<a name="self-managed-kafka-public-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies Kafka as the source. 

You can specify multiple OpenSearch Service domains as destinations for your data. This capability enables conditional routing or replication of incoming data into multiple OpenSearch Service domains.

You can also migrate data from a source Confluent Kafka cluster to an OpenSearch Serverless VPC collection. Ensure you provide a network access policy within the pipeline configuration. You can use a Confluent schema registry to define a Confluent schema.

```
version: "2"
kafka-pipeline:
  source:
    kafka:
      encryption:
        type: "ssl"
      topics:
        - name: "topic-name"
          group_id: "group-id"
      bootstrap_servers:
        - "bootstrap-server.us-east-1.aws.private.confluent.cloud:9092"
      authentication:
        sasl:
          plain:
            username: ${aws_secrets:confluent-kafka-secret:username}
            password: ${aws_secrets:confluent-kafka-secret:password}
      schema:
        type: confluent
        registry_url: https://my-registry.us-east-1.aws.confluent.cloud
        api_key: "${{aws_secrets:schema-secret:schema_registry_api_key}}"
        api_secret: "${{aws_secrets:schema-secret:schema_registry_api_secret}}"
        basic_auth_credentials_source: "USER_INFO"
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
      aws:
          region: "us-east-1"
      index: "confluent-index"
extension:
  aws:
    secrets:
      confluent-kafka-secret:
        secret_id: "my-kafka-secret"
        region: "us-east-1"
      schema-secret:
        secret_id: "my-self-managed-kafka-schema"
        region: "us-east-1"
```

You can use a preconfigured blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

### Migrating data from Kafka clusters in a VPC
<a name="self-managaged-kafka-private"></a>

You can also use OpenSearch Ingestion pipelines to migrate data from a self-managed Kafka cluster running in a VPC. To do so, set up an OpenSearch Ingestion pipeline with self-managed Kafka as the source and OpenSearch Service or OpenSearch Serverless as the destination. This processes your streaming data from a self-managed source cluster to an AWS-managed destination domain or collection.

#### Prerequisites
<a name="self-managaged-kafka-private-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a self-managed Kafka cluster with a VPC network configuration that contains the data you want to ingest into OpenSearch Service. 

1. Create an OpenSearch Service domain or OpenSearch Serverless collection where you want to migrate data to. For more information, see [Creating OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html#createdomains) and [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create).

1. Set up authentication on your self-managed cluster with AWS Secrets Manager. Enable secrets rotation by following the steps in [Rotate AWS Secrets Manager secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html).

1. Obtain the ID of the VPC that that has access to self-managed Kafka. Choose the VPC CIDR to be used by OpenSearch Ingestion.
**Note**  
If you're using the AWS Management Console to create your pipeline, you must also attach your OpenSearch Ingestion pipeline to your VPC in order to use self-managed Kafka. To do so, find the **Network configuration** section, select the **Attach to VPC** checkbox, and choose your CIDR from one of the provided default options, or select your own. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).  
To provide a custom CIDR, select **Other** from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and self-managed OpenSearch, ensure that the self-managed OpenSearch VPC CIDR is different from the CIDR for OpenSearch Ingestion.

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your self-managed cluster to your domain or collection. 

   The following sample domain access policy allows the pipeline role, which you create in the next step, to write data to a domain. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "es:DescribeDomain",
           "es:ESHttp*"
         ],
         "Resource": [
           "arn:aws:es:us-east-1:111122223333:domain/domain-name"
         ]
       }
     ]
   }
   ```

------

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

#### Step 1: Configure the pipeline role
<a name="self-managed-kafka-private-pipeline-role"></a>

After you have your pipeline prerequisites set up, [configure the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add the following permissions in the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SecretsManagerReadAccess",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": ["arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name"]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AttachNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeNetworkInterfaces"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:subnet/*",
                "arn:aws:ec2:*:*:security-group/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:Describe*"
            ],
            "Resource": "*"
        },
        { 
            "Effect": "Allow",
            "Action": [ 
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Condition": { 
               "StringEquals": 
                    {
                        "aws:RequestTag/OSISManaged": "true"
                    } 
            } 
        }
    ]
}
```

------

You must provide the above Amazon EC2 permissions on the IAM role that you use to create the OpenSearch Ingestion pipeline because the pipeline uses these permissions to create and delete a network interface in your VPC. The pipeline can only access the Kafka cluster through this network interface.

#### Step 2: Create the pipeline
<a name="self-managed-kafka-private-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies Kafka as the source.

You can specify multiple OpenSearch Service domains as destinations for your data. This capability enables conditional routing or replication of incoming data into multiple OpenSearch Service domains.

You can also migrate data from a source Confluent Kafka cluster to an OpenSearch Serverless VPC collection. Ensure you provide a network access policy within the pipeline configuration. You can use a Confluent schema registry to define a Confluent schema.

```
 version: "2"
kafka-pipeline:
  source:
    kafka:
      encryption:
        type: "ssl"
      topics:
        - name: "topic-name"
          group_id: "group-id"
      bootstrap_servers:
        - "bootstrap-server.us-east-1.aws.private.confluent.cloud:9092"
      authentication:
        sasl:
          plain:
            username: ${aws_secrets:confluent-kafka-secret:username}
            password: ${aws_secrets:confluent-kafka-secret:password}
      schema:
        type: confluent
        registry_url: https://my-registry.us-east-1.aws.confluent.cloud
        api_key: "${{aws_secrets:schema-secret:schema_registry_api_key}}"
        api_secret: "${{aws_secrets:schema-secret:schema_registry_api_secret}}"
        basic_auth_credentials_source: "USER_INFO"
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
      aws:
          region: "us-east-1"
      index: "confluent-index"
extension:
  aws:
    secrets:
      confluent-kafka-secret:
        secret_id: "my-kafka-secret"
        region: "us-east-1"
      schema-secret:
        secret_id: "my-self-managed-kafka-schema"
        region: "us-east-1"
```

You can use a preconfigured blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

# Migrating data from self-managed OpenSearch clusters using Amazon OpenSearch Ingestion
<a name="configure-client-self-managed-opensearch"></a>

You can use an Amazon OpenSearch Ingestion pipeline with self-managed OpenSearch or Elasticsearch to migrate data to Amazon OpenSearch Service domains and OpenSearch Serverless collections. OpenSearch Ingestion supports both public and private network configurations for the migration of data from self-managed OpenSearch and Elasticsearch. 

## Migrating from public OpenSearch clusters
<a name="self-managaged-opensearch-public"></a>

You can use OpenSearch Ingestion pipelines to migrate data from a self-managed OpenSearch or Elasticsearch cluster with a public configuration, which means that the domain DNS name can be publicly resolved. To do so, set up an OpenSearch Ingestion pipeline with self-managed OpenSearch or Elasticsearch as the source and OpenSearch Service or OpenSearch Serverless as the destination. This effectively migrates your data from a self-managed source cluster to an AWS-managed destination domain or collection.

### Prerequisites
<a name="self-managaged-opensearch-public-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a self-managed OpenSearch or Elastisearch cluster that contains the data you want to migrate and configure a public DNS name. For instructions, see [Create a cluster](https://opensearch.org/docs/latest/tuning-your-cluster/) in the OpenSearch documentation.

1. Create an OpenSearch Service domain or OpenSearch Serverless collection where you want to migrate data to. For more information, see [Creating OpenSearch Service domains](createupdatedomains.md#createdomains) and [Creating collections](serverless-create.md).

1. Set up authentication on your self-managed cluster with AWS Secrets Manager. Enable secrets rotation by following the steps in [Rotate AWS Secrets Manager secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html).

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your self-managed cluster to your domain or collection. 

   The following sample domain access policy allows the pipeline role, which you create in the next step, to write data to a domain. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "es:DescribeDomain",
           "es:ESHttp*"
         ],
         "Resource": [
           "arn:aws:es:us-east-1:111122223333:domain/domain-name"
         ]
       }
     ]
   }
   ```

------

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

### Step 1: Configure the pipeline role
<a name="self-managed-opensearch-public-pipeline-role"></a>

After you have your OpenSearch pipeline prerequisites set up, [configure the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add permission to write to an OpenSearch Service domain or OpenSearch Serverless collection, as well as permission to read secrets from Secrets Manager.

### Step 2: Create the pipeline
<a name="self-managed-opensearch-public-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies OpenSearch as the source. 

You can specify multiple OpenSearch Service domains as destinations for your data. This capability enables conditional routing or replication of incoming data into multiple OpenSearch Service domains.

You can also migrate data from a source OpenSearch or Elasticsearch cluster to an OpenSearch Serverless VPC collection. Ensure you provide a network access policy within the pipeline configuration.

```
version: "2"
opensearch-migration-pipeline:
  source:
    opensearch:
      acknowledgments: true
      host: [ "https://my-self-managed-cluster-name:9200" ]
      indices:
        include:
          - index_name_regex: "include-.*"
        exclude:
          - index_name_regex: '\..*'
      authentication:
        username: ${aws_secrets:secret:username}
        password: ${aws_secrets:secret:password}
        scheduling:
           interval: "PT2H"
           index_read_count: 3
           start_time: "2023-06-02T22:01:30.00Z"
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
      aws:
          region: "us-east-1"
          #Uncomment the following lines if your destination is an OpenSearch Serverless collection
          #serverless: true
          # serverless_options:
          #     network_policy_name: "network-policy-name"
      index: "${getMetadata(\"opensearch-index\")}"
      document_id: "${getMetadata(\"opensearch-document_id\")}"
      enable_request_compression: true
      dlq:
        s3:
          bucket: "bucket-name"
          key_path_prefix: "apache-log-pipeline/logs/dlq"
          region: "us-east-1"
extension:
  aws:
    secrets:
      secret:
        secret_id: "my-opensearch-secret"
        region: "us-east-1"
        refresh_interval: PT1H
```

You can use a preconfigured blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

## Migrating data from OpenSearch clusters in a VPC
<a name="self-managaged-opensearch-private"></a>

You can also use OpenSearch Ingestion pipelines to migrate data from a self-managed OpenSearch or Elasticsearch cluster running in a VPC. To do so, set up an OpenSearch Ingestion pipeline with self-managed OpenSearch or Elasticsearch as the source and OpenSearch Service or OpenSearch Serverless as the destination. This effectively migrates your data from a self-managed source cluster to an AWS-managed destination domain or collection.

### Prerequisites
<a name="self-managaged-opensearch-private-prereqs"></a>

Before you create your OpenSearch Ingestion pipeline, perform the following steps:

1. Create a self-managed OpenSearch or Elastisearch cluster with a VPC network configuration that contains the data you want to migrate. 

1. Create an OpenSearch Service domain or OpenSearch Serverless collection where you want to migrate data to. For more information, see [Creating OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html#createdomains) and [Creating collections](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-manage.html#serverless-create).

1. Set up authentication on your self-managed cluster with AWS Secrets Manager. Enable secrets rotation by following the steps in [Rotate AWS Secrets Manager secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/rotating-secrets.html).

1. Obtain the ID of the VPC that that has access to self-managed OpenSearch or Elasticsearch. Choose the VPC CIDR to be used by OpenSearch Ingestion.
**Note**  
If you're using the AWS Management Console to create your pipeline, you must also attach your OpenSearch Ingestion pipeline to your VPC in order to use self-managed OpenSearch or Elasticsearch. To do so, find the **Source network options** section, select the **Attach to VPC** checkbox, and choose your CIDR from one of the provided default options. You can use any CIDR from a private address space as defined in the [RFC 1918 Best Current Practice](https://datatracker.ietf.org/doc/html/rfc1918).  
To provide a custom CIDR, select **Other** from the dropdown menu. To avoid a collision in IP addresses between OpenSearch Ingestion and self-managed OpenSearch, ensure that the self-managed OpenSearch VPC CIDR is different from the CIDR for OpenSearch Ingestion. 

1. Attach a [resource-based policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) to your domain or a [data access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html) to your collection. These access policies allow OpenSearch Ingestion to write data from your self-managed cluster to your domain or collection. 

   The following sample domain access policy allows the pipeline role, which you create in the next step, to write data to a domain. Make sure that you update the `resource` with your own ARN. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::444455556666:role/pipeline-role"
         },
         "Action": [
           "es:DescribeDomain",
           "es:ESHttp*"
         ],
         "Resource": [
           "arn:aws:es:us-east-1:111122223333:domain/example.com"
         ]
       }
     ]
   }
   ```

------

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Setting up roles and users in Amazon OpenSearch Ingestion](pipeline-security-overview.md).

### Step 1: Configure the pipeline role
<a name="self-managed-opensearch-private-pipeline-role"></a>

After you have your pipeline prerequisites set up, [configure the pipeline role](pipeline-security-overview.md#pipeline-security-sink) that you want to use in your pipeline configuration, and add the following permissions in the role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SecretsManagerReadAccess",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": ["arn:aws:secretsmanager:us-east-1:111122223333:secret:secret-name"]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:AttachNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeNetworkInterfaces"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:subnet/*",
                "arn:aws:ec2:*:*:security-group/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:Describe*"
            ],
            "Resource": "*"
        },
        { 
            "Effect": "Allow",
            "Action": [ 
                "ec2:CreateTags"
            ],
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Condition": { 
               "StringEquals": 
                    {
                        "aws:RequestTag/OSISManaged": "true"
                    } 
            } 
        }
    ]
}
```

------

You must provide the above Amazon EC2 permissions on the IAM role that you use to create the OpenSearch Ingestion pipeline because the pipeline uses these permissions to create and delete a network interface in your VPC. The pipeline can only access the OpenSearch cluster through this network interface.

### Step 2: Create the pipeline
<a name="self-managed-opensearch-private-pipeline"></a>

You can then configure an OpenSearch Ingestion pipeline like the following, which specifies OpenSearch as the source. 

You can specify multiple OpenSearch Service domains as destinations for your data. This capability enables conditional routing or replication of incoming data into multiple OpenSearch Service domains.

You can also migrate data from a source OpenSearch or Elasticsearch cluster to an OpenSearch Serverless VPC collection. Ensure you provide a network access policy within the pipeline configuration.

```
version: "2"
opensearch-migration-pipeline:
  source:
    opensearch:
      acknowledgments: true
      host: [ "https://my-self-managed-cluster-name:9200" ]
      indices:
        include:
          - index_name_regex: "include-.*"
        exclude:
          - index_name_regex: '\..*'
      authentication:
        username: ${aws_secrets:secret:username}
        password: ${aws_secrets:secret:password}
        scheduling:
           interval: "PT2H"
           index_read_count: 3
           start_time: "2023-06-02T22:01:30.00Z"
  sink:
  - opensearch:
      hosts: ["https://search-mydomain.us-east-1.es.amazonaws.com"]
      aws:
          region: "us-east-1"
          #Uncomment the following lines if your destination is an OpenSearch Serverless collection
          #serverless: true
          # serverless_options:
          #     network_policy_name: "network-policy-name"
      index: "${getMetadata(\"opensearch-index\")}"
      document_id: "${getMetadata(\"opensearch-document_id\")}"
      enable_request_compression: true
      dlq:
        s3:
          bucket: "bucket-name"
          key_path_prefix: "apache-log-pipeline/logs/dlq"
          region: "us-east-1"
extension:
  aws:
    secrets:
      secret:
        secret_id: "my-opensearch-secret"
        region: "us-east-1"
        refresh_interval: PT1H
```

You can use a preconfigured blueprint to create this pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

# Use an OpenSearch Ingestion pipeline with Amazon Kinesis Data Streams
<a name="configure-client-kinesis"></a>

Use an OpenSearch Ingestion pipeline with Amazon Kinesis Data Streams to ingest stream records data from multiple streams into Amazon OpenSearch Service domains and collections. The OpenSearch Ingestion pipeline incorporates the streaming ingestion infrastructure to provide a high-scale, low latency way to continuously ingest stream records from Kinesis.

**Topics**
+ [Amazon Kinesis Data Streams as a source](#confluent-cloud-kinesis)
+ [Amazon Kinesis Data Streams cross account as a source](#kinesis-cross-account-source)

## Amazon Kinesis Data Streams as a source
<a name="confluent-cloud-kinesis"></a>

With the following procedure, you'll learn how to set up an OpenSearch Ingestion pipeline that uses Amazon Kinesis Data Streams as the data source. This section covers the necessary prerequisites, such as creating an OpenSearch Service domain or an OpenSearch Serverless Collection, and walking through the steps to configure the pipeline role and create the pipeline.

### Prerequisites
<a name="s3-prereqs"></a>

To set up your pipeline, you need one or more active Kinesis Data Streams. These streams must be either receiving records or ready to receive records from other sources. For more information, see [Overview of OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-getting-started-tutorials.html).

**To set up your pipeline**

1. 

**Create an OpenSearch Service domain or an OpenSearch Serverless collection**

   To create a domain or a collection, see [Getting started with OpenSearch Ingestion](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-getting-started-tutorials.html).

   To create an IAM role with the correct permissions to access write data to the collection or domain, see [Resource-based policies](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource).

1. 

**Configure the pipeline role with permissions**

   [Set up the pipeline role](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-security-overview.html#pipeline-security-sink) that you want to use in your pipeline configuration and add the following permissions to it. Replace the *placeholder values* with your own information.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "allowReadFromStream",
               "Effect": "Allow",
               "Action": [
                   "kinesis:DescribeStream",
                   "kinesis:DescribeStreamConsumer",
                   "kinesis:DescribeStreamSummary",
                   "kinesis:GetRecords",
                   "kinesis:GetShardIterator",
                   "kinesis:ListShards",
                   "kinesis:ListStreams",
                   "kinesis:ListStreamConsumers",
                   "kinesis:RegisterStreamConsumer",
                   "kinesis:SubscribeToShard"
               ],
               "Resource": [
                   "arn:aws:kinesis:us-east-1:111122223333:stream/stream-name"
               ]
           }
       ]
   }
   ```

------

   If server-side encryption is enabled on the streams, the following AWS KMS policy allows to decrypt the records. Replace the *placeholder values* with your own information.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "allowDecryptionOfCustomManagedKey",
               "Effect": "Allow",
               "Action": [
                   "kms:Decrypt",
                   "kms:GenerateDataKey"
               ],
               "Resource": "arn:aws:kms:us-east-1:111122223333:key/key-id"
           }
       ]
   }
   ```

------

   In order for a pipeline to write data to a domain, the domain must have a [domain-level access policy](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource) that allows the **sts\$1role\$1arn** pipeline role to access it.

   The following example is a domain access policy that allows the pipeline role created in the previous step (`pipeline-role`), to write data to the `ingestion-domain` domain. Replace the *placeholder values* with your own information.

   ```
   {
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::your-account-id:role/pipeline-role"
         },
         "Action": ["es:DescribeDomain", "es:ESHttp*"],
         "Resource": "arn:aws:es:AWS Region:account-id:domain/domain-name/*"
       }
     ]
   }
   ```

1. 

**Create the pipeline**

   Configure an OpenSearch Ingestion pipeline specifying **Kinesis-data-streams** as the source. You can locate a ready made blueprint available on the OpenSearch Ingestion Console for creating such a pipeline. (Optional) To create the pipeline using the AWS CLI, you can use a blueprint named "**`AWS-KinesisDataStreamsPipeline`**". Replace the *placeholder values* with your own information.

   ```
   version: "2"
   kinesis-pipeline:
     source:
       kinesis_data_streams:
         acknowledgments: true
         codec:
           # Based on whether kinesis records are aggregated or not, you could choose json, newline or ndjson codec for processing the records.
           # JSON codec supports parsing nested CloudWatch Events into individual log entries that will be written as documents into OpenSearch.
           # json:
             # key_name: "logEvents"
             # These keys contain the metadata sent by CloudWatch Subscription Filters
             # in addition to the individual log events:
             # include_keys: [ 'owner', 'logGroup', 'logStream' ]
           newline:
         streams:
           - stream_name: "stream name"
             # Enable this if ingestion should start from the start of the stream.
             # initial_position: "EARLIEST"
             # checkpoint_interval: "PT5M"
             # Compression will always be gzip for CloudWatch, but will vary for other sources:
             # compression: "gzip"
           - stream_name: "stream name"
             # Enable this if ingestion should start from the start of the stream.
             # initial_position: "EARLIEST"
             # checkpoint_interval: "PT5M"
             # Compression will always be gzip for CloudWatch, but will vary for other sources:
             # compression: "gzip"
   
           # buffer_timeout: "1s"
           # records_to_accumulate: 100
           # Change the consumer strategy to "polling". Default consumer strategy will use enhanced "fan-out" supported by KDS.
           # consumer_strategy: "polling"
           # if consumer strategy is set to "polling", enable the polling config below.
           # polling:
             # max_polling_records: 100
             # idle_time_between_reads: "250ms"
         aws:
           # Provide the Role ARN with access to Amazon Kinesis Data Streams. This role should have a trust relationship with osis-pipelines.amazonaws.com
           sts_role_arn: "arn:aws:iam::111122223333:role/Example-Role"
           # Provide the AWS Region of the Data Stream.
           region: "us-east-1"
   
     sink:
       - opensearch:
           # Provide an Amazon OpenSearch Serverless domain endpoint
           hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
           index: "index_${getMetadata(\"stream_name\")}"
           # Ensure adding unique document id as a combination of the metadata attributes available.
           document_id: "${getMetadata(\"partition_key\")}_${getMetadata(\"sequence_number\")}_${getMetadata(\"sub_sequence_number\")}"
           aws:
             # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
             sts_role_arn: "arn:aws:iam::111122223333:role/Example-Role"
             # Provide the AWS Region of the domain.
             region: "us-east-1"
             # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
             serverless: false
             # serverless_options:
               # Specify a name here to create or update network policy for the serverless collection
               # network_policy_name: "network-policy-name"
           # Enable the 'distribution_version' setting if the OpenSearch Serverless domain is of version Elasticsearch 6.x
           # distribution_version: "es6"
           # Enable and switch the 'enable_request_compression' flag if the default compression setting is changed in the domain. See https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gzip.html
           # enable_request_compression: true/false
           # Optional: Enable the S3 DLQ to capture any failed requests in an S3 bucket. Delete this entire block if you don't want a DLQ.
           dlq:
             s3:
               # Provide an S3 bucket
               bucket: "your-dlq-bucket-name"
               # Provide a key path prefix for the failed requests
               # key_path_prefix: "kinesis-pipeline/logs/dlq"
               # Provide the region of the bucket.
               region: "us-east-1"
               # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
               sts_role_arn: "arn:aws:iam::111122223333:role/Example-Role"
   ```

**Configuration options**  
For Kinesis configuration options, see [Configuration options](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/kinesis/#configuration-options) in the *OpenSearch* documentation.

**Available metadata attributes**
   + **stream\$1name** – Name of the Kinesis Data Streams from where the record has been ingested
   + **partition\$1key** – Partition key of the Kinesis Data Streams record which is being ingested
   + **sequence\$1number** – Sequence number of the Kinesis Data Streams record which is being ingested
   + **sub\$1sequence\$1number** – Sub sequence number of the Kinesis Data Streams record which is being ingested

1. 

**(Optional) Configure recommended compute units (OCUs) for the Kinesis Data Streams pipeline**

   An OpenSearch Kinesis Data Streams source pipeline can also be configured to ingest stream records from more than one stream. For faster ingestion, we recommend you add an additional compute unit per new stream added.

### Data consistency
<a name="confluent-cloud-kinesis-private"></a>

OpenSearch Ingestion supports end-to-end acknowledgement to ensure data durability. When the pipeline reads stream records from Kinesis, it dynamically distributes the work of reading stream records based on the shards associated with the streams. Pipeline will automatically checkpoint streams when it receives an acknowledgement after ingesting all records in the OpenSearch domain or collection. This will avoid duplicate processing of stream records.

To create the index based on the stream name, define the index in the opensearch sink section as **"index\$1\$1\$1getMetadata(\$1"stream\$1name\$1")\$1"**.

## Amazon Kinesis Data Streams cross account as a source
<a name="kinesis-cross-account-source"></a>

You can grant access across accounts with Amazon Kinesis Data Streams so that OpenSearch Ingestion pipelines can access Kinesis Data Streams in another account as source. Complete the following steps to enable cross-account access:

**Configure cross-account access**

1. 

**Set resource policy in the account which has the Kinesis stream**

   Replace the *placeholder values* with your own information.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "StreamReadStatementID",
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/Pipeline-Role"
               },
               "Action": [
                   "kinesis:DescribeStreamSummary",
                   "kinesis:GetRecords",
                   "kinesis:GetShardIterator",
                   "kinesis:ListShards"
               ],
               "Resource": "arn:aws:kinesis:us-east-1:444455556666:stream/stream-name"
           },
           {
               "Sid": "StreamEFOReadStatementID",
               "Effect": "Allow",
               "Principal": {
                   "AWS": "arn:aws:iam::111122223333:role/Pipeline-Role"
               },
               "Action": [
                   "kinesis:DescribeStreamSummary",
                   "kinesis:ListShards"
               ],
               "Resource": "arn:aws:kinesis:us-east-1:444455556666:stream/stream-name/consumer/consumer-name"
           }
       ]
   }
   ```

------

1. 

**(Optional) Setup Consumer and Consumer Resource Policy**

   This is an optional step and will only be required if you plan to use Enhanced Fanout Consumer strategy for reading stream records. For more information, see [Develop enhanced fan-out consumers with dedicated throughput](https://docs.aws.amazon.com/streams/latest/dev/enhanced-consumers.html).

   1. 

**Setup consumer**

      To reuse an existing consumer, you can skip this step. For more information, see [RegisterStreamConsumer](https://docs.aws.amazon.com/dms/latest/APIReference/API_RegisterStreamConsumer.html) in the *Amazon Kinesis Data Streams API Reference*.

      In the following example CLI command, replace the *placeholder values* with your own information.  
**Example : Example CLI command**  

      ```
      aws kinesis register-stream-consumer \
      --stream-arn "arn:aws:kinesis:AWS Region:account-id:stream/stream-name" \
      --consumer-name consumer-name
      ```

   1. 

**Setup Consumer Resource Policy**

      In the following statement, replace the *placeholder values* with your own information.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "ConsumerEFOReadStatementID",
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "arn:aws:iam::111122223333:role/Pipeline-Role"
                  },
                  "Action": [
                      "kinesis:DescribeStreamConsumer",
                      "kinesis:SubscribeToShard"
                  ],
                  "Resource": "arn:aws:kinesis:us-east-1:444455556666:stream/stream-1/consumer/consumer-name"
              }
          ]
      }
      ```

------

1. 

**Pipeline Configuration**

   For cross account ingestion, add the following attributes under `kinesis_data_streams` for each stream:
   + `stream_arn` - the arn of the stream belonging to the account where the stream exists
   + `consumer_arn` - this is an optional attribute and must be specified if the default enhanced fanout consumer strategy is chosen. Specify the actual consumer arn for this field. Replace the *placeholder values* with your own information.

   ```
   version: "2"
        kinesis-pipeline:
          source:
            kinesis_data_streams:
              acknowledgments: true
              codec:
                newline:
              streams:
                - stream_arn: "arn:aws:kinesis:region:stream-account-id:stream/stream-name"
                  consumer_arn: "consumer arn"
                  # Enable this if ingestion should start from the start of the stream.
                  # initial_position: "EARLIEST"
                  # checkpoint_interval: "PT5M"
                - stream_arn: "arn:aws:kinesis:region:stream-account-id:stream/stream-name"
                  consumer_arn: "consumer arn"
                   # initial_position: "EARLIEST"
        
                # buffer_timeout: "1s"
                # records_to_accumulate: 100
                # Enable the consumer strategy to "polling". Default consumer strategy will use enhanced "fan-out" supported by KDS.
                # consumer_strategy: "polling"
                # if consumer strategy is set to "polling", enable the polling config below.
                # polling:
                  # max_polling_records: 100
                  # idle_time_between_reads: "250ms"
              aws:
                # Provide the Role ARN with access to Kinesis. This role should have a trust relationship with osis-pipelines.amazonaws.com
                sts_role_arn: "arn:aws:iam::111122223333:role/Example-Role"
                # Provide the AWS Region of the domain.
                region: "us-east-1"
        
          sink:
            - opensearch:
                # Provide an OpenSearch Serverless domain endpoint
                hosts: [ "https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com" ]
                index: "index_${getMetadata(\"stream_name\")}"
                # Mapping for documentid based on partition key, shard sequence number and subsequence number metadata attributes
                document_id: "${getMetadata(\"partition_key\")}_${getMetadata(\"sequence_number\")}_${getMetadata(\"sub_sequence_number\")}"
                aws:
                  # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
                  sts_role_arn: "arn:aws:iam::111122223333:role/Example-Role"
                  # Provide the AWS Region of the domain.
                  region: "us-east-1"
                  # Enable the 'serverless' flag if the sink is an OpenSearch Serverless collection
                  serverless: false
                    # serverless_options:
                    # Specify a name here to create or update network policy for the serverless collection
                  # network_policy_name: network-policy-name
                # Enable the 'distribution_version' setting if the OpenSearch Serverless domain is of version Elasticsearch 6.x
                # distribution_version: "es6"
                # Enable and switch the 'enable_request_compression' flag if the default compression setting is changed in the domain. See https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gzip.html
                # enable_request_compression: true/false
                # Optional: Enable the S3 DLQ to capture any failed requests in an S3 bucket. Delete this entire block if you don't want a DLQ.
                dlq:
                  s3:
                    # Provide an Amazon S3 bucket
                    bucket: "your-dlq-bucket-name"
                    # Provide a key path prefix for the failed requests
                    # key_path_prefix: "alb-access-log-pipeline/logs/dlq"
                    # Provide the AWS Region of the bucket.
                    region: "us-east-1"
                    # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
                    sts_role_arn: "arn:aws:iam::111122223333:role/Example-Role"
   ```

1. 

**OSI Pipeline Role Kinesis Data Streams**

   1. 

**IAM Policy**

      Add the following policy to the pipeline role. Replace the *placeholder values* with your own information.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "kinesis:DescribeStreamConsumer",
                      "kinesis:SubscribeToShard"
                  ],
                  "Resource": [
                  "arn:aws:kinesis:us-east-1:111122223333:stream/my-stream"
                  ]
              },
              {
                  "Sid": "allowReadFromStream",
                  "Effect": "Allow",
                  "Action": [
                      "kinesis:DescribeStream",
                      "kinesis:DescribeStreamSummary",
                      "kinesis:GetRecords",
                      "kinesis:GetShardIterator",
                      "kinesis:ListShards",
                      "kinesis:ListStreams",
                      "kinesis:ListStreamConsumers",
                      "kinesis:RegisterStreamConsumer"
                  ],
                  "Resource": [
                      "arn:aws:kinesis:us-east-1:111122223333:stream/my-stream"
                  ]
              }
          ]
      }
      ```

------

   1. 

**Trust Policy**

      In order to ingest data from the stream account, you will need to establish a trust relationship between the pipeline ingestion role and the stream account. Add the following to the pipeline role. Replace the *placeholder values* with your own information.

------
#### [ JSON ]

****  

      ```
      {
        "Version":"2012-10-17",		 	 	 
        "Statement": [{
           "Effect": "Allow",
           "Principal": {
             "AWS": "arn:aws:iam::111122223333:root"
            },
           "Action": "sts:AssumeRole"
        }]
      }
      ```

------

## Next steps
<a name="configure-client-next"></a>

After you export your data to a pipeline, you can [query it](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/searching.html) from the OpenSearch Service domain that is configured as a sink for the pipeline. The following resources can help you get started:
+ [Observability in Amazon OpenSearch Service](observability.md)
+ [Discover Traces](observability-analyze-traces.md)
+ [Observability in Amazon OpenSearch Service](observability.md)

# Using an OpenSearch Ingestion pipeline with AWS Lambda
<a name="configure-client-lambda"></a>

Use the [AWS Lambda processor](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/aws-lambda/) to enrich data from any source or destination supported by OpenSearch Ingestion using custom code. With the Lambda processor, you can apply your own data transformations or enrichments and then return the processed events back to your pipeline for further processing. This processor enables customized data processing and gives you full control over how data is manipulated before it moves through the pipeline.

**Note**  
The payload size limit for a single event processed by a Lambda processor is 5 MB. Additionally, the Lambda processor only supports responses in JSON array format.

## Prerequisites
<a name="configure-clients-lambda-prereqs"></a>

Before you create a pipeline with a Lambda processor, create the following resources:
+ An AWS Lambda function that enriches and transforms your source data. For instructions, see [Create your first Lambda function](https://docs.aws.amazon.com/lambda/latest/dg/getting-started.html).
+ An OpenSearch Service domain or OpenSearch Serverless collection that will be the pipeline sink. For more information, see [Creating OpenSearch Service domains](createupdatedomains.md#createdomains) and [Creating collections](serverless-create.md).
+ A pipeline role that includes permissions to write to the domain or collection sink. For more information, see [Pipeline role](pipeline-security-overview.md#pipeline-security-sink).

  The pipeline role also needs an attached permissions policy that allows it to invoke the Lambda function specified in the pipeline configuration. For example: 

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Sid": "allowinvokeFunction",
              "Effect": "Allow",
              "Action": [
                  "lambda:invokeFunction",
                  "lambda:InvokeAsync",
                  "lambda:ListFunctions"
              ],
              "Resource": "arn:aws:lambda:us-east-1:111122223333:function:function-name"
              
          }
      ]
  }
  ```

------

## Create a pipeline
<a name="configure-clients-security-lake-pipeline-role"></a>

To use AWS Lambda as a processor, configure an OpenSearch Ingestion pipeline and specify `aws_lambda` as a processor. You can also use the **AWS Lambda custom enrichment** blueprint to create the pipeline. For more information, see [Working with blueprints](pipeline-blueprint.md).

The following example pipeline receives data from an HTTP source, enriches it using a date processor and the AWS Lambda processor, and ingests the processed data to an OpenSearch domain.

```
version: "2"
lambda-processor-pipeline:
  source:
    http:
      path: "/${pipelineName}/logs"
  processor:
      - date:
        destination: "@timestamp"
        from_time_received: true
    - aws_lambda:
        function_name: "my-lambda-function"

        tags_on_failure: ["lambda_failure"]
        batch:
            key_name: "events"
        aws:
          region: region
  sink:
    - opensearch:
        hosts: [ "https://search-mydomain.us-east-1es.amazonaws.com" ]
        index: "table-index"
        aws:
          region: "region"
          serverless: false
```

The following example AWS Lambda function transforms incoming data by adding a new key-value pair (`"transformed": "true"`) to each element in the provided array of events, and then sends back the modified version.

```
import json

def lambda_handler(event, context):
    input_array = event.get('events', [])
    output = []
    for input in input_array:
        input["transformed"] = "true";
        output.append(input)

    return output
```

## Batching
<a name="configure-clients-lambda-batching"></a>

Pipelines send batched events to the Lambda processor, and dynamically adjusts the batch size to ensure it stays below the 5 MB limit.

The following is an example of a pipeline batch:

```
batch:
    key_name: "events"

input_arrary = event.get('events', [])
```

**Note**  
When you create a pipeline, make sure the `key_name` option in the Lambda processor configuration matches the event key in the Lambda handler.

## Conditional filtering
<a name="configure-clients-lambda-conditional-filtering"></a>

Conditional filtering allows you to control when your AWS Lambda processor invokes the Lambda function based on specific conditions in event data. This is particularly useful when you want to selectively process certain types of events while ignoring others.

The following example configuration uses conditional filtering:

```
processors:
  - aws_lambda:
      function_name: "my-lambda-function"
      aws:
        region: "region"
      lambda_when: "/sourceIp == 10.10.10.10"
```

# Migrating data between domains and collections using Amazon OpenSearch Ingestion
<a name="creating-opensearch-service-pipeline"></a>

You can use OpenSearch Ingestion pipelines to migrate data between Amazon OpenSearch Service domains or OpenSearch Serverless VPC collections. To do so, you set up a pipeline in which you configure one domain or collection as the source, and another domain or collection as the sink. This effectively migrates your data from one domain or collection to the other.

To migrate data, you must have the following resources:
+ A source OpenSearch Service domain or OpenSearch Serverless VPC collection. This domain or collection contains the data that you want to migrate. If you're using a domain, it must be running OpenSearch version 1.0 or later, or Elasticsearch version 7.4 or later. The domain must also have an access policy that grants the appropriate permissions to your pipeline role.
+ A separate domain or VPC collection that you want to migrate your data to. This domain or collection will act as the pipeline *sink*.
+ An pipeline role that OpenSearch Ingestion will use to read and write to your collection or domain. You include the Amazon Resource Name (ARN) of this role in your pipeline configuration. For more information, see the following resources:
  + [Granting Amazon OpenSearch Ingestion pipelines access to domains](pipeline-domain-access.md)
  + [Granting Amazon OpenSearch Ingestion pipelines access to collections](pipeline-collection-access.md)

**Topics**
+ [Limitations](#Limitations-domain-collection)
+ [OpenSearch Service as a source](#opensearch-source)
+ [Specifying multiple OpenSearch Service domain sinks](#multiple-domains)
+ [Migrating data to an OpenSearch Serverless VPC collection](#pipeline-collection)

## Limitations
<a name="Limitations-domain-collection"></a>

The following limitations apply when you designate OpenSearch Service domains or OpenSearch Serverless collections as sinks:
+ A pipeline can't write to more than one VPC domain.
+ You can only migrate data to or from OpenSearch Serverless collections that use VPC access. Public collections aren't supported.
+ You can't specify a combination of VPC and public domains in a single pipeline configuration.
+ You can have a maximum of 20 non-pipeline sinks within a single pipeline configuration.
+ You can specify sinks from a maximum of three different AWS Regions in a single pipeline configuration.
+ A pipeline with multiple sinks might experience a reduction in processing speed over time if any of the sinks are down for too long, or are not provisioned with enough capacity to receive incoming data.

## OpenSearch Service as a source
<a name="opensearch-source"></a>

The domain or collection that you specify as the source is where the data is migrated *from*. 

### Creating a pipeline role in IAM
<a name="source-IAM"></a>

To create your OpenSearch Ingestion pipeline, you must first create a pipeline role to grant read and write access between domains or collections. To do this, perform the following steps:

1. Create a new permissions policy in IAM to attach to the pipeline role. Make sure you allow permissions to read from the source and write to the sink. For more information on setting IAM pipeline permissions for OpenSearch Service domains, see [Granting Amazon OpenSearch Ingestion pipelines access to domains](pipeline-domain-access.md) and [Granting Amazon OpenSearch Ingestion pipelines access to collections](pipeline-collection-access.md).

1. Specify the following permissions within the pipeline role to read from the source:

------
#### [ JSON ]

****  

   ```
   {
      "Version":"2012-10-17",		 	 	 
      "Statement":[
         {
            "Effect":"Allow",
            "Action":"es:ESHttpGet",
            "Resource":[
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/",
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/_cat/indices",
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/_search",
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/_search/scroll",
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/*/_search"
            ]
         },
         {
            "Effect":"Allow",
            "Action":"es:ESHttpPost",
            "Resource":[
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/*/_search/point_in_time",
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/*/_search/scroll"
            ]
         },
         {
            "Effect":"Allow",
            "Action":"es:ESHttpDelete",
            "Resource":[
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/_search/point_in_time",
               "arn:aws:es:us-east-1:111122223333:domain/domain-name/_search/scroll"
            ]
         }
      ]
   }
   ```

------

### Creating a pipeline
<a name="create"></a>

After you attach the policy to the pipeline role, use the **AWSOpenSearchDataMigrationPipeline** migration blueprint to create the pipeline. This blueprint includes a default configuration for migrating data between OpenSearch Service domains or collections. For more information, see [Working with blueprints](pipeline-blueprint.md). 

**Note**  
OpenSearch Ingestion uses your source domain version and distribution to determine what mechanism to use for migration. Some versions support the `point_in_time` option. OpenSearch Serverless uses the `search_after` option because it doesn't support `point_in_time` or `scroll`.

New indexes might be in the process of being created during the migration process, or documents might be updating while migration is in progress. Because of this, you might need to perform either a single scan or multiple scans of your domain index data to pick up new or updated data. 

Specify the number of scans to run by configuring the `index_read_count` and `interval` in the pipeline configuration. The following example shows how to perform multiple scans:

```
scheduling:
    interval: "PT2H"
    index_read_count: 3
    start_time: "2023-06-02T22:01:30.00Z"
```

OpenSearch Ingestion uses the following configuration to ensure that your data is written to the same index and maintains the same document ID:

```
index: "${getMetadata(\"opensearch-index\")}"
document_id: "${getMetadata(\"opensearch-document_id\")}"
```

## Specifying multiple OpenSearch Service domain sinks
<a name="multiple-domains"></a>

You can specify multiple public OpenSearch Service domains as destinations for your data. You can use this capability to perform conditional routing or replicate incoming data into multiple OpenSearch Service domains. You can specify up to 10 different public OpenSearch Service domains as sinks.

In the following example, incoming data is conditionally routed to different OpenSearch Service domains:

```
...
  route:
    - 2xx_status: "/response >= 200 and /response < 300"
    - 5xx_status: "/response >= 500 and /response < 600"
  sink:
    - opensearch:
        hosts: [ "https://search-response-2xx.region.es.amazonaws.com" ]
        aws:
          region: "us-east-1"
        index: "response-2xx"
        routes:
          - 2xx_status
    - opensearch:
        hosts: [ "https://search-response-5xx.region.es.amazonaws.com" ]
        aws:
          region: "us-east-1"
        index: "response-5xx"
        routes:
          - 5xx_status
```

## Migrating data to an OpenSearch Serverless VPC collection
<a name="pipeline-collection"></a>

You can use OpenSearch Ingestion to migrate data from a source OpenSearch Service domain or OpenSearch Serverless collection to a VPC collection sink. You must provide a network access policy within the pipeline configuration. For more information about data ingestion into OpenSearch Serverless VPC collections, see [Tutorial: Ingesting data into a collection using Amazon OpenSearch Ingestion](osis-serverless-get-started.md).

**To migrate data to a VPC collection**

1. Create an OpenSearch Serverless collection. For instructions, see [Tutorial: Ingesting data into a collection using Amazon OpenSearch Ingestion](osis-serverless-get-started.md).

1. Create a network policy for the collection that specifies VPC access to both the collection endpoint and the Dashboards endpoint. For instructions, see [Network access for Amazon OpenSearch Serverless](serverless-network.md). 

1. Create the pipeline role if you don't already have one. For instructions, see [Pipeline role](pipeline-security-overview.md#pipeline-security-sink).

1. Create the pipeline. For instructions, see [Working with blueprints](pipeline-blueprint.md).

# Using the AWS SDKs to interact with Amazon OpenSearch Ingestion
<a name="osis-sdk"></a>

This section includes an example of how to use the AWS SDKs to interact with Amazon OpenSearch Ingestion. The code example demonstrates how to create a domain and a pipeline, and then ingest data into the pipeline.

**Topics**
+ [Python](#osis-sdk-python)

## Python
<a name="osis-sdk-python"></a>

The following sample script uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/osis.html) to create an IAM pipeline role, a domain to write data to, and a pipeline to ingest data through. It then ingests a sample log file into the pipeline using the `[requests](https://pypi.org/project/requests/)` HTTP library.

To install the required dependencies, run the following commands:

```
pip install boto3
pip install botocore
pip install requests
pip install requests-auth-aws-sigv4
```

Within the script, replace all instances of `account-id` with your AWS account ID.

```
import boto3
import botocore
from botocore.config import Config
import requests
from requests_auth_aws_sigv4 import AWSSigV4
import time

# Build the client using the default credential configuration.
# You can use the CLI and run 'aws configure' to set access key, secret
# key, and default region.

opensearch = boto3.client('opensearch', config=my_config)
iam = boto3.client('iam', config=my_config)
osis = boto3.client('osis', config=my_config)

domainName = 'test-domain'  # The name of the domain
pipelineName = 'test-pipeline' # The name of the pipeline

def createPipelineRole(iam, domainName):
    """Creates the pipeline role"""
    response = iam.create_policy(
        PolicyName='pipeline-policy',
        PolicyDocument=f'{{\"Version\":\"2012-10-17\",\"Statement\":[{{\"Effect\":\"Allow\",\"Action\":\"es:DescribeDomain\",\"Resource\":\"arn:aws:es:us-east-1:account-id:domain\/{domainName}\"}},{{\"Effect\":\"Allow\",\"Action\":\"es:ESHttp*\",\"Resource\":\"arn:aws:es:us-east-1:account-id:domain\/{domainName}\/*\"}}]}}'
    )
    policyarn = response['Policy']['Arn']

    response = iam.create_role(
        RoleName='PipelineRole',
        AssumeRolePolicyDocument='{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"osis-pipelines.amazonaws.com\"},\"Action\":\"sts:AssumeRole\"}]}'
    )
    rolename=response['Role']['RoleName']

    response = iam.attach_role_policy(
        RoleName=rolename,
        PolicyArn=policyarn
    )

    print('Creating pipeline role...')
    time.sleep(10)
    print('Role created: ' + rolename)
        
def createDomain(opensearch, domainName):
    """Creates a domain to ingest data into"""
    response = opensearch.create_domain(
        DomainName=domainName,
        EngineVersion='OpenSearch_2.3',
        ClusterConfig={
            'InstanceType': 't2.small.search',
            'InstanceCount': 5,
            'DedicatedMasterEnabled': True,
            'DedicatedMasterType': 't2.small.search',
            'DedicatedMasterCount': 3
        },
        # Many instance types require EBS storage.
        EBSOptions={
            'EBSEnabled': True,
            'VolumeType': 'gp2',
            'VolumeSize': 10
        },
        AccessPolicies=f'{{\"Version\":\"2012-10-17\",\"Statement\":[{{\"Effect\":\"Allow\",\"Principal\":{{\"AWS\":\"arn:aws:iam::account-id:role\/PipelineRole\"}},\"Action\":\"es:*\",\"Resource\":\"arn:aws:es:us-east-1:account-id:domain\/{domainName}\/*\"}}]}}',
        NodeToNodeEncryptionOptions={
            'Enabled': True
        }
    )
    return(response)

def waitForDomainProcessing(opensearch, domainName):
    """Waits for the domain to be active"""
    try:
        response = opensearch.describe_domain(
            DomainName=domainName
        )
        # Every 30 seconds, check whether the domain is processing.
        while 'Endpoint' not in response['DomainStatus']:
            print('Creating domain...')
            time.sleep(60)
            response = opensearch.describe_domain(
                DomainName=domainName)

        # Once we exit the loop, the domain is ready for ingestion.
        endpoint = response['DomainStatus']['Endpoint']
        print('Domain endpoint ready to receive data: ' + endpoint)
        createPipeline(osis, endpoint)

    except botocore.exceptions.ClientError as error:
        if error.response['Error']['Code'] == 'ResourceNotFoundException':
            print('Domain not found.')
        else:
            raise error

def createPipeline(osis, endpoint):
    """Creates a pipeline using the domain and pipeline role"""
    try:
        definition = f'version: \"2\"\nlog-pipeline:\n  source:\n    http:\n      path: \"/${{pipelineName}}/logs\"\n  processor:\n    - date:\n        from_time_received: true\n        destination: \"@timestamp\"\n  sink:\n    - opensearch:\n        hosts: [ \"https://{endpoint}\" ]\n        index: \"application_logs\"\n        aws:\n          region: \"us-east-1\"'
        response = osis.create_pipeline(
            PipelineName=pipelineName,
            MinUnits=4,
            MaxUnits=9,
            PipelineConfigurationBody=definition,
            PipelineRoleArn="arn:aws:iam::account-id:role/PipelineRole"
        )

        response = osis.get_pipeline(
                PipelineName=pipelineName
        )
    
        # Every 30 seconds, check whether the pipeline is active.
        while response['Pipeline']['Status'] == 'CREATING':
            print('Creating pipeline...')
            time.sleep(30)
            response = osis.get_pipeline(
                PipelineName=pipelineName)

        # Once we exit the loop, the pipeline is ready for ingestion.
        ingestionEndpoint = response['Pipeline']['IngestEndpointUrls'][0]
        print('Pipeline ready to ingest data at endpoint: ' + ingestionEndpoint)
        ingestData(ingestionEndpoint)
    
    except botocore.exceptions.ClientError as error:
        if error.response['Error']['Code'] == 'ResourceAlreadyExistsException':
            print('Pipeline already exists.')
            response = osis.get_pipeline(
                PipelineName=pipelineName
            )
            ingestionEndpoint = response['Pipeline']['IngestEndpointUrls'][0]
            ingestData(ingestionEndpoint)
        else:
            raise error
    

def ingestData(ingestionEndpoint):
    """Ingests a sample log file into the pipeline"""
    endpoint = 'https://' + ingestionEndpoint
    r = requests.request('POST', f'{endpoint}/log-pipeline/logs', 
    data='[{"time":"2014-08-11T11:40:13+00:00","remote_addr":"122.226.223.69","status":"404","request":"GET http://www.k2proxy.com//hello.html HTTP/1.1","http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)"}]',
    auth=AWSSigV4('osis'))
    print('Ingesting sample log file into pipeline')
    print('Response: ' + r.text)

def main():
    createPipelineRole(iam, domainName)
    createDomain(opensearch, domainName)
    waitForDomainProcessing(opensearch, domainName)

if __name__ == "__main__":
    main()
```

# Security in Amazon OpenSearch Ingestion
<a name="pipeline-security-model"></a>

Cloud security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture that is built to meet the requirements of the most security-sensitive organizations.

Security is a shared responsibility between AWS and you. The [shared responsibility model](https://aws.amazon.com/compliance/shared-responsibility-model/) describes this as security *of* the cloud and security *in* the cloud:
+ **Security of the cloud** – AWS is responsible for protecting the infrastructure that runs AWS services in the AWS Cloud. AWS also provides you with services that you can use securely. Third-party auditors regularly test and verify the effectiveness of our security as part of the [AWS compliance programs](https://aws.amazon.com/compliance/programs/).
+ **Security in the cloud** – Your responsibility is determined by the AWS service that you use. You are also responsible for other factors including the sensitivity of your data, your company’s requirements, and applicable laws and regulations. 

This documentation helps you understand how to apply the shared responsibility model when using OpenSearch Ingestion. The following topics show you how to configure OpenSearch Ingestion to meet your security and compliance objectives. You also learn how to use other AWS services that help you to monitor and secure your OpenSearch Ingestion resources. 

**Topics**
+ [Configuring VPC access for Amazon OpenSearch Ingestion pipelines](pipeline-security.md)
+ [Configuring OpenSearch Ingestion pipelines for cross-account ingestion](cross-account-pipelines.md)
+ [Identity and Access Management for Amazon OpenSearch Ingestion](security-iam-ingestion.md)
+ [Logging Amazon OpenSearch Ingestion API calls using AWS CloudTrail](osis-logging-using-cloudtrail.md)
+ [Amazon OpenSearch Ingestion and interface endpoints API (AWS PrivateLink)](osis-access-apis-using-privatelink.md)

# Configuring VPC access for Amazon OpenSearch Ingestion pipelines
<a name="pipeline-security"></a>

You can access your Amazon OpenSearch Ingestion pipelines using an interface VPC endpoint. A VPC is a virtual network that's dedicated to your AWS account. It's logically isolated from other virtual networks in the AWS Cloud. Accessing a pipeline through a VPC endpoint enables secure communication between OpenSearch Ingestion and other services within the VPC without the need for an internet gateway, NAT device, or VPN connection. All traffic remains securely within the AWS Cloud.

OpenSearch Ingestion establishes this private connection by creating an *interface endpoint*, powered by AWS PrivateLink. We create an endpoint network interface in each subnet that you specify during pipeline creation. These are requester-managed network interfaces that serve as the entry point for traffic destined for the OpenSearch Ingestion pipeline. You can also choose to create and manage the interface endpoints yourself. 

Using a VPC allows you to enforce data flow through your OpenSearch Ingestion pipelines within the boundaries of the VPC, rather than over the public internet. Pipelines that aren't within a VPC send and receive data over public-facing endpoints and the internet.

A pipeline with VPC access can write to public or VPC OpenSearch Service domains, and to public or VPC OpenSearch Serverless collections. 

**Topics**
+ [Considerations](#pipeline-vpc-considerations)
+ [Limitations](#pipeline-vpc-limitations)
+ [Prerequisites](#pipeline-vpc-prereqs)
+ [Configuring VPC access for a pipeline](#pipeline-vpc-configure)
+ [Self-managed VPC endpoints](#pipeline-vpc-self-managed)
+ [Service-linked role for VPC access](#pipeline-vpc-slr)

## Considerations
<a name="pipeline-vpc-considerations"></a>

Consider the following when you configure VPC access for a pipeline.
+ A pipeline doesn't need to be in the same VPC as its sink. You also don't need to establish a connection between the two VPCs. OpenSearch Ingestion takes care of connecting them for you.
+ You can only specify one VPC for your pipeline.
+ Unlike with public pipelines, a VPC pipeline must be in the same AWS Region as the domain or collection sink that it's writing to. You can configure an S3 source for the pipeline in order to write cross-region.
+ You can choose to deploy a pipeline into one, two, or three subnets of your VPC. The subnets are distributed across the same Availability Zones that your Ingestion OpenSearch Compute Units (OCUs) are deployed in.
+ If you only deploy a pipeline in one subnet and the Availability Zone goes down, you won't be able to ingest data. To ensure high availability, we recommend that you configure pipelines with two or three subnets.
+ Specifying a security group is optional. If you don't provide a security group, OpenSearch Ingestion uses the default security group that is specified in the VPC.

## Limitations
<a name="pipeline-vpc-limitations"></a>

Pipelines with VPC access have the following limitations.
+ You can't change a pipeline's network configuration after you create it. If you launch a pipeline within a VPC, you can't later change it to a public endpoint, and vice versa.
+ You can either launch your pipeline with an interface VPC endpoint or a public endpoint, but you can't do both. You must choose one or the other when you create a pipeline.
+ After you provision a pipeline with VPC access, you can't move it to a different VPC, and you can't change its subnets or security group settings.
+ If your pipeline writes to a domain or collection sink that uses VPC access, you can't go back later and change the sink (VPC or public) after the pipeline is created. You must delete and recreate the pipeline with a new sink. You can still switch from a public sink to a sink with VPC access.
+ You can't provide [cross-account ingestion access](configure-client.md#configure-client-cross-account) to VPC pipelines.

## Prerequisites
<a name="pipeline-vpc-prereqs"></a>

Before you can provision a pipeline with VPC access, you must do the following:
+ **Create a VPC**

  To create your VPC, you can use the Amazon VPC console, the AWS CLI, or one of the AWS SDKs. For more information, see [Working with VPCs](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html) in the *Amazon VPC User Guide*. If you already have a VPC, you can skip this step.
+ **Reserve IP addresses **

  OpenSearch Ingestion places an *elastic network interface* in each subnet that you specify during pipeline creation. Each network interface is associated with an IP address. You must reserve one IP address per subnet for the network interfaces.

## Configuring VPC access for a pipeline
<a name="pipeline-vpc-configure"></a>

You can enable VPC access for a pipeline within the OpenSearch Service console or using the AWS CLI.

### Console
<a name="pipeline-vpc-configure-console"></a>

You configure VPC access during [pipeline creation](creating-pipeline.md#create-pipeline). Under **Source network options**, choose **VPC access** and configure the following settings:


| Setting | Description | 
| --- | --- | 
| Endpoint management |  Choose whether you want to create your VPC endpoints yourself, or have OpenSearch Ingestion create them for you.  | 
| VPC |  Choose the ID of the virtual private cloud (VPC) that you want to use. The VPC and pipeline must be in the same AWS Region.  | 
| Subnets |  Choose one or more subnets. OpenSearch Service places a VPC endpoint and elastic network interfaces in the subnets.  | 
| Security groups |  Choose one or more VPC security groups that allow your required application to reach the OpenSearch Ingestion pipeline on the ports (80 or 443) and protocols (HTTP or HTTPs) exposed by the pipeline.  | 
| VPC attachment options |  If your source requires cross-VPC communication, such as Amazon DocumentDB, self-managed OpenSearch, or Confluent Kafka, OpenSearch Ingestion creates Elastic Network Interfaces (ENIs) in the subnets that you specify in order to connect to these sources. OpenSearch Ingestion uses ENIs in each Availability Zone to reach the specified sources. The **Attach to VPC** option connects the OpenSearch Ingestion data plane VPC to your specified VPC. Select a CIDR reservation for the managed VPC to deploy the network interface.  | 

### CLI
<a name="pipeline-vpc-configure-cli"></a>

To configure VPC access using the AWS CLI, specify the `--vpc-options` parameter:

```
aws osis create-pipeline \
  --pipeline-name vpc-pipeline \
  --min-units 4 \
  --max-units 10 \
  --vpc-options SecurityGroupIds={sg-12345678,sg-9012345},SubnetIds=subnet-1212234567834asdf \
  --pipeline-configuration-body "file://pipeline-config.yaml"
```

## Self-managed VPC endpoints
<a name="pipeline-vpc-self-managed"></a>

When you create a pipeline, you can use endpoint management to create a pipeline with self-managed endpoints or service-managed endpoints. Endpoint management is optional, and defaults to endpoints managed by OpenSearch Ingestion. 

To create a pipeline with a self-managed VPC endpoint in the AWS Management Console, see [Creating pipelines with the OpenSearch Service console](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html#create-pipeline-console). To create a pipeline with a self-managed VPC endpoint in the AWS CLI, you can use the `--vpc-options` parameter in the [create-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/create-pipeline.html) command:

```
--vpc-options SubnetIds=subnet-abcdef01234567890,VpcEndpointManagement=CUSTOMER
```

You can create an endpoint to your pipeline yourself when you specify your endpoint service. To find your endpoint service, use the [get-pipeline](https://docs.aws.amazon.com/cli/latest/reference/osis/get-pipeline.html) command, which returns a response similar to the following:

```
"vpcEndpointService" : "com.amazonaws.osis.us-east-1.pipeline-id-1234567890abcdef1234567890",
"vpcEndpoints" : [ 
  {
    "vpcId" : "vpc-1234567890abcdef0",
    "vpcOptions" : {
      "subnetIds" : [ "subnet-abcdef01234567890", "subnet-021345abcdef6789" ],
      "vpcEndpointManagement" : "CUSTOMER"
    }
  }
```

Use the `vpcEndpointService` from the response to create a VPC endpoint with the AWS Management Console or AWS CLI.

If you use self-managed VPC endpoints, you must enable the DNS attributes `enableDnsSupport` and `enableDnsHostnames` in your VPC. Note that if you have a pipeline with a self-managed endpoint that you [stop and restart](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline--stop-start.html), you must recreate the VPC endpoint in your account.

## Service-linked role for VPC access
<a name="pipeline-vpc-slr"></a>

A [service-linked role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_terms-and-concepts.html#iam-term-service-linked-role) is a unique type of IAM role that delegates permissions to a service so that it can create and manage resources on your behalf. If you choose a service-managed VPC endpoint, OpenSearch Ingestion requires a service-linked role called **AWSServiceRoleForAmazonOpenSearchIngestionService** to access your VPC, create the pipeline endpoint, and place network interfaces in a subnet of your VPC. 

If you choose a self-managed VPC endpoint, OpenSearch Ingestion requires a service-linked role called **AWSServiceRoleForOpensearchIngestionSelfManagedVpce**. For more information on these roles, their permissions, and how to delete them, see [Using service-linked roles to create OpenSearch Ingestion pipelines](slr-osis.md).

OpenSearch Ingestion automatically creates the role when you create an ingestion pipeline. For this automatic creation to succeed, the user creating the first pipeline in an account must have permissions for the `iam:CreateServiceLinkedRole` action. To learn more, see [Service-linked role permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/using-service-linked-roles.html#service-linked-role-permissions) in the *IAM User Guide*. You can view the role in the AWS Identity and Access Management (IAM) console after it's created.

# Configuring OpenSearch Ingestion pipelines for cross-account ingestion
<a name="cross-account-pipelines"></a>

For push-based sources such as HTTP and OTel, Amazon OpenSearch Ingestion enables you to share pipelines across AWS accounts from a virtual private cloud (VPC) to a pipeline endpoint in a separate VPC. Teams that share analytics with other teams in their organization use this feature for a more streamlined means of, for example, sharing log analytics.

This section uses the following terminology:
+ **Pipeline owner**—The account that owns and manages the OpenSearch Ingestion pipeline. Only one account can own a pipeline.
+ **Connecting account**—An account that connects to and uses a shared pipeline. Multiple accounts can connect to the same pipeline.

To configure VPCs to share OpenSearch Ingestion pipelines across AWS accounts, complete the following tasks, as described here:
+ (Pipeline owner) [Grant connecting accounts access to a pipeline](#cross-account-pipelines-setting-up-grant-access)
+ (Connecting account) [Create a pipeline endpoint connection for each connecting VPC](#cross-account-pipelines-setting-up-create-pipeline-endpoints)

## Before you begin
<a name="cross-account-pipelines-before-you-begin"></a>

Before you configure VPCs to share OpenSearch Ingestion pipelines across AWS accounts, complete the following tasks:


****  

| Task | Details | 
| --- | --- | 
|  Create one or more OpenSearch Ingestion pipelines  |  Set the minimum OpenSearch Compute Units (OSUs) to 2 or higher. For more information, see [Creating Amazon OpenSearch Ingestion pipelines](creating-pipeline.md). For information about updating a pipeline, see [Updating Amazon OpenSearch Ingestion pipelines](update-pipeline.md).  | 
|  Create one or more VPCs for OpenSearch Ingestion  |  To enable cross-account pipeline sharing, any VPC involved for the pipeline and the pipeline endpoints must be configured with the following DNS values:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/opensearch-service/latest/developerguide/cross-account-pipelines.html) For more information, see [DNS attributes for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html) in the *Amazon VPC User Guide*.  | 

## Grant connecting accounts access to a pipeline
<a name="cross-account-pipelines-setting-up-grant-access"></a>

The procedures in this section describe how to use the OpenSearch Service console and the AWS CLI to set up cross-account pipeline access by creating a resource policy. A *resource policy* enables a pipeline owner to specify other accounts that can access a pipeline. Once created, pipeline policies exist for as long as the pipeline exists or until the policy is deleted.

**Note**  
Resource policies don't replace standard OpenSearch Ingestion authoriziation using [IAM permissions](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html#create-pipeline-permissions). Resource policies are an added authorization mechanism for enabling cross-account pipeline access.

**Topics**
+ [Grant connecting accounts access to a pipeline (console)](#cross-account-pipelines-setting-up-grant-access-console)
+ [Grant connecting accounts access to a pipeline (CLI)](#cross-account-pipelines-setting-up-grant-access-cli)

### Grant connecting accounts access to a pipeline (console)
<a name="cross-account-pipelines-setting-up-grant-access-console"></a>

Use the following procedure to grant connecting accounts access to a pipeline by using the Amazon OpenSearch Service console.

**To create a pipeline endpoint connection**

1. In the Amazon OpenSearch Service console, in the navigation pane, expand **Ingestion**, and then select **Pipelines**.

1. In the **Pipelines** section, choose the name of a pipeline that you want to grant access for a connecting account.

1. Choose the **VPC endpoints** tab.

1. In the **Authorized principals** section, choose **Authorize account**.

1. In the **AWS account ID** field, enter the 12-digit number account ID, and then select **Authorize**.

### Grant connecting accounts access to a pipeline (CLI)
<a name="cross-account-pipelines-setting-up-grant-access-cli"></a>

Use the following procedure to grant connecting accounts access to a pipeline by using the AWS CLI.

**To grant connecting accounts access to the pipeline**

1. Update to the latest version of the AWS CLI (version 2.0). For more information, see [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

1. Open the CLI in the account and AWS Region with the pipeline you want to share.

1. Run the following command to create a resource policy for the pipeline. This policy gives the `osis:CreatePipelineEndpoint` permission on the pipeline. The policy includes a parameter where you can list AWS account IDs to allow.
**Note**  
In the following command, you must use the short form of the account ID by providing only the twelve- digit account ID. Using an ARN will not work. You must also provide the Amazon Resource Name (ARN) of the pipeline in the CLI parameter for `resource-arn` and in the policy JSON under `Resource`, as shown.

   ```
   aws --region region osis put-resource-policy \
     --resource-arn arn:aws:osis:region:pipeline-owner-account-ID:pipeline/pipeline-name
     --policy 'IAM-policy'
   ```

   Use a policy like the following for *IAM-policy*

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
     {
     "Sid": "AllowAccess",
     "Effect": "Allow",
     "Principal": {
     "AWS": [
     "111122223333",
     "444455556666"
     ]
     },
     "Action": 
     "osis:CreatePipelineEndpoint",
     "Resource": "arn:aws:osis:us-east-1:123456789012:pipeline/pipeline-name"
     }
     ]
    }
   ```

------

## Create a pipeline endpoint connection for each connecting VPC
<a name="cross-account-pipelines-setting-up-create-pipeline-endpoints"></a>

After the pipeline owner grants access to a pipeline in their VPC using the previous procedure, a user in the connecting account creates a pipeline endpoint in their VPC. This section includes procedures for creating endpoints by using the OpenSearch Service console and the AWS CLI. When you create an endpoint, OpenSearch Ingestion performs the following actions:
+ Creates the [AWSServiceRoleForAmazonOpenSearchIngestionService](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/slr-osis.html) service-linked role in your account, if it doesn't already exist. This role gives the user in the connecting account permission to call the [CreatePipelineEndpoint](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_osis_CreatePipelineEndpoint.html) API action.
+ Creates the pipeline endpoint.
+ Configures the pipeline endpoint to ingest data from the shared pipeline in the pipeline owner VPC.

**Topics**
+ [Creating a pipeline endpoint connection (console)](#cross-account-pipelines-setting-up-create-pipeline-endpoints-console)
+ [Creating a pipeline endpoint connection (CLI)](#cross-account-pipelines-setting-up-create-pipeline-endpoints-cli)

### Creating a pipeline endpoint connection (console)
<a name="cross-account-pipelines-setting-up-create-pipeline-endpoints-console"></a>

Use the following procedure to create a pipeline endpoint connection by using the OpenSearch Service console.

**To create a pipeline endpoint connection**

1. In the Amazon OpenSearch Service console, in the navigation pane, expand **Ingestion**, and then select **VPC endpoints**.

1. In the **VPC endpoints** page, choose **Create**.

1. For **Pipeline location**, choose an option. If you choose **Current account**, choose the pipeline from the list. If you choose **Cross-account**, specify the pipeline ARN in the field. The pipeline owner must have granted access to the pipeline, as described in [Grant connecting accounts access to a pipeline](#cross-account-pipelines-setting-up-grant-access). 

1. In the **VPC settings** section, for **VPC**, choose a VPC from the list.

1. For **Subnets**, choose a subnet.

1. For **Security groups**, choose a group.

1. Choose **Create endpoint**.

Wait for the status of the endpoint you created to transition to `ACTIVE`. Once the pipeline is `ACTIVE`, you will see a new field named `ingestEndpointUrl`. Use this endpoint to access the pipeline and ingest data using a client like FluentBit. For more information about using FluentBit to ingest data, see [Using an OpenSearch Ingestion pipeline with Fluent Bit](configure-client-fluentbit.md).

**Note**  
The `ingestEndpointUrl` is the same URL for all connecting accounts.

### Creating a pipeline endpoint connection (CLI)
<a name="cross-account-pipelines-setting-up-create-pipeline-endpoints-cli"></a>

Use the following procedure to crate a pipeline endpoint connection by using the AWS CLI.

**To create a pipeline endpoint connection**

1. If you haven't already, update to the latest version of the AWS CLI (version 2.0). For more information, see [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

1. Open the CLI in the connecting account in the AWS Region with the shared pipeline.

1. Run the following command to create a pipeline endpoint.
**Note**  
You must provide at least one subnet and one security group for the connecting account VPC. The security group must include port 443 and support clients in connecting account VPC.

   ```
   aws osis --region region create-pipeline-endpoint \
     --pipeline-arn arn:aws:osis:region:connecting-account-ID:pipeline/shared-pipeline-name
     --vpc-options SecurityGroupIds={sg-security-group-ID-1,sg-security-group-ID-2},SubnetIds=subnet-subnet-ID
   ```

1. Run the following command to list endpoints in the Region specified in the previous command:

   ```
   aws osis --region region list-pipeline-endpoints
   ```

Wait for the status of the endpoint you created to transition to `ACTIVE`. Once the pipeline is `ACTIVE`, you will see a new field named `ingestEndpointUrl`. Use this endpoint to access the pipeline and ingest data using a client like FluentBit. For more information about using FluentBit to ingest data, see [Using an OpenSearch Ingestion pipeline with Fluent Bit](configure-client-fluentbit.md).

**Note**  
The `ingestEndpointUrl` is the same URL for all connecting accounts.

## Removing pipeline endpoints
<a name="cross-account-pipelines-remove"></a>

If you no longer want to provide access to a shared pipeline, you can remove a pipeline endpoint using one of the following methods:
+ Delete the pipeline endpoint (connecting account).
+ Revoke the pipeline endpoint (pipeline owner).

Use the following procedure to delete a pipeline endpoint in a connecting account.

**To delete a pipeline endpoint (connecting account)**

1. Open the CLI in the connecting account in the AWS Region with the shared pipeline.

1. Run the following command to list pipeline endpoints in the Region:

   ```
   aws osis --region region list-pipeline-endpoints
   ```

   Make a note of the pipeline ID you want to delete.

1. Run the following command to delete the pipeline endpoint:

   ```
   aws osis --region region delete-pipeline-endpoint \
     --endpoint-id 'ID'
   ```

As the pipeline owner of the shared pipeline, use the following procedure to revoke a pipeline endpoint.

**To revoke a pipeline endpoint (pipeline owner)**

1. Open the CLI in the connecting account in the AWS Region with the shared pipeline.

1. Run the following command to list pipeline endpoint connections in the Region:

   ```
   aws osis --region region list-pipeline-endpoint-connections
   ```

   Make a note of the pipeline ID you want to delete.

1. Run the following command to delete the pipeline endpoint:

   ```
   aws osis --region region revoke-pipeline-endpoint-connections \
     --pipeline-arn pipeline-arn --endpoint-ids ID
   ```

   The command supports specifying only one endpoint ID.

# Identity and Access Management for Amazon OpenSearch Ingestion
<a name="security-iam-ingestion"></a>

AWS Identity and Access Management (IAM) is an AWS service that helps an administrator securely control access to AWS resources. IAM administrators control who can be *authenticated* (signed in) and *authorized* (have permissions) to use OpenSearch Ingestion resources. IAM is an AWS service that you can use with no additional charge.

**Topics**
+ [Identity-based policies for OpenSearch Ingestion](#security-iam-ingestion-id-based-policies)
+ [Policy actions for OpenSearch Ingestion](#security-iam-ingestion-id-based-policies-actions)
+ [Policy resources for OpenSearch Ingestion](#security-iam-ingestion-id-based-policies-resources)
+ [Policy condition keys for Amazon OpenSearch Ingestion](#security_iam_ingestion-conditionkeys)
+ [ABAC with OpenSearch Ingestion](#security_iam_ingestion-with-iam-tags)
+ [Using temporary credentials with OpenSearch Ingestion](#security_iam_ingestion-tempcreds)
+ [Service-linked roles for OpenSearch Ingestion](#security_iam_ingestion-slr)
+ [Identity-based policy examples for OpenSearch Ingestion](#security_iam_ingestion_id-based-policy-examples)

## Identity-based policies for OpenSearch Ingestion
<a name="security-iam-ingestion-id-based-policies"></a>

**Supports identity-based policies:** Yes

Identity-based policies are JSON permissions policy documents that you can attach to an identity, such as an IAM user, group of users, or role. These policies control what actions users and roles can perform, on which resources, and under what conditions. To learn how to create an identity-based policy, see [Define custom IAM permissions with customer managed policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html) in the *IAM User Guide*.

With IAM identity-based policies, you can specify allowed or denied actions and resources as well as the conditions under which actions are allowed or denied. To learn about all of the elements that you can use in a JSON policy, see [IAM JSON policy elements reference](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements.html) in the *IAM User Guide*.

### Identity-based policy examples for OpenSearch Ingestion
<a name="osis-security_iam_id-based-policy-examples"></a>

To view examples of OpenSearch Ingestion identity-based policies, see [Identity-based policy examples for OpenSearch Ingestion](#security_iam_ingestion_id-based-policy-examples).

## Policy actions for OpenSearch Ingestion
<a name="security-iam-ingestion-id-based-policies-actions"></a>

**Supports policy actions:** Yes

The `Action` element of a JSON policy describes the actions that you can use to allow or deny access in a policy. Policy actions usually have the same name as the associated AWS API operation. There are some exceptions, such as *permission-only actions* that don't have a matching API operation. There are also some operations that require multiple actions in a policy. These additional actions are called *dependent actions*.

Include actions in a policy to grant permissions to perform the associated operation.

Policy actions in OpenSearch Ingestion use the following prefix before the action:

```
osis
```

To specify multiple actions in a single statement, separate them with commas.

```
"Action": [
      "osis:action1",
      "osis:action2"
         ]
```

You can specify multiple actions using wildcard characters (\$1). For example, to specify all actions that begin with the word `List`, include the following action:

```
"Action": "osis:List*"
```

To view examples of OpenSearch Ingestion identity-based policies, see [Identity-based policy examples for OpenSearch Serverless](security-iam-serverless.md#security_iam_id-based-policy-examples).

## Policy resources for OpenSearch Ingestion
<a name="security-iam-ingestion-id-based-policies-resources"></a>

**Supports policy resources:** Yes

Administrators can use AWS JSON policies to specify who has access to what. That is, which **principal** can perform **actions** on what **resources**, and under what **conditions**.

The `Resource` JSON policy element specifies the object or objects to which the action applies. As a best practice, specify a resource using its [Amazon Resource Name (ARN)](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference-arns.html). For actions that don't support resource-level permissions, use a wildcard (\$1) to indicate that the statement applies to all resources.

```
"Resource": "*"
```

## Policy condition keys for Amazon OpenSearch Ingestion
<a name="security_iam_ingestion-conditionkeys"></a>

**Supports service-specific policy condition keys:** No 

Administrators can use AWS JSON policies to specify who has access to what. That is, which **principal** can perform **actions** on what **resources**, and under what **conditions**.

The `Condition` element specifies when statements execute based on defined criteria. You can create conditional expressions that use [condition operators](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_condition_operators.html), such as equals or less than, to match the condition in the policy with values in the request. To see all AWS global condition keys, see [AWS global condition context keys](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html) in the *IAM User Guide*.

To see a list of OpenSearch Ingestion condition keys, see [Condition keys for Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonopensearchingestion.html#amazonopensearchingestion-policy-keys) in the *Service Authorization Reference*. To learn with which actions and resources you can use a condition key, see [Actions defined by Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonopensearchingestion.html#amazonopensearchingestion-actions-as-permissions).

## ABAC with OpenSearch Ingestion
<a name="security_iam_ingestion-with-iam-tags"></a>

**Supports ABAC (tags in policies):** Yes

Attribute-based access control (ABAC) is an authorization strategy that defines permissions based on attributes called tags. You can attach tags to IAM entities and AWS resources, then design ABAC policies to allow operations when the principal's tag matches the tag on the resource.

To control access based on tags, you provide tag information in the [condition element](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_condition.html) of a policy using the `aws:ResourceTag/key-name`, `aws:RequestTag/key-name`, or `aws:TagKeys` condition keys.

If a service supports all three condition keys for every resource type, then the value is **Yes** for the service. If a service supports all three condition keys for only some resource types, then the value is **Partial**.

For more information about ABAC, see [Define permissions with ABAC authorization](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) in the *IAM User Guide*. To view a tutorial with steps for setting up ABAC, see [Use attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_attribute-based-access-control.html) in the *IAM User Guide*.

For more information about tagging OpenSearch Ingestion resources, see [Tagging Amazon OpenSearch Ingestion pipelines](tag-pipeline.md).

## Using temporary credentials with OpenSearch Ingestion
<a name="security_iam_ingestion-tempcreds"></a>

**Supports temporary credentials:** Yes

Temporary credentials provide short-term access to AWS resources and are automatically created when you use federation or switch roles. AWS recommends that you dynamically generate temporary credentials instead of using long-term access keys. For more information, see [Temporary security credentials in IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html) and [AWS services that work with IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_aws-services-that-work-with-iam.html) in the *IAM User Guide*.

## Service-linked roles for OpenSearch Ingestion
<a name="security_iam_ingestion-slr"></a>

**Supports service-linked roles:** Yes

 A service-linked role is a type of service role that is linked to an AWS service. The service can assume the role to perform an action on your behalf. Service-linked roles appear in your AWS account and are owned by the service. An IAM administrator can view, but not edit the permissions for service-linked roles. 

OpenSearch Ingestion uses a service-linked role called `AWSServiceRoleForAmazonOpenSearchIngestionService`. The service-linked role named `AWSServiceRoleForOpensearchIngestionSelfManagedVpce` is also available for pipelines with self-managed VPC endpoints. For details about creating and managing OpenSearch Ingestion service-linked roles, see [Using service-linked roles to create OpenSearch Ingestion pipelines](slr-osis.md).

## Identity-based policy examples for OpenSearch Ingestion
<a name="security_iam_ingestion_id-based-policy-examples"></a>

By default, users and roles don't have permission to create or modify OpenSearch Ingestion resources. To grant users permission to perform actions on the resources that they need, an IAM administrator can create IAM policies.

To learn how to create an IAM identity-based policy by using these example JSON policy documents, see [Create IAM policies (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html) in the *IAM User Guide*.

For details about actions and resource types defined by Amazon OpenSearch Ingestion, including the format of the ARNs for each of the resource types, see [Actions, resources, and condition keys for Amazon OpenSearch Ingestion](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonopensearchingestion.html) in the *Service Authorization Reference*.

**Topics**
+ [Policy best practices](#security_iam_ingestion-policy-best-practices)
+ [Using OpenSearch Ingestion in the console](#security_iam_ingestion_id-based-policy-examples-console)
+ [Administering OpenSearch Ingestion pipelines](#security_iam_id-based-policy-examples-pipeline-admin)
+ [Ingesting data into an OpenSearch Ingestion pipeline](#security_iam_id-based-policy-examples-ingest-data)

### Policy best practices
<a name="security_iam_ingestion-policy-best-practices"></a>

Identity-based policies are very powerful. They determine whether someone can create, access, or delete OpenSearch Ingestion resources in your account. These actions can incur costs for your AWS account. When you create or edit identity-based policies, follow these guidelines and recommendations:

Identity-based policies determine whether someone can create, access, or delete OpenSearch Ingestion resources in your account. These actions can incur costs for your AWS account. When you create or edit identity-based policies, follow these guidelines and recommendations:
+ **Get started with AWS managed policies and move toward least-privilege permissions** – To get started granting permissions to your users and workloads, use the *AWS managed policies* that grant permissions for many common use cases. They are available in your AWS account. We recommend that you reduce permissions further by defining AWS customer managed policies that are specific to your use cases. For more information, see [AWS managed policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html#aws-managed-policies) or [AWS managed policies for job functions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_job-functions.html) in the *IAM User Guide*.
+ **Apply least-privilege permissions** – When you set permissions with IAM policies, grant only the permissions required to perform a task. You do this by defining the actions that can be taken on specific resources under specific conditions, also known as *least-privilege permissions*. For more information about using IAM to apply permissions, see [ Policies and permissions in IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html) in the *IAM User Guide*.
+ **Use conditions in IAM policies to further restrict access** – You can add a condition to your policies to limit access to actions and resources. For example, you can write a policy condition to specify that all requests must be sent using SSL. You can also use conditions to grant access to service actions if they are used through a specific AWS service, such as CloudFormation. For more information, see [ IAM JSON policy elements: Condition](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_elements_condition.html) in the *IAM User Guide*.
+ **Use IAM Access Analyzer to validate your IAM policies to ensure secure and functional permissions** – IAM Access Analyzer validates new and existing policies so that the policies adhere to the IAM policy language (JSON) and IAM best practices. IAM Access Analyzer provides more than 100 policy checks and actionable recommendations to help you author secure and functional policies. For more information, see [Validate policies with IAM Access Analyzer](https://docs.aws.amazon.com/IAM/latest/UserGuide/access-analyzer-policy-validation.html) in the *IAM User Guide*.
+ **Require multi-factor authentication (MFA)** – If you have a scenario that requires IAM users or a root user in your AWS account, turn on MFA for additional security. To require MFA when API operations are called, add MFA conditions to your policies. For more information, see [ Secure API access with MFA](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_mfa_configure-api-require.html) in the *IAM User Guide*.

For more information about best practices in IAM, see [Security best practices in IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html) in the *IAM User Guide*.

### Using OpenSearch Ingestion in the console
<a name="security_iam_ingestion_id-based-policy-examples-console"></a>

To access OpenSearch Ingestion within the OpenSearch Service console, you must have a minimum set of permissions. These permissions must allow you to list and view details about the OpenSearch Ingestion resources in your AWS account. If you create an identity-based policy that is more restrictive than the minimum required permissions, the console won't function as intended for entities (such as IAM roles) with that policy.

You don't need to allow minimum console permissions for users that are making calls only to the AWS CLI or the AWS API. Instead, allow access to only the actions that match the API operation that you're trying to perform.

The following policy allows a user to access OpenSearch Ingestion within the OpenSearch Service console:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Resource": "*",
            "Effect": "Allow",
            "Action": [
                "osis:ListPipelines",
                "osis:GetPipeline",
                "osis:ListPipelineBlueprints",
                "osis:GetPipelineBlueprint",
                "osis:GetPipelineChangeProgress"
            ]
        }
    ]
}
```

------

Alternately, you can use the [AmazonOpenSearchIngestionReadOnlyAccess](ac-managed.md#AmazonOpenSearchIngestionReadOnlyAccess) AWS managed policy, which grants read-only access to all OpenSearch Ingestion resources for an AWS account.

### Administering OpenSearch Ingestion pipelines
<a name="security_iam_id-based-policy-examples-pipeline-admin"></a>

This policy is an example of a "pipeline admin" policy that allows a user to manage and administer Amazon OpenSearch Ingestion pipelines. The user can create, view, and delete pipelines.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Resource": "arn:aws:osis:us-east-1:111122223333:pipeline/*",
            "Action": [
                "osis:CreatePipeline",
                "osis:DeletePipeline",
                "osis:UpdatePipeline",
                "osis:ValidatePipeline",
                "osis:StartPipeline",
                "osis:StopPipeline"
            ],
            "Effect": "Allow"
        },
        {
            "Resource": "*",
            "Action": [
                "osis:ListPipelines",
                "osis:GetPipeline",
                "osis:ListPipelineBlueprints",
                "osis:GetPipelineBlueprint",
                "osis:GetPipelineChangeProgress"
            ],
            "Effect": "Allow"
        }
    ]
}
```

------

### Ingesting data into an OpenSearch Ingestion pipeline
<a name="security_iam_id-based-policy-examples-ingest-data"></a>

This example policy allows a user or other entity to ingest data into an Amazon OpenSearch Ingestion pipeline in their account. The user can't modify the pipelines.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Resource": "arn:aws:osis:us-east-1:123456789012:pipeline/*",
            "Action": [
                "osis:Ingest"
            ],
            "Effect": "Allow"
        }
    ]
}
```

------

# Logging Amazon OpenSearch Ingestion API calls using AWS CloudTrail
<a name="osis-logging-using-cloudtrail"></a>

Amazon OpenSearch Ingestion is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in OpenSearch Ingestion. 

CloudTrail captures all API calls for OpenSearch Ingestion as events. The calls captured include calls from the OpenSearch Ingestion section of the OpenSearch Service console and code calls to the OpenSearch Ingestion API operations.

If you create a trail, you can enable continuous delivery of CloudTrail events to an Amazon S3 bucket, including events for OpenSearch Ingestion. If you don't configure a trail, you can still view the most recent events in the CloudTrail console in **Event history**. 

Using the information collected by CloudTrail, you can determine the request that was made to OpenSearch Ingestion, the IP address from which the request was made, who made the request, when it was made, and additional details.

To learn more about CloudTrail, see the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html).

## OpenSearch Ingestion information in CloudTrail
<a name="osisosis-info-in-cloudtrail"></a>

CloudTrail is enabled on your AWS account when you create the account. When activity occurs in OpenSearch Ingestion, that activity is recorded in a CloudTrail event along with other AWS service events in **Event history**. You can view, search, and download recent events in your AWS account. For more information, see [Viewing events with CloudTrail Event history](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/view-cloudtrail-events.html).

For an ongoing record of events in your AWS account, including events for OpenSearch Ingestion, create a trail. A *trail* enables CloudTrail to deliver log files to an Amazon S3 bucket. By default, when you create a trail in the console, the trail applies to all AWS Regions. 

The trail logs events from all Regions in the AWS partition and delivers the log files to the Amazon S3 bucket that you specify. Additionally, you can configure other AWS services to further analyze and act upon the event data collected in CloudTrail logs. For more information, see the following:
+ [Overview for creating a trail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-and-update-a-trail.html)
+ [CloudTrail supported services and integrations](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-aws-service-specific-topics.html)
+ [Configuring Amazon SNS notifications for CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/configure-sns-notifications-for-cloudtrail.html)
+ [Receiving CloudTrail log files from multiple regions](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/receive-cloudtrail-log-files-from-multiple-regions.html) and [Receiving CloudTrail log files from multiple accounts](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-receive-logs-from-multiple-accounts.html)

All OpenSearch Ingestion actions are logged by CloudTrail and are documented in the [OpenSearch Ingestion API reference](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_Operations_Amazon_OpenSearch_Ingestion.html). For example, calls to the `CreateCollection`, `ListCollections`, and `DeleteCollection` actions generate entries in the CloudTrail log files.

Every event or log entry contains information about who generated the request. The identity information helps you determine:
+ Whether the request was made with root or AWS Identity and Access Management (IAM) user credentials.
+ Whether the request was made with temporary security credentials for a role or federated user.
+ Whether the request was made by another AWS service.

For more information, see the [CloudTrail userIdentity element](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-user-identity.html).

## Understanding OpenSearch Ingestion log file entries
<a name="understanding-osis-entries"></a>

A trail is a configuration that enables delivery of events as log files to an Amazon S3 bucket that you specify. CloudTrail log files contain one or more log entries. 

An event represents a single request from any source. It includes information about the requested action, the date and time of the action, request parameters, and so on. CloudTrail log files aren't an ordered stack trace of the public API calls, so they don't appear in any specific order. 

The following example shows a CloudTrail log entry that demonstrates the `DeletePipeline` action.

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AIDACKCEVSQ6C2EXAMPLE",
        "arn":"arn:aws:iam::123456789012:user/test-user",
        "accountId": "123456789012",
        "accessKeyId": "access-key",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AIDACKCEVSQ6C2EXAMPLE",
                "arn": "arn:aws:iam::123456789012:role/Admin",
                "accountId": "123456789012",
                "userName": "Admin"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2023-04-21T16:48:33Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2023-04-21T16:49:22Z",
    "eventSource": "osis.amazonaws.com",
    "eventName": "UpdatePipeline",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "123.456.789.012",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
    "requestParameters": {
        "pipelineName": "my-pipeline",
        "pipelineConfigurationBody": "version: \"2\"\nlog-pipeline:\n  source:\n    http:\n        path: \"/test/logs\"\n  processor:\n    - grok:\n        match:\n          log: [ '%{COMMONAPACHELOG}' ]\n    - date:\n        from_time_received: true\n        destination: \"@timestamp\"\n  sink:\n    - opensearch:\n        hosts: [ \"https://search-b5zd22mwxhggheqpj5ftslgyle.us-west-2.es.amazonaws.com\" ]\n        index: \"apache_logs2\"\n        aws_sts_role_arn: \"arn:aws:iam::709387180454:role/canary-bootstrap-OsisRole-J1BARLD26QKN\"\n        aws_region: \"us-west-2\"\n        aws_sigv4: true\n"
    },
    "responseElements": {
        "pipeline": {
            "pipelineName": "my-pipeline",sourceIPAddress
            "pipelineArn": "arn:aws:osis:us-west-2:123456789012:pipeline/my-pipeline",
            "minUnits": 1,
            "maxUnits": 1,
            "status": "UPDATING",
            "statusReason": {
                "description": "An update was triggered for the pipeline. It is still available to ingest data."
            },
            "pipelineConfigurationBody": "version: \"2\"\nlog-pipeline:\n  source:\n    http:\n        path: \"/test/logs\"\n  processor:\n    - grok:\n        match:\n          log: [ '%{COMMONAPACHELOG}' ]\n    - date:\n        from_time_received: true\n        destination: \"@timestamp\"\n  sink:\n    - opensearch:\n        hosts: [ \"https://search-b5zd22mwxhggheqpj5ftslgyle.us-west-2.es.amazonaws.com\" ]\n        index: \"apache_logs2\"\n        aws_sts_role_arn: \"arn:aws:iam::709387180454:role/canary-bootstrap-OsisRole-J1BARLD26QKN\"\n        aws_region: \"us-west-2\"\n        aws_sigv4: true\n",
            "createdAt": "Mar 29, 2023 1:03:44 PM",
            "lastUpdatedAt": "Apr 21, 2023 9:49:21 AM",
            "ingestEndpointUrls": [
                "my-pipeline-tu33ldsgdltgv7x7tjqiudvf7m.us-west-2.osis.amazonaws.com"
            ]
        }
    },
    "requestID": "12345678-1234-1234-1234-987654321098",
    "eventID": "12345678-1234-1234-1234-987654321098",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "709387180454",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader": "osis.us-west-2.amazonaws.com"
    },
    "sessionCredentialFromConsole": "true"
}
```

# Amazon OpenSearch Ingestion and interface endpoints API (AWS PrivateLink)
<a name="osis-access-apis-using-privatelink"></a>

You can establish a private connection between your VPC and OpenSearch Ingestion API endpoints by creating an *interface VPC endpoint*. Interface endpoints are powered by [AWS PrivateLink](https://aws.amazon.com/privatelink). 

AWS PrivateLink enables you to privately access OpenSearch Ingestion API operations without an internet gateway, NAT device, VPN connection, or Direct Connect connection. Resources in your VPC don't need public IP addresses to communicate with OpenSearch Ingestion API endpoints to create, modify, or delete pipelines. Traffic between your VPC and OpenSearch Ingestion doesn't leave the Amazon network. 

**Note**  
This topic covers VPC endpoints for accessing the OpenSearch Ingestion *API*, which allows you to manage pipelines (create, update, delete) from within your VPC. This is different from configuring VPC access *for pipelines themselves*, which controls how data is ingested into pipelines from sources within your VPC. For information about configuring VPC access for pipelines, see [Configuring VPC access for Amazon OpenSearch Ingestion pipelines](pipeline-security.md).

Each interface endpoint is represented by one or more elastic network interfaces in your subnets. For more information on elastic network interfaces, see [Elastic network interfaces](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) in the *Amazon EC2 User Guide.* 

For more information about VPC endpoints, see [Interface VPC endpoints (AWS PrivateLink)](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html) in the *Amazon VPC User Guide*. For more information about OpenSearch Ingestion API operations, see the [OpenSearch Ingestion API reference](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_Operations_Amazon_OpenSearch_Ingestion.html).

## Considerations for VPC endpoints
<a name="vpc-endpoint-considerations"></a>

Before you set up an interface VPC endpoint for OpenSearch Ingestion API endpoints, ensure that you review [Interface endpoint properties and limitations](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html#vpce-interface-limitations) in the *Amazon VPC User Guide*. 

All OpenSearch Ingestion API operations relevant to managing OpenSearch Ingestion resources are available from your VPC using AWS PrivateLink.

VPC endpoint policies are supported for OpenSearch Ingestion API endpoints. By default, full access to OpenSearch Ingestion API operations is allowed through the endpoint. For more information, see [Controlling access to services with VPC endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-access.html) in the *Amazon VPC User Guide*.

## Availability
<a name="osis-vpc-interface-endpoints-availability"></a>

OpenSearch Ingestion API currently supports VPC endpoints in all OpenSearch Ingestion Regions.

At this time, FIPS endpoints are not supported.

## Creating an interface VPC endpoint for OpenSearch Ingestion API
<a name="vpc-endpoint-create"></a>

You can create a VPC endpoint for the OpenSearch Ingestion API using either the Amazon VPC console or the AWS Command Line Interface (AWS CLI). For more information, see [Creating an interface endpoint](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html#create-interface-endpoint) in the *Amazon VPC User Guide*.

Create a VPC endpoint for OpenSearch Ingestion API using the service name `com.amazonaws.region.osis`.

If you enable private DNS for the endpoint, you can make API requests to OpenSearch Ingestion with the VPC endpoint using its default DNS name for the AWS Region, for example `osis.us-east-1.amazonaws.com`.

For more information, see [Accessing a service through an interface endpoint](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html#access-service-though-endpoint) in the *Amazon VPC User Guide*.

## Creating a VPC endpoint policy for OpenSearch Ingestion API
<a name="vpc-endpoint-policy"></a>

You can attach an endpoint policy to your VPC endpoint that controls access to OpenSearch Ingestion API. The policy specifies the following information:
+ The principal that can perform actions.
+ The actions that can be performed.
+ The resources on which actions can be performed.

For more information, see [Controlling access to services with VPC endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-access.html) in the *Amazon VPC User Guide*. 

**Example: VPC endpoint policy for OpenSearch Ingestion API actions**  
The following is an example of an endpoint policy for OpenSearch Ingestion API. When attached to an endpoint, this policy grants access to the listed OpenSearch Ingestion API actions for all principals on all resources.

```
{
   "Statement":[
      {
         "Principal":"*",
         "Effect":"Allow",
         "Action":[
            "osis:CreatePipeline",
            "osis:UpdatePipeline",
            "osis:DeletePipeline"
         ],
         "Resource":"*"
      }
   ]
}
```

**Example: VPC endpoint policy that denies all access from a specified AWS account**  
The following VPC endpoint policy denies AWS account `123456789012` all access to resources using the endpoint. The policy allows all actions from other accounts.

```
{
  "Statement": [
    {
      "Action": "*",
      "Effect": "Allow",
      "Resource": "*",
      "Principal": "*"
    },
    {
      "Action": "*",
      "Effect": "Deny",
      "Resource": "*",
      "Principal": { "AWS": [ "123456789012" ] }
     }
   ]
}
```

# Tagging Amazon OpenSearch Ingestion pipelines
<a name="tag-pipeline"></a>

Tags let you assign arbitrary information to an Amazon OpenSearch Ingestion pipeline so you can categorize and filter on that information. A *tag* is a metadata label that you assign or that AWS assigns to an AWS resource. Each tag consists of a *key* and a *value*. For tags that you assign, you define the key and value. For example, you might define the key as `stage` and the value for one resource as `test`.

Tags help you do the following:
+ Identify and organize your AWS resources. Many AWS services support tagging, so you can assign the same tag to resources from different services to indicate that the resources are related. For example, you could assign the same tag to an OpenSearch Ingestion pipeline that you assign to an Amazon OpenSearch Service domain.
+ Track your AWS costs. You activate these tags on the AWS Billing and Cost Management dashboard. AWS uses the tags to categorize your costs and deliver a monthly cost allocation report to you. For more information, see [Use Cost Allocation Tags](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html) in the [AWS Billing User Guide](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/).
+ Restrict access to pipelines using attribute based access control. For more information, see [Controlling access based on tag keys](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_tags.html#access_tags_control-tag-keys) in the IAM User Guide.

In OpenSearch Ingestion, the primary resource is a pipeline. You can use the OpenSearch Service console, the AWS CLI, OpenSearch Ingestion APIs, or the AWS SDKs to add, manage, and remove tags from a pipeline.

**Topics**
+ [Permissions required](#pipeline-tag-permissions)
+ [Working with tags (console)](#tag-pipeline-console)
+ [Working with tags (AWS CLI)](#tag-pipeline-cli)

## Permissions required
<a name="pipeline-tag-permissions"></a>

OpenSearch Ingestion uses the following AWS Identity and Access Management Access Analyzer (IAM) permissions for tagging pipelines:
+ `osis:TagResource`
+ `osis:ListTagsForResource`
+ `osis:UntagResource`

For more information about each permission, see [Actions, resources, and condition keys for OpenSearch Ingestion](https://docs.aws.amazon.com/service-authorization/latest/reference/list_opensearchingestionservice.html) in the *Service Authorization Reference*.

## Working with tags (console)
<a name="tag-pipeline-console"></a>

The console is the simplest way to tag a pipeline.

****To create a tag****

1. Sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. Select the pipeline you want to add tags to and go to the **Tags** tab.

1. Choose **Manage** and **Add new tag**.

1. Enter a tag key and an optional value.

1. Choose **Save**.

To delete a tag, follow the same steps and choose **Remove** on the **Manage tags** page.

For more information about using the console to work with tags, see [Tag Editor](https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/tag-editor.html) in the *AWS Management Console Getting Started Guide*.

## Working with tags (AWS CLI)
<a name="tag-pipeline-cli"></a>

To tag a pipeline using the AWS CLI, send a `TagResource` request: 

```
aws osis tag-resource
  --arn arn:aws:osis:us-east-1:123456789012:pipeline/my-pipeline 
  --tags Key=service,Value=osis Key=source,Value=otel
```

Remove tags from a pipeline using the `UntagResource` command:

```
aws osis untag-resource
  --arn arn:aws:osis:us-east-1:123456789012:pipeline/my-pipeline
  --tag-keys service
```

View the existing tags for a pipeline with the `ListTagsForResource` command:

```
aws osis list-tags-for-resource
  --arn arn:aws:osis:us-east-1:123456789012:pipeline/my-pipeline
```

# Logging and monitoring Amazon OpenSearch Ingestion with Amazon CloudWatch
<a name="monitoring-pipelines"></a>

Amazon OpenSearch Ingestion publishes metrics and logs to Amazon CloudWatch.

**Topics**
+ [Monitoring pipeline logs](monitoring-pipeline-logs.md)
+ [Monitoring pipeline metrics](monitoring-pipeline-metrics.md)

# Monitoring pipeline logs
<a name="monitoring-pipeline-logs"></a>

You can enable logging for Amazon OpenSearch Ingestion pipelines to expose error and warning messages raised during pipeline operations and ingestion activity. OpenSearch Ingestion publishes all logs to *Amazon CloudWatch Logs*. CloudWatch Logs can monitor information in the log files and notify you when certain thresholds are met. You can also archive your log data in highly durable storage. For more information, see the [Amazon CloudWatch Logs User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/).

Logs from OpenSearch Ingestion might indicate failed processing of requests, authentication errors from the source to the sink, and other warnings that can be helpful for troubleshooting. For its logs, OpenSearch Ingestion uses the log levels of `INFO`, `WARN`, `ERROR`, and `FATAL`. We recommend enabling log publishing for all pipelines.

## Permissions required
<a name="monitoring-pipeline-logs-permissions"></a>

In order to enable OpenSearch Ingestion to send logs to CloudWatch Logs, you must be signed in as a user that has certain IAM permissions. 

You need the following CloudWatch Logs permissions in order to create and update log delivery resources:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "logs:CreateLogDelivery",
                "logs:PutResourcePolicy",
                "logs:UpdateLogDelivery",
                "logs:DeleteLogDelivery",
                "logs:DescribeResourcePolicies",
                "logs:GetLogDelivery",
                "logs:ListLogDeliveries"
            ]
        }
    ]
}
```

------

## Enabling log publishing
<a name="monitoring-pipeline-logs-enable"></a>

You can enable log publishing on existing pipelines, or while creating a pipeline. For steps to enable log publishing during pipeline creation, see [Creating pipelines](creating-pipeline.md#create-pipeline).

### Console
<a name="monitoring-pipeline-logs-enable-console"></a>

**To enable log publishing on an existing pipeline**

1. Sign in to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/osis/home](https://console.aws.amazon.com/aos/osis/home#osis/ingestion-pipelines). You'll be on the Pipelines page.

1. Open the pipeline that you want to enable logs, then choose **Actions**, **Edit log publishing options**.

1. Enable **Publish to CloudWatch Logs**.

1. Either create a new log group or select an existing one. We recommend that you format the name as a path, such as `/aws/vendedlogs/OpenSearchIngestion/pipeline-name/audit-logs`. This format makes it easier to apply a CloudWatch access policy that grants permissions to all log groups under a specific path such as `/aws/vendedlogs/OpenSearchIngestion`.
**Important**  
You must include the prefix `vendedlogs` in the log group name, otherwise creation fails.

1. Choose **Save**.

### CLI
<a name="monitoring-pipeline-logs-enable-cli"></a>

To enable log publishing using the AWS CLI, send the following request:

```
aws osis update-pipeline \
  --pipeline-name my-pipeline \
  --log-publishing-options  IsLoggingEnabled=true,CloudWatchLogDestination={LogGroup="/aws/vendedlogs/OpenSearchIngestion/pipeline-name"}
```

# Monitoring pipeline metrics
<a name="monitoring-pipeline-metrics"></a>

You can monitor Amazon OpenSearch Ingestion pipelines using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. These statistics are kept for 15 months, so that you can access historical information and gain a better perspective on how your web application or service is performing. You can also set alarms that watch for certain thresholds, and send notifications or take actions when those thresholds are met. For more information, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/).

The OpenSearch Ingestion console displays a series of charts based on the raw data from CloudWatch on the **Performance** tab for each pipeline.

OpenSearch Ingestion reports metrics from most [supported plugins](pipeline-config-reference.md#ingestion-plugins). If certain plugins don't have their own table below, it means that they don't report any plugin-specific metrics. Pipeline metrics are published in the `AWS/OSIS` namespace.

**Topics**
+ [Common metrics](#common-metrics)
+ [Buffer metrics](#buffer-metrics)
+ [Signature V4 metrics](#sigv4-metrics)
+ [Bounded blocking buffer metrics](#blockingbuffer-metrics)
+ [Otel trace source metrics](#oteltrace-metrics)
+ [Otel metrics source metrics](#otelmetrics-metrics)
+ [Http metrics](#http-metrics)
+ [S3 metrics](#s3-metrics)
+ [Aggregate metrics](#aggregate-metrics)
+ [Date metrics](#date-metrics)
+ [Lambda metrics](#lambda-metrics)
+ [Grok metrics](#grok-metrics)
+ [Otel trace raw metrics](#oteltrace-raw-metrics)
+ [Otel trace group metrics](#oteltracegroup-metrics)
+ [Service map stateful metrics](#servicemapstateful-metrics)
+ [OpenSearch metrics](#opensearch-metrics)
+ [System and metering metrics](#systemmetering-metrics)

## Common metrics
<a name="common-metrics"></a>

The following metrics are common to all processors and sinks.

Each metric is prefixed by the sub-pipeline name and plugin name, in the format <*sub\$1pipeline\$1name*><*plugin*><*metric\$1name*>. For example, the full name of the `recordsIn.count` metric for a sub-pipeline named `my-pipeline` and the [date](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/date/) processor would be `my-pipeline.date.recordsIn.count`.


| Metric suffix | Description | 
| --- | --- | 
| recordsIn.count |  The ingress of records to a pipeline component. This metric applies to processors and sinks. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| recordsOut.count |  The egress of records from a pipeline component. This metric applies to processors and sources. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| timeElapsed.count |  A count of data points recorded during execution of a pipeline component. This metric applies to processors and sinks. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| timeElapsed.sum |  The total time elapsed during execution of a pipeline component. This metric applies to processors and sinks, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| timeElapsed.max |  The maximum time elapsed during execution of a pipeline component. This metric applies to processors and sinks, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 

## Buffer metrics
<a name="buffer-metrics"></a>

The following metrics apply to the default [Bounded blocking](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/buffers/bounded-blocking/) buffer that OpenSearch Ingestion automatically configures for all pipelines.

Each metric is prefixed by the sub-pipeline name and buffer name, in the format <*sub\$1pipeline\$1name*><*buffer\$1name*><*metric\$1name*>. For example, the full name of the `recordsWritten.count` metric for a sub-pipeline named `my-pipeline` would be `my-pipeline.BlockingBuffer.recordsWritten.count`.


| Metric suffix | Description | 
| --- | --- | 
| recordsWritten.count |  The number of records written to a buffer. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| recordsRead.count |  The number of records read from a buffer. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| recordsInFlight.value |  The number of unchecked records read from a buffer. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 
| recordsInBuffer.value |  The number of records currently in a buffer. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 
| recordsProcessed.count |  The number of records read from a buffer and processed by a pipeline. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| recordsWriteFailed.count |  The number of records that the pipeline failed to write to the sink. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| writeTimeElapsed.count |  A count of data points recorded while writing to a buffer. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| writeTimeElapsed.sum |  The total time elapsed while writing to a buffer, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| writeTimeElapsed.max |  The maximum time elapsed while writing to a buffer, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| writeTimeouts.count |  The count of write timeouts to a buffer. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| readTimeElapsed.count |  A count of data points recorded while reading from a buffer. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| readTimeElapsed.sum |  The total time elapsed while reading from a buffer, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| readTimeElapsed.max |  The maximum time elapsed while reading from a buffer, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| checkpointTimeElapsed.count |  A count of data points recorded while checkpointing. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| checkpointTimeElapsed.sum |  The total time elapsed while checkpointing, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| checkpointTimeElapsed.max |  The maximum time elapsed while checkpointing, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 

## Signature V4 metrics
<a name="sigv4-metrics"></a>

The following metrics apply to the ingestion endpoint for a pipeline and are associate with the source plugins (`http`, `otel_trace`, and `otel_metrics`). All requests to the ingestion endpoint must be signed using [Signature Version 4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html). These metrics can help you identify authorization issues when connecting to your pipeline, or confirm that you're successfully authenticating.

Each metric is prefixed by the sub-pipeline name and `osis_sigv4_auth`. For example, `sub_pipeline_name.osis_sigv4_auth.httpAuthSuccess.count`.


| Metric suffix | Description | 
| --- | --- | 
| httpAuthSuccess.count |  The number of successful Signature V4 requests to the pipeline. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| httpAuthFailure.count |  The number of failed Signature V4 requests to the pipeline. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| httpAuthServerError.count |  The number of Signature V4 requests to the pipeline that returned server errors. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 

## Bounded blocking buffer metrics
<a name="blockingbuffer-metrics"></a>

The following metrics apply to the [bounded blocking](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/buffers/bounded-blocking/) buffer. Each metric is prefixed by the sub-pipeline name and `BlockingBuffer`. For example, `sub_pipeline_name.BlockingBuffer.bufferUsage.value`.


| Metric suffix | Description | 
| --- | --- | 
| bufferUsage.value |  Percent usage of the `buffer_size` based on the number of records in the buffer. `buffer_size` represents the maximum number of records written into the buffer as well as in-flight records that have not been checked. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 

## Otel trace source metrics
<a name="oteltrace-metrics"></a>

The following metrics apply to the [OTel trace](https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/otel-trace-source/) source. Each metric is prefixed by the sub-pipeline name and `otel_trace_source`. For example, `sub_pipeline_name.otel_trace_source.requestTimeouts.count`.


| Metric suffix | Description | 
| --- | --- | 
| requestTimeouts.count |  The number of requests that timed out. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestsReceived.count |  The number of requests received by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| successRequests.count |  The number of requests that were successfully processed by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| badRequests.count |  The number of requests with an invalid format that were processed by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestsTooLarge.count |  The number of requests of which the number of spans in the content is larger than the buffer capacity. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| internalServerError.count |  The number of requests processed by the plugin with a custom exception type. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.count |  A count of data points recorded while processing requests by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.sum |  The total latency of requests processed by the plugin, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.max |  The maximum latency of requests processed by the plugin, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| payloadSize.count |  A count of the distribution of payload sizes of incoming requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| payloadSize.sum |  The total distribution of the payload sizes of incoming requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| payloadSize.max |  The maximum distribution of payload sizes of incoming requests, in bytes. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 

## Otel metrics source metrics
<a name="otelmetrics-metrics"></a>

The following metrics apply to the [OTel metrics](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/) source. Each metric is prefixed by the sub-pipeline name and `otel_metrics_source`. For example, `sub_pipeline_name.otel_metrics_source.requestTimeouts.count`.


| Metric suffix | Description | 
| --- | --- | 
| requestTimeouts.count |  The total number of requests to the plugin that time out. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestsReceived.count |  The total number of requests received by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| successRequests.count |  The number of requests successfully processed (200 response status code) by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.count |  A count of the latency of requests processed by the plugin, in seconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.sum |  The total latency of requests processed by the plugin, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.max |  The maximum latency of requests processed by the plugin, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| payloadSize.count |  A count of the distribution of payload sizes of incoming requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| payloadSize.sum |  The total distribution of the payload sizes of incoming requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| payloadSize.max |  The maximum distribution of payload sizes of incoming requests, in bytes. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 

## Http metrics
<a name="http-metrics"></a>

The following metrics apply to the [HTTP](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/http-source/) source. Each metric is prefixed by the sub-pipeline name and `http`. For example, `sub_pipeline_name.http.requestsReceived.count`.


| Metric suffix | Description | 
| --- | --- | 
| requestsReceived.count |  The number of requests received by the `/log/ingest` endpoint. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestsRejected.count |  The number of requests rejected (429 response status code) by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| successRequests.count |  The number of requests successfully processed (200 response status code) by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| badRequests.count |  The number of requests with invalid content type or format (400 response status code) processed by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestTimeouts.count |  The number of requests that time out in the HTTP source server (415 response status code). **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestsTooLarge.count |  The number of requests of which the events size in the content is larger than the buffer capacity (413 response status code). **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| internalServerError.count |  The number of requests processed by the plugin with a custom exception type (500 response status code). **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.count |  A count of the latency of requests processed by the plugin, in seconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.sum |  The total latency of requests processed by the plugin, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestProcessDuration.max |  The maximum latency of requests processed by the plugin, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| payloadSize.count |  A count of the distribution of payload sizes of incoming requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| payloadSize.sum |  The total distribution of the payload sizes of incoming requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| payloadSize.max |  The maximum distribution of payload sizes of incoming requests, in bytes. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 

## S3 metrics
<a name="s3-metrics"></a>

The following metrics apply to the [S3](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sources/s3/) source. Each metric is prefixed by the sub-pipeline name and `s3`. For example, `sub_pipeline_name.s3.s3ObjectsFailed.count`.


| Metric suffix | Description | 
| --- | --- | 
| s3ObjectsFailed.count |  The total number of S3 objects that the plugin failed to read. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectsNotFound.count |  The number of S3 objects that the plugin failed to read due to a `Not Found` error from S3. These metrics also count toward the `s3ObjectsFailed` metric. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectsAccessDenied.count |  The number of S3 objects that the plugin failed to read due to an `Access Denied` or `Forbidden` error from S3. These metrics also count toward the `s3ObjectsFailed` metric. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectReadTimeElapsed.count |  The amount of time the plugin takes to perform a GET request for an S3 object, parse it, and write events to the buffer. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectReadTimeElapsed.sum |  The total amount of time that the plugin takes to perform a GET request for an S3 object, parse it, and write events to the buffer, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectReadTimeElapsed.max |  The maximum amount of time that the plugin takes to perform a GET request for an S3 object, parse it, and write events to the buffer, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| s3ObjectSizeBytes.count |  The count of the distribution of S3 object sizes, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectSizeBytes.sum |  The total distribution of S3 object sizes, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectSizeBytes.max |  The maximum distribution of S3 object sizes, in bytes. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| s3ObjectProcessedBytes.count |  The count of the distribution of S3 objects processed by the plugin, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectProcessedBytes.sum |  The total distribution of S3 objects processed by the plugin, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectProcessedBytes.max |  The maximum distribution of S3 objects processed by the plugin, in bytes. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| s3ObjectsEvents.count |  The count of the distribution of S3 events received by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectsEvents.sum |  The total distribution of S3 events received by the plugin.  **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3ObjectsEvents.max |  The maximum distribution of S3 events received by the plugin. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| sqsMessageDelay.count |  A count of data points recorded while S3 records an event time for the creation of an object to when it's fully parsed. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| sqsMessageDelay.sum |  The total amount of time between when S3 records an event time for the creation of an object to when it's fully parsed, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| sqsMessageDelay.max |  The maximum amount of time between when S3 records an event time for the creation of an object to when it's fully parsed, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| s3ObjectsSucceeded.count |  The number of S3 objects that the plugin successfully read. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| sqsMessagesReceived.count |  The number of Amazon SQS messages received from the queue by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| sqsMessagesDeleted.count |  The number of Amazon SQS messages deleted from the queue by the plugin. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| sqsMessagesFailed.count |  The number of Amazon SQS messages that the plugin failed to parse. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 

## Aggregate metrics
<a name="aggregate-metrics"></a>

The following metrics apply to the [Aggregate](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/aggregate/) processor. Each metric is prefixed by the sub-pipeline name and `aggregate`. For example, `sub_pipeline_name.aggregate.actionHandleEventsOut.count`.


| Metric suffix | Description | 
| --- | --- | 
| actionHandleEventsOut.count |  The number of events that have been returned from the `handleEvent` call to the configured action. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| actionHandleEventsDropped.count |  The number of events that have been returned from the `handleEvent` call to the configured action. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| actionHandleEventsProcessingErrors.count |  The number of calls made to `handleEvent` for the configured action that resulted in an error. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| actionConcludeGroupEventsOut.count |  The number of events that have been returned from the `concludeGroup` call to the configured action. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| actionConcludeGroupEventsDropped.count |  The number of events that have not been returned from the `condludeGroup` call to the configured action. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| actionConcludeGroupEventsProcessingErrors.count |  The number of calls made to `concludeGroup` for the configured action that resulted in an error. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| currentAggregateGroups.value |  The current number of groups. This gauge decreases when groups are concluded, and increases when an event initiates the creation of a new group. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 

## Date metrics
<a name="date-metrics"></a>

The following metrics apply to the [Date](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/date/) processor. Each metric is prefixed by the sub-pipeline name and `date`. For example, `sub_pipeline_name.date.dateProcessingMatchSuccess.count`.


| Metric suffix | Description | 
| --- | --- | 
| dateProcessingMatchSuccess.count |  The number of records that match at least one of the patterns specified in the `match` configuration option. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| dateProcessingMatchFailure.count |  The number of records that didn't match any of the patterns specified in the `match` configuration option. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 

## Lambda metrics
<a name="lambda-metrics"></a>

The following metrics apply to the [AWS Lambda](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/aws-lambda/) processor. Each metric is prefixed by the sub-pipeline name and `lambda`. For example, `sub_pipeline_name.lambda.recordsSuccessfullySentToLambda.count`.


| Metric suffix | Description | 
| --- | --- | 
| recordsSuccessfullySentToLambda.count |  The number of records successfully processed by the Lambda function. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| recordsFailedToSendToLambda.count |  The number of records that failed to be sent to the Lambda function. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| lambdaFunctionLatency.avg`lambdaFunctionLatency.max` |  The latency of Lambda function invocations. **Relevant statistics**: Average and Maximum **Dimension**: `PipelineName`  | 
| numberOfRequestsSucceeded.count |  The total number of successful Lambda invocation requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| numberOfRequestsFailed.count |  The total number of failed Lambda invocation requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| requestPayloadSize.avg |  The size of request payloads sent to Lambda. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 
| responsePayloadSize.avg |  The size of response payloads received from Lambda. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 

## Grok metrics
<a name="grok-metrics"></a>

The following metrics apply to the [Grok](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/processors/grok/) processor. Each metric is prefixed by the sub-pipeline name and `grok`. For example, `sub_pipeline_name.grok.grokProcessingMatch.count`.


| Metric suffix | Description | 
| --- | --- | 
| grokProcessingMatch.count |  The number of records that found at least one pattern match from the `match` configuration option. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| grokProcessingMismatch.count |  The number of records that didn't match any of the patterns specified in the `match` configuration option. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| grokProcessingErrors.count |  The number of record processing errors. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| grokProcessingTimeouts.count |  The number of records that timed out while matching. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| grokProcessingTime.count |  A count of data points recorded while an individual record matched against patterns from the `match` configuration option. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| grokProcessingTime.sum |  The total amount of time that each individual record takes to match against patterns from the `match` configuration option, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| grokProcessingTime.max |  The maximum amount of time that each individual record takes to match against patterns from the `match` configuration option, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 

## Otel trace raw metrics
<a name="oteltrace-raw-metrics"></a>

The following metrics apply to the [OTel trace raw](https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/processors/otel-traces/) processor. Each metric is prefixed by the sub-pipeline name and `otel_trace_raw`. For example, `sub_pipeline_name.otel_trace_raw.traceGroupCacheCount.value`.


| Metric suffix | Description | 
| --- | --- | 
| traceGroupCacheCount.value |  The number of trace groups in the trace group cache. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| spanSetCount.value |  The number of span sets in the span set collection. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 

## Otel trace group metrics
<a name="oteltracegroup-metrics"></a>

The following metrics apply to the [OTel trace group](https://github.com/opensearch-project/data-prepper/tree/main/data-prepper-plugins/otel-trace-group-processor) processor. Each metric is prefixed by the sub-pipeline name and `otel_trace_group`. For example, `sub_pipeline_name.otel_trace_group.recordsInMissingTraceGroup.count`.


| Metric suffix | Description | 
| --- | --- | 
| recordsInMissingTraceGroup.count |  The number of ingress records missing trace group fields. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| recordsOutFixedTraceGroup.count |  The number of egress records with trace group fields that were filled successfully. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| recordsOutMissingTraceGroup.count |  The number of egress records missing trace group fields. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 

## Service map stateful metrics
<a name="servicemapstateful-metrics"></a>

The following metrics apply to the [Service-map stateful](https://docs.opensearch.org/latest/data-prepper/common-use-cases/trace-analytics/) processor. Each metric is prefixed by the sub-pipeline name and `service-map-stateful`. For example, `sub_pipeline_name.service-map-stateful.spansDbSize.count`.


| Metric suffix | Description | 
| --- | --- | 
| spansDbSize.value |  The in-memory byte sizes of spans in MapDB across the current and previous window durations. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 
| traceGroupDbSize.value |  The in-memory byte sizes of trace groups in MapDB across the current and previous window durations. **Relevant statistics**: Average **Dimension**: `PipelineName`  | 
| spansDbCount.value |  The count of spans in MapDB across the current and previous window durations. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| traceGroupDbCount.value |  The count of trace groups in MapDB across the current and previous window durations. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| relationshipCount.value |  The count of relationships stored across the current and previous window durations. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 

## OpenSearch metrics
<a name="opensearch-metrics"></a>

The following metrics apply to the [OpenSearch](https://opensearch.org/docs/latest/data-prepper/pipelines/configuration/sinks/opensearch/) sink. Each metric is prefixed by the sub-pipeline name and `opensearch`. For example, `sub_pipeline_name.opensearch.bulkRequestErrors.count`.


| Metric suffix | Description | 
| --- | --- | 
| bulkRequestErrors.count |  The total number of errors encountered while sending bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| documentsSuccess.count |  The number of documents successfully sent to the OpenSearch Service by bulk request, including retries. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| documentsSuccessFirstAttempt.count |  The number of documents successfully sent to OpenSearch Service by bulk request on the first attempt. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| documentErrors.count |  The number of documents that failed to be sent by bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestFailed.count |  The number of bulk requests that failed. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestNumberOfRetries.count |  The number of retries of failed bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkBadRequestErrors.count |  The number of `Bad Request` errors encountered while sending bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestNotAllowedErrors.count |  The number of `Request Not Allowed` errors encountered while sending bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestInvalidInputErrors.count |  The number of `Invalid Input` errors encountered while sending bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestNotFoundErrors.count |  The number of `Request Not Found` errors encountered while sending bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestTimeoutErrors.count |  The number of `Request Timeout` errors encountered while sending bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestServerErrors.count |  The number of `Server Error` errors encountered while sending bulk requests. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestSizeBytes.count |  A count of the distribution of payload sizes of bulk requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestSizeBytes.sum |  The total distribution of payload sizes of bulk requests, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestSizeBytes.max |  The maximum distribution of payload sizes of bulk requests, in bytes. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| bulkRequestLatency.count |  A count of data points recorded while requests are sent to the plugin, including retries. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestLatency.sum |  The total latency of requests sent to the plugin, including retries, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| bulkRequestLatency.max |  The maximum latency of requests sent to the plugin, including retries, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| s3.dlqS3RecordsSuccess.count |  The number of records successfully sent to the S3 dead letter queue. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RecordsFailed.count |  The number of recourds that failed to be sent to the S3 dead letter queue. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestSuccess.count |  The number of successful requests to the S3 dead letter queue. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestFailed.count |  The number of failed requests to the S3 dead letter queue. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestLatency.count |  A count of data points recorded while requests are sent to the S3 dead letter queue, including retries. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestLatency.sum |  The total latency of requests sent to the S3 dead letter queue, including retries, in milliseconds. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestLatency.max |  The maximum latency of requests sent to the S3 dead letter queue, including retries, in milliseconds. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestSizeBytes.count |  A count of the distribution of payload sizes of requests to the S3 dead letter queue, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestSizeBytes.sum |  The total distribution of payload sizes of requests to the S3 dead letter queue, in bytes. **Relevant statistics**: Sum **Dimension**: `PipelineName`  | 
| s3.dlqS3RequestSizeBytes.max |  The maximum distribution of payload sizes of requests to the S3 dead letter queue, in bytes. **Relevant statistics**: Max **Dimension**: `PipelineName`  | 

## System and metering metrics
<a name="systemmetering-metrics"></a>

The following metrics apply to the overall OpenSearch Ingestion system. These metrics aren't prefixed by anything.


| Metric | Description | 
| --- | --- | 
| system.cpu.usage.value |  The percentage of available CPU usage for all data nodes. **Relevant statistics**: Average **Dimension**: `PipelineName`, `area`, `id`  | 
| system.cpu.count.value |  The total amount of CPU usage for all data nodes. **Relevant statistics**: Average **Dimension**: `PipelineName`, `area`, `id`  | 
| jvm.memory.max.value |  The maximum amount of memory that can be used for memory management, in bytes. **Relevant statistics**: Average **Dimension**: `PipelineName`, `area`, `id`  | 
| jvm.memory.used.value |  The total amount of memory used, in bytes. **Relevant statistics**: Average **Dimension**: `PipelineName`, `area`, `id`signa  | 
| jvm.memory.committed.value |  The amount of memory that is committed for use by the Java virtual machine (JVM), in bytes. **Relevant statistics**: Average **Dimension**: `PipelineName`, `area`, `id`  | 
| computeUnits |  The number of Ingestion OpenSearch Compute Units (Ingestion OCUs) in use by a pipeline. **Relevant statistics**: Max, Sum, Average **Dimension**: `PipelineName`  | 

# Best practices for Amazon OpenSearch Ingestion
<a name="osis-best-practices"></a>

This topic provides best practices for creating and managing Amazon OpenSearch Ingestion pipelines and includes general guidelines that apply to many use cases. Each workload is unique, with unique characteristics, so no generic recommendation is exactly right for every use case.

**Topics**
+ [General best practices](#osis-best-practices-general)
+ [Recommended CloudWatch alarms](#osis-cloudwatch-alarms)

## General best practices
<a name="osis-best-practices-general"></a>

The following general best practices apply to creating and managing pipelines.
+ To ensure high availability, configure VPC pipelines with two or three subnets. If you only deploy a pipeline in one subnet and the Availability Zone goes down, you won't be able to ingest data.
+ Within each pipeline, we recommend limiting the number of sub-pipelines to 5 or fewer.
+ If you're using the S3 source plugin, use evenly-sized S3 files for optimal performance.
+ If you're using the S3 source plugin, add 30 seconds of additional visibility timeout for every 0.25 GB of file size in the S3 bucket for optimal performance.
+ Include a [dead-letter queue](https://opensearch.org/docs/latest/data-prepper/pipelines/dlq/) (DLQ) in your pipeline configuration so that you can offload failed events and make them accessible for analysis. If your sinks reject data due to incorrect mappings or other issues, you can route the data to the DLQ in order to troubleshoot and fix the issue.
+ If you are using aggregate processor in a pipeline, we recommend using `“local_mode: true”` flag for optimal performance of the pipeline.

## Recommended CloudWatch alarms
<a name="osis-cloudwatch-alarms"></a>

CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some amount of time. For example, you might want AWS to email you if your cluster health status is `red` for longer than one minute. This section includes some recommended alarms for Amazon OpenSearch Ingestion and how to respond to them.

For more information about configuring alarms, see [Creating Amazon CloudWatch Alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.


| Alarm | Issue | 
| --- | --- | 
|  `computeUnits` maximum is = the configured `maxUnits` for 15 minute, 3 consecutive times  | The pipeline has reached the maximum capacity and might require a maxUnits update. Increase the maximum capacity of your pipeline | 
|  `opensearch.documentErrors.count` sum is = `{sub_pipeline_name}.opensearch.recordsIn.count` sum for 1 minute, 1 consecutive time  | The pipeline is unable to write to the OpenSearch sink. Check the pipeline permissions and confirm that the domain or collection is healthy. You can also check the dead letter queue (DLQ) for failed events, if it's configured. | 
|  `bulkRequestLatency.max` max is >= *x* for 1 minute, 1 consecutive time  | The pipeline is experiencing high latency sending data to the OpenSearch sink. This is likely due to the sink being undersized, or a poor sharding strategy, which is causing the sink to fall behind. Sustained high latency can impact pipeline performance and will likely lead to backpressure on the clients. | 
|  `httpAuthFailure.count` sum >= 1 for 1 minute, 1 consecutive time  | Ingestion requests are not being authenticated. Confirm that all clients have Signature Version 4 authentication enabled correctly. | 
|  `system.cpu.usage.value` average >= 80% for 15 minutes, 3 consecutive times  | Sustained high CPU usage can be problematic. Consider increasing the maximum capacity for the pipeline. | 
|  `bufferUsage.value` average >= 80% for 15 minutes, 3 consecutive times  | Sustained high buffer usage can be problematic. Consider increasing the maximum capacity for the pipeline. | 

### Other alarms you might consider
<a name="osis-cw-alarms-additional"></a>

Consider configuring the following alarms depending on which Amazon OpenSearch Ingestion features you regularly use. 


| Alarm | Issue | 
| --- | --- | 
|  `dynamodb.exportJobFailure.count` sum 1  | The attempt to trigger an export to Amazon S3 failed. | 
|  `opensearch.EndtoEndLatency.avg` average > X for 15 minutes, 4 consecutive times  | The EndtoEndLatency is higher than desired for reading from DynamoDB streams. This could be caused by an underscaled OpenSearch cluster or a maximum pipeline OCU capacity that is too low for the WCU throughput on the DynamoDB table. EndtoEndLatency will be higher after an export but should decrease over time as it catches up to the latest DynamoDB streams. | 
|  `dynamodb.changeEventsProcessed.count` sum == 0 for X minutes  | No records are being gathered from DynamoDB streams. This could be caused by to no activity on the table, or an issue accessing DynamoDB streams. | 
|  `opensearch.s3.dlqS3RecordsSuccess.count` sum >= `opensearch.documentSuccess.count` sum for 1 minute, 1 consecutive time  | A larger number of records are being sent to the DLQ than the OpenSearch sink. Review the OpenSearch sink plugin metrics to investigate and determine the root cause. | 
|  `grok.grokProcessingTimeouts.count` sum = recordsIn.count sum for 1 minute, 5 consecutive times  | All data is timing out while the Grok processor is trying to pattern match. This is likely impacting performance and slowing your pipeline down. Consider adjusting your patterns to reduce timeouts.  | 
|  `grok.grokProcessingErrors.count` sum is >= 1 for 1 minute, 1 consecutive time  | The Grok processor is failing to match patterns to the data in the pipeline, resulting in errors. Review your data and Grok plugin configurations to ensure the pattern matching is expected. | 
|  `grok.grokProcessingMismatch.count` sum = recordsIn.count sum for 1 minute, 5 consecutive times  | The Grok processor is unable to match patterns to the data in the pipeline. Review your data and Grok plugin configurations to ensure the pattern matching is expected. | 
|  `date.dateProcessingMatchFailure.count` sum = recordsIn.count sum for 1 minut, 5 consecutive times  | The Date processor is unable to match any patterns to the data in the pipeline. Review your data and Date plugin configurations to ensure the pattern is expected. | 
|  `s3.s3ObjectsFailed.count` sum >= 1 for 1 minute, 1 consecutive time  | This issue is either occurring because the S3 object doesn't exist, or the pipeline has insufficient privileges. Reivew the s3ObjectsNotFound.count and s3ObjectsAccessDenied.count metrics to determine the root cause. Confirm that the S3 object exists and/or update the permissions. | 
|  `s3.sqsMessagesFailed.count` sum >= 1 for 1 minute, 1 consecutive time  | The S3 plugin failed to process an Amazon SQS message. If you have a DLQ enabled on your SQS queue, review the failed message. The queue might be receiving invalid data that the pipeline is attempting to process. | 
|  `http.badRequests.count` sum >= 1 for 1 minute, 1 consecutive times  | The client is sending a bad request. Confirm that all clients are sending the proper payload. | 
|  `http.requestsTooLarge.count` sum >= 1 for 1 minute, 1 consecutive time  | Requests from the HTTP source plugin contain too much data, which is exceeding the buffer capacity. Adjust the batch size for your clients. | 
|  `http.internalServerError.count` sum >= 0 for 1 minute, 1 consecutive time  | The HTTP source plugin is having trouble receiving events. | 
|  `http.requestTimeouts.count` sum >= 0 for 1 minute, 1 consecutive time  | Source timeouts are likely the result of the pipeline being underprovisioned. Consider increasing the pipeline maxUnits to handle additional workload. | 
|  `otel_trace.badRequests.count` sum >= 1 for 1 minute, 1 consecutive time  | The client is sending a bad request. Confirm that all clients are sending the proper payload. | 
|  `otel_trace.requestsTooLarge.count` sum >= 1 for 1 minute, 1 consecutive time  | Requests from the Otel Trace source plugin contain too much data, which is exceeding the buffer capacity. Adjust the batch size for your clients. | 
|  `otel_trace.internalServerError.count` sum >= 0 for 1 minute, 1 consecutive time  | The Otel Trace source plugin is having trouble receiving events. | 
|  `otel_trace.requestTimeouts.count` sum >= 0 for 1 minute, 1 consecutive time  | Source timeouts are likely the result of the pipeline being underprovisioned. Consider increasing the pipeline maxUnits to handle additional workload. | 
|  `otel_metrics.requestTimeouts.count` sum >= 0 for 1 minute, 1 consecutive time  | Source timeouts are likely the result of the pipeline being underprovisioned. Consider increasing the pipeline maxUnits to handle additional workload. | 