

# Directly querying Amazon S3 data in OpenSearch Service
<a name="direct-query-s3-overview"></a>

This section will walk you through the process of creating and configuring a data source integration in Amazon OpenSearch Service, enabling you to efficiently query and analyze your data stored in Amazon S3.

In the following pages, you'll learn how to set up an Amazon S3 direct-query data source, navigate the necessary prerequisites, and follow step-by-step procedures using both the AWS Management Console and the OpenSearch Service API. It also covers important next steps, including mapping AWS Glue Data Catalog roles and configuring access controls in OpenSearch Dashboards.

**Topics**
+ [Creating an Amazon S3 data source integration in OpenSearch Service](direct-query-s3-creating.md)
+ [Configuring and querying an S3 data source in OpenSearch Dashboards](direct-query-s3-configure.md)
+ [Pricing](#direct-query-s3-pricing)
+ [Limitations](#direct-query-s3-limitations)
+ [Recommendations](#direct-query-s3-recommendations)
+ [Quotas](#direct-query-s3-quotas)
+ [Supported AWS Regions](#direct-query-s3-regions)

# Creating an Amazon S3 data source integration in OpenSearch Service
<a name="direct-query-s3-creating"></a>

You can create a new Amazon S3 direct-query data source for OpenSearch Service through the AWS Management Console or the API. Each new data source uses the AWS Glue Data Catalog to manage tables that represent Amazon S3 buckets. 

**Topics**
+ [Prerequisites](#direct-query-s3-prereq)
+ [Procedure](#direct-query-s3-create)
+ [Next steps](#direct-query-s3-next-steps)
+ [Map the AWS Glue Data Catalog role](#direct-query-s3-permissions)
+ [Additional resources](#direct-query-s3-additional-resources)

## Prerequisites
<a name="direct-query-s3-prereq"></a>

Before you get started, make sure that you have reviewed the following documentation:
+ [Limitations](direct-query-s3-overview.md#direct-query-s3-limitations)
+ [Recommendations](direct-query-s3-overview.md#direct-query-s3-recommendations)
+ [Quotas](direct-query-s3-overview.md#direct-query-s3-quotas)

Before you can create a data source, you must have the following resources in your AWS account:
+ **An OpenSearch domain with version 2.13 or later.** This is the foundation for setting up the direct query integration. For instructions on setting this up, see [Creating OpenSearch Service domains](createupdatedomains.md#createdomains).
+ **One or more S3 buckets.** You’ll need to specify the buckets containing data that you want to query, and a bucket to store your query checkpoints in. For instructions on creating an S3 bucket, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) in the Amazon S3 user guide.
+ **(Optional) One or more AWS Glue tables. **Querying data on Amazon S3 requires that you have tables setup in AWS Glue Data Catalog to point to the S3 data. You must create the tables using OpenSearch Query Workbench. Existing Hive tables are not compatible. 

  If this is the first time you're setting up an Amazon S3 data source, you must create an admin data source to configure all of your AWS Glue Data Catalog tables. You can do this by installing OpenSearch out-of-the-box integrations or by using OpenSearch Query Workbench to create custom SQL tables for advanced use cases. For examples on creating tables for VPC, CloudTrail, and AWS WAF logs, see the documentation on GitHub for [VPC](https://github.com/opensearch-project/opensearch-catalog/blob/main/integrations/observability/amazon_vpc_flow/assets/create_table_vpc_schema-1.0.0.sql), [CloudTrail](https://github.com/opensearch-project/opensearch-catalog/blob/main/integrations/observability/aws_cloudtrail/assets/create_table_cloud-trail-records-1.0.0.sql), and [AWS WAF](https://github.com/opensearch-project/opensearch-catalog/blob/main/integrations/observability/aws_waf/assets/create_table-1.0.0.sql). After you create your tables, you can create new Amazon S3 data sources and restrict access to limited tables.
+ **(Optional) A manually created IAM role. **You can use this role to manage access to your data source. Alternatively, you can have OpenSearch Service create a role for you automatically with the required permissions. If you choose to use a manually created IAM role, follow the guidance in [Required permissions for manually created IAM roles](#direct-query-s3-additional-resources-required-permissions).

## Procedure
<a name="direct-query-s3-create"></a>

You can set up a direct-query data source on a domain with the AWS Management Console or the OpenSearch Service API.

### To set up a data source using the AWS Management Console
<a name="creating-direct-query-s3-console-create"></a>

1. Navigate to the Amazon OpenSearch Service console at [https://console.aws.amazon.com/aos/](https://console.aws.amazon.com/aos/).

1. In the left navigation pane, choose **Domains**. 

1. Select the domain that you want to set up a new data source for. This opens the domain details page. 

1. Choose the **Connections** tab below the general domain details and find the **Direct query** section.

1. Choose **Configure data source**.

1. Enter a name and an optional description for your new data source. 

1. Choose **Amazon S3 with AWS Glue Data Catalog**. 

1. Under **IAM permission access settings**, choose how to manage access.

   1. If you want to automatically create a role for this data source, follow these steps:

      1. Select **Create a new role**.

      1. Enter a name for the IAM role.

      1. Select one or more S3 buckets that contain data you want to query.

      1. Select an checkpoint S3 bucket to store query checkpoints in.

      1. Select one or more AWS Glue databases or tables to define which data can be queried. If tables haven't been created yet, provide access to the default database.

   1. If you want to use an existing role that you manage yourself, follow these steps:

      1. Select **Use an existing role.**

      1. Select an existing role from the drop-down menu.
**Note**  
When using your own role, you must ensure it has all necessary permissions by attaching required policies from the IAM console. For more information, refer to the sample policy in [Required permissions for manually created IAM roles](#direct-query-s3-additional-resources-required-permissions).

1. Choose **Configure**. This opens the data source details screen with an OpenSearch Dashboards URL. You can navigate to this URL to complete the next steps.

### OpenSearch Service API
<a name="creating-direct-query-s3-api-create"></a>

Use the [AddDataSource](https://docs.aws.amazon.com/opensearch-service/latest/APIReference/API_AddDataSource.html) API operation to create a new data source in your domain.

```
POST https://es.region.amazonaws.com/2021-01-01/opensearch/domain/domain-name/dataSource

{
   "DataSourceType": {
        "S3GlueDataCatalog": {
            "RoleArn": "arn:aws:iam::account-id:role/role-name"
        }
    }
   "Description": "data-source-description",
   "Name": "my-data-source"
}
```

## Next steps
<a name="direct-query-s3-next-steps"></a>

### Visit OpenSearch Dashboards
<a name="direct-query-s3-next-steps-dashboard"></a>

After you create a data source, OpenSearch Service provides you with an OpenSearch Dashboards link. You can use this to configure access control, define tables, install out-of-the-box integrations, and query your data.

For more information, see [Configuring and querying an S3 data source in OpenSearch Dashboards](direct-query-s3-configure.md).

## Map the AWS Glue Data Catalog role
<a name="direct-query-s3-permissions"></a>

If you have enabled [fine-grained access control](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac.html) after creating a data source, you must map non-admin users to an IAM role with AWS Glue Data Catalog access in order to run direct queries. To manually create a back-end `glue_access` role that you can map to the IAM role, perform the following steps:

**Note**  
Indexes are used for any queries against the data source. A user with read access to the request index for a given data source can read *all* queries against that data source. A user with read access to the result index can read results for *all* queries against that data source.

1. From the main menu in OpenSearch Dashboards, choose **Security**, **Roles**, and **Create roles**.

1. Name the role **glue\$1access**.

1. For **Cluster permissions**, select `indices:data/write/bulk*`, `indices:data/read/scroll`, `indices:data/read/scroll/clear`.

1. For **Index**, enter the following indexes you want to grant the user with the role access to:
   + `.query_execution_request_<name of data source>`
   + `query_execution_result_<name of data source>`
   + `.async-query-scheduler`
   + `flint_*`

1. For **Index permissions**, select `indices_all`. 

1. Choose **Create**.

1. Choose **Mapped users**, **Manage mapping**. 

1. Under **Backend roles**, add the ARN of the AWS Glue role that needs permission to call your domain.

   ```
   arn:aws:iam::account-id:role/role-name
   ```

1. Select **Map** and confirm the role shows up under **Mapped users**.

For more information on mapping roles, see [Mapping roles to users](fgac.md#fgac-mapping).

## Additional resources
<a name="direct-query-s3-additional-resources"></a>

### Required permissions for manually created IAM roles
<a name="direct-query-s3-additional-resources-required-permissions"></a>

 When creating a data source for your domain, you choose an IAM role to manage access to your data. You have two options:

1. Create a new IAM role automatically

1. Use an existing IAM role that you created manually

If you use a manually created role, you need to attach the correct permissions to the role. The permissions must allow access to the specific data source, and allow OpenSearch Service to assume the role. This is required so that the OpenSearch Service can securely access and interact with your data. 

The following sample policy demonstrates the least-privilege permissions required to create and manage a data source. If you have broader permissions, such as `s3:*` or the `AdminstratorAccess` policy, these permissions encompasses the least-privilege permissions in the sample policy.

In the following sample policy, replace the *placeholder text *with your own information.

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement":[
      {
         "Sid":"HttpActionsForOpenSearchDomain",
         "Effect":"Allow",
         "Action":"es:ESHttp*",
"Resource":"arn:aws:es:us-east-1:111122223333:domain/example.com/*"
      },
      {
         "Sid":"AmazonOpenSearchS3GlueDirectQueryReadAllS3Buckets",
         "Effect":"Allow",
         "Action":[
            "s3:GetObject",
            "s3:GetObjectVersion",
            "s3:ListBucket"
         ],
         "Condition":{
            "StringEquals":{
               "aws:ResourceAccount":"111122223333"
            }
         },
         "Resource":"*"
      },
      {
         "Sid":"AmazonOpenSearchDirectQueryGlueCreateAccess",
         "Effect":"Allow",
         "Action":[
            "glue:CreateDatabase",
            "glue:CreatePartition",
            "glue:CreateTable",
            "glue:BatchCreatePartition"
         ],
         "Resource":"*"
      },
      {
         "Sid":"AmazonOpenSearchS3GlueDirectQueryModifyAllGlueResources",
         "Effect":"Allow",
         "Action":[
            "glue:DeleteDatabase",
            "glue:DeletePartition",
            "glue:DeleteTable",
            "glue:GetDatabase",
            "glue:GetDatabases",
            "glue:GetPartition",
            "glue:GetPartitions",
            "glue:GetTable",
            "glue:GetTableVersions",
            "glue:GetTables",
            "glue:UpdateDatabase",
            "glue:UpdatePartition",
            "glue:UpdateTable",
            "glue:BatchGetPartition",
            "glue:BatchDeletePartition",
            "glue:BatchDeleteTable"
         ],
         "Resource":[
            "arn:aws:glue:us-east-1:111122223333:table/*",
            "arn:aws:glue:us-east-1:111122223333:database/*",
            "arn:aws:glue:us-east-1:111122223333:catalog",
            "arn:aws:es:us-east-1:111122223333:domain/domain_name"
         ],
         "Condition":{
            "StringEquals":{
               "aws:ResourceAccount":"111122223333"
            }
         }
      },
      {
         "Sid":"ReadAndWriteActionsForS3CheckpointBucket",
         "Effect":"Allow",
         "Action":[
            "s3:ListMultipartUploadParts",
            "s3:DeleteObject",
            "s3:GetObject",
            "s3:PutObject",
            "s3:GetBucketLocation",
            "s3:ListBucket"
         ],
         "Condition":{
            "StringEquals":{
               "aws:ResourceAccount":"111122223333"
            }
         },
         "Resource":[
            "arn:aws:s3:::amzn-s3-demo-bucket",
            "arn:aws:s3:::amzn-s3-demo-bucket/*"
         ]
      }
   ]
}
```

------

To support Amazon S3 buckets in different accounts, you will need to include a condition to the Amazon S3 policy and add the appropriate account. 

In the following sample condition, replace the *placeholder text *with your own information.

```
"Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "{{accountId}}"
                }
```

The role must also have the following trust policy, which specifies the target ID.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":[
       {
          "Effect":"Allow",
          "Principal":{
             "Service": "directquery.opensearchservice.amazonaws.com"
          },
          "Action":"sts:AssumeRole"
       }
     ]
}
```

------

For instructions to create the role, see [Creating a role using custom trust policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-custom.html).

If you have fine-grained access control enabled in OpenSearch Service, a new OpenSearch fine-grained access control role will automatically be created for your data source. The name of the new fine-grained access control role will be `AWSOpenSearchDirectQuery <name of data source>`.

By default, the role has access to direct query data source indexes only. Although you can configure the role to limit or grant access to your data source, it is recommended you not adjust the access of this role. **If you delete the data source, this role will be deleted**. This will remove access for any other users if they are mapped to the role.

# Configuring and querying an S3 data source in OpenSearch Dashboards
<a name="direct-query-s3-configure"></a>

Now that you've created your data source, you can configure security settings, define your Amazon S3 tables, or set up accelerated data indexing. This section walks you through various use cases with your data source in OpenSearch Dashboards before you query your data.

To configure the following sections, you must first navigate to your data source in OpenSearch Dashboards. In the left-hand navigation, under **Management**, choose **Data sources**. Under **Manage data sources**, select the name of the data source that you created in the console. 

## Create Spark Tables using Query Workbench
<a name="direct-query-s3-configure-tables"></a>

Direct queries from OpenSearch Service to Amazon S3 use Spark tables within the AWS Glue Data Catalog. You can create tables from within the Query Workbench without having to leave OpenSearch Dashboards. 

To manage existing databases and tables in your data source, or to create new tables that you want to use direct queries on, choose **Query Workbench** from the left navigation and select the Amazon S3 data source from the data source drop down.

To set up a table for VPC Flow logs stored in S3 in Parquet format, run the following query: 

```
CREATE TABLE 
datasourcename.gluedatabasename.vpclogstable (version INT, account_id STRING, interface_id STRING, 
srcaddr STRING, dstaddr STRING, srcport INT, dstport INT, protocol INT, packets BIGINT, 
bytes BIGINT, start BIGINT, end BIGINT, action STRING, log_status STRING, 
`aws-account-id` STRING, `aws-service` STRING, `aws-region` STRING, year STRING, 
month STRING, day STRING, hour STRING) 

USING parquet PARTITIONED BY (aws-account-id, aws-service, aws-region, year, month, 
day, hour) 

LOCATION "s3://accountnum-vpcflow/AWSLogs"
```

After creating the table, run the following query to ensure that it's compatible with direct queries:

```
MSCK REPAIR TABLE  datasourcename.databasename.vpclogstable
```

## Setup integrations for popular AWS log types
<a name="direct-query-s3-setup-integration"></a>

You can integrate AWS log types stored in Amazon S3 with OpenSearch Service. Use OpenSearch Dashboards to install integrations that create AWS Glue Data Catalog tables, saved queries, and dashboards. These integrations use indexed views to keep dashboards updated.

For instructions to install an integration, see [Installing an integration asset](https://opensearch.org/docs/latest/integrations/#installing-an-integration-asset) in the OpenSearch documentation.

When you select an integration, make sure it has the `S3 Glue` tag. 

When you set up the integration, specify **S3 Connection** for the connection type. Then, select the data source for the integration, the Amazon S3 location of the data, the checkpoint to manage acceleration indexing, and the assets required for your use case.

**Note**  
Make sure the S3 bucket for your checkpoint has write permissions for the checkpoint location. Without these permissions, the integration's accelerations will fail.

## Set up access control
<a name="direct-query-s3-configure-ac"></a>

On the details page for your data source, find the **Access controls** section and choose **Edit**. If the domain has fine-grained access control enabled, choose **Restricted** and select which roles you want to provide with access to the new data source. You can also choose **Admin only** if you only want the administrator to have access to the data source.

**Important**  
Indexes are used for any queries against the data source. A user with read access to the request index for a given data source can read *all* queries against that data source. A user with read access to the result index can read results for *all* queries against that data source.

## Querying S3 data in OpenSearch Discover
<a name="direct-querying-s3-query"></a>

After you set up your tables and configure your desired optional query acceleration, you can start analyzing your data. To query your data, select your data source from the drop-down menu. If you're using Amazon S3 and OpenSearch Dashboards, go to Discover and select the data source name. 

If you're using a skipping index or haven't created an index, you can use SQL or PPL to query your data. If you've configured a materialized view or a covering index, you already have an index and can use Dashboards Query Language (DQL) throughout Dashboards. You can also use PPL with the Observability plugin, and SQL with the Query Workbench plugin. Currently, only the Observability and Query Workbench plugins support PPL and SQL. For querying data using the OpenSearch Service API, refer to the [async API documentation](https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst).

**Note**  
Not all SQL and PPL statements, commands and functions are supported. For a list of supported commands, see [Supported SQL and PPL commands](direct-query-supported-commands.md).  
If you’ve created a materialized view or covering index, you can use DQL to query your data given that you’ve indexed it within.

## Troubleshooting
<a name="s3-troubleshooting"></a>

There might be instances when results don’t return as expected. If you experience any issues, make sure that you're following the [Recommendations](direct-query-s3-overview.md#direct-query-s3-recommendations).

## Pricing
<a name="direct-query-s3-pricing"></a>

Amazon OpenSearch Service offers OpenSearch Compute Unit (OCU) pricing for Amazon S3 direct queries. As you run direct queries, you incur charges for OCUs per hour, listed as DirectQuery OCU usage type on your bill. You will also incur separate charges from Amazon S3 for data storage.

Direct queries are of two types—interactive and indexed view queries.
+ *Interactive queries* are used to populate the data selector and perform analytics on your data in Amazon S3. When you run a new query from Discover, OpenSearch Service starts a new session that lasts for a minimum of three minutes. OpenSearch Service keeps this session active to ensure that subsequent queries run quickly.
+ *Indexed view queries* use compute to maintain indexed views in OpenSearch Service. These queries usually take longer because they ingest a varying amount of data into a named index. For Amazon S3 data sources, the indexed data is stored in a domain based on an instance type purchased.

For more information, see the Direct Query and Serverless sections within [Amazon OpenSearch Service Pricing](https://aws.amazon.com/opensearch-service/pricing/).

## Limitations
<a name="direct-query-s3-limitations"></a>

The following limitations apply to direct queries in Amazon S3:
+ Direct query for S3 is only available on OpenSearch Service domains running OpenSearch version 2.13 or later, and requires access to AWS Glue Data Catalog. Existing AWS Glue Data Catalog tables must be recreated using SQL in OpenSearch Query Workbench.
+ Direct query for S3 requires you to specify a checkpoint bucket on Amazon S3. This bucket maintains the state of your indexed views, including the last refresh time and the most recently ingested data.
+ Your OpenSearch domain and AWS Glue Data Catalog must be in the same AWS account. Your S3 bucket can be in a different account (requires condition to be added to your IAM policy), but must be in the same AWS Region as your domain.
+ OpenSearch Service direct queries with S3 only support Spark tables generated from Query Workbench. Tables generated within AWS Glue Data Catalog or Athena are not supported by Spark streaming, which is needed to maintain indexed views.
+ OpenSearch instance types have networked payload limitations of either 10 MiB or 100 MiB, depending on the specific instance type you choose. 
+ Some data types aren't supported. Supported data types are limited to Parquet, CSV, and JSON. 
+ If the structure of your data changes over time, you will need to update your indexed views or out-of-the-box integrations to account for the data structure changes. 
+ AWS CloudFormation templates aren't supported yet.
+ OpenSearch SQL and OpenSearch PPL statements have different limitations when working with OpenSearch indexes compared to using direct query. Direct query supports advanced commands such as JOINs, subqueries, and lookups, while support for these commands on OpenSearch indexes is limited or nonexistent. For more information, see [Supported SQL and PPL commands](direct-query-supported-commands.md).

## Recommendations
<a name="direct-query-s3-recommendations"></a>

We recommend the following when using direct queries in Amazon S3:
+ Ingest data into Amazon S3 using partition formats of year, month, day, hour to speed up queries.
+ When you build skipping indexes, use Bloom filters for fields with high cardinality and min/max indexes for fields with large value ranges. For high-cardinality fields, consider using a value-based approach to improve query efficiency.
+ Use Index State Management to maintain storage for materialized views and covering indexes.
+ Use the `COALESCE SQL` function to handle missing columns and ensure results are returned.
+ Use limits on your queries to make sure you aren't pulling too much data back.

## Quotas
<a name="direct-query-s3-quotas"></a>

Each time you initiate a query to an Amazon S3 data source, OpenSearch Service opens a *session* and keeps it alive for at least three minutes. This reduces query latency by removing session start-up time in subsequent queries.


| Description | Maximum | Can override | 
| --- | --- | --- | 
| Connections per domain | 10 | Yes | 
| Data sources per domain | 20 | Yes | 
| Indexes per domain | 5 | Yes | 
| Concurrent sessions per data source | 10 | Yes | 
| Maximum OCU per query | 60 | Yes | 
| Maximum query execution time (minutes) | 30 | Yes | 
| Maximum OCUs per acceleration | 20 | Yes | 
| Maximum ephemeral storage | 20 | Yes | 

## Supported AWS Regions
<a name="direct-query-s3-regions"></a>

The following AWS Regions are supported for direct queries in Amazon S3:
+ Asia Pacific (Hong Kong)
+ Asia Pacific (Mumbai)
+ Asia Pacific (Seoul) 
+ Asia Pacific (Singapore)
+ Asia Pacific (Sydney)
+ Asia Pacific (Tokyo)
+ Canada (Central)
+ Europe (Frankfurt)
+  Europe (Ireland)
+ Europe (Stockholm)
+ US East (N. Virginia)
+ US East (Ohio)
+ US West (Oregon)