

# neptune.read()
<a name="access-graph-opencypher-21-extensions-s3-read"></a>

 Neptune supports a `CALL` procedure `neptune.read` to read data from Amazon S3 and then run an openCypher query (read, insert, update) using the data. The procedure yields each row in the file as a declared result variable row. It uses the IAM credentials of the caller to access the data in Amazon S3. See [Managing permissions for neptune.read()](access-graph-opencypher-21-extensions-s3-read-permissions.md) to set up the permissions. The AWS region of the Amazon S3 bucket must be in the same region where instance is located. Currently, cross-region reads are not supported. 

 **Syntax** 

```
CALL neptune.read(
  {
    source: "string",
    format: "parquet/csv",
    concurrency: 10
  }
)
YIELD row
...
```

**Inputs**
+  **source** (required) - Amazon S3 URI to a **single** object. Amazon S3 prefix to multiple objects is not supported. 
+  **format** (required) - `parquet` and `csv` are supported. 
  +  More details on the supported Parquet format can be found in [Supported Parquet column types](access-graph-opencypher-21-extensions-s3-read-parquet.md#access-graph-opencypher-21-extensions-s3-read-parquet-column-types). 
  +  For more information on the supported csv format, see [Gremlin load data format](bulk-load-tutorial-format-gremlin.md). 
+  **concurrency** (optional) - Type: 0 or greater integer. Default: 0. Specifies the number of threads to be used for reading the file. If the value is 0, the maximum number of threads allowed by the resource will be used. For Parquet, it is recommended to be set to a number of row groups. 

**Outputs**

 The neptune.read returns: 
+  **row** - type:Map 
  +  Each row in the file, where the keys are the columns and the values are the data found in each column. 
  +  You can access each column's data like a property access (`row.col`). 

## Best practices for neptune.read()
<a name="access-graph-opencypher-21-extensions-s3-read-best-practices"></a>

Neptune S3 read operations can be memory-intensive. Please use instance types well-suited for production workloads as outlined in [Choosing instance types for Amazon Neptune](instance-types.md).

Memory usage and performance of `neptune.read()` requests are affected by a variety of factors like file size, number of columns, number of rows, and file format. Depending on structure, small files (e.g., CSV files 100MB or under, Parquet files 20MB or under) may work reliably on most production-suited instance types, whereas larger files may require substantial memory that smaller instance types cannot provide.

When testing this feature, it is recommended to start with small files and scale gradually to ensure your read workload can be accommodated by your instance size. If you notice `neptune.read()` requests leading to out-of-memory exceptions or instance restarts, consider splitting your files into smaller chunks, reducing file complexity, or upgrading to larger instance types.

# Query examples using parquet
<a name="access-graph-opencypher-21-extensions-s3-read-parquet"></a>

The following example query returns the number of rows in a given Parquet file:

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "parquet"
  }
)
YIELD row
RETURN count(row)
```

You can run the query example using the `execute-open-cypher-query` operation in the AWS CLI by executing the following code:

```
aws neptunedata execute-open-cypher-query \
--open-cypher-query "CALL neptune.read({source: '<s3 path>', format: 'parquet'}) YIELD row RETURN count(row)" \
--endpoint-url https://my-cluster-name.cluster-abcdefgh1234.us-east-1.neptune.amazonaws.com:8182
```

A query can be flexible in what it does with rows read from a Parquet file. For example, the following query creates a node with a field being set to data found in the Parquet file:

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "parquet"
  }
)
YIELD row
CREATE (n {someField: row.someCol}) 
RETURN n
```

**Warning**  
It is not considered good practice to use a large results-producing clause like `MATCH(n)` prior to a `CALL` clause. This would lead to a long-running query, due to cross product between incoming solutions from prior clauses and the rows read by neptune.read. It's recommended to start the query with `CALL` neptune.read.

## Supported Parquet column types
<a name="access-graph-opencypher-21-extensions-s3-read-parquet-column-types"></a>

**Parquet Data Types:**
+ NULL
+ BOOLEAN
+ FLOAT
+ DOUBLE
+ STRING
+ SIGNED INTEGER: UINT8, UINT16, UINT32, UINT64
+ MAP: Only supports one-level. Does not support nested.
+ LIST: Only supports one-level. Does not support nested.

**Neptune-specific data types:**

Unlike the property column headers of the CSV format, the property column headers of the Parquet format only need to have the property names, so there is no need to have the type names nor the cardinality.

There are however, some special column types in the Parquet format that require annotation in the metadata, including the Any type, Date type, dateTime type, and Geometry type. The following object is an example of the required metadata annotation for files containing columns of these special types:

```
"metadata": {
    "anyTypeColumns": ["UserCol1"],
    "dateTypeColumns": ["UserCol2"],
    "dateTimeTypeColumns": ["UserCol3"],
    "geometryTypeColumns": ["UserCol4"]
}
```

Below are details on the expected payload associated with these types:
+ A column type Any is supported in the user columns. An Any type is a type "syntactic sugar" for all of the other types we support. It is extremely useful if a user column has multiple types in it. The payload of an Any type value is a list of json strings as follows: `{"value": "10", "type": "Int"};{"value": "1.0", "type": "Float"}`, which has a value field and a type field in each individual json string. The cardinality value of an Any column is set, meaning that the column can accept multiple values. 
  + Neptune supports the following types in an Any type: Bool (or Boolean), Byte, Short, Int, Long, UnsignedByte, UnsignedShort, UnsignedInt, UnsignedLong, Float, Double, Date, dateTime, String, and Geometry.
  + Vector type is not supported in Any type.
  + Nested Any type is not supported. For example, `{"value": {"value": "10", "type": "Int"}, "type": "Any"}`.
+ Columns of type Date and Datetime are supported in the user columns. The payload of these columns must be provided as strings following the XSD format or one of the formats below: 
  + yyyy-MM-dd
  + yyyy-MM-ddTHH:mm
  + yyyy-MM-ddTHH:mm:ss
  + yyyy-MM-ddTHH:mm:ssZ
  + yyyy-MM-ddTHH:mm:ss.SSSZ
  + yyyy-MM-ddTHH:mm:ss[\$1\$1-]hhmm
  + yyyy-MM-ddTHH:mm:ss.SSS[\$1\$1-]hhmm
+ A Geometry column type is supported in the user columns. The payload of these columns must only contain Geometry primitives of type Point, provided as strings in Well-known text (WKT) format. For example, POINT (30 10) would be a valid Geometry value.

## Sample parquet output
<a name="sample-parquet-output"></a>

Given a Parquet file like this:

```
<s3 path>

Parquet Type:
    int8     int16       int32             int64              float      double    string
+--------+---------+-------------+----------------------+------------+------------+----------+
|   Byte |   Short |       Int   |                Long  |     Float  |    Double  | String   |
|--------+---------+-------------+----------------------+------------+------------+----------|
|   -128 |  -32768 | -2147483648 | -9223372036854775808 |    1.23456 |    1.23457 | first    |
|    127 |   32767 |  2147483647 |  9223372036854775807 |  nan       |  nan       | second   |
|      0 |       0 |           0 |                    0 | -inf       | -inf       | third    |
|      0 |       0 |           0 |                    0 |  inf       |  inf       | fourth   |
+--------+---------+-------------+----------------------+------------+------------+----------+
```

Here is an example of the output returned by neptune.read using the following query:

```
aws neptunedata execute-open-cypher-query \
--open-cypher-query "CALL neptune.read({source: '<s3 path>', format: 'parquet'}) YIELD row RETURN row" \
--endpoint-url https://my-cluster-name.cluster-abcdefgh1234.us-east-1.neptune.amazonaws.com:8182
```

```
{
 "results": [{
 "row": {
 "Float": 1.23456,
 "Byte": -128,
 "Int": -2147483648,
 "Long": -9223372036854775808,
 "String": "first",
 "Short": -32768,
 "Double": 1.2345678899999999
 }
 }, {
 "row": {
 "Float": "NaN",
 "Byte": 127,
 "Int": 2147483647,
 "Long": 9223372036854775807,
 "String": "second",
 "Short": 32767,
 "Double": "NaN"
 }
 }, {
 "row": {
 "Float": "-INF",
 "Byte": 0,
 "Int": 0,
 "Long": 0,
 "String": "third",
 "Short": 0,
 "Double": "-INF"
 }
 }, {
 "row": {
 "Float": "INF",
 "Byte": 0,
 "Int": 0,
 "Long": 0,
 "String": "fourth",
 "Short": 0,
 "Double": "INF"
 }
 }]
}
```

Currently, there is no way to set a node or edge label to a data field coming from a Parquet file. It is recommended that you partition the queries into multiple queries, one for each label/Type.

```
CALL neptune.read({source: '<s3 path>', format: 'parquet'})
 YIELD row 
WHERE row.`~label` = 'airport'
CREATE (n:airport)

CALL neptune.read({source: '<s3 path>', format: 'parquet'})
YIELD row 
WHERE row.`~label` = 'country'
CREATE (n:country)
```

# Query examples using CSV
<a name="access-graph-opencypher-21-extensions-s3-read-csv"></a>

In this example, the query returns the number of rows in a given CSV file:

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "csv"
  }
)
YIELD row
RETURN count(row)
```

You can run the query example using the execute-open-cypher-query operation in the AWS CLI by executing the following code:

```
aws neptunedata execute-open-cypher-query \
--open-cypher-query "CALL neptune.read({source: '<s3 path>', format: 'csv'}) YIELD row RETURN count(row)" \
--endpoint-url https://my-cluster-name.cluster-abcdefgh1234.us-east-1.neptune.amazonaws.com:8182
```

A query can be flexible in what it does with rows read from a CSV file. For instance, the following query creates a node with a field set to data from a CSV file:

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "csv"
  }
)
YIELD row
CREATE (n {someField: row.someCol}) 
RETURN n
```

**Warning**  
It is not considered good practice use a large results-producing clause like MATCH(n) prior to a CALL clause. This would lead to a long-running query due to cross product between incoming solutions from prior clauses and the rows read by neptune.read. It is recommended to start the query with CALL neptune.read.

## Property column headers
<a name="property-column-headers"></a>

You can specify a column (`:`) for a property by using the following syntax. The type names are not case sensitive. If a colon appears within a property name, it must be escaped by preceding it with a backslash: `\:`.

```
propertyname:type
```

**Note**  
Space, comma, carriage return and newline characters are not allowed in the column headers, so property names cannot include these characters.
You can specify a column for an array type by adding `[]` to the type:  

  ```
                          propertyname:type[]
  ```
Edge properties can only have a single value and will cause an error if an array type is specified or a second value is specified. The following example shows the column header for a property named age of type Int:  

  ```
  age:Int
  ```

Every row in the file would be required to have an integer in that position or be left empty. Arrays of strings are allowed, but strings in an array cannot include the semicolon (`;`) character unless it is escaped using a backslash (`\;`).

## Supported CSV column types
<a name="supported-csv-column-types"></a>
+ **BOOL (or BOOLEAN)** - Allowed values: true, false. Indicates a Boolean field. Any value other than true will be treated as false.
+ **FLOAT** - Range: 32-bit IEEE 754 floating point including Infinity, INF, -Infinity, -INF and NaN (not-a-number).
+ **DOUBLE** - Range: 64-bit IEEE 754 floating point including Infinity, INF, -Infinity, -INF and NaN (not-a-number).
+ **STRING** - 
  + Quotation marks are optional. Commas, newline, and carriage return characters are automatically escaped if they are included in a string surrounded by double quotation marks ("). Example: "Hello, World".
  + To include quotation marks in a quoted string, you can escape the quotation mark by using two in a row: Example: "Hello ""World""".
  + Arrays of strings are allowed, but strings in an array cannot include the semicolon (;) character unless it is escaped using a backslash (\$1;).
  + If you want to surround strings in an array with quotation marks, you must surround the whole array with one set of quotation marks. Example: "String one; String 2; String 3".
+ **DATE, DATETIME** - The datetime values can be provided in either the XSD format, or one of the following formats: 
  + yyyy-MM-dd
  + yyyy-MM-ddTHH:mm
  + yyyy-MM-ddTHH:mm:ss
  + yyyy-MM-ddTHH:mm:ssZ
  + yyyy-MM-ddTHH:mm:ss.SSSZ
  + yyyy-MM-ddTHH:mm:ss[\$1\$1-]hhmm
  + yyyy-MM-ddTHH:mm:ss.SSS[\$1\$1-]hhmm
+ **SIGNED INTEGER** - 
  + Byte: -128 to 127
  + Short: -32768 to 32767
  + Int: -2^31 to 2^31-1
  + Long: -2^63 to 2^63-1

**Neptune-specific column types:**
+ A column type Any is supported in the user columns. An Any type is a type "syntactic sugar" for all of the other types we support. It is extremely useful if a user column has multiple types in it. The payload of an Any type value is a list of json strings as follows: `{"value": "10", "type": "Int"};{"value": "1.0", "type": "Float"}`, which has a value field and a type field in each individual json string. The column header of an Any type is propertyname:Any. The cardinality value of an Any column is set, meaning that the column can accept multiple values. 
  + Neptune supports the following types in an Any type: Bool (or Boolean), Byte, Short, Int, Long, UnsignedByte, UnsignedShort, UnsignedInt, UnsignedLong, Float, Double, Date, dateTime, String, and Geometry.
  + Vector type is not supported in Any type.
  + Nested Any type is not supported. For example, `{"value": {"value": "10", "type": "Int"}, "type": "Any"}`.
+ A Geometry column type is supported in the user columns. The payload of these columns must only contain Geometry primitives of type Point, provided as strings in Well-known text (WKT) format. For example, POINT (30 10) would be a valid Geometry value.

## Sample CSV output
<a name="sample-csv-output"></a>

Given the following CSV file:

```
<s3 path>
colA:byte,colB:short,colC:int,colD:long,colE:float,colF:double,colG:string
-128,-32768,-2147483648,-9223372036854775808,1.23456,1.23457,first
127,32767,2147483647,9223372036854775807,nan,nan,second
0,0,0,0,-inf,-inf,third
0,0,0,0,inf,inf,fourth
```

This example shows the output returned by neptune.read using the following query:

```
aws neptunedata execute-open-cypher-query \
--open-cypher-query "CALL neptune.read({source: '<s3 path>', format: 'csv'}) YIELD row RETURN row" \
--endpoint-url https://my-cluster-name.cluster-abcdefgh1234.us-east-1.neptune.amazonaws.com:8182
```

```
{
  "results": [{
      "row": {
        "colD": -9223372036854775808,
        "colC": -2147483648,
        "colE": 1.23456,
        "colB": -32768,
        "colF": 1.2345699999999999,
        "colG": "first",
        "colA": -128
      }
    }, {
      "row": {
        "colD": 9223372036854775807,
        "colC": 2147483647,
        "colE": "NaN",
        "colB": 32767,
        "colF": "NaN",
        "colG": "second",
        "colA": 127
      }
    }, {
      "row": {
        "colD": 0,
        "colC": 0,
        "colE": "-INF",
        "colB": 0,
        "colF": "-INF",
        "colG": "third",
        "colA": 0
      }
    }, {
      "row": {
        "colD": 0,
        "colC": 0,
        "colE": "INF",
        "colB": 0,
        "colF": "INF",
        "colG": "fourth",
        "colA": 0
      }
    }]
}
```

Currently, there is no way to set a node or edge label to a data field coming from a CSV file. It is recommended that you partition the queries into multiple queries, one for each label/type.

```
CALL neptune.read({source: '<s3 path>', format: 'csv'})
 YIELD row 
WHERE row.`~label` = 'airport'
CREATE (n:airport)

CALL neptune.read({source: '<s3 path>', format: 'csv'})
YIELD row 
WHERE row.`~label` = 'country'
CREATE (n:country)
```

# Managing permissions for neptune.read()
<a name="access-graph-opencypher-21-extensions-s3-read-permissions"></a>

## Required IAM Policies
<a name="access-graph-opencypher-21-extensions-s3-read-permissions-iam"></a>

To execute openCypher queries that use `neptune.read()`, you must have the appropriate permissions to access data in your Neptune database. Read-only queries require the `ReadDataViaQuery` action. Queries that modify data require `WriteDataViaQuery` for insertions or `DeleteDataViaQuery` for deletions. The example below grants all three actions on the specified cluster.

Additionally, you need permissions to access the S3 bucket containing your data files. The NeptuneS3Access policy statement grants the required S3 permissions:
+ **`s3:ListBucket`**: Required to verify bucket existence and list contents.
+ **`s3:GetObject`**: Required to access the specified object so its content can be read for integration into openCypher queries.

If your S3 bucket uses server-side encryption with AWS KMS, you must also grant KMS permissions. The NeptuneS3KMSAccess policy statement allows Neptune to decrypt data and generate data keys when accessing encrypted S3 objects. The condition restricts KMS operations to requests originating from S3 and RDS services in your region.
+ **`kms:Decrypt`**: Required to perform decryption of the encrypted object so its data can be read by Neptune.
+ **`kms:GenerateDataKey`**: Also required by the S3 API used to retrieve objects to be read.

```
{
  "Sid": "NeptuneQueryAccess",
  "Effect": "Allow",
  "Action": [
      "neptune-db:ReadDataViaQuery",
      "neptune-db:WriteDataViaQuery",
      "neptune-db:DeleteDataViaQuery"
  ],
  "Resource": "arn:aws:neptune-db:<REGION>:<AWS_ACCOUNT_ID>:<CLUSTER_RESOURCE_ID>/*"
},
{
  "Sid": "NeptuneS3Access",
  "Effect": "Allow",
  "Action": [
      "s3:ListBucket",
      "s3:GetObject"
  ],
  "Resource": [
      "arn:aws:s3:::neptune-read-bucket",
      "arn:aws:s3:::neptune-read-bucket/*"
  ]
},
{
  "Sid": "NeptuneS3KMSAccess",
  "Effect": "Allow",
  "Action": [
      "kms:Decrypt",
      "kms:GenerateDataKey"
  ],
  "Resource": "arn:aws:kms:<REGION>:<AWS_ACCOUNT_ID>:key/<KEY_ID>",
  "Condition": {
      "StringEquals": {
        "kms:ViaService": [
            "s3.<REGION>.amazonaws.com",
            "rds.<REGION>.amazonaws.com"
        ]
      }
  }
}
```

## Important prerequisites
<a name="access-graph-opencypher-21-extensions-s3-read-permissions-prerequisites"></a>

These permissions and prerequisites ensure secure and reliable integration of S3 data into openCypher queries, while maintaining proper access controls and data protection measures.
+ **IAM authentication**: This feature is only supported for Neptune clusters with IAM authentication enabled. See [Securing your Amazon Neptune database](security.md) for detailed instructions on how to create and connect to IAM authentication-enabled clusters.
+ **VPC endpoint**:
  + A Gateway-type VPC endpoint for Amazon S3 is required to allow Neptune to communicate with Amazon S3.
  + To use custom AWS KMS encryption in the query, an Interface-type VPC endpoint for AWS KMS is required to allow Neptune to communicate with AWS KMS.
  + For detailed instructions for how to configure this endpoint, see [Creating the Amazon S3 VPC Endpoint](bulk-load-tutorial-IAM.md).