

# neptune.read()


 Neptune supports a `CALL` procedure `neptune.read` to read data from Amazon S3 and then run an openCypher query (read, insert, update) using the data. The procedure yields each row in the file as a declared result variable row. It uses the IAM credentials of the caller to access the data in Amazon S3. See [ Create your IAM role for Amazon S3 access](https://docs.aws.amazon.com//neptune-analytics/latest/userguide/bulk-import-create-from-s3.html#create-iam-role-for-s3-access) to set up the permissions. The AWS region of the Amazon S3 bucket must be in the same region where Neptune Analytics instance is located. Currently, cross-region reads are not supported. 

 **Syntax** 

```
CALL neptune.read(
  {
    source: "string",
    format: "parquet/csv",
    concurrency: 10
  }
)
YIELD row
...
```

**Inputs**
+  **source** (required) - Amazon S3 URI to a **single** object. Amazon S3 prefix to multiple objects is not supported. 
+  **format** (required) - `parquet` and `csv` are supported. 
  +  More details on the supported Parquet format can be found in [Supported Parquet column types](parquet-column-types.md). 
  +  For more information on the supported csv format, see [Gremlin load data format](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html). 
+  **concurrency** (optional) - Type: 0 or greater integer. Default: 0. Specifies the number of threads to be used for reading the file. If the value is 0, the maximum number of threads allowed by the resource will be used. For Parquet, it is recommended to be set to a number of row groups. 

**Outputs**

 The neptune.read returns: 
+  **row** - type:Map 
  +  Each row in the file, where the keys are the columns and the values are the data found in each column. 
  +  You can access each column's data like a property access (`row.col`). 

# Query examples using Parquet


 The following example query returns the number of rows in a given Parquet file: 

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "parquet"
  }
)
YIELD row
RETURN count(row)
```

 You can run the query example using the `execute-query` operation in the AWS CLI by executing the following code: 

```
aws neptune-graph execute-query \
  --graph-identifier ${graphIdentifier} \
  --query-string 'CALL neptune.read({source: "<s3 path>", 
    format: "parquet"}) YIELD row RETURN count(row)' \
  --language open_cypher \
  /tmp/out.txt
```

 A query can be flexible in what it does with rows read from a Parquet file. For example, the following query creates a node with a field being set to data found in the Parquet file: 

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "parquet"
  }
)
YIELD row
CREATE (n {someField: row.someCol}) 
RETURN n
```

**Warning**  
 It is not considered good practice to use a large results-producing clause like `MATCH(n)` prior to a `CALL` clause. This would lead to a long-running query, due to cross product between incoming solutions from prior clauses and the rows read by neptune.read. It’s recommended to start the query with `CALL neptune.read`. 

# Supported Parquet column types


**Parquet data types:**
+  NULL 
+  BOOLEAN 
+  FLOAT 
+  DOUBLE 
+  STRING 
+  SIGNED INTEGER: UINT8, UINT16, UINT32, UINT64 
+  MAP: Only supports one-level. Does not support nested. 
+  LIST: Only supports one-level. Does not support nested. 

**Neptune -specific:**
+  A column type `Any` is supported in the user columns. An `Any` type is a type “syntactic sugar” for all of the other types we support. It is extremely useful if a user column has multiple types in it. The payload of an `Any` type value is a list of json strings as follows: `"{""value"": ""10"", ""type"": ""Int""};{""value"": ""1.0"", ""type"": ""Float""}"` , which has a `value` field and a `type` field in each individual json string. The column header of an `Any` type is `propertyname:Any`. The cardinality value of an `Any` column is `set`, meaning that the column can accept multiple values. 
  +  Neptune Analytics supports the following types in an `Any` type: `Bool` (or `Boolean`), `Byte`, `Short`, `Int`, `Long`, `UnsignedByte`, `UnsignedShort`, `UnsignedInt`, `UnsignedLong`, `Float`, `Double`, `Date`, `dateTime`, and `String`. 
  +  `Vector` type is not supported in `Any` type. 
  +  Nested `Any` type is not supported. For example, `"{""value"": "{""value"": ""10"", ""type"": ""Int""}", ""type"": ""Any""}"`. 

# Sample Parquet output


 Given a Parquet file like this: 

```
<s3 path>

Parquet Type:
    int8     int16       int32             int64              float      double    string
+--------+---------+-------------+----------------------+------------+------------+----------+
|   Byte |   Short |       Int   |                Long  |     Float  |    Double  | String   |
|--------+---------+-------------+----------------------+------------+------------+----------|
|   -128 |  -32768 | -2147483648 | -9223372036854775808 |    1.23456 |    1.23457 | first    |
|    127 |   32767 |  2147483647 |  9223372036854775807 |  nan       |  nan       | second   |
|      0 |       0 |           0 |                    0 | -inf       | -inf       | third    |
|      0 |       0 |           0 |                    0 |  inf       |  inf       | fourth   |
+--------+---------+-------------+----------------------+------------+------------+----------+
```

 Here is an example of the output returned by `neptune.read` using the following query: 

```
aws neptune-graph execute-query \ 
--graph-identifier ${graphIdentifier} \ 
--query-string "CALL neptune.read({source: '<s3 path>', format: 'parquet'}) YIELD row RETURN row" \ 
--language open_cypher \
 /tmp/out.txt 
 
 
cat /tmp/out.txt 

{
 "results": [{
 "row": {
 "Float": 1.23456,
 "Byte": -128,
 "Int": -2147483648,
 "Long": -9223372036854775808,
 "String": "first",
 "Short": -32768,
 "Double": 1.2345678899999999
 }
 }, {
 "row": {
 "Float": "NaN",
 "Byte": 127,
 "Int": 2147483647,
 "Long": 9223372036854775807,
 "String": "second",
 "Short": 32767,
 "Double": "NaN"
 }
 }, {
 "row": {
 "Float": "-INF",
 "Byte": 0,
 "Int": 0,
 "Long": 0,
 "String": "third",
 "Short": 0,
 "Double": "-INF"
 }
 }, {
 "row": {
 "Float": "INF",
 "Byte": 0,
 "Int": 0,
 "Long": 0,
 "String": "fourth",
 "Short": 0,
 "Double": "INF"
 }
 }]
}%
```

 Currently, there is no way to set a node or edge label to a data field coming from a Parquet file. It is recommended that you partition the queries into multiple queries, one for each label/Type. 

```
CALL neptune.read({source: '<s3 path>', format: 'parquet'})
 YIELD row 
WHERE row.`~label` = 'airport'
CREATE (n:airport)

CALL neptune.read({source: '<s3 path>', format: 'parquet'})
YIELD row 
WHERE row.`~label` = 'country'
CREATE (n:country)
```

# Query examples using CSV


 In this example, the query returns the number of rows in a given CSV file: 

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "csv"
  }
)
YIELD row
RETURN count(row)
```

 You can run the query using the `execute-query` operation in the AWS CLI: 

```
aws neptune-graph execute-query \
  --graph-identifier ${graphIdentifier} \
  --query-string 'CALL neptune.read({source: "<s3 path>", 
    format: "csv"}) YIELD row RETURN count(row)' \
  --language open_cypher \
  /tmp/out.txt
```

 A query can be flexible in what it does with rows read from a Parquet file. For instance, the following query creates a node with a field set to data from a CSV file: 

```
CALL neptune.read(
  {
    source: "<s3 path>",
    format: "csv"
  }
)
YIELD row
CREATE (n {someField: row.someCol}) 
RETURN n
```

**Warning**  
 It is not considered good practice use a large results-producing clause like `MATCH(n)` prior to a `CALL` clause. This would lead to a long-running query due to cross product between incoming solutions from prior clauses and the rows read by neptune.read. It is recommended to start the query with CALL neptune.read. 

# Property column headers


 You can specify a column (`:`) for a property by using the following syntax. The type names are not case sensitive. If a colon appears within a property name, it must be escaped by preceding it with a backslash: `\:`. 

```
propertyname:type
```

**Note**  
 Space, comma, carriage return and newline characters are not allowed in the column headers, so property names cannot include these characters.   
 You can specify a column for an array type by adding [] to the type:   

  ```
  propertyname:type[]
  ```
 Edge properties can only have a single value and will cause an error if an array type is specified or a second value is specified. The following example shows the column header for a property named age of type Int.   

  ```
  age:Int
  ```
 Every row in the file would be required to have an integer in that position or be left empty. Arrays of strings are allowed, but strings in an array cannot include the semicolon (`;`) character unless it is escaped using a backslash (`\;`). 

# Supported CSV column types

+  Bool (or Boolean) - Allowed values: `true`, `false`. Indicates a Boolean field. Any value other than `true` will be treated as `false`. 
+  FLOAT - Range: 32-bit IEEE 754 floating point including Infinity, INF, -Infinity, -INF and NaN (not-a-number). 
+  DOUBLE - Range: 64-bit IEEE 754 floating point including Infinity, INF, -Infinity, -INF and NaN (not-a-number). 
+  STRING - 
  +  Quotation marks are optional. Commas, newline, and carriage return characters are automatically escaped if they are included in a string surrounded by double quotation marks (`"`). Example: `"Hello, World"`. 
  +  To include quotation marks in a quoted string, you can escape the quotation mark by using two in a row: Example: `"Hello ""World"""`. 
  +  Arrays of strings are allowed, but strings in an array cannot include the semicolon (`;`) character unless it is escaped using a backslash (`\;`). 
  +  If you want to surround strings in an array with quotation marks, you must surround the whole array with one set of quotation marks. Example: `"String one; String 2; String 3"`. 
+  Datetime - The datetime values can be provided in either the XSD format, or one of the following formats: 
  +  yyyy-MM-dd 
  +  yyyy-MM-ddTHH:mm 
  +  yyyy-MM-ddTHH:mm:ss 
  +  yyyy-MM-ddTHH:mm:ssZ 
  +  yyyy-MM-ddTHH:mm:ss.SSSZ 
  +  yyyy-MM-ddTHH:mm:ss[\$1\$1-]hhmm 
  +  yyyy-MM-ddTHH:mm:ss.SSS[\$1\$1-]hhmm 
+  SIGNED INTEGER - 
  +  Byte: -128 to 127 
  +  Short: -32768 to 32767 
  +  Int: -2^31 to 2^31-1 
  +  Long: -2^63 to 2^63-1 

**Neptune -specific:**
+  A column type `Any` is supported in the user columns. An `Any` type is a type “syntactic sugar” for all of the other types we support. It is extremely useful if a user column has multiple types in it. The payload of an `Any` type value is a list of json strings as follows: `"{""value"": ""10"", ""type"": ""Int""};{""value"": ""1.0"", ""type"": ""Float""}"` , which has a `value` field and a `type` field in each individual json string. The column header of an `Any` type is `propertyname:Any`. The cardinality value of an `Any` column is `set`, meaning that the column can accept multiple values. 
  +  Neptune Analytics supports the following types in an `Any` type: `Bool` (or `Boolean`), `Byte`, `Short`, `Int`, `Long`, `UnsignedByte`, `UnsignedShort`, `UnsignedInt`, `UnsignedLong`, `Float`, `Double`, `Date`, `dateTime`, and `String`. 
  +  `Vector` type is not supported in `Any` type. 
  +  Nested `Any` type is not supported. For example, `"{""value"": "{""value"": ""10"", ""type"": ""Int""}", ""type"": ""Any""}"`. 

# Sample CSV output


 Given the following CSV file: 

```
<s3 path>
colA:byte,colB:short,colC:int,colD:long,colE:float,colF:double,colG:string
-128,-32768,-2147483648,-9223372036854775808,1.23456,1.23457,first
127,32767,2147483647,9223372036854775807,nan,nan,second
0,0,0,0,-inf,-inf,third
0,0,0,0,inf,inf,fourth
```

 This example shows the output returned by `neptune.read` using the following query: 

```
aws neptune-graph execute-query \ 
--graph-identifier ${graphIdentifier} \ 
--query-string "CALL neptune.read({source: '<s3 path>', format: 'csv'}) YIELD row RETURN row" \ 
--language open_cypher \
 /tmp/out.txt 
 
 
cat /tmp/out.txt 
{
  "results": [{
      "row": {
        "colD": -9223372036854775808,
        "colC": -2147483648,
        "colE": 1.23456,
        "colB": -32768,
        "colF": 1.2345699999999999,
        "colG": "first",
        "colA": -128
      }
    }, {
      "row": {
        "colD": 9223372036854775807,
        "colC": 2147483647,
        "colE": "NaN",
        "colB": 32767,
        "colF": "NaN",
        "colG": "second",
        "colA": 127
      }
    }, {
      "row": {
        "colD": 0,
        "colC": 0,
        "colE": "-INF",
        "colB": 0,
        "colF": "-INF",
        "colG": "third",
        "colA": 0
      }
    }, {
      "row": {
        "colD": 0,
        "colC": 0,
        "colE": "INF",
        "colB": 0,
        "colF": "INF",
        "colG": "fourth",
        "colA": 0
      }
    }]
}%
```

 Currently, there is no way to set a node or edge label to a data field coming from a csv file. It is recommended that you partition the queries into multiple queries, one for each label/type. 

```
CALL neptune.read({source: '<s3 path>', format: 'csv'})
 YIELD row 
WHERE row.`~label` = 'airport'
CREATE (n:airport)

CALL neptune.read({source: '<s3 path>', format: 'csv'})
YIELD row 
WHERE row.`~label` = 'country'
CREATE (n:country)
```