# Data format for loading from Amazon S3 into Neptune Analytics Neptune Analytics, just like [Neptune Database](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format.html), supports four formats for loading data: + [RDF (ntriples)](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format-rdf.html), which is a line-based format for triples. See [Using RDF data](using-rdf-data.md) for more information on how this data is handled. + [csv](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html) and [opencypher](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format-opencypher.html), which are csv-based formats with schema restrictions. A csv file must contain a header row and the column values. The remainder of the files are interpreted based on the corresponding header column. The header could contain predefined system column names and user-defined column names annotated with predefined datatypes and cardinality. + [Parquet](https://docs.aws.amazon.com//neptune-analytics/latest/userguide/using-Parquet-data.html), which is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk. The data for each column in a Parquet file is stored together. It's possible to combine CSV, RDF and Parquet data in the same graph, for example by first loading CSV data and enriching it with RDF data. # Using CSV data Neptune Analytics, like [Neptune Database](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format.html), supports two csv formats for loading graph data: [csv](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html) and [opencypher](https://docs.aws.amazon.com//neptune/latest/userguide/bulk-load-tutorial-format-opencypher.html). Both are csv-based formats with a specified schema. A csv file must contain a header row and the column values. The remainder of the files are interpreted based on the corresponding header column. The header could contain predefined system column names and user-defined column names, annotated with predefined datatypes and cardinality. ## Behavioral differences from Neptune csv (opencypher) format **Edge files**: + The `~id` (`:ID`) column in `edge` (`relationship`) files in `CSV` (`opencypher`) format is not supported. It is ignored if provided in any of the `edge` (`relationship`) files. **Vertex files**: + Only explicitly provided labels are associated with the vertices. If the label provided is empty, the vertex is added without a label. If a row contains just the vertex id without any labels or properties then the row is ignored, and no vertex is added. For more information about vertices, see [vertices](query-openCypher-data-model.md#query-openCypher-data-model-vertices). **Edge or vertex files**: + Unlike Neptune Database, a vertex identifier can appear just in edge files. Neptune Analytics allows loading just the edge data from files in Amazon S3, and running an algorithm over the data without needing to provide any additional vertex information. The edges are created between vertices with the given identifiers, and the vertices have no labels or properties unless any are provided in the vertex files. For more information on vertices and what they are, see [vertices](query-openCypher-data-model.md#query-openCypher-data-model-vertices). + Unlike Neptune Database, Neptune Analytics doesn't convert the `Date` type into `Datetime` type. ## Supported column types ### Date and Datetime The `Date` column type is supported. The following date formats are supported: `yyyy-MM-dd`, `yyyy-MM-dd[+|-]hhmm`. To include time along with date, use the `Datetime` column type instead. The datetime values can either be provided in the [XSD format](https://www.w3.org/TR/xmlschema-2/) or one of the following formats: + `yyyy-MM-dd` + `yyyy-MM-ddTHH:mm` + `yyyy-MM-ddTHH:mm:ss` + `yyyy-MM-ddTHH:mm:ssZ` + `yyyy-MM-ddTHH:mm:ss.SSSZ` + `yyyy-MM-ddTHH:mm:ss[+|-]hhmm` + `yyyy-MM-ddTHH:mm:ss.SSS[+|-]hhmm` ### Vector A new column type `Vector` is supported for associating embeddings with vertices. Since Neptune Analytics only supports one index type at this moment, the property name for embeddings is currently fixed to `embedding`. If the element type of the embeddings is not floating point (FP32), it is cast to FP32. The embeddings in the `csv` files are optional when the vector index is enabled. This means that not every node needs to be associated with an embedding. If you want to set up a vector index for the graph, choose the `vector dimension` and then specify the number of dimensions for the vectors in the index. The changes to vector embeddings are non-atomic and unisolated (see [Vector index transaction support](vector-index.md#vector-index-transaction-support)), that is they become durable and visible to other queries immediately upon write, unlike other properties. **Important** The `dimension` must match the dimension of the embeddings in the vertex files. For more details of loading embeddings, refer to [vector-index](https://docs.aws.amazon.com//neptune-analytics/latest/userguide/vector-index.html). ### Any type A column type `Any` is supported in the user columns. An `Any` type is a type "syntactic sugar" for all of the other types we support. It is extremely useful if a user column has multiple types in it. The payload of an `Any` type value is a list of json strings as follows: `"{""value"": ""10"", ""type"": ""Int""};{""value"": ""1.0"", ""type"": ""Float""}"` , which has a `value` field and a `type` field in each individual json string. The column header of an `Any` type is `propertyname:Any`. The cardinality value of an `Any` column is `set`, meaning that the column can accept multiple values. Neptune Analytics supports the following types in an `Any` type: `Bool` (or `Boolean`), `Byte`, `Short`, `Int`, `Long`, `UnsignedByte`, `UnsignedShort`, `UnsignedInt`, `UnsignedLong`, `Float`, `Double`, `Date`, `dateTime`, and `String`. **Any type limitations** + `Vector` type is not supported in `Any` type. + Nested `Any` type is not supported. For example, `"{""value"": "{""value"": ""10"", ""type"": ""Int""}", ""type"": ""Any""}"`. ## Limitations and unsupported features + Multi-line string values are not supported. Import behavior is undefined if the dataset contains multi-line string values. + Quoted string values must not have a leading space between the delimiter and quotes. For example, if a line is `abc, "def"` then that is interpreted as a line with two fields, with string values of `abc` and `"def"`. `"def"` is a non-quoted string field and quotes are stored as-is in the value, with a size of 6 characters. If the line is `abc,"def"` then it is interpreted as a line with two fields with string values `abc` and `def`. + `Gzip` files are not supported. + Float and double values in scientific notation are currently not supported. However, `Infinity`, `INF`, `-Infinity`, `-INF`, and `NaN` (`Not-a-number`) values are supported. + The maximum length of the strings supported is limited to 1,048,062 bytes. The limit is lower for strings with unicode characters since some unicode characters are represented using multiple bytes. + The `allowEmptyStrings` parameter is not supported. Empty string values ("") are not treated as null or missing value, and are stored as a property value. # Using Parquet data Neptune Analytics supports importing data using the Parquet format. A Parquet file must contain a header row and the column values. The remainder of the files are interpreted based on the corresponding header column. The header should contain predefined system column names and/or user-defined column names. Aside from the header row and column values, a Parquet file also has metadata which is stored in-line with the Parquet file, and is used in the reading and decoding of said data. **Note** Compression for Parquet format is not supported at this time. ## System column headers The required and allowed system column headers are different for vertex files and edge files. Each system column can appear only once in a header. All labels are case sensitive. **Note** The `~id` column in `edge` (`relationship`) files in `Parquet` format are not supported. They are ignored if provided in any of the `edge` (`relationship`) files. **Vertex headers** + `~id` - Required. An `id` for the vertex. + `~label` - Optional. List of labels for the vertex. Each label is a string. Multiple labels can either be semicolon (`;`) separated, or a list of strings. **Edge headers** + `~from` - Required. The vertex `id` of the **from** vertex. + `~to` - Required. The vertex `id` of the **to** vertex. + `~label` - Optional. A label for the edge. The label is a string value. ## Property column headers Unlike the property column headers of the CSV format, the property column headers of the Parquet format only need to have the property names, there is no need to have the type names nor the cardinality. There are however, some special column types in the Parquet format that requires annotation in the metadata, including `Any` type, `Date` type, and `dateTime` type. For more details of `Any` type, `Date` type, and `dateTime` type, please refer to [using CSV data](https://docs.aws.amazon.com//neptune-analytics/latest/userguide/using-CSV-data.html). The following object is an example of the metadata that has `Any` type column, `Date` type column and `dateTime` type column annotated: ``` "metadata": { "anyTypeColumns": ["UserCol1"], "dateTypeColumns": ["UserCol2"], "dateTimeTypeColumns": ["UserCol3"] } ``` **Note** Space, comma, carriage return and newline characters are not allowed in the column headers, so property names cannot include these characters. **Warning** Without the annotation in the metadata for the special column types, the values of these special columns will be stored as strings instead of the intended types. # Using RDF data Neptune Analytics supports importing RDF data using the n-triples format. The handling of RDF values is described below, including how RDF data is interpreted as LPG concepts and can be queried using openCypher. ## Handling of RDF values The handling of RDF specific values, that don‘t have a direct equivalent in LPG, is described here. ### IRIs Values of type IRI, like `` , are stored as such. IRIs and Strings are distinct data types. Calling openCypher function `TOSTRING()` on an IRI returns a string containing the IRI wrapped inside `<>`. For example, if `x` is the IRI ``, then `TOSTRING(x)` returns `""`. When serializing openCypher query results in json format, IRI values are included as strings in this same format. ### Language-tagged literals Values like `"Hallo"@de` are treated as follows: + When used as input for openCypher string functions, like `trim()`, a language-tagged string is treated as a simple string; so `trim("Hallo"@de)` is equivalent to `trim("Hallo")`. + When used in comparison operations, like `x = y` or `x <> y` or `x < y` or `ORDER BY`, a language-tagged literal is “greater than” (and thus “not equal to”) the corresponding simple string: `"Hallo" < "Hallo"@de`. Calling a function, such as `TOSTRING()` on a language-tagged literal, returns that literal as a string without language tag. For example, if `x` is the value `"Hallo"@de`, then `TOSTRING(x)` returns `"Hallo"`. When serializing openCypher query results in JSON format, language-tagged literals are also serialized as strings without an associated language tag. ### Blank nodes Blank nodes in n-triples data files are replaced with globally unique IRIs at import time. Loading RDF datasets that contains blank nodes is supported; but those blank nodes are represented as IRIs in the graph. When loading ntriples files the parameter `blankNodeHandling` needs to be specified, with the value `convertToIri`. The generated IRI for a blank node has the format: `` In these IRIs, `scope` is a unique identifier for the blank node scope, and `id` is the blank node identifier in the file. For example for a blank node `_:b123` the generated IRI could be ``. The **blank node scope** (e.g. 737c0b5386448f78) is generated by Neptune Analytics and designates one file within one load operation. This means that when two different ntriples files reference the same blank node identifier, like `_:b123`, there will be two IRIs generated, namely one for each file. All references to `_:b123` within the first file will end up as references to the first IRI, like ``, and all references within the second file will end up referring to another IRI, like ``. # Referencing IRIs in queries There are two ways to reference an IRI in an openCypher query: + Wrap the full IRI inside `<` and `>` . Depending on where in the query this IRI is referenced, the IRI is then provided as a String, such as `""` (when the IRI is the value of property `~id`), or in backticks such as ```` (when the IRI is a label, or property key). ``` CREATE (:`` {`~id`: ""}) ``` + Define a PREFIX at the start of the query, and inside the query reference an IRI using `prefix::suffix`. For example, after `PREFIX ex: ` the reference `ex::Alice` also references the full IRI ``. ``` PREFIX foaf: PREFIX ex: CREATE (: foaf::Person {`~id`: ex::Alice}) ``` Additional query examples below show the use of both full IRIs and the prefix syntax. # Mapping RDF triples to LPG concepts There are three rules that define how RDF triples correspond to LPG concepts: ``` Case RDF triple ⇆ LPG concept ----------------------------------------------------------------- Case #1 { rdf:type } ⇆ vertex with id + label Case #2 { "literal"} ⇆ vertex property Case #3 { } ⇆ edge with label ``` **Case \$11: Vertex with id and label** A triple like: ``` rdf:type ``` is equivalent to creating the vertex in openCypher like: ``` CREATE (:`` {`~id`: ""}) ``` In this example, the vertex label `` is interpreted and stored as an IRI. **Note** The back quote syntax ```` is part of openCypher which allows inserting characters that normally cannot be used in labels. Using this mechanism, it’s possible to include complete IRIs in a query. Using `PREFIX`, the same `CREATE` query could look like: ``` PREFIX foaf: PREFIX ex: CREATE (: foaf::Person {`~id`: ex::Alice}) ``` To match the newly created vertex based on its id: ``` MATCH (v {`~id`: ""}) RETURN v ``` or equivalently: ``` PREFIX ex: MATCH (v {`~id`: ex::Alice}) RETURN v ``` To find vertices with that RDF Class/LPG Label: ``` MATCH (v:``) RETURN v ``` or equivalently: ``` PREFIX foaf: MATCH (v : foaf::Person) RETURN v ``` **Case \$12: Vertex property** A triple like: ``` "Alice Smith" ``` is equivalent to defining with openCypher node with a given `~id` and property, where both the `~id` and the property key are IRIs: ``` CREATE ({`~id`: "", ``: "Alice Smith" }) ``` or equivalently: ``` PREFIX foaf: PREFIX ex: CREATE ({`~id`: ex::Alice, foaf::name: "Alice Smith" }) ``` To match the vertex with that property: ``` MATCH (v {``: "Alice Smith"}) RETURN v ``` or equivalently: ``` PREFIX foaf: MATCH (v { foaf::name : "Alice Smith"}) RETURN v ``` **Case \$13: Edge** A triple like: ``` ``` is equivalent to defining with OpenCypher an edge like this, where the edge label and vertices ids are all IRIs: ``` CREATE ({`~id`: ""}) -[:``]->({`~id`: ""}) ``` or equivalently: ``` PREFIX ex: CREATE ({`~id`: ex::Alice })-[: ex::knows ]->({`~id`: ex::Bob }) ``` To match the edges with that label: ``` MATCH (v)-[:``]->(w) RETURN v, w ``` or equivalently: ``` PREFIX ex: MATCH (v)-[: ex::knows ]->(w) RETURN v, w ``` ## Query Examples **Matching language-tagged literals** If this triple was loaded from a dataset: ``` "Hallo"@de ``` then it will **not** be matched by this query: ``` MATCH (n) WHERE n.`` = "Hallo" ``` because the language-tagged literal `"Hallo"@de` and the string “Hallo” are not equal. For more information, see [Language-tagged literals](using-rdf-data.md#rdf-handling-language-tagged-literals). The query can use `TOSTRING()` in order to find the match: ``` MATCH (n) WHERE TOSTRING(n.``) = "Hallo" ```