

# Understanding how Gremlin queries work in Neptune
<a name="gremlin-explain-background"></a>

To take full advantage of the Gremlin `explain` and `profile` reports in Amazon Neptune, it is helpful to understand some background information about Gremlin queries.

**Topics**
+ [Gremlin statements in Neptune](gremlin-explain-background-statements.md)
+ [How Neptune processes Gremlin queries using statement indexes](gremlin-explain-background-indexing-examples.md)
+ [How Gremlin queries are processed in Neptune](gremlin-explain-background-querying.md)

# Gremlin statements in Neptune
<a name="gremlin-explain-background-statements"></a>

Property graph data in Amazon Neptune is composed of four-position (quad) statements. Each of these statements represents an individual atomic unit of property graph data. For more information, see [Neptune Graph Data Model](feature-overview-data-model.md). Similar to the Resource Description Framework (RDF) data model, these four positions are as follows:
+ `subject (S)`
+ `predicate (P)`
+ `object (O)`
+ `graph (G)`

Each statement is an assertion about one or more resources. For example, a statement can assert the existence of a relationship between two resources, or it can attach a property (key-value pair) to some resource.

You can think of the predicate as the verb of the statement, describing the type of relationship or property. The object is the target of the relationship, or the value of the property. The graph position is optional and can be used in many different ways. For the Neptune property graph (PG) data, it is either unused (null graph) or it is used to represent the identifier for an edge. A set of statements with shared resource identifiers creates a graph.

There are three classes of statements in the Neptune property graph data model:

**Topics**
+ [Vertex Label Statements](#gremlin-explain-background-vertex-labels)
+ [Edge Statements](#gremlin-explain-background-edge-statements)
+ [Property Statements](#gremlin-explain-background-property-statements)

## Gremlin Vertex Label Statements
<a name="gremlin-explain-background-vertex-labels"></a>

Vertex label statements in Neptune serve two purposes:
+ They track the labels for a vertex.
+ The presence of at least one of these statements is what implies the existence of a particular vertex in the graph.

The subject of these statements is a vertex identifier, and the object is a label, both of which are specified by the user. You use a special fixed predicate for these statements, displayed as `<~label>`, and a default graph identifier (the null graph), displayed as `<~>`.

For example, consider the following `addV` traversal.

```
g.addV("Person").property(id, "v1")
```

This traversal results in the following statement being added to the graph.

```
StatementEvent[Added(<v1> <~label> <Person> <~>) .]
```

## Gremlin Edge Statements
<a name="gremlin-explain-background-edge-statements"></a>

A Gremlin edge statement is what implies the existence of an edge between two vertices in a graph in Neptune. The subject (S) of an edge statement is the source `from` vertex. The predicate (P) is a user-supplied edge label. The object (O) is the target `to` vertex. The graph (G) is a user-supplied edge identifier.

For example, consider the following `addE` traversal.

```
g.addE("knows").from(V("v1")).to(V("v2")).property(id, "e1")
```

The traversal results in the following statement being added to the graph.

```
StatementEvent[Added(<v1> <knows> <v2> <e1>) .]
```

## Gremlin Property Statements
<a name="gremlin-explain-background-property-statements"></a>

A Gremlin property statement in Neptune asserts an individual property value for a vertex or edge. The subject is a user-supplied vertex or edge identifier. The predicate is the property name (key), and the object is the individual property value. The graph (G) is again the default graph identifier, the null graph, displayed as `<~>`.

Consider the following vertex property example.

```
g.V("v1").property("name", "John")
```

This statement results in the following.

```
StatementEvent[Added(<v1> <name> "John" <~>) .]
```

Property statements differ from others in that their object is a primitive value (a `string`, `date`, `byte`, `short`, `int`, `long`, `float`, or `double`). Their object is not a resource identifier that could be used as the subject of another assertion.

For multi-properties, each individual property value in the set receives its own statement.

```
g.V("v1").property(set, "phone", "956-424-2563").property(set, "phone", "956-354-3692 (tel:9563543692)")
```

This results in the following.

```
StatementEvent[Added(<v1> <phone> "956-424-2563" <~>) .]
StatementEvent[Added(<v1> <phone> "956-354-3692" <~>) .]
```

Edge properties are handled similarly to vertex properties, but use the edge identifier in the (S) position. For example, adding a property to an edge:

```
g.E("e1").property("weight", 0.8)
```

This results in the following statement being added to the graph.

```
StatementEvent[Added(<e1> <weight> 0.8 <~>) .]
```

# How Neptune processes Gremlin queries using statement indexes
<a name="gremlin-explain-background-indexing-examples"></a>

Statements are accessed in Amazon Neptune by way of three statement indexes, as detailed in [How Statements Are Indexed in Neptune](feature-overview-storage-indexing.md). Neptune extracts a statement *pattern* from a Gremlin query in which some positions are known, and the rest are left for discovery by index search.

Neptune assumes that the size of the property graph schema is not large. This means that the number of distinct edge labels and property names is fairly low, resulting in a low total number of distinct predicates. Neptune tracks distinct predicates in a separate index. It uses this cache of predicates to do a union scan of `{ all P x POGS }` rather than use an OSGP index. Avoiding the need for a reverse traversal OSGP index saves both storage space and load throughput.

The Neptune Gremlin Explain/Profile API lets you obtain the predicate count in your graph. You can then determine whether your application invalidates the Neptune assumption that your property graph schema is small.

The following examples help illustrate how Neptune uses indexes to process Gremlin queries.

**Question: What are the labels of vertex `v1`?**

```
  Gremlin code:      g.V('v1').label()
  Pattern:           (<v1>, <~label>, ?, ?)
  Known positions:   SP
  Lookup positions:  OG
  Index:             SPOG
  Key range:         <v1>:<~label>:*
```

**Question: What are the 'knows' out-edges of vertex `v1`?**

```
  Gremlin code:      g.V('v1').out('knows')
  Pattern:           (<v1>, <knows>, ?, ?)
  Known positions:   SP
  Lookup positions:  OG
  Index:             SPOG
  Key range:         <v1>:<knows>:*
```

**Question: Which vertices have a `Person` vertex label?**

```
  Gremlin code:      g.V().hasLabel('Person')
  Pattern:           (?, <~label>, <Person>, <~>)
  Known positions:   POG
  Lookup positions:  S
  Index:             POGS
  Key range:         <~label>:<Person>:<~>:*
```

**Question: What are the from/to vertices of a given edge `e1`?**

```
  Gremlin code:      g.E('e1').bothV()
  Pattern:           (?, ?, ?, <e1>)
  Known positions:   G
  Lookup positions:  SPO
  Index:             GPSO
  Key range:         <e1>:*
```

One statement index that Neptune does **not** have is a reverse traversal OSGP index. This index could be used to gather all incoming edges across all edge labels, as in the following example.

**Question: What are the incoming adjacent vertices `v1`?**

```
  Gremlin code:      g.V('v1').in()
  Pattern:           (?, ?, <v1>, ?)
  Known positions:   O
  Lookup positions:  SPG
  Index:             OSGP  // <-- Index does not exist
```

# How Gremlin queries are processed in Neptune
<a name="gremlin-explain-background-querying"></a>

In Amazon Neptune, more complex traversals can be represented by a series of patterns that create a relation based on the definition of named variables that can be shared across patterns to create joins. This is shown in the following example.

**Question: What is the two-hop neighborhood of vertex `v1`?**

```
  Gremlin code:      g.V(‘v1’).out('knows').out('knows').path()
  Pattern:           (?1=<v1>, <knows>, ?2, ?) X Pattern(?2, <knows>, ?3, ?)

  The pattern produces a three-column relation (?1, ?2, ?3) like this:
                     ?1     ?2     ?3
                     ================
                     v1     v2     v3
                     v1     v2     v4
                     v1     v5     v6
```

By sharing the `?2` variable across the two patterns (at the O position in the first pattern and the S position of the second pattern), you create a join from the first hop neighbors to the second hop neighbors. Each Neptune solution has bindings for the three named variables, which can be used to re-create a [TinkerPop Traverser](http://tinkerpop.apache.org/docs/current/reference/#_the_traverser) (including path information).

```
```

The first step in Gremlin query processing is to parse the query into a TinkerPop [Traversal](http://tinkerpop.apache.org/docs/current/reference/#traversal) object, composed of a series of TinkerPop [steps](http://tinkerpop.apache.org/docs/current/reference/#graph-traversal-steps). These steps, which are part of the open-source [Apache TinkerPop project](http://tinkerpop.apache.org/), are both the logical and physical operators that compose a Gremlin traversal in the reference implementation. They are both used to represent the model of the query. They are executable operators that can produce solutions according to the semantics of the operator that they represent. For example, `.V()` is both represented and executed by the TinkerPop [GraphStep](http://tinkerpop.apache.org/docs/current/reference/#graph-step).

Because these off-the-shelf TinkerPop steps are executable, such a TinkerPop Traversal can execute any Gremlin query and produce the correct answer. However, when executed against a large graph, TinkerPop steps can sometimes be very inefficient and slow. Instead of using them, Neptune tries to convert the traversal into a declarative form composed of groups of patterns, as described previously.

Neptune doesn't currently support all Gremlin operators (steps) in its native query engine. So it tries to collapse as many steps as possible down into a single `NeptuneGraphQueryStep`, which contains the declarative logical query plan for all the steps that have been converted. Ideally, all steps are converted. But when a step is encountered that can't be converted, Neptune breaks out of native execution and defers all query execution from that point forward to the TinkerPop steps. It doesn't try to weave in and out of native execution.

After the steps are translated into a logical query plan, Neptune runs a series of query optimizers that rewrite the query plan based on static analysis and estimated cardinalities. These optimizers do things like reorder operators based on range counts, prune unnecessary or redundant operators, rearrange filters, push operators into different groups, and so on.

After an optimized query plan is produced, Neptune creates a pipeline of physical operators that do the work of executing the query. This includes reading data from the statement indices, performing joins of various types, filtering, ordering, and so on. The pipeline produces a solution stream that is then converted back into a stream of TinkerPop Traverser objects.

## Serialization of query results
<a name="gremlin-explain-background-querying-serialization"></a>

Amazon Neptune currently relies on the TinkerPop response message serializers to convert query results (TinkerPop Traversers) into the serialized data to be sent over the wire back to the client. These serialization formats tend to be quite verbose.

For example, to serialize the result of a vertex query such as `g.V().limit(1)`, the Neptune query engine must perform a single search to produce the query result. However, the `GraphSON` serializer would perform a large number of additional searches to package the vertex into the serialization format. It would have to perform one search to get the label, one to get the property keys, and one search per property key for the vertex to get all the values for each key.

Some of the serialization formats are more efficient, but all require additional searches. Additionally, the TinkerPop serializers don't try to avoid duplicated searches, often resulting in many searches being repeated unnecessarily.

This makes it very important to write your queries so that they ask specifically just for the information they need. For example, `g.V().limit(1).id()` would return just the vertex ID and eliminate all the additional serializer searches. The [Gremlin `profile` API in Neptune](gremlin-profile-api.md) allows you to see how many search calls are made during query execution and during serialization.