

# openCypher query best practices
<a name="best-practices-content"></a>

## Use the SET clause to remove multiple properties at once
<a name="best-practices-content-1"></a>

 When using the openCypher language, REMOVE is used to remove properties from an entity. In Neptune Analytics, each property being removed requires a separate operation, adding query latency. You can instead use SET with a map to set all property values to `null`, which in Neptune Analytics is equivalent to removing properties. Neptune Analytics will have increased performance when multiple properties on a single entity are required to be removed. 

Use:

```
WITH {prop1: null, prop2: null, prop3: null} as propertiesToRemove 
MATCH (n) 
SET n += propertiesToRemove
```

Instead of:

```
MATCH (n) 
REMOVE n.prop1, n.prop2, n.prop3
```

## Use parameterized queries
<a name="best-practices-content-2"></a>

 It is recommended to always use parameterized queries when querying using openCypher. The query engine can leverage repeated parameterized queries for features like query plan cache, where repeated invocation of the same parameterized structure with different parameters can leverage the cached plans. The query plan generated for parameterized queries is cached and reused only when it completes within 100ms and the parameter types are either NUMBER, BOOLEAN or STRING. 

Use:

```
MATCH (n:foo) WHERE id(n) = $id RETURN n
```

With parameters:

```
parameters={"id": "first"}
parameters={"id": "second"}
parameters={"id": "third"}
```

Instead of:

```
MATCH (n:foo) WHERE id(n) = "first" RETURN n
MATCH (n:foo) WHERE id(n) = "second" RETURN n
MATCH (n:foo) WHERE id(n) = "third" RETURN n
```

 You can determine if the query is using a cached plan by observing the `plan cache hits:` value in the output of the [ openCypher explain endpoint](https://docs.aws.amazon.com//neptune-analytics/latest/userguide/query-explain.html). 

## Use flattened maps instead of nested maps in UNWIND clause
<a name="best-practices-content-3"></a>

 Deep nested structure can restrict the ability of the query engine to generate an optimal query plan. To partially alleviate this issue, the following defined patterns will create optimal plans for the following scenarios: 
+  Scenario 1: UNWIND with a list of cypher literals, which includes NUMBER, STRING and BOOLEAN. 
+  Scenario 2: UNWIND with a list of flattened maps, which includes only cypher literals (NUMBER, STRING, BOOLEAN) as values. 

 When writing a query containing UNWIND clause, use the above recommendation to improve performance. 

Scenario 1 example:

```
UNWIND $ids as x
MATCH(t:ticket {`~id`: x})
```

With parameters:

```
parameters={
  "ids": [1, 2, 3]
}
```

 An example for Scenario 2 is to generate a list of nodes to CREATE or MERGE. Instead of issuing multiple statements, use the following pattern to define the properties as a set of flattened maps: 

```
UNWIND $props as p
CREATE(t:ticket {title: p.title, severity:p.severity})
```

With parameters:

```
parameters={
  "props": [
    {"title": "food poisoning", "severity": "2"},
    {"title": "Simone is in office", "severity": "3"}
  ]
}
```

Instead of nested node objects like:

```
UNWIND $nodes as n
CREATE(t:ticket n.properties)
```

With parameters:

```
parameters={
  "nodes": [
    {"id": "ticket1", "properties": {"title": "food poisoning", "severity": "2"}},
    {"id": "ticket2", "properties": {"title": "Simone is in office", "severity": "3"}}
  ]
}
```

## Place more restrictive nodes on the left side in Variable-Length Path (VLP) expressions
<a name="best-practices-content-4"></a>

 In Variable-Length Path (VLP) queries, the query engine optimizes the evaluation by choosing to start the traversal on the left or right side of the expression. The decision is based on the cardinality of the patterns on the left and right side. Cardinality is the number of nodes matching the specified pattern. 
+  If the right pattern has a cardinality of one, then the right side will be the starting point. 
+  If the left and the right side have cardinality of one, the expansion is checked on both sides and starts on the side with the smaller expansion. Expansion is the number of outgoing or incoming edges for the node on the left and the node on the right side of the VLP expression. This part of the optimization is only used if the VLP relationship is unidirectional and the relationship type is provided. 
+  Otherwise, the left side will be the starting point. 

 For a chain of VLP expressions, this optimization can only be applied to the first expression. The other VLPs are evaluated starting with the left side. As an example, let the cardinality of (a), (b) be one, and the cardinality of (c) be greater than one. 
+  `(a)-[*1..]->(c)`: Evaluation starts with (a). 
+  `(c)-[*1..]->(a)`: Evaluation starts with (a). 
+  `(a)-[*1..]-(c)`: Evaluation starts with (a). 
+  `(c)-[*1..]-(a)`: Evaluation starts with (a). 

 Now let the incoming edges of (a) be two, and the outgoing edges of (a) be three, the incoming edges of (b) be four, and the outgoing edges of (b) be five. 
+  `(a)-[*1..]->(b)`: Evaluation starts with (a) as the outgoing edges of (a) are less than the incoming edges of (b). 
+  `(a)<-[*1..]-(b)`: Evaluation starts with (a) as the incoming edges of (a) are less than the outgoing edges of (b). 

 As a general rule, place the more restrictive pattern on the left side of a VLP expression. 

## Avoid redundant node label checks by using granular relationship names
<a name="best-practices-content-5"></a>

 When optimizing for performance, using relationship labels that are exclusive to node patterns allows the removal of label filtering on nodes. Consider a graph model where the relationship `likes` is only used to define a relationship between two `person` nodes. We could write the following query to find this pattern: 

```
MATCH (n:person)-[:likes]->(m:person)
RETURN n, m
```

 The `person` label check on n and m is redundant, as we defined the relationship to only appear when both are of the type `person`. To optimize on performance, we can write the query as follows: 

```
MATCH (n)-[:likes]->(m)
RETURN n, m
```

 This pattern can also apply when properties are exclusive to a single node label. Assume that only `person` nodes have the property `email`, therefore verifying the node label matches `person` is redundant. Writing this query as: 

```
MATCH (n:person)
WHERE n.email = 'xxx@gmail.com'
RETURN n
```

 Is less efficient than writing this query as: 

```
MATCH (n)
WHERE n.email = 'xxx@gmail.com'
RETURN n
```

 You should only adopt this pattern when performance is important and you have checks in your modeling process to ensure these edge labels are not reused for patterns involving other node labels. If you later introduce an `email` property on another node label such as `company`, then the results will differ between these two versions of the query. 

## Specify edge labels where possible
<a name="best-practices-content-6"></a>

 It is recommended to provide an edge label where possible when specifying an edge in a pattern. Consider the following example query, which is used to link all of the people living in a city with all of the people who visited that city. 

```
MATCH (person)-->(city {country: "US"})-->(anotherPerson)
RETURN person, anotherPerson
```

 If your graph model links people to nodes other than just cities using multiple edge labels, by not specifying the end label, Neptune will need to evaluate additional paths that will later be discarded. In the above query, as an edge label was not given, the engine does more work first and then filters out values to obtain the correct result. A better version of above query might be: 

```
MATCH (person)-[:livesIn]->(city {country: "US"})-[:visitedBy]->(anotherPerson)
RETURN person, anotherPerson
```

 This not only helps in evaluation, but enables the query planner to create better plans. You could even combine this best practice with redundant node label checks to remove the city label check and write the query as: 

```
MATCH (person)-[:livesIn]->({country: "US"})-[:visitedBy]->(anotherPerson)
RETURN person, anotherPerson
```

## Avoid using the WITH clause when possible
<a name="best-practices-content-7"></a>

 The WITH clause in openCypher acts as a boundary where everything before it executes, and then the resulting values are passed to the remaining portions of the query. The WITH clause is needed when you require interim aggregation or want to limit the number of results, but aside from that you should try to avoid using the WITH clause. The general guidance is to remove these simple WITH clauses (without aggregation, order by or limit) to enable the query planner to work on the entire query to create a globally optimal plan. As an example, assume you wrote a query to return all people living in `India`: 

```
MATCH (person)-[:lives_in]->(city)
WITH person, city
MATCH (city)-[:part_of]->(country {name: 'India'})
RETURN collect(person) AS result
```

 In the above version, the WITH clause restricts the placement of the pattern `(city)-[:part_of]->(country {name: 'India'})` (which is more restrictive) before `(person)-[:lives_in]->(city)`. This makes the plan sub-optimal. An optimization on this query would be to remove the WITH clause and let the planner compute the best plan. 

```
MATCH (person)-[:lives_in]->(city)
MATCH (city)-[:part_of]->(country {name: 'India'})
RETURN collect(person) AS result
```

## Place restrictive filters as early in the query as possible
<a name="best-practices-content-8"></a>

 In all scenarios, early placement of filters in the query helps in reducing the intermediate solutions a query plan must consider. This means less memory and fewer compute resources are needed to execute the query. 

 The following example helps you understand these impacts. Suppose you write a query to return all of the people who live in `India`. One version of the query could be: 

```
MATCH (n)-[:lives_in]->(city)-[:part_of]->(country)
WITH country, collect(n.firstName + " "  + n.lastName) AS result
WHERE country.name = 'India'
RETURN result
```

 The above version of the query is not the most optimal way to achieve this use case. The filter `country.name = 'India'` appears later in the query pattern. It will first collect all persons and where they live, and group them by country, then filter for only the group for `country.name = India`. The optimal way to query for only people living in `India` and then perform the collect aggregation. 

```
MATCH (n)-[:lives_in]->(city)-[:part_of]->(country)
WHERE country.name = 'India'
RETURN collect(n.firstName + " "  + n.lastName) AS result
```

 A general rule is to place a filter as soon as possible after the variable is introduced. 

## Explicitly check whether properties exist
<a name="best-practices-content-9"></a>

 Based on openCypher semantics, when a property is accessed it is equivalent to an optional join and must retain all rows even if the property does not exist. If you know based on your graph schema that a particular property will always exist for that entity, explicitly checking that property for existence allows the query engine to create optimal plans and improve performance. 

 Consider a graph model where nodes of type `person` always have a property `name`. Instead of doing this: 

```
MATCH (n:person)
RETURN n.name
```

 Explicitly verify the property existence in the query with an IS NOT NULL check: 

```
MATCH (n:person)
WHERE n.name IS NOT NULL
RETURN n.name
```

## Do not use named path (unless it is required)
<a name="best-practices-content-10"></a>

 Named path in a query always comes at an additional cost, which can add penalties in terms of higher latency and memory usage. Consider the following query: 

```
MATCH p = (n)-[:commentedOn]->(m)
WITH p, m, n, n.score + m.score as total
WHERE total > 100 
MATCH (m)-[:commentedON]->(o)
WITH p, m, n, distinct(o) as o1
RETURN p, m.name, n.name, o1.name
```

 In the above query, assuming we only want to know the properties of the nodes, the use of path “p” is unnecessary. By specifying the named path as a variable, the aggregation operation using DISTINCT will get expensive both in terms of time and memory usage. A more optimized version of above query could be: 

```
MATCH (n)-[:commentedOn]->(m)
WITH m, n, n.score + m.score as total
WHERE total > 100 
MATCH (m)-[:commentedON]->(o)
WITH m, n, distinct(o) as o1
RETURN m.name, n.name, o1.name
```

## Avoid COLLECT(DISTINCT())
<a name="best-practices-content-11"></a>

 COLLECT(DISTINCT()) is used whenever a list is to be formed containing distinct values. COLLECT is an aggregation function, and grouping is done based on additional keys being projected in the same statement. When distinct is used, the input is split in multiple chunks where each chunk denotes one group for reduction. Performance will be impacted as the number of groups increases. In Neptune Analytics, it is much more efficient to perform DISTINCT before actually collecting/forming the list. This allows grouping to be done directly on the grouping keys for the whole chunk. 

 Consider the following query: 

```
MATCH (n:Person)-[:commented_on]->(p:Post)
WITH n, collect(distinct(p.post_id)) as post_list
RETURN n, post_list
```

 A more optimal way of writing this query is: 

```
MATCH (n:Person)-[:commented_on]->(p:Post)
WITH DISTINCT n, p.post_id as postId
WITH n, collect(postId) as post_list
RETURN n, post_list
```

## Prefer the properties function over individual property lookup when retrieving all property values
<a name="best-practices-content-12"></a>

 The `properties()` function is used to return a map containing all properties for an entity, and is much more efficient than returning properties individually. 

 Assuming your `Person` nodes contain 5 properties, `firstName`, `lastName`, `age`, `dept`, and `company`, the following query would be preferred: 

```
MATCH (n:Person)
WHERE n.dept = 'AWS'
RETURN properties(n) as personDetails
```

 Rather than using: 

```
MATCH (n:Person)
WHERE n.dept = 'AWS'
RETURN n.firstName, n.lastName, n.age, n.dept, n.company
    
=== OR ===
    
MATCH (n:Person)
WHERE n.dept = 'AWS'
RETURN {firstName: n.firstName, lastName: n.lastName, age: n.age, 
department: n.dept, company: n.company} as personDetails
```

## Perform static computations outside of the query
<a name="best-practices-content-13"></a>

 It is recommended to resolve static computations (simple mathematical/string operations) on the client-side. Consider this example where you want to find all people one year older or less than the author: 

```
MATCH (m:Message)-[:HAS_CREATOR]->(p:person)
WHERE p.age <= ($age + 1)
RETURN m
```

 Here, `$age` is injected into the query via parameters, and is then added to a fixed value. This value is then compared with `p.age`. Instead, a better approach would be doing the addition on the client-side and passing the calculated value as a parameter \$1ageplusone. This helps the query engine to create optimized plans, and avoids static computation for each incoming row. Following these guidelines, a more efficient verson of the query would be: 

```
MATCH (m:Message)-[:HAS_CREATOR]->(p:person)
WHERE p.age <= $ageplusone
RETURN m
```

## Batch inputs using UNWIND instead of individual statements
<a name="best-practices-content-14"></a>

 Whenever the same query needs to be executed for different inputs, instead of executing one query per input, it would be much more performant to run a query for a batch of inputs. 

 If you want to merge on a set of nodes, one option is to run a merge query per input: 

```
MERGE (n:Person {`~id`: $id})
SET n.name = $name, n.age = $age, n.employer = $employer
```

 With parameters: 

```
params = {id: '1', name: 'john', age: 25, employer: 'Amazon'}
```

 The above query needs to be executed for every input. While this approach works, it may require many queries to be executed for a large set of input. In this scenario, batching may help reduce the number of queries executed on the server, as well as improve the overall throughput. 

 Use the following pattern: 

```
UNWIND $persons as person
MERGE (n:Person {`~id`: person.id})
SET n += person
```

 With parameters: 

```
params = {persons: [{id: '1', name: 'john', age: 25, employer: 'Amazon'}, 
{id: '2', name: 'jack', age: 28, employer: 'Amazon'},
{id: '3', name: 'alice', age: 24, employer: 'Amazon'}...]}
```

 Experimentation with different batch sizes is recommended to determine what works best for your workload. 

## Prefer using custom IDs for node
<a name="best-practices-content-15"></a>

 Neptune Analytics allows users to explicitly assign IDs on nodes. The ID must be globally unique in the dataset and deterministic to be useful. A deterministic ID can be used as a lookup or a filtering mechanism just like properties; however, using an ID is much more optimized from query execution perspective than using properties. There are several benefits to using custom IDs - 
+  Properties can be null for an existing entity, but the ID must exist. This allows the query engine to use an optimized join during execution. 
+  When concurrent mutation queries are executed, the chances of [ concurrent modification exceptions](https://docs.aws.amazon.com//neptune/latest/userguide/transactions-exceptions.html) (CMEs) are reduced significantly when IDs are used to access nodes because fewer locks are taking on IDs than properties due to their enforced uniqueness. 
+  Using IDs avoids the chance of creating duplicate data as Neptune enforces uniqueness on IDs, unlike properties. 

 The following query example uses a custom ID: 

**Note**  
 The property `~id` is used to specify the ID, whereas `id` is just stored as any other property. 

```
CREATE (n:Person {`~id`: '1', name: 'alice'})
```

 Without using a custom ID: 

```
CREATE (n:Person {id: '1', name: 'alice'})
```

 If using the latter mechanism, there is no uniqueness enforcement and you could later execute the query: 

```
CREATE (n:Person {id: '1', name: 'john'})
```

 This creates a second node with `id=1` named `john`. In this scenario, you would now have two nodes with `id=1`, each having a different name - (alice and john). 

## Avoid doing `~id` computations in the query
<a name="best-practices-content-16"></a>

 When using custom IDs in the queries, always perform static computations outside the queries and provide these values in the parameters. When static values are provided, the engine is better able to optimize lookups and avoid scanning and filtering these values. 

 If you want to create edges between nodes that are existing in the database, one option could be: 

```
UNWIND $sections as section
MATCH (s:Section {`~id`: 'Sec-' + section.id})
MERGE (s)-[:IS_PART_OF]->(g:Group {`~id`: 'g1'})
```

 With parameters: 

```
parameters={sections: [{id: '1'}, {id: '2'}]}
```

 In the query above, the `id` of the section is being computed in the query. Since the computation is dynamic, the engine cannot statically inline ids and ends up scanning all section nodes. The engine then performs post-filtering for required nodes. This can be costly if there are many section nodes in the database. 

 A better way to achieve this is to have `Sec-` prepended in the ids being passed into the database: 

```
UNWIND $sections as section
MATCH (s:Section {`~id`: section.id})
MERGE (s)-[:IS_PART_OF]->(g:Group {`~id`: 'g1'})
```

 With parameters: 

```
parameters={sections: [{id: 'Sec-1'}, {id: 'Sec-2'}]}
```