Join hints - AWS Clean Rooms

Join hints

Join hints suggest join strategies for query execution. The syntax, arguments, and some examples come from the Apache Spark SQL Reference for more information

BROADCAST

Suggests that AWS Clean Rooms use broadcast join. The join side with the hint will be broadcast regardless of autoBroadcastJoinThreshold. If both sides of the join have the broadcast hints, the one with the smaller size (based on stats) will be broadcast.

Aliases: BROADCASTJOIN, MAPJOIN

Parameters: Table identifiers (optional)

Examples:

-- Broadcast a specific table SELECT /*+ BROADCAST(students) */ e.name, s.course FROM employees e JOIN students s ON e.id = s.id; -- Broadcast multiple tables SELECT /*+ BROADCASTJOIN(s, d) */ * FROM employees e JOIN students s ON e.id = s.id JOIN departments d ON e.dept_id = d.id;

MERGE

Suggests that AWS Clean Rooms use shuffle sort merge join.

Aliases: SHUFFLE_MERGE, MERGEJOIN

Parameters: Table identifiers (optional)

Examples:

-- Use merge join for a specific table SELECT /*+ MERGE(employees) */ * FROM employees e JOIN students s ON e.id = s.id; -- Use merge join for multiple tables SELECT /*+ MERGEJOIN(e, s, d) */ * FROM employees e JOIN students s ON e.id = s.id JOIN departments d ON e.dept_id = d.id;

SHUFFLE_HASH

Suggests that AWS Clean Rooms use shuffle hash join. If both sides have the shuffle hash hints, the query optimizer chooses the smaller side (based on stats) as the build side.

Parameters: Table identifiers (optional)

Examples:

-- Use shuffle hash join SELECT /*+ SHUFFLE_HASH(students) */ * FROM employees e JOIN students s ON e.id = s.id;

SHUFFLE_REPLICATE_NL

Suggests that AWS Clean Rooms use shuffle-and-replicate nested loop join.

Parameters: Table identifiers (optional)

Examples:

-- Use shuffle-replicate nested loop join SELECT /*+ SHUFFLE_REPLICATE_NL(students) */ * FROM employees e JOIN students s ON e.id = s.id;

Troubleshooting Hints in Spark SQL

The following table shows common scenarios where hints are not applied in SparkSQL. For additional information, see Considerations and limitations.

Use Case Example Query
Table reference not found
SELECT /*+ BROADCAST(fake_table) */ * FROM employees e INNER JOIN students s ON e.eid = s.sid;
Table not participating in join operation
SELECT /*+ BROADCAST(s) */ * FROM students s WHERE s.age > 25;
Table reference in nested subquery
SELECT /*+ BROADCAST(s) */ * FROM employees e INNER JOIN (SELECT * FROM students s WHERE s.age > 20) sub ON e.eid = sub.sid;
Column name instead of table reference
SELECT /*+ BROADCAST(e.eid) */ * FROM employees e INNER JOIN students s ON e.eid = s.sid;
Hint without required parameters
SELECT /*+ BROADCAST */ * FROM employees e INNER JOIN students s ON e.eid = s.sid;
Base table name instead of table alias
SELECT /*+ BROADCAST(employees) */ * FROM employees e INNER JOIN students s ON e.eid = s.sid;