Querying configured tables using the SQL code editor - AWS Clean Rooms

Querying configured tables using the SQL code editor

As a member who can query, you can build a query manually by writing SQL code in the SQL code editor. The SQL code editor is located in the Analysis section of the Analysis tab in the AWS Clean Rooms console.

The SQL code editor is displayed by default. If you want to use the analysis builder to build queries, see Querying with the analysis builder.

Important

If you start writing a SQL query in the code editor and then turn on the Analysis builder UI, your query isn't saved.

AWS Clean Rooms supports many SQL commands, functions, and conditions. For more information, see the AWS Clean Rooms SQL Reference.

Tip

If a scheduled maintenance occurs while a query is running, the query is terminated and rolled back. You must restart the query.

To query configured tables using the SQL code editor
  1. Sign in to the AWS Management Console and open the AWS Clean Rooms console at https://console.aws.amazon.com/cleanrooms.

  2. In the left navigation pane, choose Collaborations.

  3. Choose the collaboration that has Your member abilities status of Run queries.

  4. On the Analysis tab, under Tables, view the list of tables and their associated analysis rule type (Aggregation analysis rule, List analysis rule, or Custom analysis rule).

    Note

    If you don’t see the tables that you expect in the list, it might be for the following reasons:

  5. (Optional) To view the table's schema and analysis rule controls, expand the table by selecting the plus sign icon (+).

  6. Under the Analysis section, for Analysis mode, select Write SQL code.

    Note

    The Analysis section only displays if the member who can receive results and the member who is responsible to pay for query compute costs have joined the collaboration as an active member.

  7. Build the query by typing the query into the SQL code editor.

    For more information about supported SQL commands and functions, see the AWS Clean Rooms SQL Reference.

    You can also use the following options to build your query.

    Use an example query

    To use an example query

    1. Select the three vertical dots next to the table.

    2. Under Insert in editor, choose Example query.

      Note

      Inserting an Example query appends it to the query already in the editor.

      The query example appears. All of the tables listed under Tables are included in the query.

    3. Edit the placeholder values in the query.

    Insert column names or functions

    To insert a column name or function

    1. Select the three vertical dots next to a column.

    2. Under Insert in editor, choose Column name.

    3. To manually insert a function that is permitted on a column,

      1. Select the three vertical dots next to a column.

      2. Select Insert in editor.

      3. Select the name of the permitted function (such as INNER JOIN, SUM, SUM DISTINCT, or COUNT).

    4. Press Ctrl + Space to view the table schemas in the code editor.

      Note

      Members who can query can view and use the partition columns in each configured table association. Ensure the partition column is labeled as a partition column in the AWS Glue table underlying the configured table.

    5. Edit the placeholder values in the query.

  8. (Spark analytics engine only) Specify the supported Worker type and the Number of workers.

    You can choose the instance type and number of instances (workers) to run your SQL queries.

    For CR.1X, you can select up to 128 workers or a minimum of 4 workers.

    For CR.4X, you can select up to 32 workers or a minimum of 4 workers.

    Use the following table to determine the type and number or workers you need for your use case.

    Worker type vCPU Memory (GB) Storage (GB) Number of workers Total Clean Rooms Processing Units (CRPU)
    CR.1X (default) 4 30 100 4 8
    128 256
    CR.4X 16 120 400 4 32
    32 256
    Note

    Different worker types and number of workers have associated costs. To learn more about the pricing, see AWS Clean Rooms pricing.

  9. For Send results to, specify who can receive results.

    Note

    To receive results, the collaboration member must be configured as a result receiver and must be an active participant in the collaboration (Status: Active)

  10. (Member who can query only) The Use your default result settings checkbox is selected by default. Keep this selected if you want to keep your default result settings.

    If you want to specify different results settings for this query, clear the Use your default result settings checkbox, and then choose the following.

    1. Result format (CSV or PARQUET)

    2. Result files (Single or Multiple)

    3. Results destination in Amazon S3

    Each member who can receive results can specify a different Result format, Result files, and Results destination in Amazon S3.

  11. (Spark analytics engine only) To specify Spark properties:

    1. Expand Spark properties.

    2. Choose Add Spark properties.

    3. On the Spark properties dialog box, choose a Property name from the dropdown list and enter a Value.

    The following table provides a definition for each property.

    For more information about Spark properties, see Spark Properties in the Apache Spark documentation.

    Property Name Description Default Value

    spark.task.maxFailures

    Controls how many consecutive times a task can fail before the job fails. Requires a value greater than or equal to 1. The number of allowed retries equals this value minus 1. The failure count resets if any attempt succeeds. Failures across different tasks don't accumulate toward this limit.

    4

    spark.sql.files.maxPartitionBytes

    Sets the maximum number of bytes to pack into a single partition when reading from file-based sources such as Parquet, JSON, and ORC.

    128MB

    spark.hadoop.fs.s3.maxRetries

    Sets the maximum number of retry attempts for Amazon S3 file operations.

    spark.network.timeout

    Sets the default timeout for all network interactions. Overrides the following timeout settings if they aren't configured:

    • spark.storage.blockManagerHeartbeatTimeoutMs

    • spark.shuffle.io.connectionTimeout

    • spark.rpc.askTimeout

    • spark.rpc.lookupTimeout

    TBD

    spark.rdd.compress

    Specifies whether to compress serialized RDD partitions using spark.io.compression.codec. Applies to StorageLevel.MEMORY_ONLY_SER in Java and Scala, or StorageLevel.MEMORY_ONLY in Python. Reduces storage space but requires additional CPU processing time.

    FALSE

    spark.shuffle.spill.compress

    Specifies whether to compress shuffle spill data using spark.io.compression.codec.

    TRUE

    spark.sql.adaptive.advisoryPartitionSizeInBytes

    Sets the target size in bytes for shuffle partitions during adaptive optimization when spark.sql.adaptive.enabled is true. Controls partition size when coalescing small partitions or splitting skewed partitions.

    (value of spark.sql.adaptive.shuffle.targetPostShuffleInputSize)

    spark.sql.adaptive.autoBroadcastJoinThreshold

    Sets the maximum table size in bytes for broadcasting to worker nodes during joins. Applies only in adaptive framework. Uses the same default value as spark.sql.autoBroadcastJoinThreshold. Set to -1 to disable broadcasting.

    (none)

    spark.sql.adaptive.coalescePartitions.enabled

    Specifies whether to coalesce contiguous shuffle partitions based on spark.sql.adaptive.advisoryPartitionSizeInBytes to optimize task size. Requires spark.sql.adaptive.enabled to be true.

    TRUE

    spark.sql.adaptive.coalescePartitions.initialPartitionNum

    Defines the initial number of shuffle partitions before coalescing. Requires both spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled to be true. Defaults to the value of spark.sql.shuffle.partitions.

    (none)

    spark.sql.adaptive.coalescePartitions.minPartitionSize

    Sets the minimum size for coalesced shuffle partitions to prevent partitions from becoming too small during adaptive optimization.

    1 MB

    spark.sql.adaptive.coalescePartitions.parallelismFirst

    Specifies whether to calculate partition sizes based on cluster parallelism instead of spark.sql.adaptive.advisoryPartitionSizeInBytes during partition coalescing. Generates smaller partition sizes than the configured target size to maximize parallelism. We recommend setting this to false on busy clusters to improve resource utilization by preventing excessive small tasks.

    TRUE

    spark.sql.adaptive.enabled

    Specifies whether to enable adaptive query execution to re-optimize query plans during query execution, based on accurate runtime statistics.

    TRUE

    spark.sql.adaptive.forceOptimizeSkewedJoin

    Specifies whether to force enable OptimizeSkewedJoin even if it introduces extra shuffle.

    FALSE

    spark.sql.adaptive.localShuffleReader.enabled

    Specifies whether to use local shuffle readers when shuffle partitioning isn't required, such as after converting from sort-merge joins to broadcast-hash joins. Requires spark.sql.adaptive.enabled to be true.

    TRUE

    spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold

    Sets the maximum partition size in bytes for building local hash maps. Prioritizes shuffled hash joins over sort-merge joins when:

    • This value equals or exceeds spark.sql.adaptive.advisoryPartitionSizeInBytes

    • All partition sizes are within this limit

    Overrides spark.sql.join.preferSortMergeJoin setting.

    0 bytes

    spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled

    Specifies whether to optimize skewed shuffle partitions by splitting them into smaller partitions based on spark.sql.adaptive.advisoryPartitionSizeInBytes. Requires spark.sql.adaptive.enabled to be true.

    TRUE

    spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor

    Defines the size threshold factor for merging partitions during splitting. Partitions smaller than this factor multiplied by spark.sql.adaptive.advisoryPartitionSizeInBytes are merged.

    0.2

    spark.sql.adaptive.skewJoin.enabled

    Specifies whether to handle data skew in shuffled joins by splitting and optionally replicating skewed partitions. Applies to sort-merge and shuffled hash joins. Requires spark.sql.adaptive.enabled to be true.

    TRUE

    spark.sql.adaptive.skewJoin.skewedPartitionFactor

    Determines the size factor that determines partition skew. A partition is skewed when its size exceeds both:

    • This factor multiplied by the median partition size

    • The value of spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

    5

    spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

    Sets the size threshold in bytes for identifying skewed partitions. A partition is skewed when its size exceeds both:

    • This threshold

    • The median partition size multiplied by spark.sql.adaptive.skewJoin.skewedPartitionFactor

    We recommend setting this value larger than spark.sql.adaptive.advisoryPartitionSizeInBytes.

    256MB

    spark.sql.autoBroadcastJoinThreshold

    Sets the maximum table size in bytes for broadcasting to worker nodes during joins. Set to -1 to disable broadcasting.

    10MB

    spark.sql.broadcastTimeout

    Controls the timeout period in seconds for the broadcast operations during broadcast joins.

    300 seconds

    spark.sql.cbo.enabled

    Specifies whether to enable cost-based optimization (CBO) for plan statistics estimation.

    FALSE

    spark.sql.cbo.joinReorder.dp.star.filter

    Specifies whether to apply star-join filter heuristics during cost-based join enumeration.

    FALSE

    spark.sql.cbo.joinReorder.dp.threshold

    Sets the maximum number of joined nodes allowed in the dynamic programming algorithm.

    12

    spark.sql.cbo.joinReorder.enabled

    Specifies whether to enable join reordering in cost-based optimization (CBO).

    FALSE

    spark.sql.cbo.planStats.enabled

    Specifies whether to fetch row counts and column statistics from the catalog during logical plan generation.

    FALSE

    spark.sql.cbo.starSchemaDetection

    Specifies whether to enable join reordering based on star schema detection.

    FALSE

    spark.sql.crossJoin.enabled

    spark.sql.files.maxPartitionNum

    Sets the target maximum number of split file partitions for file-based sources (Parquet, JSON, and ORC). Rescales partitions when the initial count exceeds this value. This is a suggested target, not a guaranteed limit.

    (none)

    spark.sql.files.maxRecordsPerFile

    Sets the maximum number of records to write to a single file. No limit applies when set to zero or a negative value.

    0

    spark.sql.files.minPartitionNum

    Sets the target minimum number of split file partitions for file-based sources (Parquet, JSON, and ORC). Defaults to spark.sql.leafNodeDefaultParallelism. This is a suggested target, not a guaranteed limit.

    (none)

    spark.sql.inMemoryColumnarStorage.batchSize

    Controls the batch size for columnar caching. Increasing the size improves memory utilization and compression but increases the risk of out-of-memory errors.

    10000

    spark.sql.inMemoryColumnarStorage.compressed

    Specifies whether to automatically select compression codecs for columns based on data statistics.

    TRUE

    spark.sql.inMemoryColumnarStorage.enableVectorizedReader

    Specifies whether to enable vectorized reading for columnar caching.

    TRUE

    spark.sql.legacy.allowHashOnMapType

    Specifies whether to allow hash operations on map type data structures. This legacy setting maintains compatibility with older Spark versions' map type handling.

    spark.sql.legacy.allowNegativeScaleOfDecimal

    Specifies whether to allow negative scale values in decimal type definitions. This legacy setting maintains compatibility with older Spark versions that supported negative decimal scales.

    spark.sql.legacy.castComplexTypesToString.enabled

    Specifies whether to enable legacy behavior for casting complex types to strings. Maintains compatibility with older Spark versions' type conversion rules.

    spark.sql.legacy.charVarcharAsString

    Specifies whether to treat CHAR and VARCHAR types as STRING types. This legacy setting provides compatibility with older Spark versions' string type handling.

    spark.sql.legacy.createEmptyCollectionUsingStringType

    Specifies whether to create empty collections using string type elements. This legacy setting maintains compatibility with older Spark versions' collection initialization behavior.

    spark.sql.legacy.exponentLiteralAsDecimal.enabled

    Specifies whether to interpret exponential literals as decimal types. This legacy setting maintains compatibility with older Spark versions' numeric literal handling.

    spark.sql.legacy.json.allowEmptyString.enabled

    Specifies whether to allow empty strings in JSON processing. This legacy setting maintains compatibility with older Spark versions' JSON parsing behavior.

    spark.sql.legacy.parquet.int96RebaseModelRead

    Specifies whether to use legacy INT96 timestamp rebase mode when reading Parquet files. This legacy setting maintains compatibility with older Spark versions' timestamp handling.

    spark.sql.legacy.timeParserPolicy

    Controls the time parsing behavior for backwards compatibility. This legacy setting determines how timestamps and dates are parsed from strings.

    spark.sql.legacy.typeCoercion.datetimeToString.enabled

    Specifies whether to enable legacy type coercion behavior when converting datetime values to strings. Maintains compatibility with older Spark versions' datetime conversion rules.

    spark.sql.maxSinglePartitionBytes

    Sets the maximum partition size in bytes. The planner introduces shuffle operations for larger partitions to improve parallelism.

    128m

    spark.sql.metadataCacheTTLSeconds

    Controls the time-to-live (TTL) for metadata caches. Applies to partition file metadata and session catalog caches. Requires:

    • A positive value greater than zero

    • spark.sql.catalogImplementation set to hive

    • spark.sql.hive.filesourcePartitionFileCacheSize greater than zero

    • spark.sql.hive.manageFilesourcePartitions set to true

    -1000ms

    spark.sql.optimizer.collapseProjectAlwaysInline

    Specifies whether to collapse adjacent projections and inline expressions, even when it causes duplication.

    FALSE

    spark.sql.optimizer.dynamicPartitionPruning.enabled

    Specifies whether to generate predicates for partition columns used as join keys.

    TRUE

    spark.sql.optimizer.enableCsvExpressionOptimization

    Specifies whether to optimize CSV expressions in SQL optimizer by pruning unnecessary columns from from_csv operations.

    TRUE

    spark.sql.optimizer.enableJsonExpressionOptimization

    Specifies whether to optimize JSON expressions in SQL optimizer by:

    • Pruning unnecessary columns from from_json operations

    • Simplifying from_json and to_json combinations

    • Optimizing named_struct operations

    TRUE

    spark.sql.optimizer.excludedRules

    Defines optimizer rules to disable, identified by comma-separated rule names. Some rules cannot be disabled as they are required for correctness. The optimizer logs which rules are successfully disabled.

    (none)

    spark.sql.optimizer.runtime.bloomFilter.applicationSideScanSizeThreshold

    Sets the minimum aggregated scan size in bytes required to inject a Bloom filter on the application side.

    10GB

    spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold

    Defines the maximum size threshold for injecting a Bloom filter on the creation side.

    10MB

    spark.sql.optimizer.runtime.bloomFilter.enabled

    Specifies whether to insert a Bloom filter to reduce shuffle data when one side of a shuffle join has a selective predicate.

    TRUE

    spark.sql.optimizer.runtime.bloomFilter.expectedNumItems

    Defines the default number of expected items in the runtime Bloom filter.

    1000000

    spark.sql.optimizer.runtime.bloomFilter.maxNumBits

    Sets the maximum number of bits allowed in the runtime Bloom filter.

    67108864

    spark.sql.optimizer.runtime.bloomFilter.maxNumItems

    Sets the maximum number of expected items allowed in the runtime Bloom filter.

    4000000

    spark.sql.optimizer.runtime.bloomFilter.number.threshold

    Limits the maximum number of non-DPP runtime filters allowed per query to prevent out-of-memory errors in the driver.

    10

    spark.sql.optimizer.runtime.bloomFilter.numBits

    Defines the default number of bits used in the runtime Bloom filter.

    8388608

    spark.sql.optimizer.runtime.rowlevelOperationGroupFilter.enabled

    Specifies whether to enable runtime group filtering for row-level operations. Allows data sources to:

    • Prune entire groups of data (such as files or partitions) using data source filters

    • Execute runtime queries to identify matching records

    • Discard unnecessary groups to avoid expensive rewrites

    Limitations:

    • Not all expressions can convert to data source filters

    • Some expressions require Spark evaluation (such as subqueries)

    TRUE

    spark.sql.optimizer.runtimeFilter.semiJoinReduction.enabled

    Specifies whether to insert a semi-join to reduce shuffle data when one side of a shuffle join has a selective predicate.

    FALSE

    spark.sql.parquet.aggregatePushdown

    Specifies whether to push down aggregates to Parquet for optimization. Supports:

    • MIN and MAX for boolean, integer, float, and date types

    • COUNT for all data types

    Throws an exception if statistics are missing from any Parquet file footer.

    FALSE

    spark.sql.parquet.columnarReaderBatchSize

    Controls the number of rows in each Parquet vectorized reader batch. Choose a value that balances performance overhead and memory usage to prevent out-of-memory errors.

    4096

    spark.sql.session.timeZone

    Defines the session time zone for handling timestamps in string literals and Java object conversion. Accepts:

    • Region-based IDs in area/city format (such as America/Los_Angeles)

    • Zone offsets in (+/-)HH, (+/-)HH:mm, or (+/-)HH:mm:ss format (such as -08 or +01:00)

    • UTC or Z as aliases for +00:00

    (value of local timezone)

    spark.sql.shuffle.partitions

    Sets the default number of partitions for data shuffling during joins or aggregations. Cannot be modified between structured streaming query restarts from the same checkpoint location.

    200

    spark.sql.shuffledHashJoinFactor

    Defines the multiplication factor used to determine shuffle hash join eligibility. A shuffle hash join is selected when the small-side data size multiplied by this factor is less than the large-side data size.

    3

    spark.sql.sources.parallelPartitionDiscovery.threshold

    Sets the maximum number of paths for driver-side file listing with file-based sources (Parquet, JSON, and ORC). When exceeded during partition discovery, files are listed using a separate Spark distributed job.

    32

    spark.sql.statistics.histogram.enabled

    Specifies whether to generate equi-height histograms during column statistics computation to improve estimation accuracy. Requires an additional table scan beyond the one needed for basic column statistics.

    FALSE

    Source

    https://spark.apache.org/docs/latest/configuration.html#spark-properties

  12. Choose Run.

    Note

    You can't run the query if the member who can receive results hasn’t configured the query results settings.

  13. View the Results.

    For more information, see Receiving and using analysis results.

  14. Continue to adjust parameters and run your query again, or choose the + button to start a new query in a new tab.

Note

AWS Clean Rooms aims to provide clear error messaging. If an error message doesn't have enough details to help you troubleshoot, contact the account team. Provide them with a description of how the error occurred and the error message (including any identifiers). For more information, see Troubleshooting AWS Clean Rooms.