View a markdown version of this page

Evaluate data quality transform - Amazon SageMaker Unified Studio

Evaluate data quality transform

The Evaluate data quality transform validates data against a set of rules as it flows through your Visual ETL job. You define rules using DQDL (Data Quality Definition Language), a domain-specific language for defining data quality rules, with 31 built-in rule types.

To add an Evaluate Data Quality transform
  1. On the Visual ETL canvas, choose the plus icon to open the Add nodes panel.

  2. Under the Transforms tab, search for Evaluate Data Quality.

  3. Select the node to add it to the canvas, then connect it to your source.

    The Amazon SageMaker Unified Studio UI showing the Evaluate Data Quality node added to the Visual ETL canvas.
To configure the rule set
  1. Select the node to open the configuration panel.

  2. For Ruleset name (optional), customize the name for evaluation context of this node. The name defaults to the node name. If your job has multiple Evaluate Data Quality nodes, give each a unique name so you can identify results later.

  3. Expand Rule Types Reference to see the 31 rules available. For details on each rule type and syntax, see DQDL rule types in the AWS Glue documentation.

  4. For Ruleset, define rules using DQDL syntax. The editor provides autocomplete for rule types and column names from the input schema. Use Ctrl+Tab to trigger column suggestions.

    The Amazon SageMaker Unified Studio UI showing the ruleset configuration for the Evaluate Data Quality transform.

The following example shows a rule set with two rules:

Rules = [ColumnExists "phone", ColumnLength "account_length" > 10]
Note

The rule set cannot be empty. You must define at least one rule.

Select an output to add it as a child node that downstream transforms can read from. You can add multiple output nodes and route them independently. The following table describes the available outputs.

# Output Description
1 Original data Outputs original data. This option is ideal if you want to stop the job when quality issues are detected.
2 Evaluation results Outputs configured rules and their pass or fail status. This option is useful if you want to take a custom action on the results.
3 Row level results Outputs original data with additional columns depicting the rule result for each row. This option is best for row-specific manipulation based on result.
4 Row level results - Failed rows Outputs original data with additional columns depicting the rule result for each row, filtering for only rows that failed the data quality evaluation checks.
5 Row level results - Passed rows Outputs original data with additional columns depicting the rule result for each row, filtering for only rows that passed the data quality evaluation checks.

Additional options

The Evaluate Data Quality node includes options for publishing results and controlling job behavior on failure.

# Option Description
1 Publish results to Amazon CloudWatch Send evaluation metrics to CloudWatch for monitoring and alerting.
2 Publish data quality evaluation results to S3 Write detailed results to an S3 folder. Choose Browse S3 to select the target location.
3 Stop job on rule set failure Halt the job if any rule fails, preventing bad data from flowing downstream.

Viewing results after job runs

After the job completes, results are available on the Data quality tab of the data processing job detail page. For details, see Monitor data quality in data processing jobs.