Additional options Viewing results after job runs

Evaluate data quality transform

The Evaluate data quality transform validates data against a set of rules as it flows through your Visual ETL job. You define rules using DQDL (Data Quality Definition Language), a domain-specific language for defining data quality rules, with 31 built-in rule types.

To add an Evaluate Data Quality transform

On the Visual ETL canvas, choose the plus icon to open the Add nodes panel.
Under the Transforms tab, search for Evaluate Data Quality.
Select the node to add it to the canvas, then connect it to your source.

To configure the rule set

Select the node to open the configuration panel.
For Ruleset name (optional), customize the name for evaluation context of this node. The name defaults to the node name. If your job has multiple Evaluate Data Quality nodes, give each a unique name so you can identify results later.
Expand Rule Types Reference to see the 31 rules available. For details on each rule type and syntax, see DQDL rule types in the AWS Glue documentation.
For Ruleset, define rules using DQDL syntax. The editor provides autocomplete for rule types and column names from the input schema. Use Ctrl+Tab to trigger column suggestions.

The following example shows a rule set with two rules:


Rules = [ColumnExists "phone", ColumnLength "account_length" > 10]

Note

The rule set cannot be empty. You must define at least one rule.

Select an output to add it as a child node that downstream transforms can read from. You can add multiple output nodes and route them independently. The following table describes the available outputs.

#	Output	Description
1	Original data	Outputs original data. This option is ideal if you want to stop the job when quality issues are detected.
2	Evaluation results	Outputs configured rules and their pass or fail status. This option is useful if you want to take a custom action on the results.
3	Row level results	Outputs original data with additional columns depicting the rule result for each row. This option is best for row-specific manipulation based on result.
4	Row level results - Failed rows	Outputs original data with additional columns depicting the rule result for each row, filtering for only rows that failed the data quality evaluation checks.
5	Row level results - Passed rows	Outputs original data with additional columns depicting the rule result for each row, filtering for only rows that passed the data quality evaluation checks.

Additional options

The Evaluate Data Quality node includes options for publishing results and controlling job behavior on failure.

#	Option	Description
1	Publish results to Amazon CloudWatch	Send evaluation metrics to CloudWatch for monitoring and alerting.
2	Publish data quality evaluation results to S3	Write detailed results to an S3 folder. Choose Browse S3 to select the target location.
3	Stop job on rule set failure	Halt the job if any rule fails, preventing bad data from flowing downstream.

Viewing results after job runs

After the job completes, results are available on the Data quality tab of the data processing job detail page. For details, see Monitor data quality in data processing jobs.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Explode array or map into rows transform

Extract JSON path transform