

本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# Glue Studio 筆記本中 ETL AWS 任務的資料品質
<a name="data-quality-gs-studio-notebooks"></a>

在本教學課程中，您將了解如何在 AWS Glue Studio 筆記本中使用 AWS Glue Data Quality 進行擷取、轉換和載入 (ETL) 任務。

您可使用 AWS Glue Studio 中的筆記本來編輯任務指令碼並檢視輸出，而不需執行完整任務。您還可以新增 Markdown，並將筆記本儲存為 `.ipynb` 檔案和任務指令碼。請注意，您可以直接啟動筆記本，而無需在本機安裝軟體或管理伺服器。當您完成程式碼的撰寫時，您可以使用 AWS Glue Studio 輕鬆將筆記本轉換為 AWS Glue 任務。

您在此範例中使用的資料集包含從兩個 *Data.CMS.gov* 資料集下載的美國聯邦醫療保險 (Medicare) 供應商付款資料：「住院患者預期付款系統供應商前 100 大診斷相關群組摘要 - FY2011」和「住院患者費用資料 FY 2011」。

下載資料之後，我們修改了資料集，以在檔案結尾處引入幾個錯誤記錄。上述經修改的檔案位於公有 Amazon S3 儲存貯體，位置在 `s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv`。

## 先決條件
<a name="w2aac49c54c14"></a>
+  具有 Amazon S3 許可的 AWS Glue 角色，可寫入目的地 Amazon S3 儲存貯體 
+  新的筆記本 (請參閱 [AWS Glue Studio 中的筆記本入門](https://docs.aws.amazon.com/glue/latest/ug/notebook-getting-started.html)) 

## 在 AWS Glue Studio 中建立 ETL 任務
<a name="data-quality-notebooks-example"></a>

**建立 ETL 任務**

1.  將工作階段版本變更為 AWS Glue 3.0。

    為此，請使用以下魔術命令移除所有樣板程式碼儲存格並執行儲存格。請注意，建立新筆記本時，第一個儲存格中會自動提供此樣板程式碼。

   ```
   %glue_version 3.0
   ```

1.  複製以下程式碼並在儲存格中執行它。

   ```
   import sys
   from awsglue.transforms import *
   from awsglue.utils import getResolvedOptions
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.job import Job
   
   sc = SparkContext.getOrCreate()
   glueContext = GlueContext(sc)
   spark = glueContext.spark_session
   job = Job(glueContext)
   ```

1.  在下一個儲存格中，匯入評估 AWS Glue Data Quality 的 `EvaluateDataQuality` 類別。

   ```
   from awsgluedq.transforms import EvaluateDataQuality
   ```

1. 在下一個儲存格中，使用存放在公有 Amazon S3 儲存貯體中的 .csv 檔案讀入來源資料。

   ```
   medicare = spark.read.format(
   "csv").option(
   "header", "true").option(
   "inferSchema", "true").load(
   's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')
   medicare.printSchema()
   ```

1.  將資料轉換為 AWS Glue DynamicFrame。

   ```
   from awsglue.dynamicframe import DynamicFrame
   medicare_dyf = DynamicFrame.fromDF(medicare,glueContext,"medicare_dyf")
   ```

1.  使用資料品質定義語言 (DQDL) 建立規則集。

   ```
   EvaluateDataQuality_ruleset = """
       Rules = [
           ColumnExists "Provider Id",
           IsComplete "Provider Id",
           ColumnValues  " Total Discharges " > 15
   ]
       ]
   """
   ```

1.  根據規則集驗證資料集。

   ```
   EvaluateDataQualityMultiframe = EvaluateDataQuality().process_rows(
       frame=medicare_dyf,
       ruleset=EvaluateDataQuality_ruleset,
       publishing_options={
           "dataQualityEvaluationContext": "EvaluateDataQualityMultiframe",
           "enableDataQualityCloudWatchMetrics": False,
           "enableDataQualityResultsPublishing": False,
       },
       additional_options={"performanceTuning.caching": "CACHE_NOTHING"},
   )
   ```

1.  檢閱結果。

   ```
   ruleOutcomes = SelectFromCollection.apply(
       dfc=EvaluateDataQualityMultiframe,
       key="ruleOutcomes",
       transformation_ctx="ruleOutcomes",
   )
   
   ruleOutcomes.toDF().show(truncate=False)
   ```

    輸出：

   ```
   --------------------------------------+-------+-----------------------------------------------------+-------------------------------------------+
   |Rule                                  |Outcome|FailureReason                                        |EvaluatedMetrics                           |
   +--------------------------------------+-------+-----------------------------------------------------+-------------------------------------------+
   |ColumnExists "Provider Id"            |Passed |null                                                 |{}                                         |
   |IsComplete "Provider Id"              |Passed |null                                                 |{Column.Provider Id.Completeness -> 1.0}   |
   |ColumnValues " Total Discharges " > 15|Failed |Value: 11.0 does not meet the constraint requirement!|{Column. Total Discharges .Minimum -> 11.0}|
   +--------------------------------------+-------+-----------------------------------------------------+-------------------------------------------+
   ```

1.  篩選傳遞的資料列，並從資料品質資料列層級結果檢閱失敗的資料列。

   ```
   owLevelOutcomes = SelectFromCollection.apply(
   dfc=EvaluateDataQualityMultiframe,
   key="rowLevelOutcomes",
   transformation_ctx="rowLevelOutcomes",
   )
   
   rowLevelOutcomes_df = rowLevelOutcomes.toDF() # Convert Glue DynamicFrame to SparkSQL DataFrame
   rowLevelOutcomes_df_passed = rowLevelOutcomes_df.filter(rowLevelOutcomes_df.DataQualityEvaluationResult == "Passed") # Filter only the Passed records.
   rowLevelOutcomes_df.filter(rowLevelOutcomes_df.DataQualityEvaluationResult == "Failed").show(5, truncate=False) # Review the Failed records
   ```

    輸出：

   ```
   +----------------------------------------+-----------+-------------------------------------+--------------------------+-------------+--------------+-----------------+------------------------------------+------------------+-------------------------+------------------------+-------------------------+--------------------------+----------------------------------------+----------------------------+---------------------------+
   |DRG Definition                          |Provider Id|Provider Name                        |Provider Street Address   |Provider City|Provider State|Provider Zip Code|Hospital Referral Region Description| Total Discharges | Average Covered Charges | Average Total Payments |Average Medicare Payments|DataQualityRulesPass      |DataQualityRulesFail                    |DataQualityRulesSkip        |DataQualityEvaluationResult|
   +----------------------------------------+-----------+-------------------------------------+--------------------------+-------------+--------------+-----------------+------------------------------------+------------------+-------------------------+------------------------+-------------------------+--------------------------+----------------------------------------+----------------------------+---------------------------+
   |039 - EXTRACRANIAL PROCEDURES W/O CC/MCC|10005      |MARSHALL MEDICAL CENTER SOUTH        |2505 U S HIGHWAY 431 NORTH|BOAZ         |AL            |35957            |AL - Birmingham                     |14                |$15131.85                |$5787.57                |$4976.71                 |[IsComplete "Provider Id"]|[ColumnValues " Total Discharges " > 15]|[ColumnExists "Provider Id"]|Failed                     |
   |039 - EXTRACRANIAL PROCEDURES W/O CC/MCC|10046      |RIVERVIEW REGIONAL MEDICAL CENTER    |600 SOUTH THIRD STREET    |GADSDEN      |AL            |35901            |AL - Birmingham                     |14                |$67327.92                |$5461.57                |$4493.57                 |[IsComplete "Provider Id"]|[ColumnValues " Total Discharges " > 15]|[ColumnExists "Provider Id"]|Failed                     |
   |039 - EXTRACRANIAL PROCEDURES W/O CC/MCC|10083      |SOUTH BALDWIN REGIONAL MEDICAL CENTER|1613 NORTH MCKENZIE STREET|FOLEY        |AL            |36535            |AL - Mobile                         |15                |$25411.33                |$5282.93                |$4383.73                 |[IsComplete "Provider Id"]|[ColumnValues " Total Discharges " > 15]|[ColumnExists "Provider Id"]|Failed                     |
   |039 - EXTRACRANIAL PROCEDURES W/O CC/MCC|30002      |BANNER GOOD SAMARITAN MEDICAL CENTER |1111 EAST MCDOWELL ROAD   |PHOENIX      |AZ            |85006            |AZ - Phoenix                        |11                |$34803.81                |$7768.90                |$6951.45                 |[IsComplete "Provider Id"]|[ColumnValues " Total Discharges " > 15]|[ColumnExists "Provider Id"]|Failed                     |
   |039 - EXTRACRANIAL PROCEDURES W/O CC/MCC|30010      |CARONDELET ST  MARYS HOSPITAL        |1601 WEST ST MARY'S ROAD  |TUCSON       |AZ            |85745            |AZ - Tucson                         |12                |$35968.50                |$6506.50                |$5379.83                 |[IsComplete "Provider Id"]|[ColumnValues " Total Discharges " > 15]|[ColumnExists "Provider Id"]|Failed                     |
   +----------------------------------------+-----------+-------------------------------------+--------------------------+-------------+--------------+-----------------+------------------------------------+------------------+-------------------------+------------------------+-------------------------+--------------------------+----------------------------------------+----------------------------+---------------------------+
   only showing top 5 rows
   ```

    請注意，Glue Data Quality 新增了四個新資料欄 AWS (DataQualityRulesPass、DataQualityRulesFail、DataQualityRulesSkip 和 DataQualityEvaluationResult)。這指示通過的記錄、失敗的記錄、為資料列層級評估跳過的規則，以及整體資料列層級結果。

1.  將輸出寫入 Amazon S3 儲存貯體，以分析資料並視覺化結果。

   ```
   #Write the Passed records to the destination. 
   
   glueContext.write_dynamic_frame.from_options(
          frame = rowLevelOutcomes_df_passed,
          connection_type = "s3",
          connection_options = {"path": "s3://glue-sample-target/output-dir/medicare_parquet"},
          format = "parquet")
   ```