本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。 # 在 PySpark 中程式設計 AWS Glue ETL 指令碼您可以在 GitHub AWS 網站上的 Glue [AWS 範例儲存庫中找到 Glue](https://github.com/awslabs/aws-glue-samples) 的 Python 程式碼範例和公用程式。 ## 將 Python 與 AWS Glue 搭配使用 AWS Glue 支援 PySpark Python 方言的延伸，用於指令碼擷取、轉換和載入 (ETL) 任務。本節說明如何在 ETL 指令碼和 Glue API AWS 中使用 Python。 + [設定以 AWS Glue 使用 Python](aws-glue-programming-python-setup.md) + [在 Python 中呼叫 AWS Glue API](aws-glue-programming-python-calling.md) + [搭配 Glue 使用 Python AWS 程式庫](aws-glue-programming-python-libraries.md) + [AWS Glue Python 程式碼範例](aws-glue-programming-python-samples.md) ## AWS Glue PySpark 延伸模組 AWS Glue 已建立 PySpark Python 方言的下列延伸。 + [使用 `getResolvedOptions` 存取參數](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md) + [PySpark 延伸模組類型](aws-glue-api-crawler-pyspark-extensions-types.md) + [DynamicFrame 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md) + [DynamicFrameCollection 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection.md) + [DynamicFrameWriter 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.md) + [DynamicFrameReader 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md) + [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md) ## AWS Glue PySpark 轉換 AWS Glue 已建立下列轉換類別以用於 PySpark ETL 操作。 + [GlueTransform base 類別](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md) + [ApplyMapping 類別](aws-glue-api-crawler-pyspark-transforms-ApplyMapping.md) + [DropFields 類別](aws-glue-api-crawler-pyspark-transforms-DropFields.md) + [DropNullFields 類別](aws-glue-api-crawler-pyspark-transforms-DropNullFields.md) + [ErrorsAsDynamicFrame 類別](aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame.md) + [FillMissingValues 類別](aws-glue-api-crawler-pyspark-transforms-fillmissingvalues.md) + [Filter 類別](aws-glue-api-crawler-pyspark-transforms-filter.md) + [FindIncrementalMatches 類別](aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.md) + [FindMatches 類別](aws-glue-api-crawler-pyspark-transforms-findmatches.md) + [FlatMap 類別](aws-glue-api-crawler-pyspark-transforms-flat-map.md) + [Join 類別](aws-glue-api-crawler-pyspark-transforms-join.md) + [Map 類別](aws-glue-api-crawler-pyspark-transforms-map.md) + [MapToCollection 類別](aws-glue-api-crawler-pyspark-transforms-MapToCollection.md) + [mergeDynamicFrame](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-merge) + [Relationalize 類別](aws-glue-api-crawler-pyspark-transforms-Relationalize.md) + [RenameField 類別](aws-glue-api-crawler-pyspark-transforms-RenameField.md) + [ResolveChoice 類別](aws-glue-api-crawler-pyspark-transforms-ResolveChoice.md) + [SelectFields 類別](aws-glue-api-crawler-pyspark-transforms-SelectFields.md) + [SelectFromCollection 類別](aws-glue-api-crawler-pyspark-transforms-SelectFromCollection.md) + [Spigot 類別](aws-glue-api-crawler-pyspark-transforms-spigot.md) + [SplitFields 類別](aws-glue-api-crawler-pyspark-transforms-SplitFields.md) + [SplitRows 類別](aws-glue-api-crawler-pyspark-transforms-SplitRows.md) + [Unbox 類別](aws-glue-api-crawler-pyspark-transforms-Unbox.md) + [UnnestFrame 類別](aws-glue-api-crawler-pyspark-transforms-UnnestFrame.md) # 設定以 AWS Glue 使用 Python 使用 Python 開發 Spark 任務的 ETL 指令碼。ETL 任務支援的 Python 版本取決於任務的 AWS Glue 版本。如需有關 AWS Glue 版本的詳細資訊，請參閱 [Glue version job property](add-job.md#glue-version-table)。 **設定您的系統以 AWS Glue 使用 Python** 依照以下步驟安裝 Python 以及能夠呼叫 AWS Glue API。 1. 如果您尚未安裝 Python，請至 [Python.org 下載頁面](https://www.python.org/downloads/)下載及安裝。 1. 安裝 AWS Command Line Interface (AWS CLI)，如 [AWS CLI 文件](https://docs.aws.amazon.com/cli/latest/userguide/installing.html)所述。使用 Python AWS CLI 不需要直接使用。不過，安裝和設定它是一種使用 AWS 您的帳戶登入資料進行設定並驗證它們是否有效的便利方式。 1. 安裝適用於 Python 的 AWS SDK (Boto 3)，如 [Boto3 Quickstart](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) 所述。 AWS Glue 尚未提供 Boto 3 資源 API。目前，只能使用 Boto 3 用戶端 API。如需有關 Boto 3 的詳細資訊，請參閱[AWS SDK for Python (Boto3) 入門](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)。您可以在 GitHub 網站上的 [AWS Glue 範例儲存庫](https://github.com/awslabs/aws-glue-samples)中，找到適用於 AWS Glue 的 Python 程式碼範例與公用程式。 # 在 Python 中呼叫 AWS Glue API 請注意，AWS Glue 尚未提供 Boto 3 資源 API。目前，只能使用 Boto 3 用戶端 API。 ## Python 中的 AWS Glue API 名稱 AWS Java 和其他程式設計語言的 Glue API 名稱通常為 CamelCased。但是，從 Python 呼叫時，這些一般名稱會變更為小寫字母，並以底線字元區隔名稱的部分，使其更為「Pythonic」。在 [AWS Glue API](aws-glue-api.md) 參考文件中，這些 Pythonic 名稱會列在一般 CamelCased 名稱後面的刮號中。但是，雖然 AWS Glue API 名稱本身會轉換為小寫字母，但其參數名稱仍維持大寫字母。請務必記住這一點，因為如下所述，在呼叫 AWS Glue API 時，應以名稱傳遞參數。 ## 在 AWS Glue 中傳遞和存取 Python 參數在 Python 中呼叫 AWS Glue API，最好以名稱明確傳遞參數。例如： ``` job = glue.create_job(Name='sample', Role='Glue_DefaultRole', Command={'Name': 'glueetl', 'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'}) ``` 這可協助您了解 Python 建立字典的名稱/值元組，讓您在 [Job 結構](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-Job) 或 [JobRun 結構](aws-glue-api-jobs-runs.md#aws-glue-api-jobs-runs-JobRun) 中將其指定為 ETL 指令碼的引數。Boto 3 會透過 REST API 呼叫，以 JSON 格式將它們傳送到 AWS Glue。這表示您在指令碼中存取這些引數時，無法倚賴引數的排序。例如，假設您在 Python Lambda 處理常式函式中起始 `JobRun`，而且要指定幾個參數。您的程式碼看起來類似如下： ``` from datetime import datetime, timedelta client = boto3.client('glue') def lambda_handler(event, context): last_hour_date_time = datetime.now() - timedelta(hours = 1) day_partition_value = last_hour_date_time.strftime("%Y-%m-%d") hour_partition_value = last_hour_date_time.strftime("%-H") response = client.start_job_run( JobName = 'my_test_Job', Arguments = { '--day_partition_key': 'partition_0', '--hour_partition_key': 'partition_1', '--day_partition_value': day_partition_value, '--hour_partition_value': hour_partition_value } ) ``` 若要在您的 ETL 指令碼中可靠地存取這些參數，請使用 AWS Glue 的 `getResolvedOptions` 指定其名稱，然後從產生的字典存取這些參數： ``` import sys from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['JOB_NAME', 'day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value']) print "The day partition key is: ", args['day_partition_key'] print "and the day partition value is: ", args['day_partition_value'] ``` 如果想傳遞一個巢狀 JSON 字串的引數，以在將參數值傳遞給 AWS Glue ETL 任務時保留參數值，則您必須在開始任務執行之前對參數字串進行編碼，然後在任務指令碼參考其之前對參數字串進行解碼。例如，請試想有下列引數字串： ``` glue_client.start_job_run(JobName = "gluejobname", Arguments={ "--my_curly_braces_string": '{"a": {"b": {"c": [{"d": {"e": 42}}]}}}' }) ``` 若要正確傳遞此參數，您應該將引數編碼為 Base64 編碼的字串。 ``` import base64 ... sample_string='{"a": {"b": {"c": [{"d": {"e": 42}}]}}}' sample_string_bytes = sample_string.encode("ascii") base64_bytes = base64.b64encode(sample_string_bytes) base64_string = base64_bytes.decode("ascii") ... glue_client.start_job_run(JobName = "gluejobname", Arguments={ "--my_curly_braces_string": base64_bytes}) ... sample_string_bytes = base64.b64decode(base64_bytes) sample_string = sample_string_bytes.decode("ascii") print(f"Decoded string: {sample_string}") ... ``` ## 範例：建立和執行任務以下範例顯示如何使用 Python 呼叫 AWS Glue API，以建立和執行 ETL 任務。 **建立和執行任務** 1. 建立 AWS Glue 用戶端執行個體： ``` import boto3 glue = boto3.client(service_name='glue', region_name='us-east-1', endpoint_url='https://glue.us-east-1.amazonaws.com') ``` 1. 建立任務。您必須使用 `glueetl` 做為 ETL 命令的名稱，如以下程式碼所示： ``` myJob = glue.create_job(Name='sample', Role='Glue_DefaultRole', Command={'Name': 'glueetl', 'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'}) ``` 1. 為您在上個步驟建立的任務啟動新的執行： ``` myNewJobRun = glue.start_job_run(JobName=myJob['Name']) ``` 1. 取得任務狀態： ``` status = glue.get_job_run(JobName=myJob['Name'], RunId=myNewJobRun['JobRunId']) ``` 1. 列印任務執行的目前狀態： ``` print(status['JobRun']['JobRunState']) ``` # 搭配 Glue 使用 Python AWS 程式庫您可以安裝其他 Python 模組和程式庫，以與 Glue ETL AWS 搭配使用。對於 AWS Glue 2.0 及更高版本， AWS Glue 使用 Python 套件安裝程式 (pip3) 來安裝 AWS Glue ETL 使用的其他模組。 AWS Glue 提供多種選項，可將其他 Python 模組帶入您的 AWS Glue 任務環境。您可以使用 `--additional-python-modules` 參數，使用包含綁定 Python wheel 的 zip 檔案（也稱為 "zip of wheel"，適用於 AWS Glue 5.0 及更高版本）、個別 Python wheel 檔案、需求檔案 (requirements.txt，適用於 AWS Glue 5.0 及更高版本）或逗號分隔 Python 模組清單來引入新模組。它也可以用來變更 Glue 環境中提供的 Python AWS 模組版本（如需詳細資訊[Python 模組已在 Glue AWS 中提供](#glue-modules-provided)，請參閱 )。 **Topics** + [在 AWS Glue 2.0 或更新版本中使用 pip 安裝其他 Python 模組](#addl-python-modules-support) + [包括 Python 檔案與 PySpark 原生功能](#extra-py-files-support) + [使用視覺化轉換的程式設計指令碼](#aws-glue-programming-with-cvt) + [壓縮程式庫以加入](#aws-glue-programming-python-libraries-zipping) + [在 Glue Studio AWS 筆記本中載入 Python 程式庫](#aws-glue-programming-python-libraries-notebooks) + [在 Glue 0.9/1.0 的開發端點中載入 Python AWS 程式庫](#aws-glue-programming-python-libraries-dev-endpoint) + [在任務或 JobRun 使用 Python 程式庫](#aws-glue-programming-python-libraries-job) + [主動分析 Python 相依項](#aws-glue-programming-analyzing-python-dependencies) + [Python 模組已在 Glue AWS 中提供](#glue-modules-provided) + [附錄 A：建立輪子壓縮成品](#glue-python-library-zip-of-wheels-appendix) + [附錄 B： AWS Glue 環境詳細資訊](#glue-python-libraries-environment-details) ## 在 AWS Glue 2.0 或更新版本中使用 pip 安裝其他 Python 模組 AWS Glue 使用 Python 套件安裝程式 (pip3) 來安裝 AWS Glue ETL 要使用的其他模組。您可以使用 `--additional-python-modules` 參數與逗號分隔的 Python 模組清單來新增新模組或變更現有模組的版本。您可以將檔案上傳到 Amazon S3，然後在模組清單中包含 Amazon S3 物件的路徑，透過轉輪壓縮或獨立轉輪成品來安裝建置的車輪成品。如需設定任務參數的詳細資訊，請參閱[在 Glue AWS 任務中使用任務參數](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html)。您可以使用 `--python-modules-installer-option` 參數將其他選項傳遞給 pip3。例如，您可以傳遞 `--only-binary` ，強制 pip 僅為指定的套件安裝預先建置的成品`--additional-python-modules`。如需更多範例，請參閱[使用 Glue 2.0 從 Spark ETL AWS 工作負載的 wheel 建置 Python 模組。 ](https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/) ### Python 相依性管理的最佳實務對於生產工作負載， AWS Glue 建議將所有 Python 相依性封裝為單一 zip 成品中的 wheel 檔案。此方法提供： + **確定性執行**：精確控制已安裝的套件版本 + **可靠性**：任務執行期間不依賴外部套件儲存庫 + **效能**：單一下載操作而非多個網路呼叫 + **離線安裝**：可在沒有網際網路存取的私有 VPC 環境中運作 #### 重要考量在[AWS 共同責任模型](https://aws.amazon.com/compliance/shared-responsibility-model/)下，您負責管理其他 Python 模組、程式庫及其相依性。其中包含： + **安全性更新**：定期更新套件以解決安全性漏洞 + **版本相容性**：確保套件與您的 Glue AWS 版本相容 + **測試**：驗證您的封裝相依性在 Glue 環境中是否正常運作如果您有最少的相依性，您可以考慮改用個別的 wheel 檔案。 ### （建議）使用 Zip of Wheels AWS 在 Glue 5.0 或更高版本中安裝其他 Python 程式庫 AWS Glue 5.0 及更高版本支援將多個車輪檔案封裝到包含綁定 Python 車輪的單一 zip 成品中，以實現更可靠和決定性的相依性管理。若要使用此方法，請使用`.gluewheels.zip`尾碼建立包含所有車輪相依性及其傳輸相依性的 zip 檔案，將其上傳至 Amazon S3，然後使用 `--additional-python-modules` 參數加以參考。請務必將 `--no-index` 新增至`--python-modules-installer-option`任務參數。透過此組態，輪子檔案的壓縮基本上可做為 pip 的本機索引，以解決執行時間的相依性。這消除了任務執行期間 PyPI 等外部套件儲存庫的相依性，為生產工作負載提供更高的穩定性和一致性。例如： ``` --additional-python-modules s3://amzn-s3-demo-bucket/path/to/zip-of-wheels-1.0.0.gluewheels.zip --python-modules-installer-option --no-index ``` 如需如何建立 wheel 檔案 zip 的說明，請參閱 [附錄 A：建立輪子壓縮成品](#glue-python-library-zip-of-wheels-appendix)。 ### 使用 Wheel 安裝其他 Python 程式庫 AWS Glue 支援使用存放在 Amazon S3 中的 wheel (.whl) 檔案安裝自訂 Python 套件。若要在 Glue AWS 任務中包含輪子檔案，請將儲存在 s3 中的輪子檔案逗號分隔清單提供給`--additional-python-modules`任務參數。例如： ``` --additional-python-modules s3://amzn-s3-demo-bucket/path/to/package-1.0.0-py3-none-any.whl,s3://your-bucket/path/to/another-package-2.1.0-cp311-cp311-linux_x86_64.whl ``` 當您需要自訂發行版本，或具有原生相依項的套件已針對正確的作業系統預先編譯時，此方法也支援。如需更多範例，請參閱[使用 Glue 2.0 從 Spark ETL 工作負載的 wheel AWS 建置 Python 模組。](https://aws.amazon.com/blogs/big-data/building-python-modules-from-a-wheel-for-spark-etl-workloads-using-aws-glue-2-0/) ### 使用 requirements.txt AWS 在 Glue 5.0 或更高版本中安裝其他 Python 程式庫在 AWS Glue 5.0\$1 中，您可以提供 defacto 標準`requirements.txt`來管理 Python 程式庫相依性。若要這樣做，請提供下列兩個任務參數： + 索引鍵：`--python-modules-installer-option` 值：`-r` + 索引鍵：`--additional-python-modules` 值：`s3://path_to_requirements.txt` AWS Glue 5.0 節點一開始會載入中指定的 Python 程式庫`requirements.txt`。以下是範例 requirements.txt： ``` awswrangler==3.9.1 elasticsearch==8.15.1 PyAthena==3.9.0 PyMySQL==1.1.1 PyYAML==6.0.2 pyodbc==5.2.0 pyorc==0.9.0 redshift-connector==2.1.3 scipy==1.14.1 scikit-learn==1.5.2 SQLAlchemy==2.0.36 ``` **重要** 請謹慎使用此選項，尤其是在生產工作負載中。在執行時間從 PyPI 提取相依性具有高風險，因為您無法確定要解析什麼成品 pip。使用未鎖定的程式庫版本特別有風險，因為它會提取最新版本的 python 模組，這可能會導致突破性的變更或帶來不相容的 python 模組。這可能會導致任務失敗，因為 AWS Glue 任務環境中的 Python 安裝失敗。雖然鎖定程式庫版本可提高穩定性，但 pip 解析度仍未完全確定，因此可能會出現類似的問題。最佳實務是， AWS Glue 建議使用凍結的成品，例如輪子的壓縮或個別的輪子檔案（如需詳細資訊[（建議）使用 Zip of Wheels AWS 在 Glue 5.0 或更高版本中安裝其他 Python 程式庫](#glue-python-library-installing-zip-of-wheels)，請參閱 )。 **重要** 如果您未鎖定傳輸相依性的版本，則主要相依性可能會提取不相容的傳輸相依性版本。最佳實務是，所有程式庫版本都應固定，以提高 Glue AWS 任務的一致性。更好的是， AWS Glue 建議將您的相依性封裝到 wheel 壓縮檔中，以確保生產工作負載的最大一致性和可靠性。 ### 直接將其他 Python 程式庫設定為逗號分隔清單若要更新或新增 Python 模組 AWS Glue，允許使用逗號分隔 Python 模組的清單做為值傳遞`--additional-python-modules`參數。例如，若要更新或新增 scikit-learn 模組，請使用下列金鑰/值：`"--additional-python-modules", "scikit-learn==0.21.3"`。您有兩個選項可以直接設定 python 模組。 + **固定 Python 模組** `"--additional-python-modules", "scikit-learn==0.21.3,ephem==4.1.6"` + **未固定的 Python 模組：(不建議用於生產工作負載)** `"--additional-python-modules", "scikit-learn>==0.20.0,ephem>=4.0.0"` 或 `"--additional-python-modules", "scikit-learn,ephem"` **重要** 請謹慎使用此選項，尤其是在生產工作負載中。在執行時間從 PyPI 提取相依性具有高風險，因為您無法確定要解析什麼成品 pip。使用未鎖定的程式庫版本特別有風險，因為它會提取最新版本的 python 模組，這可能會導致突破性的變更或帶來不相容的 python 模組。這可能會導致任務失敗，因為 AWS Glue 任務環境中的 Python 安裝失敗。雖然鎖定程式庫版本可提高穩定性，但 pip 解析度仍未完全確定，因此可能會出現類似的問題。最佳實務是， AWS Glue 建議使用凍結的成品，例如輪子的壓縮或個別的輪子檔案（如需詳細資訊[（建議）使用 Zip of Wheels AWS 在 Glue 5.0 或更高版本中安裝其他 Python 程式庫](#glue-python-library-installing-zip-of-wheels)，請參閱 )。 **重要** 如果您未鎖定傳輸相依性的版本，則主要相依性可能會提取不相容的傳輸相依性版本。最佳實務是，所有程式庫版本都應固定，以提高 Glue AWS 任務的一致性。更好的是， AWS Glue 建議將您的相依性封裝到 wheel 壓縮檔中，以確保生產工作負載的最大一致性和可靠性。 ## 包括 Python 檔案與 PySpark 原生功能 AWS Glue 使用 PySpark 在 Glue ETL AWS 任務中包含 Python 檔案。您需使用可用的 `--additional-python-modules` 管理相依性。您可以使用 `--extra-py-files` 任務參數來包含 Python 檔案。相依性必須在 Amazon S3 中託管，且引數值應為以逗號分隔的 Amazon S3 路徑清單，並不含空格。此功能的行為類似於您搭配 Spark 使用的 Python 相依性管理。有關 Spark 中 Python 相依性管理的詳細資訊，請參閱 Apache Spark 文件中的 [Using PySpark Native Features](https://spark.apache.org/docs/latest/api/python/tutorial/python_packaging.html#using-pyspark-native-features) (使用 PySpark 原生功能) 頁面。在未封裝其他程式碼的情況下，或者當您使用現有工具鏈移轉 Spark 程式以管理相依性時，`--extra-py-files` 非常有用。為了使您的相依性工具可維護，您必須在提交之前綁定相依性。 ## 使用視覺化轉換的程式設計指令碼當您使用 AWS Glue Studio 視覺化界面建立 Glue AWS 任務時，您可以使用受管資料轉換節點和自訂視覺化轉換來轉換資料。如需有關受管資料轉換節點的詳細資訊，請參閱[使用 AWS Glue 受管轉換來轉換資料](edit-jobs-transforms.md)。如需有關自訂視覺化轉換的詳細資訊，請參閱[使用自訂視覺化轉換來轉換資料](custom-visual-transform.md)。只有在任務**語言**設定為使用 Python 時，才能使用視覺化轉換產生指令碼。使用視覺化轉換產生 AWS Glue 任務時， AWS Glue Studio 會使用任務組態中的 `--extra-py-files` 參數，將這些轉換包含在執行時間環境中。如需有關任務參數的詳細資訊，請參閱 [在 Glue AWS 任務中使用任務參數](aws-glue-programming-etl-glue-arguments.md)。對產生的指令碼或執行時期環境進行變更時，您需要保留此任務組態，指令碼才能成功執行。 ## 壓縮程式庫以加入除非程式庫包含在單一的 `.py` 檔案裡，否則應封裝於 `.zip` 封存中。套件目錄應位於封存的根目錄，且套件必須包含一個 `__init__.py` 檔案。接著 Python 就可以正常匯入套件。如果程式庫僅由一個 `.py` 檔案裡的單一 Python 模組組成，則不必將其置於 `.zip` 檔案。 ## 在 Glue Studio AWS 筆記本中載入 Python 程式庫若要在 Glue Studio AWS 筆記本中指定 Python 程式庫，請參閱[安裝其他 Python 模組。 ](https://docs.aws.amazon.com/glue/latest/dg/manage-notebook-sessions.html#specify-default-modules) ## 在 Glue 0.9/1.0 的開發端點中載入 Python AWS 程式庫若要將不同的程式庫集用於不同的 ETL 指令碼，您可以為各程式庫集設定個別的開發端點，或是覆寫每次切換指令碼時開發端點載入的程式庫 `.zip` 檔案。在建立開發端點時，您可以使用主控台為其指定一或多個程式庫 .zip 檔案。指派名稱和 IAM 角色後，請選擇 **Script Libraries and job parameters (optional)** [指令碼程式庫與任務參數 (選用)]，並在 `.zip`Python library path** (Python 程式庫路徑) 方塊中輸入程式庫 ** 檔案的完整 Amazon S3 路徑。例如： ``` s3://bucket/prefix/site-packages.zip ``` 您也可以為檔案指定多個完整路徑，以逗號但不含空格的方式隔開，例如： ``` s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip ``` 如果您在後來更新這些 `.zip` 檔案，您可以使用主控台將檔案重新匯入開發端點。導覽至該開發端點，從 **Action** (動作) 選單選擇 **Update ETL libraries** (更新 ETL 程式庫)。以類似的方式，您可以使用 Glue APIs AWS 指定程式庫檔案。呼叫 [CreateDevEndpoint 動作 (Python: create\$1dev\$1endpoint)](aws-glue-api-dev-endpoint.md#aws-glue-api-dev-endpoint-CreateDevEndpoint) 以建立開發端點時，您可以在 `ExtraPythonLibsS3Path` 參數中為程式庫指定一個或多個完整路徑，而呼叫的格式如下： ``` dep = glue.create_dev_endpoint( EndpointName="testDevEndpoint", RoleArn="arn:aws:iam::123456789012", SecurityGroupIds="sg-7f5ad1ff", SubnetId="subnet-c12fdba4", PublicKey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtp04H/y...", NumberOfNodes=3, ExtraPythonLibsS3Path="s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip") ``` 更新開發端點時，您也可以更新其載入的程式庫，方式是使用 [DevEndpointCustomLibraries](aws-glue-api-dev-endpoint.md#aws-glue-api-dev-endpoint-DevEndpointCustomLibraries) 物件，並在呼叫 [UpdateDevEndpoint (update\$1dev\$1endpoint)](aws-glue-api-dev-endpoint.md#aws-glue-api-dev-endpoint-UpdateDevEndpoint) 時將 `UpdateEtlLibraries ` 參數設定為 `True`。 ## 在任務或 JobRun 使用 Python 程式庫在主控台建立新任務時，您可以指定一個或多個程式庫 .zip 檔案，方式是選擇 **Script Libraries and job parameters (optional) (指令碼程式庫與任務參數 (選用))**，並以與建立開發端點相同的方式輸入完整 Amazon S3 路徑： ``` s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip ``` 若要呼叫 [CreateJob (create\$1job)](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-CreateJob)，您可以使用 `--extra-py-files` 預設參數來為預設程式庫指定一個或多個完整路徑，如下所示： ``` job = glue.create_job(Name='sampleJob', Role='Glue_DefaultRole', Command={'Name': 'glueetl', 'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'}, DefaultArguments={'--extra-py-files': 's3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip'}) ``` 接著，啟動 JobRun 時，您可以用不同的內容覆寫預設程式庫設定： ``` runId = glue.start_job_run(JobName='sampleJob', Arguments={'--extra-py-files': 's3://bucket/prefix/lib_B.zip'}) ``` ## 主動分析 Python 相依項若要在部署至 Glue AWS 之前主動識別潛在的相依性問題，您可以使用相依性分析工具，針對目標 Glue 環境驗證 Python AWS 套件。 AWS 提供專為 Glue 環境設計的開放原始碼 Python AWS 相依性分析器工具。此工具可在 AWS Glue 範例儲存庫中使用，並可在本機用來在部署之前驗證您的相依性。此分析有助於確保您的相依項遵循固定所有程式庫版本以實現一致生產部署的建議實務。如需詳細資訊，請參閱工具的 [README](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/glue_python_dependency_analyzer)。 ### 使用 AWS Glue 相依性分析器 AWS Glue Python 相依性分析器透過使用符合您目標 Glue AWS 環境的平台特定限制來模擬 pip 安裝，協助識別未鎖定的相依性和版本衝突。 ``` # Analyze a single Glue job python glue_dependency_analyzer.py -j my-glue-job # Analyze multiple jobs with specific AWS configuration python glue_dependency_analyzer.py -j job1 -j job2 --aws-profile production --aws-region us-west-2 ``` 工具會標記： + 可跨任務執行安裝不同版本的未固定相依項 + 套件之間的版本衝突 + 您的目標 Glue AWS 環境無法使用相依性 ## 使用 Amazon Q Developer 分析和修正因 Python 相依項造成的任務失敗 Amazon Q Developer 是生成式人工智慧 (AI) 支援的對話式助理，可協助您了解、建置、擴展和操作 AWS 應用程式。您可以遵循 Amazon Q 入門指南中的指示進行下載。 Amazon Q Developer 可用於分析和修正因 python 相依項而導致的任務失敗。我們建議您使用下列提示，將任務預留位置取代為您的 Glue 任務名稱。 ``` I have an AWS Glue job named that has failed due to Python module installation conflicts. Please assist in diagnosing and resolving this issue using the following systematic approach. Proceed once sufficient information is available. Objective: Implement a fix that addresses the root cause module while minimizing disruption to the existing working environment. Step 1: Root Cause Analysis • Retrieve the most recent failed job run ID for the specified Glue job • Extract error logs from CloudWatch Logs using the job run ID as a log stream prefix • Analyze the logs to identify: • The recently added or modified Python module that triggered the dependency conflict • The specific dependency chain causing the installation failure • Version compatibility conflicts between required and existing modules Step 2: Baseline Configuration Identification • Locate the last successful job run ID prior to the dependency failure • Document the Python module versions that were functioning correctly in that baseline run • Establish the compatible version constraints for conflicting dependencies Step 3: Targeted Resolution Implementation • Apply pinning by updating the job's additional_python_modules parameter • Pin only the root cause module and its directly conflicting dependencies to compatible versions, and do not remove python modules unless necessary • Preserve flexibility for non-conflicting modules by avoiding unnecessary version constraints • Deploy the configuration changes with minimal changes to the existing configuration and execute a validation test run. Do not change the Glue versions. Implementation Example: Scenario: Recently added pandas==2.0.0 to additional_python_modules Error: numpy version conflict (pandas 2.0.0 requires numpy>=1.21, but existing job code requires numpy<1.20) Resolution: Update additional_python_modules to "pandas==1.5.3,numpy==1.19.5" Rationale: Use pandas 1.5.3 (compatible with numpy 1.19.5) and pin numpy to last known working version Expected Outcome: Restore job functionality with minimal configuration changes while maintaining system stability. ``` 提示指示 Q 執行下列動作： 1. 擷取最近失敗的任務執行 ID 1. 尋找關聯的日誌和詳細資訊 1. 尋找成功的任務執行，以偵測任何變更的 Python 套件 1. 進行任何組態修正並觸發另一個測試執行 ## Python 模組已在 Glue AWS 中提供若要變更這些已提供模組的版本，請提供帶有 `--additional-python-modules` 任務參數的新版本。 ------ #### [ AWS Glue version 5.1 ] AWS Glue 5.1 版包含以下立即可用的 Python 模組： + aiobotocore==2.25.1 + aiohappyeyeballs==2.6.1 + aiohttp==3.13.2 + aioitertools==0.12.0 + aiosignal==1.4.0 + appdirs==1.4.4 + attrs==25.4.0 + boto3==1.40.61 + botocore==1.40.61 + certifi==2025.10.5 + charset-normalizer==3.4.4 + 編排人員==1.2.0 + contourpy==1.3.3 + cycler==0.12.1 + distlib==0.4.0 + filelock==3.20.0 + fonttools==4.60.1 + frozenlist==1.8.0 + fsspec==2025.10.0 + idna==3.11 + iniconfig==2.3.0 + jmespath==1.0.1 + kaleido==1.2.0 + kiwisolver==1.4.9 + logistro==2.0.1 + matplotlib==3.10.7 + multidict==6.7.0 + narwhals==2.10.2 + numpy==2.3.4 + orjson==3.11.4 + packaging==25.0 + pandas==2.3.3 + blocked=12.0.0 + pip==24.0 + platformdirs==4.5.0 + plotly==6.4.0 + pluggy==1.6.0 + propcache==0.4.1 + pyarrow==22.0.0 + Pygments==2.19.2 + pyparsing==3.2.5 + pytest-timeout==2.4.0 + pytest==8.4.2 + python-dateutil==2.9.0.post0 + pytz==2025.2 + 請求==2.32.5 + s3fs==2025.10.0 + s3transfer==0.14.0 + seaborn==0.13.2 + setuptools==79.0.1 + simplejson==3.20.2 + 六==1.17.0 + 韌性==9.1.2 + type\$1extensions==4.15.0 + tzdata==2025.2 + urllib3==2.5.0 + uv==0.9.7 + virtualenv==20.35.4 + wrapt==1.17.3 + yarl==1.22.0 ------ #### [ AWS Glue version 5.0 ] AWS Glue 5.0 版包含以下立即可用的 Python 模組： + aiobotocore==2.13.1 + aiohappyeyeballs==2.3.5 + aiohttp==3.10.1 + aioitertools==0.11.0 + aiosignal==1.3.1 + appdirs==1.4.4 + attrs==24.2.0 + boto3==1.34.131 + botocore==1.34.131 + certifi==2024.7.4 + charset-normalizer==3.3.2 + contourpy==1.2.1 + cycler==0.12.1 + fonttools==4.53.1 + frozenlist==1.4.1 + fsspec==2024.6.1 + idna==2.10 + jmespath==0.10.0 + kaleido==0.2.1 + kiwisolver==1.4.5 + matplotlib==3.9.0 + multidict==6.0.5 + numpy==1.26.4 + packaging==24.1 + pandas==2.2.2 + blocked==10.4.0 + pip==23.0.1 + plotly==5.23.0 + pyarrow==17.0.0 + pyparsing==3.1.2 + python-dateutil==2.9.0.post0 + pytz==2024.1 + requests==2.32.2 + s3fs==2024.6.1 + s3transfer==0.10.2 + seaborn==0.13.2 + setuptools==59.6.0 + six==1.16.0 + tenacity==9.0.0 + tzdata==2024.1 + urllib3==1.25.10 + virtualenv==20.4.0 + wrapt==1.16.0 + yarl==1.9.4 ------ #### [ AWS Glue version 4.0 ] AWS Glue 4.0 版包含以下立即可用的 Python 模組： + aiobotocore==2.4.1 + aiohttp==3.8.3 + aioitertools==0.11.0 + aiosignal==1.3.1 + async-timeout==4.0.2 + asynctest==0.13.0 + attrs==22.2.0 + avro-python3==1.10.2 + boto3==1.24.70 + botocore==1.27.59 + certifi==2021.5.30 + chardet==3.0.4 + charset-normalizer==2.1.1 + click==8.1.3 + cycler==0.10.0 + Cython==0.29.32 + fsspec==2021.8.1 + idna==2.10 + importlib-metadata==5.0.0 + jmespath==0.10.0 + joblib==1.0.1 + kaleido==0.2.1 + kiwisolver==1.4.4 + matplotlib==3.4.3 + mpmath==1.2.1 + multidict==6.0.4 + nltk==3.7 + numpy==1.23.5 + packaging==23.0 + pandas==1.5.1 + patsy==0.5.1 + Pillow==9.4.0 + pip==23.0.1 + plotly==5.16.0 + pmdarima==2.0.1 + ptvsd==4.3.2 + pyarrow==10.0.0 + pydevd==2.5.0 + pyhocon==0.3.58 + PyMySQL==1.0.2 + pyparsing==2.4.7 + python-dateutil==2.8.2 + pytz==2021.1 + PyYAML==6.0.1 + regex==2022.10.31 + requests==2.23.0 + s3fs==2022.11.0 + s3transfer==0.6.0 + scikit-learn==1.1.3 + scipy==1.9.3 + setuptools==49.1.3 + six==1.16.0 + statsmodels==0.13.5 + subprocess32==3.5.4 + sympy==1.8 + tbats==1.1.0 + threadpoolctl==3.1.0 + tqdm==4.64.1 + typing\$1extensions==4.4.0 + urllib3==1.25.11 + wheel==0.37.0 + wrapt==1.14.1 + yarl==1.8.2 + zipp==3.10.0 ------ #### [ AWS Glue version 3.0 ] AWS Glue 3.0 版包含以下立即可用的 Python 模組： + aiobotocore==1.4.2 + aiohttp==3.8.3 + aioitertools==0.11.0 + aiosignal==1.3.1 + async-timeout==4.0.2 + asynctest==0.13.0 + attrs==22.2.0 + avro-python3==1.10.2 + boto3==1.18.50 + botocore==1.21.50 + certifi==2021.5.30 + chardet==3.0.4 + charset-normalizer==2.1.1 + click==8.1.3 + cycler==0.10.0 + Cython==0.29.4 + docutils==0.17.1 + enum34==1.1.10 + frozenlist==1.3.3 + fsspec==2021.8.1 + idna==2.10 + importlib-metadata==6.0.0 + jmespath==0.10.0 + joblib==1.0.1 + kiwisolver==1.3.2 + matplotlib==3.4.3 + mpmath==1.2.1 + multidict==6.0.4 + nltk==3.6.3 + numpy==1.19.5 + packaging==23.0 + pandas==1.3.2 + patsy==0.5.1 + Pillow==9.4.0 + pip==23.0 + pmdarima==1.8.2 + ptvsd==4.3.2 + pyarrow==5.0.0 + pydevd==2.5.0 + pyhocon==0.3.58 + PyMySQL==1.0.2 + pyparsing==2.4.7 + python-dateutil==2.8.2 + pytz==2021.1 + PyYAML==5.4.1 + regex==2022.10.31 + requests==2.23.0 + s3fs==2021.8.1 + s3transfer==0.5.0 + scikit-learn==0.24.2 + scipy==1.7.1 + six==1.16.0 + Spark==1.0 + statsmodels==0.12.2 + subprocess32==3.5.4 + sympy==1.8 + tbats==1.1.0 + threadpoolctl==3.1.0 + tqdm==4.64.1 + typing\$1extensions==4.4.0 + urllib3==1.25.11 + wheel==0.37.0 + wrapt==1.14.1 + yarl==1.8.2 + zipp==3.12.0 ------ #### [ AWS Glue version 2.0 ] AWS Glue 2.0 版包含以下立即可用的 Python 模組： + avro-python3==1.10.0 + awscli==1.27.60 + boto3==1.12.4 + botocore==1.15.4 + certifi==2019.11.28 + chardet==3.0.4 + click==8.1.3 + colorama==0.4.4 + cycler==0.10.0 + Cython==0.29.15 + docutils==0.15.2 + enum34==1.1.9 + fsspec==0.6.2 + idna==2.9 + importlib-metadata==6.0.0 + jmespath==0.9.4 + joblib==0.14.1 + kiwisolver==1.1.0 + matplotlib==3.1.3 + mpmath==1.1.0 + nltk==3.5 + numpy==1.18.1 + pandas==1.0.1 + patsy==0.5.1 + pmdarima==1.5.3 + ptvsd==4.3.2 + pyarrow==0.16.0 + pyasn1==0.4.8 + pydevd==1.9.0 + pyhocon==0.3.54 + PyMySQL==0.9.3 + pyparsing==2.4.6 + python-dateutil==2.8.1 + pytz==2019.3 + PyYAML==5.3.1 + regex==2022.10.31 + requests==2.23.0 + rsa==4.7.2 + s3fs==0.4.0 + s3transfer==0.3.3 + scikit-learn==0.22.1 + scipy==1.4.1 + setuptools==45.2.0 + six==1.14.0 + Spark==1.0 + statsmodels==0.11.1 + subprocess32==3.5.4 + sympy==1.5.1 + tbats==1.0.9 + tqdm==4.64.1 + typing-extensions==4.4.0 + urllib3==1.25.8 + wheel==0.35.1 + zipp==3.12.0 ------ ## 附錄 A：建立輪子壓縮成品我們示範如何建立輪子成品的壓縮。顯示的範例會將套件 `cryptography`和下載`scipy`到輪子成品的壓縮中，並將輪子的壓縮複製到 Amazon S3 位置。 1. 您必須執行命令，以在類似 Glue 環境的 Amazon Linux 環境中建立輪子的壓縮。請參閱 [附錄 B： AWS Glue 環境詳細資訊](#glue-python-libraries-environment-details)。Glue 5.1 使用 AL2023 搭配 python 3.11 版。建立將建置此環境的 Dockerfile： ``` FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal # Install Python 3.11, pip, and zip utility RUN dnf install -y python3.11 pip zip && \ dnf clean all WORKDIR /build ``` 1. 建立 requirements.txt 檔案 ``` cryptography scipy ``` 1. 建置和啟動 Docker 容器 ``` # Build docker image docker build --platform linux/amd64 -t glue-wheel-builder . # Spin up container docker run --platform linux/amd64 -v $(pwd)/requirements.txt:/input/requirements.txt:ro -v $(pwd):/output -it glue-wheel-builder bash ``` 1. 在 docker 映像中執行下列命令 ``` # Create a directory for the wheels mkdir wheels # Copy requirements.txt into wheels directory cp /input/requirements.txt wheels/ # Download the wheels with the correct platform and Python version pip3 download \ -r wheels/requirements.txt \ --dest wheels/ \ --platform manylinux2014_x86_64 \ --python-version 311 \ --only-binary=:all: # Package the wheels into a zip archive with the .gluewheels.zip suffix zip -r mylibraries-1.0.0.gluewheels.zip wheels/ # Copy zip to output cp mylibraries-1.0.0.gluewheels.zip /output/ # Exit the container exit ``` 1. 將輪子的壓縮上傳到 Amazon S3 位置 ``` aws s3 cp mylibraries-1.0.0.gluewheels.zip s3://amzn-s3-demo-bucket/example-prefix/ ``` 1. 選用清除 ``` rm mylibraries-1.0.0.gluewheels.zip rm Dockerfile rm requirements.txt ``` 1. 使用以下任務引數執行 Glue 任務： ``` --additional-python-modules s3://amzn-s3-demo-bucket/example-prefix/mylibraries-1.0.0.gluewheels.zip --python-modules-installer-option --no-index ``` ## 附錄 B： AWS Glue 環境詳細資訊 **Glue 版本相容性和安裝方法** | AWS Glue 版本 | Python 版本 | 基礎映像 | glibc 版本 | 相容平台標籤 | | --- | --- | --- | --- | --- | | 5.1 | 3.11 | [Amazon Linux 2023 (AL2023)](https://aws.amazon.com/linux/amazon-linux-2023/) | 2.34 | manylinux\$12\$134\$1x86\$164 manylinux\$12\$128\$1x86\$164 manylinux2014\$1x86\$164 | | 5.0 | 3.11 | [Amazon Linux 2023 (AL2023)](https://aws.amazon.com/linux/amazon-linux-2023/) | 2.34 | manylinux\$12\$134\$1x86\$164 manylinux\$12\$128\$1x86\$164 manylinux2014\$1x86\$164 | | 4.0 | 3.10 | [Amazon Linux 2 (AL2)](https://aws.amazon.com/amazon-linux-2/) | 2.26 | manylinux2014\$1x86\$164 | | 3.0 | 3.7 | [Amazon Linux 2 (AL2)](https://aws.amazon.com/amazon-linux-2/) | 2.26 | manylinux2014\$1x86\$164 | | 2.0 | 3.7 | [Amazon Linux AMI (AL1)](https://aws.amazon.com/amazon-linux-ami/) | 2.17 | manylinux2014\$1x86\$164 | 在[AWS 共同責任模型](https://aws.amazon.com/compliance/shared-responsibility-model/)下，您負責管理與 Glue ETL 任務搭配使用的其他 Python AWS 模組、程式庫及其相依性。這包括套用更新和安全性修補程式。 AWS Glue 不支援在任務環境中編譯原生程式碼。不過， AWS Glue 任務會在 Amazon 受管 Linux 環境中執行。透過 Python wheel 檔案，您也許能夠以編譯形式提供原生相依項。如需 Glue AWS 版本相容性詳細資訊，請參閱上表。 **重要** 使用不相容的相依項可能會導致執行時期問題，尤其是對於必須與目標環境的架構和系統庫相符的原生延伸模組的程式庫。每個 AWS Glue 版本都會在具有預先安裝程式庫和系統組態的特定 Python 版本上執行。 # AWS Glue Python 程式碼範例 + [程式碼範例：加入和關聯化資料](aws-glue-programming-python-samples-legislators.md) + [程式碼範例：使用 ResolveChoice、Lambda 和 ApplyMapping 的資料準備](aws-glue-programming-python-samples-medicaid.md) # 程式碼範例：加入和關聯化資料本範例使用從 [http://everypolitician.org/](http://everypolitician.org/) 下載至 Amazon Simple Storage Service (Amazon S3) 的 `sample-dataset` 儲存貯體的資料集：`s3://awsglue-datasets/examples/us-legislators/all`。資料集包含美國國會議員和他們在美國眾議院和參議院內座位的 JSON 格式的資料，已針對教學用途稍作修改，並透過公有 Amazon S3 儲存貯體提供。您可在 GitHub 網站 [AWS Glue 儲存庫範例](https://github.com/awslabs/aws-glue-samples)的 `join_and_relationalize.py` 檔案中找到此範例的原始程式碼。本指南將利用這項資料告訴您如何執行下列動作： + 使用AWS Glue爬蟲程式來分類存放在公有 Amazon S3 儲存貯體中的物件，並將其結構描述儲存至 AWS Glue Data Catalog。 + 檢查爬蟲程式所產生的資料表中繼資料和結構描述。 + 編寫 Python 擷取、傳輸和載入 (ETL) 指令碼，使用 Data Catalog 中的中繼資料執行下列動作： + 將來自不同原始檔案的資料加入到單一資料表 (也就是將資料去正規化)。 + 篩選加入的資料表，依國會議員類型放入不同的資料表。 + 將產生的資料寫入到單獨的 Apache Parquet 檔案中，供以後分析之用。在上執行時偵錯 Python 或 PySpark 指令碼的偏好方法是 AWS 在 [Glue Studio AWS 上使用筆記本](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html)。 ## 步驟 1：在 Amazon S3 儲存貯體中網路爬取資料 1. 登入 AWS 管理主控台，並在 https：//[https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) 開啟 AWS Glue主控台。 1. 遵循中的步驟[設定編目程式](define-crawler.md)，建立新的爬蟲程式，可將`s3://awsglue-datasets/examples/us-legislators/all`資料集編目到 Glue Data Catalog `legislators`中名為 AWS 的資料庫。範例資料已放在這個公有 Amazon S3 儲存貯體中。 1. 執行新的爬蟲程式，接著查看 `legislators` 資料庫。爬蟲程式建立下列中繼資料資料表： + `persons_json` + `memberships_json` + `organizations_json` + `events_json` + `areas_json` + `countries_r_json` 這是一個包含國會議員和其歷史的半標準化資料表集合。 ## 步驟 2：新增樣板指令碼至開發端點筆記本將以下樣板指令碼貼至開發端點以匯入您需要的 AWS Glue 程式庫，並且設定單一的 `GlueContext`： ``` import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) ``` ## 步驟 3：從資料目錄中的資料檢查結構描述接下來，您可以從 Glue Data Catalog AWS 輕鬆建立檢查 DynamicFrame，並檢查資料的結構描述。例如，若要查看 `persons_json` 資料表的結構描述，請將下列內容新增到筆記本： ``` persons = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="persons_json") print "Count: ", persons.count() persons.printSchema() ``` 以下為列印呼叫的輸出： ``` Count: 1961 root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- note: string | | |-- name: string | | |-- lang: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string ``` 資料表中的每個人都是美國國會的成員。若要檢視 `memberships_json` 資料表的結構描述，請輸入如下命令： ``` memberships = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="memberships_json") print "Count: ", memberships.count() memberships.printSchema() ``` 其輸出如下： ``` Count: 10439 root |-- area_id: string |-- on_behalf_of_id: string |-- organization_id: string |-- role: string |-- person_id: string |-- legislative_period_id: string |-- start_date: string |-- end_date: string ``` `organizations` 為政黨和參議院與眾議院這兩個議會殿堂。若要檢視 `organizations_json` 資料表的結構描述，請輸入如下命令： ``` orgs = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="organizations_json") print "Count: ", orgs.count() orgs.printSchema() ``` 其輸出如下： ``` Count: 13 root |-- classification: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- id: string |-- name: string |-- seats: int |-- type: string ``` ## 步驟 4：篩選資料接著，保留需要的欄位，將 `id` 重新命名為 `org_id`。資料集很小，可以從整體來檢視。 `toDF()` 會將 `DynamicFrame` 轉換為 Apache Spark `DataFrame`，因此您可套用 Apache Spark SQL 中現有的轉換： ``` orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field( 'id', 'org_id').rename_field( 'name', 'org_name') orgs.toDF().show() ``` 以下將顯示輸出： ``` +--------------+--------------------+--------------------+--------------------+-----+-----------+--------------------+ |classification| org_id| org_name| links|seats| type| image| +--------------+--------------------+--------------------+--------------------+-----+-----------+--------------------+ | party| party/al| AL| null| null| null| null| | party| party/democrat| Democrat|[[website,http://...| null| null|https://upload.wi...| | party|party/democrat-li...| Democrat-Liberal|[[website,http://...| null| null| null| | legislature|d56acebe-8fdc-47b...|House of Represen...| null| 435|lower house| null| | party| party/independent| Independent| null| null| null| null| | party|party/new_progres...| New Progressive|[[website,http://...| null| null|https://upload.wi...| | party|party/popular_dem...| Popular Democrat|[[website,http://...| null| null| null| | party| party/republican| Republican|[[website,http://...| null| null|https://upload.wi...| | party|party/republican-...|Republican-Conser...|[[website,http://...| null| null| null| | party| party/democrat| Democrat|[[website,http://...| null| null|https://upload.wi...| | party| party/independent| Independent| null| null| null| null| | party| party/republican| Republican|[[website,http://...| null| null|https://upload.wi...| | legislature|8fa6c3d2-71dc-478...| Senate| null| 100|upper house| null| +--------------+--------------------+--------------------+--------------------+-----+-----------+--------------------+ ``` 輸入以下以檢視出現在 `memberships` 的 `organizations`： ``` memberships.select_fields(['organization_id']).toDF().distinct().show() ``` 以下將顯示輸出： ``` +--------------------+ | organization_id| +--------------------+ |d56acebe-8fdc-47b...| |8fa6c3d2-71dc-478...| +--------------------+ ``` ## 步驟 5：全部整合為一現在，使用 AWS Glue 加入這些關聯式表格，並建立一份關於國會議員 `memberships` 及其對應的 `organizations` 的完整歷史記錄資料表。 1. 首先，加入 `persons` 和 `memberships` 的 `id` 和 `person_id`。 1. 接著，將結果加入到 `orgs` 的 `org_id` 和 `organization_id`。 1. 然後，捨棄冗餘欄位 `person_id` 和 `org_id`。您可以在同一 (延伸) 指令碼行執行所有這些操作： ``` l_history = Join.apply(orgs, Join.apply(persons, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id']) print "Count: ", l_history.count() l_history.printSchema() ``` 其輸出如下： ``` Count: 10439 root |-- role: string |-- seats: int |-- org_name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- type: string |-- sort_name: string |-- area_id: string |-- images: array | |-- element: struct | | |-- url: string |-- on_behalf_of_id: string |-- other_names: array | |-- element: struct | | |-- note: string | | |-- name: string | | |-- lang: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- name: string |-- birth_date: string |-- organization_id: string |-- gender: string |-- classification: string |-- death_date: string |-- legislative_period_id: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- image: string |-- given_name: string |-- family_name: string |-- id: string |-- start_date: string |-- end_date: string ``` 您現在取得最終的資料表，可用於分析。您可以用精巧、有效率的格式編寫，以用於分析 (也就是 Parquet)，在 AWS Glue、Amazon Athena 或 Amazon Redshift Spectrum 上執行 SQL。以下呼叫將資料表編寫到多個檔案，在稍後執行分析時支援快速平行讀取： ``` glueContext.write_dynamic_frame.from_options(frame = l_history, connection_type = "s3", connection_options = {"path": "s3://glue-sample-target/output-dir/legislator_history"}, format = "parquet") ``` 若要將所有歷史記錄資料合併成單一檔案，您必須將其轉換為資料框架、分割並編寫： ``` s_history = l_history.toDF().repartition(1) s_history.write.parquet('s3://glue-sample-target/output-dir/legislator_single') ``` 或者，如果您希望將其分為參議院和眾議院： ``` l_history.toDF().write.parquet('s3://glue-sample-target/output-dir/legislator_part', partitionBy=['org_name']) ``` ## 步驟 6：轉換關聯式資料庫的資料 AWS Glue 可讓您輕鬆地將資料寫入關聯式資料庫，例如 Amazon Redshift，即使是半結構化資料。其提供轉換 `relationalize`，可將 `DynamicFrames` 扁平化，無論框架中的物件多複雜。使用本範例中的 `l_history` `DynamicFrame`，以根資料表 (`hist_root`) 的名稱和暫時任務路徑傳送至 `relationalize`。將傳回 `DynamicFrameCollection`。然後，您可以將 `DynamicFrames` 的名稱列在該集合中： ``` dfc = l_history.relationalize("hist_root", "s3://glue-sample-target/temp-dir/") dfc.keys() ``` 以下為 `keys` 呼叫的輸出： ``` [u'hist_root', u'hist_root_contact_details', u'hist_root_links', u'hist_root_other_names', u'hist_root_images', u'hist_root_identifiers'] ``` `Relationalize` 將歷史記錄資料表分成六個新資料表：根資料表包含在 `DynamicFrame` 中的各物件記錄，和陣列的輔助資料表。關聯式資料庫中的陣列處理通常為次最佳化，尤其在這些陣列變得龐大時。將陣列分成不同的資料表，可加快查詢速度。接著，檢查 `contact_details` 以查看分隔： ``` l_history.select_fields('contact_details').printSchema() dfc.select('hist_root_contact_details').toDF().where("id = 10 or id = 75").orderBy(['id','index']).show() ``` 以下為 `show` 呼叫的輸出： ``` root |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 10| 0| fax| | | 10| 1| | 202-225-1314| | 10| 2| phone| | | 10| 3| | 202-225-3772| | 10| 4| twitter| | | 10| 5| | MikeRossUpdates| | 75| 0| fax| | | 75| 1| | 202-225-7856| | 75| 2| phone| | | 75| 3| | 202-225-2711| | 75| 4| twitter| | | 75| 5| | SenCapito| +---+-----+------------------------+-------------------------+ ``` `contact_details` 欄位為原始 `DynamicFrame` 中結構的陣列。這些陣列的每個元素都是輔助資料表中的單獨資料列，以 `index` 編製索引。此處的 `id` 是 `hist_root` 資料表的外部金鑰，金鑰為 `contact_details`： ``` dfc.select('hist_root').toDF().where( "contact_details = 10 or contact_details = 75").select( ['id', 'given_name', 'family_name', 'contact_details']).show() ``` 以下為其輸出： ``` +--------------------+----------+-----------+---------------+ | id|given_name|family_name|contact_details| +--------------------+----------+-----------+---------------+ |f4fc30ee-7b42-432...| Mike| Ross| 10| |e3c60f34-7d1b-4c0...| Shelley| Capito| 75| +--------------------+----------+-----------+---------------+ ``` 請注意，這些命令將使用 `toDF()`，然後是 `where` 表達式，來篩選您想要查看的資料列。因此，加入 `hist_root` 資料表與輔助資料表可執行下列動作： + 無需陣列支援便能將資料載入到資料庫。 + 使用 SQL 查詢陣列中的每個個別項目。使用 AWS Glue 連線以安全存放和存取您的 Amazon Redshift 憑證。有關如何建立自己的連線的詳細資訊，請參閱 [連線至資料](glue-connections.md)。您現已準備好透過一次循環一個 `DynamicFrames` 來將資料寫入連接器： ``` for df_name in dfc.keys(): m_df = dfc.select(df_name) print "Writing to table: ", df_name glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, connection settings here) ``` 您的連接器設定將因您的關聯式資料庫類型而異： + 如需有關寫入 Amazon Redshift 的指示，請參閱[Redshift 連線](aws-glue-programming-etl-connect-redshift-home.md)。 + 若為其他資料庫，請參閱 [AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 ## 結論整體而言，AWS Glue 極具彈性。它讓您用幾行程式碼便能完成通常要好幾天才能完成撰寫的任務。您可在 GitHub [AWS Glue 範例](https://github.com/awslabs/aws-glue-samples)的 Python 檔案 `join_and_relationalize.py` 中找到完整從來源到目標的 ETL 指令碼。 # 程式碼範例：使用 ResolveChoice、Lambda 和 ApplyMapping 的資料準備此範例使用的資料集，包含從兩個 [Data.CMS.gov](https://data.cms.gov) 資料集下載的美國聯邦醫療保險 (Medicare) 供應商付款資料：「住院患者預期付款系統供應商前 100 大診斷相關群組摘要 - FY2011」和「住院患者費用資料 FY 2011」。下載資料之後，我們修改了資料集，以在檔案結尾處引入幾個錯誤記錄。上述經修改的檔案位於公有 Amazon S3 儲存貯體，位置在 `s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv`。您可以在 `data_cleaning_and_lambda.py`[ examplesAWS Glue GitHub 儲存庫的 ](https://github.com/awslabs/aws-glue-samples) 檔案中找到此範例的原始碼。在上執行時偵錯 Python 或 PySpark 指令碼的偏好方法是 AWS 在 [Glue Studio AWS 上使用筆記本](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html)。 ## 步驟 1：在 Amazon S3 儲存貯體中網路爬取資料 1. 登入 AWS 管理主控台並在 https：//[https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/) 開啟 AWS Glue主控台。 1. 遵循中所述的程序[設定編目程式](define-crawler.md)，建立新的爬蟲程式，該爬蟲程式可以編目`s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv`檔案，並將產生的中繼資料放入 Glue Data Catalog `payments`中名為 AWS 的資料庫。 1. 執行新的爬蟲程式，接著查看 `payments` 資料庫。在讀取檔案開頭，確定其格式和分隔符號後，爬蟲程式應會於資料庫建立一個命名為 `medicare` 的中繼資料資料表。新 `medicare` 資料表的結構描述如下： ``` Column name Data type ================================================== drg definition string provider id bigint provider name string provider street address string provider city string provider state string provider zip code bigint hospital referral region description string total discharges bigint average covered charges string average total payments string average medicare payments string ``` ## 步驟 2：新增樣板指令碼至開發端點筆記本將以下樣板指令碼貼至開發端點以匯入您需要的 AWS Glue 程式庫，並且設定單一的 `GlueContext`： ``` import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) ``` ## 步驟 3：比較不同的結構描述剖析接著，您可以查看 Apache Spark `DataFrame` 辨識出的結構描述是否跟 AWS Glue 爬蟲程式所記錄的相同。執行此程式碼： ``` medicare = spark.read.format( "com.databricks.spark.csv").option( "header", "true").option( "inferSchema", "true").load( 's3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv') medicare.printSchema() ``` 以下為 `printSchema` 呼叫的輸出： ``` root |-- DRG Definition: string (nullable = true) |-- Provider Id: string (nullable = true) |-- Provider Name: string (nullable = true) |-- Provider Street Address: string (nullable = true) |-- Provider City: string (nullable = true) |-- Provider State: string (nullable = true) |-- Provider Zip Code: integer (nullable = true) |-- Hospital Referral Region Description: string (nullable = true) |-- Total Discharges : integer (nullable = true) |-- Average Covered Charges : string (nullable = true) |-- Average Total Payments : string (nullable = true) |-- Average Medicare Payments: string (nullable = true) ``` 接著，查看 AWS Glue `DynamicFrame` 產生的結構描述： ``` medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog( database = "payments", table_name = "medicare") medicare_dynamicframe.printSchema() ``` `printSchema` 的輸出如下： ``` root |-- drg definition: string |-- provider id: choice | |-- long | |-- string |-- provider name: string |-- provider street address: string |-- provider city: string |-- provider state: string |-- provider zip code: long |-- hospital referral region description: string |-- total discharges: long |-- average covered charges: string |-- average total payments: string |-- average medicare payments: string ``` 在 `DynamicFrame` 產生的結構描述中，`provider id` 可以是 `long` 或 `string` 類型。`DataFrame` 結構描述將 `Provider Id` 列為 `string` 類型， Data Catalog 則是將 `provider id` 列為 `bigint` 類型。何者正確？在檔案的最後有兩筆記錄 (總共 160,000 筆記錄)，且該欄位中有 `string` 值。這些就是之前引入以示範產生問題的錯誤記錄。為了解決這種問題，AWS Glue`DynamicFrame` 採用 *choice (選擇)* 類型的概念。在此例中，`DynamicFrame` 展示了 `long` 和 `string` 值都會在該欄出現。AWS Glue 爬蟲程式遺漏了 `string` 值，原因是只考量到資料的前 2 MB。Apache Spark `DataFrame` 會考量整個資料集，但被強制將最普遍的類型指派給該欄位，亦即 `string`。事實上，Spark 在遇到複雜類型或不熟悉的變化時，通常都會採取最普遍的作法。要查詢 `provider id` 欄，請先解決選擇類型。您可以使用 `DynamicFrame` 中的 `resolveChoice` 轉換方法，藉由以下的 `cast:long` 選項將 `string` 值轉換為 `long` 值。 ``` medicare_res = medicare_dynamicframe.resolveChoice(specs = [('provider id','cast:long')]) medicare_res.printSchema() ``` `printSchema` 輸出就會是： ``` root |-- drg definition: string |-- provider id: long |-- provider name: string |-- provider street address: string |-- provider city: string |-- provider state: string |-- provider zip code: long |-- hospital referral region description: string |-- total discharges: long |-- average covered charges: string |-- average total payments: string |-- average medicare payments: string ``` 如果有無法轉換的 `string` 值，AWS Glue 會插入 `null`。另一個選項是將選擇類型轉換為 `struct`，這會保留兩種類型的值。接著，查看異常的資料列。 ``` medicare_res.toDF().where("'provider id' is NULL").show() ``` 您會見到以下情況： ``` +--------------------+-----------+---------------+-----------------------+-------------+--------------+-----------------+------------------------------------+----------------+-----------------------+----------------------+-------------------------+ | drg definition|provider id| provider name|provider street address|provider city|provider state|provider zip code|hospital referral region description|total discharges|average covered charges|average total payments|average medicare payments| +--------------------+-----------+---------------+-----------------------+-------------+--------------+-----------------+------------------------------------+----------------+-----------------------+----------------------+-------------------------+ |948 - SIGNS & SYM...| null| INC| 1050 DIVISION ST| MAUSTON| WI| 53948| WI - Madison| 12| $11961.41| $4619.00| $3775.33| |948 - SIGNS & SYM...| null| INC- ST JOSEPH| 5000 W CHAMBERS ST| MILWAUKEE| WI| 53210| WI - Milwaukee| 14| $10514.28| $5562.50| $4522.78| +--------------------+-----------+---------------+-----------------------+-------------+--------------+-----------------+------------------------------------+----------------+-----------------------+----------------------+-------------------------+ ``` 現在移除兩筆不正確的記錄，如下所示： ``` medicare_dataframe = medicare_res.toDF() medicare_dataframe = medicare_dataframe.where("'provider id' is NOT NULL") ``` ## 步驟 4：映射資料並使用 Apache Spark Lambda 函數 AWS Glue 尚未直接支援 Lambda 函式，亦即使用者定義的函式。但是您隨時可以從 Apache Spark `DataFrame` 來回轉換 `DynamicFrame`，以利用除了 `DynamicFrames` 特殊功能之外的 Spark 功能。接著，將付款資訊轉為數字，讓 Amazon Redshift 或 Amazon Athena 等分析引擎可以更快處理。 ``` from pyspark.sql.functions import udf from pyspark.sql.types import StringType chop_f = udf(lambda x: x[1:], StringType()) medicare_dataframe = medicare_dataframe.withColumn( "ACC", chop_f( medicare_dataframe["average covered charges"])).withColumn( "ATP", chop_f( medicare_dataframe["average total payments"])).withColumn( "AMP", chop_f( medicare_dataframe["average medicare payments"])) medicare_dataframe.select(['ACC', 'ATP', 'AMP']).show() ``` `show` 呼叫的輸出如下： ``` +--------+-------+-------+ | ACC| ATP| AMP| +--------+-------+-------+ |32963.07|5777.24|4763.73| |15131.85|5787.57|4976.71| |37560.37|5434.95|4453.79| |13998.28|5417.56|4129.16| |31633.27|5658.33|4851.44| |16920.79|6653.80|5374.14| |11977.13|5834.74|4761.41| |35841.09|8031.12|5858.50| |28523.39|6113.38|5228.40| |75233.38|5541.05|4386.94| |67327.92|5461.57|4493.57| |39607.28|5356.28|4408.20| |22862.23|5374.65|4186.02| |31110.85|5366.23|4376.23| |25411.33|5282.93|4383.73| | 9234.51|5676.55|4509.11| |15895.85|5930.11|3972.85| |19721.16|6192.54|5179.38| |10710.88|4968.00|3898.88| |51343.75|5996.00|4962.45| +--------+-------+-------+ only showing top 20 rows ``` 資料仍然全是字串。我們可以使用強大的 `apply_mapping` 轉換方法來捨棄、重新命名、轉換、巢套資料，讓其他資料程式設計語言和系統能夠輕易存取： ``` from awsglue.dynamicframe import DynamicFrame medicare_tmp_dyf = DynamicFrame.fromDF(medicare_dataframe, glueContext, "nested") medicare_nest_dyf = medicare_tmp_dyf.apply_mapping([('drg definition', 'string', 'drg', 'string'), ('provider id', 'long', 'provider.id', 'long'), ('provider name', 'string', 'provider.name', 'string'), ('provider city', 'string', 'provider.city', 'string'), ('provider state', 'string', 'provider.state', 'string'), ('provider zip code', 'long', 'provider.zip', 'long'), ('hospital referral region description', 'string','rr', 'string'), ('ACC', 'string', 'charges.covered', 'double'), ('ATP', 'string', 'charges.total_pay', 'double'), ('AMP', 'string', 'charges.medicare_pay', 'double')]) medicare_nest_dyf.printSchema() ``` `printSchema` 輸出如下： ``` root |-- drg: string |-- provider: struct | |-- id: long | |-- name: string | |-- city: string | |-- state: string | |-- zip: long |-- rr: string |-- charges: struct | |-- covered: double | |-- total_pay: double | |-- medicare_pay: double ``` 將資料轉回 Spark `DataFrame` 後，您就可以顯示其樣貌： ``` medicare_nest_dyf.toDF().show() ``` 其輸出如下： ``` +--------------------+--------------------+---------------+--------------------+ | drg| provider| rr| charges| +--------------------+--------------------+---------------+--------------------+ |039 - EXTRACRANIA...|[10001,SOUTHEAST ...| AL - Dothan|[32963.07,5777.24...| |039 - EXTRACRANIA...|[10005,MARSHALL M...|AL - Birmingham|[15131.85,5787.57...| |039 - EXTRACRANIA...|[10006,ELIZA COFF...|AL - Birmingham|[37560.37,5434.95...| |039 - EXTRACRANIA...|[10011,ST VINCENT...|AL - Birmingham|[13998.28,5417.56...| |039 - EXTRACRANIA...|[10016,SHELBY BAP...|AL - Birmingham|[31633.27,5658.33...| |039 - EXTRACRANIA...|[10023,BAPTIST ME...|AL - Montgomery|[16920.79,6653.8,...| |039 - EXTRACRANIA...|[10029,EAST ALABA...|AL - Birmingham|[11977.13,5834.74...| |039 - EXTRACRANIA...|[10033,UNIVERSITY...|AL - Birmingham|[35841.09,8031.12...| |039 - EXTRACRANIA...|[10039,HUNTSVILLE...|AL - Huntsville|[28523.39,6113.38...| |039 - EXTRACRANIA...|[10040,GADSDEN RE...|AL - Birmingham|[75233.38,5541.05...| |039 - EXTRACRANIA...|[10046,RIVERVIEW ...|AL - Birmingham|[67327.92,5461.57...| |039 - EXTRACRANIA...|[10055,FLOWERS HO...| AL - Dothan|[39607.28,5356.28...| |039 - EXTRACRANIA...|[10056,ST VINCENT...|AL - Birmingham|[22862.23,5374.65...| |039 - EXTRACRANIA...|[10078,NORTHEAST ...|AL - Birmingham|[31110.85,5366.23...| |039 - EXTRACRANIA...|[10083,SOUTH BALD...| AL - Mobile|[25411.33,5282.93...| |039 - EXTRACRANIA...|[10085,DECATUR GE...|AL - Huntsville|[9234.51,5676.55,...| |039 - EXTRACRANIA...|[10090,PROVIDENCE...| AL - Mobile|[15895.85,5930.11...| |039 - EXTRACRANIA...|[10092,D C H REGI...|AL - Tuscaloosa|[19721.16,6192.54...| |039 - EXTRACRANIA...|[10100,THOMAS HOS...| AL - Mobile|[10710.88,4968.0,...| |039 - EXTRACRANIA...|[10103,BAPTIST ME...|AL - Birmingham|[51343.75,5996.0,...| +--------------------+--------------------+---------------+--------------------+ only showing top 20 rows ``` ## 步驟 5：寫入資料至 Apache Parquet AWS Glue 可讓您輕鬆以關聯式資料庫能有效取用的格式 (例如 Apache Parquet) 撰寫資料： ``` glueContext.write_dynamic_frame.from_options( frame = medicare_nest_dyf, connection_type = "s3", connection_options = {"path": "s3://glue-sample-target/output-dir/medicare_parquet"}, format = "parquet") ``` # AWS Glue PySpark 延伸模組參考 AWS Glue 已建立 PySpark Python 方言的下列延伸。 + [使用 `getResolvedOptions` 存取參數](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md) + [PySpark 延伸模組類型](aws-glue-api-crawler-pyspark-extensions-types.md) + [DynamicFrame 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md) + [DynamicFrameCollection 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-collection.md) + [DynamicFrameWriter 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.md) + [DynamicFrameReader 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.md) + [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md) # 使用 `getResolvedOptions` 存取參數 AWS Glue `getResolvedOptions(args, options)` 公用程式功能可讓您存取執行任務時傳送到指令碼的引數。若要使用此功能，請先從 AWS Glue `utils` 模組與 `sys` 模組一起匯入： ``` import sys from awsglue.utils import getResolvedOptions ``` **`getResolvedOptions(args, options)`** + `args` - `sys.argv` 所含的引數清單。 + `options` - 想要擷取之引數名稱的 Python 陣列。 **Example 擷取傳送到 JobRun 的引數** 假設您在指令碼內建立 JobRun，或許在 Lambda 函數內： ``` response = client.start_job_run( JobName = 'my_test_Job', Arguments = { '--day_partition_key': 'partition_0', '--hour_partition_key': 'partition_1', '--day_partition_value': day_partition_value, '--hour_partition_value': hour_partition_value } ) ``` 若要擷取傳送的引數，您可以使用 `getResolvedOptions` 函數，如下所示： ``` import sys from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['JOB_NAME', 'day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value']) print "The day-partition key is: ", args['day_partition_key'] print "and the day-partition value is: ", args['day_partition_value'] ``` 請注意，每個引數的定義開頭形式為兩個連字號，在指令碼中參考的引數則不含連字號。引數只使用底線，不使用連字號。您的引數需要遵循此慣例才能被解析。 # PySpark 延伸模組類型 AWS Glue PySpark 延伸模組所使用的類型。 ## DataType 其他 Glue AWS 類型的基底類別。 **`__init__(properties={})`** + `properties` – 資料類型的屬性 (選用)。   **`typeName(cls)`** 傳回 AWS Glue 類型類別的類型 (亦即從尾端移除「Type」的類別名稱)。 + `cls` – 從 AWS Glue 衍生而來的 `DataType` 類別執行個體。   `jsonValue( )` 傳回 JSON 物件，其中包含類別的資料類型和屬性： ``` { "dataType": typeName, "properties": properties } ``` ## AtomicType 和 simple 衍生性產品繼承自及延伸 [DataType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-datatype) 類別，並做為所有 AWS Glue 不可部分完成資料類型的基本類別。 **`fromJsonValue(cls, json_value)`** 以來自 JSON 物件的值，初始化類別執行個體。 + `cls` – 要初始化的 AWS Glue 類別執行個體。 + `json_value` – 用於載入金鑰/值對的來源 JSON 物件。   以下類型為 [AtomicType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-atomictype) 類別的簡單衍生產品： + `BinaryType` – 二進位資料。 + `BooleanType` – 布林值。 + `ByteType` – 位元組值。 + `DateType` – 日期時間值。 + `DoubleType` – 浮點雙精度值。 + `IntegerType` – 整數值。 + `LongType` – 長整數值。 + `NullType` – null 值。 + `ShortType` – 短整數值。 + `StringType` – 文字字串。 + `TimestampType` – 時間戳記值 (單位通常為 1970/1/1 起的秒數)。 + `UnknownType` – 無法識別類型的值。 ## DecimalType(AtomicType) 沿用自和延伸 [AtomicType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-atomictype) 類別來代表十進位數字 (以十進位數字表示，而非二進位 base-2 數字)。 **`__init__(precision=10, scale=2, properties={})`** + `precision` – 十進位數字的位數 (選用，預設為 10)。 + `scale` – 小數點右邊的位數 (選用，預設為 2)。 + `properties` – 十進位數字段屬性 (選用)。 ## EnumType(AtomicType) 繼承自和延伸 [AtomicType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-atomictype) 類別，以表示有效選項的列舉。 **`__init__(options)`** + `options` – 列舉的選項清單。 ## 集合類型 + [ArrayType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-arraytype) + [ChoiceType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-choicetype) + [MapType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-maptype) + [Field(Object)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-field) + [StructType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-structtype) + [EntityType(DataType)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-entitytype) ## ArrayType(DataType) **`__init__(elementType=UnknownType(), properties={})`** + `elementType` – 陣列中的元素類型 (選用，預設為 UnknownType)。 + `properties` – 陣列的屬性 (選用)。 ## ChoiceType(DataType) **`__init__(choices=[], properties={})`** + `choices` – 可能的選項清單 (選用)。 + `properties` – 這些選項的屬性 (選用)。   **`add(new_choice)`** 將選項新增至可能的選項清單。 + `new_choice` – 要加入至可能選項清單的選項。   **`merge(new_choices)`** 合併新選項的清單與現有的選項清單。 + `new_choices` – 合併新選項與現有選項的清單。 ## MapType(DataType) **`__init__(valueType=UnknownType, properties={})`** + `valueType` – 映射中的值類型 (選用，預設為 UnknownType)。 + `properties` – 映射的屬性 (選用)。 ## Field(Object) 從衍生自 [DataType](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-datatype) 的物件建立欄位物件。 **`__init__(name, dataType, properties={})`** + `name` – 要指派到欄位的名稱。 + `dataType` – 要從中建立欄位的物件。 + `properties` – 欄位的屬性 (選用)。 ## StructType(DataType) 定義資料結構 (`struct`)。 **`__init__(fields=[], properties={})`** + `fields` – 要包含在結構中的欄位 (`Field` 類型) 清單 (選用)。 + `properties` – 結構的屬性 (選用)。   **`add(field)`** + `field` – 要新增到架構的物件類型 `Field`。   **`hasField(field)`** 如果此結構有相同名稱的欄位，將傳回 `True`，否則將傳回 `False`。 + `field` – 欄位名稱，或名稱已使用的物件類型 `Field`。   **`getField(field)`** + `field` – 欄位名稱，或使用其名稱的物件類型 `Field`。如果結構有相同名稱的欄位，將會傳回。 ## EntityType(DataType) `__init__(entity, base_type, properties)` 此類別尚未實作。 ## 其他類型 + [DataSource(object)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-source) + [DataSink(object)](#aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-sink) ## DataSource(object) **`__init__(j_source, sql_ctx, name)`** + `j_source` – 資料來源。 + `sql_ctx` – SQL 內容。 + `name` – 資料來源名稱。   **`setFormat(format, **options)`** + `format` – 要用於設定資料來源的格式。 + `options` – 要用於設定資料來源的選項集合。如需有關格式選項的詳細資訊，請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md)。   `getFrame()` 傳回資料來源的 `DynamicFrame`。 ## DataSink(object) **`__init__(j_sink, sql_ctx)`** + `j_sink` – 要建立的目的地。 + `sql_ctx` – 資料目的地的 SQL 內容。   **`setFormat(format, **options)`** + `format` – 要用於設定資料目的地的格式。 + `options` – 要用於設定資料目的地的選項集合。如需有關格式選項的詳細資訊，請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md)。   **`setAccumulableSize(size)`** + `size` – 要設定的 accumulable 大小 (以位元組為單位)。   **`writeFrame(dynamic_frame, info="")`** + `dynamic_frame` – 所要撰寫的 `DynamicFrame`。 + `info` – 有關 `DynamicFrame` 的資訊 (選用)。   **`write(dynamic_frame_or_dfc, info="")`** 撰寫 `DynamicFrame` 或 `DynamicFrameCollection`。 + `dynamic_frame_or_dfc` – 要撰寫的 `DynamicFrame` 物件或 `DynamicFrameCollection` 物件。 + `info` – 有關要撰寫的 `DynamicFrame` 或 `DynamicFrames` 的資訊 (選用)。 # DynamicFrame 類別 Apache Spark 其中一個主要抽象為 SparkSQL `DataFrame`，其與 R 和 pandas 中找到的 `DataFrame` 結構類似。`DataFrame` 類似於資料表並支援功能樣式 (對應/減少/篩選/等) 操作和 SQL 操作 (選擇、專案、彙總)。 `DataFrames` 功能強大，受到廣泛採用，但其在擷取、轉換和載入 (ETL) 操作上受到限制。最重要的是，其需要指定結構描述，才能載入任何資料。SparkSQL 可解決此問題，其進行兩次資料傳送，第一個推斷結構描述，第二個則載入資料。不過，此推斷相當有限，無法滿足龐大資料的實際需求。例如，相同的欄位在不同的記錄內可能為不同的類型。Apache Spark 通常讓出並使用原始欄位文字回報類型為 `string`。這可能不正確，而且您可能需要更精確控制如何解決結構描述的差異。此外，對於大型資料集，額外傳送來源資料的代價可能使人卻步地高昂。為了解決這些限制， AWS Glue 推出了 `DynamicFrame`。`DynamicFrame` 類似 `DataFrame`，但每筆記錄均為自我描述，且開始時不需結構描述。反之，AWS Glue 僅在必要時隨時計算結構描述，並使用所選 (或聯合) 類型明確編碼結構描述的不一致。您可以解決這些不一致，讓您的資料集相容於需要固定結構描述的資料存放區。同樣地，`DynamicRecord` 代表 `DynamicFrame` 內的邏輯記錄。其類似 Spark `DataFrame` 中的資料列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。搭配 PySpark 使用 AWS Glue 時，您通常不會操作獨立 `DynamicRecords`。相反地，您可以透過其 `DynamicFrame` 一起轉換資料集。您可以在解決任何結構描述不一致後反覆轉換 `DynamicFrames` 和 `DataFrames`。 ## — construction — + [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-__init__) + [fromDF](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF) + [toDF](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF) ## \$1\$1init\$1\$1 **`__init__(jdf, glue_ctx, name)`** + `jdf` – Java 虛擬機器 (JVM) 中資料框架的參考。 + `glue_ctx` – [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md) 物件。 + `name` – 選用名稱字串，預設是空的。 ## fromDF **`fromDF(dataframe, glue_ctx, name)`** 將 `DataFrame` 欄位轉換為 `DynamicFrame` 欄位，藉此將 `DataFrame` 轉換為 `DynamicRecord`。傳回新的 `DynamicFrame`。 `DynamicRecord` 代表 `DynamicFrame` 中的邏輯記錄。它類似 Spark `DataFrame` 中的一列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。此函數會預期 `DataFrame` 中具有重複名稱的資料欄已受到解析。 + `dataframe` – 要轉換的 Apache Spark SQL `DataFrame` (必要)。 + `glue_ctx` – 指定轉換內容的 [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md) 物件 (必要)。 + `name` – 產生的名稱 `DynamicFrame`（自 Glue 3.0 AWS 起為選用）。 ## toDF **`toDF(options)`** 將 `DynamicRecords` 轉換為 `DataFrame` 欄位，藉此將 `DynamicFrame` 轉換為 Apache Spark `DataFrame`。傳回新的 `DataFrame`。 `DynamicRecord` 代表 `DynamicFrame` 中的邏輯記錄。它類似 Spark `DataFrame` 中的一列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。 + `options` – `ResolveOption` 物件清單，指定如何在轉換期間解析選擇類型。此參數用於處理結構描述不一致，而不是 CSV 剖析等格式選項。對於 CSV 剖析和其他格式選項，請在建立 DynamicFrame 時於 `from_options` 方法中指定這些選項，而不是在 `toDF`方法中。以下是處理 CSV 格式選項的正確方式範例： ``` from awsglue.context import GlueContext from awsglue.dynamicframe import DynamicFrame from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) # Correct: Specify format options in from_options csv_dyf = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={"paths": ["s3://my-bucket/path/to/csv/"]}, format="csv", format_options={ "withHeader": True, "separator": ",", "inferSchema": True } ) # Convert to DataFrame (no format options needed here) csv_df = csv_dyf.toDF() ``` 中的 `options` 參數`toDF`專門用於解析選擇類型。如果您選擇 `Project` 和 `Cast` 動作類型，請指定目標類型。範例如下。 ``` >>>toDF([ResolveOption("a.b.c", "KeepAsStruct")]) >>>toDF([ResolveOption("a.b.c", "Project", DoubleType())]) ``` ## — information — + [count](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-count) + [結構描述](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-schema) + [printSchema](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-printSchema) + [顯示](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-show) + [repartition](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-repartition) + [coalesce](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-coalesce) ## count `count( )` – 傳回基礎 `DataFrame` 中的資料列數量。 ## 結構描述 `schema( )` – 傳回此 `DynamicFrame` 的結構描述，或者，假如不可用，則傳回基礎 `DataFrame` 的結構描述。如需有關組成此結構描述的 `DynamicFrame` 類型的詳細資訊，請參閱 [PySpark 延伸模組類型](aws-glue-api-crawler-pyspark-extensions-types.md)。 ## printSchema `printSchema( )` – 列印基礎 `DataFrame` 的結構描述。 ## 顯示 `show(num_rows)` – 列印基礎 `DataFrame` 的指定資料列數量。 ## repartition `repartition(numPartitions)` – 傳回包含 `numPartitions` 個分割區的新 `DynamicFrame`。 ## coalesce `coalesce(numPartitions)` – 傳回包含 `numPartitions` 個分割區的新 `DynamicFrame`。 ## — transforms — + [apply\$1mapping](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-apply_mapping) + [drop\$1fields](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-drop_fields) + [篩選條件](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-filter) + [join](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-join) + [map](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-map) + [mergeDynamicFrame](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-merge) + [關聯化](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-relationalize) + [rename\$1field](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-rename_field) + [resolveChoice](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-resolveChoice) + [select\$1fields](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-select_fields) + [spigot](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-spigot) + [split\$1fields](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_fields) + [split\$1rows](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_rows) + [unbox](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unbox) + [聯集](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-union) + [解巢狀](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest) + [unnest\$1ddb\$1json](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest_ddb_json) + [write](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-write) ## apply\$1mapping **`apply_mapping(mappings, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 套用宣告映射至 `DynamicFrame`，並傳回將這些映射套用至您指定欄位的新 `DynamicFrame`。未指定的欄位將從新的 `DynamicFrame` 中省略。 + `mappings` –映射元組的清單 (必要)。每個清單包括：(來源欄、來源類型、目標欄、目標類型)。如果來源資料欄的名稱中有一個小點 "`.`"，則您必須在其前後加上反引號 "````"。例如，若要將 `this.old.name` (字串) 對應至 `thisNewName`，會使用以下元組： ``` ("`this.old.name`", "string", "thisNewName", "string") ``` + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 apply\$1map 來重新命名欄位並變更欄位類型以下程式碼顯示使用 `apply_mapping` 方法重新命名所選欄位和更改欄位類型的方法。 **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：加入和關聯化資料](aws-glue-programming-python-samples-legislators.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling) 中的說明進行。 ``` # Example: Use apply_mapping to reshape source data into # the desired column names and types as a new DynamicFrame from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create a DynamicFrame and view its schema persons = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="persons_json" ) print("Schema for the persons DynamicFrame:") persons.printSchema() # Select and rename fields, change field type print("Schema for the persons_mapped DynamicFrame, created with apply_mapping:") persons_mapped = persons.apply_mapping( [ ("family_name", "String", "last_name", "String"), ("name", "String", "first_name", "String"), ("birth_date", "String", "date_of_birth", "Date"), ] ) persons_mapped.printSchema() ``` #### Output ``` Schema for the persons DynamicFrame: root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string Schema for the persons_mapped DynamicFrame, created with apply_mapping: root |-- last_name: string |-- first_name: string |-- date_of_birth: date ``` ## drop\$1fields **`drop_fields(paths, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 呼叫 [FlatMap 類別](aws-glue-api-crawler-pyspark-transforms-flat-map.md) 轉換，從 `DynamicFrame` 移除欄位。傳回捨棄了指定欄位的新 `DynamicFrame`。 + `paths` – 字串清單。各包含您想捨棄的欄位節點的完整路徑。您可以使用點標記法來指定巢狀欄位。例如，如果欄位 `first` 是樹狀結構中的子欄位 `name`，您可以指定 `"name.first"` 為路徑。如果欄位節點的名稱中有常值 `.`，您必須以反引號將名稱括起 (```)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 drop\$1fields 從 `DynamicFrame` 中移除欄位此程式碼範例使用 `drop_fields` 方法從 `DynamicFrame` 中移除選取的頂層和巢狀欄位。 **範例資料集** 此範例使用下列資料集，該資料集由程式碼中的 `EXAMPLE-FRIENDS-DATA` 表格表示： ``` {"name": "Sally", "age": 23, "location": {"state": "WY", "county": "Fremont"}, "friends": []} {"name": "Varun", "age": 34, "location": {"state": "NE", "county": "Douglas"}, "friends": [{"name": "Arjun", "age": 3}]} {"name": "George", "age": 52, "location": {"state": "NY"}, "friends": [{"name": "Fred"}, {"name": "Amy", "age": 15}]} {"name": "Haruki", "age": 21, "location": {"state": "AK", "county": "Denali"}} {"name": "Sheila", "age": 63, "friends": [{"name": "Nancy", "age": 22}]} ``` **範例程式碼** ``` # Example: Use drop_fields to remove top-level and nested fields from a DynamicFrame. # Replace MY-EXAMPLE-DATABASE with your Glue Data Catalog database name. # Replace EXAMPLE-FRIENDS-DATA with your table name. from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create a DynamicFrame from Glue Data Catalog glue_source_database = "MY-EXAMPLE-DATABASE" glue_source_table = "EXAMPLE-FRIENDS-DATA" friends = glueContext.create_dynamic_frame.from_catalog( database=glue_source_database, table_name=glue_source_table ) print("Schema for friends DynamicFrame before calling drop_fields:") friends.printSchema() # Remove location.county, remove friends.age, remove age friends = friends.drop_fields(paths=["age", "location.county", "friends.age"]) print("Schema for friends DynamicFrame after removing age, county, and friend age:") friends.printSchema() ``` #### Output ``` Schema for friends DynamicFrame before calling drop_fields: root |-- name: string |-- age: int |-- location: struct | |-- state: string | |-- county: string |-- friends: array | |-- element: struct | | |-- name: string | | |-- age: int Schema for friends DynamicFrame after removing age, county, and friend age: root |-- name: string |-- location: struct | |-- state: string |-- friends: array | |-- element: struct | | |-- name: string ``` ## 篩選條件 **`filter(f, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 傳回新的 `DynamicFrame`，其中包含所有 `DynamicRecords`，其滿足輸入 `DynamicFrame` 且指定的述詞函數 `f`。 + `f` – 要套用至 `DynamicFrame` 的述詞函數。此函數必須以 `DynamicRecord` 做為引數並傳回 True，如果 `DynamicRecord` 符合篩選條件要求，否則將傳回 False (必要)。 `DynamicRecord` 代表 `DynamicFrame` 中的邏輯記錄。它類似 Spark `DataFrame` 中的一列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用篩選條件取得已篩選的欄位選取此範例使用`filter`方法來建立新的`DynamicFrame`，其中包括對另一個 `DynamicFrame` 的欄位的已篩選選取。跟 `map` 方法一樣，`filter` 需要一個函數作為引數，該引數應用於原始 `DynamicFrame` 中的每個記錄。該函數需要一個記錄作為輸入，並傳回一個布林值。如果傳回值為 true，記錄會包含在所產生的 `DynamicFrame` 中。如果傳回值為 false，記錄會被排除在外。 **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：使用 ResolveChoice、Lambda 和 ApplyMapping 的資料準備](aws-glue-programming-python-samples-medicaid.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-medicaid.md#aws-glue-programming-python-samples-medicaid-crawling) 中的說明進行。 ``` # Example: Use filter to create a new DynamicFrame # with a filtered selection of records from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create DynamicFrame from Glue Data Catalog medicare = glueContext.create_dynamic_frame.from_options( "s3", { "paths": [ "s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv" ] }, "csv", {"withHeader": True}, ) # Create filtered DynamicFrame with custom lambda # to filter records by Provider State and Provider City sac_or_mon = medicare.filter( f=lambda x: x["Provider State"] in ["CA", "AL"] and x["Provider City"] in ["SACRAMENTO", "MONTGOMERY"] ) # Compare record counts print("Unfiltered record count: ", medicare.count()) print("Filtered record count: ", sac_or_mon.count()) ``` #### Output ``` Unfiltered record count: 163065 Filtered record count: 564 ``` ## join **`join(paths1, paths2, frame2, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 執行與其他 `DynamicFrame` 的對等性加入，並傳回產生的 `DynamicFrame`。 + `paths1` – 要加入的此框架中的金鑰清單。 + `paths2` – 要加入的其他框架中的金鑰清單。 + `frame2` – 要加入的其他 `DynamicFrame`。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用聯結合併 `DynamicFrames` 此範例使用 `join`方法來對三個執行聯結`DynamicFrames`。 AWS Glue 會根據您提供的欄位金鑰執行聯結。產生的 `DynamicFrame` 包含兩個原始影格的列，其中指定之索引鍵相符。請注意，`join` 轉換會保持所有欄位不變。這表示您指定要比對的欄位會出現在產生的 DynamicFrame 中，即使這些欄位是多餘且包含相同的索引鍵。在此範例中，我們使用 `drop_fields` 在聯結後移除這些多餘的索引鍵。 **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：加入和關聯化資料](aws-glue-programming-python-samples-legislators.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling) 中的說明進行。 ``` # Example: Use join to combine data from three DynamicFrames from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Load DynamicFrames from Glue Data Catalog persons = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="persons_json" ) memberships = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="memberships_json" ) orgs = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="organizations_json" ) print("Schema for the persons DynamicFrame:") persons.printSchema() print("Schema for the memberships DynamicFrame:") memberships.printSchema() print("Schema for the orgs DynamicFrame:") orgs.printSchema() # Join persons and memberships by ID persons_memberships = persons.join( paths1=["id"], paths2=["person_id"], frame2=memberships ) # Rename and drop fields from orgs # to prevent field name collisions with persons_memberships orgs = ( orgs.drop_fields(["other_names", "identifiers"]) .rename_field("id", "org_id") .rename_field("name", "org_name") ) # Create final join of all three DynamicFrames legislators_combined = orgs.join( paths1=["org_id"], paths2=["organization_id"], frame2=persons_memberships ).drop_fields(["person_id", "org_id"]) # Inspect the schema for the joined data print("Schema for the new legislators_combined DynamicFrame:") legislators_combined.printSchema() ``` #### Output ``` Schema for the persons DynamicFrame: root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string Schema for the memberships DynamicFrame: root |-- area_id: string |-- on_behalf_of_id: string |-- organization_id: string |-- role: string |-- person_id: string |-- legislative_period_id: string |-- start_date: string |-- end_date: string Schema for the orgs DynamicFrame: root |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- id: string |-- classification: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- image: string |-- seats: int |-- type: string Schema for the new legislators_combined DynamicFrame: root |-- role: string |-- seats: int |-- org_name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- type: string |-- sort_name: string |-- area_id: string |-- images: array | |-- element: struct | | |-- url: string |-- on_behalf_of_id: string |-- other_names: array | |-- element: struct | | |-- note: string | | |-- name: string | | |-- lang: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- name: string |-- birth_date: string |-- organization_id: string |-- gender: string |-- classification: string |-- legislative_period_id: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- image: string |-- given_name: string |-- start_date: string |-- family_name: string |-- id: string |-- death_date: string |-- end_date: string ``` ## map **`map(f, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 傳回套用指定映射函數至原始 `DynamicFrame` 中所有記錄而產生的新 `DynamicFrame`。 + `f` – 套用到 `DynamicFrame` 中所有記錄的映射函數。此函數必須以 `DynamicRecord` 做為引數，並傳回新的 `DynamicRecord` (必要)。 `DynamicRecord` 代表 `DynamicFrame` 中的邏輯記錄。它類似 Apache Spark `DataFrame` 中的一列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ### 範例：使用 map 將函數套用至 `DynamicFrame` 中的每個記錄此範例示範如何使用 `map` 方法將函數套用至 `DynamicFrame` 的每個記錄。具體來說，此範例套用名為 `MergeAddress` 函數至每個記錄，以便將多個地址欄位合併為一個 `struct` 類型。 **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：使用 ResolveChoice、Lambda 和 ApplyMapping 的資料準備](aws-glue-programming-python-samples-medicaid.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-medicaid.md#aws-glue-programming-python-samples-medicaid-crawling) 中的說明進行。 ``` # Example: Use map to combine fields in all records # of a DynamicFrame from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create a DynamicFrame and view its schema medicare = glueContext.create_dynamic_frame.from_options( "s3", {"paths": ["s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv"]}, "csv", {"withHeader": True}) print("Schema for medicare DynamicFrame:") medicare.printSchema() # Define a function to supply to the map transform # that merges address fields into a single field def MergeAddress(rec): rec["Address"] = {} rec["Address"]["Street"] = rec["Provider Street Address"] rec["Address"]["City"] = rec["Provider City"] rec["Address"]["State"] = rec["Provider State"] rec["Address"]["Zip.Code"] = rec["Provider Zip Code"] rec["Address"]["Array"] = [rec["Provider Street Address"], rec["Provider City"], rec["Provider State"], rec["Provider Zip Code"]] del rec["Provider Street Address"] del rec["Provider City"] del rec["Provider State"] del rec["Provider Zip Code"] return rec # Use map to apply MergeAddress to every record mapped_medicare = medicare.map(f = MergeAddress) print("Schema for mapped_medicare DynamicFrame:") mapped_medicare.printSchema() ``` #### Output ``` Schema for medicare DynamicFrame: root |-- DRG Definition: string |-- Provider Id: string |-- Provider Name: string |-- Provider Street Address: string |-- Provider City: string |-- Provider State: string |-- Provider Zip Code: string |-- Hospital Referral Region Description: string |-- Total Discharges: string |-- Average Covered Charges: string |-- Average Total Payments: string |-- Average Medicare Payments: string Schema for mapped_medicare DynamicFrame: root |-- Average Total Payments: string |-- Average Covered Charges: string |-- DRG Definition: string |-- Average Medicare Payments: string |-- Hospital Referral Region Description: string |-- Address: struct | |-- Zip.Code: string | |-- City: string | |-- Array: array | | |-- element: string | |-- State: string | |-- Street: string |-- Provider Id: string |-- Total Discharges: string |-- Provider Name: string ``` ## mergeDynamicFrame **`mergeDynamicFrame(stage_dynamic_frame, primary_keys, transformation_ctx = "", options = {}, info = "", stageThreshold = 0, totalThreshold = 0)`** 根據指定的主索引鍵來合併此 `DynamicFrame` 與暫存 `DynamicFrame` 以識別記錄。重複的記錄 (具有相同主索引鍵的記錄) 不會被刪除重複資料。如果暫存影格中沒有相符的記錄，則會保留來源中的所有記錄 (包括重複項)。如果暫存影格具有相符的記錄，則暫存影格中的記錄會覆寫 AWS Glue 中來源的記錄。 + `stage_dynamic_frame` – 要合併的暫存 `DynamicFrame`。 + `primary_keys` - 要從來源和暫存動態影格比對記錄的主索引鍵欄位清單。 + `transformation_ctx` - 用來擷取目前轉換之中繼資料的唯一字串 (選用)。 + `options` - JSON 名稱值組的字串，可提供此轉換的額外資料。目前未使用此引數。 + `info` – `String`。與轉換中的錯誤相關的任何字串。 + `stageThreshold` – `Long`。在給定轉換中的錯誤數量，其處理需要輸出錯誤。 + `totalThreshold` – `Long`。在此轉換之前 (包括在此轉換中) 的錯誤總數，其處理需要輸出錯誤。此方法會傳回透過將此 `DynamicFrame` 與暫存 `DynamicFrame` 合併而取得的新 `DynamicFrame`。在下列情況下，傳回的 `DynamicFrame` 包含記錄 A： + 如果 `A` 同時存在於來源影格和暫存影格，則會傳回暫存影格中的 `A`。 + 如果 `A` 位於來源資料表中而 `A.primaryKeys` 不在 `stagingDynamicFrame` 中，則 `A` 不會在暫存資料表中更新。來源影格和暫存影格不需要具有相同的結構描述。 ### 範例：使用 mergeDynamicFrame 根據主索引鍵來合併兩個 `DynamicFrames`。下列程式碼範例示範如何使用 `mergeDynamicFrame` 方法，根據主索引鍵 `id` 將 `DynamicFrame` 與「暫存」`DynamicFrame` 合併。 **範例資料集** 該範例使用稱為 `split_rows_collection` 的來自 `DynamicFrameCollection` 的兩個 `DynamicFrames`。以下是 `split_rows_collection` 中的索引鍵清單。 ``` dict_keys(['high', 'low']) ``` **範例程式碼** ``` # Example: Use mergeDynamicFrame to merge DynamicFrames # based on a set of specified primary keys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import SelectFromCollection # Inspect the original DynamicFrames frame_low = SelectFromCollection.apply(dfc=split_rows_collection, key="low") print("Inspect the DynamicFrame that contains rows where ID < 10") frame_low.toDF().show() frame_high = SelectFromCollection.apply(dfc=split_rows_collection, key="high") print("Inspect the DynamicFrame that contains rows where ID > 10") frame_high.toDF().show() # Merge the DynamicFrames based on the "id" primary key merged_high_low = frame_high.mergeDynamicFrame( stage_dynamic_frame=frame_low, primary_keys=["id"] ) # View the results where the ID is 1 or 20 print("Inspect the merged DynamicFrame that contains the combined rows") merged_high_low.toDF().where("id = 1 or id= 20").orderBy("id").show() ``` #### Output ``` Inspect the DynamicFrame that contains rows where ID < 10 +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 1| 0| fax| 202-225-3307| | 1| 1| phone| 202-225-5731| | 2| 0| fax| 202-225-3307| | 2| 1| phone| 202-225-5731| | 3| 0| fax| 202-225-3307| | 3| 1| phone| 202-225-5731| | 4| 0| fax| 202-225-3307| | 4| 1| phone| 202-225-5731| | 5| 0| fax| 202-225-3307| | 5| 1| phone| 202-225-5731| | 6| 0| fax| 202-225-3307| | 6| 1| phone| 202-225-5731| | 7| 0| fax| 202-225-3307| | 7| 1| phone| 202-225-5731| | 8| 0| fax| 202-225-3307| | 8| 1| phone| 202-225-5731| | 9| 0| fax| 202-225-3307| | 9| 1| phone| 202-225-5731| | 10| 0| fax| 202-225-6328| | 10| 1| phone| 202-225-4576| +---+-----+------------------------+-------------------------+ only showing top 20 rows Inspect the DynamicFrame that contains rows where ID > 10 +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 11| 0| fax| 202-225-6328| | 11| 1| phone| 202-225-4576| | 11| 2| twitter| RepTrentFranks| | 12| 0| fax| 202-225-6328| | 12| 1| phone| 202-225-4576| | 12| 2| twitter| RepTrentFranks| | 13| 0| fax| 202-225-6328| | 13| 1| phone| 202-225-4576| | 13| 2| twitter| RepTrentFranks| | 14| 0| fax| 202-225-6328| | 14| 1| phone| 202-225-4576| | 14| 2| twitter| RepTrentFranks| | 15| 0| fax| 202-225-6328| | 15| 1| phone| 202-225-4576| | 15| 2| twitter| RepTrentFranks| | 16| 0| fax| 202-225-6328| | 16| 1| phone| 202-225-4576| | 16| 2| twitter| RepTrentFranks| | 17| 0| fax| 202-225-6328| | 17| 1| phone| 202-225-4576| +---+-----+------------------------+-------------------------+ only showing top 20 rows Inspect the merged DynamicFrame that contains the combined rows +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 1| 0| fax| 202-225-3307| | 1| 1| phone| 202-225-5731| | 20| 0| fax| 202-225-5604| | 20| 1| phone| 202-225-6536| | 20| 2| twitter| USRepLong| +---+-----+------------------------+-------------------------+ ``` ## 關聯化 **`relationalize(root_table_name, staging_path, options, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 將 `DynamicFrame` 轉換為適合關聯式資料庫的表單。當您想要將資料從 DynamoDB 等 NoSQL 環境移動到 MySQL 等關聯式資料庫時，關聯化 `DynamicFrame` 特別有用。透過對巢狀化欄解除巢狀化並將陣列欄直轉橫，可產生框架清單。使用在解除巢狀化階段中所產生的聯結鍵，將直轉橫的陣列欄聯結至根資料表。 + `root_table_name` – 根資料表的名稱。 + `staging_path` – 該方法用來以 CSV 格式存放直轉橫資料表分區的路徑 (選用)。直轉橫資料表從這個路徑讀回。 + `options` – 選用參數的字典。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 relationalize 來壓平合併 `DynamicFrame` 中的巢狀化結構描述此程式碼範例使用 `relationalize` 方法，將巢狀化結構描述壓平合併為適合關聯式資料庫的表單。 **範例資料集** 此範例會將稱為 `legislators_combined` 的 `DynamicFrame` 與下列結構描述搭配使用。`legislators_combined` 具有多個巢狀化欄位，例如 `links`、`images` 和 `contact_details`，這些欄位將由 `relationalize` 轉換進行壓平合併。 ``` root |-- role: string |-- seats: int |-- org_name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- type: string |-- sort_name: string |-- area_id: string |-- images: array | |-- element: struct | | |-- url: string |-- on_behalf_of_id: string |-- other_names: array | |-- element: struct | | |-- note: string | | |-- name: string | | |-- lang: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- name: string |-- birth_date: string |-- organization_id: string |-- gender: string |-- classification: string |-- legislative_period_id: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- image: string |-- given_name: string |-- start_date: string |-- family_name: string |-- id: string |-- death_date: string |-- end_date: string ``` **範例程式碼** ``` # Example: Use relationalize to flatten # a nested schema into a format that fits # into a relational database. # Replace DOC-EXAMPLE-S3-BUCKET/tmpDir with your own location. from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Apply relationalize and inspect new tables legislators_relationalized = legislators_combined.relationalize( "l_root", "s3://DOC-EXAMPLE-BUCKET/tmpDir" ) legislators_relationalized.keys() # Compare the schema of the contact_details # nested field to the new relationalized table that # represents it legislators_combined.select_fields("contact_details").printSchema() legislators_relationalized.select("l_root_contact_details").toDF().where( "id = 10 or id = 75" ).orderBy(["id", "index"]).show() ``` #### Output 下列輸出可讓您將稱為 `contact_details` 的巢狀化欄位結構描述與 `relationalize` 轉換所建立的資料表進行比較。請注意，資料表記錄使用稱為 `id` 的外部索引鍵和代表陣列位置的 `index` 資料欄連結回主資料表。 ``` dict_keys(['l_root', 'l_root_images', 'l_root_links', 'l_root_other_names', 'l_root_contact_details', 'l_root_identifiers']) root |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 10| 0| fax| 202-225-4160| | 10| 1| phone| 202-225-3436| | 75| 0| fax| 202-225-6791| | 75| 1| phone| 202-225-2861| | 75| 2| twitter| RepSamFarr| +---+-----+------------------------+-------------------------+ ``` ## rename\$1field **`rename_field(oldName, newName, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 重新命名此 `DynamicFrame` 中的欄位，並傳回欄位重新命名的新 `DynamicFrame`。 + `oldName` – 要重新命名之節點的完整路徑。如果舊名稱內有小點，`RenameField` 無法正常運作，除非在前後加上反引號 (```)。例如，若要將 `this.old.name` 換成 `thisNewName`，可以用下列方式呼叫 rename\$1field。 ``` newDyF = oldDyF.rename_field("`this.old.name`", "thisNewName") ``` + `newName` – 新的名稱，做為完整路徑。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 rename\$1field 重新命名 `DynamicFrame` 中的欄位此程式碼範例會使用 `rename_field` 方法重新命名 `DynamicFrame` 中的欄位。請注意，此範例使用方法鏈結同時重新命名多個欄位。 **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：加入和關聯化資料](aws-glue-programming-python-samples-legislators.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling) 中的說明進行。 **範例程式碼** ``` # Example: Use rename_field to rename fields # in a DynamicFrame from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Inspect the original orgs schema orgs = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="organizations_json" ) print("Original orgs schema: ") orgs.printSchema() # Rename fields and view the new schema orgs = orgs.rename_field("id", "org_id").rename_field("name", "org_name") print("New orgs schema with renamed fields: ") orgs.printSchema() ``` #### Output ``` Original orgs schema: root |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- id: string |-- classification: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- image: string |-- seats: int |-- type: string New orgs schema with renamed fields: root |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- classification: string |-- org_id: string |-- org_name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- image: string |-- seats: int |-- type: string ``` ## resolveChoice **`resolveChoice(specs = None, choice = "" , database = None , table_name = None , transformation_ctx="", info="", stageThreshold=0, totalThreshold=0, catalog_id = None)`** 在此 `DynamicFrame` 中解析所選類型，並傳回新的 `DynamicFrame`。 + `specs` – 要解析的特定模棱兩可項目的清單，形式皆為 tuple：`(field_path, action)`。有兩種方式可以使用 `resolveChoice`。第一種是使用 `specs` 引數指定一系列的特定的欄以及解析它們的方式。`resolveChoice` 的其他模式是使用 `choice` 引數為所有 `ChoiceTypes` 指定單一解析度。 `specs` 的值指定為由 `(field_path, action)` 對組成的元組。`field_path` 值代表模棱兩可的特定元素，`action` 值則代表對應的解析動作。可行的動作如下： + `cast:type` - 嘗試將所有值轉換至指定類型。例如：`cast:int`。 + `make_cols` - 將每個不同的類型轉換為具有 `columnName_type` 名稱的欄。透過將資料壓平合併來解析可能的模棱兩可項目。例如，如果 `columnA` 可能是 `int` 或 `string`，則在得出的 `DynamicFrame` 中，解析動作會產生名為 `columnA_int` 和 `columnA_string` 的兩個欄。 + `make_struct` – 藉由使用 `struct` 表示資料，來解決可能的模棱兩可項目。舉例來說，如果欄中的資料可能是 `int` 或 `string`，則 `make_struct` 動作會在產生的 `DynamicFrame` 中產生結構欄。每個結構都包含 `int` 和 `string`。 + `project:type` - 藉由將所有資料預測為一種可能的資料類型，來解決可能的模棱兩可項目。舉例來說，如果欄中的資料可能是 `int` 或 `string`，則使用 `project:string` 動作會在結果的 `DynamicFrame` 中產生欄，其中所有的 `int` 值皆轉換為字串。若 `field_path` 識別到陣列，在陣列的名稱後放置空白的方括號以避免模棱兩可的狀況。例如，假設您使用如下結構化的資料： ``` "myList": [ { "price": 100.00 }, { "price": "$100.00" } ] ``` 您可以選取數值而不是價格字串版本，方法是將 `field_path` 設定為 `"myList[].price"`，且將 `action` 設定為 `"cast:double"`。 **注意** 您只能使用 `specs` 和 `choice` 參數的其中一項。如果 `specs` 參數不是 `None`，則 `choice` 參數必須為空字串。相反地，如果 `choice` 不是空字串，則 `specs` 參數必須為 `None`。 + `choice` – 為所有 `ChoiceTypes` 指定單一解析度。您可以在 `ChoiceTypes` 的完整清單在執行時間之前是未知的情況下使用此模式。除了以上列出的 `specs` 動作，此引數也支援下列動作： + `match_catalog` – 嘗試將每個 `ChoiceType` 投射至指定 Data Catalog 資料表中的對應類型。 + `database` – 搭配 `match_catalog` 動作使用的 Data Catalog 資料庫。 + `table_name` – 搭配 `match_catalog` 動作使用的 Data Catalog 資料表。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` – 直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設為零，表示流程不會錯誤輸出。 + `catalog_id` – 要存取之 Data Catalog 的目錄 ID ( Data Catalog 的帳戶 ID)。當設定為 `None` (預設值) 時，它會使用呼叫帳戶的目錄 ID。 ### 範例：使用 resolveChoice 來處理包含多種類型的資料欄此程式碼範例會使用 `resolveChoice` 方法來指定如何處理包含多種類型值的 `DynamicFrame` 資料欄。該範例演示了處理具有不同類型欄的兩種常見方法： + 將資料欄轉換為單一資料類型。 + 將所有類型保留在單獨的欄中。 **範例資料集** **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：使用 ResolveChoice、Lambda 和 ApplyMapping 的資料準備](aws-glue-programming-python-samples-medicaid.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-medicaid.md#aws-glue-programming-python-samples-medicaid-crawling) 中的說明進行。此範例將稱為 `medicare` 的 `DynamicFrame` 與下列結構描述搭配使用： ``` root |-- drg definition: string |-- provider id: choice | |-- long | |-- string |-- provider name: string |-- provider street address: string |-- provider city: string |-- provider state: string |-- provider zip code: long |-- hospital referral region description: string |-- total discharges: long |-- average covered charges: string |-- average total payments: string |-- average medicare payments: string ``` **範例程式碼** ``` # Example: Use resolveChoice to handle # a column that contains multiple types from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Load the input data and inspect the "provider id" column medicare = glueContext.create_dynamic_frame.from_catalog( database="payments", table_name="medicare_hospital_provider_csv" ) print("Inspect the provider id column:") medicare.toDF().select("provider id").show() # Cast provider id to type long medicare_resolved_long = medicare.resolveChoice(specs=[("provider id", "cast:long")]) print("Schema after casting provider id to type long:") medicare_resolved_long.printSchema() medicare_resolved_long.toDF().select("provider id").show() # Create separate columns # for each provider id type medicare_resolved_cols = medicare.resolveChoice(choice="make_cols") print("Schema after creating separate columns for each type:") medicare_resolved_cols.printSchema() medicare_resolved_cols.toDF().select("provider id_long", "provider id_string").show() ``` #### Output ``` Inspect the 'provider id' column: +-----------+ |provider id| +-----------+ | [10001,]| | [10005,]| | [10006,]| | [10011,]| | [10016,]| | [10023,]| | [10029,]| | [10033,]| | [10039,]| | [10040,]| | [10046,]| | [10055,]| | [10056,]| | [10078,]| | [10083,]| | [10085,]| | [10090,]| | [10092,]| | [10100,]| | [10103,]| +-----------+ only showing top 20 rows Schema after casting 'provider id' to type long: root |-- drg definition: string |-- provider id: long |-- provider name: string |-- provider street address: string |-- provider city: string |-- provider state: string |-- provider zip code: long |-- hospital referral region description: string |-- total discharges: long |-- average covered charges: string |-- average total payments: string |-- average medicare payments: string +-----------+ |provider id| +-----------+ | 10001| | 10005| | 10006| | 10011| | 10016| | 10023| | 10029| | 10033| | 10039| | 10040| | 10046| | 10055| | 10056| | 10078| | 10083| | 10085| | 10090| | 10092| | 10100| | 10103| +-----------+ only showing top 20 rows Schema after creating separate columns for each type: root |-- drg definition: string |-- provider id_string: string |-- provider id_long: long |-- provider name: string |-- provider street address: string |-- provider city: string |-- provider state: string |-- provider zip code: long |-- hospital referral region description: string |-- total discharges: long |-- average covered charges: string |-- average total payments: string |-- average medicare payments: string +----------------+------------------+ |provider id_long|provider id_string| +----------------+------------------+ | 10001| null| | 10005| null| | 10006| null| | 10011| null| | 10016| null| | 10023| null| | 10029| null| | 10033| null| | 10039| null| | 10040| null| | 10046| null| | 10055| null| | 10056| null| | 10078| null| | 10083| null| | 10085| null| | 10090| null| | 10092| null| | 10100| null| | 10103| null| +----------------+------------------+ only showing top 20 rows ``` ## select\$1fields **`select_fields(paths, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 傳回包含所選欄位的新 `DynamicFrame`。 + `paths` – 字串清單。每個字串清單均為您想要選擇的最上層節點的路徑。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 select\$1fields 來用所選欄位建立新的 `DynamicFrame` 以下程式碼範例顯示如何使用 `select_fields` 方法建立新的 `DynamicFrame`，其具有從現有 `DynamicFrame` 中選取的欄位清單。 **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：加入和關聯化資料](aws-glue-programming-python-samples-legislators.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling) 中的說明進行。 ``` # Example: Use select_fields to select specific fields from a DynamicFrame from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create a DynamicFrame and view its schema persons = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="persons_json" ) print("Schema for the persons DynamicFrame:") persons.printSchema() # Create a new DynamicFrame with chosen fields names = persons.select_fields(paths=["family_name", "given_name"]) print("Schema for the names DynamicFrame, created with select_fields:") names.printSchema() names.toDF().show() ``` #### Output ``` Schema for the persons DynamicFrame: root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string Schema for the names DynamicFrame: root |-- family_name: string |-- given_name: string +-----------+----------+ |family_name|given_name| +-----------+----------+ | Collins| Michael| | Huizenga| Bill| | Clawson| Curtis| | Solomon| Gerald| | Rigell| Edward| | Crapo| Michael| | Hutto| Earl| | Ertel| Allen| | Minish| Joseph| | Andrews| Robert| | Walden| Greg| | Kazen| Abraham| | Turner| Michael| | Kolbe| James| | Lowenthal| Alan| | Capuano| Michael| | Schrader| Kurt| | Nadler| Jerrold| | Graves| Tom| | McMillan| John| +-----------+----------+ only showing top 20 rows ``` ## simplify\$1ddb\$1json **`simplify_ddb_json(): DynamicFrame`** 簡化專屬於 DynamoDB JSON 結構中 `DynamicFrame` 內的巢狀資料欄，並傳回新的簡化 `DynamicFrame`。如果 List 類型中有多種類型或 Map 類型，則 List 中的元素不會進行簡化。請注意，這是一種特定類型的轉換，其行為與常規 `unnest` 轉換不同，且資料必須已位於 DynamoDB JSON 結構中。如需詳細資訊，請參閱 [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html#DataExport.Output.Data)。例如，讀取 DynamoDB JSON 結構的匯出結構描述與以下類似： ``` root |-- Item: struct | |-- parentMap: struct | | |-- M: struct | | | |-- childMap: struct | | | | |-- M: struct | | | | | |-- appName: struct | | | | | | |-- S: string | | | | | |-- packageName: struct | | | | | | |-- S: string | | | | | |-- updatedAt: struct | | | | | | |-- N: string | |-- strings: struct | | |-- SS: array | | | |-- element: string | |-- numbers: struct | | |-- NS: array | | | |-- element: string | |-- binaries: struct | | |-- BS: array | | | |-- element: string | |-- isDDBJson: struct | | |-- BOOL: boolean | |-- nullValue: struct | | |-- NULL: boolean ``` `simplify_ddb_json()` 轉換會將此轉換為： ``` root |-- parentMap: struct | |-- childMap: struct | | |-- appName: string | | |-- packageName: string | | |-- updatedAt: string |-- strings: array | |-- element: string |-- numbers: array | |-- element: string |-- binaries: array | |-- element: string |-- isDDBJson: boolean |-- nullValue: null ``` ### 範例：使用 simplify\$1ddb\$1json 來調用 DynamoDB JSON 簡化此程式碼範例使用 `simplify_ddb_json`方法使用 AWS Glue DynamoDB 匯出連接器、叫用 DynamoDB JSON 簡化，以及列印分割區數量。 **範例程式碼** ``` from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext() glueContext = GlueContext(sc) dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type = "dynamodb", connection_options = { 'dynamodb.export': 'ddb', 'dynamodb.tableArn': '', 'dynamodb.s3.bucket': '', 'dynamodb.s3.prefix': '', 'dynamodb.s3.bucketOwner': '' } ) simplified = dynamicFrame.simplify_ddb_json() print(simplified.getNumPartitions()) ``` ## spigot **`spigot(path, options={})`** 將範例記錄寫入指定的目的地，以協助您驗證任務執行的轉換。 + `path` - 要寫入的目的地路徑 (必要)。 + `options` – 指定選項的索引鍵/值對 (選用)。`"topk"` 選項指定應寫入第一個 `k` 記錄。`"prob"` 選項指定選擇任何給定記錄的概率 (小數)。您可以使用其來選擇要寫入的記錄。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 ### 範例：使用 spigot 將範例欄位從 `DynamicFrame` 寫入到 Amazon S3 此程式碼範例會在套用 `select_fields` 轉換後，使用 `spigot` 方法將範例記錄寫入 Amazon S3 儲存貯體。 **範例資料集** **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：加入和關聯化資料](aws-glue-programming-python-samples-legislators.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling) 中的說明進行。此範例將稱為 `persons` 的 `DynamicFrame` 與下列結構描述搭配使用： ``` root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string ``` **範例程式碼** ``` # Example: Use spigot to write sample records # to a destination during a transformation # from pyspark.context import SparkContext. # Replace DOC-EXAMPLE-BUCKET with your own location. from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Load table data into a DynamicFrame persons = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="persons_json" ) # Perform the select_fields on the DynamicFrame persons = persons.select_fields(paths=["family_name", "given_name", "birth_date"]) # Use spigot to write a sample of the transformed data # (the first 10 records) spigot_output = persons.spigot( path="s3://DOC-EXAMPLE-BUCKET", options={"topk": 10} ) # Example: Use spigot to write sample records # to a destination during a transformation # from pyspark.context import SparkContext. # Replace DOC-EXAMPLE-BUCKET with your own location. from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Load table data into a DynamicFrame persons = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="persons_json" ) # Perform the select_fields on the DynamicFrame persons = persons.select_fields(paths=["family_name", "given_name", "birth_date"]) # Use spigot to write a sample of the transformed data # (the first 10 records) spigot_output = persons.spigot( path="s3://DOC-EXAMPLE-BUCKET", options={"topk": 10} ) ``` #### Output 以下是 `spigot` 寫入 Amazon S3 的資料範例。由於範例程式碼指定了 `options={"topk": 10}`，範例資料會包含前 10 筆記錄。 ``` {"family_name":"Collins","given_name":"Michael","birth_date":"1944-10-15"} {"family_name":"Huizenga","given_name":"Bill","birth_date":"1969-01-31"} {"family_name":"Clawson","given_name":"Curtis","birth_date":"1959-09-28"} {"family_name":"Solomon","given_name":"Gerald","birth_date":"1930-08-14"} {"family_name":"Rigell","given_name":"Edward","birth_date":"1960-05-28"} {"family_name":"Crapo","given_name":"Michael","birth_date":"1951-05-20"} {"family_name":"Hutto","given_name":"Earl","birth_date":"1926-05-12"} {"family_name":"Ertel","given_name":"Allen","birth_date":"1937-11-07"} {"family_name":"Minish","given_name":"Joseph","birth_date":"1916-09-01"} {"family_name":"Andrews","given_name":"Robert","birth_date":"1957-08-04"} ``` ## split\$1fields **`split_fields(paths, name1, name2, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 傳回新的 `DynamicFrameCollection`，其包含兩個 `DynamicFrames`。第一個 `DynamicFrame` 包含分割的所有節點，第二個包含其餘節點。 + `paths` – 字串清單，其各自為想要分割到新 `DynamicFrame` 的節點的完整路徑。 + `name1` – 分割的 `DynamicFrame` 的名稱字串。 + `name2` – 分割指定節點後剩餘的 `DynamicFrame` 的名稱字串。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 split\$1fields 將選取的欄位分割為單獨的 `DynamicFrame` 此程式碼範例會使用 `split_fields` 方法，將指定欄位的清單分割為單獨的 `DynamicFrame`。 **範例資料集** 該範例使用稱為 `l_root_contact_details` 的 `DynamicFrame`，其來自名為 `legislators_relationalized` 的集合。 `l_root_contact_details` 具有以下結構描述和項目。 ``` root |-- id: long |-- index: int |-- contact_details.val.type: string |-- contact_details.val.value: string +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 1| 0| phone| 202-225-5265| | 1| 1| twitter| kathyhochul| | 2| 0| phone| 202-225-3252| | 2| 1| twitter| repjackyrosen| | 3| 0| fax| 202-225-1314| | 3| 1| phone| 202-225-3772| ... ``` **範例程式碼** ``` # Example: Use split_fields to split selected # fields into a separate DynamicFrame from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Load the input DynamicFrame and inspect its schema frame_to_split = legislators_relationalized.select("l_root_contact_details") print("Inspect the input DynamicFrame schema:") frame_to_split.printSchema() # Split id and index fields into a separate DynamicFrame split_fields_collection = frame_to_split.split_fields(["id", "index"], "left", "right") # Inspect the resulting DynamicFrames print("Inspect the schemas of the DynamicFrames created with split_fields:") split_fields_collection.select("left").printSchema() split_fields_collection.select("right").printSchema() ``` #### Output ``` Inspect the input DynamicFrame's schema: root |-- id: long |-- index: int |-- contact_details.val.type: string |-- contact_details.val.value: string Inspect the schemas of the DynamicFrames created with split_fields: root |-- id: long |-- index: int root |-- contact_details.val.type: string |-- contact_details.val.value: string ``` ## split\$1rows **`split_rows(comparison_dict, name1, name2, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 將 `DynamicFrame` 中一個或多個欄分割成新的 `DynamicFrame`。該方法傳回新的 `DynamicFrameCollection`，其包含兩個 `DynamicFrames`。第一個 `DynamicFrame` 包含分割的所有列，第二個包含其餘節列。 + `comparison_dict` – 一個字典，其中索引鍵為欄位的路徑，而對於與欄位數值相比較的數值而言，此數值為另一種字典映射比較運算子。例如，`{"age": {">": 10, "<": 20}}` 分割所有資料列，其年齡欄中的值大於 10 且小於 20。 + `name1` – 分割的 `DynamicFrame` 的名稱字串。 + `name2` – 分割指定節點後剩餘的 `DynamicFrame` 的名稱字串。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 split\$1rows 來分割 `DynamicFrame` 中的列此程式碼範例使用 `split_rows` 方法，根據 `id` 欄位值來分割 `DynamicFrame` 中的列。 **範例資料集** 該範例使用稱為 `l_root_contact_details` 的 `DynamicFrame`，其選自名為 `legislators_relationalized` 的集合。 `l_root_contact_details` 具有以下結構描述和項目。 ``` root |-- id: long |-- index: int |-- contact_details.val.type: string |-- contact_details.val.value: string +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 1| 0| phone| 202-225-5265| | 1| 1| twitter| kathyhochul| | 2| 0| phone| 202-225-3252| | 2| 1| twitter| repjackyrosen| | 3| 0| fax| 202-225-1314| | 3| 1| phone| 202-225-3772| | 3| 2| twitter| MikeRossUpdates| | 4| 0| fax| 202-225-1314| | 4| 1| phone| 202-225-3772| | 4| 2| twitter| MikeRossUpdates| | 5| 0| fax| 202-225-1314| | 5| 1| phone| 202-225-3772| | 5| 2| twitter| MikeRossUpdates| | 6| 0| fax| 202-225-1314| | 6| 1| phone| 202-225-3772| | 6| 2| twitter| MikeRossUpdates| | 7| 0| fax| 202-225-1314| | 7| 1| phone| 202-225-3772| | 7| 2| twitter| MikeRossUpdates| | 8| 0| fax| 202-225-1314| +---+-----+------------------------+-------------------------+ ``` **範例程式碼** ``` # Example: Use split_rows to split up # rows in a DynamicFrame based on value from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Retrieve the DynamicFrame to split frame_to_split = legislators_relationalized.select("l_root_contact_details") # Split up rows by ID split_rows_collection = frame_to_split.split_rows({"id": {">": 10}}, "high", "low") # Inspect the resulting DynamicFrames print("Inspect the DynamicFrame that contains IDs < 10") split_rows_collection.select("low").toDF().show() print("Inspect the DynamicFrame that contains IDs > 10") split_rows_collection.select("high").toDF().show() ``` #### Output ``` Inspect the DynamicFrame that contains IDs < 10 +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 1| 0| phone| 202-225-5265| | 1| 1| twitter| kathyhochul| | 2| 0| phone| 202-225-3252| | 2| 1| twitter| repjackyrosen| | 3| 0| fax| 202-225-1314| | 3| 1| phone| 202-225-3772| | 3| 2| twitter| MikeRossUpdates| | 4| 0| fax| 202-225-1314| | 4| 1| phone| 202-225-3772| | 4| 2| twitter| MikeRossUpdates| | 5| 0| fax| 202-225-1314| | 5| 1| phone| 202-225-3772| | 5| 2| twitter| MikeRossUpdates| | 6| 0| fax| 202-225-1314| | 6| 1| phone| 202-225-3772| | 6| 2| twitter| MikeRossUpdates| | 7| 0| fax| 202-225-1314| | 7| 1| phone| 202-225-3772| | 7| 2| twitter| MikeRossUpdates| | 8| 0| fax| 202-225-1314| +---+-----+------------------------+-------------------------+ only showing top 20 rows Inspect the DynamicFrame that contains IDs > 10 +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 11| 0| phone| 202-225-5476| | 11| 1| twitter| RepDavidYoung| | 12| 0| phone| 202-225-4035| | 12| 1| twitter| RepStephMurphy| | 13| 0| fax| 202-226-0774| | 13| 1| phone| 202-225-6335| | 14| 0| fax| 202-226-0774| | 14| 1| phone| 202-225-6335| | 15| 0| fax| 202-226-0774| | 15| 1| phone| 202-225-6335| | 16| 0| fax| 202-226-0774| | 16| 1| phone| 202-225-6335| | 17| 0| fax| 202-226-0774| | 17| 1| phone| 202-225-6335| | 18| 0| fax| 202-226-0774| | 18| 1| phone| 202-225-6335| | 19| 0| fax| 202-226-0774| | 19| 1| phone| 202-225-6335| | 20| 0| fax| 202-226-0774| | 20| 1| phone| 202-225-6335| +---+-----+------------------------+-------------------------+ only showing top 20 rows ``` ## unbox **`unbox(path, format, transformation_ctx="", info="", stageThreshold=0, totalThreshold=0, **options)`** 將 `DynamicFrame` 中的字串欄位拆箱 (重新格式化)，並傳回包含拆箱的 `DynamicRecords` 的新 `DynamicFrame`。 `DynamicRecord` 代表 `DynamicFrame` 中的邏輯記錄。它類似 Apache Spark `DataFrame` 中的一列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。 + `path` – 要拆箱之字串節點的完整路徑。 + `format` – 格式化規格 (選用)。您可將其用於 Amazon S3 或支援多種格式的 AWS Glue 連線。如需了解受支援的格式，請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `options` – 下列一或多個： + `separator` – 包含分隔符號字元的字串。 + `escaper` – 包含逸出字元的字串。 + `skipFirst` – 布林值，指出是否略過第一個執行個體。 + `withSchema`：包含節點結構描述的 JSON 表示法的字串。結構描述的 JSON 表示法的格式由 `StructType.json()` 的輸出定義。 + `withHeader` – 布林值，指出是否包含標頭。 ### 範例：使用 unbox 將字串欄位拆箱到結構中此程式碼範例使用 `unbox` 方法，將 `DynamicFrame` 中的字串欄位*拆箱*或重新格式化為結構類型的欄位。 **範例資料集** 此範例搭配使用稱為 `mapped_with_string` 的 `DynamicFrame` 與下列結構描述和項目。請注意名為 `AddressString` 的欄位。這是範例拆箱為結構的欄位。 ``` root |-- Average Total Payments: string |-- AddressString: string |-- Average Covered Charges: string |-- DRG Definition: string |-- Average Medicare Payments: string |-- Hospital Referral Region Description: string |-- Address: struct | |-- Zip.Code: string | |-- City: string | |-- Array: array | | |-- element: string | |-- State: string | |-- Street: string |-- Provider Id: string |-- Total Discharges: string |-- Provider Name: string +----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+ |Average Total Payments| AddressString|Average Covered Charges| DRG Definition|Average Medicare Payments|Hospital Referral Region Description| Address|Provider Id|Total Discharges| Provider Name| +----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+ | $5777.24|{"Street": "1108 ...| $32963.07|039 - EXTRACRANIA...| $4763.73| AL - Dothan|[36301, DOTHAN, [...| 10001| 91|SOUTHEAST ALABAMA...| | $5787.57|{"Street": "2505 ...| $15131.85|039 - EXTRACRANIA...| $4976.71| AL - Birmingham|[35957, BOAZ, [25...| 10005| 14|MARSHALL MEDICAL ...| | $5434.95|{"Street": "205 M...| $37560.37|039 - EXTRACRANIA...| $4453.79| AL - Birmingham|[35631, FLORENCE,...| 10006| 24|ELIZA COFFEE MEMO...| | $5417.56|{"Street": "50 ME...| $13998.28|039 - EXTRACRANIA...| $4129.16| AL - Birmingham|[35235, BIRMINGHA...| 10011| 25| ST VINCENT'S EAST| ... ``` **範例程式碼** ``` # Example: Use unbox to unbox a string field # into a struct in a DynamicFrame from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) unboxed = mapped_with_string.unbox("AddressString", "json") unboxed.printSchema() unboxed.toDF().show() ``` #### Output ``` root |-- Average Total Payments: string |-- AddressString: struct | |-- Street: string | |-- City: string | |-- State: string | |-- Zip.Code: string | |-- Array: array | | |-- element: string |-- Average Covered Charges: string |-- DRG Definition: string |-- Average Medicare Payments: string |-- Hospital Referral Region Description: string |-- Address: struct | |-- Zip.Code: string | |-- City: string | |-- Array: array | | |-- element: string | |-- State: string | |-- Street: string |-- Provider Id: string |-- Total Discharges: string |-- Provider Name: string +----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+ |Average Total Payments| AddressString|Average Covered Charges| DRG Definition|Average Medicare Payments|Hospital Referral Region Description| Address|Provider Id|Total Discharges| Provider Name| +----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+ | $5777.24|[1108 ROSS CLARK ...| $32963.07|039 - EXTRACRANIA...| $4763.73| AL - Dothan|[36301, DOTHAN, [...| 10001| 91|SOUTHEAST ALABAMA...| | $5787.57|[2505 U S HIGHWAY...| $15131.85|039 - EXTRACRANIA...| $4976.71| AL - Birmingham|[35957, BOAZ, [25...| 10005| 14|MARSHALL MEDICAL ...| | $5434.95|[205 MARENGO STRE...| $37560.37|039 - EXTRACRANIA...| $4453.79| AL - Birmingham|[35631, FLORENCE,...| 10006| 24|ELIZA COFFEE MEMO...| | $5417.56|[50 MEDICAL PARK ...| $13998.28|039 - EXTRACRANIA...| $4129.16| AL - Birmingham|[35235, BIRMINGHA...| 10011| 25| ST VINCENT'S EAST| | $5658.33|[1000 FIRST STREE...| $31633.27|039 - EXTRACRANIA...| $4851.44| AL - Birmingham|[35007, ALABASTER...| 10016| 18|SHELBY BAPTIST ME...| | $6653.80|[2105 EAST SOUTH ...| $16920.79|039 - EXTRACRANIA...| $5374.14| AL - Montgomery|[36116, MONTGOMER...| 10023| 67|BAPTIST MEDICAL C...| | $5834.74|[2000 PEPPERELL P...| $11977.13|039 - EXTRACRANIA...| $4761.41| AL - Birmingham|[36801, OPELIKA, ...| 10029| 51|EAST ALABAMA MEDI...| | $8031.12|[619 SOUTH 19TH S...| $35841.09|039 - EXTRACRANIA...| $5858.50| AL - Birmingham|[35233, BIRMINGHA...| 10033| 32|UNIVERSITY OF ALA...| | $6113.38|[101 SIVLEY RD, H...| $28523.39|039 - EXTRACRANIA...| $5228.40| AL - Huntsville|[35801, HUNTSVILL...| 10039| 135| HUNTSVILLE HOSPITAL| | $5541.05|[1007 GOODYEAR AV...| $75233.38|039 - EXTRACRANIA...| $4386.94| AL - Birmingham|[35903, GADSDEN, ...| 10040| 34|GADSDEN REGIONAL ...| | $5461.57|[600 SOUTH THIRD ...| $67327.92|039 - EXTRACRANIA...| $4493.57| AL - Birmingham|[35901, GADSDEN, ...| 10046| 14|RIVERVIEW REGIONA...| | $5356.28|[4370 WEST MAIN S...| $39607.28|039 - EXTRACRANIA...| $4408.20| AL - Dothan|[36305, DOTHAN, [...| 10055| 45| FLOWERS HOSPITAL| | $5374.65|[810 ST VINCENT'S...| $22862.23|039 - EXTRACRANIA...| $4186.02| AL - Birmingham|[35205, BIRMINGHA...| 10056| 43|ST VINCENT'S BIRM...| | $5366.23|[400 EAST 10TH ST...| $31110.85|039 - EXTRACRANIA...| $4376.23| AL - Birmingham|[36207, ANNISTON,...| 10078| 21|NORTHEAST ALABAMA...| | $5282.93|[1613 NORTH MCKEN...| $25411.33|039 - EXTRACRANIA...| $4383.73| AL - Mobile|[36535, FOLEY, [1...| 10083| 15|SOUTH BALDWIN REG...| | $5676.55|[1201 7TH STREET ...| $9234.51|039 - EXTRACRANIA...| $4509.11| AL - Huntsville|[35609, DECATUR, ...| 10085| 27|DECATUR GENERAL H...| | $5930.11|[6801 AIRPORT BOU...| $15895.85|039 - EXTRACRANIA...| $3972.85| AL - Mobile|[36608, MOBILE, [...| 10090| 27| PROVIDENCE HOSPITAL| | $6192.54|[809 UNIVERSITY B...| $19721.16|039 - EXTRACRANIA...| $5179.38| AL - Tuscaloosa|[35401, TUSCALOOS...| 10092| 31|D C H REGIONAL ME...| | $4968.00|[750 MORPHY AVENU...| $10710.88|039 - EXTRACRANIA...| $3898.88| AL - Mobile|[36532, FAIRHOPE,...| 10100| 18| THOMAS HOSPITAL| | $5996.00|[701 PRINCETON AV...| $51343.75|039 - EXTRACRANIA...| $4962.45| AL - Birmingham|[35211, BIRMINGHA...| 10103| 33|BAPTIST MEDICAL C...| +----------------------+--------------------+-----------------------+--------------------+-------------------------+------------------------------------+--------------------+-----------+----------------+--------------------+ only showing top 20 rows ``` ## 聯集 **`union(frame1, frame2, transformation_ctx = "", info = "", stageThreshold = 0, totalThreshold = 0)`** 將兩個 DynamicFrames 聯集。傳回 DynamicFrame，其中包含來自兩個輸入 DynamicFrames 的所有記錄。此轉換可能會從兩個具有對等資料的 DataFrames 聯集傳回不同結果。若您需要 Spark DataFrame 聯集行為，請考慮使用 `toDF`。 + `frame1` – 要聯集的第一個 DynamicFrame。 + `frame2` – 要聯集的第二個 DynamicFrame。 + `transformation_ctx` – (選用) 用於識別統計資料/狀態資訊的唯一字串 + `info` – (選用) 與轉換中的錯誤相關的任何字串 + `stageThreshold` – (選用) 在處理輸出錯誤之前，轉換中的最大錯誤數 + `totalThreshold` – (選用) 在處理輸出錯誤之前的最大錯誤數。 ## 解巢狀 **`unnest(transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** 對 `DynamicFrame` 中的巢狀化物件進行解除巢狀化，將其變為頂層元素，並傳回新的未巢狀化 `DynamicFrame`。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 + `totalThreshold` –直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用)。預設值為零，表示此流程不會發生錯誤。 ### 範例：使用 unnest 將巢狀化欄位轉換為頂層欄位此程式碼範例使用 `unnest` 方法，將 `DynamicFrame` 中的所有巢狀化欄位壓平合併為頂層欄位。 **範例資料集** 此範例搭配使用稱為 `mapped_medicare` 的 `DynamicFrame` 與下列結構描述。請注意，`Address` 欄位是唯一包含巢狀化資料的欄位。 ``` root |-- Average Total Payments: string |-- Average Covered Charges: string |-- DRG Definition: string |-- Average Medicare Payments: string |-- Hospital Referral Region Description: string |-- Address: struct | |-- Zip.Code: string | |-- City: string | |-- Array: array | | |-- element: string | |-- State: string | |-- Street: string |-- Provider Id: string |-- Total Discharges: string |-- Provider Name: string ``` **範例程式碼** ``` # Example: Use unnest to unnest nested # objects in a DynamicFrame from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Unnest all nested fields unnested = mapped_medicare.unnest() unnested.printSchema() ``` #### Output ``` root |-- Average Total Payments: string |-- Average Covered Charges: string |-- DRG Definition: string |-- Average Medicare Payments: string |-- Hospital Referral Region Description: string |-- Address.Zip.Code: string |-- Address.City: string |-- Address.Array: array | |-- element: string |-- Address.State: string |-- Address.Street: string |-- Provider Id: string |-- Total Discharges: string |-- Provider Name: string ``` ## unnest\$1ddb\$1json 解除專屬於 DynamoDB JSON 結構中 `DynamicFrame` 內的巢狀欄的巢狀化，並傳回新的解巢狀 `DynamicFrame`。結構類型陣列的欄將不是解巢狀狀態。請注意，這是一種特定類型的解除巢狀化轉換，其行為與常規 `unnest` 轉換不同，且資料必須已經位於 DynamoDB JSON 結構中。如需詳細資訊，請參閱 [DynamoDB JSON](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DataExport.Output.html#DataExport.Output.Data)。 **`unnest_ddb_json(transformation_ctx="", info="", stageThreshold=0, totalThreshold=0)`** + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與此轉換回報錯誤關聯的字串 (選用)。 + `stageThreshold` – 此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用：預設為 0，表示流程不會錯誤輸出)。 + `totalThreshold` – 直到及包含此轉換期間流程應錯誤輸出之前遇到的錯誤次數 (選用：預設為 0，表示流程不會錯誤輸出)。例如，讀取 DynamoDB JSON 結構的匯出結構描述與以下類似： ``` root |-- Item: struct | |-- ColA: struct | | |-- S: string | |-- ColB: struct | | |-- S: string | |-- ColC: struct | | |-- N: string | |-- ColD: struct | | |-- L: array | | | |-- element: null ``` `unnest_ddb_json()` 轉換會將此轉換為： ``` root |-- ColA: string |-- ColB: string |-- ColC: string |-- ColD: array | |-- element: null ``` 下列程式碼範例示範如何使用 AWS Glue DynamoDB 匯出連接器、叫用 DynamoDB JSON unnest，以及列印分割區數量： ``` import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ["JOB_NAME"]) glue_context= GlueContext(SparkContext.getOrCreate()) job = Job(glue_context) job.init(args["JOB_NAME"], args) dynamicFrame = glue_context.create_dynamic_frame.from_options( connection_type="dynamodb", connection_options={ "dynamodb.export": "ddb", "dynamodb.tableArn": "", "dynamodb.s3.bucket": "", "dynamodb.s3.prefix": "", "dynamodb.s3.bucketOwner": "", } ) unnested = dynamicFrame.unnest_ddb_json() print(unnested.getNumPartitions()) job.commit() ``` ## write **`write(connection_type, connection_options, format, format_options, accumulator_size)`** 從此 `DynamicFrame` 的 [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md) 取得指定連線類型的 [DataSink(object)](aws-glue-api-crawler-pyspark-extensions-types.md#aws-glue-api-crawler-pyspark-extensions-types-awsglue-data-sink)，並用其來格式化及寫入此 `DynamicFrame` 的內容。傳回依指定格式化和寫入的新 `DynamicFrame`。 + `connection_type` – 使用的連線類型。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver` 及 `oracle`。 + `connection_options` – 使用的連線選項 (選用)。如果是 `connection_type` 的 `s3`，會定義 Amazon S3 路徑。 ``` connection_options = {"path": "s3://aws-glue-target/temp"} ``` 如果是 JDBC 連線，必須定義幾項屬性。請注意，資料庫名稱必須是 URL 的一部分。它可以選擇性包含在連線選項中。 **警告** 不建議在指令碼中存放密碼。考慮使用從 AWS Secrets Manager 或 Glue Data Catalog AWS `boto3`擷取它們。 ``` connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} ``` + `format` – 格式化規格 (選用)。這用於 Amazon Simple Storage Service (Amazon S3) 或支援多種格式的 AWS Glue 連線。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `format_options` – 指定格式的格式選項。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `accumulator_size`：要使用的 accumulable 大小，以位元組為單位 (選用)。 ## — errors — + [assertErrorThreshold](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-assertErrorThreshold) + [errorsAsDynamicFrame](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsAsDynamicFrame) + [errorsCount](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsCount) + [stageErrorsCount](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-stageErrorsCount) ## assertErrorThreshold `assertErrorThreshold( )` – 建立此 `DynamicFrame` 的轉換中的錯誤宣告。從基礎 `DataFrame` 傳回 `Exception`。 ## errorsAsDynamicFrame `errorsAsDynamicFrame( )` – 傳回 `DynamicFrame`，其內部有巢狀的錯誤記錄。 ### 範例：使用 errorsAsDynamicFrame 來檢視錯誤記錄以下程式碼範例顯示如何使用 `errorsAsDynamicFrame` 方法來檢視 `DynamicFrame` 的錯誤記錄。 **範例資料集** 此範例使用下列資料集，您可以將其作為 JSON 上傳到 Amazon S3。請注意，第二條記錄的格式錯誤。當您使用 SparkSQL 時，格式錯誤的資料通常會中斷檔案剖析。但是，`DynamicFrame` 會辨識出格式錯誤問題，並將格式錯誤的行轉換為可以單獨處理的錯誤記錄。 ``` {"id": 1, "name": "george", "surname": "washington", "height": 178} {"id": 2, "name": "benjamin", "surname": "franklin", {"id": 3, "name": "alexander", "surname": "hamilton", "height": 171} {"id": 4, "name": "john", "surname": "jay", "height": 190} ``` **範例程式碼** ``` # Example: Use errorsAsDynamicFrame to view error records. # Replace s3://DOC-EXAMPLE-S3-BUCKET/error_data.json with your location. from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create errors DynamicFrame, view schema errors = glueContext.create_dynamic_frame.from_options( "s3", {"paths": ["s3://DOC-EXAMPLE-S3-BUCKET/error_data.json"]}, "json" ) print("Schema of errors DynamicFrame:") errors.printSchema() # Show that errors only contains valid entries from the dataset print("errors contains only valid records from the input dataset (2 of 4 records)") errors.toDF().show() # View errors print("Errors count:", str(errors.errorsCount())) print("Errors:") errors.errorsAsDynamicFrame().toDF().show() # View error fields and error data error_record = errors.errorsAsDynamicFrame().toDF().head() error_fields = error_record["error"] print("Error fields: ") print(error_fields.asDict().keys()) print("\nError record data:") for key in error_fields.asDict().keys(): print("\n", key, ": ", str(error_fields[key])) ``` #### Output ``` Schema of errors DynamicFrame: root |-- id: int |-- name: string |-- surname: string |-- height: int errors contains only valid records from the input dataset (2 of 4 records) +---+------+----------+------+ | id| name| surname|height| +---+------+----------+------+ | 1|george|washington| 178| | 4| john| jay| 190| +---+------+----------+------+ Errors count: 1 Errors: +--------------------+ | error| +--------------------+ |[[ File "/tmp/20...| +--------------------+ Error fields: dict_keys(['callsite', 'msg', 'stackTrace', 'input', 'bytesread', 'source', 'dynamicRecord']) Error record data: callsite : Row(site=' File "/tmp/2060612586885849088", line 549, in \n sys.exit(main())\n File "/tmp/2060612586885849088", line 523, in main\n response = handler(content)\n File "/tmp/2060612586885849088", line 197, in execute_request\n result = node.execute()\n File "/tmp/2060612586885849088", line 103, in execute\n exec(code, global_dict)\n File "", line 10, in \n File "/opt/amazon/lib/python3.6/site-packages/awsglue/dynamicframe.py", line 625, in from_options\n format_options, transformation_ctx, push_down_predicate, **kwargs)\n File "/opt/amazon/lib/python3.6/site-packages/awsglue/context.py", line 233, in create_dynamic_frame_from_options\n source.setFormat(format, **format_options)\n', info='') msg : error in jackson reader stackTrace : com.fasterxml.jackson.core.JsonParseException: Unexpected character ('{' (code 123)): was expecting either valid name character (for unquoted name) or double-quote (for quoted) to start field name at [Source: com.amazonaws.services.glue.readers.BufferedStream@73492578; line: 3, column: 2] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1581) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:533) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:462) at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleOddName(UTF8StreamJsonParser.java:2012) at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1650) at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:740) at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$hasNextGoodToken$1.apply(JacksonReader.scala:57) at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$hasNextGoodToken$1.apply(JacksonReader.scala:57) at scala.collection.Iterator$$anon$9.next(Iterator.scala:162) at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:599) at scala.collection.Iterator$$anon$16.hasNext(Iterator.scala:598) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$1.apply(JacksonReader.scala:120) at com.amazonaws.services.glue.readers.JacksonReader$$anonfun$1.apply(JacksonReader.scala:116) at com.amazonaws.services.glue.DynamicRecordBuilder.handleErr(DynamicRecordBuilder.scala:209) at com.amazonaws.services.glue.DynamicRecordBuilder.handleErrorWithException(DynamicRecordBuilder.scala:202) at com.amazonaws.services.glue.readers.JacksonReader.nextFailSafe(JacksonReader.scala:116) at com.amazonaws.services.glue.readers.JacksonReader.next(JacksonReader.scala:109) at com.amazonaws.services.glue.readers.JSONReader.next(JSONReader.scala:247) at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReaderSplittable.nextKeyValue(TapeHadoopRecordReaderSplittable.scala:103) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:230) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) input : bytesread : 252 source : dynamicRecord : Row(id=2, name='benjamin', surname='franklin') ``` ## 全方位 DynamicFrame 範例下列範例示範在基本 Glue 目錄案例之外建立和使用 DynamicFrames 的各種方法。 ### 使用 SQL SELECT 查詢從 PostgreSQL 載入此範例說明如何使用自訂 SQL SELECT 查詢從 PostgreSQL 資料庫載入資料： ``` from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) # Load specific data from PostgreSQL with custom query postgres_dyf = glueContext.create_dynamic_frame.from_options( connection_type="postgresql", connection_options={ "url": "jdbc:postgresql://your-postgres-host:5432/your-database", "user": "your-username", "password": "your-password", "dbtable": "(SELECT customer_id, customer_name, email FROM customers WHERE active = true) AS filtered_customers" } ) ``` ### 載入特定資料欄以避免完整資料表掃描此範例示範如何僅從大型資料庫資料表載入特定資料欄： ``` from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) # Load only specific columns from a large table selected_columns_dyf = glueContext.create_dynamic_frame.from_options( connection_type="mysql", connection_options={ "url": "jdbc:mysql://your-mysql-host:3306/your-database", "user": "your-username", "password": "your-password", "dbtable": "(SELECT order_id, customer_id FROM large_orders_table) AS selected_data" } ) # Alternative approach using column selection in query efficient_load_dyf = glueContext.create_dynamic_frame.from_options( connection_type="postgresql", connection_options={ "url": "jdbc:postgresql://your-postgres-host:5432/your-database", "user": "your-username", "password": "your-password", "query": "SELECT product_id, product_name FROM products WHERE category = 'electronics'" } ) ``` ### 透過 JDBC 連線進行資料列層級篩選此範例示範如何使用資料列層級篩選，僅從資料庫資料表載入特定資料列： ``` from awsglue.context import GlueContext from pyspark.context import SparkContext sc = SparkContext() glueContext = GlueContext(sc) # Load filtered rows using WHERE clause filtered_rows_dyf = glueContext.create_dynamic_frame.from_options( connection_type="postgresql", connection_options={ "url": "jdbc:postgresql://your-postgres-host:5432/your-database", "user": "your-username", "password": "your-password", "dbtable": "(SELECT * FROM transactions WHERE transaction_date >= '2024-01-01' AND amount > 100) AS recent_large_transactions" } ) # Using partitionColumn for parallel loading with filtering partitioned_load_dyf = glueContext.create_dynamic_frame.from_options( connection_type="mysql", connection_options={ "url": "jdbc:mysql://your-mysql-host:3306/your-database", "user": "your-username", "password": "your-password", "dbtable": "sales_data", "partitionColumn": "sale_date", "lowerBound": "2024-01-01", "upperBound": "2024-12-31", "numPartitions": "10" } ) ``` ### 從記憶體內 Python 資料建立 DynamicFrame 此範例示範如何從 Python 清單、元組或字典建立 DynamicFrame： ``` from awsglue.context import GlueContext from awsglue.dynamicframe import DynamicFrame from pyspark.context import SparkContext from pyspark.sql import Row sc = SparkContext() glueContext = GlueContext(sc) # Method 1: From list of tuples data_tuples = [ ("John", "Doe", 30, "Engineer"), ("Jane", "Smith", 25, "Designer"), ("Bob", "Johnson", 35, "Manager") ] # Convert to RDD of Rows rdd = sc.parallelize([Row(first_name=row[0], last_name=row[1], age=row[2], job=row[3]) for row in data_tuples]) df = glueContext.spark_session.createDataFrame(rdd) dyf_from_tuples = DynamicFrame.fromDF(df, glueContext, "employees_from_tuples") # Method 2: From list of dictionaries data_dicts = [ {"product_id": 1, "product_name": "Laptop", "price": 999.99, "category": "Electronics"}, {"product_id": 2, "product_name": "Book", "price": 19.99, "category": "Education"}, {"product_id": 3, "product_name": "Chair", "price": 149.99, "category": "Furniture"} ] df_from_dicts = glueContext.spark_session.createDataFrame(data_dicts) dyf_from_dicts = DynamicFrame.fromDF(df_from_dicts, glueContext, "products_from_dicts") # Method 3: From nested data structures nested_data = [ { "customer_id": 1, "customer_info": { "name": "Alice Brown", "email": "alice@example.com" }, "orders": [ {"order_id": 101, "amount": 250.00}, {"order_id": 102, "amount": 175.50} ] } ] df_nested = glueContext.spark_session.createDataFrame(nested_data) dyf_nested = DynamicFrame.fromDF(df_nested, glueContext, "customers_with_orders") ``` ### 大型資料集的效能最佳化使用大型資料集時，請考慮下列效能最佳化技術： ``` # Use partitioning for parallel reads large_table_dyf = glueContext.create_dynamic_frame.from_options( connection_type="postgresql", connection_options={ "url": "jdbc:postgresql://your-postgres-host:5432/your-database", "user": "your-username", "password": "your-password", "dbtable": "large_table", "partitionColumn": "id", "lowerBound": "1", "upperBound": "1000000", "numPartitions": "20" } ) # Use pushdown predicates to filter at source filtered_dyf = glueContext.create_dynamic_frame.from_options( connection_type="mysql", connection_options={ "url": "jdbc:mysql://your-mysql-host:3306/your-database", "user": "your-username", "password": "your-password", "dbtable": "transactions" }, push_down_predicate="transaction_date >= '2024-01-01'" ) ``` ## errorsCount `errorsCount( )` – 傳回 `DynamicFrame` 中的錯誤總數。 ## stageErrorsCount `stageErrorsCount` – 傳回產生此 `DynamicFrame` 過程中發生的錯誤數量。 # DynamicFrameCollection 類別 `DynamicFrameCollection` 為 [DynamicFrame 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md) 物件的字典，其中索引鍵為 `DynamicFrames` 的名稱，值則為 `DynamicFrame` 物件。 ## \$1\$1init\$1\$1 **`__init__(dynamic_frames, glue_ctx)`** + `dynamic_frames` – [DynamicFrame 類別](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md) 物件的字典。 + `glue_ctx` – [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md) 物件。 ## 金鑰 `keys( )` – 傳回此集合裡的金鑰清單，通常包含了對應的 `DynamicFrame` 值之名稱。 ## 值 `values(key)` – 傳回此集合裡的 `DynamicFrame` 值清單。 ## Select **`select(key)`** 傳回的 `DynamicFrame` 會對應至指定的索引鍵 (通常為 `DynamicFrame` 的名稱)。 + `key` – `DynamicFrameCollection` 中的金鑰，通常代表 `DynamicFrame` 的名稱。 ## Map **`map(callable, transformation_ctx="")`** 使用傳入的函數，根據此集合中的 `DynamicFrames` 建立並傳回新的 `DynamicFrameCollection`。 + `callable` – 此函數會以 `DynamicFrame` 和指定的轉換細節做為參數，並傳回 `DynamicFrame`。 + `transformation_ctx` – 由 callable 使用的轉換細節 (選用)。 ## Flatmap **`flatmap(f, transformation_ctx="")`** 使用傳入的函數，根據此集合中的 `DynamicFrames` 建立並傳回新的 `DynamicFrameCollection`。 + `f` – 此函數以 `DynamicFrame` 做為參數並傳回 `DynamicFrame` 或 `DynamicFrameCollection`。 + `transformation_ctx` – 由函數使用的轉換細節 (選用)。 # DynamicFrameWriter 類別 ## 方法 + [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-__init__) + [from\$1options](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_options) + [from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_catalog) + [from\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer-from_jdbc_conf) ## \$1\$1init\$1\$1 **`__init__(glue_context)`** + `glue_context` – 所要使用的 [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md)。 ## from\$1options **`from_options(frame, connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")`** 使用指定的連線和格式來撰寫 `DynamicFrame`。 + `frame` – 所要撰寫的 `DynamicFrame`。 + `connection_type` – 連線類型。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver` 及 `oracle`。 + `connection_options` – 連線選項，例如路徑和資料庫資料表 (選用)。如果是 `connection_type` 的 `s3`，會定義 Amazon S3 路徑。 ``` connection_options = {"path": "s3://aws-glue-target/temp"} ``` 如果是 JDBC 連線，必須定義幾項屬性。請注意，資料庫名稱必須是 URL 的一部分。它可以選擇性包含在連線選項中。 **警告** 不建議在指令碼中存放密碼。考慮使用從 AWS Secrets Manager 或 Glue Data Catalog AWS `boto3`擷取它們。 ``` connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} ``` `dbtable` 屬性為 JDBC 資料表的名稱。若是支援資料庫內結構描述的 JDBC 資料存放區，請指定 `schema.table-name`。如果未提供結構描述，則會使用預設的 "public" 結構描述。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 + `format` – 格式化規格 (選用)。這用於 Amazon Simple Storage Service (Amazon S3) 或支援多種格式的 AWS Glue 連線。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `format_options` – 指定格式的格式選項。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `transformation_ctx` – 所要使用的轉換細節 (選用)。 ## from\$1catalog **`from_catalog(frame, name_space, table_name, redshift_tmp_dir="", transformation_ctx="")`** 使用指定的目錄資料庫和資料表名稱，來撰寫 `DynamicFrame`。 + `frame` – 所要撰寫的 `DynamicFrame`。 + `name_space` – 所要使用的資料庫。 + `table_name` – 所要使用的 `table_name`。 + `redshift_tmp_dir` – 所要使用的 Amazon Redshift 暫時目錄 (選用)。 + `transformation_ctx` – 所要使用的轉換細節 (選用)。 + `additional_options` – 提供給 AWS Glue 的額外選項。若要寫入受 Lake Formation 管控的資料表，您可以使用下列其他選項： + `transactionId` – (字串) 要寫入受管控資料表的交易 ID。此交易不能已遞交或中止，否則寫入將失敗。 + `callDeleteObjectsOnCancel ` – (布林值，選用) 如果設定為 `true` (預設值)，則 AWS Glue 會在物件寫入至 Amazon S3 之後自動呼叫 `DeleteObjectsOnCancel` API。如需詳細資訊，請參閱《*AWS Lake Formation 開發人員指南*》中的 [DeleteObjectsOnCancel](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-transactions-api.html#aws-lake-formation-api-transactions-api-DeleteObjectsOnCancel)。 **Example 範例：寫入 Lake Formation 中的受管控資料表** ``` txId = glueContext.start_transaction(read_only=False) glueContext.write_dynamic_frame.from_catalog( frame=dyf, database = db, table_name = tbl, transformation_ctx = "datasource0", additional_options={"transactionId":txId}) ... glueContext.commit_transaction(txId) ``` ## from\$1jdbc\$1conf **`from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx="")`** 使用指定的 JDBC 連線資訊來撰寫 `DynamicFrame`。 + `frame` – 所要撰寫的 `DynamicFrame`。 + `catalog_connection` – 所要使用的目錄連線。 + `connection_options` – 連線選項，例如路徑和資料庫資料表 (選用)。 + `redshift_tmp_dir` – 所要使用的 Amazon Redshift 暫時目錄 (選用)。 + `transformation_ctx` – 所要使用的轉換細節 (選用)。 ## write\$1dynamic\$1frame 的範例這個範例使用 S3 的 `connection_type` 和 `connection_options` 中的 POSIX 路徑參數在本機寫入輸出，這允許寫入本機儲存。 ``` glueContext.write_dynamic_frame.from_options(\ frame = dyf_splitFields,\ connection_options = {'path': '/home/glue/GlueLocalOutput/'},\ connection_type = 's3',\ format = 'json') ``` # DynamicFrameReader 類別 ## — methods — + [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-__init__) + [from\$1rdd](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_rdd) + [from\$1options](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_options) + [from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_catalog) ## \$1\$1init\$1\$1 **`__init__(glue_context)`** + `glue_context` – 所要使用的 [GlueContext 類別](aws-glue-api-crawler-pyspark-extensions-glue-context.md)。 ## from\$1rdd **`from_rdd(data, name, schema=None, sampleRatio=None)`** `DynamicFrame` 從彈性分散式資料集 (RDD) 的讀取。 + `data` – 欲讀取的資料集。 + `name` – 欲讀取的名稱。 + `schema` - 欲讀取的結構描述 (選用)。 + `sampleRatio` – 取樣率 (選用)。 ## from\$1options **`from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")`** 使用指定的連線和格式讀取 `DynamicFrame`。 + `connection_type` – 連線類型。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver`、`oracle`、`dynamodb` 和 `snowflake`。 + `connection_options` – 連線選項，例如路徑和資料庫資料表 (選用)。如需詳細資訊，請參閱 [ Glue for Spark 中的 ETL AWS 連線類型和選項](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html)。如果是 `connection_type` 的 `s3`，Amazon S3 路徑定義在陣列中。 ``` connection_options = {"paths": [ "s3://amzn-s3-demo-bucket/object_a", "s3://amzn-s3-demo-bucket/object_b"]} ``` 如果是 JDBC 連線，必須定義幾項屬性。請注意，資料庫名稱必須是 URL 的一部分。它可以選擇性包含在連線選項中。 **警告** 不建議在指令碼中存放密碼。考慮使用從 AWS Secrets Manager 或 Glue Data Catalog AWS `boto3`擷取它們。 ``` connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} ``` 若是執行平行讀取的 JDBC 連線，您可以設定 hashfield 選項。例如： ``` connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path" , "hashfield": "month"} ``` 如需詳細資訊，請參閱[從 JDBC 資料表中平行讀取](run-jdbc-parallel-read-job.md)。 + `format` – 格式化規格 (選用)。這用於 Amazon Simple Storage Service (Amazon S3) 或支援多種格式的 AWS Glue 連線。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `format_options` – 指定格式的格式選項。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `push_down_predicate` – 篩選分割區，而無需列出和讀取資料集中的所有檔案。如需詳細資訊，請參閱[使用 Pushdown 述詞預先篩選](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns)。 ## from\$1catalog **`from_catalog(database, table_name, redshift_tmp_dir="", transformation_ctx="", push_down_predicate="", additional_options={})`** 使用指定的目錄命名空間和資料表名稱讀取 `DynamicFrame`。 + `database` – 欲讀取的資料庫。 + `table_name` – 欲讀取的資料表的名稱。 + `redshift_tmp_dir` - 要使用的 Amazon Redshift 暫時目錄 (如果不是從 Redshift 讀取資料，則為選用)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `push_down_predicate` – 篩選分割區，而無需列出和讀取資料集中的所有檔案。如需詳細資訊，請參閱[使用 pushdown 述詞預先篩選](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns)。 + `additional_options` – 提供給 AWS Glue 的額外選項。 + 若要使用執行平行讀取的 JDBC 連線，您可以設定 `hashfield`、`hashexpression` 或 `hashpartitions` 選項。例如： ``` additional_options = {"hashfield": "month"} ``` 如需詳細資訊，請參閱[從 JDBC 資料表中平行讀取](run-jdbc-parallel-read-job.md)。 + 若要傳遞目錄表達式以根據索引欄進行篩選，您可以參閱 `catalogPartitionPredicate` 選項。 `catalogPartitionPredicate` — 您可以傳遞目錄表達式以根據索引欄進行篩選。這會將篩選下推至伺服器端。如需詳細資訊，請參閱 [AWS Glue 分割區索引](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html)。注意 `push_down_predicate` 和 `catalogPartitionPredicate` 使用不同的語法。前者使用 Spark SQL 標準語法，後者使用 JSQL 剖析器。如需詳細資訊，請參閱[在 AWS Glue 中管理適用於 ETL 輸出的分割區](aws-glue-programming-etl-partitions.md)。 # GlueContext 類別包裝 Apache Spark [SparkContext](https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html) 物件，並提供與 Apache Spark 平台互動的機制。 ## \$1\$1init\$1\$1 **`__init__(sparkContext)`** + `sparkContext` – 欲使用的 Apache Spark 細節。 ## 正在建立 + [\$1\$1init\$1\$1](#aws-glue-api-crawler-pyspark-extensions-glue-context-__init__) + [getSource](#aws-glue-api-crawler-pyspark-extensions-glue-context-get-source) + [create\$1dynamic\$1frame\$1from\$1rdd](#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_rdd) + [create\$1dynamic\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog) + [create\$1dynamic\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) + [create\$1sample\$1dynamic\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-sample-dynamic-frame-from-catalog) + [create\$1sample\$1dynamic\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-sample-dynamic-frame-from-options) + [add\$1ingestion\$1time\$1columns](#aws-glue-api-crawler-pyspark-extensions-glue-context-add-ingestion-time-columns) + [create\$1data\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-catalog) + [create\$1data\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-create-dataframe-from-options) + [forEachBatch](#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch) ## getSource **`getSource(connection_type, transformation_ctx = "", **options)`** 建立可用於從外部來源讀取 `DynamicFrames` 的 `DataSource` 物件。 + `connection_type` – 要使用的連線類型，例如 Amazon Simple Storage Service (Amazon S3)、Amazon Redshift 及 JDBC。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver`、`oracle` 和 `dynamodb`。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `options` – 選擇性的名稱/值對的集合。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。以下是 `getSource` 的使用範例： ``` >>> data_source = context.getSource("file", paths=["/in/path"]) >>> data_source.setFormat("json") >>> myFrame = data_source.getFrame() ``` ## create\$1dynamic\$1frame\$1from\$1rdd **`create_dynamic_frame_from_rdd(data, name, schema=None, sample_ratio=None, transformation_ctx="")`** 傳回從 Apache Spark 彈性分散式資料集 (RDD) 建立的 `DynamicFrame`。 + `data` – 欲使用的資料來源。 + `name` – 欲使用的資料名稱。 + `schema` – 欲使用的結構描述 (選用)。 + `sample_ratio` – 欲使用的取樣率 (選用)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 ## create\$1dynamic\$1frame\$1from\$1catalog **`create_dynamic_frame_from_catalog(database, table_name, redshift_tmp_dir, transformation_ctx = "", push_down_predicate= "", additional_options = {}, catalog_id = None)`** 傳回使用 Data Catalog 資料庫和資料表名稱建立的 `DynamicFrame`。使用此方法時，您可以在指定的 AWS Glue Data Catalog 資料表上`format_options`透過資料表屬性提供，並透過 `additional_options` 引數提供其他選項。 + `Database` – 欲讀取的資料庫。 + `table_name` – 欲讀取的資料表的名稱。 + `redshift_tmp_dir` – 所要使用的 Amazon Redshift 暫時目錄 (選用)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `push_down_predicate` – 篩選分割區，而無需列出和讀取資料集中的所有檔案。如需支援的來源和限制，請參閱在 [Glue ETL AWS 中使用下推來最佳化讀取](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-pushdown.html)。如需詳細資訊，請參閱[使用 pushdown 述詞預先篩選](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns)。 + `additional_options` – 選擇性的名稱/值對的集合。可能的選項包括 [AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md) 中列出的項目，除了 `endpointUrl`、`streamName`、`bootstrap.servers`、`security.protocol`、`topicName`、`classification` 以及`delimiter`。另一個支援的選項是 `catalogPartitionPredicate`： `catalogPartitionPredicate` — 您可以傳遞目錄表達式以根據索引欄進行篩選。這會將篩選下推至伺服器端。如需詳細資訊，請參閱 [AWS Glue 分割區索引](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html)。注意 `push_down_predicate` 和 `catalogPartitionPredicate` 使用不同的語法。前者使用 Spark SQL 標準語法，後者使用 JSQL 剖析器。 + `catalog_id` — 要存取之 Data Catalog 的目錄 ID (帳戶 ID)。若無，會使用發起人的預設帳戶 ID。 ## create\$1dynamic\$1frame\$1from\$1options **`create_dynamic_frame_from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")`** 傳回使用指定的連線和格式建立的 `DynamicFrame`。 + `connection_type` – 連線類型，例如 Amazon S3、Amazon Redshift 及 JDBC。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver`、`oracle` 和 `dynamodb`。 + `connection_options` - 連線選項，例如路徑和資料庫資料表 (選用)。如果是 `connection_type` 的 `s3`，會定義 Amazon S3 路徑清單。 ``` connection_options = {"paths": ["s3://aws-glue-target/temp"]} ``` 如果是 JDBC 連線，必須定義幾項屬性。請注意，資料庫名稱必須是 URL 的一部分。它可以選擇性包含在連線選項中。 **警告** 不建議在指令碼中存放密碼。考慮使用從 AWS Secrets Manager 或 Glue Data Catalog AWS `boto3`擷取它們。 ``` connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} ``` `dbtable` 屬性為 JDBC 資料表的名稱。若是支援資料庫內結構描述的 JDBC 資料存放區，請指定 `schema.table-name`。如果未提供結構描述，則會使用預設的 "public" 結構描述。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 + `format` – 格式規格。這是用於 Amazon S3 或支援多種格式的 AWS Glue 連線。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `format_options` – 指定格式的格式選項。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `push_down_predicate` – 篩選分割區，而無需列出和讀取資料集中的所有檔案。如需支援的來源和限制，請參閱在 [Glue ETL AWS 中使用下推來最佳化讀取](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-pushdown.html)。如需詳細資訊，請參閱[使用 Pushdown 述詞預先篩選](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns)。 ## create\$1sample\$1dynamic\$1frame\$1from\$1catalog **`create_sample_dynamic_frame_from_catalog(database, table_name, num, redshift_tmp_dir, transformation_ctx = "", push_down_predicate= "", additional_options = {}, sample_options = {}, catalog_id = None)`** 傳回使用 Data Catalog 資料庫和資料表名稱建立的範例 `DynamicFrame`。`DynamicFrame` 僅包含來自資料來源的第一個 `num` 記錄。 + `database` – 欲讀取的資料庫。 + `table_name` – 欲讀取的資料表的名稱。 + `num` – 傳回的範例動態框架中記錄的最大數目。 + `redshift_tmp_dir`：所要使用的 Amazon Redshift 臨時目錄 (選用)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `push_down_predicate` – 篩選分割區，而無需列出和讀取資料集中的所有檔案。如需詳細資訊，請參閱[使用 pushdown 述詞預先篩選](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns)。 + `additional_options` – 選擇性的名稱/值對的集合。可能的選項包括 [AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md) 中列出的項目，除了 `endpointUrl`、`streamName`、`bootstrap.servers`、`security.protocol`、`topicName`、`classification` 以及`delimiter`。 + `sample_options` – 用於控制取樣行為的參數 (選用)。Amazon S3 來源的目前可用參數： + `maxSamplePartitions` – 取樣將讀取的分割區數目上限。預設值為 10 + `maxSampleFilesPerPartition` – 取樣將在一個分割區中讀取的檔案數目上限。預設值為 10。這些參數有助於減少檔案清單所耗用的時間。例如，假設資料集有 1000 個分割區，並且每個分割區都有 10 個檔案。如果您設定 `maxSamplePartitions` = 10 和 `maxSampleFilesPerPartition` = 10，而不是列出所有 10,000 個檔案，而是僅列出和讀取前 10 個分割區及每個分割區的前 10 個檔案 (總計為 10\$110 = 100 個檔案)。 + `catalog_id` – 要存取之 Data Catalog 的目錄 ID ( Data Catalog 的帳戶 ID)。依預設設定為 `None`。`None` 預設為服務中呼叫帳戶的目錄 ID。 ## create\$1sample\$1dynamic\$1frame\$1from\$1options **`create_sample_dynamic_frame_from_options(connection_type, connection_options={}, num, sample_options={}, format=None, format_options={}, transformation_ctx = "")`** 傳回使用指定的連線和格式建立的範例 `DynamicFrame`。`DynamicFrame` 僅包含來自資料來源的第一個 `num` 記錄。 + `connection_type` – 連線類型，例如 Amazon S3、Amazon Redshift 及 JDBC。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver`、`oracle` 和 `dynamodb`。 + `connection_options` - 連線選項，例如路徑和資料庫資料表 (選用)。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 + `num` – 傳回的範例動態框架中記錄的最大數目。 + `sample_options` – 用於控制取樣行為的參數 (選用)。Amazon S3 來源的目前可用參數： + `maxSamplePartitions` – 取樣將讀取的分割區數目上限。預設值為 10 + `maxSampleFilesPerPartition` – 取樣將在一個分割區中讀取的檔案數目上限。預設值為 10。這些參數有助於減少檔案清單所耗用的時間。例如，假設資料集有 1000 個分割區，並且每個分割區都有 10 個檔案。如果您設定 `maxSamplePartitions` = 10 和 `maxSampleFilesPerPartition` = 10，而不是列出所有 10,000 個檔案，而是僅列出和讀取前 10 個分割區及每個分割區的前 10 個檔案 (總計為 10\$110 = 100 個檔案)。 + `format` – 格式規格。這是用於 Amazon S3 或支援多種格式的 AWS Glue 連線。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `format_options` – 指定格式的格式選項。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `push_down_predicate` – 篩選分割區，而無需列出和讀取資料集中的所有檔案。如需詳細資訊，請參閱[使用 pushdown 述詞預先篩選](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns)。 ## add\$1ingestion\$1time\$1columns **`add_ingestion_time_columns(dataFrame, timeGranularity = "")`** 附加擷取時間欄 (如 `ingest_year`、`ingest_month`、`ingest_day`、`ingest_hour`、`ingest_minute`) 到輸入 `DataFrame`。當您指定以 Amazon S3 為目標的 Data Catalog 資料表時，此函數會在 AWS Glue 產生的指令碼中自動產生。此函數會自動使用輸出資料表上的擷取時間欄來更新分割區。這可讓輸出資料在擷取時間自動分割，而不需要輸入資料中的明確擷取時間欄。 + `dataFrame` – 要將擷取時間欄附加到的 `dataFrame`。 + `timeGranularity` – 時間欄的精密程度。有效值為 "`day`"、"`hour`" 和 "`minute`"。例如：如果 "`hour`" 被傳遞給函數，原始 `dataFrame` 會附加上 "`ingest_year`"、"`ingest_month`"、"`ingest_day`" 和 "`ingest_hour`" 時間欄。傳回附加時間粒度欄後的資料框架。範例： ``` dynamic_frame = DynamicFrame.fromDF(glueContext.add_ingestion_time_columns(dataFrame, "hour")) ``` ## create\$1data\$1frame\$1from\$1catalog **`create_data_frame_from_catalog(database, table_name, transformation_ctx = "", additional_options = {})`** 傳回使用 Data Catalog 資料表的資訊建立的 `DataFrame`。 + `database` – 要從中讀取的 Data Catalog 資料庫。 + `table_name` – 要從中讀取的 Data Catalog 資料表的名稱。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `additional_options` – 選擇性的名稱/值對的集合。可能的選項包括 [AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md) 中列出用於串流來源的項目，例如 `startingPosition`、`maxFetchTimeInMs` 以及 `startingOffsets`。 + `useSparkDataSource` – 設為 true 時，會強制 AWS Glue 使用原生 Spark 資料來源 API 讀取資料表。Spark Data Source API 支援下列格式：AVRO、二進位、CSV、JSON、ORC、Parquet 和文字。在 Data Catalog 資料表中，您可以使用 `classification` 屬性指定格式。若要進一步了解 Spark Data Source API，請參閱官方 [Apache Spark 文件](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html)。將 `create_data_frame_from_catalog` 與 `useSparkDataSource` 一起使用具有以下好處： + 直接傳回 `DataFrame` 並提供 `create_dynamic_frame.from_catalog().toDF()` 的替代方案。 + 支援原生格式的 AWS Lake Formation 資料表層級許可控制。 + 支援在沒有 AWS Lake Formation 資料表層級許可控制的情況下讀取資料湖格式。如需詳細資訊，請參閱[搭配 AWS Glue ETL 任務使用資料湖架構](aws-glue-programming-etl-datalake-native-frameworks.md)。啟用時`useSparkDataSource`，您也可以`additional_options`視需要在中新增任何 [Spark 資料來源選項](https://spark.apache.org/docs/latest/sql-data-sources.html)。 AWS Glue 會將這些選項直接傳遞給 Spark 讀取器。 + `useCatalogSchema` – 設為 true 時， AWS Glue 會將 Data Catalog 結構描述套用至產生的 `DataFrame`。否則，讀取器會從資料推斷結構描述。啟用 `useCatalogSchema` 時，也必須將 `useSparkDataSource` 設定為 true。 **限制** 使用 `useSparkDataSource` 選項時請考慮以下限制： + 當您使用時`useSparkDataSource`， AWS Glue 會在與原始 Spark 工作階段不同的個別 Spark 工作階段`DataFrame`中建立新的。 + Spark DataFrame 分割區篩選不適用於下列 AWS Glue 功能。 + [任務書籤](monitor-continuations.md) + [排除 Amazon S3 儲存類別](aws-glue-programming-etl-storage-classes.md#aws-glue-programming-etl-storage-classes-dynamic-frame) + [目錄分割區述詞](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-cat-predicates) 若要搭配這些功能使用分割區篩選，您可以使用 AWS Glue 下推述詞。如需詳細資訊，請參閱[使用 pushdown 述詞預先篩選](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-pushdowns)。篩選未分割資料欄不會受到影響。下列範例指令碼示範使用 `excludeStorageClasses` 選項執行分割區篩選的不正確方法。 ``` // Incorrect partition filtering using Spark filter with excludeStorageClasses read_df = glueContext.create_data_frame.from_catalog( database=database_name, table_name=table_name, additional_options = { "useSparkDataSource": True, "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"] } ) // Suppose year and month are partition keys. // Filtering on year and month won't work, the filtered_df will still // contain data with other year/month values. filtered_df = read_df.filter("year == '2017 and month == '04' and 'state == 'CA'") ``` 下列範例指令碼示範使用 `excludeStorageClasses` 選項，利用下推述詞來執行分割區篩選的正確方法。 ``` // Correct partition filtering using the AWS Glue pushdown predicate // with excludeStorageClasses read_df = glueContext.create_data_frame.from_catalog( database=database_name, table_name=table_name, // Use AWS Glue pushdown predicate to perform partition filtering push_down_predicate = "(year=='2017' and month=='04')" additional_options = { "useSparkDataSource": True, "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"] } ) // Use Spark filter only on non-partitioned columns filtered_df = read_df.filter("state == 'CA'") ``` **範例：使用 Spark Data Source 讀取器來建立 CSV 資料表** ``` // Read a CSV table with '\t' as separator read_df = glueContext.create_data_frame.from_catalog( database=, table_name=, additional_options = {"useSparkDataSource": True, "sep": '\t'} ) ``` ## create\$1data\$1frame\$1from\$1options **`create_data_frame_from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")`** 此 API 現已棄用。請改用 `getSource()` API。傳回使用指定的連線和格式建立的 `DataFrame`。這個函數只能用於 AWS Glue 串流來源。 + `connection_type` - 串流連線類型。有效值包括 `kinesis` 與 `kafka`。 + `connection_options`— 連線選項，這些選項對於 Kinesis 和 Kafka 而言是不同的。您可以在 [AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md) 中找到每個串流資料來源的所有連線選項清單。請注意串流連線選項的下列不同處： + Kinesis 串流來源需要 `streamARN`、`startingPosition`、`inferSchema` 以及 `classification`。 + Kafka 串流來源需要 `connectionName`、`topicName`、`startingOffsets`、`inferSchema` 以及 `classification`。 + `format` – 格式規格。這是用於 Amazon S3 或支援多種格式的 AWS Glue 連線。如需有關支援格式的資訊，請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md)。 + `format_options` – 指定格式的格式選項。如需支援格式選項的詳細資訊，請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 Amazon Kinesis 串流來源範例： ``` kinesis_options = { "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream", "startingPosition": "TRIM_HORIZON", "inferSchema": "true", "classification": "json" } data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kinesis", connection_options=kinesis_options) ``` Kafka 串流來源範例： ``` kafka_options = { "connectionName": "ConfluentKafka", "topicName": "kafka-auth-topic", "startingOffsets": "earliest", "inferSchema": "true", "classification": "json" } data_frame_datasource0 = glueContext.create_data_frame.from_options(connection_type="kafka", connection_options=kafka_options) ``` ## forEachBatch **`forEachBatch(frame, batch_function, options)`** 將傳入的 `batch_function` 套用至從串流來源讀取的每個微批次。 + `frame` – 包含目前微批次的 DataFrame。 + `batch_function` – 將套用至每個微批次的函數。 + `options` – 索引鍵/值配對的集合，其中包含如何處理微批次的相關資訊。下列選項是必要的： + `windowSize` – 處理每個批次的時間量。 + `checkpointLocation` - 串流 ETL 任務的檢查點儲存位置。 + `batchMaxRetries` – 如果失敗，可重試批次的次數上限。預設值為 3。此選項僅在 Glue 2.0 及以上版本上才可設定。 **範例**： ``` glueContext.forEachBatch( frame = data_frame_datasource0, batch_function = processBatch, options = { "windowSize": "100 seconds", "checkpointLocation": "s3://kafka-auth-dataplane/confluent-test/output/checkpoint/" } ) def processBatch(data_frame, batchId): if (data_frame.count() > 0): datasource0 = DynamicFrame.fromDF( glueContext.add_ingestion_time_columns(data_frame, "hour"), glueContext, "from_data_frame" ) additionalOptions_datasink1 = {"enableUpdateCatalog": True} additionalOptions_datasink1["partitionKeys"] = ["ingest_yr", "ingest_mo", "ingest_day"] datasink1 = glueContext.write_dynamic_frame.from_catalog( frame = datasource0, database = "tempdb", table_name = "kafka-auth-table-output", transformation_ctx = "datasink1", additional_options = additionalOptions_datasink1 ) ``` ## 在 Amazon S3 中使用資料集 + [purge\$1table](#aws-glue-api-crawler-pyspark-extensions-glue-context-purge_table) + [purge\$1s3\$1path](#aws-glue-api-crawler-pyspark-extensions-glue-context-purge_s3_path) + [transition\$1table](#aws-glue-api-crawler-pyspark-extensions-glue-context-transition_table) + [transition\$1s3\$1path](#aws-glue-api-crawler-pyspark-extensions-glue-context-transition_s3_path) ## purge\$1table **`purge_table(catalog_id=None, database="", table_name="", options={}, transformation_ctx="")`** 從 Amazon S3 中刪除指定目錄資料庫和資料表的檔案。如果刪除分割區中的所有檔案，該分割區也會從目錄中刪除。對於向 Lake Formation 註冊的資料表，我們不支援 purge\$1table 動作。如果您希望能夠復原已刪除的物件，您可以在 Amazon S3 儲存貯體上開啟[物件版本控制](https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html)。從未啟用物件版本控制的儲存貯體中刪除物件時，無法復原物件。如需如何復原已啟用版本控制之儲存貯體中已刪除物件的詳細資訊，請參閱 AWS 支援知識中心的[如何擷取已刪除的 Amazon S3 物件？](https://aws.amazon.com/premiumsupport/knowledge-center/s3-undelete-configuration/)。 + `catalog_id` – 要存取之 Data Catalog 的目錄 ID ( Data Catalog 的帳戶 ID)。依預設設定為 `None`。`None` 預設為服務中呼叫帳戶的目錄 ID。 + `database` – 所要使用的資料庫。 + `table_name` - 要使用的資料表名稱。 + `options` - 篩選要刪除之檔案和用於產生資訊清單檔案的選項。 + `retentionPeriod` - 指定保留檔案的期間 (以小時為單位)。比保留期間新的檔案都會予以保留。依預設設定為 168 小時 (7 天)。 + `partitionPredicate` - 滿足此述詞的分割區會被刪除。這些分割區中仍在保留期間內的檔案不會被刪除。設定為 `""` – 預設為空值。 + `excludeStorageClasses` - 不會刪除 `excludeStorageClasses` 集合中具有儲存體方案的檔案。預設為 `Set()` – 空集合。 + `manifestFilePath` - 產生資訊清單檔案的選用路徑。所有已成功清除的檔案都會記錄在 `Success.csv` 中，失敗的則記錄在 `Failed.csv` 中 + `transformation_ctx` – 欲使用的轉換細節 (選用)。用於資訊清單檔案的路徑。 **Example** ``` glueContext.purge_table("database", "table", {"partitionPredicate": "(month=='march')", "retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"}) ``` ## purge\$1s3\$1path **`purge_s3_path(s3_path, options={}, transformation_ctx="")`** 以遞迴方式刪除指定 Amazon S3 路徑中的檔案。如果您希望能夠復原已刪除的物件，您可以在 Amazon S3 儲存貯體上開啟[物件版本控制](https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html)。從未開啟物件版本控制的儲存貯體中刪除物件時，無法復原物件。如需如何使用版本控制復原儲存貯體中已刪除物件的詳細資訊，請參閱支援知識中心中的[如何擷取已刪除的 Amazon S3 物件？](https://aws.amazon.com/premiumsupport/knowledge-center/s3-undelete-configuration/)。 + `s3_path` - 要刪除之檔案的 Amazon S3 路徑，格式為 `s3:////` + `options` - 篩選要刪除之檔案和用於產生資訊清單檔案的選項。 + `retentionPeriod` - 指定保留檔案的期間 (以小時為單位)。比保留期間新的檔案都會予以保留。依預設設定為 168 小時 (7 天)。 + `excludeStorageClasses` - 不會刪除 `excludeStorageClasses` 集合中具有儲存體方案的檔案。預設為 `Set()` – 空集合。 + `manifestFilePath` - 產生資訊清單檔案的選用路徑。所有已成功清除的檔案都會記錄在 `Success.csv` 中，失敗的則記錄在 `Failed.csv` 中 + `transformation_ctx` – 欲使用的轉換細節 (選用)。用於資訊清單檔案的路徑。 **Example** ``` glueContext.purge_s3_path("s3://bucket/path/", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/"}) ``` ## transition\$1table **`transition_table(database, table_name, transition_to, options={}, transformation_ctx="", catalog_id=None)`** 針對指定之目錄的資料庫和資料表，轉換儲存在 Amazon S3 上之檔案的儲存體方案。您可以在任意兩個儲存體方案之間轉換。對於 `GLACIER` 和 `DEEP_ARCHIVE` 儲存體方案，您可以轉換到這些方案。但是，您可以使用 `S3 RESTORE` 從 `GLACIER` 和 `DEEP_ARCHIVE` 儲存體方案轉換。如果您執行的 AWS Glue ETL 任務會從 Amazon S3 讀取檔案或分割區，則您可排除部分 Amazon S3 儲存類別類型。如需詳細資訊，請參閱[排除 Amazon S3 儲存體方案](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-storage-classes.html)。 + `database` – 所要使用的資料庫。 + `table_name` - 要使用的資料表名稱。 + `transition_to` – 要轉移的 [Amazon S3 儲存方案](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/StorageClass.html)。 + `options` - 篩選要刪除之檔案和用於產生資訊清單檔案的選項。 + `retentionPeriod` - 指定保留檔案的期間 (以小時為單位)。比保留期間新的檔案都會予以保留。依預設設定為 168 小時 (7 天)。 + `partitionPredicate` - 滿足此述詞的分割區會被轉換。這些分割區中仍在保留期間內的檔案不會被轉換。設定為 `""` – 預設為空值。 + `excludeStorageClasses` - 不會轉換 `excludeStorageClasses` 集合中具有儲存體方案的檔案。預設為 `Set()` – 空集合。 + `manifestFilePath` - 產生資訊清單檔案的選用路徑。所有已成功轉換的檔案都會記錄在 `Success.csv` 中，失敗的則記錄在 `Failed.csv` 中 + `accountId` – 要執行轉移轉換的 Amazon Web Services 帳戶 ID。對於這種轉換是強制性的。 + `roleArn` – AWS 執行轉換的角色。對於這種轉換是強制性的。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。用於資訊清單檔案的路徑。 + `catalog_id` – 要存取之 Data Catalog 的目錄 ID ( Data Catalog 的帳戶 ID)。依預設設定為 `None`。`None` 預設為服務中呼叫帳戶的目錄 ID。 **Example** ``` glueContext.transition_table("database", "table", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"}) ``` ## transition\$1s3\$1path **`transition_s3_path(s3_path, transition_to, options={}, transformation_ctx="")`** 以遞迴方式轉換指定 Amazon S3 路徑中檔案的儲存體方案。您可以在任意兩個儲存體方案之間轉換。對於 `GLACIER` 和 `DEEP_ARCHIVE` 儲存體方案，您可以轉換到這些方案。但是，您可以使用 `S3 RESTORE` 從 `GLACIER` 和 `DEEP_ARCHIVE` 儲存體方案轉換。如果您執行的 AWS Glue ETL 任務會從 Amazon S3 讀取檔案或分割區，則您可排除部分 Amazon S3 儲存類別類型。如需詳細資訊，請參閱[排除 Amazon S3 儲存體方案](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-storage-classes.html)。 + `s3_path` - 要以格式 `s3:////` 轉換之檔案的 Amazon S3 路徑。 + `transition_to` – 要轉移的 [Amazon S3 儲存方案](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/StorageClass.html)。 + `options` - 篩選要刪除之檔案和用於產生資訊清單檔案的選項。 + `retentionPeriod` - 指定保留檔案的期間 (以小時為單位)。比保留期間新的檔案都會予以保留。依預設設定為 168 小時 (7 天)。 + `partitionPredicate` - 滿足此述詞的分割區會被轉換。這些分割區中仍在保留期間內的檔案不會被轉換。設定為 `""` – 預設為空值。 + `excludeStorageClasses` - 不會轉換 `excludeStorageClasses` 集合中具有儲存體方案的檔案。預設為 `Set()` – 空集合。 + `manifestFilePath` - 產生資訊清單檔案的選用路徑。所有已成功轉換的檔案都會記錄在 `Success.csv` 中，失敗的則記錄在 `Failed.csv` 中 + `accountId` – 要執行轉移轉換的 Amazon Web Services 帳戶 ID。對於這種轉換是強制性的。 + `roleArn` – AWS 執行轉換的角色。對於這種轉換是強制性的。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。用於資訊清單檔案的路徑。 **Example** ``` glueContext.transition_s3_path("s3://bucket/prefix/", "STANDARD_IA", {"retentionPeriod": 1, "excludeStorageClasses": ["STANDARD_IA"], "manifestFilePath": "s3://bucketmanifest/", "accountId": "12345678901", "roleArn": "arn:aws:iam::123456789012:user/example-username"}) ``` ## 擷取 + [extract\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-glue-context-extract_jdbc_conf) ## extract\$1jdbc\$1conf **`extract_jdbc_conf(connection_name, catalog_id = None)`** 從 Data Catalog 中的 AWS Glue 連線物件傳回含索引鍵 (具有組態屬性) 的 `dict`。 + `user`：資料庫使用者名稱。 + `password`：資料庫密碼。 + `vendor`：指定廠商 (`mysql`、`postgresql`、`oracle`、`sqlserver` 等)。 + `enforceSSL`：布林字串，指示是否需要安全連線。 + `customJDBCCert`：使用指定 Amazon S3 路徑中的特定用戶端憑證。 + `skipCustomJDBCCertValidation`：布林字串，指示 `customJDBCCert` 必須由 CA 驗證。 + `customJDBCCertString`：有關自訂憑證的其他資訊，因驅動程式類型而異。 + `url`：(已棄用) 僅包含通訊協定、伺服器和連接埠的 JDBC URL。 + `fullUrl`：建立連線時輸入的 JDBC URL (適用於 AWS Glue 3.0 版或更新版本)。擷取 JDBC 組態的範例： ``` jdbc_conf = glueContext.extract_jdbc_conf(connection_name="your_glue_connection_name") print(jdbc_conf) >>> {'enforceSSL': 'false', 'skipCustomJDBCCertValidation': 'false', 'url': 'jdbc:mysql://myserver:3306', 'fullUrl': 'jdbc:mysql://myserver:3306/mydb', 'customJDBCCertString': '', 'user': 'admin', 'customJDBCCert': '', 'password': '1234', 'vendor': 'mysql'} ``` ## 交易 + [start\$1transaction](#aws-glue-api-pyspark-extensions-glue-context-start-transaction) + [commit\$1transaction](#aws-glue-api-pyspark-extensions-glue-context-commit-transaction) + [cancel\$1transaction](#aws-glue-api-pyspark-extensions-glue-cancel-transaction) ## start\$1transaction **`start_transaction(read_only)`** 開始新交易。內部呼叫 Lake Formation [startTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-StartTransaction) API。 + `read_only` – (布林值) 指出此交易應該是唯讀，還是讀取和寫入。使用唯讀交易 ID 進行的寫入將被拒絕。唯讀交易不需要遞交。傳回交易 ID。 ## commit\$1transaction **`commit_transaction(transaction_id, wait_for_commit = True)`** 嘗試遞交指定的交易。`commit_transaction` 可能會在交易完成遞交之前返回。內部呼叫 Lake Formation [commitTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CommitTransaction) API。 + `transaction_id ` – (字串) 要遞交的交易。 + `wait_for_commit` – (布林值) 決定 `commit_transaction` 是否立即傳回。預設值為 true。如為 False，`commit_transaction` 輪詢並等待，直到交易完成遞交。使用指數退避時，等待時間長度限制為 1 分鐘，最多可嘗試 6 次重試。傳回一個布林值，指示遞交是否完成。 ## cancel\$1transaction **`cancel_transaction(transaction_id)`** 嘗試取消指定的交易。如果交易先前已遞交，傳回 `TransactionCommittedException` 例外狀況。內部呼叫 Lake Formation [CancelTransaction](https://docs.aws.amazon.com/lake-formation/latest/dg/aws-lake-formation-api-aws-lake-formation-api-transactions.html#aws-lake-formation-api-aws-lake-formation-api-transactions-CancelTransaction) API。 + `transaction_id` – (字串) 要取消的交易。 ## 寫入 + [getSink](#aws-glue-api-crawler-pyspark-extensions-glue-context-get-sink) + [write\$1dynamic\$1frame\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) + [write\$1from\$1options](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_from_options) + [write\$1dynamic\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_catalog) + [write\$1data\$1frame\$1from\$1catalog](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_data_frame_from_catalog) + [write\$1dynamic\$1frame\$1from\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_jdbc_conf) + [write\$1from\$1jdbc\$1conf](#aws-glue-api-crawler-pyspark-extensions-glue-context-write_from_jdbc_conf) ## getSink **`getSink(connection_type, format = None, transformation_ctx = "", **options)`** 取得可用於將 `DynamicFrames` 寫入外部來源的 `DataSink` 物件。請先檢查 SparkSQL `format` 以確保取得預期的目的地。 + `connection_type` – 要使用的連線類型，例如 Amazon S3、Amazon Redshift 及 JDBC。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver`、`oracle`、`kinesis` 和 `kafka`。 + `format` – 要使用的 SparkSQL 格式 (選用)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `options` – 名稱/值對的集合，用來指定連線選項。一些可能的值為： + `user` 和 `password`：適用於授權 + `url`：資料存放區的端點 + `dbtable`：目標資料表的名稱。 + `bulkSize`：插入操作的平行程度您可以指定的選項取決於連線類型。如需其他值和範例，請參閱 [AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。範例： ``` >>> data_sink = context.getSink("s3") >>> data_sink.setFormat("json"), >>> data_sink.writeFrame(myFrame) ``` ## write\$1dynamic\$1frame\$1from\$1options **`write_dynamic_frame_from_options(frame, connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "")`** 使用指定的連線和格式來撰寫並傳回 `DynamicFrame`。 + `frame` – 所要撰寫的 `DynamicFrame`。 + `connection_type` – 連線類型，例如 Amazon S3、Amazon Redshift 及 JDBC。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver`、`oracle`、`kinesis` 和 `kafka`。 + `connection_options` – 連線選項，例如路徑和資料庫資料表 (選用)。如果是 `connection_type` 的 `s3`，會定義 Amazon S3 路徑。 ``` connection_options = {"path": "s3://aws-glue-target/temp"} ``` 如果是 JDBC 連線，必須定義幾項屬性。請注意，資料庫名稱必須是 URL 的一部分。它可以選擇性包含在連線選項中。 **警告** 不建議在指令碼中存放密碼。考慮使用從 AWS Secrets Manager 或 Glue Data Catalog AWS `boto3`擷取它們。 ``` connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} ``` `dbtable` 屬性為 JDBC 資料表的名稱。若是支援資料庫內結構描述的 JDBC 資料存放區，請指定 `schema.table-name`。如果未提供結構描述，則會使用預設的 "public" 結構描述。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 + `format` – 格式規格。這是用於 Amazon S3 或支援多種格式的 AWS Glue 連線。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `format_options` – 指定格式的格式選項。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `transformation_ctx` – 所要使用的轉換細節 (選用)。 ## write\$1from\$1options **`write_from_options(frame_or_dfc, connection_type, connection_options={}, format={}, format_options={}, transformation_ctx = "")`** 寫入和傳回以指定的連線和格式資訊建立的 `DynamicFrame` 或 `DynamicFrameCollection`。 + `frame_or_dfc` – 所要撰寫的 `DynamicFrame` 或 `DynamicFrameCollection`。 + `connection_type` – 連線類型，例如 Amazon S3、Amazon Redshift 及 JDBC。有效值包括 `s3`、`mysql`、`postgresql`、`redshift`、`sqlserver` 及 `oracle`。 + `connection_options` – 連線選項，例如路徑和資料庫資料表 (選用)。如果是 `connection_type` 的 `s3`，會定義 Amazon S3 路徑。 ``` connection_options = {"path": "s3://aws-glue-target/temp"} ``` 如果是 JDBC 連線，必須定義幾項屬性。請注意，資料庫名稱必須是 URL 的一部分。它可以選擇性包含在連線選項中。 **警告** 不建議在指令碼中存放密碼。考慮使用從 AWS Secrets Manager 或 Glue Data Catalog AWS `boto3`擷取它們。 ``` connection_options = {"url": "jdbc-url/database", "user": "username", "password": passwordVariable,"dbtable": "table-name", "redshiftTmpDir": "s3-tempdir-path"} ``` `dbtable` 屬性為 JDBC 資料表的名稱。若是支援資料庫內結構描述的 JDBC 資料存放區，請指定 `schema.table-name`。如果未提供結構描述，則會使用預設的 "public" 結構描述。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 + `format` – 格式規格。這是用於 Amazon S3 或支援多種格式的 AWS Glue 連線。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `format_options` – 指定格式的格式選項。請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md) 以了解受支援的格式。 + `transformation_ctx` – 所要使用的轉換細節 (選用)。 ## write\$1dynamic\$1frame\$1from\$1catalog **`write_dynamic_frame_from_catalog(frame, database, table_name, redshift_tmp_dir, transformation_ctx = "", additional_options = {}, catalog_id = None)`** 使用來自 Data Catalog 資料庫和資料表的資訊寫入並傳回 `DynamicFrame`。 + `frame` – 所要撰寫的 `DynamicFrame`。 + `Database` – 包含資料表的 Data Catalog 資料庫。 + `table_name` – 與目標關聯的 Data Catalog 資料表名稱。 + `redshift_tmp_dir` – 所要使用的 Amazon Redshift 暫時目錄 (選用)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `additional_options` – 選擇性的名稱/值對的集合。 + `catalog_id` — 要存取之 Data Catalog 的目錄 ID (帳戶 ID)。若無，會使用發起人的預設帳戶 ID。 ## write\$1data\$1frame\$1from\$1catalog **`write_data_frame_from_catalog(frame, database, table_name, redshift_tmp_dir, transformation_ctx = "", additional_options = {}, catalog_id = None)`** 使用來自 Data Catalog 資料庫和資料表的資訊寫入並傳回 `DataFrame`。此方法支援寫入資料湖格式 (Hudi、Iceberg 和 Delta Lake)。如需詳細資訊，請參閱[搭配 AWS Glue ETL 任務使用資料湖架構](aws-glue-programming-etl-datalake-native-frameworks.md)。 + `frame` – 所要撰寫的 `DataFrame`。 + `Database` – 包含資料表的 Data Catalog 資料庫。 + `table_name` – 與目標關聯的 Data Catalog 資料表名稱。 + `redshift_tmp_dir`：所要使用的 Amazon Redshift 臨時目錄 (選用)。 + `transformation_ctx` – 欲使用的轉換細節 (選用)。 + `additional_options` – 選擇性的名稱/值對的集合。 + `useSparkDataSink` – 設為 true 時，會強制 AWS Glue 使用原生 Spark Data Sink API 寫入資料表。啟用此選項時，您可以`additional_options`視需要將任何 [Spark 資料來源選項](https://spark.apache.org/docs/latest/sql-data-sources.html)新增至。 AWS Glue 會將這些選項直接傳遞給 Spark 寫入器。 + `catalog_id` – 要存取之 Data Catalog 的目錄 ID (帳戶 ID)。如果您未指定值，則會使用發起人的預設帳戶 ID。 **限制** 使用 `useSparkDataSink` 選項時請考慮以下限制： + 使用 `useSparkDataSink` 選項時，不支援 [`enableUpdateCatalog`](update-from-job.md) 選項。 **範例：使用 Spark Data Source 寫入器寫入 Hudi 資料表** ``` hudi_options = { 'useSparkDataSink': True, 'hoodie.table.name': , 'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': 'product_id', 'hoodie.datasource.write.table.name': , 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'updated_at', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': , 'hoodie.datasource.hive_sync.table': , 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.hive_sync.mode': 'hms'} glueContext.write_data_frame.from_catalog( frame = , database = , table_name = , additional_options = hudi_options ) ``` ## write\$1dynamic\$1frame\$1from\$1jdbc\$1conf **`write_dynamic_frame_from_jdbc_conf(frame, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)`** 使用指定的 JDBC 連線資訊撰寫並傳回 `DynamicFrame`。 + `frame` – 所要撰寫的 `DynamicFrame`。 + `catalog_connection` – 所要使用的目錄連線。 + `connection_options` – 連線選項，例如路徑和資料庫資料表 (選用)。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 + `redshift_tmp_dir` – 所要使用的 Amazon Redshift 暫時目錄 (選用)。 + `transformation_ctx` – 所要使用的轉換細節 (選用)。 + `catalog_id` — 要存取之 Data Catalog 的目錄 ID (帳戶 ID)。若無，會使用發起人的預設帳戶 ID。 ## write\$1from\$1jdbc\$1conf **`write_from_jdbc_conf(frame_or_dfc, catalog_connection, connection_options={}, redshift_tmp_dir = "", transformation_ctx = "", catalog_id = None)`** 使用指定的 JDBC 連線資訊撰寫並傳回 `DynamicFrame` 或 `DynamicFrameCollection`。 + `frame_or_dfc` – 所要撰寫的 `DynamicFrame` 或 `DynamicFrameCollection`。 + `catalog_connection` – 所要使用的目錄連線。 + `connection_options` – 連線選項，例如路徑和資料庫資料表 (選用)。如需詳細資訊，請參閱[AWS Glue for Spark 中 ETL 的連線類型和選項](aws-glue-programming-etl-connect.md)。 + `redshift_tmp_dir` – 所要使用的 Amazon Redshift 暫時目錄 (選用)。 + `transformation_ctx` – 所要使用的轉換細節 (選用)。 + `catalog_id` — 要存取之 Data Catalog 的目錄 ID (帳戶 ID)。若無，會使用發起人的預設帳戶 ID。 # AWS Glue PySpark 轉換參考 AWS Glue 提供下列內建轉換，您可以在 PySpark ETL 操作中使用。您的資料會在資料結構中從轉換傳遞至轉換，而此資料結構稱為 *DynamicFrame*，是 Apache Spark SQL `DataFrame` 的延伸。`DynamicFrame` 包含您的資料，而您可以參考其結構描述以處理資料。這些轉換大多數也作為 `DynamicFrame` 類別的方法存在。如需詳細資訊，請參閱 [DynamicFrame 轉換](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-_transforms)。 + [GlueTransform base 類別](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md) + [ApplyMapping 類別](aws-glue-api-crawler-pyspark-transforms-ApplyMapping.md) + [DropFields 類別](aws-glue-api-crawler-pyspark-transforms-DropFields.md) + [DropNullFields 類別](aws-glue-api-crawler-pyspark-transforms-DropNullFields.md) + [ErrorsAsDynamicFrame 類別](aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame.md) + [EvaluateDataQuality 類別](aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality.md) + [FillMissingValues 類別](aws-glue-api-crawler-pyspark-transforms-fillmissingvalues.md) + [Filter 類別](aws-glue-api-crawler-pyspark-transforms-filter.md) + [FindIncrementalMatches 類別](aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.md) + [FindMatches 類別](aws-glue-api-crawler-pyspark-transforms-findmatches.md) + [FlatMap 類別](aws-glue-api-crawler-pyspark-transforms-flat-map.md) + [Join 類別](aws-glue-api-crawler-pyspark-transforms-join.md) + [Map 類別](aws-glue-api-crawler-pyspark-transforms-map.md) + [MapToCollection 類別](aws-glue-api-crawler-pyspark-transforms-MapToCollection.md) + [mergeDynamicFrame](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-merge) + [Relationalize 類別](aws-glue-api-crawler-pyspark-transforms-Relationalize.md) + [RenameField 類別](aws-glue-api-crawler-pyspark-transforms-RenameField.md) + [ResolveChoice 類別](aws-glue-api-crawler-pyspark-transforms-ResolveChoice.md) + [SelectFields 類別](aws-glue-api-crawler-pyspark-transforms-SelectFields.md) + [SelectFromCollection 類別](aws-glue-api-crawler-pyspark-transforms-SelectFromCollection.md) + [Simplify\$1ddb\$1json 類別](aws-glue-api-crawler-pyspark-transforms-simplify-ddb-json.md) + [Spigot 類別](aws-glue-api-crawler-pyspark-transforms-spigot.md) + [SplitFields 類別](aws-glue-api-crawler-pyspark-transforms-SplitFields.md) + [SplitRows 類別](aws-glue-api-crawler-pyspark-transforms-SplitRows.md) + [Unbox 類別](aws-glue-api-crawler-pyspark-transforms-Unbox.md) + [UnnestFrame 類別](aws-glue-api-crawler-pyspark-transforms-UnnestFrame.md) # GlueTransform base 類別所有 `awsglue.transforms` 類別繼承的基底類別。所有類別皆定義一項 `__call__` 方法。依據預設，它們會覆寫下列的 `GlueTransform` 類別方法，或者透過使用類別名稱呼叫。 ## 方法 + [apply(cls, \$1args, \$1\$1kwargs)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply) + [name(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name) + [describeArgs(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs) + [describeReturn(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn) + [describeTransform(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform) + [describeErrors(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors) + [describe(cls)](#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe) ## apply(cls, \$1args, \$1\$1kwargs) 藉由呼叫轉換類別並傳回結果以套用轉換。 + `cls` – `self` 類別物件。 ## name(cls) 傳回衍生的轉換類別名稱。 + `cls` – `self` 類別物件。 ## describeArgs(cls) + `cls` – `self` 類別物件。傳回字典清單，每個皆對應至一個具名引數，格式如下： ``` [ { "name": "(name of argument)", "type": "(type of argument)", "description": "(description of argument)", "optional": "(Boolean, True if the argument is optional)", "defaultValue": "(Default value string, or None)(String; the default value, or None)" }, ... ] ``` 呼叫未實作的衍生轉換時，引發 `NotImplementedError` 例外。 ## describeReturn(cls) + `cls` – `self` 類別物件。傳回字典及有關傳回類型的資訊，格式如下： ``` { "type": "(return type)", "description": "(description of output)" } ``` 呼叫未實作的衍生轉換時，引發 `NotImplementedError` 例外。 ## describeTransform(cls) 傳回描述轉換的字串。 + `cls` – `self` 類別物件。呼叫未實作的衍生轉換時，引發 `NotImplementedError` 例外。 ## describeErrors(cls) + `cls` – `self` 類別物件。傳回字典的清單，每個皆描述此轉換可能擲出的例外狀況，格式如下： ``` [ { "type": "(type of error)", "description": "(description of error)" }, ... ] ``` ## describe(cls) + `cls` – `self` 類別物件。以下列格式傳回物件： ``` { "transform" : { "name" : cls.name( ), "args" : cls.describeArgs( ), "returns" : cls.describeReturn( ), "raises" : cls.describeErrors( ), "location" : "internal" } } ``` # ApplyMapping 類別在 `DynamicFrame` 套用映射。 ## 範例建議您使用 [`DynamicFrame.apply_mapping()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-apply_mapping) 在 `DynamicFrame` 中套用映射。若要檢視程式碼範例，請參閱 [範例：使用 apply\$1map 來重新命名欄位並變更欄位類型](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-apply_mapping-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-ApplyMapping-describe) ## \$1\$1call\$1\$1(frame, mappings, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 將宣告式映射套用於指定的 `DynamicFrame`。 + `frame` – 要套用映射的 `DynamicFrame` (必要)。 + `mappings` –映射元組的清單 (必要)。每個清單包括：(來源欄、來源類型、目標欄、目標類型)。如果來源欄的名稱中有一個小點 "`.`"，則您必須在其前後加上反引號 "````"。例如，若要將 `this.old.name` (字串) 對應至 `thisNewName`，會使用以下元組： ``` ("`this.old.name`", "string", "thisNewName", "string") ``` + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。只傳回「映射」元組中指定的 `DynamicFrame` 欄位。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # DropFields 類別在 `DynamicFrame` 內捨棄欄位。 ## 範例建議您使用 [`DynamicFrame.drop_fields()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-drop_fields) 方法從 `DynamicFrame` 中刪除欄位。若要檢視程式碼範例，請參閱 [範例：使用 drop\$1fields 從 `DynamicFrame` 中移除欄位](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-drop_fields-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-DropFields-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-DropFields-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-DropFields-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-DropFields-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-DropFields-describe) ## \$1\$1call\$1\$1(frame, paths, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 在 `DynamicFrame` 內捨棄節點。 + `frame` – 要捨棄節點的 `DynamicFrame` (必要)。 + `paths` – 要捨棄之節點的完整路徑清單 (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。傳回無指定欄位的新 `DynamicFrame`。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # DropNullFields 類別捨棄 `DynamicFrame` 中類型為 `NullType` 的所有 null 欄位。在 `DynamicFrame` 資料集的每筆記錄中，皆存在缺少值或為空值的欄位。 ## 範例此範例使用 `DropNullFields` 建立新的 `DynamicFrame`，其中類型 `NullType` 的欄位已刪除。為了演示 `DropNullFields`，我們將類型為 null 的名為 `empty_column` 的新資料欄新增至已加載的 `persons` 資料集。 **注意** 若要存取此範例中使用的資料集，請參閱 [程式碼範例：加入和關聯化資料](aws-glue-programming-python-samples-legislators.md) 並依照 [步驟 1：在 Amazon S3 儲存貯體中網路爬取資料](aws-glue-programming-python-samples-legislators.md#aws-glue-programming-python-samples-legislators-crawling) 中的說明進行。 ``` # Example: Use DropNullFields to create a new DynamicFrame without NullType fields from pyspark.context import SparkContext from awsglue.context import GlueContext from pyspark.sql.functions import lit from pyspark.sql.types import NullType from awsglue.dynamicframe import DynamicFrame from awsglue.transforms import DropNullFields # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create DynamicFrame persons = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="persons_json" ) print("Schema for the persons DynamicFrame:") persons.printSchema() # Add new column "empty_column" with NullType persons_with_nulls = persons.toDF().withColumn("empty_column", lit(None).cast(NullType())) persons_with_nulls_dyf = DynamicFrame.fromDF(persons_with_nulls, glueContext, "persons_with_nulls") print("Schema for the persons_with_nulls_dyf DynamicFrame:") persons_with_nulls_dyf.printSchema() # Remove the NullType field persons_no_nulls = DropNullFields.apply(persons_with_nulls_dyf) print("Schema for the persons_no_nulls DynamicFrame:") persons_no_nulls.printSchema() ``` ### Output ``` Schema for the persons DynamicFrame: root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string Schema for the persons_with_nulls_dyf DynamicFrame: root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string |-- empty_column: null null_fields ['empty_column'] Schema for the persons_no_nulls DynamicFrame: root |-- family_name: string |-- name: string |-- links: array | |-- element: struct | | |-- note: string | | |-- url: string |-- gender: string |-- image: string |-- identifiers: array | |-- element: struct | | |-- scheme: string | | |-- identifier: string |-- other_names: array | |-- element: struct | | |-- lang: string | | |-- note: string | | |-- name: string |-- sort_name: string |-- images: array | |-- element: struct | | |-- url: string |-- given_name: string |-- birth_date: string |-- id: string |-- contact_details: array | |-- element: struct | | |-- type: string | | |-- value: string |-- death_date: string ``` ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-DropNullFields-describe) ## \$1\$1call\$1\$1(frame, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 刪除 `DynamicFrame` 中類型為 `NullType` 的所有 null 欄位。在 `DynamicFrame` 資料集的每筆記錄中，皆存在缺少值或為空值的欄位。 + `frame` – 要刪除其 null 欄位的 `DynamicFrame` (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。傳回不含 null 欄位的新 `DynamicFrame`。 ## apply(cls, \$1args, \$1\$1kwargs) + `cls` – cls ## name(cls) + `cls` – cls ## describeArgs(cls) + `cls` – cls ## describeReturn(cls) + `cls` – cls ## describeTransform(cls) + `cls` – cls ## describeErrors(cls) + `cls` – cls ## describe(cls) + `cls` – cls # ErrorsAsDynamicFrame 類別傳回一個 `DynamicFrame`，其中包含建立來源 `DynamicFrame` 時發生的錯誤的巢狀錯誤。 ## 範例建議您使用 [`DynamicFrame.errorsAsDynamicFrame()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-errorsAsDynamicFrame) 方法擷取和檢視錯誤記錄。若要檢視程式碼範例，請參閱 [範例：使用 errorsAsDynamicFrame 來檢視錯誤記錄](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-errorsAsDynamicFrame-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-ErrorsAsDynamicFrame-describe) ## \$1\$1call\$1\$1(frame) 傳回 `DynamicFrame`，其中包含與來源 `DynamicFrame` 有關的巢狀錯誤記錄。 + `frame` – 來源 `DynamicFrame` (必要)。 ## apply(cls, \$1args, \$1\$1kwargs) + `cls` – cls ## name(cls) + `cls` – cls ## describeArgs(cls) + `cls` – cls ## describeReturn(cls) + `cls` – cls ## describeTransform(cls) + `cls` – cls ## describeErrors(cls) + `cls` – cls ## describe(cls) + `cls` – cls # EvaluateDataQuality 類別根據 `DynamicFrame` 評估資料品質規則集，並傳回包含評估結果的新 `DynamicFrame`。 ## 範例下列範例程式碼示範如何評估 `DynamicFrame` 的資料品質，然後檢視資料品質結果。 ``` from awsglue.transforms import * from pyspark.context import SparkContext from awsglue.context import GlueContext from awsgluedq.transforms import EvaluateDataQuality #Create Glue context sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Define DynamicFrame legislatorsAreas = glueContext.create_dynamic_frame.from_catalog( database="legislators", table_name="areas_json") # Create data quality ruleset ruleset = """Rules = [ColumnExists "id", IsComplete "id"]""" # Evaluate data quality dqResults = EvaluateDataQuality.apply( frame=legislatorsAreas, ruleset=ruleset, publishing_options={ "dataQualityEvaluationContext": "legislatorsAreas", "enableDataQualityCloudWatchMetrics": True, "enableDataQualityResultsPublishing": True, "resultsS3Prefix": "amzn-s3-demo-bucket1", }, ) # Inspect data quality results dqResults.printSchema() dqResults.toDF().show() ``` ### Output ``` root |-- Rule: string |-- Outcome: string |-- FailureReason: string |-- EvaluatedMetrics: map | |-- keyType: string | |-- valueType: double +-----------------------+-------+-------------+---------------------------------------+ |Rule |Outcome|FailureReason|EvaluatedMetrics | +-----------------------+-------+-------------+---------------------------------------+ |ColumnExists "id" |Passed |null |{} | |IsComplete "id" |Passed |null |{Column.first_name.Completeness -> 1.0}| +-----------------------+-------+-------------+---------------------------------------+ ``` ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-EvaluateDataQuality-describe) ## \$1\$1call\$1\$1(frame, ruleset, publishing\$1options = \$1\$1) + `frame` – 您要評估資料品質的 `DynamicFrame`。 + `ruleset` – 字串格式的資料品質定義語言 (DQDL) 規則集。若要進一步了解 DQDL，請參閱 [資料品質定義語言 (DQDL) 參考](dqdl.md) 指南。 + `publishing_options` – 指定以下用於發佈評估結果和指標的選項的字典： + `dataQualityEvaluationContext` – 指定 Glue AWS 應發佈 Amazon CloudWatch 指標和資料品質結果的命名空間的字串。彙總指標會出現在 CloudWatch 中，而完整結果會出現在 AWS Glue Studio 界面中。 + 必要：否 + 預設值：`default_context` + `enableDataQualityCloudWatchMetrics` – 指定是否應將資料品質評估的結果發佈至 CloudWatch。您可以使用 `dataQualityEvaluationContext` 選項指定指標的命名空間。 + 必要：否 + 預設值：False + `enableDataQualityResultsPublishing` – 指定資料品質結果是否應顯示在 AWS Glue Studio 介面的 **Data Quality** (資料品質) 索引標籤上。 + 必要：否 + 預設值：True + `resultsS3Prefix` – 指定 Glue AWS 可以寫入資料品質評估結果的 Amazon S3 位置。 + 必要：否 + 預設值："" (空字串) ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # FillMissingValues 類別 `FillMissingValues` 類別會在指定的 `DynamicFrame` 中找到 null 值和空字串，並使用機器學習方法 (例如線性迴歸和隨機樹系) 來預測缺少的值。ETL 任務會使用輸入資料集中的值來訓練機器學習模型，然後預測缺少的值應該是什麼。 **提示** 如果您使用增量資料集，則每個增量集都會用作機器學習模型的訓練資料，因此結果可能不準確。若要匯入： ``` from awsglueml.transforms import FillMissingValues ``` ## 方法 + [套用](#aws-glue-api-crawler-pyspark-transforms-fillmissingvalues-apply) ## apply(frame, missing\$1values\$1column, output\$1column ="", transformation\$1ctx ="", info ="", stageThreshold = 0, totalThreshold = 0) 在指定的欄中填入動態框架的缺少值，並在新的欄中傳回具有估計值的新框架。對於沒有缺少值的列，指定欄的值將被複製到新欄。 + `frame` – 在其中填入缺少值的 `DynamicFrame`。必要。 + `missing_values_column` – 包含缺少值的欄 (`null` 值和空字串)。必要. + `output_column` – 新欄的名稱，該欄將包含所有缺少值的列的估計值。選擇性；預設值為 `missing_values_column` 的名稱，字尾為 `"_filled"`。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用；預設值為零)。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用；預設值為零)。傳回具有一個額外欄的新 `DynamicFrame`，其中包含缺少值的列估計和其他列的目前值。 # Filter 類別建立新的 `DynamicFrame`，其中包含來自輸入 `DynamicFrame` 的符合指定述詞函數的記錄。 ## 範例建議您使用 [`DynamicFrame.filter()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-filter) 方法篩選 `DynamicFrame` 中的記錄。若要檢視程式碼範例，請參閱 [範例：使用篩選條件取得已篩選的欄位選取](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-filter-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-filter-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-filter-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-filter-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-filter-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-filter-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-filter-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-filter-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-filter-describe) ## \$1\$1call\$1\$1(frame, f, transformation\$1ctx="", info="", stageThreshold=0, totalThreshold=0)) 傳回新 `DynamicFrame`，其藉由從符合指定述詞函數的輸入 `DynamicFrame` 選擇記錄所建置。 + `frame` – 要套用指定篩選條件函數的來源 `DynamicFrame` (必要)。 + `f` – 要套用到 `DynamicFrame` 中各個 `DynamicRecord` 的述詞函數。此函數必須以 `DynamicRecord` 做為引數並傳回 True，如果 `DynamicRecord` 符合篩選條件要求，否則將傳回 False (必要)。 `DynamicRecord` 代表 `DynamicFrame` 中的邏輯記錄。它類似 Spark `DataFrame` 中的一列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # FindIncrementalMatches 類別識別現有和增量 `DynamicFrame` 中的相符記錄，並使用指派給每個相符記錄群組的唯一識別碼來建立新的 `DynamicFrame`。若要匯入： ``` from awsglueml.transforms import FindIncrementalMatches ``` ## 方法 + [套用](#aws-glue-api-crawler-pyspark-transforms-findincrementalmatches-apply) ## apply(existingFrame, incrementalFrame, transformId, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0, enforcedMatches = none, computeMatchConfidenceScores = 0) 識別輸入 `DynamicFrame` 中的相符記錄，並使用指派給每個相符記錄群組的唯一識別碼來建立新的 `DynamicFrame`。 + `existingFrame` – 現有和預先符合的 `DynamicFrame` 以套用 FindIncrementalMatches 轉換。必要. + `incrementalFrame` – 要套用 FindIncrementalMatches 轉換以比對 `existingFrame` 的增量 `DynamicFrame`。必要. + `transformId` – 與 FindIncrementalMatches 轉換相關聯的唯一 ID，以套用於 `DynamicFrames` 中的記錄。必要. + `transformation_ctx` – 用於識別統計資料/狀態資訊的唯一字串。選用。 + `info` – 與轉換中的錯誤相關的字串。選用。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限。選用。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限。選用。預設為零。 + `enforcedMatches` – 用於強制符合的 `DynamicFrame`。選用。預設值為 None (無)。 + `computeMatchConfidenceScores` – 布林值，指出是否運算每個相符記錄群組的可信度分數。選用。預設值為 false。傳回具有指派給每個相符記錄群組之唯一識別碼的新 `DynamicFrame`。 # FindMatches 類別識別輸入 `DynamicFrame` 中的相符記錄，並使用指派給每個相符記錄群組的唯一識別碼來建立新的 `DynamicFrame`。若要匯入： ``` from awsglueml.transforms import FindMatches ``` ## 方法 + [套用](#aws-glue-api-crawler-pyspark-transforms-findmatches-apply) ## apply(frame, transformId, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0, enforcedMatches = none, computeMatchConfidenceScores = 0) 識別輸入 `DynamicFrame` 中的相符記錄，並使用指派給每個相符記錄群組的唯一識別碼來建立新的 `DynamicFrame`。 + `frame` – 要套用 FindMatches 轉換的 `DynamicFrame`。必要. + `transformId` – 與 FindMatches 轉換相關聯的唯一 ID，以套用於 `DynamicFrame` 中的記錄。必要. + `transformation_ctx` – 用於識別統計資料/狀態資訊的唯一字串。選用。 + `info` – 與轉換中的錯誤相關的字串。選用。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限。選用。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限。選用。預設為零。 + `enforcedMatches` – 用於強制符合的 `DynamicFrame`。選用。預設值為 None (無)。 + `computeMatchConfidenceScores` – 布林值，指出是否運算每個相符記錄群組的可信度分數。選用。預設值為 false。傳回具有指派給每個相符記錄群組之唯一識別碼的新 `DynamicFrame`。 # FlatMap 類別將轉換套用至集合中的各個 `DynamicFrame`。結果不會扁平化為單一的 `DynamicFrame`，而是保留為集合。 ## FlatMap 的範例下列範例程式碼片段示範如何在套用至 `FlatMap` 時對動態影格集合使用 `ResolveChoice` 轉換。用於輸入的資料位於 Amazon S3 地址 `s3://bucket/path-for-data/sample.json` 預留位置的 JSON 中，並包含下列資料。 ### 範例 JSON 資料 ``` [{ "firstname": "Arnav", "lastname": "Desai", "address": { "street": "6 Anyroad Avenue", "city": "London", "state": "England", "country": "UK" }, "phone": 17235550101, "affiliations": [ "General Anonymous Example Products", "Example Independent Research", "Government Department of Examples" ] }, { "firstname": "Mary", "lastname": "Major", "address": { "street": "7821 Spot Place", "city": "Centerville", "state": "OK", "country": "US" }, "phone": 19185550023, "affiliations": [ "Example Dot Com", "Example Independent Research", "Example.io" ] }, { "firstname": "Paulo", "lastname": "Santos", "address": { "street": "123 Maple Street", "city": "London", "state": "Ontario", "country": "CA" }, "phone": 12175550181, "affiliations": [ "General Anonymous Example Products", "Example Dot Com" ] }] ``` **Example 將 ResolveChoice 套用至 DynamicFrameCollection 並顯示輸出。** ``` #Read DynamicFrame datasource = glueContext.create_dynamic_frame_from_options("s3", connection_options = {"paths":["s3://bucket/path/to/file/mysamplejson.json"]}, format="json") datasource.printSchema() datasource.show() ## Split to create a DynamicFrameCollection split_frame=datasource.split_fields(["firstname","lastname","address"],"personal_info","business_info") split_frame.keys() print("---") ## Use FlatMap to run ResolveChoice kwargs = {"choice": "cast:string"} flat = FlatMap.apply(split_frame, ResolveChoice, frame_name="frame", transformation_ctx='tcx', **kwargs) flat.keys() ##Select one of the DynamicFrames personal_info = flat.select("personal_info") personal_info.printSchema() personal_info.show() print("---") business_info = flat.select("business_info") business_info.printSchema() business_info.show() ``` 呼叫 `FlatMap.apply` 時，`frame_name` 參數**必須**是 `"frame"`。目前不接受其他值。 ### 範例輸出 ``` root |-- firstname: string |-- lastname: string |-- address: struct | |-- street: string | |-- city: string | |-- state: string | |-- country: string |-- phone: long |-- affiliations: array | |-- element: string --- { "firstname": "Mary", "lastname": "Major", "address": { "street": "7821 Spot Place", "city": "Centerville", "state": "OK", "country": "US" }, "phone": 19185550023, "affiliations": [ "Example Dot Com", "Example Independent Research", "Example.io" ] } { "firstname": "Paulo", "lastname": "Santos", "address": { "street": "123 Maple Street", "city": "London", "state": "Ontario", "country": "CA" }, "phone": 12175550181, "affiliations": [ "General Anonymous Example Products", "Example Dot Com" ] } --- root |-- firstname: string |-- lastname: string |-- address: struct | |-- street: string | |-- city: string | |-- state: string | |-- country: string { "firstname": "Mary", "lastname": "Major", "address": { "street": "7821 Spot Place", "city": "Centerville", "state": "OK", "country": "US" } } { "firstname": "Paulo", "lastname": "Santos", "address": { "street": "123 Maple Street", "city": "London", "state": "Ontario", "country": "CA" } } --- root |-- phone: long |-- affiliations: array | |-- element: string { "phone": 19185550023, "affiliations": [ "Example Dot Com", "Example Independent Research", "Example.io" ] } { "phone": 12175550181, "affiliations": [ "General Anonymous Example Products", "Example Dot Com" ] } ``` ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-flat-map-__call__) + [套用](#aws-glue-api-crawler-pyspark-transforms-flat-map-apply) + [名稱](#aws-glue-api-crawler-pyspark-transforms-flat-map-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-flat-map-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-flat-map-describe) ## \$1\$1call\$1\$1(dfc, BaseTransform, frame\$1name, transformation\$1ctx = "", \$1\$1base\$1kwargs) 對集合中的每個 `DynamicFrame` 套用轉換，並將結果扁平化。 + `dfc` – 要對其套用 flatmap 的 `DynamicFrameCollection` (必要)。 + `BaseTransform` – 從 `GlueTransform` 衍生的轉換功能，要套用到集合中的每個成員 (必要)， + `frame_name` – 做為集合元素傳遞目標的引數名稱 (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `base_kwargs` – 要傳遞到基本轉換的引數 (必要)。傳回新的 `DynamicFrameCollection`，也就是對來源 `DynamicFrameCollection` 中個別 `DynamicFrame` 所套用轉換後所產生的集合。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # Join 類別對兩個 `DynamicFrames` 執行對等性加入。 ## 範例建議您使用 [`DynamicFrame.join()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-join) 聯結 `DynamicFrames`。若要檢視程式碼範例，請參閱 [範例：使用聯結合併 `DynamicFrames`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-join-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-join-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-join-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-join-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-join-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-join-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-join-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-join-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-join-describe) ## \$1\$1call\$1\$1(frame1, frame2, keys1, keys2, transformation\$1ctx = "") 對兩個 `DynamicFrames` 執行對等性加入。 + `frame1` – 第一個要加入的 `DynamicFrame` (必要)。 + `frame2` – 第二個要加入的 `DynamicFrame` (必要)。 + `keys1` – 第一個框架要加入的金鑰 (必要)。 + `keys2` – 第二個框架要加入的金鑰 (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。傳回建立的新 `DynamicFrame`，其建立透過聯結兩個 `DynamicFrames`。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # Map 類別透過將函數套用到輸入 `DynamicFrame` 中的所有記錄，以建置新的 `DynamicFrame`。 ## 範例建議您使用 [`DynamicFrame.map()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-map) 方法將函數套用至 `DynamicFrame` 中的所有記錄。若要檢視程式碼範例，請參閱 [範例：使用 map 將函數套用至 `DynamicFrame` 中的每個記錄](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-map-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-map-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-map-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-map-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-map-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-map-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-map-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-map-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-map-describe) ## \$1\$1call\$1\$1(frame, f, transformation\$1ctx="", info="", stageThreshold=0, totalThreshold=0) 傳回透過將指定函數套用到原始 `DynamicFrame` 中所有 `DynamicRecords` 所產生的新 `DynamicFrame`。 + `frame` – 要套用映射函數的 `DynamicFrame` (必要)。 + `f` – 要套用到 `DynamicFrame` 中所有 `DynamicRecords` 的函數。此函數必須以 `DynamicRecord` 做為參數，並傳回以映射產生的新 `DynamicRecord` (必要)。 `DynamicRecord` 代表 `DynamicFrame` 中的邏輯記錄。它類似 Apache Spark `DataFrame` 中的一列，除了它是自我描述的，以及可用於不符合固定結構描述的資料。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。傳回透過將指定函數套用到原始 `DynamicFrame` 中所有 `DynamicRecords` 所產生的新 `DynamicFrame`。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # MapToCollection 類別將轉換套用到指定的 `DynamicFrameCollection` 中的各個 `DynamicFrame`。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-__call__) + [套用](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-apply) + [名稱](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-MapToCollection-describe) ## \$1\$1call\$1\$1(dfc, BaseTransform, frame\$1name, transformation\$1ctx = "", \$1\$1base\$1kwargs) 將轉換函式套用到指定的 `DynamicFrameCollection` 中的各個 `DynamicFrame`。 + `dfc` – 要套用轉換函數的 `DynamicFrameCollection` (必要)。 + `callable` – 要套用到各個成員集合的可呼叫轉換函數 (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。傳回新的 `DynamicFrameCollection`，也就是對來源 `DynamicFrameCollection` 中個別 `DynamicFrame` 所套用轉換後所產生的集合。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply) ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # Relationalize 類別對 `DynamicFrame` 中的巢狀化結構描述進行壓平合併，並針對已攤平的框架，將陣列欄直轉橫。 ## 範例建議您使用 [`DynamicFrame.relationalize()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-relationalize) 方法來關聯化 `DynamicFrame`。若要檢視程式碼範例，請參閱 [範例：使用 relationalize 來壓平合併 `DynamicFrame` 中的巢狀化結構描述](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-relationalize-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-Relationalize-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-Relationalize-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-Relationalize-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-Relationalize-describe) ## \$1\$1call\$1\$1(frame, staging\$1path=None, name='roottable', options=None, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 將 `DynamicFrame` 關聯化，並針對將巢狀欄解巢狀和將陣列欄直轉橫所產生的框架，來製作清單。使用在解除巢狀化階段中所產生的聯結鍵，將直轉橫的陣列欄聯結至根資料表。 + `frame` – 要進行關聯化的 `DynamicFrame` (必要)。 + `staging_path` – 該方法用來以 CSV 格式存放直轉橫資料表分區的路徑 (選用)。直轉橫資料表從這個路徑讀回。 + `name` – 根資料表的名稱 (選用)。 + `options` – 選用參數的字典。目前未使用。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # RenameField 類別重新命名 `DynamicFrame` 內的節點。 ## 範例建議您使用 [`DynamicFrame.rename_field()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-rename_field) 方法重新命名 `DynamicFrame` 中的欄位。若要檢視程式碼範例，請參閱 [範例：使用 rename\$1field 重新命名 `DynamicFrame` 中的欄位](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-rename_field-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-RenameField-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-RenameField-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-RenameField-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-RenameField-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-RenameField-describe) ## \$1\$1call\$1\$1(frame, old\$1name, new\$1name, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 重新命名 `DynamicFrame` 內的節點。 + `frame` - 要重新命名其中節點的 `DynamicFrame` (必要)。 + `old_name` – 要重新命名之節點的完整路徑 (必要)。如果舊名稱內有小點，RenameField 無法正常運作，除非在前後加上反引號 (````)。例如，若要將 `this.old.name` 換成 `thisNewName`，您可以用下列方式呼叫 RenameField： ``` newDyF = RenameField(oldDyF, "`this.old.name`", "thisNewName") ``` + `new_name` – 新的名稱，包含完整路徑 (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # ResolveChoice 類別解析 `DynamicFrame` 內的選擇類型。 ## 範例建議您使用 [`DynamicFrame.resolveChoice()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-resolveChoice) 方法來處理 `DynamicFrame` 中包含多個類型的欄位。若要檢視程式碼範例，請參閱 [範例：使用 resolveChoice 來處理包含多種類型的資料欄](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-resolveChoice-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-ResolveChoice-describe) ## \$1\$1call\$1\$1(frame, specs = none, choice = "", transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 提供資訊以解析 `DynamicFrame` 內模棱兩可的類型。它會傳回產生的 `DynamicFrame`。 + `frame` – 要解析其中選擇類型的 `DynamicFrame` (必要)。 + `specs` – 要解析的特定模棱兩可項目的清單，形式皆為 tuple:`(path, action)`。`path` 值代表模棱兩可的特定元素，`action` 值則代表對應的解析動作。您只能使用 `spec` 和 `choice` 參數的其中一項。如果 `spec` 參數不是 `None`，則 `choice` 參數必須為空字串。相反地，如果 `choice` 不是空字串，則 `spec` 參數必須為 `None`。如果未提供任何參數， AWS Glue 會嘗試剖析結構描述，並使用它來解決模棱兩可的情況。可在 `specs` 元組的 `action` 部分中指定下列解析策略的其中一種： + `cast` - 可讓您指定轉換的目標類型 (例如 `cast:int`)。 + `make_cols` – 透過將資料壓平合併來解析可能的模棱兩可項目。例如，如果 `columnA` 可能是 `int` 或 `string`，則在得出的 `DynamicFrame` 中，解析動作會產生名為 `columnA_int` 和 `columnA_string` 的兩個欄。 + `make_struct` - 藉由以結構表示資料，來解決可能的模棱兩可項目。舉例來說，如果欄中的資料可能是 `int` 或 `string`，則使用 `make_struct` 動作會在結果的 `DynamicFrame` 中產生結構的欄，每個欄同時包含 `int` 和 `string`。 + `project` - 在產生的 `DynamicFrame` 中只擷取指定種類的值，以此解析可能的模棱兩可項目。例如，如果 `ChoiceType` 欄中的資料可能是 `int` 或 `string`，指定 `project:string` 動作會從並非 `string` 類型產生的 `DynamicFrame` 捨棄欄。若 `path` 識別到陣列，在陣列的名稱後放置空白的方括號以避免模棱兩可的狀況。例如，假設您使用如下結構化的資料： ``` "myList": [ { "price": 100.00 }, { "price": "$100.00" } ] ``` 您可以選取數值而不是價格字串版本，方法是將 `path` 設定為 `"myList[].price"`，且將 `action` 設定為 `"cast:double"`。 + `choice` – 當 `specs` 參數為 `None` 時的預設解析動作。如果 `specs` 參數不是 `None`，則此值只能為空字串，不能設定成其他的值。除了上述 `specs` 動作，此引數也支援下列動作： + `MATCH_CATALOG` – 嘗試將每個 `ChoiceType` 投射至指定 Data Catalog 資料表中的對應類型。 + `database` – 要與 `MATCH_CATALOG` 選項搭配使用的 AWS Glue Data Catalog 資料庫 `MATCH_CATALOG` ( 需要）。 + `table_name` – 要與 `MATCH_CATALOG`動作搭配使用的 AWS Glue Data Catalog 資料表名稱 `MATCH_CATALOG` ( 需要）。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # SelectFields 類別 `SelectFields` 類別建立新的 `DynamicFrame` 從現有 `DynamicFrame`，並僅保留您指定的欄位。`SelectFields` 提供類似 SQL `SELECT` 陳述式的功能。 ## 範例建議您使用 [`DynamicFrame.select_fields()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-select_fields) 方法從 `DynamicFrame` 中選擇欄位。若要檢視程式碼範例，請參閱 [範例：使用 select\$1fields 來用所選欄位建立新的 `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-select_fields-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SelectFields-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-SelectFields-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-SelectFields-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describeErrors) + [描述](#aws-glue-api-crawler-pyspark-transforms-SelectFields-describe) ## \$1\$1call\$1\$1(frame, paths, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 在 `DynamicFrame` 中取得欄位 (節點)。 + `frame` – 要在其中選擇欄位的 `DynamicFrame` (必要)。 + `paths` – 所要選擇的欄位的完整路徑清單 (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。傳回僅包含指定欄位的新 `DynamicFrame`。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # SelectFromCollection 類別在 `DynamicFrameCollection` 中選擇一個 `DynamicFrame`。 ## 範例此範例使用 `SelectFromCollection` 從 `DynamicFrameCollection` 中選取 `DynamicFrame`。 **範例資料集** 該範例從稱為 `split_rows_collection` 的 `DynamicFrameCollection` 中選取兩個 `DynamicFrames`。以下是 `split_rows_collection` 中的索引鍵清單。 ``` dict_keys(['high', 'low']) ``` **範例程式碼** ``` # Example: Use SelectFromCollection to select # DynamicFrames from a DynamicFrameCollection from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import SelectFromCollection # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Select frames and inspect entries frame_low = SelectFromCollection.apply(dfc=split_rows_collection, key="low") frame_low.toDF().show() frame_high = SelectFromCollection.apply(dfc=split_rows_collection, key="high") frame_high.toDF().show() ``` ### Output ``` +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 1| 0| fax| 202-225-3307| | 1| 1| phone| 202-225-5731| | 2| 0| fax| 202-225-3307| | 2| 1| phone| 202-225-5731| | 3| 0| fax| 202-225-3307| | 3| 1| phone| 202-225-5731| | 4| 0| fax| 202-225-3307| | 4| 1| phone| 202-225-5731| | 5| 0| fax| 202-225-3307| | 5| 1| phone| 202-225-5731| | 6| 0| fax| 202-225-3307| | 6| 1| phone| 202-225-5731| | 7| 0| fax| 202-225-3307| | 7| 1| phone| 202-225-5731| | 8| 0| fax| 202-225-3307| | 8| 1| phone| 202-225-5731| | 9| 0| fax| 202-225-3307| | 9| 1| phone| 202-225-5731| | 10| 0| fax| 202-225-6328| | 10| 1| phone| 202-225-4576| +---+-----+------------------------+-------------------------+ only showing top 20 rows +---+-----+------------------------+-------------------------+ | id|index|contact_details.val.type|contact_details.val.value| +---+-----+------------------------+-------------------------+ | 11| 0| fax| 202-225-6328| | 11| 1| phone| 202-225-4576| | 11| 2| twitter| RepTrentFranks| | 12| 0| fax| 202-225-6328| | 12| 1| phone| 202-225-4576| | 12| 2| twitter| RepTrentFranks| | 13| 0| fax| 202-225-6328| | 13| 1| phone| 202-225-4576| | 13| 2| twitter| RepTrentFranks| | 14| 0| fax| 202-225-6328| | 14| 1| phone| 202-225-4576| | 14| 2| twitter| RepTrentFranks| | 15| 0| fax| 202-225-6328| | 15| 1| phone| 202-225-4576| | 15| 2| twitter| RepTrentFranks| | 16| 0| fax| 202-225-6328| | 16| 1| phone| 202-225-4576| | 16| 2| twitter| RepTrentFranks| | 17| 0| fax| 202-225-6328| | 17| 1| phone| 202-225-4576| +---+-----+------------------------+-------------------------+ only showing top 20 rows ``` ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-SelectFromCollection-describe) ## \$1\$1call\$1\$1(dfc, key, transformation\$1ctx = "") 從 `DynamicFrameCollection` 取得一個 `DynamicFrame`。 + `dfc` – 應該從其中選取 `DynamicFrame` 的 `DynamicFrameCollection` (必要)。 + `key` – 所要選擇的 `DynamicFrame` 金鑰 (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # Simplify\$1ddb\$1json 類別簡化專屬於 DynamoDB JSON 結構中 `DynamicFrame` 內的巢狀資料欄，並傳回新的簡化 `DynamicFrame`。 ## 範例我們建議您使用 `DynamicFrame.simplify_ddb_json()` 方法來簡化專屬於 DynamoDB JSON 結構的 `DynamicFrame` 中和巢狀資料欄。若要檢視程式碼範例，請參閱 [範例：使用 simplify\$1ddb\$1json 來調用 DynamoDB JSON 簡化](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-simplify-ddb-json-example)。 # Spigot 類別將範例記錄寫入指定的目的地，以協助您驗證 AWS Glue 任務執行的轉換。 ## 範例建議您使用 [`DynamicFrame.spigot()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-spigot) 方法，將記錄子集從 `DynamicFrame` 寫入指定目的地。若要檢視程式碼範例，請參閱 [範例：使用 spigot 將範例欄位從 `DynamicFrame` 寫入到 Amazon S3](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-spigot-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-spigot-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-spigot-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-spigot-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-spigot-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-spigot-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-spigot-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-spigot-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-spigot-describe) ## \$1\$1call\$1\$1(frame, path, options, transformation\$1ctx = "") 在轉換期間將範例記錄寫入到指定的目的地。 + `frame` – 欲 Spigot 的 `DynamicFrame` (必要)。 + `path` - 要寫入的目的地路徑 (必要)。 + `options` - 指定選項的 JSON 索引鍵/值對 (選用)。`"topk"` 選項指定應寫入前 *k* 個記錄。`"prob"` 選項指定挑選任何給定記錄的概率 (小數)。您可以使用它來選擇要寫入的記錄。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply) ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name) ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs) ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn) ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform) ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors) ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe) # SplitFields 類別以指定欄位將 `DynamicFrame` 分割為二。 ## 範例建議您使用 [`DynamicFrame.split_fields()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_fields) 方法分割 `DynamicFrame` 中的欄位。若要檢視程式碼範例，請參閱 [範例：使用 split\$1fields 將選取的欄位分割為單獨的 `DynamicFrame`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-split_fields-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SplitFields-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-SplitFields-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-SplitFields-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-SplitFields-describe) ## \$1\$1call\$1\$1(frame, paths, name1 = none, name2 = none, transformation\$1ctx = "", info = "", stageThreshold = 0, totalThreshold = 0) 將 `DynamicFrame` 中一個或多個欄位分割成新的 `DynamicFrame`，並且建立另一個包含剩餘欄位的新 `DynamicFrame`。 + `frame` – 要分割為二的來源 `DynamicFrame` (必要)。 + `paths` – 欲分割欄位的完整路徑清單 (必要)。 + `name1` 指派給 `DynamicFrame` 的名稱，其中包含要分割的欄位 (選用)。如果未提供名稱，則會使用來源架構的名稱並加上「1」。 + `name2` – 指派給 `DynamicFrame` 的名稱，其中包含指定欄位分割後剩餘的欄位 (選用)。如果未提供名稱，則會使用來源架構的名稱並加上「2」。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # SplitRows 類別建立 `DynamicFrameCollection`，其包含兩個 `DynamicFrames`。一個 `DynamicFrame` 僅包含要分割的指定資料列，另一個則包含所有剩餘的資料列。 ## 範例建議您使用 [`DynamicFrame.split_rows()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_rows) 方法分割 `DynamicFrame` 中的資料列。若要檢視程式碼範例，請參閱 [範例：使用 split\$1rows 來分割 `DynamicFrame` 中的列](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-split_rows-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-SplitRows-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-SplitRows-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-SplitRows-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-SplitRows-describe) ## \$1\$1call\$1\$1(frame, comparison\$1dict, name1="frame1", name2="frame2", transformation\$1ctx = "", info = none, stageThreshold = 0, totalThreshold = 0) 將 `DynamicFrame` 中一個或多個欄分割成新的 `DynamicFrame`。 + `frame` – 要分割為二的來源 `DynamicFrame` (必要)。 + `comparison_dict` – 一個字典，其中索引鍵為欄位的完整路徑，而對於與欄位數值相比較的數值而言，此數值為另一種字典映射比較運算子。例如，`{"age": {">": 10, "<": 20}}` 會特別將介於 10 到 20 之間的 "age" (年齡) 數值與該範圍外的 "age" 列分割開來 (必要)。 + `name1` – 指派給 `DynamicFrame` 的名稱，其中包含要分割的資料列 (選用)。 + `name2` – 指派給 `DynamicFrame` 的名稱，其中包含指定資料列分割後剩餘的資料列 (選用)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # Unbox 類別將 `DynamicFrame` 中的字串欄位拆箱 (重新格式化)。 ## 範例建議您使用 [`DynamicFrame.unbox()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unbox) 方法，對 `DynamicFrame` 中的欄位進行拆箱。若要檢視程式碼範例，請參閱 [範例：使用 unbox 將字串欄位拆箱到結構中](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-unbox-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-Unbox-__call__) + [applyapply](#aws-glue-api-crawler-pyspark-transforms-Unbox-apply) + [namename](#aws-glue-api-crawler-pyspark-transforms-Unbox-name) + [describeArgsdescribeArgs](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeArgs) + [describeReturndescribeReturn](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeReturn) + [describeTransformdescribeTransform](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeTransform) + [describeErrorsdescribeErrors](#aws-glue-api-crawler-pyspark-transforms-Unbox-describeErrors) + [describedescribe](#aws-glue-api-crawler-pyspark-transforms-Unbox-describe) ## \$1\$1call\$1\$1(frame, path, format, transformation\$1ctx = "", info="", stageThreshold=0, totalThreshold=0, \$1\$1options) 將 `DynamicFrame` 中的字串欄位拆箱。 + `frame` - 具有要拆箱之欄位的 `DynamicFrame` (必要)。 + `path` – 欲拆箱的 `StringNode` 之完整路徑 (必要)。 + `format` – 格式化規格 (選用)。這是用於 Amazon S3 或支援多種格式的 AWS Glue 連線。如需了解受支援的格式，請參閱 [AWS Glue for Spark 中的輸入與輸出的資料格式選項](aws-glue-programming-etl-format.md)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 + `separator` – 分隔符號符記 (選用)。 + `escaper` – 逸出符記 (選用)。 + `skipFirst` – 如果資料的第一行應略過則為 `True`，如果不應略過則為 `False` (選用)。 + withSchema`` – 這是一個字串，包含了要拆箱之資料的結構描述 (選用)。此字串應一律使用 `StructType.json` 來建立。 + `withHeader` – 如果被解壓縮的資料包含標頭則為 `True`，若無則為 `False` (選用)。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # UnnestFrame 類別對 `DynamicFrame` 解除巢狀化，將巢狀化物件壓平合併為頂層元素，並且為陣列物件產生聯結鍵。 ## 範例我們建議您使用 [`DynamicFrame.unnest()`](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-unnest) 方法在 `DynamicFrame` 中壓平合併巢狀化結構。若要檢視程式碼範例，請參閱 [範例：使用 unnest 將巢狀化欄位轉換為頂層欄位](aws-glue-api-crawler-pyspark-extensions-dynamic-frame.md#pyspark-unnest-example)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-UnnestFrame-describe) ## \$1\$1call\$1\$1(frame, transformation\$1ctx = "", info="", stageThreshold=0, totalThreshold=0) 對 `DynamicFrame` 解除巢狀化，將巢狀化物件壓平合併為頂層元素，並且為陣列物件產生聯結鍵。 + `frame` – 欲解巢狀的 `DynamicFrame` (必要)。 + `transformation_ctx` – 用於識別狀態資訊的唯一字串 (選用)。 + `info` – 與轉換中的錯誤相關的字串 (選用)。 + `stageThreshold` – 在錯誤輸出之前，轉換作業中可發生錯誤的次數上限 (選用)。預設為零。 + `totalThreshold` – 在處理錯誤輸出之前，整體作業可發生錯誤的次數上限 (選用)。預設為零。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # FlagDuplicatesInColumn 類別 `FlagDuplicatesInColumn` 會傳回一個新資料欄，每個資料列都有一個指定值，指示該資料列的來源資料欄中的值是否與來源資料欄的較早的資料列中的值相符。找到相符項目時，其會標記為重複項目。初始出現不會標記，因為其不符合較早的資料列。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) datasource1 = spark.read.json("s3://${BUCKET}/json/zips/raw/data") try: df_output = column.FlagDuplicatesInColumn.apply( data_frame=datasource1, spark_context=sc, source_column="city", target_column="flag_col", true_string="True", false_string="False" ) except: print("Unexpected Error happened ") raise ``` ## Output `FlagDuplicatesInColumn` 轉換會將新的資料欄 `flag\$1col` 新增至 `df\$1output` DataFrame。此資料欄將包含字串值，指示對應的資料列在 `city` 資料欄中是否有重複的值。如果資料列具有重複的 `city` 值，`flag\$1col` 將包含 `true\$1string` 值 "True"。如果資料列具有唯一的 `city` 值，`flag\$1col` 將包含 `false\$1string` 值 "False"。產生的 `df\$1output` DataFrame 將包含來自原始 `datasource1` DataFrame 的所有資料欄，加上指出重複 `city` 值的額外 `flag\$1col` 資料欄。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FlagDuplicatesInColumn-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicatesInColumn-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, target\$1column, true\$1string=DEFAULT\$1TRUE\$1STRING, false\$1string=DEFAULT\$1FALSE\$1STRING) `FlagDuplicatesInColumn` 會傳回一個新資料欄，每個資料列都有一個指定值，指示該資料列的來源資料欄中的值是否與來源資料欄的較早的資料列中的值相符。找到相符項目時，其會標記為重複項目。初始出現不會標記，因為其不符合較早的資料列。 + `source_column` – 來源資料欄的名稱。 + `target_column` – 目標資料欄的名稱。 + `true_string` – 在來源資料欄值與該資料欄中的較早值重複時，要插入目標資料欄中的字串。 + `false_string` – 在來源資料欄值與該資料欄中的較早值不同時，要插入目標資料欄中的字串。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # FormatPhoneNumber 類別 `FormatPhoneNumber` 轉換會傳回一個資料欄，其中電話號碼字串會轉換為格式化值。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ ("408-341-5669",), ("4083415669",) ], ["phone"], ) try: df_output = column_formatting.FormatPhoneNumber.apply( data_frame=input_df, spark_context=sc, source_column="phone", default_region="US" ) df_output.show() except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是： ``` ``` +---------------+ | phone| +---------------+ |(408) 341-5669| |(408) 341-5669| +---------------+ ``` ``` `FormatPhoneNumber` 轉換會將 `source\$1column` 作為 `"phone"`，並將 `default\$1region` 作為 `"US"`。無論電話號碼的初始格式為何，轉換都會成功將這兩個電話號碼格式化為標準美國格式 `(408) 341-5669`。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FormatPhoneNumber-__call__) + [applyapply](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-apply) + [namename](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-name) + [describeArgsdescribeArgs](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeArgs) + [describeReturndescribeReturn](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeReturn) + [describeTransformdescribeTransform](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeTransform) + [describeErrorsdescribeErrors](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describeErrors) + [describedescribe](#aws-glue-api-crawler-pyspark-transforms-FormatPhoneNumber-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, phone\$1number\$1format=None, default\$1region=None, default\$1region\$1column=None) `FormatPhoneNumber` 轉換會傳回一個資料欄，其中電話號碼字串會轉換為格式化值。 + `source_column` – 現有資料欄的名稱。 + `phone_number_format` – 將電話號碼轉換為的格式。如果未指定格式，預設值為 `E.164`，這是國際認可的標準電話號碼格式。有效值包括下列項目： + E164 (省略 E 後的句點) + `default_region` – 由兩個或三個大寫字母組成的有效區域代碼，當號碼本身沒有國家/地區代碼時，用於指定電話號碼所在的區域。最多可以提供 `defaultRegion` 或 `defaultRegionColumn` 之一。 + `default_region_column` – 進階資料類型 `Country` 的資料欄名稱。當號碼本身沒有國家/地區代碼時，所指定資料欄中的區域代碼用於確定電話號碼的國家/地區代碼。最多可以提供 `defaultRegion` 或 `defaultRegionColumn` 之一。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # FormatCase 類別 `FormatCase` 轉換會將資料欄中的每個字串變更為指定的案例類型。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) datasource1 = spark.read.json("s3://${BUCKET}/json/zips/raw/data") try: df_output = data_cleaning.FormatCase.apply( data_frame=datasource1, spark_context=sc, source_column="city", case_type="LOWER" ) except: print("Unexpected Error happened ") raise ``` ## Output `FormatCase` 轉換會根據 `case\$1type="LOWER"` 參數，將 `city` 資料欄中的值轉換為小寫。產生的 `df\$1output` DataFrame 將包含原始 `datasource1` DataFrame 中的所有資料欄，但 `city` 資料欄值為小寫。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FormatCase-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-FormatCase-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-FormatCase-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-FormatCase-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, case\$1type) `FormatCase` 轉換會將資料欄中的每個字串變更為指定的案例類型。 + `source_column` – 現有資料欄的名稱。 + `case_type` – 支援的案例類型為 `CAPITAL`、`LOWER`、`UPPER`、`SENTENCE`。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # FillWithMode 類別 `FillWithMode` 轉換會根據您指定的電話號碼格式，來格式化資料欄。您也可以指定和局中斷器邏輯，其中一些值是相同的。例如，請考慮下列值：`1 2 2 3 3 4` modeType `MINIMUM` 導致 `FillWithMode` 傳回 2 作為模式值。如果 modeType 為 `MAXIMUM`，則模式為 3。對於 `AVERAGE`，模式為 2.5。 ## 範例 ``` from awsglue.context import * from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (105.111, 13.12), (1055.123, 13.12), (None, 13.12), (13.12, 13.12), (None, 13.12), ], ["source_column_1", "source_column_2"], ) try: df_output = data_quality.FillWithMode.apply( data_frame=input_df, spark_context=sc, source_column="source_column_1", mode_type="MAXIMUM" ) df_output.show() except: print("Unexpected Error happened ") raise ``` ## Output 指定的程式碼的輸出將是： ``` ``` +---------------+---------------+ |source_column_1|source_column_2| +---------------+---------------+ | 105.111| 13.12| | 1055.123| 13.12| | 1055.123| 13.12| | 13.12| 13.12| | 1055.123| 13.12| +---------------+---------------+ ``` ``` 從 `awsglue.data\$1quality` 模組的 `FillWithMode` 轉換會套用至 `input\$1df` DataFrame。其會將 `source_column_1` 資料欄中的 `null` 值取代為該資料欄中非 Null 值的最大值 (`mode\$1type="MAXIMUM"`)。在此情況下，`source_column_1` 資料欄中的最大值為 `1055.123`。因此，`source_column_1` 中的 `null` 值會由輸出 DataFrame `df\$1output` 中的 `1055.123` 取代。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FillWithMode-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-FillWithMode-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column, mode\$1type) `FillWithMode` 轉換會格式化資料欄中的字串大小寫。 + `source_column` – 現有資料欄的名稱。 + `mode_type` – 如何解析資料中的和局值。此值必須是 `MINIMUM`、`NONE`、`AVERAGE` 或 `MAXIMUM` 之一。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # FlagDuplicateRows 類別 `FlagDuplicateRows` 會傳回一個新資料欄，其中每個資料列都有一個指定的值，指示該資料列是否與資料集中較早的資料列完全相符。找到相符項目時，其會標記為重複項目。初始出現不會標記，因為其不符合較早的資料列。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (105.111, 13.12), (13.12, 13.12), (None, 13.12), (13.12, 13.12), (None, 13.12), ], ["source_column_1", "source_column_2"], ) try: df_output = data_quality.FlagDuplicateRows.apply( data_frame=input_df, spark_context=sc, target_column="flag_row", true_string="True", false_string="False", target_index=1 ) except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是具有額外資料欄 `flag_row` 的 PySpark DataFrame，根據 `source_column_1` 資料欄指出資料列是否重複。產生的 `df\$1output` DataFrame 將包含下列資料列： ``` ``` +---------------+---------------+--------+ |source_column_1|source_column_2|flag_row| +---------------+---------------+--------+ | 105.111| 13.12| False| | 13.12| 13.12| True| | null| 13.12| True| | 13.12| 13.12| True| | null| 13.12| True| +---------------+---------------+--------+ ``` ``` `flag_row` 資料欄指出資料列是否重複。`true\$1string` 設定為 "True"，而 `false\$1string` 設定為 "False"。`target\$1index` 設定為 1，這表示 `flag_row` 資料欄將插入輸出 DataFrame 中的第二個位置 (索引 1)。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-FlagDuplicateRows-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-FlagDuplicateRows-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, true\$1string=DEFAULT\$1TRUE\$1STRING, false\$1string=DEFAULT\$1FALSE\$1STRING, target\$1index=None) `FlagDuplicateRows` 會傳回一個新資料欄，其中每個資料列都有一個指定的值，指示該資料列是否與資料集中較早的資料列完全相符。找到相符項目時，其會標記為重複項目。初始出現不會標記，因為其不符合較早的資料列。 + `true_string` – 在資料列符合較早的資料列時要插入的值。 + `false_string` – 在資料列是唯一時要插入的值。 + `target_column` – 插入資料集的新資料欄名稱。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # RemoveDuplicates 類別如果在選取的來源資料欄中遇到重複值，`RemoveDuplicates` 轉換會刪除整個資料列。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (105.111, 13.12), (13.12, 13.12), (None, 13.12), (13.12, 13.12), (None, 13.12), ], ["source_column_1", "source_column_2"], ) try: df_output = data_quality.RemoveDuplicates.apply( data_frame=input_df, spark_context=sc, source_column="source_column_1" ) except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是 PySpark DataFrame，並根據 `source_column_1` 資料欄移除重複項目。產生的 `df\$1output` DataFrame 將包含下列資料列： ``` ``` +---------------+---------------+ |source_column_1|source_column_2| +---------------+---------------+ | 105.111| 13.12| | 13.12| 13.12| | null| 13.12| +---------------+---------------+ ``` ``` 請注意，`source_column_1` 值為 `13.12` 和 `null` 的資料列只會在輸出 DataFrame 中出現一次，因為已根據 `source_column_1` 資料欄移除重複項目。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-RemoveDuplicates-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-RemoveDuplicates-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1column) 如果在選取的來源資料欄中遇到重複值，`RemoveDuplicates` 轉換會刪除整個資料列。 + `source_column` – 現有資料欄的名稱。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # MonthName 類別 `MonthName` 轉換會從代表日期的字串建立新的資料欄，其中包含月份名稱。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") input_df = spark.createDataFrame( [ ("20-2018-12",), ("2018-20-12",), ("20182012",), ("12202018",), ("20122018",), ("20-12-2018",), ("12/20/2018",), ("02/02/02",), ("02 02 2009",), ("02/02/2009",), ("August/02/2009",), ("02/june/2009",), ("02/2020/june",), ("2013-02-21 06:35:45.658505",), ("August 02 2009",), ("2013/02/21",), (None,), ], ["column_1"], ) try: df_output = datetime_functions.MonthName.apply( data_frame=input_df, spark_context=sc, source_column="column_1", target_column="target_column" ) df_output.show() except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是： ``` ``` +------------+------------+ | column_1|target_column| +------------+------------+ |20-2018-12 | December | |2018-20-12 | null | | 20182012| null | | 12202018| null | | 20122018| null | |20-12-2018 | December | |12/20/2018 | December | | 02/02/02 | February | |02 02 2009 | February | |02/02/2009 | February | |August/02/2009| August | |02/june/2009| null | |02/2020/june| null | |2013-02-21 06:35:45.658505| February | |August 02 2009| August | | 2013/02/21| February | | null | null | +------------+------------+ ``` ``` `MonthName` 轉換會將 `source\$1column` 作為 `"column\$11"`，並將 `target\$1column` 作為 `"target\$1column"`。其會嘗試從 `"column\$11"` 資料欄中的日期/時間字串中擷取月份名稱，並將其放在 `"target\$1column"` 資料欄中。如果日期/時間字串的格式無法辨識或無法剖析，則 `"target\$1column"` 值會設定為 `null`。轉換會從各種日期/時間格式成功擷取月份名稱，例如 "20-12-2018"、"12/20/2018"、"02/02/2009"、"2013-02-21 06:35:45.658505" 和 "August 02 2009"。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-MonthName-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-MonthName-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-MonthName-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-MonthName-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-MonthName-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, value=None) `MonthName` 轉換會從代表日期的字串建立新的資料欄，其中包含月份名稱。 + `source_column` – 現有資料欄的名稱。 + `value` – 要評估的字元字串。 + `target_column` – 新建立資料欄的名稱。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # IsEven 類別 `IsEven` 轉換會在新資料欄中傳回布林值，指示來源資料欄或值是否為偶數。如果來源資料欄或值為小數，則結果為 false。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [(5,), (0,), (-1,), (2,), (None,)], ["source_column"], ) try: df_output = math_functions.IsEven.apply( data_frame=input_df, spark_context=sc, source_column="source_column", target_column="target_column", value=None, true_string="Even", false_string="Not even", ) df_output.show() except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是： ``` ``` +------------+------------+ |source_column|target_column| +------------+------------+ | 5| Not even| | 0| Even| | -1| Not even| | 2| Even| | null| null| +------------+------------+ ``` ``` `IsEven` 轉換會將 `source\$1column` 作為 "source\$1column"，並將 `target\$1column` 作為 "target\$1column"。其會檢查 `"source\$1column"` 中的值是否為偶數。如果值為偶數，則會將 `"target\$1column"` 值設定為 `true\$1string` "Even"。如果值為奇數，則會將 `"target\$1column"` 值設定為 `false\$1string` "Not even"。如果 `"source\$1column"` 值為 `null`，`"target\$1column"` 值會設定為 `null`。轉換會正確識別偶數 (0 和 2)，並將 `"target\$1column"` 值設定為 "Even"。對於奇數 (5 和 -1)，其會將 `"target\$1column"` 值設定為 "Not even"。對於 `"source\$1column"` 中的 `null` 值，`"target\$1column"` 值設定為 `null`。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-IsEven-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-IsEven-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-IsEven-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-IsEven-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-IsEven-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, true\$1string=DEFAULT\$1TRUE\$1STRING, false\$1string=DEFAULT\$1FALSE\$1STRING, value=None) `IsEven` 轉換會在新資料欄中傳回布林值，指示來源資料欄或值是否為偶數。如果來源資料欄或值為小數，則結果為 false。 + `source_column` – 現有資料欄的名稱。 + `target_column` – 要建立的新資料欄的名稱。 + `true_string` – 指示值是否為偶數的字串。 + `false_string` – 指示值是否為非偶數的字串。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # CryptographicHash 類別 `CryptographicHash` 轉換會將演算法套用至資料欄中的雜湊值。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * secret = "${SECRET}" sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (1, "1234560000"), (2, "1234560001"), (3, "1234560002"), (4, "1234560003"), (5, "1234560004"), (6, "1234560005"), (7, "1234560006"), (8, "1234560007"), (9, "1234560008"), (10, "1234560009"), ], ["id", "phone"], ) try: df_output = pii.CryptographicHash.apply( data_frame=input_df, spark_context=sc, source_columns=["id", "phone"], secret_id=secret, algorithm="HMAC_SHA256", output_format="BASE64", ) df_output.show() except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是： ``` ``` +---+------------+-------------------+-------------------+ | id| phone | id_hashed | phone_hashed | +---+------------+-------------------+-------------------+ | 1| 1234560000 | QUI1zXTJiXmfIb... | juDBAmiRnnO3g... | | 2| 1234560001 | ZAUWiZ3dVTzCo... | vC8lgUqBVDMNQ... | | 3| 1234560002 | ZP4VvZWkqYifu... | Kl3QAkgswYpzB... | | 4| 1234560003 | 3u8vO3wQ8EQfj... | CPBzK1P8PZZkV... | | 5| 1234560004 | eWkQJk4zAOIzx... | aLf7+mHcXqbLs... | | 6| 1234560005 | xtI9fZCJZCvsa... | dy2DFgdYWmr0p... | | 7| 1234560006 | iW9hew7jnHuOf... | wwfGMCOEv6oOv... | | 8| 1234560007 | H9V1pqvgkFhfS... | g9WKhagIXy9ht... | | 9| 1234560008 | xDhEuHaxAUbU5... | b3uQLKPY+Q5vU... | | 10| 1234560009 | GRN6nFXkxk349... | VJdsKt8VbxBbt... | +---+------------+-------------------+-------------------+ ``` ``` 轉換會使用指定的演算法和機密金鑰運算 `id` 和 `phone` 資料欄中值的密碼編譯雜湊，並以 Base64 格式編碼雜湊。產生的 `df\$1output` DataFrame 包含來自原始 `input\$1df` DataFrame 的所有資料欄，以及具有計算雜湊的額外 `id\$1hashed` 和 `phone\$1hashed` 資料欄。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-CryptographicHash-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-CryptographicHash-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1columns, secret\$1id, algorithm=None, secret\$1version=None, create\$1secret\$1if\$1missing=False, output\$1format=None, entity\$1type\$1filter=None) `CryptographicHash` 轉換會將演算法套用至資料欄中的雜湊值。 + `source_columns` – 現有資料欄的陣列。 + `secret_id` – Secrets Manager 機密金鑰的 ARN。雜湊訊息驗證碼 (HMAC) 字首演算法中使用的金鑰，用於雜湊來源資料欄。 + `secret_version` - 選用。預設為最新機密版本。 + `entity_type_filter` – 實體類型的選用陣列。可用於僅加密任意文字資料欄中偵測到的 PII。 + `create_secret_if_missing` – 選用布林值。如果為 true，將嘗試代表發起人建立機密。 + `algorithm` – 用於雜湊資料的演算法。有效列舉值：MD5、SHA1、SHA256、SHA512、HMAC\$1MD5、HMAC\$1SHA1、HMAC\$1SHA256、HMAC\$1SHA512。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # 解密類別 `Decrypt` 轉換會在 Glue AWS 內部解密。您的資料也可以使用 AWS Encryption SDK 在 AWS Glue 外部解密。如果提供的 KMS 金鑰 ARN 不符合用於加密資料欄的內容，解密操作會失敗。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * kms = "${KMS}" sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (1, "1234560000"), (2, "1234560001"), (3, "1234560002"), (4, "1234560003"), (5, "1234560004"), (6, "1234560005"), (7, "1234560006"), (8, "1234560007"), (9, "1234560008"), (10, "1234560009"), ], ["id", "phone"], ) try: df_encrypt = pii.Encrypt.apply( data_frame=input_df, spark_context=sc, source_columns=["phone"], kms_key_arn=kms ) df_decrypt = pii.Decrypt.apply( data_frame=df_encrypt, spark_context=sc, source_columns=["phone"], kms_key_arn=kms ) df_decrypt.show() except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是具有原始 `id` 資料欄和解密 `phone` 資料欄的 PySpark DataFrame： ``` ``` +---+------------+ | id| phone| +---+------------+ | 1| 1234560000| | 2| 1234560001| | 3| 1234560002| | 4| 1234560003| | 5| 1234560004| | 6| 1234560005| | 7| 1234560006| | 8| 1234560007| | 9| 1234560008| | 10| 1234560009| +---+------------+ ``` ``` `Encrypt` 轉換會將 `source\$1columns` 作為 `["phone"]` 及將 `kms\$1key\$1arn` 作為 `\$1\$1KMS\$1` 環境變數的值。轉換會使用指定的 KMS 金鑰加密 `phone` 資料欄中的值。然後，加密的 DataFrame `df\$1encrypt` 會從 `awsglue.pii` 模組傳遞至 `Decrypt` 轉換。其會將 `source\$1columns` 作為 `["phone"]` 及將 `kms\$1key\$1arn` 作為 `\$1\$1KMS\$1` 環境變數的值。轉換會使用相同的 KMS 金鑰加密 `phone` 資料欄中的加密值。產生的 `df\$1decrypt` DataFrame 包含原始 `id` 資料欄和解密的 `phone` 資料欄。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-Decrypt-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-Decrypt-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-Decrypt-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-Decrypt-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1columns, kms\$1key\$1arn) `Decrypt` 轉換會在 Glue AWS 內部解密。您的資料也可以使用 AWS Encryption SDK 在 AWS Glue 外部解密。如果提供的 KMS 金鑰 ARN 不符合用於加密資料欄的內容，解密操作會失敗。 + `source_columns` – 現有資料欄的陣列。 + `kms_key_arn` – Key AWS Management Service 金鑰的金鑰 ARN，用於解密來源資料欄。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # 加密類別 `Encrypt` 轉換會使用 AWS Key Management Service 金鑰加密來源資料欄。`Encrypt` 轉換每個儲存格最多可加密 128 MiB。其會嘗試在解密時保留格式。若要保留資料類型，資料類型中繼資料必須序列化為小於 1KB。否則，您必須將 `preserve_data_type` 參數設定為 false。資料類型中繼資料將在加密內容中以純文字儲存。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * kms = "${KMS}" sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (1, "1234560000"), (2, "1234560001"), (3, "1234560002"), (4, "1234560003"), (5, "1234560004"), (6, "1234560005"), (7, "1234560006"), (8, "1234560007"), (9, "1234560008"), (10, "1234560009"), ], ["id", "phone"], ) try: df_encrypt = pii.Encrypt.apply( data_frame=input_df, spark_context=sc, source_columns=["phone"], kms_key_arn=kms ) except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是具有原始 `id` 資料欄的 PySpark DataFrame，及包含 `phone` 資料欄加密值的額外資料欄。 ``` ``` +---+------------+-------------------------+ | id| phone | phone_encrypted | +---+------------+-------------------------+ | 1| 1234560000| EncryptedData1234...abc | | 2| 1234560001| EncryptedData5678...def | | 3| 1234560002| EncryptedData9012...ghi | | 4| 1234560003| EncryptedData3456...jkl | | 5| 1234560004| EncryptedData7890...mno | | 6| 1234560005| EncryptedData1234...pqr | | 7| 1234560006| EncryptedData5678...stu | | 8| 1234560007| EncryptedData9012...vwx | | 9| 1234560008| EncryptedData3456...yz0 | | 10| 1234560009| EncryptedData7890...123 | +---+------------+-------------------------+ ``` ``` `Encrypt` 轉換會將 `source\$1columns` 作為 `["phone"]` 及將 `kms\$1key\$1arn` 作為 `\$1\$1KMS\$1` 環境變數的值。轉換會使用指定的 KMS 金鑰加密 `phone` 資料欄中的值。產生的 `df\$1encrypt` DataFrame 包含原始 `id` 資料欄、原始 `phone` 資料欄，及名為 `phone\$1encrypted` 的額外資料欄，其中包含 `phone` 資料欄的加密值。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-Encrypt-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-Encrypt-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-Encrypt-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-Encrypt-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, source\$1columns, kms\$1key\$1arn, entity\$1type\$1filter=None, preserve\$1data\$1type=None) `Encrypt` 轉換會使用 AWS Key Management Service 金鑰加密來源資料欄。 + `source_columns` – 現有資料欄的陣列。 + `kms_key_arn` – Key AWS Management Service 金鑰的金鑰 ARN，用於加密來源資料欄。 + `entity_type_filter` – 實體類型的選用陣列。可用於僅加密任意文字資料欄中偵測到的 PII。 + `preserve_data_type` – 選用布林值。預設為 true。如果為 false，則不會儲存資料類型。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # IntToIp 類別 `IntToIp` 轉換會將來源資料欄的整數值或其他值轉換為目標資料欄中對應的 IPv4 值，並在新的資料欄中傳回結果。 ## 範例 ``` from pyspark.context import SparkContext from pyspark.sql import SparkSession from awsgluedi.transforms import * sc = SparkContext() spark = SparkSession(sc) input_df = spark.createDataFrame( [ (3221225473,), (0,), (1,), (100,), (168430090,), (4294967295,), (4294967294,), (4294967296,), (-1,), (None,), ], ["source_column_int"], ) try: df_output = web_functions.IntToIp.apply( data_frame=input_df, spark_context=sc, source_column="source_column_int", target_column="target_column", value=None ) df_output.show() except: print("Unexpected Error happened ") raise ``` ## Output 輸出將是： ``` ``` +---------------+---------------+ |source_column_int|target_column| +---------------+---------------+ | 3221225473| 192.0.0.1 | | 0| 0.0.0.0 | | 1| 0.0.0.1 | | 100| 0.0.0.100| | 168430090 | 10.0.0.10 | | 4294967295| 255.255.255.255| | 4294967294| 255.255.255.254| | 4294967296| null | | -1| null | | null| null | +---------------+---------------+ ``` ``` `IntToIp.apply` 轉換會將 `source\$1column` 作為 `"source\$1column\$1int"`，將 `target\$1column` 作為 `"target\$1column"`，將 `source\$1column\$1int` 資料欄中的整數值轉換為對應的 IPv4 地址表示法，並將結果儲存在 `target\$1column` 資料欄中。對於 IPv4 地址範圍內的有效整數值 (0 至 4294967295)，轉換會成功將其轉換為 IPv4 地址表示法 (例如 192.0.0.1、0.0.0.0、10.0.0.10、255.255.255.255)。對於有效範圍外的整數值 (例如 4294967296、-1)，`target\$1column` 值會設為 `null`。對於 `source\$1column\$1int` 資料欄中的 `null` 值，`target\$1column` 值也會設定為 `null`。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-IntToIp-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-IntToIp-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-IntToIp-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-IntToIp-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, value=None) `IntToIp` 轉換會將來源資料欄的整數值或其他值轉換為目標資料欄中對應的 IPv4 值，並在新的資料欄中傳回結果。 + `sourceColumn` – 現有資料欄的名稱。 + `value` – 要評估的字元字串。 + `targetColumn` – 要建立的新資料欄的名稱。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 # IpToInt 類別 `IpToInt` 轉換會將來源資料欄的網際網路協定第 4 版 (IPv4) 值或其他值轉換為目標資料欄中對應的整數值，並在新的資料欄中傳回結果。 ## 範例對於 AWS Glue 4.0 及更高版本，使用建立或更新任務引數 `key: --enable-glue-di-transforms, value: true` ``` from pyspark.context import SparkContext from awsgluedi.transforms import * sc = SparkContext() input_df = spark.createDataFrame( [ ("192.0.0.1",), ("10.10.10.10",), ("1.2.3.4",), ("1.2.3.6",), ("http://12.13.14.15",), ("https://16.17.18.19",), ("1.2.3.4",), (None,), ("abc",), ("abc.abc.abc.abc",), ("321.123.123.123",), ("244.4.4.4",), ("255.255.255.255",), ], ["source_column_ip"], ) df_output = web_functions.IpToInt.apply( data_frame=input_df, spark_context=sc, source_column="source_column_ip", target_column="target_column", value=None ) df_output.show() ``` ## Output 輸出將是： ``` ``` +----------------+---------------+ |source_column_ip| target_column| +----------------+---------------+ | 192.0.0.1| 3221225473| | 10.10.10.10| 168427722| | 1.2.3.4| 16909060| | 1.2.3.6| 16909062| |http://12.13.14.15| null| |https://16.17.18.19| null| | 1.2.3.4| 16909060| | null| null| | abc| null| |abc.abc.abc.abc| null| | 321.123.123.123| null| | 244.4.4.4| 4102444804| | 255.255.255.255| 4294967295| +----------------+---------------+ ``` ``` `IpToInt` 轉換會將 `source\$1column` 作為 `"source\$1column\$1ip"`，將 `target\$1column` 作為 `"target\$1column"`，將 `source\$1column\$1ip` 資料欄中的有效 IPv4 地址字串轉換為對應的 32 位元整數表示法，並將結果儲存在 `target\$1column` 資料欄中。對於有效的 IPv4 地址字串 (例如 "192.0.0.1"、"10.10.10.10"、"1.2.3.4")，轉換會成功將其轉換為整數表示法 (例如 3221225473、168427722、16909060)。對於不是有效 IPv4 地址的字串 (例如 URL、"abc" 等非 IP 字串、"abc.abc.abc.abc" 等無效的 IP 格式)，`target\$1column` 值設定為 `null`。對於 `source\$1column\$1ip` 資料欄中的 `null` 值，`target\$1column` 值也會設定為 `null`。 ## 方法 + [\$1\$1call\$1\$1](#aws-glue-api-pyspark-transforms-IpToInt-__call__) + [apply](#aws-glue-api-crawler-pyspark-transforms-IpToInt-apply) + [name](#aws-glue-api-crawler-pyspark-transforms-IpToInt-name) + [describeArgs](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeArgs) + [describeReturn](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeReturn) + [describeTransform](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeTransform) + [describeErrors](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describeErrors) + [describe](#aws-glue-api-crawler-pyspark-transforms-IpToInt-describe) ## \$1\$1call\$1\$1(spark\$1context, data\$1frame, target\$1column, source\$1column=None, value=None) `IpToInt` 轉換會將來源資料欄的網際網路協定第 4 版 (IPv4) 值或其他值轉換為目標資料欄中對應的整數值，並在新的資料欄中傳回結果。 + `sourceColumn` – 現有資料欄的名稱。 + `value` – 要評估的字元字串。 + `targetColumn` – 要建立的新資料欄的名稱。 ## apply(cls, \$1args, \$1\$1kwargs) 繼承自 `GlueTransform` [apply](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-apply)。 ## name(cls) 繼承自 `GlueTransform` [name](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-name)。 ## describeArgs(cls) 繼承自 `GlueTransform` [describeArgs](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeArgs)。 ## describeReturn(cls) 繼承自 `GlueTransform` [describeReturn](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeReturn)。 ## describeTransform(cls) 繼承自 `GlueTransform` [describeTransform](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeTransform)。 ## describeErrors(cls) 繼承自 `GlueTransform` [describeErrors](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describeErrors)。 ## describe(cls) 繼承自 `GlueTransform` [describe](aws-glue-api-crawler-pyspark-transforms-GlueTransform.md#aws-glue-api-crawler-pyspark-transforms-GlueTransform-describe)。 ## 資料整合轉換對於 AWS Glue 4.0 及更高版本，請使用建立或更新任務引數`key: --enable-glue-di-transforms, value: true`。範例任務指令碼： ``` from pyspark.context import SparkContext from awsgluedi.transforms import * sc = SparkContext() input_df = spark.createDataFrame( [(5,), (0,), (-1,), (2,), (None,)], ["source_column"], ) try: df_output = math_functions.IsEven.apply( data_frame=input_df, spark_context=sc, source_column="source_column", target_column="target_column", value=None, true_string="Even", false_string="Not even", ) df_output.show() except: print("Unexpected Error happened ") raise ``` 使用筆記本的範例工作階段 ``` %idle_timeout 2880 %glue_version 4.0 %worker_type G.1X %number_of_workers 5 %region eu-west-1 ``` ``` %%configure { "--enable-glue-di-transforms": "true" } ``` ``` from pyspark.context import SparkContext from awsgluedi.transforms import * sc = SparkContext() input_df = spark.createDataFrame( [(5,), (0,), (-1,), (2,), (None,)], ["source_column"], ) try: df_output = math_functions.IsEven.apply( data_frame=input_df, spark_context=sc, source_column="source_column", target_column="target_column", value=None, true_string="Even", false_string="Not even", ) df_output.show() except: print("Unexpected Error happened ") raise ``` 使用的範例工作階段 AWS CLI ``` aws glue create-session --default-arguments "--enable-glue-di-transforms=true" ``` DI 轉換： + [FlagDuplicatesInColumn 類別](aws-glue-api-pyspark-transforms-FlagDuplicatesInColumn.md) + [FormatPhoneNumber 類別](aws-glue-api-pyspark-transforms-FormatPhoneNumber.md) + [FormatCase 類別](aws-glue-api-pyspark-transforms-FormatCase.md) + [FillWithMode 類別](aws-glue-api-pyspark-transforms-FillWithMode.md) + [FlagDuplicateRows 類別](aws-glue-api-pyspark-transforms-FlagDuplicateRows.md) + [RemoveDuplicates 類別](aws-glue-api-pyspark-transforms-RemoveDuplicates.md) + [MonthName 類別](aws-glue-api-pyspark-transforms-MonthName.md) + [IsEven 類別](aws-glue-api-pyspark-transforms-IsEven.md) + [CryptographicHash 類別](aws-glue-api-pyspark-transforms-CryptographicHash.md) + [解密類別](aws-glue-api-pyspark-transforms-Decrypt.md) + [加密類別](aws-glue-api-pyspark-transforms-Encrypt.md) + [IntToIp 類別](aws-glue-api-pyspark-transforms-IntToIp.md) + [IpToInt 類別](aws-glue-api-pyspark-transforms-IpToInt.md) ### Maven：將外掛程式與 Spark 應用程式綁定在一起您可以在本機開發 Spark 應用程式時，透過在 Maven `pom.xml` 中新增外掛程式相依項，將轉換相依項與您的 Spark 應用程式和 Spark 發行版本 (3.3 版) 捆綁在一起。 ``` ... aws-glue-etl-artifacts https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ ... com.amazonaws AWSGlueTransforms 4.0.0 ``` 或者，您也可以直接從 AWS Glue Maven 成品下載二進位檔，並將其包含在 Spark 應用程式中，如下所示。 ``` #!/bin/bash sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/AWSGlueTransforms/4.0.0/AWSGlueTransforms-4.0.0.jar -P /usr/lib/spark/jars/ ```