将 ETL 流程转换为 in AWS Glue AWS Schema Conversion Tool

在以下各节中，您可以找到对在 Python 中调用 AWS Glue API 操作的转换的描述。有关更多信息，请参阅《 AWS Glue 开发人员指南》中的使用 Python 编程 AWS Glue ETL 脚本。

主题

步骤 1：创建数据库
步骤 2：创建连接
步骤 3：创建 AWS Glue 爬虫

步骤 1：创建数据库

第一步是使用 AWS SDK AP I 在 AWS Glue 数据目录中创建新数据库。当您在数据目录中定义表时，您将其添加到数据库。数据库用于组织中的表 AWS Glue。

以下示例演示了 Python API 的使用create_database方法 AWS Glue。


response = client.create_database(
    DatabaseInput={
        'Name': 'database_name’,
        'Description': 'description',
        'LocationUri': 'string',
        'Parameters': {         
            'parameter-name': 'parameter value'
        }
    }
)

如果您使用的是 Amazon Redshift，数据库名称的构成方式如下。


{redshift_cluster_name}_{redshift_database_name}_{redshift_schema_name}

此示例中的 Amazon Redshift 集群的完整名称如下所示。


rsdbb03.apq1mpqso.us-west-2.redshift.amazonaws.com

下面显示了格式正确的数据库名称的示例。在此例中，rsdbb03 为名称，这是集群端点的完整名称的第一个部分。数据库名为 dev，架构为 ora_glue。


rsdbb03_dev_ora_glue

步骤 2：创建连接

使用 AWS SDK API 在数据目录中创建新连接。

以下示例演示了如何使用 Python API create_connection的方法 AWS Glue。


response = client.create_connection(
    ConnectionInput={
        'Name': 'Redshift_abcde03.aabbcc112233.us-west-2.redshift.amazonaws.com_dev',
        'Description': 'Created from SCT',
        'ConnectionType': 'JDBC',
        'ConnectionProperties': {
            'JDBC_CONNECTION_URL': 'jdbc:redshift://aabbcc03.aabbcc112233.us-west-2.redshift.amazonaws.com:5439/dev',
            'USERNAME': 'user_name',
            'PASSWORD': 'password'
        },
        'PhysicalConnectionRequirements': {
            'AvailabilityZone': 'us-west-2c',
            'SubnetId': 'subnet-a1b23c45',
            'SecurityGroupIdList': [
                'sg-000a2b3c', 'sg-1a230b4c', 'sg-aba12c3d', 'sg-1abb2345'
            ]
        }
    }
)

create_connection 中所用的参数如下所示：

Name （UTF-8 字符串）：必需。对于 Amazon Redshift，连接名称的构成方式如下所示：Redshift_<Endpoint-name>_<redshift-database-name>，例如： Redshift_abcde03_dev
Description （UTF-8 字符串）：连接的描述。
ConnectionType （UTF-8 字符串）：必需。连接的类型。当前，仅支持 JDBC；SFTP 不受支持。
ConnectionProperties（dict）：必需。用作此连接的参数的键值对列表，包括 JDBC 连接 URL、用户名和密码。
PhysicalConnectionRequirements（dict）：物理连接要求，其中包括以下内容：
- SubnetId（UTF-8 字符串）：连接使用的子网的 ID。
- SecurityGroupIdList（列表）：连接使用的安全组 ID 列表。
- AvailabilityZone（UTF-8 字符串）：必需。包含该端点的可用区。此参数已被弃用。

步骤 3：创建 AWS Glue 爬虫

接下来，您将创建一个 AWS Glue 爬虫来填充目录。 AWS Glue 有关更多信息，请参阅《AWS Glue 开发人员指南》中的使用爬网程序编录表。

添加爬网程序的第一步是使用 AWS SDK API 在数据目录中创建新数据库。在开始之前，务必先使用 delete_crawler 操作删除其先前的任何版本。

创建爬网程序时，请注意以下几点：

对于爬网程序名称，使用格式 <redshift_node_name>_<redshift_database_name>_<redshift_shema_name>，例如：abcde03_dev_ora_glue
使用已存在的 IAM 角色。有关创建 IAM 角色的更多信息，请参阅《IAM 用户指南》中的创建 IAM 角色。
请使用您在先前步骤中创建的数据库的名称。
使用 ConnectionName 参数，这是必需的。
对于 path 参数，使用 JDBC 目标的路径，例如：dev/ora_glue/%

以下示例将删除现有爬网程序，然后使用适用于 AWS Glue的 Python API 创建新的爬网程序。


response = client.delete_crawler(
    Name='crawler_name'
)

response = client.create_crawler(
    Name='crawler_name',
    Role= ‘IAM_role’,
    DatabaseName='database_name’,
    Description='string',
    Targets={
        'S3Targets': [
            {
                'Path': 'string',
                'Exclusions': [
                    'string',
                ]
            },
        ],
        'JdbcTargets': [
            {
                'ConnectionName': ‘ConnectionName’,
                'Path': ‘Include_path’,
                'Exclusions': [
                    'string',
                ]
            },
        ]
    },
    Schedule='string',
    Classifiers=[
        'string',
    ],
    TablePrefix='string',
    SchemaChangePolicy={
        'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'
    },
    Configuration='string'
)

创建并运行一个爬网程序，以便连接到一个或多个数据存储，确定数据结构，并将表写入到数据目录中。您可以按计划运行您的爬网程序，如下所示。


response = client.start_crawler(
    Name='string'
)

此示例使用 Amazon Redshift 作为目标。爬虫运行后，Amazon Redshift AWS Glue 数据类型通过以下方式映射到数据类型。

Amazon Redshift 数据类型	AWS Glue 数据类型
smallint	smallint
integer	int
bigint	bigint
decimal	decimal(18,0)
decimal(p,s)	decimal(p,s)
real	double
double precision	double
布尔值	布尔值
char	字符串
varchar	字符串
varchar(n)	字符串
date	date
timestamp	timestamp
timestamptz	timestamp

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

ETL 过程

Informatica ETL 脚本