Reading archived S3 objects with Standard storage class - AWS Prescriptive Guidance

Reading archived S3 objects with Standard storage class

Using AWS Glue

The data off-loaded from MySQL to Amazon S3 retains the same structural rigidity and consistency typical of a relational database management system (RDBMS).

AWS Glue Crawler crawls over S3 objects, infers the data types, and creates table metadata as an external table DDL. When you configure the crawler job, use Amazon S3 as the source, and specify the S3 prefix location where all the data-files are created. In the configuration, include the following:

  • Crawler run options

  • Optional table prefix preference

  • Target database for creating the table

  • IAM roles with required permissions

After you invoke the job, it will scan through the data to infer the schema and preserve it in AWS Glue Data Catalog as AWS Glue tables. AWS Glue tables are essentially external tables that can be queried with SQL statements like a normal database table using analytical services such as Amazon Athena, Amazon Redshift Spectrum, and Apache Hive on Amazon EMR. For more information about the crawler, see the AWS Glue documentation.

For .csv files with a column header specified, the resultant table column names will reflect the same field names. The data type is inferred based on the values in the data object.

For Parquet files, the schema is preserved within the data itself and the resultant table will reflect the same field names and data type.

Alternatively, you can run a DDL manually within Athena to create the table definition with the required column names and data type. This creates the table definition within Data Catalog. For more information about creating Athena tables, see the Amazon Athena documentation.

Note: If the header row is missing from the CSV file, the crawler creates the field name as generic c_0, c_1,c_2,...

Using Amazon S3 Select

You can use Amazon S3 Select to read the S3 objects programmatically by using SQL expressions. The API operation can be invoked by using the AWS CLI command select-object-content or by using an SDK such as Boto3 and invoking the operation select_object_content from Python.

The API operations support SQL statements as parameters and can read files only of type JSON and Parquet. The outputs can be redirected as output files.

These operations are invoked for each S3 object. For multiple files, run the operations recursively.

For more information about running the operations by using AWS CLI, see the AWS CLI documentation. For more information about running S3 Select by using the Python SDK Boto3, see the Boto3 documentation.