將 JSON 文件剖析為 SUPER 欄使用 COPY 在 Amazon Redshift 中載入 JSON 資料

將半結構化資料載入 Amazon Redshift

使用 SUPER 資料類型在 Amazon Redshift 中保留和查詢階層式和一般資料。Amazon Redshift 引入了以 JSON 格式剖析資料並將其轉換為 SUPER 表示的 json_parse 函數。Amazon Redshift 還支援使用 COPY 命令載入 SUPER 欄。支援的檔案格式包括 JSON、Avro、文字、逗號分隔值 (CSV) 格式、Parquet 和 ORC。

如需下列範例中所用資料表的詳細資訊，請參閱SUPER 範例資料集。

如需 json_parse 函數的詳細資訊，請參閱JSON_PARSE 函數。

SUPER 資料類型的預設編碼是 ZSTD。

將 JSON 文件剖析為 SUPER 欄

您可以使用 json_parse 函數將 JSON 資料插入或更新到 SUPER 欄中。該函數剖析 JSON 格式的資料，並將其轉換為 SUPER 資料類型，您可以在 INSERT 或 UPDATE 陳述式中使用。

下列範例將 JSON 資料插入到 SUPER 欄。如果查詢中缺少 json_parse 函數，Amazon Redshift 會將該值視為單一字串，而不是必須剖析的 JSON 格式字串。

如果您更新 SUPER 資料欄，Amazon Redshift 會要求將完整的文件傳遞至欄值。Amazon Redshift 不支援部分更新。


INSERT INTO region_nations VALUES(0,
   'lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to',
   'AFRICA',
   JSON_PARSE('{"r_nations":[
      {"n_comment":" haggle. carefully final deposits detect slyly agai",
         "n_nationkey":0,
         "n_name":"ALGERIA"
      },
      {"n_comment":"ven packages wake quickly. regu",
         "n_nationkey":5,
         "n_name":"ETHIOPIA"
      },
      {"n_comment":" pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t",
         "n_nationkey":14,
         "n_name":"KENYA"
      },
      {"n_comment":"rns. blithely bold courts among the closely regular packages use furiously bold platelets?",
         "n_nationkey":15,
         "n_name":"MOROCCO"
      },
      {"n_comment":"s. ironic, unusual asymptotes wake blithely r",
         "n_nationkey":16,
         "n_name":"MOZAMBIQUE"
      }
   ]
}'));

使用 COPY 在 Amazon Redshift 中載入 SUPER 欄

在下列章節中，您可以了解如何使用 COPY 命令將 JSON 資料載入 Amazon Redshift 的不同方式。

從 JSON 和 Avro 複製資料

透過在 Amazon Redshift 中使用半結構化資料支援，您可以載入 JSON 文件，而不必將其 JSON 結構的屬性分解成多個欄。

Amazon Redshift 提供了兩種方法來使用 COPY 擷取 JSON 文件，即使 JSON 結構完全或部分未知：

使用 noshred 選項將從 JSON 文件衍生的資料儲存到單一 SUPER 資料欄中。當結構描述不知道或預期會改變時，此方法非常有用。因此，這種方法可以更容易地將整個元組儲存在單一 SUPER 欄中。
使用 auto 或 jsonpaths 選項將 JSON 文件分解成多個 Amazon Redshift 欄。屬性可以是 Amazon Redshift 純量或 SUPER 值。

您可以將這些選項與 JSON 或 Avro 格式搭配使用。

分解前的 JSON 物件大小上限為 4 MB。

將 JSON 文件複製到單一 SUPER 資料欄

若要將 JSON 文件複製到單一 SUPER 資料欄中，請使用單一 SUPER 資料欄建立資料表。


CREATE TABLE region_nations_noshred (rdata SUPER);

將 Amazon S3 的資料複製到單一 SUPER 資料欄。若要將 JSON 來源資料內嵌到單一 SUPER 資料欄中，請在 FORMAT JSON 子句中指定 noshred 選項。


COPY region_nations_noshred FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 'noshred';

COPY 成功擷取 JSON 之後，您的資料表會有一個 rdata SUPER 資料欄，其中包含整個 JSON 物件的資料。擷取的資料會維護 JSON 階層的所有屬性。不過，分葉會轉換成 Amazon Redshift 純量類型，以便有效率地處理查詢。

使用下列查詢擷取原始 JSON 字串。


SELECT rdata FROM region_nations_noshred;

當 Amazon Redshift 產生 SUPER 資料欄時，它可以透過 JSON 序列化使用 JDBC 做為字串進行存取。如需詳細資訊，請參閱序列化複雜的巢狀 JSON。

將 JSON 文件複製到多個 SUPER 資料欄

您可以將 JSON 文件分解為多個欄，這些欄可以是 SUPER 資料欄或 Amazon Redshift 純量類型。Amazon Redshift 傳播 JSON 物件的不同部分到不同的欄。


CREATE TABLE region_nations
(
 r_regionkey smallint
 ,r_name varchar
 ,r_comment varchar
 ,r_nations super
 );

若要將上一個範例的資料複製到資料表中，請在 FORMAT JSON 子句中指定 AUTO 選項，將 JSON 值分割為多個資料欄。COPY 將頂層 JSON 屬性與欄名稱比對，並允許以 SUPER 值的形式擷取巢狀值，例如 JSON 陣列和物件。


COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 'auto';

當 JSON 屬性名稱混合使用大寫和小寫時，請在 FORMAT JSON 子句中指定 auto ignorecase 選項。如需 COPY 命令的相關資訊，請參閱使用 'auto ignorecase' 選項從 JSON 資料載入。

在某些情況下，欄名稱和 JSON 屬性不相符，或者要載入的屬性巢狀超過層級深度。如果是這樣，請使用 jsonpaths 檔案手動將 JSON 屬性對應至 Amazon Redshift 欄。


CREATE TABLE nations
(
 regionkey smallint
 ,name varchar
 ,comment super
 ,nations super
 );

假設您要將資料載入到欄名不符合 JSON 屬性的表格中。在下列範例中的 nations 表就是這種表格。您可以建立 jsonpaths 檔案，依據屬性在 jsonpaths 陣列中的位置，將屬性路徑對應至表格欄。


{"jsonpaths": [
       "$.r_regionkey",
       "$.r_name",
       "$.r_comment",
       "$.r_nations
    ]
}

jsonpaths 檔案的位置被用作 FORMAT JSON 的引數。


COPY nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 's3://redshift-downloads/semistructured/tpch-nested/data/jsonpaths/nations_jsonpaths.json';

使用下列查詢來存取顯示資料分佈到多個欄的資料表。SUPER 資料欄使用 JSON 格式進行列印。


SELECT r_regionkey,r_name,r_comment,r_nations[0].n_nationkey FROM region_nations ORDER BY 1,2,3 LIMIT 1;

Jsonpath 黨案將 JSON 文件中的欄位映射到表格欄。您可以擷取其他欄，例如分散索引鍵和排序索引鍵，同時仍將完整的文件載入為 SUPER 欄。以下查詢將完整文件載入到 nations 欄。name 欄是排序索引鍵，而 regionkey 欄是分散索引鍵。


CREATE TABLE nations_sorted (
    regionkey smallint,
    name varchar,
    nations super
) DISTKEY(regionkey) SORTKEY(name);

根 jsonpath "$" 映射到文件的根目錄，如下所示：


{"jsonpaths": [
       "$.r_regionkey",
       "$.r_name",
       "$"
    ]
}

jsonpaths 文件的位置被用作 FORMAT JSON 的引數。


COPY nations_sorted FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 's3://redshift-downloads/semistructured/tpch-nested/data/jsonpaths/nations_sorted_jsonpaths.json';

從文字和 CSV 複製資料

Amazon Redshift 將文字和 CSV 格式的 SUPER 欄表示為序列化 JSON。SUPER 欄需要有效的 JSON 格式，才能以正確的類型資訊載入。取消物件、陣列、數字、布林值和 null 值的引號。將字串值換成雙引號。SUPER 資料欄會針對文字和 CSV 格式使用標準逸出規則。針對 CSV，會根據 CSV 標準逸出分隔符號。對於文字，如果選擇的分隔符號也可能出現在 SUPER 欄位中，請在 COPY 和 UNLOAD 期間使用 ESCAPE 選項。


COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/csv/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT CSV;


COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/text/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
DELIMITER ','
ESCAPE;

從單欄式 Parquet 和 ORC 複製資料

如果您的半結構化或巢狀資料已以 Apache Parquet 或 Apache ORC 格式提供，您可以使用 COPY 命令將資料擷取到 Amazon Redshift 中。

Amazon Redshift 資料表結構應符合 Parquet 或 ORC 檔案的欄數和欄資料類型。藉由在 COPY 命令中指定 SERIALIZETOJSON，您可以將與資料表中的 SUPER 欄對齊的檔案中的任何欄類型載入為 SUPER。這包括結構和陣列類型。


COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/parquet/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT PARQUET SERIALIZETOJSON;

以下範例使用 ORC 格式。


COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/orc/region_nation'
IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT ORC SERIALIZETOJSON;

當日期或時間資料類型的屬性在 ORC 中時，Amazon Redshift 在 SUPER 中對它們進行編碼時會將它們轉換為 varchar。

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

SUPER 範例資料集

卸載半結構化資料