本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 在擷取期間充實您的文件
<a name="custom-document-enrichment"></a>

**注意**  
功能支援會因索引類型和正在使用的搜尋 API 而有所不同。若要查看您正在使用的索引類型和搜尋 API 是否支援此功能，請參閱[索引類型](https://docs.aws.amazon.com/kendra/latest/dg/hiw-index-types.html)。

您可以在文件擷取過程中變更內容和文件中繼資料欄位或屬性。透過 Amazon Kendra的*自訂文件擴充*功能，您可以在擷取文件時建立、修改或刪除文件屬性和內容 Amazon Kendra。這表示您可以視需要操作和擷取資料。

此功能可讓您控制文件的處理和擷取方式 Amazon Kendra。例如，您可以在擷取文件時清除文件中繼資料中的個人身分識別資訊 Amazon Kendra。

您可以使用此功能的另一種方法是在 中叫用 Lambda 函數 AWS Lambda ，以在影像上執行光學字元辨識 (OCR)、文字翻譯，以及其他準備資料以進行搜尋或分析的任務。例如，您可以叫用 函數在映像上執行 OCR。函數可以從影像中解譯文字，並將每個影像視為文字文件。接收郵寄客戶問卷並存放這些問卷的公司，因為影像可以將這些影像作為文字文件擷取到 Amazon Kendra。然後，公司可以在 中搜尋寶貴的客戶問卷資訊 Amazon Kendra。

您可以使用基本操作將 套用為資料的第一個剖析，然後使用 Lambda 函數在資料上套用更複雜的操作。例如，您可以使用基本操作，直接移除文件中繼資料欄位 'Customer\_ID' 中的所有值，然後套用 Lambda 函數從文件中文字的影像擷取文字。

## 自訂文件擴充的運作方式
<a name="how-custom-document-enrichment-works"></a>

自訂文件擴充的整體程序如下：

1. 您可以在建立或更新資料來源時設定自訂文件擴充，或直接將文件編製索引 Amazon Kendra。

1. Amazon Kendra 會套用內嵌組態或基本邏輯來變更您的資料。如需詳細資訊，請參閱[變更中繼資料的基本操作](#basic-data-maniplation)。

1. 如果您選擇設定進階資料處理， Amazon Kendra 可以在原始原始文件或結構化的剖析文件上套用此操作。如需詳細資訊，請參閱[Lambda 函數：擷取和變更中繼資料或內容](#advanced-data-manipulation)。

1. 您修改的文件會擷取至其中 Amazon Kendra。

在此程序中，如果您的組態無效 Amazon Kendra ， 會擲回錯誤。

當您呼叫 [CreateDataSource](https://docs.aws.amazon.com/kendra/latest/APIReference/API_CreateDataSource.html)、[UpdateDataSource](https://docs.aws.amazon.com/kendra/latest/APIReference/API_UpdateDataSource.html) 或 [BatchPutDocument](https://docs.aws.amazon.com/kendra/latest/APIReference/API_BatchPutDocument.html) APIs時，請提供自訂文件擴充組態。如果您呼叫 `BatchPutDocument`，則必須為每個請求設定自訂文件擴充。如果您使用 主控台，請選取您的索引，然後選取**文件擴充**以設定自訂文件擴充。

如果您在 主控台中使用 **文件擴充功能**，您可以選擇只設定基本操作，或只設定 Lambda 函數或兩者，就像您可以使用 API 一樣。您可以在主控台步驟中選取**下一步**，以選擇不設定基本操作和僅設定 Lambda 函數，包括是否套用到原始 （擷取前） 或結構化 （擷取後） 資料。您只能透過完成 主控台中的所有步驟來儲存組態。如果您未完成所有步驟，則不會儲存您的文件組態。

## 變更中繼資料的基本操作
<a name="basic-data-maniplation"></a>

您可以使用基本邏輯來操作文件欄位和內容。這包括移除欄位中的值、使用條件修改欄位中的值，或建立欄位。對於超出可使用基本邏輯操作範圍的進階操作，請叫用 Lambda 函數。如需詳細資訊，請參閱[Lambda 函數：擷取和變更中繼資料或內容](#advanced-data-manipulation)。

若要套用基本邏輯，您可以使用 [DocumentAttributeTarget](https://docs.aws.amazon.com/kendra/latest/APIReference/API_DocumentAttributeTarget.html) 物件來指定要操作的目標欄位。您提供 屬性金鑰。例如，金鑰 'Department' 是包含與文件關聯之所有部門名稱的欄位或屬性。如果符合特定條件，您也可以指定要在目標欄位中使用的值。您可以使用 [DocumentAttributeCondition](https://docs.aws.amazon.com/kendra/latest/APIReference/API_DocumentAttributeCondition.html) 物件來設定條件。例如，如果 'Source\_URI' 欄位在其 URI 值中包含 'financial'，則使用文件的目標值 'Finance' 預先填入目標欄位 'Department'。您也可以刪除目標文件屬性的值。

若要使用主控台套用基本邏輯，請選取您的索引，然後在導覽功能表中選取**文件擴充**。前往**設定基本操作**，將基本操作套用至文件欄位和內容。

以下是使用基本邏輯移除名為 'Customer\_ID' 之文件欄位中所有客戶識別號碼的範例。

**範例 1：移除與文件相關的客戶識別號碼**

套用基本操作之前的資料。


| **Document\_ID** | **Body\_Text** | **Customer\_ID** | 
| --- | --- | --- | 
| 1 | Lorem Ipsum。 | CID1234 | 
| 2 | Lorem Ipsum。 | CID1235 | 
| 3 | Lorem Ipsum。 | CID1236 | 

套用基本操作後的資料。


| **Document\_ID** | **Body\_Text** | **Customer\_ID** | 
| --- | --- | --- | 
| 1 | Lorem Ipsum。 |   | 
| 2 | Lorem Ipsum。 |   | 
| 3 | Lorem Ipsum。 |   | 

以下是使用基本邏輯來建立名為 'Department' 的欄位，並根據來自 'Source\_URI' 欄位的資訊，以部門名稱預先填入此欄位的範例。這使用的條件是，如果 'Source\_URI' 欄位在其 URI 值中包含 'financial'，則將文件的目標值 'Finance' 預先填入目標欄位 'Department'。

**範例 2：建立「部門」欄位，並使用條件預先填入與文件相關聯的部門名稱。**

套用基本操作之前的資料。


| **Document\_ID** | **Body\_Text** | **Source\_URI** | 
| --- | --- | --- | 
| 1 | Lorem Ipsum。 | 金融/1 | 
| 2 | Lorem Ipsum。 | 金融/2 | 
| 3 | Lorem Ipsum。 | 金融/3 | 

套用基本操作後的資料。


| **Document\_ID** | **Body\_Text** | **Source\_URI** | **部門** | 
| --- | --- | --- | --- | 
| 1 | Lorem Ipsum。 | 金融/1 | 財務 | 
| 2 | Lorem Ipsum。 | 金融/2 | 財務 | 
| 3 | Lorem Ipsum。 | 金融/3 | 財務 | 

**注意**  
Amazon Kendra 如果目標文件欄位尚未建立為索引欄位，則無法建立該欄位。建立索引欄位後，您可以使用 建立文件欄位`DocumentAttributeTarget`。 Amazon Kendra 然後， 會將新建立的文件中繼資料欄位映射到您的索引欄位。

下列程式碼是設定基本資料處理以移除與文件相關聯之客戶識別號碼的範例。

------
#### [ Console ]

**設定基本資料處理以移除客戶識別號碼**

1. 在左側導覽窗格的**索引**下，選取**文件擴充，**然後選取**新增文件擴充**。

1. 在**設定基本操作**頁面上，從您要變更文件欄位和內容的資料來源下拉式清單中選擇 。然後從下拉式清單中選擇文件欄位名稱 'Customer\_ID'，從下拉式清單中選擇索引欄位名稱 'Customer\_ID'，然後從下拉式清單中選擇目標動作 **刪除**。然後選取**新增基本操作**。

------
#### [ CLI ]

**設定基本資料處理以移除客戶識別號碼**

```
aws kendra create-data-source \
 --name {{data-source-name}} \
 --index-id {{index-id}} \
 --role-arn {{arn:aws:iam::account-id:role/role-name}} \
 --type S3 \
 --configuration '{"S3Configuration":{"BucketName":"{{S3-bucket-name}}"}}' \
 --custom-document-enrichment-configuration '{"InlineConfigurations":[{"Target":{"TargetDocumentAttributeKey":"Customer_ID", "TargetDocumentAttributeValueDeletion": true}}]}'
```

------
#### [ Python ]

**設定基本資料處理以移除客戶識別號碼**

```
import boto3
from botocore.exceptions import ClientError
import pprint
import time

kendra = boto3.client("kendra")

print("Create a data source with customizations")

# Provide the name of the data source
name = "data-source-name"
# Provide the index ID for the data source
index_id = "index-id"
# Provide the IAM role ARN required for data sources
role_arn = "arn:aws:iam::${account-id}:role/${role-name}"
# Provide the data source connection information
data_source_type = "S3"
S3_bucket_name = "S3-bucket-name"
# Configure the data source with Custom Document Enrichment
configuration = {"S3Configuration":
        {
            "BucketName": S3_bucket_name
        }
    }
custom_document_enrichment_configuration = {"InlineConfigurations":[
        {
            "Target":{"TargetDocumentAttributeKey":"Customer_ID",
                       "TargetDocumentAttributeValueDeletion": True}
        }]
    }

try:
    data_source_response = kendra.create_data_source(
        Name = name,
        IndexId = index_id,
        RoleArn = role_arn,
        Type = data_source_type
        Configuration = configuration
        CustomDocumentEnrichmentConfiguration = custom_document_enrichment_configuration
    )

    pprint.pprint(data_source_response)

    data_source_id = data_source_response["Id"]

    print("Wait for Amazon Kendra to create the data source with your customizations.")

    while True:
        # Get the details of the data source, such as the status
        data_source_description = kendra.describe_data_source(
            Id = data_source_id,
            IndexId = index_id
        )
        status = data_source_description["Status"]
        print(" Creating data source. Status: "+status)
        time.sleep(60)
        if status != "CREATING":
            break

    print("Synchronize the data source.")

    sync_response = kendra.start_data_source_sync_job(
        Id = data_source_id,
        IndexId = index_id
    )

    pprint.pprint(sync_response)

    print("Wait for the data source to sync with the index.")

    while True:

        jobs = kendra.list_data_source_sync_jobs(
            Id= data_source_id,
            IndexId= index_id
        )

        # For this example, there should be one job
        status = jobs["History"][0]["Status"]

        print(" Syncing data source. Status: "+status)
        time.sleep(60)
        if status != "SYNCING":
            break

except  ClientError as e:
        print("%s" % e)

print("Program ends.")
```

------
#### [ Java ]

**設定基本資料處理以移除客戶識別號碼**

```
package com.amazonaws.kendra;

import java.util.concurrent.TimeUnit;
import software.amazon.awssdk.services.kendra.KendraClient;
import software.amazon.awssdk.services.kendra.model.CreateDataSourceRequest;
import software.amazon.awssdk.services.kendra.model.CreateDataSourceResponse;
import software.amazon.awssdk.services.kendra.model.CreateIndexRequest;
import software.amazon.awssdk.services.kendra.model.CreateIndexResponse;
import software.amazon.awssdk.services.kendra.model.DataSourceConfiguration;
import software.amazon.awssdk.services.kendra.model.DataSourceStatus;
import software.amazon.awssdk.services.kendra.model.DataSourceSyncJob;
import software.amazon.awssdk.services.kendra.model.DataSourceSyncJobStatus;
import software.amazon.awssdk.services.kendra.model.DataSourceType;
import software.amazon.awssdk.services.kendra.model.DescribeDataSourceRequest;
import software.amazon.awssdk.services.kendra.model.DescribeDataSourceResponse;
import software.amazon.awssdk.services.kendra.model.DescribeIndexRequest;
import software.amazon.awssdk.services.kendra.model.DescribeIndexResponse;
import software.amazon.awssdk.services.kendra.model.IndexStatus;
import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsRequest;
import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsResponse;
import software.amazon.awssdk.services.kendra.model.S3DataSourceConfiguration;
import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobRequest;
import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobResponse;

public class CreateDataSourceWithCustomizationsExample {

    public static void main(String[] args) throws InterruptedException {
        System.out.println("Create a data source with customizations");
        
        String dataSourceName = "data-source-name";
        String indexId = "index-id";
        String dataSourceRoleArn = "arn:aws:iam::account-id:role/role-name";
        String s3BucketName = "S3-bucket-name"

        KendraClient kendra = KendraClient.builder().build();
        
        CreateDataSourceRequest createDataSourceRequest = CreateDataSourceRequest
            .builder()
            .name(dataSourceName)
            .description(experienceDescription)
            .roleArn(experienceRoleArn)
            .type(DataSourceType.S3)
            .configuration(
                DataSourceConfiguration
                    .builder()
                    .s3Configuration(
                        S3DataSourceConfiguration
                            .builder()
                            .bucketName(s3BucketName)
                            .build()
                    ).build()
            )
            .customDocumentEnrichmentConfiguration(
                CustomDocumentEnrichmentConfiguration
                    .builder()
                    .inlineConfigurations(Arrays.asList(
                        InlineCustomDocumentEnrichmentConfiguration
                            .builder()
                            .target(
                                DocumentAttributeTarget
                                    .builder()
                                    .targetDocumentAttributeKey("Customer_ID")
                                    .targetDocumentAttributeValueDeletion(true)
                                    .build())
                            .build()
                    )).build();
        
        CreateDataSourceResponse createDataSourceResponse = kendra.createDataSource(createDataSourceRequest);
        System.out.println(String.format("Response of creating data source: %s", createDataSourceResponse));

        String dataSourceId = createDataSourceResponse.id();
        System.out.println(String.format("Waiting for Kendra to create the data source %s", dataSourceId));
        DescribeDataSourceRequest describeDataSourceRequest = DescribeDataSourceRequest
            .builder()
            .indexId(indexId)
            .id(dataSourceId)
            .build();

        while (true) {
            DescribeDataSourceResponse describeDataSourceResponse = kendra.describeDataSource(describeDataSourceRequest);

            DataSourceStatus status = describeDataSourceResponse.status();
            System.out.println(String.format("Creating data source. Status: %s", status));
            TimeUnit.SECONDS.sleep(60);
            if (status != DataSourceStatus.CREATING) {
                break;
            }
        }

        System.out.println(String.format("Synchronize the data source %s", dataSourceId));
        StartDataSourceSyncJobRequest startDataSourceSyncJobRequest = StartDataSourceSyncJobRequest
            .builder()
            .indexId(indexId)
            .id(dataSourceId)
            .build();
        StartDataSourceSyncJobResponse startDataSourceSyncJobResponse = kendra.startDataSourceSyncJob(startDataSourceSyncJobRequest);
        System.out.println(String.format("Waiting for the data source to sync with the index %s for execution ID %s", indexId, startDataSourceSyncJobResponse.executionId()));

        // For this example, there should be one job
        ListDataSourceSyncJobsRequest listDataSourceSyncJobsRequest = ListDataSourceSyncJobsRequest
            .builder()
            .indexId(indexId)
            .id(dataSourceId)
            .build();

        while (true) {
            ListDataSourceSyncJobsResponse listDataSourceSyncJobsResponse = kendra.listDataSourceSyncJobs(listDataSourceSyncJobsRequest);
            DataSourceSyncJob job = listDataSourceSyncJobsResponse.history().get(0);
            System.out.println(String.format("Syncing data source. Status: %s", job.status()));

            TimeUnit.SECONDS.sleep(60);
            if (job.status() != DataSourceSyncJobStatus.SYNCING) {
                break;
            }

        }

        System.out.println("Data source creation with customizations is complete");
    }
}
```

------

## Lambda 函數：擷取和變更中繼資料或內容
<a name="advanced-data-manipulation"></a>

您可以使用 Lambda 函數來操作文件欄位和內容。如果您想要超越基本邏輯並套用進階資料處理，這會很有用。例如，使用從影像解譯文字的光學字元辨識 (OCR)，並將每個影像視為文字文件。或者，擷取特定時區中的目前日期時間，並插入日期欄位有空值的日期時間。

您可以先套用基本邏輯，然後使用 Lambda 函數進一步操作您的資料，反之亦然。您也可以選擇只套用 Lambda 函數。

Amazon Kendra 可以調用 Lambda 函數，在擷取過程中套用進階資料處理，作為 [CustomDocumentEnrichmentConfiguration](https://docs.aws.amazon.com/kendra/latest/APIReference/API_CustomDocumentEnrichmentConfiguration.html) 的一部分。您可以指定角色，其中包含執行 Lambda 函數和存取儲存 Amazon S3 貯體以存放資料處理輸出的許可，請參閱[IAM 存取角色](https://docs.aws.amazon.com/kendra/latest/dg/iam-roles.html)。

Amazon Kendra 可以在原始原始文件或結構化的剖析文件上套用 Lambda 函數。您可以設定採用原始或原始資料的 Lambda 函數，並使用 [PreExtractionHookConfiguration](https://docs.aws.amazon.com/kendra/latest/APIReference/API_CustomDocumentEnrichmentConfiguration.html) 套用資料處理。您也可以設定 Lambda 函數，以擷取您的結構化文件，並使用 [PostExtractionHookConfiguration](https://docs.aws.amazon.com/kendra/latest/APIReference/API_CustomDocumentEnrichmentConfiguration.html) 套用您的資料處理。 會 Amazon Kendra 擷取文件中繼資料和文字來建構您的文件。您的 Lambda 函數必須遵循強制性請求和回應結構。如需詳細資訊，請參閱[Lambda 函數的資料合約](#cde-data-contracts-lambda)。

若要在主控台中設定 Lambda 函數，請選取您的索引，然後在導覽功能表中選取**文件擴充**。前往**設定 Lambda 函數**以設定 Lambda 函數。

您只能為 `PreExtractionHookConfiguration`和 設定一個 Lambda 函數，並為 設定一個 Lambda 函數`PostExtractionHookConfiguration`。不過，您的 Lambda 函數可以叫用其所需的其他函數。您可以同時設定 `PreExtractionHookConfiguration` 與 `PostExtractionHookConfiguration`，或僅設定其中之一。的 Lambda 函數`PreExtractionHookConfiguration`不得超過 5 分鐘的執行時間，而 的 Lambda 函數`PostExtractionHookConfiguration`不得超過 1 分鐘的執行時間。與未設定此項目 Amazon Kendra 相比，設定自訂文件擴充自然需要更長的時間才能將文件擷取至 。

您可以設定 Amazon Kendra 僅在符合條件時叫用 Lambda 函數。例如，您可以指定一個條件，如果日期時間值為空，則 Amazon Kendra 應該叫用插入目前日期時間的函數。

以下是使用 Lambda 函數執行 OCR 從影像解譯文字，並將此文字存放在名為 'Document\_Image\_Text' 的欄位中的範例。

**範例 1：從影像擷取文字以建立文字文件**

套用進階操作之前的資料。


| **Document\_ID** | **Document\_Image** | 
| --- | --- | 
| 1 | image\_1.png | 
| 2 | image\_2.png | 
| 3 | image\_3.png | 

套用進階操作後的資料。


| **Document\_ID** | **Document\_Image** | **Document\_Image\_Text** | 
| --- | --- | --- | 
| 1 | image\_1.png | 郵寄問卷回覆 | 
| 2 | image\_2.png | 郵寄問卷回覆 | 
| 3 | image\_3.png | 郵寄問卷回覆 | 

以下是使用 Lambda 函數為空日期值插入目前日期時間的範例。這使用的條件是，如果日期欄位值為 'null'，則將其取代為目前的日期時間。

**範例 2：將 Last\_Updated 欄位中的空值取代為目前的日期時間。**

套用進階操作之前的資料。


| **Document\_ID** | **Body\_Text** | **Last\_Updated** | 
| --- | --- | --- | 
| 1 | Lorem Ipsum。 | 2020 年 1 月 1 日 | 
| 2 | Lorem Ipsum。 |   | 
| 3 | Lorem Ipsum。 | 2020 年 7 月 1 日 | 

套用進階操作後的資料。


| **Document\_ID** | **Body\_Text** | **Last\_Updated** | 
| --- | --- | --- | 
| 1 | Lorem Ipsum。 | 2020 年 1 月 1 日 | 
| 2 | Lorem Ipsum。 | 2021 年 12 月 1 日 | 
| 3 | Lorem Ipsum。 | 2020 年 7 月 1 日 | 

下列程式碼是為原始原始資料上的進階資料處理設定 Lambda 函數的範例。

------
#### [ Console ]

**為原始原始資料上的進階資料處理設定 Lambda 函數**

1. 在左側導覽窗格的**索引**下，選取**文件擴充，**然後選取**新增文件擴充**。

1. 在**設定 Lambda 函數**頁面的**預先擷取 Lambda** 區段中，從您的 Lambda 函數 ARN 和儲存 Amazon S3 貯體下拉式清單中選取 。從下拉式清單中選取建立新角色的選項，以新增您的 IAM 存取角色。這會建立建立文件擴充所需的 Amazon Kendra 許可。

------
#### [ CLI ]

**為原始原始資料上的進階資料處理設定 Lambda 函數**

```
aws kendra create-data-source \
 --name {{data-source-name}} \
 --index-id {{index-id}} \
 --role-arn {{arn:aws:iam::account-id:role/role-name}} \
 --type S3 \
 --configuration '{"S3Configuration":{"BucketName":"{{S3-bucket-name}}"}}' \
 --custom-document-enrichment-configuration '{"PreExtractionHookConfiguration":{"LambdaArn":"{{arn:aws:iam::account-id:function/function-name}}", "S3Bucket":"{{S3-bucket-name}}"}, "RoleArn": "{{arn:aws:iam:account-id:role/cde-role-name}}"}'
```

------
#### [ Python ]

**為原始原始資料上的進階資料處理設定 Lambda 函數**

```
import boto3
from botocore.exceptions import ClientError
import pprint
import time

kendra = boto3.client("kendra")

print("Create a data source with customizations.")

# Provide the name of the data source
name = "data-source-name"
# Provide the index ID for the data source
index_id = "index-id"
# Provide the IAM role ARN required for data sources
role_arn = "arn:aws:iam::${account-id}:role/${role-name}"
# Provide the data source connection information
data_source_type = "S3"
S3_bucket_name = "S3-bucket-name"
# Configure the data source with Custom Document Enrichment
configuration = {"S3Configuration":
        {
            "BucketName": S3_bucket_name
        }
    }
custom_document_enrichment_configuration = {"PreExtractionHookConfiguration":
        {
            "LambdaArn":"arn:aws:iam::account-id:function/function-name",
            "S3Bucket":"S3-bucket-name"
        }
    "RoleArn":"arn:aws:iam::account-id:role/cde-role-name"
    }

try:
    data_source_response = kendra.create_data_source(
        Name = name,
        IndexId = index_id,
        RoleArn = role_arn,
        Type = data_source_type
        Configuration = configuration
        CustomDocumentEnrichmentConfiguration = custom_document_enrichment_configuration
    )

    pprint.pprint(data_source_response)

    data_source_id = data_source_response["Id"]

    print("Wait for Amazon Kendra to create the data source with your customizations.")

    while True:
        # Get the details of the data source, such as the status
        data_source_description = kendra.describe_data_source(
            Id = data_source_id,
            IndexId = index_id
        )
        status = data_source_description["Status"]
        print(" Creating data source. Status: "+status)
        time.sleep(60)
        if status != "CREATING":
            break

    print("Synchronize the data source.")

    sync_response = kendra.start_data_source_sync_job(
        Id = data_source_id,
        IndexId = index_id
    )

    pprint.pprint(sync_response)

    print("Wait for the data source to sync with the index.")

    while True:

        jobs = kendra.list_data_source_sync_jobs(
            Id = data_source_id,
            IndexId = index_id
        )

        # For this example, there should be one job
        status = jobs["History"][0]["Status"]

        print(" Syncing data source. Status: "+status)
        time.sleep(60)
        if status != "SYNCING":
            break

except  ClientError as e:
        print("%s" % e)

print("Program ends.")
```

------
#### [ Java ]

**為原始原始資料上的進階資料處理設定 Lambda 函數**

```
package com.amazonaws.kendra;

import java.util.concurrent.TimeUnit;
import software.amazon.awssdk.services.kendra.KendraClient;
import software.amazon.awssdk.services.kendra.model.CreateDataSourceRequest;
import software.amazon.awssdk.services.kendra.model.CreateDataSourceResponse;
import software.amazon.awssdk.services.kendra.model.CreateIndexRequest;
import software.amazon.awssdk.services.kendra.model.CreateIndexResponse;
import software.amazon.awssdk.services.kendra.model.DataSourceConfiguration;
import software.amazon.awssdk.services.kendra.model.DataSourceStatus;
import software.amazon.awssdk.services.kendra.model.DataSourceSyncJob;
import software.amazon.awssdk.services.kendra.model.DataSourceSyncJobStatus;
import software.amazon.awssdk.services.kendra.model.DataSourceType;
import software.amazon.awssdk.services.kendra.model.DescribeDataSourceRequest;
import software.amazon.awssdk.services.kendra.model.DescribeDataSourceResponse;
import software.amazon.awssdk.services.kendra.model.DescribeIndexRequest;
import software.amazon.awssdk.services.kendra.model.DescribeIndexResponse;
import software.amazon.awssdk.services.kendra.model.IndexStatus;
import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsRequest;
import software.amazon.awssdk.services.kendra.model.ListDataSourceSyncJobsResponse;
import software.amazon.awssdk.services.kendra.model.S3DataSourceConfiguration;
import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobRequest;
import software.amazon.awssdk.services.kendra.model.StartDataSourceSyncJobResponse;


public class CreateDataSourceWithCustomizationsExample {

    public static void main(String[] args) throws InterruptedException {
        System.out.println("Create a data source with customizations");
        
        String dataSourceName = "data-source-name";
        String indexId = "index-id";
        String dataSourceRoleArn = "arn:aws:iam::account-id:role/role-name";
        String s3BucketName = "S3-bucket-name"

        KendraClient kendra = KendraClient.builder().build();
        
        CreateDataSourceRequest createDataSourceRequest = CreateDataSourceRequest
            .builder()
            .name(dataSourceName)
            .description(experienceDescription)
            .roleArn(experienceRoleArn)
            .type(DataSourceType.S3)
            .configuration(
                DataSourceConfiguration
                    .builder()
                    .s3Configuration(
                        S3DataSourceConfiguration
                            .builder()
                            .bucketName(s3BucketName)
                            .build()
                    ).build()
            )
            .customDocumentEnrichmentConfiguration(
                CustomDocumentEnrichmentConfiguration
                    .builder()
                    .preExtractionHookConfiguration(
                        HookConfiguration
                            .builder()
                            .lambdaArn("arn:aws:iam::account-id:function/function-name")
                            .s3Bucket("S3-bucket-name")
                            .build())
                    .roleArn("arn:aws:iam::account-id:role/cde-role-name")
                    .build();
        
        CreateDataSourceResponse createDataSourceResponse = kendra.createDataSource(createDataSourceRequest);
        System.out.println(String.format("Response of creating data source: %s", createDataSourceResponse));

        String dataSourceId = createDataSourceResponse.id();
        System.out.println(String.format("Waiting for Kendra to create the data source %s", dataSourceId));
        DescribeDataSourceRequest describeDataSourceRequest = DescribeDataSourceRequest
            .builder()
            .indexId(indexId)
            .id(dataSourceId)
            .build();

        while (true) {
            DescribeDataSourceResponse describeDataSourceResponse = kendra.describeDataSource(describeDataSourceRequest);

            DataSourceStatus status = describeDataSourceResponse.status();
            System.out.println(String.format("Creating data source. Status: %s", status));
            TimeUnit.SECONDS.sleep(60);
            if (status != DataSourceStatus.CREATING) {
                break;
            }
        }

        System.out.println(String.format("Synchronize the data source %s", dataSourceId));
        StartDataSourceSyncJobRequest startDataSourceSyncJobRequest = StartDataSourceSyncJobRequest
            .builder()
            .indexId(indexId)
            .id(dataSourceId)
            .build();
        StartDataSourceSyncJobResponse startDataSourceSyncJobResponse = kendra.startDataSourceSyncJob(startDataSourceSyncJobRequest);
        System.out.println(String.format("Waiting for the data source to sync with the index %s for execution ID %s", indexId, startDataSourceSyncJobResponse.executionId()));

        // For this example, there should be one job
        ListDataSourceSyncJobsRequest listDataSourceSyncJobsRequest = ListDataSourceSyncJobsRequest
            .builder()
            .indexId(indexId)
            .id(dataSourceId)
            .build();

        while (true) {
            ListDataSourceSyncJobsResponse listDataSourceSyncJobsResponse = kendra.listDataSourceSyncJobs(listDataSourceSyncJobsRequest);
            DataSourceSyncJob job = listDataSourceSyncJobsResponse.history().get(0);
            System.out.println(String.format("Syncing data source. Status: %s", job.status()));

            TimeUnit.SECONDS.sleep(60);
            if (job.status() != DataSourceSyncJobStatus.SYNCING) {
                break;
            }

        }

        System.out.println("Data source creation with customizations is complete");
    }
}
```

------

## Lambda 函數的資料合約
<a name="cde-data-contracts-lambda"></a>

用於進階資料處理的 Lambda 函數會與 Amazon Kendra 資料合約互動。合約是您 Lambda 函數的強制性請求和回應結構。如果您的 Lambda 函數未遵循這些結構，則 會 Amazon Kendra 擲回錯誤。

用於 `PreExtractionHookConfiguration` 的 Lambda 函數應預期以下請求結構：

```
{
    "version": <str>,
    "dataBlobStringEncodedInBase64": <str>, //In the case of a data blob
    "s3Bucket": <str>, //In the case of an S3 bucket
    "s3ObjectKey": <str>, //In the case of an S3 bucket
    "metadata": <Metadata>
}
```

下列為 `metadata` 結構，其亦包含 `CustomDocumentAttribute` 結構：

```
{
    "attributes": [<CustomDocumentAttribute<]
}

CustomDocumentAttribute
{
    "name": <str>,
    "value": <CustomDocumentAttributeValue>
}

CustomDocumentAttributeValue
{
    "stringValue": <str>,
    "integerValue": <int>,
    "longValue": <long>,
    "stringListValue": list<str>,
    "dateValue": <str>
}
```

用於 `PreExtractionHookConfiguration` 的 Lambda 函數必須遵循下列回應結構：

```
{
    "version": <str>,
    "dataBlobStringEncodedInBase64": <str>, //In the case of a data blob
    "s3ObjectKey": <str>, //In the case of an S3 bucket
    "metadataUpdates": [<CustomDocumentAttribute>]
}
```

用於 `PostExtractionHookConfiguration` 的 Lambda 函數應預期以下請求結構：

```
{
    "version": <str>,
    "s3Bucket": <str>,
    "s3ObjectKey": <str>,
    "metadata": <Metadata>
}
```

用於 `PostExtractionHookConfiguration` 的 Lambda 函數必須遵循下列回應結構：

```
PostExtractionHookConfiguration Lambda Response
{
    "version": <str>,
    "s3ObjectKey": <str>,
    "metadataUpdates": [<CustomDocumentAttribute>]
}
```

您修改的文件會上傳至您的儲存 Amazon S3 貯體。修改的文件必須遵循 中顯示的格式[結構化文件格式](#structured-document-format)。

### 結構化文件格式
<a name="structured-document-format"></a>

Amazon Kendra 會將您的結構化文件上傳至指定的 Amazon S3 儲存貯體。結構化文件的格式如下：

```
Kendra document

{
   "textContent": <TextContent>
}

TextContent
{
  "documentBodyText": <str>
}
```

### 遵循資料合約的 Lambda 函數範例
<a name="example-lambda-function-advanced-manipulation"></a>

下列 Python 程式碼是 Lambda 函數的範例，可在原始或原始文件上套用中繼資料欄位 `_authors`、 `_document_title`和內文內容的進階操作。

**如果本文內容位於 Amazon S3 儲存貯體**

```
import json
import boto3
     
s3 = boto3.client("s3")

# Lambda function for advanced data manipulation    
def lambda_handler(event, context):
    # Get the value of "S3Bucket" key name or item from the given event input
    s3_bucket = event.get("s3Bucket")
    # Get the value of "S3ObjectKey" key name or item from the given event input
    s3_object_key = event.get("s3ObjectKey")
    
    content_object_before_CDE = s3.get_object(Bucket = s3_bucket, Key = s3_object_key)
    content_before_CDE = content_object_before_CDE["Body"].read().decode("utf-8");
    content_after_CDE = "CDEInvolved " + content_before_CDE
    
    # Get the value of "metadata" key name or item from the given event input
    metadata = event.get("metadata")
    # Get the document "attributes" from the metadata 
    document_attributes = metadata.get("attributes")
    
    s3.put_object(Bucket = s3_bucket, Key = "dummy_updated_kendra_document", Body=json.dumps(content_after_CDE))
    return {
        "version": "v0",
        "s3ObjectKey": "dummy_updated_kendra_document",
        "metadataUpdates": [
            {"name":"_document_title", "value":{"stringValue":"title_from_pre_extraction_lambda"}},
            {"name":"_authors", "value":{"stringListValue":["author1", "author2"]}}
        ]
    }
```

**如果本文內容位於資料 Blob 中**

```
import json
import boto3
import base64

# Lambda function for advanced data manipulation
def lambda_handler(event, context):
    
    # Get the value of "dataBlobStringEncodedInBase64" key name or item from the given event input 
    data_blob_string_encoded_in_base64 = event.get("dataBlobStringEncodedInBase64")
    # Decode the data blob string in UTF-8
    data_blob_string = base64.b64decode(data_blob_string_encoded_in_base64).decode("utf-8")
    # Get the value of "metadata" key name or item from the given event input    
    metadata = event.get("metadata")
    # Get the document "attributes" from the metadata
    document_attributes = metadata.get("attributes")
    
    new_data_blob = "This should be the modified data in the document by pre processing lambda ".encode("utf-8")
    return {
        "version": "v0",
        "dataBlobStringEncodedInBase64": base64.b64encode(new_data_blob).decode("utf-8"),
        "metadataUpdates": [
            {"name":"_document_title", "value":{"stringValue":"title_from_pre_extraction_lambda"}},
            {"name":"_authors", "value":{"stringListValue":["author1", "author2"]}}
        ]
    }
```

下列 Python 程式碼是 Lambda 函數的範例，可在結構化或剖析的文件上套用中繼資料欄位 `_authors`、 `_document_title`和內文內容的進階操作。

```
import json
import boto3
import time

s3 = boto3.client("s3")

# Lambda function for advanced data manipulation
def lambda_handler(event, context):
    
    # Get the value of "S3Bucket" key name or item from the given event input
    s3_bucket = event.get("s3Bucket")
    # Get the value of "S3ObjectKey" key name or item from the given event input
    s3_key = event.get("s3ObjectKey")
    # Get the value of "metadata" key name or item from the given event input
    metadata = event.get("metadata")
    # Get the document "attributes" from the metadata 
    document_attributes = metadata.get("attributes")
    
    kendra_document_object = s3.get_object(Bucket = s3_bucket, Key = s3_key)
    kendra_document_string = kendra_document_object['Body'].read().decode('utf-8')
    kendra_document = json.loads(kendra_document_string)
    kendra_document["textContent"]["documentBodyText"] = "Changing document body to a short sentence."
    
    s3.put_object(Bucket = s3_bucket, Key = "dummy_updated_kendra_document", Body=json.dumps(kendra_document))

    return {
        "version" : "v0",
        "s3ObjectKey": "dummy_updated_kendra_document",
        "metadataUpdates": [
            {"name": "_document_title", "value":{"stringValue": "title_from_post_extraction_lambda"}},
            {"name": "_authors", "value":{"stringListValue":["author1", "author2"]}}
        ]
    }
```