本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 自定义数据来源的摄取
<a name="kb-data-source-customize-ingestion"></a>

在连接数据源时，您可以通过在发送请求时修改`vectorIngestionConfiguration`字段的值来自定义矢量摄取。 AWS 管理控制台 [CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html)

选择一个主题，了解如何在连接到数据来源时，将自定义摄取所需的配置包括在内：

**Topics**
+ [选择用于解析的工具](#kb-data-source-customize-parsing)
+ [选择分块策略](#kb-data-source-customize-chunking)
+ [在摄取期间使用 Lambda 函数](#kb-data-source-customize-lambda)

## 选择用于解析的工具
<a name="kb-data-source-customize-parsing"></a>

您可以自定义数据中文档的解析方式。要了解有关在 Amazon Bedrock 知识库中解析数据的选项，请参阅[数据来源的解析选项](kb-advanced-parsing.md)。

**警告**  
您在连接到数据来源后，便无法更改解析策略。要使用不同的解析策略，可以添加新的数据来源。  
创建知识库之后，便无法再添加 S3 位置来存储多模态数据（包括图像、数字、图表和表格）。如果要包含多模态数据并使用支持此类数据的解析器，必须创建新的知识库。

选择解析策略所涉及的步骤取决于您是使用 Amazon Bedrock API 还是 Amazon Bedrock API 以及您选择的解析方法。 AWS 管理控制台 如果您选择支持多模态数据的解析方法，则必须指定 S3 URI，以存储从文档中提取的多模态数据。这些数据可以在知识库查询中返回。
+ 在中 AWS 管理控制台，执行以下操作：

  1. 在设置知识库过程中连接到数据来源时，或者在向现有知识库添加新数据来源时，选择解析策略。

  1. （如果您选择 Amazon Bedrock 数据自动化或基础模型作为解析策略）在选择嵌入模型并配置向量存储时，在**多模态存储目标**部分中指定一个 S3 URI，以存储从文档中提取的多模态数据。您还可以在此步骤中，选择使用客户托管密钥来加密 S3 数据。
+ 在 Amazon Bedrock API 中，执行以下操作：

  1. （如果您计划使用 Amazon Bedrock 数据自动化或基础模型作为解析策略）[SupplementalDataStorageLocation](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_SupplementalDataStorageLocation.html)在请求中加[VectorKnowledgeBaseConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_VectorKnowledgeBaseConfiguration.html)入。[CreateKnowledgeBase](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateKnowledgeBase.html)

  1. [ParsingConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_ParsingConfiguration.html)在[CreateDataSource](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CreateDataSource.html)请求的`parsingConfiguration`字段[VectorIngestionConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_VectorIngestionConfiguration.html)中加入。
**注意**  
如果您省略此配置，Amazon Bedrock 知识库将使用 Amazon Bedrock 默认解析器。

有关如何在 API 中指定解析策略的更多详细信息，请展开与您要使用的解析策略相对应的部分：

### Amazon Bedrock 默认解析器
<a name="w2aac32c10c23c15c17c11c13b1"></a>

要使用默认解析器，请不要将 `parsingConfiguration` 字段包括在 `VectorIngestionConfiguration` 中。

### Amazon Bedrock 数据自动化解析器（预览版）
<a name="w2aac32c10c23c15c17c11c13b3"></a>

要使用 Amazon Bedrock 数据自动化解析器，请在的`parsingStrategy`字段`BEDROCK_DATA_AUTOMATION`中指定，`ParsingConfiguration`并在`bedrockDataAutomationConfiguration`字段[BedrockDataAutomationConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_BedrockDataAutomationConfiguration.html)中加入，格式如下：

```
{
    "parsingStrategy": "BEDROCK_DATA_AUTOMATION",
    "bedrockDataAutomationConfiguration": {
        "parsingModality": "string"
    }
}
```

### 基础模型
<a name="w2aac32c10c23c15c17c11c13b5"></a>

要使用基础模型作为解析器，请在的`parsingStrategy`字段`BEDROCK_FOUNDATION_MODEL`中指定，`ParsingConfiguration`并在`bedrockFoundationModelConfiguration`字段[BedrockFoundationModelConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_BedrockFoundationModelConfiguration.html)中包含一个，格式如下：

```
{
    "parsingStrategy": "BEDROCK_FOUNDATION_MODEL",
    "bedrockFoundationModelConfiguration": {
        "modelArn": "string",
        "parsingModality": "string",
        "parsingPrompt": {
            "parsingPromptText": "string"
        }
    }
}
```

## 选择分块策略
<a name="kb-data-source-customize-chunking"></a>

您可以自定义如何对数据中的文档进行分块，以进行存储和检索。要了解有关在 Amazon Bedrock 知识库中进行数据分块的选项，请参阅[知识库的内容分块是如何运作的](kb-chunking.md)。

**警告**  
连接到数据来源后，就无法更改分块策略。

在连接数据源时， AWS 管理控制台 您可以选择分块策略。使用 Amazon Bedrock API，您可以[ChunkingConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_ChunkingConfiguration.html)在 “” `chunkingConfiguration` 字段中加入。[VectorIngestionConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_VectorIngestionConfiguration.html)

**注意**  
如果您省略此配置，Amazon Bedrock 会将您的内容分为大约 300 个词元组成的块，同时保留句子边界。

展开与您要使用的解析策略相对应的部分：

### 不分块
<a name="w2aac32c10c23c15c17c13c13b1"></a>

要将数据来源中的每个文档视为单个源分块，请在 `ChunkingConfiguration` 的 `chunkingStrategy` 字段中指定 `NONE`，格式如下：

```
{
    "chunkingStrategy": "NONE"
}
```

### 固定大小分块
<a name="w2aac32c10c23c15c17c13c13b3"></a>

要将数据源中的每个文档分成大小大致相同的块，请在的`chunkingStrategy`字段`FIXED_SIZE`中指定，`ChunkingConfiguration`并在`fixedSizeChunkingConfiguration`字段[FixedSizeChunkingConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_FixedSizeChunkingConfiguration.html)中包含一个，如下所示：

```
{
    "chunkingStrategy": "FIXED_SIZE",
    "fixedSizeChunkingConfiguration": {
        "maxTokens": number,
        "overlapPercentage": number
    }
}
```

### 分层分块
<a name="w2aac32c10c23c15c17c13c13b5"></a>

要将数据来源中的每个文档分为两个层，其中第二层包含从第一层派生的较小分块，请在 `ChunkingConfiguration` 的 `chunkingStrategy` 字段中指定 `HIERARCHICAL`，并将 `hierarchicalChunkingConfiguration` 字段包括在内，格式如下：

```
{
    "chunkingStrategy": "HIERARCHICAL",
    "hierarchicalChunkingConfiguration": {
        "levelConfigurations": [{
            "maxTokens": number
        }],
        "overlapTokens": number
    }
}
```

### 语义分块
<a name="w2aac32c10c23c15c17c13c13b7"></a>

要将数据来源中的每个文档分为按语义含义（而非语法结构）区分优先级的分块，请在 `ChunkingConfiguration` 的 `chunkingStrategy` 字段中指定 `SEMANTIC`，并将 `semanticChunkingConfiguration` 字段包括在内，格式如下：

```
{
    "chunkingStrategy": "SEMANTIC",
    "semanticChunkingConfiguration": {
        "breakpointPercentileThreshold": number,
        "bufferSize": number,
        "maxTokens": number
    }
}
```

## 在摄取期间使用 Lambda 函数
<a name="kb-data-source-customize-lambda"></a>

您可以通过以下方式，使用 Lambda 函数对如何将数据中的源分块写入向量存储进行后处理：
+ 将提供自定义分块策略的分块逻辑包括在内。
+ 将指定块级元数据的逻辑包括在内。

要了解如何编写自定义 Lambda 函数用于摄取，请参阅[使用自定义转换 Lambda 函数定义数据的摄取方式](kb-custom-transformation.md)。在连接数据源时， AWS 管理控制台 您可以选择 Lambda 函数。使用 Amazon Bedrock API，您可以在的`CustomTransformationConfiguration`字段[CustomTransformationConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_CustomTransformationConfiguration.html)中加入，[VectorIngestionConfiguration](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_agent_VectorIngestionConfiguration.html)并指定 Lambda 的 ARN，格式如下：

```
{
    "transformations": [{
        "transformationFunction": {
            "transformationLambdaConfiguration": {
                "lambdaArn": "string"
            }
        },
        "stepToApply": "POST_CHUNKING"
    }],
    "intermediateStorage": {
        "s3Location": {
            "uri": "string"
        }
    }
}
```

您还可以在应用 Lambda 函数后，指定要存储输出的 S3 位置。

您可以将 `chunkingConfiguration` 字段包括在内，以在应用 Amazon Bedrock 提供的分块选项之一后，应用 Lambda 函数。