多模态输入提示
以下章节为理解图像与视频提供了指导。有关音频相关提示,请参阅语音对话提示一节。
多模态通用准则
用户提示与系统提示
在多模态理解使用案例中,每个请求都应包含用户提示文本。系统提示为可选项,且仅支持文本内容。
系统提示可用于为模型指定角色,定义通用的个性与回复风格,但不建议用于详细的任务定义或输出格式说明。
在多模态使用案例中,将任务定义、指令及格式要求写入用户提示,效果会优于系统提示。
内容顺序
发送到 Amazon Nova 的多模态理解请求,应包含一个或多个文件以及用户提示。用户文本提示需作为消息中的最后一项内容,始终置于图像、文档或视频内容之后。
message = { "role": "user", "content": [ { "document|image|video|audio": {...} }, { "document|image|video|audio": {...} }, ... { "text": "<user prompt>" } ] }
如需在用户提示中引用特定文件,可使用 text 元素为每个文件块前置定义标签。
message = { "role": "user", "content": [ { "text": "<label for item 1>" }, { "document|image|video|audio": {...} }, { "text": "<label for item 2>" }, { "document|image|video|audio": {...} }, ... { "text": "<user prompt>" } ] }
理解文档与图像
以下章节将指导如何为需要理解或分析图像和文档的任务编写提示。
从图像中提取文本
Amazon Nova 模型支持从图像中提取文本,该能力称为光学字符识别(OCR)。为获得最佳效果,请确保输入模型的图像分辨率足够高,保证文字清晰可辨。
在文本提取使用案例中,建议采用如下推理配置:
-
temperature:默认值(0.7)
-
TopP:默认值(0.9)
-
请勿启用推理功能
Amazon Nova 模型可将提取的文本输出为 Markdown、HTML 或 LaTeX 格式。建议采用如下用户提示模板:
## Instructions Extract all information from this page using only {text_formatting} formatting. Retain the original layout and structure including lists, tables, charts and math formulae. ## Rules 1. For math formulae, always use LaTeX syntax. 2. Describe images using only text. 3. NEVER use HTML image tags `<img>` in the output. 4. NEVER use Markdown image tags `![]()` in the output. 5. Always wrap the entire output in ``` tags.
输出内容会被完整或部分包裹在 Markdown 代码围栏 (```) 内。可使用类似如下代码去除代码围栏:
def strip_outer_code_fences(text): lines = text.split("\n") # Remove only the outer code fences if present if lines and lines[0].startswith("```"): lines = lines[1:] if lines and lines[-1].startswith("```"): lines = lines[:-1] return "\n".join(lines).strip()
从图像或文本中提取结构化信息
Amazon Nova 模型可从图像中提取信息并输出为机器可解析的 JSON 格式,该过程称为关键信息提取(KIE)。要执行 KIE,需提供以下内容:
-
JSON 架构:遵循 JSON 架构规范的正式架构定义。
-
以下任意一项或多项:文档文件、图像或文档文本
请求中的文档或图像必须始终置于用户提示之前。
在 KIE 使用案例中,建议采用如下推理配置:
-
temperature:0
-
推理:并非必需,但在仅输入图像或使用复杂架构时,启用推理可提升效果。
提示模板
Given the image representation of a document, extract information in JSON format according to the given schema. Follow these guidelines: - Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document. - When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field. JSON Schema: {json_schema}
Given the OCR representation of a document, extract information in JSON format according to the given schema. Follow these guidelines: - Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document. - When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field. JSON Schema: {json_schema} OCR: {document_text}
Given the image and OCR representations of a document, extract information in JSON format according to the given schema. Follow these guidelines: - Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document. - When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field. JSON Schema: {json_schema} OCR: {document_text}
检测图像中的物体及其位置
Amazon Nova 2 模型能够识别图像中的物体及其位置,这项任务有时也称为“图像定位”或“物体定位”。实际应用场景包括图像分析与标注、界面自动化、图像编辑等。
无论图像输入分辨率和宽高比如何,模型都使用坐标空间,将图像横向分为 1,000 个单位,纵向分为 1,000 个单位,其中 x:0 y:0 位于图像的左上角。
使用分别表示左、上、右和底部的 [x1, y1, x2, y2] 格式来描述边界框。二维坐标使用 [x, y] 格式表示。
在物体检测使用案例中,建议采用如下推理参数值:
-
temperature:0
-
请勿启用推理功能
提示模板:通用物体检测
建议采用如下用户提示模板。
使用边界框检测多个实例:
Please identify {target_description} in the image and provide the bounding box coordinates for each one you detect. Represent the bounding box as the [x1, y1, x2, y2] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
使用边界框检测单个区域:
Please generate the bounding box coordinates corresponding to the region described in this sentence: {target_description}. Represent the bounding box as the [x1, y1, x2, y2] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
使用中心点检测多个实例:
Please identify {target_description} in the image and provide the center point coordinates for each one you detect. Represent the point as the [x, y] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
使用中心点检测单个区域:
Please generate the center point coordinates corresponding to the region described in this sentence: {target_description}. Represent the center point as the [x, y] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.
解析模型输出:
上面推荐的每个提示都将生成一个以逗号分隔的字符串,其中包含一项或多项边界框描述,形式类似如下内容。字符串结尾是否包含“.” 可能略有不同。例如,[356, 770, 393, 872],
[626, 770, 659, 878].
可以使用正则表达式解析模型生成的坐标信息,如以下 Python 代码示例所示。
def parse_coord_text(text): """Parses a model response which uses array formatting ([x, y, ...]) to describe points and bounding boxes. Returns an array of tuples.""" pattern = r"\[([^\[\]]*?)\]" return [ tuple(int(x.strip()) for x in match.split(",")) for match in re.findall(pattern, text) ]
要将边界框的归一化坐标重新映射到输入图像的坐标空间,可使用类似如下 Python 示例的函数。
def remap_bbox_to_image(bounding_box, image_width, image_height): return [ bounding_box[0] * image_width / 1000, bounding_box[1] * image_height / 1000, bounding_box[2] * image_width / 1000, bounding_box[3] * image_height / 1000, ]
提示模板:带位置信息的多类别物体检测
若需在图像中识别多类目标,可通过以下任一格式在提示中加入类别列表。
对于模型易于理解的通用类别,直接在方括号内列出类别名称(无需加引号):
[car, traffic light, road sign, pedestrian]
对于含义细微、不常见或来自专业领域、模型可能不熟悉的类别,请在括号内为每个类别补充定义。由于该任务难度较高,模型效果可能会有所下降。
[taraxacum officinale (Dandelion - bright yellow flowers, jagged basal leaves, white puffball seed heads), digitaria spp (Crabgrass - low spreading grass with coarse blades and finger-like seed heads), trifolium repens (White Clover - three round leaflets and small white pom-pom flowers), plantago major (Broadleaf Plantain - wide oval rosette leaves with tall narrow seed stalks), stellaria media (Chickweed - low mat-forming plant with tiny star-shaped white flowers)]
您可根据偏好的 JSON 输出格式,选用以下任一用户提示模板。
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively. Classes: {candidate_class_list} Include separate entries for each detected object as an element of a list. Formulate your output as JSON format: [ { "class 1": [x1, y1, x2, y2] }, ... ]
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively. Classes: {candidate_class_list} Include separate entries for each detected object as an element of a list. Formulate your output as JSON format: [ { "class": class 1, "bbox": [x1, y1, x2, y2] }, ... ]
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively. Classes: {candidate_class_list} Group all detected bounding boxes by class. Formulate your output as JSON format: { "class 1": [[x1, y1, x2, y2], [x1, x2, y1, y2], ...], ... }
Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively. Classes: {candidate_class_list} Group all detected bounding boxes by class. Formulate your output as JSON format: [ { "class": class 1, "bbox": [[x1, y1, x2, y2], [x1, x2, y1, y2], ...] }, ... ]
解析模型输出
输出将以 JSON 格式编码,可使用任意 JSON 解析库进行解析。
提示模板:截图 UI 边界检测
建议采用如下用户提示模板。
根据目标检测 UI 元素位置:
In this UI screenshot, what is the location of the element if I want to {goal}? Express the location coordinates using the [x1, y1, x2, y2] format, scaled between 0 and 1000.
根据文本检测 UI 元素位置:
In this UI screenshot, what is the location of the element if I want to click on "{text}"? Express the location coordinates using the [x1, y1, x2, y2] format, scaled between 0 and 1000.
解析模型输出:
对于上述所有界面边界检测提示,均可通过正则表达式解析模型生成的坐标信息,具体示例参见下方 Python 代码。
def parse_coord_text(text): """Parses a model response which uses array formatting ([x, y, ...]) to describe points and bounding boxes. Returns an array of tuples.""" pattern = r"\[([^\[\]]*?)\]" return [ tuple(int(x.strip()) for x in match.split(",")) for match in re.findall(pattern, text) ]
视频理解
以下章节将指导如何为需要理解或分析视频的任务编写提示。
视频摘要
Amazon Nova 模型可以生成视频内容摘要。
在视频摘要使用案例中,建议采用如下推理参数值:
-
temperature:0
-
某些使用案例可能适合启用模型推理
不需要特定的提示模板。用户提示应明确指定所关注的视频内容维度。以下是几个优质提示示例:
Can you create an executive summary of this video's content?
Can you distill the essential information from this video into a concise summary?
Could you provide a summary of the video, focusing on its key points?
为视频生成详细描述
Amazon Nova 模型可为视频生成详细描述,该任务称为“密集描述生成”。
在视频描述生成使用案例中,建议采用如下推理参数值:
-
temperature:0
-
某些使用案例可能适合启用模型推理
不需要特定的提示模板。用户提示应明确指定所关注的视频内容维度。以下是几个优质提示示例:
Provide a detailed, second-by-second description of the video content.
Break down the video into key segments and provide detailed descriptions for each.
Generate a rich textual representation of the video, covering aspects like movement, color and composition.
Describe the video scene-by-scene, including details about characters, actions and settings.
Offer a detailed narrative of the video, including descriptions of any text, graphics, or special effects used.
Create a dense timeline of events occurring in the video, with timestamps if possible.
分析监控视频画面
Amazon Nova 模型可检测监控视频画面的事件。
在监控视频画面使用案例中,建议采用如下推理参数值:
-
temperature:0
-
某些使用案例可能适合启用模型推理
You are a security assistant for a smart home who is given security camera footage in natural setting. You will examine the video and describe the events you see. You are capable of identifying important details like people, objects, animals, vehicles, actions and activities. This is not a hypothetical, be accurate in your responses. Do not make up information not present in the video.
提取带时间戳的视频事件
Amazon Nova 模型可识别视频中与事件相关的时间戳。您可指定时间戳格式为秒数或 MM:SS 格式。例如,视频中 1 分 25 秒时发生的事件可以表示为 85 或 01:25。
在本使用案例中,建议采用如下推理参数值:
-
temperature:0
-
请勿启用推理功能
建议使用类似如下的提示词:
Please localize the moment that the event "{event_description}" happens in the video. Answer with the starting and ending time of the event in seconds, such as [[72, 82]]. If the event happen multiple times, list all of them, such as [[40, 50], [72, 82]].
Locate the segment where "{event_description}" happens. Specify the start and end times of the event in MM:SS.
Answer the starting and end time of the event "{event_description}". Provide answers in MM:SS
When does "{event_description}" in the video? Specify the start and end timestamps, e.g. [[9, 14]]
Please localize the moment that the event "{event_description}" happens in the video. Answer with the starting and ending time of the event in seconds. e.g. [[72, 82]]. If the event happen multiple times, list all of them. e.g. [[40, 50], [72, 82]]
Segment a video into different scenes and generate caption per scene. The output should be in the format: [STARTING TIME-ENDING TIMESTAMP] CAPTION. Timestamp in MM:SS format
For a video clip, segment it into chapters and generate chapter titles with timestamps. The output should be in the format: [STARTING TIME] TITLE. Time in MM:SS
Generate video captions with timestamp.
对视频进行分类
您可以使用 Amazon Nova 模型,根据自己提供的预定义类别列表对视频进行分类。
在本使用案例中,建议采用如下推理参数值:
-
temperature:0
-
请勿启用推理功能
使用如下提示模板:
What is the most appropriate category for this video? Select your answer from the options provided: {class1} {class2} {...}
示例:
What is the most appropriate category for this video? Select your answer from the options provided: Arts Technology Sports Education