Pedoman multimodal umum Pemahaman Dokumen dan Gambar Pemahaman video

Meminta input multimodal

Bagian berikut memberikan panduan untuk pemahaman gambar dan video. Untuk permintaan terkait audio, lihat bagian. Permintaan percakapan suara

Pedoman multimodal umum

Permintaan pengguna dan perintah sistem

Untuk kasus penggunaan pemahaman multimodal, setiap permintaan harus menyertakan teks prompt pengguna. Permintaan sistem, yang mungkin hanya berisi teks, bersifat opsional.

Prompt sistem dapat digunakan untuk menentukan persona untuk model dan untuk mendefinisikan kepribadian umum dan gaya respons tetapi tidak boleh digunakan untuk definisi tugas rinci atau instruksi pemformatan output.

Sertakan definisi tugas, instruksi, dan detail pemformatan dalam prompt pengguna untuk memiliki efek yang lebih kuat daripada prompt sistem untuk kasus penggunaan multimodal.

Pesanan konten

Permintaan pemahaman multimodal yang dikirim ke Amazon Nova harus berisi satu atau beberapa file dan prompt pengguna. Prompt teks pengguna harus menjadi item terakhir dalam pesan, selalu setelah gambar, dokumen, atau konten video.


message = {
  "role": "user",
  "content": [
    { "document|image|video|audio": {...} },
    { "document|image|video|audio": {...} },
    ...
    { "text": "<user prompt>" }
  ]
}

Dalam kasus di mana Anda ingin merujuk ke file tertentu dari dalam prompt pengguna, gunakan elemen “teks” untuk menentukan label yang mendahului setiap blok file.


message = {
  "role": "user",
  "content": [
    { "text": "<label for item 1>" },
    { "document|image|video|audio": {...} },
    { "text": "<label for item 2>" },
    { "document|image|video|audio": {...} },
    ...
    { "text": "<user prompt>" }
  ]
}

Pemahaman Dokumen dan Gambar

Bagian berikut memberikan panduan tentang cara menyusun petunjuk untuk tugas-tugas yang membutuhkan pemahaman atau analisis gambar dan dokumen.

Mengekstrak teks dari gambar

Model Amazon Nova dapat mengekstrak teks dari gambar, kemampuan yang disebut sebagai Optical Character Recognition (OCR). Untuk hasil terbaik, pastikan input gambar yang Anda berikan kepada model adalah resolusi yang cukup tinggi sehingga karakter teks mudah dilihat.

Untuk kasus penggunaan ekstraksi teks, kami merekomendasikan konfigurasi inferensi berikut:

suhu: default (0,7)
TopP: default (0,9)
Jangan aktifkan penalaran

Model Amazon Nova dapat mengekstrak teks ke format Markdown, HTML, atau LaTe X. Template prompt pengguna berikut direkomendasikan:


## Instructions
Extract all information from this page using only {text_formatting} formatting. Retain the original layout and structure including lists, tables, charts and math formulae. 

## Rules
1. For math formulae, always use LaTeX syntax. 
2. Describe images using only text.
3. NEVER use HTML image tags `<img>` in the output.
4. NEVER use Markdown image tags `![]()` in the output.
5. Always wrap the entire output in ``` tags.

Outputnya dibungkus dengan pagar kode Markdown penuh atau sebagian ()```. Anda dapat menghapus pagar kode menggunakan kode yang mirip dengan berikut ini:


def strip_outer_code_fences(text):
    lines = text.split("\n")
    # Remove only the outer code fences if present
    if lines and lines[0].startswith("```"):
        lines = lines[1:]
        if lines and lines[-1].startswith("```"):
            lines = lines[:-1]
    return "\n".join(lines).strip()

Mengekstrak informasi terstruktur dari gambar atau teks

Model Amazon Nova dapat mengekstrak informasi dari gambar ke dalam format JSON yang dapat diurai mesin, sebuah proses yang disebut sebagai Ekstraksi Informasi Kunci (KIE). Untuk melakukan KIE, berikan yang berikut:

Skema JSON. Definisi skema formal yang mengikuti spesifikasi Skema JSON.
Satu atau beberapa hal berikut: File dokumen atau gambar atau teks dokumen

Dokumen atau gambar harus selalu ditempatkan sebelum permintaan pengguna Anda diminta.

Untuk kasus penggunaan KIE, kami merekomendasikan konfigurasi inferensi berikut:

Temperatur: 0
Penalaran: Penalaran tidak diperlukan tetapi dapat meningkatkan hasil ketika input gambar saja atau skema kompleks digunakan.

Templat cepat


Given the image representation of a document, extract information in JSON format according to the given schema.
     
Follow these guidelines:
- Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document.
- When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field.

JSON Schema:
{json_schema}


Given the OCR representation of a document, extract information in JSON format according to the given schema.

Follow these guidelines:
- Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document.
- When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field.

JSON Schema:
{json_schema}

OCR:
{document_text}


Given the image and OCR representations of a document, extract information in JSON format according to the given schema.
       
Follow these guidelines:
- Ensure that every field is populated, provided the document includes the corresponding value. Only use null when the value is absent from the document.
- When instructed to read tables or lists, read each row from every page. Ensure every field in each row is populated if the document contains the field.

JSON Schema:
{json_schema}

OCR:
{document_text}

Mendeteksi objek dan posisinya dalam gambar

Model Amazon Nova 2 memberikan kemampuan untuk mengidentifikasi objek dan posisinya dalam gambar, tugas yang kadang-kadang disebut sebagai pentanahan gambar atau lokalisasi objek. Aplikasi praktis termasuk analisis gambar dan penandaan, otomatisasi antarmuka pengguna, pengeditan gambar, dan lainnya.

Terlepas dari resolusi input gambar dan rasio aspek, model menggunakan ruang koordinat yang membagi gambar menjadi 1.000 unit secara horizontal dan 1.000 unit secara vertikal, dengan lokasi x:0 y: 0 menjadi kiri atas gambar.

Kotak pembatas dijelaskan menggunakan format yang [x1, y1, x2, y2] mewakili kiri, atas, kanan dan bawah masing-masing. Koordinat dua dimensi diwakili menggunakan format. [x, y]

Untuk kasus penggunaan deteksi objek, kami merekomendasikan nilai parameter inferensi berikut:

Temperatur: 0
Jangan aktifkan penalaran

Templat prompt: deteksi objek umum

Kami merekomendasikan template prompt pengguna berikut.

Mendeteksi beberapa instance dengan kotak pembatas:


Please identify {target_description} in the image and provide the bounding box coordinates for each one you detect. Represent the bounding box as the [x1, y1, x2, y2] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.

Mendeteksi satu wilayah dengan kotak pembatas:


Please generate the bounding box coordinates corresponding to the region described in this sentence: {target_description}. Represent the bounding box as the [x1, y1, x2, y2] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.

Mendeteksi beberapa instance dengan titik tengah:


Please identify {target_description} in the image and provide the center point coordinates for each one you detect. Represent the point as the [x, y] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.

Mendeteksi satu wilayah dengan titik tengah:


Please generate the center point coordinates corresponding to the region described in this sentence: {target_description}. Represent the center point as the [x, y] format, where the coordinates are scaled between 0 and 1000 to the image width and height, respectively.

Mengurai keluaran model:

Setiap petunjuk yang disarankan di atas akan menghasilkan string dipisahkan koma yang berisi satu atau lebih deskripsi kotak pembatas dalam bentuk yang mirip dengan berikut ini. Mungkin ada sedikit variasi dalam apakah “.” disertakan di akhir string. Sebagai contoh, [356, 770, 393, 872], [626, 770, 659, 878]..

Anda dapat mengurai informasi koordinat yang dihasilkan oleh model menggunakan ekspresi reguler seperti yang ditunjukkan pada contoh kode Python berikut.


def parse_coord_text(text):
    """Parses a model response which uses array formatting ([x, y, ...])
    to describe points and bounding boxes. Returns an array of tuples."""
    pattern = r"\[([^\[\]]*?)\]"
    return [
        tuple(int(x.strip()) for x in match.split(","))
        for match in re.findall(pattern, text)
    ]

Untuk memetakan ulang koordinat kotak pembatas yang dinormalisasi ke ruang koordinat gambar input, Anda dapat menggunakan fungsi yang mirip dengan contoh Python berikut.


def remap_bbox_to_image(bounding_box, image_width, image_height):
    return [
        bounding_box[0] * image_width / 1000,
        bounding_box[1] * image_height / 1000,
        bounding_box[2] * image_width / 1000,
        bounding_box[3] * image_height / 1000,
    ]

Template prompt: mendeteksi beberapa kelas objek dengan posisi

Bila Anda ingin mengidentifikasi beberapa kelas item dalam gambar, Anda dapat menyertakan daftar kelas dalam prompt Anda menggunakan salah satu pendekatan pemformatan berikut.

Untuk kelas yang dipahami umum yang kemungkinan akan dipahami dengan baik oleh model, daftarkan nama kelas (tanpa tanda kutip) di dalam tanda kurung siku:


[car, traffic light, road sign, pedestrian]

Untuk kelas yang bernuansa, tidak umum, atau berasal dari domain khusus yang model mungkin tidak akrab dengan, sertakan definisi untuk setiap kelas dalam tanda kurung. Karena tugas ini menantang, perkirakan kinerja model menurun.


[taraxacum officinale (Dandelion - bright yellow flowers, jagged basal leaves, white puffball seed heads), digitaria spp (Crabgrass - low spreading grass with coarse blades and finger-like seed heads), trifolium repens (White Clover - three round leaflets and small white pom-pom flowers), plantago major (Broadleaf Plantain - wide oval rosette leaves with tall narrow seed stalks), stellaria media (Chickweed - low mat-forming plant with tiny star-shaped white flowers)]

Gunakan salah satu template prompt pengguna berikut tergantung pada format output JSON yang Anda inginkan.


Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Include separate entries for each detected object as an element of a list. 

Formulate your output as JSON format:
[
  {
  	"class 1": [x1, y1, x2, y2]
  },
  ...
]


Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Include separate entries for each detected object as an element of a list.

Formulate your output as JSON format:
[
    {
        "class": class 1,
        "bbox": [x1, y1, x2, y2]
    },
    ...
]


Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Group all detected bounding boxes by class.

Formulate your output as JSON format:
{
    "class 1": [[x1, y1, x2, y2], [x1, x2, y1, y2], ...],
    ...
}


Detect all objects with their bounding boxes in the image from the provided class list. Normalize the bounding box coordinates to be scaled between 0 and 1000 to the image width and height, respectively.

Classes: {candidate_class_list}

Group all detected bounding boxes by class.

Formulate your output as JSON format:
[
    {
        "class": class 1,
        "bbox": [[x1, y1, x2, y2], [x1, x2, y1, y2], ...]
    },
    ...
]

Mengurai keluaran model

Output akan dikodekan sebagai JSON yang dapat diurai dengan pustaka parsing JSON apa pun.

Templat prompt: deteksi batas UI tangkapan layar

Kami merekomendasikan template prompt pengguna berikut.

Mendeteksi posisi elemen UI berdasarkan tujuan:


In this UI screenshot, what is the location of the element if I want to {goal}? Express the location coordinates using the [x1, y1, x2, y2] format, scaled between 0 and 1000.

Mendeteksi posisi elemen UI berdasarkan teks:


In this UI screenshot, what is the location of the element if I want to click on "{text}"? Express the location coordinates using the [x1, y1, x2, y2] format, scaled between 0 and 1000.

Mengurai keluaran model:

Untuk setiap petunjuk deteksi batas UI di atas, Anda dapat mengurai informasi koordinat yang dihasilkan oleh model menggunakan ekspresi reguler seperti yang ditunjukkan pada contoh kode Python di bawah ini.


def parse_coord_text(text):
    """Parses a model response which uses array formatting ([x, y, ...]) 
    to describe points and bounding boxes. Returns an array of tuples."""
    pattern = r"\[([^\[\]]*?)\]"
    return [
        tuple(int(x.strip()) for x in match.split(","))
        for match in re.findall(pattern, text)
    ]

Pemahaman video

Bagian berikut memberikan panduan tentang cara menyusun petunjuk untuk tugas-tugas yang membutuhkan pemahaman atau analisis video.

Meringkas video

Model Amazon Nova dapat menghasilkan ringkasan konten video.

Untuk kasus penggunaan ringkasan video, kami merekomendasikan nilai parameter inferensi berikut:

Temperatur: 0
Beberapa kasus penggunaan mungkin mendapat manfaat dari mengaktifkan penalaran model

Tidak diperlukan templat permintaan khusus. Prompt pengguna Anda harus dengan jelas menentukan aspek video yang Anda pedulikan. Berikut adalah beberapa contoh petunjuk yang efektif:


Can you create an executive summary of this video's content?


Can you distill the essential information from this video into a concise summary?


Could you provide a summary of the video, focusing on its key points?

Menghasilkan keterangan rinci untuk video

Model Amazon Nova dapat menghasilkan teks terperinci untuk video, tugas yang disebut teks padat.

Untuk kasus penggunaan teks video, kami merekomendasikan nilai parameter inferensi berikut:

Temperatur: 0
Beberapa kasus penggunaan mungkin mendapat manfaat dari mengaktifkan penalaran model

Tidak diperlukan templat permintaan khusus. Prompt pengguna Anda harus dengan jelas menentukan aspek video yang Anda pedulikan. Berikut adalah beberapa contoh petunjuk yang efektif:


Provide a detailed, second-by-second description of the video content.


Break down the video into key segments and provide detailed descriptions for each.


Generate a rich textual representation of the video, covering aspects like movement, color and composition.


Describe the video scene-by-scene, including details about characters, actions and settings.


Offer a detailed narrative of the video, including descriptions of any text, graphics, or special effects used.


Create a dense timeline of events occurring in the video, with timestamps if possible.

Menganalisis rekaman video keamanan

Model Amazon Nova dapat mendeteksi peristiwa dalam rekaman keamanan.

Untuk kasus penggunaan rekaman keamanan, kami merekomendasikan nilai parameter inferensi berikut:

Temperatur: 0
Beberapa kasus penggunaan mungkin mendapat manfaat dari mengaktifkan penalaran model


You are a security assistant for a smart home who is given security camera footage in natural setting. You will examine the video and describe the events you see. You are capable of identifying important details like people, objects, animals, vehicles, actions and activities. This is not a hypothetical, be accurate in your responses. Do not make up information not present in the video.

Mengekstrak acara video dengan stempel waktu

Model Amazon Nova dapat mengidentifikasi stempel waktu yang terkait dengan peristiwa dalam video. Anda dapat meminta agar stempel waktu diformat dalam hitungan detik atau dalam format MM: SS. Misalnya, peristiwa yang terjadi pada 1 menit 25 detik dalam video dapat direpresentasikan sebagai 85 atau01:25.

Untuk kasus penggunaan ini, kami merekomendasikan nilai parameter inferensi berikut:

Temperatur: 0
Jangan gunakan penalaran

Kami menyarankan Anda menggunakan petunjuk yang mirip dengan yang berikut ini:


Please localize the moment that the event "{event_description}" happens in the video. Answer with the starting and ending time of the event in seconds, such as [[72, 82]]. If the event happen multiple times, list all of them, such as [[40, 50], [72, 82]].


Locate the segment where "{event_description}" happens. Specify the start and end times of the event in MM:SS.


Answer the starting and end time of the event "{event_description}". Provide answers in MM:SS


When does "{event_description}" in the video? Specify the start and end timestamps, e.g. [[9, 14]]


Please localize the moment that the event "{event_description}" happens in the video. Answer with the starting and ending time of the event in seconds. e.g. [[72, 82]]. If the event happen multiple times, list all of them. e.g. [[40, 50], [72, 82]]


Segment a video into different scenes and generate caption per scene. The output should be in the format: [STARTING TIME-ENDING TIMESTAMP] CAPTION. Timestamp in MM:SS format


For a video clip, segment it into chapters and generate chapter titles with timestamps. The output should be in the format: [STARTING TIME] TITLE. Time in MM:SS


Generate video captions with timestamp.

Mengklasifikasikan video

Anda dapat menggunakan model Amazon Nova untuk mengklasifikasikan video berdasarkan daftar kelas yang telah ditentukan sebelumnya.

Untuk kasus penggunaan ini, kami merekomendasikan nilai parameter inferensi berikut:

Temperatur: 0
Penalaran tidak boleh digunakan

Gunakan template prompt berikut:


What is the most appropriate category for this video? Select your answer from the options provided:
{class1}
{class2}
{...}

Contoh:


What is the most appropriate category for this video? Select your answer from the options provided:
Arts
Technology
Sports
Education

Awas Javascript dinonaktifkan atau tidak tersedia di browser Anda.

Untuk menggunakan Dokumentasi AWS, Javascript harus diaktifkan. Lihat halaman Bantuan browser Anda untuk petunjuk.

Konvensi Dokumen

Teknik meminta tingkat lanjut

Moderasi konten