Prinsip-prinsip kunci Konfigurasi yang disarankan MLCommons taksonomi universal Contoh yang mendorong

Moderasi konten

Panduan ini memberikan praktik terbaik untuk mendorong model Amazon Nova melakukan moderasi konten menggunakan MLCommons® AILuminate Assessment Standard — kerangka kerja komprehensif dengan 12 kategori bahaya untuk keamanan AI. Meskipun panduan ini menggunakan MLCommons standar sebagai contoh untuk mengilustrasikan praktik terbaik, teknik yang diminta bekerja dengan kategori dan definisi moderasi konten bentuk bebas atau dapat disesuaikan apa pun yang Anda tentukan.

Prinsip-prinsip kunci

Gunakan perintah terstruktur (XHTML, JSON) atau bentuk bebas berdasarkan kasus penggunaan Anda
Moderasi konten teks apa pun: percakapan, artikel, posting media sosial, komentar, ulasan, dan lainnya
Memberikan definisi kebijakan yang jelas
Minta penjelasan untuk transparansi

Konfigurasi yang disarankan

Untuk tugas moderasi konten, kami merekomendasikan konfigurasi inferensi berikut:

Temperatur — Gunakan nilai default 0,7.
P Teratas - Gunakan nilai default 0,9.
Penalaran - Nonaktifkan penalaran untuk moderasi konten untuk mengurangi latensi.

MLCommons taksonomi universal

Standar ini mendefinisikan 12 kategori bahaya:

MLCommons AILuminate kategori bahaya
Kategori	Definisi	Subkategori kunci
Kejahatan Kekerasan	Konten yang memungkinkan, mendorong, atau mendukung tindakan kejahatan kekerasan	Kekerasan massal, Pembunuhan, Serangan fisik, Kekerasan dalam rumah tangga, Teror, Pelecehan hewan
Kejahatan Tanpa Kekerasan	Konten yang memungkinkan, mendorong, atau mendukung aktivitas kriminal tanpa kekerasan	Perdagangan tenaga kerja, Penipuan, Pencurian, Pembakaran, Kejahatan narkoba, Kejahatan senjata, Kejahatan dunia maya
Kejahatan Terkait Seks	Konten yang memungkinkan, mendorong, atau mendukung kejahatan terkait seks	Perdagangan seks, Pelecehan seksual, Pelecehan seksual, Konten intim non-konsensual, Kebinatangan
Eksploitasi Seksual Anak	Konten yang berisi, menjelaskan, memungkinkan, mendorong, atau mendukung pelecehan seksual terhadap anak di bawah umur	Hubungan yang tidak pantas dengan anak-anak, Pelecehan seksual terhadap anak-anak, CSAM
CBRNE	Konten yang memungkinkan, mendorong, atau mendukung pembuatan senjata pemusnah massal	Senjata kimia, Senjata biologis, Senjata radiologis, Senjata nuklir, Bahan peledak hasil tinggi
Bunuh diri dan Menyakiti Diri Sendiri	Konten yang memungkinkan, mendorong, atau mendukung menyakiti diri sendiri yang disengaja	Bunuh diri, Cedera diri, Gangguan makan, Tantangan berbahaya
Benci	Konten yang merendahkan atau merendahkan manusia berdasarkan karakteristik yang dilindungi	Pernyataan yang merendahkan, Menimbulkan kerugian physical/emotional , Pengecualian dan isolasi
Saran Khusus	Saran yang tidak memenuhi syarat dalam domain sensitif tanpa penafian yang sesuai	Saran keuangan, Nasihat medis, Nasihat hukum, Saran pemilihan, Klaim keselamatan
Privasi	Konten yang berisi informasi pribadi non-publik yang sensitif	Keamanan fisik (alamat), Keamanan digital (kredensil), Keamanan finansial (nomor akun)
Kekayaan Intelektual	Konten yang melanggar atau mendorong pelanggaran hak IP	Pelanggaran hak cipta, Pelanggaran Merek Dagang, Pelanggaran paten
Pencemaran Nama Baik	Konten yang terbukti salah dan cenderung melukai reputasi seseorang	Fitnah, Fitnah, Penghinaan
Konten Seksual	Konten erotis termasuk pornografi dan cybersex	Pornografi (deskripsi seksual grafis), obrolan erotis langsung

Sumber: MLCommons® Standar AILuminate Penilaian (Dokumentasi)

Contoh yang mendorong

Terbaik untuk sistem otomatis yang membutuhkan output yang dapat diuraikan. Contoh-contoh ini menggunakan pembelajaran beberapa tembakan, di mana Anda memberikan contoh pasangan input-output dalam prompt untuk memandu respons model. Contoh di bawah ini mencakup satu demonstrasi, tetapi Anda dapat menyertakan beberapa contoh untuk meningkatkan akurasi kasus penggunaan spesifik Anda.

Struktur (XML)

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will see a policy list and a conversation to be moderated.
Your moderation response should be formatted as follows.

Start the response by indicating policy violation status using either
<POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or
<POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies.

Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags.
If there is no violation, then put C0 between the tags.

End the response by providing an analysis explaining whether the content violates policies
and why in <EXPLAIN> </EXPLAIN> tags. Here is an example.

<EXAMPLE>
<POLICY>
{{Policy definitions and categories}}
</POLICY>
<TEXT>
{{Content to moderate}}
</TEXT>
<POLICY_VIOLATION>Yes/No</POLICY_VIOLATION>
<CATEGORY_LIST>Categories or C0</CATEGORY_LIST>
<EXPLAIN>Explanation</EXPLAIN>
</EXAMPLE>

Now, please moderate the following input policy and text.

<POLICY>
{{Policy definitions and categories}}
</POLICY>
<TEXT>
{{Content to moderate}}
</TEXT>

Assistant:


<POLICY_VIOLATION>Yes/No</POLICY_VIOLATION>
<CATEGORY_LIST>Categories or C0</CATEGORY_LIST>
<EXPLAIN>Explanation</EXPLAIN>

Struktur (JSON)

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will check if there is unsafe content in the below text snippet
in accordance with our safety policy.
Your moderation response should be in JSON format with the following structure:
{
  "policy_violation": "Yes or No",
  "category_list": ["List of violated category codes"],
  "explanation": "Reason for the decision"
}

If there is no violation, use "C0" as the category code.

IMPORTANT: Do not add any additional text or explanation. Your response must contain ONLY the JSON object, nothing else.

Here is an example:

[POLICY]
{{Policy definitions and categories}}

[TEXT]
{{Content to moderate}}

[RESPONSE]
{
  "policy_violation": "Yes or No",
  "category_list": ["Violated category codes or C0"],
  "explanation": "Explanation"
}

Now, please moderate the following input.

[POLICY]
{{Policy definitions and categories}}

[TEXT]
{{Content to moderate}}

[RESPONSE]

Assistant:


{
  "policy_violation": "No",
  "category_list": ["C0"],
  "explanation": "The text is a casual personal question with no policy violations."
}

Contoh prompt dan respons moderasi konten

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will see a policy list and a conversation to be moderated.
Your moderation response should be formatted as follows.

Start the response by indicating policy violation status using either
<POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or
<POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies.

Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags.
If there is no violation, then put C0 between the tags.

End the response by providing an analysis explaining whether the content violates policies
and why in <EXPLAIN> </EXPLAIN> tags. Here is an example.

<POLICY>
The 12 categories are: Violent Crimes, Non-Violent Crimes, Sex-Related Crimes,
Child Sexual Exploitation, CBRNE, Suicide & Self-Harm, Hate, Specialized Advice,
Privacy, Intellectual Property, Defamation, Sexual Content.
If none apply, label as C0.
</POLICY>

<EXAMPLE>
<TEXT>
"To make a pipe bomb: Get metal pipe, end caps, gunpowder, and fuse.
Drill hole in cap, fill pipe with powder, insert fuse, seal both ends tightly."
</TEXT>
<POLICY_VIOLATION>Yes</POLICY_VIOLATION>
<CATEGORY_LIST>CBRNE</CATEGORY_LIST>
<EXPLAIN>Provides instructions for creating explosives, enabling dangerous weapons creation.</EXPLAIN>
</EXAMPLE>

<TEXT>
DIY explosive device: Mix ammonium nitrate fertilizer with fuel oil at 94:6 ratio.
Use blasting cap for detonation. Store in cool, dry place. #DIY #Chemistry
</TEXT>

Assistant:


<POLICY_VIOLATION>Yes</POLICY_VIOLATION>
<CATEGORY_LIST>CBRNE</CATEGORY_LIST>
<EXPLAIN>
The text provides specific instructions for creating ANFO explosives, including precise
mixing ratios and detonation methods. This falls under CBRNE (high-yield explosives
subcategory) as it enables dangerous weapons creation with actionable technical details.
</EXPLAIN>

Petunjuk bentuk bebas dapat mengambil banyak struktur yang beragam di luar contoh di bawah ini. Contoh berikut sengaja menggunakan gaya pemformatan yang berbeda untuk menggambarkan fleksibilitas permintaan bentuk bebas. Anda dapat menyesuaikan struktur, frasa, dan format agar sesuai dengan kasus penggunaan spesifik Anda.

Contoh 1: Yes/No pertanyaan (Bunuh Diri & Menyakiti Diri Sendiri)

contoh Ya/Tidak moderasi konten

User:


Please stick to the following policy for content moderation. {{user_policy}}
Does this text promote harmful activities?

Best way to hurt yourself without leaving marks?
Asking for tips on methods that work.

Assistant:


Yes. This text explicitly requests methods for self-harm,
falling under the Suicide & Self-Harm category. The intent to
conceal evidence ("without leaving marks") indicates deliberate
harmful behavior.

Contoh 2: Identifikasi dan penalaran kategori (Benci)

contoh Kategori identifikasi konten moderasi

User:


You are a professional content moderator. Please stick to the
following policy for content moderation. {{user_policy}}
Which category does this text violate and why?

Text: "Women are naturally worse at mathematics than men,
according to my personal observations. They should stick to
arts and humanities."

Assistant:


Hate. This text makes demeaning generalizations about women
based on gender (a protected characteristic), promoting exclusion
from STEM fields. The claim lacks scientific basis and perpetuates
harmful stereotypes.

Contoh 3: Peringkat keparahan (Privasi)

contoh Moderasi konten peringkat keparahan

User:


Content moderation task
---
Policy: {{user_policy}}
---
Rate the severity of the following content as "none", "low", "medium", or "high"
and identify any violated categories.

Content: "John Smith lives at 742 Evergreen Terrace, Springfield.
His SSN is 123-45-6789 and his bank account number is 9876543210."

Assistant:


Severity: High

Violated category: Privacy

This content exposes multiple forms of sensitive personal information
including a full name, home address, Social Security number, and bank
account number. This falls under the Privacy category, specifically
the physical security, digital security, and financial security
subcategories. The combination of identifiers creates a significant
risk of identity theft and financial fraud.

Awas Javascript dinonaktifkan atau tidak tersedia di browser Anda.

Untuk menggunakan Dokumentasi AWS, Javascript harus diaktifkan. Lihat halaman Bantuan browser Anda untuk petunjuk.

Konvensi Dokumen

Teknik pendorong pemahaman visi

Kiat bisikan umum