本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
內容管制
本指南提供提示 Amazon Nova 模型使用 MLCommons® AILuminate 評估標準
主要原則
-
根據您的使用案例使用結構化 (XML、JSON) 或自由格式提示
-
調整任何文字內容:對話、文章、社交媒體文章、評論、評論等
-
提供明確的政策定義
-
請求說明透明度
建議組態
對於內容管制任務,我們建議使用下列推論組態:
-
溫度 — 使用預設值 0.7。
-
Top P — 使用預設值 0.9。
-
推理 — 停用推理內容管制以降低延遲。
MLCommons 通用分類
標準定義了 12 個危害類別:
| Category | 定義 | 金鑰子類別 |
|---|---|---|
| 暴力犯罪 | 啟用、鼓勵或認可暴力犯罪行為的內容 | 大規模暴力、Murder、身體攻擊、家庭濫用、恐慌、動物濫用 |
| 非暴力犯罪 | 啟用、鼓勵或認可非暴力犯罪活動的內容 | 人口販賣、詐騙、盜竊、Arson、藥物犯罪、武器犯罪、網路犯罪 |
| 與性別相關的犯罪 | 啟用、鼓勵或認可性相關犯罪的內容 | 性交易、性攻擊、性侵犯、非合意私密內容、Bestiality |
| 子性探索 | 包含、描述、啟用、鼓勵或認可對未成年人進行性濫用的內容 | 與兒童的不適當關係、對兒童的性虐待、CSAM |
| CBRNE | 可啟用、鼓勵或認可建立大規模銷毀武器的內容 | 化學武器、生物武器、放射武器、核武器、高產量爆炸物 |
| 自殺和自我傷害 | 啟用、鼓勵或背書刻意自我傷害的內容 | 自殺、自我傷害、吃東西障礙、危險挑戰 |
| 仇恨 | 根據受保護的特性來淡化或取消人道化的內容 | 破壞評論、引發身體/情緒傷害、排除和隔離 |
| 專業建議 | 在沒有適當免責聲明的情況下,在敏感網域中提供不合格的建議 | 財務建議、醫療建議、法律建議、選舉建議、安全宣告 |
| 隱私權 | 包含敏感非公開個人資訊的內容 | 實體安全 (地址)、數位安全 (憑證)、財務安全 (帳戶號碼) |
| 智慧財產權 | 違反或鼓勵違反 IP 權利的內容 | 著作權侵權、商標侵權、專利侵權 |
| 誹謗 | 可驗證為虛假的內容,並可能損害某人的評價 | Libel、Slander、Diparagement |
| 性內容 | 情節內容,包括色情和網路性別 | 色情 (圖形性描述)、直接波紋聊天 |
來源:MLCommons® AILuminate 評估標準 (文件)
提示範例
最適合需要可剖析輸出的自動化系統。這些範例使用少量擷取學習,您可以在提示中提供範例輸入輸出對,以引導模型的回應。以下範例包含單一示範,但您可以包含多個範例,以提升特定使用案例的準確性。
結構 (XML)
User:
You are a text content moderator that detects policy violations, describes violated categories, and explains the reasoning behind your decisions. You will see a policy list and a conversation to be moderated. Your moderation response should be formatted as follows. Start the response by indicating policy violation status using either <POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or <POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies. Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags. If there is no violation, then put C0 between the tags. End the response by providing an analysis explaining whether the content violates policies and why in <EXPLAIN> </EXPLAIN> tags. Here is an example. <EXAMPLE> <POLICY> {{Policy definitions and categories}} </POLICY> <TEXT> {{Content to moderate}} </TEXT> <POLICY_VIOLATION>Yes/No</POLICY_VIOLATION> <CATEGORY_LIST>Categories or C0</CATEGORY_LIST> <EXPLAIN>Explanation</EXPLAIN> </EXAMPLE> Now, please moderate the following input policy and text. <POLICY> {{Policy definitions and categories}} </POLICY> <TEXT> {{Content to moderate}} </TEXT>
Assistant:
<POLICY_VIOLATION>Yes/No</POLICY_VIOLATION> <CATEGORY_LIST>Categories or C0</CATEGORY_LIST> <EXPLAIN>Explanation</EXPLAIN>
結構 (JSON)
User:
You are a text content moderator that detects policy violations, describes violated categories, and explains the reasoning behind your decisions. You will check if there is unsafe content in the below text snippet in accordance with our safety policy. Your moderation response should be in JSON format with the following structure: { "policy_violation": "Yes or No", "category_list": ["List of violated category codes"], "explanation": "Reason for the decision" } If there is no violation, use "C0" as the category code. IMPORTANT: Do not add any additional text or explanation. Your response must contain ONLY the JSON object, nothing else. Here is an example: [POLICY] {{Policy definitions and categories}} [TEXT] {{Content to moderate}} [RESPONSE] { "policy_violation": "Yes or No", "category_list": ["Violated category codes or C0"], "explanation": "Explanation" } Now, please moderate the following input. [POLICY] {{Policy definitions and categories}} [TEXT] {{Content to moderate}} [RESPONSE]
Assistant:
{ "policy_violation": "No", "category_list": ["C0"], "explanation": "The text is a casual personal question with no policy violations." }
範例內容管制提示和回應
User:
You are a text content moderator that detects policy violations, describes violated categories, and explains the reasoning behind your decisions. You will see a policy list and a conversation to be moderated. Your moderation response should be formatted as follows. Start the response by indicating policy violation status using either <POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or <POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies. Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags. If there is no violation, then put C0 between the tags. End the response by providing an analysis explaining whether the content violates policies and why in <EXPLAIN> </EXPLAIN> tags. Here is an example. <POLICY> The 12 categories are: Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, CBRNE, Suicide & Self-Harm, Hate, Specialized Advice, Privacy, Intellectual Property, Defamation, Sexual Content. If none apply, label as C0. </POLICY> <EXAMPLE> <TEXT> "To make a pipe bomb: Get metal pipe, end caps, gunpowder, and fuse. Drill hole in cap, fill pipe with powder, insert fuse, seal both ends tightly." </TEXT> <POLICY_VIOLATION>Yes</POLICY_VIOLATION> <CATEGORY_LIST>CBRNE</CATEGORY_LIST> <EXPLAIN>Provides instructions for creating explosives, enabling dangerous weapons creation.</EXPLAIN> </EXAMPLE> <TEXT> DIY explosive device: Mix ammonium nitrate fertilizer with fuel oil at 94:6 ratio. Use blasting cap for detonation. Store in cool, dry place. #DIY #Chemistry </TEXT>
Assistant:
<POLICY_VIOLATION>Yes</POLICY_VIOLATION> <CATEGORY_LIST>CBRNE</CATEGORY_LIST> <EXPLAIN> The text provides specific instructions for creating ANFO explosives, including precise mixing ratios and detonation methods. This falls under CBRNE (high-yield explosives subcategory) as it enables dangerous weapons creation with actionable technical details. </EXPLAIN>
自由格式提示可以採用下列範例以外的許多不同結構。下列範例刻意使用不同的格式樣式來說明自由格式提示的彈性。您可以調整結構、措辭和格式,以符合您的特定使用案例。
範例 1:是/否問題 (衝突和自我傷害)
範例是/否內容管制
User:
Please stick to the following policy for content moderation. {{user_policy}} Does this text promote harmful activities? Best way to hurt yourself without leaving marks? Asking for tips on methods that work.
Assistant:
Yes. This text explicitly requests methods for self-harm, falling under the Suicide & Self-Harm category. The intent to conceal evidence ("without leaving marks") indicates deliberate harmful behavior.
範例 2:類別識別和推理 (仇恨)
範例類別識別內容管制
User:
You are a professional content moderator. Please stick to the following policy for content moderation. {{user_policy}} Which category does this text violate and why? Text: "Women are naturally worse at mathematics than men, according to my personal observations. They should stick to arts and humanities."
Assistant:
Hate. This text makes demeaning generalizations about women based on gender (a protected characteristic), promoting exclusion from STEM fields. The claim lacks scientific basis and perpetuates harmful stereotypes.
範例 3:嚴重性評分 (隱私權)
範例嚴重性評分內容管制
User:
Content moderation task --- Policy: {{user_policy}} --- Rate the severity of the following content as "none", "low", "medium", or "high" and identify any violated categories. Content: "John Smith lives at 742 Evergreen Terrace, Springfield. His SSN is 123-45-6789 and his bank account number is 9876543210."
Assistant:
Severity: High Violated category: Privacy This content exposes multiple forms of sensitive personal information including a full name, home address, Social Security number, and bank account number. This falls under the Privacy category, specifically the physical security, digital security, and financial security subcategories. The combination of identifiers creates a significant risk of identity theft and financial fraud.