評估配方範例

這些配方可讓您在一套完整的純文字基準之間評估 Amazon Nova 模型的基本功能。它們會以格式 xxx_general_text_benchmark_eval.yaml 提供。

這些配方可讓您使用自有資料集進行基準測試，並使用不同類型的指標比較模型輸出以參考回答。它們會以格式 xxx_bring_your_own_dataset_eval.yaml 提供。

以下是使用自有資料集需求：

檔案格式需求
- 您必須包含一個 gen_qa.jsonl 包含評估範例的檔案。
- 您的資料集必須上傳至 S3 位置，SageMaker 訓練任務才能存取該位置。
- 檔案必須遵循一般問與答資料集所需的結構描述格式。
結構描述格式需求 - JSONL 檔案中的每一行都必須是具有下列欄位的 JSON 物件：
- query：(必要) 字串，包含需要回答的問題或指示
- response：(必要) 字串，包含預期模型輸出
- system：(選用) 字串，包含在處理查詢之前設定 AI 模型行為、角色或個性的系統提示。
- metadata：（選用）字串包含與項目相關聯的中繼資料，用於標記目的。

以下是使用自有資料集範例項目


{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}

若要使用您的自訂資料集，請使用下列必要欄位修改您的評估配方，切勿變更任何內容：


evaluation:
  task: gen_qa
  strategy: gen_qa
  metric: all

有下列限制：

每次評估只允許一個 JSONL 檔案。
該檔案必須嚴格遵循定義的結構描述。
內容長度限制：對於資料集中的每個範例，內容長度（包括系統 + 查詢提示）應小於 3.5k。

Amazon Nova LLM 即評審是一種模型評估功能，可讓客戶將一個模型的回應品質與自訂資料集上的基準模型回應進行比較。它會採用具有提示、基準回應和挑戰者回應的資料集，然後使用 Nova Judge 模型透過配對比較，以根據 Bradley-Terry 機率提供勝率指標。

配方會以 xxx_llm_judge_eval.yaml 格式提供。

以下是 LLM 即評審需求：

檔案格式需求
- 包含一個包含評估範例的 llm_judge.jsonl 檔案。檔案名稱必須是 llm_judge.jsonl。
- 您的資料集必須上傳至 SageMaker AI SageMaker HyperPod RIG 可存取的 S3 位置。
- 檔案必須遵循 llm_judge.jsonl 資料集所需的結構描述格式。
- 輸入資料集應確保所有記錄都不到 12k 內容長度。
結構描述格式需求 - JSONL 檔案中的每一行都必須是具有下列欄位的 JSON 物件：
- prompt：(必要) 字串，包含所產生回應的提示。
- response_A：字串，包含基準回應。
- response_B：字串，包含與基準回應比較的替代回應。

以下是 LLM 即評審範例項目


{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}

若要使用您的自訂資料集，請使用下列必要欄位修改您的評估配方，切勿變更任何內容：


evaluation:
  task: llm_judge
  strategy: judge
  metric: all

有下列限制：

每次評估只允許一個 JSONL 檔案。
該檔案必須嚴格遵循定義的結構描述。
Amazon Nova Judge 模型在所有模型系列規格 (即 Lite、Micro 和 Pro) 之間皆相同。
目前不支援自訂判斷模型。
內容長度限制：對於資料集中的每個範例，內容長度（包括系統 + 查詢提示）應小於 7k。

Nova LLM Judge for multi-modal (image)， short for Nova MM_LLM Judge 是一種模型評估功能，可讓您使用自訂資料集，比較一個模型的回應品質與基準模型的回應。它接受包含提示、基準回應和挑戰者回應的資料集，以及 Base64-encoded字串形式的影像，然後使用 Nova Judge 模型，透過配對比較，根據 Bradley-Terry 機率提供獲勝率指標。配方格式：xxx_mm_llm_judge _eval.yaml。

Nova LLM 資料集需求

檔案格式：

包含評估範例的單一 mm_llm_judge.jsonl 檔案。檔案名稱必須剛好為 llm_judge.jsonl。
您必須將資料集上傳至 SageMaker 訓練任務可存取的 SS3 位置。
檔案必須遵循 mm_llm_judge 資料集所需的結構描述格式。
輸入資料集應確保所有記錄都低於 12 k 內容長度，不包括影像的屬性。

結構描述格式 - .jsonl 檔案中的每一行都必須是具有下列欄位的 JSON 物件。

必要欄位。

prompt：字串，包含所產生回應的提示。

images：陣列包含具有資料屬性的物件清單（值為 Base64-encoded的影像字串）。

response_A：字串，包含基準回應。

response_B：字串，包含與基準回應比較的替代回應。

範例項目

為了便於閱讀，以下範例包含新的行和縮排，但在實際的資料集中，每個記錄應該位於單一行。


{
  "prompt": "what is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
}
{
  "prompt": "how many animals in echo of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal",
}

若要使用您的自訂資料集，請使用下列必要欄位修改您的評估配方，切勿變更任何內容：


evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all

限制

每次評估只允許一個 .jsonl 檔案。
該檔案必須嚴格遵循定義的結構描述。
Nova MM Judge 模型僅支援影像參考。
Nova MM Judge 模型在 Amazon Nova Lite 規格中是相同的。
目前不支援自訂判斷模型。
不支援 Amazon S3 映像 URI。
輸入資料集應確保所有記錄都低於 12 k 內容長度，但不包括影像屬性。

Rubric Judge 是建置在 Nova 2.0 Lite 上的增強型 LLM-as-a-judge 評估模型。與只提供偏好判斷 (A>B、B>A 或 tie) 的原始判斷模型不同，Rubyric Judge 會動態產生為每個提示量身打造的自訂評估條件，並跨多個維度指派精細分數。

主要功能：

產生動態條件：根據輸入提示自動建立相關的評估維度
加權分數：為每個條件指派重要性權重，以反映其相對重要性
精細評估：為每個條件提供二進位 (true/false) 或擴展 (1-5) 的詳細分數
品質指標：計算持續品質分數 (0-1 比例），以量化回應之間的差異程度

模型產生的範例條件：


price_validation:
  description: "The response includes validation to ensure price is a positive value."
  type: "scale"
  weight: 0.3

模型會根據所有產生的條件評估這兩個回應，然後使用這些條件層級分數來通知其最終偏好決策。

配方組態

Rubric Judge 配方

在您的配方task: rubric_llm_judge中設定以啟用 Rubric Judge：


run:
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

原始 LLM 做為判斷配方（用於比較）

原始判斷模型使用 task: llm_judge：


run:
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: llm_judge                                       # [FIXED] Original judge task
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

輸入資料集格式

輸入資料集格式與原始判斷模型相同：

必要欄位：

prompt：包含輸入提示和指示的字串
response_A：包含基準模型輸出的字串
response_B：包含自訂模型輸出的字串

範例資料集 (JSONL 格式）：


{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}

格式需求：

每個項目必須是單行 JSON 物件
使用換行分隔項目
遵循範例所示的確切欄位命名

評估輸出

輸出結構

Rubric Judge 與原始判斷模型相比，會產生增強的評估指標：


{
  "config_general": {
    "lighteval_sha": "string",
    "num_fewshot_seeds": "int",
    "max_samples": "int | null",
    "job_id": "int",
    "start_time": "float",
    "end_time": "float",
    "total_evaluation_time_secondes": "string",
    "model_name": "string",
    "model_sha": "string",
    "model_dtype": "string | null",
    "model_size": "string"
  },
  "results": {
    "custom|rubric_llm_judge_judge|0": {
      "a_scores": "float",
      "a_scores_stderr": "float",
      "b_scores": "float",
      "b_scores_stderr": "float",
      "ties": "float",
      "ties_stderr": "float",
      "inference_error": "float",
      "inference_error_stderr": "float",
      "score": "float",
      "score_stderr": "float",
      "weighted_score_A": "float",
      "weighted_score_A_stderr": "float",
      "weighted_score_B": "float",
      "weighted_score_B_stderr": "float",
      "score_margin": "float",
      "score_margin_stderr": "float",
      "winrate": "float",
      "lower_rate": "float",
      "upper_rate": "float"
    }
  },
  "versions": {
    "custom|rubric_llm_judge_judge|0": "int"
  }
}

Rubric Judge 中的新指標

以下六個指標對 Rubric Judge 來說是唯一的，並提供精細的品質評估：

指標	說明
weighted_score_A	response_A 在所有模型產生的評估條件中的平均標準化品質分數。分數依標準重要性加權，並標準化為 0-1 比例（較高 = 更好的品質）
weighted_score_A_stderr	weighted_score_A 平均值的標準錯誤，表示統計不確定性
weighted_score_B	response_B 在所有模型產生的評估條件中的平均標準化品質分數。分數依標準重要性加權，並標準化為 0-1 比例（較高 = 更好的品質）
weighted_score_B_stderr	weighted_score_B 平均值的標準錯誤，表示統計不確定性
score_margin	加權分數之間的差異（計算方式為 weighted_score_A - weighted_score_B)。範圍：-1.0 到 1.0。正值 = response_A 較佳；負值 = response_B 較佳；接近零 = 相似品質
score_margin_stderr	score_margin 平均值的標準誤差，表示品質差異測量的不確定性

了解加權分數指標

目的：加權分數提供持續品質測量，以補充二進位偏好設定判斷，進而更深入了解模型效能。

與原始判斷的主要差異：

原始判斷：僅輸出離散偏好設定 (A>B、B>A、A=B)
Rubric Judge：根據自訂條件輸出偏好設定和持續品質分數 (0-1 比例）

解譯 score_margin：

score_margin = -0.128：Response_B 得分比 response_A 高 12.8 個百分點
|score_margin| < 0.1：縮小品質差異（關閉決策）
|score_margin| > 0.2：清楚的品質差異（可信決策）

使用案例：

模型改進：識別模型表現不佳的特定領域
品質量化：測量效能差距的幅度，而不只是取捨率
可信度評估：區分密切決策和明確的品質差異

重要

最終判斷仍以判斷模型的明確偏好標籤為基礎，以保留整體推理，並透過向前/向後評估確保適當的位置偏差緩解。加權分數做為可觀測性工具，而不是取代主要判斷。

計算方法

透過下列程序計算加權分數：

擷取條件資料：剖析判斷器的 YAML 輸出以擷取條件分數和權重
標準化分數：
- Scale-type 條件 (1-5)：透過計算將標準化為 0-1 (score - 1) / 4
- 二進位條件 (true/false)：轉換為 1.0/0.0
套用權重：將每個標準化分數乘以其條件權重
彙總：每個回應的所有加權分數總和
計算邊界：運算 score_margin = weighted_score_A - weighted_score_B

範例：如果 response_A 的加權總和為 0.65，而 response_B 的加權總和為 0.78，則 score_margin會是 -0.13，表示 response_B 在所有加權條件的品質提高 13 個百分點。

推理模型支援

推理模型支援可使用推理功能的 Nova 模型進行評估，這些模型在產生最終回應之前執行明確的內部推理。此功能會透過 reasoning_effort 參數使用 API 層級控制來動態啟用或停用推理功能，進而改善複雜分析任務的回應品質。

支援的模型：

amazon.nova-2-lite-v1：0：256k

配方組態

將 reasoning_effort 參數新增至配方的 inference區段，以啟用推理：


run:
  name: eval-job-name                                    # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for  job
  output_s3_path: ""                                     # [MODIFIABLE] Output path for  job (not compatible with SageMaker Training jobs)

evaluation:
  task: mmlu                                             # [MODIFIABLE] Evaluation task
  strategy: generate                                     # [MODIFIABLE] Evaluation strategy
  metric: all                                            # [MODIFIABLE] Metric calculation method

inference:
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/medium/high or null to disable
  max_new_tokens: 200                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: 50                                              # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)

使用 reasoning_effort 參數

reasoning_effort 參數控制具推理能力模型的推理行為。

先決條件：

模型相容性：reasoning_effort只有在model_type指定可推理的模型時設定（目前為 amazon.nova-2-lite-v1:0:256k)
錯誤處理：使用 reasoning_effort搭配不支援的模型會失敗 ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."

可用的選項：

選項	Behavior (行為)	字符限制	使用案例
null （預設）	停用推理模式	N/A	無推理額外負荷的標準評估
low	啟用具有限制條件的推理	用於內部推理的 4，000 個字符	需要簡潔推理的案例；針對速度和成本進行最佳化
high	啟用無限制的推理	內部推理沒有字符限制	需要廣泛分析和step-by-step推理的複雜問題

何時啟用推理

針對下列項目使用推理模式 (medium、 low或 high)：

複雜的問題解決任務（數學、邏輯拼圖、編碼）
需要中繼推理的多步驟分析問題
詳細說明或step-by-step思考可提高準確性的任務
回應品質優先於速度的情況

針對下列項目使用非原因模式（省略參數）：

簡單問答或事實查詢
創意撰寫任務
當更快的回應時間至關重要時
應排除推理開銷的績效基準
推理無法改善任務效能時的成本最佳化

疑難排解

錯誤：「啟用原因模式，但模型不支援推理」

原因： reasoning_effort 參數設定為非空值，但指定的 model_type 不支援推理。

解決方法：

確認您的模型類型為 amazon.nova-2-lite-v1:0:256k
如果使用不同的模型，請切換到具有推理功能的模型，或從配方中移除 reasoning_effort 參數

主題

配方組態

Rubric Judge 配方

原始 LLM 做為判斷配方 （用於比較）

輸入資料集格式

評估輸出

輸出結構

Rubric Judge 中的新指標

了解加權分數指標

重要

計算方法

推理模型支援

配方組態

使用 reasoning_effort 參數

何時啟用推理

疑難排解

原始 LLM 做為判斷配方（用於比較）