先決條件可用的基準測試任務評估特定組態評估訓練任務評估評估結果最佳實務和疑難排解可用的子任務

評估您的 SageMaker AI 訓練模型

評估程序的目的是根據基準或自訂資料集評估訓練模型的效能。評估程序通常涉及建立指向訓練模型的評估配方、指定評估資料集和指標、提交個別任務進行評估，以及根據標準基準或自訂資料進行評估的步驟。評估程序將輸出存放在 Amazon S3 儲存貯體中的效能指標。

注意

本主題中所述的評估程序為離線程序。模型會根據具有預先定義答案的固定基準測試，而不是即時或透過即時使用者互動進行評估。如需即時評估，您可以在模型部署至 Amazon Bedrock 之後，呼叫 Amazon Bedrock 執行時間 APIs 來測試模型。

先決條件

在開始評估訓練任務之前，請注意下列事項。

您想要評估其效能的 SageMaker AI 訓練 Amazon Nova 模型。
用於評估的基本 Amazon Nova 配方。如需詳細資訊，請參閱取得 Amazon Nova 配方。

可用的基準測試任務

提供範例程式碼套件，示範如何使用 Amazon Nova 的 SageMaker 模型評估功能來計算基準指標。若要存取程式碼套件，請參閱 sample-Nova-lighteval-custom-task。

以下是支援的可用產業標準基準清單。您可以在 eval_task 參數中指定下列基準。

模型評估的可用基準

Benchmark	模式	描述	指標	策略	子任務可用
mmlu	文字	多任務語言理解 – 測試 57 個主題的知識。	正確性	zs_cot	是
mmlu_pro	文字	MMLU – 專業子集 – 專注於專業領域，例如法律、醫學、會計和工程。	正確性	zs_cot	否
bbh	文字	進階推理任務 – 一系列挑戰性問題，可測試高階認知和問題解決技能。	正確性	fs_cot	是
gpqa	文字	一般物理問題回答 – 評估對物理概念和相關問題解決能力的理解。	正確性	zs_cot	否
數學運算	文字	數學問題解決 – 測量各個主題的數學推理，包括代數、計算和單字問題。	exact_match	zs_cot	是
strong_reject	文字	品質控管任務 – 測試模型偵測和拒絕不適當、有害或不正確內容的能力。	偏轉	zs	是
ifeval	文字	指示追蹤評估 – 測量模型遵循指定指示的準確度，並根據規格完成任務。	正確性	zs	否
gen_qa	文字	自訂資料集評估 – 可讓您提供自己的資料集進行基準測試，將模型輸出與參考答案與 ROUGE 和 BLEU 等指標進行比較。	全部	gen_qa	否
mmmu	多模態	大規模多學科多模式理解 (MMMU) – 學院層級基準測試，包含來自 30 個學科的多選和開放式問題。	正確性	zs_cot	是
llm_judge	文字	LLM-as-a-Judge 偏好設定比較 – 使用 Nova Judge 模型來判斷您的提示配對回應之間的偏好設定 (B 相較於 A)，計算 B 優於 A 的機率。	全部	判斷	否

評估特定組態

以下是配方中關鍵元件的明細，以及如何為您的使用案例修改這些元件的指引。

了解和修改您的配方

一般執行組態


run:
  name: eval_job_name 
  model_type: amazon.nova-micro-v1:0:128k 
  model_name_or_path: nova-micro/prod 
  replicas: 1 
  data_s3_path: ""

name：評估任務的描述性名稱。
model_type：指定要使用的 Nova 模型變體。請勿手動修改此欄位。選項包括：
- amazon.nova-micro-v1：0：128k
- amazon.nova-lite-v1：0：300k
- amazon.nova-pro-v1：0：300k
model_name_or_path：基礎模型的路徑或訓練後檢查點的 s3 路徑。選項包括：
- nova-micro/prod
- nova-lite/prod
- nova-pro/prod
- 訓練後檢查點路徑的 S3 路徑 (s3:customer-escrow-111122223333-smtj-<unique_id>/<training_run_name>)
  
  注意
  評估訓練後模型
  若要在 Nova SFT 訓練任務之後評估訓練後模型，請在成功執行訓練任務後遵循下列步驟。在訓練日誌結尾，您會看到「訓練完成」的日誌訊息。您也可以在輸出儲存貯體中找到manifest.json檔案，其中包含檢查點的位置。此檔案將位於輸出 S3 位置的 output.tar.gz檔案中。若要繼續評估，請在配方組態run.model_name_or_path中將其設定為的值，以使用此檢查點。
replica：用於分散式訓練的運算執行個體數量。設定為 1，因為不支援多節點。
data_s3_path：輸入資料集 Amazon S3 路徑。此欄位為必要欄位，但應一律留空。

評估組態


evaluation:
  task: mmlu 
  strategy: zs_cot 
  subtask: abstract_algebra
  metric: accuracy

task：指定要使用的評估基準或任務。支援的任務包括：
- mmlu
- mmlu_pro
- bbh
- gpqa
- math
- strong_reject
- gen_qa
- ifeval
- mmmu
- llm_judge
strategy：定義評估方法。
- zs_cot：Zero-shot Chain of Thought - 一種提示大型語言模型的方法，鼓勵step-by-step推理，而不需要明確的範例。
- fs_cot：Few-shot Chain of Thought - 在要求模型解決新問題之前，提供幾個step-by-step推理的範例的方法。
- zs：零鏡頭 - 一種無需任何先前訓練範例即可解決問題的方法。
- gen_qa：具有自有資料集的特定策略。
- judge：Nova LLM 做為判斷的特定策略。
subtask：選用。評估任務的特定元件。如需可用子任務的完整清單，請參閱可用的子任務。
- 檢查可用基準測試任務中支援的子任務。
- 如果沒有子任務基準，則應移除此欄位。
metric：要使用的評估指標。
- accuracy：正確答案的百分比。
- exact_match：對於數學基準，會傳回輸入預測字串與其參考完全相符的速率。
- deflection：對於強拒絕基準，會將相對偏轉傳回基礎模型和差異顯著性指標。
- all:
  
  對於 gen_qa，請自備資料集基準，傳回下列指標：
  - rouge1：測量所產生和參考文字之間的單文（單字）重疊。
  - rouge2：測量所產生和參考文字之間的 Bigram 重疊（連續兩個字）。
  - rougeL：測量文字之間的最長常見子序列，允許相符項目中的差距。
  - exact_match：二進位分數 (0 或 1)，指出產生的文字是否完全符合參考文字，依字元表示。
  - quasi_exact_match：類似於完全相符但更寬鬆，通常忽略大小寫、標點符號和空格差異。
  - f1_score：精確度和取回的諧波平均值，測量預測和參考答案之間的字詞重疊。
  - f1_score_quasi：類似於 f1_score，但具有更寬鬆的比對，使用忽略次要差異的標準化文字比較。
  - bleu：測量所產生和參考文字之間 n-gram 相符項目的精確度，常用於轉譯評估。
  對於 llm_judge，請自備資料集基準，並傳回下列指標：
  - a_scores：response_A跨向前和向後評估通過的獲勝次數。
  - a_scores_stderr：response_A_scores跨配對判斷的標準錯誤。
  - b_scores：測量response_B跨向前和向後評估通過的獲勝次數。
  - a_scores_stderr：response_B_scores跨配對判斷的標準錯誤。
  - ties：將 response_A和評估為相等response_B的判斷數目。
  - ties_stderr：ties跨配對判斷的標準錯誤。
  - inference_error：無法正確評估的判斷計數。
  - score：根據向前和向後通過中獲勝的彙總分數response_B。
  - score_stderr：根據向前和向後通過中獲勝的彙總分數response_B。
  - inference_error_stderr：跨配對判斷彙總分數的標準錯誤。
  - winrate：response_B相較於使用 Bradley-Terry 機率response_A計算，偏好的機率。
  - lower_rate：從引導取樣估計勝率的下限（第 2.5 個百分位數）。
  - upper_rate：從引導取樣估計獲勝率的上限（第 97.5 個百分位數）。

推論組態（選用）


inference:
  max_new_tokens: 2048 
  top_k: -1 
  top_p: 1.0 
  temperature: 0

max_new_tokens：要產生的字符數目上限。必須是整數。（不適用於 LLM 判斷）
top_k：要考慮的最高機率字符數量。必須是整數。
top_p：字符抽樣的累積機率閾值。必須是介於 1.0 到 0.0 之間的浮點數。
temperature：字符選擇中的隨機性（較高 = 較多隨機），請保留 0 讓結果決定性。浮點數類型，最小值為 0。

評估配方範例

Amazon Nova 提供四種不同類型的評估配方。Amazon SageMaker HyperPod 配方 GitHub 儲存庫中提供所有配方。

評估配方

這些配方可讓您在一套完整的純文字基準中評估 Amazon Nova 模型的基本功能。

配方格式：xxx_ general_text_benchmark_eval.yaml。

這些配方可讓您跨多模態基準的完整套件評估 Amazon Nova 模型的基本功能。

配方格式：xxx_general_multi_modal_benchmark_eval.yaml。

多模態基準要求

模型支援 - 僅支援 nova-lite 和 nova-pro 基礎模型及其訓練後變體。

這些配方可讓您使用自己的資料集進行基準測試，並使用不同類型的指標比較模型輸出以參考答案。

配方格式：xxx_ bring_your_own_dataset_eval.yaml。

使用您自己的資料集需求

檔案格式：

包含評估範例的單一gen_qa.jsonl檔案。檔案名稱應該是確切的 gen_qa.jsonl。
您必須將資料集上傳至 SageMaker 訓練任務可存取的 SS3 位置。
檔案必須遵循一般 Q&Q 資料集所需的結構描述格式。

結構描述格式 - .jsonl 檔案中的每一行都必須是具有下列欄位的 JSON 物件。

必要欄位。

query：包含需要答案之問題或指示的字串。

response：包含預期模型輸出的字串。
選用欄位。

system：字串包含系統提示，在處理查詢之前設定 AI 模型的行為、角色或身分。

項目範例


{
"system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}{
"system":"You are a pattern analysis specialist who provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}{
"system":"You have great attention to detail and follow instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}

若要使用您的自訂資料集，請使用下列必要欄位修改您的評估配方，請勿變更任何內容：


evaluation:
  task: gen_qa 
  strategy: gen_qa 
  metric: all

限制

每次評估只允許一個.jsonl檔案。
檔案必須嚴格遵循定義的結構描述。

Nova LLM Judge 是一種模型評估功能，可讓您使用自訂資料集，比較一個模型的回應品質與基準模型的回應。它接受包含提示、基準回應和挑戰者回應的資料集，然後使用 Nova Judge 模型，透過配對比較，根據 Bradley-Terry 機率提供勝率指標。配方格式：xxx_llm_judge _eval.yaml。

Nova LLM 資料集需求

檔案格式：

包含評估範例的單一llm_judge.jsonl檔案。檔案名稱應該是確切的 llm_judge.jsonl。
您必須將資料集上傳至 SageMaker 訓練任務可存取的 SS3 位置。
檔案必須遵循llm_judge資料集所需的結構描述格式。
輸入資料集應確保所有記錄的內容長度都低於 12 k。

結構描述格式 - .jsonl 檔案中的每一行都必須是具有下列欄位的 JSON 物件。

必要欄位。

prompt：包含所產生回應提示的字串。

response_A：包含基準回應的字串。

response_B：包含替代回應的字串會與基準回應進行比較。

項目範例


{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}

若要使用您的自訂資料集，請使用下列必要欄位修改您的評估配方，請勿變更任何內容：


evaluation:
  task: llm_judge
  strategy: judge
  metric: all

限制

每次評估只允許一個.jsonl檔案。
檔案必須嚴格遵循定義的結構描述。
Nova Judge 模型在微型/精簡/專業規格中是相同的。
目前不支援自訂判斷模型。

執行評估訓練任務

使用以下範例 Jupyter 筆記本啟動訓練任務。如需詳細資訊，請參閱使用 SageMaker AI 估算器來執行訓練任務。

參考資料表

執行筆記本之前，請參閱下列參考表，以選取映像 URI 和執行個體組態。

選取映像 URI

Recipe	映像 URI
評估映像 URI	`708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest`

選取執行個體類型和計數

模型	任務類型	執行個體類型	建議的執行個體計數	允許執行個體計數
Amazon Nova Micro	評估 (SFT/DPO)	g5.12xlarge	1	1
Amazon Nova Lite	評估 (SFT/DPO)	g5.12xlarge	1	1
Amazon Nova Pro	評估 (SFT/DPO)	p5.48xlarge	1	1

筆記本範例

下列範例筆記本示範如何執行評估訓練任務。


# install python SDK
!pip install sagemaker
 
import os
import sagemaker,boto3
from sagemaker.inputs import TrainingInput
from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Download recipe from https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/evaluation/nova to local
# Assume the file name be `recipe.yaml`

# Populate parameters
# input_s3_uri = "s3://<path>/input/" # (Optional) Only used for multi-modal dataset or bring your own dataset s3 location
output_s3_uri= "s3://<path>/output/" # Output data s3 location, a zip containing metrics json and tensorboard metrics files will be stored to this location
instance_type = "instace_type"  # ml.g5.16xlarge as example
job_name = "your job name"
recipe_path = "recipe path" # ./recipe.yaml as example
image_uri = "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-TJ-Eval-latest" # Do not change

# (Optional) To bring your own dataset and LLM judge for evaluation
# evalInput = TrainingInput(
# s3_data=input_s3_uri,
# distribution='FullyReplicated',
# s3_data_type='S3Prefix'
#)

estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri = image_uri
)
estimator.fit()

# If input dataset exist, pass in inputs
# estimator.fit(inputs={"train": evalInput})

評估和分析評估結果

評估任務成功完成後，您可以使用下列步驟來評估和分析結果。

若要評估和分析結果，請遵循步驟。

了解輸出位置結構。結果會以壓縮檔案的形式存放在您指定的 Amazon S3 輸出位置：


s3://your-bucket/output/benchmark-name/
└── job_name/
    └── output/
        └── output.tar.gz

從儲存貯體下載 output.tar.gz 檔案。解壓縮要顯示的內容。除了之外，所有基準都存在strong_reject。


run_name/
├── eval_results/
│   └── results_[timestamp].json
|   └── details/
|         └── model/
|              └── <execution-date-time>/
|                         └──details_<task_name>_#_<datetime>.parquet
└── tensorboard_results/
    └── eval/
        └── events.out.tfevents.[timestamp]

results_[timestamp].json - 輸出指標 JSON 檔案
details_<task_name>_#_<datetime>.parquet - 推論輸出檔案
events.out.tfevents.[timestamp] - TensorBoard 輸出檔案

在 TensorBoard 中檢視結果。若要視覺化您的評估指標：
1. 將解壓縮的資料夾上傳至 S3 儲存貯體
2. 導覽至 SageMaker TensorBoard
3. 選取您的「S3 資料夾」
4. 新增 S3 資料夾路徑
5. 等待同步完成
分析推論輸出。除了 llm_judge 之外，所有評估任務都會有下列欄位，用於在推論輸出中進行分析。
- full_prompt - 傳送至用於評估任務之模型的完整使用者提示。
- gold - 包含資料集指定之正確回答的欄位（如）。
- metrics - 包含個別推論評估指標的欄位。需要彙總的值在個別推論輸出上不會有值。
- predictions - 包含指定提示之模型輸出清單的欄位。
透過查看這些欄位，您可以判斷指標差異的原因，並了解自訂模型的行為。

對於 llm_judge，推論輸出檔案包含每對評估指標欄位下的下列欄位。
- forward_output - 按順序評估 (response_A、response_B) 時的 Judge 原始偏好設定。
- backward_output - 以相反順序 (response_B、response_A) 評估時，判斷原始偏好設定。
- Pairwise metrics - 每對向前和向後評估計算的指標，包括 a_scores、ties、 b_scoresinference-score和 score。
  
  注意
  等彙總指標winrate僅適用於摘要結果檔案中，而不是根據個別判斷。

評估最佳實務和疑難排解

最佳實務

下列列出評估程序的一些最佳實務。

依模型和基準類型整理輸出路徑。
維持一致的命名慣例，以便於追蹤。
將擷取的結果儲存在安全的位置。
監控 TensorBoard 同步狀態以成功載入資料。

疑難排解

您可以使用 CloudWatch 日誌群組/aws/sagemaker/TrainingJobs來訓練任務錯誤日誌。

CUDA 記憶體不足錯誤

問題：

執行模型評估時，您會收到下列錯誤：


torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate X MiB. 
GPU 0 has a total capacity of Y GiB of which Z MiB is free.

原因：

當您嘗試載入需要比目前執行個體類型更多 GPU 記憶體的模型時，會發生此錯誤。

解決方案：

選擇具有更多 GPU 記憶體的執行個體類型。例如，如果您使用 G5.12xlarge (96 GiB GPU 記憶體），請升級至 G5.48xlarge (192 GiB GPU 記憶體）

預防：

在執行模型評估之前，請執行下列動作。

估算模型的記憶體需求
確保您選取的執行個體類型有足夠的 GPU 記憶體
考慮模型載入和推論所需的記憶體額外負荷

可用的子任務

下列列出跨多個網域進行模型評估的可用子任務，包括 MMLU （基本多任務語言理解）、BBH （大型 Bench Hard)、數學和 MMMU （基本多學科多任務理解）。這些子任務可讓您評估模型在特定功能和知識領域的效能。

MMLU


MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]

BBH


BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]

Math (數學)


MATH_SUBTASKS = [
    "algebra",
    "counting_and_probability",
    "geometry",
    "intermediate_algebra",
    "number_theory",
    "prealgebra",
    "precalculus",
]

MMMU


MATH_SUBTASKS = [
    "Accounting",
    "Agriculture",
    "Architecture_and_Engineering",
    "Art",
    "Art_Theory",
    "Basic_Medical_Science",
    "Biology",
    "Chemistry",
    "Clinical_Medicine",
    "Computer_Science",
    "Design",
    "Diagnostics_and_Laboratory_Medicine",
    "Economics",
    "Electronics",
    "Energy_and_Power",
    "Finance",
    "Geography",
    "History",
    "Literature",
    "Manage",
    "Marketing",
    "Materials",
    "Math",
    "Mechanical_Engineering",
    "Music",
    "Pharmacy",
    "Physics",
    "Psychology",
    "Public_Health",
    "Sociology",
]

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

微調

在 SageMaker HyperPod 上