

本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 使用 Amazon SageMaker AI 自訂模型
<a name="customize-model"></a>

Amazon SageMaker AI 模型自訂是一項功能，可將傳統上複雜且耗時的 AI 模型自訂程序，從長達數月的工作轉換為可在數天內完成的簡化工作流程。此功能可解決 AI 開發人員面臨的關鍵挑戰，這些開發人員需要使用專屬資料自訂基礎模型，以建立高度差異化的客戶體驗。本 SageMaker AI 指南提供詳細的自訂文件，包括step-by-step指南和進階組態選項。如需 Nova 模型自訂的簡短概觀，請參閱《Amazon Nova 使用者指南》中的[使用 SageMaker 自訂和微調](https://docs.aws.amazon.com/nova/latest/userguide/customize-fine-tune-sagemaker.html)。

該功能包括新的引導式使用者介面，可了解自然語言需求，並具有一套完整的進階模型自訂技術，所有技術均採用無伺服器基礎設施，可消除管理運算資源的操作開銷。無論您是要建置法律研究應用程式、增強客戶服務聊天機器人，還是開發特定領域的 AI 代理器，此功能都能加速從proof-of-concept到生產部署的路徑。

採用 Amazon Bedrock Evaluations 技術的模型自訂功能可能會安全地跨 AWS 區域 地理位置傳輸資料以進行處理。如需詳細資訊，請存取 [Amazon Bedrock 評估文件](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation-judge.html)。

## 重要概念
<a name="model-customize-concepts"></a>

**無伺服器訓練**

全受管的運算基礎設施，可消除所有基礎設施的複雜性，讓您僅專注於模型開發。這包括根據模型大小和訓練需求自動佈建 GPU 執行個體 (P5、P4de, P4d, G5)、納入每種自訂技術最佳實務的預先最佳化訓練配方、使用可透過 UI 存取的即時指標和日誌進行即時監控，以及在訓練完成後自動清理資源以最佳化成本。

**模型自訂技術**

完整的進階方法集，包括監督式微調 (SFT)、直接偏好最佳化 (DPO)、具有可驗證獎勵的強化學習 (RLVR)，以及具有 AI 意見回饋的強化學習 (RLAIF)。

**自訂模型**

基礎模型的特殊版本，已根據自己的資料進行訓練，以適應特定使用案例，進而產生 AI 模型，保留原始基礎模型的一般功能，同時新增專為您的需求量身打造的領域特定知識、術語、風格或行為。

**AI 模型自訂資產**

用於在模型自訂過程中訓練、精簡和評估自訂模型的資源和成品。這些資產包括**資料集**、 是訓練範例的集合 (prompt-response 配對、 網域特定文字、 或標記的資料） 用於微調基礎模型，以學習特定行為， 知識、 或 樣式、 和**評估者** 這是透過任一***獎勵函數***評估和改善模型效能的機制 （以程式碼為基礎的邏輯，可根據特定條件對模型輸出進行評分， 用於 RLVR 訓練和自訂評分器評估） 或***獎勵提示 ***（引導 LLM 判斷模型回應品質的自然語言指示， 用於 RLAIF 訓練和 LLM-as-a-judge 評估）。

**模型套件群組**

一種集合容器，可追蹤來自訓練任務的所有記錄模型，為模型版本及其歷程提供集中位置。

**已記錄的模型**

執行無伺服器訓練任務時SageMaker AI 建立的輸出。這可以是微調的模型 （成功任務）、檢查點 （使用檢查點的失敗任務） 或相關聯的中繼資料 （沒有檢查點的失敗任務）。

**已註冊的模型**

已標記為正式追蹤和管理目的的記錄模型，可實現完整的歷程和生命週期管理。

**譜系**

會自動擷取訓練任務、輸入資料集、輸出模型、評估任務和跨 SageMaker AI 和 Amazon Bedrock 部署之間的關係。

**跨帳戶共用**

能夠使用 AWS Resource Access Manager (RAM) 跨 AWS 帳戶共用模型、資料集和評估器，同時保持完整的歷程可見性。

# 在 Amazon SageMaker AI 上自訂 Amazon Nova 模型
<a name="nova-model"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [1.0 使用者指南》或《Amazon Nova ](https://docs.aws.amazon.com//nova/latest/userguide/nova-model.html)2.0 使用者指南》中的在 Amazon SageMaker AI 上自訂 Amazon Nova 模型。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model.html)

# Amazon Nova 配方
<a name="nova-model-recipes"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [1.0 使用者指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-recipes.html)》或《Amazon Nova [2.0 使用者指南》中的 Amazon Nova 配方。](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-recipes.html)

# SageMaker 訓練任務上的 Amazon Nova 自訂
<a name="nova-model-training-job"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [1.0 使用者指南》或《Amazon Nova ](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-training-job.html)2.0 使用者指南》中的 SageMaker 訓練任務上的 Amazon Nova 自訂。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-training-job.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-training-job.html)

**Topics**
+ [Amazon Nova 蒸餾](nova-distillation.md)
+ [Nova 自訂 SDK](nova-customization-sdk.md)
+ [使用 SageMaker 訓練任務微調 Amazon Nova 模型](nova-fine-tuning-training-job.md)
+ [監控反覆運算的進度](nova-model-monitor.md)
+ [評估您的 SageMaker AI 訓練模型](nova-model-evaluation.md)

# Amazon Nova 蒸餾
<a name="nova-distillation"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [1.0 使用者指南》中的 Amazon Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-distillation.html) 分割。

# Nova 自訂 SDK
<a name="nova-customization-sdk"></a>

Amazon Nova Customization SDK 是全方位的 Python SDK，可在整個生命週期中自訂 Amazon Nova 模型，從訓練和評估到部署和推論。

本主題已移動。如需最新資訊，請參閱 [Amazon Nova 1.0 使用者指南中的 Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-customization-sdk.html) Customization SDK 或 [Amazon Nova 2.0 使用者指南。](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-customization-sdk.html)

# 使用 SageMaker 訓練任務微調 Amazon Nova 模型
<a name="nova-fine-tuning-training-job"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [1.0 使用者指南》或《Amazon Nova ](https://docs.aws.amazon.com//nova/latest/userguide/nova-fine-tune-1.html)2.0 使用者指南》中的使用 SageMaker 訓練任務微調 Amazon Nova 模型。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/smtj-training.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/smtj-training.html)

# 監控反覆運算的進度
<a name="nova-model-monitor"></a>

本主題已移動。如需最新資訊，請參閱《[Amazon Nova 1.0 使用者指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-monitor.html)》或《[Amazon Nova 2.0 使用者指南](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-monitor.html)》中的監控反覆運算的進度。

# 評估您的 SageMaker AI 訓練模型
<a name="nova-model-evaluation"></a>

本主題已移動。如需最新資訊，請參閱《[Amazon Nova 1.0 使用者指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-evaluation.html)》或《Amazon Nova 2.0 使用者指南》中的評估 SageMaker AI 訓練模型。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-evaluation.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-evaluation.html)

# Amazon SageMaker HyperPod 上的 Amazon Nova 自訂
<a name="nova-hp"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [1.0 使用者指南》或《Amazon Nova ](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html)2.0 使用者指南》中的 Amazon SageMaker HyperPod 上的 Amazon Nova 自訂。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html)

# Nova 自訂 SDK
<a name="nova-hp-customization-sdk"></a>

本主題已移動。如需最新資訊，請參閱《[Amazon Nova 1.0 使用者指南》中的 Nova](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-customization-sdk.html) Customization SDK 或《[Amazon Nova 2.0 使用者指南》。](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-customization-sdk.html)

# Amazon HyperPod 基本命令指南
<a name="nova-hp-essential-commands-guide"></a>

本主題已移動。如需最新資訊，請參閱 [Amazon Nova 1.0 使用者指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-essential-commands-guide.html)或 Amazon Nova 2.0 使用者指南中的 SageMaker HyperPod 基本命令指南。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-essential-commands-guide.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-essential-commands-guide.html)

# 建立具有受限制執行個體群組 (RIG) 的 HyperPod EKS 叢集
<a name="nova-hp-cluster"></a>

本主題已移動。如需最新資訊，請參閱《[Amazon Nova 1.0 使用者指南》](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-cluster.html)或《[Amazon Nova 2](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-cluster.html).0 使用者指南》中的使用 RIG 建立 SageMaker HyperPod EKS 叢集。

# SageMaker AI HyperPod 的 Nova Forge 存取和設定
<a name="nova-forge-hp-access"></a>

本主題已移動。如需最新資訊，請參閱 [Amazon Nova 1.0 使用者指南或 Amazon Nova ](https://docs.aws.amazon.com//nova/latest/userguide/nova-forge-hp-access.html)2.0 使用者指南中的 SageMaker AI HyperPod 的 Nova Forge 存取和設定。 [https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-forge-hp-access.html](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-forge-hp-access.html)

# Amazon Nova 模型的訓練
<a name="nova-hp-training"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [1.0 使用者指南》](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-training.html)或《Amazon Nova [2.0 使用者指南》中的 Amazon Nova ](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-training.html)模型訓練。

# 在 Amazon SageMaker HyperPod 上微調 Amazon Nova 模型 HyperPod
<a name="nova-hp-fine-tune"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova 1.0 使用者指南》中的在 Amazon SageMaker HyperPod 上微調 Amazon Nova 模型。 [https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-fine-tune.html](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-fine-tune.html)

# 評估您的訓練模型
<a name="nova-hp-evaluate"></a>

本主題已移動。如需最新資訊，請參閱《[Amazon Nova 1.0 使用者指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp-evaluate.html)》或《[Amazon Nova 2.0 使用者指南](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-evaluate.html)》中的評估您的訓練模型。

# 反覆訓練
<a name="nova-iterative-training"></a>

反覆訓練可讓您透過多個訓練週期改善模型效能，以先前的檢查點為基礎，系統化地解決故障模式，並適應不斷變化的需求。

本主題已移動。如需最新資訊，請參閱《[Amazon Nova 1.0 使用者指南](https://docs.aws.amazon.com//nova/latest/userguide/nova-iterative-training.html)》或《[Amazon Nova 2.0 使用者指南](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-iterative-training.html)》中的反覆訓練。

# Amazon Bedrock 參考。
<a name="nova-model-bedrock-inference"></a>

本主題已移動。如需最新資訊，請參閱《Amazon [Nova 1.0 使用者指南》](https://docs.aws.amazon.com//nova/latest/userguide/nova-model-bedrock-inference.html)或《Amazon [Nova 2.0 使用者指南》中的 Amazon](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-bedrock-inference.html) Bedrock 推論。

# 自訂 Amazon Nova 模型的限制
<a name="nova-model-limitations"></a>

本主題已移動。如需最新資訊，請參閱《Amazon Nova [2.0 使用者指南》中的自訂 Amazon Nova ](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-model-limitations.html)模型的限制。

# 開放權重模型自訂
<a name="model-customize-open-weight"></a>

本節會逐步引導您開始使用開放權重模型自訂的程序。

**Topics**
+ [先決條件](model-customize-open-weight-prereq.md)
+ [在 UI 中建立用於模型自訂的資產](model-customize-open-weight-create-assets-ui.md)
+ [AI 模型自訂任務提交](model-customize-open-weight-job.md)
+ [模型評估任務提交](model-customize-open-weight-evaluation.md)
+ [模型部署](model-customize-open-weight-deployment.md)
+ [範例資料集和評估器](model-customize-open-weight-samples.md)

# 先決條件
<a name="model-customize-open-weight-prereq"></a>

開始之前，請先完成以下先決條件：
+ 使用 Studio 存取加入 SageMaker AI 網域。如果您沒有將 Studio 設定為網域預設體驗的許可，請聯絡您的管理員。如需詳細資訊，請參閱 [Amazon SageMaker AI 網域概觀](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html)。
+  AWS CLI 依照[安裝目前 AWS CLI 版本](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled)中的步驟更新 。
+ 從您的本機機器，執行 `aws configure` 並提供您的 AWS 憑證。如需 AWS 登入資料的資訊，請參閱[了解並取得您的 AWS 登入](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html)資料。

## 所需的 IAM 許可
<a name="model-customize-open-weight-iam"></a>

SageMaker AI 模型自訂需要將適當的許可新增至 SageMaker AI 網域執行。若要這樣做，您可以建立內嵌 IAM 許可政策，並將其連接到 IAM 角色。如需新增政策的詳細資訊，請參閱《 [Identity and Access Management 使用者指南》中的新增和移除 IAM 身分許可](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html)。 *AWS *

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowNonAdminStudioActions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeDomain",
                "sagemaker:DescribeUserProfile",
                "sagemaker:DescribeSpace",
                "sagemaker:ListSpaces",
                "sagemaker:DescribeApp",
                "sagemaker:ListApps"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:domain/*",
                "arn:aws:sagemaker:*:*:user-profile/*",
                "arn:aws:sagemaker:*:*:app/*",
                "arn:aws:sagemaker:*:*:space/*"
             ]
        },
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "SageMakerPublicHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubContents"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:aws:hub/SageMakerPublicHub"
            ]
        },
        {
            "Sid": "SageMakerHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubs",
                "sagemaker:ListHubContents",
                "sagemaker:DescribeHubContent",
                "sagemaker:DeleteHubContent",
                "sagemaker:ListHubContentVersions",
                "sagemaker:Search"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "JumpStartAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "ListMLFlowOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListMlflowApps",
                "sagemaker:ListMlflowTrackingServers"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "MLFlowAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:UpdateMlflowApp",
                "sagemaker:DescribeMlflowApp",
                "sagemaker:CreatePresignedMlflowAppUrl",
                "sagemaker:CallMlflowAppApi",
                "sagemaker-mlflow:*"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:mlflow-app/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BYODataSetS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*"
            ]
        },
        {
            "Sid": "AllowHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ImportHubContent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForSageMaker",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "sagemaker.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForAWSLambda",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "lambda.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "bedrock.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TrainingJobRun",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTrainingJob",
                "sagemaker:DescribeTrainingJob",
                "sagemaker:ListTrainingJobs"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:training-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ModelPackageAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateModelPackage",
                "sagemaker:DescribeModelPackage",
                "sagemaker:ListModelPackages",
                "sagemaker:CreateModelPackageGroup",
                "sagemaker:DescribeModelPackageGroup",
                "sagemaker:ListModelPackageGroups",
                "sagemaker:CreateModel"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TagsPermission",
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags",
                "sagemaker:ListTags"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*",
                "arn:aws:sagemaker:*:*:training-job/*",
                "arn:aws:sagemaker:*:*:model/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:action/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LogAccess",
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:GetLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:log-group*",
                "arn:aws:logs:*:*:log-group:/aws/sagemaker/TrainingJobs:log-stream:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob", 
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "SageMakerPipelinesAndLineage",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListActions",
                "sagemaker:ListArtifacts",
                "sagemaker:QueryLineage",
                "sagemaker:ListAssociations",
                "sagemaker:AddAssociation",
                "sagemaker:DescribeAction",
                "sagemaker:AddAssociation",
                "sagemaker:CreateAction",
                "sagemaker:CreateContext",
                "sagemaker:DescribeTrialComponent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:artifact/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:experiment-trial-component/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ListOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListInferenceComponents",
                "sagemaker:ListWorkforces"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerInference",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeInferenceComponent",
                "sagemaker:CreateEndpoint",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:DescribeEndpoint",
                "sagemaker:DescribeEndpointConfig",
                "sagemaker:ListEndpoints"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerPipelines",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribePipelineExecution",
                "sagemaker:ListPipelineExecutions",
                "sagemaker:ListPipelineExecutionSteps",
                "sagemaker:CreatePipeline",
                "sagemaker:UpdatePipeline",
                "sagemaker:StartPipelineExecution"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:pipeline/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        }
    ]
}
```

如果您已將 [AmazonSageMakerFullAccessPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) 連接至執行角色，則可以新增此減少的政策：

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*",
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "PassRoleForSageMakerAndLambdaAndBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": { 
                "StringEquals": { 
                    "iam:PassedToService": [ 
                        "lambda.amazonaws.com", 
                        "bedrock.amazonaws.com"
                     ],
                     "aws:ResourceAccount": "${aws:PrincipalAccount}" 
                 } 
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob",
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

然後，您必須按一下**編輯信任政策**並將其取代為下列政策，然後按一下**更新政策**。

```
{
    "Version": "2012-10-17",		 	 	                    
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                 "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                   "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                  "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

# 在 UI 中建立用於模型自訂的資產
<a name="model-customize-open-weight-create-assets-ui"></a>

您可以建立和管理資料集和評估器資產，以便在 UI 中用於模型自訂。

## 資產
<a name="model-customize-open-weight-assets"></a>

選取左側面板中的**資產**和 Amazon SageMaker Studio UI，然後選取**資料集**。

![\[包含模型自訂存取權的映像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-16.png)


選擇**上傳資料集**，以新增您將在模型自訂任務中使用的資料集。透過選擇**必要的資料輸入格式**，您可以存取要使用的資料集格式參考。

![\[包含模型自訂存取權的映像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-15.png)


## 評估工具
<a name="model-customize-open-weight-evaluators"></a>

您也可以為強化學習自訂任務新增**獎勵函數**和**獎勵提示**。

![\[包含模型自訂存取權的映像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-14.png)


UI 也提供獎勵函數或獎勵提示所需的格式指引。

![\[包含模型自訂存取權的映像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-13.png)


## 使用 AWS SDK 進行模型自訂的資產
<a name="model-customize-open-weight-create-assets-sdk"></a>

您也可以使用 SageMaker AI Python SDK 建立資產。請參閱以下程式碼片段範例：

```
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION, REWARD_PROMPT
from sagemaker.ai_registry.dataset import DataSet, CustomizationTechnique
from sagemaker.ai_registry.evaluator import Evaluator

# Creating a dataset example
dataset = DataSet.create(
            name="sdkv3-gen-ds2",
            source="s3://sample-test-bucket/datasets/training-data/jamjee-sft-ds1.jsonl", # or use local filepath as source.
            customization_technique=CustomizationTechnique.SFT
        )

# Refreshes status from hub
dataset.refresh()
pprint(dataset.__dict__)

# Creating an evaluator. Method : Lambda
evaluator = Evaluator.create(
                name = "sdk-new-rf11",
                source="arn:aws:lambda:us-west-2:<>:function:<function-name>8",
                type=REWARD_FUNCTION
        )

# Creating an evaluator. Method : Bring your own code
evaluator = Evaluator.create(
                name = "eval-lambda-test",
                source="/path_to_local/eval_lambda_1.py",
                type = REWARD_FUNCTION
        )

# Optional wait, by default we have wait = True during create call.
evaluator.wait()

evaluator.refresh()
pprint(evaluator)
```

# AI 模型自訂任務提交
<a name="model-customize-open-weight-job"></a>

您可以從左側面板的 Amazon SageMaker Studio 模型頁面存取 SageMaker AI 模型自訂功能。 Amazon SageMaker 您也可以找到資產頁面，您可以在其中建立和管理模型自訂資料集和評估器。

![\[包含模型自訂存取權的映像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-12.png)


若要開始提交模型自訂任務，請選取模型選項以存取 Jumpstart 基礎模型索引標籤：

![\[包含如何選擇基礎模型的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-11.png)


您可以直接按一下模型卡中的自訂模型，也可以從您有興趣自訂的中繼搜尋任何模型。

![\[包含模型卡的影像，以及如何選擇要自訂的模型。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-10.png)


按一下模型卡時，您可以存取模型詳細資訊頁面，然後按一下自訂模型，然後選取使用 UI 自訂以啟動設定 RLVR 任務，以啟動自訂任務。

![\[包含如何啟動自訂任務的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-9.png)


然後，您可以輸入您的自訂模型名稱，選取要使用並設定任務超參數的模型自訂技術：

![\[包含模型自訂技術選擇的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-8.png)


![\[包含模型自訂技術選擇的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-7.png)


## 使用 SDK 提交 AI 模型自訂任務
<a name="model-customize-open-weight-job-sdk"></a>

您也可以使用 SageMaker AI Python SDK 提交模型自訂任務：

```
# Submit a DPO model customization job

from sagemaker.modules.train.dpo_trainer import DPOTrainer
from sagemaker.modules.train.common import TrainingType

trainer = DPOTrainer(
    model=BASE_MODEL,
    training_type=TrainingType.LORA,
    model_package_group_name=MODEL_PACKAGE_GROUP_NAME,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    sagemaker_session=sagemaker_session,
    role=ROLE_ARN
)
```

## 監控您的自訂任務
<a name="model-customize-open-weight-monitor"></a>

提交任務後，系統會立即將您重新導向至模型自訂訓練任務頁面。

![\[包含模型自訂技術選擇的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-6.png)


任務完成後，您可以按一下右上角的前往自訂模型按鈕，**前往自訂模型**詳細資訊頁面。

![\[包含模型自訂技術選擇的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-5.png)


在自訂模型詳細資訊頁面中，您可以透過下列方式進一步使用自訂模型：

1. 檢查有關效能、產生的成品位置、訓練組態超參數和訓練日誌的資訊。

1. 使用不同的資料集啟動評估任務 （持續自訂）。

1. 使用 SageMaker AI 推論端點或 Amazon Bedrock 自訂模型匯入來部署模型。  
![\[包含模型自訂技術選擇的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-4.png)

# 模型評估任務提交
<a name="model-customize-open-weight-evaluation"></a>

本節說明開放原始碼自訂模型評估。它可讓您開始逐步解說評估任務提交程序。為更進階的評估任務提交使用案例提供額外的資源。

**Topics**
+ [開始使用](model-customize-evaluation-getting-started.md)
+ [評估類型和任務提交](model-customize-evaluation-types.md)
+ [評估指標格式](model-customize-evaluation-metrics-formats.md)
+ [Bring-Your-Own-Dataset (BYOD) 任務支援的資料集格式](model-customize-evaluation-dataset-formats.md)
+ [使用預設和自訂計分器進行評估](model-customize-evaluation-preset-custom-scorers.md)

# 開始使用
<a name="model-customize-evaluation-getting-started"></a>

## 透過 SageMaker Studio 提交評估任務
<a name="model-customize-evaluation-studio"></a>

### 步驟 1：從您的模型卡導覽至評估
<a name="model-customize-evaluation-studio-step1"></a>

自訂模型之後，請從模型卡導覽至評估頁面。

如需開放原始碼自訂模型訓練的資訊：https：//[https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html](https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html)

SageMaker 會在我的模型索引標籤上視覺化您的自訂模型：

![\[已註冊的模型卡頁面\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/getting-started-registered-model-card.png)


選擇檢視最新版本，然後選擇評估：

![\[模型自訂頁面\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/getting-started-evaluate-from-model-card.png)


### 步驟 2：提交您的評估任務
<a name="model-customize-evaluation-studio-step2"></a>

選擇提交按鈕並提交您的評估任務。這會提交最小的 MMLU 基準測試任務。

如需支援評估任務類型的資訊，請參閱 [評估類型和任務提交](model-customize-evaluation-types.md)。

![\[評估任務提交頁面\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/getting-started-benchmark-submission.png)


### 步驟 3：追蹤您的評估任務進度
<a name="model-customize-evaluation-studio-step3"></a>

評估步驟索引標籤會追蹤您的評估任務進度：

![\[您的評估任務進度\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/getting-started-benchmark-tracking.png)


### 步驟 4：檢視您的評估任務結果
<a name="model-customize-evaluation-studio-step4"></a>

您的評估任務結果會在評估結果索引標籤中視覺化：

![\[您的評估任務指標\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/getting-started-benchmark-results.png)


### 步驟 5：檢視您已完成的評估
<a name="model-customize-evaluation-studio-step5"></a>

您已完成的評估任務會顯示在模型卡的評估中：

![\[您已完成的評估任務\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/getting-started-benchmark-completed-model-card.png)


## 透過 SageMaker Python SDK 提交您的評估任務
<a name="model-customize-evaluation-sdk"></a>

### 步驟 1：建立您的 BenchMarkEvaluator
<a name="model-customize-evaluation-sdk-step1"></a>

將已註冊的訓練模型、 AWS S3 輸出位置和 MLFlow 資源 ARN 傳遞至 `BenchMarkEvaluator` ，然後將其初始化。

```
from sagemaker.train.evaluate import BenchMarkEvaluator, Benchmark  
  
evaluator = BenchMarkEvaluator(  
    benchmark=Benchmark.MMLU,  
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",  
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",  
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",  
    evaluate_base_model=False  
)
```

### 步驟 2：提交您的評估任務
<a name="model-customize-evaluation-sdk-step2"></a>

呼叫 `evaluate()`方法以提交評估任務。

```
execution = evaluator.evaluate()
```

### 步驟 3：追蹤您的評估任務進度
<a name="model-customize-evaluation-sdk-step3"></a>

呼叫 執行`wait()`方法，以取得評估任務進度的即時更新。

```
execution.wait(target_status="Succeeded", poll=5, timeout=3600)
```

### 步驟 4：檢視您的評估任務結果
<a name="model-customize-evaluation-sdk-step4"></a>

呼叫 `show_results()`方法來顯示您的評估任務結果。

```
execution.show_results()
```

# 評估類型和任務提交
<a name="model-customize-evaluation-types"></a>

## 使用標準化資料集進行基準測試
<a name="model-customize-evaluation-benchmarking"></a>

使用基準評估類型跨標準化基準資料集評估模型的品質，包括熱門資料集，例如 MMLU 和 BBH。


| Benchmark | 支援的自訂資料集 | 模態 | Description | 指標 | 策略 | 子任務可用 | 
| --- | --- | --- | --- | --- | --- | --- | 
| mmlu | 否 | 文字 | 多任務語言理解 – 測試 57 個主題的知識。 | 正確性 | zs\$1cot | 是 | 
| mmlu\$1pro | 否 | 文字 | MMLU – 專業子集 – 專注於專業領域，例如法律、醫學、會計和工程。 | 正確性 | zs\$1cot | 否 | 
| bbh | 否 | 文字 | 進階推理任務 – 一系列挑戰性問題，可測試高階認知和問題解決技能。 | 正確性 | fs\$1cot | 是 | 
| gpqa | 否 | 文字 | 一般物理問題回答 – 評估對物理概念的理解及解決相關問題的能力。 | 正確性 | zs\$1cot | 否 | 
| 數學運算 | 否 | 文字 | 數學問題解決 – 測量代數、微積分和應用題等主題的數學推理能力。 | exact\$1match | zs\$1cot | 是 | 
| strong\$1reject | 否 | 文字 | 品質控管任務 – 測試模型偵測和拒絕不適當、有害或不正確內容的能力。 | 偏轉 | zs | 是 | 
| ifeval | 否 | 文字 | 指示追蹤評估 – 測量模型遵循指定指示的準確度，並根據規格完成任務。 | 正確性 | zs | 否 | 

如需 BYOD 格式的詳細資訊，請參閱 [Bring-Your-Own-Dataset (BYOD) 任務支援的資料集格式](model-customize-evaluation-dataset-formats.md)。

### 可用的子任務
<a name="model-customize-evaluation-benchmarking-subtasks"></a>

下列列出跨多個網域進行模型評估的可用子任務，包括 MMLU （基本多任務語言理解）、BBH （大型 Bench Hard)、StrongReject 和 MATH。這些子任務可讓您評估模型在特定功能和知識領域的效能。

**MMLU 子任務**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**BBH 子任務**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**數學子任務**

```
MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]
```

**StrongReject 子任務**

```
STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]
```

### 提交基準任務
<a name="model-customize-evaluation-benchmarking-submit"></a>

------
#### [ SageMaker Studio ]

![\[透過 SageMaker Studio 進行基準測試的最低組態\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import get_benchmarks
from sagemaker.train.evaluate import BenchMarkEvaluator

Benchmark = get_benchmarks()

# Create evaluator with MMLU benchmark
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
s3_output_path="s3://<bucket-name>/<prefix>/",
evaluate_base_model=False
)

execution = evaluator.evaluate()
```

如需透過 SageMaker Python SDK 提交評估任務的詳細資訊，請參閱：https：//[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

## 大型語言模型作為判斷 (LLMAJ) 評估
<a name="model-customize-evaluation-llmaj"></a>

使用 LLM-as-a-judge (LLMAJ) 評估，利用另一個前沿模型對目標模型回應進行分級。您可以透過呼叫 `create_evaluation_job` API 來啟動評估任務，以使用 AWS Bedrock 模型做為判斷。

如需支援的判斷模型的詳細資訊，請參閱：https：//[https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)

您可以使用 2 種不同的指標格式來定義評估：
+ **內建指標：**利用 AWS Bedrock 內建指標來分析模型推論回應的品質。如需詳細資訊，請參閱：https：//[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)
+ **自訂指標：**以 Bedrock 評估自訂指標格式定義您自己的自訂指標，以使用您自己的指示來分析模型推論回應的品質。如需詳細資訊，請參閱：https：//[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

### 提交內建指標 LLMAJ 任務
<a name="model-customize-evaluation-llmaj-builtin"></a>

------
#### [ SageMaker Studio ]

![\[透過 SageMaker Studio 進行 LLMAJ 基準測試的最低組態\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/llmaj-as-judge-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import LLMAsJudgeEvaluator

evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"],
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

如需透過 SageMaker Python SDK 提交評估任務的詳細資訊，請參閱：https：//[https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

### 提交自訂指標 LLMAJ 任務
<a name="model-customize-evaluation-llmaj-custom"></a>

定義您的自訂指標 (s)：

```
{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}
```

如需詳細資訊，請參閱：https：//[https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

------
#### [ SageMaker Studio ]

![\[透過自訂指標上傳自訂指標 > 新增自訂指標\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/custom-llmaj-metrics-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    custom_metrics=custom_metric_dict = {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": (
                "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
                "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
                "Consider the following:\n"
                "- Does the response have a positive, encouraging tone?\n"
                "- Is the response helpful and constructive?\n"
                "- Does it avoid negative language or criticism?\n\n"
                "Rate on this scale:\n"
                "- Good: Response has positive sentiment\n"
                "- Poor: Response lacks positive sentiment\n\n"
                "Here is the actual task:\n"
                "Prompt: {{prompt}}\n"
                "Response: {{prediction}}"
            ),
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1}},
                {"definition": "Poor", "value": {"floatValue": 0}}
            ]
        }
    },
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)
```

------

## 自訂計分器
<a name="model-customize-evaluation-custom-scorers"></a>

定義您自己的自訂計分器函數以啟動評估任務。系統提供兩個內建的計分器：主要數學和主要程式碼。您也可以使用自己的計分器函數。您可以直接複製您的計分器函數程式碼，或使用相關聯的 ARN 帶入您自己的 Lambda 函數定義。根據預設，這兩種計分器類型都會產生評估結果，其中包含 F1 分數、ROUGE 和 BLEU 等標準指標。

如需內建和自訂評分者及其個別需求/合約的詳細資訊，請參閱 [使用預設和自訂計分器進行評估](model-customize-evaluation-preset-custom-scorers.md)。

### 註冊您的資料集
<a name="model-customize-evaluation-custom-scorers-register-dataset"></a>

將自訂計分器註冊為 SageMaker Hub 內容資料集，為自訂計分器使用自己的資料集。

------
#### [ SageMaker Studio ]

在 Studio 中，使用專用資料集頁面上傳資料集。

![\[SageMaker Studio 中註冊的評估資料集\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/dataset-registration-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

在 SageMaker Python SDK 中，使用專用資料集頁面上傳資料集。

```
from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="your-bring-your-own-dataset",
    source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl"
)
dataset.refresh()
```

------

### 提交內建的計分器任務
<a name="model-customize-evaluation-custom-scorers-builtin"></a>

------
#### [ SageMaker Studio ]

![\[從程式碼執行或數學答案中選取內建自訂評分\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/builtin-scorer-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import CustomScorerEvaluator
from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()

evaluator_builtin = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

從 `BuiltInMetric.PRIME_MATH` 或 `BuiltInMetric.PRIME_CODE` 中選取內建評分。

------

### 提交自訂計分器任務
<a name="model-customize-evaluation-custom-scorers-custom"></a>

定義自訂獎勵函數。如需詳細資訊，請參閱[自訂計分器 （使用您自己的指標）](model-customize-evaluation-preset-custom-scorers.md#model-customize-evaluation-custom-scorers-byom)。

**註冊自訂獎勵函數**

------
#### [ SageMaker Studio ]

![\[導覽至 SageMaker Studio > 資產 > 評估器 > 建立評估器 > 建立獎勵函數\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/custom-scorer-submission-sagemaker-studio.png)


![\[在自訂計分器 > 自訂指標中提交參考已註冊預設獎勵函數的自訂計分器評估任務\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/custom-scorer-benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.ai_registry.evaluator import Evaluator
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION

evaluator = Evaluator.create(
    name = "your-reward-function-name",
    source="/path_to_local/custom_lambda_function.py",
    type = REWARD_FUNCTION
)
```

```
evaluator = CustomScorerEvaluator(
    evaluator=evaluator,
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

# 評估指標格式
<a name="model-customize-evaluation-metrics-formats"></a>

跨這些指標格式評估模型的品質：
+ 模型評估摘要
+ MLFlow
+ TensorBoard

## 模型評估摘要
<a name="model-customize-evaluation-metrics-summary"></a>

當您提交評估任務時，您會指定一個 AWS S3 輸出位置。SageMaker 會自動將評估摘要 .json 檔案上傳至 位置。基準摘要 S3 路徑如下：

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/eval_results/
```

**傳遞 AWS S3 位置**

------
#### [ SageMaker Studio ]

![\[傳入輸出成品位置 (AWS S3 URI)\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

從 AWS S3 `.json` 位置直接讀取為 ，或在 UI 中自動視覺化：

```
{
  "results": {
    "custom|gen_qa_gen_qa|0": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    },
    "all": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    }
  }
}
```

![\[在 SageMaker Studio 中視覺化的自訂 gen-qa 基準效能指標範例\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/gen-qa-metrics-visualization-sagemaker-studio.png)


## MLFlow 記錄
<a name="model-customize-evaluation-metrics-mlflow"></a>

**提供您的 SageMaker MLFlow 資源 ARN**

SageMaker Studio 會使用在每個 Studio 網域上佈建的預設 MLFlow 應用程式。SageMaker Studio 在評估任務提交中使用與 ARN 相關聯的預設 MLflow 應用程式。

您也可以提交評估任務，並明確提供 MLFlow 資源 ARN，將指標串流到上述相關聯的追蹤伺服器/應用程式，以進行即時分析。

**SageMaker Python SDK**

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

模型層級和系統層級指標視覺化：

![\[MMLU 基準測試任務的範例模型層級錯誤和準確性\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/model-metrics-mlflow.png)


![\[LLMAJ 基準測試任務的內建指標範例\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/llmaj-metrics-mlflow.png)


![\[MMLU 基準測試任務的系統層級指標範例\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/system-metrics-mlflow.png)


## TensorBoard
<a name="model-customize-evaluation-metrics-tensorboard"></a>

使用 an AWS S3 輸出位置提交評估任務。SageMaker 會自動將 TensorBoard 檔案上傳至 位置。

SageMaker 會將 TensorBoard 檔案上傳至下列位置的 AWS S3：

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/tensorboard_results/eval/
```

**傳遞 AWS S3 位置，如下所示**

------
#### [ SageMaker Studio ]

![\[傳入輸出成品位置 (AWS S3 URI)\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

**模型層級指標範例**

![\[SageMaker TensorBoard 顯示基準測試任務的結果\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/metrics-in-tensorboard.png)


# Bring-Your-Own-Dataset (BYOD) 任務支援的資料集格式
<a name="model-customize-evaluation-dataset-formats"></a>

自訂計分器和 LLM-as-judge 評估類型需要位於 AWS S3 中的自訂資料集 JSONL 檔案。您必須以遵守下列其中一種支援格式的 JSON Lines 檔案的形式提供檔案。為了清楚起見，本文件中的範例已展開。

每個格式都有自己的細微差別，但至少都需要使用者提示。


**必要欄位**  

| 欄位 | 必要 | 
| --- | --- | 
| 使用者提示 | 是 | 
| 系統提示詞 | 否 | 
| Ground 事實 | 僅適用於自訂計分器 | 
| Category | 否 | 

**1。OpenAI 格式**

```
{
    "messages": [
        {
            "role": "system",    # System prompt (looks for system role)
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",       # Query (looks for user role)
            "content": "Hello!"
        },
        {
            "role": "assistant",  # Ground truth (looks for assistant role)
            "content": "Hello to you!"
        }
    ]
}
```

**2. SageMaker 評估**

```
{
   "system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?", # Ground truth
   "category": "Grammar"
}
```

**3. HuggingFace 提示完成**

同時支援標準和對話格式。

```
# Standard

{
    "prompt" : "What is the symbol that ends the sentence as a question", # Query
    "completion" : "?" # Ground truth
}

# Conversational
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What is the symbol that ends the sentence as a question"
        }
    ],
    "completion": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "?"
        }
    ]
}
```

**4. HuggingFace 偏好設定**

支援標準格式 （字串） 和對話格式 （訊息陣列）。

```
# Standard: {"prompt": "text", "chosen": "text", "rejected": "text"}
{
     "prompt" : "The sky is", # Query
     "chosen" : "blue", # Ground truth
     "rejected" : "green"
}

# Conversational:
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What color is the sky?"
        }
    ],
    "chosen": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "It is blue."
        }
    ],
    "rejected": [
        {
            "role": "assistant",
            "content": "It is green."
        }
    ]
}
```

**5. Verl 格式**

強化學習使用案例支援 Verl 格式 （目前和舊版格式）。參考的 Verl 文件：https：//[https://verl.readthedocs.io/en/latest/preparation/prepare\$1data.html](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)

VERL 格式的使用者通常不提供基本事實回應。如果您想要提供其中一個，請使用其中一個欄位`extra_info.answer`或 `reward_model.ground_truth`； `extra_info`優先。

如果存在，SageMaker 會將下列 VERL 特定欄位保留為中繼資料：
+ `id`
+ `data_source`
+ `ability`
+ `reward_model`
+ `extra_info`
+ `attributes`
+ `difficulty`

```
# Newest VERL format where `prompt` is an array of messages.
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "You are a helpful math tutor who explains solutions to questions step-by-step.",
      "role": "system"
    },
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  },
  "reward_model": {
    "ground_truth": "72" # Ignored in favor of extra_info.answer
  }
}

# Legacy VERL format where `prompt` is a string. Also supported.
{
  "data_source": "openai/gsm8k",
  "prompt": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

# 使用預設和自訂計分器進行評估
<a name="model-customize-evaluation-preset-custom-scorers"></a>

使用自訂計分器評估類型時，SageMaker Evaluation 支援兩個內建計分器 （也稱為「獎勵函數」) 從 [volcengine/verl](https://github.com/volcengine/verl) RL 訓練程式庫取得的 Prime Math 和 Prime Code，或您自己的自訂計分器實作為 Lambda 函數。

## 內建計分器
<a name="model-customize-evaluation-builtin-scorers"></a>

**主要數學**

主要數學計分器預期自訂 JSONL 資料集的項目，其中包含數學問題作為提示/查詢，而正確答案作為基本事實。資料集可以是 中提及的任何支援格式[Bring-Your-Own-Dataset (BYOD) 任務支援的資料集格式](model-customize-evaluation-dataset-formats.md)。

資料集項目範例 （為了清楚起見而擴展）：

```
{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}
```

**主要程式碼**

主要程式碼評分器預期自訂 JSONL 資料集的項目，其中包含 `metadata` 欄位中指定的編碼問題和測試案例。使用每個項目的預期函數名稱、範例輸入和預期輸出來建構測試案例。

資料集項目範例 （為了清楚起見而擴展）：

```
{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}
```

## 自訂計分器 （使用您自己的指標）
<a name="model-customize-evaluation-custom-scorers-byom"></a>

使用自訂後製處理邏輯來完全自訂模型評估工作流程，可讓您根據需求來計算自訂指標。您必須將自訂計分器實作為接受模型回應並傳回獎勵分數的 AWS Lambda 函數。

### Lambda 輸入承載範例
<a name="model-customize-evaluation-custom-scorers-lambda-input"></a>

您的自訂 AWS Lambda 預期會以 OpenAI 格式輸入。範例：

```
{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}
```

### Lambda 輸出承載範例
<a name="model-customize-evaluation-custom-scorers-lambda-output"></a>

SageMaker 評估容器預期您的 Lambda 回應遵循以下格式：

```
{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}
```

### 自訂 Lambda 定義
<a name="model-customize-evaluation-custom-scorers-lambda-definition"></a>

在 https：//[https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html\$1nova-reward-llm-judge-example](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example) 尋找具有範例輸入和預期輸出的完全實作自訂評分器範例

使用下列骨架做為您自己的函數的起點。

```
def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """
```

### 輸入和輸出欄位
<a name="model-customize-evaluation-custom-scorers-fields"></a>

**輸入欄位**


| 欄位 | Description | 其他備註 | 
| --- | --- | --- | 
| id | 範例的唯一識別符 | 在輸出中回呼。字串格式 | 
| messages | 以 OpenAI 格式排序的聊天歷史記錄 | 訊息物件陣列 | 
| messages【】.role | 訊息的發言者 | 常見值："user"、"assistant"、"system" | 
| messages【】.content | 訊息的文字內容 | 純文字的字串 | 
| 中繼資料 | 協助分級的自由格式資訊 | 物件；從訓練資料傳遞的選用欄位 | 

**輸出欄位**


**輸出欄位**  

| 欄位 | Description | 其他備註 | 
| --- | --- | --- | 
| id | 與輸入範例相同的識別符 | 必須符合輸入 | 
| aggregate\$1reward\$1score | 範例的整體分數 | 浮動 （例如 0.0–1.0 或任務定義範圍） | 
| metrics\$1list | 組成彙總的元件分數 | 指標物件陣列 | 

### 所需的許可
<a name="model-customize-evaluation-custom-scorers-permissions"></a>

確定您用來執行評估的 SageMaker 執行角色具有 AWS Lambda 許可。

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}
```

確保您的 AWS Lambda 函數的執行角色具有基本的 Lambda 執行許可，以及任何下游 AWS 呼叫可能需要的其他許可。

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```

# 模型部署
<a name="model-customize-open-weight-deployment"></a>

從自訂模型詳細資訊頁面，您也可以使用 SageMaker AI 推論端點或 Amazon Bedrock 部署自訂模型。

![\[包含模型自訂技術選擇的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/screenshot-open-model-1.png)


# 範例資料集和評估器
<a name="model-customize-open-weight-samples"></a>

## 受管微調 (SFT)
<a name="model-customize-open-weight-samples-sft"></a>
+ 名稱：TAT-QA
+ 授權：CC-BY-4.0
+ 連結：https：//[https://huggingface.co/datasets/next-tat/TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
+ 預先處理 - 格式化

**一個範例 **

```
{
    "prompt": "Given a table and relevant text descriptions, answer the following question.\n\nTable:\n<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td></td>\n      <td>2019</td>\n      <td>2018</td>\n    </tr>\n    <tr>\n      <td></td>\n      <td>$'000</td>\n      <td>$'000</td>\n    </tr>\n    <tr>\n      <td>Revenue from external customers</td>\n      <td></td>\n      <td></td>\n    </tr>\n    <tr>\n      <td>Australia</td>\n      <td>144,621</td>\n      <td>129,431</td>\n    </tr>\n    <tr>\n      <td>New Zealand</td>\n      <td>13,036</td>\n      <td>8,912</td>\n    </tr>\n    <tr>\n      <td>Total</td>\n      <td>157,657</td>\n      <td>138,343</td>\n    </tr>\n  </tbody>\n</table>\n\nParagraphs:\n    4. SEGMENT INFORMATION\n\n    During the 2019 and 2018 financial years, the Group operated wholly within one business segment being the operation and management of storage centres in Australia and New Zealand.\n\n    The Managing Director is the Group\u2019s chief operating decision maker and monitors the operating results on a portfolio wide basis. Monthly management reports are evaluated based upon the overall performance of NSR consistent with the presentation within the consolidated financial statements. The Group\u2019s financing (including finance costs and finance income) are managed on a Group basis and not allocated to operating segments.\n\n    The operating results presented in the statement of profit or loss represent the same segment information as reported in internal management information.\n\n    The revenue information above excludes interest income and is based on the location of storage centres.\n\nQuestion: How many business segments were present in 2019 and 2018?\n\nAnswer:",
    "completion": "one",
    "data_idx": "2951"
}
```

## 直接偏好最佳化 (DPO)
<a name="model-customize-open-weight-samples-dpo"></a>
+ 名稱：Ultrafeedback
+ 授權：MIT
+ 連結：https：//[https://huggingface.co/datasets/openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
+ 預先處理步驟：格式化和子取樣。

**一個範例**

```
{
    "source": "evol_instruct",
    "prompt": "Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:\n#include <iostream>\n#include <string>\nusing namespace std;\nint main() {\n    string country;\n    // prompt user for input\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    // check if country borders the Mediterranean Sea\n    // [C++ code]\n    return 0;\n}",
    "chosen": "Here's a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea:\n\n#include <iostream>\n#include <string>\n#include <set>\n#include <map>\n#include <algorithm>\n\nusing namespace std;\n\nint main() {\n    // store countries and their bordering seas in a map\n    map<string, set<string>> countries;\n    countries[\"Algeria\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"France\"] = {\"Mediterranean Sea\", \"English Channel\"};\n    countries[\"Italy\"] = {\"Mediterranean Sea\", \"Adriatic Sea\"};\n    countries[\"Malta\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"Spain\"] = {\"Mediterranean Sea\", \"Atlantic Ocean\"};\n    countries[\"Tunisia\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n\n    // check if country is in the map\n    if (countries.find(country)!= countries.end()) {\n        // check if country borders the Mediterranean Sea\n        set<string> borders = countries.at(country);\n        if (borders.find(\"Mediterranean Sea\")!= borders.end()) {\n            cout << country << \" borders the Mediterranean Sea.\" << endl;\n        } else {\n            cout << country << \" does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << country << \" is not a recognized country.\" << endl;\n    }\n\n    return 0;\n}\n\nIn this program, we first define a map called countries that stores each country's name as a key and a set of its surrounding seas as the value. We use a set to store each sea name so that we can easily check if a country borders a certain sea.\n\nWe then present the user with a prompt to enter the name of a country. We use a while loop to ensure that the user enters a valid country name. We check if the country is in the map, and if so, we retrieve its bordering seas from the map.\n\nFinally, we check if the Mediterranean Sea is among the set of borders. If it is, we inform the user that the country borders the Mediterranean Sea. If not, we inform the user that the country does not border the Mediterranean Sea.",
    "chosen-rating": 5.0,
    "chosen-model": "starchat",
    "rejected": "Sure, here is the program using the C++11 algorithm \"cds::algorithm::GreaterEqual\":\n#include <iostream>\n#include <string>\n#include <algorithm>\n#include <vector>\n#include <cctype>\n\nusing namespace std;\n\nint main() {\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    std::vector<string> vec;\n    vec.push_back(country);\n    size_t index = std::find_if(vec.begin(), vec.end(), [](const string& s) {\n        return std::any_of(s.begin(), s.end(), [](const char& c) {\n            return c == '}}';\n    });\n    if (index != vec.end()) {\n        if ((*index)[0] == 'M') {\n            cout << \"Country is bordered by the Mediterranean Sea.\" << endl;\n        } else {\n            cout << \"Country does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << \"Country is not found.\" << endl;\n    }\n    return 0;\n}",
    "rejected-rating": 1.25,
    "rejected-model": "pythia-12b"
}
```

## 從 AI 意見回饋 (RLAIF) 強化學習 - 成對判斷
<a name="model-customize-open-weight-samples-rlaif"></a>

**輸入資料集**

來源資料集：https：//[https://github.com/WeOpenML/PandaLM](https://github.com/WeOpenML/PandaLM)

**一個範例**

```
{
    "data_source": "WeOpenML/PandaLM",
    "prompt": [
        {
            "role": "user",
            "content": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n
            ### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n
            ### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n
            ### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n"
        }
    ],
    "ability": "pairwise-judging",
    "reward_model": {
        "style": "llmj",
        "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "raw_output_sequence": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n",
        "llmj": {
            "question": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n",
            "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.",
            "document_in_context": null
        },
        "sample_size": 1980
    }
}
```

## RLAIF - 思維鏈
<a name="model-customize-open-weight-samples-rlaif2"></a>

**輸入資料集 **

來源資料：https：//[https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**一個範例**

```
{
    "data_source": "openai/gsm8k",
    "prompt": [
        {
            "role": "system",
            "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n"
        },
        {
            "role": "user",
            "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?"
        }
    ],
    "ability": "chain-of-thought",
    "reward_model": {
        "style": "llmj-cot",
        "ground_truth": "Thus, there were 36 - 12 - 9 = <<36-12-9=15>>15 sales in the stationery section."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "question": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "short_answer": "15",
        "model_output": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>"
    }
}
```

## RLAIF - 宗教
<a name="model-customize-open-weight-samples-rlaif3"></a>

**輸入資料集**

來源：https：//[https://huggingface.co/datasets/rajpurkar/squad\$1v2/blob/main/squad\$1v2/train-00000-of-00001.parquet](https://huggingface.co/datasets/rajpurkar/squad_v2/blob/main/squad_v2/train-00000-of-00001.parquet)

**一個範例**

```
{
    "data_source": "squad_v2",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that answers questions based on the provided context. Only use information from the context."
        },
        {
            "role": "user",
            "content": "Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\".\n\nQuestion: When did Beyonce start becoming popular?"
        }
    ],
    "ability": "faithfulness",
    "reward_model": {
        "style": "llmj-faithfulness",
        "ground_truth": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\"."
    },
    "extra_info": {
        "question": "When did Beyonce start becoming popular?",
        "split": "train",
        "index": 0
    }
}
```

## RLAIF - 摘要
<a name="model-customize-open-weight-samples-rlaif4"></a>

**輸入資料集**

來源：清理的 gsm8k 資料集 https：//[https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**一個範例**

```
{
    "data_source": "cnn_dailymail",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that creates concise, accurate summaries of news articles. Focus on the key facts and main points."
        },
        {
            "role": "user",
            "content": "Summarize the following article:\n\nLONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed."
        }
    ],
    "ability": "summarization",
    "reward_model": {
        "style": "llmj-summarization",
        "ground_truth": "Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."
    },
    "extra_info": {
        "question": "LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.",
        "split": "train",
        "index": 0,
        "source_id": "42c027e4ff9730fbb3de84c1af0d2c50"
    }
}
```

## RLAIF - 自訂提示
<a name="model-customize-open-weight-samples-rlaif5"></a>

在此範例中，我們使用 [RLAIF - 思維鏈](#model-customize-open-weight-samples-rlaif2)來討論自訂 jinja 提示如何取代其中一個預設提示。

**以下是 CoT 的自訂提示範例：**

```
You are an expert logical reasoning evaluator specializing in Chain-of-Thought (CoT) analysis. 

Given: A problem prompt and a model's reasoning-based response. 

Goal: Assess the quality and structure of logical reasoning, especially for specialized domains (law, medicine, finance, etc.).

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Structural Completeness (0.3 max)
- Clear problem statement: +0.05
- Defined variables/terminology: +0.05
- Organized given information: +0.05
- Explicit proof target: +0.05
- Step-by-step reasoning: +0.05
- Clear conclusion: +0.05

Logical Quality (0.4 max)
- Valid logical flow: +0.1
- Proper use of if-then relationships: +0.1
- Correct application of domain principles: +0.1
- No logical fallacies: +0.1

Technical Accuracy (0.3 max)
- Correct use of domain terminology: +0.1
- Accurate application of domain rules: +0.1
- Proper citation of relevant principles: +0.1

Critical Deductions:
A. Invalid logical leap: -0.3
B. Missing critical steps: -0.2
C. Incorrect domain application: -0.2
D. Unclear/ambiguous reasoning: -0.1

Additional Instructions:
- Verify domain-specific terminology and principles
- Check for logical consistency throughout
- Ensure conclusions follow from premises
- Flag potential domain-specific compliance issues
- Consider regulatory/professional standards where applicable

Return EXACTLY this JSON (no extra text):
{
    "score": <numerical score 0.0-1.0>,
    "component_scores": {
        "structural_completeness": <score>,
        "logical_quality": <score>,
        "technical_accuracy": <score>
    },
    "steps_present": {
        "problem_statement": <true/false>,
        "variable_definitions": <true/false>,
        "given_information": <true/false>,
        "proof_target": <true/false>,
        "step_reasoning": <true/false>,
        "conclusion": <true/false>
    },
    "reasoning": "<explain scoring decisions and identify any logical gaps>",
    "domain_flags": ["<any domain-specific concerns or compliance issues>"]
}

### (Prompt field from dataset)
Problem Prompt: {{ prompt }}

Model's Response: {{ model_output }}

### Ground truth (if applicable):
{{ ground_truth }}
```

## 從可驗證獎勵 (RLVR) 的強化學習 - 完全相符
<a name="model-customize-open-weight-samples-RLVR"></a>

**輸入資料集**

來源：https：//[https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

**範例**

```
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let\'s think step by step and output the final answer after \\"####\\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "reward_model": {
    "ground_truth": "72",
    "style": "rule"
  },
  "extra_info": {
    "answer": "Natalia sold 48\\/2 = <<48\\/2=24>>24 clips in May.\\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

## RLVR - 程式碼執行
<a name="model-customize-open-weight-samples-RLVR2"></a>

**輸入資料集**

來源：https：//[https://huggingface.co/datasets/open-r1/codeforces](https://huggingface.co/datasets/open-r1/codeforces)

**範例**

```
{
  "data_source": "codeforces",
  "prompt": [
    {
      "content": "\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n",
      "role": "system"
    },
    {
      "content": "Title: Zebras\n\nTime Limit: None seconds\n\nMemory Limit: None megabytes\n\nProblem Description:\nOleg writes down the history of the days he lived. For each day he decides if it was good or bad. Oleg calls a non-empty sequence of days a zebra, if it starts with a bad day, ends with a bad day, and good and bad days are alternating in it. Let us denote bad days as 0 and good days as 1. Then, for example, sequences of days 0, 010, 01010 are zebras, while sequences 1, 0110, 0101 are not.\n\nOleg tells you the story of days he lived in chronological order in form of string consisting of 0 and 1. Now you are interested if it is possible to divide Oleg's life history into several subsequences, each of which is a zebra, and the way it can be done. Each day must belong to exactly one of the subsequences. For each of the subsequences, days forming it must be ordered chronologically. Note that subsequence does not have to be a group of consecutive days.\n\nInput Specification:\nIn the only line of input data there is a non-empty string *s* consisting of characters 0 and 1, which describes the history of Oleg's life. Its length (denoted as |*s*|) does not exceed 200<=000 characters.\n\nOutput Specification:\nIf there is a way to divide history into zebra subsequences, in the first line of output you should print an integer *k* (1<=\u2264<=*k*<=\u2264<=|*s*|), the resulting number of subsequences. In the *i*-th of following *k* lines first print the integer *l**i* (1<=\u2264<=*l**i*<=\u2264<=|*s*|), which is the length of the *i*-th subsequence, and then *l**i* indices of days forming the subsequence. Indices must follow in ascending order. Days are numbered starting from 1. Each index from 1 to *n* must belong to exactly one subsequence. If there is no way to divide day history into zebra subsequences, print -1.\n\nSubsequences may be printed in any order. If there are several solutions, you may print any of them. You do not have to minimize nor maximize the value of *k*.\n\nDemo Input:\n['0010100\\n', '111\\n']\n\nDemo Output:\n['3\\n3 1 3 4\\n3 2 5 6\\n1 7\\n', '-1\\n']\n\nNote:\nnone\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end.",
      "role": "user"
    }
  ],
  "ability": "code",
  "reward_model": {
    "ground_truth": "{\"inputs\": [\"0010100\", \"111\", \"0\", \"1\", \"0101010101\", \"010100001\", \"000111000\", \"0101001000\", \"0000001000\", \"0101\", \"000101110\", \"010101010\", \"0101001010\", \"0100101100\", \"0110100000\", \"0000000000\", \"1111111111\", \"0010101100\", \"1010000\", \"0001110\", \"0000000000011001100011110101000101000010010111000100110110000011010011110110001100100001001001010010\", \"01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010\", \"0010011100000000\"], \"outputs\": [\"3\\n1 1\\n5 2 3 4 5 6\\n1 7\", \"-1\", \"1\\n1 1\", \"-1\", \"-1\", \"-1\", \"3\\n3 1 6 7\\n3 2 5 8\\n3 3 4 9\", \"4\\n5 1 2 3 4 5\\n3 6 7 8\\n1 9\\n1 10\", \"8\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n3 6 7 8\\n1 9\\n1 10\", \"-1\", \"-1\", \"1\\n9 1 2 3 4 5 6 7 8 9\", \"2\\n5 1 2 3 4 5\\n5 6 7 8 9 10\", \"2\\n5 1 2 3 8 9\\n5 4 5 6 7 10\", \"-1\", \"10\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n1 9\\n1 10\", \"-1\", \"2\\n3 1 8 9\\n7 2 3 4 5 6 7 10\", \"-1\", \"-1\", \"22\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n7 9 24 25 26 27 28 29\\n7 10 13 14 17 18 23 30\\n11 11 12 15 16 19 22 31 32 33 34 35\\n3 20 21 36\\n3 37 46 47\\n9 38 39 40 45 48 57 58 75 76\\n17 41 42 43 44 49 50 51 54 55 56 59 72 73 74 77 80 81\\n9 52 53 60 71 78 79 82 83 84\\n7 61 64 65 66 67 70 85\\n5 62 63 68 69 86\\n3 87 88 89\\n3 90 91 92\\n5 93 94 95 96 97\\n3 98 99 100\", \"1\\n245 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 ...\", \"8\\n3 1 8 9\\n5 2 3 4 7 10\\n3 5 6 11\\n1 12\\n1 13\\n1 14\\n1 15\\n1 16\"]}",
    "style": "rule"
  },
  "extra_info": {
    "index": 49,
    "split": "train"
  }
}
```

**獎勵函數**

獎勵函數：https：//[https://github.com/volcengine/verl/tree/main/verl/utils/reward\$1score/prime\$1code](https://github.com/volcengine/verl/tree/main/verl/utils/reward_score/prime_code)

## RLVR - 數學答案
<a name="model-customize-open-weight-samples-RLVR3"></a>

**輸入資料集**

來源：清理的 gsm8k 資料集 [https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)：//

**範例**

```
[
    {
        "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries...",
        "role": "system"
    },
    {
        "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "role": "user"
    },
    {
        "content": "\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections...\n\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n<reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct...\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct...\n</adjustment>\n\n<output>\n15\n</output>",
        "role": "assistant"
    }
]
```

**獎勵計算**

獎勵函數：https：//[https://github.com/volcengine/verl/blob/main/verl/utils/reward\$1score/gsm8k.py](https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py)

## RLVR - 自訂 Lambda
<a name="model-customize-open-weight-samples-RLVR4"></a>

**輸入資料集**

來源：清理的 gsm8k 資料集 [https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)：//

**範例**

```
[
  {
    "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n",
    "role": "system"
  },
  {
    "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
    "role": "user"
  },
  {
    "content": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>",
    "role": "assistant"
  }
]
```

**獎勵計算範例**

```
# RLVR Evaluator for OSS

# lambda_grader.py
import json
import re
import uuid
from typing import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**獎勵函數程式碼範例**

```
# RLVR Evaluator for OSS
# lambda_grader.py

import json
import re
import uuid from typing 
import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**獎勵提示範例**

```
You are an expert RAG response evaluator specializing in faithfulness and relevance assessment.
Given: Context documents, a question, and response statements.
Goal: Evaluate both statement-level faithfulness and overall response relevance to the question.

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Faithfulness Assessment (0.6 max)
Per statement evaluation:
- Direct support in context: +0.2
- Accurate inference from context: +0.2
- No contradictions with context: +0.2
Deductions:
- Hallucination: -0.3
- Misrepresentation of context: -0.2
- Unsupported inference: -0.1

Question Relevance (0.4 max)
- Direct answer to question: +0.2
- Appropriate scope/detail: +0.1
- Proper context usage: +0.1
Deductions:
- Off-topic content: -0.2
- Implicit/meta responses: -0.2
- Missing key information: -0.1

Critical Flags:
A. Complete hallucination
B. Context misalignment
C. Question misinterpretation
D. Implicit-only responses

Additional Instructions:
- Evaluate each statement independently
- Check for direct textual support
- Verify logical inferences
- Assess answer completeness
- Flag any unsupported claims

Return EXACTLY this JSON (no extra text):
{
    "statements_evaluation": [
        {
            "statement": "<statement_text>",
            "verdict": <0 or 1>,
            "reason": "<detailed explanation>",
            "context_support": "<relevant context quote or 'None'>"
        }
    ],
    "overall_assessment": {
        "question_addressed": <0 or 1>,
        "reasoning": "<explanation>",
        "faithfulness_score": <0.0-1.0>,
        "relevance_score": <0.0-1.0>
    },
    "flags": ["<any critical issues>"]
}

## Current Evaluation Task

### Context
{{ ground_truth }}

### Question
{{ extra_info.question }}

### Model's Response
{{ model_output }}
```

# 版本備註
<a name="model-customize-release-note"></a>

**SageMaker AI 模型自訂映像**

**支援計劃**
+ 主要版本：下次主要版本後 12 個月
+ 次要版本：下一個次要版本後 6 個月
+ 修補程式版本：不保證支援 （升級至最新的修補程式）

以下是適用於 Amazon EKS (EKS) 和 SageMaker AI 訓練任務 (SMTJ) 的基礎深度學習容器版本備註：


****  

| 版本 | Type | 服務 | 影像 URL | 
| --- | --- | --- | --- | 
|  1.0.0  | CUDA | EKS |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:verl-eks-v1.0.0`  | 
|  1.0.0  | CUDA | SMTJ |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:verl-smtj-v1.0.0`  | 
|  1.0.0  | CUDA | SMJT |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:v1-v1.0.0`  | 
|  1.0.0  | CUDA | SMTJ |  `652744875666.dkr.ecr.amazonaws.com/hyperpod-model-customization:llama-90b-v1.0.0`  | 

**AWS 區域 支援**


****  

| 區域 | Code | 無伺服器 SMTJ 支援 | 
| --- | --- | --- | 
| 亞太區域 (孟買) | ap-south-1 | 編號 | 
| 亞太區域 (新加坡) | ap-southeast-1 | 編號 | 
| 亞太地區 (悉尼) | ap-southeast-2 | 編號 | 
| 亞太區域 (東京) | ap-northeast-1 | 是 | 
| 歐洲 (法蘭克福) | eu-central-1 | 編號 | 
| 歐洲 (愛爾蘭) | eu-west-1 | 是 | 
| Europe (Stockholm) | eu-north-1 | 編號 | 
| 南美洲 (聖保羅) | sa-east-1 | 編號 | 
| 美國東部 (維吉尼亞北部) | us-east-1 | 是 | 
| 美國東部 (俄亥俄) | us-east-2 | 編號 | 
| 美國西部 (加利佛尼亞北部) | us-west-1 | 編號 | 
| 美國西部 (奧勒岡) | us-west-2 | 是 | 