

Terjemahan disediakan oleh mesin penerjemah. Jika konten terjemahan yang diberikan bertentangan dengan versi bahasa Inggris aslinya, utamakan versi bahasa Inggris.

# Jenis evaluasi dan Job Submission
<a name="model-customize-evaluation-types"></a>

## Benchmarking dengan dataset standar
<a name="model-customize-evaluation-benchmarking"></a>

Gunakan tipe Evaluasi Benchmark untuk mengevaluasi kualitas model Anda di seluruh kumpulan data benchmark standar termasuk kumpulan data populer seperti MMLU dan BBH.


| Tolok Ukur | Dataset Kustom Didukung | Modalitas | Deskripsi | Metrik-metrik | Strategi | Subtugas tersedia | 
| --- | --- | --- | --- | --- | --- | --- | 
| mmlu | Tidak | Teks | Pemahaman Bahasa Multi-tugas - Menguji pengetahuan di 57 mata pelajaran. | ketepatan | zs\$1cot | Ya | 
| mmlu\$1pro | Tidak | Teks | MMLU - Subset Profesional - Berfokus pada domain profesional seperti hukum, kedokteran, akuntansi, dan teknik. | ketepatan | zs\$1cot | Tidak | 
| bbh | Tidak | Teks | Tugas Penalaran Lanjutan - Kumpulan masalah menantang yang menguji keterampilan kognitif dan pemecahan masalah tingkat tinggi. | ketepatan | fs\$1cot | Ya | 
| gpqa | Tidak | Teks | Penjawab Pertanyaan Fisika Umum — Menilai pemahaman konsep fisika dan kemampuan pemecahan masalah terkait. | ketepatan | zs\$1cot | Tidak | 
| matematika | Tidak | Teks | Pemecahan Masalah Matematika — Mengukur penalaran matematis di seluruh topik termasuk aljabar, kalkulus, dan masalah kata. | exact\$1match | zs\$1cot | Ya | 
| strong\$1tolak | Tidak | Teks | Quality-Control Task — Menguji kemampuan model untuk mendeteksi dan menolak konten yang tidak pantas, berbahaya, atau salah. | defleksi | zs | Ya | 
| ifeval | Tidak | Teks | Instruksi-Mengikuti Evaluasi - Mengukur seberapa akurat model mengikuti instruksi yang diberikan dan menyelesaikan tugas untuk spesifikasi. | ketepatan | zs | Tidak | 

Untuk informasi selengkapnya tentang format BYOD, lihat[Format Set Data yang Didukung untuk Tugas Bring-Your-Own-Dataset (BYOD)](model-customize-evaluation-dataset-formats.md).

### Subtugas yang Tersedia
<a name="model-customize-evaluation-benchmarking-subtasks"></a>

Berikut daftar subtugas yang tersedia untuk evaluasi model di beberapa domain termasuk MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard), dan MATH. StrongReject Subtugas ini memungkinkan Anda menilai kinerja model Anda pada kemampuan dan bidang pengetahuan tertentu.

**Subtugas MMLU**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**Subtugas BBH**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**Subtugas Matematika**

```
MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]
```

**StrongReject Subtugas**

```
STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]
```

### Kirimkan pekerjaan benchmark Anda
<a name="model-customize-evaluation-benchmarking-submit"></a>

------
#### [ SageMaker Studio ]

![\[Konfigurasi minimal untuk benchmarking melalui Studio SageMaker\]](http://docs.aws.amazon.com/id_id/sagemaker/latest/dg/images/benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import get_benchmarks
from sagemaker.train.evaluate import BenchMarkEvaluator

Benchmark = get_benchmarks()

# Create evaluator with MMLU benchmark
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
s3_output_path="s3://<bucket-name>/<prefix>/",
evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Untuk informasi lebih lanjut tentang pengajuan pekerjaan evaluasi melalui SageMaker Python SDK, lihat: [https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

## Evaluasi Model Bahasa Besar sebagai Hakim (LLMAJ)
<a name="model-customize-evaluation-llmaj"></a>

Gunakan evaluasi LLM-as-a-judge (LLMAJ) untuk memanfaatkan model perbatasan lain untuk menilai respons model target Anda. Anda dapat menggunakan model AWS Bedrock sebagai juri dengan memanggil `create_evaluation_job` API untuk meluncurkan pekerjaan evaluasi.

Untuk informasi selengkapnya tentang model juri yang didukung, lihat: [https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)

Anda dapat menggunakan 2 format metrik yang berbeda untuk menentukan evaluasi:
+ **Metrik bawaan:** Manfaatkan metrik bawaan AWS Bedrock untuk menganalisis kualitas respons inferensi model Anda. Untuk informasi lebih lanjut, lihat: [https://docs.aws.amazon.com/bedrock/latest/userguide/model- evaluation-type-judge-prompt .html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)
+ **Metrik khusus:** Tentukan metrik kustom Anda sendiri dalam format metrik kustom Evaluasi Batuan Dasar untuk menganalisis kualitas respons inferensi model Anda menggunakan instruksi Anda sendiri. Untuk informasi lebih lanjut, lihat: [https://docs.aws.amazon.com/bedrock/latest/userguide/model- evaluation-custom-metrics-prompt -formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

### Kirim pekerjaan LLMAJ metrik bawaan
<a name="model-customize-evaluation-llmaj-builtin"></a>

------
#### [ SageMaker Studio ]

![\[Konfigurasi minimal untuk benchmarking LLMAJ melalui Studio SageMaker\]](http://docs.aws.amazon.com/id_id/sagemaker/latest/dg/images/llmaj-as-judge-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import LLMAsJudgeEvaluator

evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"],
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Untuk informasi lebih lanjut tentang pengajuan pekerjaan evaluasi melalui SageMaker Python SDK, lihat: [https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

### Kirim pekerjaan LLMAJ metrik khusus
<a name="model-customize-evaluation-llmaj-custom"></a>

Tentukan metrik kustom Anda:

```
{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}
```

Untuk informasi lebih lanjut, lihat: [https://docs.aws.amazon.com/bedrock/latest/userguide/model- evaluation-custom-metrics-prompt -formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

------
#### [ SageMaker Studio ]

![\[Unggah metrik kustom melalui Metrik kustom > Tambahkan metrik khusus\]](http://docs.aws.amazon.com/id_id/sagemaker/latest/dg/images/custom-llmaj-metrics-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    custom_metrics=custom_metric_dict = {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": (
                "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
                "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
                "Consider the following:\n"
                "- Does the response have a positive, encouraging tone?\n"
                "- Is the response helpful and constructive?\n"
                "- Does it avoid negative language or criticism?\n\n"
                "Rate on this scale:\n"
                "- Good: Response has positive sentiment\n"
                "- Poor: Response lacks positive sentiment\n\n"
                "Here is the actual task:\n"
                "Prompt: {{prompt}}\n"
                "Response: {{prediction}}"
            ),
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1}},
                {"definition": "Poor", "value": {"floatValue": 0}}
            ]
        }
    },
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)
```

------

## Pencetak Skor Kustom
<a name="model-customize-evaluation-custom-scorers"></a>

Tentukan fungsi pencetak gol kustom Anda sendiri untuk meluncurkan pekerjaan evaluasi. Sistem ini menyediakan dua pencetak gol bawaan: Prime math dan Prime code. Anda juga dapat membawa fungsi pencetak gol Anda sendiri. Anda dapat menyalin kode fungsi pencetak gol Anda secara langsung atau membawa definisi fungsi Lambda Anda sendiri menggunakan ARN terkait. Secara default, kedua jenis pencetak gol menghasilkan hasil evaluasi yang mencakup metrik standar seperti skor F1, ROUGE, dan BLEU.

Untuk informasi lebih lanjut tentang pencetak gol bawaan dan kustom serta persyaratan/kontrak masing-masing, lihat. [Evaluasi dengan Preset dan Custom Scorers](model-customize-evaluation-preset-custom-scorers.md)

### Daftarkan dataset Anda
<a name="model-customize-evaluation-custom-scorers-register-dataset"></a>

Bawa kumpulan data Anda sendiri untuk pencetak gol khusus dengan mendaftarkannya sebagai Kumpulan Data Konten SageMaker Hub.

------
#### [ SageMaker Studio ]

Di Studio, unggah kumpulan data Anda menggunakan halaman Datasets khusus..

![\[Dataset evaluasi terdaftar di Studio SageMaker\]](http://docs.aws.amazon.com/id_id/sagemaker/latest/dg/images/dataset-registration-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

Di SageMaker Python SDK, unggah dataset Anda menggunakan halaman Datasets khusus..

```
from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="your-bring-your-own-dataset",
    source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl"
)
dataset.refresh()
```

------

### Kirim pekerjaan pencetak gol bawaan
<a name="model-customize-evaluation-custom-scorers-builtin"></a>

------
#### [ SageMaker Studio ]

![\[Pilih dari eksekusi Kode atau jawaban Matematika untuk penilaian kustom Built-in\]](http://docs.aws.amazon.com/id_id/sagemaker/latest/dg/images/builtin-scorer-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import CustomScorerEvaluator
from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()

evaluator_builtin = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Pilih dari `BuiltInMetric.PRIME_MATH` atau `BuiltInMetric.PRIME_CODE` untuk Skor Bawaan.

------

### Kirim pekerjaan pencetak gol khusus
<a name="model-customize-evaluation-custom-scorers-custom"></a>

Tentukan fungsi hadiah khusus. Untuk informasi selengkapnya, lihat [Pencetak Skor Kustom (Bawa Metrik Anda Sendiri)](model-customize-evaluation-preset-custom-scorers.md#model-customize-evaluation-custom-scorers-byom).

**Daftarkan fungsi hadiah khusus**

------
#### [ SageMaker Studio ]

![\[Menavigasi ke SageMaker Studio > Aset > Evaluator > Buat evaluator > Buat fungsi hadiah\]](http://docs.aws.amazon.com/id_id/sagemaker/latest/dg/images/custom-scorer-submission-sagemaker-studio.png)


![\[Kirimkan pekerjaan evaluasi Pencetak Skor Khusus yang mereferensikan fungsi hadiah preset terdaftar di Custom Scorer > Metrik khusus\]](http://docs.aws.amazon.com/id_id/sagemaker/latest/dg/images/custom-scorer-benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.ai_registry.evaluator import Evaluator
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION

evaluator = Evaluator.create(
    name = "your-reward-function-name",
    source="/path_to_local/custom_lambda_function.py",
    type = REWARD_FUNCTION
)
```

```
evaluator = CustomScorerEvaluator(
    evaluator=evaluator,
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------