평가 레시피 예시

이러한 레시피를 사용하면 광범위한 텍스트 전용 벤치마크 모음 전반에서 Amazon Nova 모델의 기본 기능을 평가할 수 있습니다. 이는 xxx_general_text_benchmark_eval.yaml 형식으로 제공됩니다.

이러한 레시피를 사용하면 벤치마킹을 위한 자체 데이터세트를 가져와서 다양한 유형의 지표를 사용하여 모델 출력을 참조 답변과 비교할 수 있습니다. 이는 xxx_bring_your_own_dataset_eval.yaml 형식으로 제공됩니다.

다음은 사용자 제공 데이터세트 요구 사항입니다.

파일 형식 요구 사항
- 평가 예시가 포함된 단일 gen_qa.jsonl 파일을 포함해야 합니다.
- SageMaker 훈련 작업이 액세스할 수 있는 S3 위치에 데이터세트를 업로드해야 합니다.
- 파일은 일반 Q&A 데이터세트에 필요한 스키마 형식을 따라야 합니다.
스키마 형식 요구 사항 - JSONL 파일의 각 줄은 다음 필드가 있는 JSON 객체여야 합니다.
- query: (필수) 답변이 필요한 질문 또는 지침이 포함된 문자열
- response: (필수) 예상 모델 출력을 포함하는 문자열
- system: (선택 사항) 쿼리를 처리하기 전에 AI 모델의 동작, 역할 또는 특성을 설정하는 시스템 프롬프트가 포함된 문자열
- metadata: (선택 사항) 태그 지정을 위해 항목과 연결된 메타데이터를 포함하는 문자열입니다.

다음은 사용자 제공 데이터세트 예시 항목입니다.


{
   "system":"You are a english major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?"
}
{
   "system":"You are a pattern analysis specialist that provides succinct answers: ",
   "query":"What is the next number in this series? 1, 2, 4, 8, 16, ?",
   "response":"32"
}
{
   "system":"You have great attention to detail that follows instructions accurately: ",
   "query":"Repeat only the last two words of the following: I ate a hamburger today and it was kind of dry",
   "response":"of dry"
}

사용자 지정 데이터세트를 사용하려면 다음 필수 필드를 포함하도록 평가 레시피를 수정합니다. 콘텐츠는 변경하지 마세요.


evaluation:
  task: gen_qa
  strategy: gen_qa
  metric: all

다음과 같은 제한이 적용됩니다.

평가당 하나의 JSONL 파일만 허용됩니다.
파일은 정의된 스키마를 엄격하게 따라야 합니다.
컨텍스트 길이 제한: 데이터세트의 각 샘플에서, 컨텍스트 길이(시스템 + 쿼리 프롬프트 포함)는 3.5k 미만이어야 합니다.

Amazon Nova 평가형 LLM은 고객이 사용자 지정 데이터세트에서 한 모델의 응답 품질을 기준 모델 응답과 비교하는 모델 평가 기능입니다. 프롬프트, 기준 응답 및 도전자 응답이 있는 데이터세트를 가져오고 Nova Judge 모델을 사용하여 쌍별 비교를 통해 Bradley-Terry 확률로 승률 지표를 제공합니다.

이 레시피는 xxx_llm_judge_eval.yaml 형식으로 제공됩니다.

다음은 평가형 LLM 요구 사항입니다.

파일 형식 요구 사항
- 평가 예시가 포함된 단일 llm_judge.jsonl 파일을 포함합니다. 파일 이름은 llm_judge.jsonl이어야 합니다.
- SageMaker HyperPod RIG가 액세스할 수 있는 S3 위치에 데이터세트를 업로드해야 합니다.
- 파일은 llm_judge.jsonl 데이터세트에 필요한 스키마 형식을 따라야 합니다.
- 입력 데이터세트는 모든 레코드가 12k 컨텍스트 길이 미만인지 확인해야 합니다.
스키마 형식 요구 사항 - JSONL 파일의 각 줄은 다음 필드가 있는 JSON 객체여야 합니다.
- prompt: (필수) 생성된 응답에 대한 프롬프트가 포함된 문자열입니다.
- response_A: 기준 응답을 포함하는 문자열입니다.
- response_B: 기준 응답과 비교할 대체 응답을 포함하는 문자열입니다.

다음은 평가형 LLM 예시 항목입니다.


{
"prompt": "What is the most effective way to combat climate change?",
"response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.",
"response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."
}
{
"prompt": "Explain how a computer's CPU works",
"response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.",
"response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."
}
{
"prompt": "How does photosynthesis work?",
"response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.",
"response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."
}

사용자 지정 데이터세트를 사용하려면 다음 필수 필드를 포함하도록 평가 레시피를 수정합니다. 콘텐츠는 변경하지 마세요.


evaluation:
  task: llm_judge
  strategy: judge
  metric: all

다음과 같은 제한이 적용됩니다.

평가당 하나의 JSONL 파일만 허용됩니다.
파일은 정의된 스키마를 엄격하게 따라야 합니다.
Amazon Nova Judge 모델은 모든 모델 패밀리 사양(즉, Lite, Micro, Pro)에서 동일합니다.
사용자 지정 판단 모델은 현재 지원되지 않습니다.
컨텍스트 길이 제한: 데이터세트의 각 샘플에서, 컨텍스트 길이(시스템 + 쿼리 프롬프트 포함)는 7k 미만이어야 합니다.

Nova MM_LLM Judge라고 하는 멀티모달에 대한 Nova LLM Judge(이미지)는 사용자 지정 데이터세트를 사용하여 한 모델의 응답 품질을 기준 모델의 응답과 비교하는 모델 평가 기능입니다. 프롬프트, 기준 응답, 도전자 응답, Base64로 인코딩된 문자열 양식의 이미지가 있는 데이터세트를 수락하고 Nova Judge 모델을 사용하여 페어별 비교를 통해 Bradley-Terry 확률에 기반한 승률 지표를 제공합니다. 레시피 형식: xxx_mm_llm_judge _eval.yaml.

Nova LLM 데이터세트 요구 사항

파일 형식:

평가 예시가 포함된 단일 mm_llm_judge.jsonl 파일입니다. 파일 이름은 정확히 llm_judge.jsonl이어야 합니다.
데이터세트는 SageMaker 훈련 작업에서 액세스할 수 있는 S3 위치에 업로드해야 합니다.
파일은 mm_llm_judge 데이터세트에 필요한 스키마 형식을 따라야 합니다.
입력 데이터세트에서는 이미지 속성을 제외하고 모든 레코드가 12k 컨텍스트 길이 미만이 되도록 보장해야 합니다.

스키마 형식 - .jsonl 파일의 각 줄은 다음 필드가 있는 JSON 객체여야 합니다.

필수 필드:

prompt: 생성된 응답에 대한 프롬프트가 포함된 문자열입니다.

images: 데이터 속성(Base64로 인코딩된 이미지 문자열)이 있는 객체 목록을 포함하는 배열.

response_A: 기준 응답을 포함하는 문자열입니다.

response_B: 기준 응답과 비교할 대체 응답을 포함하는 문자열입니다.

입력 예

가독성을 위해 다음 예제에는 새 줄과 들여쓰기가 포함되지만 실제 데이터세트에서는 각 레코드가 한 줄에 있어야 합니다.


{
  "prompt": "what is in the image?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    }
  ],
  "response_A": "a dog.",
  "response_B": "a cat.",
}
{
  "prompt": "how many animals in echo of the images?",
  "images": [
    {
      "data": "data:image/jpeg;Base64,/9j/2wBDAAQDAwQDAwQEAwQFBAQFBgo..."
    },
    {
      "data": "data:image/jpeg;Base64,/DKEafe3gihn..."
    }
  ],
  "response_A": "The first image contains one cat and the second image contains one dog",
  "response_B": "The first image has one aminal and the second has one animal",
}

사용자 지정 데이터세트를 사용하려면 다음 필수 필드를 포함하도록 평가 레시피를 수정합니다. 콘텐츠는 변경하지 마세요.


evaluation:
  task: mm_llm_judge
  strategy: judge
  metric: all

제한 사항

평가당 하나의 .jsonl 파일만 허용됩니다.
파일은 정의된 스키마를 엄격하게 따라야 합니다.
Nova MM Judge 모델은 이미지 참조만 지원합니다.
Nova MM Judge 모델은 Amazon Nova Lite 사양에서 동일합니다.
(현재 사용자 지정 구분 기호는 지원되지 않습니다.)
Amazon S3 이미지 URI는 지원되지 않습니다.
입력 데이터세트에서는 이미지 속성을 제외하고 모든 레코드가 12k 컨텍스트 길이 미만이 되도록 보장해야 합니다.

Rubric Judge는 Nova 2.0 Lite에 빌드된 향상된 평가형 LLM 평가 모델입니다. 기본 설정 결정(A>B, B>A 또는 동등)만 제공하는 원래 평가 모델과 달리 Rubric Judge는 각 프롬프트에 맞게 조정된 사용자 지정 평가 기준을 동적으로 생성하고 여러 차원에 세분화된 점수를 할당합니다.

주요 기능:

동적 기준 생성: 입력 프롬프트를 기반으로 관련 평가 차원 자동 생성
가중치 기반 점수: 상대적 유의성을 반영하기 위해 각 기준에 중요도 가중치 할당
세분화된 평가: 각 기준에 대해 바이너리(true/false) 또는 스케일(1~5) 기준으로 세부 점수 제공
품질 지표: 응답 간 차이의 정도를 정량화하는 연속 품질 점수(스케일: 0~1) 계산

모델에서 생성된 기준 예제:


price_validation:
  description: "The response includes validation to ensure price is a positive value."
  type: "scale"
  weight: 0.3

모델은 생성된 모든 기준에 대해 두 응답을 모두 평가한 다음, 이러한 기준 수준 점수를 사용하여 최종 기본 설정 결정을 알립니다.

레시피 구성

Rubric Judge 레시피

레시피에서 task: rubric_llm_judge를 설정하여 Rubric Judge를 활성화합니다.


run:
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

원래 평가형 LLM 레시피(비교용)

원래 평가 모델은 task: llm_judge를 사용합니다.


run:
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job

evaluation:
  task: llm_judge                                       # [FIXED] Original judge task
  strategy: judge                                       # [FIXED] Evaluation strategy
  metric: all                                           # [FIXED] Metric calculation method

inference:
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

입력 데이터세트 형식

입력 데이터세트 형식은 원래 결정 모델과 동일합니다.

필수 필드:

prompt: 입력 프롬프트 및 지침을 포함하는 문자열
response_A: 기준 모델 출력을 포함하는 문자열
response_B: 사용자 지정 모델 출력을 포함하는 문자열

데이터세트 예제(JSONL 형식):


{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}

형식 지정 요구 사항

각 항목은 한 줄의 JSON 객체여야 함
항목을 새 줄로 구분
예제와 같이 정확한 필드 이름 지정 준수

평가 출력

출력 구조

Rubric Judge는 원래 결정 모델에 비해 향상된 평가 지표를 생성합니다.


{
  "config_general": {
    "lighteval_sha": "string",
    "num_fewshot_seeds": "int",
    "max_samples": "int | null",
    "job_id": "int",
    "start_time": "float",
    "end_time": "float",
    "total_evaluation_time_secondes": "string",
    "model_name": "string",
    "model_sha": "string",
    "model_dtype": "string | null",
    "model_size": "string"
  },
  "results": {
    "custom|rubric_llm_judge_judge|0": {
      "a_scores": "float",
      "a_scores_stderr": "float",
      "b_scores": "float",
      "b_scores_stderr": "float",
      "ties": "float",
      "ties_stderr": "float",
      "inference_error": "float",
      "inference_error_stderr": "float",
      "score": "float",
      "score_stderr": "float",
      "weighted_score_A": "float",
      "weighted_score_A_stderr": "float",
      "weighted_score_B": "float",
      "weighted_score_B_stderr": "float",
      "score_margin": "float",
      "score_margin_stderr": "float",
      "winrate": "float",
      "lower_rate": "float",
      "upper_rate": "float"
    }
  },
  "versions": {
    "custom|rubric_llm_judge_judge|0": "int"
  }
}

Rubric Judge의 새 지표

다음과 같은 6개의 지표는 Rubric Judge에 고유하며 세분화된 품질 평가를 제공합니다.

지표	설명
weighted_score_A	모델에서 생성된 모든 평가 기준에서 response_A에 대해 정규화된 평균 품질 점수. 점수에는 기준 중요도에 따라 가중치가 부여되고 점수는 0~1의 스케일로 정규화됩니다(높은 값 = 더 나은 품질).
weighted_score_A_stderr	weighted_score_A 평균의 표준 오차로, 통계적 불확실성을 나타냄
weighted_score_B	모델에서 생성된 모든 평가 기준에서 response_B에 대해 정규화된 평균 품질 점수. 점수에는 기준 중요도에 따라 가중치가 부여되고 점수는 0~1의 스케일로 정규화됩니다(높은 값 = 더 나은 품질).
weighted_score_B_stderr	weighted_score_B 평균의 표준 오차로, 통계적 불확실성을 나타냄
score_margin	가중치 적용 점수 간 차이(weighted_score_A - weighted_score_B로 계산됨). 범위: -1.0~1.0. 양수 = response_A가 더 나음, 음수 = response_B가 더 나음, 0에 가까운 값 = 유사한 품질
score_margin_stderr	score_margin 평균의 표준 오차로, 품질 차이 측정의 불확실성을 나타냄

가중치 기반 점수 지표 이해

목적: 가중치 기반 점수는 바이너리 기본 설정 결정을 보완하는 지속적인 품질 측정을 제공하여 모델 성능에 대한 심층적인 인사이트를 제공합니다.

원래 평가와의 주요 차이:

원래 평가: 개별 기본 설정(A>B, B>A, A=B)만 출력
Rubric Judge: 사용자 지정 기준에 따라 기본 설정과 연속 품질 점수(스케일: 0~1) 모두 출력

Interpreting score_margin:

score_margin = -0.128: Response_B 점수가 response_A보다 12.8% 더 높음
|score_margin| < 0.1: 품질 차이가 적음(근사한 결정)
|score_margin| > 0.2: 명확한 품질 차이(신뢰할 수 있는 결정)

사용 사례:

모델 개선: 모델 성능이 저하되는 특정 영역 식별
품질 정량화: 승률과 함께 성능 격차의 크기 측정
신뢰도 평가: 근사한 결정과 명확한 품질 차이 구분

중요

최종 결정은 여전히 판단 모델의 명시적 기본 설정 레이블을 기반으로 하여 전체적인 추론을 보존하고 전진/후진 평가를 통해 적절한 위치 편향 완화를 보장합니다. 가중치 적용 점수는 기본 결정을 대체하는 것이 아니라 관찰성 도구 역할을 합니다.

계산 방법

가중치 적용 점수는 다음 프로세스를 통해 계산됩니다.

기준 데이터 추출: 평가의 YAML 출력을 구문 분석하여 기준 점수 및 가중치 추출
점수 정규화:
- 스케일 유형 기준(1~5): (score - 1) / 4를 계산하여 0~1로 정규화
- 이진 기준(true/false): 1.0/0.0으로 변환
가중치 적용: 정규화된 각 점수에 기준 가중치를 곱함
집계: 각 응답에 대한 모든 가중치 적용 점수를 합함
마진 계산: score_margin = weighted_score_A - weighted_score_B 계산

예제: response_A의 가중치 합계가 0.65이고 response_B의 가중치 합계가 0.78인 경우 score_margin은 -0.13입니다. 즉, response_B가 모든 가중치 적용 기준에서 품질이 13% 더 높음을 나타냅니다.

추론 모델 지원

추론 모델 지원을 사용하면 최종 응답을 생성하기 전에 명시적인 내부 추론을 수행하는 추론 가능 Nova 모델을 사용하여 평가할 수 있습니다. 이 기능은 reasoning_effort 파라미터를 통한 API 수준 제어를 사용하여 추론 기능을 동적으로 활성화하거나 비활성화함으로써 복잡한 분석 태스크에 대한 응답 품질을 개선할 수 있습니다.

지원되는 모델:

amazon.nova-2-lite-v1:0:256k

레시피 구성

레시피의 inference 섹션에 reasoning_effort 파라미터를 추가하여 추론을 활성화합니다.


run:
  name: eval-job-name                                    # [MODIFIABLE] Unique identifier for your evaluation job
  model_type: amazon.nova-2-lite-v1:0:256k               # [FIXED] Must be a reasoning-supported model
  model_name_or_path: nova-lite-2/prod                   # [FIXED] Path to model checkpoint or identifier
  replicas: 1                                            # [MODIFIABLE] Number of replicas for SageMaker Training job
  data_s3_path: ""                                       # [MODIFIABLE] Leave empty for SageMaker Training job; optional for  job
  output_s3_path: ""                                     # [MODIFIABLE] Output path for  job (not compatible with SageMaker Training jobs)

evaluation:
  task: mmlu                                             # [MODIFIABLE] Evaluation task
  strategy: generate                                     # [MODIFIABLE] Evaluation strategy
  metric: all                                            # [MODIFIABLE] Metric calculation method

inference:
  reasoning_effort: high                                 # [MODIFIABLE] Enables reasoning mode; options: low/medium/high or null to disable
  max_new_tokens: 200                                    # [MODIFIABLE] Maximum tokens to generate
  top_k: 50                                              # [MODIFIABLE] Top-k sampling parameter
  top_p: 1.0                                             # [MODIFIABLE] Nucleus sampling parameter
  temperature: 0                                         # [MODIFIABLE] Sampling temperature (0 = deterministic)

reasoning_effort 파라미터 사용

reasoning_effort 파라미터는 추론 가능 모델의 추론 동작을 제어합니다.

사전 조건:

모델 호환성: model_type이 추론 가능 모델(현재 amazon.nova-2-lite-v1:0:256k)을 지정하는 경우에만 reasoning_effort를 설정함
오류 처리: 지원되지 않는 모델과 함께 reasoning_effort를 사용하면 ConfigValidationError: "Reasoning mode is enabled but model '{model_type}' does not support reasoning. Please use a reasoning-capable model or disable reasoning mode."에서 실패함

사용 가능한 옵션:

옵션	동작	토큰 제한	사용 사례
null(기본값)	추론 모드 비활성화	해당 사항 없음	추론 오버헤드가 없는 표준 평가
low	제약 조건과의 추론 활성화	내부 추론을 위한 4,000개의 토큰	간결한 추론이 필요한 시나리오, 속도와 비용에 맞게 최적화
높음	제약 조건 없이 추론 활성화	내부 추론에 대한 토큰 제한 없음	광범위한 분석 및 단계별 추론이 필요한 복잡한 문제

추론을 활성화하는 경우

다음과 같은 경우 추론 모드(low, medium 또는 high)를 사용합니다.

복잡한 문제 해결 태스크(수학, 논리적 퍼즐, 코딩)
중간 추론이 필요한 다단계 분석 질문
자세한 설명 또는 단계별 사고로 정확도가 향상되는 태스크
응답 품질이 속도보다 우선되는 시나리오

다음과 같은 경우 비추론 모드(파라미터 생략)를 사용합니다.

간단한 Q&A 또는 실제 쿼리
창의적 쓰기 태스크
더 빠른 응답 시간이 중요한 경우
추론 오버헤드를 제외해야 하는 성능 벤치마킹
추론으로 태스크 성능이 개선되지 않는 경우 비용 최적화

문제 해결

오류: 'Reasoning mode is enabled but model does not support reasoning'

원인: reasoning_effort 파라미터가 null이 아닌 값으로 설정되었지만 지정된 model_type에서 추론을 지원하지 않습니다.

해결 방법:

모델 유형이 amazon.nova-2-lite-v1:0:256k인지 확인
다른 모델을 사용하는 경우 추론 가능 모델로 전환하거나 레시피에서 reasoning_effort 파라미터를 제거합니다.