Overview Recipe configuration Input dataset format Evaluation output

Rubric Based Judge

Overview

Rubric Judge is an enhanced LLM-as-a-judge evaluation model built on Nova 2.0 Lite. Unlike the original judge model that only provides preference verdicts (A>B, B>A, or tie), Rubric Judge dynamically generates custom evaluation criteria tailored to each prompt and assigns granular scores across multiple dimensions.

Key capabilities

Dynamic criteria generation – Automatically creates relevant evaluation dimensions based on the input prompt
Weighted scoring – Assigns importance weights to each criterion to reflect their relative significance
Granular assessment – Provides detailed scores on a binary (true/false) or scale (1-5) basis for each criterion
Quality metrics – Calculates continuous quality scores (0-1 scale) that quantify the magnitude of differences between responses

Example criterion generated by the model


price_validation:  
  description: "The response includes validation to ensure price is a positive value."  
  type: "scale"  
  weight: 0.3

The model evaluates both responses against all generated criteria, then uses these criterion-level scores to inform its final preference decision.

Recipe configuration

Rubric Judge recipe

Enable Rubric Judge by setting task: rubric_llm_judge in your recipe:


run:  
  name: nova-eval-job-name                              # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-2-lite-v1:0:256k              # [FIXED] Rubric Judge model type  
  model_name_or_path: "nova-lite-2/prod"                # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job  
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job  
    
evaluation:  
  task: rubric_llm_judge                                # [FIXED] Evaluation task - enables Rubric Judge  
  strategy: judge                                       # [FIXED] Evaluation strategy  
  metric: all                                           # [FIXED] Metric calculation method  
    
inference:  
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

Original LLM as a Judge recipe (for comparison)

The original judge model uses task: llm_judge:


run:  
  name: eval-job-name                                   # [MODIFIABLE] Unique identifier for your evaluation job  
  model_type: amazon.nova-micro-v1:0:128k               # [FIXED] Model type   
  model_name_or_path: "nova-micro/prod"                 # [FIXED] Path to model checkpoint or identifier  
  replicas: 1                                           # [MODIFIABLE] Number of replicas for SageMaker Training job  
  data_s3_path: ""                                      # [FIXED] Leave empty for SageMaker Training job  
  output_s3_path: ""                                    # [FIXED] Leave empty for SageMaker Training job  
    
evaluation:  
  task: llm_judge                                       # [FIXED] Original judge task  
  strategy: judge                                       # [FIXED] Evaluation strategy  
  metric: all                                           # [FIXED] Metric calculation method  
  
inference:  
  max_new_tokens: 12000                                 # [MODIFIABLE] Maximum tokens to generate  
  top_k: -1                                             # [MODIFIABLE] Top-k sampling parameter  
  top_p: 1.0                                            # [MODIFIABLE] Nucleus sampling parameter  
  temperature: 0                                        # [MODIFIABLE] Sampling temperature (0 = deterministic)

Input dataset format

The input dataset format is identical to the original judge model:

Required fields

prompt – String containing the input prompt and instructions
response_A – String containing the baseline model output
response_B – String containing the customized model output

Example dataset (JSONL format)


{"prompt": "What is the most effective way to combat climate change?", "response_A": "The most effective way to combat climate change is through a combination of transitioning to renewable energy sources and implementing strict carbon pricing policies. This creates economic incentives for businesses to reduce emissions while promoting clean energy adoption.", "response_B": "We should focus on renewable energy. Solar and wind power are good. People should drive electric cars. Companies need to pollute less."}  
{"prompt": "Explain how a computer's CPU works", "response_A": "CPU is like brain of computer. It does math and makes computer work fast. Has lots of tiny parts inside.", "response_B": "A CPU (Central Processing Unit) functions through a fetch-execute cycle, where instructions are retrieved from memory, decoded, and executed through its arithmetic logic unit (ALU). It coordinates with cache memory and registers to process data efficiently using binary operations."}  
{"prompt": "How does photosynthesis work?", "response_A": "Plants do photosynthesis to make food. They use sunlight and water. It happens in leaves.", "response_B": "Photosynthesis is a complex biochemical process where plants convert light energy into chemical energy. They utilize chlorophyll to absorb sunlight, combining CO2 and water to produce glucose and oxygen through a series of chemical reactions in chloroplasts."}

Format requirements

Each entry must be a single-line JSON object
Separate entries with newlines
Follow the exact field naming as shown in examples

Evaluation output

Output structure

Rubric Judge produces enhanced evaluation metrics compared to the original judge model:


{  
  "config_general": {  
    "lighteval_sha": "string",  
    "num_fewshot_seeds": "int",  
    "max_samples": "int | null",  
    "job_id": "int",  
    "start_time": "float",  
    "end_time": "float",  
    "total_evaluation_time_secondes": "string",  
    "model_name": "string",  
    "model_sha": "string",  
    "model_dtype": "string | null",  
    "model_size": "string"  
  },  
  "results": {  
    "custom|rubric_llm_judge_judge|0": {  
      "a_scores": "float",  
      "a_scores_stderr": "float",  
      "b_scores": "float",  
      "b_scores_stderr": "float",  
      "ties": "float",  
      "ties_stderr": "float",  
      "inference_error": "float",  
      "inference_error_stderr": "float",  
      "score": "float",  
      "score_stderr": "float",  
      "weighted_score_A": "float",  
      "weighted_score_A_stderr": "float",  
      "weighted_score_B": "float",  
      "weighted_score_B_stderr": "float",  
      "score_margin": "float",  
      "score_margin_stderr": "float",  
      "winrate": "float",  
      "lower_rate": "float",  
      "upper_rate": "float"  
    }  
  },  
  "versions": {  
    "custom|rubric_llm_judge_judge|0": "int"  
  }  
}

New metrics in Rubric Judge

The following six metrics are unique to Rubric Judge and provide granular quality assessment:

Metric	Description
weighted_score_A	Average normalized quality score for response_A across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality)
weighted_score_A_stderr	Standard error of the mean for weighted_score_A, indicating statistical uncertainty
weighted_score_B	Average normalized quality score for response_B across all model-generated evaluation criteria. Scores are weighted by criterion importance and normalized to 0-1 scale (higher = better quality)
weighted_score_B_stderr	Standard error of the mean for weighted_score_B, indicating statistical uncertainty
score_margin	Difference between weighted scores (calculated as weighted_score_A - weighted_score_B). Range: -1.0 to 1.0. Positive = response_A is better; negative = response_B is better; near zero = similar quality
score_margin_stderr	Standard error of the mean for score_margin, indicating uncertainty in the quality difference measurement

Understanding weighted score metrics

Purpose: Weighted scores provide continuous quality measurements that complement binary preference verdicts, enabling deeper insights into model performance.

Key differences from original judge

Original judge – Only outputs discrete preferences (A>B, B>A, A=B)
Rubric Judge – Outputs both preferences AND continuous quality scores (0-1 scale) based on custom criteria

Interpreting score_margin

score_margin = -0.128: Response_B scored 12.8 percentage points higher than response_A
|score_margin| < 0.1: Narrow quality difference (close decision)
|score_margin| > 0.2: Clear quality difference (confident decision)

Use cases

Model improvement – Identify specific areas where your model underperforms
Quality quantification – Measure the magnitude of performance gaps, not just win/loss ratios
Confidence assessment – Distinguish between close decisions and clear quality differences

Important

Final verdicts are still based on the judge model's explicit preference labels to preserve holistic reasoning and ensure proper position bias mitigation through forward/backward evaluation. Weighted scores serve as observability tools, not as replacements for the primary verdict.

Calculation methodology

Weighted scores are computed through the following process:

Extract criterion data – Parse the judge's YAML output to extract criterion scores and weights
Normalize scores:
- Scale-type criteria (1-5): Normalize to 0-1 by calculating (score - 1) / 4
- Binary criteria (true/false): Convert to 1.0/0.0
Apply weights – Multiply each normalized score by its criterion weight
Aggregate – Sum all weighted scores for each response
Calculate margin – Compute score_margin = weighted_score_A - weighted_score_B

Example: If response_A has a weighted sum of 0.65 and response_B has 0.78, the score_margin would be -0.13, indicating response_B is 13 percentage points higher in quality across all weighted criteria.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Evaluation

Reasoning model evaluation