프리셋 및 사용자 지정 득점자를 사용하여 평가

사용자 지정 득점자 평가 유형을 사용하는 경우 SageMaker Evaluation은 volcengine/verl RL 훈련 라이브러리에서 가져온 두 개의 기본 제공 득점자('보상 함수'라고도 함) 프라임 수학 및 프라임 코드 또는 Lambda 함수로 구현된 사용자 지정 득점자를 지원합니다.

기본 제공 Scorer

프라임 수학

프라임 수학 채점자는 수학 질문을 프롬프트/쿼리로 포함하고 올바른 답변을 실측 정보로 포함하는 사용자 지정 JSONL 항목 데이터 세트를 기대합니다. 데이터 세트는에 언급된 지원되는 형식 중 하나일 수 있습니다BYOD(Bring-Your-Own-Dataset) 작업에 지원되는 데이터 세트 형식.

예제 데이터 세트 항목(명확성을 위해 확장됨):


{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}

프라임 코드

프라임 코드 득점자는 metadata 필드에 지정된 코딩 문제 및 테스트 사례를 포함하는 항목의 사용자 지정 JSONL 데이터 세트를 기대합니다. 테스트 케이스를 각 항목의 예상 함수 이름, 샘플 입력 및 예상 출력으로 구성합니다.

예제 데이터 세트 항목(명확성을 위해 확장됨):


{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}

사용자 지정 득점자(자체 지표 가져오기)

필요에 맞는 사용자 지정 지표를 계산할 수 있는 사용자 지정 사후 처리 로직을 사용하여 모델 평가 워크플로를 완전히 사용자 지정할 수 있습니다. 사용자 지정 득점자를 모델 응답을 수락하고 보상 점수를 반환하는 AWS Lambda 함수로 구현해야 합니다.

샘플 Lambda 입력 페이로드

사용자 지정 AWS Lambda는 OpenAI 형식의 입력을 예상합니다. 예제:


{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}

샘플 Lambda 출력 페이로드

SageMaker 평가 컨테이너는 Lambda 응답이 다음 형식을 따를 것으로 예상합니다.


{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}

사용자 지정 Lambda 정의

샘플 입력 및 예상 출력이 포함된 완전히 구현된 사용자 지정 득점자의 예는 https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example 확인할 수 있습니다.

다음 스켈레톤을 자체 함수의 시작점으로 사용합니다.


def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """

입력 및 출력 필드

입력 필드

필드	설명	추가 참고 사항
id	샘플의 고유 식별자	출력에서 다시 반향됩니다. 문자열 형식
messages	정렬된 채팅 기록(OpenAI 형식)	메시지 객체 배열
messages[].role	메시지의 화자	일반적인 값: "user", "assistant", "system"
messages[].content	메시지의 텍스트 콘텐츠	일반 문자열
metadata	등급 지정에 도움이 되는 자유 양식 정보	객체, 훈련 데이터에서 전달되는 선택적 필드

출력 필드

출력 필드
필드	설명	추가 참고 사항
id	입력 샘플과 동일한 식별자	입력과 일치해야 함
aggregate_reward_score	샘플의 전체 점수	부동 소수점(예: 0.0~1.0 또는 태스크 정의 범위)
metrics_list	집계를 구성하는 구성 요소 점수	지표 객체의 배열

필수 권한

평가를 실행하는 데 사용하는 SageMaker 실행 역할에 AWS Lambda 권한이 있는지 확인합니다.


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}

AWS Lambda 함수의 실행 역할에 기본 Lambda 실행 권한과 다운스트림 AWS 호출에 필요할 수 있는 추가 권한이 있는지 확인합니다.


{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

BYOD(Bring-Your-Own-Dataset) 작업에 지원되는 데이터 세트 형식

모델 배포