

Las traducciones son generadas a través de traducción automática. En caso de conflicto entre la traducción y la version original de inglés, prevalecerá la version en inglés.

# Personalización de modelos de peso abierto
<a name="model-customize-open-weight"></a>

En esta sección, se explica el proceso para empezar a personalizar los modelos de peso abierto.

**Topics**
+ [Requisitos previos](model-customize-open-weight-prereq.md)
+ [Crear recursos para la personalización del modelo en la interfaz de usuario](model-customize-open-weight-create-assets-ui.md)
+ [Envío de trabajos de personalización de modelos de IA](model-customize-open-weight-job.md)
+ [Presentación de un trabajo de evaluación de modelos](model-customize-open-weight-evaluation.md)
+ [Implementación de modelos](model-customize-open-weight-deployment.md)
+ [Ejemplos de conjuntos de datos y evaluadores](model-customize-open-weight-samples.md)

# Requisitos previos
<a name="model-customize-open-weight-prereq"></a>

Antes de comenzar, complete los siguientes requisitos previos:
+ Incorpórate a un dominio de SageMaker IA con acceso a Studio. Si no dispone de permisos para configurar Studio como experiencia predeterminada para su dominio, póngase en contacto con su administrador. Para obtener más información, consulta la [descripción general del dominio de Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-onboard.html).
+  AWS CLI Actualízalo siguiendo los pasos que se indican en [Instalación de la AWS CLI versión actual](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv1.html#install-tool-bundled).
+ En el equipo local, ejecute `aws configure` y proporcione sus credenciales de AWS . Para obtener información sobre AWS las credenciales, consulte [Descripción y obtención de AWS las credenciales](https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds.html).

## Permisos de IAM necesarios
<a name="model-customize-open-weight-iam"></a>

SageMaker La personalización del modelo de IA requiere añadir los permisos adecuados a la ejecución de su dominio de SageMaker IA. Para ello, puede crear una política de permisos de IAM integrada y adjuntarla a la función de IAM. Para obtener información sobre cómo añadir políticas, consulte [Añadir y eliminar permisos de identidad de IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) en la *Guía del usuario de AWS Identity and Access Management*.

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "AllowNonAdminStudioActions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeDomain",
                "sagemaker:DescribeUserProfile",
                "sagemaker:DescribeSpace",
                "sagemaker:ListSpaces",
                "sagemaker:DescribeApp",
                "sagemaker:ListApps"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:domain/*",
                "arn:aws:sagemaker:*:*:user-profile/*",
                "arn:aws:sagemaker:*:*:app/*",
                "arn:aws:sagemaker:*:*:space/*"
             ]
        },
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "SageMakerPublicHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubContents"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:aws:hub/SageMakerPublicHub"
            ]
        },
        {
            "Sid": "SageMakerHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListHubs",
                "sagemaker:ListHubContents",
                "sagemaker:DescribeHubContent",
                "sagemaker:DeleteHubContent",
                "sagemaker:ListHubContentVersions",
                "sagemaker:Search"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "JumpStartAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "ListMLFlowOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListMlflowApps",
                "sagemaker:ListMlflowTrackingServers"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "MLFlowAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:UpdateMlflowApp",
                "sagemaker:DescribeMlflowApp",
                "sagemaker:CreatePresignedMlflowAppUrl",
                "sagemaker:CallMlflowAppApi",
                "sagemaker-mlflow:*"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:mlflow-app/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BYODataSetS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*"
            ]
        },
        {
            "Sid": "AllowHubPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ImportHubContent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForSageMaker",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "sagemaker.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForAWSLambda",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "lambda.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "PassRoleForBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": "bedrock.amazonaws.com",
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TrainingJobRun",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTrainingJob",
                "sagemaker:DescribeTrainingJob",
                "sagemaker:ListTrainingJobs"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:training-job/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ModelPackageAccess",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateModelPackage",
                "sagemaker:DescribeModelPackage",
                "sagemaker:ListModelPackages",
                "sagemaker:CreateModelPackageGroup",
                "sagemaker:DescribeModelPackageGroup",
                "sagemaker:ListModelPackageGroups",
                "sagemaker:CreateModel"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "TagsPermission",
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags",
                "sagemaker:ListTags"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:model-package-group/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:hub/*",
                "arn:aws:sagemaker:*:*:hub-content/*",
                "arn:aws:sagemaker:*:*:training-job/*",
                "arn:aws:sagemaker:*:*:model/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:action/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LogAccess",
            "Effect": "Allow",
            "Action": [
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:GetLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:log-group*",
                "arn:aws:logs:*:*:log-group:/aws/sagemaker/TrainingJobs:log-stream:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob", 
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "SageMakerPipelinesAndLineage",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListActions",
                "sagemaker:ListArtifacts",
                "sagemaker:QueryLineage",
                "sagemaker:ListAssociations",
                "sagemaker:AddAssociation",
                "sagemaker:DescribeAction",
                "sagemaker:AddAssociation",
                "sagemaker:CreateAction",
                "sagemaker:CreateContext",
                "sagemaker:DescribeTrialComponent"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:artifact/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:action/*",
                "arn:aws:sagemaker:*:*:model-package/*",
                "arn:aws:sagemaker:*:*:context/*",
                "arn:aws:sagemaker:*:*:pipeline/*",
                "arn:aws:sagemaker:*:*:experiment-trial-component/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "ListOperations",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListInferenceComponents",
                "sagemaker:ListWorkforces"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerInference",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeInferenceComponent",
                "sagemaker:CreateEndpoint",
                "sagemaker:CreateEndpointConfig",
                "sagemaker:DescribeEndpoint",
                "sagemaker:DescribeEndpointConfig",
                "sagemaker:ListEndpoints"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:inference-component/*",
                "arn:aws:sagemaker:*:*:endpoint/*",
                "arn:aws:sagemaker:*:*:endpoint-config/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "SageMakerPipelines",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribePipelineExecution",
                "sagemaker:ListPipelineExecutions",
                "sagemaker:ListPipelineExecutionSteps",
                "sagemaker:CreatePipeline",
                "sagemaker:UpdatePipeline",
                "sagemaker:StartPipelineExecution"
            ],
            "Resource": [
                "arn:aws:sagemaker:*:*:pipeline/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        }
    ]
}
```

Si la ha asociado [AmazonSageMakerFullAccessPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html)a su función de ejecución, puede añadir esta política reducida:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "LambdaListPermissions",
            "Effect": "Allow",
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "LambdaPermissionsForRewardFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:InvokeFunction",
                "lambda:GetFunction"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*SageMaker*",
                "arn:aws:lambda:*:*:function:*sagemaker*",
                "arn:aws:lambda:*:*:function:*Sagemaker*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "LambdaLayerForAWSSDK",
            "Effect": "Allow",
            "Action": [
                "lambda:GetLayerVersion"
            ],
            "Resource": [
                "arn:aws:lambda:*:336392948345:layer:AWSSDK*"
            ]
        },
        {
            "Sid": "S3Access",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*SageMaker*",
                "arn:aws:s3:::*Sagemaker*",
                "arn:aws:s3:::*sagemaker*",
                "arn:aws:s3:::jumpstart*"
            ]
        },
        {
            "Sid": "PassRoleForSageMakerAndLambdaAndBedrock",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/service-role/AmazonSageMaker-ExecutionRole-*"
            ],
            "Condition": { 
                "StringEquals": { 
                    "iam:PassedToService": [ 
                        "lambda.amazonaws.com", 
                        "bedrock.amazonaws.com"
                     ],
                     "aws:ResourceAccount": "${aws:PrincipalAccount}" 
                 } 
            }
        },
        {
            "Sid": "BedrockDeploy",
            "Effect": "Allow",
            "Action": [
                "bedrock:CreateModelImportJob"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetModelImportJob",
                "bedrock:GetImportedModel",
                "bedrock:ListProvisionedModelThroughputs",
                "bedrock:ListCustomModelDeployments",
                "bedrock:ListCustomModels",
                "bedrock:ListModelImportJobs",
                "bedrock:GetEvaluationJob",
                "bedrock:CreateEvaluationJob",
                "bedrock:InvokeModel"
            ],
            "Resource": [
                "arn:aws:bedrock:*:*:evaluation-job/*",
                "arn:aws:bedrock:*:*:imported-model/*",
                "arn:aws:bedrock:*:*:model-import-job/*",
                "arn:aws:bedrock:*:*:foundation-model/*"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "${aws:PrincipalAccount}"
                }
            }
        },
        {
            "Sid": "BedrockFoundationModelOperations",
            "Effect": "Allow",
            "Action": [
                "bedrock:GetFoundationModelAvailability",
                "bedrock:ListFoundationModels"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

A continuación, debe hacer clic en **Editar política de confianza** y sustituirla por la siguiente política y, a continuación, hacer clic en **Actualizar política**.

```
{
    "Version": "2012-10-17",		 	 	                    
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                 "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                   "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                  "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

# Crear recursos para la personalización del modelo en la interfaz de usuario
<a name="model-customize-open-weight-create-assets-ui"></a>

Puede crear y administrar el conjunto de datos y los activos del evaluador que puede usar para la personalización del modelo en la interfaz de usuario.

## Activos
<a name="model-customize-open-weight-assets"></a>

Selecciona **Activos** en el panel de la izquierda y en la interfaz de usuario de Amazon SageMaker Studio y, a continuación, selecciona **Conjuntos de datos.**

![\[Una imagen que contiene el acceso a la personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-16.png)


Elija **Cargar conjunto de datos** para añadir el conjunto de datos que utilizará en sus trabajos de personalización de modelos. Al elegir el **formato de entrada de datos obligatorio**, puede acceder a una referencia del formato de conjunto de datos que desee utilizar.

![\[Una imagen que contiene el acceso a la personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-15.png)


## Evaluadores
<a name="model-customize-open-weight-evaluators"></a>

También puedes añadir **funciones de recompensa** e **indicaciones de recompensa** para tus trabajos de personalización del aprendizaje por refuerzo.

![\[Una imagen que contiene el acceso a la personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-14.png)


La interfaz de usuario también proporciona orientación sobre el formato requerido para la función de recompensa o la solicitud de recompensa.

![\[Una imagen que contiene el acceso a la personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-13.png)


## Recursos para la personalización de modelos mediante el AWS SDK
<a name="model-customize-open-weight-create-assets-sdk"></a>

También puedes usar el SDK de Python para SageMaker IA para crear activos. Consulta un ejemplo de fragmento de código a continuación:

```
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION, REWARD_PROMPT
from sagemaker.ai_registry.dataset import DataSet, CustomizationTechnique
from sagemaker.ai_registry.evaluator import Evaluator

# Creating a dataset example
dataset = DataSet.create(
            name="sdkv3-gen-ds2",
            source="s3://sample-test-bucket/datasets/training-data/jamjee-sft-ds1.jsonl", # or use local filepath as source.
            customization_technique=CustomizationTechnique.SFT
        )

# Refreshes status from hub
dataset.refresh()
pprint(dataset.__dict__)

# Creating an evaluator. Method : Lambda
evaluator = Evaluator.create(
                name = "sdk-new-rf11",
                source="arn:aws:lambda:us-west-2:<>:function:<function-name>8",
                type=REWARD_FUNCTION
        )

# Creating an evaluator. Method : Bring your own code
evaluator = Evaluator.create(
                name = "eval-lambda-test",
                source="/path_to_local/eval_lambda_1.py",
                type = REWARD_FUNCTION
        )

# Optional wait, by default we have wait = True during create call.
evaluator.wait()

evaluator.refresh()
pprint(evaluator)
```

# Envío de trabajos de personalización de modelos de IA
<a name="model-customize-open-weight-job"></a>

Se puede acceder a la capacidad de personalización del modelo de SageMaker IA desde la página de modelos de Amazon SageMaker Studio en el panel de la izquierda. También puede encontrar la página de activos, donde puede crear y administrar sus conjuntos de datos y evaluadores de personalización de modelos.

![\[Una imagen que contiene el acceso a la personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-12.png)


Para comenzar a enviar un trabajo de personalización de modelos, seleccione la opción Modelos para acceder a la pestaña Modelos básicos de Jumpstart:

![\[Una imagen que muestra cómo elegir el modelo base.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-11.png)


Puedes hacer clic directamente en Personalizar modelo en la tarjeta del modelo o puedes buscar cualquier modelo de Meta que te interese personalizar.

![\[Una imagen que contiene la tarjeta del modelo y cómo elegir el modelo a personalizar.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-10.png)


Al hacer clic en la tarjeta del modelo, puede acceder a la página de detalles del modelo e iniciar el trabajo de personalización haciendo clic en Personalizar el modelo y, a continuación, seleccionando Personalizar con interfaz de usuario para iniciar la configuración de su trabajo de RLVR.

![\[Una imagen que muestra cómo iniciar el trabajo de personalización.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-9.png)


A continuación, puede introducir el nombre del modelo personalizado, seleccionar la técnica de personalización del modelo que desee utilizar y configurar los hiperparámetros del trabajo:

![\[Imagen que contiene una selección de técnicas de personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-8.png)


![\[Imagen que contiene una selección de técnicas de personalización de modelos.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-7.png)


## Envío de trabajos de personalización de modelos de IA mediante el SDK
<a name="model-customize-open-weight-job-sdk"></a>

También puedes usar el SDK de Python para SageMaker IA para enviar un trabajo de personalización de modelos:

```
# Submit a DPO model customization job

from sagemaker.modules.train.dpo_trainer import DPOTrainer
from sagemaker.modules.train.common import TrainingType

trainer = DPOTrainer(
    model=BASE_MODEL,
    training_type=TrainingType.LORA,
    model_package_group_name=MODEL_PACKAGE_GROUP_NAME,
    training_dataset=TRAINING_DATASET,
    s3_output_path=S3_OUTPUT_PATH,
    sagemaker_session=sagemaker_session,
    role=ROLE_ARN
)
```

## Supervisar su trabajo de personalización
<a name="model-customize-open-weight-monitor"></a>

Inmediatamente después de enviar su trabajo, se le redirigirá a su página de trabajo de formación sobre personalización de modelos.

![\[Una imagen que contiene una selección de técnicas de personalización de modelos.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-6.png)


Una vez finalizado el trabajo, puede ir a la página de detalles del modelo personalizado haciendo clic en el botón Ir al **modelo personalizado** situado en la esquina superior derecha.

![\[Una imagen que contiene una selección de técnicas de personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-5.png)


En la página de detalles del modelo personalizado, puede seguir trabajando con su modelo personalizado de la siguiente manera:

1. Comprueba la información sobre el rendimiento, la ubicación de los artefactos generados, los hiperparámetros de configuración del entrenamiento y los registros de entrenamiento.

1. Inicie un trabajo de evaluación con un conjunto de datos diferente (personalización continua).

1. Implemente el modelo mediante puntos de enlace de inferencia de SageMaker IA o Amazon Bedrock Custom Model Import.  
![\[Imagen que contiene una selección de técnicas de personalización del modelo.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-4.png)

# Presentación de un trabajo de evaluación de modelos
<a name="model-customize-open-weight-evaluation"></a>

En esta sección se describe la evaluación de modelos personalizados de peso abierto. Le permite comenzar con un recorrido por el proceso de presentación de los trabajos de evaluación. Se proporcionan recursos adicionales para casos de uso más avanzados de presentación de trabajos de evaluación.

**Topics**
+ [Introducción](model-customize-evaluation-getting-started.md)
+ [Tipos de evaluación y presentación de trabajos](model-customize-evaluation-types.md)
+ [Formatos de métricas de evaluación](model-customize-evaluation-metrics-formats.md)
+ [Formatos de conjuntos de datos compatibles para Bring-Your-Own-Dataset tareas (BYOD)](model-customize-evaluation-dataset-formats.md)
+ [Evalúe con puntadores preestablecidos y personalizados](model-customize-evaluation-preset-custom-scorers.md)

# Introducción
<a name="model-customize-evaluation-getting-started"></a>

## Presente un trabajo de evaluación a través de SageMaker Studio
<a name="model-customize-evaluation-studio"></a>

### Paso 1: Navegue hasta la evaluación desde su tarjeta modelo
<a name="model-customize-evaluation-studio-step1"></a>

Después de personalizar el modelo, vaya a la página de evaluación desde su tarjeta de modelo.

Para obtener información sobre el entrenamiento con modelos personalizados con pesas abiertas: [https://docs.aws.amazon.com/sagemaker/latest/dg/model- .html customize-open-weight-job](https://docs.aws.amazon.com/sagemaker/latest/dg/model-customize-open-weight-job.html)

SageMaker visualiza su modelo personalizado en la pestaña Mis modelos:

![\[Página de tarjeta de modelo registrado\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/getting-started-registered-model-card.png)


Seleccione Ver la última versión y, a continuación, elija Evaluar:

![\[Página de personalización del modelo\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/getting-started-evaluate-from-model-card.png)


### Paso 2: Envíe su trabajo de evaluación
<a name="model-customize-evaluation-studio-step2"></a>

Pulse el botón Enviar y envíe su trabajo de evaluación. Esto presenta un trabajo mínimo de referencia de MMLU.

Para obtener información sobre los tipos de trabajos de evaluación admitidos, consulte. [Tipos de evaluación y presentación de trabajos](model-customize-evaluation-types.md)

![\[Página de envío de trabajos de evaluación\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/getting-started-benchmark-submission.png)


### Paso 3: Realice un seguimiento del progreso de su trabajo de evaluación
<a name="model-customize-evaluation-studio-step3"></a>

El progreso de su trabajo de evaluación se registra en la pestaña Pasos de evaluación:

![\[El progreso de su trabajo de evaluación\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/getting-started-benchmark-tracking.png)


### Paso 4: Vea los resultados de su trabajo de evaluación
<a name="model-customize-evaluation-studio-step4"></a>

Los resultados de su trabajo de evaluación se visualizan en la pestaña Resultados de la evaluación:

![\[Las métricas de tu trabajo de evaluación\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/getting-started-benchmark-results.png)


### Paso 5: Vea sus evaluaciones completadas
<a name="model-customize-evaluation-studio-step5"></a>

El trabajo de evaluación completado se muestra en Evaluaciones de su tarjeta modelo:

![\[Sus trabajos de evaluación completados\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/getting-started-benchmark-completed-model-card.png)


## Envíe su trabajo de evaluación a través del SDK de SageMaker Python
<a name="model-customize-evaluation-sdk"></a>

### Paso 1: Crea tu BenchMarkEvaluator
<a name="model-customize-evaluation-sdk-step1"></a>

Transfiera el modelo entrenado registrado, la ubicación de salida de AWS S3 y el ARN del MLFlow recurso y, a continuación, `BenchMarkEvaluator` inicialícelo.

```
from sagemaker.train.evaluate import BenchMarkEvaluator, Benchmark  
  
evaluator = BenchMarkEvaluator(  
    benchmark=Benchmark.MMLU,  
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",  
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",  
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",  
    evaluate_base_model=False  
)
```

### Paso 2: Envíe su trabajo de evaluación
<a name="model-customize-evaluation-sdk-step2"></a>

Utilice el `evaluate()` método para enviar el trabajo de evaluación.

```
execution = evaluator.evaluate()
```

### Paso 3: Realice un seguimiento del progreso de su trabajo de evaluación
<a name="model-customize-evaluation-sdk-step3"></a>

Utilice el `wait()` método de ejecución para obtener una actualización en tiempo real del progreso del trabajo de evaluación.

```
execution.wait(target_status="Succeeded", poll=5, timeout=3600)
```

### Paso 4: Vea los resultados de su trabajo de evaluación
<a name="model-customize-evaluation-sdk-step4"></a>

Utilice el `show_results()` método para mostrar los resultados de su trabajo de evaluación.

```
execution.show_results()
```

# Tipos de evaluación y presentación de trabajos
<a name="model-customize-evaluation-types"></a>

## Evaluación comparativa con conjuntos de datos estandarizados
<a name="model-customize-evaluation-benchmarking"></a>

Utilice el tipo de evaluación comparativa para evaluar la calidad de su modelo en conjuntos de datos de referencia estandarizados, incluidos conjuntos de datos populares como MMLU y BBH.


| Referencia | Se admiten conjuntos de datos personalizados | Modalidades | Description (Descripción) | Métricas | Strategy (Estrategia) | Subtarea disponible | 
| --- | --- | --- | --- | --- | --- | --- | 
| mmlu | No | Texto | Comprensión del lenguaje multitarea: evalúa los conocimientos sobre 57 materias. | precisión | zs\$1cot | Sí | 
| mmlu\$1pro | No | Texto | MMLU (subconjunto profesional): se centra en dominios profesionales como derecho, medicina, contabilidad e ingeniería. | precisión | zs\$1cot | No | 
| bbh | No | Texto | Tareas de razonamiento avanzado: conjunto de problemas complejos que ponen a prueba las habilidades cognitivas y de resolución de problemas de nivel superior. | precisión | fs\$1cot | Sí | 
| gpqa | No | Texto | Respuesta a preguntas de física general: evalúa la comprensión de conceptos de física y habilidades relacionadas con la resolución de problemas. | precisión | zs\$1cot | No | 
| math | No | Texto | Resolución de problemas matemáticos: mide el razonamiento matemático en temas como álgebra, cálculo y problemas verbales. | exact\$1match | zs\$1cot | Sí | 
| strong\$1reject | No | Texto | Tarea de control de calidad: prueba la capacidad del modelo para detectar y rechazar contenido inapropiado, dañino o incorrecto. | desviación | zs | Sí | 
| ifeval | No | Texto | Evaluación de seguimiento de instrucciones: mide la precisión con la que un modelo sigue las instrucciones dadas y completa las tareas según las especificaciones. | precisión | zs | No | 

Para obtener más información sobre los formatos BYOD, consulte. [Formatos de conjuntos de datos compatibles para Bring-Your-Own-Dataset tareas (BYOD)](model-customize-evaluation-dataset-formats.md)

### Subtareas disponibles
<a name="model-customize-evaluation-benchmarking-subtasks"></a>

A continuación, se enumeran las subtareas disponibles para la evaluación de modelos en varios dominios, como MMLU (Massive Multitask Language Understanding), BBH (Big Bench Hard) y MATH. StrongReject Estas subtareas le permiten evaluar el rendimiento del modelo en función de capacidades y áreas de conocimiento específicas.

**Subtareas de MMLU**

```
MMLU_SUBTASKS = [
    "abstract_algebra",
    "anatomy",
    "astronomy",
    "business_ethics",
    "clinical_knowledge",
    "college_biology",
    "college_chemistry",
    "college_computer_science",
    "college_mathematics",
    "college_medicine",
    "college_physics",
    "computer_security",
    "conceptual_physics",
    "econometrics",
    "electrical_engineering",
    "elementary_mathematics",
    "formal_logic",
    "global_facts",
    "high_school_biology",
    "high_school_chemistry",
    "high_school_computer_science",
    "high_school_european_history",
    "high_school_geography",
    "high_school_government_and_politics",
    "high_school_macroeconomics",
    "high_school_mathematics",
    "high_school_microeconomics",
    "high_school_physics",
    "high_school_psychology",
    "high_school_statistics",
    "high_school_us_history",
    "high_school_world_history",
    "human_aging",
    "human_sexuality",
    "international_law",
    "jurisprudence",
    "logical_fallacies",
    "machine_learning",
    "management",
    "marketing",
    "medical_genetics",
    "miscellaneous",
    "moral_disputes",
    "moral_scenarios",
    "nutrition",
    "philosophy",
    "prehistory",
    "professional_accounting",
    "professional_law",
    "professional_medicine",
    "professional_psychology",
    "public_relations",
    "security_studies",
    "sociology",
    "us_foreign_policy",
    "virology",
    "world_religions"
]
```

**Subtareas de BBH**

```
BBH_SUBTASKS = [
    "boolean_expressions",
    "causal_judgement",
    "date_understanding",
    "disambiguation_qa",
    "dyck_languages",
    "formal_fallacies",
    "geometric_shapes",
    "hyperbaton",
    "logical_deduction_five_objects",
    "logical_deduction_seven_objects",
    "logical_deduction_three_objects",
    "movie_recommendation",
    "multistep_arithmetic_two",
    "navigate",
    "object_counting",
    "penguins_in_a_table",
    "reasoning_about_colored_objects",
    "ruin_names",
    "salient_translation_error_detection",
    "snarks",
    "sports_understanding",
    "temporal_sequences",
    "tracking_shuffled_objects_five_objects",
    "tracking_shuffled_objects_seven_objects",
    "tracking_shuffled_objects_three_objects",
    "web_of_lies",
    "word_sorting"
]
```

**Subtareas de matemáticas**

```
MATH_SUBTASKS = [
    "algebra", 
    "counting_and_probability", 
    "geometry",
    "intermediate_algebra", 
    "number_theory", 
    "prealgebra", 
    "precalculus"
]
```

**StrongReject Subtareas**

```
STRONG_REJECT_SUBTASKS = [
    "gcg_transfer_harmbench", 
    "gcg_transfer_universal_attacks",
    "combination_3", 
    "combination_2", 
    "few_shot_json", 
    "dev_mode_v2",
    "dev_mode_with_rant",
    "wikipedia_with_title", 
    "distractors",
    "wikipedia",
     "style_injection_json", 
    "style_injection_short",
    "refusal_suppression", 
    "prefix_injection", 
    "distractors_negated",
    "poems", 
    "base64", 
    "base64_raw", "
    base64_input_only",
    "base64_output_only", 
    "evil_confidant", 
    "aim", 
    "rot_13",
    "disemvowel", 
    "auto_obfuscation", 
    "auto_payload_splitting", 
    "pair",
    "pap_authority_endorsement", 
    "pap_evidence_based_persuasion",
    "pap_expert_endorsement", 
    "pap_logical_appeal", 
    "pap_misrepresentation"
]
```

### Envía tu trabajo de referencia
<a name="model-customize-evaluation-benchmarking-submit"></a>

------
#### [ SageMaker Studio ]

![\[Una configuración mínima para realizar evaluaciones comparativas a través de Studio SageMaker\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import get_benchmarks
from sagemaker.train.evaluate import BenchMarkEvaluator

Benchmark = get_benchmarks()

# Create evaluator with MMLU benchmark
evaluator = BenchMarkEvaluator(
benchmark=Benchmark.MMLU,
model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
s3_output_path="s3://<bucket-name>/<prefix>/",
evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Para obtener más información sobre el envío de trabajos de evaluación a través del SDK de SageMaker Python, consulte: [https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

## Evaluación de un modelo lingüístico amplio como juez (LLMAJ)
<a name="model-customize-evaluation-llmaj"></a>

Utilice la evaluación LLM-as-a-judge (LLMAJ) para aprovechar otro modelo fronterizo y calificar las respuestas del modelo objetivo. Puede utilizar los modelos AWS Bedrock como jueces llamando a la `create_evaluation_job` API para iniciar el trabajo de evaluación.

[Para obtener más información sobre los modelos de jueces compatibles, consulte: -supported.html https://docs.aws.amazon.com/bedrock/ latest/userguide/models](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)

Puede utilizar dos formatos métricos diferentes para definir la evaluación:
+ **Métricas integradas: aproveche las métricas** integradas de AWS Bedrock para analizar la calidad de las respuestas de inferencia de su modelo. [Para obtener más información, consulte: - .html https://docs.aws.amazon.com/bedrock/ latest/userguide/model evaluation-type-judge-prompt](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-type-judge-prompt.html)
+ **Métricas personalizadas:** defina sus propias métricas personalizadas en el formato de métrica personalizado de Bedrock Evaluation para analizar la calidad de las respuestas de inferencia de su modelo siguiendo sus propias instrucciones. [Para obtener más información, consulte: - -formats.html https://docs.aws.amazon.com/bedrock/ latest/userguide/model evaluation-custom-metrics-prompt](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

### Envíe un trabajo LLMAJ con métricas integradas
<a name="model-customize-evaluation-llmaj-builtin"></a>

------
#### [ SageMaker Studio ]

![\[Una configuración mínima para la evaluación comparativa de LLMAJ a través de Studio SageMaker\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/llmaj-as-judge-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import LLMAsJudgeEvaluator

evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    builtin_metrics=["<builtin-metric-1>", "<builtin-metric-2>"],
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Para obtener más información sobre el envío de trabajos de evaluación a través del SDK de SageMaker Python, consulte: [https://sagemaker.readthedocs.io/en/stable/model\$1customization/evaluation.html](https://sagemaker.readthedocs.io/en/stable/model_customization/evaluation.html)

------

### Envíe un trabajo de LLMAJ con métricas personalizadas
<a name="model-customize-evaluation-llmaj-custom"></a>

Defina su (s) métrica (s) personalizada (s):

```
{
    "customMetricDefinition": {
        "name": "PositiveSentiment",
        "instructions": (
            "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
            "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
            "Consider the following:\n"
            "- Does the response have a positive, encouraging tone?\n"
            "- Is the response helpful and constructive?\n"
            "- Does it avoid negative language or criticism?\n\n"
            "Rate on this scale:\n"
            "- Good: Response has positive sentiment\n"
            "- Poor: Response lacks positive sentiment\n\n"
            "Here is the actual task:\n"
            "Prompt: {{prompt}}\n"
            "Response: {{prediction}}"
        ),
        "ratingScale": [
            {"definition": "Good", "value": {"floatValue": 1}},
            {"definition": "Poor", "value": {"floatValue": 0}}
        ]
    }
}
```

Para obtener más información, consulte: [https://docs.aws.amazon.com/bedrock/latest/userguide/model- evaluation-custom-metrics-prompt -formats.html](https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-custom-metrics-prompt-formats.html)

------
#### [ SageMaker Studio ]

![\[Cargue la métrica personalizada en Métricas personalizadas > Añadir métricas personalizadas\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/custom-llmaj-metrics-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = LLMAsJudgeEvaluator(
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    evaluator_model="<bedrock-judge-model-id>",
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    custom_metrics=custom_metric_dict = {
        "customMetricDefinition": {
            "name": "PositiveSentiment",
            "instructions": (
                "You are an expert evaluator. Your task is to assess if the sentiment of the response is positive. "
                "Rate the response based on whether it conveys positive sentiment, helpfulness, and constructive tone.\n\n"
                "Consider the following:\n"
                "- Does the response have a positive, encouraging tone?\n"
                "- Is the response helpful and constructive?\n"
                "- Does it avoid negative language or criticism?\n\n"
                "Rate on this scale:\n"
                "- Good: Response has positive sentiment\n"
                "- Poor: Response lacks positive sentiment\n\n"
                "Here is the actual task:\n"
                "Prompt: {{prompt}}\n"
                "Response: {{prediction}}"
            ),
            "ratingScale": [
                {"definition": "Good", "value": {"floatValue": 1}},
                {"definition": "Poor", "value": {"floatValue": 0}}
            ]
        }
    },
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)
```

------

## Puntuadores personalizados
<a name="model-customize-evaluation-custom-scorers"></a>

Defina su propia función de puntuación personalizada para iniciar un trabajo de evaluación. El sistema incluye dos marcadores integrados: Prime math y Prime code. También puedes traer tu propia función de anotador. Puede copiar el código de la función de puntuación directamente o crear su propia definición de función de Lambda mediante el ARN asociado. De forma predeterminada, ambos tipos de anotadores producen resultados de evaluación que incluyen métricas estándar, como la puntuación F1, ROUGE y BLEU.

Para obtener más información sobre los goleadores integrados y personalizados y sus requisitos o contratos respectivos, consulte. [Evalúe con puntadores preestablecidos y personalizados](model-customize-evaluation-preset-custom-scorers.md)

### Registre su conjunto de datos
<a name="model-customize-evaluation-custom-scorers-register-dataset"></a>

Crea tu propio conjunto de datos para un marcador personalizado registrándolo como un conjunto de datos de contenido de SageMaker Hub.

------
#### [ SageMaker Studio ]

En Studio, carga tu conjunto de datos a través de la página dedicada a los conjuntos de datos.

![\[Conjunto de datos de evaluación registrado en Studio SageMaker\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/dataset-registration-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

En el SDK de SageMaker Python, carga tu conjunto de datos mediante la página dedicada a los conjuntos de datos.

```
from sagemaker.ai_registry.dataset import DataSet

dataset = DataSet.create(
    name="your-bring-your-own-dataset",
    source="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl"
)
dataset.refresh()
```

------

### Envía un trabajo de anotador integrado
<a name="model-customize-evaluation-custom-scorers-builtin"></a>

------
#### [ SageMaker Studio ]

![\[Seleccione entre ejecuciones de código o respuestas matemáticas para obtener una puntuación personalizada integrada\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/builtin-scorer-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.train.evaluate import CustomScorerEvaluator
from sagemaker.train.evaluate import get_builtin_metrics

BuiltInMetric = get_builtin_metrics()

evaluator_builtin = CustomScorerEvaluator(
    evaluator=BuiltInMetric.PRIME_MATH,
    dataset="arn:aws:sagemaker:<region>:<account-id>:hub-content/<hub-content-id>/DataSet/your-bring-your-own-dataset/<version>",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Seleccione una de las `BuiltInMetric.PRIME_MATH` opciones `BuiltInMetric.PRIME_CODE` de puntuación integrada o una de ellas.

------

### Envíe un trabajo de anotador personalizado
<a name="model-customize-evaluation-custom-scorers-custom"></a>

Defina una función de recompensa personalizada. Para obtener más información, consulte [Puntuadores personalizados (traiga sus propias métricas)](model-customize-evaluation-preset-custom-scorers.md#model-customize-evaluation-custom-scorers-byom).

**Registre la función de recompensa personalizada**

------
#### [ SageMaker Studio ]

![\[Vaya a SageMaker Studio > Activos > Evaluador > Crear evaluador > Crear función de recompensa\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/custom-scorer-submission-sagemaker-studio.png)


![\[Envíe el trabajo de evaluación de Custom Scorer haciendo referencia a la función de recompensa preestablecida registrada en Custom Scorer > Métricas personalizadas\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/custom-scorer-benchmark-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
from sagemaker.ai_registry.evaluator import Evaluator
from sagemaker.ai_registry.air_constants import REWARD_FUNCTION

evaluator = Evaluator.create(
    name = "your-reward-function-name",
    source="/path_to_local/custom_lambda_function.py",
    type = REWARD_FUNCTION
)
```

```
evaluator = CustomScorerEvaluator(
    evaluator=evaluator,
    dataset="s3://<bucket-name>/<prefix>/<dataset-file>.jsonl",
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

# Formatos de métricas de evaluación
<a name="model-customize-evaluation-metrics-formats"></a>

Evaluar la calidad del modelo en los siguientes formatos de métricas:
+ Resumen de la evaluación del modelo
+ MLFlow
+ TensorBoard

## Resumen de la evaluación del modelo
<a name="model-customize-evaluation-metrics-summary"></a>

Cuando envía su trabajo de evaluación, especifica una ubicación de salida de AWS S3. SageMaker carga automáticamente el archivo.json con el resumen de la evaluación en la ubicación. La ruta S3 del resumen del punto de referencia es la siguiente:

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/eval_results/
```

**Pase la ubicación AWS S3**

------
#### [ SageMaker Studio ]

![\[Pase a la ubicación del artefacto de salida (URI AWS S3)\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

Léelo directamente `.json` desde la ubicación de AWS S3 o visualízalo automáticamente en la interfaz de usuario:

```
{
  "results": {
    "custom|gen_qa_gen_qa|0": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    },
    "all": {
      "rouge1": 0.9152812653966208,
      "rouge1_stderr": 0.003536439199232507,
      "rouge2": 0.774569918517409,
      "rouge2_stderr": 0.006368825746765958,
      "rougeL": 0.9111255645823356,
      "rougeL_stderr": 0.003603841524881021,
      "em": 0.6562150055991042,
      "em_stderr": 0.007948251702846893,
      "qem": 0.7522396416573348,
      "qem_stderr": 0.007224355240883467,
      "f1": 0.8428757602152095,
      "f1_stderr": 0.005186300690881584,
      "f1_score_quasi": 0.9156170336744968,
      "f1_score_quasi_stderr": 0.003667700152375464,
      "bleu": 100.00000000000004,
      "bleu_stderr": 1.464411857851008
    }
  }
}
```

![\[Ejemplos de métricas de rendimiento para un benchmark personalizado de generación y control de calidad visualizadas en Studio SageMaker\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/gen-qa-metrics-visualization-sagemaker-studio.png)


## MLFlow registro
<a name="model-customize-evaluation-metrics-mlflow"></a>

**Proporcione el ARN SageMaker MLFlow de su recurso**

SageMaker Studio usa la MLFlow aplicación predeterminada que se aprovisiona en cada dominio de Studio cuando usas la capacidad de personalización del modelo por primera vez. SageMaker Studio usa el ARN predeterminado asociado a MLflow la aplicación al enviar los trabajos de evaluación.

También puede enviar su trabajo de evaluación y proporcionar explícitamente un ARN de MLFlow recurso para transmitir las métricas a dicho seguimiento asociado server/app para su análisis en tiempo real.

**SageMaker SDK de Python**

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    mlflow_resource_arn="arn:aws:sagemaker:<region>:<account-id>:mlflow-tracking-server/<tracking-server-name>",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

Visualización de métricas a nivel de modelo y nivel de sistema:

![\[Ejemplo de error y precisión a nivel de modelo para una tarea de evaluación comparativa de MMLU\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/model-metrics-mlflow.png)


![\[Ejemplo de métricas integradas para la tarea de evaluación comparativa de LLMAJ\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/llmaj-metrics-mlflow.png)


![\[Ejemplos de métricas a nivel de sistema para una tarea de evaluación comparativa de MMLU\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/system-metrics-mlflow.png)


## TensorBoard
<a name="model-customize-evaluation-metrics-tensorboard"></a>

Envíe su trabajo de evaluación con una ubicación de salida de AWS S3. SageMaker carga automáticamente un TensorBoard archivo en la ubicación.

SageMaker carga el TensorBoard archivo a AWS S3 en la siguiente ubicación:

```
s3://<your-provide-s3-location>/<training-job-name>/output/output/<evaluation-job-name>/tensorboard_results/eval/
```

**Pase la ubicación AWS S3 de la siguiente manera**

------
#### [ SageMaker Studio ]

![\[Pase a la ubicación del artefacto de salida (URI AWS S3)\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/s3-output-path-submission-sagemaker-studio.png)


------
#### [ SageMaker Python SDK ]

```
evaluator = BenchMarkEvaluator(
    benchmark=Benchmark.MMLU,
    model="arn:aws:sagemaker:<region>:<account-id>:model-package/<model-package-name>/<version>",
    s3_output_path="s3://<bucket-name>/<prefix>/eval/",
    evaluate_base_model=False
)

execution = evaluator.evaluate()
```

------

**Ejemplo de métricas a nivel de modelo**

![\[SageMaker TensorBoard mostrar los resultados de un trabajo de evaluación comparativa\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/metrics-in-tensorboard.png)


# Formatos de conjuntos de datos compatibles para Bring-Your-Own-Dataset tareas (BYOD)
<a name="model-customize-evaluation-dataset-formats"></a>

El marcador personalizado y los tipos de LLM-as-judge evaluación requieren un archivo JSONL de conjunto de datos personalizado ubicado en S3. AWS Debe proporcionar el archivo como un archivo de líneas JSON que cumpla con uno de los siguientes formatos compatibles. Los ejemplos de este documento se han ampliado para mayor claridad.

Cada formato tiene sus propios matices, pero como mínimo todos requieren un mensaje de usuario.


**Campos obligatorios**  

| Campo | Obligatorio | 
| --- | --- | 
| Mensaje de usuario | Sí | 
| Petición del sistema | No | 
| Verdad fundamental | Solo para Custom Scorer | 
| Categoría | No | 

**1. Formato OpenAI**

```
{
    "messages": [
        {
            "role": "system",    # System prompt (looks for system role)
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",       # Query (looks for user role)
            "content": "Hello!"
        },
        {
            "role": "assistant",  # Ground truth (looks for assistant role)
            "content": "Hello to you!"
        }
    ]
}
```

**2. SageMaker **Evaluación

```
{
   "system":"You are an English major with top marks in class who likes to give minimal word responses: ",
   "query":"What is the symbol that ends the sentence as a question",
   "response":"?", # Ground truth
   "category": "Grammar"
}
```

**3. HuggingFace Pronta finalización**

Se admiten los formatos estándar y conversacional.

```
# Standard

{
    "prompt" : "What is the symbol that ends the sentence as a question", # Query
    "completion" : "?" # Ground truth
}

# Conversational
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What is the symbol that ends the sentence as a question"
        }
    ],
    "completion": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "?"
        }
    ]
}
```

**4. HuggingFace Preferencia**

Support tanto para el formato estándar (cadena) como para el formato conversacional (matriz de mensajes).

```
# Standard: {"prompt": "text", "chosen": "text", "rejected": "text"}
{
     "prompt" : "The sky is", # Query
     "chosen" : "blue", # Ground truth
     "rejected" : "green"
}

# Conversational:
{
    "prompt": [
        {
            "role": "user", # Query (looks for user role)
            "content": "What color is the sky?"
        }
    ],
    "chosen": [
        {
            "role": "assistant", # Ground truth (looks for assistant role)
            "content": "It is blue."
        }
    ],
    "rejected": [
        {
            "role": "assistant",
            "content": "It is green."
        }
    ]
}
```

**5. Formato Verl**

El formato Verl (tanto los formatos actuales como los heredados) es compatible con los casos de uso del aprendizaje por refuerzo. Documentos de Verl como referencia: [https://verl.readthedocs.io/en/latest/preparation/prepare\$1data.html](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)

Por lo general, los usuarios del formato VERL no proporcionan una respuesta basada en la verdad. Si desea proporcionar una de todas formas, utilice cualquiera de los campos `extra_info.answer` o`reward_model.ground_truth`; `extra_info` tiene prioridad.

SageMaker conserva los siguientes campos específicos de Verl como metadatos, si están presentes:
+ `id`
+ `data_source`
+ `ability`
+ `reward_model`
+ `extra_info`
+ `attributes`
+ `difficulty`

```
# Newest VERL format where `prompt` is an array of messages.
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "You are a helpful math tutor who explains solutions to questions step-by-step.",
      "role": "system"
    },
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  },
  "reward_model": {
    "ground_truth": "72" # Ignored in favor of extra_info.answer
  }
}

# Legacy VERL format where `prompt` is a string. Also supported.
{
  "data_source": "openai/gsm8k",
  "prompt": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after \"####\".",
  "extra_info": {
    "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

# Evalúe con puntadores preestablecidos y personalizados
<a name="model-customize-evaluation-preset-custom-scorers"></a>

Cuando se utiliza el tipo de evaluación Custom Scorer, SageMaker Evaluation admite dos puntuadores integrados (también denominados «funciones de recompensa») Prime Math y Prime Code tomados de la biblioteca de formación de [volcengine/verl RL](https://github.com/volcengine/verl), o su propio marcador personalizado implementado como una función Lambda.

## Puntuadores integrados
<a name="model-customize-evaluation-builtin-scorers"></a>

**Prime Math**

El mejor anotador de matemáticas espera un conjunto de datos JSONL personalizado de entradas que contenga una pregunta matemática como la prompt/query respuesta correcta como verdad fundamental. El conjunto de datos puede tener cualquiera de los formatos compatibles que se mencionan en. [Formatos de conjuntos de datos compatibles para Bring-Your-Own-Dataset tareas (BYOD)](model-customize-evaluation-dataset-formats.md)

Ejemplo de entrada en un conjunto de datos (ampliado para mayor claridad):

```
{
    "system":"You are a math expert: ",
    "query":"How many vertical asymptotes does the graph of $y=\\frac{2}{x^2+x-6}$ have?",
    "response":"2" # Ground truth aka correct answer
}
```

**Código principal**

El evaluador de código principal espera un conjunto de datos JSONL personalizado compuesto por entradas que contengan un problema de codificación y casos de prueba especificados en el campo. `metadata` Estructura los casos de prueba con el nombre de función esperado para cada entrada, las entradas de muestra y las salidas esperadas.

Ejemplo de entrada en un conjunto de datos (ampliado para mayor claridad):

```
{
    "system":"\\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\\n\\n[ASSESS]\\n\\n[ADVANCE]\\n\\n[VERIFY]\\n\\n[SIMPLIFY]\\n\\n[SYNTHESIZE]\\n\\n[PIVOT]\\n\\n[OUTPUT]\\n\\nYou should strictly follow the format below:\\n\\n[ACTION NAME]\\n\\n# Your action step 1\\n\\n# Your action step 2\\n\\n# Your action step 3\\n\\n...\\n\\nNext action: [NEXT ACTION NAME]\\n\\n",
    "query":"A number N is called a factorial number if it is the factorial of a positive integer. For example, the first few factorial numbers are 1, 2, 6, 24, 120,\\nGiven a number N, the task is to return the list/vector of the factorial numbers smaller than or equal to N.\\nExample 1:\\nInput: N = 3\\nOutput: 1 2\\nExplanation: The first factorial number is \\n1 which is less than equal to N. The second \\nnumber is 2 which is less than equal to N,\\nbut the third factorial number is 6 which \\nis greater than N. So we print only 1 and 2.\\nExample 2:\\nInput: N = 6\\nOutput: 1 2 6\\nExplanation: The first three factorial \\nnumbers are less than equal to N but \\nthe fourth factorial number 24 is \\ngreater than N. So we print only first \\nthree factorial numbers.\\nYour Task:  \\nYou don't need to read input or print anything. Your task is to complete the function factorialNumbers() which takes an integer N as an input parameter and return the list/vector of the factorial numbers smaller than or equal to N.\\nExpected Time Complexity: O(K), Where K is the number of factorial numbers.\\nExpected Auxiliary Space: O(1)\\nConstraints:\\n1<=N<=10^{18}\\n\\nWrite Python code to solve the problem. Present the code in \\n```python\\nYour code\\n```\\nat the end.",
    "response": "", # Dummy string for ground truth. Provide a value if you want NLP metrics like ROUGE, BLEU, and F1.
    ### Define test cases in metadata field
    "metadata": {
        "fn_name": "factorialNumbers",
        "inputs": ["5"],
        "outputs": ["[1, 2]"]
    }
}
```

## Puntuadores personalizados (traiga sus propias métricas)
<a name="model-customize-evaluation-custom-scorers-byom"></a>

Personalice completamente el flujo de trabajo de evaluación de sus modelos con una lógica de posprocesamiento personalizada que le permita calcular métricas personalizadas adaptadas a sus necesidades. Debe implementar su marcador personalizado como una función AWS Lambda que acepte las respuestas del modelo y devuelva las puntuaciones de recompensa.

### Ejemplo de carga útil de entrada Lambda
<a name="model-customize-evaluation-custom-scorers-lambda-input"></a>

Su AWS Lambda personalizada espera entradas en el formato OpenAI. Ejemplo:

```
{
    "id": "123",
    "messages": [
        {
            "role": "user",
            "content": "Do you have a dedicated security team?"
        },
        {
            "role": "assistant",
            "content": "As an AI developed by Amazon, I do not have a dedicated security team..."
        }
    ],
    "reference_answer": {
        "compliant": "No",
        "explanation": "As an AI developed by Company, I do not have a traditional security team..."
    }
}
```

### Ejemplo de carga útil de salida Lambda
<a name="model-customize-evaluation-custom-scorers-lambda-output"></a>

El contenedor SageMaker de evaluación espera que sus respuestas de Lambda sigan este formato:

```
{
    "id": str,                              # Same id as input sample
    "aggregate_reward_score": float,        # Overall score for the sample
    "metrics_list": [                       # OPTIONAL: Component scores
        {
            "name": str,                    # Name of the component score
            "value": float,                 # Value of the component score
            "type": str                     # "Reward" or "Metric"
        }
    ]
}
```

### Definición de Lambda personalizada
<a name="model-customize-evaluation-custom-scorers-lambda-definition"></a>

[Encuentre un ejemplo de un marcador personalizado completamente implementado con una entrada de muestra y la salida esperada en: https://docs.aws.amazon.com/sagemaker/ latest/dg/nova - .html\$1 -example implementing-reward-functions nova-reward-llm-judge](https://docs.aws.amazon.com/sagemaker/latest/dg/nova-implementing-reward-functions.html#nova-reward-llm-judge-example)

Usa el siguiente esquema como punto de partida para tu propia función.

```
def lambda_handler(event, context):
    return lambda_grader(event)

def lambda_grader(samples: list[dict]) -> list[dict]:
    """
    Args:
        Samples: List of dictionaries in OpenAI format
            
        Example input:
        {
            "id": "123",
            "messages": [
                {
                    "role": "user",
                    "content": "Do you have a dedicated security team?"
                },
                {
                    "role": "assistant",
                    "content": "As an AI developed by Company, I do not have a dedicated security team..."
                }
            ],
            # This section is the same as your training dataset
            "reference_answer": {
                "compliant": "No",
                "explanation": "As an AI developed by Company, I do not have a traditional security team..."
            }
        }
        
    Returns:
        List of dictionaries with reward scores:
        {
            "id": str,                              # Same id as input sample
            "aggregate_reward_score": float,        # Overall score for the sample
            "metrics_list": [                       # OPTIONAL: Component scores
                {
                    "name": str,                    # Name of the component score
                    "value": float,                 # Value of the component score
                    "type": str                     # "Reward" or "Metric"
                }
            ]
        }
    """
```

### Campos de entrada y salida
<a name="model-customize-evaluation-custom-scorers-fields"></a>

**Campos de entrada**


| Campo | Description (Descripción) | Notas adicionales | 
| --- | --- | --- | 
| id | Identificador único de la muestra | El eco volvió a aparecer en la salida. Formato de cadena | 
| Mensajes | Historial de chat ordenado en formato OpenAI | Matriz de objetos de mensajes | 
| mensajes [] .role | Orador del mensaje | Valores comunes: «usuario», «asistente», «sistema» | 
| mensajes [] .contenido | Contenido textual del mensaje | Cadena sin formato | 
| metadatos | Información en formato libre para facilitar la calificación | Objeto; campos opcionales transferidos de los datos de entrenamiento | 

**Campos de salida**


**Campos de salida**  

| Campo | Description (Descripción) | Notas adicionales | 
| --- | --- | --- | 
| id | El mismo identificador que la muestra de entrada | Debe coincidir con la entrada | 
| agregate\$1reward\$1score | Puntuación general de la muestra | Flotación (p. ej., 0,0—1,0 o rango definido por la tarea) | 
| Lista\$1de\$1métricas | Puntuaciones de los componentes que componen el agregado | Matriz de objetos métricos | 

### Permisos necesarios
<a name="model-customize-evaluation-custom-scorers-permissions"></a>

Asegúrese de que la función de SageMaker ejecución que utiliza para ejecutar la evaluación tenga AWS permisos Lambda.

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": "arn:aws:lambda:region:account-id:function:function-name"
        }
    ]
}
```

Asegúrese AWS de que la función de ejecución de la función Lambda tenga los permisos básicos de ejecución de Lambda, así como los permisos adicionales que pueda necesitar para cualquier llamada descendente. AWS 

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
```

# Implementación de modelos
<a name="model-customize-open-weight-deployment"></a>

Desde la página de detalles de los modelos personalizados, también puede implementar su modelo personalizado mediante puntos de conexión de SageMaker IA Inference o Amazon Bedrock.

![\[Una imagen que contiene una selección de técnicas de personalización de modelos.\]](http://docs.aws.amazon.com/es_es/sagemaker/latest/dg/images/screenshot-open-model-1.png)


# Ejemplos de conjuntos de datos y evaluadores
<a name="model-customize-open-weight-samples"></a>

## Ajuste fino supervisado (SFT)
<a name="model-customize-open-weight-samples-sft"></a>
+ Nombre: TAT-QA
+ Licencia: CC-BY-4.0
+ [Enlace: https://huggingface. co/datasets/next-tat/TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
+ Preprocesamiento - Formateo

**Una muestra**

```
{
    "prompt": "Given a table and relevant text descriptions, answer the following question.\n\nTable:\n<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td></td>\n      <td>2019</td>\n      <td>2018</td>\n    </tr>\n    <tr>\n      <td></td>\n      <td>$'000</td>\n      <td>$'000</td>\n    </tr>\n    <tr>\n      <td>Revenue from external customers</td>\n      <td></td>\n      <td></td>\n    </tr>\n    <tr>\n      <td>Australia</td>\n      <td>144,621</td>\n      <td>129,431</td>\n    </tr>\n    <tr>\n      <td>New Zealand</td>\n      <td>13,036</td>\n      <td>8,912</td>\n    </tr>\n    <tr>\n      <td>Total</td>\n      <td>157,657</td>\n      <td>138,343</td>\n    </tr>\n  </tbody>\n</table>\n\nParagraphs:\n    4. SEGMENT INFORMATION\n\n    During the 2019 and 2018 financial years, the Group operated wholly within one business segment being the operation and management of storage centres in Australia and New Zealand.\n\n    The Managing Director is the Group\u2019s chief operating decision maker and monitors the operating results on a portfolio wide basis. Monthly management reports are evaluated based upon the overall performance of NSR consistent with the presentation within the consolidated financial statements. The Group\u2019s financing (including finance costs and finance income) are managed on a Group basis and not allocated to operating segments.\n\n    The operating results presented in the statement of profit or loss represent the same segment information as reported in internal management information.\n\n    The revenue information above excludes interest income and is based on the location of storage centres.\n\nQuestion: How many business segments were present in 2019 and 2018?\n\nAnswer:",
    "completion": "one",
    "data_idx": "2951"
}
```

## Optimización de preferencias directas (DPO)
<a name="model-customize-open-weight-samples-dpo"></a>
+ Nombre: Ultrafeedback
+ Licencia: MIT
+ Enlace: [https://huggingface. co/datasets/openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
+ Pasos de preprocesamiento: formateo y submuestreo.

**Una muestra**

```
{
    "source": "evol_instruct",
    "prompt": "Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:\n#include <iostream>\n#include <string>\nusing namespace std;\nint main() {\n    string country;\n    // prompt user for input\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    // check if country borders the Mediterranean Sea\n    // [C++ code]\n    return 0;\n}",
    "chosen": "Here's a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea:\n\n#include <iostream>\n#include <string>\n#include <set>\n#include <map>\n#include <algorithm>\n\nusing namespace std;\n\nint main() {\n    // store countries and their bordering seas in a map\n    map<string, set<string>> countries;\n    countries[\"Algeria\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"France\"] = {\"Mediterranean Sea\", \"English Channel\"};\n    countries[\"Italy\"] = {\"Mediterranean Sea\", \"Adriatic Sea\"};\n    countries[\"Malta\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n    countries[\"Spain\"] = {\"Mediterranean Sea\", \"Atlantic Ocean\"};\n    countries[\"Tunisia\"] = {\"Mediterranean Sea\", \"North African Coast\"};\n\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n\n    // check if country is in the map\n    if (countries.find(country)!= countries.end()) {\n        // check if country borders the Mediterranean Sea\n        set<string> borders = countries.at(country);\n        if (borders.find(\"Mediterranean Sea\")!= borders.end()) {\n            cout << country << \" borders the Mediterranean Sea.\" << endl;\n        } else {\n            cout << country << \" does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << country << \" is not a recognized country.\" << endl;\n    }\n\n    return 0;\n}\n\nIn this program, we first define a map called countries that stores each country's name as a key and a set of its surrounding seas as the value. We use a set to store each sea name so that we can easily check if a country borders a certain sea.\n\nWe then present the user with a prompt to enter the name of a country. We use a while loop to ensure that the user enters a valid country name. We check if the country is in the map, and if so, we retrieve its bordering seas from the map.\n\nFinally, we check if the Mediterranean Sea is among the set of borders. If it is, we inform the user that the country borders the Mediterranean Sea. If not, we inform the user that the country does not border the Mediterranean Sea.",
    "chosen-rating": 5.0,
    "chosen-model": "starchat",
    "rejected": "Sure, here is the program using the C++11 algorithm \"cds::algorithm::GreaterEqual\":\n#include <iostream>\n#include <string>\n#include <algorithm>\n#include <vector>\n#include <cctype>\n\nusing namespace std;\n\nint main() {\n    string country;\n    cout << \"Enter the name of a country: \";\n    cin >> country;\n    std::vector<string> vec;\n    vec.push_back(country);\n    size_t index = std::find_if(vec.begin(), vec.end(), [](const string& s) {\n        return std::any_of(s.begin(), s.end(), [](const char& c) {\n            return c == '}}';\n    });\n    if (index != vec.end()) {\n        if ((*index)[0] == 'M') {\n            cout << \"Country is bordered by the Mediterranean Sea.\" << endl;\n        } else {\n            cout << \"Country does not border the Mediterranean Sea.\" << endl;\n        }\n    } else {\n        cout << \"Country is not found.\" << endl;\n    }\n    return 0;\n}",
    "rejected-rating": 1.25,
    "rejected-model": "pythia-12b"
}
```

## Aprendizaje reforzado a partir de la retroalimentación de la IA (RLAIF): evaluación por parejas
<a name="model-customize-open-weight-samples-rlaif"></a>

**Conjunto de datos de entrada**

Conjunto de datos fuente: [https://github.com/WeOpenML/PanDalm](https://github.com/WeOpenML/PandaLM)

**Una muestra**

```
{
    "data_source": "WeOpenML/PandaLM",
    "prompt": [
        {
            "role": "user",
            "content": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n
            ### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n
            ### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n
            ### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n"
        }
    ],
    "ability": "pairwise-judging",
    "reward_model": {
        "style": "llmj",
        "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "raw_output_sequence": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n",
        "llmj": {
            "question": "Below are two responses for a given task. The task is defined by the Instruction with an Input that provides further context. Evaluate the responses and generate a reference answer for the task.\n\n### Instruction:\nCompare the given products.\n\n### Input:\niPhone 11 and Google Pixel 4\n\n### Response 1:\nThe iPhone 11 has a larger screen size and a longer battery life than the Google Pixel 4.\n\n### Response 2:\nThe iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.\n\n### Evaluation:\n",
            "ground_truth": "2\n\n### Reason: Response 2 provides a more detailed and comprehensive comparison of the two products, including their specifications and features. Response 1 only mentions two aspects of the products and does not provide as much information.\n\n### Reference: The iPhone 11 and Google Pixel 4 are both flagship smartphones released in 2018. The iPhone 11 has a 6.1-inch LCD display, while the Google Pixel 4 has a 5.7-inch OLED display. The iPhone 11 has an A13 Bionic chipset, while the Google Pixel 4 has a Qualcomm Snapdragon 845 chipset. The iPhone 11 has a dual-camera system, while the Google Pixel 4 has a single camera system. The iPhone 11 has a longer battery life, while the Google Pixel 4 has a faster processor.",
            "document_in_context": null
        },
        "sample_size": 1980
    }
}
```

## RLAIF: Cadena de pensamiento
<a name="model-customize-open-weight-samples-rlaif2"></a>

**Conjunto de datos de entrada**

Datos de origen: [https://huggingface. co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**Una muestra**

```
{
    "data_source": "openai/gsm8k",
    "prompt": [
        {
            "role": "system",
            "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n"
        },
        {
            "role": "user",
            "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?"
        }
    ],
    "ability": "chain-of-thought",
    "reward_model": {
        "style": "llmj-cot",
        "ground_truth": "Thus, there were 36 - 12 - 9 = <<36-12-9=15>>15 sales in the stationery section."
    },
    "extra_info": {
        "split": "train",
        "index": 0,
        "question": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "short_answer": "15",
        "model_output": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>"
    }
}
```

## RELAIF - Fidelidad
<a name="model-customize-open-weight-samples-rlaif3"></a>

**Conjunto de datos de entrada**

Fuente: [https://huggingface. co/datasets/rajpurkar/squad\$1v2/blob/main/squad\$1v2/train-00000-de-00001](https://huggingface.co/datasets/rajpurkar/squad_v2/blob/main/squad_v2/train-00000-of-00001.parquet). parquet

**Una muestra**

```
{
    "data_source": "squad_v2",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that answers questions based on the provided context. Only use information from the context."
        },
        {
            "role": "user",
            "content": "Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\".\n\nQuestion: When did Beyonce start becoming popular?"
        }
    ],
    "ability": "faithfulness",
    "reward_model": {
        "style": "llmj-faithfulness",
        "ground_truth": "Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles \"Crazy in Love\" and \"Baby Boy\"."
    },
    "extra_info": {
        "question": "When did Beyonce start becoming popular?",
        "split": "train",
        "index": 0
    }
}
```

## RLAIF: resumen
<a name="model-customize-open-weight-samples-rlaif4"></a>

**Conjunto de datos de entrada**

[Fuente: conjunto de datos gsm8k limpio https://huggingface. co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**Una muestra**

```
{
    "data_source": "cnn_dailymail",
    "prompt": [
        {
            "role": "system",
            "content": "You are a helpful assistant that creates concise, accurate summaries of news articles. Focus on the key facts and main points."
        },
        {
            "role": "user",
            "content": "Summarize the following article:\n\nLONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed."
        }
    ],
    "ability": "summarization",
    "reward_model": {
        "style": "llmj-summarization",
        "ground_truth": "Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."
    },
    "extra_info": {
        "question": "LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in \"Harry Potter and the Order of the Phoenix\" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. \"I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar,\" he told an Australian interviewer earlier this month. \"I don't think I'll be particularly extravagant. \"The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs.\" At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film \"Hostel: Part II,\" currently six places below his number one movie on the UK box office chart. Details of how he'll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. \"I'll definitely have some sort of party,\" he said in an interview. \"Hopefully none of you will be reading about it.\" Radcliffe's earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. \"People are always looking to say 'kid star goes off the rails,'\" he told reporters last month. \"But I try very hard not to go that way because it would be too easy for them.\" His latest outing as the boy wizard in \"Harry Potter and the Order of the Phoenix\" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter's latest » . There is life beyond Potter, however. The Londoner has filmed a TV movie called \"My Boy Jack,\" about author Rudyard Kipling and his son, due for release later this year. He will also appear in \"December Boys,\" an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer's \"Equus.\" Meanwhile, he is braced for even closer media scrutiny now that he's legally an adult: \"I just think I'm going to be more sort of fair game,\" he told Reuters. E-mail to a friend . Copyright 2007 Reuters. All rights reserved.This material may not be published, broadcast, rewritten, or redistributed.",
        "split": "train",
        "index": 0,
        "source_id": "42c027e4ff9730fbb3de84c1af0d2c50"
    }
}
```

## RLAIF: mensaje personalizado
<a name="model-customize-open-weight-samples-rlaif5"></a>

En este ejemplo, solemos analizar [RLAIF: Cadena de pensamiento](#model-customize-open-weight-samples-rlaif2) cómo un indicador jinja personalizado puede reemplazar uno de los mensajes preestablecidos.

**A continuación se muestra un ejemplo de un mensaje personalizado para CoT:**

```
You are an expert logical reasoning evaluator specializing in Chain-of-Thought (CoT) analysis. 

Given: A problem prompt and a model's reasoning-based response. 

Goal: Assess the quality and structure of logical reasoning, especially for specialized domains (law, medicine, finance, etc.).

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Structural Completeness (0.3 max)
- Clear problem statement: +0.05
- Defined variables/terminology: +0.05
- Organized given information: +0.05
- Explicit proof target: +0.05
- Step-by-step reasoning: +0.05
- Clear conclusion: +0.05

Logical Quality (0.4 max)
- Valid logical flow: +0.1
- Proper use of if-then relationships: +0.1
- Correct application of domain principles: +0.1
- No logical fallacies: +0.1

Technical Accuracy (0.3 max)
- Correct use of domain terminology: +0.1
- Accurate application of domain rules: +0.1
- Proper citation of relevant principles: +0.1

Critical Deductions:
A. Invalid logical leap: -0.3
B. Missing critical steps: -0.2
C. Incorrect domain application: -0.2
D. Unclear/ambiguous reasoning: -0.1

Additional Instructions:
- Verify domain-specific terminology and principles
- Check for logical consistency throughout
- Ensure conclusions follow from premises
- Flag potential domain-specific compliance issues
- Consider regulatory/professional standards where applicable

Return EXACTLY this JSON (no extra text):
{
    "score": <numerical score 0.0-1.0>,
    "component_scores": {
        "structural_completeness": <score>,
        "logical_quality": <score>,
        "technical_accuracy": <score>
    },
    "steps_present": {
        "problem_statement": <true/false>,
        "variable_definitions": <true/false>,
        "given_information": <true/false>,
        "proof_target": <true/false>,
        "step_reasoning": <true/false>,
        "conclusion": <true/false>
    },
    "reasoning": "<explain scoring decisions and identify any logical gaps>",
    "domain_flags": ["<any domain-specific concerns or compliance issues>"]
}

### (Prompt field from dataset)
Problem Prompt: {{ prompt }}

Model's Response: {{ model_output }}

### Ground truth (if applicable):
{{ ground_truth }}
```

## Aprendizaje reforzado a partir de recompensas verificables (RLVR): Exact Match
<a name="model-customize-open-weight-samples-RLVR"></a>

**Conjunto de datos de entrada**

Fuente: [https://huggingface. co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

**Muestra**

```
{
  "data_source": "openai/gsm8k",
  "prompt": [
    {
      "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let\'s think step by step and output the final answer after \\"####\\".",
      "role": "user"
    }
  ],
  "ability": "math",
  "reward_model": {
    "ground_truth": "72",
    "style": "rule"
  },
  "extra_info": {
    "answer": "Natalia sold 48\\/2 = <<48\\/2=24>>24 clips in May.\\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\\n#### 72",
    "index": 0,
    "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
    "split": "train"
  }
}
```

## RLVR: ejecución de código
<a name="model-customize-open-weight-samples-RLVR2"></a>

**Conjunto de datos de entrada**

Fuente: [https://huggingface. co/datasets/open-r1/codeforces](https://huggingface.co/datasets/open-r1/codeforces)

**Muestra**

```
{
  "data_source": "codeforces",
  "prompt": [
    {
      "content": "\nWhen tackling complex reasoning tasks, you have access to the following actions. Use them as needed to progress through your thought process.\n\n[ASSESS]\n\n[ADVANCE]\n\n[VERIFY]\n\n[SIMPLIFY]\n\n[SYNTHESIZE]\n\n[PIVOT]\n\n[OUTPUT]\n\nYou should strictly follow the format below:\n\n[ACTION NAME]\n\n# Your action step 1\n\n# Your action step 2\n\n# Your action step 3\n\n...\n\nNext action: [NEXT ACTION NAME]\n\n",
      "role": "system"
    },
    {
      "content": "Title: Zebras\n\nTime Limit: None seconds\n\nMemory Limit: None megabytes\n\nProblem Description:\nOleg writes down the history of the days he lived. For each day he decides if it was good or bad. Oleg calls a non-empty sequence of days a zebra, if it starts with a bad day, ends with a bad day, and good and bad days are alternating in it. Let us denote bad days as 0 and good days as 1. Then, for example, sequences of days 0, 010, 01010 are zebras, while sequences 1, 0110, 0101 are not.\n\nOleg tells you the story of days he lived in chronological order in form of string consisting of 0 and 1. Now you are interested if it is possible to divide Oleg's life history into several subsequences, each of which is a zebra, and the way it can be done. Each day must belong to exactly one of the subsequences. For each of the subsequences, days forming it must be ordered chronologically. Note that subsequence does not have to be a group of consecutive days.\n\nInput Specification:\nIn the only line of input data there is a non-empty string *s* consisting of characters 0 and 1, which describes the history of Oleg's life. Its length (denoted as |*s*|) does not exceed 200<=000 characters.\n\nOutput Specification:\nIf there is a way to divide history into zebra subsequences, in the first line of output you should print an integer *k* (1<=\u2264<=*k*<=\u2264<=|*s*|), the resulting number of subsequences. In the *i*-th of following *k* lines first print the integer *l**i* (1<=\u2264<=*l**i*<=\u2264<=|*s*|), which is the length of the *i*-th subsequence, and then *l**i* indices of days forming the subsequence. Indices must follow in ascending order. Days are numbered starting from 1. Each index from 1 to *n* must belong to exactly one subsequence. If there is no way to divide day history into zebra subsequences, print -1.\n\nSubsequences may be printed in any order. If there are several solutions, you may print any of them. You do not have to minimize nor maximize the value of *k*.\n\nDemo Input:\n['0010100\\n', '111\\n']\n\nDemo Output:\n['3\\n3 1 3 4\\n3 2 5 6\\n1 7\\n', '-1\\n']\n\nNote:\nnone\n\nWrite Python code to solve the problem. Present the code in \n```python\nYour code\n```\nat the end.",
      "role": "user"
    }
  ],
  "ability": "code",
  "reward_model": {
    "ground_truth": "{\"inputs\": [\"0010100\", \"111\", \"0\", \"1\", \"0101010101\", \"010100001\", \"000111000\", \"0101001000\", \"0000001000\", \"0101\", \"000101110\", \"010101010\", \"0101001010\", \"0100101100\", \"0110100000\", \"0000000000\", \"1111111111\", \"0010101100\", \"1010000\", \"0001110\", \"0000000000011001100011110101000101000010010111000100110110000011010011110110001100100001001001010010\", \"01010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010\", \"0010011100000000\"], \"outputs\": [\"3\\n1 1\\n5 2 3 4 5 6\\n1 7\", \"-1\", \"1\\n1 1\", \"-1\", \"-1\", \"-1\", \"3\\n3 1 6 7\\n3 2 5 8\\n3 3 4 9\", \"4\\n5 1 2 3 4 5\\n3 6 7 8\\n1 9\\n1 10\", \"8\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n3 6 7 8\\n1 9\\n1 10\", \"-1\", \"-1\", \"1\\n9 1 2 3 4 5 6 7 8 9\", \"2\\n5 1 2 3 4 5\\n5 6 7 8 9 10\", \"2\\n5 1 2 3 8 9\\n5 4 5 6 7 10\", \"-1\", \"10\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n1 9\\n1 10\", \"-1\", \"2\\n3 1 8 9\\n7 2 3 4 5 6 7 10\", \"-1\", \"-1\", \"22\\n1 1\\n1 2\\n1 3\\n1 4\\n1 5\\n1 6\\n1 7\\n1 8\\n7 9 24 25 26 27 28 29\\n7 10 13 14 17 18 23 30\\n11 11 12 15 16 19 22 31 32 33 34 35\\n3 20 21 36\\n3 37 46 47\\n9 38 39 40 45 48 57 58 75 76\\n17 41 42 43 44 49 50 51 54 55 56 59 72 73 74 77 80 81\\n9 52 53 60 71 78 79 82 83 84\\n7 61 64 65 66 67 70 85\\n5 62 63 68 69 86\\n3 87 88 89\\n3 90 91 92\\n5 93 94 95 96 97\\n3 98 99 100\", \"1\\n245 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 ...\", \"8\\n3 1 8 9\\n5 2 3 4 7 10\\n3 5 6 11\\n1 12\\n1 13\\n1 14\\n1 15\\n1 16\"]}",
    "style": "rule"
  },
  "extra_info": {
    "index": 49,
    "split": "train"
  }
}
```

**Función de recompensa**

Función de recompensa: [https://github.com/volcengine/verl/tree/main/verl/utils/reward\$1score/prime\$1code](https://github.com/volcengine/verl/tree/main/verl/utils/reward_score/prime_code)

## RLVR - Respuesta matemática
<a name="model-customize-open-weight-samples-RLVR3"></a>

**Conjunto de datos de entrada**

[Fuente: conjunto de datos gsm8k limpio https://huggingface. co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**Muestra**

```
[
    {
        "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries...",
        "role": "system"
    },
    {
        "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
        "role": "user"
    },
    {
        "content": "\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections...\n\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n<reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct...\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct...\n</adjustment>\n\n<output>\n15\n</output>",
        "role": "assistant"
    }
]
```

**Cálculo de recompensas**

Función de recompensa: [https://github.com/volcengine/verl/blob/main/verl/utils/reward\$1score/gsm8k.py](https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py)

## RLVR - Lambda personalizada
<a name="model-customize-open-weight-samples-RLVR4"></a>

**Conjunto de datos de entrada**

[Fuente: conjunto de datos gsm8k limpio https://huggingface. co/datasets/thesven/gsm8k-reasoning/tree/main/data](https://huggingface.co/datasets/thesven/gsm8k-reasoning/tree/main/data)

**Muestra**

```
[
  {
    "content": "\nYou are an AI assistant that uses a Chain of Thought (CoT) approach with reflection to answer queries.\n\nFollow these steps:\n- Think through the problem step by step within the ‹thinking> tags.\n- Reflect on your thinking\n\nto check for any errors or improvements within the ‹reflection› tags.\n- Make any necessary adjustments based on your reflection.\n- Provide your final, concise answer within the ‹output> tags.\n\nImportant: The <thinking> and ‹reflection› sections are for your internal reasoning process only.\nDo not include any part of the final answer in these sections.\nThe actual response to the query must be entirely contained within the ‹output› tags.\n\nUse the following format for your response:\n<thinking>\n[Your initial thought process goes here]\n</thinking›\n\n<reasoning>\n[Your step-by-step reasoning goes here. This is your internal thought process, not the final answer. You can create as many reasoning steps as necessary in your process.]\n</reasoning>\n\n‹reflection>\n[Your reflection on your reasoning, checking for errors or improvements. You can create as many reflection steps as necessary in your process.]\n</ reflection>\n\n<adjustment>\n[Any adjustments to your thinking based on your reflection]\n</adjustment>\n\n<output>\n[Your final, concise answer to the query. This is the only part that will be shown to the user.]\n</output>\n",
    "role": "system"
  },
  {
    "content": "A craft store makes a third of its sales in the fabric section, a quarter of its sales in the jewelry section, and the rest in the stationery section. They made 36 sales today. How many sales were in the stationery section?",
    "role": "user"
  },
  {
    "content": "<thinking>\nTo find the number of sales in the stationery section, I need to first calculate the number of sales in the fabric and jewelry sections. The store makes a third of its sales in the fabric section, which is 1/3 of 36 sales. I can calculate this as 36 * 1/3 = 12 sales in the fabric section. The store makes a quarter of its sales in the jewelry section, which is 1/4 of 36 sales. I can calculate this as 36 * 1/4 = 9 sales in the jewelry section. Since the rest of the sales are in the stationery section, I can subtract the sales from the other two sections from the total number of sales to find the number of sales in the stationery section.\n</thinking>\n\n<reasoning>\n1. Calculate the number of sales in the fabric section: 36 * 1/3 = 12\n2. Calculate the number of sales in the jewelry section: 36 * 1/4 = 9\n3. Calculate the number of sales in the stationery section: 36 - 12 - 9 = 15\n</reasoning>\n\n‹reflection>\nUpon reviewing my reasoning, I noticed that my calculations seem correct, but I should ensure that I understand the question correctly. The question asks for the number of sales in the stationery section, and I believe I have correctly calculated that. However, I should double-check my math to make sure it's accurate.\n</reflection>\n\n<adjustment>\nUpon re-examining my math, I am confident that my calculations are correct. The fabric section makes up 1/3 of the sales, the jewelry section makes up 1/4, and the rest is the stationery section. My math checks out, and I am confident in my answer.\n</adjustment>\n\n<output>\n15\n</output>",
    "role": "assistant"
  }
]
```

**Ejemplo de cálculo de recompensas**

```
# RLVR Evaluator for OSS

# lambda_grader.py
import json
import re
import uuid
from typing import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**Ejemplo de código de función de recompensa**

```
# RLVR Evaluator for OSS
# lambda_grader.py

import json
import re
import uuid from typing 
import Any, Dict, List
 
def custom_reward(assistant_answer: str, ground_truth: str) -> float:
    """
    Add custom reward computation here
 
    Example:-
    Reward = fraction of ground-truth words that are correct
    in the correct position.
 
    Example:
      gt:   "the cat sat"
      ans:  "the dog sat"
 
      word-by-word:
        "the" == "the"  -> correct
        "dog" != "cat"  -> wrong
        "sat" == "sat"  -> correct
 
      correct = 2 out of 3 -> reward = 2/3 ≈ 0.67
    """
    ans_words = assistant_answer.strip().lower().split()
    gt_words = ground_truth.strip().lower().split()
 
    if not gt_words:
        return 0.0
 
    correct = 0
    for aw, gw in zip(ans_words, gt_words):
        if aw == gw:
            correct += 1
 
    return correct / len(gt_words)
 
 
# Lambda utility functions
def _ok(body: Any, code: int = 200) -> Dict[str, Any]:
    return {
        "statusCode": code,
        "headers": {
            "content-type": "application/json",
            "access-control-allow-origin": "*",
            "access-control-allow-methods": "POST,OPTIONS",
            "access-control-allow-headers": "content-type",
        },
        "body": json.dumps(body),
    }
 
def _assistant_text(sample: Dict[str, Any]) -> str:
    """Extract assistant text from sample messages."""
    for m in reversed(sample.get("messages", [])):
        if m.get("role") == "assistant":
            return (m.get("content") or "").strip()
    return ""
 
def _sample_id(sample: Dict[str, Any]) -> str:
    """Generate or extract sample ID."""
    if isinstance(sample.get("id"), str) and sample["id"]:
        return sample["id"]
 
    return str(uuid.uuid4())
 
def _ground_truth(sample: Dict[str, Any]) -> str:
    """Extract ground truth from sample or metadata if available"""
 
    if isinstance(sample.get("reference_answer"), str) and sample["reference_answer"]:
        return sample["reference_answer"].strip()
 
    md = sample.get("metadata") or {}
    gt = md.get("reference_answer", None) or md.get("ground_truth", None)
    if gt is None:
        return ""
    return str(gt).strip()
 
 
def _score_and_metrics(sample: Dict[str, Any]) -> Dict[str, Any]:
    sid = _sample_id(sample)
    solution_text = _assistant_text(sample)
 
    # Extract ground truth
    gt = _ground_truth(sample)
 
    metrics_list: List[Dict[str, Any]] = []
 
    # Custom rlvr scoring
    if solution_text and gt:
        
        # Compute score
        reward_score = custom_reward(
            assistant_answer=solution_text,
            ground_truth=gt
        )
        
        # Add detailed metrics
        metrics_list.append({
            "name": "custom_reward_score", 
            "value": float(reward_score), 
            "type": "Reward"
        })
       
        # The aggregate reward score is the custom reward score
        aggregate_score = float(reward_score)
        
    else:
        # No solution text or ground truth - default to 0
        aggregate_score = 0.0
        metrics_list.append({
            "name": "default_zero", 
            "value": 0.0, 
            "type": "Reward"
        })
    print("detected score", {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    })
    return {
        "id": sid,
        "aggregate_reward_score": float(aggregate_score),
        "metrics_list": metrics_list,
    }
 
def lambda_handler(event, context):
    """AWS Lambda handler for custom reward lambda grading."""
    # CORS preflight
    if event.get("requestContext", {}).get("http", {}).get("method") == "OPTIONS":
        return _ok({"ok": True})
 
    # Body may be a JSON string (API GW/Function URL) or already a dict (Invoke)
    raw = event.get("body") or "{}"
    try:
        body = json.loads(raw) if isinstance(raw, str) else raw
    except Exception as e:
        return _ok({"error": f"invalid JSON body: {e}"}, 400)
 
    # Accept top-level list, {"batch":[...]}, or single sample object
    if isinstance(body, dict) and isinstance(body.get("batch"), list):
        samples = body["batch"]
    else:
        return _ok({
            "error": "Send a sample object, or {'batch':[...]} , or a top-level list of samples."
        }, 400)
 
    try:
        results = [_score_and_metrics(s) for s in samples]
    except Exception as e:
        return _ok({"error": f"Custom scoring failed: {e}"}, 500)
 
    return _ok(results)
```

**Ejemplo de aviso de recompensa**

```
You are an expert RAG response evaluator specializing in faithfulness and relevance assessment.
Given: Context documents, a question, and response statements.
Goal: Evaluate both statement-level faithfulness and overall response relevance to the question.

Scoring rubric (start at 0.0, then add or subtract):

Core Components:

Faithfulness Assessment (0.6 max)
Per statement evaluation:
- Direct support in context: +0.2
- Accurate inference from context: +0.2
- No contradictions with context: +0.2
Deductions:
- Hallucination: -0.3
- Misrepresentation of context: -0.2
- Unsupported inference: -0.1

Question Relevance (0.4 max)
- Direct answer to question: +0.2
- Appropriate scope/detail: +0.1
- Proper context usage: +0.1
Deductions:
- Off-topic content: -0.2
- Implicit/meta responses: -0.2
- Missing key information: -0.1

Critical Flags:
A. Complete hallucination
B. Context misalignment
C. Question misinterpretation
D. Implicit-only responses

Additional Instructions:
- Evaluate each statement independently
- Check for direct textual support
- Verify logical inferences
- Assess answer completeness
- Flag any unsupported claims

Return EXACTLY this JSON (no extra text):
{
    "statements_evaluation": [
        {
            "statement": "<statement_text>",
            "verdict": <0 or 1>,
            "reason": "<detailed explanation>",
            "context_support": "<relevant context quote or 'None'>"
        }
    ],
    "overall_assessment": {
        "question_addressed": <0 or 1>,
        "reasoning": "<explanation>",
        "faithfulness_score": <0.0-1.0>,
        "relevance_score": <0.0-1.0>
    },
    "flags": ["<any critical issues>"]
}

## Current Evaluation Task

### Context
{{ ground_truth }}

### Question
{{ extra_info.question }}

### Model's Response
{{ model_output }}
```