本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。
了解人工評估任務的結果
當您建立了使用人力的模型評估任務時,您選取了一或多個指標類型。當工作團隊的成員評估工作者入口網站中的回應時,其回應會儲存在 json 物件中。這些回應的儲存方式會根據建立任務時選取的指標類型而變更。humanAnswers
下列各節說明這些差異並提供範例。
JSON 輸出參考
當模型評估任務完成時,結果會以 JSON 檔案形式儲存在 Amazon S3 中。JSON 物件包含三個高階節點 humanEvaluationResult、inputRecord 和 modelResponses。humanEvaluationResult 金鑰是高階節點,其中包含來自指派給模型評估任務之工作團隊的回應。inputRecord 金鑰是高階節點,其中包含建立模型評估任務時提供給模型的提示。modelResponses 金鑰是高階節點,其中包含來自模型對提示的回應。
下表摘要說明在模型評估任務的 JSON 輸出中找到的金鑰值對。
後續章節提供每個金鑰值對的更精細詳細資訊。
| 參數 | 範例 | Description |
|---|---|---|
|
|
arn:aws:sagemaker:us-west-2:
|
已建立人工迴圈的人工檢閱工作流程 (流程定義) ARN。 |
humanAnswers |
特定於所選評估指標的 JSON 物件清單。若要進一步了解,請參閱在 humanAnswers 下找到的金鑰值對。 |
包含工作者回應的 JSON 物件清單。 |
|
|
system-generated-hash |
系統產生了 40 個字元的十六進位字串。 |
inputRecord |
|
包含來自輸入資料集中輸入提示的 JSON 物件。 |
modelResponses |
|
來自模型的個別回應。 |
inputContent |
|
在您的 Amazon S3 儲存貯體中啟動人工迴圈所需的人工迴圈輸入內容。 |
modelResponseIdMap |
|
描述 |
在 humanEvaluationResult 下找到的金鑰值對
在模型評估任務輸出的 humanEvaluationResult 下找到下列金鑰值對。
如需與 humanAnswers 相關聯的金鑰值對,請參閱在 humanAnswers 下找到的金鑰值對。
flowDefinitionArn
-
用來完成模型評估任務的流程定義 ARN。
-
範例
arn:aws:sagemaker:us-west-2::111122223333:flow-definition/flow-definition-name
humanLoopName
-
系統產生了 40 個字元的十六進位字串。
inputContent
-
此金鑰值說明指標類型,以及您在工作者入口網站中提供給工作者的指示。
-
additionalDataS3Uri:Amazon S3 中儲存工作者指示的位置。 -
instructions:您在工作者入口網站中提供給工作者的指示。 -
evaluationMetrics:指標的名稱及其描述。金鑰值metricType是提供給工作者的工具,用於評估模型的回應。
-
modelResponseIdMap
-
此金鑰值對會識別所選模型的完整名稱,以及如何將工作者選擇對應至
humanAnswers金鑰值對中的模型。
在 inputRecord 下找到的金鑰值對
下列項目描述 inputRecord 金鑰值對。
prompt
-
傳送至模型的提示文字。
category
-
分類提示的選用類別。在模型評估期間,可在工作者入口網站中看見工作者。
-
範例
"American cities":
referenceResponse
-
來自輸入 JSON 的選用欄位,用來指定您想要工作者在評估期間參考的基準真相
responses
-
來自輸入 JSON 的選用欄位,其中包含來自其他模型的回應。
範例 JSON 輸入記錄。
{ "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": // The same modelIdentifier must be specified for all responses [{ "modelIdentifier":"meta-textgeneration-llama-codellama-7b", "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }
在 modelResponses 下找到的金鑰值對
金鑰值對的陣列,其中包含來自模型的回應,以及哪個模型提供了回應。
text
-
模型對提示的回應。
modelIdentifier
-
模型的名稱。
在 humanAnswers 下找到的金鑰值對
金鑰值對的陣列,其中包含來自模型的回應,以及工作者評估模型的方式。
acceptanceTime
-
當工作者在工作者入口網站中接受了任務時。
submissionTime
-
當工作者提交了其回應時。
timeSpentInSeconds
-
工作者完成任務所花費的時間。
workerId
-
完成任務的工作者 ID。
workerMetadata
-
有關哪個工作團隊已指派給此模型評估任務的中繼資料。
answerContent JSON 陣列的格式
答案的結構取決於建立模型評估任務時選取的評估指標。每個工作者回應或答案都會記錄在新的 JSON 物件中。
answerContent
-
evaluationResults包含工作者的回應。-
選取選擇按鈕時,來自每個工作者的結果為
"evaluationResults": "comparisonChoice"。metricName:指標的名稱result:JSON 物件指出工作者使用0或1選取的模型。若要查看模型對應到哪個值,請參閱modelResponseIdMap。 -
當選取李克特量表,比較時,來自每個工作者的結果會以
"evaluationResults": "comparisonLikertScale"表示。metricName:指標的名稱。leftModelResponseId:指出哪個modelResponseIdMap顯示在工作者入口網站左側。rightModelResponseId:指出哪個modelResponseIdMap顯示在工作者入口網站左側。result:JSON 物件指出工作者使用0或1選取的模型。若要查看模型對應到哪個值,請參閱modelResponseIdMap -
選取順序排名時,每個工作者的結果會以
"evaluationResults": "comparisonRank"表示。metricName:指標的名稱result:JSON 物件的陣列。針對每個模型 (modelResponseIdMap) 工作者提供rank。"result": [{ "modelResponseId": "0", "rank": 1 }, { "modelResponseId": "1", "rank": 1 }] -
當選取李克特量表,評估單一模型回應時,工作者的結果會儲存在
"evaluationResults": "individualLikertScale"中。這是 JSON 陣列,其中包含建立任務時所指定metricName的分數。metricName:指標的名稱。modelResponseId:評分的模型。若要查看模型對應到哪個值,請參閱modelResponseIdMap。result:金鑰值對,指出工作者選取的李克特量表值。 -
選取拇指向上/向下時,來自工作者的結果會儲存為 JSON 陣列
"evaluationResults": "thumbsUpDown"。metricName:指標的名稱。result:與false相關時,true或metricName。當工作者選擇拇指向上時,"result" : true。
-
來自模型評估任務輸出的範例輸出
下列 JSON 物件是儲存在 Amazon S3 中的範例模型評估任務輸出。若要進一步了解每個金鑰值對,請參閱JSON 輸出參考。
為了清楚起見,此任務只包含來自兩個工作者的回應。某些金鑰值對可能也遭截斷以方便讀取
{ "humanEvaluationResult": { "flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name", "humanAnswers": [ { "acceptanceTime": "2024-06-07T22:31:57.066Z", "answerContent": { "evaluationResults": { "comparisonChoice": [ { "metricName": "Fluency", "result": { "modelResponseId": "0" } } ], "comparisonLikertScale": [ { "leftModelResponseId": "0", "metricName": "Coherence", "result": 1, "rightModelResponseId": "1" } ], "comparisonRank": [ { "metricName": "Toxicity", "result": [ { "modelResponseId": "0", "rank": 1 }, { "modelResponseId": "1", "rank": 1 } ] } ], "individualLikertScale": [ { "metricName": "Correctness", "modelResponseId": "0", "result": 2 }, { "metricName": "Correctness", "modelResponseId": "1", "result": 3 }, { "metricName": "Completeness", "modelResponseId": "0", "result": 1 }, { "metricName": "Completeness", "modelResponseId": "1", "result": 4 } ], "thumbsUpDown": [ { "metricName": "Accuracy", "modelResponseId": "0", "result": true }, { "metricName": "Accuracy", "modelResponseId": "1", "result": true } ] } }, "submissionTime": "2024-06-07T22:32:19.640Z", "timeSpentInSeconds": 22.574, "workerId": "ead1ba56c1278175", "workerMetadata": { "identityData": { "identityProviderType": "Cognito", "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4", "sub": "cd2848f5-6105-4f72-b44e-68f9cb79ba07" } } }, { "acceptanceTime": "2024-06-07T22:32:19.721Z", "answerContent": { "evaluationResults": { "comparisonChoice": [ { "metricName": "Fluency", "result": { "modelResponseId": "1" } } ], "comparisonLikertScale": [ { "leftModelResponseId": "0", "metricName": "Coherence", "result": 1, "rightModelResponseId": "1" } ], "comparisonRank": [ { "metricName": "Toxicity", "result": [ { "modelResponseId": "0", "rank": 2 }, { "modelResponseId": "1", "rank": 1 } ] } ], "individualLikertScale": [ { "metricName": "Correctness", "modelResponseId": "0", "result": 3 }, { "metricName": "Correctness", "modelResponseId": "1", "result": 4 }, { "metricName": "Completeness", "modelResponseId": "0", "result": 1 }, { "metricName": "Completeness", "modelResponseId": "1", "result": 5 } ], "thumbsUpDown": [ { "metricName": "Accuracy", "modelResponseId": "0", "result": true }, { "metricName": "Accuracy", "modelResponseId": "1", "result": false } ] } }, "submissionTime": "2024-06-07T22:32:57.918Z", "timeSpentInSeconds": 38.197, "workerId": "bad258db224c3db6", "workerMetadata": { "identityData": { "identityProviderType": "Cognito", "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4", "sub": "84d5194a-3eed-4ecc-926d-4b9e1b724094" } } } ], "humanLoopName": "a757 11d3e75a 8d41f35b9873d 253f5b7bce0256e", "inputContent": { "additionalDataS3Uri": "s3://mgrt-test-us-west-2/test-2-workers-2-model/datasets/custom_dataset/0/task-input-additional-data.json", "instructions": "worker instructions provided by the model evaluation job administrator", "evaluationMetrics": [ { "metricName": "Fluency", "metricType": "ComparisonChoice", "description": "Measures the linguistic quality of a generated text." }, { "metricName": "Coherence", "metricType": "ComparisonLikertScale", "description": "Measures the organization and structure of a generated text." }, { "metricName": "Toxicity", "metricType": "ComparisonRank", "description": "Measures the harmfulness of a generated text." }, { "metricName": "Accuracy", "metricType": "ThumbsUpDown", "description": "Indicates the accuracy of a generated text." }, { "metricName": "Correctness", "metricType": "IndividualLikertScale", "description": "Measures a generated answer's satisfaction in the context of the question." }, { "metricName": "Completeness", "metricType": "IndividualLikertScale", "description": "Measures a generated answer's inclusion of all relevant information." } ], "disableRandomization": "true" }, "modelResponseIdMap": { "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" } }, "inputRecord": { "prompt": { "text": "What is high intensity interval training?" }, "category": "Fitness", "referenceResponse": { "text": "High-Intensity Interval Training (HIIT)" } }, "modelResponses": [ { "text": "High Intensity Interval Training (HIIT) is a form of exercise that alternates between periods of high intensity work and low intensity recovery.HIIT is an excellent way to increase your fitness and improve your health, but it can be difficult to get started.In this article, we will", "modelIdentifier": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612" }, { "text": "High intensity interval training is a form of exercise consisting of short bursts of maximum effort followed by periods of rest. The theory behind HIIT is that it can often be more effective at improving cardiovascular and metabolic health than longer, lower intensity workouts.The work intervals can range in length depending on the specific type of exercise, but are typically between 20 and 90 seconds. The recovery periods are generally longer, lasting between 1 and 5 minutes. This pattern is then repeated for multiple sets.\n\nSince the work intervals are high intensity, they require more effort from your body and therefore result in a greater calorie burn. The body also continues to burn calories at an increased rate after the workout due to an effect called excess post exercise oxygen consumption (EPOC), also know as the afterburn effect.\n\nHIIT is a versatile form of training that can be adapted to different fitness levels and can be performed using a variety of exercises including cycling, running, bodyweight movements, and even swimming. It can be done in as little as 20 minutes once or twice a week, making it an efficient option for busy individuals.\n\nWhat are the benefits of high intensity interval training", "modelIdentifier": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" } ] }