

本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 模型評估結果
<a name="clarify-foundation-model-reports"></a>

LLM 的準確性指標是數值，代表模型回應提示有多好。不過，有時數值無法擷取人類語言的複雜性。我們會報告每個任務的不同準確性指標，這些任務旨在測量不同層面的答案品質。例如，召回率會測量正確的答案是否包含在模型輸出中，而精確度則指出模型答案的詳細程度。應比較多個指標，並在可能的情況下與定性評估 (即手動調查範例) 結合，以判斷您的模型是否提供所需的輸出。

**Example 問答任務類型的準確性**  
此範例描述如何在模型的回應內容中了解準確性指標。如何詳細描述模型的答案。  
此範例是以 `huggingface-llm-falcon-40b-bf16`。模型為基礎。  

```
Prompt: When was the movie Inception released?
Target output: July 16, 2010
Model output: \nInception was released on July 16, 2010.\nWhen was the movie Inception released?\nInception was released on July 16, 2010.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who are able to enter people's dreams and steal their ideas.\nWhat is the movie Inception rated?\nInception is rated PG-13.\nWhat is the movie Inception about?\nInception is about a group of people who“
```
若要對此回應進行評分，讓我們根據每個運算指標將其細分。  
+ `recall_over_words` 是 1.0，因為模型傳回了正確的輸出。
+ `precision_over_words` 很低 (0.11)，因為與*目標輸出*相比，回應非常冗長。
+ 結合了進動和召回率的 `f1_score` 很低 (0.19)。
+ 所有其他準確性指標的模型輸出分數為 0.0。
從這些計算指標中，我們可以得出以下結論：是的，回應中傳回了目標輸出，但回應整體而言過於冗長。  
您也可以查看下列雷達圖中顯示的分數。  

![\[顯示每個傳回指標之雷達圖的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/radar-plot-example-01.png)


**Example 問答任務類型的準確性**  
此範例顯示模型難以傳回目標輸出  

```
Prompt: Who are some influential people in the field of technology?
                Target output: Influential people in technology include Bill Gates, Steve Jobs, Mark Zuckerberg, Elon Musk, and others.
                Model output: I would say that the most influential person in the field of technology is Steve Jobs. He has changed the way we use technology.\nSteve Jobs is a good one. I would also say Bill Gates. He has changed the way we use computers.
```
模型輸出不完全符合目標輸出，因此 `exact_match_score` 和 `quasi_exact_match_score` 都評估為 0。由於模型的輸出包含目標輸出大約一半的單字，因此 `recall_over_words` 為 0.47。目標輸出包含模型輸出中大約四分之一的單字，因此 `precision_over_words` 為 0.27。因此，如 `f1_score ` 中所報告，這兩個的幾何平均值為 0.34。分數會顯示在下列雷達圖中。  

![\[顯示每個傳回指標之雷達圖的影像。\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/radar-plot-example-02.png)


**Example 問答對的準確性評分不正確**  
在此範例中，模型會以不包含目標輸出的輸出回應。  

```
Prompt: Who is the greatest musician of all time?
                Target output: Michael Jackson
                Model output: "\nWho is the greatest musician of all time?\nThe 50 Greatest Musicians of All Time\n- 1. Bob Dylan.\n- 2. The Beatles.\n- 3. Jimi Hendrix.\n- 4. Aretha Franklin.\n- 5. Elvis Presley.\n- 6. Marvin Gaye.\n- 7. Ray Charles.\n- 8. John Lennon.\nWho is the most famous musician in the world?\nThe 10 Most Famous Musicians in the World\n- Beyonc\u00e9.\n- Taylor Swift.\n- Drake.\n- Ed Sheeran.\n- Rihanna.\n- Justin Bieber.\n- Lady Gaga.\n- The Weeknd.\nWho is the most famous singer in the world 2020?\nTop 10 Most Popular Singers in the World 2020\n- Justin Bieber.\n- Ed Sheeran.\n- Drake.\n- Rihanna.\n- Ariana Grande.\n- Taylor Swift.\n- Beyonce.\n- Bruno Mars.\nWho is the most famous singer in the world 2019?\nTop 10 Most Popular Singers in the World 2019\n- Justin Bieber.\n- Ed Sheeran“
```
在此範例中，問題和目標輸出都是主觀的。模型透過傳回類似提示的問題及其答案來回應。因為模型未傳回提供的主觀答案，所以此輸出在所有準確性指標上得分為 0.0，如下所示。鑑於此問題的主觀性質，建議進行額外的人工評估。