Customize your workflow using the fmeval library
You can customize your model evaluation to allow for a model that is not a
JumpStart or Amazon Bedrock model or use a custom workflow for evaluation. If you use your own
model, you have to create a custom ModelRunner. If you use your own dataset
for evaluation, you must configure a DataConfig object. The following
section shows how to format your input dataset, customize a DataConfig
object to use your custom dataset, and create a custom ModelRunner.
If you want to use your own dataset to evaluate your model, you must use a
DataConfig object to specify the dataset_name and
the dataset_uri of the dataset that you want to evaluate. If you
use a built-in dataset, the DataConfig object is already configured
as the default for evaluation algorithms.
You can use one custom dataset every time you use the evaluate
function. You can invoke evaluate any number of times to use any
number of datasets that you want.
Configure a custom dataset with your model request specified in the question column, and the target answer specified in the column answer, as follows:
from fmeval.data_loaders.data_config import DataConfig from fmeval.constants import MIME_TYPE_JSONLINES config = DataConfig( dataset_name="tiny_dataset", dataset_uri="tiny_dataset.jsonl", dataset_mime_type=MIME_TYPE_JSONLINES, model_input_location="question", target_output_location="answer", )
The DataConfig class contains the following parameters:
-
dataset_name– The name of the dataset that you want to use to evaluate your LLM.dataset_uri– The local path or uniform resource identifier (URI) to the S3 location of your dataset. -
dataset_mime_type– The format of the input data that you want to use to evaluate your LLM. The FMEval library can support bothMIME_TYPE_JSONandMIME_TYPE_JSONLINES. -
model_input_location– (Optional) The name of the column in your dataset that contains the model inputs or prompts that you want to evaluate.Use a
model_input_locationthat specifies the name of your column. The column must contain the following values corresponding to the following associated tasks:-
For open-ended generation, toxicity, and accuracy evaluations, specify the column that contains the prompt that your model should respond to.
-
For a question answering task, specify the column that contains the question that your model should generate a response to.
-
For a text summarization task, specify the name of the column that contains the text that you want your model to summarize.
-
For a classification task, specify the name of the column that contains the text that you want your model to classify.
-
For a factual knowledge evaluations, specify the name of the column that contains the question that you want the model to predict the answer to.
-
For semantic robustness evaluations, specify the name of the column that contains the input that you want your model to perturb.
-
For prompt stereotyping evaluations, use the
sent_more_input_locationandsent_less_input_locationinstead ofmodel_input_location, as shown in the following parameters.
-
-
model_output_location– (Optional) The name of the column in your dataset that contains the predicted output that you want to compare against the reference output that is contained intarget_output_location. If you providemodel_output_location, then FMEval won't send a request to your model for inference. Instead, it uses the output contained in the specified column to evaluate your model. -
target_output_location– The name of the column in the reference dataset that contains the true value to compare against the predicted value that is contained inmodel_output_location. Required only for factual knowledge, accuracy, and semantic robustness. For factual knowledge, each row in this column should contain all possible answers separated by a delimiter. For example, if the answers for a question are [“UK”,“England”], then the column should contain “UK<OR>England”. The model prediction is correct if it contains any of the answers separated by the delimiter. -
category_location– The name of the column that contains the name of a category. If you provide a value forcategory_location, then scores are aggregated and reported for each category. -
sent_more_input_location– The name of the column that contains a prompt with more bias. Required only for prompt stereotyping. Avoid unconscious bias. For bias examples, see the CrowS-Pairs dataset. -
sent_less_input_location– The name of the column that contains a prompt with less bias. Required only for prompt stereotyping. Avoid unconscious bias. For bias examples, see the CrowS-Pairs dataset. -
sent_more_output_location– (Optional) The name of the column that contains a predicted probability that your model’s generated response will contain more bias. This parameter is only used in prompt stereotyping tasks. -
sent_less_output_location– (Optional) The name of the column that contains a predicted probability that your model’s generated response will contain less bias. This parameter is only used in prompt stereotyping tasks.
If you want to add a new attribute that corresponds to a dataset column to the
DataConfig class, you must add the suffix
_location to the end of the attribute name.
To evaluate a custom model, use a base data class to configure your model and
create a custom ModelRunner. Then, you can use this
ModelRunner to evaluate any language model. Use the following
steps to define a model configuration, create a custom ModelRunner,
and test it.
The ModelRunner interface has one abstract method as
follows:
def predict(self, prompt: str) → Tuple[Optional[str], Optional[float]]
This method takes in a prompt as a string input, and returns a Tuple
containing a model text response and an input log probability. Every
ModelRunner must implement a predict
method.
Create a custom ModelRunner
-
Define a model configuration.
The following code example shows how to apply a
dataclassdecorator to a customHFModelConfigclass so that you can define a model configuration for a Hugging Face model:from dataclasses import dataclass @dataclass class HFModelConfig: model_name: str max_new_tokens: int seed: int = 0 remove_prompt_from_generated_text: bool = TrueIn the previous code example, the following applies:
-
The parameter
max_new_tokensis used to limit the length of the response by limiting the number of tokens returned by an LLM. The type of model is set by passing a value formodel_namewhen the class is instantiated. In this example, the model name is set togpt2, as shown in the end of this section. The parametermax_new_tokensis one option to configure text generation strategies using agpt2model configuration for a pre-trained OpenAI GPT model. See AutoConfigfor other model types. -
If the parameter
remove_prompt_from_generated_textis set toTrue, then the generated response won't contain the originating prompt sent in the request.
For other text generation parameters, see the Hugging Face documentation for GenerationConfig
. -
-
Create a custom
ModelRunnerand implement a predict method. The following code example shows how to create a customModelRunnerfor a Hugging Face model using theHFModelConfigclass created in the previous code example.from typing import Tuple, Optional import torch from transformers import AutoModelForCausalLM, AutoTokenizer from fmeval.model_runners.model_runner import ModelRunner class HuggingFaceCausalLLMModelRunner(ModelRunner): def __init__(self, model_config: HFModelConfig): self.config = model_config self.model = AutoModelForCausalLM.from_pretrained(self.config.model_name) self.tokenizer = AutoTokenizer.from_pretrained(self.config.model_name) def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]: input_ids = self.tokenizer(prompt, return_tensors="pt").to(self.model.device) generations = self.model.generate( **input_ids, max_new_tokens=self.config.max_new_tokens, pad_token_id=self.tokenizer.eos_token_id, ) generation_contains_input = ( input_ids["input_ids"][0] == generations[0][: input_ids["input_ids"].shape[1]] ).all() if self.config.remove_prompt_from_generated_text and not generation_contains_input: warnings.warn( "Your model does not return the prompt as part of its generations. " "`remove_prompt_from_generated_text` does nothing." ) if self.config.remove_prompt_from_generated_text and generation_contains_input: output = self.tokenizer.batch_decode(generations[:, input_ids["input_ids"].shape[1] :])[0] else: output = self.tokenizer.batch_decode(generations, skip_special_tokens=True)[0] with torch.inference_mode(): input_ids = self.tokenizer(self.tokenizer.bos_token + prompt, return_tensors="pt")["input_ids"] model_output = self.model(input_ids, labels=input_ids) probability = -model_output[0].item() return output, probabilityThe previous code uses a custom
HuggingFaceCausalLLMModelRunnerclass that inherits properties from the FMEvalModelRunnerclass. The custom class contains a constructor and a definition for a predict function, which returns aTuple.For more
ModelRunnerexamples, see the model_runnersection of the fmevallibrary.The
HuggingFaceCausalLLMModelRunnerconstructor contains the following definitions:-
The configuration is set to
HFModelConfig, defined in the beginning of this section. -
The model is set to a pre-trained model from the Hugging Face Auto Class
that is specified using the model_name parameter upon instantiation. -
The tokenizer is set to a class from the Hugging Face tokenizer library
that matches the pre-trained model specified by model_name.
The
predictmethod in theHuggingFaceCausalLLMModelRunnerclass uses the following definitions:-
input_ids– A variable that contains input for your model. The model generates the input as follows.-
A
tokenizerConverts the request contained inpromptinto token identifiers (IDs). These token IDs, which are numerical values that represent a specific token (word, sub-word or character), can be used directly by your model as input. The token IDs are returned as a PyTorch tensor objects, as specified byreturn_tensors="pt". For other types of return tensor types, see the Hugging Face documentation for apply_chat_template. -
Token IDs are sent to a device where the model is located so that they can be used by the model.
-
-
generations– A variable that contains the response generated by your LLM. The model’s generate function uses the following inputs to generate the response:-
The
input_idsfrom the previous step. -
The parameter
max_new_tokensspecified inHFModelConfig. -
A
pad_token_idadds an end of sentence (eos) token to the response. For other tokens that you can use, see the Hugging Face documentation for PreTrainedTokenizer.
-
-
generation_contains_input– A boolean variable that returnsTruewhen the generated response includes the input prompt in its response, andFalseotherwise. The return value is calculated using an element-wise comparison between the following.-
All of the token IDs in the input prompt that are contained in
input_ids["input_ids"][0]. -
The beginning of the generated content that is contained in
generations[0][: input_ids["input_ids"].shape[1]].
The
predictmethod returns a warning if you directed the LLM toremove_prompt_from_generated_textin your configuration but the generated response doesn’t contain the input prompt.The output from the
predictmethod contains a string returned by thebatch_decodemethod, which converts token IDs returned in the response into human readable text. If you specifiedremove_prompt_from_generated_textasTrue, then the input prompt is removed from the generated text. If you specifiedremove_prompt_from_generated_textasFalse, the generated text will be returned without any special tokens that you included in the dictionaryspecial_token_dict, as specified byskip_special_tokens=True. -
-
-
Test your
ModelRunner. Send a sample request to your model.The following example shows how to test a model using the
gpt2pre-trained model from the Hugging FaceAutoConfigclass:hf_config = HFModelConfig(model_name="gpt2", max_new_tokens=32) model = HuggingFaceCausalLLMModelRunner(model_config=hf_config)In the previous code example,
model_namespecifies the name of the pre-trained model. TheHFModelConfigclass is instantiated as hf_config with a value for the parametermax_new_tokens, and used to initializeModelRunner.If you want to use another pre-trained model from Hugging Face, choose a
pretrained_model_name_or_pathinfrom_pretrainedunder AutoClass. Lastly, test your
ModelRunner. Send a sample request to your model as shown in the following code example:model_output = model.predict("London is the capital of?")[0] print(model_output) eval_algo.evaluate_sample()