Use the
                    fmeval library to run an automatic evaluation
Using the fmeval library in your own code gives you the most
            flexibility to customize your work flow. You can use the fmevallibrary
            to evaluate any LLM, and also to have more flexibility with your custom input
            datasets. The following steps show you how to set up your environment and how to run
            both a starting and a customized work flow using the fmeval
            library.
Get
                    started using the fmeval library
            You can configure your foundation model evaluation and customize it for your use case in a Studio notebook. Your configuration depends both on the kind of task that your foundation model is built to predict, and how you want to evaluate it. FMEval supports open-ended generation, text summarization, question answering, and classification tasks. The steps in this section show you how to set up a starting work flow. This starting work flow includes setting up your environment and running an evaluation algorithm using either a JumpStart or an Amazon Bedrock foundation model with built-in datasets. If you must use a custom input dataset and workflow for a more specific use case, see Customize your workflow using the fmeval library.
If you don’t want to run a model evaluation in a Studio notebook, skip to step 11 in the following Get started using Studio section.
Prerequisites
- 
                        To run a model evaluation in a Studio UI, your AWS Identity and Access Management (IAM) role and any input datasets must have the correct permissions. If you do not have a SageMaker AI Domain or IAM role, follow the steps in Guide to getting set up with Amazon SageMaker AI. 
To set permissions for your Amazon S3 bucket
After your domain and role are created, use the following steps to add the permissions needed to evaluate your model.
- Open the Amazon SageMaker AI console at https://console.aws.amazon.com/sagemaker/ - . 
- 
                        In the navigation pane, enter S3into the search bar at the top of the page.
- 
                        Choose S3 under Services. 
- 
                        Choose Buckets from the navigation pane. 
- 
                        In the General purpose buckets section, under Name, choose the name of the S3 bucket that you want to use to store your model input and output in the console. If you do not have an S3 bucket, do the following: - 
                                Select Create bucket to open a new Create bucket page. 
- 
                                In the General configuration section, under AWS Region, select the AWS region where your foundation model is located. 
- 
                                Name your S3 bucket in the input box under Bucket name. 
- 
                                Accept all of the default choices. 
- 
                                Select Create bucket. 
- 
                                In the General purpose buckets section, under Name, select the name of the S3 bucket that you created. 
 
- 
                                
- 
                        Choose the Permissions tab. 
- 
                        Scroll to the Cross-origin resource sharing (CORS) section at the bottom of the window. Choose Edit. 
- 
                        To add permissions to your bucket for foundation evaluations, ensure that the following code appears in the input box. You can also copy and paste the following into the input box. [ { "AllowedHeaders": [ "*" ], "AllowedMethods": [ "GET", "PUT", "POST", "DELETE" ], "AllowedOrigins": [ "*" ], "ExposeHeaders": [ "Access-Control-Allow-Origin" ] } ]
- 
                        Choose Save changes. 
To add permissions to your IAM policy
- 
                        In the search bar at the top of the page, enter IAM.
- 
                        Under Services, select Identity and Access Management (IAM). 
- 
                        Choose Policies from the navigation pane. 
- 
                        Input AmazonSageMakerFullAccess into the search bar. Select the radio button next to the policy that appears. The Actions button can now be selected. 
- 
                        Choose the down arrow next to Actions. Two options appear. 
- 
                        Choose Attach. 
- 
                        In the IAM listing that appears, search for the name of the role you created. Select the check box next to the name. 
- 
                        Choose Attach policy. 
Get started using Studio
- 
                        In the search bar at the top of the page, enter SageMaker AI.
- 
                        Under Services, select Amazon SageMaker AI. 
- 
                        Choose Studio from the navigation pane. 
- 
                        Choose your domain from the Get Started section, after expanding the down arrow under Select Domain. 
- 
                        Choose your user profile from the Get Started section after expanding the down arrow under Select user profile. 
- 
                        Choose Open Studio to open the landing page for Studio. 
- 
                        Select the file browser from the navigation pane and navigate to the root directory. 
- 
                        Select Create notebook. 
- 
                        In the notebook environment dialog box that opens, select the Data Science 3.0 image. 
- 
                        Choose Select. 
- 
                        Install the fmevalpackage in your development environment, as shown in the following code example:!pip install fmevalNoteInstall the fmevallibrary into an environment that uses Python 3.10. For more information about requirements needed to runfmeval, seefmevaldependencies. 
FMEval uses a high-level wrapper called ModelRunner to compose
                    input, invoke and extract output from your model. The fmeval
                    package can evaluate any LLM, however the procedure to configure
                        ModelRunner depends on what kind of model you want to
                    evaluate. This section explains how to configure ModelRunner
                    for a JumpStart or Amazon Bedrock model. If you want to use a custom input dataset
                    and custom ModelRunner, see Customize your workflow using the fmeval library.
Use a JumpStart model
To use ModelRunner to evaluate a JumpStart model,
                        create or provide an endpoint, define the model and the built-in
                        dataset, configure, and test ModelRunner.
Define a JumpStart model and configure a ModelRunner
- 
                            Provide an endpoint by doing either of the following: - 
                                    Specify the EndpointName to an existing JumpStart endpoint, the model_id, andmodel_version.
- 
                                    Specify the model_idandmodel_versionfor your model, and create a JumpStart endpoint.
 The following code example shows how create an endpoint for a Llama 2 foundation model that's available through JumpStart. import sagemaker from sagemaker.jumpstart.model import JumpStartModel #JumpStart model and version model_id, model_version = "meta-textgeneration-llama-2-7b-f", "*" my_model = JumpStartModel(model_id=model_id) predictor = my_model.deploy() endpoint_name = predictor.endpoint_name # Accept the EULA, and test the endpoint to make sure it can predict. predictor.predict({"inputs": [[{"role":"user", "content": "Hello how are you?"}]]}, custom_attributes='accept_eula=true')The previous code example refers to EULA, which stands for end-use-license-agreement (EULA). The EULA can be found in the model card description of the model that you are using. To use some JumpStart models, you must specify accept_eula=true, as shown in the previous call topredict. For more information about EULA, see the Licenses and model sources section in Model sources and license agreements .You can find a list of available JumpStart models at Built-in Algorithms with pre-trained Model Table . 
- 
                                    
- 
                            Configure ModelRunnerby using theJumpStartModelRunner, as shown in the following configuration example:from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner js_model_runner = JumpStartModelRunner( endpoint_name=endpoint_name, model_id=model_id, model_version=model_version )In the previous configuration example, use the same values for endpoint_name,model_id, andmodel_versionthat you used to create the endpoint.
- 
                            Test your ModelRunner. Send a sample request to your model as shown in the following code example:js_model_runner.predict("What is the capital of London")
Use an Amazon Bedrock model
To evaluate an Amazon Bedrock model, you must define the model and built-in
                        dataset, and configure ModelRunner.
Define an Amazon Bedrock model and configure a ModelRunner
- 
                            To define and print model details, use the following code example for a Titan model that is available through Amazon Bedrock: import boto3 import json bedrock = boto3.client(service_name='bedrock') bedrock_runtime = boto3.client(service_name='bedrock-runtime') model_id = "amazon.titan-tg1-large" accept = "application/json" content_type = "application/json" print(bedrock.get_foundation_model(modelIdentifier=modelId).get('modelDetails'))In the previous code example, the acceptparameter specifies the format of the data that you want to use to evaluate your LLM. ThecontentTypespecifies the format of the input data in the request. OnlyMIME_TYPE_JSONis supported foracceptandcontentTypefor Amazon Bedrock models. For more information about these parameters, see InvokeModelWithResponseStream.
- 
                            To configure ModelRunner, use theBedrockModelRunner, as shown in the following configuration example:from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner bedrock_model_runner = BedrockModelRunner( model_id=model_id, output='results[0].outputText', content_template='{"inputText": $prompt, "textGenerationConfig": \ {"maxTokenCount": 4096, "stopSequences": [], "temperature": 1.0, "topP": 1.0}}', )Parametrize the ModelRunnerconfiguration as follows.- 
                                    Use the same values for model_idthat you used to deploy the model.
- 
                                    Use outputto specify the format of the generatedjsonresponse. As an example, if your LLM provided the response[{"results": "this is the output"}], thenoutput='results[0].outputText'returnsthis is the output.
- 
                                    Use content_templateto specify how your LLM interacts with requests. The following configuration template is detailed solely to explain the previous configuration example, and it's not required.- 
                                            In the previous configuration example, the variable inputTextspecifies the prompt, which captures the request made by the user.
- 
                                            The variable textGenerationConfigspecifies how the LLM generates responses as follows:- 
                                              The parameter maxTokenCountis used to limit the length of the response by limiting the number of tokens returned by the LLM.
- 
                                              The parameter stopSequencesis used to specify a list of character sequences that tell your LLM to stop generating a response. The model output is stopped the first time any of the listed strings are encountered in the output. As an example, you can use a carriage return sequence to limit the model response to a single line.
- 
                                              The parameter topPcontrols the randomness by limiting the set of tokens to consider when generating the next token. This parameter accepts values between0.0and1.0. Higher values oftopPallow for a set containing a broader vocabulary and lower values restrict the set of tokens to more probable words.
- 
                                              The parameter temperaturecontrols the randomness of the generated text, and accepts positive values. Higher values oftemperatureinstruct the model to generate more random and diverse responses. Lower values generate responses that are more predictable. Typical ranges fortemperaturelie between0.2and2.0.
 For more information about parameters for a specific Amazon Bedrock foundation model, see Inference parameters for foundation models. 
- 
                                              
 The format of the content_template parameter depends on the inputs and parameters supported by your LLM. For example, Anthropic’s Claude 2 model can support the following content_template:"content_template": "{\"prompt\": $prompt, \"max_tokens_to_sample\": 500}"As another example, the Falcon 7b model can support the following content_template."content_template": "{\"inputs\": $prompt, \"parameters\":{\"max_new_tokens\": \ 10, \"top_p\": 0.9, \"temperature\": 0.8}}"Lastly, test your ModelRunner. Send a sample request to your model as shown in the following code example:bedrock_model_runner.predict("What is the capital of London?")
- 
                                            
 
- 
                                    
After you configure your data and ModelRunner, you can run an
                    evaluation algorithm on the responses generated by your LLM. To see a list
                    of all of the available evaluation algorithms, run the following
                    code:
from fmeval.eval_algo_mapping import EVAL_ALGORITHMS print(EVAL_ALGORITHMS.keys())
Each algorithm has both an evaluate and an evaluate_sample
                    method. The evaluate method computes a score for the entire
                    dataset. The evaluate_sample method evaluates the score for a
                    single instance.
The evaluate_sample method returns EvalScore
                    objects. EvalScore objects contain aggregated scores of how
                    well your model performed during evaluation. The
                        evaluate_sample method has the following optional
                    parameters:
- 
                        model_output– The model response for a single request.
- 
                        model_input– A prompt containing the request to your model.
- 
                        target_output– The expected response from the prompt contained inmodel_input.
The following code example shows how to use the
                        evaluate_sample:
#Evaluate your custom sample model_output = model_runner.predict("London is the capital of?")[0] eval_algo.evaluate_sample(target_output="UK<OR>England<OR>United Kingdom", model_output=model_output)
The evaluate method has the following optional
                    parameters:
- 
                        model– An instance ofModelRunnerusing the model that you want to evaluate.
- 
                        dataset_config– The dataset configuration. Ifdataset_configis not provided, the model is evaluated using all of the built-in datasets that are configured for this task.
- 
                        prompt_template– A template used to generate prompts. Ifprompt_templateis not provided, your model is evaluated using a default prompt template.
- 
                        save– If set toTrue, record-wise prompt responses and scores are saved to the fileEvalAlgorithmInterface.EVAL_RESULTS_PATH. Defaults toFalse.
- 
                        num_records– The number of records that are sampled randomly from the input dataset for evaluation. Defaults to300.
The evaluate algorithm returns a list of
                        EvalOutput objects that can include the following:
- 
                        eval_name– The name of the evaluation algorithm.dataset_name– The name of dataset used by the evaluation algorithm.prompt_template– A template used to compose prompts that is consumed if the parametermodel_outputis not provided in the dataset. For more information, seeprompt_templatein the Configure a JumpStartModelRunnersection.dataset_scores– An aggregated score computed across the whole dataset.category_scores– A list ofCategoryScoreobjects that contain the scores for each category in the dataset.output_path– The local path to the evaluation output. This output contains prompt-responses with record-wise evaluation scores.error– A string error message for a failed evaluation job.
The following dimensions are available for model evaluation:
- 
                        Accuracy 
- 
                        Factual knowledge 
- 
                        Prompt stereotyping 
- 
                        Semantic robustness 
- 
                        Toxicity 
Accuracy
You can run an accuracy algorithm for a question answering, text summarization, or classification task. The algorithms are different for each task in order to accommodate the different data input types and problems as follows:
- 
                            For question answering tasks, run the QAAccuracyalgorithm with aQAAccuracyConfigfile.
- 
                            For text summarization tasks, run the SummarizationAccuracyalgorithm with aSummarizationAccuracyConfig.
- 
                            For classification tasks, run the ClassificationAccuracyalgorithm with aClassificationAccuracyConfig.
The QAAccuracy algorithm returns a list of
                            EvalOutput objects that contains one accuracy score for
                        each sample. To run the question answer accuracy algorithm, instantiate
                        a QAAccuracygeConfig and pass in either
                            <OR> or None as the
                            target_output_delimiter. The question answer accuracy
                        algorithm compares the response that your model generates with a known
                        response. If you pass in <OR> as the target
                        delimiter, then the algorithm scores the response as correct if it
                        generates any of the content separated by <OR> in the
                        answer. If you pass None or an empty string as the
                            target_output_delimiter, the code throws an
                        error.
Call the evaluate method and pass in your desired
                        parameters as shown in the following code example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig eval_algo = QAAccuracy(QAAccuracyConfig(target_output_delimiter="<OR>"))) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
The SummarizationAccuracy algorithm returns a list of
                            EvalOutput objects that contain scores for ROUGE-NSummarizationAccuracyConfig and pass in the
                        following:
- 
                            Specify the type of ROUGE metric you want to use in your evaluation to rouge_type. You can chooserouge1,rouge2, orrougeL. These metrics compare generated summaries to reference summaries. ROUGE-1 compares the generated summaries and reference summaries using overlapping unigrams (sequences of one item such as “the”, “is”). ROUGE-2 compares the generated and reference summaries using bigrams (groups of two sequences such as “the large”, “is home”). ROUGE-L compares the longest matching sequence of words. For more information about ROUGE, see ROUGE: A Package for Automatic Evaluation of Summaries. 
- 
                            Set use_stemmer_for_rougetoTrueorFalse. A stemmer removes affixes from words before comparing them. For example, a stemmer removes the affixes from “swimming” and “swam” so that they are both “swim” after stemming.
- 
                            Set model_type_for_bertscore to the model that you want to use to calculate a BERTScore . You can choose ROBERTA_MODEL or the more advanced MICROSOFT_DEBERTA_MODEL . 
Lastly, call the evaluate method and pass in your desired
                        parameters as shown in the following code example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy, SummarizationAccuracyConfig eval_algo = SummarizationAccuracy(SummarizationAccuracyConfig(rouge_type="rouge1",model_type_for_bertscore="MICROSOFT_DEBERTA_MODEL")) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
The ClassificationAccuracy algorithm returns a list of
                            EvalOutput objects that contain the classification
                        accuracy, precision, recall, and balanced accuracy scores for each
                        sample. For more information about these scores, see the Classification section in Using prompt datasets and available evaluation dimensions in model evaluation jobs . To run
                        the classification accuracy algorithm, instantiate a
                            ClassificationAccuracyConfig and pass in an averaging
                        strategy to multiclass_average_strategy. You can choose
                            micro, macro, samples,
                            weighted, or binary. The default value is
                            micro. Then, pass in a list containing the names of the
                        columns that contain the true labels for your classification categories
                        to valid_labels. Lastly, call the evaluate method and pass
                        in your desired parameters as shown in the following code
                        example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.classification_accuracy import ClassificationAccuracy, ClassificationAccuracyConfig eval_algo = ClassificationAccuracy(ClassificationAccuracyConfig(multiclass_average_strategy="samples",valid_labels=["animal_type","plant_type","fungi_type"])) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Factual knowledge
You can run the factual knowledge algorithm for open-ended generation.
                        To run the factual knowledge algorithm, instantiate a
                            FactualKnowledgeConfig and optionally pass a delimiter
                        string (by default, this is <OR>). The factual
                        knowledge algorithm compares the response that your model generates with
                        a known response. The algorithm scores the response as correct if it
                        generates any of the content separated by the delimiter in the answer.
                        If you pass None as the
                            target_output_delimiter, then the model must generate
                        the same response as the answer to be scored as correct. Lastly, call
                        the evaluate method and pass in your desired
                        parameters.
Factual knowledge returns a list of EvalScore objects.
                        These contain aggregated scores on how well your model is able to encode
                        factual knowledge as described in the Foundation
                            model evaluation overview section. The scores range
                        between 0 and 1 with the lowest score
                        corresponding to a lower knowledge of real-world facts.
The following code example shows how to evaluate your LLM using the factual knowledge algorithm:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig eval_algo = FactualKnowledge(FactualKnowledgeConfig()) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Prompt stereotyping
You can run the prompt stereotyping algorithm for open-ended
                        generation. To run the prompt stereotyping algorithm, your
                            DataConfig must identify the columns in your input
                        dataset that contain a less stereotypical sentence in
                            sent_less_input_location and a more stereotypical
                        sentence in sent_more_output_location. For more information
                        about DataConfig, see the previous section 2. Configure ModelRunner. Next,
                        call the evaluate method and pass in your desired
                        parameters.
Prompt stereotyping returns a list of EvalOutput objects
                        that contain a score for each input record and overall scores for each
                        type of bias. The scores are calculated by comparing the probability of
                        the more and less stereotypical sentences. The overall score reports how
                        often the model preferred the stereotypical sentence in that the model
                        assigns a higher probability to the more stereotypical compared to the
                        less stereotypical sentence. A score of 0.5 indicates that
                        your model is unbiased, or that it prefers more and less stereotypical
                        sentences at equal rates. A score of greater than 0.5
                        indicates that your model is likely to generate a response that is more
                        stereotypical. Scores less than 0.5 indicate that your
                        model is likely to generate a response that is less
                        stereotypical.
The following code example shows how to evaluate your LLM using the prompt stereotyping algorithm:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.prompt_stereotyping import PromptStereotyping eval_algo = PromptStereotyping() eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Semantic robustness
You can run a semantic robustness algorithm for any FMEval task, however your model should be deterministic. A deterministic model is one that always generate the same output for the same input. One may typically achieve determinism by setting a random seed in the decoding process. The algorithms are different for each task in order to accommodate the different data input types and problems as follows:
- 
                            For open-ended generation, question answering, or task classification run the GeneralSemanticRobustnessalgorithm with aGeneralSemanticRobustnessConfigfile.
- 
                            For text summarization, run the SummarizationAccuracySemanticRobustnessalgorithm with aSummarizationAccuracySemanticRobustnessConfigfile.
The GeneralSemanticRobustness algorithm returns a list of
                            EvalScore objects that contain accuracy with values
                        between 0 and 1 quantifying the difference
                        between the perturbed and unperturbed model outputs. To run the general
                        semantic robustness algorithm, instantiate a
                            GeneralSemanticRobustnessConfig and pass in a
                            perturbation_type. You can choose one of the following
                        for perturbation_type:
- 
                            Butterfinger– A perturbation that mimics spelling mistakes using character swaps based on keyboard distance. Input a probability that a given character is perturbed. Butterfinger is the default value forperturbation_type.
- 
                            RandomUpperCase– A perturbation that changes a fraction of characters to uppercase. Input a decimal from0to1.
- 
                            WhitespaceAddRemove– The probability that a white space character is added in front of a non-white space character into white.
You can also specify the following parameters:
- 
                            num_perturbations– The number of perturbations for each sample to introduce into the generated text. The default is5.
- 
                            butter_finger_perturbation_prob– The probability that a character is be perturbed. Used only whenperturbation_typeisButterfinger. The default is0.1.
- 
                            random_uppercase_corrupt_proportion– The fraction of characters to be changed to uppercase. Used only whenperturbation_typeisRandomUpperCase. The default is0.1.
- 
                            whitespace_add_prob– Given a white space, the probability of removing it from a sample. Used only whenperturbation_typeisWhitespaceAddRemove. The default is0.05.
- 
                            whitespace_remove_prob– Given a non-white space, the probability of adding a white space in front of it. Used only whenperturbation_typeisWhitespaceAddRemove. The default is0.1.
Lastly, call the evaluate method and pass in your desired
                        parameters as shown in the following code example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.general_semantic_robustness import GeneralSemanticRobustness, GeneralSemanticRobustnessConfig eval_algo = GeneralSemanticRobustness(GeneralSemanticRobustnessConfig(perturbation_type="RandomUpperCase",num_perturbations=2,random_uppercase_corrupt_proportion=0.3))) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
The SummarizationAccuracySemanticRobustness algorithm
                        returns a list of EvalScore objects that contain the
                        difference (or delta) between the ROUGE-NSummarizationAccuracySemanticRobustnessConfig and pass
                        in a perturbation_type. 
You can choose one of the following for
                        perturbation_type:
- 
                            Butterfinger– A perturbation that mimics spelling mistakes using character swaps based on keyboard distance. Input a probability that a given character is perturbed.Butterfingeris the default value forperturbation_type.
- 
                            RandomUpperCase– A perturbation that changes a fraction of characters to uppercase. Input a decimal from0to1.
- 
                            WhitespaceAddRemove– Input a probability that a white space character is added in front of a non-white space character into white.
You can also specify the following parameters:
- 
                            num_perturbations– The number of perturbations for each sample to introduce into the generated text. Default is5.
- 
                            butter_finger_perturbation_prob– The probability that a character is perturbed. Used only whenperturbation_typeisButterfinger. Default is0.1.
- 
                            random_uppercase_corrupt_proportion– The fraction of characters to be changed to uppercase. Used only whenperturbation_typeisRandomUpperCase. Default is0.1.
- 
                            whitespace_add_prob– Given a white space, the probability of removing it from a sample. Used only whenperturbation_typeisWhitespaceAddRemove. Default is0.05.
- 
                            whitespace_remove_prob– Given a non-white space, the probability of adding a white space in front of it. Used only whenperturbation_typeisWhitespaceAddRemove, Default is0.1.
- 
                            rouge_type– Metrics that compare generated summaries to reference summaries. Specify the type of ROUGEmetric you want to use in your evaluation to rouge_type. You can chooserouge1,rouge2, orrougeL. ROUGE-1 compares the generated summaries and reference summaries using overlapping unigrams (sequences of one item such as “the”, “is”). ROUGE-2 compares the generated and reference summaries using bigrams (groups of two sequences such as “the large”, “is home”). ROUGE-L compares the longest matching sequence of words. For more information about ROUGE, see ROUGE: A Package for Automatic Evaluation of Summaries. 
- 
                            Set user_stemmer_for_rougetoTrueorFalse. A stemmer removes affixes from words before comparing them. For example, a stemmer removes the affixes from “swimming” and “swam” so that they are both “swim” after stemming.
- 
                            Set model_type_for_bertscoreto the model that you want to use to calculate a BERTScore. You can choose ROBERTA_MODEL or the more advanced MICROSOFT_DEBERTA_MODEL . Call the evaluatemethod and pass in your desired parameters as shown in the following code example:from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.summarization_accuracy_semantic_robustness import SummarizationAccuracySemanticRobustness, SummarizationAccuracySemanticRobustnessConfig eval_algo = SummarizationAccuracySemanticRobustness(SummarizationAccuracySemanticRobustnessConfig(perturbation_type="Butterfinger",num_perturbations=3,butter_finger_perturbation_prob=0.2))) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Toxicity
You can run the a toxicity algorithm for open-ended generation, text summarization, or question answering. There are three distinct classes depending on the task.
- 
                            For open-ended generation, run the Toxicity algorithm with a ToxicityConfigfile.
- 
                            For summarization, use the class Summarization_Toxicity.
- 
                            For question answering, use the class QAToxicity.
The toxicity algorithm returns one or more a list of
                            EvalScore objects (depending on the toxicity detector)
                        that contain scores between 0 and 1. To run
                        the toxicity algorithm, instantiate a ToxicityConfig and
                        pass in a toxicity model to use to evaluate your model against in
                            model_type. You can choose the following for
                            model_type:
- 
                            `detoxify` for UnitaryAI Detoxify-unbiased , a multilabel text classifier trained on Toxic Comment Classification Challenge and Jigsaw Unintended Bias in Toxicity Classification . The model provides 7scores for the following classes: toxicity, severe toxicity, obscenity, threat, insult, sexual explicity and identity attack.The following is example output from the detoxity model: EvalScore(name='toxicity', value=0.01936926692724228), EvalScore(name='severe_toxicity', value=3.3755677577573806e-06), EvalScore(name='obscene', value=0.00022437423467636108), EvalScore(name='identity_attack', value=0.0006707844440825284), EvalScore(name='insult', value=0.005559926386922598), EvalScore(name='threat', value=0.00016682750720065087), EvalScore(name='sexual_explicit', value=4.828436431125738e-05)
- 
                            `toxigen` for Toxigen-roberta , a binary RoBERTa-based text classifier fine-tuned on the ToxiGen dataset, which contains sentences with subtle and implicit toxicity pertaining to 13minority groups.
Lastly, call the evaluate method and pass in your desired
                        parameters as shown in the following code example.
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig eval_algo = Toxicity(ToxicityConfig(model_type="detoxify")) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)