Create an AutoML job to fine-tune text generation models using the API
Large language models (LLMs) excel in multiple generative tasks, including text generation, summarization, completion, question answering, and more. Their performance can be attributed to their significant size and extensive training on diverse datasets and various tasks. However, specific domains, such as healthcare and financial services, may require customized fine-tuning to adapt to unique data and use cases. By tailoring their training to their particular domain, LLMs can improve their performance and provide more accurate outputs for targeted applications.
Autopilot offers the capability to fine-tune a selection of pre-trained generative text models. In particular, Autopilot supports the instruction-based fine tuning of a selection of general-purpose large language models (LLMs) powered by JumpStart.
Note
The text generation models that support fine-tuning in Autopilot are currently accessible exclusively in Regions supported by SageMaker Canvas. See the documentation of SageMaker Canvas for the full list of its supported Regions.
Fine-tuning a pre-trained model requires a specific dataset of clear instructions that guide the model on how to generate output or behave for that task. The model learns from the dataset, adjusting its parameters to conform to the provided instructions. Instruction-based fine-tuning involves using labeled examples formatted as prompt-response pairs and phrased as instructions. For more information about fine-tuning, see Fine-tune a foundation model.
The following guidelines outline the process of creating an Amazon SageMaker Autopilot job as a pilot experiment to fine-tune text generation LLMs using the SageMaker API Reference.
Note
Tasks such as text and image classification,
time-series forecasting, and fine-tuning of large language models are exclusively available
through the version 2 of the AutoML REST API.
If your language of choice is Python, you can refer to AWS SDK for Python (Boto3)
Users who prefer the convenience of a user interface can use Amazon SageMaker Canvas to access pre-trained models and generative AI foundation models, or create custom models tailored for specific text, image classification, forecasting needs, or generative AI.
To create an Autopilot experiment programmatically for fine-tuning an LLM, you can call the
CreateAutoMLJobV2 API in any language supported by Amazon SageMaker Autopilot or the
AWS CLI.
For information about how this API action translates into a function in the language of your
choice, see the
See Also section of CreateAutoMLJobV2 and choose an SDK. As an example,
for Python users, see the full request syntax of create_auto_ml_job_v2 in AWS SDK for Python (Boto3).
Note
Autopilot fine-tunes large language models without requiring multiple candidates to be
trained and evaluated. Instead, using your dataset, Autopilot directly fine-tunes your target
model to enhance a default objective metric, the cross-entropy loss. Fine-tuning language
models in Autopilot does not require setting the AutoMLJobObjective field.
Once your LLM is fine-tuned, you can evaluate its performance by accessing various ROUGE
scores through the BestCandidate when making a DescribeAutoMLJobV2 API call. The model also provides information about its
training and validation loss as well as perplexity. For a comprehensive list of metrics for
evaluating the quality of the text generated by the fine-tuned models, see Metrics for fine-tuning large language
models in Autopilot.
Prerequisites
Before using Autopilot to create a fine-tuning experiment in SageMaker AI, make sure to take the following steps:
-
(Optional) Choose the pre-trained model you want to fine-tune.
For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see Supported large language models for fine-tuning. The selection of a model is not mandatory; if no model is specified, Autopilot automatically defaults to the model Falcon7BInstruct.
-
Create a dataset of instructions. See Dataset file types and input data format to learn about the format requirements for your instruction-based dataset.
-
Place your dataset in an Amazon S3 bucket.
-
Grant full access to the Amazon S3 bucket containing your input data for the SageMaker AI execution role used to run your experiment.
-
For information on retrieving your SageMaker AI execution role, see Get your execution role.
-
For information on granting your SageMaker AI execution role permissions to access one or more specific buckets in Amazon S3, see Add Additional Amazon S3 Permissions to a SageMaker AI Execution Role in Create execution role.
-
-
Additionally, you should provide your execution role with the necessary permissions to access the default storage Amazon S3 bucket used by JumpStart. This access is required for storing and retrieving pre-trained model artifacts in JumpStart. To grant access to this Amazon S3 bucket, you must create a new inline custom policy on your execution role.
Here's an example policy that you can use in your JSON editor when configuring AutoML fine-tuning jobs in
us-west-2:JumpStart's bucket names follow a predetermined pattern that depends on the AWS Regions. You must adjust the name of the bucket accordingly.
{ "Sid": "Statement1", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::jumpstart-cache-prod-us-west-2", "arn:aws:s3:::jumpstart-cache-prod-us-west-2/*" ] }
Once this is done, you can use the ARN of this execution role in Autopilot API requests.
Required parameters
When calling CreateAutoMLJobV2 to create an Autopilot experiment for LLM fine-tuning, you
must provide the following values:
-
An
AutoMLJobNameto specify the name of your job. The name should be of typestring, and should have a minimum length of 1 character and a maximum length of 32. -
At least one
AutoMLJobChannelof thetrainingtype within theAutoMLJobInputDataConfig. This channel specifies the name of the Amazon S3 bucket where your fine-tuning dataset is located. You have the option to define avalidationchannel. If no validation channel is provided, and aValidationFractionis configured in theAutoMLDataSplitConfig, this fraction is utilized to randomly divide the training dataset into training and validation sets. Additionally, you can specify the type of content (CSV or Parquet files) for the dataset. -
An
AutoMLProblemTypeConfigof typeTextGenerationJobConfigto configure the settings of your training job.In particular, you can specify the name of the base model to fine-tune in the
BaseModelNamefield. For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see Supported large language models for fine-tuning. -
An
OutputDataConfigto specify the Amazon S3 output path to store the artifacts of your AutoML job. -
A
RoleArnto specify the ARN of the role used to access your data.
The following is an example of the full request format used when making an API call to
CreateAutoMLJobV2 for fine-tuning a (Falcon7BInstruct)
model.
{ "AutoMLJobName": "<job_name>", "AutoMLJobInputDataConfig": [ { "ChannelType": "training", "CompressionType": "None", "ContentType": "text/csv", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<bucket_name>/<input_data>.csv" } } } ], "OutputDataConfig": { "S3OutputPath": "s3://<bucket_name>/output", "KmsKeyId": "arn:aws:kms:<region>:<account_id>:key/<key_value>" }, "RoleArn":"arn:aws:iam::<account_id>:role/<sagemaker_execution_role_name>", "AutoMLProblemTypeConfig": { "TextGenerationJobConfig": { "BaseModelName": "Falcon7BInstruct" } } }
All other parameters are optional.
Optional parameters
The following sections provide details of some optional parameters that you can pass to your fine-tuning AutoML job.
You can provide your own validation dataset and custom data split ratio, or let Autopilot split the dataset automatically.
Each AutoMLJobChannel object (see the required parameter AutoMLJobInputDataConfig) has a ChannelType, which can be set to
either training or validation values that specify how the data
is to be used when building a machine learning model.
At least one data source must be provided and a maximum of two data sources is allowed: one for training data and one for validation data. How you split the data into training and validation datasets depends on whether you have one or two data sources.
-
If you only have one data source, the
ChannelTypeis set totrainingby default and must have this value.-
If the
ValidationFractionvalue inAutoMLDataSplitConfigis not set, 0.2 (20%) of the data from this source is used for validation by default. -
If the
ValidationFractionis set to a value between 0 and 1, the dataset is split based on the value specified, where the value specifies the fraction of the dataset used for validation.
-
-
If you have two data sources, the
ChannelTypeof one of theAutoMLJobChannelobjects must be set totraining, the default value. TheChannelTypeof the other data source must be set tovalidation. The two data sources must have the same format, either CSV or Parquet, and the same schema. You must not set the value for theValidationFractionin this case because all of the data from each source is used for either training or validation. Setting this value causes an error.
With
Autopilot,
you can automatically deploy your fine-tuned model to an endpoint. To
enable automatic deployment for your fine-tuned model, include a ModelDeployConfig in the AutoML job request. This allows the
deployment of your fine-tuned model to a SageMaker AI endpoint. Below are the available
configurations for customization.
-
To let Autopilot generate the endpoint name, set
AutoGenerateEndpointNametoTrue. -
To provide your own name for the endpoint, set
AutoGenerateEndpointName to.Falseand provide a name of your choice in EndpointName
For models requiring the acceptance of an end-user license agreement before
fine-tuning, you can accept the EULA by setting the AcceptEula attribute of
the ModelAccessConfig to True in TextGenerationJobConfig when configuring your AutoMLProblemTypeConfig.
You can optimize the learning process of your text generation model by setting
hyperparameter values in the TextGenerationHyperParameters attribute of
TextGenerationJobConfig when configuring your AutoMLProblemTypeConfig.
Autopilot allows for the setting of four common hyperparameters across all models.
-
epochCount: Its value should be a string containing an integer value within the range of1to10. -
batchSize: Its value should be a string containing an integer value within the range of1to64. -
learningRate: Its value should be a string containing a floating-point value within the range of0to1. -
learningRateWarmupSteps: Its value should be a string containing an integer value within the range of0to250.
For more details on each hyperparameter, see Hyperparameters for optimizing the learning process of your text generation models.
The following JSON example shows a TextGenerationHyperParameters field
passed to the TextGenerationJobConfig where all four hyperparameters are
configured.
"AutoMLProblemTypeConfig": { "TextGenerationJobConfig": { "BaseModelName": "Falcon7B", "TextGenerationHyperParameters": {"epochCount":"5", "learningRate":"0.000001", "batchSize": "32", "learningRateWarmupSteps": "10"} } }