Create an MLflow app Access the MLflow app Key metrics to track Determining when to stop

Monitoring Progress Across Iterations

You can track metrics via MLflow.

Create an MLflow app

Using Studio UI: If you create a training job through the Studio UI, a default MLflow app is created automatically and selected by default under Advanced Options.

Using CLI: If you use the CLI, you must create an MLflow app and pass it as an input to the training job API request.


mlflow_app_name="<enter your MLflow app name>"
role_arn="<enter your role ARN>" 
bucket_name="<enter your bucket name>" 
region="<enter your region>"

mlflow_app_arn=$(aws sagemaker create-mlflow-app \
  --name $mlflow_app_name \
  --artifact-store-uri "s3://$bucket_name" \
  --role-arn $role_arn \
  --region $region)

Access the MLflow app

Using CLI: Create a pre-signed URL to access the MLflow app UI:


aws sagemaker create-presigned-mlflow-app-url \
  --arn $mlflow_app_arn \
  --region $region \
  --output text

Using Studio UI: The Studio UI displays key metrics stored in MLflow and provides a link to the MLflow app UI.

Key metrics to track

Monitor these metrics across iterations to assess improvement and track the job progress:

For SFT

Training loss curves
Number of samples consumed and time to process samples
Performance accuracy on held-out test sets
Format compliance (e.g., valid JSON output rate)
Perplexity on domain-specific evaluation data

For RFT

Average reward scores over training
Reward distribution (percentage of high-reward responses)
Validation reward trends (watch for over-fitting)
Task-specific success rates (e.g., code execution pass rate, math problem accuracy)

General

Benchmark performance deltas between iterations
Human evaluation scores on representative samples
Production metrics (if deploying iteratively)

Determining when to stop

Stop iterating when:

Performance plateaus: Additional training no longer meaningfully improves target metrics
Technique switching helps: If one technique plateaus, try switching (e.g., SFT → RFT → SFT) to break through performance ceilings
Target metrics achieved: Your success criteria are met
Regression detected: New iterations degrade performance (see rollback procedures below)

For detailed evaluation procedures, refer to the Evaluation section.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Reinforcement fine-tuning

Evaluation