Accelerate MLOps with Backstage and self-service Amazon SageMaker AI templates - AWS Prescriptive Guidance

Accelerate MLOps with Backstage and self-service Amazon SageMaker AI templates

Ashish Bhatt, Shashank Hirematt, and Shivanshu Suryakar, Amazon Web Services

Summary

Organizations that use machine learning operations (MLOps) systems face significant challenges in scaling, standardizing, and securing their ML infrastructure. This pattern introduces a transformative approach that combines Backstage, an open source developer portal, with Amazon SageMaker AI and hardened infrastructure as code (IaC) modules to improve how your data science teams can develop, deploy, and manage ML workflows.

The IaC modules for this pattern are provided in the GitHub AWS AIOps modules repository. These modules offer pre-built templates for setting up ML infrastructure and creating consistent ML environments. However, data scientists often struggle to use these templates directly because they require infrastructure expertise. Adding a developer portal such as Backstage creates a user-friendly way for data scientists to deploy standardized ML environments without needing to understand the underlying infrastructure details.

By using Backstage as a self-service platform and integrating preconfigured SageMaker AI templates, you can:

  • Accelerate time to value for your ML initiatives.

  • Help enforce consistent security and governance.

  • Provide data scientists with standardized, compliant environments.

  • Reduce operational overhead and infrastructure complexity.

This pattern provides a solution that addresses the critical challenges of MLOps and also provides a scalable, repeatable framework that enables innovation while maintaining organizational standards.

Target audience

This pattern is intended for a broad audience involved in ML, cloud architecture, and platform engineering within an organization. This includes:

  • ML engineers who want to standardize and automate ML workflow deployments.

  • Data scientists who want self-service access to preconfigured and compliant ML environments.

  • Platform engineers who are responsible for building and maintaining internal developer platforms and shared infrastructure.

  • Cloud architects who design scalable, secure, and cost-effective cloud solutions for MLOps.

  • DevOps engineers who are interested in extending continuous integration and continuous delivery (CI/CD) practices to ML infrastructure provisioning and workflows.

  • Technical leads and managers who oversee ML initiatives and want to improve team productivity, governance, and time to market.

For more information about MLOps challenges, SageMaker AI MLOps modules, and how the solution provided by this pattern can address the needs of your ML teams, see the Additional information section.

Prerequisites and limitations

Prerequisites

Limitations

  • Limited template coverage. Currently, the solution supports only SageMaker AI-related AIOps modules from the broader AIOps solution. Other modules, such as Ray on Amazon Elastic Kubernetes Service (Amazon EKS), MLflow, Apache Airflow, and fine-tuning for Amazon Bedrock, are not yet available as Backstage templates.

  • Non-configurable default settings. Templates use fixed default configurations from the AIOps SageMaker modules with no customization. You cannot modify instance types, storage sizes, networking configurations, or security policies through the Backstage interface, which limits flexibility for specific use cases.

  • AWS-only support. The platform is designed exclusively for AWS deployments and doesn't support multicloud scenarios. Organizations that use cloud services outside the AWS Cloud cannot use these templates for their ML infrastructure needs.

  • Manual credential management. You must manually provide your AWS credentials for each deployment. This solution doesn’t provide integration with corporate identity providers, AWS IAM Identity Center, or automated credential rotation.

  • Limited lifecycle management. The templates lack comprehensive resource lifecycle management features such as automated cleanup policies, cost optimization recommendations, and infrastructure drift detection. You must manually manage and monitor deployed resources after creation.

Architecture

The following diagram shows the solution architecture for a unified developer portal that standardizes and accelerates ML infrastructure deployment with SageMaker AI across environments.

Architecture for unified developer portal with Backstage, CNOE, GitHub Actions, and Seed-Farmer.

In this architecture:

  1. AWS application modernization blueprints provision the infrastructure setup with an Amazon EKS cluster as a base for the Cloud Native Operational Excellence (CNOE) framework. This comprehensive solution addresses complex cloud-native infrastructure management challenges by providing a scalable internal developer platform (IDP). The blueprints offer a structured approach to setting up a robust, flexible infrastructure that can adapt to your evolving organizational needs.

  2. The CNOE open source framework consolidates DevOps tools and solves ecosystem fragmentation through a unified platform engineering approach. By bringing together disparate tools and technologies, it simplifies the complex landscape of cloud-native development, so your teams can focus on innovation instead of toolchain management. The framework provides a standardized methodology for selecting, integrating, and managing development tools.

  3. With CNOE, Backstage is deployed as an out-of-the-box solution within the Amazon EKS cluster. Backstage is bundled with robust authentication through Keycloak and comprehensive deployment workflows through Argo CD. This integrated platform creates a centralized environment for managing development processes and provides a single place for teams to access, deploy, and monitor their infrastructure and applications across multiple environments.

  4. A GitHub repository contains preconfigured AIOps software templates that cover the entire SageMaker AI lifecycle. These templates address critical ML infrastructure needs, including SageMaker Studio provisioning, model training, inference pipelines, and model monitoring. These templates help you accelerate your ML initiatives and ensure consistency across different projects and teams.

  5. GitHub Actions implements an automated workflow that dynamically triggers resource provisioning through the Seed-Farmer utility. This approach integrates the Backstage catalog with the AIOps modules repository and creates a streamlined infrastructure deployment process. The automation reduces manual intervention, minimizes human error, and ensures rapid, consistent infrastructure creation across different environments.

  6. The AWS CDK helps you define and provision infrastructure as code, and ensures repeatable, secure, and compliant resource deployment across specified AWS accounts. This approach provides maximum governance with minimal manual intervention, so you can create standardized infrastructure templates that can be easily replicated, version-controlled, and audited.

Tools

AWS services

Other tools

  • Backstage is an open source framework that helps you build internal developer portals.

  • GitHub Actions is a CI/CD platform that automates software development workflows, including tasks such as building, testing, and deploying code.

Code repositories

This pattern uses code and templates from the following GitHub repositories:

Implementation

This implementation uses a production-grade deployment pattern for Backstage from the Modern engineering on AWS repository. This approach significantly simplifies the setup process while incorporating AWS best practices for security and scalability.

The Epics section of this pattern outlines the implementation approach. For detailed, step-by-step deployment instructions, see the comprehensive deployment guide available in the AIOps internal developer platform (IDP) with Backstage repository. The implementation includes:

  • Initial Backstage platform deployment

  • Integration of SageMaker software templates with Backstage

  • Consuming and maintaining Backstage templates

The deployment guide also includes guidance for ongoing maintenance, troubleshooting, and platform scaling.

Best practices

Follow these best practices to help ensure security, governance, and operational excellence in your MLOps infrastructure implementations.

Template management

  • Never make breaking changes to live templates.

  • Always test updates thoroughly before production deployment.

  • Maintain clear and well-documented template versions.

Security

  • Pin GitHub Actions to specific commit secure hash algorithms (SHAs) to help prevent supply chain attacks.

  • Use least privilege IAM roles with granular permissions.

  • Store sensitive credentials in GitHub Secrets and AWS Secrets Manager.

  • Never hardcode credentials in templates.

Governance and tracking

  • Implement comprehensive resource tagging standards.

  • Enable precise cost tracking and compliance monitoring.

  • Maintain clear audit trails for infrastructure changes.

This guide provides a strong foundation for implementing these best practices by using Backstage, SageMaker AI, and IaC modules.

Epics

TaskDescriptionSkills required

Deploy Backstage.

This step uses the blueprints in the Modern engineering on AWS repository to build a robust, scalable infrastructure that integrates multiple AWS services to create a centralized IDP for ML workflows. Follow the instructions in the Backstage deployment section of the deployment guide to clone the repository, install dependencies, bootstrap the AWS CDK configure environment variables, and deploy the Backstage platform.

The infrastructure uses Amazon EKS as a container orchestration platform for deploying IDP components. The Amazon EKS architecture includes secure networking configurations to establish strict network isolation and control access patterns. The platform integrates with authentication mechanisms to help secure user access across services and environments.

Platform engineer

Set up your SageMaker AI templates.

This step uses the scripts in the GitHub AIOps internal developer platform (IDP) with Backstage repository. Follow the instructions in the SageMaker template setup section of the deployment guide to clone the repository, set up prerequisites, and run the setup script.

This process creates a repository that contains the SageMaker AI templates that are required for integration with Backstage.

Platform engineer

Integrate the SageMaker AI templates with Backstage.

Follow the instructions in the SageMaker templates integration section of the deployment guide to register your SageMaker AI templates.

This step integrates the AIOps modules (SageMaker AI templates from the last step) into your Backstage deployment so you can self-service your ML infrastructure needs.

Platform engineer

Use the SageMaker AI templates from Backstage.

Follow the instructions in the Using SageMaker templates section of the deployment guide to access the Backstage portal and create the ML environment in SageMaker Studio.

In the Backstage portal, you can select from available SageMaker AI templates, including options for SageMaker Studio environments, SageMaker notebooks, custom SageMaker project templates, and model deployment pipelines. After you provide configuration parameters, the platform creates dedicated repositories automatically and provisions AWS resources through GitHub Actions and Seed-Farmer. You can monitor progress through GitHub Actions logs and the Backstage component catalog.

Data scientist, Data engineer, Developer
TaskDescriptionSkills required

Update SageMaker AI templates.

To update a SageMaker AI template in Backstage, follow these steps.

  1. Modify the template content:

    1. Make the necessary changes in the template.yaml file or by editing the files in the skeleton/ directory.

    2. Test any new parameters, actions, or file structures locally or in a development environment.

  2. Test changes:

    1. Use the Backstage UI or CLI (@backstage/create-app) to scaffold a test component by using the updated template.

    2. Validate that all steps run successfully and that the generated code meets your expectations.

  3. Commit and push changes:

    1. Push the changes to the Git repository where the template is stored.

      If the template is registered in a specific branch (for example, main), the updates will be reflected automatically.

    2. If you’re using versioning (see next step), make sure that the correct version or tag is updated.

Platform engineer

Create and manage multiple versions of a template.

For breaking changes or upgrades, you might want to create multiple versions of a SageMaker AI template.

  1. Use Git tags or branches for each version; for example:

    git checkout -b v2.0.0 git push origin v2.0.0
  2. (Optional but recommended) Register each version separately.

    In Backstage, you can register different versions of a template as separate entities in the catalog, where each entity points to a specific branch or tag. For example (for a .yaml file):

    metadata: name: node-service-template-v2 description: Node.js service template - Version 2 spec: type: template lifecycle: experimental version: '2.0.0'
  3. Communicate changes clearly by maintaining a CHANGELOG.md file in the template repository. In this file, document which features or changes were introduced in each version of the template.

  4. Deprecate older versions of the template, if necessary:

    1. Mark it as deprecated in the template description.

    2. Remove the version from the catalog if it’s no longer needed.

Platform engineer
TaskDescriptionSkills required

Expand template coverage beyond SageMaker AI.

The current solution implements only SageMaker AI-related AIOps templates. You can extend the ML environment by adding AIOps modules and integrating custom software templates for additional AWS services and applications. You can create these by using the template designer interface in Backstage, by implementing custom scaffolder actions, or by maintaining template repositories with standard metadata. The platform supports template versioning, cross-team sharing, and validation workflows for consistency. For more information, see the Backstage documentation.

You can also implement template inheritance patterns to create specialized versions of base templates. This extensibility enables you to manage diverse AWS resources and applications beyond SageMaker AI while preserving the simplified developer experience and maintaining your organization’s standards.

Platform engineer

Use dynamic parameter injection.

The current templates use default configurations without customization, and run the Seed-Farmer CLI to deploy resources with default variables. You can extend the default configuration by using dynamic parameter injection for module-specific configurations.

Platform engineer

Enhance security and compliance.

To enhance security in the creation of AWS resources, you can enable role-based access control (RBAC) integration with single sign-on (SSO), SAML, OpenID Connect (OIDC), and policy as code enforcement.

Platform engineer

Add automated resource cleanup.

You can enable features for automated cleanup policies, and also add infrastructure drift detection and remediation.

Platform engineer
TaskDescriptionSkills required

Remove the Backstage infrastructure and SageMaker AI resources.

When you’ve finished using your ML environment, follow the instructions in the Cleanup and resource management section of the deployment guide to remove the Backstage infrastructure and to delete the SageMaker AI resources in your ML environment.

Platform engineer

Troubleshooting

IssueSolution

AWS CDK bootstrap failures

Verify AWS credentials and Region configuration.

Amazon EKS cluster access issues

Check kubectl configuration and IAM permissions.

Application Load Balancer connectivity issues

Make sure that security groups allow inbound traffic on port 80/443.

GitHub integration issues

Verify GitHub token permissions and organization access.

SageMaker AI deployment failures

Check AWS service quotas and IAM permissions.

Related resources

Additional information

Business challenges

Organizations that embark on or scale their MLOps initiatives frequently encounter these business and technical challenges:

  • Inconsistent environments. The lack of standardized development and deployment environments makes collaboration difficult and increases deployment risks.

  • Manual provisioning overhead. Manually setting up an ML infrastructure with SageMaker Studio, Amazon Simple Storage Service (Amazon S3) buckets, IAM roles, and CI/CD pipelines is time-consuming and error-prone, and diverts data scientists from their core task of model development.

  • Lack of discoverability and reuse. The lack of a centralized catalog makes it difficult to find existing ML models, datasets, and pipelines. This leads to redundant work and missed opportunities for reuse.

  • Complex governance and compliance. Ensuring that ML projects adhere to organizational security policies, data privacy regulations, and compliance standards such as Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR) can be challenging without automated guardrails.

  • Slow time to value. The cumulative effect of these challenges results in protracted ML project lifecycles and delays the realization of business value from ML investments.

  • Security risks. Inconsistent configurations and manual processes can introduce security vulnerabilities that make it difficult to enforce least privilege and network isolation.

These issues prolong development cycles, increase operational overhead, and introduce security risks. The iterative nature of ML requires repeatable workflows and efficient collaboration.

Gartner predicts that by 2026, 80% of software engineering organizations will have platform teams. (See Platform Engineering Empowers Developers to be Better, Faster, Happier on the Gartner website.) This prediction highlights how an IDP can accelerate software delivery. As an IDP, Backstage helps restore order to complex infrastructure so that teams can deliver high-quality code rapidly and safely. Integrating Backstage with hardened AIOps modules helps you shift from reactive troubleshooting to proactive prevention.

MLOps SageMaker modules

The AIOps modules in the GitHub repository used for this pattern provide a valuable foundation for standardizing MLOps on AWS through reusable and hardened IaC. These modules encapsulate best practices for provisioning SageMaker projects, pipelines, and associated networking and storage resources, with the goal to reduce complexity and accelerate the setup of ML environments. You can use these templates for various MLOps use cases to establish consistent and secure deployment patterns that foster a more governed and efficient approach to ML workflows.

Using the AIOps modules directly often requires platform teams to deploy and manage these IaC templates, which can present challenges for data scientists who want self-service access. Discovering and understanding the available templates, configuring the necessary parameters, and triggering their deployment might require navigating AWS service consoles or directly interacting with IaC tools. This can create friction, increase cognitive load for data scientists who prefer to focus on ML tasks, and potentially lead to inconsistent parameterization or deviations from organizational standards if these templates aren’t managed through a centralized and user-friendly interface. Integrating these powerful AIOps modules with an IDP such as Backstage helps address these challenges by providing a streamlined, self-service experience, enhanced discoverability, and stronger governance controls for using these standardized MLOps building blocks.

Backstage as IDP

An internal developer platform (IDP) is a self-service layer built by platform teams to simplify and standardize how developers build, deploy, and manage applications. It abstracts infrastructure complexity and provides developers with easy access to tools, environments, and services through a unified interface.

The primary goal of an IDP is to enhance developer experience and productivity by:

  • Enabling self-service for tasks such as service creation and deployment.

  • Promoting consistency and compliance through standard templates.

  • Integrating tools across the development lifecycle (CI/CD, monitoring, and documentation).

Backstage is an open source developer portal that was created by Spotify and is now part of the Cloud Native Computing Foundation (CNCF). It helps organizations build their own IDP by providing a centralized, extensible platform to manage software components, tools, and documentation. With Backstage, developers can:

  • Discover and manage all internal services through a software catalog.

  • Create new projects by using predefined templates through the scaffolder plugin.

  • Access integrated tooling such as CI/CD pipelines, Kubernetes dashboards, and monitoring systems from one location.

  • Maintain consistent, markdown-based documentation through TechDocs.

FAQ

What's the difference between using this Backstage template versus deploying SageMaker Studio manually through the SageMaker console?

The Backstage template provides several advantages over manual AWS console deployment, including standardized configurations that follow organizational best practices, automated IaC deployment using Seed-Farmer and the AWS CDK, built-in security policies and compliance measures, and integration with your organization's developer workflows through GitHub. The template also creates reproducible deployments with version control, which make it easier to replicate environments across different stages (development, staging, production) and maintain consistency across teams. Additionally, the template includes automated cleanup capabilities and integrates with your organization's identity management system through Backstage. Manual deployment through the console requires deep AWS expertise and doesn’t provide version control or the same level of standardization and governance that the template offers. For these reasons, console deployments are more suitable for one-off experiments than production ML environments.

What is Seed-Farmer and why does this solution use it?

Seed-Farmer is an AWS deployment orchestration tool that manages infrastructure modules by using the AWS CDK. This pattern uses Seed-Farmer because it provides standardized, reusable infrastructure components that are specifically designed for AI/ML workloads, handles complex dependencies between AWS services automatically, and ensures consistent deployments across different environments.

Do I need to install the AWS CLI to use these templates?

No, you don't have to install the AWS CLI on your computer. The templates run entirely through GitHub Actions in the cloud. You provide your AWS credentials (access key, secret key, and session token) through the Backstage interface, and the deployment happens automatically in the GitHub Actions environment.

How long does it take to deploy a SageMaker Studio environment?

A typical SageMaker Studio deployment takes 15-25 minutes to complete. This includes AWS CDK bootstrapping (2-3 minutes), Seed-Farmer toolchain setup (3-5 minutes), and resource creation (10-15 minutes). The exact time depends on your AWS Region and the complexity of your networking setup.

Can I deploy multiple SageMaker environments in the same AWS account?

Yes, you can. Each deployment creates resources with unique names based on the component name you provide in the template. However, be aware of AWS service quotas: Each account can have a limited number of SageMaker domains per Region, so check your quotas before you create multiple environments.