

# Unified storage in Amazon SageMaker Unified Studio
<a name="storage"></a>

Amazon SageMaker Unified Studio provides flexible file storage options to support your analytics, AI and ML workflows.

Amazon SageMaker Unified Studio brings together the functionality and tools from existing AWS Analytics and AI/ML services into a single data and AI development environment. As you work with different tools like JupyterLab, SQL Editor, Visual ETL Builder, or capabilities from Amazon Bedrock inside Amazon SageMaker Unified Studio you'll create and manage files that represent your work.

## S3 storage
<a name="s3-storage"></a>

Amazon Simple Storage Service (S3) storage is the default option for storage of project files in Amazon SageMaker Unified Studio.

With S3 storage, you can easily share files by moving them between local and shared folders using simple drag-and-drop operations. The file explorer provides a consistent interface across all tools, displaying both local and shared directories in a single view with drag-and-drop functionality for easy file management. It allows users to create, edit, delete, upload, and download files directly through the interface, with optional auto-save capabilities to prevent data loss.

S3 storage provides basic file versioning capabilities when enabled by your administrator. This option is available in all AWS regions where Amazon SageMaker Unified Studio is supported, making it ideal for teams working across different geographic locations.

For more information on configuring S3 storage see [Configuring project storage options](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/configuring-project-storage.html).

### Key benefits of S3 storage:
<a name="s3-storage-benefits"></a>
+ Simple file management
+ Easy file sharing with drag-and-drop between folders
+ Availability in all regions where Amazon S3 is supported

## Git-based storage
<a name="git-based-storage"></a>

For projects requiring advanced version control, Amazon SageMaker Unified Studio allows you to connect your project to a Git repository where all project members can access, store, and collaborate on files. This option provides full version control capabilities including comprehensive commit history, branching, and merging.

When you choose Git-based storage, you'll need to specify a repository and branch during project creation. Once the project is created, you'll be able to see the files that were created during repository bootstrapping directly from the project's home page.

**Important**  
When you connect a project to a third-party Git repository, all users who can sign in to any domain in the account have read and write access to all repositories on that connection. This access is not limited to the project or domain where the connection was created. To enforce isolation between repositories, use separate AWS accounts. Do not store sensitive information in connected repositories unless all users in the account are authorized to access it.

With Git-based storage, you'll have access to full Git semantics regardless of whether you're using space-based tools like JupyterLab or web-based tools like SQL Query Editor. This provides a consistent experience for team members accustomed to working with Git.

### Key benefits of Git-based storage include:
<a name="git-storage-benefits"></a>
+ Full version control with commit history, branching, and merging
+ Collaboration features like pull requests and code reviews
+ Cross-project sharing by allowing multiple projects to use the same repository
+ Integration with existing development workflows

## Storage working in different tools
<a name="storage-in-tools"></a>

Amazon SageMaker Unified Studio provides a consistent storage experience across different tools while optimizing for each tool's specific requirements.

### Web-based tools
<a name="web-based-tools"></a>

When using web-based tools such as Query Editor and Visual ETL, you'll interact with files through a unified File Explorer interface. This explorer displays your shared directory and allows you to navigate and manage shared files seamlessly.

You can perform various file operations directly from the File Explorer:
+ Create, edit, and delete files and folders
+ Upload and download files to/from shared storage
+ Access version history (when available)
+ Edit files directly within the source

All web-based tools offer optional auto-save functionality, which can be enabled to automatically save your changes as you work. This feature helps prevent data loss if you navigate away from the page or experience connectivity issues.

### Space-based tools
<a name="space-based-tools"></a>

Space-based tools like JupyterLab and Code Editor provide access to two types of storage spaces to support both individual work and team collaboration.

#### Local storage (local folder)
<a name="local-storage"></a>

The local storage features dedicated EBS storage that delivers superior performance for frequent file operations within your workspace. Local storage serves as your personal workspace and the files in it are private to your Space.

Within your local storage, you can create and manage subfolders to organize your files effectively. This helps you maintain a structured workspace for different aspects of your work.

When you save files to your local storage, they operate on a 'last write wins' principle—new changes overwrite previous versions without versioning capabilities.

Your local folder
+ Includes this root folder and any subfolders (except shared)
+ Serves as your private workspace within each project
+ Allows you to work on files privately
+ Is ideal for frequent file access and modification
+ Is visible only in this space
+ Remains isolated from other project members, creating a secure environment for experimentation and development

#### Shared storage (shared folder)
<a name="shared-storage"></a>

Shared storage is implemented in Amazon S3 or Git repository and is accessible from all Amazon SageMaker Unified Studio tools. Project members can create and manage subfolders within the shared storage to help organize artifacts effectively.

By default, all project members have read, write, update, and delete access to files within the shared storage. This central repository allows team members to access common resources, share completed work, and maintain project artifacts in a single location.

Shared storage operates on a "last write wins" principle, so you have to coordinate with team members when working on the same files to avoid overwriting each other's changes.

The shared folder (Git and non-Git):
+ Contains files visible to all project members
+ Functions as a collaborative workspace accessible to all project members
+ Is accessible across all your tools
+ Updates immediately when any member adds or modifies files
+ Operates on a "last write wins" mechanism, so team members should coordinate when working on the same files
+ Is not well-suited for heavy file read/write workloads due to remote Amazon S3 origin of this folder and potential additional costs associated with frequent Amazon S3 access
+ If two individuals are modifying the same file in this folder at the same time that might result in losing some changes

You can copy files between these locations as needed, allowing you to optimize your workflow based on performance requirements and collaboration needs. For example, copy files from shared storage to local storage for ML tasks requiring low latency.

# Managing storage resources
<a name="managing-storage"></a>

## Storage organization
<a name="storage-organization"></a>

Creating logical subfolders helps organize your work, grouping related files together for easier navigation. Establishing and following consistent naming conventions for files and folders helps team members understand the purpose and content of resources.

Regularly moving completed work from local to shared storage ensures team access to important files. Removing unnecessary files periodically helps conserve storage space and maintain a clean, efficient working environment.

## Working with files across multiple tools
<a name="working-across-tools"></a>

Within Amazon SageMaker Unified Studio, a single file may be accessible via multiple tools. If you have the same file open simultaneously in different tools, changes made in one tool may overwrite changes made in another if not properly saved. You'll need to explicitly reopen files to see changes made through different tools.

When executing files from local storage, any resulting output files or artifacts automatically default to the same local storage location. For example, running a notebook that generates additional files will store these in local storage unless otherwise specified.

## Transitioning between storage types
<a name="transitioning-storage"></a>

Currently, you cannot convert a project from S3 storage to Git-based storage or vice versa. If you need to change storage types, you must create a new project with the desired storage configuration, copy your files to the new project, and update any references or dependencies.

This limitation is important to consider when initially setting up your project, as the storage decision has long-term implications for your workflow.

# Limitations
<a name="limitations"></a>
+ For JupyterLab and Code Editor, large files over 15 MB cannot be directly uploaded to the shared folder. To upload large files, first upload them to any other folder (such as your local storage), then copy or move to the shared folder.
+ When uploading files using `putObject` API to non-existent folder paths in shared storage, folders created indirectly may display incorrect timestamps - January 1, 1970 in JupyterLab's file browser. In the CodeEditor, the file metadata also shows the time stamp as January 1, 1970.