

# Customization of a SageMaker notebook instance using an LCC script
<a name="notebook-lifecycle-config"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

A *lifecycle configuration* (LCC) provides shell scripts that run only when you create the notebook instance or whenever you start one. When you create a notebook instance, you can create a new LCC or attach an LCC that you already have. Lifecycle configuration scripts are useful for the following use cases:
+ Installing packages or sample notebooks on a notebook instance
+ Configuring networking and security for a notebook instance
+ Using a shell script to customize a notebook instance

You can also use a lifecycle configuration script to access AWS services from your notebook. For example, you can create a script that lets you use your notebook to control other AWS resources, such as an Amazon EMR instance.

We maintain a public repository of notebook lifecycle configuration scripts that address common use cases for customizing notebook instances at [https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples).

**Note**  
Each script has a limit of 16384 characters.  
The value of the `$PATH` environment variable that is available to both scripts is `/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin`. The working directory, which is the value of the `$PWD` environment variable, is `/`.  
View CloudWatch Logs for notebook instance lifecycle configurations in log group `/aws/sagemaker/NotebookInstances` in log stream `[notebook-instance-name]/[LifecycleConfigHook]`.  
Scripts cannot run for longer than 5 minutes. If a script runs for longer than 5 minutes, it fails and the notebook instance is not created or started. To help decrease the run time of scripts, try the following:  
Cut down on necessary steps. For example, limit which conda environments in which to install large packages.
Run tasks in parallel processes.
Use the `nohup` command in your script.

You can see a list of notebook instance lifecycle configurations you previously created by choosing **Lifecycle configuration** in the SageMaker AI console. You can attach a notebook instance LCC when you create a new notebook instance. For more information about creating a notebook instance, see [Create an Amazon SageMaker notebook instance](howitworks-create-ws.md).

# Create a lifecycle configuration script
<a name="notebook-lifecycle-config-create"></a>

The following procedure shows how to create a lifecycle configuration script for use with an Amazon SageMaker notebook instance. For more information about creating a notebook instance, see [Create an Amazon SageMaker notebook instance](howitworks-create-ws.md).

**To create a lifecycle configuration**

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). 

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **Lifecycle configurations**. 

1. From the **Lifecycle configurations** page, choose the **Notebook Instance** tab.

1. Choose **Create configuration**.

1. For **Name**, type a name using alphanumeric characters and "-", but no spaces. The name can have a maximum of 63 characters.

1. (Optional) To create a script that runs when you create the notebook and every time you start it, choose **Start notebook**.

1. In the **Start notebook** editor, type the script.

1. (Optional) To create a script that runs only once, when you create the notebook, choose **Create notebook**.

1. In the **Create notebook** editor, type the script configure networking.

1. Choose **Create configuration**.

## Lifecycle Configuration Best Practices
<a name="nbi-lifecycle-config-bp"></a>

The following are best practices for using lifecycle configurations:

**Important**  
We do not recommend storing sensitive information in your lifecycle configuration script.

**Important**  
Lifecycle configuration scripts run with root access and the notebook instance's IAM execution role privileges, regardless of the root access setting for notebook users. Principals with permissions to create or modify lifecycle configurations and update notebook instances can execute code with the execution role's credentials. See [Control root access to a SageMaker notebook instance](nbi-root-access.md) for more information.
+ Lifecycle configurations run as the `root` user. If your script makes any changes within the `/home/ec2-user/SageMaker` directory, (for example, installing a package with `pip`), use the command `sudo -u ec2-user` to run as the `ec2-user` user. This is the same user that Amazon SageMaker AI runs as.
+ SageMaker AI notebook instances use `conda` environments to implement different kernels for Jupyter notebooks. If you want to install packages that are available to one or more notebook kernels, enclose the commands to install the packages with `conda` environment commands that activate the conda environment that contains the kernel where you want to install the packages.

  For example, if you want to install a package only for the `python3` environment, use the following code:

  ```
  #!/bin/bash
  sudo -u ec2-user -i <<EOF
  
  # This will affect only the Jupyter kernel called "conda_python3".
  source activate python3
  
  # Replace myPackage with the name of the package you want to install.
  pip install myPackage
  # You can also perform "conda install" here as well.
  
  source deactivate
  
  EOF
  ```

  If you want to install a package in all conda environments in the notebook instance, use the following code:

  ```
  #!/bin/bash
  sudo -u ec2-user -i <<EOF
  
  # Note that "base" is special environment name, include it there as well.
  for env in base /home/ec2-user/anaconda3/envs/*; do
      source /home/ec2-user/anaconda3/bin/activate $(basename "$env")
  
      # Installing packages in the Jupyter system environment can affect stability of your SageMaker
      # Notebook Instance.  You can remove this check if you'd like to install Jupyter extensions, etc.
      if [ $env = 'JupyterSystemEnv' ]; then
        continue
      fi
  
      # Replace myPackage with the name of the package you want to install.
      pip install --upgrade --quiet myPackage
      # You can also perform "conda install" here as well.
  
      source /home/ec2-user/anaconda3/bin/deactivate
  done
  
  EOF
  ```
+ You must store all conda environments in the default environments folder (/home/user/anaconda3/envs).

**Important**  
When you create or change a script, we recommend that you use a text editor that provides Unix-style line breaks, such as the text editor available in the console when you create a notebook. Copying text from a non-Linux operating system might introduce incompatible line breaks and result in an unexpected error.

# External library and kernel installation
<a name="nbi-add-external"></a>

**Important**  
Currently, all packages in notebook instance environments are licensed for use with Amazon SageMaker AI and do not require additional commercial licenses. However, this might be subject to change in the future, and we recommend reviewing the licensing terms regularly for any updates.

Amazon SageMaker notebook instances come with multiple environments already installed. These environments contain Jupyter kernels and Python packages including: scikit, Pandas, NumPy, TensorFlow, and MXNet. These environments, along with all files in the `sample-notebooks` folder, are refreshed when you stop and start a notebook instance. You can also install your own environments that contain your choice of packages and kernels.

The different Jupyter kernels in Amazon SageMaker notebook instances are separate conda environments. For information about conda environments, see [Managing environments](https://conda.io/docs/user-guide/tasks/manage-environments.html) in the *Conda* documentation.

Install custom environments and kernels on the notebook instance's Amazon EBS volume. This ensures that they persist when you stop and restart the notebook instance, and that any external libraries you install are not updated by SageMaker AI. To do that, use a lifecycle configuration that includes both a script that runs when you create the notebook instance (`on-create)` and a script that runs each time you restart the notebook instance (`on-start`). For more information about using notebook instance lifecycle configurations, see [Customization of a SageMaker notebook instance using an LCC script](notebook-lifecycle-config.md). There is a GitHub repository that contains sample lifecycle configuration scripts at [SageMaker AI Notebook Instance Lifecycle Config Samples](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples).

The examples at [https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-create.sh](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-create.sh) and [https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-start.sh](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/persistent-conda-ebs/on-start.sh) show the best practice for installing environments and kernels on a notebook instance. The `on-create` script installs the `ipykernel` library to create custom environments as Jupyter kernels, then uses `pip install` and `conda install` to install libraries. You can adapt the script to create custom environments and install libraries that you want. SageMaker AI does not update these libraries when you stop and restart the notebook instance, so you can ensure that your custom environment has specific versions of libraries that you want. The `on-start` script installs any custom environments that you create as Jupyter kernels, so that they appear in the dropdown list in the Jupyter **New** menu.

## Package installation tools
<a name="nbi-add-external-tools"></a>

SageMaker notebooks support the following package installation tools:
+ conda install
+ pip install

You can install packages using the following methods:
+ Lifecycle configuration scripts.

  For example scripts, see [SageMaker AI Notebook Instance Lifecycle Config Samples](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples). For more information on lifecycle configuration, see [Customize a Notebook Instance Using a Lifecycle Configuration Script](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html).
+ Notebooks – The following commands are supported.
  + `%conda install`
  + `%pip install`
+ The Jupyter terminal – You can install packages using pip and conda directly.

From within a notebook you can use the system command syntax (lines starting with \$1) to install packages, for example, `!pip install` and `!conda install`. More recently, new commands have been added to IPython: `%pip` and `%conda`. These commands are the recommended way to install packages from a notebook as they correctly take into account the active environment or interpreter being used. For more information, see [Add %pip and %conda magic functions](https://github.com/ipython/ipython/pull/11524).

### Conda
<a name="nbi-add-external-tools-conda"></a>

Conda is an open source package management system and environment management system, which can install packages and their dependencies. SageMaker AI supports using Conda with either of the two main channels, the default channel, and the conda-forge channel. For more information, see [Conda channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html). The conda-forge channel is a community channel where contributors can upload packages.

**Note**  
Due to how Conda resolves the dependency graph, installing packages from conda-forge can take significantly longer (in the worst cases, upwards of 10 minutes).

The Deep Learning AMI comes with many conda environments and many packages preinstalled. Due to the number of packages preinstalled, finding a set of packages that are guaranteed to be compatible is difficult. You may see a warning "The environment is inconsistent, please check the package plan carefully". Despite this warning, SageMaker AI ensures that all the SageMaker AI provided environments are correct. SageMaker AI cannot guarantee that any user installed packages will function correctly.

**Note**  
Users of SageMaker AI, AWS Deep Learning AMIs and Amazon EMR can access the commercial Anaconda repository without taking a commercial license through February 1, 2024 when using Anaconda in those services. For any usage of the commercial Anaconda repository after February 1, 2024, customers are responsible for determining their own Anaconda license requirements.

Conda has two methods for activating environments: conda activate/deactivate, and source activate/deactivate. For more information, see [Should I use 'conda activate' or 'source activate' in Linux](https://stackoverflow.com/questions/49600611/python-anaconda-should-i-use-conda-activate-or-source-activate-in-linux).

SageMaker AI supports moving Conda environments onto the Amazon EBS volume, which is persisted when the instance is stopped. The environments aren't persisted when the environments are installed to the root volume, which is the default behavior. For an example lifecycle script, see [persistent-conda-ebs](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/tree/master/scripts/persistent-conda-ebs).

**Supported conda operations (see note at the bottom of this topic)**
+ conda install of a package in a single environment
+ conda install of a package in all environments
+ conda install of a R package in the R environment
+ Installing a package from the main conda repository
+ Installing a package from conda-forge
+ Changing the Conda install location to use EBS
+ Supporting both conda activate and source activate

### Pip
<a name="nbi-add-external-tools-pip"></a>

Pip is the de facto tool for installing and managing Python packages. Pip searches for packages on the Python Package Index (PyPI) by default. Unlike Conda, pip doesn't have built in environment support, and is not as thorough as Conda when it comes to packages with native/system library dependencies. Pip can be used to install packages in Conda environments.

You can use alternative package repositories with pip instead of the PyPI. For an example lifecycle script, see [on-start.sh](https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/add-pypi-repository/on-start.sh).

**Supported pip operations (see note at the bottom of this topic)**
+ Using pip to install a package without an active conda environment (install packages system wide)
+ Using pip to install a package in a conda environment
+ Using pip to install a package in all conda environments
+ Changing the pip install location to use EBS
+ Using an alternative repository to install packages with pip

### Unsupported
<a name="nbi-add-external-tools-misc"></a>

SageMaker AI aims to support as many package installation operations as possible. However, if the packages were installed by SageMaker AI or DLAMI, and you use the following operations on these packages, it might make your notebook instance unstable:
+ Uninstalling
+ Downgrading
+ Upgrading

We do not provide support for installing packages via yum install or installing R packages from CRAN.

Due to potential issues with network conditions or configurations, or the availability of Conda or PyPi, we cannot guarantee that packages will install in a fixed or deterministic amount of time.

**Note**  
We cannot guarantee that a package installation will be successful. Attempting to install a package in an environment with incompatible dependencies can result in a failure. In such a case you should contact the library maintainer to see if it is possible to update the package dependencies. Alternatively you can attempt to modify the environment in such a way as to allow the installation. This modification however will likely mean removing or updating existing packages, which means we can no longer guarantee stability of this environment.

# Notebook Instance Software Updates
<a name="nbi-software-updates"></a>

Amazon SageMaker AI periodically tests and releases software that is installed on notebook instances. This includes:
+ Kernel updates
+ Security patches
+ AWS SDK updates
+ [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) updates
+ Open source software updates

To ensure that you have the most recent software updates, stop and restart your notebook instance, either in the SageMaker AI console or by calling [  `StopNotebookInstance`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopNotebookInstance.html).

You can also manually update software installed on your notebook instance while it is running by using update commands in a terminal or in a notebook.

**Note**  
Updating kernels and some packages might depend on whether root access is enabled for the notebook instance. For more information, see [Control root access to a SageMaker notebook instance](nbi-root-access.md).

You can check the [Personal Health Dashboard](https://aws.amazon.com/premiumsupport/technology/personal-health-dashboard/) or the security bulletin at [Security Bulletins](https://aws.amazon.com/security/security-bulletins/) for updates.

# Control an Amazon EMR Spark Instance Using a Notebook
<a name="nbi-lifecycle-config-emr"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

You can use a notebook instance created with a custom lifecycle configuration script to access AWS services from your notebook. For example, you can create a script that lets you use your notebook with Sparkmagic to control other AWS resources, such as an Amazon EMR instance. You can then use the Amazon EMR instance to process your data instead of running the data analysis on your notebook. This allows you to create a smaller notebook instance because you won't use the instance to process data. This is helpful when you have large datasets that would require a large notebook instance to process the data.

The process requires three procedures using the Amazon SageMaker AI console:
+ Create the Amazon EMR Spark instance
+ Create the Jupyter Notebook
+ Test the notebook-to-Amazon EMR connection

**To create an Amazon EMR Spark instance that can be controlled from a notebook using Sparkmagic**

1. Open the Amazon EMR console at [https://console.aws.amazon.com/elasticmapreduce/](https://console.aws.amazon.com/elasticmapreduce/).

1. In the navigation pane, choose **Create cluster**.

1. On the **Create Cluster - Quick Options** page, under **Software configuration**, choose **Spark: Spark 2.4.4 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.2**.

1. Set additional parameters on the page and then choose **Create cluster**.

1. On the **Cluster** page, choose the cluster name that you created. Note the **Master Public DNS**, the **EMR master's security group**, and the VPC name and subnet ID where the EMR cluster was created. You will use these values when you create a notebook.

**To create a notebook that uses Sparkmagic to control an Amazon EMR Spark instance**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the navigation pane, under **Notebook instances**, choose **Create notebook**.

1. Enter the notebook instance name and choose the instance type.

1. Choose **Additional configuration**, then, under **Lifecycle configuration**, choose **Create a new lifecycle configuration**.

1. Add the following code to the lifecycle configuration script:

   ```
   # OVERVIEW
   # This script connects an Amazon EMR cluster to an Amazon SageMaker notebook instance that uses Sparkmagic.
   #
   # Note that this script will fail if the Amazon EMR cluster's master node IP address is not reachable.
   #   1. Ensure that the EMR master node IP is resolvable from the notebook instance.
   #      One way to accomplish this is to have the notebook instance and the Amazon EMR cluster in the same subnet.
   #   2. Ensure the EMR master node security group provides inbound access from the notebook instance security group.
   #       Type        - Protocol - Port - Source
   #       Custom TCP  - TCP      - 8998 - $NOTEBOOK_SECURITY_GROUP
   #   3. Ensure the notebook instance has internet connectivity to fetch the SparkMagic example config.
   #
   # https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/
   
   # PARAMETERS
   EMR_MASTER_IP=your.emr.master.ip
   
   
   cd /home/ec2-user/.sparkmagic
   
   echo "Fetching Sparkmagic example config from GitHub..."
   wget https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/example_config.json
   
   echo "Replacing EMR master node IP in Sparkmagic config..."
   sed -i -- "s/localhost/$EMR_MASTER_IP/g" example_config.json
   mv example_config.json config.json
   
   echo "Sending a sample request to Livy.."
   curl "$EMR_MASTER_IP:8998/sessions"
   ```

1. In the `PARAMETERS` section of the script, replace `your.emr.master.ip` with the Master Public DNS name for the Amazon EMR instance.

1. Choose **Create configuration**.

1. On the **Create notebook** page, choose **Network - optional**.

1. Choose the VPC and subnet where the Amazon EMR instance is located.

1. Choose the security group used by the Amazon EMR master node.

1. Choose **Create notebook instance**.

While the notebook instance is being created, the status is **Pending**. After the instance has been created and the lifecycle configuration script has successfully run, the status is **InService**.

**Note**  
If the notebook instance can't connect to the Amazon EMR instance, SageMaker AI can't create the notebook instance. The connection can fail if the Amazon EMR instance and notebook are not in the same VPC and subnet, if the Amazon EMR master security group is not used by the notebook, or if the Master Public DNS name in the script is incorrect. 

**To test the connection between the Amazon EMR instance and the notebook**

1.  When the status of the notebook is **InService**, choose **Open Jupyter** to open the notebook.

1. Choose **New**, then choose **Sparkmagic (PySpark)**.

1. In the code cell, enter **%%info** and then run the cell.

   The output should be similar to the following

   ```
   Current session configs: {'driverMemory': '1000M', 'executorCores': 2, 'kind': 'pyspark'}
                       No active sessions.
   ```