

# Data Processing with Framework Processors
<a name="processing-job-frameworks"></a>

A `FrameworkProcessor` can run Processing jobs with a specified machine learning framework, providing you with an Amazon SageMaker AI–managed container for whichever machine learning framework you choose. `FrameworkProcessor` provides premade containers for the following machine learning frameworks: Hugging Face, MXNet, PyTorch, TensorFlow, and XGBoost.

The `FrameworkProcessor` class also provides you with customization over the container configuration. The `FrameworkProcessor` class supports specifying a source directory `source_dir` for your processing scripts and dependencies. With this capability, you can give the processor access to multiple scripts in a directory instead of only specifying one script. `FrameworkProcessor` also supports including a `requirements.txt` file in the `source_dir` for customizing the Python libraries to install in the container.

For more information on the `FrameworkProcessor` class and its methods and parameters, see [FrameworkProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor) in the *Amazon SageMaker AI Python SDK*.

To see examples of using a `FrameworkProcessor` for each of the supported machine learning frameworks, see the following topics.

**Topics**
+ [Code example using HuggingFaceProcessor in the Amazon SageMaker Python SDK](processing-job-frameworks-hugging-face.md)
+ [MXNet Framework Processor](processing-job-frameworks-mxnet.md)
+ [PyTorch Framework Processor](processing-job-frameworks-pytorch.md)
+ [TensorFlow Framework Processor](processing-job-frameworks-tensorflow.md)
+ [XGBoost Framework Processor](processing-job-frameworks-xgboost.md)

# Code example using HuggingFaceProcessor in the Amazon SageMaker Python SDK
<a name="processing-job-frameworks-hugging-face"></a>

Hugging Face is an open-source provider of natural language processing (NLP) models. The `HuggingFaceProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with Hugging Face scripts. When you use the `HuggingFaceProcessor`, you can leverage an Amazon-built Docker container with a managed Hugging Face environment so that you don't need to bring your own container.

The following code example shows how you can use the `HuggingFaceProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the HuggingFaceProcessor
hfp = HuggingFaceProcessor(
    role=get_execution_role(), 
    instance_count=1,
    instance_type='ml.g4dn.xlarge',
    transformers_version='4.4.2',
    pytorch_version='1.6.0', 
    base_job_name='frameworkprocessor-hf'
)

#Run the processing job
hfp.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data/'
        )
    ],
    outputs=[
        ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train/', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test/', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='val', source='/opt/ml/processing/output/val/', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `HuggingFaceProcessor` class, see [Hugging Face Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) in the *Amazon SageMaker AI Python SDK*.

# MXNet Framework Processor
<a name="processing-job-frameworks-mxnet"></a>

Apache MXNet is an open-source deep learning framework commonly used for training and deploying neural networks. The `MXNetProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with MXNet scripts. When you use the `MXNetProcessor`, you can leverage an Amazon-built Docker container with a managed MXNet environment so that you don’t need to bring your own container.

The following code example shows how you can use the `MXNetProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.mxnet import MXNetProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the MXNetProcessor
mxp = MXNetProcessor(
    framework_version='1.8.0',
    py_version='py37',
    role=get_execution_role(), 
    instance_count=1,
    instance_type='ml.c5.xlarge',
    base_job_name='frameworkprocessor-mxnet'
)

#Run the processing job
mxp.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data/'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='processed_data',
            source='/opt/ml/processing/output/',
            destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
        )
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `MXNetProcessor` class, see [MXNet Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html#mxnet-estimator) in the *Amazon SageMaker Python SDK*.

# PyTorch Framework Processor
<a name="processing-job-frameworks-pytorch"></a>

PyTorch is an open-source machine learning framework. The `PyTorchProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with PyTorch scripts. When you use the `PyTorchProcessor`, you can leverage an Amazon-built Docker container with a managed PyTorch environment so that you don’t need to bring your own container.

The following code example shows how you can use the `PyTorchProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

For the PyTorch versions supported by SageMaker AI, see the available [ Deep Learning Container images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

```
from sagemaker.pytorch.processing import PyTorchProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the PyTorchProcessor
pytorch_processor = PyTorchProcessor(
    framework_version='1.8',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name='frameworkprocessor-PT'
)

#Run the processing job
pytorch_processor.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(output_name='data_structured', source='/opt/ml/processing/tmp/data_structured', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='train', source='/opt/ml/processing/output/train', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='validation', source='/opt/ml/processing/output/val', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='test', source='/opt/ml/processing/output/test', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'),
        ProcessingOutput(output_name='logs', source='/opt/ml/processing/logs', destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}')
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `PyTorchProcessor` class, see [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) in the *Amazon SageMaker Python SDK*.

# TensorFlow Framework Processor
<a name="processing-job-frameworks-tensorflow"></a>

TensorFlow is an open-source machine learning and artificial intelligence library. The `TensorFlowProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with TensorFlow scripts. When you use the `TensorFlowProcessor`, you can leverage an Amazon-built Docker container with a managed TensorFlow environment so that you don’t need to bring your own container.

The following code example shows how you can use the `TensorFlowProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.tensorflow import TensorFlowProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the TensorFlowProcessor
tp = TensorFlowProcessor(
    framework_version='2.3',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name='frameworkprocessor-TF',
    py_version='py37'
)

#Run the processing job
tp.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data'
        ),
        ProcessingInput(
            input_name='model',
            source=f's3://{BUCKET}/{S3_PATH_TO_MODEL}',
            destination='/opt/ml/processing/input/model'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='predictions',
            source='/opt/ml/processing/output',
            destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
        )
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `TensorFlowProcessor` class, see [TensorFlow Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) in the *Amazon SageMaker Python SDK*.

# XGBoost Framework Processor
<a name="processing-job-frameworks-xgboost"></a>

XGBoost is an open-source machine learning framework. The `XGBoostProcessor` in the Amazon SageMaker Python SDK provides you with the ability to run processing jobs with XGBoost scripts. When you use the XGBoostProcessor, you can leverage an Amazon-built Docker container with a managed XGBoost environment so that you don’t need to bring your own container.

The following code example shows how you can use the `XGBoostProcessor` to run your Processing job using a Docker image provided and maintained by SageMaker AI. Note that when you run the job, you can specify a directory containing your scripts and dependencies in the `source_dir` argument, and you can have a `requirements.txt` file located inside your `source_dir` directory that specifies the dependencies for your processing script(s). SageMaker Processing installs the dependencies in `requirements.txt` in the container for you.

```
from sagemaker.xgboost import XGBoostProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker import get_execution_role

#Initialize the XGBoostProcessor
xgb = XGBoostProcessor(
    framework_version='1.2-2',
    role=get_execution_role(),
    instance_type='ml.m5.xlarge',
    instance_count=1,
    base_job_name='frameworkprocessor-XGB',
)

#Run the processing job
xgb.run(
    code='processing-script.py',
    source_dir='scripts',
    inputs=[
        ProcessingInput(
            input_name='data',
            source=f's3://{BUCKET}/{S3_INPUT_PATH}',
            destination='/opt/ml/processing/input/data'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='processed_data',
            source='/opt/ml/processing/output/',
            destination=f's3://{BUCKET}/{S3_OUTPUT_PATH}'
        )
    ]
)
```

If you have a `requirements.txt` file, it should be a list of libraries you want to install in the container. The path for `source_dir` can be a relative, absolute, or Amazon S3 URI path. However, if you use an Amazon S3 URI, then it must point to a tar.gz file. You can have multiple scripts in the directory you specify for `source_dir`. To learn more about the `XGBoostProcessor` class, see [XGBoost Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/xgboost.html) in the *Amazon SageMaker Python SDK*.