

# Built-in SageMaker AI Algorithms for Text Data
<a name="algorithms-text"></a>

SageMaker AI provides algorithms that are tailored to the analysis of textual documents used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
+ [BlazingText algorithm](blazingtext.md)—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.
+ [Latent Dirichlet Allocation (LDA) Algorithm](lda.md)—an algorithm suitable for determining topics in a set of documents. It is an *unsupervised algorithm*, which means that it doesn't use example data with answers during training.
+ [Neural Topic Model (NTM) Algorithm](ntm.md)—another unsupervised technique for determining topics in a set of documents, using a neural network approach.
+ [Object2Vec Algorithm](object2vec.md)—a general-purpose neural embedding algorithm that can be used for recommendation systems, document classification, and sentence embeddings.
+ [Sequence-to-Sequence Algorithm](seq-2-seq.md)—a supervised algorithm commonly used for neural machine translation. 
+ [Text Classification - TensorFlow](text-classification-tensorflow.md)—a supervised algorithm that supports transfer learning with available pretrained models for text classification. 


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| BlazingText | train | File or Pipe | Text file (one sentence per line with space-separated tokens)  | GPU (single instance only) or CPU | No | 
| LDA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU (single instance only) | No | 
| Neural Topic Model | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | GPU or CPU | Yes | 
| Object2Vec | train and (optionally) validation, test, or both | File | JSON Lines  | GPU or CPU (single instance only) | No | 
| Seq2Seq Modeling | train, validation, and vocab | File | recordIO-protobuf | GPU (single instance only) | No | 
| Text Classification - TensorFlow | training and validation | File | CSV | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 

# BlazingText algorithm
<a name="blazingtext"></a>

The Amazon SageMaker AI BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.

The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector representation of a word is called a *word embedding*. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words. 

Many natural language processing (NLP) applications learn word embeddings by training on large collections of documents. These pretrained vector representations provide information about semantics and word distributions that typically improves the generalizability of other models that are later trained on a more limited amount of data. Most implementations of the Word2vec algorithm are not optimized for multi-core CPU architectures. This makes it difficult to scale to large datasets. 

With the BlazingText algorithm, you can scale to large datasets easily. Similar to Word2vec, it provides the Skip-gram and continuous bag-of-words (CBOW) training architectures. BlazingText's implementation of the supervised multi-class, multi-label text classification algorithm extends the fastText text classifier to use GPU acceleration with custom [CUDA ](https://docs.nvidia.com/cuda/index.html) kernels. You can train a model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU. And, you achieve performance on par with the state-of-the-art deep learning text classification algorithms.

The BlazingText algorithm is not parallelizable. For more information on parameters related to training, see [ Docker Registry Paths for SageMaker Built-in Algorithms](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

 The SageMaker AI BlazingText algorithms provides the following features:
+ Accelerated training of the fastText text classifier on multi-core CPUs or a GPU and Word2Vec on GPUs using highly optimized CUDA kernels. For more information, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).
+ [Enriched Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) by learning vector representations for character n-grams. This approach enables BlazingText to generate meaningful vectors for out-of-vocabulary (OOV) words by representing their vectors as the sum of the character n-gram (subword) vectors.
+ A `batch_skipgram` `mode` for the Word2Vec algorithm that allows faster training and distributed computation across multiple CPU nodes. The `batch_skipgram` `mode` does mini-batching using the Negative Sample Sharing strategy to convert level-1 BLAS operations into level-3 BLAS operations. This efficiently leverages the multiply-add instructions of modern architectures. For more information, see [Parallelizing Word2Vec in Shared and Distributed Memory](https://arxiv.org/pdf/1604.04661.pdf).

To summarize, the following modes are supported by BlazingText on different types instances:


| Modes |  Word2Vec (Unsupervised Learning)  |  Text Classification (Supervised Learning)  | 
| --- | --- | --- | 
|  Single CPU instance  |  `cbow` `Skip-gram` `Batch Skip-gram`  |  `supervised`  | 
|  Single GPU instance (with 1 or more GPUs)  |  `cbow` `Skip-gram`  |  `supervised` with one GPU  | 
|  Multiple CPU instances  | Batch Skip-gram  | None | 

For more information about the mathematics behind BlazingText, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).

**Topics**
+ [

## Input/Output Interface for the BlazingText Algorithm
](#bt-inputoutput)
+ [

## EC2 Instance Recommendation for the BlazingText Algorithm
](#blazingtext-instances)
+ [

## BlazingText Sample Notebooks
](#blazingtext-sample-notebooks)
+ [

# BlazingText Hyperparameters
](blazingtext_hyperparameters.md)
+ [

# Tune a BlazingText Model
](blazingtext-tuning.md)

## Input/Output Interface for the BlazingText Algorithm
<a name="bt-inputoutput"></a>

The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.

### Training and Validation Data Format
<a name="blazingtext-data-formats"></a>

#### Training and Validation Data Format for the Word2Vec Algorithm
<a name="blazingtext-data-formats-word2vec"></a>

For Word2Vec training, upload the file under the *train* channel. No other channels are supported. The file should contain a training sentence per line.

#### Training and Validation Data Format for the Text Classification Algorithm
<a name="blazingtext-data-formats-text-class"></a>

For supervised mode, you can train with file mode or with the augmented manifest text format.

##### Train with File Mode
<a name="blazingtext-data-formats-text-class-file-mode"></a>

For `supervised` mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string *\$1\$1label\$1\$1*. Here is an example of a training/validation file:

```
__label__4  linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2  bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .
```

**Note**  
The order of labels within the sentence doesn't matter. 

Upload the training file under the train channel, and optionally upload the validation file under the validation channel.

##### Train with Augmented Manifest Text Format
<a name="blazingtext-data-formats-text-class-augmented-manifest"></a>

Supervised mode for CPU instances also supports the augmented manifest format, which enables you to do training in pipe mode without needing to create RecordIO files. While using the format, an S3 manifest file needs to be generated that contains the list of sentences and their corresponding labels. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The sentences are specified using the `source` tag and the label can be specified using the `label` tag. Both `source` and `label` tags should be provided under the `AttributeNames` parameter value as specified in the request.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label":2}
```

Multi-label training is also supported by specifying a JSON array of labels.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label": [1, 3]}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label": [2, 4, 5]}
```

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Model Artifacts and Inference
<a name="blazingtext-artifacts-inference"></a>

#### Model Artifacts for the Word2Vec Algorithm
<a name="blazingtext--artifacts-inference-word2vec"></a>

For Word2Vec training, the model artifacts consist of *vectors.txt*, which contains words-to-vectors mapping, and *vectors.bin*, a binary used by BlazingText for hosting, inference, or both. *vectors.txt* stores the vectors in a format that is compatible with other tools like Gensim and Spacy. For example, a Gensim user can run the following commands to load the vectors.txt file:

```
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
word_vectors.doesnt_match("breakfast cereal dinner lunch".split())
```

If the evaluation parameter is set to `True`, an additional file, *eval.json*, is created. This file contains the similarity evaluation results (using Spearman’s rank correlation coefficients) on WS-353 dataset. The number of words from the WS-353 dataset that aren't there in the training corpus are reported.

For inference requests, the model accepts a JSON file containing a list of strings and returns a list of vectors. If the word is not found in vocabulary, inference returns a vector of zeros. If subwords is set to `True` during training, the model is able to generate vectors for out-of-vocabulary (OOV) words.

##### Sample JSON Request
<a name="word2vec-json-request"></a>

Mime-type:` application/json`

```
{
"instances": ["word1", "word2", "word3"]
}
```

#### Model Artifacts for the Text Classification Algorithm
<a name="blazingtext-artifacts-inference-text-class"></a>

Training with supervised outputs creates a *model.bin* file that can be consumed by BlazingText hosting. For inference, the BlazingText model accepts a JSON file containing a list of sentences and returns a list of corresponding predicted labels and probability scores. Each sentence is expected to be a string with space-separated tokens, words, or both.

##### Sample JSON Request
<a name="text-class-json-request"></a>

Mime-type:` application/json`

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."]
}
```

By default, the server returns only one prediction, the one with the highest probability. For retrieving the top *k* predictions, you can set *k* in the configuration, as follows:

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."],
 "configuration": {"k": 2}
}
```

For BlazingText, the` content-type` and `accept` parameters must be equal. For batch transform, they both need to be `application/jsonlines`. If they differ, the `Accept` field is ignored. The format for input follows:

```
content-type: application/jsonlines

{"source": "source_0"}
{"source": "source_1"}

if you need to pass the value of k for top-k, then you can do it in the following way:

{"source": "source_0", "k": 2}
{"source": "source_1", "k": 3}
```

The format for output follows:

```
accept: application/jsonlines


{"prob": [prob_1], "label": ["__label__1"]}
{"prob": [prob_1], "label": ["__label__1"]}

If you have passed the value of k to be more than 1, then response will be in this format:

{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
```

For both supervised (text classification) and unsupervised (Word2Vec) modes, the binaries (*\$1.bin*) produced by BlazingText can be cross-consumed by fastText and vice versa. You can use binaries produced by BlazingText by fastText. Likewise, you can host the model binaries created with fastText using BlazingText.

Here is an example of how to use a model generated with BlazingText with fastText:

```
#Download the model artifact from S3
aws s3 cp s3://<YOUR_S3_BUCKET>/<PREFIX>/model.tar.gz model.tar.gz

#Unzip the model archive
tar -xzf model.tar.gz

#Use the model archive with fastText
fasttext predict ./model.bin test.txt
```

However, the binaries are only supported when training on CPU and single GPU; training on multi-GPU will not produce binaries.

## EC2 Instance Recommendation for the BlazingText Algorithm
<a name="blazingtext-instances"></a>

For `cbow` and `skipgram` modes, BlazingText supports single CPU and single GPU instances. Both of these modes support learning of `subwords` embeddings. To achieve the highest speed without compromising accuracy, we recommend that you use an ml.p3.2xlarge instance. 

For `batch_skipgram` mode, BlazingText supports single or multiple CPU instances. When training on multiple instances, set the value of the `S3DataDistributionType` field of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) object that you pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) to `FullyReplicated`. BlazingText takes care of distributing data across machines.

For the supervised text classification mode, a C5 instance is recommended if the training dataset is less than 2 GB. For larger datasets, use an instance with a single GPU. BlazingText supports P2, P3, G4dn, and G5 instances for training and inference.

## BlazingText Sample Notebooks
<a name="blazingtext-sample-notebooks"></a>

For a sample notebook that trains and deploys the SageMaker AI BlazingText algorithm to generate word vectors, see [Learning Word2Vec Word Representations using BlazingText](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/blazingtext_word2vec_text8/blazingtext_word2vec_text8.html). For instructions for creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI examples. The topic modeling example notebooks that use the Blazing Text are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# BlazingText Hyperparameters
<a name="blazingtext_hyperparameters"></a>

When you start a training job with a `CreateTrainingJob` request, you specify a training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The hyperparameters for the BlazingText algorithm depend on which mode you use: Word2Vec (unsupervised) and Text Classification (supervised).

## Word2Vec Hyperparameters
<a name="blazingtext_hyperparameters_word2vec"></a>

The following table lists the hyperparameters for the BlazingText Word2Vec training algorithm provided by Amazon SageMaker AI.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The Word2vec architecture used for training. **Required** Valid values: `batch_skipgram`, `skipgram`, or `cbow`  | 
| batch\$1size |  The size of each batch when `mode` is set to `batch_skipgram`. Set to a number between 10 and 20. **Optional** Valid values: Positive integer Default value: 11  | 
| buckets |  The number of hash buckets to use for subwords. **Optional** Valid values: positive integer Default value: 2000000  | 
| epochs |  The number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| evaluation |  Whether the trained model is evaluated using the [WordSimilarity-353 Test](http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.html). **Optional** Valid values: (Boolean) `True` or `False` Default value: `True`  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1char |  The minimum number of characters to use for subwords/character n-grams. **Optional** Valid values: positive integer Default value: 3  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| max\$1char |  The maximum number of characters to use for subwords/character n-grams **Optional** Valid values: positive integer Default value: 6  | 
| negative\$1samples |  The number of negative samples for the negative sample sharing strategy. **Optional** Valid values: Positive integer Default value: 5  | 
| sampling\$1threshold |  The threshold for the occurrence of words. Words that appear with higher frequency in the training data are randomly down-sampled. **Optional** Valid values: Positive fraction. The recommended range is (0, 1e-3] Default value: 0.0001  | 
| subwords |  Whether to learn subword embeddings on not. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| vector\$1dim |  The dimension of the word vectors that the algorithm learns. **Optional** Valid values: Positive integer Default value: 100  | 
| window\$1size |  The size of the context window. The context window is the number of words surrounding the target word used for training. **Optional** Valid values: Positive integer Default value: 5  | 

## Text Classification Hyperparameters
<a name="blazingtext_hyperparameters_text_class"></a>

The following table lists the hyperparameters for the Text Classification training algorithm provided by Amazon SageMaker AI.

**Note**  
Although some of the parameters are common between the Text Classification and Word2Vec modes, they might have different meanings depending on the context.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The training mode. **Required** Valid values: `supervised`  | 
| buckets |  The number of hash buckets to use for word n-grams. **Optional** Valid values: Positive integer Default value: 2000000  | 
| early\$1stopping |  Whether to stop training if validation accuracy doesn't improve after a `patience` number of epochs. Note that a validation channel is required if early stopping is used. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| epochs |  The maximum number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| min\$1epochs |  The minimum number of epochs to train before early stopping logic is invoked. **Optional** Valid values: Positive integer Default value: 5  | 
| patience |  The number of epochs to wait before applying early stopping when no progress is made on the validation set. Used only when `early_stopping` is `True`. **Optional** Valid values: Positive integer Default value: 4  | 
| vector\$1dim |  The dimension of the embedding layer. **Optional** Valid values: Positive integer Default value: 100  | 
| word\$1ngrams |  The number of word n-gram features to use. **Optional** Valid values: Positive integer Default value: 2  | 

# Tune a BlazingText Model
<a name="blazingtext-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the BlazingText Algorithm
<a name="blazingtext-metrics"></a>

The BlazingText Word2Vec algorithm (`skipgram`, `cbow`, and `batch_skipgram` modes) reports on a single metric during training: `train:mean_rho`. This metric is computed on [WS-353 word similarity datasets](https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art)). When tuning the hyperparameter values for the Word2Vec algorithm, use this metric as the objective.

The BlazingText Text Classification algorithm (`supervised` mode), also reports on a single metric during training: the `validation:accuracy`. When tuning the hyperparameter values for the text classification algorithm, use these metrics as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| train:mean\$1rho |  The mean rho (Spearman's rank correlation coefficient) on [WS-353 word similarity datasets](http://alfonseca.org/pubs/ws353simrel.tar.gz)  |  Maximize  | 
| validation:accuracy |  The classification accuracy on the user-specified validation dataset  |  Maximize  | 

## Tunable BlazingText Hyperparameters
<a name="blazingtext-tunable-hyperparameters"></a>

### Tunable Hyperparameters for the Word2Vec Algorithm
<a name="blazingtext-tunable-hyperparameters-word2vec"></a>

Tune an Amazon SageMaker AI BlazingText Word2Vec model with the following hyperparameters. The hyperparameters that have the greatest impact on Word2Vec objective metrics are: `mode`, ` learning_rate`, `window_size`, `vector_dim`, and `negative_samples`.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| batch\$1size |  `IntegerParameterRange`  |  [8-32]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| mode |  `CategoricalParameterRange`  |  [`'batch_skipgram'`, `'skipgram'`, `'cbow'`]  | 
| negative\$1samples |  `IntegerParameterRange`  |  [5-25]  | 
| sampling\$1threshold |  `ContinuousParameterRange`  |  MinValue: 0.0001, MaxValue: 0.001  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| window\$1size |  `IntegerParameterRange`  |  [1-10]  | 

### Tunable Hyperparameters for the Text Classification Algorithm
<a name="blazingtext-tunable-hyperparameters-text_class"></a>

Tune an Amazon SageMaker AI BlazingText text classification model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| buckets |  `IntegerParameterRange`  |  [1000000-10000000]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| word\$1ngrams |  `IntegerParameterRange`  |  [1-3]  | 

# Latent Dirichlet Allocation (LDA) Algorithm
<a name="lda"></a>

The Amazon SageMaker AI Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

The exact content of two documents with similar topic mixtures will not be the same. But overall, you would expect these documents to more frequently use a shared subset of words, than when compared with a document from a different topic mixture. This allows LDA to discover these word groups and use them to form topics. As an extremely simple example, given a set of documents where the only words that occur within them are: *eat*, *sleep*, *play*, *meow*, and *bark*, LDA might produce topics like the following:


| **Topic** | *eat* | *sleep*  | *play* | *meow* | *bark* | 
| --- | --- | --- | --- | --- | --- | 
| Topic 1  | 0.1  | 0.3  | 0.2  | 0.4  | 0.0  | 
| Topic 2  | 0.2  | 0.1 | 0.4  | 0.0  | 0.3  | 

You can infer that documents that are more likely to fall into Topic 1 are about cats (who are more likely to *meow* and *sleep*), and documents that fall into Topic 2 are about dogs (who prefer to *play* and *bark*). These topics can be found even though the words dog and cat never appear in any of the texts. 

**Topics**
+ [

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
](#lda-or-ntm)
+ [

## Input/Output Interface for the LDA Algorithm
](#lda-inputoutput)
+ [

## EC2 Instance Recommendation for the LDA Algorithm
](#lda-instances)
+ [

## LDA Sample Notebooks
](#LDA-sample-notebooks)
+ [

# How LDA Works
](lda-how-it-works.md)
+ [

# LDA Hyperparameters
](lda_hyperparameters.md)
+ [

# Tune an LDA Model
](lda-tuning.md)

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
<a name="lda-or-ntm"></a>

Topic models are commonly used to produce topics from corpuses that (1) coherently encapsulate semantic meaning and (2) describe documents well. As such, topic models aim to minimize perplexity and maximize topic coherence. 

Perplexity is an intrinsic language modeling evaluation metric that measures the inverse of the geometric mean per-word likelihood in your test data. A lower perplexity score indicates better generalization performance. Research has shown that the likelihood computed per word often does not align to human judgement, and can be entirely non-correlated, thus topic coherence has been introduced. Each inferred topic from your model consists of words, and topic coherence is computed to the top N words for that particular topic from your model. It is often defined as the average or median of the pairwise word-similarity scores of the words in that topic e.g., Pointwise Mutual Information (PMI). A promising model generates coherent topics or topics with high topic coherence scores. 

While the objective is to train a topic model that minimizes perplexity and maximizes topic coherence, there is often a tradeoff with both LDA and NTM. Recent research by Amazon, Dinget et al., 2018 has shown that NTM is promising for achieving high topic coherence but LDA trained with collapsed Gibbs sampling achieves better perplexity. There is a tradeoff between perplexity and topic coherence. From a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized across multiple GPU instances, whereas LDA only supports single-instance CPU training. 

**Topics**
+ [

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
](#lda-or-ntm)
+ [

## Input/Output Interface for the LDA Algorithm
](#lda-inputoutput)
+ [

## EC2 Instance Recommendation for the LDA Algorithm
](#lda-instances)
+ [

## LDA Sample Notebooks
](#LDA-sample-notebooks)
+ [

# How LDA Works
](lda-how-it-works.md)
+ [

# LDA Hyperparameters
](lda_hyperparameters.md)
+ [

# Tune an LDA Model
](lda-tuning.md)

## Input/Output Interface for the LDA Algorithm
<a name="lda-inputoutput"></a>

LDA expects data to be provided on the train channel, and optionally supports a test channel, which is scored by the final model. LDA supports both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV`, the data must be dense and have dimension equal to *number of records \$1 vocabulary size*. LDA can be trained in File or Pipe mode when using recordIO-wrapped protobuf, but only in File mode for the `CSV` format.

For inference, `text/csv`, `application/json`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. LDA inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_mixture` vector for each observation.

Please see the [LDA Sample Notebooks](#LDA-sample-notebooks) for more detail on training and inference formats.

## EC2 Instance Recommendation for the LDA Algorithm
<a name="lda-instances"></a>

LDA currently only supports single-instance CPU training. CPU instances are recommended for hosting/inference.

## LDA Sample Notebooks
<a name="LDA-sample-notebooks"></a>

For a sample notebook that shows how to train the SageMaker AI Latent Dirichlet Allocation algorithm on a dataset and then how to deploy the trained model to perform inferences about the topic mixtures in input documents, see the [An Introduction to SageMaker AI LDA](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lda_topic_modeling/LDA-Introduction.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How LDA Works
<a name="lda-how-it-works"></a>

Amazon SageMaker AI LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of different categories. These categories are themselves a probability distribution over the features. LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to discriminative models, which attempt to learn how inputs map to outputs.

You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A feature is referred to as a word. And the resulting categories are referred to as topics.

**Note**  
Lemmatization significantly increases algorithm performance and accuracy. Consider pre-processing any input text data. For more information, see [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html).

An LDA model is defined by two parameters:
+ α—A prior estimate on topic probability (in other words, the average frequency that each topic within a given document occurs). 
+ β—a collection of k topics where each topic is given a probability distribution over the vocabulary used in a document corpus, also called a "topic-word distribution."

LDA is a "bag-of-words" model, which means that the order of words does not matter. LDA is a generative model where each document is generated word-by-word by choosing a topic mixture θ ∼ Dirichlet(α). 

 For each word in the document: 
+  Choose a topic z ∼ Multinomial(θ) 
+  Choose the corresponding topic-word distribution β\$1z. 
+  Draw a word w ∼ Multinomial(β\$1z). 

When training the model, the goal is to find parameters α and β, which maximize the probability that the text corpus is generated by the model.

The most popular methods for estimating the LDA model use Gibbs sampling or Expectation Maximization (EM) techniques. The Amazon SageMaker AI LDA uses tensor spectral decomposition. This provides several advantages:
+  **Theoretical guarantees on results**. The standard EM-method is guaranteed to converge only to local optima, which are often of poor quality. 
+  **Embarrassingly parallelizable**. The work can be trivially divided over input documents in both training and inference. The EM-method and Gibbs Sampling approaches can be parallelized, but not as easily. 
+  **Fast**. Although the EM-method has low iteration cost it is prone to slow convergence rates. Gibbs Sampling is also subject to slow convergence rates and also requires a large number of samples. 

At a high-level, the tensor decomposition algorithm follows this process:

1.  The goal is to calculate the spectral decomposition of a **V** x **V** x **V** tensor, which summarizes the moments of the documents in our corpus. **V** is vocabulary size (in other words, the number of distinct words in all of the documents). The spectral components of this tensor are the LDA parameters α and β, which maximize the overall likelihood of the document corpus. However, because vocabulary size tends to be large, this **V** x **V** x **V** tensor is prohibitively large to store in memory. 

1.  Instead, it uses a **V** x **V** moment matrix, which is the two-dimensional analog of the tensor from step 1, to find a whitening matrix of dimension **V** x **k**. This matrix can be used to convert the **V** x **V** moment matrix into a **k** x **k** identity matrix. **k** is the number of topics in the model. 

1.  This same whitening matrix can then be used to find a smaller **k** x **k** x **k** tensor. When spectrally decomposed, this tensor has components that have a simple relationship with the components of the **V** x **V** x **V** tensor. 

1.  *Alternating Least Squares* is used to decompose the smaller **k** x *k* x **k** tensor. This provides a substantial improvement in memory consumption and speed. The parameters α and β can be found by “unwhitening” these outputs in the spectral decomposition. 

After the LDA model’s parameters have been found, you can find the topic mixtures for each document. You use stochastic gradient descent to maximize the likelihood function of observing a given topic mixture corresponding to these data.

Topic quality can be improved by increasing the number of topics to look for in training and then filtering out poor quality ones. This is in fact done automatically in SageMaker AI LDA: 25% more topics are computed and only the ones with largest associated Dirichlet priors are returned. To perform further topic filtering and analysis, you can increase the topic count and modify the resulting LDA model as follows:

```
> import mxnet as mx
> alpha, beta = mx.ndarray.load(‘model.tar.gz’)
> # modify alpha and beta
> mx.nd.save(‘new_model.tar.gz’, [new_alpha, new_beta])
> # upload to S3 and create new SageMaker model using the console
```

For more information about algorithms for LDA and the SageMaker AI implementation, see the following:
+ Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. *Tensor Decompositions for Learning Latent Variable Models*, Journal of Machine Learning Research, 15:2773–2832, 2014.
+  David M Blei, Andrew Y Ng, and Michael I Jordan. *Latent Dirichlet Allocation*. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
+  Thomas L Griffiths and Mark Steyvers. *Finding Scientific Topics*. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. 
+  Tamara G Kolda and Brett W Bader. *Tensor Decompositions and Applications*. SIAM Review, 51(3):455–500, 2009. 

# LDA Hyperparameters
<a name="lda_hyperparameters"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the LDA training algorithm provided by Amazon SageMaker AI. For more information, see [How LDA Works](lda-how-it-works.md).


| Parameter Name | Description | 
| --- | --- | 
| num\$1topics |  The number of topics for LDA to find within the data. **Required** Valid values: positive integer  | 
| feature\$1dim |  The size of the vocabulary of the input document corpus. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  The total number of documents in the input document corpus. **Required** Valid values: positive integer  | 
| alpha0 |  Initial guess for the concentration parameter: the sum of the elements of the Dirichlet prior. Small values are more likely to generate sparse topic mixtures and large values (greater than 1.0) produce more uniform mixtures.  **Optional** Valid values: Positive float Default value: 1.0  | 
| max\$1restarts |  The number of restarts to perform during the Alternating Least Squares (ALS) spectral decomposition phase of the algorithm. Can be used to find better quality local minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 10  | 
| max\$1iterations |  The maximum number of iterations to perform during the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 1000  | 
| tol |  Target error tolerance for the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive float Default value: 1e-8  | 

# Tune an LDA Model
<a name="lda-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

LDA is an unsupervised topic modeling algorithm that attempts to describe a set of observations (documents) as a mixture of different categories (topics). The “per-word log-likelihood” (PWLL) metric measures the likelihood that a learned set of topics (an LDA model) accurately describes a test document dataset. Larger values of PWLL indicate that the test data is more likely to be described by the LDA model.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the LDA Algorithm
<a name="lda-metrics"></a>

The LDA algorithm reports on a single metric during training: `test:pwll`. When tuning a model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:pwll | Per-word log-likelihood on the test dataset. The likelihood that the test dataset is accurately described by the learned LDA model. | Maximize | 

## Tunable LDA Hyperparameters
<a name="lda-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the LDA algorithm. Both hyperparameters, `alpha0` and `num_topics`, can affect the LDA objective metric (`test:pwll`). If you don't already know the optimal values for these hyperparameters, which maximize per-word log-likelihood and produce an accurate LDA model, automatic model tuning can help find them.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha0 | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 10 | 
| num\$1topics | IntegerParameterRanges | MinValue: 1, MaxValue: 150 | 

# Neural Topic Model (NTM) Algorithm
<a name="ntm"></a>

Amazon SageMaker AI NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into *topics* that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a *latent representation* because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics. Documents relevant to each topic might be indexed or searched for based on their soft topic labels. The latent representations of documents might also be used to find similar documents in the topic space. You can also use the latent representations of documents that the topic model learns for input to another supervised algorithm such as a document classifier. Because the latent representations of documents are expected to capture the semantics of the underlying documents, algorithms based in part on these representations are expected to perform better than those based on lexical features alone.

Although you can use both the Amazon SageMaker AI NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.

For more information on the mathematics behind NTM, see [Neural Variational Inference for Text Processing](https://arxiv.org/pdf/1511.06038.pdf).

**Topics**
+ [

## Input/Output Interface for the NTM Algorithm
](#NTM-inputoutput)
+ [

## EC2 Instance Recommendation for the NTM Algorithm
](#NTM-instances)
+ [

## NTM Sample Notebooks
](#NTM-sample-notebooks)
+ [

# NTM Hyperparameters
](ntm_hyperparameters.md)
+ [

# Tune an NTM Model
](ntm-tuning.md)
+ [

# NTM Response Formats
](ntm-in-formats.md)

## Input/Output Interface for the NTM Algorithm
<a name="NTM-inputoutput"></a>

Amazon SageMaker AI Neural Topic Model supports four data channels: train, validation, test, and auxiliary. The validation, test, and auxiliary data channels are optional. If you specify any of these optional channels, set the value of the `S3DataDistributionType` parameter for them to `FullyReplicated`. If you provide validation data, the loss on this data is logged at every epoch, and the model stops training as soon as it detects that the validation loss is not improving. If you don't provide validation data, the algorithm stops early based on the training data, but this can be less efficient. If you provide test data, the algorithm reports the test loss from the final model. 

The train, validation, and test data channels for NTM support both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV` format, each row must be represented densely with zero counts for words not present in the corresponding document, and have dimension equal to: (number of records) \$1 (vocabulary size). You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`. The auxiliary channel is used to supply a text file that contains vocabulary. By supplying the vocabulary file, users are able to see the top words for each of the topics printed in the log instead of their integer IDs. Having the vocabulary file also allows NTM to compute the Word Embedding Topic Coherence (WETC) scores, a new metric displayed in the log that captures similarity among the top words in each topic effectively. The `ContentType` for the auxiliary channel is `text/plain`, with each line containing a single word, in the order corresponding to the integer IDs provided in the data. The vocabulary file must be named `vocab.txt` and currently only UTF-8 encoding is supported. 

For inference, `text/csv`, `application/json`, `application/jsonlines`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. NTM inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_weights` vector for each observation.

See the [blog post](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-neural-topic-model-now-supports-auxiliary-vocabulary-channel-new-topic-evaluation-metrics-and-training-subsampling/) for more details on using the auxiliary channel and the WETC scores. For more information on how to compute the WETC score, see [Coherence-Aware Neural Topic Modeling](https://arxiv.org/pdf/1809.02687.pdf). We used the pairwise WETC described in this paper for the Amazon SageMaker AI Neural Topic Model.

For more information on input and output file formats, see [NTM Response Formats](ntm-in-formats.md) for inference and the [NTM Sample Notebooks](#NTM-sample-notebooks).

## EC2 Instance Recommendation for the NTM Algorithm
<a name="NTM-instances"></a>

NTM training supports both GPU and CPU instance types. We recommend GPU instances, but for certain workloads, CPU instances may result in lower training costs. CPU instances should be sufficient for inference. NTM training supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

## NTM Sample Notebooks
<a name="NTM-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI NTM algorithm to uncover topics in documents from a synthetic data source where the topic distributions are known, see the [Introduction to Basic Functionality of NTM](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ntm_synthetic/ntm_synthetic.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# NTM Hyperparameters
<a name="ntm_hyperparameters"></a>

The following table lists the hyperparameters that you can set for the Amazon SageMaker AI Neural Topic Model (NTM) algorithm.


| Parameter Name | Description | 
| --- | --- | 
|  `feature_dim`  |  The vocabulary size of the dataset. **Required** Valid values: Positive integer (min: 1, max: 1,000,000)  | 
| num\$1topics |  The number of required topics. **Required** Valid values: Positive integer (min: 2, max: 1000)  | 
| batch\$1norm |  Whether to use batch normalization during training. **Optional** Valid values: *true* or *false* Default value: *false*  | 
| clip\$1gradient |  The maximum magnitude for each gradient component. **Optional** Valid values: Float (min: 1e-3) Default value: Infinity  | 
| encoder\$1layers |  The number of layers in the encoder and the output size of each layer. When set to *auto*, the algorithm uses two layers of sizes 3 x `num_topics` and 2 x `num_topics` respectively.  **Optional** Valid values: Comma-separated list of positive integers or *auto* Default value: *auto*  | 
| encoder\$1layers\$1activation |  The activation function to use in the encoder layers. **Optional** Valid values:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `sigmoid`  | 
| epochs |  The maximum number of passes over the training data. **Optional** Valid values: Positive integer (min: 1) Default value: 50  | 
| learning\$1rate |  The learning rate for the optimizer. **Optional** Valid values: Float (min: 1e-6, max: 1.0) Default value: 0.001  | 
| mini\$1batch\$1size |  The number of examples in each mini batch. **Optional** Valid values: Positive integer (min: 1, max: 10000) Default value: 256  | 
| num\$1patience\$1epochs |  The number of successive epochs over which early stopping criterion is evaluated. Early stopping is triggered when the change in the loss function drops below the specified `tolerance` within the last `num_patience_epochs` number of epochs. To disable early stopping, set `num_patience_epochs` to a value larger than `epochs`. **Optional** Valid values: Positive integer (min: 1) Default value: 3  | 
| optimizer |  The optimizer to use for training. **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `adadelta`  | 
| rescale\$1gradient |  The rescale factor for gradient. **Optional** Valid values: float (min: 1e-3, max: 1.0) Default value: 1.0  | 
| sub\$1sample |  The fraction of the training data to sample for training per epoch. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 1.0  | 
| tolerance |  The maximum relative change in the loss function. Early stopping is triggered when change in the loss function drops below this value within the last `num_patience_epochs` number of epochs. **Optional** Valid values: Float (min: 1e-6, max: 0.1) Default value: 0.001  | 
| weight\$1decay |   The weight decay coefficient. Adds L2 regularization. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 0.0  | 

# Tune an NTM Model
<a name="ntm-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

Amazon SageMaker AI NTM is an unsupervised learning algorithm that learns latent representations of large collections of discrete data, such as a corpus of documents. Latent representations use inferred variables that are not directly measured to model the observations in a dataset. Automatic model tuning on NTM helps you find the model that minimizes loss over the training or validation data. *Training loss* measures how well the model fits the training data. *Validation loss* measures how well the model can generalize to data that it is not trained on. Low training loss indicates that a model is a good fit to the training data. Low validation loss indicates that a model has not overfit the training data and so should be able to model documents successfully on which is has not been trained. Usually, it's preferable to have both losses be small. However, minimizing training loss too much might result in overfitting and increase validation loss, which would reduce the generality of the model. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the NTM Algorithm
<a name="ntm-metrics"></a>

The NTM algorithm reports a single metric that is computed during training: `validation:total_loss`. The total loss is the sum of the reconstruction loss and Kullback-Leibler divergence. When tuning hyperparameter values, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:total\$1loss |  Total Loss on validation set  |  Minimize  | 

## Tunable NTM Hyperparameters
<a name="ntm-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the NTM algorithm. Usually setting low `mini_batch_size` and small `learning_rate` values results in lower validation losses, although it might take longer to train. Low validation losses don't necessarily produce more coherent topics as interpreted by humans. The effect of other hyperparameters on training and validation loss can vary from dataset to dataset. To see which values are compatible, see [NTM Hyperparameters](ntm_hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| encoder\$1layers\$1activation |  CategoricalParameterRanges  |  ['sigmoid', 'tanh', 'relu']  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 0.1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 16, MaxValue:2048  | 
| optimizer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'adadelta']  | 
| rescale\$1gradient |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 1.0  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 1.0  | 

# NTM Response Formats
<a name="ntm-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI NTM algorithm.

## JSON Response Format
<a name="ntm-json"></a>

```
{
    "predictions":    [
        {"topic_weights": [0.02, 0.1, 0,...]},
        {"topic_weights": [0.25, 0.067, 0,...]}
    ]
}
```

## JSONLINES Response Format
<a name="ntm-jsonlines"></a>

```
{"topic_weights": [0.02, 0.1, 0,...]}
{"topic_weights": [0.25, 0.067, 0,...]}
```

## RECORDIO Response Format
<a name="ntm-recordio"></a>

```
[
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    }  
]
```

# Object2Vec Algorithm
<a name="object2vec"></a>

The Amazon SageMaker AI Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression. 

Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in the SageMaker AI [BlazingText algorithm](blazingtext.md). For a blog post that discusses how to apply Object2Vec to some practical use cases, see [Introduction to Amazon SageMaker AI Object2Vec](https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/). 

**Topics**
+ [

## I/O Interface for the Object2Vec Algorithm
](#object2vec-inputoutput)
+ [

## EC2 Instance Recommendation for the Object2Vec Algorithm
](#object2vec--instances)
+ [

## Object2Vec Sample Notebooks
](#object2vec-sample-notebooks)
+ [

# How Object2Vec Works
](object2vec-howitworks.md)
+ [

# Object2Vec Hyperparameters
](object2vec-hyperparameters.md)
+ [

# Tune an Object2Vec Model
](object2vec-tuning.md)
+ [

# Data Formats for Object2Vec Training
](object2vec-training-formats.md)
+ [

# Data Formats for Object2Vec Inference
](object2vec-inference-formats.md)
+ [

# Encoder Embeddings for Object2Vec
](object2vec-encoder-embeddings.md)

## I/O Interface for the Object2Vec Algorithm
<a name="object2vec-inputoutput"></a>

You can use Object2Vec on many input data types, including the following examples.


| Input Data Type | Example | 
| --- | --- | 
|  Sentence-sentence pairs  | "A soccer game with multiple males playing." and "Some men are playing a sport." | 
|  Labels-sequence pairs  | The genre tags of the movie "Titanic", such as "Romance" and "Drama", and its short description: "James Cameron's Titanic is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic. She was the most luxurious liner of her era, a ship of dreams, which ultimately carried over 1,500 people to their death in the ice cold waters of the North Atlantic in the early hours of April 15, 1912." | 
|  Customer-customer pairs  |  The customer ID of Jane and customer ID of Jackie.  | 
|  Product-product pairs  |  The product ID of football and product ID of basketball.  | 
|  Item review user-item pairs  |  A user's ID and the items she has bought, such as apple, pear, and orange.  | 

To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec natively supports two types of input: 
+ A discrete token, which is represented as a list of a single `integer-id`. For example, `[10]`.
+ A sequences of discrete tokens, which is represented as a list of `integer-ids`. For example, `[0,12,10,13]`.

The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token, token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:
+  Average-pooled embeddings
+  Hierarchical convolutional neural networks (CNNs),
+  Multi-layered bidirectional long short-term memory (BiLSTMs) 

The input label for each pair can be one of the following:
+ A categorical label that expresses the relationship between the objects in the pair 
+ A score that expresses the strength of the similarity between the two objects 

For categorical labels used in classification, the algorithm supports the cross-entropy loss function. For ratings/score-based labels used in regression, the algorithm supports the mean squared error (MSE) loss function. Specify these loss functions with the `output_layer` hyperparameter when you create the model training job.

## EC2 Instance Recommendation for the Object2Vec Algorithm
<a name="object2vec--instances"></a>

The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you are training or running inference. 

When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance. For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance, you can use a larger instance. Currently, the Object2Vec algorithm can train only on a single machine. However, it does offer support for multiple GPUs. Object2Vec supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

For inference with a trained Object2Vec model that has a deep neural network, we recommend using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the [GPU optimization: Classification or Regression](object2vec-inference-formats.md#object2vec-inference-gpu-optimize-classification) or [GPU optimization: Encoder Embeddings](object2vec-encoder-embeddings.md#object2vec-inference-gpu-optimize-encoder-embeddings) inference network is loaded into GPU.

## Object2Vec Sample Notebooks
<a name="object2vec-sample-notebooks"></a>
+ [Using Object2Vec to Encode Sentences into Fixed Length Embeddings](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/object2vec_sentence_similarity/object2vec_sentence_similarity.html)

# How Object2Vec Works
<a name="object2vec-howitworks"></a>

When using the Amazon SageMaker AI Object2Vec algorithm, you follow the standard workflow: process the data, train the model, and produce inferences. 

**Topics**
+ [

## Step 1: Process Data
](#object2vec-step-1-data-preprocessing)
+ [

## Step 2: Train a Model
](#object2vec-step-2-training-model)
+ [

## Step 3: Produce Inferences
](#object2vec-step-3-inference)

## Step 1: Process Data
<a name="object2vec-step-1-data-preprocessing"></a>

During preprocessing, convert the data to the [JSON Lines](http://jsonlines.org/) text file format specified in [Data Formats for Object2Vec Training](object2vec-training-formats.md) . To get the highest accuracy during training, also randomly shuffle the data before feeding it into the model. How you generate random permutations depends on the language. For python, you could use `np.random.shuffle`; for Unix, `shuf`.

## Step 2: Train a Model
<a name="object2vec-step-2-training-model"></a>

The SageMaker AI Object2Vec algorithm has the following main components:
+ **Two input channels** – The input channels take a pair of objects of the same or different types as inputs, and pass them to independent and customizable encoders.
+ **Two encoders** – The two encoders, enc0 and enc1, convert each object into a fixed-length embedding vector. The encoded embeddings of the objects in the pair are then passed into a comparator.
+ **A comparator** – The comparator compares the embeddings in different ways and outputs scores that indicate the strength of the relationship between the paired objects. In the output score for a sentence pair. For example, 1 indicates a strong relationship between a sentence pair, and 0 represents a weak relationship. 

During training, the algorithm accepts pairs of objects and their relationship labels or scores as inputs. The objects in each pair can be of different types, as described earlier. If the inputs to both encoders are composed of the same token-level units, you can use a shared token embedding layer by setting the `tied_token_embedding_weight` hyperparameter to `True` when you create the training job. This is possible, for example, when comparing sentences that both have word token-level units. To generate negative samples at a specified rate, set the `negative_sampling_rate` hyperparameter to the desired ratio of negative to positive samples. This hyperparameter expedites learning how to discriminate between the positive samples observed in the training data and the negative samples that are not likely to be observed. 

Pairs of objects are passed through independent, customizable encoders that are compatible with the input types of corresponding objects. The encoders convert each object in a pair into a fixed-length embedding vector of equal length. The pair of vectors are passed to a comparator operator, which assembles the vectors into a single vector using the value specified in the he `comparator_list` hyperparameter. The assembled vector then passes through a multilayer perceptron (MLP) layer, which produces an output that the loss function compares with the labels that you provided. This comparison evaluates the strength of the relationship between the objects in the pair as predicted by the model. The following figure shows this workflow.

![\[Architecture of the Object2Vec Algorithm from Data Inputs to Scores\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/object2vec-training-image.png)


## Step 3: Produce Inferences
<a name="object2vec-step-3-inference"></a>

After the model is trained, you can use the trained encoder to preprocess input objects or to perform two types of inference:
+ To convert singleton input objects into fixed-length embeddings using the corresponding encoder
+ To predict the relationship label or score between a pair of input objects

The inference server automatically figures out which of the types is requested based on the input data. To get the embeddings as output, provide only one input. To predict the relationship label or score, provide both inputs in the pair.

# Object2Vec Hyperparameters
<a name="object2vec-hyperparameters"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Object2Vec training algorithm.


| Parameter Name | Description | 
| --- | --- | 
| enc0\$1max\$1seq\$1len |  The maximum sequence length for the enc0 encoder. **Required** Valid values: 1 ≤ integer ≤ 5000  | 
| enc0\$1vocab\$1size |  The vocabulary size of enc0 tokens. **Required** Valid values: 2 ≤ integer ≤ 3000000  | 
| bucket\$1width |  The allowed difference between data sequence length when bucketing is enabled. To enable bucketing, specify a non-zero value for this parameter. **Optional** Valid values: 0 ≤ integer ≤ 100 Default value: 0 (no bucketing)  | 
| comparator\$1list |  A list used to customize the way in which two embeddings are compared. The Object2Vec comparator operator layer takes the encodings from both encoders as inputs and outputs a single vector. This vector is a concatenation of subvectors. The string values passed to the `comparator_list` and the order in which they are passed determine how these subvectors are assembled. For example, if `comparator_list="hadamard, concat"`, then the comparator operator constructs the vector by concatenating the Hadamard product of two encodings and the concatenation of two encodings. If, on the other hand, `comparator_list="hadamard"`, then the comparator operator constructs the vector as the hadamard product of only two encodings.  **Optional** Valid values: A string that contains any combination of the names of the three binary operators: `hadamard`, `concat`, or `abs_diff`. The Object2Vec algorithm currently requires that the two vector encodings have the same dimension. These operators produce the subvectors as follows: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `"hadamard, concat, abs_diff"`  | 
| dropout |  The dropout probability for network layers. *Dropout* is a form of regularization used in neural networks that reduces overfitting by trimming codependent neurons. **Optional** Valid values: 0.0 ≤ float ≤ 1.0 Default value: 0.0  | 
| early\$1stopping\$1patience |  The number of consecutive epochs without improvement allowed before early stopping is applied. Improvement is defined by with the `early_stopping_tolerance` hyperparameter. **Optional** Valid values: 1 ≤ integer ≤ 5 Default value: 3  | 
| early\$1stopping\$1tolerance |  The reduction in the loss function that an algorithm must achieve between consecutive epochs to avoid early stopping after the number of consecutive epochs specified in the `early_stopping_patience` hyperparameter concludes. **Optional** Valid values: 0.000001 ≤ float ≤ 0.1 Default value: 0.01  | 
| enc\$1dim |  The dimension of the output of the embedding layer. **Optional** Valid values: 4 ≤ integer ≤ 10000 Default value: 4096  | 
| enc0\$1network |  The network model for the enc0 encoder. **Optional** Valid values: `hcnn`, `bilstm`, or `pooled_embedding` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `hcnn`  | 
| enc0\$1cnn\$1filter\$1width |  The filter width of the convolutional neural network (CNN) enc0 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 9 Default value: 3  | 
| enc0\$1freeze\$1pretrained\$1embedding |  Whether to freeze enc0 pretrained embedding weights. **Conditional** Valid values: `True` or `False` Default value: `True`  | 
| enc0\$1layers  |  The number of layers in the enc0 encoder. **Conditional** Valid values: `auto` or 1 ≤ integer ≤ 4 [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `auto`  | 
| enc0\$1pretrained\$1embedding\$1file |  The filename of the pretrained enc0 token embedding file in the auxiliary data channel. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc0\$1token\$1embedding\$1dim |  The output dimension of the enc0 token embedding layer. **Conditional** Valid values: 2 ≤ integer ≤ 1000 Default value: 300  | 
| enc0\$1vocab\$1file |  The vocabulary file for mapping pretrained enc0 token embedding vectors to numerical vocabulary IDs. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1network |  The network model for the enc1 encoder. If you want the enc1 encoder to use the same network model as enc0, including the hyperparameter values, set the value to `enc0`.   Even when the enc0 and enc1 encoder networks have symmetric architectures, you can't shared parameter values for these networks.  **Optional** Valid values: `enc0`, `hcnn`, `bilstm`, or `pooled_embedding` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `enc0`  | 
| enc1\$1cnn\$1filter\$1width |  The filter width of the CNN enc1 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 9 Default value: 3  | 
| enc1\$1freeze\$1pretrained\$1embedding |  Whether to freeze enc1 pretrained embedding weights. **Conditional** Valid values: `True` or `False` Default value: `True`  | 
| enc1\$1layers  |  The number of layers in the enc1 encoder. **Conditional** Valid values: `auto` or 1 ≤ integer ≤ 4 [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `auto`  | 
| enc1\$1max\$1seq\$1len |  The maximum sequence length for the enc1 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 5000  | 
| enc1\$1pretrained\$1embedding\$1file |  The name of the enc1 pretrained token embedding file in the auxiliary data channel. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1token\$1embedding\$1dim |  The output dimension of the enc1 token embedding layer. **Conditional** Valid values: 2 ≤ integer ≤ 1000 Default value: 300  | 
| enc1\$1vocab\$1file |  The vocabulary file for mapping pretrained enc1 token embeddings to vocabulary IDs. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1vocab\$1size |  The vocabulary size of enc0 tokens. **Conditional** Valid values: 2 ≤ integer ≤ 3000000  | 
| epochs |  The number of epochs to run for training.  **Optional** Valid values: 1 ≤ integer ≤ 100 Default value: 30  | 
| learning\$1rate |  The learning rate for training. **Optional** Valid values: 1.0E-6 ≤ float ≤ 1.0 Default value: 0.0004  | 
| mini\$1batch\$1size |  The batch size that the dataset is split into for an `optimizer` during training. **Optional** Valid values: 1 ≤ integer ≤ 10000 Default value: 32  | 
| mlp\$1activation |  The type of activation function for the multilayer perceptron (MLP) layer. **Optional** Valid values: `tanh`, `relu`, or `linear` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `linear`  | 
| mlp\$1dim |  The dimension of the output from MLP layers. **Optional** Valid values: 2 ≤ integer ≤ 10000 Default value: 512  | 
| mlp\$1layers |  The number of MLP layers in the network. **Optional** Valid values: 0 ≤ integer ≤ 10 Default value: 2  | 
| negative\$1sampling\$1rate |  The ratio of negative samples, generated to assist in training the algorithm, to positive samples that are provided by users. Negative samples represent data that is unlikely to occur in reality and are labelled negatively for training. They facilitate training a model to discriminate between the positive samples observed and the negative samples that are not. To specify the ratio of negative samples to positive samples used for training, set the value to a positive integer. For example, if you train the algorithm on input data in which all of the samples are positive and set `negative_sampling_rate` to 2, the Object2Vec algorithm internally generates two negative samples per positive sample. If you don't want to generate or use negative samples during training, set the value to 0.  **Optional** Valid values: 0 ≤ integer Default value: 0 (off)  | 
| num\$1classes |  The number of classes for classification training. Amazon SageMaker AI ignores this hyperparameter for regression problems. **Optional** Valid values: 2 ≤ integer ≤ 30 Default value: 2  | 
| optimizer |  The optimizer type. **Optional** Valid values: `adadelta`, `adagrad`, `adam`, `sgd`, or `rmsprop`. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `adam`  | 
| output\$1layer |  The type of output layer where you specify that the task is regression or classification. **Optional** Valid values: `softmax` or `mean_squared_error` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `softmax`  | 
| tied\$1token\$1embedding\$1weight |  Whether to use a shared embedding layer for both encoders. If the inputs to both encoders use the same token-level units, use a shared token embedding layer. For example, for a collection of documents, if one encoder encodes sentences and another encodes whole documents, you can use a shared token embedding layer. That's because both sentences and documents are composed of word tokens from the same vocabulary. **Optional** Valid values: `True` or `False` Default value: `False`  | 
| token\$1embedding\$1storage\$1type |  The mode of gradient update used during training: when the `dense` mode is used, the optimizer calculates the full gradient matrix for the token embedding layer even if most rows of the gradient are zero-valued. When `sparse` mode is used, the optimizer only stores rows of the gradient that are actually being used in the mini-batch. If you want the algorithm to perform lazy gradient updates, which calculate the gradients only in the non-zero rows and which speed up training, specify `row_sparse`. Setting the value to `row_sparse` constrains the values available for other hyperparameters, as follows:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) **Optional** Valid values: `dense` or `row_sparse` Default value: `dense`  | 
| weight\$1decay |  The weight decay parameter used for optimization. **Optional** Valid values: 0 ≤ float ≤ 10000 Default value: 0 (no decay)  | 

# Tune an Object2Vec Model
<a name="object2vec-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. For the objective metric, you use one of the metrics that the algorithm computes. Automatic model tuning searches the chosen hyperparameters to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Object2Vec Algorithm
<a name="object2vec-metrics"></a>

The Object2Vec algorithm has both classification and regression metrics. The `output_layer` type determines which metric you can use for automatic model tuning. 

### Regressor Metrics Computed by the Object2Vec Algorithm
<a name="object2vec-regressor-metrics"></a>

The algorithm reports a mean squared error regressor metric, which is computed during testing and validation. When tuning the model for regression tasks, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:mean\$1squared\$1error | The Mean Square Error | Minimize | 
| validation:mean\$1squared\$1error | The Mean Square Error | Minimize | 

### Classification Metrics Computed by the Object2Vec Algorithm
<a name="object2vec-classification-metrics"></a>

The Object2Vec algorithm reports accuracy and cross-entropy classification metrics, which are computed during test and validation. When tuning the model for classification tasks, choose one of these as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:accuracy | Accuracy | Maximize | 
| test:cross\$1entropy | Cross-entropy | Minimize | 
| validation:accuracy | Accuracy | Maximize | 
| validation:cross\$1entropy | Cross-entropy | Minimize | 

## Tunable Object2Vec Hyperparameters
<a name="object2vec-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the Object2Vec algorithm.


| Hyperparameter Name | Hyperparameter Type | Recommended Ranges and Values | 
| --- | --- | --- | 
| dropout | ContinuousParameterRange | MinValue: 0.0, MaxValue: 1.0 | 
| early\$1stopping\$1patience | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| early\$1stopping\$1tolerance | ContinuousParameterRange | MinValue: 0.001, MaxValue: 0.1 | 
| enc\$1dim | IntegerParameterRange | MinValue: 4, MaxValue: 4096 | 
| enc0\$1cnn\$1filter\$1width | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| enc0\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| enc0\$1token\$1embedding\$1dim | IntegerParameterRange | MinValue: 5, MaxValue: 300 | 
| enc1\$1cnn\$1filter\$1width | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| enc1\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| enc1\$1token\$1embedding\$1dim | IntegerParameterRange | MinValue: 5, MaxValue: 300 | 
| epochs | IntegerParameterRange | MinValue: 4, MaxValue: 20 | 
| learning\$1rate | ContinuousParameterRange | MinValue: 1e-6, MaxValue: 1.0 | 
| mini\$1batch\$1size | IntegerParameterRange | MinValue: 1, MaxValue: 8192 | 
| mlp\$1activation | CategoricalParameterRanges |  [`tanh`, `relu`, `linear`]  | 
| mlp\$1dim | IntegerParameterRange | MinValue: 16, MaxValue: 1024 | 
| mlp\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| optimizer | CategoricalParameterRanges | [`adagrad`, `adam`, `rmsprop`, `sgd`, `adadelta`] | 
| weight\$1decay | ContinuousParameterRange | MinValue: 0.0, MaxValue: 1.0 | 

# Data Formats for Object2Vec Training
<a name="object2vec-training-formats"></a>

When training with the Object2Vec algorithm, make sure that the input data in your request is in JSON Lines format, where each line represents a single data point.

## Input: JSON Lines Request Format
<a name="object2vec-in-training-data-jsonlines"></a>

Content-type: application/jsonlines

```
{"label": 0, "in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"label": 1, "in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"label": 1, "in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
```

The “in0” and “in1” are the inputs for encoder0 and encoder1, respectively. The same format is valid for both classification and regression problems. For regression, the field `"label"` can accept real valued inputs.

# Data Formats for Object2Vec Inference
<a name="object2vec-inference-formats"></a>

The following page describes the input request and output response formats for getting scoring inference from the Amazon SageMaker AI Object2Vec model.

## GPU optimization: Classification or Regression
<a name="object2vec-inference-gpu-optimize-classification"></a>

Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the classification/regression or the [Output: Encoder Embeddings](object2vec-encoder-embeddings.md#object2vec-out-encoder-embeddings-data) inference network is loaded into GPU. If the majority of your inference is for classification or regression, specify `INFERENCE_PREFERRED_MODE=classification`. The following is a Batch Transform example of using 4 instances of p3.2xlarge that optimizes for classification/regression inference:

```
transformer = o2v.transformer(instance_count=4,
                              instance_type="ml.p2.xlarge",
                              max_concurrent_transforms=2,
                              max_payload=1,  # 1MB
                              strategy='MultiRecord',
                              env={'INFERENCE_PREFERRED_MODE': 'classification'},  # only useful with GPU
                              output_path=output_s3_path)
```

## Input: Classification or Regression Request Format
<a name="object2vec-in-inference-data"></a>

Content-type: application/json

```
{
  "instances" : [
    {"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]},
    {"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]},
    {"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
  ]
}
```

Content-type: application/jsonlines

```
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
```

For classification problems, the length of the scores vector corresponds to `num_classes`. For regression problems, the length is 1.

## Output: Classification or Regression Response Format
<a name="object2vec-out-inference-data"></a>

Accept: application/json

```
{
    "predictions": [
        {
            "scores": [
                0.6533935070037842,
                0.07582679390907288,
                0.2707797586917877
            ]
        },
        {
            "scores": [
                0.026291321963071823,
                0.6577019095420837,
                0.31600672006607056
            ]
        }
    ]
}
```

Accept: application/jsonlines

```
{"scores":[0.195667684078216,0.395351558923721,0.408980727195739]}
{"scores":[0.251988261938095,0.258233487606048,0.489778339862823]}
{"scores":[0.280087798833847,0.368331134319305,0.351581096649169]}
```

In both the classification and regression formats, the scores apply to individual labels. 

# Encoder Embeddings for Object2Vec
<a name="object2vec-encoder-embeddings"></a>

The following page lists the input request and output response formats for getting encoder embedding inference from the Amazon SageMaker AI Object2Vec model.

## GPU optimization: Encoder Embeddings
<a name="object2vec-inference-gpu-optimize-encoder-embeddings"></a>

An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the [Data Formats for Object2Vec Inference](object2vec-inference-formats.md) or the encoder embedding inference network is loaded into GPU. If the majority of your inference is for encoder embeddings, specify `INFERENCE_PREFERRED_MODE=embedding`. The following is a Batch Transform example of using 4 instances of p3.2xlarge that optimizes for encoder embedding inference:

```
transformer = o2v.transformer(instance_count=4,
                              instance_type="ml.p2.xlarge",
                              max_concurrent_transforms=2,
                              max_payload=1,  # 1MB
                              strategy='MultiRecord',
                              env={'INFERENCE_PREFERRED_MODE': 'embedding'},  # only useful with GPU
                              output_path=output_s3_path)
```

## Input: Encoder Embeddings
<a name="object2vec-in-encoder-embeddings-data"></a>

Content-type: application/json; infer\$1max\$1seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum sequence lengths for the forward and backward encoder.

```
{
  "instances" : [
    {"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]},
    {"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]},
    {"in0": [774, 14, 21, 206]}
  ]
}
```

Content-type: application/jsonlines; infer\$1max\$1seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum sequence lengths for the forward and backward encoder.

```
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]}
{"in0": [774, 14, 21, 206]}
```

In both of these formats, you specify only one input type: `“in0”` or `“in1.”` The inference service then invokes the corresponding encoder and outputs the embeddings for each of the instances. 

## Output: Encoder Embeddings
<a name="object2vec-out-encoder-embeddings-data"></a>

Content-type: application/json

```
{
  "predictions": [
    {"embeddings":[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.003637571120634,0.021305780857801,0.004316598642617,0.0,0.003397724591195,0.0,0.000378780066967,0.0,0.0,0.0,0.007419463712722]},
    {"embeddings":[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133398,0.047553978860378,0.0,0.0,0.011533712036907,0.011472506448626,0.010696629062294,0.0,0.0,0.0,0.008508535102009]}
  ]
}
```

Content-type: application/jsonlines

```
{"embeddings":[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.003637571120634,0.021305780857801,0.004316598642617,0.0,0.003397724591195,0.0,0.000378780066967,0.0,0.0,0.0,0.007419463712722]}
{"embeddings":[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133398,0.047553978860378,0.0,0.0,0.011533712036907,0.011472506448626,0.010696629062294,0.0,0.0,0.0,0.008508535102009]}
```

The vector length of the embeddings output by the inference service is equal to the value of one of the following hyperparameters that you specify at training time: `enc0_token_embedding_dim`, `enc1_token_embedding_dim`, or `enc_dim`.

# Sequence-to-Sequence Algorithm
<a name="seq-2-seq"></a>

Amazon SageMaker AI Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker AI seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures. 

**Topics**
+ [

## Input/Output Interface for the Sequence-to-Sequence Algorithm
](#s2s-inputoutput)
+ [

## EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm
](#s2s-instances)
+ [

## Sequence-to-Sequence Sample Notebooks
](#seq-2-seq-sample-notebooks)
+ [

# How Sequence-to-Sequence Works
](seq-2-seq-howitworks.md)
+ [

# Sequence-to-Sequence Hyperparameters
](seq-2-seq-hyperparameters.md)
+ [

# Tune a Sequence-to-Sequence Model
](seq-2-seq-tuning.md)

## Input/Output Interface for the Sequence-to-Sequence Algorithm
<a name="s2s-inputoutput"></a>

**Training**

SageMaker AI seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as integers, not as floating points, as is usually the case.

A script to convert data from tokenized text files to the protobuf format is included in [the seq2seq example notebook](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). In general, it packs the data into 32-bit integer tensors and generates the necessary vocabulary files, which are needed for metric calculation and inference.

After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three channels:
+ `train`: It should contain the training data (for example, the `train.rec` file generated by the preprocessing script).
+ `validation`: It should contain the validation data (for example, the `val.rec` file generated by the preprocessing script).
+ `vocab`: It should contain two vocabulary files (`vocab.src.json` and `vocab.trg.json`) 

If the algorithm doesn't find data in any of these three channels, training results in an error.

**Inference**

For hosted endpoints, inference supports two data formats. To perform inference using space separated text tokens, use the `application/json` format. Otherwise, use the `recordio-protobuf` format to work with the integer encoded data. Both modes support batching of input data. `application/json` format also allows you to visualize the attention matrix.
+ `application/json`: Expects the input in JSON format and returns the output in JSON format. Both content and accept types should be `application/json`. Each sequence is expected to be a string with whitespace separated tokens. This format is recommended when the number of source sequences in the batch is small. It also supports the following additional configuration options:

  `configuration`: \$1`attention_matrix`: `true`\$1: Returns the attention matrix for the particular input sequence.
+ `application/x-recordio-protobuf`: Expects the input in `recordio-protobuf` format and returns the output in `recordio-protobuf format`. Both content and accept types should be `applications/x-recordio-protobuf`. For this format, the source sequences must be converted into a list of integers for subsequent protobuf encoding. This format is recommended for bulk inference.

For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON Lines format and returns the output in JSON Lines format. Both content and accept types should be `application/jsonlines`. The format for input is as follows:

```
content-type: application/jsonlines

{"source": "source_sequence_0"}
{"source": "source_sequence_1"}
```

The format for response is as follows:

```
accept: application/jsonlines

{"target": "predicted_sequence_0"}
{"target": "predicted_sequence_1"}
```

For additional details on how to serialize and deserialize the inputs and outputs to specific formats for inference, see the [Sequence-to-Sequence Sample Notebooks](#seq-2-seq-sample-notebooks) .

## EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm
<a name="s2s-instances"></a>

The Amazon SageMaker AI seq2seq algorithm only supports on GPU instance types and can only train on a single machine. However, you can use instances with multiple GPUs. The seq2seq algorithm supports P2, P3, G4dn, and G5 GPU instance families.

## Sequence-to-Sequence Sample Notebooks
<a name="seq-2-seq-sample-notebooks"></a>

For a sample notebook that shows how to use the SageMaker AI Sequence to Sequence algorithm to train a English-German translation model, see [Machine Translation English-German Example Using SageMaker AI Seq2Seq](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Sequence-to-Sequence Works
<a name="seq-2-seq-howitworks"></a>

Typically, a neural network for sequence-to-sequence modeling consists of a few layers, including: 
+ An **embedding layer**. In this layer, the input matrix, which is input tokens encoded in a sparse way (for example, one-hot encoded) are mapped to a dense feature layer. This is required because a high-dimensional feature vector is more capable of encoding information regarding a particular token (word for text corpora) than a simple one-hot-encoded vector. It is also a standard practice to initialize this embedding layer with a pre-trained word vector like [FastText](https://fasttext.cc/) or [Glove](https://nlp.stanford.edu/projects/glove/) or to initialize it randomly and learn the parameters during training. 
+ An **encoder layer**. After the input tokens are mapped into a high-dimensional feature space, the sequence is passed through an encoder layer to compress all the information from the input embedding layer (of the entire sequence) into a fixed-length feature vector. Typically, an encoder is made of RNN-type networks like long short-term memory (LSTM) or gated recurrent units (GRU). ([ Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) explains LSTM in a great detail.) 
+ A **decoder layer**. The decoder layer takes this encoded feature vector and produces the output sequence of tokens. This layer is also usually built with RNN architectures (LSTM and GRU). 

The whole model is trained jointly to maximize the probability of the target sequence given the source sequence. This model was first introduced by [Sutskever et al.](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) in 2014. 

**Attention mechanism**. The disadvantage of an encoder-decoder framework is that model performance decreases as and when the length of the source sequence increases because of the limit of how much information the fixed-length encoded feature vector can contain. To tackle this problem, in 2015, Bahdanau et al. proposed the [attention mechanism](https://arxiv.org/pdf/1409.0473.pdf). In an attention mechanism, the decoder tries to find the location in the encoder sequence where the most important information could be located and uses that information and previously decoded words to predict the next token in the sequence. 

For more in details, see the whitepaper [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) by Luong, et al. that explains and simplifies calculations for various attention mechanisms. Additionally, the whitepaper [Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) by Wu, et al. describes Google's architecture for machine translation, which uses skip connections between encoder and decoder layers.

# Sequence-to-Sequence Hyperparameters
<a name="seq-2-seq-hyperparameters"></a>

The following table lists the hyperparameters that you can set when training with the Amazon SageMaker AI Sequence-to-Sequence (seq2seq) algorithm.


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size | Mini batch size for gradient descent. **Optional** Valid values: positive integer Default value: 64 | 
| beam\$1size | Length of the beam for beam search. Used during training for computing `bleu` and used during inference. **Optional** Valid values: positive integer Default value: 5 | 
| bleu\$1sample\$1size | Number of instances to pick from validation dataset to decode and compute `bleu` score during training. Set to -1 to use full validation set (if `bleu` is chosen as `optimized_metric`). **Optional** Valid values: integer Default value: 0 | 
| bucket\$1width | Returns (source,target) buckets up to (`max_seq_len_source`, `max_seq_len_target`). The longer side of the data uses steps of `bucket_width` while the shorter side uses steps scaled down by the average target/source length ratio. If one sided reaches its maximum length before the other, width of extra buckets on that side is fixed to that side of `max_len`. **Optional** Valid values: positive integer Default value: 10 | 
| bucketing\$1enabled | Set to `false` to disable bucketing, unroll to maximum length. **Optional** Valid values: `true` or `false` Default value: `true` | 
| checkpoint\$1frequency\$1num\$1batches | Checkpoint and evaluate every x batches. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 1000 | 
| checkpoint\$1threshold | Maximum number of checkpoints model is allowed to not improve in `optimized_metric` on validation dataset before training is stopped. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 3 | 
| clip\$1gradient | Clip absolute gradient values greater than this. Set to negative to disable. **Optional** Valid values: float Default value: 1 | 
| cnn\$1activation\$1type | The `cnn` activation type to be used. **Optional** Valid values: String. One of `glu`, `relu`, `softrelu`, `sigmoid`, or `tanh`. Default value: `glu` | 
| cnn\$1hidden\$1dropout | Dropout probability for dropout between convolutional layers. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| cnn\$1kernel\$1width\$1decoder | Kernel width for the `cnn` decoder. **Optional** Valid values: positive integer Default value: 5 | 
| cnn\$1kernel\$1width\$1encoder | Kernel width for the `cnn` encoder. **Optional** Valid values: positive integer Default value: 3 | 
| cnn\$1num\$1hidden | Number of `cnn` hidden units for encoder and decoder. **Optional** Valid values: positive integer Default value: 512 | 
| decoder\$1type | Decoder type. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: *rnn* | 
| embed\$1dropout\$1source | Dropout probability for source side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| embed\$1dropout\$1target | Dropout probability for target side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| encoder\$1type | Encoder type. The `rnn` architecture is based on attention mechanism by Bahdanau et al. and *cnn* architecture is based on Gehring et al. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: `rnn` | 
| fixed\$1rate\$1lr\$1half\$1life | Half life for learning rate in terms of number of checkpoints for `fixed_rate_`\$1 schedulers. **Optional** Valid values: positive integer Default value: 10 | 
| learning\$1rate | Initial learning rate. **Optional** Valid values: float Default value: 0.0003 | 
| loss\$1type | Loss function for training. **Optional** Valid values: String. `cross-entropy` Default value: `cross-entropy` | 
| lr\$1scheduler\$1type | Learning rate scheduler type. `plateau_reduce` means reduce the learning rate whenever `optimized_metric` on `validation_accuracy` plateaus. `inv_t` is inverse time decay. `learning_rate`/(1\$1`decay_rate`\$1t) **Optional** Valid values: String. One of `plateau_reduce`, `fixed_rate_inv_t`, or `fixed_rate_inv_sqrt_t`. Default value: `plateau_reduce` | 
| max\$1num\$1batches | Maximum number of updates/batches to process. -1 for infinite. **Optional** Valid values: integer Default value: -1 | 
| max\$1num\$1epochs | Maximum number of epochs to pass through training data before fitting is stopped. Training continues until this number of epochs even if validation accuracy is not improving if this parameter is passed. Ignored if not passed. **Optional** Valid values: Positive integer and less than or equal to max\$1num\$1epochs. Default value: none | 
| max\$1seq\$1len\$1source | Maximum length for the source sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100  | 
| max\$1seq\$1len\$1target | Maximum length for the target sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100 | 
| min\$1num\$1epochs | Minimum number of epochs the training must run before it is stopped via `early_stopping` conditions. **Optional** Valid values: positive integer Default value: 0 | 
| momentum | Momentum constant used for `sgd`. Don't pass this parameter if you are using `adam` or `rmsprop`. **Optional** Valid values: float Default value: none | 
| num\$1embed\$1source | Embedding size for source tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1embed\$1target | Embedding size for target tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1layers\$1decoder | Number of layers for Decoder *rnn* or *cnn*. **Optional** Valid values: positive integer Default value: 1 | 
| num\$1layers\$1encoder | Number of layers for Encoder `rnn` or `cnn`. **Optional** Valid values: positive integer Default value: 1 | 
| optimized\$1metric | Metrics to optimize with early stopping. **Optional** Valid values: String. One of `perplexity`, `accuracy`, or `bleu`. Default value: `perplexity` | 
| optimizer\$1type | Optimizer to choose from. **Optional** Valid values: String. One of `adam`, `sgd`, or `rmsprop`. Default value: `adam` | 
| plateau\$1reduce\$1lr\$1factor | Factor to multiply learning rate with (for `plateau_reduce`). **Optional** Valid values: float Default value: 0.5 | 
| plateau\$1reduce\$1lr\$1threshold | For `plateau_reduce` scheduler, multiply learning rate with reduce factor if `optimized_metric` didn't improve for this many checkpoints. **Optional** Valid values: positive integer Default value: 3 | 
| rnn\$1attention\$1in\$1upper\$1layers | Pass the attention to upper layers of *rnn*, like Google NMT paper. Only applicable if more than one layer is used. **Optional** Valid values: boolean (`true` or `false`) Default value: `true` | 
| rnn\$1attention\$1num\$1hidden | Number of hidden units for attention layers. defaults to `rnn_num_hidden`. **Optional** Valid values: positive integer Default value: `rnn_num_hidden` | 
| rnn\$1attention\$1type | Attention model for encoders. `mlp` refers to concat and bilinear refers to general from the Luong et al. paper. **Optional** Valid values: String. One of `dot`, `fixed`, `mlp`, or `bilinear`. Default value: `mlp` | 
| rnn\$1cell\$1type | Specific type of `rnn` architecture. **Optional** Valid values: String. Either `lstm` or `gru`. Default value: `lstm` | 
| rnn\$1decoder\$1state\$1init | How to initialize `rnn` decoder states from encoders. **Optional** Valid values: String. One of `last`, `avg`, or `zero`. Default value: `last` | 
| rnn\$1first\$1residual\$1layer | First *rnn* layer to have a residual connection, only applicable if number of layers in encoder or decoder is more than 1. **Optional** Valid values: positive integer Default value: 2 | 
| rnn\$1num\$1hidden | The number of *rnn* hidden units for encoder and decoder. This must be a multiple of 2 because the algorithm uses bi-directional Long Term Short Term Memory (LSTM) by default. **Optional** Valid values: positive even integer Default value: 1024 | 
| rnn\$1residual\$1connections | Add residual connection to stacked *rnn*. Number of layers should be more than 1. **Optional** Valid values: boolean (`true` or `false`) Default value: `false` | 
| rnn\$1decoder\$1hidden\$1dropout | Dropout probability for hidden state that combines the context with the *rnn* hidden state in the decoder. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| training\$1metric | Metrics to track on training on validation data. **Optional** Valid values: String. Either `perplexity` or `accuracy`. Default value: `perplexity` | 
| weight\$1decay | Weight decay constant. **Optional** Valid values: float Default value: 0 | 
| weight\$1init\$1scale | Weight initialization scale (for `uniform` and `xavier` initialization).  **Optional** Valid values: float Default value: 2.34 | 
| weight\$1init\$1type | Type of weight initialization.  **Optional** Valid values: String. Either `uniform` or `xavier`. Default value: `xavier` | 
| xavier\$1factor\$1type | Xavier factor type. **Optional** Valid values: String. One of `in`, `out`, or `avg`. Default value: `in` | 

# Tune a Sequence-to-Sequence Model
<a name="seq-2-seq-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Sequence-to-Sequence Algorithm
<a name="seq-2-seq-metrics"></a>

The sequence to sequence algorithm reports three metrics that are computed during training. Choose one of them as an objective to optimize when tuning the hyperparameter values.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Accuracy computed on the validation dataset.  |  Maximize  | 
| validation:bleu |  [Bleu﻿](https://en.wikipedia.org/wiki/BLEU) score computed on the validation dataset. Because BLEU computation is expensive, you can choose to compute BLEU on a random subsample of the validation dataset to speed up the overall training process. Use the `bleu_sample_size` parameter to specify the subsample.  |  Maximize  | 
| validation:perplexity |  [Perplexity](https://en.wikipedia.org/wiki/Perplexity), is a loss function computed on the validation dataset. Perplexity measures the cross-entropy between an empirical sample and the distribution predicted by a model and so provides a measure of how well a model predicts the sample values, Models that are good at predicting a sample have a low perplexity.  |  Minimize  | 

## Tunable Sequence-to-Sequence Hyperparameters
<a name="seq-2-seq-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the SageMaker AI Sequence to Sequence algorithm. The hyperparameters that have the greatest impact on sequence to sequence objective metrics are: `batch_size`, `optimizer_type`, `learning_rate`, `num_layers_encoder`, and `num_layers_decoder`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1layers\$1encoder |  IntegerParameterRange  |  [1-10]  | 
| num\$1layers\$1decoder |  IntegerParameterRange  |  [1-10]  | 
| batch\$1size |  CategoricalParameterRange  |  [16,32,64,128,256,512,1024,2048]  | 
| optimizer\$1type |  CategoricalParameterRange  |  ['adam', 'sgd', 'rmsprop']  | 
| weight\$1init\$1type |  CategoricalParameterRange  |  ['xavier', 'uniform']  | 
| weight\$1init\$1scale |  ContinuousParameterRange  |  For the xavier type: MinValue: 2.0, MaxValue: 3.0 For the uniform type: MinValue: -1.0, MaxValue: 1.0  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 0.00005, MaxValue: 0.2  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.1  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.5, MaxValue: 0.9  | 
| clip\$1gradient |  ContinuousParameterRange  |  MinValue: 1.0, MaxValue: 5.0  | 
| rnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to recurrent neural networks (RNNs). [128,256,512,1024,2048]   | 
| cnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to convolutional neural networks (CNNs). [128,256,512,1024,2048]   | 
| num\$1embed\$1source |  IntegerParameterRange  |  [256-512]  | 
| num\$1embed\$1target |  IntegerParameterRange  |  [256-512]  | 
| embed\$1dropout\$1source |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| embed\$1dropout\$1target |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| rnn\$1decoder\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| cnn\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| lr\$1scheduler\$1type |  CategoricalParameterRange  |  ['plateau\$1reduce', 'fixed\$1rate\$1inv\$1t', 'fixed\$1rate\$1inv\$1sqrt\$1t']  | 
| plateau\$1reduce\$1lr\$1factor |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 0.5  | 
| plateau\$1reduce\$1lr\$1threshold |  IntegerParameterRange  |  [1-5]  | 
| fixed\$1rate\$1lr\$1half\$1life |  IntegerParameterRange  |  [10-30]  | 

# Text Classification - TensorFlow
<a name="text-classification-tensorflow"></a>

The Amazon SageMaker AI Text Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the [TensorFlow Hub](https://tfhub.dev/). Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of text data is not available. The text classification algorithm takes a text string as input and outputs a probability for each of the class labels. Training datasets must be in CSV format. This page includes information about Amazon EC2 instance recommendations and sample notebooks for Text Classification - TensorFlow.

**Topics**
+ [

# How to use the SageMaker AI Text Classification - TensorFlow algorithm
](text-classification-tensorflow-how-to-use.md)
+ [

# Input and output interface for the Text Classification - TensorFlow algorithm
](text-classification-tensorflow-inputoutput.md)
+ [

## Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm
](#text-classification-tensorflow-instances)
+ [

## Text Classification - TensorFlow sample notebooks
](#text-classification-tensorflow-sample-notebooks)
+ [

# How Text Classification - TensorFlow Works
](text-classification-tensorflow-HowItWorks.md)
+ [

# TensorFlow Hub Models
](text-classification-tensorflow-Models.md)
+ [

# Text Classification - TensorFlow Hyperparameters
](text-classification-tensorflow-Hyperparameter.md)
+ [

# Tune a Text Classification - TensorFlow model
](text-classification-tensorflow-tuning.md)

# How to use the SageMaker AI Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-how-to-use"></a>

You can use Text Classification - TensorFlow as an Amazon SageMaker AI built-in algorithm. The following section describes how to use Text Classification - TensorFlow with the SageMaker AI Python SDK. For information on how to use Text Classification - TensorFlow from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

The Text Classification - TensorFlow algorithm supports transfer learning using any of the compatible pretrained TensorFlow models. For a list of all available pretrained models, see [TensorFlow Hub Models](text-classification-tensorflow-Models.md). Every pretrained model has a unique `model_id`. The following example uses BERT Base Uncased (`model_id`: `tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2`) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated model training artifacts to construct a SageMaker AI Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and their default values with `hyperparameters.retrieve_default`. For more information, see [Text Classification - TensorFlow Hyperparameters](text-classification-tensorflow-Hyperparameter.md). Use these values to construct a SageMaker AI Estimator.

**Note**  
Default hyperparameter values are different for different models. For example, for larger models, the default batch size is smaller. 

This example uses the [https://www.tensorflow.org/datasets/catalog/glue#gluesst2](https://www.tensorflow.org/datasets/catalog/glue#gluesst2) dataset, which contains positive and negative movie reviews. We pre-downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call `.fit` using the Amazon S3 location of your training dataset. Any S3 bucket used in a notebook must be in the same AWS Region as the notebook instance that accesses it.

```
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2", "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)

# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")

# Retrieve the pretrained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

# Retrieve the default hyperparameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST2/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create an Estimator instance
tf_tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a training job
tf_tc_estimator.fit({"training": training_dataset_s3_path}, logs=True)
```

For more information about how to use the SageMaker Text Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) notebook.

# Input and output interface for the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-inputoutput"></a>

Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset made up of text sentences with any number of classes. The pretrained model attaches a classification layer to the Text Embedding model and initializes the layer parameters to random values. The output dimension of the classification layer is determined based on the number of classes detected in the input data. 

Be mindful of how to format your training data for input to the Text Classification - TensorFlow model.
+ **Training data input format:** A directory containing a `data.csv` file. Each row of the first column should have integer class labels between 0 and the number of classes. Each row of the second column should have the corresponding text data.

The following is an example of an input CSV file. Note that the file should not have any header. The file should be hosted in an Amazon S3 bucket with a path similar to the following: `s3://bucket_name/input_directory/`. Note that the trailing `/` is required.

```
|   |  |
|---|---|
|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human nature|
|...|...|
```

## Incremental training
<a name="text-classification-tensorflow-incremental-training"></a>

You can seed the training of a new model with artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data.

**Note**  
You can only seed a SageMaker AI Text Classification - TensorFlow model with another Text Classification - TensorFlow model trained in SageMaker AI. 

You can use any dataset for incremental training, as long as the set of classes remains the same. The incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained model, you start with an existing fine-tuned model. 

For more information on using incremental training with the SageMaker AI Text Classification - TensorFlow algorithm, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) sample notebook.

## Inference with the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-inference"></a>

You can host the fine-tuned model that results from your TensorFlow Text Classification training for inference. Any raw text formats for inference must be content type `application/x-text`.

Running inference results in probability values, class labels for all classes, and the predicted label corresponding to the class index with the highest probability encoded in JSON format. The Text Classification - TensorFlow model processes a single string per request and outputs only one line. The following is an example of a JSON format response:

```
accept: application/json;verbose

{"probabilities": [prob_0, prob_1, prob_2, ...],
"labels": [label_0, label_1, label_2, ...],
"predicted_label": predicted_label}
```

If `accept` is set to `application/json`, then the model only outputs probabilities. 

## Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-instances"></a>

The Text Classification - TensorFlow algorithm supports all CPU and GPU instances for training, including:
+ `ml.p2.xlarge`
+ `ml.p2.16xlarge`
+ `ml.p3.2xlarge`
+ `ml.p3.16xlarge`
+ `ml.g4dn.xlarge`
+ `ml.g4dn.16.xlarge`
+ `ml.g5.xlarge`
+ `ml.g5.48xlarge`

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference. For a comprehensive list of SageMaker training and inference instances across AWS Regions, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

## Text Classification - TensorFlow sample notebooks
<a name="text-classification-tensorflow-sample-notebooks"></a>

For more information about how to use the SageMaker AI Text Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How Text Classification - TensorFlow Works
<a name="text-classification-tensorflow-HowItWorks"></a>

The Text Classification - TensorFlow algorithm takes text as classifies it into one of the output class labels. Deep learning networks such as [BERT](https://arxiv.org/pdf/1810.04805.pdf) are highly accurate for text classification. There are also deep learning networks that are trained on large text datasets, such as TextNet, which has more than 11 million texts with about 11,000 categories. After a network is trained with TextNet data, you can then fine-tune the network on a dataset with a particular focus to perform more specific text classification tasks. The Amazon SageMaker AI Text Classification - TensorFlow algorithm supports transfer learning on many pretrained models that are available in the TensorFlow Hub.

According to the number of class labels in your training data, a text classification layer is attached to the pretrained TensorFlow model of your choice. The classification layer consists of a dropout layer, a dense layer, and a fully connected layer with 2-norm regularization, and is initialized with random weights. You can change the hyperparameter values for the dropout rate of the dropout layer and the L2 regularization factor for the dense layer.

You can fine-tune either the entire network (including the pretrained model) or only the top classification layer on new training data. With this method of transfer learning, training with smaller datasets is possible.

# TensorFlow Hub Models
<a name="text-classification-tensorflow-Models"></a>

The following pretrained models are available to use for transfer learning with the Text Classification - TensorFlow algorithm. 

The following models vary significantly in size, number of model parameters, training time, and inference latency for any given dataset. The best model for your use case depends on the complexity of your fine-tuning dataset and any requirements that you have on training time, inference latency, or model accuracy.


| Model Name | `model_id` | Source | 
| --- | --- | --- | 
|  BERT Base Uncased  | `tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3) | 
|  BERT Base Cased  | `tensorflow-tc-bert-en-cased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3) | 
|  BERT Base Multilingual Cased  | `tensorflow-tc-bert-multi-cased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3) | 
|  Small BERT L-2\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1) | 
|  Small BERT L-2\$1H-256\$1A-4 | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1) | 
|  Small BERT L-2\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1) | 
|  Small BERT L-2\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1) | 
|  Small BERT L-4\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1) | 
|  Small BERT L-4\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1) | 
|  Small BERT L-4\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1) | 
|  Small BERT L-4\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1) | 
|  Small BERT L-6\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1) | 
|  Small BERT L-6\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1) | 
|  Small BERT L-6\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1) | 
|  Small BERT L-6\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1) | 
|  Small BERT L-8\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1) | 
|  Small BERT L-8\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1) | 
|  Small BERT L-8\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1) | 
|  Small BERT L-8\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1) | 
|  Small BERT L-10\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1) | 
|  Small BERT L-10\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1) | 
|  Small BERT L-10\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1) | 
|  Small BERT L-10\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1) | 
|  Small BERT L-12\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1) | 
|  Small BERT L-12\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1) | 
|  Small BERT L-12\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1) | 
|  Small BERT L-12\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1) | 
|  BERT Large Uncased  | `tensorflow-tc-bert-en-uncased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/3) | 
|  BERT Large Cased  | `tensorflow-tc-bert-en-cased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/3) | 
|  BERT Large Uncased Whole Word Masking  | `tensorflow-tc-bert-en-wwm-uncased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_wwm_uncased_L-24_H-1024_A-16/3) | 
|  BERT Large Cased Whole Word Masking  | `tensorflow-tc-bert-en-wwm-cased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_wwm_cased_L-24_H-1024_A-16/3) | 
|  ALBERT Base  | `tensorflow-tc-albert-en-base` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/albert_en_base/2) | 
|  ELECTRA Small\$1\$1  | `tensorflow-tc-electra-small-1` | [TensorFlow Hub link](https://tfhub.dev/google/electra_small/2) | 
|  ELECTRA Base  | `tensorflow-tc-electra-base-1` | [TensorFlow Hub link](https://tfhub.dev/google/electra_base/2) | 
|  BERT Base Wikipedia and BooksCorpus  | `tensorflow-tc-experts-bert-wiki-books-1` | [TensorFlow Hub link](https://tfhub.dev/google/experts/bert/wiki_books/2) | 
|  BERT Base MEDLINE/PubMed  | `tensorflow-tc-experts-bert-pubmed-1` | [TensorFlow Hub link](https://tfhub.dev/google/experts/bert/pubmed/2) | 
|  Talking Heads Base  | `tensorflow-tc-talking-heads-base` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1) | 
|  Talking Heads Large  | `tensorflow-tc-talking-heads-large` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1) | 

# Text Classification - TensorFlow Hyperparameters
<a name="text-classification-tensorflow-Hyperparameter"></a>

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Object Detection - TensorFlow algorithm. See [Tune a Text Classification - TensorFlow model](text-classification-tensorflow-tuning.md) for information on hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size |  The batch size for training. For training on instances with multiple GPUs, this batch size is used across the GPUs.  Valid values: positive integer. Default value: `32`.  | 
| beta\$11 |  The beta1 for the `"adam"` and `"adamw"` optimizers. Represents the exponential decay rate for the first moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| beta\$12 |  The beta2 for the `"adam"` and `"adamw"` optimizers. Represents the exponential decay rate for the second moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.999`.  | 
| dropout\$1rate | The dropout rate for the dropout layer in the top classification layer. Used only when `reinitialize_top_layer` is set to `"True"`. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2` | 
| early\$1stopping |  Set to `"True"` to use early stopping logic during training. If `"False"`, early stopping is not used. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| early\$1stopping\$1min\$1delta | The minimum change needed to qualify as an improvement. An absolute change less than the value of early\$1stopping\$1min\$1delta does not qualify as improvement. Used only when early\$1stopping is set to "True".Valid values: float, range: [`0.0`, `1.0`].Default value: `0.0`. | 
| early\$1stopping\$1patience |  The number of epochs to continue training with no improvement. Used only when `early_stopping` is set to `"True"`. Valid values: positive integer. Default value: `5`.  | 
| epochs |  The number of training epochs. Valid values: positive integer. Default value: `10`.  | 
| epsilon |  The epsilon for `"adam"`, `"rmsprop"`, `"adadelta"`, and `"adagrad"` optimizers. Usually set to a small value to avoid division by 0. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `1e-7`.  | 
| initial\$1accumulator\$1value |  The starting value for the accumulators, or the per-parameter momentum values, for the `"adagrad"` optimizer. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| learning\$1rate | The optimizer learning rate. Valid values: float, range: [`0.0`, `1.0`].Default value: `0.001`. | 
| momentum |  The momentum for the `"sgd"` and `"nesterov"` optimizers. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| optimizer |  The optimizer type. For more information, see [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) in the TensorFlow documentation. Valid values: string, any of the following: (`"adamw"`, `"adam"`, `"sgd"`, `"nesterov"`, `"rmsprop"`,` "adagrad"` , `"adadelta"`). Default value: `"adam"`.  | 
| regularizers\$1l2 |  The L2 regularization factor for the dense layer in the classification layer. Used only when `reinitialize_top_layer` is set to `"True"`. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| reinitialize\$1top\$1layer |  If set to `"Auto"`, the top classification layer parameters are re-initialized during fine-tuning. For incremental training, top classification layer parameters are not re-initialized unless set to `"True"`. Valid values: string, any of the following: (`"Auto"`, `"True"` or `"False"`). Default value: `"Auto"`.  | 
| rho |  The discounting factor for the gradient of the `"adadelta"` and `"rmsprop"` optimizers. Ignored for other optimizers.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.95`.  | 
| train\$1only\$1on\$1top\$1layer |  If `"True"`, only the top classification layer parameters are fine-tuned. If `"False"`, all model parameters are fine-tuned. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| validation\$1split\$1ratio |  The fraction of training data to randomly split to create validation data. Only used if validation data is not provided through the `validation` channel. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2`.  | 
| warmup\$1steps\$1fraction |  The fraction of the total number of gradient update steps, where the learning rate increases from 0 to the initial learning rate as a warm up. Only used with the `adamw` optimizer. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.1`.  | 

# Tune a Text Classification - TensorFlow model
<a name="text-classification-tensorflow-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-metrics"></a>

Refer to the following chart to find which metrics are computed by the Text Classification - TensorFlow algorithm.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| validation:accuracy | The ratio of the number of correct predictions to the total number of predictions made. | Maximize | `val_accuracy=([0-9\\.]+)` | 

## Tunable Text Classification - TensorFlow hyperparameters
<a name="text-classification-tensorflow-tunable-hyperparameters"></a>

Tune a text classification model with the following hyperparameters. The hyperparameters that have the greatest impact on text classification objective metrics are: `batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `regularizers_l2`, `beta_1`, `beta_2`, and `eps` based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adamw` or `adam` is the `optimizer`.

For more information about which hyperparameters are used for each `optimizer`, see [Text Classification - TensorFlow Hyperparameters](text-classification-tensorflow-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| batch\$1size | IntegerParameterRanges | MinValue: 4, MaxValue: 128 | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['adamw', 'adam', 'sgd', 'rmsprop', 'nesterov', 'adagrad', 'adadelta'] | 
| regularizers\$1l2 | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| train\$1only\$1on\$1top\$1layer | CategoricalParameterRanges | ['True', 'False'] | 