

# Neural Topic Model (NTM) Algorithm


Amazon SageMaker AI NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into *topics* that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a *latent representation* because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics. Documents relevant to each topic might be indexed or searched for based on their soft topic labels. The latent representations of documents might also be used to find similar documents in the topic space. You can also use the latent representations of documents that the topic model learns for input to another supervised algorithm such as a document classifier. Because the latent representations of documents are expected to capture the semantics of the underlying documents, algorithms based in part on these representations are expected to perform better than those based on lexical features alone.

Although you can use both the Amazon SageMaker AI NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.

For more information on the mathematics behind NTM, see [Neural Variational Inference for Text Processing](https://arxiv.org/pdf/1511.06038.pdf).

**Topics**
+ [

## Input/Output Interface for the NTM Algorithm
](#NTM-inputoutput)
+ [

## EC2 Instance Recommendation for the NTM Algorithm
](#NTM-instances)
+ [

## NTM Sample Notebooks
](#NTM-sample-notebooks)
+ [

# NTM Hyperparameters
](ntm_hyperparameters.md)
+ [

# Tune an NTM Model
](ntm-tuning.md)
+ [

# NTM Response Formats
](ntm-in-formats.md)

## Input/Output Interface for the NTM Algorithm


Amazon SageMaker AI Neural Topic Model supports four data channels: train, validation, test, and auxiliary. The validation, test, and auxiliary data channels are optional. If you specify any of these optional channels, set the value of the `S3DataDistributionType` parameter for them to `FullyReplicated`. If you provide validation data, the loss on this data is logged at every epoch, and the model stops training as soon as it detects that the validation loss is not improving. If you don't provide validation data, the algorithm stops early based on the training data, but this can be less efficient. If you provide test data, the algorithm reports the test loss from the final model. 

The train, validation, and test data channels for NTM support both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV` format, each row must be represented densely with zero counts for words not present in the corresponding document, and have dimension equal to: (number of records) \$1 (vocabulary size). You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`. The auxiliary channel is used to supply a text file that contains vocabulary. By supplying the vocabulary file, users are able to see the top words for each of the topics printed in the log instead of their integer IDs. Having the vocabulary file also allows NTM to compute the Word Embedding Topic Coherence (WETC) scores, a new metric displayed in the log that captures similarity among the top words in each topic effectively. The `ContentType` for the auxiliary channel is `text/plain`, with each line containing a single word, in the order corresponding to the integer IDs provided in the data. The vocabulary file must be named `vocab.txt` and currently only UTF-8 encoding is supported. 

For inference, `text/csv`, `application/json`, `application/jsonlines`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. NTM inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_weights` vector for each observation.

See the [blog post](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-neural-topic-model-now-supports-auxiliary-vocabulary-channel-new-topic-evaluation-metrics-and-training-subsampling/) for more details on using the auxiliary channel and the WETC scores. For more information on how to compute the WETC score, see [Coherence-Aware Neural Topic Modeling](https://arxiv.org/pdf/1809.02687.pdf). We used the pairwise WETC described in this paper for the Amazon SageMaker AI Neural Topic Model.

For more information on input and output file formats, see [NTM Response Formats](ntm-in-formats.md) for inference and the [NTM Sample Notebooks](#NTM-sample-notebooks).

## EC2 Instance Recommendation for the NTM Algorithm


NTM training supports both GPU and CPU instance types. We recommend GPU instances, but for certain workloads, CPU instances may result in lower training costs. CPU instances should be sufficient for inference. NTM training supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

## NTM Sample Notebooks
Sample Notebooks

For a sample notebook that uses the SageMaker AI NTM algorithm to uncover topics in documents from a synthetic data source where the topic distributions are known, see the [Introduction to Basic Functionality of NTM](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ntm_synthetic/ntm_synthetic.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# NTM Hyperparameters
Hyperparameters

The following table lists the hyperparameters that you can set for the Amazon SageMaker AI Neural Topic Model (NTM) algorithm.


| Parameter Name | Description | 
| --- | --- | 
|  `feature_dim`  |  The vocabulary size of the dataset. **Required** Valid values: Positive integer (min: 1, max: 1,000,000)  | 
| num\$1topics |  The number of required topics. **Required** Valid values: Positive integer (min: 2, max: 1000)  | 
| batch\$1norm |  Whether to use batch normalization during training. **Optional** Valid values: *true* or *false* Default value: *false*  | 
| clip\$1gradient |  The maximum magnitude for each gradient component. **Optional** Valid values: Float (min: 1e-3) Default value: Infinity  | 
| encoder\$1layers |  The number of layers in the encoder and the output size of each layer. When set to *auto*, the algorithm uses two layers of sizes 3 x `num_topics` and 2 x `num_topics` respectively.  **Optional** Valid values: Comma-separated list of positive integers or *auto* Default value: *auto*  | 
| encoder\$1layers\$1activation |  The activation function to use in the encoder layers. **Optional** Valid values:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `sigmoid`  | 
| epochs |  The maximum number of passes over the training data. **Optional** Valid values: Positive integer (min: 1) Default value: 50  | 
| learning\$1rate |  The learning rate for the optimizer. **Optional** Valid values: Float (min: 1e-6, max: 1.0) Default value: 0.001  | 
| mini\$1batch\$1size |  The number of examples in each mini batch. **Optional** Valid values: Positive integer (min: 1, max: 10000) Default value: 256  | 
| num\$1patience\$1epochs |  The number of successive epochs over which early stopping criterion is evaluated. Early stopping is triggered when the change in the loss function drops below the specified `tolerance` within the last `num_patience_epochs` number of epochs. To disable early stopping, set `num_patience_epochs` to a value larger than `epochs`. **Optional** Valid values: Positive integer (min: 1) Default value: 3  | 
| optimizer |  The optimizer to use for training. **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `adadelta`  | 
| rescale\$1gradient |  The rescale factor for gradient. **Optional** Valid values: float (min: 1e-3, max: 1.0) Default value: 1.0  | 
| sub\$1sample |  The fraction of the training data to sample for training per epoch. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 1.0  | 
| tolerance |  The maximum relative change in the loss function. Early stopping is triggered when change in the loss function drops below this value within the last `num_patience_epochs` number of epochs. **Optional** Valid values: Float (min: 1e-6, max: 0.1) Default value: 0.001  | 
| weight\$1decay |   The weight decay coefficient. Adds L2 regularization. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 0.0  | 

# Tune an NTM Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

Amazon SageMaker AI NTM is an unsupervised learning algorithm that learns latent representations of large collections of discrete data, such as a corpus of documents. Latent representations use inferred variables that are not directly measured to model the observations in a dataset. Automatic model tuning on NTM helps you find the model that minimizes loss over the training or validation data. *Training loss* measures how well the model fits the training data. *Validation loss* measures how well the model can generalize to data that it is not trained on. Low training loss indicates that a model is a good fit to the training data. Low validation loss indicates that a model has not overfit the training data and so should be able to model documents successfully on which is has not been trained. Usually, it's preferable to have both losses be small. However, minimizing training loss too much might result in overfitting and increase validation loss, which would reduce the generality of the model. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the NTM Algorithm
Metrics

The NTM algorithm reports a single metric that is computed during training: `validation:total_loss`. The total loss is the sum of the reconstruction loss and Kullback-Leibler divergence. When tuning hyperparameter values, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:total\$1loss |  Total Loss on validation set  |  Minimize  | 

## Tunable NTM Hyperparameters
Tunable Hyperparameters

You can tune the following hyperparameters for the NTM algorithm. Usually setting low `mini_batch_size` and small `learning_rate` values results in lower validation losses, although it might take longer to train. Low validation losses don't necessarily produce more coherent topics as interpreted by humans. The effect of other hyperparameters on training and validation loss can vary from dataset to dataset. To see which values are compatible, see [NTM Hyperparameters](ntm_hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| encoder\$1layers\$1activation |  CategoricalParameterRanges  |  ['sigmoid', 'tanh', 'relu']  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 0.1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 16, MaxValue:2048  | 
| optimizer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'adadelta']  | 
| rescale\$1gradient |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 1.0  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 1.0  | 

# NTM Response Formats
Inference Formats

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI NTM algorithm.

## JSON Response Format


```
{
    "predictions":    [
        {"topic_weights": [0.02, 0.1, 0,...]},
        {"topic_weights": [0.25, 0.067, 0,...]}
    ]
}
```

## JSONLINES Response Format


```
{"topic_weights": [0.02, 0.1, 0,...]}
{"topic_weights": [0.25, 0.067, 0,...]}
```

## RECORDIO Response Format


```
[
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    }  
]
```