

# Latent Dirichlet Allocation (LDA) Algorithm
<a name="lda"></a>

The Amazon SageMaker AI Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

The exact content of two documents with similar topic mixtures will not be the same. But overall, you would expect these documents to more frequently use a shared subset of words, than when compared with a document from a different topic mixture. This allows LDA to discover these word groups and use them to form topics. As an extremely simple example, given a set of documents where the only words that occur within them are: *eat*, *sleep*, *play*, *meow*, and *bark*, LDA might produce topics like the following:


| **Topic** | *eat* | *sleep*  | *play* | *meow* | *bark* | 
| --- | --- | --- | --- | --- | --- | 
| Topic 1  | 0.1  | 0.3  | 0.2  | 0.4  | 0.0  | 
| Topic 2  | 0.2  | 0.1 | 0.4  | 0.0  | 0.3  | 

You can infer that documents that are more likely to fall into Topic 1 are about cats (who are more likely to *meow* and *sleep*), and documents that fall into Topic 2 are about dogs (who prefer to *play* and *bark*). These topics can be found even though the words dog and cat never appear in any of the texts. 

**Topics**
+ [Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)](#lda-or-ntm)
+ [Input/Output Interface for the LDA Algorithm](#lda-inputoutput)
+ [EC2 Instance Recommendation for the LDA Algorithm](#lda-instances)
+ [LDA Sample Notebooks](#LDA-sample-notebooks)
+ [How LDA Works](lda-how-it-works.md)
+ [LDA Hyperparameters](lda_hyperparameters.md)
+ [Tune an LDA Model](lda-tuning.md)

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
<a name="lda-or-ntm"></a>

Topic models are commonly used to produce topics from corpuses that (1) coherently encapsulate semantic meaning and (2) describe documents well. As such, topic models aim to minimize perplexity and maximize topic coherence. 

Perplexity is an intrinsic language modeling evaluation metric that measures the inverse of the geometric mean per-word likelihood in your test data. A lower perplexity score indicates better generalization performance. Research has shown that the likelihood computed per word often does not align to human judgement, and can be entirely non-correlated, thus topic coherence has been introduced. Each inferred topic from your model consists of words, and topic coherence is computed to the top N words for that particular topic from your model. It is often defined as the average or median of the pairwise word-similarity scores of the words in that topic e.g., Pointwise Mutual Information (PMI). A promising model generates coherent topics or topics with high topic coherence scores. 

While the objective is to train a topic model that minimizes perplexity and maximizes topic coherence, there is often a tradeoff with both LDA and NTM. Recent research by Amazon, Dinget et al., 2018 has shown that NTM is promising for achieving high topic coherence but LDA trained with collapsed Gibbs sampling achieves better perplexity. There is a tradeoff between perplexity and topic coherence. From a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized across multiple GPU instances, whereas LDA only supports single-instance CPU training. 

**Topics**
+ [Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)](#lda-or-ntm)
+ [Input/Output Interface for the LDA Algorithm](#lda-inputoutput)
+ [EC2 Instance Recommendation for the LDA Algorithm](#lda-instances)
+ [LDA Sample Notebooks](#LDA-sample-notebooks)
+ [How LDA Works](lda-how-it-works.md)
+ [LDA Hyperparameters](lda_hyperparameters.md)
+ [Tune an LDA Model](lda-tuning.md)

## Input/Output Interface for the LDA Algorithm
<a name="lda-inputoutput"></a>

LDA expects data to be provided on the train channel, and optionally supports a test channel, which is scored by the final model. LDA supports both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV`, the data must be dense and have dimension equal to *number of records \$1 vocabulary size*. LDA can be trained in File or Pipe mode when using recordIO-wrapped protobuf, but only in File mode for the `CSV` format.

For inference, `text/csv`, `application/json`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. LDA inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_mixture` vector for each observation.

Please see the [LDA Sample Notebooks](#LDA-sample-notebooks) for more detail on training and inference formats.

## EC2 Instance Recommendation for the LDA Algorithm
<a name="lda-instances"></a>

LDA currently only supports single-instance CPU training. CPU instances are recommended for hosting/inference.

## LDA Sample Notebooks
<a name="LDA-sample-notebooks"></a>

For a sample notebook that shows how to train the SageMaker AI Latent Dirichlet Allocation algorithm on a dataset and then how to deploy the trained model to perform inferences about the topic mixtures in input documents, see the [An Introduction to SageMaker AI LDA](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lda_topic_modeling/LDA-Introduction.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How LDA Works
<a name="lda-how-it-works"></a>

Amazon SageMaker AI LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of different categories. These categories are themselves a probability distribution over the features. LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to discriminative models, which attempt to learn how inputs map to outputs.

You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A feature is referred to as a word. And the resulting categories are referred to as topics.

**Note**  
Lemmatization significantly increases algorithm performance and accuracy. Consider pre-processing any input text data. For more information, see [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html).

An LDA model is defined by two parameters:
+ α—A prior estimate on topic probability (in other words, the average frequency that each topic within a given document occurs). 
+ β—a collection of k topics where each topic is given a probability distribution over the vocabulary used in a document corpus, also called a "topic-word distribution."

LDA is a "bag-of-words" model, which means that the order of words does not matter. LDA is a generative model where each document is generated word-by-word by choosing a topic mixture θ ∼ Dirichlet(α). 

 For each word in the document: 
+  Choose a topic z ∼ Multinomial(θ) 
+  Choose the corresponding topic-word distribution β\$1z. 
+  Draw a word w ∼ Multinomial(β\$1z). 

When training the model, the goal is to find parameters α and β, which maximize the probability that the text corpus is generated by the model.

The most popular methods for estimating the LDA model use Gibbs sampling or Expectation Maximization (EM) techniques. The Amazon SageMaker AI LDA uses tensor spectral decomposition. This provides several advantages:
+  **Theoretical guarantees on results**. The standard EM-method is guaranteed to converge only to local optima, which are often of poor quality. 
+  **Embarrassingly parallelizable**. The work can be trivially divided over input documents in both training and inference. The EM-method and Gibbs Sampling approaches can be parallelized, but not as easily. 
+  **Fast**. Although the EM-method has low iteration cost it is prone to slow convergence rates. Gibbs Sampling is also subject to slow convergence rates and also requires a large number of samples. 

At a high-level, the tensor decomposition algorithm follows this process:

1.  The goal is to calculate the spectral decomposition of a **V** x **V** x **V** tensor, which summarizes the moments of the documents in our corpus. **V** is vocabulary size (in other words, the number of distinct words in all of the documents). The spectral components of this tensor are the LDA parameters α and β, which maximize the overall likelihood of the document corpus. However, because vocabulary size tends to be large, this **V** x **V** x **V** tensor is prohibitively large to store in memory. 

1.  Instead, it uses a **V** x **V** moment matrix, which is the two-dimensional analog of the tensor from step 1, to find a whitening matrix of dimension **V** x **k**. This matrix can be used to convert the **V** x **V** moment matrix into a **k** x **k** identity matrix. **k** is the number of topics in the model. 

1.  This same whitening matrix can then be used to find a smaller **k** x **k** x **k** tensor. When spectrally decomposed, this tensor has components that have a simple relationship with the components of the **V** x **V** x **V** tensor. 

1.  *Alternating Least Squares* is used to decompose the smaller **k** x *k* x **k** tensor. This provides a substantial improvement in memory consumption and speed. The parameters α and β can be found by “unwhitening” these outputs in the spectral decomposition. 

After the LDA model’s parameters have been found, you can find the topic mixtures for each document. You use stochastic gradient descent to maximize the likelihood function of observing a given topic mixture corresponding to these data.

Topic quality can be improved by increasing the number of topics to look for in training and then filtering out poor quality ones. This is in fact done automatically in SageMaker AI LDA: 25% more topics are computed and only the ones with largest associated Dirichlet priors are returned. To perform further topic filtering and analysis, you can increase the topic count and modify the resulting LDA model as follows:

```
> import mxnet as mx
> alpha, beta = mx.ndarray.load(‘model.tar.gz’)
> # modify alpha and beta
> mx.nd.save(‘new_model.tar.gz’, [new_alpha, new_beta])
> # upload to S3 and create new SageMaker model using the console
```

For more information about algorithms for LDA and the SageMaker AI implementation, see the following:
+ Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. *Tensor Decompositions for Learning Latent Variable Models*, Journal of Machine Learning Research, 15:2773–2832, 2014.
+  David M Blei, Andrew Y Ng, and Michael I Jordan. *Latent Dirichlet Allocation*. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
+  Thomas L Griffiths and Mark Steyvers. *Finding Scientific Topics*. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. 
+  Tamara G Kolda and Brett W Bader. *Tensor Decompositions and Applications*. SIAM Review, 51(3):455–500, 2009. 

# LDA Hyperparameters
<a name="lda_hyperparameters"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the LDA training algorithm provided by Amazon SageMaker AI. For more information, see [How LDA Works](lda-how-it-works.md).


| Parameter Name | Description | 
| --- | --- | 
| num\$1topics |  The number of topics for LDA to find within the data. **Required** Valid values: positive integer  | 
| feature\$1dim |  The size of the vocabulary of the input document corpus. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  The total number of documents in the input document corpus. **Required** Valid values: positive integer  | 
| alpha0 |  Initial guess for the concentration parameter: the sum of the elements of the Dirichlet prior. Small values are more likely to generate sparse topic mixtures and large values (greater than 1.0) produce more uniform mixtures.  **Optional** Valid values: Positive float Default value: 1.0  | 
| max\$1restarts |  The number of restarts to perform during the Alternating Least Squares (ALS) spectral decomposition phase of the algorithm. Can be used to find better quality local minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 10  | 
| max\$1iterations |  The maximum number of iterations to perform during the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 1000  | 
| tol |  Target error tolerance for the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive float Default value: 1e-8  | 

# Tune an LDA Model
<a name="lda-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

LDA is an unsupervised topic modeling algorithm that attempts to describe a set of observations (documents) as a mixture of different categories (topics). The “per-word log-likelihood” (PWLL) metric measures the likelihood that a learned set of topics (an LDA model) accurately describes a test document dataset. Larger values of PWLL indicate that the test data is more likely to be described by the LDA model.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the LDA Algorithm
<a name="lda-metrics"></a>

The LDA algorithm reports on a single metric during training: `test:pwll`. When tuning a model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:pwll | Per-word log-likelihood on the test dataset. The likelihood that the test dataset is accurately described by the learned LDA model. | Maximize | 

## Tunable LDA Hyperparameters
<a name="lda-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the LDA algorithm. Both hyperparameters, `alpha0` and `num_topics`, can affect the LDA objective metric (`test:pwll`). If you don't already know the optimal values for these hyperparameters, which maximize per-word log-likelihood and produce an accurate LDA model, automatic model tuning can help find them.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha0 | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 10 | 
| num\$1topics | IntegerParameterRanges | MinValue: 1, MaxValue: 150 | 