

# BlazingText algorithm
BlazingText

The Amazon SageMaker AI BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.

The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector representation of a word is called a *word embedding*. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words. 

Many natural language processing (NLP) applications learn word embeddings by training on large collections of documents. These pretrained vector representations provide information about semantics and word distributions that typically improves the generalizability of other models that are later trained on a more limited amount of data. Most implementations of the Word2vec algorithm are not optimized for multi-core CPU architectures. This makes it difficult to scale to large datasets. 

With the BlazingText algorithm, you can scale to large datasets easily. Similar to Word2vec, it provides the Skip-gram and continuous bag-of-words (CBOW) training architectures. BlazingText's implementation of the supervised multi-class, multi-label text classification algorithm extends the fastText text classifier to use GPU acceleration with custom [CUDA ](https://docs.nvidia.com/cuda/index.html) kernels. You can train a model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU. And, you achieve performance on par with the state-of-the-art deep learning text classification algorithms.

The BlazingText algorithm is not parallelizable. For more information on parameters related to training, see [ Docker Registry Paths for SageMaker Built-in Algorithms](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

 The SageMaker AI BlazingText algorithms provides the following features:
+ Accelerated training of the fastText text classifier on multi-core CPUs or a GPU and Word2Vec on GPUs using highly optimized CUDA kernels. For more information, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).
+ [Enriched Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) by learning vector representations for character n-grams. This approach enables BlazingText to generate meaningful vectors for out-of-vocabulary (OOV) words by representing their vectors as the sum of the character n-gram (subword) vectors.
+ A `batch_skipgram` `mode` for the Word2Vec algorithm that allows faster training and distributed computation across multiple CPU nodes. The `batch_skipgram` `mode` does mini-batching using the Negative Sample Sharing strategy to convert level-1 BLAS operations into level-3 BLAS operations. This efficiently leverages the multiply-add instructions of modern architectures. For more information, see [Parallelizing Word2Vec in Shared and Distributed Memory](https://arxiv.org/pdf/1604.04661.pdf).

To summarize, the following modes are supported by BlazingText on different types instances:


| Modes |  Word2Vec (Unsupervised Learning)  |  Text Classification (Supervised Learning)  | 
| --- | --- | --- | 
|  Single CPU instance  |  `cbow` `Skip-gram` `Batch Skip-gram`  |  `supervised`  | 
|  Single GPU instance (with 1 or more GPUs)  |  `cbow` `Skip-gram`  |  `supervised` with one GPU  | 
|  Multiple CPU instances  | Batch Skip-gram  | None | 

For more information about the mathematics behind BlazingText, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).

**Topics**
+ [

## Input/Output Interface for the BlazingText Algorithm
](#bt-inputoutput)
+ [

## EC2 Instance Recommendation for the BlazingText Algorithm
](#blazingtext-instances)
+ [

## BlazingText Sample Notebooks
](#blazingtext-sample-notebooks)
+ [

# BlazingText Hyperparameters
](blazingtext_hyperparameters.md)
+ [

# Tune a BlazingText Model
](blazingtext-tuning.md)

## Input/Output Interface for the BlazingText Algorithm


The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.

### Training and Validation Data Format


#### Training and Validation Data Format for the Word2Vec Algorithm


For Word2Vec training, upload the file under the *train* channel. No other channels are supported. The file should contain a training sentence per line.

#### Training and Validation Data Format for the Text Classification Algorithm


For supervised mode, you can train with file mode or with the augmented manifest text format.

##### Train with File Mode
File Mode

For `supervised` mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string *\$1\$1label\$1\$1*. Here is an example of a training/validation file:

```
__label__4  linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2  bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .
```

**Note**  
The order of labels within the sentence doesn't matter. 

Upload the training file under the train channel, and optionally upload the validation file under the validation channel.

##### Train with Augmented Manifest Text Format
Augmented Manifest

Supervised mode for CPU instances also supports the augmented manifest format, which enables you to do training in pipe mode without needing to create RecordIO files. While using the format, an S3 manifest file needs to be generated that contains the list of sentences and their corresponding labels. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The sentences are specified using the `source` tag and the label can be specified using the `label` tag. Both `source` and `label` tags should be provided under the `AttributeNames` parameter value as specified in the request.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label":2}
```

Multi-label training is also supported by specifying a JSON array of labels.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label": [1, 3]}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label": [2, 4, 5]}
```

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Model Artifacts and Inference


#### Model Artifacts for the Word2Vec Algorithm


For Word2Vec training, the model artifacts consist of *vectors.txt*, which contains words-to-vectors mapping, and *vectors.bin*, a binary used by BlazingText for hosting, inference, or both. *vectors.txt* stores the vectors in a format that is compatible with other tools like Gensim and Spacy. For example, a Gensim user can run the following commands to load the vectors.txt file:

```
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
word_vectors.doesnt_match("breakfast cereal dinner lunch".split())
```

If the evaluation parameter is set to `True`, an additional file, *eval.json*, is created. This file contains the similarity evaluation results (using Spearman’s rank correlation coefficients) on WS-353 dataset. The number of words from the WS-353 dataset that aren't there in the training corpus are reported.

For inference requests, the model accepts a JSON file containing a list of strings and returns a list of vectors. If the word is not found in vocabulary, inference returns a vector of zeros. If subwords is set to `True` during training, the model is able to generate vectors for out-of-vocabulary (OOV) words.

##### Sample JSON Request


Mime-type:` application/json`

```
{
"instances": ["word1", "word2", "word3"]
}
```

#### Model Artifacts for the Text Classification Algorithm


Training with supervised outputs creates a *model.bin* file that can be consumed by BlazingText hosting. For inference, the BlazingText model accepts a JSON file containing a list of sentences and returns a list of corresponding predicted labels and probability scores. Each sentence is expected to be a string with space-separated tokens, words, or both.

##### Sample JSON Request


Mime-type:` application/json`

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."]
}
```

By default, the server returns only one prediction, the one with the highest probability. For retrieving the top *k* predictions, you can set *k* in the configuration, as follows:

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."],
 "configuration": {"k": 2}
}
```

For BlazingText, the` content-type` and `accept` parameters must be equal. For batch transform, they both need to be `application/jsonlines`. If they differ, the `Accept` field is ignored. The format for input follows:

```
content-type: application/jsonlines

{"source": "source_0"}
{"source": "source_1"}

if you need to pass the value of k for top-k, then you can do it in the following way:

{"source": "source_0", "k": 2}
{"source": "source_1", "k": 3}
```

The format for output follows:

```
accept: application/jsonlines


{"prob": [prob_1], "label": ["__label__1"]}
{"prob": [prob_1], "label": ["__label__1"]}

If you have passed the value of k to be more than 1, then response will be in this format:

{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
```

For both supervised (text classification) and unsupervised (Word2Vec) modes, the binaries (*\$1.bin*) produced by BlazingText can be cross-consumed by fastText and vice versa. You can use binaries produced by BlazingText by fastText. Likewise, you can host the model binaries created with fastText using BlazingText.

Here is an example of how to use a model generated with BlazingText with fastText:

```
#Download the model artifact from S3
aws s3 cp s3://<YOUR_S3_BUCKET>/<PREFIX>/model.tar.gz model.tar.gz

#Unzip the model archive
tar -xzf model.tar.gz

#Use the model archive with fastText
fasttext predict ./model.bin test.txt
```

However, the binaries are only supported when training on CPU and single GPU; training on multi-GPU will not produce binaries.

## EC2 Instance Recommendation for the BlazingText Algorithm


For `cbow` and `skipgram` modes, BlazingText supports single CPU and single GPU instances. Both of these modes support learning of `subwords` embeddings. To achieve the highest speed without compromising accuracy, we recommend that you use an ml.p3.2xlarge instance. 

For `batch_skipgram` mode, BlazingText supports single or multiple CPU instances. When training on multiple instances, set the value of the `S3DataDistributionType` field of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) object that you pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) to `FullyReplicated`. BlazingText takes care of distributing data across machines.

For the supervised text classification mode, a C5 instance is recommended if the training dataset is less than 2 GB. For larger datasets, use an instance with a single GPU. BlazingText supports P2, P3, G4dn, and G5 instances for training and inference.

## BlazingText Sample Notebooks
Sample Notebooks

For a sample notebook that trains and deploys the SageMaker AI BlazingText algorithm to generate word vectors, see [Learning Word2Vec Word Representations using BlazingText](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/blazingtext_word2vec_text8/blazingtext_word2vec_text8.html). For instructions for creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI examples. The topic modeling example notebooks that use the Blazing Text are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# BlazingText Hyperparameters
Hyperparameters

When you start a training job with a `CreateTrainingJob` request, you specify a training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The hyperparameters for the BlazingText algorithm depend on which mode you use: Word2Vec (unsupervised) and Text Classification (supervised).

## Word2Vec Hyperparameters
Word2Vec Hyperparameters

The following table lists the hyperparameters for the BlazingText Word2Vec training algorithm provided by Amazon SageMaker AI.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The Word2vec architecture used for training. **Required** Valid values: `batch_skipgram`, `skipgram`, or `cbow`  | 
| batch\$1size |  The size of each batch when `mode` is set to `batch_skipgram`. Set to a number between 10 and 20. **Optional** Valid values: Positive integer Default value: 11  | 
| buckets |  The number of hash buckets to use for subwords. **Optional** Valid values: positive integer Default value: 2000000  | 
| epochs |  The number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| evaluation |  Whether the trained model is evaluated using the [WordSimilarity-353 Test](http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.html). **Optional** Valid values: (Boolean) `True` or `False` Default value: `True`  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1char |  The minimum number of characters to use for subwords/character n-grams. **Optional** Valid values: positive integer Default value: 3  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| max\$1char |  The maximum number of characters to use for subwords/character n-grams **Optional** Valid values: positive integer Default value: 6  | 
| negative\$1samples |  The number of negative samples for the negative sample sharing strategy. **Optional** Valid values: Positive integer Default value: 5  | 
| sampling\$1threshold |  The threshold for the occurrence of words. Words that appear with higher frequency in the training data are randomly down-sampled. **Optional** Valid values: Positive fraction. The recommended range is (0, 1e-3] Default value: 0.0001  | 
| subwords |  Whether to learn subword embeddings on not. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| vector\$1dim |  The dimension of the word vectors that the algorithm learns. **Optional** Valid values: Positive integer Default value: 100  | 
| window\$1size |  The size of the context window. The context window is the number of words surrounding the target word used for training. **Optional** Valid values: Positive integer Default value: 5  | 

## Text Classification Hyperparameters
Text Classification Hyperparameters

The following table lists the hyperparameters for the Text Classification training algorithm provided by Amazon SageMaker AI.

**Note**  
Although some of the parameters are common between the Text Classification and Word2Vec modes, they might have different meanings depending on the context.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The training mode. **Required** Valid values: `supervised`  | 
| buckets |  The number of hash buckets to use for word n-grams. **Optional** Valid values: Positive integer Default value: 2000000  | 
| early\$1stopping |  Whether to stop training if validation accuracy doesn't improve after a `patience` number of epochs. Note that a validation channel is required if early stopping is used. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| epochs |  The maximum number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| min\$1epochs |  The minimum number of epochs to train before early stopping logic is invoked. **Optional** Valid values: Positive integer Default value: 5  | 
| patience |  The number of epochs to wait before applying early stopping when no progress is made on the validation set. Used only when `early_stopping` is `True`. **Optional** Valid values: Positive integer Default value: 4  | 
| vector\$1dim |  The dimension of the embedding layer. **Optional** Valid values: Positive integer Default value: 100  | 
| word\$1ngrams |  The number of word n-gram features to use. **Optional** Valid values: Positive integer Default value: 2  | 

# Tune a BlazingText Model
Model Tuning

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the BlazingText Algorithm
Metrics

The BlazingText Word2Vec algorithm (`skipgram`, `cbow`, and `batch_skipgram` modes) reports on a single metric during training: `train:mean_rho`. This metric is computed on [WS-353 word similarity datasets](https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art)). When tuning the hyperparameter values for the Word2Vec algorithm, use this metric as the objective.

The BlazingText Text Classification algorithm (`supervised` mode), also reports on a single metric during training: the `validation:accuracy`. When tuning the hyperparameter values for the text classification algorithm, use these metrics as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| train:mean\$1rho |  The mean rho (Spearman's rank correlation coefficient) on [WS-353 word similarity datasets](http://alfonseca.org/pubs/ws353simrel.tar.gz)  |  Maximize  | 
| validation:accuracy |  The classification accuracy on the user-specified validation dataset  |  Maximize  | 

## Tunable BlazingText Hyperparameters
Tunable Hyperparameters

### Tunable Hyperparameters for the Word2Vec Algorithm
Tunable Word2Vec Hyperparameters

Tune an Amazon SageMaker AI BlazingText Word2Vec model with the following hyperparameters. The hyperparameters that have the greatest impact on Word2Vec objective metrics are: `mode`, ` learning_rate`, `window_size`, `vector_dim`, and `negative_samples`.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| batch\$1size |  `IntegerParameterRange`  |  [8-32]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| mode |  `CategoricalParameterRange`  |  [`'batch_skipgram'`, `'skipgram'`, `'cbow'`]  | 
| negative\$1samples |  `IntegerParameterRange`  |  [5-25]  | 
| sampling\$1threshold |  `ContinuousParameterRange`  |  MinValue: 0.0001, MaxValue: 0.001  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| window\$1size |  `IntegerParameterRange`  |  [1-10]  | 

### Tunable Hyperparameters for the Text Classification Algorithm
Tunable Text Classification Hyperparameters

Tune an Amazon SageMaker AI BlazingText text classification model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| buckets |  `IntegerParameterRange`  |  [1000000-10000000]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| word\$1ngrams |  `IntegerParameterRange`  |  [1-3]  | 