

# Speech marks
<a name="speechmarks"></a>

*Speech marks* are metadata that describe the speech that you synthesize, such as where a sentence or word starts and ends in the audio stream. When you request speech marks for your text, Amazon Polly returns this metadata instead of synthesized speech. By using speech marks in conjunction with the synthesized speech audio stream, you can provide your applications with an enhanced visual experience. 

For example, combining the metadata with the audio stream from your text can enable you to synchronize speech with facial animation (lip-syncing) or to highlight written words as they're spoken.

Speechmarks are available when using neural, long-form, or standard text-to-speech engines.

**Topics**
+ [Speech mark types](using-speechmarks.md)
+ [Visemes and Amazon Polly](viseme.md)
+ [Speech mark output](output.md)
+ [Requesting speech marks](speechmarksconsole.md)
+ [Speech marks without SSML example](sp-mks-example1.md)
+ [Speech marks with SSML example](sp-mks-example2.md)

# Speech mark types
<a name="using-speechmarks"></a>

You request speech marks using the [SpeechMarkTypes](https://docs.aws.amazon.com/polly/latest/dg/API_StartSpeechSynthesisTask.html#polly-StartSpeechSynthesisTask-request-SpeechMarkTypes) option for either the [SynthesizeSpeech](https://docs.aws.amazon.com/polly/latest/dg/API_SynthesizeSpeech.html) or [StartSpeechSynthesisTask](https://docs.aws.amazon.com/polly/latest/dg/API_StartSpeechSynthesisTask.html) commands. You specify the metadata elements that you want to return from your input text. You can request as many as four types of metadata but you must specify at least one per request. No audio output is generated with the request.

In the AWS CLI, for example:

```
--speech-mark-types='["sentence", "word", "viseme", "ssml"]'
```

Amazon Polly generates speech marks using the following elements:
+  **sentence** – Indicates a sentence element in the input text. 
+  **word** – Indicates a word element in the text. 
+  **viseme** – Describes the face and mouth movements corresponding to each phoneme being spoken. For more information, see [Visemes and Amazon Polly](viseme.md). 
+  **ssml** – Describes a <mark> element from the SSML input text. For more information, see [Generating speech from SSML documents](ssml.md).

# Visemes and Amazon Polly
<a name="viseme"></a>

A *viseme* represents the position of the face and mouth when saying a word. It is the visual equivalent of a phoneme, which is the basic acoustic unit from which a word is formed. Visemes are the basic visual building blocks of speech.

Each language has a set of viseme that correspond to their specific phonemes. In a language, each phoneme has a corresponding viseme that represents the shape that the mouth makes when forming the sound. However, not all visemes can be mapped to a particular phoneme because numerous phonemes appear the same when spoken, even though they sound different. For example, in English, the words "pet" and "bet" are acoustically different. However, when observed visually (without sound), they look exactly the same.

The following chart shows a partial list of International Phonetic Alphabet (IPA) phonemes and Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) symbols as well as their corresponding visemes for US English voices.

For the complete table and tables for all available languages, see [Languages in Amazon Polly](supported-languages.md).


| IPA | X-SAMPA | Description | Example | Viseme | 
| --- | --- | --- | --- | --- | 
|  **Consonants**  | 
| b | b | Voiced bilabial plosive | **b**ed | p | 
| d | d | Voiced alveolar plosive | **d**ig | t | 
| d͡ʒ | dZ | Voiced postalveolar affricate | **j**ump | S | 
| ð | D | Voiced dental fricative | **th**en | T | 
| f | f | Voiceless labiodental fricative | **f**ive | f | 
| g | g | Voiced velar plosive | **g**ame | k | 
| h | h | Voiceless glottal fricative | **h**ouse | k | 
| ... | ... | ... | ... | ... | 

# Speech mark output
<a name="output"></a>

Amazon Polly returns speech mark objects in a line-delimited JSON stream. A speech mark object contains the following fields:
+  **time** – the timestamp in milliseconds from the beginning of the corresponding audio stream
+  **type** – the type of speech mark (sentence, word, viseme, or ssml)
+  **start** – the offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
+  **end** – the offset in bytes (not characters) of the object's end in the input text (not including viseme marks) 
+  **value** – this varies depending on the type of speech mark
  +  **SSML**: <mark> SSML tag
  +  **viseme**: the viseme name
  +  **word** or **sentence**: a substring of the input text, as delimited by the start and end fields

For example, Amazon Polly generates the following `word` speech mark object from the text "Mary had a little lamb":

```
{"time":373,"type":"word","start":5,"end":8,"value":"had"}
```

The described word ("had") begins 373 milliseconds after the audio stream begins, and starts at byte 5 and ends at byte 8 of the input text. 

**Note**  
This metadata is for the `Joanna` voice-id. If you use another voice with the same input text, the metadata might differ.

# Requesting speech marks
<a name="speechmarksconsole"></a>

You can use the console or the `synthesize-speech` command to request speech marks from Amazon Polly. You can then view the metadata or save it to a file.

------
#### [ Console ]

**To generate speech marks on the console**

1. Sign in to the AWS Management Console and open the Amazon Polly console at [https://console.aws.amazon.com/polly/](https://console.aws.amazon.com/polly/).

1. Choose the **Text-to-Speech** tab.

1. Turn on **SSML** to use SSML.

1. Type or paste your text into the input box.

1. For **Language**, choose the language of your text.

1. For **Voice**, choose the voice you want to use.

1. To change text pronunciation, expand **Additional settings**, turn on **Customize pronunciation**, and for **Apply lexicon**, choose the desired lexicon. 

1. To verify the speech, choose **Listen**. 

1. Turn on **Speech file format settings**. 
**Note**  
Downloading MP3, OGG, PCM, Mu-law, or A-law formats will not generate speech marks.

1. For **File Format**, choose **Speech marks**. 

1. For **Speech mark types**, choose the types of speech marks to generate. The option to choose **SSML** metadata is only available when **SSML** is on. For more information on using SSML with Amazon Polly see [Generating speech from SSML documents](ssml.md). 

1. Choose **Download**. 

------
#### [ AWS CLI ]

In addition to the input text, the following elements are required to return this metadata:
+ `output-format`

  Amazon Polly supports only the JSON format when returning speech marks. 

  ```
  --output-format json
  ```

  If you use an unsupported output format, Amazon Polly throws an exception.
+ `voice-id`

  To ensure that the metadata matches the associated audio stream, specify the same voice that is used to generate the synthesized speech audio stream. The available voices don't have identical speech rates. If you use a voice other than the one used to generate the speech, the metadata will not match the audio stream.

  ```
  --voice-id Joanna
  ```
+ `speech-mark-types`

  Specify the type or types of speech marks you want. You can request any or all of the speech mark types, but must specify at least one type.

  ```
  --speech-mark-types='["sentence", "word", "viseme", "ssml"]'
  ```
+ `text-type`

  Plain text is the default input text for Amazon Polly, so you must use `text-type ssml` if you want to return SSML speech marks.
+ `outfile`

  Specify the output file to which the metadata is written.

  ```
  MaryLamb.txt 
  ```

The following AWS CLI example is formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^) and use full quotation marks (") around the input text with single quotes (') for interior tags.

```
aws polly synthesize-speech \
  --output-format json \
  --voice-id Voice ID \
  --text 'Input text' \
  --speech-mark-types='["sentence", "word", "viseme"]' \
  outfile
```

------

# Speech marks without SSML example
<a name="sp-mks-example1"></a>

The following example shows you what requested metadata looks like on your screen for the simple sentence: "Mary had a little lamb." For simplicity, we don't include SSML speech marks in this example.

The following AWS CLI example is formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^) and use full quotation marks (") around the input text with single quotes (') for interior tags.

```
aws polly synthesize-speech \
  --output-format json \
  --voice-id Joanna \
  --text 'Mary had a little lamb.' \
  --speech-mark-types='["viseme", "word", "sentence"]' \
  MaryLamb.txt
```

When you make this request, Amazon Polly returns the following in the .txt file:

```
{"time":0,"type":"sentence","start":0,"end":23,"value":"Mary had a little lamb."}
{"time":6,"type":"word","start":0,"end":4,"value":"Mary"}
{"time":6,"type":"viseme","value":"p"}
{"time":73,"type":"viseme","value":"E"}
{"time":180,"type":"viseme","value":"r"}
{"time":292,"type":"viseme","value":"i"}
{"time":373,"type":"word","start":5,"end":8,"value":"had"}
{"time":373,"type":"viseme","value":"k"}
{"time":460,"type":"viseme","value":"a"}
{"time":521,"type":"viseme","value":"t"}
{"time":604,"type":"word","start":9,"end":10,"value":"a"}
{"time":604,"type":"viseme","value":"@"}
{"time":643,"type":"word","start":11,"end":17,"value":"little"}
{"time":643,"type":"viseme","value":"t"}
{"time":739,"type":"viseme","value":"i"}
{"time":769,"type":"viseme","value":"t"}
{"time":799,"type":"viseme","value":"t"}
{"time":882,"type":"word","start":18,"end":22,"value":"lamb"}
{"time":882,"type":"viseme","value":"t"}
{"time":964,"type":"viseme","value":"a"}
{"time":1082,"type":"viseme","value":"p"}
```

In this output, each part of the text is broken out in terms of speech marks:
+ The sentence "Mary had a little lamb."
+ Each word in the text: "Mary", "had", "a", "little", and "lamb."
+ The viseme for each sound in the corresponding audio stream: "p", "E", "r", "i", and so on. For more information on visemes see [Visemes and Amazon Polly](viseme.md).

# Speech marks with SSML example
<a name="sp-mks-example2"></a>

The process of generating speech marks from SSML-enhanced text is similar to the process when SSML is not present. Use the `synthesize-speech` command, and specify the SSML-enhanced text and the type of speech marks that you want, as shown in the following example. To make the example easier to read, we don't include viseme speech marks, but these could be included as well.

The following AWS CLI example is formatted for Unix, Linux, and macOS. For Windows, replace the backslash (\$1) Unix continuation character at the end of each line with a caret (^) and use full quotation marks (") around the input text with single quotes (') for interior tags.

```
aws polly synthesize-speech \
  --output-format json \
  --voice-id Joanna \
  --text-type ssml \
  --text '<speak><prosody volume="+20dB">Mary had <break time="300ms"/>a little <mark name="animal"/>lamb</prosody></speak>' \
  --speech-mark-types='["sentence", "word", "ssml"]' \
  output.txt
```

When you make this request, Amazon Polly returns the following in the .txt file:

```
{"time":0,"type":"sentence","start":31,"end":95,"value":"Mary had <break time=\"300ms\"\/>a little <mark name=\"animal\"\/>lamb"}
{"time":6,"type":"word","start":31,"end":35,"value":"Mary"}
{"time":325,"type":"word","start":36,"end":39,"value":"had"}
{"time":897,"type":"word","start":40,"end":61,"value":"<break time=\"300ms\"\/>"}
{"time":1291,"type":"word","start":61,"end":62,"value":"a"}
{"time":1373,"type":"word","start":63,"end":69,"value":"little"}
{"time":1635,"type":"ssml","start":70,"end":91,"value":"animal"}
{"time":1635,"type":"word","start":91,"end":95,"value":"lamb"}
```