Cross-modal input - Amazon Nova

How it works Event structure

Amazon Nova 2 Sonic now supports cross-modal input, allowing you to send text messages in addition to voice input during a conversation session. While speech remains the primary mode of interaction, text input provides flexibility for scenarios where typing is more convenient or appropriate.

Continuous streaming required: Cross-modal input requires an active streaming session to function properly. The session must maintain continuous streaming like a regular voice session, otherwise standard session timeouts will be applied and the connection will be terminated.

Sensitivity levels in cross-modal text input is useful for scenarios such as:

Client-side app integration (web and mobile): Allows users to interact with the application using both text and voice, supporting seamless multimodal experiences.
"Model-start-first" use case: A text message can be sent immediately after the session starts to prompt the model to begin speaking.
Guiding the model during async tool calling: When a toolUse event is triggered and the system begins processing tool calls, the client can send a text message to Sonic to provide a natural response while waiting — such as, “Hold on a second while I process your information. In the meantime, is there anything else I can assist with?”
Telephony DTMF integration: Customer uses phone keypad to enter sensitive information (such as credit card numbers). Note: Amazon Nova Sonic does not process DTMF tones natively. To support DTMF input, your system must detect the tones, convert them to text (such as "1234"), and send to Nova 2 Sonic.

Cross-modal input uses a three-event sequence similar to audio input:

Content Start Event: Signals the beginning of text input
Text Input Event:Contains the actual text message
Content End Event: Signals the completion of text input

All three events must use the same promptName and contentName to maintain the sequence. A new UUID should be generated for contentName each time you send text input to ensure proper multi-turn conversation tracking.

Initiates the text input sequence with configuration details:


{
  "event": {
    "contentStart": {
      "promptName": "<prompt_name>",
      "contentName": "<new_content_name>",
      "role": "USER",
      "type": "TEXT",
      "interactive": true,
      "textInputConfiguration": {
        "mediaType": "text/plain"
      }
    }
  }
}

Key Parameters:

promptName:The name of your conversation prompt (consistent across the session)
contentName: A unique identifier for this text input (generate a new UUID for each message)
role: Set to "USER" to indicate user input
type: Set to "TEXT" for text input
interactive: Set to true to enable interactive mode
mediaType: Set to "text/plain" for plain text content

Contains the actual text message content:


{
  "event": {
    "textInput": {
      "promptName": "<prompt_name>",
      "contentName": "<new_content_name>",
      "content": "<your_text_message>"
    }
  }
}

Key Parameters:

promptName: Must match the value from Content Start Event
contentName: Must match the value from Content Start Event
role: Your text message string

Signals the completion of the text input:


{
  "event": {
    "contentEnd": {
      "promptName": "<prompt_name>",
      "contentName": "<new_content_name>"
    }
  }
}

Key Parameters:

promptName: Must match the value from previous events
contentName: Must match the value from previous events

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Turn-taking controllability

Language support and multilingual capabilities