

# Speech and voice agents
<a name="speech-and-voice-agents"></a>

Speech and voice agents interact with users through spoken dialogue. These agents integrate speech recognition, natural-language understanding, and speech synthesis to enable conversational AI across telephony, mobile, web, and embedded platforms.

Voice agents are particularly effective in hands-free, real-time, or accessibility-driven environments. By combining streaming interfaces with LLM-powered reasoning, they facilitate rich, dynamic interactions that feel natural to users.

## Architecture
<a name="architecture-speech-and-voice"></a>

A speech and voice agent is shown in the following diagram:

![\[Speech and voice agents.\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/images/speech-and-voice-agents.png)


## Description
<a name="description-speech-and-voice"></a>

1. Receives a voice query
   + The user voices a request to a phone, microphone, or embedded system.
   + A speech-to-text (STT) module converts the audio to text.

1. Integrates streaming and telephony context
   + The agent uses a streaming interface to manage audio I/O in real time.
   + If it's deployed in a contact center or telecom context, telephony integration handles session routing, dual-tone multi-frequency (DTMF) input, and media transport.

Note: DTMF refers to the tones generated when you press buttons on a telephone keypad. In the context of streaming and telephony context integration within voice agents, DTMF is used as a signal input mechanism during a phone call, especially in interactive voice response (IVR) systems. DTMF inputs enable the agent to:
+ Recognize menu selections (for example, "Press 1 for billing. Press 2 for support.")
+ Collect numeric inputs (for example, account numbers, PINs, and confirmation numbers)
+ Trigger workflows or state transitions in call flows
+ Revert from speech to touch-tone when necessary

1. Reasons through LLM stream context
   + The query is sent to the agent, which passes it, along with any session metadata (for example, caller ID, prior context), to an LLM.
   + The LLM generates a response, possibly using a chain-of-thought strategy or multiturn memory if the interaction is ongoing.

1. Returns a voice response
   + The agent converts its response to speech using text-to-speech (TTS).
   + It returns audio to the user through a voice channel.

## Capabilities
<a name="capabilities-speech-and-voice"></a>
+ Real-time speech understanding and generation
+ Multilingual I/O with STT and TTS support
+ Integration with telephony or streaming APIs
+ Session awareness and memory handoff between turns

## Common use cases
<a name="common-use-cases-speech-and-voice"></a>
+ Conversational IVR systems
+ Virtual receptionists and appointment schedulers
+ Voice-driven helpdesk agents
+ Wearable voice assistants
+ Voice interfaces for smart homes and accessibility tools

## Implementation guidance
<a name="implementation-guidance-speech-and-voice"></a>

You can build this pattern using the following tools and AWS services:
+ Amazon Lex V2 or Amazon Transcribe for STT
+ Amazon Polly for TTS
+ Amazon Chime SDK, Amazon Connect, or Amazon Interactive Video Service (Amazon IVS) for streaming and telephony
+ Amazon Bedrock for reasoning with Anthropic, AI21, or other foundation models
+ AWS Lambda to connect STT, LLM, TTS, and session context

(Optional) Additional enhancements may include the following:
+ Amazon Kendra or OpenSearch for context-aware RAG
+ Amazon DynamoDB for session memory
+ Amazon CloudWatch Logs and AWS X-Ray for traceability

## Summary
<a name="summary-speech-and-voice"></a>

Speech and voice agents are intelligent systems that interact through natural conversations. By integrating speech interfaces with LLM reasoning and real-time streaming infrastructure, voice agents enable seamless, accessible, and scalable interactions.