Speech AI for Product Managers: TTS, STT, and Voice Interface Design

Speech-to-Text (STT): What PMs Need to Know

Speech-to-text converts audio into text. It sounds simple — and for clean, English, office-environment audio, it mostly is. But real-world conditions introduce complexity that determines whether your voice feature works or fails.

Word Error Rate (WER) is the key accuracy metric

WER measures the percentage of words transcribed incorrectly. Modern STT models achieve 2–5% WER in ideal conditions (English, clear audio, standard accent). In noisy environments, accents, or domain-specific vocabulary, WER can exceed 20%. Evaluate STT against the real audio your product will encounter — not benchmark datasets.

Latency modes: batch vs streaming

Batch STT transcribes audio after recording is complete — lower complexity, higher accuracy. Streaming STT provides real-time transcription as audio arrives — enables live captions, voice assistants, and low-latency interactions but at the cost of accuracy and engineering complexity. Choose based on your product's interaction model, not on what's easier to build.

Speaker diarization and multi-speaker scenarios

Standard STT produces a single stream of text. If your product needs to attribute speech to individual speakers (meeting transcription, call analysis), you need diarization — a separate model layer that identifies who spoke when. Accuracy degrades significantly with more than 4–5 speakers or overlapping speech.

Domain-specific vocabulary

General STT models struggle with product names, technical jargon, medical terminology, and proper nouns. Most providers offer custom vocabulary or fine-tuning options to improve accuracy on your domain. If your product involves specialized language, evaluate accuracy on domain-specific test sets before choosing a provider.

Text-to-Speech (TTS): Quality, Latency, and Trade-offs

Neural TTS vs older concatenative TTS

Older TTS systems (think early Siri) concatenated recorded speech fragments — recognizable by their robotic rhythm and unnatural transitions. Neural TTS (ElevenLabs, OpenAI TTS, Google WaveNet) generates speech from learned patterns and produces output that is increasingly indistinguishable from human voice. For any user-facing application built in 2026, neural TTS is the baseline.

Latency: TTFB and streaming audio

Time-to-first-byte (TTFB) is how long before audio starts playing. For conversational applications, TTFB over 500ms feels unresponsive. Streaming TTS — where audio chunks start playing before the full response is generated — is essential for low-latency voice experiences. Evaluate providers on streaming TTFB, not just total generation time.

Voice cloning and custom voices

Several providers allow voice cloning — creating a custom voice from a few minutes of reference audio. This enables branded voices, character voices, and consistent persona across interactions. Ethical and legal considerations apply: consent requirements, deepfake risks, and impersonation concerns are real product and legal issues.

Prosody and expressiveness

Good TTS sounds natural. Great TTS sounds expressive — it handles questions, emphasis, lists, and emotional tone. Most neural TTS providers have improved dramatically here, but evaluate samples with your actual content type. Technical documentation reads differently from customer service scripts, which reads differently from narrative content.

Voice Interface Design Principles

Design for conversation, not commands

Early voice interfaces were command-driven: say the magic phrase, get the response. Users rejected this model. Modern voice UX is conversational: the system should handle natural phrasing, corrections, and interruptions. This requires designing around a dialogue manager, not just an STT→LLM→TTS pipeline.

Handle silence and interruption explicitly

Voice interfaces must decide what to do when the user stops talking: when to start processing, how long to wait for follow-up, and how to handle the user interrupting the TTS output. These are explicit product design decisions — getting them wrong creates frustrating, unresponsive or over-eager experiences.

Error recovery is a core UX flow

STT errors happen. The product must handle misrecognition gracefully — confirm understanding, offer corrections, and fail softly. 'I didn't catch that' said twice in a row is a product failure. Design specific recovery flows for misrecognition before you think about the happy path.

Accessibility is broader with voice

Voice interfaces open accessibility for users with motor or visual impairments — but introduce barriers for users with speech differences, in noisy environments, or who prefer not to speak aloud (open offices, public transit). Design voice as an additional modality, not a replacement for visual/text interfaces.

Build Voice AI Products with Confidence

Multimodal AI, voice interface design, and technical PM decisions are part of the AI PM Masterclass curriculum. Taught by a Salesforce Sr. Director PM.

Common Voice AI Product Mistakes

Evaluating STT in ideal conditions only

Demo environments have clean audio, standard accents, and quiet backgrounds. Your users have background noise, varied accents, and phone microphones. Always test STT in conditions that match real usage — ideally by collecting audio from actual users during beta.

Treating voice as a feature add-on

Adding a microphone button to a text interface and calling it a voice feature usually produces a worse text experience with extra steps. Voice-first interactions require rethinking the entire interaction model — confirmation patterns, error handling, and information density all change.

Ignoring language and accent coverage

English-centric STT and TTS models degrade significantly for non-English languages and non-standard English accents. If you have global users, test your voice pipeline against representative audio from each major user segment before committing to a provider.

Not designing for voice-specific failure modes

Voice AI fails in ways text AI doesn't: ambient noise triggers the microphone, TTS mispronounces product names, STT misrecognizes critical numbers. Build explicit monitoring and error handling for voice-specific failure modes — they won't be caught by general AI quality monitoring.

Choosing a Voice AI Provider

STT providers: Whisper, Deepgram, AssemblyAI, Google STT

OpenAI Whisper (via API) leads on accuracy for English; Deepgram leads on latency and streaming; AssemblyAI leads on additional features (diarization, sentiment); Google STT leads on language coverage. No single provider wins on all dimensions. Choose based on the trade-off that matters most for your product.

TTS providers: ElevenLabs, OpenAI TTS, Google WaveNet, Amazon Polly

ElevenLabs leads on voice quality and cloning; OpenAI TTS balances quality and simplicity; Google WaveNet and Amazon Polly offer deeper infrastructure integration and higher volume pricing. For consumer products, voice quality matters most. For high-volume enterprise, pricing and reliability matter more.

Latency benchmarking for your use case

Published latency numbers are measured in controlled conditions. Build a simple test harness and benchmark each provider with audio representative of your actual input distribution. Include p95 and p99 latency, not just median — tail latency determines whether voice interactions feel responsive.

Data privacy and retention policies

Voice data is sensitive. Understand each provider's data retention policy, where audio is processed, and whether it's used for model training. For healthcare, legal, and financial applications, provider data handling may determine regulatory eligibility.