Voice Input
Set up speech-to-text for voice messages to your SuperAgent agents using Deepgram or OpenAI Whisper.
SuperAgent supports voice input, allowing you to speak to your agents instead of typing. Voice messages are transcribed to text using a cloud speech-to-text (STT) provider, then sent as regular text messages to the agent. All voice settings are in Settings > Voice.
Supported providers
| Provider | Model | Latency | Languages | Environment variable |
|---|---|---|---|---|
| Deepgram | Nova 3 | ~200ms (lowest) | 47 | DEEPGRAM_API_KEY |
| OpenAI | GPT-4o Mini Transcribe / Whisper | Moderate | 57 | OPENAI_API_KEY |
A third option, Platform, is available when connected to the SuperAgent platform. It uses Deepgram Nova 3 via your platform connection and requires no separate API key.
Setting up voice input
- Open Settings > Voice.
- Select a Speech-to-Text Provider from the dropdown.
- Enter your API key for the selected provider (not needed for the Platform provider).
- Click Validate & Save. SuperAgent will verify the key against the provider's API before saving.
- Use the Test section to verify your microphone and transcription are working.
Deepgram
Deepgram provides the lowest-latency transcription using their Nova 3 model. SuperAgent connects to Deepgram's WebSocket API for real-time streaming transcription.
API key requirements
Your Deepgram API key must have at least Member-level access to create temporary (ephemeral) tokens. SuperAgent validates this during key setup by:
- Checking that the key can access the Deepgram projects API.
- Verifying the key can create ephemeral tokens via the
/v1/auth/grantendpoint.
If your key passes the first check but fails the second, you will see: "API key is valid but lacks permission to create temporary tokens." Upgrade the key's access level in the Deepgram Console.
How it works
When you start a voice recording:
- SuperAgent requests a short-lived ephemeral token from Deepgram (valid for 10 minutes) using your stored API key. This token is passed to the browser -- your long-lived API key never leaves the server.
- The browser opens a WebSocket connection to
wss://api.deepgram.com/v1/listenwith the ephemeral token. - Audio from your microphone is streamed in real time (16kHz, 16-bit linear PCM, mono).
- Deepgram returns interim transcripts (displayed as you speak) and final transcripts (used as the message text).
Deepgram also supports batch transcription of audio files, which is used when audio data needs to be transcribed server-side rather than via the real-time WebSocket.
OpenAI
OpenAI provides transcription through their Whisper and GPT-4o Mini Transcribe models. SuperAgent uses OpenAI's Realtime API for streaming transcription in the browser.
API key requirements
A standard OpenAI API key is sufficient. SuperAgent validates the key by checking access to the OpenAI models endpoint.
How it works
When you start a voice recording:
- SuperAgent requests a client secret from OpenAI's Realtime API (
/v1/realtime/client_secrets) using your stored API key. This short-lived secret is passed to the browser. - The browser establishes a WebSocket connection to OpenAI's Realtime API using the client secret.
- Audio is streamed and transcribed in real time, similar to Deepgram.
OpenAI also supports batch audio file transcription via the Whisper API (/v1/audio/transcriptions), used for server-side transcription of recorded audio.
API key management
API keys for STT providers follow the same pattern as LLM provider keys:
- Settings UI: Enter the key in Settings > Voice. It is stored locally in
settings.jsonwith restricted file permissions. - Environment variables: Set
DEEPGRAM_API_KEYorOPENAI_API_KEYbefore starting SuperAgent. Saved keys take precedence over environment variables.
The current key status is displayed with a badge showing the source. You can remove a saved key to revert to the environment variable, or save a new key to override it.
Voice input in the UI
Once voice input is configured, a microphone button appears in the message composer throughout the app. The workflow is:
- Click the microphone button (or use the keyboard shortcut).
- Grant microphone access if prompted by your browser.
- Speak your message. Interim transcripts appear in real time as you talk.
- Click the button again (or stop speaking) to finish recording.
- The final transcript is placed into the message input, ready to send.
Voice input is available wherever you can type a message to an agent, including the main chat and the agent creation prompt.
Voice Agent
Both Deepgram and OpenAI support Voice Agent sessions -- a more interactive mode where the agent can respond with voice as well. When a voice agent session is active, a separate token is minted for the voice agent endpoint. The availability of Voice Agent depends on whether the configured STT provider supports it (both Deepgram and OpenAI do).
Troubleshooting
- No microphone button visible: Verify that a provider is selected and its API key is configured in Settings > Voice.
- "API key lacks permission to create temporary tokens" (Deepgram): Your key needs Member-level access. Check the key's permissions in the Deepgram Console.
- "OpenAI API quota exceeded": Check your OpenAI account balance and billing settings at platform.openai.com.
- Transcription is inaccurate: Try speaking more clearly and reducing background noise. Ensure your microphone is working correctly using the test tool in Settings > Voice > Test.
- Connection timeout: The browser waits up to 10 seconds to connect to the STT provider's WebSocket. If this fails, check your network connection and try again.