Voice Input

Enable voice input for your agents. Set up speech-to-text transcription using Deepgram or OpenAI for hands-free interaction.

Gamut supports voice input, allowing you to speak to your agents instead of typing. Voice messages are transcribed to text using a cloud speech-to-text (STT) provider, then sent as regular text messages to the agent. All voice settings are in Settings > Voice.

Supported providers

Provider	Model	Latency	Languages	Environment variable
Deepgram	Nova 3	~200ms (lowest)	47	`DEEPGRAM_API_KEY`
OpenAI	GPT-4o Mini Transcribe / Whisper	Moderate	57	`OPENAI_API_KEY`

A third option, Platform, is available when connected to the Gamut platform. It uses Deepgram Nova 3 via your platform connection and requires no separate API key.

Setting up voice input

Open Settings > Voice.
Select a Speech-to-Text Provider from the dropdown.
Enter your API key for the selected provider (not needed for the Platform provider).
Click Validate & Save. Gamut will verify the key against the provider's API before saving.
Use the Test section to verify your microphone and transcription are working.

Deepgram

Deepgram provides the lowest-latency transcription using their Nova 3 model. Gamut connects to Deepgram's WebSocket API for real-time streaming transcription.

API key requirements

Your Deepgram API key must have at least Member-level access to create temporary (ephemeral) tokens. Gamut validates this during key setup by:

Checking that the key can access the Deepgram projects API.
Verifying the key can create ephemeral tokens via the /v1/auth/grant endpoint.

If your key passes the first check but fails the second, you will see: "API key is valid but lacks permission to create temporary tokens." Upgrade the key's access level in the Deepgram Console.

How it works

When you start a voice recording:

Gamut requests a short-lived ephemeral token from Deepgram (valid for 10 minutes) using your stored API key. This token is passed to the browser -- your long-lived API key never leaves the server.
The browser opens a WebSocket connection to wss://api.deepgram.com/v1/listen with the ephemeral token.
Audio from your microphone is streamed in real time (16kHz, 16-bit linear PCM, mono).
Deepgram returns interim transcripts (displayed as you speak) and final transcripts (used as the message text).

Deepgram also supports batch transcription of audio files, which is used when audio data needs to be transcribed server-side rather than via the real-time WebSocket.

OpenAI

OpenAI provides transcription through their Whisper and GPT-4o Mini Transcribe models. Gamut uses OpenAI's Realtime API for streaming transcription in the browser.

API key requirements

A standard OpenAI API key is sufficient. Gamut validates the key by checking access to the OpenAI models endpoint.

How it works

When you start a voice recording:

Gamut requests a client secret from OpenAI's Realtime API (/v1/realtime/client_secrets) using your stored API key. This short-lived secret is passed to the browser.
The browser establishes a WebSocket connection to OpenAI's Realtime API using the client secret.
Audio is streamed and transcribed in real time, similar to Deepgram.

OpenAI also supports batch audio file transcription via the Whisper API (/v1/audio/transcriptions), used for server-side transcription of recorded audio.

API key management

API keys for STT providers follow the same pattern as LLM provider keys:

Settings UI: Enter the key in Settings > Voice. It is stored locally in settings.json with restricted file permissions.
Environment variables: Set DEEPGRAM_API_KEY or OPENAI_API_KEY before starting Gamut. Saved keys take precedence over environment variables.

The current key status is displayed with a badge showing the source. You can remove a saved key to revert to the environment variable, or save a new key to override it.

Voice input in the UI

Once voice input is configured, a microphone button appears in the message composer throughout the app. The workflow is:

Click the microphone button (or use the keyboard shortcut).
Grant microphone access if prompted by your browser.
Speak your message. Interim transcripts appear in real time as you talk.
Click the button again (or stop speaking) to finish recording.
The final transcript is placed into the message input, ready to send.

Voice input is available wherever you can type a message to an agent, including the main chat and the agent creation prompt.

Voice Agent

Both Deepgram and OpenAI support Voice Agent sessions -- a more interactive mode where the agent can respond with voice as well. When a voice agent session is active, a separate token is minted for the voice agent endpoint. The availability of Voice Agent depends on whether the configured STT provider supports it (both Deepgram and OpenAI do).

Troubleshooting

No microphone button visible: Verify that a provider is selected and its API key is configured in Settings > Voice.
"API key lacks permission to create temporary tokens" (Deepgram): Your key needs Member-level access. Check the key's permissions in the Deepgram Console.
"OpenAI API quota exceeded": Check your OpenAI account balance and billing settings at platform.openai.com.
Transcription is inaccurate: Try speaking more clearly and reducing background noise. Ensure your microphone is working correctly using the test tool in Settings > Voice > Test.
Connection timeout: The browser waits up to 10 seconds to connect to the STT provider's WebSocket. If this fails, check your network connection and try again.