Voice Activity Detection (VAD)

VAD controls how Echosy detects speech and splits audio into segments. Proper VAD tuning is key to getting clean, well-segmented transcripts. Echosy provides separate VAD settings for recording and dictation, so you can fine-tune each use case independently.

How VAD Works

VAD continuously monitors the audio stream and decides when speech starts and stops. When speech is detected, audio is buffered into a segment. When silence is detected for long enough, the segment is finalized and sent to the ASR model for transcription. The four parameters below control this behavior.

Recording VAD Parameters

These settings apply when using the main recording feature (system audio + mic).

ParameterDefaultRangeDescription
Silence duration800 ms200–3000 msHow long silence must last before ending a segment. Lower values create more, shorter segments. Higher values allow for natural pauses within a segment.
Min speech duration300 ms100–2000 msMinimum speech length to count as a segment. Filters out very short sounds like coughs, clicks, or background noise.
Max speech duration20 s5–120 sMaximum segment length before forcing a split. Prevents very long segments that are harder for ASR models to transcribe accurately.
Speech ratio2.5×1.0–10.0How much louder speech must be compared to background noise to be detected. Lower values are more sensitive (detect quieter speech), higher values require louder speech.

Dictation VAD Parameters

These settings apply when using the dictation feature. Dictation defaults are tuned for faster response — shorter silence duration and lower minimum speech to capture quick voice input.

ParameterDefaultDescription
Silence duration600 msShorter than recording — faster segment turnover for real-time typing.
Min speech duration200 msLower threshold to catch short words and quick dictation.
Max speech duration15 sShorter max to keep segments manageable for quick input.
Speech ratio2.5×Same sensitivity as recording mode.

Presets

Presets provide optimized VAD configurations for common scenarios. Select a preset to auto-fill all four parameters:

PresetSilenceMin SpeechMax SpeechBest for
Meeting800 ms300 ms20 sGroup discussions with natural pauses
Subtitle500 ms200 ms10 sShort, timed segments for subtitles
Interview1000 ms400 ms30 sLonger turns with clear speaker pauses
Lecture1200 ms500 ms60 sContinuous speech with few interruptions

Tuning Tips

  • Segments too short? Increase silence duration so pauses within sentences are not mistaken for segment endings.
  • Segments too long? Decrease max speech duration or decrease silence duration.
  • Picking up background noise? Increase speech ratio to require louder speech for detection, or increase min speech duration to filter out short noise bursts.
  • Missing quiet speech? Lower the speech ratio to make detection more sensitive.
  • Dictation feels slow? Lower the silence duration for faster segment finalization.
Recording and dictation VAD are independent — changes to one do not affect the other. This lets you use aggressive (fast) settings for dictation while keeping conservative settings for recording.

Ready to get started?

Download Echosy for free and start transcribing in minutes.

Download Echosy