Voice Activity Detection (VAD)
VAD controls how Echosy detects speech and splits audio into segments. Proper VAD tuning is key to getting clean, well-segmented transcripts. Echosy provides separate VAD settings for recording and dictation, so you can fine-tune each use case independently.
How VAD Works
VAD continuously monitors the audio stream and decides when speech starts and stops. When speech is detected, audio is buffered into a segment. When silence is detected for long enough, the segment is finalized and sent to the ASR model for transcription. The four parameters below control this behavior.
Recording VAD Parameters
These settings apply when using the main recording feature (system audio + mic).
| Parameter | Default | Range | Description |
|---|---|---|---|
| Silence duration | 800 ms | 200–3000 ms | How long silence must last before ending a segment. Lower values create more, shorter segments. Higher values allow for natural pauses within a segment. |
| Min speech duration | 300 ms | 100–2000 ms | Minimum speech length to count as a segment. Filters out very short sounds like coughs, clicks, or background noise. |
| Max speech duration | 20 s | 5–120 s | Maximum segment length before forcing a split. Prevents very long segments that are harder for ASR models to transcribe accurately. |
| Speech ratio | 2.5× | 1.0–10.0 | How much louder speech must be compared to background noise to be detected. Lower values are more sensitive (detect quieter speech), higher values require louder speech. |
Dictation VAD Parameters
These settings apply when using the dictation feature. Dictation defaults are tuned for faster response — shorter silence duration and lower minimum speech to capture quick voice input.
| Parameter | Default | Description |
|---|---|---|
| Silence duration | 600 ms | Shorter than recording — faster segment turnover for real-time typing. |
| Min speech duration | 200 ms | Lower threshold to catch short words and quick dictation. |
| Max speech duration | 15 s | Shorter max to keep segments manageable for quick input. |
| Speech ratio | 2.5× | Same sensitivity as recording mode. |
Presets
Presets provide optimized VAD configurations for common scenarios. Select a preset to auto-fill all four parameters:
| Preset | Silence | Min Speech | Max Speech | Best for |
|---|---|---|---|---|
| Meeting | 800 ms | 300 ms | 20 s | Group discussions with natural pauses |
| Subtitle | 500 ms | 200 ms | 10 s | Short, timed segments for subtitles |
| Interview | 1000 ms | 400 ms | 30 s | Longer turns with clear speaker pauses |
| Lecture | 1200 ms | 500 ms | 60 s | Continuous speech with few interruptions |
Tuning Tips
- Segments too short? Increase silence duration so pauses within sentences are not mistaken for segment endings.
- Segments too long? Decrease max speech duration or decrease silence duration.
- Picking up background noise? Increase speech ratio to require louder speech for detection, or increase min speech duration to filter out short noise bursts.
- Missing quiet speech? Lower the speech ratio to make detection more sensitive.
- Dictation feels slow? Lower the silence duration for faster segment finalization.