Voice and Audio System#
How Vaani captures, generates, and plays voice with multi-engine reliability.
Overview#
Vaani’s audio system has three main components:
Multi-Engine Speech Recognition - Converting your voice to text using multiple engines
Native Text-to-Speech - Converting responses to audio using Indian voices
Audio Playback - Playing music and voice through your speakers
Each component uses a fallback architecture for maximum reliability.
Recognition Architecture (V2):
Primary: Google Speech API (en-IN) - Cloud-based, 95% accuracy for Indian accents
Backup: Vosk (Indian English model) - Offline, 85% accuracy, <100ms processing
Fallback: Sphinx - Emergency recognition when others fail
Version: 2.0 (the multi-engine rewrite)
Branch: dev (where active development happens)
Speech Recognition (Multi-Engine V2)#
How Multi-Engine Recognition Works
Audio Capture - Records audio from your microphone - Captures 16-bit PCM audio at 16kHz sample rate - Balanced energy threshold: 300 (optimized for Indian accents) - Optimized timing:
Pause threshold: 0.8s (prevents early cutoff)
Phrase threshold: 0.3s (quick capture start)
Non-speaking duration: 0.6s (balanced end detection)
Processing - No ambient calibration (removed for speed) - Wake word detection with fuzzy matching (85% threshold) - Context-aware wake word requirement (idle vs active)
Multi-Engine Recognition Chain - Step 1: Try Google Speech API (language: en-IN)
Best accuracy (~95% for Indian English)
Requires internet connection
Typical latency: 1-2 seconds
Step 2: If Google fails, try Vosk (Indian English model: en-in-0.4) - Good accuracy (~85%) - Fully offline - Ultra-fast: <100ms processing
Step 3: If Vosk fails, try Sphinx - Basic accuracy - Emergency fallback - Always available
Wake Word Detection - Fuzzy matching using RapidFuzz (85% threshold) - Handles accent variations - Context-aware requirement:
Idle: No wake word needed
Music/TTS: “Hey Vaani” required
Microphone Requirements
Standard USB or built-in laptop microphone
- The system checks your config for
MICROPHONE_TYPE: internal: Prioritizes “MacBook”, “Built-in” microphones.external: Prioritizes “USB”, “Bluetooth”, “AirPods”.
- The system checks your config for
If your preferred type isn’t found, it falls back to the best available option.
Testing Your Microphone
# Quick test
python3 test_audio_direct.py
# Detailed test
python3 << 'EOF'
from vaani_assistant.voice.speech_recognition import recognize_speech
print("Speak something (5 seconds)...")
text = recognize_speech(timeout=5)
print(f"You said: {text}")
EOF
Microphone Setup by OS
macOS
# Check microphone
system_profiler SPAudioDataType | grep "Microphone"
# In System Preferences → Sound → Input:
# Select your microphone
# Ensure input volume is above 50%
Linux
# Check microphone
arecord -l
# Test recording
arecord -D hw:0,0 -d 5 test.wav
aplay test.wav
Windows (WSL)
Audio capture in WSL has limitations. For best experience:
Configure WSL to use Windows audio
Set microphone as default in Windows Sound settings
Grant WSL microphone permission
See Troubleshooting for WSL audio issues.
Microphone Settings in Vaani
# View current settings
python3 << 'EOF'
from vaani_assistant.config import settings
config = settings.config
print(f"Sample rate: {config.get('SPEECH_RECOGNITION_SAMPLE_RATE', 16000)}")
print(f"Language: {config.get('VOICE_LANGUAGE', 'en')}")
EOF
Text-to-Speech (Native Indian Voices)#
How TTS Works
Text Input - Receives response text to speak - Non-blocking queue-based processing - Thread-safe state management
Voice Selection - Prioritizes Indian voices:
Veena (Indian English Female) - Primary
Rishi (Indian English Male)
Lekha (Hindi Female)
Samantha (US English) - Fallback
Automatic selection based on language
Graceful fallback if preferred voice unavailable
Audio Generation - Uses macOS native TTS (nsss engine) - Natural pronunciation for Indian English - Handles Hindi and other Indian languages - Queue-based processing (producer-consumer pattern)
Audio Output - Plays through system audio device - Automatic volume ducking during music playback - Thread-safe speaking state management - Updates context for wake word requirement
Voice Configuration
Default configuration (config.json):
{
"PREFERRED_VOICES": ["Veena", "Rishi", "Samantha", "Lekha"],
"LANGUAGE_VOICE_MAP": {
"hi": "Veena",
"en": "Veena",
"ta": "Veena",
"te": "Veena"
}
}
Testing Voices
# List available Indian voices
say -v ? | grep -E "(hi_IN|en_IN)"
# Test Veena voice
say -v Veena "Testing Indian English voice"
# Test Lekha voice (Hindi)
say -v Lekha "नमस्ते, मैं वाणी हूँ"
Voice Configuration
Choose different voices by language:
# View available voices
python3 << 'EOF'
from vaani_assistant.config import settings
config = settings.config
language = config.get('VOICE_LANGUAGE', 'en')
print(f"Current language: {language}")
print(f"Available voices: {config.get('VOICE_OPTIONS', {}).get(language, [])}")
EOF
Configure Voice
# Edit .env file
echo "VOICE_LANGUAGE=en" >> .env
echo "TTS_ENGINE_VOICE_ID=female" >> .env
# Then restart Vaani
python3 main.py
Available voice identifiers by language are in Configuration.
TTS Testing
# Test TTS
python3 << 'EOF'
from vaani_assistant.core.tts_engine import text_to_speech
text = "Hello, this is Vaani. Testing text to speech."
audio_file = text_to_speech(text, language="en")
print(f"Generated audio: {audio_file}")
EOF
Audio Playback#
Playback Pipeline
Audio File - Generated TTS audio (WAV, MP3) - Or music from YouTube
Audio Device Selection - Detects available audio output devices - Selects default or configured device - Falls back to next available
Playback - Initiates playback through system audio - Handles volume control - May reduce music volume if configured
Completion - Waits for audio to finish - Cleans up resources
Audio Devices
Check available audio devices:
# List audio devices
python3 << 'EOF'
import sounddevice as sd
devices = sd.query_devices()
for i, device in enumerate(devices):
print(f"{i}: {device['name']}")
EOF
Configuring Audio Output
# Edit .env to specify device
echo "AUDIO_OUTPUT_DEVICE_INDEX=0" >> .env
# Or let Vaani auto-detect
echo "AUDIO_OUTPUT_DEVICE_INDEX=-1" >> .env
Music Playback
Music playback is a special type of audio:
Music Search - Query YouTube for song/artist
Download - Get audio stream from YouTube
Playback - Play through audio output device
See customization for music-specific settings.
Testing Playback
# Test speaker/headphones
python3 test_audio_verification.py
# Test music playback
python3 test_music.py
Audio Quality#
Sample Rate
16kHz (Standard) - Default for Vaani, good for voice
44.1kHz (CD Quality) - Better for music, uses more bandwidth
48kHz (Professional) - Highest quality, rarely needed
Default: 16kHz for speech recognition (sufficient for voice quality)
Bit Depth
16-bit (Standard) - Default, good enough for voice
24-bit (High) - Better dynamic range, rarely needed
Default: 16-bit (standard)
Noise and Issues
Common audio problems:
Echo (hearing yourself)
→ Check microphone position
→ Reduce speaker volume
→ Use noise cancellation headset
Background noise being amplified
→ Move closer to microphone
→ Check microphone levels
→ Enable noise reduction in settings
Crackling or distortion
→ Lower input/output volume
→ Use higher quality microphone
→ Check audio cable connections
Stuttering playback
→ Close other applications
→ Increase audio buffer size
→ Check CPU usage with: top -l 1
Noise Handling#
Noise Reduction
Vaani can reduce background noise before recognition:
# Enable noise reduction in .env
echo "SPEECH_RECOGNITION_NOISE_REDUCTION=true" >> .env
# Restart Vaani
python3 main.py
How It Works
Records baseline of room noise (first 0.5 seconds)
Subtracts baseline from incoming audio
Reduces background noise while preserving voice
Limitations
Works best with consistent background noise
Degrades performance with varying noise
May remove part of speech if sound is overlapping
When to Disable
Disable noise reduction if:
You’re in a quiet environment (unnecessary processing)
Noise reduction is cutting off parts of speech
Performance is a concern
echo "SPEECH_RECOGNITION_NOISE_REDUCTION=false" >> .env
Audio Ducking#
What Is Audio Ducking?
Smart Ducking Strategy
Vaani uses a “Smart Ducking” approach to balance music enjoyment and voice interaction:
Wake Word Listening: Music plays at FULL VOLUME. The music is not ducked while Vaani waits for “Hey Vaani”.
Command Listening: Only after the wake word is detected, the music volume drops instantly to 15% (configurable).
Response: Music stays low while Vaani speaks the response.
Resuming: Music volume gracefully fades back up to original level.
Why? Previous assistants ducked music constantly while listening for the wake word, ruining the song. Vaani’s approach keeps the music loud until you actually engage the assistant.
Music Mode Sensitivity
When music is playing loudly, Vaani automatically switches to Music Mode: - Heuristic Filter: It ignores long sentences (likely lyrics) to prevent false wake word triggers. - Dynamic Threshold: It raises the microphone threshold significantly (to >3000) so it only hears you if you speak over the music.
Configure Ducking
# Enable ducking (default)
echo "AUDIO_DUCKING_ENABLED=true" >> .env
# Set ducking level (0.2 = 20% volume while speaking)
echo "AUDIO_DUCKING_LEVEL=0.2" >> .env
# Set fade time (milliseconds)
echo "AUDIO_DUCKING_FADE_TIME=500" >> .env
Disable Ducking
echo "AUDIO_DUCKING_ENABLED=false" >> .env
Limitations
Only works with audio Vaani controls
System-level music apps won’t be affected
Timing might be slightly off on slower systems
Multi-Language Audio#
Language Support
Vaani supports 32 languages for audio:
English (en), Spanish (es), French (fr), German (de),
Italian (it), Portuguese (pt), Russian (ru), Japanese (ja),
Chinese (zh), Korean (ko), Arabic (ar), Hindi (hi),
Turkish (tr), Polish (pl), Dutch (nl), Swedish (sv),
Norwegian (no), Danish (da), Finnish (fi), Czech (cs),
Greek (el), Hebrew (he), Thai (th), Vietnamese (vi),
Indonesian (id), Filipino (fil), Malay (ms), Romanian (ro),
Bulgarian (bg), Hungarian (hu), Croatian (hr), Serbian (sr)
Switching Languages
# Interactive setup
python3 -c "from vaani_assistant.config import global_config; global_config.setup_initial()"
# Or edit .env
echo "VOICE_LANGUAGE=es" >> .env
Language-Specific Voices
Each language has multiple voice options:
# Example: English voices
# female, male, female_uk, male_deep, etc.
echo "VOICE_LANGUAGE=en" >> .env
echo "TTS_ENGINE_VOICE_ID=female" >> .env
See Configuration for complete voice list by language.
Accent Considerations
Speech recognition works with various accents
Specify your language for best accuracy
Some accents may reduce recognition accuracy
Advanced Configuration#
Audio Buffering
For systems with audio lag or dropouts:
# Increase buffer size (higher = more latency but fewer dropouts)
echo "AUDIO_BUFFER_SIZE=4096" >> .env
Sample Rate Override
# Match your system's default
echo "SPEECH_RECOGNITION_SAMPLE_RATE=44100" >> .env
Device Fallback
If primary audio device fails:
# Try next device automatically
echo "AUDIO_FALLBACK_ENABLED=true" >> .env
Debugging Audio Issues#
Enable Audio Logging
LOG_LEVEL=DEBUG python3 main.py 2>&1 | grep -i audio
Check Audio System
# macOS
system_profiler SPAudioDataType
# Linux
aplay -l
pactl list short sinks
# Windows (WSL)
pactl list short sinks
Test Components Independently
# Test microphone
python3 test_audio_direct.py
# Test TTS
python3 << 'EOF'
from vaani_assistant.core.tts_engine import text_to_speech
text_to_speech("Testing text to speech")
EOF
# Test playback
python3 << 'EOF'
from vaani_assistant.core.audio_player import play_audio
play_audio("test_audio.wav")
EOF
Capture Audio for Analysis
# Record what Vaani is hearing
python3 << 'EOF'
from vaani_assistant.core.speech_recognition import recognize_speech
import sounddevice as sd
import soundfile as sf
duration = 5 # seconds
print(f"Recording for {duration} seconds...")
recording = sd.rec(int(16000 * duration), samplerate=16000, channels=1)
sd.wait()
sf.write("captured_audio.wav", recording, 16000)
print("Saved to captured_audio.wav")
EOF
See Troubleshooting for common audio problems and solutions.
Technical Details#
Audio Processing Pipeline
When you speak to Vaani:
Microphone captures analog sound
Sound card converts to digital (analog-to-digital conversion)
Vaani records 16-bit PCM at 16kHz
Optional noise reduction is applied
Audio is buffered in 1-second chunks
When you stop speaking (silence detected), processing starts
Audio sent to speech recognition engine
Recognized text is returned
TTS Generation Pipeline
When Vaani responds:
Response text is prepared
Selected TTS engine is called
Engine generates audio data
Audio may be cached
Audio output device is selected
Audio is decoded (if necessary)
Volume levels are set
Audio is played through output device
Music Playback Pipeline
When music is requested:
Song name is sent to YouTube search
Top result is selected
Audio stream is downloaded
Stream is decoded in real-time
Audio ducking is applied if enabled
Audio is played through output device
Performance Tuning#
For Slow Systems
# Use faster TTS engine
echo "TTS_ENGINE_PRIORITY=standard" >> .env
# Disable noise reduction
echo "SPEECH_RECOGNITION_NOISE_REDUCTION=false" >> .env
# Reduce sample rate
echo "SPEECH_RECOGNITION_SAMPLE_RATE=8000" >> .env
For High Quality
# Use advanced TTS
echo "TTS_ENGINE_PRIORITY=advanced" >> .env
# Enable noise reduction
echo "SPEECH_RECOGNITION_NOISE_REDUCTION=true" >> .env
# Use higher sample rate
echo "SPEECH_RECOGNITION_SAMPLE_RATE=44100" >> .env
Next Steps#
See Configuration for all audio settings
Read Troubleshooting for audio problems
Check customization for music options
Review project_structure for code details