Performance Analysis & Optimization#
This section provides a detailed analysis of Vaani’s performance characteristics, focusing on latency, resource utilization, and optimization strategies. Understanding these metrics is crucial for deploying a responsive voice assistant.
Architecture-Level Performance#
Vaani utilizes a hybrid architecture that balances privacy (local processing) and intelligence (cloud processing).
Wake Word Detection (Vosk): - Mechanism: Continuous listing stream processed by a lightweight Kalman filter-based model. - Latency: < 200ms. - Resource Usage: Low CPU footprint (~5-10% on a single core of a modern CPU).
Speech Recognition (ASR): - Offloaded (Google Speech API): Used for complex queries. High accuracy but introduces network latency (200-800ms). - Local (Vosk): Used for command control and offline mode. Zero network latency but requires ~500MB RAM for the model.
Inference (Gemini): - The primary bottleneck for complex interactions. Response generation time depends on query complexity and API load, typically ranging from 1s to 3s.
Latency Optimization#
Reducing “Time-to-Response” (TTR) is the primary optimization goal.
Network Configuration#
Since intelligence is cloud-backed, network stability is paramount. - DNS Resolution: Use fast DNS providers (1.1.1.1 or 8.8.8.8) to reduce API connection setup time. - Keep-Alive: The requests session in Vaani/intellegence handles connection pooling to avoid SSL handshake overhead on every turn.
Audio Buffering#
The pyaudio stream buffer size (chunk size) directly impacts input latency. - Default Config: 4096 frames (~0.1s at 44.1kHz). - Optimization: Reducing chunk size to 1024 frames offers faster wake-word reaction but increases CPU interrupt frequency.
Strategies:
# Example: Adjusting chunk size in config (if exposed)
CHUNK_SIZE = 1024 # Lower latency
CHUNK_SIZE = 4096 # Lower CPU usage
Resource Utilization#
Memory (RAM)#
Baseline: ~60MB (Python runtime + core libraries).
With Vosk Model: +50MB (Small model) to +1.5GB (Large model).
Recommendation: A Raspberry Pi 4 (2GB+) or standard laptop is sufficient. For constrained environments (Pi Zero), stick to the ‘small’ Vosk models.
CPU Profiling#
The most CPU-intensive operations are: 1. FFT (Fast Fourier Transform): Run continuously by Vosk on the audio stream. 2. SSL Encryption: During API calls to Google. 3. Audio Decoding: When playing music via VLC.
Multi-threading#
Vaani uses threading to ensure the UI (listening loop) never blocks. - Main Thread: Handles the event loop and audio input. - Worker Threads: Handle API requests and TTS synthesis. - Process Isolation: Used for subprocess calls (e.g., system commands) to avoid Global Interpreter Lock (GIL) contention.
Benchmarking Methodology#
To objectively measure performance changes:
Record TTR: Measure the delta between End-of-Speech detection and First-Byte-Audio output.
Profile: Use cProfile to identify hot code paths.
python -m cProfile -o profile.stats main.py # Analyze with snakeviz snakeviz profile.stats