Offline Voice in Every Room: Build ESP32 Voice Satellites for Home Assistant
Turn every room into a private, fast, and reliable voice zone. Learn how to deploy low-cost ESP32 voice satellites that connect to Home Assistant’s local speech pipeline—no cloud, no lag, no subscriptions.
- Put private, fast voice control in every room with low-cost ESP32 satellites.
- Use Home Assistant’s local wake word, speech-to-text, and TTS—no cloud needed.
- Get practical wiring, network, and tuning tips to avoid lag and false wakes.
Voice control belongs everywhere you spend time—kitchen, hallway, bedroom—not just where the smart speaker happens to sit. The good news: you don’t need to buy multiple cloud-linked assistants to get that experience. With ESP32-based voice satellites and a fully local speech pipeline in Home Assistant, you can spread reliable, private voice control across your home for a fraction of the cost, and without handing your audio to third parties.
This guide explains the concept of room-level voice satellites, the parts you need, realistic costs, setup steps, and the key tuning tricks that determine whether voice control feels magical or flaky. If you’re comfortable plugging in a board, connecting it to Wi‑Fi, and clicking through a few Home Assistant screens, you’re ready.
What voice satellites are and why local matters
A voice satellite is a tiny always‑listening device that can hear a wake word and forward short speech audio to a local server for recognition. In our case, each satellite is an ESP32 with a microphone and speaker running ESPHome’s voice assistant features. The brain lives on your Home Assistant machine, where the wake word model, speech‑to‑text (STT), natural language understanding (NLU), and text‑to‑speech (TTS) run locally. The result is a room‑aware, private, and quick voice system.
This is fundamentally different from cloud speakers. Instead of streaming audio to a vendor for interpretation, your data stays at home. Latency drops, privacy increases, and you control the upgrade path. You also gain the flexibility to tailor commands to your home—custom intents like ‘tidy mode’ can mean pause the TV, lower blinds to 40%, and bump vacuum suction one level.
Local voice has been turbocharged recently by efficient, open building blocks. Typical stack pieces include:
- Wake word: OpenWakeWord models that run on-device or in Home Assistant.
- Speech-to-text: Whisper or faster-whisper, sized to your hardware.
- NLU: Home Assistant’s Assist pipelines or Rhasspy-like intent handling.
- TTS: Piper, a fast, natural-sounding offline voice.
Because processing is local, the system remains usable even if the internet drops. And since satellites are inexpensive, you can cover more rooms for less than the price of one premium speaker.
Hardware you need and suggested builds
You can assemble a satellite from a handful of commodity parts or use a ready-made ESP32 unit with built-in mic and speaker. Below are solid options that balance cost, sound quality, and simplicity.
| Option | Microphone | Speaker/Amplifier | Pros | Est. Cost (USD) | Best for |
|---|---|---|---|---|---|
| ESP32‑S3‑BOX‑3 | Built-in I2S mic array | Built-in speaker & amp | Works out of the box with ESPHome; screen; decent audio | 60–80 | Fast start, living spaces |
| M5Stack Atom Echo | Built-in I2S mic | Small speaker built-in | Tiny footprint; USB power; easy wall mounting | 15–25 | Hallways, bathrooms, bedside |
| Custom ESP32 + INMP441 + MAX98357A | INMP441 I2S mic module | MAX98357A I2S amp + small 4–8Ω speaker | Lowest cost; configurable mic & speaker placement | 12–25 | DIY rooms, hidden installs |
| ESP32 + ReSpeaker 2‑Mic (I2S) | 2‑mic I2S array | External amp/speaker | Better pickup in noisy rooms | 25–45 | Kitchens, workshops |
You’ll power each satellite with a standard 5V USB supply. For wall or ceiling mounting, consider a clean cable path or a recessed outlet. If you prefer a zero‑tools start, an ESP32‑S3‑BOX‑3 on a shelf is hard to beat. DIYers can hide an ESP32 with a small mic behind a grille and run thin speaker wire to a surface‑mount speaker where the sound belongs.
Beyond the satellites, your Home Assistant host should have enough muscle for speech models. A Raspberry Pi 5 can handle small to medium speech pipelines well; a mini PC with an Intel N100 or better feels snappier for larger Whisper models. Keep storage fast and network stable—low latency between satellites and Home Assistant is key.
Basic materials for a custom DIY satellite:
- ESP32 (preferably ESP32‑S3 for stronger I2S and USB power stability)
- INMP441 I2S microphone module
- MAX98357A I2S amplifier + 3W 4Ω speaker
- USB power supply and cable, short jumper wires
Step‑by‑step: from wake word to working commands
Set aside about an afternoon for the first satellite and pipeline; adding additional rooms afterward takes minutes.
1) Prepare the local voice pipeline in Home Assistant
Open Home Assistant and install the add‑ons you’ll need. Piper covers TTS, and whisper‑based add‑ons handle STT. Choose a language and voice for Piper that matches your household. In Settings → Voice Assistants → Assist, create a new pipeline: select your preferred wake word model (OpenWakeWord offers lightweight, accurate options), pick STT (Whisper small or medium for speed/accuracy balance), set NLU to Home Assistant’s built‑in intents, and set TTS to Piper. Give the pipeline a short name like ‘Local English’ and make it the default.
Tip: Start with smaller STT models for responsiveness. You can upgrade the model later once you confirm your network and mic placement are solid.
2) Flash your first ESP32 satellite
In ESPHome, add a new device. Many boards have ready‑made voice assistant configurations. If you’re using ESP32‑S3‑BOX‑3 or M5Stack Atom Echo, pick or adapt their templates and enable the voice assistant component. For a custom build, set the I2S pins for the mic (INMP441) and for the speaker amp (MAX98357A), and enable voice assistant features in the YAML. Give the device a helpful name like ‘voice_kitchen’.
Once flashed, the satellite will appear in Home Assistant’s Integrations. Adopt it and confirm the ESPHome entity named ‘Voice Assistant’ is present. In the device’s settings, set the Assist Pipeline to the one you created. If your satellite has LEDs, enable a brief visual cue on wake to increase confidence that it heard you.
3) Place and calibrate the satellite
Mount or set the device where it can hear you clearly: roughly ear height, away from direct airflow and echo-prone corners. Kitchens benefit from placement away from hoods and noisy appliances. If you have a fan or HVAC register nearby, rotate the mic module so the port is not facing the noise source. Start with a modest wake word sensitivity and increase only if you find yourself repeating the wake word frequently.
Acoustic tips that pay off:
- Aim the speaker toward the area where you stand to reduce volume needed (less echo).
- Use felt bumpers under desktop units to dampen vibration.
- If you have a smart speaker in the same room, disable its wake word or move it to another room to avoid crosstalk.
4) Test round‑trip speed and accuracy
Say the wake word, then a short command you already use in Home Assistant, like ‘Turn on the kitchen lights.’ Measure the time from end of speech to action. A tuned local system typically responds in 600–1,500 ms. If it feels slow, check Wi‑Fi signal strength on the satellite and Home Assistant’s CPU load. Wired Ethernet for the HA host and a strong access point near the room make a visible difference.
5) Teach your home to understand your language
Assist can parse many natural phrasings. To go further, define custom intents for routines you trigger often. Map ‘tidy mode’ to a scene that lowers blinds, pauses music, and starts the robot vacuum in silent mode. Add synonyms like ‘clean up’ or ‘let’s tidy’ so family members don’t have to memorize exact words. Test each phrase and tweak until success rate is near 100%.
6) Add safety and privacy guardrails
Keep all voice processing local by ensuring the pipeline you selected does not call external APIs. On your router, block the satellite’s outbound internet if you prefer a belt‑and‑suspenders approach. Many ESP32 devices also support a hardware mute—wire a small toggle in series with the mic VDD or use a GPIO to disable audio capture in software during quiet hours. Give guests confidence by adding a visible LED that lights only when the device is ‘awake’ and recording a command segment.
7) Repeat for more rooms
Once the first satellite works well, clone its ESPHome configuration, rename the device, and flash additional units. Assign each to the same pipeline or tune per room—e.g., a more aggressive noise profile in the kitchen, a quieter TTS voice in the nursery. You can expand coverage in minutes.
Practical automations voice satellites unlock right away:
- ‘Good night’ shuts down lights, locks doors, lowers thermostat, and announces tomorrow’s first calendar item.
- ‘Movie time’ dims the room to 10%, closes blinds to 80%, and switches TV input.
- ‘Hands-free timer’ sets a named kitchen timer and reads it back over the nearest satellite.
Network and reliability tips that most people miss: Multicast (mDNS) must work well for ESPHome discovery and low-latency audio. On managed Wi‑Fi, enable IGMP snooping and disable client isolation on the SSID you use for IoT. If your network supports band steering, lock satellites to 2.4 GHz for consistent range, and keep your Home Assistant host on Ethernet if possible. Assign static IPs to satellites to ease troubleshooting.
If you hear frequent false wakes, adjust the wake word sensitivity downward and move the satellite away from reflective surfaces like bare walls or large windows. Soft furnishings, curtains, and bookshelves can dramatically reduce echo.
Cost reality check: a typical home with four rooms can be fully covered for $80–200, depending on build choices, which is less than a single high-end smart speaker. You’ll pay once, keep your data, and iterate quickly.
Accessibility matters too: local voice with clear, low-latency feedback helps users who may find apps difficult. Pair satellites with visible cues—brief LED flashes or small e-ink cards—to reinforce successful recognition without blasting audio.
Scalability and future-proofing: because satellites are modular, you can upgrade parts (better mic, nicer speaker) without retraining the entire house. And as Home Assistant’s Assist and Piper continue to improve, you can swap in better models at the server without touching each satellite.
All speech processing stays on your Home Assistant host. Satellites send short audio to your server over your LAN and do not contact cloud endpoints. You can also firewall them to guarantee no outbound connections.
Yes. The wake word, STT, NLU, and TTS are local. As long as your home network and Home Assistant host are up, voice commands continue to work offline.
Not necessarily. A Raspberry Pi 5 handles smaller Whisper models well for quick commands. For larger models or many satellites, a small Intel N100/N95 mini PC provides snappier response.
Keep speaker volume only as high as needed, aim speakers toward you, and prefer mic arrays for noisy rooms. Some ESP32 devices support basic echo suppression; placement and volume control are your best tools.
Yes. Many people keep general-purpose speakers for music and use satellites for home control. To avoid false wakes, choose different wake words or separate rooms.
Troubleshooting checklist if things feel off: if you hear the wake sound but commands don’t execute, verify the satellite is bound to the correct Assist pipeline. If TTS is silent, confirm the speaker wiring and the selected audio output in ESPHome. If recognition errors increase in one room, move the mic 15–30 cm away from walls and reduce the satellite’s input gain by a notch.
Once you trust the core, polish the experience. Add a soft chime on wake and a brief LED animation on success. Route timers and reminders to the nearest satellite so feedback comes from the room you’re in. For family comfort, publish a short ‘What can I say?’ card on a tablet or a printout in the kitchen—five or six example commands are enough to form the habit.
Finally, remember that the magic of voice is not in complexity but in immediacy. Keep commands short and consistent, place satellites where they can hear you easily, and iterate with your household’s feedback. With a handful of ESP32 satellites and a tuned local pipeline, you’ll get the speed and privacy of an on-premises assistant and the convenience of voice in every room.