Open source ESP32 voice + vision hardware for AI Agent platforms.
Dumb terminal + Agent brain. Compile once, configure via Web page.
What is XiaoXi and why does it exist?
ESP32 only handles audio I/O and WiFi. All intelligence lives on the Agent backend. Compile firmware once, change everything via Web config page.
Change Agent backend address in Web config — no recompilation, no reflashing. Hermes, xiaozhi, or any compatible backend.
Connect to Hermes for tool calling, smart home, calendar, search, MCP tools — everything an AI Agent can do.
Hardware BOM from ¥29 (under $4 USD). Open source firmware, open source hardware. No monthly fees.
Pocket-sized AI voice assistant. Custom PCB designed to fit inside a pen barrel. Also available in desk form.
MIT License. Firmware, hardware schematics, PCB designs, documentation — all open.
| Feature | XiaoZhi | XiaoXi |
|---|---|---|
| Backend | Hardcoded to official | Configurable, switch freely |
| Settings | Recompile + reflash | Web page, instant |
| Switch LLM | Modify firmware | Backend side, ESP32 doesn't know |
| Add Tools | Modify firmware | Backend side, ESP32 doesn't know |
| Setup | PC client required | Built-in Web page |
| HW Cost | ~¥50 | From ¥29 |
ESP32 = dumb terminal. Agent backend = brain. Built on ESP-IDF 5.5, 62 source files, 1.06MB firmware, 74% free memory.
graph LR
subgraph Device["ESP32 Device"]
MIC["Microphone\nI2S INMP441"]
SPK["Speaker\nI2S MAX98357"]
BTN["Button / Wake Word"]
CODEC["Audio Codec\nOpus Encode/Decode"]
WIFI["WiFi Manager"]
WEB["Web Config Page\nAP Hotspot 192.168.4.1"]
CAM["Camera OV2640\nVision versions"]
SENS["Sensor Module\nUltrasonic / mmWave\nLidar (reserved)"]
end
subgraph NET["Network"]
WIFI2["WiFi / Hotspot\nHTTP + WebSocket"]
end
subgraph Backend["Agent Backend - Brain"]
ASR["ASR\nWhisper / SenseVoice"]
LLM["LLM\nDeepSeek / Qwen\nClaude / GPT / Local"]
TTS["TTS\nEdge TTS / GPT-SoVITS\nOpenAI TTS"]
CTX["Context Manager\nMulti-turn Memory"]
TOOLS["Tool Calling / MCP\nWeather Search SmartHome\nCalendar Custom Tools"]
VISION["Vision\nGPT-4o / Qwen-VL"]
ADMIN["Web Admin\nPersona API Key\nVoice History"]
end
MIC --> CODEC
CODEC --> WIFI
BTN --> CODEC
SENS --> CODEC
CAM --> CODEC
WIFI --> WIFI2
WIFI2 --> ASR
ASR --> LLM
LLM --> TTS
LLM --> CTX
LLM --> TOOLS
CAM --> VISION
TTS --> WIFI2
WIFI2 --> CODEC
CODEC --> SPK
┌─────────────────────────────────────────────────────────────────────────────┐
│ 本地AI服务器 (4060 8G) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ ASR 服务 │ │ Chat 服务 │ │ TTS 服务 │ │ Vision 服务 │ │
│ │ 语音识别 │ │ 对话推理 │ │ 语音合成 │ │ 场景理解 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ │ │
│ OpenAI兼容HTTP API │
└────────────────────────────────────────┬────────────────────────────────────┘
│
WiFi 网络
│
┌────────────────────────────────────────┴────────────────────────────────────┐
│ ESP32-S3 哑巴终端 (ESP-IDF 5.5) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ESP-SR │ │ I2S 音频 │ │ OV2640 │ │ 传感器模块 │ │
│ │ 唤醒词检测 │ │ 录音/播放 │ │ 摄像头 │ │ 超声波/雷达 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ WiFi STA/AP │ │ NVS 存储 │ │ HTTP 客户端 │ │ 按钮控制 │ │
│ │ 自动切换 │ │ 配置管理 │ │ 4个API端点 │ │ 交互触发 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
ESP32 creates WiFi AP. Phone connects → browser 192.168.4.1 → change Agent address, WiFi, volume, device name. No USB needed.
Home: ESP32 → WiFi → LAN → Agent. Outside: ESP32 → Phone hotspot → Internet → Agent. Auto switch.
Upload new firmware via Web page. No USB cable, no compile tools. Just drag & drop the .bin file.
| Pen Basic | Pen Eye | Desk Standard | Desk Eye | Embodied Hexapod Robot | |
|---|---|---|---|---|---|
| Chip | ESP32-C3 | ESP32-CAM | ESP32-S3 | S3 Mini | ESP32-S3 + 本地服务器(4060 8G) |
| Trigger | Button | Button | Wake word + button | Wake word + button | Wake word + button + radar |
| Camera | ❌ | ✅ OV2640 | ❌ | ✅ OV2640 | ✅ OV2640 + RoboBrain VLM |
| Screen | ❌ | ❌ | ✅ OLED | ✅ OLED | ✅ OLED 表情显示 |
| Motion | ❌ | ❌ | ❌ | ❌ | ✅ 12 servos (hexapod) / 2 servos + 2 motors (wheeled) |
| Radar | ❌ | ❌ | ❌ | ❌ | ✅ Ultrasonic/mmWave obstacle avoidance |
| BOM | ~¥29 | ~¥55 | ~¥55 | ~¥63 | ~¥150-250 |
| Price | ¥99-149 | ¥199-299 | ¥199-249 | ¥249-349 | ¥499-999 |
ESP-IDF 5.5 · 62 source files · 1.06MB firmware · 74% free memory
ESP-IDF 5.5 development framework
C++ OOP architecture
62 source files modular design
Firmware: 1.06MB, 74% free memory
WiFi STA/AP auto switch
NVS persistent config
ESP-SR wake word engine
I2S digital audio interface
VAD voice activity detection
Real-time record & playback
Noise suppression
Low-latency audio stream
POST /v1/chat/completions — Chat LLM
POST /v1/audio/transcriptions — ASR
POST /v1/audio/speech — TTS
POST /v1/vision/analyze — Vision
OpenAI compatible HTTP API
Local inference on 4060 8G
OV2640 camera module
MPU6050 IMU inertial sensor
Ultrasonic distance sensor
mmWave radar (reserved)
Lidar (reserved)
RoboBrain VLM vision
LEDC PWM servo driver
DC motor control
JSON action sequences
Obstacle avoidance
Posture control interface
Kinematics solver
HTTP RESTful API
OpenAI compatible format
JSON serialization
WebSocket (planned)
MQTT IoT (planned)
BLE low power (planned)
Phase 1 hearing complete · Phase 2 vision in progress · Phase 3 motion planned
Complete voice interaction loop — from wake word to response
Let XiaoXi 'see' and understand the surrounding environment
Give XiaoXi the ability to act in the physical world
Five versions for different use cases
| Model | Form Factor | Chip | Camera | Motion Control | Use Case |
|---|---|---|---|---|---|
| C3 Pen Edition | Pen portable | ESP32-C3 | — | — | Portable voice assistant |
| S3 Standard | Desktop terminal | ESP32-S3 | ✓ | ✓ | Smart home hub |
| CAM Vision Edition | Camera terminal | ESP32-S3 | ✓ OV2640 | ✓ | Security / Visual AI |
| S3Mini Mini Edition | Mini terminal | ESP32-S3 | — | — | Low-cost voice interaction |
| Embodied Hexapod Robot | Hexapod/Wheeled | ESP32-S3 + 本地服务器(4060 8G) | ✓ OV2640 + RoboBrain VLM | ✓ | 视觉感知 + 运动规划 + 避障导航 + RoboBrain具身智能 |
4 independent OpenAI-compatible HTTP endpoints
Chat inference — OpenAI compatible format
Speech recognition — ASR to text
Speech synthesis — TTS generate audio
Vision analysis — Image scene understanding
Firmware, schematics, PCB files, 3D models
Coming soon — Pre-compiled firmware for each version.
Coming soon — KiCad / Altium source files + Gerber for JLCPCB.
Coming soon — 3D printable enclosure for each version.
Coming soon — Complete bill of materials with purchase links.
What you need to build XiaoXi
ESP-IDF v5.5 — Espressif official SDK
VS Code + ESP-IDF Plugin — Recommended IDE
Python 3.10+ — Build system
ESP32-C3 / S3 — Main controller
INMP441 — I2S digital microphone
MAX98357A — I2S audio amplifier
OV2640 — 2MP camera (vision versions)
Hermes Agent — Recommended, full-featured
xiaozhi-server — Docker self-hosted
xiaozhi.me — Official free service
Any compatible WebSocket Agent backend
Guides, references, and technical docs
Deep dive into xiaozhi-esp32 firmware architecture — audio pipeline, wake word engine, WebSocket protocol, OTA system.
Four versions — BOM cost, pricing, component list, market comparison, data flow.
Quick start guide
graph TD
A["1. Get Hardware"] --> B["2. Flash Firmware"]
B --> C["3. Power On"]
C --> D["4. Connect to AP Hotspot"]
D --> E["5. Configure WiFi + Agent Address"]
E --> F["6. Talk to XiaoXi!"]
style A fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
style B fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
style C fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
style D fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
style E fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
style F fill:#065f46,stroke:#34d399,color:#e2e8f0
Order an ESP32-S3 or ESP32-C3 dev board + INMP441 mic + MAX98357A amp + speaker. Total under ¥50 from Taobao.
Download pre-built .bin from this site (coming soon), or compile from source with ESP-IDF v5.5. Flash via USB.
Power on → connect to XiaoXi AP hotspot → open 192.168.4.1 → set your WiFi and Agent backend address.
Install Hermes Agent on your PC, or use xiaozhi-server Docker, or connect to xiaozhi.me official server.
Press button (pen version) or say wake word (desk version). Ask anything. XiaoXi replies in ~3 seconds.
Change LLM, TTS voice, persona prompt, tools — all from the Agent backend. ESP32 firmware stays the same.
Join us, contribute, or just say hi
xiaozhi-esp32 — Original firmware (27k⭐)
xiaozhi-esp32-server — Backend server (9.7k⭐)
Hermes Agent — AI Agent platform
Looking for:
• Hardware / PCB designers
• ESP32 firmware engineers
• Frontend developers
• Documentation writers