🖊️ XiaoXi

Open source ESP32 voice + vision hardware for AI Agent platforms.
Dumb terminal + Agent brain. Compile once, configure via Web page.

⭐ GitHub Download Documentation

Source Files

1.06MB

Firmware Size

74%

Free Memory

API Endpoints

💡 Project Introduction

What is XiaoXi and why does it exist?

🎯

Dumb Terminal Design

ESP32 only handles audio I/O and WiFi. All intelligence lives on the Agent backend. Compile firmware once, change everything via Web config page.

🔄

Switch Backend in Seconds

Change Agent backend address in Web config — no recompilation, no reflashing. Hermes, xiaozhi, or any compatible backend.

🛠️

Full Agent Capabilities

Connect to Hermes for tool calling, smart home, calendar, search, MCP tools — everything an AI Agent can do.

💰

Ultra Low Cost

Hardware BOM from ¥29 (under $4 USD). Open source firmware, open source hardware. No monthly fees.

📱

Pen-Sized Form Factor

Pocket-sized AI voice assistant. Custom PCB designed to fit inside a pen barrel. Also available in desk form.

🔓

Fully Open Source

MIT License. Firmware, hardware schematics, PCB designs, documentation — all open.

⚡ XiaoXi vs XiaoZhi (Original)

Feature	XiaoZhi	XiaoXi
Backend	Hardcoded to official	Configurable, switch freely
Settings	Recompile + reflash	Web page, instant
Switch LLM	Modify firmware	Backend side, ESP32 doesn't know
Add Tools	Modify firmware	Backend side, ESP32 doesn't know
Setup	PC client required	Built-in Web page
HW Cost	~¥50	From ¥29

🏗️ System Architecture

ESP32 = dumb terminal. Agent backend = brain. Built on ESP-IDF 5.5, 62 source files, 1.06MB firmware, 74% free memory.

graph LR
  subgraph Device["ESP32 Device"]
    MIC["Microphone\nI2S INMP441"]
    SPK["Speaker\nI2S MAX98357"]
    BTN["Button / Wake Word"]
    CODEC["Audio Codec\nOpus Encode/Decode"]
    WIFI["WiFi Manager"]
    WEB["Web Config Page\nAP Hotspot 192.168.4.1"]
    CAM["Camera OV2640\nVision versions"]
    SENS["Sensor Module\nUltrasonic / mmWave\nLidar (reserved)"]
  end

  subgraph NET["Network"]
    WIFI2["WiFi / Hotspot\nHTTP + WebSocket"]
  end

  subgraph Backend["Agent Backend - Brain"]
    ASR["ASR\nWhisper / SenseVoice"]
    LLM["LLM\nDeepSeek / Qwen\nClaude / GPT / Local"]
    TTS["TTS\nEdge TTS / GPT-SoVITS\nOpenAI TTS"]
    CTX["Context Manager\nMulti-turn Memory"]
    TOOLS["Tool Calling / MCP\nWeather Search SmartHome\nCalendar Custom Tools"]
    VISION["Vision\nGPT-4o / Qwen-VL"]
    ADMIN["Web Admin\nPersona API Key\nVoice History"]
  end

  MIC --> CODEC
  CODEC --> WIFI
  BTN --> CODEC
  SENS --> CODEC
  CAM --> CODEC
  WIFI --> WIFI2
  WIFI2 --> ASR
  ASR --> LLM
  LLM --> TTS
  LLM --> CTX
  LLM --> TOOLS
  CAM --> VISION
  TTS --> WIFI2
  WIFI2 --> CODEC
  CODEC --> SPK

📋 Detailed Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           本地AI服务器 (4060 8G)                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
│  │  ASR 服务    │  │  Chat 服务   │  │  TTS 服务    │  │ Vision 服务  │      │
│  │  语音识别    │  │  对话推理    │  │  语音合成    │  │  场景理解    │      │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘      │
│         │                │                │                │               │
│         └────────────────┴────────────────┴────────────────┘               │
│                                    │                                        │
│                           OpenAI兼容HTTP API                                │
└────────────────────────────────────────┬────────────────────────────────────┘
                                         │
                                    WiFi 网络
                                         │
┌────────────────────────────────────────┴────────────────────────────────────┐
│                           ESP32-S3 哑巴终端 (ESP-IDF 5.5)                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  ESP-SR      │  │  I2S 音频     │  │  OV2640      │  │  传感器模块   │  │
│  │  唤醒词检测   │  │  录音/播放    │  │  摄像头      │  │  超声波/雷达  │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  WiFi STA/AP │  │  NVS 存储    │  │  HTTP 客户端  │  │  按钮控制    │  │
│  │  自动切换    │  │  配置管理    │  │  4个API端点   │  │  交互触发    │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

📱

Web Config Page

ESP32 creates WiFi AP. Phone connects → browser 192.168.4.1 → change Agent address, WiFi, volume, device name. No USB needed.

🔄

Home & Away

Home: ESP32 → WiFi → LAN → Agent. Outside: ESP32 → Phone hotspot → Internet → Agent. Auto switch.

🔄

OTA Updates

Upload new firmware via Web page. No USB cable, no compile tools. Just drag & drop the .bin file.

📦 Five Product Versions

	Pen Basic	Pen Eye	Desk Standard	Desk Eye	Embodied Hexapod Robot
Chip	ESP32-C3	ESP32-CAM	ESP32-S3	S3 Mini	ESP32-S3 + 本地服务器（4060 8G）
Trigger	Button	Button	Wake word + button	Wake word + button	Wake word + button + radar
Camera	❌	✅ OV2640	❌	✅ OV2640	✅ OV2640 + RoboBrain VLM
Screen	❌	❌	✅ OLED	✅ OLED	✅ OLED 表情显示
Motion	❌	❌	❌	❌	✅ 12 servos (hexapod) / 2 servos + 2 motors (wheeled)
Radar	❌	❌	❌	❌	✅ Ultrasonic/mmWave obstacle avoidance
BOM	~¥29	~¥55	~¥55	~¥63	~¥150-250
Price	¥99-149	¥199-299	¥199-249	¥249-349	¥499-999

🔧 Tech Stack

ESP-IDF 5.5 · 62 source files · 1.06MB firmware · 74% free memory

🔧

ESP32 Firmware

ESP-IDF 5.5 development framework
C++ OOP architecture
62 source files modular design
Firmware: 1.06MB, 74% free memory
WiFi STA/AP auto switch
NVS persistent config

🎤

Audio Processing

ESP-SR wake word engine
I2S digital audio interface
VAD voice activity detection
Real-time record & playback
Noise suppression
Low-latency audio stream

🤖

AI Services (4 Endpoints)

POST /v1/chat/completions — Chat LLM
POST /v1/audio/transcriptions — ASR
POST /v1/audio/speech — TTS
POST /v1/vision/analyze — Vision
OpenAI compatible HTTP API
Local inference on 4060 8G

👁️

Sensor Module

OV2640 camera module
MPU6050 IMU inertial sensor
Ultrasonic distance sensor
mmWave radar (reserved)
Lidar (reserved)
RoboBrain VLM vision

🎮

Motion Control

LEDC PWM servo driver
DC motor control
JSON action sequences
Obstacle avoidance
Posture control interface
Kinematics solver

📡

Communication

HTTP RESTful API
OpenAI compatible format
JSON serialization
WebSocket (planned)
MQTT IoT (planned)
BLE low power (planned)

🗺️ Development Roadmap

Phase 1 hearing complete · Phase 2 vision in progress · Phase 3 motion planned

✅ Phase 1 Complete — 100%

Hearing Intelligence

Complete voice interaction loop — from wake word to response

✓ WiFi STA/AP auto switch

✓ NVS config persistence

✓ HTTP client (4 endpoints)

✓ ESP-SR wake word "你好小鑫"

✓ VAD voice activity detection

✓ Button interaction

✓ ASR speech-to-text

✓ Chat LLM inference

✓ TTS text-to-speech

✓ I2S audio playback

🔄 In Progress — API Ready

Visual Perception

Let XiaoXi 'see' and understand the surrounding environment

○ ESP32-CAM photo capture

○ RoboBrain VLM scene understanding

○ Object recognition & description

○ Face detection & recognition

○ Text OCR recognition

○ Visual Q&A interaction

🔮 Planned — API Ready

Motion Control

Give XiaoXi the ability to act in the physical world

○ Servo/motor precision control

○ JSON action sequence orchestration

○ Ultrasonic obstacle avoidance

○ IMU posture awareness

○ Autonomous movement

○ Robot arm operation (future)

📦 Product Line

Five versions for different use cases

Model	Form Factor	Chip	Camera	Motion Control	Use Case
C3 Pen Edition	Pen portable	ESP32-C3	—	—	Portable voice assistant
S3 Standard	Desktop terminal	ESP32-S3	✓	✓	Smart home hub
CAM Vision Edition	Camera terminal	ESP32-S3	✓ OV2640	✓	Security / Visual AI
S3Mini Mini Edition	Mini terminal	ESP32-S3	—	—	Low-cost voice interaction
Embodied Hexapod Robot	Hexapod/Wheeled	ESP32-S3 + 本地服务器（4060 8G）	✓ OV2640 + RoboBrain VLM	✓	视觉感知 + 运动规划 + 避障导航 + RoboBrain具身智能

🔌 API Endpoints

4 independent OpenAI-compatible HTTP endpoints

POST

/v1/chat/completions

Chat inference — OpenAI compatible format

POST

/v1/audio/transcriptions

Speech recognition — ASR to text

POST

/v1/audio/speech

Speech synthesis — TTS generate audio

POST

/v1/vision/analyze

Vision analysis — Image scene understanding

⬇️ Downloads

Firmware, schematics, PCB files, 3D models

📦

Firmware (.bin)

Coming soon — Pre-compiled firmware for each version.

🔧

PCB Schematics

Coming soon — KiCad / Altium source files + Gerber for JLCPCB.

🖨️

3D Models (STL)

Coming soon — 3D printable enclosure for each version.

📋

BOM List

Coming soon — Complete bill of materials with purchase links.

🔧 Parts & Tools

What you need to build XiaoXi

💻

Firmware Development

ESP-IDF v5.5 — Espressif official SDK
VS Code + ESP-IDF Plugin — Recommended IDE
Python 3.10+ — Build system

🔌

Key Components

ESP32-C3 / S3 — Main controller
INMP441 — I2S digital microphone
MAX98357A — I2S audio amplifier
OV2640 — 2MP camera (vision versions)

📡

Agent Backend

Hermes Agent — Recommended, full-featured
xiaozhi-server — Docker self-hosted
xiaozhi.me — Official free service
Any compatible WebSocket Agent backend

📖 Documentation

Guides, references, and technical docs

📄

Firmware Code Analysis

Deep dive into xiaozhi-esp32 firmware architecture — audio pipeline, wake word engine, WebSocket protocol, OTA system.

English · 中文

📦

Product Line Definition

Four versions — BOM cost, pricing, component list, market comparison, data flow.

English · 中文

🏗️

Architecture Diagram

System architecture — ESP32 device layer, network layer, Agent backend layer.

English · 中文 (HTML)

🚀 Getting Started

Quick start guide

graph TD
  A["1. Get Hardware"] --> B["2. Flash Firmware"]
  B --> C["3. Power On"]
  C --> D["4. Connect to AP Hotspot"]
  D --> E["5. Configure WiFi + Agent Address"]
  E --> F["6. Talk to XiaoXi!"]

  style A fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style B fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style C fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style D fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style E fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style F fill:#065f46,stroke:#34d399,color:#e2e8f0

①

Get Hardware

Order an ESP32-S3 or ESP32-C3 dev board + INMP441 mic + MAX98357A amp + speaker. Total under ¥50 from Taobao.

②

Flash Firmware

Download pre-built .bin from this site (coming soon), or compile from source with ESP-IDF v5.5. Flash via USB.

③

Configure

Power on → connect to XiaoXi AP hotspot → open 192.168.4.1 → set your WiFi and Agent backend address.

④

Set Up Backend

Install Hermes Agent on your PC, or use xiaozhi-server Docker, or connect to xiaozhi.me official server.

⑤

Talk!

Press button (pen version) or say wake word (desk version). Ask anything. XiaoXi replies in ~3 seconds.

⑥

Customize

Change LLM, TTS voice, persona prompt, tools — all from the Agent backend. ESP32 firmware stays the same.

📬 Links & Contact

Join us, contribute, or just say hi

🐙

GitHub Repository

R2129487/hermes-xiaoxi
Star ⭐ · Issues · Pull Requests

🔗

Related Projects

xiaozhi-esp32 — Original firmware (27k⭐)
xiaozhi-esp32-server — Backend server (9.7k⭐)
Hermes Agent — AI Agent platform

🤝

Contributors Welcome

Looking for:
• Hardware / PCB designers
• ESP32 firmware engineers
• Frontend developers
• Documentation writers