🖊️ XiaoXi

Open source ESP32 voice + vision hardware for AI Agent platforms.
Dumb terminal + Agent brain. Compile once, configure via Web page.

62
Source Files
1.06MB
Firmware Size
74%
Free Memory
4
API Endpoints

💡 Project Introduction

What is XiaoXi and why does it exist?

🎯

Dumb Terminal Design

ESP32 only handles audio I/O and WiFi. All intelligence lives on the Agent backend. Compile firmware once, change everything via Web config page.

🔄

Switch Backend in Seconds

Change Agent backend address in Web config — no recompilation, no reflashing. Hermes, xiaozhi, or any compatible backend.

🛠️

Full Agent Capabilities

Connect to Hermes for tool calling, smart home, calendar, search, MCP tools — everything an AI Agent can do.

💰

Ultra Low Cost

Hardware BOM from ¥29 (under $4 USD). Open source firmware, open source hardware. No monthly fees.

📱

Pen-Sized Form Factor

Pocket-sized AI voice assistant. Custom PCB designed to fit inside a pen barrel. Also available in desk form.

🔓

Fully Open Source

MIT License. Firmware, hardware schematics, PCB designs, documentation — all open.

⚡ XiaoXi vs XiaoZhi (Original)

Feature XiaoZhi XiaoXi
BackendHardcoded to officialConfigurable, switch freely
SettingsRecompile + reflashWeb page, instant
Switch LLMModify firmwareBackend side, ESP32 doesn't know
Add ToolsModify firmwareBackend side, ESP32 doesn't know
SetupPC client requiredBuilt-in Web page
HW Cost~¥50From ¥29

🏗️ System Architecture

ESP32 = dumb terminal. Agent backend = brain. Built on ESP-IDF 5.5, 62 source files, 1.06MB firmware, 74% free memory.

graph LR
  subgraph Device["ESP32 Device"]
    MIC["Microphone\nI2S INMP441"]
    SPK["Speaker\nI2S MAX98357"]
    BTN["Button / Wake Word"]
    CODEC["Audio Codec\nOpus Encode/Decode"]
    WIFI["WiFi Manager"]
    WEB["Web Config Page\nAP Hotspot 192.168.4.1"]
    CAM["Camera OV2640\nVision versions"]
    SENS["Sensor Module\nUltrasonic / mmWave\nLidar (reserved)"]
  end

  subgraph NET["Network"]
    WIFI2["WiFi / Hotspot\nHTTP + WebSocket"]
  end

  subgraph Backend["Agent Backend - Brain"]
    ASR["ASR\nWhisper / SenseVoice"]
    LLM["LLM\nDeepSeek / Qwen\nClaude / GPT / Local"]
    TTS["TTS\nEdge TTS / GPT-SoVITS\nOpenAI TTS"]
    CTX["Context Manager\nMulti-turn Memory"]
    TOOLS["Tool Calling / MCP\nWeather Search SmartHome\nCalendar Custom Tools"]
    VISION["Vision\nGPT-4o / Qwen-VL"]
    ADMIN["Web Admin\nPersona API Key\nVoice History"]
  end

  MIC --> CODEC
  CODEC --> WIFI
  BTN --> CODEC
  SENS --> CODEC
  CAM --> CODEC
  WIFI --> WIFI2
  WIFI2 --> ASR
  ASR --> LLM
  LLM --> TTS
  LLM --> CTX
  LLM --> TOOLS
  CAM --> VISION
  TTS --> WIFI2
  WIFI2 --> CODEC
  CODEC --> SPK
      

📋 Detailed Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           本地AI服务器 (4060 8G)                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
│  │  ASR 服务    │  │  Chat 服务   │  │  TTS 服务    │  │ Vision 服务  │      │
│  │  语音识别    │  │  对话推理    │  │  语音合成    │  │  场景理解    │      │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘      │
│         │                │                │                │               │
│         └────────────────┴────────────────┴────────────────┘               │
│                                    │                                        │
│                           OpenAI兼容HTTP API                                │
└────────────────────────────────────────┬────────────────────────────────────┘
                                         │
                                    WiFi 网络
                                         │
┌────────────────────────────────────────┴────────────────────────────────────┐
│                           ESP32-S3 哑巴终端 (ESP-IDF 5.5)                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  ESP-SR      │  │  I2S 音频     │  │  OV2640      │  │  传感器模块   │  │
│  │  唤醒词检测   │  │  录音/播放    │  │  摄像头      │  │  超声波/雷达  │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  WiFi STA/AP │  │  NVS 存储    │  │  HTTP 客户端  │  │  按钮控制    │  │
│  │  自动切换    │  │  配置管理    │  │  4个API端点   │  │  交互触发    │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘
        
📱

Web Config Page

ESP32 creates WiFi AP. Phone connects → browser 192.168.4.1 → change Agent address, WiFi, volume, device name. No USB needed.

🔄

Home & Away

Home: ESP32 → WiFi → LAN → Agent. Outside: ESP32 → Phone hotspot → Internet → Agent. Auto switch.

🔄

OTA Updates

Upload new firmware via Web page. No USB cable, no compile tools. Just drag & drop the .bin file.

📦 Five Product Versions

Pen BasicPen EyeDesk StandardDesk EyeEmbodied Hexapod Robot
ChipESP32-C3ESP32-CAMESP32-S3S3 MiniESP32-S3 + 本地服务器(4060 8G)
TriggerButtonButtonWake word + buttonWake word + buttonWake word + button + radar
Camera✅ OV2640✅ OV2640✅ OV2640 + RoboBrain VLM
Screen✅ OLED✅ OLED✅ OLED 表情显示
Motion✅ 12 servos (hexapod) / 2 servos + 2 motors (wheeled)
Radar✅ Ultrasonic/mmWave obstacle avoidance
BOM~¥29~¥55~¥55~¥63~¥150-250
Price¥99-149¥199-299¥199-249¥249-349¥499-999

🔧 Tech Stack

ESP-IDF 5.5 · 62 source files · 1.06MB firmware · 74% free memory

🔧

ESP32 Firmware

ESP-IDF 5.5 development framework
C++ OOP architecture
62 source files modular design
Firmware: 1.06MB, 74% free memory
WiFi STA/AP auto switch
NVS persistent config

🎤

Audio Processing

ESP-SR wake word engine
I2S digital audio interface
VAD voice activity detection
Real-time record & playback
Noise suppression
Low-latency audio stream

🤖

AI Services (4 Endpoints)

POST /v1/chat/completions — Chat LLM
POST /v1/audio/transcriptions — ASR
POST /v1/audio/speech — TTS
POST /v1/vision/analyze — Vision
OpenAI compatible HTTP API
Local inference on 4060 8G

👁️

Sensor Module

OV2640 camera module
MPU6050 IMU inertial sensor
Ultrasonic distance sensor
mmWave radar (reserved)
Lidar (reserved)
RoboBrain VLM vision

🎮

Motion Control

LEDC PWM servo driver
DC motor control
JSON action sequences
Obstacle avoidance
Posture control interface
Kinematics solver

📡

Communication

HTTP RESTful API
OpenAI compatible format
JSON serialization
WebSocket (planned)
MQTT IoT (planned)
BLE low power (planned)

🗺️ Development Roadmap

Phase 1 hearing complete · Phase 2 vision in progress · Phase 3 motion planned

✅ Phase 1 Complete — 100%

Hearing Intelligence

Complete voice interaction loop — from wake word to response

✓ WiFi STA/AP auto switch
✓ NVS config persistence
✓ HTTP client (4 endpoints)
✓ ESP-SR wake word "你好小鑫"
✓ VAD voice activity detection
✓ Button interaction
✓ ASR speech-to-text
✓ Chat LLM inference
✓ TTS text-to-speech
✓ I2S audio playback
🔄 In Progress — API Ready

Visual Perception

Let XiaoXi 'see' and understand the surrounding environment

○ ESP32-CAM photo capture
○ RoboBrain VLM scene understanding
○ Object recognition & description
○ Face detection & recognition
○ Text OCR recognition
○ Visual Q&A interaction
🔮 Planned — API Ready

Motion Control

Give XiaoXi the ability to act in the physical world

○ Servo/motor precision control
○ JSON action sequence orchestration
○ Ultrasonic obstacle avoidance
○ IMU posture awareness
○ Autonomous movement
○ Robot arm operation (future)

📦 Product Line

Five versions for different use cases

Model Form Factor Chip Camera Motion Control Use Case
C3 Pen Edition Pen portable ESP32-C3 Portable voice assistant
S3 Standard Desktop terminal ESP32-S3 Smart home hub
CAM Vision Edition Camera terminal ESP32-S3 ✓ OV2640 Security / Visual AI
S3Mini Mini Edition Mini terminal ESP32-S3 Low-cost voice interaction
Embodied Hexapod Robot Hexapod/Wheeled ESP32-S3 + 本地服务器(4060 8G) ✓ OV2640 + RoboBrain VLM 视觉感知 + 运动规划 + 避障导航 + RoboBrain具身智能

🔌 API Endpoints

4 independent OpenAI-compatible HTTP endpoints

POST

/v1/chat/completions

Chat inference — OpenAI compatible format

POST

/v1/audio/transcriptions

Speech recognition — ASR to text

POST

/v1/audio/speech

Speech synthesis — TTS generate audio

POST

/v1/vision/analyze

Vision analysis — Image scene understanding

⬇️ Downloads

Firmware, schematics, PCB files, 3D models

📦

Firmware (.bin)

Coming soon — Pre-compiled firmware for each version.

🔧

PCB Schematics

Coming soon — KiCad / Altium source files + Gerber for JLCPCB.

🖨️

3D Models (STL)

Coming soon — 3D printable enclosure for each version.

📋

BOM List

Coming soon — Complete bill of materials with purchase links.

🔧 Parts & Tools

What you need to build XiaoXi

💻

Firmware Development

ESP-IDF v5.5 — Espressif official SDK
VS Code + ESP-IDF Plugin — Recommended IDE
Python 3.10+ — Build system

🔌

Key Components

ESP32-C3 / S3 — Main controller
INMP441 — I2S digital microphone
MAX98357A — I2S audio amplifier
OV2640 — 2MP camera (vision versions)

📡

Agent Backend

Hermes Agent — Recommended, full-featured
xiaozhi-server — Docker self-hosted
xiaozhi.me — Official free service
Any compatible WebSocket Agent backend

📖 Documentation

Guides, references, and technical docs

📄

Firmware Code Analysis

Deep dive into xiaozhi-esp32 firmware architecture — audio pipeline, wake word engine, WebSocket protocol, OTA system.

English · 中文

📦

Product Line Definition

Four versions — BOM cost, pricing, component list, market comparison, data flow.

English · 中文

🏗️

Architecture Diagram

System architecture — ESP32 device layer, network layer, Agent backend layer.

English · 中文 (HTML)

🚀 Getting Started

Quick start guide

graph TD
  A["1. Get Hardware"] --> B["2. Flash Firmware"]
  B --> C["3. Power On"]
  C --> D["4. Connect to AP Hotspot"]
  D --> E["5. Configure WiFi + Agent Address"]
  E --> F["6. Talk to XiaoXi!"]

  style A fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style B fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style C fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style D fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style E fill:#1e293b,stroke:#22d3ee,color:#e2e8f0
  style F fill:#065f46,stroke:#34d399,color:#e2e8f0
      

Get Hardware

Order an ESP32-S3 or ESP32-C3 dev board + INMP441 mic + MAX98357A amp + speaker. Total under ¥50 from Taobao.

Flash Firmware

Download pre-built .bin from this site (coming soon), or compile from source with ESP-IDF v5.5. Flash via USB.

Configure

Power on → connect to XiaoXi AP hotspot → open 192.168.4.1 → set your WiFi and Agent backend address.

Set Up Backend

Install Hermes Agent on your PC, or use xiaozhi-server Docker, or connect to xiaozhi.me official server.

Talk!

Press button (pen version) or say wake word (desk version). Ask anything. XiaoXi replies in ~3 seconds.

Customize

Change LLM, TTS voice, persona prompt, tools — all from the Agent backend. ESP32 firmware stays the same.

📬 Links & Contact

Join us, contribute, or just say hi

🐙

GitHub Repository

R2129487/hermes-xiaoxi
Star ⭐ · Issues · Pull Requests

🔗

Related Projects

xiaozhi-esp32Original firmware (27k⭐)
xiaozhi-esp32-serverBackend server (9.7k⭐)
Hermes AgentAI Agent platform

🤝

Contributors Welcome

Looking for:
• Hardware / PCB designers
• ESP32 firmware engineers
• Frontend developers
• Documentation writers