A framework for building voice-interactive AI robot with StackChan (M5StackChan) and AIAvatarKit.
Turn your StackChan into a cute, smart buddy that talks with you, shows emotions, moves around, and even sees the world around it.
- 🗣️ Voice conversation — Ultra-low-latency streaming for smooth, natural conversations. Push-to-talk is also supported.
- 🧱 Swappable building blocks — STT, LLM, and TTS are all pluggable on the server side, so your robot can keep up with the latest and greatest.
- 🥰 Expressive avatar — Show a character face with multiple expressions, automatic blinking, and mouth animation synced to speech(lip sync).
- 👀 Vision — When the server needs visual context, StackChan snaps a photo and sends it along for multimodal conversations.
- 🦞 AI agent integration — Hook into agent harness systems like OpenClaw to give your robot real-world skills that grow over time.
AIAvatarStackChan runs as a pair:
- AIAvatarKit server
- StackChan firmware built with this framework
The server resources are in examples/server. Start by navigating there.
cd examples/serverInstall requirements.
pip install -r requirements.txtSet your OpenAI API key.
export OPENAI_API_KEY=sk-...Start the server.
python -m uvicorn run:app --host 0.0.0.0 --port 8000Note: If you want your robot to speak Japanese, make sure VOICEVOX is running before you start the server.
For server-side details, see uezo/aiavatarkit.
Create a PlatformIO project and copy the example into it.
cp -r examples/stackchan/basic/src /path/to/your/project/
cp examples/stackchan/basic/platformio.ini /path/to/your/project/Copy one of the sample configs as config.json and fill in your Wi-Fi credentials under wifi_networks.
cp config.sample.ja.json config.json # or config.sample.json for EnglishPut config.json and at least one avatar image (/avatar/neutral.png) on the SD card, then insert it into your StackChan. You can use the sample images in examples/avatar/stackchan to test it quickly.
If you want to run without an SD card, create Config values in main.cpp and register optional built-in assets through ResourceProvider. SD files still take priority when an SD card is mounted.
#include "BuiltinAvatarImages.h"
static aiavatar::Config config;
static aiavatar::ResourceProvider resources;
static aiavatar::AIAvatar avatar;
void setup() {
strlcpy(config.wifiSsid, "your-ssid", sizeof(config.wifiSsid));
strlcpy(config.wifiPass, "your-pass", sizeof(config.wifiPass));
strlcpy(config.wsHost, "192.168.0.10", sizeof(config.wsHost));
// Optional fallback images generated by examples/tools/embed_assets.py.
resources.setBuiltinAssets(aiavatar::kBuiltinAssets, aiavatar::kBuiltinAssetsCount);
if (resources.beginSD(GPIO_NUM_4)) {
resources.loadConfig(config);
}
avatar.begin(config, resources);
}To embed avatar images in firmware, generate a user-side asset file:
cd /path/to/your/project/src
/path/to/AIAvatarStackChan/examples/tools/embed_assets.py --input /path/to/AIAvatarStackChan/examples/avatar/stackchanThis writes BuiltinAvatarImages.h and .cpp next to your main.cpp, not into the library source tree. Include the generated header from your app:
#include "BuiltinAvatarImages.h"
resources.setBuiltinAssets(aiavatar::kBuiltinAssets, aiavatar::kBuiltinAssetsCount);Build and upload the firmware. Once StackChan boots and the 🛜Wi-Fi icon turns green, you're connected to the server — start chatting!
Here are the default controls built into the firmware.
- 🎙️ Toggles mute/unmute. While muted, long-press the screen to use push-to-talk.
- 🛜 Opens the Wi-Fi network picker and lets you toggle the WebSocket connection on/off.
- 🔈 No visible button, but tapping the lower-left corner of the screen cycles through speaker volume levels.
- 👀 Say something like "look at this" and StackChan will automatically snap a photo, send it to the server, and respond based on what it sees.
- 👋 Pet the top of StackChan's body and it will react with a cute response.
- ☝️ Swipe up on the screen to hide the clock and status icons. Swipe down from the top edge to bring them back.
config.json can define the following fields:
wifi_networks(array): Wi-Fi profiles shown in the Wi-Fi menu. Up to 5 entriesname(string): menu display namessid(string): Wi-Fi SSID. Entries with an empty SSID are ignoredpass(string): Wi-Fi passwordsleep_wifi_mode(string): optional per-network sleep Wi-Fi behavior.sleeporoff
ws_host(string): AIAvatarKit WebSocket server hostws_port(number): AIAvatarKit WebSocket server portws_path(string): AIAvatarKit WebSocket server pathuser_id(string): user ID sent when connecting to the WebSocket serverchannel(string): channel name sent when connecting to the WebSocket servertimezone(string): TZ string used for NTP time configurationmic_sample_rate(number): microphone input sample ratemic_magnification(number): microphone input gain settingmic_buffer_samples(number): microphone samples sent per frame. 1 to 2048vad_threshold_db(number): voice activity detection threshold in dBplayback_queue_depth(number): audio playback queue depthrbuf_samples(number): legacy setting. Used only whenplayback_queue_depthis not set, converted in 512-sample unitsstart_threshold(number): number of queued samples required before playback startsdrain_timeout_ms(number): playback queue drain timeout in msspeaker_volume(number): initial speaker volume. 0 to 255audio_normalize_target_peak(number): target playback peak used for automatic audio normalization. 0.0 disables normalization, 1.0 targets full scaleaudio_normalize_max_gain(number): maximum gain multiplier applied by automatic audio normalization. Minimum 1.0volume_levels(array): volume values cycled by the volume button. 2 to 8 entries, each 0 to 255audio_task_stack_size(number): stack size for audio tasksaudio_task_core(number): CPU core assigned to audio tasksws_task_stack_size(number): stack size for the WebSocket taskws_task_core(number): CPU core assigned to the WebSocket taskws_reconnect_interval_ms(number): WebSocket reconnect interval in msmic_tx_slow_backoff_ms(number): delay in ms when microphone transmission is congestedmic_tx_fail_backoff_ms(number): delay in ms after microphone transmission failskeepalive_interval_ms(number): keepalive send interval in msdisplay_rotation(number): display rotation settingdisplay_brightness(number): display brightnesssleep_enabled(boolean): whether to enable idle sleep modesleep_timeout_ms(number): idle time in ms before entering sleepsleep_display_brightness(number): display brightness while sleeping. 0 to 255sleep_wifi_mode(string): default sleep Wi-Fi behavior.sleepkeeps Wi-Fi associated with modem sleep;offpowers Wi-Fi down and reconnects on wakestatus_overlay_enabled(boolean): whether to show the status overlayvision_preview_duration_ms(number): camera preview duration for vision requests in ms. Default:2000accepted_led_color(array): RGB color for the accepted-state LED. Example:[0, 168, 0]tool_led_color(array): RGB color for the tool-running LED. Example:[140, 0, 140]ptt_max_seconds(number): maximum Push-to-Talk recording length in secondsptt_min_seconds(number): minimum Push-to-Talk recording length sent to the serverptt_hold_threshold_ms(number): hold duration in ms required to start Push-to-Talkpitch_home(number): StackChan pitch home anglestackchan_auto_angle_sync(boolean): whether to synchronize StackChan posture from the physical servo positionnade_invoke_prompt(string): prompt sent when StackChan touch/nade is detectedvision_invoke_prompt(string): prompt sent with camera imagesfast_startup(boolean): whether to start the display, mic, and UI first, then defer Wi-Fi, WebSocket, speaker, and heavy image loading so the device becomes interactive soonerdebug_log(boolean): whether to output debug logs
If stackchan_auto_angle_sync causes sudden servo jumps on your hardware, set it to false.
Use callbacks on AIAvatar to add behavior without changing the framework internals.
Public user callbacks:
avatar.onSpeechDetected(aiavatar::SpeechDetectedCallback cb)- Signature:
void (*)() - Called when local microphone audio crosses the VAD threshold while the mic is unmuted, the WebSocket is connected, and the server is not processing. Calls are throttled to about once every 300 ms.
- Signature:
avatar.onNade(aiavatar::NadeCallback cb)- Signature:
void (*)() - Called after StackChan touch/nade is detected. The built-in nade prompt is still queued before the user callback runs.
- Signature:
avatar.onStart(aiavatar::TextCallback cb)- Signature:
void (*)(const char* text) - Called when the server starts a response.
textis the request text reported by the server when available.
- Signature:
avatar.onFinal(aiavatar::FinalTextCallback cb)- Signature:
void (*)(const char* responseText, const char* voiceText) - Called when the server sends final text metadata.
responseTextis the final response text, andvoiceTextis the text used for voice output when available.
- Signature:
avatar.onToolCall(aiavatar::ToolCallCallback cb)- Signature:
void (*)(const char* toolName) - Called when the server reports a tool call. Built-in LED/OpenClaw effects run first, then the user callback runs.
- Signature:
avatar.onAccepted(aiavatar::SimpleCallback cb)- Signature:
void (*)() - Called when the server accepts a user input/request. Built-in playback interruption and accepted LED effects run first.
- Signature:
avatar.onOverlay(aiavatar::ScreenOverlayCallback cb)- Signature:
void (*)(LGFX_Sprite* canvas) - Called during display rendering so user code can draw on the shared canvas. Built-in visual effects, status overlay, and OpenClaw overlay are drawn before this callback; system UI is drawn after it.
- Signature:
Example:
static void onToolCall(const char* toolName) {
Serial.printf("tool: %s\n", toolName ? toolName : "");
}
static void drawOverlay(LGFX_Sprite* canvas) {
canvas->setTextColor(TFT_WHITE);
canvas->drawString("custom", 8, 8);
}
void setup() {
// ...
avatar.onToolCall(onToolCall);
avatar.onOverlay(drawOverlay);
avatar.begin(config);
}Keep callbacks short and non-blocking. Long work should be moved to your own task or handled asynchronously.
WebSocketClient, MotionController, and ScreenRenderer also expose lower-level callback setters, but AIAvatar::begin() uses those internally. User code should prefer the AIAvatar callbacks above so it does not replace framework wiring.
StackChan can speak in response to device-side events instead of waiting for user speech.
Use this when a local sensor, button, timer, or application event should make StackChan start a response. Internally, this is implemented by building a prompt on the device and sending it to the server as an invoke request. The built-in nade flow uses the same mechanism.
Public invoke APIs:
avatar.websocket().sendInvoke(const char* text): sends a text-only invoke request.avatar.websocket().sendInvokeWithImage(const char* text, const char* imageDataUrl): sends a text prompt with an image file URL or data URL.avatar.websocket().sendInvokeWithAudio(const int16_t* pcmData, size_t sampleCount): sends recorded PCM audio as an invoke request.
Example:
static aiavatar::AIAvatar avatar;
static bool sensorWasActive = false;
void loop() {
avatar.update();
bool sensorActive = readYourSensor();
if (sensorActive && !sensorWasActive && avatar.isConnected()) {
struct tm ti;
time_t now = time(nullptr);
localtime_r(&now, &ti);
char prompt[512];
snprintf(prompt, sizeof(prompt),
"$A local sensor was triggered. React with one short phrase.\n\n"
"Current date and time: %04d-%02d-%02d %02d:%02d:%02d",
ti.tm_year + 1900, ti.tm_mon + 1, ti.tm_mday,
ti.tm_hour, ti.tm_min, ti.tm_sec);
avatar.websocket().sendInvoke(prompt);
}
sensorWasActive = sensorActive;
delay(1);
}Keep invoke triggers edge-based or rate-limited. Sending an invoke every loop iteration will flood the server queue.
CoreS3 has no physical front buttons, so the framework provides virtual touch areas.
Default actions:
- Virtual Button A: lower-left touch area, volume cycle
- Virtual Button B: no action
- Virtual Button C: no action
For devices with physical buttons, disable virtual buttons and forward clicks yourself:
avatar.systemUI().setVirtualButtonsEnabled(false);
if (M5.BtnA.wasClicked()) {
avatar.systemUI().runButtonAction(aiavatar::ButtonId::A);
}Available actions:
ButtonAction::NoneButtonAction::VolumeCycleButtonAction::StopButtonAction::WebSocketToggleButtonAction::MicToggle
Sleep mode reduces idle power use by dimming the display and optionally reducing Wi-Fi power after a period without user activity, speech playback, Push-to-Talk, or server processing.
Enable it in config.json:
{
"sleep_enabled": true,
"sleep_timeout_ms": 60000,
"sleep_display_brightness": 32,
"sleep_wifi_mode": "sleep"
}Or set it directly in firmware code:
config.sleepEnabled = true;
config.sleepTimeoutMs = 60000;
config.sleepDisplayBrightness = 32;
config.sleepWifiMode = aiavatar::SleepWifiMode::Sleep;Wi-Fi behavior:
sleep: enables Wi-Fi modem sleep while keeping the connection state. Wake is usually immediate. If the WebSocket was disconnected while sleeping, the framework reconnects it on wake.off: disconnects the WebSocket, powers Wi-Fi off, and reconnects to the Wi-Fi profile that was active when sleep started. This saves more power, but wake takes longer and some mobile tethering access points may stop advertising while the device is disconnected.
You can override Wi-Fi sleep behavior per network:
{
"sleep_wifi_mode": "off",
"wifi_networks": [
{
"name": "Home",
"ssid": "home-ssid",
"pass": "home-password",
"sleep_wifi_mode": "off"
},
{
"name": "Mobile",
"ssid": "phone-tethering",
"pass": "mobile-password",
"sleep_wifi_mode": "sleep"
}
]
}SleepManager is owned by AIAvatar. Once sleep_enabled is true, avatar.begin(config) initializes sleep handling and avatar.update() manages the sleep timer, display brightness, Wi-Fi mode, and WebSocket reconnects. User code does not need to create a SleepManager.
Note: When
fast_startupis enabled, sleep waits until deferred startup is complete. If the device cannot connect to Wi-Fi during startup, deferred startup may remain incomplete and sleep mode will not start. This is intentional for now so sleep does not interrupt partially initialized startup work.
Built-in activity sources automatically reset the sleep timer:
- Touch handled by
SystemUIController - Push-to-Talk through
avatar.startPushToTalk()/avatar.endPushToTalk() - Volume, microphone, WebSocket, Wi-Fi menu, and stop actions routed through
AIAvatar/SystemUIController - Server response start, tool calls, vision requests, accepted events, and StackChan nade events
- Speech playback while the speaker is playing, plus playback end
For app-specific input that the framework cannot see, call avatar.resetSleepTimer(reason) before or during the action:
if (M5.BtnA.wasClicked()) {
avatar.resetSleepTimer("button A");
avatar.cycleVolume();
}For long-running app work, keep the main loop non-blocking and reset the sleep timer while that work is active. Track the task state in your app, then call avatar.resetSleepTimer("your work") from loop() at a modest interval:
uint32_t lastSleepResetMs = 0;
void loop() {
avatar.update();
userApp.update();
if (millis() - lastSleepResetMs >= 1000) {
lastSleepResetMs = millis();
if (userApp.isLongTaskRunning()) {
avatar.resetSleepTimer("long task");
}
}
delay(1);
}Touch wake behavior is intentionally conservative. If the device is already sleeping, the first screen tap wakes the display and Wi-Fi state, then the current tap event is consumed so it does not also trigger a UI action. If the user keeps holding the screen, the next update cycles can still start touch Push-to-Talk after the hold threshold. Physical buttons or app-specific inputs should call avatar.resetSleepTimer(...) and then run their normal action, so a button press can wake and perform its intended function.
You can add project-specific behavior directly in main.cpp, or keep it in a separate application module.
examples/stackchan/custom shows the module-based approach. It reads temperature and humidity from Unit ENV III and draws the values on the screen.
The custom behavior is isolated in UserApp.cpp / UserApp.h. main.cpp creates the UserApp instance, calls userApp.begin(avatar) during setup, and calls userApp.update() from the main loop.
Use this example as a starting point for app-specific sensors, overlays, callbacks, and event-driven speech.
graph TD;
MAIN[Main];
CONFIG[Config];
AVATAR[AIAvatar];
USERAPP[UserApp];
MIC[MicrophoneInput];
WS[WebSocketClient];
SPEAKER[SpeakerOutput];
DISPLAY[ScreenRenderer];
FACE[FaceController];
MOTION[MotionController];
LED[LedController];
UI[SystemUIController];
STATUS[StatusOverlay];
EFFECTS[VisualEffects];
SLEEP[SleepManager];
CAMERA[CameraController];
OPENCLAW[OpenClawEffects];
STACKCHAN[StackChanHardware];
CONVERTER[AudioConverter];
HARDWARE[HardwareAdapter];
MAIN --> CONFIG;
MAIN --> AVATAR;
MAIN --> USERAPP;
USERAPP --> AVATAR;
CONFIG --> AVATAR;
AVATAR --> MIC;
AVATAR --> WS;
AVATAR --> SPEAKER;
AVATAR --> DISPLAY;
AVATAR --> FACE;
AVATAR --> MOTION;
AVATAR --> LED;
AVATAR --> UI;
AVATAR --> STATUS;
AVATAR --> EFFECTS;
AVATAR --> SLEEP;
AVATAR --> CAMERA;
AVATAR --> OPENCLAW;
AVATAR --> STACKCHAN;
WS --> CONVERTER;
FACE --> DISPLAY;
MOTION --> HARDWARE;
LED --> HARDWARE;
STACKCHAN --> HARDWARE;
UI --> STATUS;
OPENCLAW --> LED;
OPENCLAW --> DISPLAY;
AIAvatar is the central orchestrator. It owns the audio pipeline, WebSocket connection, display rendering, face control, motion control, LEDs, system UI, and user callbacks.
| Module | Role |
|---|---|
Config |
Holds runtime settings and can load JSON from a stream or bytes. |
ResourceProvider |
Resolves config/assets from SD first, then optional user-registered built-in assets. |
MicrophoneInput |
Captures PCM audio from CoreS3 and queues frames for upload. |
WebSocketClient |
Sends microphone frames and invoke requests, then receives server events and audio chunks. |
SpeakerOutput |
Buffers and plays returned PCM audio. |
ScreenRenderer |
Owns the display canvas and calls overlay rendering hooks. |
FaceController |
Handles expressions, blinking, and lip-sync state. |
MotionController |
Drives StackChan head motion and nade motion sequences. |
LedController |
Drives StackChan LED feedback. |
SystemUIController |
Handles virtual buttons, Wi-Fi selection UI, and built-in button actions. |
StatusOverlay |
Draws connection, Wi-Fi, battery, volume, and microphone status. |
VisualEffects |
Draws transient UI effects such as voice detection. |
SleepManager |
Manages idle sleep, display dimming, Wi-Fi sleep/off behavior, wake handling, and WebSocket reconnect after wake. |
CameraController |
Captures camera images for vision requests. |
OpenClawEffects |
Adds optional OpenClaw-specific LED and screen effects. |
HardwareAdapter |
Abstracts hardware-specific motion and LED operations. |
StackChanHardware |
Implements HardwareAdapter for StackChan hardware. |
AudioConverter |
Optional encoder/decoder hook used by WebSocketClient. |
UserApp |
Project-specific extension point used by the custom example for sensors, overlays, callbacks, and app logic. |
The hardware-dependent pieces are isolated behind HardwareAdapter, so the core voice/avatar flow can still run without StackChan motion, touch, LEDs, or camera.
-
StopWatch has no SD card slot, so put Wi-Fi, WebSocket, user ID, volume, and other settings directly in
examples/stopwatch/src/main.cpp. -
Register Wi-Fi profiles with
addWifiNetwork(index, name, ssid, pass). The first entry is used for the initial connection and all entries appear in the Wi-Fi picker. -
Convert avatar images into firmware assets before building:
cd examples/stopwatch/src ../../tools/embed_assets.py --input ../../avatar/stopwatch -
The generated
BuiltinAvatarImages.h/.cppmust stay next tomain.cpp, andmain.cppmust includeBuiltinAvatarImages.hand callresources.setBuiltinAssets(...). -
Keep only the images you need. Firmware-embedded PNGs increase flash usage, so large or unused avatar files should be removed before conversion.
-
StopWatch speaker output can be quieter than StackChan. The example applies
avatar.speaker().setPcmGain(6.0f)inmain.cpp; adjust it to match your TTS voice and hardware. -
Press
BtnAto cycle speaker volume. The current volume level is shown briefly on screen. -
Hold
BtnBfor Push-to-Talk, then release it to send the recorded audio. -
Tap the face area to send a short text invoke. This is rate-limited to avoid accidental repeated taps.
-
The screen orientation flips automatically when the device is worn upside down.
-
When using the OpenClaw example server, matching OpenClaw response messages trigger a short vibration.
We are aware of the following issues. Contributions are very welcome if you can help fix them.✨🙏✨
- Servo instability: The head may occasionally jerk left or drop downward. This may vary from unit to unit. Setting
stackchan_auto_angle_synctofalseinconfig.jsoncan solve it, but motion becomes less responsive. - Arduino IDE support: We would like to support Arduino IDE and similar environments so beginners (like me) can use this project more easily.
First and foremost, huge respect and heartfelt thanks to all the creators who brought StackChan to life and have built such a wonderful community around it. We also want to thank the M5Stack team for making StackChan more accessible to everyone by turning it into a product.
This project is licensed under the MIT License. See LICENSE for details.