WebSocket Communication Protocol
The following is a WebSocket communication protocol document organized based on code implementation, outlining how clients (devices) interact with servers via WebSocket. This document is inferred solely from the provided code and may require further confirmation or supplementation when combined with server-side implementation during actual deployment.
1. Overall Process Overview
Device-side Initialization
- Device power-on, initialize
Application:- Initialize audio codec, display, LED, etc.
- Connect to network
- Create and initialize WebSocket protocol instance (
WebsocketProtocol) implementingProtocolinterface
- Enter main loop waiting for events (audio input, audio output, scheduled tasks, etc.).
- Device power-on, initialize
Establish WebSocket Connection
- When device needs to start voice session (e.g., user wake-up, manual button trigger), call
OpenAudioChannel():- Get WebSocket URL from compile configuration (
CONFIG_WEBSOCKET_URL) - Set several request headers (
Authorization,Protocol-Version,Device-Id,Client-Id) - Call
Connect()to establish WebSocket connection with server
- Get WebSocket URL from compile configuration (
- When device needs to start voice session (e.g., user wake-up, manual button trigger), call
Send Client “hello” Message
- After successful connection, device sends a JSON message with example structure:
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }- The
"frame_duration"value corresponds toOPUS_FRAME_DURATION_MS(e.g., 60ms).
Server Replies “hello”
- Device waits for server to return a JSON message containing
"type": "hello"and checks if"transport": "websocket"matches. - If matched, server is considered ready and audio channel opening is marked successful.
- If correct reply is not received within timeout (default 10 seconds), connection is considered failed and network error callback is triggered.
- Device waits for server to return a JSON message containing
Subsequent Message Interaction
Device and server can send two main types of data:
- Binary audio data (Opus encoded)
- Text JSON messages (for transmitting chat status, TTS/STT events, IoT commands, etc.)
In the code, receive callbacks are mainly divided into:
OnData(...):- When
binaryistrue, it’s considered an audio frame; device treats it as Opus data for decoding. - When
binaryisfalse, it’s considered JSON text, requiring cJSON parsing on device side for corresponding business logic (see message structure below).
- When
When server or network disconnects,
OnDisconnected()callback is triggered:- Device calls
on_audio_channel_closed_()and eventually returns to idle state.
- Device calls
Close WebSocket Connection
- Device actively disconnects by calling
CloseAudioChannel()when ending voice session and returns to idle state. - Or if server actively disconnects, same callback flow is triggered.
- Device actively disconnects by calling
2. Common Request Headers
When establishing WebSocket connection, the code example sets the following request headers:
Authorization: Stores access token, formatted as"Bearer <token>"Protocol-Version: Fixed as"1"in exampleDevice-Id: Device physical network card MAC addressClient-Id: Device UUID (can uniquely identify device in application)
These headers are sent to server along with WebSocket handshake, and server can perform verification, authentication, etc., as needed.
3. JSON Message Structure
WebSocket text frames are transmitted in JSON format. The following are common "type" fields and their corresponding business logic. Fields not listed may be optional or implementation-specific details.
3.1 Client→Server
Hello
- Sent by client after successful connection to inform server of basic parameters.
- Example:
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }
Listen
- Indicates client starts or stops recording listening.
- Common fields:
"session_id": Session identifier"type": "listen""state":"start","stop","detect"(wake detection triggered)"mode":"auto","manual"or"realtime", indicating recognition mode.
- Example: Start listening
{ "session_id": "xxx", "type": "listen", "state": "start", "mode": "manual" }
Abort
- Terminate current speaking (TTS playback) or voice channel.
- Example:
{ "session_id": "xxx", "type": "abort", "reason": "wake_word_detected" } reasonvalue can be"wake_word_detected"or others.
Wake Word Detected
- Used by client to inform server that wake word is detected.
- Example:
{ "session_id": "xxx", "type": "listen", "state": "detect", "text": "hello xiaoming" }
IoT
- Send current device’s IoT-related information:
- Descriptors (describing device functions, attributes, etc.)
- States (real-time updates of device status)
- Example:or
{ "session_id": "xxx", "type": "iot", "descriptors": { ... } }{ "session_id": "xxx", "type": "iot", "states": { ... } }
- Send current device’s IoT-related information:
3.2 Server→Client
Hello
- Handshake confirmation message returned by server.
- Must contain
"type": "hello"and"transport": "websocket". - May include
audio_params, indicating server’s expected audio parameters or configuration aligned with client. - After successful reception, client sets event flag indicating WebSocket channel is ready.
STT
{"type": "stt", "text": "..."}- Indicates server has recognized user speech (e.g., speech-to-text result)
- Device may display this text on screen, then proceed to response flow.
LLM
{"type": "llm", "emotion": "happy", "text": "😀"}- Server instructs device to adjust emotion animation / UI expression.
TTS
{"type": "tts", "state": "start"}: Server prepares to send TTS audio, client enters “speaking” playback state.{"type": "tts", "state": "stop"}: Indicates current TTS session ended.{"type": "tts", "state": "sentence_start", "text": "..."}- Have device display current text segment to be played or read on interface (e.g., for user display).
IoT
{"type": "iot", "commands": [ ... ]}- Server sends IoT action commands to device, device parses and executes (e.g., turn on lights, set temperature, etc.).
Audio Data: Binary Frames
- When server sends audio binary frames (Opus encoded), client decodes and plays.
- If client is in “listening” (recording) state, received audio frames are ignored or cleared to prevent conflicts.
4. Audio Encoding/Decoding
Client Sends Recording Data
- Audio input, after possible echo cancellation, noise reduction, or volume gain, is packaged via Opus encoding into binary frames and sent to server.
- If client’s encoding generates binary frames of N bytes each time, this data is sent via WebSocket binary messages.
Client Plays Received Audio
- When receiving binary frames from server, they are also considered Opus data.
- Device performs decoding, then sends to audio output interface for playback.
- If server’s audio sample rate differs from device, resampling is performed after decoding.
5. Common State Transitions
The following briefly describes key state transitions on device side, corresponding to WebSocket messages:
Idle → Connecting
- After user trigger or wake-up, device calls
OpenAudioChannel()→ establish WebSocket connection → send"type":"hello".
- After user trigger or wake-up, device calls
Connecting → Listening
- After successful connection establishment, if continuing to execute
SendStartListening(...), enter recording state. Device continuously encodes microphone data and sends to server.
- After successful connection establishment, if continuing to execute
Listening → Speaking
- Receive server TTS Start message (
{"type":"tts","state":"start"}) → stop recording and play received audio.
- Receive server TTS Start message (
Speaking → Idle
- Server TTS Stop (
{"type":"tts","state":"stop"}) → audio playback ends. If not continuing to auto-listen, return to Idle; if configured for auto-loop, re-enter Listening.
- Server TTS Stop (
Listening / Speaking → Idle (encountering exceptions or active interruption)
- Call
SendAbortSpeaking(...)orCloseAudioChannel()→ interrupt session → close WebSocket → state returns to Idle.
- Call
6. Error Handling
Connection Failure
- If
Connect(url)returns failure or times out while waiting for server “hello” message, triggeron_network_error_()callback. Device prompts “Cannot connect to service” or similar error message.
- If
Server Disconnect
- If WebSocket abnormally disconnects, callback
OnDisconnected():- Device callbacks
on_audio_channel_closed_() - Switch to Idle or other retry logic.
- Device callbacks
- If WebSocket abnormally disconnects, callback
7. Other Notes
Authentication
- Device provides authentication by setting
Authorization: Bearer <token>, server needs to verify validity. - If token expires or is invalid, server can refuse handshake or disconnect subsequently.
- Device provides authentication by setting
Session Control
- Some messages in code contain
session_id, used to distinguish independent conversations or operations. Server can perform separate processing for different sessions as needed, WebSocket protocol is empty.
- Some messages in code contain
Audio Payload
- Code defaults to Opus format with
sample_rate = 16000, mono. Frame duration is controlled byOPUS_FRAME_DURATION_MS, typically 60ms. Can be adjusted appropriately based on bandwidth or performance.
- Code defaults to Opus format with
IoT Commands
"type":"iot"messages interface withthing_manageron user side to execute specific commands, varying by device customization. Server needs to ensure sent format remains consistent with client.
Error or Exception JSON
- When JSON lacks necessary fields, e.g.,
{"type": ...}, client logs error (ESP_LOGE(TAG, "Missing message type, data: %s", data);), without executing any business logic.
- When JSON lacks necessary fields, e.g.,
8. Message Examples
Below is a typical bidirectional message example (simplified flow illustration):
Client → Server (handshake)
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }Server → Client (handshake response)
{ "type": "hello", "transport": "websocket", "audio_params": { "sample_rate": 16000 } }Client → Server (start listening)
{ "session_id": "", "type": "listen", "state": "start", "mode": "auto" }Client simultaneously starts sending binary frames (Opus data).
Server → Client (ASR result)
{ "type": "stt", "text": "what user said" }Server → Client (TTS start)
{ "type": "tts", "state": "start" }Server then sends binary audio frames to client for playback.
Server → Client (TTS end)
{ "type": "tts", "state": "stop" }Client stops audio playback, returns to idle state if no more commands.
9. Summary
This protocol transmits JSON text and binary audio frames over WebSocket, completing functions including audio stream upload, TTS audio playback, speech recognition and state management, IoT command delivery, etc. Core characteristics:
- Handshake Phase: Send
"type":"hello", wait for server return. - Audio Channel: Use Opus-encoded binary frames for bidirectional voice stream transmission.
- JSON Messages: Use
"type"as core field to identify different business logic, including TTS, STT, IoT, WakeWord, etc. - Extensibility: Fields can be added to JSON messages or additional authentication in headers based on actual needs.
Server and client need to agree in advance on field meanings, timing logic, and error handling rules for various message types to ensure smooth communication. The above information can serve as basic documentation for subsequent interfacing, development, or extension.