WebSocket Communication Protocol
The following is a WebSocket communication protocol document based on code implementation, outlining how clients (devices) interact with servers via WebSocket. This document is inferred solely from the provided code and may require further confirmation or supplementation with server-side implementation for actual deployment.
1. Overall Process Overview
Device Initialization
- Device powers on, initializes
Application
:- Initializes audio codec, display, LEDs, etc.
- Connects to network
- Creates and initializes a WebSocket protocol instance (
WebsocketProtocol
) that implements theProtocol
interface
- Enters the main loop waiting for events (audio input, audio output, scheduled tasks, etc.).
- Device powers on, initializes
Establishing WebSocket Connection
- When the device needs to start a voice session (e.g., user wake-up, manual button trigger, etc.), it calls
OpenAudioChannel()
:- Gets the WebSocket URL from build configuration (
CONFIG_WEBSOCKET_URL
) - Sets several request headers (
Authorization
,Protocol-Version
,Device-Id
,Client-Id
) - Calls
Connect()
to establish a WebSocket connection with the server
- Gets the WebSocket URL from build configuration (
- When the device needs to start a voice session (e.g., user wake-up, manual button trigger, etc.), it calls
Sending Client “hello” Message
- After successful connection, the device sends a JSON message with the following example structure:
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }
- The value of
"frame_duration"
corresponds toOPUS_FRAME_DURATION_MS
(e.g., 60ms).
Server “hello” Response
- The device waits for the server to return a JSON message containing
"type": "hello"
and checks if"transport": "websocket"
matches. - If it matches, the server is considered ready, and the audio channel is marked as successfully open.
- If no correct reply is received within the timeout period (default 10 seconds), the connection is considered failed and a network error callback is triggered.
- The device waits for the server to return a JSON message containing
Subsequent Message Exchange
Two main types of data can be sent between the device and the server:
- Binary audio data (Opus encoded)
- Text JSON messages (for transmitting chat status, TTS/STT events, IoT commands, etc.)
In the code, the receive callbacks are mainly divided into:
OnData(...)
:- When
binary
istrue
, it is considered an audio frame; the device will decode it as Opus data. - When
binary
isfalse
, it is considered JSON text, which needs to be parsed with cJSON on the device side and processed according to the corresponding business logic (see message structure below).
- When
When the server or network disconnects, the
OnDisconnected()
callback is triggered:- The device calls
on_audio_channel_closed_()
and eventually returns to the idle state.
- The device calls
Closing the WebSocket Connection
- When the device needs to end a voice session, it will call
CloseAudioChannel()
to actively disconnect and return to the idle state. - Or if the server actively disconnects, it will trigger the same callback process.
- When the device needs to end a voice session, it will call
2. Common Request Headers
In establishing a WebSocket connection, the code example sets the following request headers:
Authorization
: Used to store the access token, in the form of"Bearer <token>"
Protocol-Version
: Fixed as"1"
in the exampleDevice-Id
: Device physical network card MAC addressClient-Id
: Device UUID (can uniquely identify the device in the application)
These headers are sent to the server along with the WebSocket handshake, and the server can perform verification, authentication, etc. as needed.
3. JSON Message Structure
WebSocket text frames are transmitted in JSON format. The following are common "type"
fields and their corresponding business logic. If a message contains fields not listed, they may be optional or specific implementation details.
3.1 Client → Server
Hello
- Sent by the client after a successful connection to inform the server of basic parameters.
- Example:
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }
Listen
- Indicates that the client is starting or stopping audio listening.
- Common fields:
"session_id"
: Session identifier"type": "listen"
"state"
:"start"
,"stop"
,"detect"
(wake-up detection has been triggered)"mode"
:"auto"
,"manual"
or"realtime"
, indicating the recognition mode.
- Example: Start listening
{ "session_id": "xxx", "type": "listen", "state": "start", "mode": "manual" }
Abort
- Terminates the current speech (TTS playback) or voice channel.
- Example:
{ "session_id": "xxx", "type": "abort", "reason": "wake_word_detected" }
reason
values can be"wake_word_detected"
or others.
Wake Word Detected
- Used by the client to inform the server that a wake word has been detected.
- Example:
{ "session_id": "xxx", "type": "listen", "state": "detect", "text": "Hello XiaoMing" }
IoT
- Sends IoT-related information about the current device:
- Descriptors (describing device features, attributes, etc.)
- States (real-time updates of device status)
- Example:or
{ "session_id": "xxx", "type": "iot", "descriptors": { ... } }
{ "session_id": "xxx", "type": "iot", "states": { ... } }
- Sends IoT-related information about the current device:
3.2 Server → Client
Hello
- The handshake confirmation message returned by the server.
- Must contain
"type": "hello"
and"transport": "websocket"
. - May carry
audio_params
, indicating the server’s expected audio parameters or configuration aligned with the client. - Upon successful reception, the client sets an event flag indicating that the WebSocket channel is ready.
STT
{"type": "stt", "text": "..."}
- Indicates that the server has recognized the user’s speech. (e.g., speech-to-text results)
- The device may display this text on the screen and then proceed to the answer process.
LLM
{"type": "llm", "emotion": "happy", "text": "😀"}
- The server instructs the device to adjust facial animations / UI expressions.
TTS
{"type": "tts", "state": "start"}
: The server is ready to send TTS audio, and the client enters the “speaking” playback state.{"type": "tts", "state": "stop"}
: Indicates the end of this TTS.{"type": "tts", "state": "sentence_start", "text": "..."}
- Makes the device display the current text fragment to be played or read on the interface (e.g., for display to the user).
IoT
{"type": "iot", "commands": [ ... ]}
- The server sends IoT action commands to the device, which the device parses and executes (e.g., turning on lights, setting temperature, etc.).
Audio Data: Binary Frames
- When the server sends audio binary frames (Opus encoded), the client decodes and plays them.
- If the client is in a “listening” (recording) state, received audio frames will be ignored or cleared to prevent conflicts.
4. Audio Encoding and Decoding
Client Sending Recording Data
- Audio input, after possible echo cancellation, noise reduction, or volume gain, is packaged as binary frames through Opus encoding and sent to the server.
- If the binary frame size generated by the client each time is N bytes, this data will be sent via the WebSocket binary message.
Client Playing Received Audio
- When receiving binary frames from the server, they are likewise considered Opus data.
- The device side will decode them and then pass them to the audio output interface for playback.
- If the server’s audio sampling rate is inconsistent with the device, it will be resampled after decoding.
5. Common State Transitions
The following briefly describes the key state transitions of the device side, corresponding to WebSocket messages:
Idle → Connecting
- After user triggering or wake-up, the device calls
OpenAudioChannel()
→ establishes a WebSocket connection → sends"type":"hello"
.
- After user triggering or wake-up, the device calls
Connecting → Listening
- After successfully establishing a connection, if
SendStartListening(...)
continues to execute, it enters the recording state. At this time, the device will continuously encode microphone data and send it to the server.
- After successfully establishing a connection, if
Listening → Speaking
- Receiving a server TTS Start message (
{"type":"tts","state":"start"}
) → stops recording and plays the received audio.
- Receiving a server TTS Start message (
Speaking → Idle
- Server TTS Stop (
{"type":"tts","state":"stop"}
) → audio playback ends. If not continuing to automatic listening, it returns to Idle; if automatic cycling is configured, it enters Listening again.
- Server TTS Stop (
Listening / Speaking → Idle (encountering an exception or active interruption)
- Calling
SendAbortSpeaking(...)
orCloseAudioChannel()
→ interrupts the session → closes the WebSocket → state returns to Idle.
- Calling
6. Error Handling
Connection Failure
- If
Connect(url)
returns failure or times out while waiting for the server’s “hello” message, it triggers theon_network_error_()
callback. The device will prompt “Unable to connect to service” or similar error message.
- If
Server Disconnection
- If the WebSocket abnormally disconnects, the
OnDisconnected()
callback is triggered:- The device calls back
on_audio_channel_closed_()
- Switches to Idle or other retry logic.
- The device calls back
- If the WebSocket abnormally disconnects, the
7. Other Considerations
Authentication
- The device provides authentication by setting
Authorization: Bearer <token>
, and the server needs to verify whether it is valid. - If the token is expired or invalid, the server can refuse the handshake or disconnect later.
- The device provides authentication by setting
Session Control
- Some messages in the code contain
session_id
, used to distinguish independent conversations or operations. The server can process different sessions separately as needed, while the WebSocket protocol is empty.
- Some messages in the code contain
Audio Payload
- The code defaults to using Opus format and sets
sample_rate = 16000
, mono. The frame duration is controlled byOPUS_FRAME_DURATION_MS
, usually 60ms. It can be adjusted according to bandwidth or performance requirements.
- The code defaults to using Opus format and sets
IoT Commands
- Messages with
"type":"iot"
in the client code interface withthing_manager
to execute specific commands, which vary by device customization. The server needs to ensure that the issued format is consistent with the client.
- Messages with
Error or Malformed JSON
- When JSON is missing required fields, such as
{"type": ...}
, the client will log an error (ESP_LOGE(TAG, "Missing message type, data: %s", data);
) and will not execute any business logic.
- When JSON is missing required fields, such as
8. Message Examples
Below is a typical example of bidirectional messages (simplified flow):
Client → Server (handshake)
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }
Server → Client (handshake response)
{ "type": "hello", "transport": "websocket", "audio_params": { "sample_rate": 16000 } }
Client → Server (start listening)
{ "session_id": "", "type": "listen", "state": "start", "mode": "auto" }
At the same time, the client begins sending binary frames (Opus data).
Server → Client (ASR result)
{ "type": "stt", "text": "User's speech" }
Server → Client (TTS start)
{ "type": "tts", "state": "start" }
The server then sends binary audio frames to the client for playback.
Server → Client (TTS end)
{ "type": "tts", "state": "stop" }
The client stops playing audio and returns to the idle state if there are no further instructions.
9. Summary
This protocol accomplishes functions including audio stream upload, TTS audio playback, speech recognition and state management, IoT command distribution, etc. by transmitting JSON text and binary audio frames over WebSocket. Its core characteristics:
- Handshake Phase: Sending
"type":"hello"
, waiting for server response. - Audio Channel: Bidirectional transmission of voice streams using Opus-encoded binary frames.
- JSON Messages: Using
"type"
as the core field to identify different business logic, including TTS, STT, IoT, WakeWord, etc. - Extensibility: Fields can be added to JSON messages or additional authentication in headers based on actual requirements.
Server and client need to agree in advance on the meaning of various message fields, sequence logic, and error handling rules to ensure smooth communication. The information above can serve as basic documentation to facilitate subsequent integration, development, or extension.