WebSocket Communication Protocol

The following is a WebSocket communication protocol document based on code implementation, outlining how clients (devices) interact with servers via WebSocket. This document is inferred solely from the provided code and may require further confirmation or supplementation with server-side implementation for actual deployment.

1. Overall Process Overview

Device Initialization
- Device powers on, initializes Application:
  - Initializes audio codec, display, LEDs, etc.
  - Connects to network
  - Creates and initializes a WebSocket protocol instance (WebsocketProtocol) that implements the Protocol interface
- Enters the main loop waiting for events (audio input, audio output, scheduled tasks, etc.).
Establishing WebSocket Connection
- When the device needs to start a voice session (e.g., user wake-up, manual button trigger, etc.), it calls OpenAudioChannel():
  - Gets the WebSocket URL from build configuration (CONFIG_WEBSOCKET_URL)
  - Sets several request headers (Authorization, Protocol-Version, Device-Id, Client-Id)
  - Calls Connect() to establish a WebSocket connection with the server

Sending Client “hello” Message

After successful connection, the device sends a JSON message with the following example structure:

{
  "type": "hello",
  "version": 1,
  "transport": "websocket",
  "audio_params": {
    "format": "opus",
    "sample_rate": 16000,
    "channels": 1,
    "frame_duration": 60
  }
}

The value of "frame_duration" corresponds to OPUS_FRAME_DURATION_MS (e.g., 60ms).

Server “hello” Response
- The device waits for the server to return a JSON message containing "type": "hello" and checks if "transport": "websocket" matches.
- If it matches, the server is considered ready, and the audio channel is marked as successfully open.
- If no correct reply is received within the timeout period (default 10 seconds), the connection is considered failed and a network error callback is triggered.
Subsequent Message Exchange
- Two main types of data can be sent between the device and the server:
  1. Binary audio data (Opus encoded)
  2. Text JSON messages (for transmitting chat status, TTS/STT events, IoT commands, etc.)
- In the code, the receive callbacks are mainly divided into:
  - OnData(...):
    - When binary is true, it is considered an audio frame; the device will decode it as Opus data.
    - When binary is false, it is considered JSON text, which needs to be parsed with cJSON on the device side and processed according to the corresponding business logic (see message structure below).
- When the server or network disconnects, the OnDisconnected() callback is triggered:
  - The device calls on_audio_channel_closed_() and eventually returns to the idle state.
Closing the WebSocket Connection
- When the device needs to end a voice session, it will call CloseAudioChannel() to actively disconnect and return to the idle state.
- Or if the server actively disconnects, it will trigger the same callback process.

2. Common Request Headers

In establishing a WebSocket connection, the code example sets the following request headers:

Authorization: Used to store the access token, in the form of "Bearer <token>"
Protocol-Version: Fixed as "1" in the example
Device-Id: Device physical network card MAC address
Client-Id: Device UUID (can uniquely identify the device in the application)

These headers are sent to the server along with the WebSocket handshake, and the server can perform verification, authentication, etc. as needed.

3. JSON Message Structure

WebSocket text frames are transmitted in JSON format. The following are common "type" fields and their corresponding business logic. If a message contains fields not listed, they may be optional or specific implementation details.

3.1 Client → Server

Hello

Sent by the client after a successful connection to inform the server of basic parameters.

Example:

{
  "type": "hello",
  "version": 1,
  "transport": "websocket",
  "audio_params": {
    "format": "opus",
    "sample_rate": 16000,
    "channels": 1,
    "frame_duration": 60
  }
}

Listen
- Indicates that the client is starting or stopping audio listening.
- Common fields:
  - "session_id": Session identifier
  - "type": "listen"
  - "state": "start", "stop", "detect" (wake-up detection has been triggered)
  - "mode": "auto", "manual" or "realtime", indicating the recognition mode.
- Example: Start listening
  { "session_id": "xxx", "type": "listen", "state": "start", "mode": "manual" }
Abort
- Terminates the current speech (TTS playback) or voice channel.
- Example:
  { "session_id": "xxx", "type": "abort", "reason": "wake_word_detected" }
- reason values can be "wake_word_detected" or others.

Wake Word Detected

Used by the client to inform the server that a wake word has been detected.

Example:

{
  "session_id": "xxx",
  "type": "listen",
  "state": "detect",
  "text": "Hello XiaoMing"
}

IoT
- Sends IoT-related information about the current device:
  - Descriptors (describing device features, attributes, etc.)
  - States (real-time updates of device status)
- Example:
  { "session_id": "xxx", "type": "iot", "descriptors": { ... } }
  or
  { "session_id": "xxx", "type": "iot", "states": { ... } }

3.2 Server → Client

Hello
- The handshake confirmation message returned by the server.
- Must contain "type": "hello" and "transport": "websocket".
- May carry audio_params, indicating the server’s expected audio parameters or configuration aligned with the client.
- Upon successful reception, the client sets an event flag indicating that the WebSocket channel is ready.
STT
- {"type": "stt", "text": "..."}
- Indicates that the server has recognized the user’s speech. (e.g., speech-to-text results)
- The device may display this text on the screen and then proceed to the answer process.
LLM
- {"type": "llm", "emotion": "happy", "text": "😀"}
- The server instructs the device to adjust facial animations / UI expressions.
TTS
- {"type": "tts", "state": "start"}: The server is ready to send TTS audio, and the client enters the “speaking” playback state.
- {"type": "tts", "state": "stop"}: Indicates the end of this TTS.
- {"type": "tts", "state": "sentence_start", "text": "..."}
  - Makes the device display the current text fragment to be played or read on the interface (e.g., for display to the user).
IoT
- {"type": "iot", "commands": [ ... ]}
- The server sends IoT action commands to the device, which the device parses and executes (e.g., turning on lights, setting temperature, etc.).
Audio Data: Binary Frames
- When the server sends audio binary frames (Opus encoded), the client decodes and plays them.
- If the client is in a “listening” (recording) state, received audio frames will be ignored or cleared to prevent conflicts.

4. Audio Encoding and Decoding

Client Sending Recording Data
- Audio input, after possible echo cancellation, noise reduction, or volume gain, is packaged as binary frames through Opus encoding and sent to the server.
- If the binary frame size generated by the client each time is N bytes, this data will be sent via the WebSocket binary message.
Client Playing Received Audio
- When receiving binary frames from the server, they are likewise considered Opus data.
- The device side will decode them and then pass them to the audio output interface for playback.
- If the server’s audio sampling rate is inconsistent with the device, it will be resampled after decoding.

5. Common State Transitions

The following briefly describes the key state transitions of the device side, corresponding to WebSocket messages:

Idle → Connecting
- After user triggering or wake-up, the device calls OpenAudioChannel() → establishes a WebSocket connection → sends "type":"hello".
Connecting → Listening
- After successfully establishing a connection, if SendStartListening(...) continues to execute, it enters the recording state. At this time, the device will continuously encode microphone data and send it to the server.
Listening → Speaking
- Receiving a server TTS Start message ({"type":"tts","state":"start"}) → stops recording and plays the received audio.
Speaking → Idle
- Server TTS Stop ({"type":"tts","state":"stop"}) → audio playback ends. If not continuing to automatic listening, it returns to Idle; if automatic cycling is configured, it enters Listening again.
Listening / Speaking → Idle (encountering an exception or active interruption)
- Calling SendAbortSpeaking(...) or CloseAudioChannel() → interrupts the session → closes the WebSocket → state returns to Idle.

6. Error Handling

Connection Failure
- If Connect(url) returns failure or times out while waiting for the server’s “hello” message, it triggers the on_network_error_() callback. The device will prompt “Unable to connect to service” or similar error message.
Server Disconnection
- If the WebSocket abnormally disconnects, the OnDisconnected() callback is triggered:
  - The device calls back on_audio_channel_closed_()
  - Switches to Idle or other retry logic.

7. Other Considerations

Authentication
- The device provides authentication by setting Authorization: Bearer <token>, and the server needs to verify whether it is valid.
- If the token is expired or invalid, the server can refuse the handshake or disconnect later.
Session Control
- Some messages in the code contain session_id, used to distinguish independent conversations or operations. The server can process different sessions separately as needed, while the WebSocket protocol is empty.
Audio Payload
- The code defaults to using Opus format and sets sample_rate = 16000, mono. The frame duration is controlled by OPUS_FRAME_DURATION_MS, usually 60ms. It can be adjusted according to bandwidth or performance requirements.
IoT Commands
- Messages with "type":"iot" in the client code interface with thing_manager to execute specific commands, which vary by device customization. The server needs to ensure that the issued format is consistent with the client.
Error or Malformed JSON
- When JSON is missing required fields, such as {"type": ...}, the client will log an error (ESP_LOGE(TAG, "Missing message type, data: %s", data);) and will not execute any business logic.

8. Message Examples

Below is a typical example of bidirectional messages (simplified flow):

Client → Server (handshake)

{
  "type": "hello",
  "version": 1,
  "transport": "websocket",
  "audio_params": {
    "format": "opus",
    "sample_rate": 16000,
    "channels": 1,
    "frame_duration": 60
  }
}

Server → Client (handshake response)

{
  "type": "hello",
  "transport": "websocket",
  "audio_params": {
    "sample_rate": 16000
  }
}

Client → Server (start listening)
```
{
  "session_id": "",
  "type": "listen",
  "state": "start",
  "mode": "auto"
}
```
At the same time, the client begins sending binary frames (Opus data).

Server → Client (ASR result)

{
  "type": "stt",
  "text": "User's speech"
}

Server → Client (TTS start)
```
{
  "type": "tts",
  "state": "start"
}
```
The server then sends binary audio frames to the client for playback.
Server → Client (TTS end)
```
{
  "type": "tts",
  "state": "stop"
}
```
The client stops playing audio and returns to the idle state if there are no further instructions.

9. Summary

This protocol accomplishes functions including audio stream upload, TTS audio playback, speech recognition and state management, IoT command distribution, etc. by transmitting JSON text and binary audio frames over WebSocket. Its core characteristics:

Handshake Phase: Sending "type":"hello", waiting for server response.
Audio Channel: Bidirectional transmission of voice streams using Opus-encoded binary frames.
JSON Messages: Using "type" as the core field to identify different business logic, including TTS, STT, IoT, WakeWord, etc.
Extensibility: Fields can be added to JSON messages or additional authentication in headers based on actual requirements.

Server and client need to agree in advance on the meaning of various message fields, sequence logic, and error handling rules to ensure smooth communication. The information above can serve as basic documentation to facilitate subsequent integration, development, or extension.

ESP-IDF Development Environment Setup and XiaoZhi Compilation Emoji Emotion Display