WebSocket Communication Protocol

The following is a WebSocket communication protocol document organized based on code implementation, outlining how clients (devices) interact with servers via WebSocket. This document is inferred solely from the provided code and may require further confirmation or supplementation when combined with server-side implementation during actual deployment.


1. Overall Process Overview

  1. Device-side Initialization

    • Device power-on, initialize Application:
      • Initialize audio codec, display, LED, etc.
      • Connect to network
      • Create and initialize WebSocket protocol instance (WebsocketProtocol) implementing Protocol interface
    • Enter main loop waiting for events (audio input, audio output, scheduled tasks, etc.).
  2. Establish WebSocket Connection

    • When device needs to start voice session (e.g., user wake-up, manual button trigger), call OpenAudioChannel():
      • Get WebSocket URL from compile configuration (CONFIG_WEBSOCKET_URL)
      • Set several request headers (Authorization, Protocol-Version, Device-Id, Client-Id)
      • Call Connect() to establish WebSocket connection with server
  3. Send Client “hello” Message

    • After successful connection, device sends a JSON message with example structure:
    {
      "type": "hello",
      "version": 1,
      "transport": "websocket",
      "audio_params": {
        "format": "opus",
        "sample_rate": 16000,
        "channels": 1,
        "frame_duration": 60
      }
    }
    • The "frame_duration" value corresponds to OPUS_FRAME_DURATION_MS (e.g., 60ms).
  4. Server Replies “hello”

    • Device waits for server to return a JSON message containing "type": "hello" and checks if "transport": "websocket" matches.
    • If matched, server is considered ready and audio channel opening is marked successful.
    • If correct reply is not received within timeout (default 10 seconds), connection is considered failed and network error callback is triggered.
  5. Subsequent Message Interaction

    • Device and server can send two main types of data:

      1. Binary audio data (Opus encoded)
      2. Text JSON messages (for transmitting chat status, TTS/STT events, IoT commands, etc.)
    • In the code, receive callbacks are mainly divided into:

      • OnData(...):
        • When binary is true, it’s considered an audio frame; device treats it as Opus data for decoding.
        • When binary is false, it’s considered JSON text, requiring cJSON parsing on device side for corresponding business logic (see message structure below).
    • When server or network disconnects, OnDisconnected() callback is triggered:

      • Device calls on_audio_channel_closed_() and eventually returns to idle state.
  6. Close WebSocket Connection

    • Device actively disconnects by calling CloseAudioChannel() when ending voice session and returns to idle state.
    • Or if server actively disconnects, same callback flow is triggered.

2. Common Request Headers

When establishing WebSocket connection, the code example sets the following request headers:

  • Authorization: Stores access token, formatted as "Bearer <token>"
  • Protocol-Version: Fixed as "1" in example
  • Device-Id: Device physical network card MAC address
  • Client-Id: Device UUID (can uniquely identify device in application)

These headers are sent to server along with WebSocket handshake, and server can perform verification, authentication, etc., as needed.


3. JSON Message Structure

WebSocket text frames are transmitted in JSON format. The following are common "type" fields and their corresponding business logic. Fields not listed may be optional or implementation-specific details.

3.1 Client→Server

  1. Hello

    • Sent by client after successful connection to inform server of basic parameters.
    • Example:
      {
        "type": "hello",
        "version": 1,
        "transport": "websocket",
        "audio_params": {
          "format": "opus",
          "sample_rate": 16000,
          "channels": 1,
          "frame_duration": 60
        }
      }
  2. Listen

    • Indicates client starts or stops recording listening.
    • Common fields:
      • "session_id": Session identifier
      • "type": "listen"
      • "state": "start", "stop", "detect" (wake detection triggered)
      • "mode": "auto", "manual" or "realtime", indicating recognition mode.
    • Example: Start listening
      {
        "session_id": "xxx",
        "type": "listen",
        "state": "start",
        "mode": "manual"
      }
  3. Abort

    • Terminate current speaking (TTS playback) or voice channel.
    • Example:
      {
        "session_id": "xxx",
        "type": "abort",
        "reason": "wake_word_detected"
      }
    • reason value can be "wake_word_detected" or others.
  4. Wake Word Detected

    • Used by client to inform server that wake word is detected.
    • Example:
      {
        "session_id": "xxx",
        "type": "listen",
        "state": "detect",
        "text": "hello xiaoming"
      }
  5. IoT

    • Send current device’s IoT-related information:
      • Descriptors (describing device functions, attributes, etc.)
      • States (real-time updates of device status)
    • Example:
      {
        "session_id": "xxx",
        "type": "iot",
        "descriptors": { ... }
      }
      or
      {
        "session_id": "xxx",
        "type": "iot",
        "states": { ... }
      }

3.2 Server→Client

  1. Hello

    • Handshake confirmation message returned by server.
    • Must contain "type": "hello" and "transport": "websocket".
    • May include audio_params, indicating server’s expected audio parameters or configuration aligned with client.
    • After successful reception, client sets event flag indicating WebSocket channel is ready.
  2. STT

    • {"type": "stt", "text": "..."}
    • Indicates server has recognized user speech (e.g., speech-to-text result)
    • Device may display this text on screen, then proceed to response flow.
  3. LLM

    • {"type": "llm", "emotion": "happy", "text": "😀"}
    • Server instructs device to adjust emotion animation / UI expression.
  4. TTS

    • {"type": "tts", "state": "start"}: Server prepares to send TTS audio, client enters “speaking” playback state.
    • {"type": "tts", "state": "stop"}: Indicates current TTS session ended.
    • {"type": "tts", "state": "sentence_start", "text": "..."}
      • Have device display current text segment to be played or read on interface (e.g., for user display).
  5. IoT

    • {"type": "iot", "commands": [ ... ]}
    • Server sends IoT action commands to device, device parses and executes (e.g., turn on lights, set temperature, etc.).
  6. Audio Data: Binary Frames

    • When server sends audio binary frames (Opus encoded), client decodes and plays.
    • If client is in “listening” (recording) state, received audio frames are ignored or cleared to prevent conflicts.

4. Audio Encoding/Decoding

  1. Client Sends Recording Data

    • Audio input, after possible echo cancellation, noise reduction, or volume gain, is packaged via Opus encoding into binary frames and sent to server.
    • If client’s encoding generates binary frames of N bytes each time, this data is sent via WebSocket binary messages.
  2. Client Plays Received Audio

    • When receiving binary frames from server, they are also considered Opus data.
    • Device performs decoding, then sends to audio output interface for playback.
    • If server’s audio sample rate differs from device, resampling is performed after decoding.

5. Common State Transitions

The following briefly describes key state transitions on device side, corresponding to WebSocket messages:

  1. IdleConnecting

    • After user trigger or wake-up, device calls OpenAudioChannel() → establish WebSocket connection → send "type":"hello".
  2. ConnectingListening

    • After successful connection establishment, if continuing to execute SendStartListening(...), enter recording state. Device continuously encodes microphone data and sends to server.
  3. ListeningSpeaking

    • Receive server TTS Start message ({"type":"tts","state":"start"}) → stop recording and play received audio.
  4. SpeakingIdle

    • Server TTS Stop ({"type":"tts","state":"stop"}) → audio playback ends. If not continuing to auto-listen, return to Idle; if configured for auto-loop, re-enter Listening.
  5. Listening / SpeakingIdle (encountering exceptions or active interruption)

    • Call SendAbortSpeaking(...) or CloseAudioChannel() → interrupt session → close WebSocket → state returns to Idle.

6. Error Handling

  1. Connection Failure

    • If Connect(url) returns failure or times out while waiting for server “hello” message, trigger on_network_error_() callback. Device prompts “Cannot connect to service” or similar error message.
  2. Server Disconnect

    • If WebSocket abnormally disconnects, callback OnDisconnected():
      • Device callbacks on_audio_channel_closed_()
      • Switch to Idle or other retry logic.

7. Other Notes

  1. Authentication

    • Device provides authentication by setting Authorization: Bearer <token>, server needs to verify validity.
    • If token expires or is invalid, server can refuse handshake or disconnect subsequently.
  2. Session Control

    • Some messages in code contain session_id, used to distinguish independent conversations or operations. Server can perform separate processing for different sessions as needed, WebSocket protocol is empty.
  3. Audio Payload

    • Code defaults to Opus format with sample_rate = 16000, mono. Frame duration is controlled by OPUS_FRAME_DURATION_MS, typically 60ms. Can be adjusted appropriately based on bandwidth or performance.
  4. IoT Commands

    • "type":"iot" messages interface with thing_manager on user side to execute specific commands, varying by device customization. Server needs to ensure sent format remains consistent with client.
  5. Error or Exception JSON

    • When JSON lacks necessary fields, e.g., {"type": ...}, client logs error (ESP_LOGE(TAG, "Missing message type, data: %s", data);), without executing any business logic.

8. Message Examples

Below is a typical bidirectional message example (simplified flow illustration):

  1. Client → Server (handshake)

    {
      "type": "hello",
      "version": 1,
      "transport": "websocket",
      "audio_params": {
        "format": "opus",
        "sample_rate": 16000,
        "channels": 1,
        "frame_duration": 60
      }
    }
  2. Server → Client (handshake response)

    {
      "type": "hello",
      "transport": "websocket",
      "audio_params": {
        "sample_rate": 16000
      }
    }
  3. Client → Server (start listening)

    {
      "session_id": "",
      "type": "listen",
      "state": "start",
      "mode": "auto"
    }

    Client simultaneously starts sending binary frames (Opus data).

  4. Server → Client (ASR result)

    {
      "type": "stt",
      "text": "what user said"
    }
  5. Server → Client (TTS start)

    {
      "type": "tts",
      "state": "start"
    }

    Server then sends binary audio frames to client for playback.

  6. Server → Client (TTS end)

    {
      "type": "tts",
      "state": "stop"
    }

    Client stops audio playback, returns to idle state if no more commands.


9. Summary

This protocol transmits JSON text and binary audio frames over WebSocket, completing functions including audio stream upload, TTS audio playback, speech recognition and state management, IoT command delivery, etc. Core characteristics:

  • Handshake Phase: Send "type":"hello", wait for server return.
  • Audio Channel: Use Opus-encoded binary frames for bidirectional voice stream transmission.
  • JSON Messages: Use "type" as core field to identify different business logic, including TTS, STT, IoT, WakeWord, etc.
  • Extensibility: Fields can be added to JSON messages or additional authentication in headers based on actual needs.

Server and client need to agree in advance on field meanings, timing logic, and error handling rules for various message types to ensure smooth communication. The above information can serve as basic documentation for subsequent interfacing, development, or extension.