WebSocket Communication Protocol

WebSocket Communication Protocol

WebSocket Communication Protocol

The following is a WebSocket communication protocol document based on code implementation, outlining how clients (devices) interact with servers via WebSocket. This document is inferred from the provided code and may need further confirmation or supplementation when deployed with server-side implementation.


1. Overall Process Overview

  1. Device Initialization

    • Device powers on, initializes Application:
      • Initializes audio codec, display, LED, etc.
      • Connects to network
      • Creates and initializes WebSocket protocol instance (WebsocketProtocol) implementing the Protocol interface
    • Enters main loop waiting for events (audio input, audio output, scheduling tasks, etc.)
  2. Establishing WebSocket Connection

    • When the device needs to start a voice session (e.g., user wake-up, manual button trigger), calls OpenAudioChannel():
      • Gets WebSocket URL from compilation configuration (CONFIG_WEBSOCKET_URL)
      • Sets several request headers (Authorization, Protocol-Version, Device-Id, Client-Id)
      • Calls Connect() to establish WebSocket connection with server
  3. Sending Client “hello” Message

    • After successful connection, device sends a JSON message with structure:
    {
      "type": "hello",
      "version": 1,
      "transport": "websocket",
      "audio_params": {
        "format": "opus",
        "sample_rate": 16000,
        "channels": 1,
        "frame_duration": 60
      }
    }
    • Where "frame_duration" value corresponds to OPUS_FRAME_DURATION_MS (e.g., 60ms)

2. Common Request Headers

When establishing a WebSocket connection, the following headers are set in the code example:

  • Authorization: For access token, in format "Bearer <token>"
  • Protocol-Version: Fixed as "1" in example
  • Device-Id: Device physical MAC address
  • Client-Id: Device UUID (uniquely identifies device in application)

These headers are sent with the WebSocket handshake for server validation and authentication.

3. JSON Message Structure

WebSocket text frames are transmitted in JSON format. Below are common "type" fields and their corresponding business logic. Fields not listed may be optional or specific implementation details.

3.1 Client → Server

  1. Hello

    • Sent by client after connection to inform server of basic parameters.
    • Example:
      {
        "type": "hello",
        "version": 1,
        "transport": "websocket",
        "audio_params": {
          "format": "opus",
          "sample_rate": 16000,
          "channels": 1,
          "frame_duration": 60
        }
      }
  2. Listen

    • Indicates client starts or stops audio listening.
    • Common fields:
      • "session_id": Session identifier
      • "type": "listen"
      • "state": "start", "stop", "detect" (wake word detected)
      • "mode": "auto", "manual" or "realtime", indicates recognition mode.
    • Example: Start listening
      {
        "session_id": "xxx",
        "type": "listen",
        "state": "start",
        "mode": "manual"
      }
  3. Abort

    • Terminates current speech (TTS playback) or audio channel.
    • Example:
      {
        "session_id": "xxx",
        "type": "abort",
        "reason": "wake_word_detected"
      }

3.2 Server → Client

  1. Hello

    • Server’s handshake confirmation message.
    • Must contain "type": "hello" and "transport": "websocket".
    • May include audio_params indicating server’s expected audio parameters.
  2. STT

    • Speech-to-text results: {"type": "stt", "text": "..."}
    • Indicates server has recognized user speech.
  3. LLM

    • {"type": "llm", "emotion": "happy", "text": "😀"}
    • Server instructs device to adjust expression animation/UI.
  4. TTS

    • {"type": "tts", "state": "start"}: Server preparing to send TTS audio
    • {"type": "tts", "state": "stop"}: TTS session ended
    • {"type": "tts", "state": "sentence_start", "text": "..."}: Display text

4. Audio Encoding/Decoding

  1. Client Audio Transmission

    • Audio input processed through echo cancellation, noise reduction
    • Encoded using Opus codec
    • Sent as binary WebSocket frames
  2. Client Audio Playback

    • Binary frames from server treated as Opus data
    • Decoded and played through audio output interface
    • Resampling performed if necessary

5. Common State Transitions

  1. Idle → Connecting

    • Triggered by user or wake word
    • OpenAudioChannel() → WebSocket connection → Send "type":"hello"
  2. Connecting → Listening

    • After successful connection
    • SendStartListening(...) → Begin recording
  3. Listening → Speaking

    • Receive TTS Start → Stop recording → Play received audio
  4. Speaking → Idle

    • TTS Stop → End playback → Return to Idle or auto-listen

6. Error Handling

  1. Connection Failure

    • Failed Connect(url) or timeout waiting for “hello”
    • Triggers on_network_error_()
  2. Server Disconnection

    • Abnormal WebSocket disconnect
    • Triggers OnDisconnected()

7. Additional Notes

  1. Authentication

    • Uses Authorization: Bearer <token>
    • Server validates token during handshake
  2. Session Control

    • Server and client must agree on message fields
    • Timing logic and error handling rules
    • Document serves as foundation for integration