WebSocket Communication Protocol
WebSocket Communication Protocol
The following is a WebSocket communication protocol document based on code implementation, outlining how clients (devices) interact with servers via WebSocket. This document is inferred from the provided code and may need further confirmation or supplementation when deployed with server-side implementation.
1. Overall Process Overview
Device Initialization
- Device powers on, initializes
Application
:- Initializes audio codec, display, LED, etc.
- Connects to network
- Creates and initializes WebSocket protocol instance (
WebsocketProtocol
) implementing theProtocol
interface
- Enters main loop waiting for events (audio input, audio output, scheduling tasks, etc.)
- Device powers on, initializes
Establishing WebSocket Connection
- When the device needs to start a voice session (e.g., user wake-up, manual button trigger), calls
OpenAudioChannel()
:- Gets WebSocket URL from compilation configuration (
CONFIG_WEBSOCKET_URL
) - Sets several request headers (
Authorization
,Protocol-Version
,Device-Id
,Client-Id
) - Calls
Connect()
to establish WebSocket connection with server
- Gets WebSocket URL from compilation configuration (
- When the device needs to start a voice session (e.g., user wake-up, manual button trigger), calls
Sending Client “hello” Message
- After successful connection, device sends a JSON message with structure:
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }
- Where
"frame_duration"
value corresponds toOPUS_FRAME_DURATION_MS
(e.g., 60ms)
2. Common Request Headers
When establishing a WebSocket connection, the following headers are set in the code example:
Authorization
: For access token, in format"Bearer <token>"
Protocol-Version
: Fixed as"1"
in exampleDevice-Id
: Device physical MAC addressClient-Id
: Device UUID (uniquely identifies device in application)
These headers are sent with the WebSocket handshake for server validation and authentication.
3. JSON Message Structure
WebSocket text frames are transmitted in JSON format. Below are common "type"
fields and their corresponding business logic. Fields not listed may be optional or specific implementation details.
3.1 Client → Server
Hello
- Sent by client after connection to inform server of basic parameters.
- Example:
{ "type": "hello", "version": 1, "transport": "websocket", "audio_params": { "format": "opus", "sample_rate": 16000, "channels": 1, "frame_duration": 60 } }
Listen
- Indicates client starts or stops audio listening.
- Common fields:
"session_id"
: Session identifier"type": "listen"
"state"
:"start"
,"stop"
,"detect"
(wake word detected)"mode"
:"auto"
,"manual"
or"realtime"
, indicates recognition mode.
- Example: Start listening
{ "session_id": "xxx", "type": "listen", "state": "start", "mode": "manual" }
Abort
- Terminates current speech (TTS playback) or audio channel.
- Example:
{ "session_id": "xxx", "type": "abort", "reason": "wake_word_detected" }
3.2 Server → Client
Hello
- Server’s handshake confirmation message.
- Must contain
"type": "hello"
and"transport": "websocket"
. - May include
audio_params
indicating server’s expected audio parameters.
STT
- Speech-to-text results:
{"type": "stt", "text": "..."}
- Indicates server has recognized user speech.
- Speech-to-text results:
LLM
{"type": "llm", "emotion": "happy", "text": "😀"}
- Server instructs device to adjust expression animation/UI.
TTS
{"type": "tts", "state": "start"}
: Server preparing to send TTS audio{"type": "tts", "state": "stop"}
: TTS session ended{"type": "tts", "state": "sentence_start", "text": "..."}
: Display text
4. Audio Encoding/Decoding
Client Audio Transmission
- Audio input processed through echo cancellation, noise reduction
- Encoded using Opus codec
- Sent as binary WebSocket frames
Client Audio Playback
- Binary frames from server treated as Opus data
- Decoded and played through audio output interface
- Resampling performed if necessary
5. Common State Transitions
Idle → Connecting
- Triggered by user or wake word
OpenAudioChannel()
→ WebSocket connection → Send"type":"hello"
Connecting → Listening
- After successful connection
SendStartListening(...)
→ Begin recording
Listening → Speaking
- Receive TTS Start → Stop recording → Play received audio
Speaking → Idle
- TTS Stop → End playback → Return to Idle or auto-listen
6. Error Handling
Connection Failure
- Failed
Connect(url)
or timeout waiting for “hello” - Triggers
on_network_error_()
- Failed
Server Disconnection
- Abnormal WebSocket disconnect
- Triggers
OnDisconnected()
7. Additional Notes
Authentication
- Uses
Authorization: Bearer <token>
- Server validates token during handshake
- Uses
Session Control
- Server and client must agree on message fields
- Timing logic and error handling rules
- Document serves as foundation for integration