Worker Monitoring (Dual-Mode)¶
Version: 2.0.0 (May 2026 — ADR-041: WebSocket Monitoring)
Component: Worker Controller
Source: src/worker-controller/integration/services/cml_websocket_monitor.py, src/worker-controller/application/hosted_services/worker_reconciler.py
Related Documentation
- Worker Lifecycle — reconciliation triggers monitoring
- Worker Discovery — how workers are found
- Idle Detection — consumes activity signals
- Worker Monitoring (Architecture) — system-level overview
- ADR-041 — design decision
1. Overview¶
The Worker Controller implements a dual-mode monitoring system that collects metrics and activity signals from CML workers:
| Mode | Transport | Latency | Use Case |
|---|---|---|---|
| WebSocket (primary) | Persistent wss://<host>/ws/ui |
< 5s | Real-time system stats, lab events, activity |
| REST Polling (fallback) | HTTP GET /api/v0/system_stats |
60s | Degraded connectivity, older CML versions |
The system gracefully degrades: if a WebSocket connection fails, the reconciler automatically falls back to REST polling until the connection is restored.
flowchart TB
subgraph wc["Worker Controller"]
Reconciler["WorkerReconciler<br/>(LeaderElectedHostedService)"]
subgraph ws_mode["Primary: WebSocket Monitoring"]
Registry["CmlWebSocketMonitorRegistry"]
Monitor1["CmlWebSocketMonitor<br/>(Worker A)"]
Monitor2["CmlWebSocketMonitor<br/>(Worker B)"]
Registry --> Monitor1
Registry --> Monitor2
end
subgraph poll_mode["Fallback: REST Polling"]
CmlClient["CmlSystemSpiClient"]
CmlClient -->|"GET /system_stats"| CML_REST
end
Reconciler -->|"ensure_monitoring()"| Registry
Reconciler -->|"when WS disconnected"| CmlClient
end
subgraph cml["CML Worker"]
WS_EP["/ws/ui (WebSocket)"]
CML_REST["/api/v0/system_stats (REST)"]
end
subgraph cpa["Control Plane API"]
Internal["InternalController"]
SSE["SSEEventRelay"]
end
Monitor1 -->|"wss://"| WS_EP
Monitor2 -->|"wss://"| WS_EP
Reconciler -->|"POST /workers/:id/metrics"| Internal
Internal --> SSE
SSE -->|"SSE"| Browser["Browser"]
style ws_mode fill:#E8F5E9,stroke:#2E7D32
style poll_mode fill:#FFF3E0,stroke:#E65100
style wc fill:#E3F2FD,stroke:#1565C0
2. WebSocket Monitoring (Primary)¶
Architecture¶
Each RUNNING worker gets a dedicated CmlWebSocketMonitor instance managed by the CmlWebSocketMonitorRegistry:
- Lifecycle: Created when worker transitions to RUNNING; destroyed on STOPPING/STOPPED/TERMINATED
- Cardinality: One monitor per RUNNING worker (N instances, not a singleton)
- Leader-aware: Only the leader replica runs monitors (reconciler handles this)
Connection Flow¶
sequenceDiagram
participant R as WorkerReconciler
participant Reg as Registry
participant M as CmlWebSocketMonitor
participant CML as CML Worker
R->>Reg: ensure_monitoring(worker_id, host)
Reg->>M: create + start()
M->>CML: WebSocket UPGRADE /ws/ui
CML-->>M: 101 Switching Protocols
M->>CML: {"token": "<jwt>"}
CML-->>M: system_stats (every ~3.3s)
CML-->>M: lab_stats (per running lab)
CML-->>M: state_change (on node events)
CML-->>M: lab_event (on lab transitions)
Note over M: Throttle to report_interval (10s)
M->>R: on_system_stats callback
R->>R: Report metrics to CPA
Authentication¶
CML's /ws/ui endpoint uses message-based authentication:
- Connect without auth (101 handshake succeeds)
- Send
{"token": "<jwt>"}as first message within 10 seconds - Receive events after successful auth
- On auth failure: CLOSE frame with code
3000("Unauthorized")
The JWT is obtained via the existing CmlSystemSpiClient._authenticate() method (same credentials as REST).
Event Types¶
| Event Type | Frequency | Data | Used For |
|---|---|---|---|
system_stats |
~3.3s | CPU, memory, disk, dominfo | Metrics reporting |
lab_stats |
~3.3s per lab | Node/link counts per lab | Lab monitoring |
state_change |
On event | Node/interface transitions | Activity detection |
lab_event |
On event | Lab lifecycle changes | Activity detection |
Throttling¶
To prevent overwhelming the Control Plane API, CmlWebSocketMonitor throttles outbound metrics reporting:
system_statsarrives every ~3.3s from CML- Reports are sent at most every
CML_WEBSOCKET_METRICS_REPORT_INTERVALseconds (default: 10) - Latest stats are always available via
monitor.latest_system_statsproperty
Reconnection¶
On disconnection (network error, CML restart, auth expiry):
- Exponential backoff: 1s → 2s → 4s → ... →
CML_WEBSOCKET_RECONNECT_MAX_INTERVAL(30s) - Re-authenticate on reconnect (token may have expired)
- After
CML_WEBSOCKET_MAX_RECONNECT_ATTEMPTSconsecutive failures → mark as FAILED - Reconciler detects FAILED state → falls back to REST polling
- Next reconciliation cycle retries WebSocket connection
3. REST Polling (Fallback)¶
When WebSocket is unavailable (disabled, disconnected, or CML version too old):
# In WorkerReconciler._handle_running():
if not monitor or not monitor.is_connected:
# Fallback to REST polling
cml_stats = await self._cml_client.get_system_stats(worker.ip_address)
await self._report_metrics(worker.id, cml_stats)
Polling runs at METRICS_POLL_INTERVAL (default: 60s) during reconciliation cycles.
4. Graceful Degradation¶
The system transitions seamlessly between modes:
stateDiagram-v2
[*] --> WebSocket: Worker becomes RUNNING
WebSocket --> Polling: Connection lost (after max retries)
Polling --> WebSocket: Reconnect succeeds
WebSocket --> [*]: Worker leaves RUNNING
Polling --> [*]: Worker leaves RUNNING
state WebSocket {
Connected --> Reconnecting: Disconnect
Reconnecting --> Connected: Success
Reconnecting --> Failed: Max attempts
}
| State | Metrics Source | Activity Source | SSE Event |
|---|---|---|---|
| WS Connected | WebSocket system_stats |
WebSocket state_change/lab_event |
worker.ws.connected |
| WS Reconnecting | REST polling (fallback) | REST polling | — |
| WS Failed | REST polling | REST polling | worker.ws.disconnected |
| WS Disabled | REST polling only | REST polling only | — |
5. OpenTelemetry Metrics¶
The WebSocket monitoring system exports the following OTel metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
lcm.worker.websocket.connected |
Gauge | worker_id |
1 if connected, 0 otherwise |
lcm.worker.websocket.messages_total |
Counter | worker_id, event_type |
Total messages received |
lcm.worker.websocket.reconnects_total |
Counter | worker_id |
Total reconnection attempts |
lcm.worker.websocket.activity_events_total |
Counter | worker_id |
Activity events for idle detection |
lcm.worker.websocket.last_message_age_seconds |
Gauge | worker_id |
Seconds since last message |
6. Configuration Reference¶
| Variable | Default | Description |
|---|---|---|
CML_WEBSOCKET_ENABLED |
true |
Enable WebSocket monitoring (set false for REST-only) |
CML_WEBSOCKET_METRICS_REPORT_INTERVAL |
10 |
Max frequency of metrics reports to CPA (seconds) |
CML_WEBSOCKET_RECONNECT_MAX_INTERVAL |
30 |
Max reconnect backoff cap (seconds) |
CML_WEBSOCKET_MAX_RECONNECT_ATTEMPTS |
3 |
Consecutive failures before FAILED state |
CML_WEBSOCKET_HEALTH_TIMEOUT |
60 |
No-message timeout before unhealthy |
METRICS_POLL_INTERVAL |
60 |
REST polling interval when WS unavailable (seconds) |
7. Troubleshooting¶
WebSocket Not Connecting¶
- Verify CML version supports
/ws/ui(CML ≥ 2.9) - Check
CML_WORKER_API_USERNAME/CML_WORKER_API_PASSWORDare valid - Confirm network path allows outbound WebSocket (same as REST — port 443)
- Check logs for
CLOSE 3000(auth failure) vs timeout (network issue)
Falling Back to Polling Frequently¶
- Check
CML_WEBSOCKET_HEALTH_TIMEOUT— increase if CML is slow to send - Verify CML worker is not overloaded (high CPU may delay WS frames)
- Check for network instability between worker-controller and CML instances
Metrics Appear Stale¶
- Verify
CML_WEBSOCKET_METRICS_REPORT_INTERVAL(lower = more frequent) - Check
lcm.worker.websocket.last_message_age_secondsgauge in Grafana - If >
CML_WEBSOCKET_HEALTH_TIMEOUT, connection may be zombie — will auto-reconnect