ADR-041: WebSocket-Based CML Worker Monitoring¶
| Attribute | Value |
|---|---|
| Status | Proposed |
| Date | 2026-05-20 |
| Deciders | Platform Team |
| Related ADRs | ADR-013 (SSE Batching), ADR-015 (CPA No External Calls), ADR-019 (LabRecord Aggregate) |
| Knowledge Refs | AD-WS-MONITOR-001 |
Context¶
Current Monitoring Architecture¶
The worker-controller monitors active CML workers using a polling-based approach:
- Metrics Collection Loop (60s interval): For each RUNNING worker, calls:
- CloudWatch API (
get_metric_statistics) for EC2-level CPU/network/disk metrics - CML REST API (
GET /api/v0/system_stats) for system resource utilization - CML REST API (
GET /api/v0/system_health) for compute health -
CML REST API (
GET /api/v0/licensing) for license status -
Idle Detection (per-reconciliation cycle, ~30s): Fetches:
- CML REST API (
GET /api/v0/telemetry/events) for all raw telemetry events -
Sends events to control-plane-api for idle evaluation
-
Data Reporting: Worker-controller reports collected data to control-plane-api via:
POST /api/internal/workers/{id}/metrics(EC2 + CML system stats)POST /api/internal/workers/{id}/cml-data(system info, health, license)-
POST /api/internal/workers/{id}/detect-idle(raw telemetry events for idle evaluation) -
Frontend Delivery: Control-plane-api broadcasts to the frontend via SSE:
worker.metrics.updated(batched every 1s per ADR-013)worker.activity.updated,worker.paused, etc.
Problem Statement¶
This polling approach has several limitations:
| Problem | Impact |
|---|---|
| 60-second latency for resource metrics | Dashboard shows stale data; labs starting/stopping go unnoticed for up to a minute |
| 30-second latency for idle detection | Delayed auto-pause decisions; wasted compute costs |
| NΓ4 API calls per cycle per RUNNING worker | Significant HTTP overhead; each poll creates a new connection |
| Full telemetry dump on each idle check | /api/v0/telemetry/events returns ALL events (no pagination/filtering); wasteful for repeated checks |
| No lab-level activity signals | Only aggregate CPU/memory visible; can't detect individual lab starts/stops in real-time |
| No node-level state changes | Node QUEUED β STARTED β BOOTED transitions invisible between polls |
CML WebSocket Endpoint Discovery¶
CML servers expose a WebSocket endpoint at wss://<host>/ws/ui that their own frontend consumes. This endpoint emits push-based, real-time events including:
| Event Type | Data Content | Current Polling Equivalent | WS Frequency |
|---|---|---|---|
system_stats |
Full compute stats (CPU load, memory, disk, dominfo) | GET /api/v0/system_stats (60s poll) |
~3s interval |
lab_stats |
Per-lab node CPU/RAM/disk, link traffic | None (only aggregate available via poll) | ~3s per lab |
state_change |
Node/interface lifecycle: QUEUED, STARTED, BOOTED, STOPPED | GET /api/v0/telemetry/events (30s poll) |
On occurrence |
lab_event |
Lab-level state: STARTED, STOPPED, DEFINED | GET /api/v0/telemetry/events (30s poll) |
On occurrence |
Key observation (validated 2026-05-20): The system_stats event contains the exact same data as GET /api/v0/system_stats (CPU, memory, disk, dominfo per compute). The event stream averages ~0.9 msg/s with system_stats every ~3 seconds and lab_stats every ~3 seconds per running lab. The state_change/lab_event events supersede the telemetry events endpoint for idle detection purposes β they arrive in real-time vs. polled every 30s.
Decision¶
Replace polling-based CML metrics/activity collection with a persistent WebSocket connection per monitored worker¶
The worker-controller will maintain a WebSocket connection to each RUNNING worker's wss://<host>/ws/ui endpoint, replacing:
- CML system_stats polling β Replaced by
system_statsWebSocket events (~3s interval from CML) - CML telemetry/events polling β Replaced by
state_change+lab_eventWebSocket events (real-time push) - Idle detection telemetry fetch β Replaced by activity inference from WebSocket events (zero additional API calls)
CloudWatch polling and CML system_health/licensing REST calls remain unchanged β these have no WebSocket equivalent.
Architecture: CmlWebSocketMonitor Service¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Worker Controller β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β WorkerReconciler β
β βββ _handle_running() β
β β βββ (existing) Verify EC2 state β
β β βββ (existing) Handle on-demand refresh β
β β βββ (existing) Reconcile license operations β
β β βββ (NEW) Ensure CmlWebSocketMonitor is connected β
β β βββ (MODIFIED) Idle detection uses WS-derived activity data β
β β β
β βββ _handle_stopping() / _handle_stopped() β
β β βββ (NEW) Disconnect CmlWebSocketMonitor β
β β β
β βββ _run_metrics_collection_loop() β
β βββ (RETAINED) CloudWatch EC2 metrics (no WS equivalent) β
β βββ (REMOVED) CML system_stats poll β replaced by WS β
β βββ (RETAINED) CML system_health poll (no WS equivalent) β
β βββ (RETAINED) CML licensing poll (no WS equivalent) β
β βββ (NEW) Report WS-collected metrics to CPA β
β β
β CmlWebSocketMonitor β NEW SERVICE β
β βββ Per-worker WebSocket connection lifecycle β
β βββ Event parsing and classification β
β βββ Metrics aggregation (rate-limit outbound reports) β
β βββ Activity event tracking (replaces telemetry fetch) β
β βββ Auto-reconnect with exponential backoff β
β βββ Connection health monitoring β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Event Flow: WebSocket β CPA β SSE β Frontend¶
CML Worker Worker Controller Control Plane API Frontend
β β β β
ββ ws: system_stats βββββββββββββΆβ β β
β ββ report_worker_metrics() ββββΆβ β
β β ββ SSE: worker.metrics ββΆβ
β β β .updated β
β β β β
ββ ws: state_change(STARTED) ββββΆβ β β
β ββ report_worker_activity() βββΆβ β
β β ββ SSE: worker.activityββΆβ
β β β .updated β
β β β β
ββ ws: lab_event(state=STARTED)ββΆβ β β
β ββ report_worker_activity() βββΆβ β
β ββ report_lab_state_change() ββΆβ β
β β ββ SSE: worker.lab ββΆβ
β β β .state_change β
β β β β
ββ ws: lab_stats ββββββββββββββββΆβ β β
β ββ report_lab_metrics() βββββββΆβ β
β β ββ SSE: worker.lab ββΆβ
β β β .stats_updated β
New SSE Event Types for Frontend¶
| SSE Event Type | Source | Data Shape | Purpose |
|---|---|---|---|
worker.lab.state_change |
CML WS state_change |
{worker_id, lab_id, element_type, element_id, event, data} |
Real-time node/interface lifecycle |
worker.lab.stats_updated |
CML WS lab_stats |
{worker_id, lab_id, nodes: {...}, links: {...}} |
Per-lab resource utilization |
worker.ws.connected |
WS lifecycle | {worker_id, connected_at} |
WS connection status indicator |
worker.ws.disconnected |
WS lifecycle | {worker_id, disconnected_at, reason} |
WS connection lost |
Existing events enhanced (no breaking changes):
worker.metrics.updatedβ Now arrives within ~5-10s of CML emittingsystem_stats(was 60s)worker.activity.updatedβ Now arrives within milliseconds of lab interaction (was 30s)
Idle Detection: Event-Driven Approach¶
Before (polling):
- Every 30s: Fetch ALL telemetry events from CML REST API
- Send raw events to CPA for filtering and evaluation
- CPA determines
last_activity_atfrom matching event categories
After (WebSocket-driven):
- CmlWebSocketMonitor receives
state_changeandlab_eventin real-time - Worker-controller classifies events by activity category (same filter as current):
lab_eventwith stateSTARTED|STOPPEDβ user activitystate_changewith eventQUEUED|STARTEDonnodeelement β user activity- On activity detection β immediately report to CPA via
report_worker_activity(worker_id, activity_events=[...]) - CPA updates
last_activity_atand resets idle timer (existing pipeline Steps 2-4 unchanged)
Fallback: If WebSocket disconnects, revert to polling-based telemetry fetch (existing behavior) until reconnection.
Consequences¶
Positive¶
| Benefit | Quantified Impact |
|---|---|
| Near-real-time metrics | 60s β ~3s latency for system stats (validated) |
| Instant activity detection | 30s β <1s for node start/stop events |
| Reduced API load | Eliminates 2 REST calls per worker per cycle (system_stats + telemetry/events) |
| Lab-level visibility | NEW: Per-lab resource usage visible (not just aggregate) |
| Node lifecycle visibility | NEW: Individual node state transitions in real-time |
| More accurate idle detection | Activity detected instantly β faster auto-pause when truly idle, no false positives from polling gaps |
| Persistent connection | Reuses single TCP connection vs. creating new HTTP connections every 60s |
Negative¶
| Trade-off | Mitigation |
|-----------|------------|3s latency for system stats (validated)
| Persistent connection per worker | Bounded by max_workers (10/region default); negligible memory |
| Connection management complexity | Auto-reconnect with exponential backoff; health monitoring; graceful degradation to polling |
| WebSocket library dependency | Add websockets package (mature, well-maintained, already async) |
| Token refresh over WS | Monitor for 401/close events; reconnect with fresh token |
| CML version dependency | /ws/ui endpoint may vary across CML versions; version check on connect |
Neutral¶
- CloudWatch polling unchanged (still 60s; no WS equivalent available)
- CML system_health and licensing REST polls unchanged (infrequent, no WS equivalent)
- CPA idle detection pipeline unchanged (Steps 2-4); only Step 1 (telemetry fetch) replaced with push-based events
- Frontend SSE infrastructure unchanged; only new event types added
Alternatives Considered¶
1. Reduce Polling Interval (10s instead of 60s)¶
Rejected: 6Γ more REST API calls; increases load on CML workers; still not real-time; doesn't provide lab-level granularity.
2. Server-Sent Events from CML (instead of WebSocket)¶
Rejected: CML doesn't expose an SSE endpoint; the existing frontend uses WebSocket exclusively.
3. WebSocket only for activity detection, keep polling for metrics¶
Rejected: system_stats arrives via WebSocket at the same interval as the CML frontend refreshes (~5-10s), eliminating the need for a separate poll. Using both would create redundant data paths.
4. Have CPA connect to CML WebSocket directly¶
Rejected: Violates ADR-015 (CPA shall not make external calls to CML/EC2). Worker-controller owns all external CML communication.
Validation (Completed 2026-05-20)¶
Exploration script: scripts/explore_cml_ws.py tested against CML worker at 3.91.22.2 (CML v2.9).
Confirmed Results¶
| Question | Answer |
|---|---|
| Authentication method | Message-based: send {"token": "<jwt>"} as first WS message |
| Handshake auth required? | No β handshake always succeeds (101), auth is post-connect |
| Auth timeout | ~10 seconds (CLOSE 3000 if no valid auth) |
system_stats frequency |
Every ~3.3 seconds |
lab_stats frequency |
Every ~3.3 seconds per running lab |
| Total message rate | ~0.9 msg/s (with 2 running labs, idle nodes) |
| Event types observed (idle) | system_stats, lab_stats |
| Event types observed (active) | state_change, lab_event (from user's prior capture) |
| Close code on auth failure | 3000 with reason "Unauthorized" |
| Close behavior on token expiry | TBD (requires 24h test or token manipulation) |
| CML version tested | 2.9.x |
Remaining Validation (Deferred to Phase 1)¶
- Extended stability test: 30+ minute connection hold
- Token expiry behavior: What close code/reason when JWT expires mid-session
- Minimum CML version: Test against older CML instances if available
Implementation Notes¶
See ADR-041 Implementation Plan for the detailed, phased implementation plan.
Key Design Decisions Within This ADR¶
- CmlWebSocketMonitor is a standalone integration service β not embedded in the reconciler. The reconciler manages its lifecycle (connect/disconnect) but the monitor runs independently.
- Activity events are reported immediately (no batching on the worker-controller side). CPA's existing batching (ADR-013) handles frontend delivery rate-limiting.
- Metrics are throttled in the monitor before reporting to CPA β maximum one
report_worker_metrics()call permetrics_report_intervalseconds (configurable, default 10s), regardless of how frequently CML pushessystem_stats. - Graceful degradation: If WebSocket is unavailable (CML version too old, network issues, auth failure), fall back to polling seamlessly. The reconciler checks
monitor.is_connectedand uses the polling path if not. - Lab-level events are forwarded to CPA for LabRecord state updates (existing
lab.status.updatedSSE event) and as newworker.lab.stats_updatedevents.
Compatibility Matrix¶
| CML Version | /ws/ui Available |
Behavior |
|---|---|---|
| < 2.7 (estimated) | β Unknown | Polling fallback (existing behavior) |
| β₯ 2.9 | β Confirmed | WebSocket monitoring active (validated 2026-05-20) |
Note: Minimum version may be lower than 2.9 but untested.