ADR-015: Control Plane API Must Not Make External Calls¶
- Status: Implemented β
- Date: 2026-02-06 (Expanded: 2026-02-06, Completed: 2026-02-07)
- Deciders: Platform Team
- Related: ADR-001 (API-Centric State Management), ADR-005 (Dual State Store), ADR-014 (Worker Orphan Detection), ADR-016 (License Operations), ADR-017 (Lab Operations)
Context¶
The control-plane-api service was originally designed as both a state store AND an orchestration layer, making direct calls to:
- AWS EC2 - Instance lifecycle (create, start, stop, terminate)
- AWS CloudWatch - Metrics collection (CPU, memory, storage)
- CML REST API - Health checks, license management, lab operations
This created several problems:
- Architectural Violation: Control-plane-api's purpose is state management (per ADR-001), not infrastructure orchestration
- Credential Sprawl: AWS and CML credentials needed in multiple services
- Inconsistent Behavior: External operations could be triggered from multiple paths
- Testing Complexity: Mocking AWS/CML in multiple services
- Orphan Detection Failure: Commands couldn't update state when external resources were gone
- Tight Coupling: Control-plane-api knowledge of EC2/CML implementation details
Decision¶
Control-plane-api MUST NOT make any external API calls.
It ONLY:
- Reads/writes to MongoDB (spec/desired state, document storage)
- Reads/writes to etcd (state for controller watching, per ADR-005)
- Exposes REST API for UI and internal controller services
All external integrations are delegated to specialized controllers.
Responsibility Boundaries¶
| Service | Responsibility | External Calls | Watches |
|---|---|---|---|
| control-plane-api | State storage (MongoDB + etcd), API for UI & controllers | NONE | N/A |
| worker-controller | AWS EC2 lifecycle, CloudWatch metrics, orphan detection | AWS EC2, CloudWatch | etcd |
| lablet-controller | CML API interactions, lab lifecycle, CML health/metrics | CML REST API | etcd |
| resource-scheduler | Resource allocation, scheduling algorithms | NONE | etcd |
Data Flow Pattern¶
βββββββββββββββββββββββββββββββββββββββββββ
β UI β
ββββββββββββββββββββββ¬βββββββββββββββββββββ
β REST API
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β control-plane-api β
β (DB-only: MongoDB + etcd writes) β
ββββββββ¬ββββββββββββββββββββββ¬ββββββββββββββ
β β
MongoDB ββββββ ββββββΊ etcd
(specs) (state)
β
ββββββββββββββββββ¬ββββββββββββββββΌββββββββββββββββ
β β β β
watchβ watchβ watchβ watchβ
βΌ βΌ βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β worker-controllerβ βlablet-controllerβ βresource-schedulerβ
ββββββββββ¬ββββββββββ ββββββββ¬ββββββββ ββββββββββββββββ
β β
βΌ βΌ
AWS EC2 CML API
CloudWatch
Spec vs State Pattern (Kubernetes-like Reconciliation)¶
| Field | Description | Written By | Read By |
|---|---|---|---|
desired_status |
Spec: What user wants | control-plane-api | Controllers (via etcd watch) |
status |
State: Actual resource state | Controllers (via internal API) | UI, control-plane-api |
Flow:
- User clicks "Stop Worker" in UI
- control-plane-api sets
desired_status=STOPPEDin MongoDB AND etcd - worker-controller watches etcd, sees
desired_status != status - worker-controller calls AWS EC2 to stop instance
- worker-controller updates
status=STOPPEDvia control-plane-api internal API - UI sees reconciled state
Commands/Queries to Remove or Refactor¶
DELETE (Obsolete - functionality moved to controllers)¶
| File | Reason |
|---|---|
sync_worker_ec2_status_command.py |
EC2 β worker-controller |
bulk_sync_worker_ec2_status_command.py |
EC2 β worker-controller |
collect_worker_cloudwatch_metrics_command.py |
CloudWatch β worker-controller |
provision_cml_worker_event_handler.py |
EC2 provisioning β worker-controller |
sync_worker_cml_data_command.py |
CML API β lablet-controller |
bulk_sync_worker_cml_data_command.py |
CML API β lablet-controller |
refresh_worker_metrics_command.py |
Orchestrates deleted commands |
refresh_worker_labs_command.py |
CML API β lablet-controller |
request_worker_data_refresh_command.py |
Orchestrates deleted commands |
REFACTOR (Keep but remove external calls)¶
| File | Change |
|---|---|
create_cml_worker_command.py |
DB-only: create with status=PENDING, desired_status=RUNNING |
start_cml_worker_command.py |
DB-only: set desired_status=RUNNING |
stop_cml_worker_command.py |
DB-only: set desired_status=STOPPED |
terminate_cml_worker_command.py |
DB-only: set desired_status=TERMINATED |
delete_cml_worker_command.py |
DB-only: set desired_status=TERMINATED (same as terminate) |
update_cml_worker_status_command.py |
DB-only: return current status from DB |
update_cml_worker_tags_command.py |
DB-only: store tags in MongoDB, worker-controller syncs to EC2 |
enable_worker_detailed_monitoring_command.py |
DB-only: store setting, worker-controller applies to EC2 |
import_cml_worker_command.py |
DB-only: store import request, worker-controller discovers from EC2 |
bulk_import_cml_workers_command.py |
DB-only: store import request, worker-controller discovers from EC2 |
register_cml_worker_license_command.py |
DB-only: store license info, lablet-controller registers with CML |
deregister_cml_worker_license_command.py |
DB-only: remove license info, lablet-controller deregisters |
get_cml_worker_resources_query.py |
DB-only: return cached metrics from MongoDB |
SERVICES TO REMOVE¶
| Service | Reason |
|---|---|
CMLHealthService |
CML API β lablet-controller |
AwsEc2Client registration |
EC2 β worker-controller |
CMLApiClient registration |
CML API β lablet-controller |
Consequences¶
Positive¶
- Clear Separation of Concerns: control-plane-api is a pure state store
- Single Responsibility: Each controller owns its external integration
- Simpler Testing: control-plane-api tests need no AWS/CML mocks
- Reduced Credential Exposure: AWS creds only in worker-controller, CML creds only in lablet-controller
- Consistent Behavior: All external operations go through dedicated controllers
- Reactive Architecture: Controllers watch etcd (per ADR-005) for efficient state sync
Negative¶
- Migration Effort: Significant refactoring of control-plane-api
- Controller Complexity: Controllers must implement reconciliation loops
- Eventual Consistency: UI sees desired_status immediately, status updates async
Neutral¶
- Network Hops: Controllers call internal API to update state (correct flow)
- etcd Dependency: Controllers depend on etcd watch (already required by ADR-005)
Implementation Checklist¶
Phase 1: Core Lifecycle Commands β ¶
- [x] Create
MarkWorkerTerminatedCommand(database-only) - [x] Refactor
StartCMLWorkerCommandto set desired_status only - [x] Refactor
StopCMLWorkerCommandto set desired_status only - [x] Refactor
TerminateCMLWorkerCommandto set desired_status only - [x] Refactor
CreateCMLWorkerCommandto DB-only (no EC2 provisioning) - [x] Add
desired_statusfield to CMLWorkerState - [x] Remove
SyncWorkerEC2StatusCommand - [x] Remove
BulkSyncWorkerEC2StatusCommand - [x] Remove
CollectWorkerCloudWatchMetricsCommand
Phase 2: Complete External Call Removal β ¶
- [x] Remove
ProvisionCMLWorkerEventHandler(EC2 provisioning) - [x] Remove
SyncWorkerCMLDataCommand(CML API) - [x] Remove
BulkSyncWorkerCMLDataCommand(CML API) - [x] Remove
RefreshWorkerMetricsCommand(orchestrates deleted commands) - [x] Remove
RefreshWorkerLabsCommand(CML API) - [x] Remove
RequestWorkerDataRefreshCommand(orchestrates deleted commands) - [x] Refactor
DeleteCMLWorkerCommandto DB-only - [x] Refactor
UpdateCMLWorkerTagsCommandto DB-only - [x] Refactor
EnableWorkerDetailedMonitoringCommandto DB-only - [x] Remove
ImportCMLWorkerCommand(no longer needed - workers created via CreateCMLWorkerCommand) - [x] Remove
BulkImportCMLWorkersCommand(no longer needed) - [x] Refactor
RegisterCMLWorkerLicenseCommandto DB-only (ADR-016: stores intent, worker-controller reconciles) - [x] Refactor
DeregisterCMLWorkerLicenseCommandto DB-only (ADR-016: stores intent, worker-controller reconciles) - [x] Refactor
GetCMLWorkerResourcesQueryto return cached DB data (returnsCachedResourcesUtilizationfrom worker state)
Phase 2b: Lab Operations (ADR-017) β ¶
- [x] Refactor
ControlLabCommandto DB-only (ADR-017: sets pending_action, lablet-controller reconciles) - [x] Refactor
DeleteLabCommandto DB-only (ADR-017: sets pending_action=delete, lablet-controller reconciles) - [x] Refactor
ImportLabCommandto DB-only (ADR-017: stores YAML in PendingLabImport, lablet-controller reconciles) - [x] Refactor
DownloadLabCommandto BFF via lablet-controller (ADR-017: read-only passthrough via LabletControllerClient)
Phase 3: Dependency Cleanup β ¶
- [x] Remove
AwsEc2Clientregistration from control-plane-api DI - [x] Remove
CMLApiClientregistration from control-plane-api DI - [x] Remove
CMLHealthServicefrom control-plane-api (already removed) - [x] Remove AWS/CML integration imports from control-plane-api
- [x] Update
__init__.pyfiles to remove deleted exports
Phase 4: Validate ADR-005 (etcd watch) β ¶
- [x] Verify control-plane-api writes to etcd on desired_status changes (EtcdStateStore configured, set_instance_state/set_worker_state available)
- [x] Verify worker-controller watches etcd for worker state changes (WorkerReconciler watches /workers/ prefix)
- [x] Verify lablet-controller watches etcd for lab/license state changes (LabletReconciler watches /instances/ prefix)
- [x] Verify resource-scheduler watches etcd for scheduling triggers (uses leader election + polling pattern)
Phase 5: etcd State Publishing for desired_status β ¶
Implemented reactive reconciliation via etcd watch by publishing desired_status changes:
- [x] Add
WORKER_DESIRED_STATE_KEYto EtcdStateStore (/workers/{id}/desired_state) - [x] Add
set_worker_desired_state()method to EtcdStateStore - [x] Add
get_worker_desired_state()method to EtcdStateStore - [x] Update
delete_worker_state()to also clean updesired_statekey - [x] Create
CMLWorkerDesiredStatusUpdatedEtcdProjectordomain event handler - [x] Update
CMLWorkerCreatedEtcdProjectorto publish bothstatusanddesired_statuson creation - [x] Export new projector in
application/events/domain/__init__.py - [x] Add unit tests for etcd state projectors (
test_etcd_state_projectors.py) - [x] Update WorkerReconciler docstrings to document watch pattern
etcd Key Structure:
/lcm/workers/{worker_id}/state # Actual EC2 state (updated by worker-controller)
/lcm/workers/{worker_id}/desired_state # User-requested state (updated by control-plane-api)
Flow:
- User clicks "Stop Worker" in UI
- control-plane-api updates
desired_status=STOPPEDin MongoDB - Domain event
CMLWorkerDesiredStatusUpdatedDomainEventis published CMLWorkerDesiredStatusUpdatedEtcdProjectorwrites to etcd/workers/{id}/desired_state- worker-controller watching
/workers/prefix receives event immediately - worker-controller fetches worker from API, sees
desired_status != status - worker-controller calls AWS EC2 to stop instance
- worker-controller updates
status=STOPPEDvia control-plane-api
References¶
- ADR-001: API-Centric State Management (control-plane-api as state store)
- ADR-005: Dual State Store Architecture (etcd + MongoDB)
- ADR-014: Worker Orphan Detection (triggered this discovery)
- Kubernetes Controller Pattern
- 12-Factor App: Backing Services