ADR-011: Removal of APScheduler and Migration to Controller-Based Job Execution
| Attribute |
Value |
| Status |
Accepted |
| Date |
2026-01-19 |
| Deciders |
Architecture Team |
| Related ADRs |
ADR-001, ADR-010 |
Context
The control-plane-api originally used APScheduler for background job execution:
| Job |
Purpose |
APScheduler Pattern |
AutoImportWorkersJob |
Discover EC2 instances |
Recurrent (interval-based) |
WorkerMetricsCollectionJob |
Collect worker metrics |
Recurrent (interval-based) |
ActivityDetectionJob |
Detect idle workers |
Recurrent (interval-based) |
LabsRefreshJob |
Sync lab records |
Recurrent (interval-based) |
LicenseRegistrationJob |
Register CML licenses |
Scheduled (on-demand) |
LicenseDeregistrationJob |
Deregister licenses |
Scheduled (on-demand) |
OnDemandWorkerDataRefreshJob |
Refresh single worker |
Scheduled (on-demand) |
With the refactoring to declarative resource management (ADR-010), controllers now use reconciliation loops (HostedService pattern) for continuous operations:
- worker-controller:
WorkerReconciler (includes discovery loop per AD-020)
- lablet-controller:
LabletReconciler, LabsRefreshService
This creates architectural tension:
- Dual scheduling systems: APScheduler in control-plane-api + HostedService in controllers
- Violation of ADR-001: APScheduler jobs access repositories directly
- Complexity:
background_worker.py process runs separately from main API
- Redundancy: Controller reconcilers now handle what APScheduler jobs did
Decision
Remove APScheduler from control-plane-api entirely. All background operations are handled by:
- Reconciliation loops in controllers (worker-controller, lablet-controller)
- On-demand mediator commands triggered by API requests
Migration Strategy
| Original Job |
Migration Target |
Trigger |
AutoImportWorkersJob |
WorkerReconciler._run_discovery_loop() (worker-controller) |
Leader-elected asyncio task |
WorkerMetricsCollectionJob |
WorkerReconciler._collect_and_report_metrics() |
Reconciliation loop |
ActivityDetectionJob |
WorkerReconciler._detect_activity() |
Reconciliation loop |
LabsRefreshJob |
LabsRefreshService (lablet-controller) |
Continuous (interval-based HostedService) |
LicenseRegistrationJob |
RegisterWorkerLicenseCommand |
On-demand via API |
LicenseDeregistrationJob |
DeregisterWorkerLicenseCommand |
On-demand via API |
OnDemandWorkerDataRefreshJob |
RefreshWorkerDataCommand → worker-controller |
On-demand via API |
On-Demand Pattern
For user-triggered operations (license registration, data refresh):
User → POST /api/workers/{id}/register-license
→ RegisterWorkerLicenseCommand (mediator)
→ Handler calls worker-controller via ControlPlaneApiClient
→ Worker-controller executes license operation
→ Reports result to control-plane-api via internal API
→ Control-plane-api persists state and broadcasts SSE
Rationale
Benefits
- Single scheduling paradigm: All continuous operations use HostedService reconciliation
- ADR-001 compliance: Controllers never access repositories directly
- Simplified deployment: No separate
background_worker.py process
- Better observability: All operations flow through reconcilers with consistent logging/tracing
- Cleaner codebase: Remove ~900 lines of APScheduler infrastructure
Trade-offs
- On-demand operations require API round-trip through controller
- Controllers must handle both reconciliation and on-demand requests
Consequences
Files to Delete
| File |
Reason |
control-plane-api/background_worker.py |
APScheduler process entry point |
control-plane-api/application/services/background_scheduler.py |
APScheduler wrapper (~900 lines) |
control-plane-api/application/jobs/auto_import_workers_job.py |
Replaced by WorkerReconciler._run_discovery_loop() |
control-plane-api/application/jobs/worker_metrics_collection_job.py |
Replaced by WorkerReconciler |
control-plane-api/application/jobs/activity_detection_job.py |
Replaced by WorkerReconciler |
control-plane-api/application/jobs/labs_refresh_job.py |
Replaced by LabsRefreshService |
control-plane-api/application/jobs/license_registration_job.py |
Replaced by command |
control-plane-api/application/jobs/license_deregistration_job.py |
Replaced by command |
control-plane-api/application/jobs/on_demand_worker_data_refresh_job.py |
Replaced by command |
control-plane-api/application/jobs/__init__.py |
Empty module |
Code Changes Required
- control-plane-api/main.py: Remove
BackgroundTaskScheduler.configure() call
- control-plane-api/pyproject.toml: Remove
apscheduler dependency
- docker-compose.yml: Remove
background-worker service if present
- Create new commands:
RegisterWorkerLicenseCommand, DeregisterWorkerLicenseCommand, RefreshWorkerDataCommand
New Commands in control-plane-api
| Command |
Purpose |
Delegates To |
RegisterWorkerLicenseCommand |
Initiate license registration |
worker-controller |
DeregisterWorkerLicenseCommand |
Initiate license deregistration |
worker-controller |
RefreshWorkerDataCommand |
Trigger on-demand data refresh |
worker-controller |
New Endpoints in worker-controller
| Endpoint |
Purpose |
POST /internal/workers/{id}/register-license |
Execute license registration |
POST /internal/workers/{id}/deregister-license |
Execute license deregistration |
POST /internal/workers/{id}/refresh-data |
Execute data refresh |
Implementation Phases
Phase 1: Create New Commands (control-plane-api)
- [ ] Create
RegisterWorkerLicenseCommand and handler
- [ ] Create
DeregisterWorkerLicenseCommand and handler
- [ ] Create
RefreshWorkerDataCommand and handler
- [ ] Update API endpoints to use new commands
Phase 2: Create New Endpoints (worker-controller)
- [ ] Add
POST /internal/workers/{id}/register-license
- [ ] Add
POST /internal/workers/{id}/deregister-license
- [ ] Add
POST /internal/workers/{id}/refresh-data
- [ ] Implement license registration via CML System API
- [ ] Implement data refresh via EC2 + CloudWatch + CML APIs
Phase 3: Verify Migration (Integration Testing)
- [ ] Test license registration flow end-to-end
- [ ] Test license deregistration flow end-to-end
- [ ] Test on-demand data refresh flow
- [ ] Verify SSE events are broadcast correctly
Phase 4: Cleanup (After Verification)
- [ ] Delete deprecated files (listed above)
- [ ] Remove APScheduler dependency
- [ ] Update documentation