ADR-002: Separate Resource Scheduler Service¶
| Attribute | Value |
|---|---|
| Status | Accepted |
| Date | 2026-01-15 |
| Deciders | Architecture Team |
| Related ADRs | ADR-001, ADR-006 |
Context¶
The placement logic for assigning LabletInstances to Workers requires evaluating constraints (license affinity, resource requirements, capacity) and making optimal decisions. We need to decide whether this logic lives in the API or a separate service.
Options considered:
- Embedded in API - Scheduling logic runs within API request handlers
- Separate service - Dedicated Resource Scheduler service with its own lifecycle
- Serverless functions - Lambda/Cloud Functions triggered on events
Decision¶
Resource Scheduler runs as a separate microservice, not embedded in the API.
The Resource Scheduler:
- Operates on its own reconciliation loop
- Subscribes to state changes via watchers (etcd) and periodic polling
- Calls Control Plane API to record scheduling decisions
- Can be scaled and deployed independently
Rationale¶
Benefits¶
- Separation of Concerns: Placement algorithm isolated from CRUD operations
- Independent Scaling: Resource Scheduler can scale based on queue depth, not API traffic
- Easier Testing: Scheduling logic testable in isolation
- Upgradability: Algorithm improvements deployed without API changes
- Failure Isolation: Resource Scheduler issues don't impact API availability
Trade-offs¶
- Additional deployment complexity
- Network latency between Scheduler and API
- Requires coordination mechanism for HA (see ADR-006)
Consequences¶
Positive¶
- Clean domain boundaries
- Algorithm can be optimized/replaced independently
- Scheduler failures don't cascade to API
Negative¶
- Operational overhead of additional service
- Must handle distributed system complexities
Implementation Notes¶
- Scheduler subscribes to etcd watch for
LabletInstancechanges - Redundant periodic reconciliation (30s) as fallback
- Placement decision written back via
POST /api/internal/instances/{id}/schedule - Metrics exposed for scheduling latency and queue depth