Auto-Scaling Architecture¶
Version: 1.0.0 (February 2026) Scope: Cross-cutting — spans Resource Scheduler, Control Plane API, and Worker Controller Phase: Phase 3 - Auto-Scaling
Related Documentation
- Worker Lifecycle — state transitions during scaling
- Worker Discovery — how controllers observe changes
- Idle Detection — activity tracking for scale-down
- Worker Templates — template selection and capacity
1. Overview¶
LCM auto-scaling automatically adjusts the CML worker fleet based on workload demand. It operates in two directions:
| Direction | Trigger | Service Chain | Goal |
|---|---|---|---|
| Scale Up | Insufficient capacity for pending lablet instances | Resource Scheduler → Control Plane API → Worker Controller | Add workers |
| Scale Down | Workers idle beyond threshold | Worker Controller → Control Plane API | Remove workers |
flowchart LR
subgraph up["Scale Up"]
direction TB
RS["Resource Scheduler<br/>PlacementEngine"]
CP1["Control Plane API<br/>RequestScaleUpCommand"]
WC1["Worker Controller<br/>_handle_pending"]
RS -->|"No capacity"| CP1 -->|"Create PENDING worker"| WC1
end
subgraph down["Scale Down"]
direction TB
WC2["Worker Controller<br/>_evaluate_scale_down"]
CP2["Control Plane API<br/>DrainWorkerCommand"]
WC3["Worker Controller<br/>_handle_stopping"]
WC2 -->|"Idle + eligible"| CP2 -->|"DRAINING"| WC3
end
style up fill:#E8F5E9,stroke:#2E7D32
style down fill:#FFEBEE,stroke:#C62828
2. Scale-Up Flow¶
When the Resource Scheduler cannot find a suitable worker for a pending lablet instance, it triggers a scale-up:
sequenceDiagram
autonumber
participant Inst as Lablet Instance<br/>(PENDING_SCHEDULING)
participant RS as Resource Scheduler
participant PE as Placement Engine
participant CP as Control Plane API
participant etcd as etcd
participant WC as Worker Controller
participant EC2 as AWS EC2
Inst->>RS: Observed via watch/polling
RS->>PE: place_instance(instance, workers, templates)
alt Workers available
PE->>PE: Filter → Score → Select
PE-->>RS: PlacementDecision(action=assign, worker_id)
RS->>CP: assign_instance(instance_id, worker_id)
else No workers or no capacity
PE->>PE: _select_template(requirements)
PE-->>RS: PlacementDecision(action=scale_up, template)
RS->>CP: request_scale_up(template, reason)
rect rgb(232, 245, 233)
Note over CP: RequestScaleUpCommand
CP->>CP: Resolve template
CP->>CP: Check scaling constraints
CP->>CP: Create CMLWorker(status=PENDING)
CP->>etcd: PUT /workers/new-id
end
etcd-->>WC: Watch event
rect rgb(227, 242, 253)
Note over WC: _handle_pending
WC->>EC2: launch_instance()
EC2-->>WC: instance_id
WC->>CP: update_status(PROVISIONING)
end
Note over WC: Subsequent cycles:<br/>PROVISIONING → RUNNING
end
Scaling Constraints¶
Before creating a new worker, RequestScaleUpCommand validates:
| Constraint | Check | Default | Behavior on Violation |
|---|---|---|---|
| Max workers per region | active_count < max_workers_per_region |
10 |
Reject with 409 Conflict |
| Pending workers | Check for existing PENDING workers | — | Log warning (does not block) |
| Constraint check failure | Exception during validation | — | Fail open — allows scale-up |
Fail-Open Design
If the constraint check itself fails (e.g., database error), the scale-up is allowed to proceed. This prevents a cascading failure where infrastructure issues also prevent scaling.
3. Placement Algorithm¶
The PlacementEngine implements a Filter → Score → Select pattern inspired by the Kubernetes scheduler:
flowchart TD
Start["place_instance(instance)"] --> HasWorkers{"Any workers<br/>registered?"}
HasWorkers -->|No| ScaleUp1["PlacementDecision<br/>action=scale_up<br/>reason: no workers available"]
HasWorkers -->|Yes| Filter["Filter Phase<br/>(5 predicates)"]
Filter --> HasCandidates{"Any candidates<br/>passed filter?"}
HasCandidates -->|No| ScaleUp2["PlacementDecision<br/>action=scale_up<br/>reason: + rejection_reasons"]
HasCandidates -->|Yes| Score["Score Phase<br/>(utilization + locality)"]
Score --> Select["Select highest score<br/>(bin-packing: most utilized)"]
Select --> Assign["PlacementDecision<br/>action=assign<br/>worker_id=selected"]
style Start fill:#1565C0,color:white
style ScaleUp1 fill:#FF9800,color:white
style ScaleUp2 fill:#FF9800,color:white
style Assign fill:#4CAF50,color:white
Filter Predicates (5 checks)¶
Each worker must pass all predicates to be eligible:
flowchart TD
W["Worker candidate"] --> F1{"Status check<br/>status = RUNNING?"}
F1 -->|Fail| R1["Rejected:<br/>status_not_eligible"]
F1 -->|Pass| F2{"License affinity<br/>matches requirement?"}
F2 -->|Fail| R2["Rejected:<br/>license_affinity"]
F2 -->|Pass| F3{"Resource capacity<br/>CPU, memory, storage?"}
F3 -->|Fail| R3["Rejected:<br/>insufficient_capacity"]
F3 -->|Pass| F4{"AMI requirements<br/>CML version, node defs?"}
F4 -->|Fail| R4["Rejected:<br/>ami"]
F4 -->|Pass| F5{"Port availability<br/>enough free ports?"}
F5 -->|Fail| R5["Rejected:<br/>port_availability"]
F5 -->|Pass| OK["Passes filter"]
style OK fill:#4CAF50,color:white
style R1 fill:#f44336,color:white
style R2 fill:#f44336,color:white
style R3 fill:#f44336,color:white
style R4 fill:#f44336,color:white
style R5 fill:#f44336,color:white
| # | Predicate | Data Source | Logic |
|---|---|---|---|
| 1 | Status | Worker state | Must be RUNNING (not DRAINING, STOPPING, etc.) |
| 2 | License | Instance requirements | If instance specifies license types, worker's license must match |
| 3 | Capacity | etcd (preferred) or worker state | Available CPU ≥ required, memory ≥ required, storage ≥ required |
| 4 | AMI | Instance requirements | CML version in min/max range, required node definitions present |
| 5 | Ports | Worker state + allocated ports | max_ports - allocated_ports ≥ required_ports |
Real-Time Capacity via etcd
The placement engine prefers etcd for capacity data (updated every 30s by workers) over MongoDB state, which may be stale. Capacity keys: /lcm/workers/{id}/capacity.
Scoring Formula¶
Workers that pass filtering are scored to select the most utilized (bin-packing strategy):
$$ \text{score} = \frac{\text{cpu_utilization} + \text{memory_utilization}}{2} + \text{locality_bonus} $$
Where:
- $\text{cpu_utilization} = \frac{\text{allocated_cpu}}{\text{declared_cpu}}$
- $\text{memory_utilization} = \frac{\text{allocated_memory}}{\text{declared_memory}}$
- $\text{locality_bonus} = \min(0.05, \text{instance_count} \times 0.01)$ — small bonus for co-locating instances
Higher score = preferred — the algorithm packs workloads onto the most utilized workers first, keeping other workers available for larger workloads or scale-down.
4. Template Selection¶
When the placement engine decides to scale up, it selects the optimal worker template:
flowchart TD
Start["_select_template(requirements)"] --> HasTemplates{"Templates<br/>available?"}
HasTemplates -->|No| Hardcoded["Tier 3: Hardcoded fallback"]
HasTemplates -->|Yes| Tier1["Tier 1: Capacity-based<br/>cheapest viable"]
Tier1 --> T1Filter["Filter: enabled templates<br/>where CPU, memory, storage<br/>all >= requirements"]
T1Filter --> T1Sort["Sort by cost_per_hour ASC"]
T1Sort --> T1Check{"Any matches?"}
T1Check -->|Yes| T1Select["Select cheapest viable"]
T1Check -->|No| Tier2["Tier 2: Largest available"]
Tier2 --> T2Filter["Filter: enabled templates"]
T2Filter --> T2Sort["Sort by CPU DESC"]
T2Sort --> T2Check{"Any templates?"}
T2Check -->|Yes| T2Select["Select largest + warning"]
T2Check -->|No| Hardcoded
Hardcoded --> HC{"Required CPU?"}
HC -->|">= 32"| Metal["metal"]
HC -->|">= 16"| Large["large"]
HC -->|">= 4"| Medium["medium"]
HC -->|"< 4"| Small["small"]
style Start fill:#1565C0,color:white
style T1Select fill:#4CAF50,color:white
style T2Select fill:#FF9800,color:white
style Metal fill:#9E9E9E,color:white
style Large fill:#9E9E9E,color:white
style Medium fill:#9E9E9E,color:white
style Small fill:#9E9E9E,color:white
See Worker Templates for template definitions and capacity model.
5. Scale-Down Flow¶
Scale-down is evaluated during the Worker Controller's _handle_running reconciliation (Step 6), after idle detection has run:
sequenceDiagram
autonumber
participant WC as Worker Controller
participant CP as Control Plane API
participant etcd as etcd
Note over WC: _handle_running Step 6
WC->>WC: _evaluate_scale_down(worker, idle_result)
alt Guard 1: auto_pause_triggered
Note over WC: Skip — auto-pause already acted
else Guard 2: not idle
Note over WC: Skip — worker is active
else Guard 3: not eligible
Note over WC: Skip — not eligible for pause
else Guard 4: running <= min_workers
Note over WC: Skip — at minimum fleet size
else Guard 5: cooldown active
Note over WC: Skip — too soon since last drain
else All guards pass
WC->>CP: drain_worker(worker_id, reason=scale_down)
rect rgb(255, 235, 238)
Note over CP: DrainWorkerCommand
CP->>CP: Validate status = RUNNING
CP->>CP: Set status = DRAINING
CP->>CP: Set desired_status = STOPPED
CP->>etcd: PUT /workers/id (status=DRAINING)
end
WC->>WC: Update last_scale_down_at
WC->>WC: Decrement running_worker_count
WC->>WC: Increment scale_down_count
end
Safety Guards (5 checks, in order)¶
The _evaluate_scale_down method applies five sequential guards before draining:
| # | Guard | Condition to Skip | Audit Label | Purpose |
|---|---|---|---|---|
| 1 | Auto-pause | auto_pause_triggered is true |
skipped_auto_pause |
Avoid double-action with auto-pause |
| 2 | Idle check | is_idle is false |
skipped_not_idle |
Only drain idle workers |
| 3 | Eligibility | eligible_for_pause is false |
skipped_not_eligible |
Respects snooze, per-worker flags |
| 4 | Min workers | running_count <= min_workers |
skipped_min_workers |
Maintain minimum fleet |
| 5 | Cooldown | elapsed < cooldown_seconds |
skipped_cooldown |
Prevent rapid succession drains |
flowchart TD
Start["_evaluate_scale_down"] --> G1{"auto_pause<br/>triggered?"}
G1 -->|Yes| Skip1["Skip: auto-pause<br/>already handled"]
G1 -->|No| G2{"is_idle?"}
G2 -->|No| Skip2["Skip: not idle"]
G2 -->|Yes| G3{"eligible_for<br/>_pause?"}
G3 -->|No| Skip3["Skip: not eligible"]
G3 -->|Yes| G4{"running_count<br/><= min_workers?"}
G4 -->|Yes| Skip4["Skip: minimum<br/>fleet size"]
G4 -->|No| G5{"cooldown<br/>active?"}
G5 -->|Yes| Skip5["Skip: too soon"]
G5 -->|No| Drain["DRAIN WORKER"]
style Start fill:#1565C0,color:white
style Drain fill:#f44336,color:white
style Skip1 fill:#9E9E9E,color:white
style Skip2 fill:#9E9E9E,color:white
style Skip3 fill:#9E9E9E,color:white
style Skip4 fill:#9E9E9E,color:white
style Skip5 fill:#9E9E9E,color:white
Running Worker Count Tracking¶
After a successful drain, the reconciler decrements _running_worker_count locally. This ensures that if multiple idle workers are evaluated in the same reconciliation cycle, the min_workers guard remains accurate without waiting for the next API refresh.
6. DRAINING State (ADR-008)¶
The DRAINING status provides a graceful transition between RUNNING and STOPPING:
stateDiagram-v2
direction LR
RUNNING --> DRAINING: DrainWorkerCommand
DRAINING --> STOPPING: Worker Controller reconcile
STOPPING --> STOPPED: EC2 instance stopped
note right of DRAINING
Worker accepts no new<br/>lablet assignments.<br/>Active workloads continue<br/>until completion.
end note
The DrainWorkerCommand handler:
- Validates the worker is in
RUNNINGstatus (returns 409 Conflict otherwise) - Sets
status = DRAINING - Sets
desired_status = STOPPED - Records scaling audit event
The Worker Controller's _handle_stopping method handles both STOPPING and DRAINING statuses with the same EC2 stop logic.
7. End-to-End Scale-Up Timeline¶
sequenceDiagram
autonumber
participant User as User / System
participant RS as Resource Scheduler
participant CP as Control Plane API
participant WC as Worker Controller
participant EC2 as AWS EC2
participant CML as CML Instance
User->>CP: Create LabletInstance (PENDING)
CP->>CP: Set status = PENDING_SCHEDULING
Note over RS: Watch event or polling cycle
RS->>RS: PlacementEngine: no capacity
RS->>CP: RequestScaleUp(template=large)
CP->>CP: Create worker (PENDING)
Note over WC: Watch event (~0.5s)
WC->>EC2: launch_instance(m5zn.metal)
WC->>CP: Update status (PROVISIONING)
Note over EC2: EC2 instance booting (~2-5 min)
WC->>EC2: get_instance_state()
EC2-->>WC: running + IP
WC->>CP: Update status (RUNNING, ip_address)
Note over RS: Next reconciliation cycle
RS->>RS: PlacementEngine: worker available
RS->>CP: assign_instance(instance, worker)
Note over CML: Lablet Controller provisions lab
8. Observability¶
Scaling Audit Events¶
All scaling decisions are recorded via record_scaling_event() for observability:
| Event | Service | Description |
|---|---|---|
scale_up_accepted |
Control Plane API | New worker created from template |
scale_up_rejected |
Control Plane API | Constraint violation or error |
provisioned |
Worker Controller | EC2 instance launched |
scale_down_initiated |
Worker Controller | Drain command sent |
scale_down_failed |
Worker Controller | Drain command failed |
skipped_auto_pause |
Worker Controller | Scale-down skipped (auto-pause acted) |
skipped_not_idle |
Worker Controller | Scale-down skipped (worker active) |
skipped_not_eligible |
Worker Controller | Scale-down skipped (not eligible) |
skipped_min_workers |
Worker Controller | Scale-down skipped (min fleet) |
skipped_cooldown |
Worker Controller | Scale-down skipped (cooldown) |
drained |
Worker Controller | Worker successfully drained |
9. Configuration Reference¶
Scale-Up Settings (Control Plane API)¶
| Setting | Env Variable | Default | Description |
|---|---|---|---|
max_workers_per_region |
MAX_WORKERS_PER_REGION |
10 |
Maximum active workers per AWS region |
Scale-Down Settings (Worker Controller)¶
| Setting | Env Variable | Default | Description |
|---|---|---|---|
scale_down_enabled |
SCALE_DOWN_ENABLED |
False |
Enable automatic scale-down |
min_workers |
MIN_WORKERS |
0 |
Minimum running workers to maintain |
scale_down_cooldown_seconds |
SCALE_DOWN_COOLDOWN_SECONDS |
600 |
Cooldown between drains (10 min) |
Scheduler Settings (Resource Scheduler)¶
| Setting | Env Variable | Default | Description |
|---|---|---|---|
scheduling_interval |
RECONCILE_INTERVAL |
30 |
Seconds between scheduling cycles |
max_retries |
MAX_RETRIES |
35 |
Max retries for failed placements |
scheduling_polling_enabled |
SCHEDULING_POLLING_ENABLED |
True |
Enable polling-based scheduling |
Scale-Down Disabled by Default
scale_down_enabled defaults to False. Enable it only after validating idle detection thresholds and min_workers settings for your environment.