ADR-043: Startup State Reconciliation and Discovery Separation¶
| Attribute | Value |
|---|---|
| Status | Accepted |
| Date | 2026-06-04 |
| Deciders | Platform Team |
| Related ADRs | ADR-031 (Checkpoint Pipeline), ADR-014 (Orphan Detection) |
| Knowledge Refs | AD-031, AD-043, AD-021 |
Context¶
External EC2 State Changes¶
CML workers run on AWS EC2 instances that can be stopped, terminated, or modified by external means (AWS Console, cost automation, scheduled events, AWS maintenance). The LCM application cannot assume that stored resource states are correct after a restart; it must verify all resources against their actual state on boot.
Current Behavior¶
The WatchTriggeredHostedService base class (AD-031) already runs a one-time startup
reconciliation sweep on leader election. However, this sweep calls list_resources() which
filters out terminal statuses (STOPPED, TERMINATED, FAILED, etc.) on the assumption that
terminal resources don't need reconciliation.
This creates gaps when the controller is down:
- A STOPPED worker with
desired_status=RUNNINGwon't be picked up (desired-state divergence missed). - A STOPPED worker whose EC2 was terminated externally remains stale indefinitely.
- A RUNNING record whose EC2 was stopped is caught (non-terminal →
_handle_runningdetects mismatch), but only because the filter happens to include it.
Design Questions Resolved¶
-
Should reconciliation always verify current state and conditions (e.g. timeline validity)? → Yes. The reconciliation loop is the single authority for state alignment. It must verify the source of truth (EC2 API) rather than trusting stored status, especially at startup.
-
Should discovery processes be extended to include state validation? → No. Discovery ("What exists?") and reconciliation ("Is it correct?") remain separate concerns. Discovery feeds data to the reconciler; mixing validation into discovery would violate SRP and create overlapping ownership of state transitions.
Decision¶
1. Startup Full-Sync for Worker-Controller¶
Add a startup full-sync mode to the worker-controller reconciler. When the startup
sweep runs, list_resources() returns ALL workers (including STOPPED) so the reconciler
can verify EC2 state and desired-state alignment for every worker.
Mechanism:
- Add a
_startup_sweep_active: boolflag toWatchTriggeredHostedService. - Set it
Truebefore the sweep,Falseafter. - Subclasses can check
self._startup_sweep_activeinlist_resources()to adjust their filtering logic.
2. STOPPED Worker EC2 Verification¶
When a STOPPED worker is reconciled during the startup sweep, verify its EC2 instance state:
- If EC2 is terminated → transition to TERMINATED.
- If EC2 is running → transition to RUNNING (unexpected restart detected).
- If desired_status differs from current → initiate appropriate transition.
- If at rest (EC2 stopped, desired=stopped) → no-op (success).
3. Discovery Remains Separate¶
Discovery processes (worker discovery, lab discovery) continue to focus exclusively on existence/absence detection. They do not perform state validation. They already fire early at startup (5s delay) to provide fresh data for the reconciler.
4. Lablet-Controller Already Covered¶
The lablet-controller's startup sweep (AD-031) reconciles all non-terminal sessions. Since lablet sessions depend on worker state (worker_ip resolution), the worker-controller's startup sweep MUST complete before the lablet-controller acts on sessions requiring worker connectivity. This is naturally ordered by the initial delay settings (worker-controller: 2s sweep delay; lablet-controller: 2s sweep delay + worker_ip enrichment retries).
Consequences¶
Positive¶
- Correctness: All stored states verified against reality on every boot.
- Resilience: External EC2 operations (manual stops, cost automation, maintenance) detected within seconds of leader election.
- Separation preserved: Discovery stays focused on existence; reconciliation owns state alignment.
- Minimal blast radius: Only the startup sweep is broadened; steady-state polling continues to filter terminal workers for efficiency.
Negative¶
- Startup load: First reconciliation cycle processes more workers (all vs non-terminal).
Mitigated by
max_concurrent_reconciles=5semaphore. - EC2 API calls at boot: One
describe_instancesper stopped worker. Acceptable given m5zn.metal instances are few (typically < 20).
Neutral¶
- No API contract changes to Control Plane API.
- No new configuration required (uses existing
startup_reconcile_enabledflag). - Pattern is consistent with how the lablet-controller already operates.
Implementation¶
Files Modified¶
| File | Change |
|---|---|
src/core/lcm_core/infrastructure/hosted_services/watch_triggered_hosted_service.py |
Add _startup_sweep_active flag |
src/worker-controller/application/hosted_services/worker_reconciler.py |
Broaden list_resources() during startup; add EC2 verification to _handle_stopped() |
src/core/tests/test_watch_triggered_hosted_service.py |
Test startup flag lifecycle |
src/worker-controller/tests/ |
Test startup full-sync behavior |
Startup Sequence (Updated)¶
┌─── Leader Election ───────────────────────────────────────────────┐
│ 1. _become_leader() │
│ 2. Start watch loop (etcd prefix monitoring) │
│ 3. Start metrics loop (independent, periodic) │
│ 4. Start discovery loop (independent, periodic) │
│ 5. _startup_reconcile_sweep(): │
│ a. Set _startup_sweep_active = True │
│ b. _reconcile_all() → list_resources() returns ALL workers │
│ c. For each worker: verify EC2 state, align desired state │
│ d. Set _startup_sweep_active = False │
│ 6. Resume normal polling/watch (list_resources filters terminals) │
└────────────────────────────────────────────────────────────────────┘