ADR-006: Resource Scheduler High Availability Coordination¶
| Attribute | Value |
|---|---|
| Status | Accepted |
| Date | 2026-01-16 |
| Deciders | Architecture Team |
| Related ADRs | ADR-002, ADR-005 |
Context¶
With multiple Resource Scheduler replicas for HA, we need to prevent:
- Duplicate scheduling: Two schedulers assigning the same instance
- Lost assignments: Instance falls through cracks between schedulers
- Conflicting decisions: Schedulers making incompatible placements
Options considered:
- Leader election - Single active resource scheduler at a time
- Optimistic locking - First writer wins via version checks
- Work partitioning - Each scheduler handles subset of instances
- Distributed lock per instance - Fine-grained locking
Decision¶
Use leader election for Scheduler (single active leader).
Only the leader resource scheduler makes placement decisions. Other replicas are hot standbys that take over if leader fails.
Rationale¶
Why Leader Election?¶
- Simplicity: No distributed coordination per instance
- Deterministic: Clear ownership of scheduling responsibility
- Proven pattern: Used by Kubernetes scheduler
- etcd native: Built-in leader election support
Why Not Optimistic Locking?¶
- Race conditions under high load
- Wasted work when multiple schedulers compute same placement
- Complexity in retry logic
Why Not Work Partitioning?¶
- Partition rebalancing on scheduler failure
- Complexity in partition assignment
- Overkill for expected scale (<1000 instances)
Consequences¶
Positive¶
- Simple mental model (one scheduler active)
- No per-instance locking overhead
- Fast failover via etcd lease expiration
Negative¶
- Single scheduler bottleneck (mitigated by fast placement algorithm)
- Failover latency (etcd lease TTL, typically 10-15 seconds)
Implementation¶
Leader Election with etcd¶
class SchedulerLeaderElection:
def __init__(self, etcd_client, instance_id: str):
self.etcd = etcd_client
self.instance_id = instance_id
self.lease_ttl = 15 # seconds
self.leader_key = "/lcm/scheduler/leader"
async def campaign(self):
"""Attempt to become leader."""
lease = await self.etcd.lease(self.lease_ttl)
try:
# Try to create leader key (fails if exists)
await self.etcd.put(
self.leader_key,
self.instance_id,
lease=lease
)
return True # We are leader
except etcd.KeyExistsError:
return False # Someone else is leader
async def maintain_leadership(self):
"""Keep lease alive while leader."""
while True:
await self.lease.refresh()
await asyncio.sleep(self.lease_ttl / 3)
async def watch_leader(self):
"""Watch for leader changes."""
async for event in self.etcd.watch(self.leader_key):
if event.type == "DELETE":
# Leader lost, try to become leader
await self.campaign()
Scheduler Lifecycle¶
1. Scheduler starts
2. Attempts leader election (campaign)
3. If leader:
a. Start scheduling loop
b. Maintain leadership (lease refresh)
4. If not leader:
a. Watch leader key
b. On leader loss, attempt election
5. On shutdown:
a. Release lease
b. Another replica becomes leader
Failover Timeline¶
T+0: Leader crashes
T+0-15: Lease TTL expires (configurable)
T+15: etcd deletes leader key
T+15: Standby detects deletion via watch
T+15-16: Standby campaigns and wins
T+16: New leader starts scheduling
Total failover: ~15-20 seconds
Resource Controller HA¶
Same pattern applies to Resource Controller:
- Leader election at
/lcm/controller/leader - Only leader runs reconciliation loop
- Hot standby replicas
Trade-offs vs Scale¶
| Scale | Recommendation |
|---|---|
| <100 instances | Leader election sufficient |
| 100-1000 instances | Leader election, optimize algorithm |
| >1000 instances | Consider work partitioning |
Given expected scale (<1000 concurrent instances), leader election is appropriate.
Open Questions¶
- What lease TTL balances failover speed vs network flakiness?
- Should standby schedulers pre-warm caches (read state)?
- Should we expose leader status in health checks?