Background Task Scheduling¶
Version: 2.0.0 (March 2026) Status: Current Implementation
APScheduler Removed (ADR-011)
APScheduler was removed in January 2026 per ADR-011.
All background operations now use reconciliation loops via Neuroglia HostedService pattern.
Overview¶
The Lablet Cloud Manager implements background task scheduling through HostedService reconciliation loops — a pattern inspired by Kubernetes controllers. Each background operation runs as a leader-elected hosted service that:
- Watches etcd for state changes (reactive)
- Periodically polls the Control Plane API (fallback)
- Reconciles desired state vs. actual state
- Reports results back via the Control Plane API
Architecture¶
graph TD
subgraph "Application Startup"
A[WebApplicationBuilder] --> B[Register HostedServices]
B --> C[build_app_with_lifespan]
C --> D[Auto-start all HostedServices]
end
subgraph "HostedService Lifecycle"
D --> E{Leader Election}
E -->|Won| F[Run Reconciliation Loop]
E -->|Lost| G[Watch for Leader Failure]
G -->|Leader Failed| E
end
subgraph "Reconciliation Loop"
F --> H[list_resources / on_watch_event]
H --> I[reconcile per resource]
I --> J{Result}
J -->|SUCCESS| K[Done]
J -->|REQUEUE| H
J -->|FAILED| L[Log + Retry]
end
HostedService Hierarchy¶
All background services extend from lcm-core base classes:
HostedService (Neuroglia)
└── ReconciliationHostedService[T] (lcm-core)
├── LeaderElectedHostedService[T] (lcm-core)
│ ├── CleanupHostedService (resource-scheduler)
│ ├── LabRecordReconciler (lablet-controller)
│ └── ContentSyncService (lablet-controller)
└── WatchTriggeredHostedService[T] (lcm-core)
├── SchedulerHostedService (resource-scheduler)
├── LabletReconciler (lablet-controller)
└── WorkerReconciler (worker-controller)
Base Class Features¶
| Base Class | Features |
|---|---|
ReconciliationHostedService[T] |
Polling loop, per-resource reconcile(), metrics, stats |
LeaderElectedHostedService[T] |
+ etcd leader election, only leader runs loop |
WatchTriggeredHostedService[T] |
+ etcd watch for reactive triggering, debounce, dual-mode |
Service Inventory¶
Resource Scheduler¶
| Service | Base Class | Purpose | Interval |
|---|---|---|---|
SchedulerHostedService |
WatchTriggeredHostedService |
Schedule PENDING sessions to workers | 30s poll + watch |
CleanupHostedService |
LeaderElectedHostedService |
Remove terminated worker records | 3600s (1 hour) |
Worker Controller¶
| Service | Base Class | Purpose | Interval |
|---|---|---|---|
WorkerReconciler |
WatchTriggeredHostedService |
Reconcile CML workers vs EC2 | 60s poll + watch |
Lablet Controller¶
| Service | Base Class | Purpose | Interval |
|---|---|---|---|
LabletReconciler |
WatchTriggeredHostedService |
Reconcile session lifecycle (pipeline execution) | 30s poll + watch |
LabRecordReconciler |
LeaderElectedHostedService |
Sync lab records from CML | 1800s (30 min) |
ContentSyncService |
LeaderElectedHostedService |
Sync content packages from upstream sources | On-demand |
Configuration Pattern¶
Each hosted service is registered in main.py via a configure() static method:
# In main.py::create_app()
SchedulerHostedService.configure(builder.services, settings)
# Register as HostedService for automatic lifecycle management
def scheduler_factory(sp) -> HostedService:
return sp.get_required_service(SchedulerHostedService)
builder.services.add_singleton(HostedService, implementation_factory=scheduler_factory)
ReconciliationConfig¶
ReconciliationConfig(
interval_seconds=30, # Polling interval
initial_delay_seconds=5.0, # Delay before first poll
polling_enabled=True, # Enable/disable polling mode
max_concurrent_reconciles=10, # Parallel reconciliation limit
service_name="resource-scheduler",
)
LeaderElectionConfig¶
LeaderElectionConfig(
etcd_endpoints=["localhost:2379"],
lease_ttl_seconds=15,
service_name="resource-scheduler",
)
WatchConfig¶
WatchConfig(
enabled=True,
prefix="/sessions/", # etcd key prefix to watch
debounce_seconds=0.5, # Debounce rapid events
)
ReconciliationResult¶
Each reconcile() call returns a structured result:
class ReconciliationResult:
status: ReconciliationStatus # SUCCESS, REQUEUE, FAILED
message: str
exception: Exception | None
after_seconds: float | None # Delay before requeue
@staticmethod
def success(message: str) -> ReconciliationResult: ...
@staticmethod
def requeue(message: str, after_seconds: float = None) -> ReconciliationResult: ...
@staticmethod
def failed(message: str, exception: Exception = None) -> ReconciliationResult: ...
Lifecycle Management¶
Startup¶
WebApplicationBuilderregisters hosted services as singletonsbuild_app_with_lifespan()creates ASGI lifespan that starts allHostedServiceinstances- Each service campaigns for leadership via etcd
- Leader starts reconciliation loop; standbys watch for leader failure
Shutdown¶
- ASGI lifespan shutdown signal received
- Each
HostedService.stop_async()called - Running reconciliations complete gracefully
- etcd leases revoked (leadership released)
- Clean shutdown completed
Failover¶
- Leader's etcd lease expires (TTL, typically 15s)
- Standbys receive watch notification (key deleted)
- First standby to campaign wins leadership
- New leader starts reconciliation loop
- Failover time: ~TTL/3 to TTL seconds
Observability¶
Each hosted service inherits metrics from the base classes:
| Metric | Description |
|---|---|
| Reconciliation count | Total reconcile() calls |
| Success/failure counts | Per-resource outcomes |
| Reconciliation duration | Time spent in reconcile() |
| Leader status | Is this instance the leader? |
| Watch events processed | etcd watch events handled |
Services add custom metrics via infrastructure/observability/ (OpenTelemetry).
Migration from APScheduler (Historical)¶
Per ADR-011, the original APScheduler-based system was replaced:
| Original (APScheduler) | Current (HostedService) |
|---|---|
AutoImportWorkersJob |
WorkerReconciler._run_discovery_loop() |
WorkerMetricsCollectionJob |
WorkerReconciler._collect_and_report_metrics() |
ActivityDetectionJob |
WorkerReconciler._detect_activity() |
LabsRefreshJob |
LabRecordReconciler |
LicenseRegistrationJob |
RegisterWorkerLicenseCommand (on-demand) |
OnDemandWorkerDataRefreshJob |
RefreshWorkerDataCommand (on-demand) |
Benefits of Migration¶
- Single paradigm: All continuous operations use reconciliation loops
- ADR-001 compliance: Controllers never access repositories directly
- Simplified deployment: No separate background worker process
- Better observability: Consistent logging/tracing through reconcilers
Related Documentation¶
- Resource Scheduler Architecture
- Worker Controller — worker reconciliation
- Lablet Controller — session lifecycle reconciliation
- ADR-011: APScheduler Removal