ADR-023: Content Sync Trigger via Reactive etcd Watch + Opt-In Polling¶
| Attribute | Value |
|---|---|
| Status | Accepted |
| Date | 2026-02-25 |
| Deciders | Architecture Team |
| Related ADRs | ADR-005 (Dual State Store), ADR-015 (CPA No External Calls), ADR-017 (Lab Operations via Lablet-Controller) |
| Extends | ADR-005 (etcd key namespace), ADR-017 (reconciliation pattern) |
| Implementation | Content Synchronization Plan ยง1.3, ยง4.2, ยง4.6, ยง6.1 |
Context¶
LabletDefinitions reference content packages authored in Mosaic (an external content authoring platform). When a user creates or updates a definition, the content must be downloaded from Mosaic, stored in RustFS (S3-compatible object storage), and metadata extracted for use during LabletSession instantiation.
The system needs a mechanism for the lablet-controller's ContentSyncService to learn that a definition requires content synchronization. Three options were considered:
- Polling only: lablet-controller polls CPA internal API for definitions with
sync_status=sync_requested - Reactive etcd watch: CPA emits a domain event โ etcd projector writes a key โ lablet-controller watches the prefix and reacts immediately
- CloudEvent webhook: CPA sends a CloudEvent to lablet-controller when sync is requested
Established Pattern¶
The codebase already uses reactive etcd watches as the primary reconciliation trigger for all controller operations:
| Feature | etcd Key Pattern | Writer (CPA Projector) | Watcher (Controller) |
|---|---|---|---|
| Worker desired state | /lcm/workers/{id}/desired_state |
CMLWorkerDesiredStatusUpdatedEtcdProjector |
worker-controller |
| Session state | /lcm/sessions/{id}/state |
SessionStateEtcdProjector |
lablet-controller |
| Lab pending action | /lcm/lab_records/{id}/pending_action |
LabActionRequestedEtcdProjector |
LabRecordReconciler |
| Content sync (NEW) | /lcm/definitions/{id}/content_sync |
ContentSyncRequestedEtcdProjector |
ContentSyncService |
The LabRecordReconciler (ADR-017) is the most directly analogous pattern: a standalone watcher class (not a HostedService) with lifecycle managed by LabletReconciler, using exponential backoff reconnection and PUT-only event handling.
Decision¶
1. Reactive etcd Watch as Primary Trigger¶
When a user requests content synchronization, the CPA follows this reactive flow:
User โ "Synchronize" โ CPA SyncLabletDefinitionCommand
โ sets sync_status="sync_requested" in MongoDB
โ emits LabletDefinitionSyncRequestedDomainEvent
โ ContentSyncRequestedEtcdProjector writes /lcm/definitions/{id}/content_sync
โ lablet-controller ContentSyncService._watch_loop() reacts immediately
โ executes sync pipeline
โ reports results to CPA via internal API
โ CPA emits LabletDefinitionContentSyncedDomainEvent
โ ContentSyncCompletedEtcdProjector DELETES the etcd key (cleanup)
2. etcd Key Convention¶
Payload (JSON):
{
"definition_id": "def-abc-123",
"form_qualified_name": "Exam Associate CCNA v1.1 LAB 1.3a",
"bucket_name": "exam-associate-ccna-v1.1-lab-1.3a",
"user_session_package_name": "SVN.zip",
"requested_by": "user@example.com",
"requested_at": "2026-02-25T10:00:00Z"
}
3. CPA-Side Components¶
| Component | File | Purpose |
|---|---|---|
LabletDefinitionSyncRequestedDomainEvent |
domain/events/lablet_definition_events.py |
Emitted by SyncLabletDefinitionCommandHandler |
LabletDefinitionContentSyncedDomainEvent |
domain/events/lablet_definition_events.py |
Emitted by RecordContentSyncResultCommandHandler |
EtcdStateStore.DEFINITION_CONTENT_SYNC_KEY |
integration/services/etcd_state_store.py |
Key template: /definitions/{id}/content_sync |
EtcdStateStore.set_definition_content_sync() |
integration/services/etcd_state_store.py |
Writes etcd key with JSON payload |
EtcdStateStore.delete_definition_content_sync() |
integration/services/etcd_state_store.py |
Deletes etcd key (cleanup) |
ContentSyncRequestedEtcdProjector |
application/events/domain/etcd_state_projector.py |
DomainEventHandler โ writes key |
ContentSyncCompletedEtcdProjector |
application/events/domain/etcd_state_projector.py |
DomainEventHandler โ deletes key |
4. Lablet-Controller ContentSyncService¶
Follows the LabRecordReconciler pattern exactly:
- Singleton (not a HostedService) โ lifecycle managed by
LabletReconciler._become_leader()/_step_down() _watch_loop()with exponential backoff reconnection (1s โ 30s max)_handle_watch_event()reacts to PUT events only_get_watch_prefix()โ{etcd_key_prefix}/definitions/- DI:
configure()classmethod registers via factory
5. Opt-In Polling Fallback¶
Polling is available as an additional consistency measure, disabled by default:
| Setting | Default | Purpose |
|---|---|---|
CONTENT_SYNC_WATCH_ENABLED |
true |
Primary: reactive etcd watch |
CONTENT_SYNC_POLL_ENABLED |
false |
Fallback: periodic poll (opt-in) |
CONTENT_SYNC_POLL_INTERVAL |
300 |
Seconds between polls (if enabled) |
When enabled, the poll loop queries GET /api/internal/lablet-definitions?sync_status=sync_requested and processes any definitions missed during etcd watch reconnection gaps.
Rationale¶
Why reactive etcd watch (not polling-only)?¶
- Consistency: All existing controller reconciliation uses etcd watches as primary trigger
- Low latency: Sub-second reaction time vs. polling interval delay
- Proven pattern:
LabRecordReconciler, worker desired state, session state all use this approach - Clean lifecycle: etcd key is created on request and deleted on completion โ self-documenting
Why opt-in polling (not mandatory)?¶
- Edge case coverage: etcd watch reconnection gaps could miss events
- Startup catch-up: New leader may have pending sync requests from before election
- Minimal overhead: 300s interval with simple HTTP query when enabled
Why not CloudEvent webhook?¶
- Adds a reverse dependency (CPA โ lablet-controller)
- CloudEvents are used for external system integration (LDS, GradingEngine), not internal coordination
- etcd watches are the established internal coordination mechanism
Why not polling-only?¶
- Inconsistent with the architecture โ all other reconciliation uses etcd watches
- Higher latency (minimum one poll interval)
- Unnecessary load from frequent polling
Consequences¶
Positive¶
- Consistent with all existing controller reconciliation patterns
- Immediate reaction to sync requests (sub-second via etcd watch)
- Self-cleaning: etcd key lifecycle mirrors the sync lifecycle
- Polling fallback provides defense-in-depth for data consistency
- Pattern is well-understood by the team (identical to
LabRecordReconciler)
Negative¶
- Two code paths (watch + optional poll) increase testing surface
- etcd becomes a dependency for content sync flow (already a dependency for all reconciliation)
- Additional etcd key namespace to manage
Risks¶
- etcd unavailability blocks immediate sync (mitigated by opt-in polling fallback)
- Key cleanup failure could leave stale keys (mitigated by ContentSyncCompletedEtcdProjector)