MVP Implementation Plan¶
| Attribute | Value |
|---|---|
| Document Version | 4.2.0 |
| Status | Authoritative |
| Created | 2026-02-08 |
| Last Updated | 2026-02-20 |
| Author | LCM Architecture Team |
| Related | Codebase Discovery Audit, Requirements Spec, ADR-020, ADR-021, ADR-022, Phase 7 Execution Plan |
1. Executive Summary¶
This document is the single authoritative implementation plan for Lablet Cloud Manager MVP. It is derived from the Codebase Discovery Audit.
Critical Insight: Foundation First¶
Dependency Chain
The MVP cannot proceed to lablet lifecycle (LDS/Grading) without a solid worker management foundation.
Correct dependency order:
Worker Capacity → Resource Scheduling → Auto-Scaling → LDS → Frontend → Session Entity Model → Grading
You cannot schedule lablet sessions without knowing worker capacity. You cannot auto-scale without tracking resource usage. You cannot implement grading without the session entity model (ADR-020/021/022).
Current State Analysis¶
| Component | State | Evidence |
|---|---|---|
declared_capacity on Worker |
✅ Exists | Set from WorkerTemplate |
allocated_capacity on Worker |
✅ Updated | Updated via AllocateCapacity/ReleaseCapacity commands (Phase 1) |
| PlacementEngine capacity check | ✅ Complete | Uses etcd capacity data (Phase 2 enhanced) |
ScheduleLabletSessionCommand |
✅ Fixed | Validates capacity + allocates on schedule (Phase 1) |
| Worker metrics collection | ✅ Works | CloudWatch + CML stats |
| Activity detection | ✅ Works | Idle detection functional |
| Worker provisioning | ✅ Complete | EC2 provisioning via WorkerTemplateService (Phase 3) |
| Auto-scaling triggers | ✅ Complete | Scale-up (RS) + scale-down (WC) with safety guards (Phase 3) |
| LDS integration | 🔄 ~90% | LDS SPI client, session provisioning, mark-ready, session.started handling, archival |
| Grading integration | ⬜ Missing | No client, no collection flow |
MVP Scope¶
| Capability | Requirements | Phase | Current State |
|---|---|---|---|
| Worker Capacity Tracking | FR-2.4.1, FR-2.4.3 | 1 | ✅ Complete |
| Resource-Aware Scheduling | FR-2.3.2a-e | 2 | ✅ Complete |
| Auto-Scaling (Basic) | FR-2.5.1a,c,d; FR-2.5.2a-b | 3 | ✅ Complete |
| LDS Session Provisioning | FR-2.2.5, FR-2.2.6 | 4 | ✅ Complete (staging validated) |
| SSE & Frontend Readiness | FR-2.1, FR-2.2, FR-2.4 | 6 | 🔄 ~85% (G1+G3+G4+G5+F1+F2+F3+F8 done) |
| Session Entity Model | FR-2.2.1, FR-2.2.2, ADR-020/021/022 | 7 | ✅ Complete |
| Grading Integration | FR-2.6.1, FR-2.6.2 | 5 | ⬜ Blocked by Phase 7 |
Deferred (Post-MVP)¶
| Capability | Requirements | Rationale |
|---|---|---|
| Grading Integration | FR-2.6.1, FR-2.6.2 | Phase 7 complete — ready to start; requires GradingSession, ScoreReport, CloudEventIngestor |
| Warm Pool | FR-2.7.1 | Optimization, not blocking |
| Advanced Auto-Scaling | FR-2.5.1b, FR-2.5.2c-d | Basic scale-up/down sufficient |
| S3/MinIO Artifact Sync | FR-2.1.5 | Manual artifact management acceptable |
Timeline Summary¶
Phase 0: Domain Prerequisites ████████████████████ ✅ COMPLETE (2026-02-08)
Phase 1: Worker Foundation ████████████████████ ✅ COMPLETE (2026-02-09)
Phase 2: Resource Scheduling ████████████████████ ✅ COMPLETE (2026-02-10)
Phase 3: Auto-Scaling ████████████████████ ✅ COMPLETE (2026-02-08)
Phase 4: LDS Integration ████████████████████ ✅ COMPLETE (2026-02-10)
Phase 6: SSE & Frontend █████████████████░░░ 🔄 ~85% (G1+G3+G4+G5+F1+F2+F3+F8 done)
Phase 7: Session Entity Model ████████████████████ ✅ COMPLETE (2026-02-20)
Phase 5: Grading Integration ░░░░░░░░░░░░░░░░░░░░ ⬜ Blocked by Phase 7 → Ready to start
─────────────────────
Execution order: 0→1→2→3→4→6→7→5
Phase 7 COMPLETE — Phase 5 unblocked
2. Phase Dependencies¶
flowchart TD
subgraph P0["Phase 0: Domain Prerequisites ✅"]
P0A[Add READY state to LabletSessionStatus]
P0B[Add form_qualified_name to LabletDefinition]
end
subgraph P1["Phase 1: Worker Foundation ✅"]
P1A[Fix allocated_capacity updates]
P1B[Add UpdateWorkerCapacityCommand]
P1C[Verify metrics collection flow]
end
subgraph P2["Phase 2: Resource Scheduling ✅"]
P2A[Complete PlacementEngine integration]
P2B[Capacity validation in ScheduleCommand]
P2C[Capacity allocation on schedule]
end
subgraph P3["Phase 3: Auto-Scaling ✅"]
P3A[Implement worker provisioning]
P3B[Scale-up trigger logic]
P3C[Scale-down trigger logic]
end
subgraph P4["Phase 4: LDS Integration ✅"]
P4A[LDS SPI Client]
P4B[Session provisioning in LabletReconciler]
P4C[MarkSessionReady + HandleSessionStarted]
end
subgraph P6["Phase 6: SSE & Frontend 🔄"]
P6A[G1: Fix SSE pipeline ✅]
P6B[F2: LDS Session Display ✅]
P6C[F1: Reservation UI ✅]
P6D[F3: Capacity Dashboard ✅]
P6E[G4/G5: Test coverage ✅]
end
subgraph P7["Phase 7: Session Entity Model ✅"]
P7A[LabletSession aggregate + LabletSessionStatus enum]
P7B[UserSession / GradingSession / ScoreReport entities]
P7C[CQRS commands + queries for child entities]
P7D[MongoDB repos for 4 new collections]
P7E[CloudEventIngestor in lablet-controller]
P7F[GradingSPI adapter skeleton]
P7G[Data migration + etcd key migration]
end
subgraph P5["Phase 5: Grading Integration ⬜"]
P5A[Collection service]
P5B[Grading SPI Client integration]
P5C[CloudEvent handlers for grading.completed/failed]
end
P0 --> P1
P1 --> P2
P2 --> P3
P2 --> P4
P3 --> P4
P4 --> P6
P6 --> P7
P7 --> P5
3. Phase 0: Domain Prerequisites¶
Status: ✅ COMPLETE (2026-02-08) Goal: Prepare domain models for LDS integration (required for Phase 4). Duration: ~1 week Requirements: FR-2.2.1, FR-2.2.5h, FR-2.1.6
Phase 0 Complete
All 8 tasks completed. 21 new tests added (210 total domain tests pass).
Key artifacts: LabletSessionStatus.READY, LabletSessionReadyDomainEvent,
form_qualified_name on LabletDefinition.
Breaking change: INSTANTIATING→RUNNING is now invalid; must transition through READY.
Bootstrap prompt: PHASE_0_BOOTSTRAP.md
ADR-020/021 Impact
Phase 0 was implemented using the original LabletInstance naming. Per ADR-020, LabletInstance
is renamed to LabletSession. Per ADR-021, lds_session_id and lds_login_url move from the
session aggregate to the UserSession child entity. Phase 7 implements these migrations.
3.1 Tasks¶
| ID | Task | Service | File(s) | Estimate |
|---|---|---|---|---|
| P0-1 | Add READY state to LabletSessionStatus |
control-plane-api | domain/enums.py |
2h |
| P0-2 | Update LABLET_SESSION_VALID_TRANSITIONS | control-plane-api | domain/enums.py |
1h |
| P0-3 | Add form_qualified_name to LabletDefinition |
control-plane-api | domain/entities/lablet_definition.py |
2h |
| P0-4 | Add lds_session_id, lds_login_url to LabletSession (superseded by ADR-021: fields move to UserSession in Phase 7) |
control-plane-api | domain/entities/lablet_session.py |
2h |
| P0-5 | Add LabletSessionReadyDomainEvent | control-plane-api | domain/events/lablet_session_events.py |
1h |
| P0-6 | Update LabletSessionReadModel | lcm-core | domain/entities/read_models/lablet_session_read_model.py |
1h |
| P0-7 | Update LabletDefinitionReadModel | lcm-core | domain/entities/read_models/lablet_definition_read_model.py |
1h |
| P0-8 | Unit tests for new state transitions | control-plane-api | tests/domain/ |
3h |
3.2 Specification: READY State¶
Current State Machine:
Updated State Machine:
Transition Table Update:
# domain/enums.py - LABLET_SESSION_VALID_TRANSITIONS
LabletSessionStatus.INSTANTIATING: [
LabletSessionStatus.READY, # NEW: LDS provisioned, awaiting user login
LabletSessionStatus.TERMINATED,
],
LabletSessionStatus.READY: [ # NEW STATE
LabletSessionStatus.RUNNING, # User logged in (CloudEvent from LDS)
LabletSessionStatus.TERMINATED,
],
READY State Semantics:
- CML lab is running on worker
- LDS LabSession is provisioned with device access info
- UserSession entity created with
login_url(ADR-021) - Awaiting user login via LDS portal
3.3 Acceptance Criteria¶
- [x]
LabletSessionStatus.READYexists in enum - [x] Transition
INSTANTIATING → READYis valid - [x] Transition
READY → RUNNINGis valid - [x]
LabletDefinition.form_qualified_namepersists correctly - [x] LDS session fields persist correctly (Phase 7 migrates to UserSession per ADR-021)
- [x] Domain events emit correctly
- [x] Read models reflect new attributes
- [x] All unit tests pass (210 tests, 21 new)
4. Phase 1: Worker Foundation¶
Status: ✅ COMPLETE (2026-02-09) Goal: Ensure worker capacity is accurately tracked so scheduling can make informed decisions. Duration: ~2 weeks Requirements: FR-2.4.1, FR-2.4.3 Bootstrap: PHASE_1_BOOTSTRAP.md
Phase 1 Complete
Worker capacity is now accurately tracked through the full lablet session lifecycle.
Key artifacts: AllocateCapacityCommand, ReleaseCapacityCommand, WorkerCapacityPublisher.
ScheduleLabletSessionCommand now validates worker status (RUNNING) and capacity before scheduling,
then allocates capacity via mediator. TerminateLabletSessionCommand releases capacity on termination.
Capacity snapshots are published to etcd at /lcm/workers/{id}/capacity.
Tests: 26 new tests in test_capacity_commands.py (387 total CPA non-integration tests pass).
Integration tests and PlacementEngine enhancement deferred to Phase 2.
Bootstrap prompt: PHASE_1_BOOTSTRAP.md
4.1 Problem Statement¶
The PlacementEngine in resource-scheduler uses declared_capacity and allocated_capacity to filter eligible workers. However:
allocated_capacityis never updated when sessions are scheduledScheduleLabletSessionCommandhas a TODO: "Check worker status and capacity"- No command exists to update worker capacity when sessions start/stop
This means PlacementEngine sees stale data and may over-allocate workers.
4.2 Tasks¶
| ID | Task | Service | File(s) | Estimate |
|---|---|---|---|---|
| P1-1 | Create AllocateCapacityCommand |
control-plane-api | application/commands/worker/ |
4h |
| P1-2 | Create ReleaseCapacityCommand |
control-plane-api | application/commands/worker/ |
4h |
| P1-3 | Update ScheduleLabletSessionCommand to call allocate |
control-plane-api | application/commands/lablet_session/ |
3h |
| P1-4 | Update session termination to call release | control-plane-api | application/commands/lablet_session/ |
3h |
| P1-5 | Add capacity change domain events | control-plane-api | domain/events/cml_worker.py |
2h |
| P1-6 | Publish capacity to etcd for scheduler | control-plane-api | application/services/ |
4h |
| P1-7 | Verify metrics collection end-to-end | worker-controller | application/hosted_services/worker_reconciler.py |
4h |
| P1-8 | Integration tests for capacity flow | control-plane-api | tests/integration/ |
8h |
4.3 Specification: Capacity Allocation Flow¶
sequenceDiagram
participant RS as resource-scheduler
participant CPA as control-plane-api
participant Worker as CMLWorker Aggregate
participant etcd as etcd
RS->>CPA: POST /api/internal/sessions/{id}/schedule (worker_id)
CPA->>Worker: schedule(session_id, worker_id)
Worker->>Worker: validate capacity available
Worker->>Worker: allocate_capacity(requirements)
Worker-->>CPA: LabletSessionScheduledDomainEvent
Worker-->>CPA: WorkerCapacityAllocatedDomainEvent
CPA->>etcd: PUT /workers/{id}/allocated_capacity
CPA-->>RS: 200 OK {status: SCHEDULED}
AllocateCapacityCommand:
@dataclass
class AllocateCapacityCommand(Command[OperationResult[dict]]):
"""Allocate capacity on a worker for a lablet session."""
worker_id: str
session_id: str
cpu_cores: int
memory_gb: int
storage_gb: int
node_count: int
ReleaseCapacityCommand:
@dataclass
class ReleaseCapacityCommand(Command[OperationResult[dict]]):
"""Release capacity when session terminates."""
worker_id: str
session_id: str
4.4 Acceptance Criteria¶
- [x]
AllocateCapacityCommandupdatesallocated_capacityon worker - [x]
ReleaseCapacityCommanddecrementsallocated_capacityon worker - [x]
ScheduleLabletSessionCommandvalidates capacity before scheduling - [x]
ScheduleLabletSessionCommandcallsAllocateCapacityCommandon success - [x] Session termination triggers
ReleaseCapacityCommand - [x] Worker capacity changes are published to etcd
- [x] PlacementEngine sees current capacity data (completed in Phase 2)
- [x] Integration tests verify full flow (completed in Phase 2)
5. Phase 2: Resource Scheduling¶
Status: ✅ Complete (2026-02-10) Goal: Resource-scheduler makes accurate placement decisions based on real-time capacity. Duration: ~1.5 weeks Requirements: FR-2.3.2a-e
5.1 Problem Statement¶
The PlacementEngine logic is mostly implemented but:
- It uses stale capacity data (fixed in Phase 1)
- The
ScheduleLabletSessionCommanddoesn't validate capacity - No feedback loop exists if scheduling fails
5.2 Tasks¶
| ID | Task | Service | File(s) | Estimate |
|---|---|---|---|---|
| P2-1 | Add capacity validation in PlacementEngine | resource-scheduler | application/services/placement_engine.py |
4h |
| P2-2 | Fetch fresh capacity from etcd (not API) | resource-scheduler | application/hosted_services/scheduler_hosted_service.py |
4h |
| P2-3 | Handle scheduling failures with retry | resource-scheduler | application/hosted_services/scheduler_hosted_service.py |
4h |
| P2-4 | Add scheduling metrics (success/fail/scale-up) | resource-scheduler | application/services/ |
3h |
| P2-5 | Update scheduler to request scale-up on no capacity | resource-scheduler | application/hosted_services/scheduler_hosted_service.py |
4h |
| P2-6 | Integration tests for scheduling decisions | resource-scheduler | tests/integration/ |
6h |
5.3 Specification: Enhanced PlacementEngine¶
Current _check_resource_capacity (partial):
def _check_resource_capacity(self, worker: dict, definition: dict) -> bool:
# Uses declared_capacity - allocated_capacity
# Problem: allocated_capacity is stale
Enhanced flow:
def schedule(self, instance: dict, definition: dict, workers: list) -> SchedulingDecision:
# 1. Filter by status (RUNNING only)
# 2. Filter by license affinity
# 3. Filter by AMI requirements
# 4. Filter by REAL-TIME capacity (from etcd or fresh API call)
# 5. Filter by port availability
# 6. Score by utilization (bin-packing)
# 7. Select best or request scale-up
5.4 Acceptance Criteria¶
- [x] PlacementEngine uses real-time capacity data (etcd capacity preferred, API fallback)
- [x] Scheduling fails gracefully with clear error message (rejection_summary tracking)
- [x] Failed scheduling triggers requeue with backoff (base class backoff + max retry escalation at 5 failures → 300s)
- [x] Scale-up decision made when no eligible workers (granular rejection categories: status/license/capacity/ami/ports)
- [x] Scheduling metrics are emitted (OTel: decisions, latency, retries, etcd fetches, scale-ups)
- [x] Integration tests cover all decision paths (41 new tests: 17 PlacementEngine + 24 SchedulerHostedService)
6. Phase 3: Auto-Scaling¶
Status: ✅ COMPLETE (2026-02-08) Goal: Automatically provision/deprovision workers based on demand. Duration: ~2 weeks Requirements: FR-2.5.1a,c,d; FR-2.5.2a-b Bootstrap: PHASE_3_BOOTSTRAP.md (retrospective)
Phase 3 Complete
Full auto-scaling implemented across 3 services. Scale-up: resource-scheduler detects no eligible
workers → selects cheapest viable template → RequestScaleUpCommand creates PENDING worker →
worker-controller provisions EC2 via _handle_pending(). Scale-down: worker-controller evaluates
idle workers (5 safety guards) → DrainWorkerCommand sets DRAINING → stops EC2 when empty.
Key artifacts: RequestScaleUpCommand (CPA), DrainWorkerCommand (CPA), _handle_pending()
provisioning (WC), _evaluate_scale_down() (WC), _select_template_for_requirements() (RS),
WorkerTemplateService (CPA), scaling constraints (min/max workers, cooldowns).
Tests: 44 new tests (14 CPA + 17 RS + 13 WC). Also fixed discovery state sync bug (AD-21).
Bootstrap prompt: PHASE_3_BOOTSTRAP.md
6.1 Problem Statement¶
Currently, worker provisioning is stubbed:
# worker_reconciler.py line ~290
async def _handle_pending(self, worker: CMLWorkerReadModel) -> ReconciliationResult:
# ...
# For now, requeue until template system is implemented
return ReconciliationResult.requeue("Template provisioning not yet implemented")
No automatic scale-up or scale-down exists.
6.2 Tasks¶
| ID | Task | Service | File(s) | Estimate |
|---|---|---|---|---|
| P3-1 | Implement _handle_pending with EC2 provisioning |
worker-controller | application/hosted_services/worker_reconciler.py |
8h |
| P3-2 | Create RequestScaleUpCommand |
control-plane-api | application/commands/worker/ |
4h |
| P3-3 | Implement scale-up trigger in scheduler | resource-scheduler | application/hosted_services/scheduler_hosted_service.py |
6h |
| P3-4 | Create scale-down detection job | worker-controller | application/hosted_services/ |
6h |
| P3-5 | Implement worker draining before scale-down | control-plane-api | domain/entities/cml_worker.py |
4h |
| P3-6 | Add scaling constraints (min/max workers) | control-plane-api | application/settings.py |
2h |
| P3-7 | Add scaling audit log | control-plane-api | application/services/ |
3h |
| P3-8 | Integration tests for scale-up flow | all | tests/integration/ |
8h |
| P3-9 | Integration tests for scale-down flow | all | tests/integration/ |
6h |
6.3 Specification: Scale-Up Flow¶
sequenceDiagram
participant RS as resource-scheduler
participant CPA as control-plane-api
participant WC as worker-controller
participant EC2 as AWS EC2
RS->>RS: No eligible workers for session
RS->>CPA: POST /api/workers/scale-up {template, reason}
CPA->>CPA: Create CMLWorker (status=PENDING)
CPA-->>RS: 202 Accepted {worker_id}
RS->>RS: Requeue session (await worker)
Note over WC: Reconciliation loop
WC->>CPA: GET /api/workers?status=PENDING
WC->>EC2: RunInstances(template config)
EC2-->>WC: ec2_instance_id, pending
WC->>CPA: PATCH /api/workers/{id} {status=PROVISIONING, ec2_id}
Note over WC: Next reconcile cycle
WC->>EC2: DescribeInstances(ec2_instance_id)
EC2-->>WC: running, ip_address
WC->>CPA: PATCH /api/workers/{id} {status=RUNNING, ip}
Note over RS: Next reconcile cycle
RS->>CPA: GET /api/workers?status=RUNNING
RS->>RS: Worker now eligible
RS->>CPA: POST /api/internal/sessions/{id}/schedule
6.4 Specification: Scale-Down Flow¶
sequenceDiagram
participant WC as worker-controller
participant CPA as control-plane-api
participant EC2 as AWS EC2
Note over WC: Idle detection
WC->>CPA: GET /api/workers/{id}
WC->>WC: Check: no sessions, idle > threshold
WC->>CPA: POST /api/workers/{id}/drain
CPA->>CPA: Set status=DRAINING
Note over WC: Wait for drain complete
WC->>CPA: GET /api/workers/{id}/sessions
CPA-->>WC: [] (empty)
WC->>CPA: POST /api/workers/{id}/stop
CPA->>CPA: Set desired_status=stopped
Note over WC: Next reconcile
WC->>EC2: StopInstances(ec2_instance_id)
EC2-->>WC: stopping
WC->>CPA: PATCH /api/workers/{id} {status=STOPPED}
6.5 Acceptance Criteria¶
- [x] Worker provisioning creates real EC2 instances
- [x] Worker template configuration is applied correctly
- [x] Scale-up triggered when no eligible workers for pending session
- [x] Scale-down triggered when worker has no sessions and is idle
- [x] Draining prevents new session scheduling
- [x] Scaling constraints (min/max) are respected
- [x] Scaling decisions are audit-logged (OTel metrics + structured logging)
- [x] Integration tests verify scale-up scenarios (14 tests); scale-down unit-tested (13 tests)
7. Phase 4: LDS Integration¶
Status: ✅ COMPLETE (2026-02-10) Goal: Provision LDS LabSessions for lablet sessions and handle user login events. Duration: ~2 weeks Requirements: FR-2.2.5, FR-2.2.6 Bootstrap: PHASE_4_BOOTSTRAP.md
Phase 4 Complete
LDS integration implemented across lablet-controller and control-plane-api. LDS SPI Client:
636-line REST client with multi-region deployment support, YAML config, data models
(DeviceAccessInfo, SessionPartInfo, LdsSessionInfo, LdsDeploymentConfig). Reconciler LDS Flow:
7-step _provision_lds_session() (get definition → get nodes → create session → build device list →
set devices → get launch URL → call CPA mark-ready), _archive_lds_session() on TERMINATED,
_build_device_access_list() static helper. CPA Commands: MarkSessionReadyCommand (atomic
INSTANTIATING→READY with LDS info, AD-P4-01), HandleSessionStartedCommand (READY→RUNNING on
session.started). Internal API: PUT /api/internal/sessions/{id}/mark-ready, POST /api/internal/
sessions/{id}/transition. CPA Client: mark_session_ready(), notify_session_started() in
ControlPlaneApiClient. Tests: 57 new tests (44 LDS SPI + 5 mark-ready + 8 session-started).
Remaining: Staging validation with live LDS deployment.
ADR-022 Impact
Phase 4 was originally implemented with CloudEvent handling in control-plane-api. Per ADR-022, all CloudEvents (LDS + GradingEngine) are now routed to lablet-controller via CloudEventIngestor. Phase 7 implements this migration.
7.1 Tasks¶
| ID | Task | Service | File(s) | Status | Estimate |
|---|---|---|---|---|---|
| P4-1 | Create LDS SPI client (data models + config) | lablet-controller | integration/services/lds_spi.py, config/lds_deployments.yaml |
✅ | 4h |
| P4-2 | Implement LDS REST client (multi-region) | lablet-controller | integration/services/lds_spi.py |
✅ | 6h |
| P4-3 | Add MarkSessionReadyCommand (atomic INSTANTIATING→READY) |
control-plane-api | application/commands/lablet_session/mark_session_ready_command.py |
✅ | 3h |
| P4-4 | Update TransitionLabletSessionCommand for READY |
control-plane-api | application/commands/lablet_session/transition_lablet_session_command.py |
✅ | 2h |
| P4-5 | Update LabletReconciler: _provision_lds_session() |
lablet-controller | application/hosted_services/lablet_reconciler.py |
✅ | 8h |
| P4-6 | Add internal API endpoints (mark-ready, session-started) | control-plane-api | api/controllers/internal_controller.py |
✅ | 4h |
| P4-7 | HandleSessionStartedCommand (READY→RUNNING) |
control-plane-api | application/commands/lablet_session/handle_session_started_command.py |
✅ | 4h |
| P4-8 | Add _archive_lds_session() on TERMINATED |
lablet-controller | application/hosted_services/lablet_reconciler.py |
✅ | 3h |
| P4-9 | Tests (LDS SPI + command handlers) | LC, CPA | tests/ |
✅ | 8h |
| P4-CPA-Client | Add mark_session_ready(), notify_session_started() to CPA client |
lcm-core | integration/clients/control_plane_client.py |
✅ | 2h |
7.2 Specification: LDS Client Interface¶
# lablet-controller/integration/services/lds_spi.py
class LdsSpiClient:
"""LDS (Lab Delivery System) SPI client."""
async def create_session_with_part(
self,
username: str,
timeslot_start: datetime,
timeslot_end: datetime,
form_qualified_name: str,
) -> LdsSessionInfo:
"""Create LabSession with a LabSessionPart for the content."""
...
async def set_devices(
self,
session_id: str,
devices: list[DeviceAccessInfo],
) -> None:
"""Set device access information for the session."""
...
async def get_session_info(self, session_id: str) -> LdsSessionInfo:
"""Get session details including login_url."""
...
async def archive_session(self, session_id: str) -> None:
"""Archive session (called on TERMINATED)."""
...
@dataclass
class DeviceAccessInfo:
name: str # Device label from content.xml
protocol: str # "ssh", "telnet", "vnc", "web"
host: str # Worker IP address
port: int # Allocated port
uri: str # Connection URI
username: str # Device credentials
password: str # Device credentials
@dataclass
class LdsSessionInfo:
session_id: str
login_url: str
status: str
7.3 Specification: CloudEvent Ingestion (ADR-022)¶
Per ADR-022, all external CloudEvents
are routed to lablet-controller, not control-plane-api. The lablet-controller uses Neuroglia's
CloudEventIngestor with @dispatch handlers:
# lablet-controller/application/events/cloud_event_ingestor.py
class LabletCloudEventIngestor(CloudEventIngestor):
"""Receives and dispatches CloudEvents from LDS and GradingEngine."""
def __init__(self, service_provider, control_plane_client: ControlPlaneApiClient):
super().__init__(service_provider)
self._cpa = control_plane_client
@dispatch(LdsSessionStartedEvent)
async def on_lds_session_started(self, event: LdsSessionStartedEvent) -> None:
"""Handle LDS session.started — transition READY → RUNNING."""
await self._cpa.update_user_session_status(event.lablet_session_id, UserSessionStatus.ACTIVE)
await self._cpa.transition_session(event.lablet_session_id, LabletSessionStatus.RUNNING)
@dispatch(LdsSessionEndedEvent)
async def on_lds_session_ended(self, event: LdsSessionEndedEvent) -> None:
"""Handle LDS session.ended — transition RUNNING → COLLECTING, initiate grading."""
await self._cpa.update_user_session_status(event.lablet_session_id, UserSessionStatus.ENDED)
await self._cpa.transition_session(event.lablet_session_id, LabletSessionStatus.COLLECTING)
# Initiate grading via GradingSPI (Phase 7)
@dispatch(GradingSessionCompletedEvent)
async def on_grading_completed(self, event: GradingSessionCompletedEvent) -> None:
"""Handle grading.session.completed — create ScoreReport, transition to STOPPING."""
await self._cpa.create_score_report(event.lablet_session_id, event.score_data)
await self._cpa.update_grading_session_status(event.lablet_session_id, GradingStatus.SUBMITTED)
await self._cpa.transition_session(event.lablet_session_id, LabletSessionStatus.STOPPING)
@dispatch(GradingSessionFailedEvent)
async def on_grading_failed(self, event: GradingSessionFailedEvent) -> None:
"""Handle grading.session.failed — mark FAULTED."""
await self._cpa.update_grading_session_status(
event.lablet_session_id, GradingStatus.FAULTED, error=event.error_message
)
CloudEvent endpoint registered at POST /api/events on lablet-controller.
State mutations are proxied to control-plane-api via internal REST calls (ADR-001).
7.4 Acceptance Criteria¶
- [x] LDS client creates sessions with device access info (636-line
LdsSpiClientwith multi-region support) - [x] LabletReconciler provisions LDS session during INSTANTIATING (7-step
_provision_lds_session()flow) - [x] Session transitions to READY after LDS provisioning (atomic
MarkSessionReadyCommandvia AD-P4-01) - [x] LDS session info stored on UserSession entity (Phase 7 migration — currently on LabletSession via
mark_ready()domain method) - [x] CloudEvent
session.startedtriggers READY→RUNNING transition (Phase 7 migrates handler to lablet-controller CloudEventIngestor per ADR-022) - [x] LDS session archived when session reaches TERMINATED (
_archive_lds_session()with graceful error handling) - [x] Integration tests verify flow with mock LDS (57 tests: 44 LDS SPI + 5 mark-ready + 8 session-started)
- [x] Staging validation with live LDS deployment (G3: 12/12 checks passed, bug fix:
archive_session()json=None→json={}, Docker config added)
8. Phase 5: Grading Integration¶
Status: ⬜ Blocked by Phase 7 Goal: Collect device configurations and submit for grading via GradingEngine. Duration: ~1.5 weeks Requirements: FR-2.6.1, FR-2.6.2 Dependencies: Phase 7 (Session Entity Model) must be complete first
Phase 7 Prerequisite
Phase 5 requires the following Phase 7 deliverables:
- GradingSession entity (ADR-021) — stores grading state and GradingEngine IDs
- ScoreReport entity (ADR-021) — stores grading results with per-section breakdowns
- CloudEventIngestor (ADR-022) — routes grading.session.completed/failed events to handlers
- GradingSPI adapter — client for GradingEngine REST API
- Internal API endpoints — CRUD for GradingSession and ScoreReport entities
8.1 Tasks¶
| ID | Task | Service | File(s) | Estimate |
|---|---|---|---|---|
| P5-1 | Create collection service | lablet-controller | application/services/collection_service.py |
8h |
| P5-2 | Add console command execution to CML SPI | lablet-controller | integration/services/cml_labs_spi.py |
4h |
| P5-3 | Add StartCollectionCommand |
control-plane-api | application/commands/lablet_session/ |
3h |
| P5-4 | Add CollectionCompletedCommand |
control-plane-api | application/commands/lablet_session/ |
3h |
| P5-5 | Wire GradingSPI client (skeleton from Phase 7) | lablet-controller | integration/services/grading_spi.py |
4h |
| P5-6 | Update LabletReconciler for COLLECTING state | lablet-controller | application/hosted_services/lablet_reconciler.py |
6h |
| P5-7 | Update LabletReconciler for GRADING state | lablet-controller | application/hosted_services/lablet_reconciler.py |
4h |
| P5-8 | Wire CloudEvent handlers for grading events (skeleton from Phase 7) | lablet-controller | application/events/cloud_event_ingestor.py |
3h |
| P5-9 | Integration tests for grading flow | lablet-controller | tests/integration/ |
6h |
8.2 Specification: Collection Service¶
# lablet-controller/application/services/collection_service.py
class CollectionService:
"""Collects device configurations from CML lab nodes."""
async def collect_configs(
self,
worker_ip: str,
lab_id: str,
commands: list[str] | None = None,
) -> CollectionResult:
"""Collect running-config from all nodes in the lab."""
commands = commands or ["show running-config"]
nodes = await self._cml.get_lab_nodes(worker_ip, lab_id)
results = []
for node in nodes:
node_outputs = []
for command in commands:
output = await self._cml.execute_console_command(
worker_ip, lab_id, node.id, command
)
node_outputs.append(CommandOutput(command=command, output=output))
results.append(DeviceCollection(name=node.label, outputs=node_outputs))
return CollectionResult(devices=results, collected_at=datetime.now(UTC))
8.3 Specification: State Flow¶
stateDiagram-v2
[*] --> RUNNING: User logged in
RUNNING --> COLLECTING: StartCollectionCommand
COLLECTING --> GRADING: Collection complete
GRADING --> STOPPING: Grading complete
STOPPING --> STOPPED: Resources released
STOPPED --> TERMINATED: Cleanup complete
TERMINATED --> [*]
8.4 Acceptance Criteria¶
- [ ] Collection service gathers
show running-configfrom all nodes - [ ] RUNNING→COLLECTING transition works via command
- [ ] Collection results stored/forwarded to grading engine
- [ ] COLLECTING→GRADING transition after collection complete
- [ ] Grading engine submission works
- [ ] CloudEvent
grading.completedtriggers score storage - [ ] GRADING→STOPPING transition after grading complete
- [ ] Integration tests verify full flow
9. Phase 6: SSE, Frontend & Integration Readiness¶
Status: 🔄 In Progress Goal: Fix SSE event pipeline, fill integration gaps, and build the frontend UI to MVP readiness. Duration: ~3–4 weeks Requirements: FR-2.1 (UI), FR-2.2 (lifecycle visibility), FR-2.4 (capacity dashboard), FR-2.6 (grading display)
Phase 6 Context
A gap analysis between the codebase and the MVP requirements revealed that Phases 0–4 focused on backend domain logic, CQRS commands, and inter-service integration. Phase 6 addresses the remaining SSE pipeline bugs, missing test coverage, and frontend UI gaps needed for a usable MVP. Grading backend (Phase 5) and grading UI (F4) are deferred to Phase 7.
Progress (v4.0.0): 8 of 15 tasks complete (G1, G3, G4, G5, F1, F2, F3, F8). All Medium+ priority frontend and backend tasks done. G4 delivered 53 worker-controller tests, G5 delivered 59 lablet-controller tests. Remaining work is Low-priority polish tasks (F4-F7, F9).
9.1 Sub-Phase A: SSE & Backend Readiness¶
These tasks address broken or missing backend infrastructure required before frontend work.
| ID | Task | Priority | Status | Description |
|---|---|---|---|---|
| G1 | Fix SSE broken for all aggregates | P0 | ✅ | SSE event naming mismatch (hyphens→dots), missing event mappings, legacy SSEService, no initial snapshots. Fixed: 21 backend handlers renamed, 6 frontend event types added, legacy SSEService deleted, global SSE connect in app.js, lablet/definition snapshots added to EventsController. |
| G2 | CloudEvents external naming mismatch | Deferred | ⬜ | External CloudEvent type fields still use hyphens (e.g. lablet-session.status.changed). Not blocking SSE (internal dot notation works). Defer until external CloudEvent consumers exist. |
| G3 | Phase 4 staging validation | P1 | ✅ | Validated LDS integration against live pyLDS backend (12/12 checks passed). Findings: (1) LDS returns HTTP 201 for session creation (SPI client handles correctly via raise_for_status()), (2) Bug fix: archive_session() sent json=None causing HTTP 415 — changed to json={}, (3) Docker networking: created lds_deployments.docker.yaml with lds-backend:4000 base_url, added LDS_VERIFY_SSL + LDS_DEPLOYMENTS_CONFIG_PATH env vars to docker-compose.shared.yml. |
| G4 | Add missing Worker Controller tests | P2 | ✅ | 53 new tests in test_worker_reconciler_g4.py. Covers: all 9 status handlers (PENDING through TERMINATED), EC2 provisioning flow, CML readiness checks, metrics collection, scale-down evaluation (5 safety guards), drain completion, error recovery. Target was +20, delivered +53. |
| G5 | Add missing Lablet Controller tests | P2 | ✅ | 59 new tests in test_lablet_reconciler_g5.py. Covers: all 7 status handlers (SCHEDULED through PENDING_CLEANUP), LDS provisioning 7-step flow, device mapping, definition caching, session archival, reconcile router dispatch. Target was +15, delivered +59. |
| G6 | Worker metrics events disabled | Deferred | ⬜ | WorkerMetricsUpdatedDomainEvent SSE handlers exist but metrics collection jobs are not emitting events. Leave disabled until monitoring dashboard (F3/F9) is prioritized. |
9.2 Sub-Phase B: Frontend Implementation¶
These tasks address UI gaps identified in the control-plane-api frontend. The CPA frontend is the primary user interface; worker-controller, lablet-controller, and resource-scheduler UIs remain empty scaffolds.
Existing UI Pages (CPA)¶
| Page | Description | State |
|---|---|---|
| OverviewPage | Dashboard with aggregate metric cards, quick actions, recent activity | ✅ Functional |
| WorkersPage | Tabbed: Workers (card/table) + Templates (admin) | ✅ Functional |
| LabletsPage | Tabbed: Sessions (card/table) + Definitions (admin) | ✅ Functional |
| SystemPage | Tabbed: Monitoring (health, SSE) + Settings (admin) | ✅ Functional |
Frontend Gap Tasks¶
| ID | Task | Priority | Status | What Exists | What's Missing |
|---|---|---|---|---|---|
| F1 | Reservation Management UI | Medium | ✅ | ReservationsPage.js: stats cards, reservation lookup by external ID, active/all/timeline tabs, status filtering, search, SSE real-time updates, auto-refresh | Done: Dedicated Reservations page with full reservation lifecycle view, filtering, and search. |
| F2 | LDS Session Display | High | ✅ | Backend domain models have LDS session info (migrating to UserSession entity per ADR-021) | Fixed: DTO mappers now include LDS fields, SSE READY handler added with LDS data, LabletSessionCard has "Open Lab" button + session display. 9 new tests. |
| F3 | Capacity/Utilization Dashboard | Medium | ✅ | CapacityDashboard.js: fleet summary cards, resource allocation progress bars (CPU/Mem/Storage/Nodes), per-worker breakdown table with mini progress bars, color-coded utilization | Done: Cross-fleet capacity dashboard with aggregate metrics and per-worker breakdown. Historical trends deferred to Grafana integration. |
| F4 | Grading Results Display | Low | ⬜ | Grade API call + "Start Grading" button on LabletSessionCard, grading/graded status in badge mappings | No grading results display (score, checks, pass/fail), no GradingPanel. UI can trigger grading but cannot show outcomes. Blocked by Phase 5/7 grading backend. |
| F5 | Notification Center | Low | ⬜ | Toast notifications for SSE events (ephemeral) | No persistent notification center/inbox, no alert history, no configurable thresholds. Toasts sufficient for MVP. |
| F6 | User/RBAC Admin UI | Low | ⬜ | Frontend RBAC enforcement via permissions.js, role-based views |
No user management page, no role assignment UI, no user profile dropdown. Keycloak admin console suffices for now. |
| F7 | Audit Log Viewer | Low | ⬜ | EventBus has getEventHistory() (client-side in-memory only) |
No audit log viewer page, no server-side event history browser. Post-MVP feature. |
| F8 | Resource Scheduler UI | Medium | ✅ | SchedulerPage.js + api/scheduler.js: leader election status, scheduling stats, pending placements table, scheduling policy info, admin actions (trigger reconcile, resign leadership) | Done: Full scheduling dashboard with leader status, stats, pending placements, and admin controls. |
| F9 | Multi-Service Observability | Low | ⬜ | SystemPage health checks + SSE status. PrometheusClient + LcmGrafanaPanel scaffolded. | No unified observability across all 4 microservices. Prometheus/Grafana not connected. Worker-controller, lablet-controller, resource-scheduler UIs are empty scaffolds. |
9.3 Recommended Implementation Order¶
9.3.1 G1: Fix SSE pipeline ✅ DONE
9.3.2 F2: LDS Session Display ✅ DONE
9.3.3 G3: Phase 4 staging validation (validates F2 + LDS flow) ✅ DONE
9.3.4 F1: Reservation Management UI ✅ DONE
9.3.5 F3: Capacity Dashboard ✅ DONE
9.3.6 F8: Resource Scheduler UI ✅ DONE
9.3.7 G4: Worker Controller tests (+53) ✅ DONE
9.3.8 G5: Lablet Controller tests (+59) ✅ DONE
9.3.9 F5–F7, F9: Post-MVP polish (deferred)
9.3.10 F4: Grading Results Display (blocked by Phase 7)
9.4 Acceptance Criteria¶
- [x] SSE events flow end-to-end for all aggregates (workers, lablet sessions, lablet definitions, worker templates)
- [x] SSE initial snapshots sent on connect (workers, lablet sessions, lablet definitions)
- [x] LDS session URL displayed on lablet session cards with "Open Lab" button
- [x] DTO mappers include LDS session info in API responses (migrating to UserSession per ADR-021)
- [ ] Reservation filtering available on LabletsPage
- [ ] Fleet-level capacity overview visible on OverviewPage or dedicated dashboard
- [ ] Worker Controller test coverage ≥80%
- [ ] Lablet Controller test coverage ≥80%
- [x] UI builds successfully (
make build-uiexits 0). StateStoreregisterSlice+ reducer-awaredispatchadded tolcm_uicore (AD-9). Parcel cache gotcha documented.
10. Phase 7: Session Entity Model Migration¶
Status: ✅ Complete (2026-02-20) Goal: Implement the LabletSession entity model (ADR-020), child entities (ADR-021), CloudEvent ingestion (ADR-022), and GradingSPI skeleton. Duration: ~3–4 weeks Requirements: FR-2.2.1, FR-2.2.2, FR-2.6.1, FR-2.6.2 Dependencies: Phase 6 (frontend stabilization) substantially complete ADRs: ADR-020, ADR-021, ADR-022 📋 Detailed Plan: phase-7-session-migration.md (codebase audit + 12 sub-phases)
Phase 7 has been extracted to a dedicated execution document
Due to its scope (largest phase in the MVP), Phase 7 has its own detailed plan with:
- Codebase audit — actual entity/command/query/collection inventory across all services
- 12 sub-phases (7A–7L) with dependency graph, per-task estimates, and acceptance criteria
- Migration strategy — big-bang rename, hard etcd cutover, dead code cleanup
- Key decisions (AD-P7-01 through AD-P7-05) — see below
Migration Strategy Decisions¶
| ID | Decision |
|---|---|
| AD-P7-01 | CloudEvent webhook → CPA proxy (no CQRS/Mediator in lablet-controller) |
| AD-P7-02 | Big-bang rename, no backward compatibility |
| AD-P7-03 | Hard etcd cutover /lcm/instances/ → /lcm/sessions/, no dual-write |
| AD-P7-04 | Remove old API endpoints, accept broken frontend during migration |
| AD-P7-05 | Clean up ~2,000+ lines of dead code (Task sample, LabletControllerService, LabsRefreshService, CloudProvider) |
Sub-Phase Overview¶
| Sub-Phase | Scope | Service(s) | Estimate |
|---|---|---|---|
| 7A | lcm-core shared layer renames (enums, read models) | lcm-core | 1 day |
| 7B | Dead code cleanup (~2,000+ lines) | CPA, LC | 1 day |
| 7C | CPA domain layer (LabletSession + 3 child entities) | CPA | 3 days |
| 7D | CPA application layer (commands + queries rewrite) | CPA | 3 days |
| 7E | CPA integration layer (4 new MongoDB repos + etcd) | CPA | 2 days |
| 7F | CPA API layer (controllers + internal endpoints) | CPA | 2 days |
| 7G | lcm-core ControlPlaneApiClient update | lcm-core | 1 day |
| 7H | Controller service updates (reconciler + scheduler) | LC, RS | 3 days |
| 7I | CloudEvent webhook endpoint | LC | 2 days |
| 7J | Frontend updates (API paths + SSE events) | CPA UI | 1.5 days |
| 7K | Cross-service tests & verification | all | 2 days |
| 7L | Documentation updates | docs | 1 day |
Acceptance Criteria (Summary)¶
- [x] Zero
LabletInstance/lablet_instance/lablet_lab_binding/lablet_record_runreferences in Python source (docstring exceptions only) - [x] MongoDB:
lablet_sessions,user_sessions,grading_sessions,score_reportsoperational - [x] MongoDB:
lablet_instances,lablet_lab_bindings,lablet_record_runs,tasksdropped - [ ] CloudEvent webhook handles 4 event types via CPA proxy — DEFERRED (AD-P7-06)
- [x]
make lint+make testpass for all services (pre-existing failures only) - [x]
make build-uipasses - [x] All services start and communicate
Full task breakdown, entity schemas, API mappings, and verification checklist: → phase-7-session-migration.md
11. Post-Implementation: Status Document Update¶
Goal: Update IMPLEMENTATION_STATUS.md to reflect actual implementation state. Duration: ~2 days Trigger: After Phase 7 completion
Per-Phase Updates
While this section describes a final comprehensive audit, each phase also requires incremental updates to the status document per §12: Mandatory Documentation Maintenance.
11.1 Tasks¶
| ID | Task | Estimate |
|---|---|---|
| PS-1 | Audit all requirement IDs against implementation | 4h |
| PS-2 | Update status matrix with accurate ✅/🔶/⬜ | 2h |
| PS-3 | Update progress percentages | 1h |
| PS-4 | Add test coverage summary | 2h |
| PS-5 | Review with team | 2h |
12. Mandatory: Documentation Maintenance Per Phase¶
Required for Every Phase
Each implementation phase MUST include a documentation maintenance task as its final step. This is not optional — the implementation is not complete until documentation reflects reality.
Per-Phase Documentation Checklist¶
Every phase completion MUST include:
| Task | File | Action |
|---|---|---|
| Update plan status | docs/implementation/mvp-implementation-plan.md |
Mark phase as ✅ COMPLETE, check acceptance criteria, add completion notes |
| Update status matrix | docs/implementation/IMPLEMENTATION_STATUS.md |
Update all affected FR rows, bump progress bars, add completion date |
| Bump document versions | Both files above | Increment version, update Last Updated date |
| Store knowledge | Knowledge Manager | store_decision, store_insight, update_task for all significant changes |
| Create next bootstrap | docs/implementation/PHASE_N+1_BOOTSTRAP.md |
Prepare bootstrap prompt for next phase |
| Update mkdocs.yml | mkdocs.yml |
Register any new documentation files in navigation |
Bootstrap Prompt Requirement¶
Every PHASE_N_BOOTSTRAP.md document MUST include a final task:
### P{N}-FINAL: Update Implementation Documentation
- Update `docs/implementation/mvp-implementation-plan.md`:
- Mark Phase {N} as ✅ COMPLETE with date
- Check all acceptance criteria
- Add completion notes (key artifacts, test count, breaking changes)
- Update `docs/implementation/IMPLEMENTATION_STATUS.md`:
- Update all affected FR rows to reflect actual state
- Update progress bars
- Bump document version
- Store completion knowledge via Knowledge Manager
- Create `PHASE_{N+1}_BOOTSTRAP.md` for next phase
13. Risk Register¶
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| LDS API unavailable | High | Low | Mock client for development |
| Grading Engine API changes | Medium | Low | Version-pin API contract |
| EC2 provisioning failures | Medium | Medium | Retry logic, fallback regions |
| etcd leader election issues | High | Low | Use proven libraries, HA setup |
| CML console collection timeouts | Medium | Medium | Configurable timeout, partial collection |
| CloudEvent delivery failures | Medium | Low | Idempotent handlers, dead letter queue |
| Data migration corruption | High | Low | Verification script, backup before migration, parallel-run old/new |
| API path breaking changes | High | Medium | Versioned API (/api/v1/sessions/), deprecation period for old paths |
| Multi-collection consistency | Medium | Medium | Transactional updates where possible, compensating actions on failure |
14. Success Metrics¶
Phase Completion Criteria¶
| Phase | Key Metric | Target |
|---|---|---|
| Phase 0 | Domain model tests pass | 100% |
| Phase 1 | Capacity tracking accurate | Verified in integration tests |
| Phase 2 | Scheduling respects capacity | No over-allocation |
| Phase 3 | Auto-scaling functional | Workers provision/deprovision |
| Phase 4 | LDS integration working | Sessions created in staging |
| Phase 6 | SSE & Frontend usable | MVP UI functional |
| Phase 7 | Session entity model migrated | ✅ All 4 collections operational, 7I (CloudEventIngestor) deferred to future phase |
| Phase 5 | Grading flow complete | Scores stored correctly |
MVP Readiness Checklist¶
- [ ] All phases complete
- [ ] Integration tests pass
- [ ] Staging environment validated
- [ ] IMPLEMENTATION_STATUS.md updated
- [ ] Runbooks created for operations
- [ ] Monitoring dashboards configured
- [ ] Documentation current
15. Revision History¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 4.2.0 | 2026-02-20 | LCM Architecture Team | Phase 7 COMPLETE. Updated §10 status to ✅ Complete, acceptance criteria checked (6/7 — CloudEvent webhook deferred per AD-P7-06). Timeline progress bar updated to 100%. Phase Dependencies diagram updated (Phase 7 ✅). MVP Scope table updated. Deferred table: Grading Integration unblocked. Success Metrics: Phase 7 row updated. |
| 4.1.0 | 2026-02-18 | LCM Architecture Team | Phase 7 extraction: Extracted detailed Phase 7 plan into dedicated phase-7-session-migration.md with codebase audit (entity/command/query/collection inventory across all services), 12 sub-phases (7A–7L) refined from audit findings, and 5 migration strategy decisions (AD-P7-01 through AD-P7-05). Master plan §10 now contains compact summary + link. Scope expanded from 9 to 12 sub-phases based on audit (added dead code cleanup, separate lcm-core client update, and split test/verification sub-phase). |
| 4.0.0 | 2026-02-18 | LCM Architecture Team | Major update: Aligned entire plan with ADR-020 (LabletInstance → LabletSession), ADR-021 (UserSession/GradingSession/ScoreReport child entities), ADR-022 (CloudEvent ingestion via lablet-controller). Added Phase 7: Session Entity Model Migration (§10) with 9 sub-phases covering domain, application, integration, API, CloudEventIngestor, GradingSPI skeleton, data migration, frontend, and tests. Resequenced Phase 5 (Grading Integration) to depend on Phase 7. Updated Phase Dependencies diagram. Fixed Phase 4 CloudEvent handler spec (§7.3) from control-plane-api to lablet-controller. Updated all phases (0–6) with LabletSession terminology. Added 3 new risks for migration. Updated Success Metrics with Phase 7. |
| 3.0.0 | 2026-02-10 | LCM Architecture Team | Phase 6 at ~85%: G4 (53 worker-controller tests) and G5 (59 lablet-controller tests) complete. Phase 4 marked ✅ COMPLETE (staging validated via G3). Worker controller service type errors fixed (aligned SPI method calls). 8/15 Phase 6 tasks done. Remaining: low-priority polish (F4-F7, F9). |
| 2.9.0 | 2026-02-10 | LCM Architecture Team | G3 Phase 4 staging validation complete: 12/12 live LDS checks passed. Bug fix: archive_session() json=None→json={} (HTTP 415). Docker config: created lds_deployments.docker.yaml (lds-backend:4000), added LDS env vars to docker-compose.shared.yml, added lds-backend dependency. Validation script: scripts/validate_lds_integration.py. Test assertion strengthened for json={} contract. |
| 2.8.0 | 2026-02-09 | LCM Architecture Team | Phase 6 progress: F1 (ReservationsPage ~570 lines), F3 (CapacityDashboard ~320 lines), F8 (SchedulerPage ~430 lines + api/scheduler.js) completed. Navbar converted to dropdowns for Lablets/Workers tabs (AD-16). Section containers added to index.jinja. Phase 6 at ~50% (G1+F1+F2+F3+F8 done). Remaining: G3 staging validation, G4/G5 controller tests, low-priority polish (F4-F7, F9). |
| 2.7.0 | 2026-02-09 | LCM Architecture Team | Phase 6 progress: F2 (LDS Session Display) completed — DTO mappers, SSE READY handler, LabletSessionCard "Open Lab" button, 9 new tests. StateStore registerSlice + reducer-aware dispatch added to lcm_ui core (AD-9) — frontend was blocked by missing slice registration API. Parcel cache gotcha documented; CPA make clean now clears ui/.parcel-cache. Phase 6 at ~20% (G1+F2 done). |
| 2.6.0 | 2026-02-08 | LCM Architecture Team | Added Phase 6: SSE & Frontend Readiness (§9). Gap analysis identified 6 backend gaps (G1–G6) and 9 frontend gaps (F1–F9). G1 (SSE pipeline fix) completed: 21 backend handler renames, 6 frontend event types, legacy SSEService deleted, global SSE connect, initial snapshots. Phase 5 (Grading) deferred to Phase 7 (post-frontend). Renumbered §9–§14→§10–§15. Updated timeline. |
| 2.5.0 | 2026-02-09 | LCM Architecture Team | Phase 4 LDS Integration ~90% complete: LDS SPI client (636 lines, multi-region, YAML config), MarkInstanceReadyCommand (atomic INSTANTIATING→READY, AD-P4-01), HandleSessionStartedCommand (READY→RUNNING), _provision_lds_session() 7-step reconciler flow,_archive_lds_session(), internal API endpoints (mark-ready, session-started), CPA client methods, 57 new tests (44 LDS SPI + 5 mark-ready + 8 session-started). CommandHandlerBase pattern adopted for all new handlers. Remaining: staging validation. |
| 2.4.0 | 2026-02-08 | LCM Architecture Team | Phase 3 marked complete: Scale-up (RequestScaleUpCommand, template selection, EC2 provisioning), Scale-down (5 safety guards, DrainWorkerCommand, idle detection), WorkerTemplateService, scaling constraints, OTel audit metrics, 44 new tests across 3 services. Discovery state sync bug fixed (AD-21). |
| 2.3.0 | 2026-02-10 | LCM Architecture Team | Phase 2 marked complete: etcd real-time capacity in PlacementEngine, retry escalation, OTel scheduling metrics, rejection tracking, 41 new tests. All acceptance criteria met. |
| 2.2.0 | 2026-02-09 | LCM Architecture Team | Phase 1 marked complete: capacity commands, schedule/terminate integration, etcd publishing. PlacementEngine enhancement and integration tests deferred to Phase 2. |
| 2.1.0 | 2026-02-08 | LCM Architecture Team | Phase 0 marked complete, added mandatory doc maintenance section (§10), added status lines per phase, bumped section numbering |
| 2.0.0 | 2026-02-08 | LCM Architecture Team | Complete rewrite with foundation-first approach |
| 1.0.0 | 2026-02-08 | LCM Architecture Team | Initial plan (flawed - assumed worker foundation complete) |