MVP Implementation Plan¶

Attribute	Value
Document Version	4.2.0
Status	Authoritative
Created	2026-02-08
Last Updated	2026-02-20
Author	LCM Architecture Team
Related	Codebase Discovery Audit, Requirements Spec, ADR-020, ADR-021, ADR-022, Phase 7 Execution Plan

1. Executive Summary¶

This document is the single authoritative implementation plan for Lablet Cloud Manager MVP. It is derived from the Codebase Discovery Audit.

Critical Insight: Foundation First¶

Dependency Chain

The MVP cannot proceed to lablet lifecycle (LDS/Grading) without a solid worker management foundation.

Correct dependency order:

Worker Capacity → Resource Scheduling → Auto-Scaling → LDS → Frontend → Session Entity Model → Grading

You cannot schedule lablet sessions without knowing worker capacity. You cannot auto-scale without tracking resource usage. You cannot implement grading without the session entity model (ADR-020/021/022).

Current State Analysis¶

Component	State	Evidence
`declared_capacity` on Worker	✅ Exists	Set from WorkerTemplate
`allocated_capacity` on Worker	✅ Updated	Updated via AllocateCapacity/ReleaseCapacity commands (Phase 1)
PlacementEngine capacity check	✅ Complete	Uses etcd capacity data (Phase 2 enhanced)
`ScheduleLabletSessionCommand`	✅ Fixed	Validates capacity + allocates on schedule (Phase 1)
Worker metrics collection	✅ Works	CloudWatch + CML stats
Activity detection	✅ Works	Idle detection functional
Worker provisioning	✅ Complete	EC2 provisioning via WorkerTemplateService (Phase 3)
Auto-scaling triggers	✅ Complete	Scale-up (RS) + scale-down (WC) with safety guards (Phase 3)
LDS integration	🔄 ~90%	LDS SPI client, session provisioning, mark-ready, session.started handling, archival
Grading integration	⬜ Missing	No client, no collection flow

MVP Scope¶

Capability	Requirements	Phase	Current State
Worker Capacity Tracking	FR-2.4.1, FR-2.4.3	1	✅ Complete
Resource-Aware Scheduling	FR-2.3.2a-e	2	✅ Complete
Auto-Scaling (Basic)	FR-2.5.1a,c,d; FR-2.5.2a-b	3	✅ Complete
LDS Session Provisioning	FR-2.2.5, FR-2.2.6	4	✅ Complete (staging validated)
SSE & Frontend Readiness	FR-2.1, FR-2.2, FR-2.4	6	🔄 ~85% (G1+G3+G4+G5+F1+F2+F3+F8 done)
Session Entity Model	FR-2.2.1, FR-2.2.2, ADR-020/021/022	7	✅ Complete
Grading Integration	FR-2.6.1, FR-2.6.2	5	⬜ Blocked by Phase 7

Deferred (Post-MVP)¶

Capability	Requirements	Rationale
Grading Integration	FR-2.6.1, FR-2.6.2	Phase 7 complete — ready to start; requires GradingSession, ScoreReport, CloudEventIngestor
Warm Pool	FR-2.7.1	Optimization, not blocking
Advanced Auto-Scaling	FR-2.5.1b, FR-2.5.2c-d	Basic scale-up/down sufficient
S3/MinIO Artifact Sync	FR-2.1.5	Manual artifact management acceptable

Timeline Summary¶

Phase 0: Domain Prerequisites    ████████████████████  ✅ COMPLETE (2026-02-08)
Phase 1: Worker Foundation       ████████████████████  ✅ COMPLETE (2026-02-09)
Phase 2: Resource Scheduling     ████████████████████  ✅ COMPLETE (2026-02-10)
Phase 3: Auto-Scaling            ████████████████████  ✅ COMPLETE (2026-02-08)
Phase 4: LDS Integration         ████████████████████  ✅ COMPLETE (2026-02-10)
Phase 6: SSE & Frontend          █████████████████░░░  🔄 ~85% (G1+G3+G4+G5+F1+F2+F3+F8 done)
Phase 7: Session Entity Model    ████████████████████  ✅ COMPLETE (2026-02-20)
Phase 5: Grading Integration     ░░░░░░░░░░░░░░░░░░░░  ⬜ Blocked by Phase 7 → Ready to start
                                 ─────────────────────
                                 Execution order: 0→1→2→3→4→6→7→5
                                 Phase 7 COMPLETE — Phase 5 unblocked

2. Phase Dependencies¶

flowchart TD
    subgraph P0["Phase 0: Domain Prerequisites ✅"]
        P0A[Add READY state to LabletSessionStatus]
        P0B[Add form_qualified_name to LabletDefinition]
    end

    subgraph P1["Phase 1: Worker Foundation ✅"]
        P1A[Fix allocated_capacity updates]
        P1B[Add UpdateWorkerCapacityCommand]
        P1C[Verify metrics collection flow]
    end

    subgraph P2["Phase 2: Resource Scheduling ✅"]
        P2A[Complete PlacementEngine integration]
        P2B[Capacity validation in ScheduleCommand]
        P2C[Capacity allocation on schedule]
    end

    subgraph P3["Phase 3: Auto-Scaling ✅"]
        P3A[Implement worker provisioning]
        P3B[Scale-up trigger logic]
        P3C[Scale-down trigger logic]
    end

    subgraph P4["Phase 4: LDS Integration ✅"]
        P4A[LDS SPI Client]
        P4B[Session provisioning in LabletReconciler]
        P4C[MarkSessionReady + HandleSessionStarted]
    end

    subgraph P6["Phase 6: SSE & Frontend 🔄"]
        P6A[G1: Fix SSE pipeline ✅]
        P6B[F2: LDS Session Display ✅]
        P6C[F1: Reservation UI ✅]
        P6D[F3: Capacity Dashboard ✅]
        P6E[G4/G5: Test coverage ✅]
    end

    subgraph P7["Phase 7: Session Entity Model ✅"]
        P7A[LabletSession aggregate + LabletSessionStatus enum]
        P7B[UserSession / GradingSession / ScoreReport entities]
        P7C[CQRS commands + queries for child entities]
        P7D[MongoDB repos for 4 new collections]
        P7E[CloudEventIngestor in lablet-controller]
        P7F[GradingSPI adapter skeleton]
        P7G[Data migration + etcd key migration]
    end

    subgraph P5["Phase 5: Grading Integration ⬜"]
        P5A[Collection service]
        P5B[Grading SPI Client integration]
        P5C[CloudEvent handlers for grading.completed/failed]
    end

    P0 --> P1
    P1 --> P2
    P2 --> P3
    P2 --> P4
    P3 --> P4
    P4 --> P6
    P6 --> P7
    P7 --> P5

3. Phase 0: Domain Prerequisites¶

Status: ✅ COMPLETE (2026-02-08) Goal: Prepare domain models for LDS integration (required for Phase 4). Duration: ~1 week Requirements: FR-2.2.1, FR-2.2.5h, FR-2.1.6

Phase 0 Complete

All 8 tasks completed. 21 new tests added (210 total domain tests pass). Key artifacts: LabletSessionStatus.READY, LabletSessionReadyDomainEvent, form_qualified_name on LabletDefinition. Breaking change: INSTANTIATING→RUNNING is now invalid; must transition through READY. Bootstrap prompt: PHASE_0_BOOTSTRAP.md

ADR-020/021 Impact

Phase 0 was implemented using the original LabletInstance naming. Per ADR-020, LabletInstance is renamed to LabletSession. Per ADR-021, lds_session_id and lds_login_url move from the session aggregate to the UserSession child entity. Phase 7 implements these migrations.

3.1 Tasks¶

ID	Task	Service	File(s)	Estimate
P0-1	Add `READY` state to LabletSessionStatus	control-plane-api	`domain/enums.py`	2h
P0-2	Update LABLET_SESSION_VALID_TRANSITIONS	control-plane-api	`domain/enums.py`	1h
P0-3	Add `form_qualified_name` to LabletDefinition	control-plane-api	`domain/entities/lablet_definition.py`	2h
P0-4	Add `lds_session_id`, `lds_login_url` to LabletSession (superseded by ADR-021: fields move to UserSession in Phase 7)	control-plane-api	`domain/entities/lablet_session.py`	2h
P0-5	Add LabletSessionReadyDomainEvent	control-plane-api	`domain/events/lablet_session_events.py`	1h
P0-6	Update LabletSessionReadModel	lcm-core	`domain/entities/read_models/lablet_session_read_model.py`	1h
P0-7	Update LabletDefinitionReadModel	lcm-core	`domain/entities/read_models/lablet_definition_read_model.py`	1h
P0-8	Unit tests for new state transitions	control-plane-api	`tests/domain/`	3h

3.2 Specification: READY State¶

Current State Machine:

PENDING → SCHEDULED → INSTANTIATING → RUNNING → ...

Updated State Machine:

PENDING → SCHEDULED → INSTANTIATING → READY → RUNNING → ...

Transition Table Update:

# domain/enums.py - LABLET_SESSION_VALID_TRANSITIONS
LabletSessionStatus.INSTANTIATING: [
    LabletSessionStatus.READY,       # NEW: LDS provisioned, awaiting user login
    LabletSessionStatus.TERMINATED,
],
LabletSessionStatus.READY: [         # NEW STATE
    LabletSessionStatus.RUNNING,     # User logged in (CloudEvent from LDS)
    LabletSessionStatus.TERMINATED,
],

READY State Semantics:

CML lab is running on worker
LDS LabSession is provisioned with device access info
UserSession entity created with login_url (ADR-021)
Awaiting user login via LDS portal

3.3 Acceptance Criteria¶

[x] LabletSessionStatus.READY exists in enum
[x] Transition INSTANTIATING → READY is valid
[x] Transition READY → RUNNING is valid
[x] LabletDefinition.form_qualified_name persists correctly
[x] LDS session fields persist correctly (Phase 7 migrates to UserSession per ADR-021)
[x] Domain events emit correctly
[x] Read models reflect new attributes
[x] All unit tests pass (210 tests, 21 new)

4. Phase 1: Worker Foundation¶

Status: ✅ COMPLETE (2026-02-09) Goal: Ensure worker capacity is accurately tracked so scheduling can make informed decisions. Duration: ~2 weeks Requirements: FR-2.4.1, FR-2.4.3 Bootstrap: PHASE_1_BOOTSTRAP.md

Phase 1 Complete

Worker capacity is now accurately tracked through the full lablet session lifecycle. Key artifacts: AllocateCapacityCommand, ReleaseCapacityCommand, WorkerCapacityPublisher. ScheduleLabletSessionCommand now validates worker status (RUNNING) and capacity before scheduling, then allocates capacity via mediator. TerminateLabletSessionCommand releases capacity on termination. Capacity snapshots are published to etcd at /lcm/workers/{id}/capacity. Tests: 26 new tests in test_capacity_commands.py (387 total CPA non-integration tests pass). Integration tests and PlacementEngine enhancement deferred to Phase 2. Bootstrap prompt: PHASE_1_BOOTSTRAP.md

4.1 Problem Statement¶

The PlacementEngine in resource-scheduler uses declared_capacity and allocated_capacity to filter eligible workers. However:

allocated_capacity is never updated when sessions are scheduled
ScheduleLabletSessionCommand has a TODO: "Check worker status and capacity"
No command exists to update worker capacity when sessions start/stop

This means PlacementEngine sees stale data and may over-allocate workers.

4.2 Tasks¶

ID	Task	Service	File(s)	Estimate
P1-1	Create `AllocateCapacityCommand`	control-plane-api	`application/commands/worker/`	4h
P1-2	Create `ReleaseCapacityCommand`	control-plane-api	`application/commands/worker/`	4h
P1-3	Update `ScheduleLabletSessionCommand` to call allocate	control-plane-api	`application/commands/lablet_session/`	3h
P1-4	Update session termination to call release	control-plane-api	`application/commands/lablet_session/`	3h
P1-5	Add capacity change domain events	control-plane-api	`domain/events/cml_worker.py`	2h
P1-6	Publish capacity to etcd for scheduler	control-plane-api	`application/services/`	4h
P1-7	Verify metrics collection end-to-end	worker-controller	`application/hosted_services/worker_reconciler.py`	4h
P1-8	Integration tests for capacity flow	control-plane-api	`tests/integration/`	8h

4.3 Specification: Capacity Allocation Flow¶

sequenceDiagram
    participant RS as resource-scheduler
    participant CPA as control-plane-api
    participant Worker as CMLWorker Aggregate
    participant etcd as etcd

    RS->>CPA: POST /api/internal/sessions/{id}/schedule (worker_id)
    CPA->>Worker: schedule(session_id, worker_id)
    Worker->>Worker: validate capacity available
    Worker->>Worker: allocate_capacity(requirements)
    Worker-->>CPA: LabletSessionScheduledDomainEvent
    Worker-->>CPA: WorkerCapacityAllocatedDomainEvent
    CPA->>etcd: PUT /workers/{id}/allocated_capacity
    CPA-->>RS: 200 OK {status: SCHEDULED}

AllocateCapacityCommand:

@dataclass
class AllocateCapacityCommand(Command[OperationResult[dict]]):
    """Allocate capacity on a worker for a lablet session."""
    worker_id: str
    session_id: str
    cpu_cores: int
    memory_gb: int
    storage_gb: int
    node_count: int

ReleaseCapacityCommand:

@dataclass
class ReleaseCapacityCommand(Command[OperationResult[dict]]):
    """Release capacity when session terminates."""
    worker_id: str
    session_id: str

4.4 Acceptance Criteria¶

[x] AllocateCapacityCommand updates allocated_capacity on worker
[x] ReleaseCapacityCommand decrements allocated_capacity on worker
[x] ScheduleLabletSessionCommand validates capacity before scheduling
[x] ScheduleLabletSessionCommand calls AllocateCapacityCommand on success
[x] Session termination triggers ReleaseCapacityCommand
[x] Worker capacity changes are published to etcd
[x] PlacementEngine sees current capacity data (completed in Phase 2)
[x] Integration tests verify full flow (completed in Phase 2)

5. Phase 2: Resource Scheduling¶

Status: ✅ Complete (2026-02-10) Goal: Resource-scheduler makes accurate placement decisions based on real-time capacity. Duration: ~1.5 weeks Requirements: FR-2.3.2a-e

5.1 Problem Statement¶

The PlacementEngine logic is mostly implemented but:

It uses stale capacity data (fixed in Phase 1)
The ScheduleLabletSessionCommand doesn't validate capacity
No feedback loop exists if scheduling fails

5.2 Tasks¶

ID	Task	Service	File(s)	Estimate
P2-1	Add capacity validation in PlacementEngine	resource-scheduler	`application/services/placement_engine.py`	4h
P2-2	Fetch fresh capacity from etcd (not API)	resource-scheduler	`application/hosted_services/scheduler_hosted_service.py`	4h
P2-3	Handle scheduling failures with retry	resource-scheduler	`application/hosted_services/scheduler_hosted_service.py`	4h
P2-4	Add scheduling metrics (success/fail/scale-up)	resource-scheduler	`application/services/`	3h
P2-5	Update scheduler to request scale-up on no capacity	resource-scheduler	`application/hosted_services/scheduler_hosted_service.py`	4h
P2-6	Integration tests for scheduling decisions	resource-scheduler	`tests/integration/`	6h

5.3 Specification: Enhanced PlacementEngine¶

Current _check_resource_capacity (partial):

def _check_resource_capacity(self, worker: dict, definition: dict) -> bool:
    # Uses declared_capacity - allocated_capacity
    # Problem: allocated_capacity is stale

Enhanced flow:

def schedule(self, instance: dict, definition: dict, workers: list) -> SchedulingDecision:
    # 1. Filter by status (RUNNING only)
    # 2. Filter by license affinity
    # 3. Filter by AMI requirements
    # 4. Filter by REAL-TIME capacity (from etcd or fresh API call)
    # 5. Filter by port availability
    # 6. Score by utilization (bin-packing)
    # 7. Select best or request scale-up

5.4 Acceptance Criteria¶

[x] PlacementEngine uses real-time capacity data (etcd capacity preferred, API fallback)
[x] Scheduling fails gracefully with clear error message (rejection_summary tracking)
[x] Failed scheduling triggers requeue with backoff (base class backoff + max retry escalation at 5 failures → 300s)
[x] Scale-up decision made when no eligible workers (granular rejection categories: status/license/capacity/ami/ports)
[x] Scheduling metrics are emitted (OTel: decisions, latency, retries, etcd fetches, scale-ups)
[x] Integration tests cover all decision paths (41 new tests: 17 PlacementEngine + 24 SchedulerHostedService)

6. Phase 3: Auto-Scaling¶

Status: ✅ COMPLETE (2026-02-08) Goal: Automatically provision/deprovision workers based on demand. Duration: ~2 weeks Requirements: FR-2.5.1a,c,d; FR-2.5.2a-b Bootstrap: PHASE_3_BOOTSTRAP.md (retrospective)

Phase 3 Complete

Full auto-scaling implemented across 3 services. Scale-up: resource-scheduler detects no eligible workers → selects cheapest viable template → RequestScaleUpCommand creates PENDING worker → worker-controller provisions EC2 via _handle_pending(). Scale-down: worker-controller evaluates idle workers (5 safety guards) → DrainWorkerCommand sets DRAINING → stops EC2 when empty. Key artifacts: RequestScaleUpCommand (CPA), DrainWorkerCommand (CPA), _handle_pending() provisioning (WC), _evaluate_scale_down() (WC), _select_template_for_requirements() (RS), WorkerTemplateService (CPA), scaling constraints (min/max workers, cooldowns). Tests: 44 new tests (14 CPA + 17 RS + 13 WC). Also fixed discovery state sync bug (AD-21). Bootstrap prompt: PHASE_3_BOOTSTRAP.md

6.1 Problem Statement¶

Currently, worker provisioning is stubbed:

# worker_reconciler.py line ~290
async def _handle_pending(self, worker: CMLWorkerReadModel) -> ReconciliationResult:
    # ...
    # For now, requeue until template system is implemented
    return ReconciliationResult.requeue("Template provisioning not yet implemented")

No automatic scale-up or scale-down exists.

6.2 Tasks¶

ID	Task	Service	File(s)	Estimate
P3-1	Implement `_handle_pending` with EC2 provisioning	worker-controller	`application/hosted_services/worker_reconciler.py`	8h
P3-2	Create `RequestScaleUpCommand`	control-plane-api	`application/commands/worker/`	4h
P3-3	Implement scale-up trigger in scheduler	resource-scheduler	`application/hosted_services/scheduler_hosted_service.py`	6h
P3-4	Create scale-down detection job	worker-controller	`application/hosted_services/`	6h
P3-5	Implement worker draining before scale-down	control-plane-api	`domain/entities/cml_worker.py`	4h
P3-6	Add scaling constraints (min/max workers)	control-plane-api	`application/settings.py`	2h
P3-7	Add scaling audit log	control-plane-api	`application/services/`	3h
P3-8	Integration tests for scale-up flow	all	`tests/integration/`	8h
P3-9	Integration tests for scale-down flow	all	`tests/integration/`	6h

6.3 Specification: Scale-Up Flow¶

sequenceDiagram
    participant RS as resource-scheduler
    participant CPA as control-plane-api
    participant WC as worker-controller
    participant EC2 as AWS EC2

    RS->>RS: No eligible workers for session
    RS->>CPA: POST /api/workers/scale-up {template, reason}
    CPA->>CPA: Create CMLWorker (status=PENDING)
    CPA-->>RS: 202 Accepted {worker_id}
    RS->>RS: Requeue session (await worker)

    Note over WC: Reconciliation loop
    WC->>CPA: GET /api/workers?status=PENDING
    WC->>EC2: RunInstances(template config)
    EC2-->>WC: ec2_instance_id, pending
    WC->>CPA: PATCH /api/workers/{id} {status=PROVISIONING, ec2_id}

    Note over WC: Next reconcile cycle
    WC->>EC2: DescribeInstances(ec2_instance_id)
    EC2-->>WC: running, ip_address
    WC->>CPA: PATCH /api/workers/{id} {status=RUNNING, ip}

    Note over RS: Next reconcile cycle
    RS->>CPA: GET /api/workers?status=RUNNING
    RS->>RS: Worker now eligible
    RS->>CPA: POST /api/internal/sessions/{id}/schedule

6.4 Specification: Scale-Down Flow¶

sequenceDiagram
    participant WC as worker-controller
    participant CPA as control-plane-api
    participant EC2 as AWS EC2

    Note over WC: Idle detection
    WC->>CPA: GET /api/workers/{id}
    WC->>WC: Check: no sessions, idle > threshold
    WC->>CPA: POST /api/workers/{id}/drain
    CPA->>CPA: Set status=DRAINING

    Note over WC: Wait for drain complete
    WC->>CPA: GET /api/workers/{id}/sessions
    CPA-->>WC: [] (empty)
    WC->>CPA: POST /api/workers/{id}/stop
    CPA->>CPA: Set desired_status=stopped

    Note over WC: Next reconcile
    WC->>EC2: StopInstances(ec2_instance_id)
    EC2-->>WC: stopping
    WC->>CPA: PATCH /api/workers/{id} {status=STOPPED}

6.5 Acceptance Criteria¶

[x] Worker provisioning creates real EC2 instances
[x] Worker template configuration is applied correctly
[x] Scale-up triggered when no eligible workers for pending session
[x] Scale-down triggered when worker has no sessions and is idle
[x] Draining prevents new session scheduling
[x] Scaling constraints (min/max) are respected
[x] Scaling decisions are audit-logged (OTel metrics + structured logging)
[x] Integration tests verify scale-up scenarios (14 tests); scale-down unit-tested (13 tests)

7. Phase 4: LDS Integration¶

Status: ✅ COMPLETE (2026-02-10) Goal: Provision LDS LabSessions for lablet sessions and handle user login events. Duration: ~2 weeks Requirements: FR-2.2.5, FR-2.2.6 Bootstrap: PHASE_4_BOOTSTRAP.md

Phase 4 Complete

LDS integration implemented across lablet-controller and control-plane-api. LDS SPI Client: 636-line REST client with multi-region deployment support, YAML config, data models (DeviceAccessInfo, SessionPartInfo, LdsSessionInfo, LdsDeploymentConfig). Reconciler LDS Flow: 7-step _provision_lds_session() (get definition → get nodes → create session → build device list → set devices → get launch URL → call CPA mark-ready), _archive_lds_session() on TERMINATED, _build_device_access_list() static helper. CPA Commands: MarkSessionReadyCommand (atomic INSTANTIATING→READY with LDS info, AD-P4-01), HandleSessionStartedCommand (READY→RUNNING on session.started). Internal API: PUT /api/internal/sessions/{id}/mark-ready, POST /api/internal/ sessions/{id}/transition. CPA Client: mark_session_ready(), notify_session_started() in ControlPlaneApiClient. Tests: 57 new tests (44 LDS SPI + 5 mark-ready + 8 session-started). Remaining: Staging validation with live LDS deployment.

ADR-022 Impact

Phase 4 was originally implemented with CloudEvent handling in control-plane-api. Per ADR-022, all CloudEvents (LDS + GradingEngine) are now routed to lablet-controller via CloudEventIngestor. Phase 7 implements this migration.

7.1 Tasks¶

ID	Task	Service	File(s)	Status	Estimate
P4-1	Create LDS SPI client (data models + config)	lablet-controller	`integration/services/lds_spi.py`, `config/lds_deployments.yaml`	✅	4h
P4-2	Implement LDS REST client (multi-region)	lablet-controller	`integration/services/lds_spi.py`	✅	6h
P4-3	Add `MarkSessionReadyCommand` (atomic INSTANTIATING→READY)	control-plane-api	`application/commands/lablet_session/mark_session_ready_command.py`	✅	3h
P4-4	Update `TransitionLabletSessionCommand` for READY	control-plane-api	`application/commands/lablet_session/transition_lablet_session_command.py`	✅	2h
P4-5	Update LabletReconciler: `_provision_lds_session()`	lablet-controller	`application/hosted_services/lablet_reconciler.py`	✅	8h
P4-6	Add internal API endpoints (mark-ready, session-started)	control-plane-api	`api/controllers/internal_controller.py`	✅	4h
P4-7	`HandleSessionStartedCommand` (READY→RUNNING)	control-plane-api	`application/commands/lablet_session/handle_session_started_command.py`	✅	4h
P4-8	Add `_archive_lds_session()` on TERMINATED	lablet-controller	`application/hosted_services/lablet_reconciler.py`	✅	3h
P4-9	Tests (LDS SPI + command handlers)	LC, CPA	`tests/`	✅	8h
P4-CPA-Client	Add `mark_session_ready()`, `notify_session_started()` to CPA client	lcm-core	`integration/clients/control_plane_client.py`	✅	2h

7.2 Specification: LDS Client Interface¶

# lablet-controller/integration/services/lds_spi.py

class LdsSpiClient:
    """LDS (Lab Delivery System) SPI client."""

    async def create_session_with_part(
        self,
        username: str,
        timeslot_start: datetime,
        timeslot_end: datetime,
        form_qualified_name: str,
    ) -> LdsSessionInfo:
        """Create LabSession with a LabSessionPart for the content."""
        ...

    async def set_devices(
        self,
        session_id: str,
        devices: list[DeviceAccessInfo],
    ) -> None:
        """Set device access information for the session."""
        ...

    async def get_session_info(self, session_id: str) -> LdsSessionInfo:
        """Get session details including login_url."""
        ...

    async def archive_session(self, session_id: str) -> None:
        """Archive session (called on TERMINATED)."""
        ...

@dataclass
class DeviceAccessInfo:
    name: str           # Device label from content.xml
    protocol: str       # "ssh", "telnet", "vnc", "web"
    host: str           # Worker IP address
    port: int           # Allocated port
    uri: str            # Connection URI
    username: str       # Device credentials
    password: str       # Device credentials

@dataclass
class LdsSessionInfo:
    session_id: str
    login_url: str
    status: str

7.3 Specification: CloudEvent Ingestion (ADR-022)¶

Per ADR-022, all external CloudEvents are routed to lablet-controller, not control-plane-api. The lablet-controller uses Neuroglia's CloudEventIngestor with @dispatch handlers:

# lablet-controller/application/events/cloud_event_ingestor.py

class LabletCloudEventIngestor(CloudEventIngestor):
    """Receives and dispatches CloudEvents from LDS and GradingEngine."""

    def __init__(self, service_provider, control_plane_client: ControlPlaneApiClient):
        super().__init__(service_provider)
        self._cpa = control_plane_client

    @dispatch(LdsSessionStartedEvent)
    async def on_lds_session_started(self, event: LdsSessionStartedEvent) -> None:
        """Handle LDS session.started — transition READY → RUNNING."""
        await self._cpa.update_user_session_status(event.lablet_session_id, UserSessionStatus.ACTIVE)
        await self._cpa.transition_session(event.lablet_session_id, LabletSessionStatus.RUNNING)

    @dispatch(LdsSessionEndedEvent)
    async def on_lds_session_ended(self, event: LdsSessionEndedEvent) -> None:
        """Handle LDS session.ended — transition RUNNING → COLLECTING, initiate grading."""
        await self._cpa.update_user_session_status(event.lablet_session_id, UserSessionStatus.ENDED)
        await self._cpa.transition_session(event.lablet_session_id, LabletSessionStatus.COLLECTING)
        # Initiate grading via GradingSPI (Phase 7)

    @dispatch(GradingSessionCompletedEvent)
    async def on_grading_completed(self, event: GradingSessionCompletedEvent) -> None:
        """Handle grading.session.completed — create ScoreReport, transition to STOPPING."""
        await self._cpa.create_score_report(event.lablet_session_id, event.score_data)
        await self._cpa.update_grading_session_status(event.lablet_session_id, GradingStatus.SUBMITTED)
        await self._cpa.transition_session(event.lablet_session_id, LabletSessionStatus.STOPPING)

    @dispatch(GradingSessionFailedEvent)
    async def on_grading_failed(self, event: GradingSessionFailedEvent) -> None:
        """Handle grading.session.failed — mark FAULTED."""
        await self._cpa.update_grading_session_status(
            event.lablet_session_id, GradingStatus.FAULTED, error=event.error_message
        )

CloudEvent endpoint registered at POST /api/events on lablet-controller. State mutations are proxied to control-plane-api via internal REST calls (ADR-001).

7.4 Acceptance Criteria¶

[x] LDS client creates sessions with device access info (636-line LdsSpiClient with multi-region support)
[x] LabletReconciler provisions LDS session during INSTANTIATING (7-step _provision_lds_session() flow)
[x] Session transitions to READY after LDS provisioning (atomic MarkSessionReadyCommand via AD-P4-01)
[x] LDS session info stored on UserSession entity (Phase 7 migration — currently on LabletSession via mark_ready() domain method)
[x] CloudEvent session.started triggers READY→RUNNING transition (Phase 7 migrates handler to lablet-controller CloudEventIngestor per ADR-022)
[x] LDS session archived when session reaches TERMINATED (_archive_lds_session() with graceful error handling)
[x] Integration tests verify flow with mock LDS (57 tests: 44 LDS SPI + 5 mark-ready + 8 session-started)
[x] Staging validation with live LDS deployment (G3: 12/12 checks passed, bug fix: archive_session() json=None→json={}, Docker config added)

8. Phase 5: Grading Integration¶

Status: ⬜ Blocked by Phase 7 Goal: Collect device configurations and submit for grading via GradingEngine. Duration: ~1.5 weeks Requirements: FR-2.6.1, FR-2.6.2 Dependencies: Phase 7 (Session Entity Model) must be complete first

Phase 7 Prerequisite

Phase 5 requires the following Phase 7 deliverables: - GradingSession entity (ADR-021) — stores grading state and GradingEngine IDs - ScoreReport entity (ADR-021) — stores grading results with per-section breakdowns - CloudEventIngestor (ADR-022) — routes grading.session.completed/failed events to handlers - GradingSPI adapter — client for GradingEngine REST API - Internal API endpoints — CRUD for GradingSession and ScoreReport entities

8.1 Tasks¶

ID	Task	Service	File(s)	Estimate
P5-1	Create collection service	lablet-controller	`application/services/collection_service.py`	8h
P5-2	Add console command execution to CML SPI	lablet-controller	`integration/services/cml_labs_spi.py`	4h
P5-3	Add `StartCollectionCommand`	control-plane-api	`application/commands/lablet_session/`	3h
P5-4	Add `CollectionCompletedCommand`	control-plane-api	`application/commands/lablet_session/`	3h
P5-5	Wire GradingSPI client (skeleton from Phase 7)	lablet-controller	`integration/services/grading_spi.py`	4h
P5-6	Update LabletReconciler for COLLECTING state	lablet-controller	`application/hosted_services/lablet_reconciler.py`	6h
P5-7	Update LabletReconciler for GRADING state	lablet-controller	`application/hosted_services/lablet_reconciler.py`	4h
P5-8	Wire CloudEvent handlers for grading events (skeleton from Phase 7)	lablet-controller	`application/events/cloud_event_ingestor.py`	3h
P5-9	Integration tests for grading flow	lablet-controller	`tests/integration/`	6h

8.2 Specification: Collection Service¶

# lablet-controller/application/services/collection_service.py

class CollectionService:
    """Collects device configurations from CML lab nodes."""

    async def collect_configs(
        self,
        worker_ip: str,
        lab_id: str,
        commands: list[str] | None = None,
    ) -> CollectionResult:
        """Collect running-config from all nodes in the lab."""
        commands = commands or ["show running-config"]

        nodes = await self._cml.get_lab_nodes(worker_ip, lab_id)

        results = []
        for node in nodes:
            node_outputs = []
            for command in commands:
                output = await self._cml.execute_console_command(
                    worker_ip, lab_id, node.id, command
                )
                node_outputs.append(CommandOutput(command=command, output=output))

            results.append(DeviceCollection(name=node.label, outputs=node_outputs))

        return CollectionResult(devices=results, collected_at=datetime.now(UTC))

8.3 Specification: State Flow¶

stateDiagram-v2
    [*] --> RUNNING: User logged in
    RUNNING --> COLLECTING: StartCollectionCommand
    COLLECTING --> GRADING: Collection complete
    GRADING --> STOPPING: Grading complete
    STOPPING --> STOPPED: Resources released
    STOPPED --> TERMINATED: Cleanup complete
    TERMINATED --> [*]

8.4 Acceptance Criteria¶

[ ] Collection service gathers show running-config from all nodes
[ ] RUNNING→COLLECTING transition works via command
[ ] Collection results stored/forwarded to grading engine
[ ] COLLECTING→GRADING transition after collection complete
[ ] Grading engine submission works
[ ] CloudEvent grading.completed triggers score storage
[ ] GRADING→STOPPING transition after grading complete
[ ] Integration tests verify full flow

9. Phase 6: SSE, Frontend & Integration Readiness¶

Status: 🔄 In Progress Goal: Fix SSE event pipeline, fill integration gaps, and build the frontend UI to MVP readiness. Duration: ~3–4 weeks Requirements: FR-2.1 (UI), FR-2.2 (lifecycle visibility), FR-2.4 (capacity dashboard), FR-2.6 (grading display)

Phase 6 Context

A gap analysis between the codebase and the MVP requirements revealed that Phases 0–4 focused on backend domain logic, CQRS commands, and inter-service integration. Phase 6 addresses the remaining SSE pipeline bugs, missing test coverage, and frontend UI gaps needed for a usable MVP. Grading backend (Phase 5) and grading UI (F4) are deferred to Phase 7.

Progress (v4.0.0): 8 of 15 tasks complete (G1, G3, G4, G5, F1, F2, F3, F8). All Medium+ priority frontend and backend tasks done. G4 delivered 53 worker-controller tests, G5 delivered 59 lablet-controller tests. Remaining work is Low-priority polish tasks (F4-F7, F9).

9.1 Sub-Phase A: SSE & Backend Readiness¶

These tasks address broken or missing backend infrastructure required before frontend work.

ID	Task	Priority	Status	Description
G1	Fix SSE broken for all aggregates	P0	✅	SSE event naming mismatch (hyphens→dots), missing event mappings, legacy SSEService, no initial snapshots. Fixed: 21 backend handlers renamed, 6 frontend event types added, legacy SSEService deleted, global SSE connect in app.js, lablet/definition snapshots added to EventsController.
G2	CloudEvents external naming mismatch	Deferred	⬜	External CloudEvent `type` fields still use hyphens (e.g. `lablet-session.status.changed`). Not blocking SSE (internal dot notation works). Defer until external CloudEvent consumers exist.
G3	Phase 4 staging validation	P1	✅	Validated LDS integration against live pyLDS backend (12/12 checks passed). Findings: (1) LDS returns HTTP 201 for session creation (SPI client handles correctly via `raise_for_status()`), (2) Bug fix: `archive_session()` sent `json=None` causing HTTP 415 — changed to `json={}`, (3) Docker networking: created `lds_deployments.docker.yaml` with `lds-backend:4000` base_url, added `LDS_VERIFY_SSL` + `LDS_DEPLOYMENTS_CONFIG_PATH` env vars to docker-compose.shared.yml.
G4	Add missing Worker Controller tests	P2	✅	53 new tests in `test_worker_reconciler_g4.py`. Covers: all 9 status handlers (PENDING through TERMINATED), EC2 provisioning flow, CML readiness checks, metrics collection, scale-down evaluation (5 safety guards), drain completion, error recovery. Target was +20, delivered +53.
G5	Add missing Lablet Controller tests	P2	✅	59 new tests in `test_lablet_reconciler_g5.py`. Covers: all 7 status handlers (SCHEDULED through PENDING_CLEANUP), LDS provisioning 7-step flow, device mapping, definition caching, session archival, reconcile router dispatch. Target was +15, delivered +59.
G6	Worker metrics events disabled	Deferred	⬜	`WorkerMetricsUpdatedDomainEvent` SSE handlers exist but metrics collection jobs are not emitting events. Leave disabled until monitoring dashboard (F3/F9) is prioritized.

9.2 Sub-Phase B: Frontend Implementation¶

These tasks address UI gaps identified in the control-plane-api frontend. The CPA frontend is the primary user interface; worker-controller, lablet-controller, and resource-scheduler UIs remain empty scaffolds.

Existing UI Pages (CPA)¶

Page	Description	State
OverviewPage	Dashboard with aggregate metric cards, quick actions, recent activity	✅ Functional
WorkersPage	Tabbed: Workers (card/table) + Templates (admin)	✅ Functional
LabletsPage	Tabbed: Sessions (card/table) + Definitions (admin)	✅ Functional
SystemPage	Tabbed: Monitoring (health, SSE) + Settings (admin)	✅ Functional

Frontend Gap Tasks¶

ID	Task	Priority	Status	What Exists	What's Missing
F1	Reservation Management UI	Medium	✅	ReservationsPage.js: stats cards, reservation lookup by external ID, active/all/timeline tabs, status filtering, search, SSE real-time updates, auto-refresh	Done: Dedicated Reservations page with full reservation lifecycle view, filtering, and search.
F2	LDS Session Display	High	✅	Backend domain models have LDS session info (migrating to UserSession entity per ADR-021)	Fixed: DTO mappers now include LDS fields, SSE READY handler added with LDS data, LabletSessionCard has "Open Lab" button + session display. 9 new tests.
F3	Capacity/Utilization Dashboard	Medium	✅	CapacityDashboard.js: fleet summary cards, resource allocation progress bars (CPU/Mem/Storage/Nodes), per-worker breakdown table with mini progress bars, color-coded utilization	Done: Cross-fleet capacity dashboard with aggregate metrics and per-worker breakdown. Historical trends deferred to Grafana integration.
F4	Grading Results Display	Low	⬜	Grade API call + "Start Grading" button on LabletSessionCard, grading/graded status in badge mappings	No grading results display (score, checks, pass/fail), no GradingPanel. UI can trigger grading but cannot show outcomes. Blocked by Phase 5/7 grading backend.
F5	Notification Center	Low	⬜	Toast notifications for SSE events (ephemeral)	No persistent notification center/inbox, no alert history, no configurable thresholds. Toasts sufficient for MVP.
F6	User/RBAC Admin UI	Low	⬜	Frontend RBAC enforcement via `permissions.js`, role-based views	No user management page, no role assignment UI, no user profile dropdown. Keycloak admin console suffices for now.
F7	Audit Log Viewer	Low	⬜	EventBus has `getEventHistory()` (client-side in-memory only)	No audit log viewer page, no server-side event history browser. Post-MVP feature.
F8	Resource Scheduler UI	Medium	✅	SchedulerPage.js + api/scheduler.js: leader election status, scheduling stats, pending placements table, scheduling policy info, admin actions (trigger reconcile, resign leadership)	Done: Full scheduling dashboard with leader status, stats, pending placements, and admin controls.
F9	Multi-Service Observability	Low	⬜	SystemPage health checks + SSE status. PrometheusClient + LcmGrafanaPanel scaffolded.	No unified observability across all 4 microservices. Prometheus/Grafana not connected. Worker-controller, lablet-controller, resource-scheduler UIs are empty scaffolds.

9.3 Recommended Implementation Order¶

9.3.1  G1: Fix SSE pipeline                          ✅ DONE
9.3.2  F2: LDS Session Display                      ✅ DONE
9.3.3  G3: Phase 4 staging validation (validates F2 + LDS flow)  ✅ DONE
9.3.4  F1: Reservation Management UI                ✅ DONE
9.3.5  F3: Capacity Dashboard                       ✅ DONE
9.3.6  F8: Resource Scheduler UI                     ✅ DONE
9.3.7  G4: Worker Controller tests (+53)             ✅ DONE
9.3.8  G5: Lablet Controller tests (+59)             ✅ DONE
9.3.9  F5–F7, F9: Post-MVP polish (deferred)
9.3.10 F4: Grading Results Display (blocked by Phase 7)

9.4 Acceptance Criteria¶

[x] SSE events flow end-to-end for all aggregates (workers, lablet sessions, lablet definitions, worker templates)
[x] SSE initial snapshots sent on connect (workers, lablet sessions, lablet definitions)
[x] LDS session URL displayed on lablet session cards with "Open Lab" button
[x] DTO mappers include LDS session info in API responses (migrating to UserSession per ADR-021)
[ ] Reservation filtering available on LabletsPage
[ ] Fleet-level capacity overview visible on OverviewPage or dedicated dashboard
[ ] Worker Controller test coverage ≥80%
[ ] Lablet Controller test coverage ≥80%
[x] UI builds successfully (make build-ui exits 0). StateStore registerSlice + reducer-aware dispatch added to lcm_ui core (AD-9). Parcel cache gotcha documented.

10. Phase 7: Session Entity Model Migration¶

Status: ✅ Complete (2026-02-20) Goal: Implement the LabletSession entity model (ADR-020), child entities (ADR-021), CloudEvent ingestion (ADR-022), and GradingSPI skeleton. Duration: ~3–4 weeks Requirements: FR-2.2.1, FR-2.2.2, FR-2.6.1, FR-2.6.2 Dependencies: Phase 6 (frontend stabilization) substantially complete ADRs: ADR-020, ADR-021, ADR-022 📋 Detailed Plan: phase-7-session-migration.md (codebase audit + 12 sub-phases)

Phase 7 has been extracted to a dedicated execution document

Due to its scope (largest phase in the MVP), Phase 7 has its own detailed plan with:

Codebase audit — actual entity/command/query/collection inventory across all services
12 sub-phases (7A–7L) with dependency graph, per-task estimates, and acceptance criteria
Migration strategy — big-bang rename, hard etcd cutover, dead code cleanup
Key decisions (AD-P7-01 through AD-P7-05) — see below

Migration Strategy Decisions¶

ID	Decision
AD-P7-01	CloudEvent webhook → CPA proxy (no CQRS/Mediator in lablet-controller)
AD-P7-02	Big-bang rename, no backward compatibility
AD-P7-03	Hard etcd cutover `/lcm/instances/` → `/lcm/sessions/`, no dual-write
AD-P7-04	Remove old API endpoints, accept broken frontend during migration
AD-P7-05	Clean up ~2,000+ lines of dead code (Task sample, LabletControllerService, LabsRefreshService, CloudProvider)

Sub-Phase Overview¶

Sub-Phase	Scope	Service(s)	Estimate
7A	lcm-core shared layer renames (enums, read models)	lcm-core	1 day
7B	Dead code cleanup (~2,000+ lines)	CPA, LC	1 day
7C	CPA domain layer (LabletSession + 3 child entities)	CPA	3 days
7D	CPA application layer (commands + queries rewrite)	CPA	3 days
7E	CPA integration layer (4 new MongoDB repos + etcd)	CPA	2 days
7F	CPA API layer (controllers + internal endpoints)	CPA	2 days
7G	lcm-core ControlPlaneApiClient update	lcm-core	1 day
7H	Controller service updates (reconciler + scheduler)	LC, RS	3 days
7I	CloudEvent webhook endpoint	LC	2 days
7J	Frontend updates (API paths + SSE events)	CPA UI	1.5 days
7K	Cross-service tests & verification	all	2 days
7L	Documentation updates	docs	1 day

Acceptance Criteria (Summary)¶

[x] Zero LabletInstance / lablet_instance / lablet_lab_binding / lablet_record_run references in Python source (docstring exceptions only)
[x] MongoDB: lablet_sessions, user_sessions, grading_sessions, score_reports operational
[x] MongoDB: lablet_instances, lablet_lab_bindings, lablet_record_runs, tasks dropped
[ ] CloudEvent webhook handles 4 event types via CPA proxy — DEFERRED (AD-P7-06)
[x] make lint + make test pass for all services (pre-existing failures only)
[x] make build-ui passes
[x] All services start and communicate

Full task breakdown, entity schemas, API mappings, and verification checklist: → phase-7-session-migration.md

11. Post-Implementation: Status Document Update¶

Goal: Update IMPLEMENTATION_STATUS.md to reflect actual implementation state. Duration: ~2 days Trigger: After Phase 7 completion

Per-Phase Updates

While this section describes a final comprehensive audit, each phase also requires incremental updates to the status document per §12: Mandatory Documentation Maintenance.

11.1 Tasks¶

ID	Task	Estimate
PS-1	Audit all requirement IDs against implementation	4h
PS-2	Update status matrix with accurate ✅/🔶/⬜	2h
PS-3	Update progress percentages	1h
PS-4	Add test coverage summary	2h
PS-5	Review with team	2h

12. Mandatory: Documentation Maintenance Per Phase¶

Required for Every Phase

Each implementation phase MUST include a documentation maintenance task as its final step. This is not optional — the implementation is not complete until documentation reflects reality.

Per-Phase Documentation Checklist¶

Every phase completion MUST include:

Task	File	Action
Update plan status	`docs/implementation/mvp-implementation-plan.md`	Mark phase as ✅ COMPLETE, check acceptance criteria, add completion notes
Update status matrix	`docs/implementation/IMPLEMENTATION_STATUS.md`	Update all affected FR rows, bump progress bars, add completion date
Bump document versions	Both files above	Increment version, update Last Updated date
Store knowledge	Knowledge Manager	`store_decision`, `store_insight`, `update_task` for all significant changes
Create next bootstrap	`docs/implementation/PHASE_N+1_BOOTSTRAP.md`	Prepare bootstrap prompt for next phase
Update mkdocs.yml	`mkdocs.yml`	Register any new documentation files in navigation

Bootstrap Prompt Requirement¶

Every PHASE_N_BOOTSTRAP.md document MUST include a final task:

### P{N}-FINAL: Update Implementation Documentation
- Update `docs/implementation/mvp-implementation-plan.md`:
  - Mark Phase {N} as ✅ COMPLETE with date
  - Check all acceptance criteria
  - Add completion notes (key artifacts, test count, breaking changes)
- Update `docs/implementation/IMPLEMENTATION_STATUS.md`:
  - Update all affected FR rows to reflect actual state
  - Update progress bars
  - Bump document version
- Store completion knowledge via Knowledge Manager
- Create `PHASE_{N+1}_BOOTSTRAP.md` for next phase

13. Risk Register¶

Risk	Impact	Probability	Mitigation
LDS API unavailable	High	Low	Mock client for development
Grading Engine API changes	Medium	Low	Version-pin API contract
EC2 provisioning failures	Medium	Medium	Retry logic, fallback regions
etcd leader election issues	High	Low	Use proven libraries, HA setup
CML console collection timeouts	Medium	Medium	Configurable timeout, partial collection
CloudEvent delivery failures	Medium	Low	Idempotent handlers, dead letter queue
Data migration corruption	High	Low	Verification script, backup before migration, parallel-run old/new
API path breaking changes	High	Medium	Versioned API (`/api/v1/sessions/`), deprecation period for old paths
Multi-collection consistency	Medium	Medium	Transactional updates where possible, compensating actions on failure

14. Success Metrics¶

Phase Completion Criteria¶

Phase	Key Metric	Target
Phase 0	Domain model tests pass	100%
Phase 1	Capacity tracking accurate	Verified in integration tests
Phase 2	Scheduling respects capacity	No over-allocation
Phase 3	Auto-scaling functional	Workers provision/deprovision
Phase 4	LDS integration working	Sessions created in staging
Phase 6	SSE & Frontend usable	MVP UI functional
Phase 7	Session entity model migrated	✅ All 4 collections operational, 7I (CloudEventIngestor) deferred to future phase
Phase 5	Grading flow complete	Scores stored correctly

MVP Readiness Checklist¶

[ ] All phases complete
[ ] Integration tests pass
[ ] Staging environment validated
[ ] IMPLEMENTATION_STATUS.md updated
[ ] Runbooks created for operations
[ ] Monitoring dashboards configured
[ ] Documentation current

15. Revision History¶

Version	Date	Author	Changes
4.2.0	2026-02-20	LCM Architecture Team	Phase 7 COMPLETE. Updated §10 status to ✅ Complete, acceptance criteria checked (6/7 — CloudEvent webhook deferred per AD-P7-06). Timeline progress bar updated to 100%. Phase Dependencies diagram updated (Phase 7 ✅). MVP Scope table updated. Deferred table: Grading Integration unblocked. Success Metrics: Phase 7 row updated.
4.1.0	2026-02-18	LCM Architecture Team	Phase 7 extraction: Extracted detailed Phase 7 plan into dedicated phase-7-session-migration.md with codebase audit (entity/command/query/collection inventory across all services), 12 sub-phases (7A–7L) refined from audit findings, and 5 migration strategy decisions (AD-P7-01 through AD-P7-05). Master plan §10 now contains compact summary + link. Scope expanded from 9 to 12 sub-phases based on audit (added dead code cleanup, separate lcm-core client update, and split test/verification sub-phase).
4.0.0	2026-02-18	LCM Architecture Team	Major update: Aligned entire plan with ADR-020 (LabletInstance → LabletSession), ADR-021 (UserSession/GradingSession/ScoreReport child entities), ADR-022 (CloudEvent ingestion via lablet-controller). Added Phase 7: Session Entity Model Migration (§10) with 9 sub-phases covering domain, application, integration, API, CloudEventIngestor, GradingSPI skeleton, data migration, frontend, and tests. Resequenced Phase 5 (Grading Integration) to depend on Phase 7. Updated Phase Dependencies diagram. Fixed Phase 4 CloudEvent handler spec (§7.3) from control-plane-api to lablet-controller. Updated all phases (0–6) with LabletSession terminology. Added 3 new risks for migration. Updated Success Metrics with Phase 7.
3.0.0	2026-02-10	LCM Architecture Team	Phase 6 at ~85%: G4 (53 worker-controller tests) and G5 (59 lablet-controller tests) complete. Phase 4 marked ✅ COMPLETE (staging validated via G3). Worker controller service type errors fixed (aligned SPI method calls). 8/15 Phase 6 tasks done. Remaining: low-priority polish (F4-F7, F9).
2.9.0	2026-02-10	LCM Architecture Team	G3 Phase 4 staging validation complete: 12/12 live LDS checks passed. Bug fix: `archive_session()` json=None→json={} (HTTP 415). Docker config: created `lds_deployments.docker.yaml` (lds-backend:4000), added LDS env vars to docker-compose.shared.yml, added lds-backend dependency. Validation script: `scripts/validate_lds_integration.py`. Test assertion strengthened for json={} contract.
2.8.0	2026-02-09	LCM Architecture Team	Phase 6 progress: F1 (ReservationsPage ~570 lines), F3 (CapacityDashboard ~320 lines), F8 (SchedulerPage ~430 lines + api/scheduler.js) completed. Navbar converted to dropdowns for Lablets/Workers tabs (AD-16). Section containers added to index.jinja. Phase 6 at ~50% (G1+F1+F2+F3+F8 done). Remaining: G3 staging validation, G4/G5 controller tests, low-priority polish (F4-F7, F9).
2.7.0	2026-02-09	LCM Architecture Team	Phase 6 progress: F2 (LDS Session Display) completed — DTO mappers, SSE READY handler, LabletSessionCard "Open Lab" button, 9 new tests. StateStore `registerSlice` + reducer-aware `dispatch` added to `lcm_ui` core (AD-9) — frontend was blocked by missing slice registration API. Parcel cache gotcha documented; CPA `make clean` now clears `ui/.parcel-cache`. Phase 6 at ~20% (G1+F2 done).
2.6.0	2026-02-08	LCM Architecture Team	Added Phase 6: SSE & Frontend Readiness (§9). Gap analysis identified 6 backend gaps (G1–G6) and 9 frontend gaps (F1–F9). G1 (SSE pipeline fix) completed: 21 backend handler renames, 6 frontend event types, legacy SSEService deleted, global SSE connect, initial snapshots. Phase 5 (Grading) deferred to Phase 7 (post-frontend). Renumbered §9–§14→§10–§15. Updated timeline.
2.5.0	2026-02-09	LCM Architecture Team	Phase 4 LDS Integration ~90% complete: LDS SPI client (636 lines, multi-region, YAML config), MarkInstanceReadyCommand (atomic INSTANTIATING→READY, AD-P4-01), HandleSessionStartedCommand (READY→RUNNING), _provision_lds_session() 7-step reconciler flow,_archive_lds_session(), internal API endpoints (mark-ready, session-started), CPA client methods, 57 new tests (44 LDS SPI + 5 mark-ready + 8 session-started). CommandHandlerBase pattern adopted for all new handlers. Remaining: staging validation.
2.4.0	2026-02-08	LCM Architecture Team	Phase 3 marked complete: Scale-up (RequestScaleUpCommand, template selection, EC2 provisioning), Scale-down (5 safety guards, DrainWorkerCommand, idle detection), WorkerTemplateService, scaling constraints, OTel audit metrics, 44 new tests across 3 services. Discovery state sync bug fixed (AD-21).
2.3.0	2026-02-10	LCM Architecture Team	Phase 2 marked complete: etcd real-time capacity in PlacementEngine, retry escalation, OTel scheduling metrics, rejection tracking, 41 new tests. All acceptance criteria met.
2.2.0	2026-02-09	LCM Architecture Team	Phase 1 marked complete: capacity commands, schedule/terminate integration, etcd publishing. PlacementEngine enhancement and integration tests deferred to Phase 2.
2.1.0	2026-02-08	LCM Architecture Team	Phase 0 marked complete, added mandatory doc maintenance section (§10), added status lines per phase, bumped section numbering
2.0.0	2026-02-08	LCM Architecture Team	Complete rewrite with foundation-first approach
1.0.0	2026-02-08	LCM Architecture Team	Initial plan (flawed - assumed worker foundation complete)