Skip to content

Lablet Controller Architecture

Version: 1.2.0 (February 2026) Status: Current Implementation

Related Documentation

For scheduling and placement details, see the Resource Scheduler Architecture.


Revision History

Version Date Changes
1.2.0 2026-02 Enhanced port allocation (dynamic registry, tag rewriting), LabletDefinition content refresh, LDS deployment selection
1.1.0 2026-02 Added LDS integration (LabDeliverySPI), READY state, etcd watch pattern
1.0.0 2026-01 Initial architecture documentation

1. Overview

The Lablet Controller is responsible for LabletInstance reconciliation - managing the workload lifecycle by reconciling desired instance state (spec) against actual CML lab state, AND provisioning the corresponding LabSession in LDS (Lab Delivery System).

A LabletInstance is a composite entity consisting of:

  • CML Lab: The network topology running on a CML Worker
  • LabSession: The user-facing interface in LDS for interacting with lab devices and viewing tasks

It operates at the Application Layer, talking to:

  • CML Labs SPI - Lab lifecycle (create, start, stop, wipe, delete)
  • LabDelivery SPI - Session lifecycle (create, set devices, archive)

Application Layer Separation

The Lablet Controller manages workloads (CML labs + LDS sessions). It does NOT manage EC2 instances or infrastructure - that is the Worker Controller's responsibility.

2. Reconciliation Pattern

The Lablet Controller follows the Kubernetes Controller Pattern:

┌─────────────────────────────────────────────────────────────────────────────┐
│                LABLET CONTROLLER - RECONCILIATION PATTERN                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   ┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐   │
│   │       SPEC       │     │     OBSERVE      │     │       ACT        │   │
│   │   (Desired)      │     │    (Actual)      │     │   (Reconcile)    │   │
│   └────────┬─────────┘     └────────┬─────────┘     └────────┬─────────┘   │
│            │                        │                        │              │
│            ▼                        ▼                        ▼              │
│   ┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐   │
│   │ LabletInstance   │     │ CML Lab State    │     │ • Import lab     │   │
│   │ • state=RUNNING  │     │ • state=DEFINED  │     │ • Start nodes    │   │
│   │ • worker_id=W1   │ ←→  │ • nodes stopped  │  →  │ • Allocate ports │   │
│   │ • ports={...}    │     │ • no ports       │     │ • Update state   │   │
│   └──────────────────┘     └──────────────────┘     └──────────────────┘   │
│                                                                              │
│   Source: MongoDB         Source: CML Labs API       Target: Both          │
│   (via Control Plane)     (direct observation)       (via Control Plane)   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3. Domain Separation

Service Abstraction Layer SPI (Service Provider Interface)
Lablet Controller Application (Workload) CML Labs SPI (Labs, Nodes, Interfaces, Links API)
Lablet Controller Application (UX) LabDelivery SPI (Sessions, Devices, Content)
Worker Controller Infrastructure (Compute) Cloud Provider SPI (EC2, CloudWatch, CML System API)

Both controllers follow the same reconciliation pattern but at different abstraction layers.

4. Core Responsibilities

flowchart TD
    subgraph Input [Desired State - MongoDB]
        INSTANCES[LabletInstance Specs]
    end

    subgraph LabletController [Lablet Controller]
        LEADER[Leader Election<br/>etcd lease]
        LOOP[Reconciliation Loop]

        subgraph Observe [Observe - CML Labs SPI]
            LABS[Labs API]
            NODES[Nodes API]
            IFACE[Interfaces API]
        end

        subgraph Act [Reconcile Actions]
            IMPORT[Import Topology]
            START[Start Lab]
            STOP[Stop Lab]
            WIPE[Wipe Lab]
            PORTS[Allocate Ports]
            CONFIG[Extract Configs]
        end
    end

    subgraph Output [State Updates]
        STATE[Instance State]
        PORTMAP[Port Mappings]
        NODECONFIG[Node Configurations]
    end

    INSTANCES --> LOOP
    LOOP --> LEADER
    LEADER --> Observe
    Observe --> Act
    Act --> STATE
    Act --> PORTMAP
    Act --> NODECONFIG

5. Reconciliation Examples

Desired (Spec) Actual (Observed) Action
state=INSTANTIATING Lab not imported Import topology, allocate ports
state=INSTANTIATING Lab imported, not started Start lab, provision LDS session
state=INSTANTIATING Lab started, LDS ready Transition to READY
state=READY Awaiting user login No action (event-driven → RUNNING)
state=RUNNING Lab state=STARTED No action (converged)
state=COLLECTING Collection not started Extract configs, collect artifacts
state=COLLECTING Collection complete Transition to GRADING
state=STOPPED Lab state=STARTED Stop lab nodes
state=TERMINATED Lab exists Archive LDS session, wipe and delete lab
Any Lab error state Attempt recovery or mark FAILED

READY State

The READY → RUNNING transition is event-driven (not reconciliation). LDS emits session.started CloudEvent when user logs in; control-plane-api handles it.

6. Layer Architecture

No CQRS Pattern

Lablet-controller uses Reconciliation Loops via HostedServices, NOT CQRS commands/queries. CQRS is implemented only in control-plane-api. Controllers interact with Control Plane API via REST.

lablet-controller/
├── api/                          # HTTP Layer (minimal - health/admin only)
│   └── controllers/
│       ├── health_controller.py  # /health, /ready, /info
│       ├── admin_controller.py   # /admin/trigger-reconcile, /admin/stats
│       └── labs_controller.py    # /labs/{id}/download (BFF for DownloadLabCommand)
├── application/                  # Business Logic Layer
│   ├── hosted_services/          # Reconciliation loops (NOT commands!)
│   │   ├── lablet_reconciler.py  # LeaderElectedHostedService
│   │   └── labs_refresh_service.py  # HostedService
│   ├── services/
│   │   ├── port_allocation_service.py  # Port allocation for instances
│   │   └── lab_observer.py
│   ├── dtos/
│   │   └── reconciliation_result.py
│   └── settings.py
├── integration/                  # SPI Implementations
│   └── services/
│       ├── cml_labs_spi.py       # CMLLabsSPI implementation
│       └── cml_nodes_spi.py      # Node configuration extraction
└── main.py                       # Neuroglia WebApplicationBuilder

CML API Restriction

Lablet-controller uses CML Labs API ONLY (labs, nodes, interfaces, links). It MUST NOT import or call CML System API. System operations are worker-controller's responsibility.

Port Allocation Responsibility

Port allocation for LabletInstances is performed by lablet-controller, not control-plane-api. The controller queries Control Plane API for worker/instance information, allocates ports, and reports allocations back via the internal API.

7. CML Labs SPI

Labs API

class CmlLabsClient:
    """
    CML Lab lifecycle management.

    Base endpoint: /api/v0/labs
    """

    async def list_labs(
        self,
        worker_ip: str,
        auth_token: str
    ) -> list[CmlLabInfo]:
        """List all labs on worker."""

    async def get_lab(
        self,
        worker_ip: str,
        lab_id: str,
        auth_token: str
    ) -> CmlLabDetail:
        """Get detailed lab information."""

    async def import_lab(
        self,
        worker_ip: str,
        topology_yaml: str,
        auth_token: str
    ) -> str:
        """Import YAML topology, return lab_id."""

    async def start_lab(
        self,
        worker_ip: str,
        lab_id: str,
        auth_token: str
    ) -> None:
        """Start all nodes in lab."""

    async def stop_lab(
        self,
        worker_ip: str,
        lab_id: str,
        auth_token: str
    ) -> None:
        """Stop all nodes in lab."""

    async def wipe_lab(
        self,
        worker_ip: str,
        lab_id: str,
        auth_token: str
    ) -> None:
        """Wipe lab state (reset to initial)."""

    async def delete_lab(
        self,
        worker_ip: str,
        lab_id: str,
        auth_token: str
    ) -> None:
        """Delete lab permanently."""

Nodes API

class CmlNodesClient:
    """
    CML Node management.

    Base endpoint: /api/v0/labs/{lab_id}/nodes
    """

    async def list_nodes(
        self,
        worker_ip: str,
        lab_id: str,
        auth_token: str
    ) -> list[CmlNodeInfo]:
        """List all nodes in lab."""

    async def get_node_state(
        self,
        worker_ip: str,
        lab_id: str,
        node_id: str,
        auth_token: str
    ) -> CmlNodeState:
        """Get node state (BOOTED, STOPPED, etc)."""

    async def extract_config(
        self,
        worker_ip: str,
        lab_id: str,
        node_id: str,
        auth_token: str
    ) -> str:
        """Extract running configuration from node."""

Interfaces API

class CmlInterfacesClient:
    """
    CML Interface and console port management.

    Base endpoint: /api/v0/labs/{lab_id}/nodes/{node_id}
    """

    async def get_console_ports(
        self,
        worker_ip: str,
        lab_id: str,
        auth_token: str
    ) -> dict[str, int]:
        """Get console port mappings for all nodes."""

    async def get_vnc_url(
        self,
        worker_ip: str,
        lab_id: str,
        node_id: str,
        auth_token: str
    ) -> str:
        """Get VNC access URL for graphical nodes."""

8. Lab State Machine

CML labs have a complex state machine:

stateDiagram-v2
    [*] --> DEFINED_ON_CORE: Import YAML

    DEFINED_ON_CORE --> STARTED: Start lab
    DEFINED_ON_CORE --> QUEUED: Start lab (queued)

    QUEUED --> STARTED: Resources available

    STARTED --> STOPPED: Stop lab

    STOPPED --> STARTED: Start lab
    STOPPED --> DEFINED_ON_CORE: Wipe lab

    STARTED --> DEFINED_ON_CORE: Wipe lab

    DEFINED_ON_CORE --> [*]: Delete lab
    STOPPED --> [*]: Delete lab

State Mapping

CML Lab State LabletInstance State Description
DEFINED_ON_CORE INSTANTIATING Lab imported, nodes not started
QUEUED INSTANTIATING Lab start queued (waiting resources)
STARTED RUNNING All nodes booted
STOPPED STOPPED Nodes stopped, state preserved
(deleted) TERMINATED Lab removed from CML

9. Dynamic Port Allocation

The Lablet Controller manages a private port registry per CML worker to enable multiple lablet-instances on a single worker. Ports defined in the LabletDefinition's cml.yaml are dynamically rewritten before the lab is created in CML.

Why Dynamic Port Allocation?

A LabletDefinition contains a static cml.yaml with hardcoded port numbers in device tags. Without dynamic allocation, only ONE instance of that definition could run per worker. By rewriting ports at instantiation time, we enable concurrent instances on the same worker.

9.1 Port Registry Architecture

flowchart TD
    subgraph LabletController [Lablet Controller]
        PR[Port Registry<br/>per Worker]
        REWRITE[Tag Rewriter]
    end

    subgraph ControlPlaneAPI [Control Plane API]
        WORKER_STATE[CML Worker State<br/>allocated_ports: dict]
    end

    subgraph Instance1 [Instance A - Definition netlab]
        YAML1[cml.yaml<br/>tags: serial:10001, vnc:10002]
    end

    subgraph Instance2 [Instance B - Definition netlab]
        YAML2[cml.yaml<br/>tags: serial:10003, vnc:10004]
    end

    PR -->|allocate| Instance1
    PR -->|allocate| Instance2
    REWRITE -->|rewrite tags| YAML1
    REWRITE -->|rewrite tags| YAML2
    PR -->|persist| WORKER_STATE

9.2 Port Tags in cml.yaml

Device nodes in cml.yaml define required ports via tags:

# Example device node with port tags
- id: n5
  label: workstation
  node_definition: ubuntu-desktop-24-04-v2
  image_definition: lablet-desktop-v0.1.1
  tags:
    - serial:5065      # Serial console on port 5065
    - vnc:5066         # VNC access on port 5066
    - pat:5067:22      # PAT: external 5067 → internal 22 (SSH)
  interfaces:
    - id: i0
      label: ens3
      type: physical

Tag Format:

Tag Pattern Description Example
serial:<port> Serial console access serial:5065
vnc:<port> VNC graphical access vnc:5066
pat:<ext>:<int> Port Address Translation pat:5067:22
http:<port> HTTP/HTTPS web access http:8080

9.3 Dynamic Rewriting Flow

sequenceDiagram
    participant LC as Lablet Controller
    participant PR as Port Registry
    participant CPA as Control Plane API
    participant CML as CML Worker

    Note over LC: Instance enters INSTANTIATING

    LC->>LC: Parse cml.yaml from LabletDefinition
    LC->>LC: Extract port requirements from tags

    LC->>PR: Request port allocation (worker_id, port_count)
    PR->>CPA: GET /workers/{id} (current allocations)
    CPA-->>PR: allocated_ports: {10001-10010: instance-A}

    PR->>PR: Find next available range
    PR-->>LC: Allocated: [10011, 10012, 10013]

    LC->>LC: Rewrite cml.yaml tags with allocated ports
    Note over LC: serial:5065 → serial:10011<br/>vnc:5066 → vnc:10012<br/>pat:5067:22 → pat:10013:22

    LC->>CML: Import rewritten cml.yaml
    CML-->>LC: lab_id

    LC->>CPA: PATCH /workers/{id} (update allocations)
    LC->>CPA: PATCH /instances/{id} (store port_mappings)

9.4 Lab Naming Convention

CML lab names must be unique per worker. The controller generates names using:

{definition_name}-{definition_id}-{session_id}

Example:

netlab-fundamentals-def123-sess456

This ensures:

  • Multiple instances of the same definition can coexist
  • Labs are traceable back to their definition and session
  • No naming collisions even with concurrent instantiation

9.5 Port Registry Configuration

Variable Description Default
PORT_RANGE_START Start of allocatable port range 10000
PORT_RANGE_END End of allocatable port range 20000
PORTS_PER_INSTANCE_MAX Maximum ports per instance 50

Port Range Must Match CML Configuration

The port range configured in LCM must match the CML system's external port range. CML exposes these ports externally; LCM only tracks allocation.

9.6 Port Allocation Data Model

class PortAllocation:
    """Tracks allocated ports for a lablet instance."""
    instance_id: str
    worker_id: str
    allocated_ports: list[int]           # [10011, 10012, 10013]
    port_mappings: dict[str, PortInfo]   # {"workstation": PortInfo(...)}
    allocated_at: datetime
    released_at: datetime | None


class PortInfo:
    """Port mapping for a single device."""
    device_label: str      # "workstation"
    original_port: int     # 5065 (from definition)
    allocated_port: int    # 10011 (dynamically assigned)
    protocol: str          # "serial", "vnc", "pat", "http"
    internal_port: int | None  # For PAT: 22

9.7 Port Release on Termination

When an instance reaches TERMINATED, ports are released:

async def release_ports(self, instance_id: str, worker_id: str) -> None:
    """
    Release allocated ports back to the pool.

    Called during TERMINATED reconciliation after lab deletion.
    """
    allocation = await self._get_allocation(instance_id)
    await self._port_registry.release(worker_id, allocation.allocated_ports)
    await self._control_plane.update_worker_allocations(worker_id)

9A. LabletDefinition Content Refresh

LabletDefinition content (cml.yaml, device specs) is refreshed on-demand by admin action, NOT automatically. This ensures content changes don't impact production unexpectedly.

Separation of Concerns

Content refresh is a separate process from lablet instance instantiation. Instantiation reads from MongoDB only - no runtime dependency on MinIO/S3.

9A.1 Content Storage Strategy

flowchart LR
    subgraph External [External Storage]
        S3[(MinIO/S3<br/>Content Packages)]
    end

    subgraph ControlPlane [Control Plane API]
        REFRESH[Refresh Content<br/>Command]
        DEF[(LabletDefinition<br/>Aggregate)]
    end

    subgraph Runtime [Runtime - No S3 Dependency]
        LC[Lablet Controller]
        INSTANCE[LabletInstance]
    end

    S3 -->|admin triggers refresh| REFRESH
    REFRESH -->|store in MongoDB| DEF
    DEF -->|read at instantiation| LC
    LC -->|create| INSTANCE

Design Rationale:

Approach Pros Cons
Store in MongoDB (chosen) No runtime S3 dependency, fast instantiation, atomic with definition Larger documents
Store in S3, pull per-instance Smaller MongoDB docs S3 availability impacts instantiation, slower

9A.2 Content Package Structure

A LabletDefinition stores the following content (downloaded from S3 on refresh):

class LabletDefinitionContent:
    """Content payload stored in LabletDefinition aggregate."""

    cml_yaml: str                    # Full CML topology YAML
    device_definitions: list[DeviceDefinition]  # Parsed device specs
    grade_xml: str | None            # Grading criteria (optional)
    pod_xml: str | None              # Pod configuration (optional, TBD)
    content_version: str             # Version from S3 metadata
    form_qualified_name: str         # S3 path identifier
    refreshed_at: datetime           # Last refresh timestamp
    refreshed_by: str                # Admin user who triggered refresh


class DeviceDefinition:
    """Device extracted from cml.yaml for LDS mapping."""

    label: str                       # "workstation", "R1", "SW1"
    node_definition: str             # "ubuntu-desktop-24-04-v2"
    image_definition: str            # "lablet-desktop-v0.1.1"
    port_tags: list[PortTag]         # Parsed from tags array
    is_user_visible: bool            # Whether LDS should expose this device
    access_credentials: Credentials | None  # Default username/password

9A.3 Refresh Trigger Flow

Content refresh is triggered via Control Plane API by an admin:

sequenceDiagram
    participant Admin
    participant CPA as Control Plane API
    participant S3 as MinIO/S3
    participant MongoDB

    Admin->>CPA: POST /lablet-definitions/{id}/refresh
    CPA->>S3: GET content package (form_qualified_name)
    S3-->>CPA: content.zip (cml.yaml, grade.xml, etc.)

    CPA->>CPA: Parse and validate content
    CPA->>CPA: Extract device definitions from cml.yaml
    CPA->>CPA: Parse port tags per device

    CPA->>MongoDB: Update LabletDefinition.content
    CPA->>MongoDB: Increment LabletDefinition.version

    CPA-->>Admin: 200 OK {version: "1.2.0", devices: 5}

    Note over CPA: Optionally notify LDS to refresh content cache
    CPA->>CPA: Publish LabletDefinitionRefreshedEvent

No Automatic Refresh

Content refresh is never automatic. Content in S3 may be partially edited or in draft state. Only explicit admin action triggers refresh to production.

9A.4 Device Visibility for LDS

Not all CML nodes are user-visible in LDS. The controller determines visibility:

def determine_device_visibility(node: CmlNode) -> bool:
    """
    Determine if a CML node should be exposed to LDS.

    Rules:
    1. Nodes with 'hidden' tag are NOT visible
    2. Nodes without port tags are NOT visible (no access method)
    3. Infrastructure nodes (e.g., 'external_connector') are NOT visible
    4. All others ARE visible
    """
    if "hidden" in node.tags:
        return False
    if not any(is_port_tag(t) for t in node.tags):
        return False
    if node.node_definition in INFRASTRUCTURE_NODE_TYPES:
        return False
    return True

Example:

CML Node Tags LDS Visible? Reason
workstation serial:5065, vnc:5066 ✅ Yes Has port tags
R1 serial:5070 ✅ Yes Has port tag
external_connector (none) ❌ No Infrastructure node
management_server hidden, serial:5099 ❌ No Hidden tag

9A.5 Refresh Command

@dataclass
class RefreshLabletDefinitionContentCommand(Command[OperationResult[RefreshResultDto]]):
    """Refresh content from S3 for a LabletDefinition."""
    definition_id: str
    force: bool = False  # Refresh even if version unchanged


class RefreshLabletDefinitionContentCommandHandler(CommandHandler):
    async def handle_async(self, request, cancellation_token=None):
        definition = await self._repository.get_by_id_async(request.definition_id)
        if not definition:
            return self.not_found("LabletDefinition", request.definition_id)

        # Download from S3
        content_package = await self._s3_client.download(
            definition.state.form_qualified_name
        )

        # Parse and validate
        parsed = self._content_parser.parse(content_package)

        # Update aggregate
        definition.refresh_content(
            cml_yaml=parsed.cml_yaml,
            device_definitions=parsed.devices,
            grade_xml=parsed.grade_xml,
            content_version=parsed.version,
            refreshed_by=self._current_user.id,
        )

        await self._repository.update_async(definition, cancellation_token)

        return self.ok(RefreshResultDto(
            definition_id=request.definition_id,
            version=parsed.version,
            device_count=len(parsed.devices),
        ))

10. LDS Integration (Lab Delivery System)

The Lablet Controller is the ONLY LCM component that interacts with LDS. It provisions LabSessions during the INSTANTIATING state and archives them on TERMINATED.

LDS State Decoupling

LDS session states (EMPTY, PENDING, PRELAUNCH, RUNNING, PAUSED, USER_FINISHED, ARCHIVED) do NOT map 1:1 to LabletInstance states. States like PRELAUNCH and PAUSED are used for other LDS lab types and are not applicable to Lablets.

10.1 LabletInstance LDS Attributes

Each LabletInstance stores critical LDS integration keys:

class LabletInstanceState:
    """LabletInstance aggregate state (partial)."""

    # ... existing fields ...

    # LDS Integration Keys
    lds_session_id: str | None       # LDS session identifier (set on provisioning)
    lds_base_url: str | None         # LDS deployment URL (selected at instantiation)
    lds_login_url: str | None        # User login URL (from LDS after provisioning)

    # Port Allocation
    port_mappings: dict[str, PortInfo]  # {device_label: PortInfo}

External Key: lds_session_id

The lds_session_id is a critical external key linking the LabletInstance to its LDS session. This enables: - CloudEvent correlation (session.started, session.ended) - Session archival on termination - Response/feedback collection

10.2 LDS Deployment Selection

LCM supports multiple LDS deployments for load distribution and regional affinity:

class LdsDeployment:
    """LDS deployment configuration."""
    id: str                    # "lds-us-west", "lds-eu-central"
    base_url: str              # "https://lds-us-west.example.com"
    region: str                # "us-west-2"
    capacity: int              # Max concurrent sessions
    priority: int              # Selection priority (lower = preferred)
    enabled: bool              # Whether deployment is active

System Configuration:

# config/lds_deployments.yaml
lds_deployments:
  - id: lds-us-west
    base_url: https://lds-us-west.cisco.com
    region: us-west-2
    capacity: 500
    priority: 1
    enabled: true

  - id: lds-eu-central
    base_url: https://lds-eu-central.cisco.com
    region: eu-central-1
    capacity: 300
    priority: 2
    enabled: true

Selection Logic:

class LdsDeploymentSelector:
    """
    Select LDS deployment for a LabletInstance.

    Strategy:
    1. If LabletDefinition has lds_affinity, use that deployment
    2. Otherwise, round-robin across enabled deployments (weighted by priority)
    """

    async def select_deployment(
        self,
        definition: LabletDefinition,
        instance: LabletInstance,
    ) -> LdsDeployment:
        # Check for affinity override
        if definition.state.lds_affinity:
            deployment = self._get_by_id(definition.state.lds_affinity)
            if deployment and deployment.enabled:
                return deployment
            logger.warning(f"LDS affinity {definition.state.lds_affinity} unavailable")

        # Round-robin selection
        return await self._round_robin_select()

    async def _round_robin_select(self) -> LdsDeployment:
        """Select next deployment using weighted round-robin."""
        enabled = [d for d in self._deployments if d.enabled]
        enabled.sort(key=lambda d: d.priority)

        # Simple round-robin with priority weighting
        selected = enabled[self._counter % len(enabled)]
        self._counter += 1
        return selected
Variable Description Default
LDS_DEPLOYMENTS_CONFIG Path to LDS deployments config config/lds_deployments.yaml
LDS_DEFAULT_TIMEOUT API timeout (seconds) 30
LDS_SELECTION_STRATEGY Selection strategy (round-robin, priority) round-robin

10.3 LDS Provisioning Flow

sequenceDiagram
    participant LC as Lablet Controller
    participant PR as Port Registry
    participant CPA as Control Plane API
    participant CML as CML Worker
    participant SEL as LDS Selector
    participant LDS as Lab Delivery System

    Note over LC: Instance enters INSTANTIATING

    %% Port Allocation & Lab Creation
    LC->>CPA: GET /lablet-definitions/{id}
    CPA-->>LC: LabletDefinition (with cached cml.yaml, devices)

    LC->>PR: Allocate ports (worker_id, device_port_tags)
    PR-->>LC: Allocated ports [10011, 10012, 10013]

    LC->>LC: Rewrite cml.yaml with allocated ports
    LC->>LC: Generate lab name: {def_name}-{def_id}-{session_id}

    LC->>CML: Import rewritten cml.yaml
    CML-->>LC: lab_id

    LC->>CML: Start lab
    CML-->>LC: OK

    %% LDS Provisioning
    LC->>SEL: Select LDS deployment (definition.lds_affinity)
    SEL-->>LC: LdsDeployment (lds_base_url)

    LC->>LC: Map visible devices to allocated ports

    LC->>LDS: create_session_with_part(username, timeslot, form_qualified_name)
    LDS-->>LC: session_id

    LC->>LDS: set_devices(session_id, devices[])
    LDS-->>LC: OK

    LC->>LDS: get_session_info(session_id)
    LDS-->>LC: login_url

    %% Store External Keys
    LC->>CPA: PATCH /instances/{id}
    Note over LC: lds_session_id, lds_base_url,<br/>lds_login_url, port_mappings

    Note over LC: Transition to READY (awaiting user login)

READY → RUNNING Transition

The lablet-controller transitions to READY, not RUNNING. RUNNING is triggered by LDS CloudEvent session.started when the user logs in. This event is handled by control-plane-api (not lablet-controller). See ADR-018 Section 7.

10.4 LabDelivery SPI

class LabDeliverySPI(Protocol):
    """
    Abstract interface for Lab Delivery System integration.

    Implementation: LdsApiClient in integration/services/
    """

    async def create_session_with_part(
        self,
        username: str,
        timeslot_start: datetime,
        timeslot_end: datetime,
        form_qualified_name: str,
    ) -> LabSessionInfo:
        """
        Create a LabSession with initial LabSessionPart.

        The form_qualified_name identifies content in S3/MinIO.
        LDS uses this to load tasks and device definitions.
        """
        ...

    async def set_devices(
        self,
        session_id: str,
        devices: list[DeviceAccessInfo],
    ) -> None:
        """
        Provision device access info for the session.

        Each device includes:
        - name: Device label (matches content.xml device_label)
        - protocol: Access protocol (ssh, telnet, vnc, http)
        - host: CML worker IP address
        - port: Allocated external port
        - uri: Full connection URI
        - username/password: Device credentials
        """
        ...

    async def get_session_info(self, session_id: str) -> LabSessionInfo:
        """Get session details including login URL."""
        ...

    async def get_login_url(self, session_id: str) -> str:
        """Get user login URL for the session."""
        ...

    async def archive_session(self, session_id: str) -> None:
        """Archive completed session (on TERMINATED)."""
        ...

    async def refresh_content(self, form_qualified_name: str) -> ContentMetadata:
        """
        Trigger LDS to refresh content from S3/MinIO.

        Called when a LabletDefinition is versioned.
        Synchronous - LDS pulls content package and returns metadata.
        """
        ...

    # Future extensions
    async def collect_responses(self, session_id: str) -> ResponseData:
        """Collect user responses from session."""
        ...

    async def collect_user_feedback_by_session(self, session_id: str) -> FeedbackData:
        """Collect user feedback for specific session."""
        ...

    async def collect_user_feedback_by_form(self, form_qualified_name: str) -> FeedbackData:
        """Collect user feedback for all sessions of a form."""
        ...

10.5 Device Mapping (No S3 Dependency)

The controller maps devices from the cached LabletDefinition to allocated ports:

async def map_devices_to_ports(
    self,
    definition: LabletDefinition,
    allocated_ports: dict[str, int],
    worker_ip: str,
) -> list[DeviceAccessInfo]:
    """
    Map cached device definitions to LDS access info.

    NOTE: Uses cached content from MongoDB - NO S3 access at instantiation.

    1. Get device_definitions from LabletDefinition.content
    2. Filter to user-visible devices only
    3. Get protocol from port_tags
    4. Lookup allocated port for each device
    5. Build DeviceAccessInfo with worker_ip as host
    """

Example Device to LDS Mapping:

Given a device in the cached LabletDefinition:

DeviceDefinition(
    label="workstation",
    node_definition="ubuntu-desktop-24-04-v2",
    port_tags=[
        PortTag(type="serial", port=5065),
        PortTag(type="vnc", port=5066),
    ],
    is_user_visible=True,
)

And allocated ports {"workstation_serial": 10011, "workstation_vnc": 10012}:

Resulting DeviceAccessInfo:

[
    DeviceAccessInfo(
        name="workstation",
        protocol="vnc",         # Primary access method
        host="10.0.1.50",       # Worker IP
        port=10012,             # Dynamically allocated port
        uri="vnc://10.0.1.50:10012",
        username="cisco",
        password="cisco",
    ),
]

10.6 LDS Content Notification

When a LabletDefinition content is refreshed (via admin command), LDS may be notified:

async def on_definition_content_refreshed(
    self,
    definition_id: str,
    form_qualified_name: str,
) -> None:
    """
    Called when LabletDefinition content is refreshed by admin.

    Optionally triggers LDS to refresh its content cache.
    This is separate from the LCM content refresh - LDS maintains its own cache.
    """
    # Notify all configured LDS deployments
    for deployment in self._lds_deployments:
        try:
            await deployment.client.refresh_content(form_qualified_name)
            logger.info(f"LDS {deployment.id} notified of content refresh")
        except Exception as e:
            logger.warning(f"Failed to notify LDS {deployment.id}: {e}")

LDS Content Independence

LDS maintains its own content cache (pulled from S3/MinIO). The notification is optional - LDS can also poll for content updates. LCM's cached content is authoritative for port allocation and device mapping.

10.7 Session Archival

When a LabletInstance reaches TERMINATED, the controller archives the LDS session:

async def on_instance_terminated(
    self,
    instance: LabletInstance,
) -> None:
    """Archive LDS session on instance termination."""
    if instance.lds_session_id:
        await self.lds_spi.archive_session(instance.lds_session_id)
        logger.info(f"LDS session archived: {instance.lds_session_id}")

The controller can extract running configurations from nodes:

async def extract_all_configs(
    self,
    instance: LabletInstance
) -> dict[str, str]:
    """
    Extract configurations from all nodes in lab.

    Returns: {node_label: config_text}

    Use cases:
    - Backup before termination
    - Grading/assessment
    - Configuration comparison
    """

11. Configuration

Key environment variables:

Variable Description Default
ETCD_HOST etcd server host localhost
ETCD_PORT etcd server port 2379
CONTROL_PLANE_API_URL Control Plane API URL http://localhost:8080
LABLET_CONTROLLER_INSTANCE_ID Unique instance ID Auto-generated
LEADER_LEASE_TTL Leader lease TTL (seconds) 15
RECONCILE_INTERVAL Reconciliation interval (seconds) 30
LAB_START_TIMEOUT Max time to wait for lab start (seconds) 300
CONFIG_EXTRACT_ON_STOP Extract configs before stopping true
LDS_API_URL Lab Delivery System API URL http://localhost:8081
LDS_API_TIMEOUT LDS API timeout (seconds) 30
S3_ENDPOINT S3/MinIO endpoint for content http://localhost:9000
S3_ACCESS_KEY S3/MinIO access key -
S3_SECRET_KEY S3/MinIO secret key -

12. Observability

Metrics Exported

Metric Type Labels
instance_reconciliation_duration_seconds Histogram instance_id
instance_state_transitions_total Counter from_state, to_state
lab_start_duration_seconds Histogram worker_id
lab_nodes_booted Gauge instance_id, lab_id

Health Check

GET /health

Response:
{
    "status": "healthy",
    "is_leader": true,
    "instance_id": "lablet-ctrl-abc123",
    "last_reconciliation": "2026-01-17T10:30:00Z",
    "instances_managed": 12,
    "instances_running": 8
}