Skip to content

Solution OverviewΒΆ

Operator / Author summaryΒΆ

The platform turns authored lab content into graded delivery sessions for candidates. Everything operable is modelled as a timed resource with a declared desired state and an observed actual state, reconciled continuously. Two responsibilities are kept strictly apart:

  • CPA is the session manager + control plane. It is the only place operators and students enter. It keeps the catalog of definitions, drives each Session (and its Parts) through their lifecycle phases, runs the plumbing (pick a host, find the pod, open ports, register with the delivery system), and hosts the unified resource dashboard.
  • SE is the automation engine. Whenever a phase needs something done to the lab β€” apply changes post-init, collect device output, grade it, produce a report β€” CPA hands that work to SE. SE talks to the live devices through ROC (and other adapters) and sends results back.

A session may be single-part (the classic Lablet) or multi-part (e.g. an expert exam with DES / DOO / AI-DOO parts), and each part may run 0..N pods of any PodType on any HostType. The same lifecycle/reconciliation machinery applies at every level of the tree.

If a session is stuck provisioning, look at CPA + the LCM controllers. If grading or a report is wrong, look at SE + the content's grading rules.


Architect detailΒΆ

Responsibility mapΒΆ

Concern Owner Notes
Entry point (API + UI) CPA Single front door; dual auth (cookie BFF + JWT).
Unified resource dashboard CPA K8s-style: per-type lists + drill-down, slide-over detail, declarative+imperative actions. SE report UI embedded as an iframe widget.
Catalog (SessionDefinition / PartDefinition / PodDefinition) CPA SSOT = Mosaic (resolved dynamically). The Form is the synced content unit (ADR-059).
Session + SessionPart ordering & gating CPA Top-level resource reconcilers; own part sequencing and gating.
Native infra steps CPA β†’ controllers worker_lab_resolve, pod_locator, ports_alloc, lds_register, mark_ready, archive.
PodInstance reconciliation pod-controller Per-type manager for pods (any PodType workload).
Host reconciliation host-controller Per-type manager for hosts (any HostType platform); the generalized CmlWorker β€” the substrate a pod binds to.
Form sync ingestion form-controller Mosaic β†’ RustFS, then fan-out to LDS + SE (ADR-059).
Host allocation / timeslots resource-scheduler HostType-aware capacity; lead_time drives JIT vs eager.
EC2/CML host lifecycle host-controller cml_on_aws host adapter: provision/start/stop/terminate.
Automation (Job/Scenario/grading/report) SE SSOT for content = RustFS.
Raw device interaction ROC SE delegates collection; ROC owns RADkit.
Student access (ports/content view) LDS Keeps its own synced content view.

Two sources of truth, by design: CPA treats Mosaic as the SSOT for what content exists (resolved per form-qualified name); SE treats RustFS as the SSOT for the content bytes it executes. The sync flow is what keeps them consistent.

Resource management planeΒΆ

The control plane is a single reconciliation pattern applied recursively. Every resource β€” Session, SessionPart, PodInstance, Host/Worker β€” is a TimedResource with a desired_status, a status, a Timeslot, and a ManagedLifecycle. A generic reconcile framework (observe β†’ diff β†’ act β†’ record) is specialised by a per-type manager for each kind. Intent cascades down the tree (operator/scheduler β†’ Session β†’ Part β†’ Pod); observed status bubbles up.

Ownership of reconciliation:

Level Reconciled by
Session, SessionPart (ordering, gating) CPA / session-controller
Form (content_package sync) form-controller (ADR-059)
PodInstance (workload) pod-controller
Host/Worker (platform substrate) host-controller (+ resource-scheduler for allocation)
Per-part content automation (collect/grade/report) SE (invoked from a part's workflow phase)

See resource-model.md for the full tree and reconciliation framework, session-model.md for definitions and session profiles, and ui-resource-dashboard.md for the dashboard.

C4 β€” System contextΒΆ

C4Context
    title System Context β€” Lablet delivery platform

    Person(operator, "Operator", "Triggers syncs, schedules and supervises sessions")
    Person(student, "Candidate", "Consumes a delivery session")
    Person(author, "Content author", "Publishes lab content")

    System_Boundary(lcm, "Lablet Cloud Manager") {
        System(cpa, "Control-Plane API", "Session manager + single entry point")
        System(se, "Scenario Engine", "Automation engine: collect / evaluate / report")
    }

    System_Ext(mosaic, "Mosaic", "Content authoring & publishing (catalog SSOT)")
    System_Ext(roc, "ROC", "RADkit device interaction service")
    System_Ext(lds, "LDS", "Lab Delivery System (student access)")
    System_Ext(keycloak, "Keycloak", "Identity provider")

    Rel(author, mosaic, "Publishes content")
    Rel(operator, cpa, "Syncs labs, runs sessions", "HTTPS")
    Rel(student, cpa, "Joins session", "HTTPS")
    Rel(cpa, se, "Triggers jobs per phase", "HTTP / CloudEvents")
    Rel(cpa, mosaic, "Resolves & downloads content", "HTTPS")
    Rel(cpa, lds, "Registers delivery", "HTTP")
    Rel(se, roc, "Delegates device collection", "HTTP")
    Rel(cpa, keycloak, "Authn/Authz", "OIDC")

    UpdateLayoutConfig($c4ShapeInRow="2", $c4BoundaryInRow="1")

C4 β€” Container viewΒΆ

As-built names vs ADR-054 Rev 2 target

The container below shows the as-built services (lablet-controller, worker-controller). The ADR-054 Rev 2 target splits lablet-controller into session-/form-/pod-controller and renames worker-controller β†’ host-controller. The responsibility map above uses the target names.

C4Container
    title Container View β€” services and stores

    Person(operator, "Operator")

    System_Boundary(lcm, "Lablet Cloud Manager") {
        Container(cpa, "Control-Plane API", "FastAPI + Neuroglia", "Catalog, sessions, phase ordering, UI")
        Container(labletctl, "lablet-controller", "Python service", "Content-sync ingester + lifecycle reconciler")
        Container(sched, "resource-scheduler", "Python service", "Timeslots + worker allocation")
        Container(workerctl, "worker-controller", "Python service", "EC2/CML worker lifecycle")
        Container(se, "scenario-engine", "FastAPI + Neuroglia", "Jobs, scenarios, grading, reports")

        ContainerDb(cpadb, "CPA MongoDB", "MongoDB", "SessionDefinition, Form, Session")
        ContainerDb(sedb, "SE MongoDB", "MongoDB", "Job, JobDefinition, reports")
        ContainerDb(blob, "RustFS / S3", "Object store", "Canonical content bytes (SE SSOT)")
        ContainerDb(etcd, "etcd", "KV / watch", "CPA <-> controller coordination")
    }

    System_Ext(mosaic, "Mosaic", "Catalog SSOT")
    System_Ext(roc, "ROC", "RADkit devices")
    System_Ext(lds, "LDS", "Student access")

    Rel(operator, cpa, "Uses", "HTTPS")
    Rel(cpa, cpadb, "Reads/writes")
    Rel(cpa, etcd, "Writes desired state / watches")
    Rel(labletctl, etcd, "Watches / reconciles")
    Rel(labletctl, mosaic, "Downloads package", "HTTPS")
    Rel(labletctl, blob, "Uploads content", "S3")
    Rel(labletctl, lds, "Triggers content sync", "HTTP")
    Rel(labletctl, se, "Triggers content sync", "HTTP")
    Rel(cpa, sched, "Requests allocation", "HTTP / events")
    Rel(sched, workerctl, "Requests worker ops", "HTTP / events")
    Rel(cpa, se, "Submits jobs", "HTTP")
    Rel(se, sedb, "Reads/writes")
    Rel(se, blob, "Reads content", "S3")
    Rel(se, roc, "Collects device output", "HTTP")
    Rel(se, cpa, "Job results", "CloudEvents")

    UpdateLayoutConfig($c4ShapeInRow="2", $c4BoundaryInRow="1")

Integration styleΒΆ

  • CPA β†’ SE (commands): synchronous HTTP for triggering a job (submit_job, sync_content). CPA never blocks on job completion.
  • SE β†’ CPA (results): asynchronous CloudEvents consumed by CPA's integration-event handlers, which advance the session phase.
  • CPA β†’ controllers: desired-state via etcd; controllers reconcile and report progress back through CloudEvents / phase-progress commands.
  • SE β†’ ROC: synchronous HTTP using the (session/bucket, bulk_cmd_uuid) contract (POST /devices, POST /execute/bulk, GET/DELETE /execute/bulk/{uuid}).

Why this splitΒΆ

Decision Rationale
CPA owns Session, SE owns Job One place to reason about delivery state; SE stays reusable across process types and pod types.
Phase ordering native, job bodies content-driven Infra sequencing rarely changes; automation content changes constantly and is author-owned.
SE content SSOT = RustFS, not Mosaic SE must run deterministically and offline from Mosaic; the sync flow is the only coupling point.
ROC kept external RADkit/device interaction is already a live, separately-operated capability.