ADR-044: ScenarioEngine — Pod Automation as a Separate Service¶

Attribute	Value
Status	Proposed (Rev 2 — supersedes Rev 1 in-process design)
Date	2026-06-05
Deciders	LCM architects
Related ADRs	ADR-034, ADR-038, ADR-037
Supersedes	ADR-044 Rev 1 (in-process ScenarioEngine subsystem)
Sprint	K+ (platform architecture)

1. Context¶

1.1 Current State (Post ADR-038)¶

ADR-038 introduced the @step_handler registry and decomposed the monolithic reconciler into a package of per-step handler modules. The subsequent refactoring (AD-STEP-001) completed the one-file-per-step convention, yielding 21 individually testable step handler modules under lablet-controller/application/services/step_handlers/.

What works well:

PipelineExecutor provides DAG ordering, skip_when, retry, timeout, progress persistence
PipelineTemplateResolver enables extends/overrides/remove composition
Step handlers are stateless functions registered by name
Pipeline definitions live in LabletDefinition seed YAML

What is limiting:

All step execution occurs inside the lablet-controller reconciler process
Step handlers directly import infrastructure clients (CmlLabsSpiClient, LdsSpiClient)
No separation between "what the scenario needs" (policy) and "how to accomplish it" (mechanics)
The PipelineContext dataclass has grown to 20+ fields (god context)
Future content-defined pipelines (PAv1/) have no injection point
No versioning of step behavior — handler changes affect all definitions simultaneously
Grading, evidence, and post-init steps are stubs with no execution path yet
No multi-adapter support — only CML-on-AWS is implemented; ROC/Proxmox require different execution paths
LCM services are tightly coupled to pod automation — scaling them independently is impossible

1.2 Vision: Separate Concerns at Service Level¶

The Lablet Cloud Manager (LCM) and the ScenarioEngine (SE) consume the same content package (PAv1/) but at different abstraction levels:

Concern	LCM	SE
Reads from PAv1/	Top-level phase ordering, SE job triggers	Low-level task definitions, adapter calls
Owns	Session lifecycle, resource scheduling, worker assignment	Job execution, report generation, adapter dispatch
Persistence	LabletSession, LabletDefinition, Worker, LabRecord	PodDefinition, Job progress/results
Communication	CQRS commands, etcd watches, CPA REST	Fire-and-forget jobs, CloudEvents, SSE streams

"LCM consumes the top-level orchestration definition while the SE consumes the low-level IO-bound calls to adapters."

1.3 Content Package Structure (PAv1/)¶

exam-ccnp-test-v1-lab-1.1/
├── images/                  # Node definition images
├── resources/               # Student materials
├── content/                 # LDS content.xml, devices.json
├── mosaic_meta.json         # LDS metadata
├── RCUv1/                   # Retained for backward compat
│   ├── cml.yaml             # CML topology definition
│   ├── pod.xml              # Pod layout for LDS
│   ├── grade.xml            # Grading rules (legacy format)
│   └── devices.json         # Device-to-port mapping
└── PAv1/                    # POD Automation v1
    ├── manifest.yaml        # Pod type, required adapters, version
    ├── lifecycle.yaml       # Phase → task DAG (DSL)
    ├── scenarios/           # Reusable scenario definitions (DSL)
    │   ├── provision.yaml
    │   └── teardown.yaml
    ├── grading/             # Grading configuration
    │   ├── rubric.yaml      # Per-item grading rules
    │   └── evidence_spec.yaml
    ├── restore/             # Restore process (retakes)
    │   └── restore.yaml
    └── reports/             # Phase report templates
        └── grade_report.yaml

1.4 The Execution Question (Revised)¶

The Rev 1 design kept the ScenarioEngine as an in-process subsystem. This is now superseded based on the following realizations:

Multi-adapter future — CML-on-AWS, ROC/RADkit, Proxmox, VMWare require different adapter implementations that should not bloat the lablet-controller
Content lifecycle independence — PodDefinitions have their own lifecycle (DEFINED → SYNCHRONIZING → READY → EXPIRED → SUPERSEDED) separate from sessions
Scaling independence — SE handles I/O-heavy pod automation that can be scaled independently from LCM's lightweight state management
Reusability — Multiple callers (lablet-controller, manual triggers, CI/CD) can submit jobs to SE without coupling to LCM internals
DSL runtime — A proprietary workflow DSL with jq expressions warrants its own execution environment with dedicated testing and versioning

2. Decision¶

2.1 Introduce a ScenarioEngine as a Separate Microservice¶

The ScenarioEngine is a standalone FastAPI microservice that:

Owns pod automation execution (I/O-bound adapter calls)
Exposes a fire-and-forget job API (submit job → get job_id → CloudEvents callback)
Maintains an in-memory scenario registry (Python packages auto-discovered at boot)
Persists PodDefinition entities (content synced on-demand from BlobStorage)
Executes a proprietary DSL (ServerlessWorkflow-inspired, jq expressions)
Supports multiple infrastructure adapters (CML/AWS, ROC/RADkit, Proxmox, VMWare)

┌─────────────────────────────────────────────────────────────────────────────┐
│  Content Layer (BlobStorage / S3)                                           │
│  LAB.zip → PAv1/ + RCUv1/ + images/ + resources/                           │
└────────────────┬─────────────────────────────────────┬──────────────────────┘
                 │ top-level phase defs                 │ full content package
                 ▼                                     ▼
┌────────────────────────────────────┐  ┌──────────────────────────────────────┐
│  LCM (Resource Orchestration)      │  │  ScenarioEngine (Pod Automation)     │
│                                    │  │                                      │
│  ┌────────────┐ ┌───────────────┐  │  │  ┌────────────────────────────────┐  │
│  │control-    │ │lablet-        │  │  │  │ DSL Runtime                    │  │
│  │plane-api   │ │controller     │──┼──┼─▶│ • Task executor (DAG)          │  │
│  │            │ │               │  │  │  │ • jq expression evaluator      │  │
│  │• Sessions  │ │• Lifecycle    │  │  │  │ • Retry/timeout/skip_when      │  │
│  │• Defs      │ │  gates        │  │  │  │ • Progress tracking            │  │
│  │• Workers   │ │• Watch loop   │  │  │  └────────────────────────────────┘  │
│  │• Timeslots │ │• State trans  │  │  │                                      │
│  └────────────┘ └───────────────┘  │  │  ┌────────────────────────────────┐  │
│                                    │  │  │ Scenario Registry (in-memory)  │  │
│  ┌────────────┐ ┌───────────────┐  │  │  │ • lab_resolve@v1              │  │
│  │resource-   │ │worker-        │  │  │  │ • lab_start@v1               │  │
│  │scheduler   │ │controller     │  │  │  │ • execute_command@v1          │  │
│  │            │ │               │  │  │  │ • collect_evidence@v1         │  │
│  │• Timeslots │ │• EC2/CML      │  │  │  │ • grade_item@v1              │  │
│  │• Compat    │ │  provisioning │  │  │  └────────────────────────────────┘  │
│  │  check     │ │• Worker state │  │  │                                      │
│  └────────────┘ └───────────────┘  │  │  ┌────────────────────────────────┐  │
│                                    │  │  │ Adapters                       │  │
│  State transitions stay HERE:      │  │  │ • CmlOnAwsAdapter             │  │
│  • mark_ready, archive, schedule   │  │  │ • RocRadkitAdapter            │  │
│  • session status updates          │  │  │ • ProxmoxAdapter (future)     │  │
│  • timeslot management             │  │  │ • VMWareAdapter (future)      │  │
│  • LDS registration                │  │  └────────────────────────────────┘  │
│                                    │  │                                      │
│  Local adapters (packages):        │  │  ┌────────────────────────────────┐  │
│  • CPA REST client                 │  │  │ PodDefinition Store           │  │
│  • etcd client                     │  │  │ • Lifecycle: DEFINED →        │  │
│  • S3 client (metadata only)       │  │  │   SYNCHRONIZING → READY →     │  │
│  • LDS client                      │  │  │   EXPIRED → SUPERSEDED        │  │
└────────────────────────────────────┘  │  │ • Content: topology, devices, │  │
                                        │  │   grading rules, scenarios     │  │
         CloudEvents ◀─────────────────┤  └────────────────────────────────┘  │
         (job.started, step.completed,  │                                      │
          job.completed, job.faulted)   └──────────────────────────────────────┘

2.2 Rationale: Why a Separate Service (Reversing Rev 1)¶

Factor	In-Process (Rev 1)	Separate Service (Rev 2)
Multi-adapter	Bloats lablet-controller with CML+ROC+Proxmox deps	Each adapter self-contained in SE
Content lifecycle	PodDefinition lifecycle coupled to session lifecycle	Independent PodDefinition aggregate
Scaling	I/O-heavy automation scales with lightweight state mgmt	Scale pod automation independently
Reusability	Only lablet-controller can trigger steps	Any caller (CI/CD, manual, LCM) submits jobs
DSL runtime	Mixed with CQRS/DDD framework	Dedicated execution environment
Testing	Steps tested against full LCM DI container	SE testable with mock adapters only
Deployment	Monolith risk (one crash kills reconciler + automation)	Fault isolation (SE crash ≠ LCM crash)

Key insight (updated): The step handlers are I/O-bound and need multiple infrastructure adapters (CML, ROC, Proxmox). They also need their own content lifecycle (PodDefinition) and a DSL runtime with jq expressions. This volume of responsibility warrants a dedicated service with its own deployment, scaling, and failure domain.

2.3 ScenarioEngine Service Architecture¶

scenario-engine/                      # New microservice
├── main.py                           # FastAPI app, DI, lifespan
├── Makefile
├── pyproject.toml
├── Dockerfile
├── api/
│   ├── controllers/
│   │   ├── jobs_controller.py        # POST/GET/DELETE /api/v1/jobs
│   │   ├── content_controller.py     # POST /api/v1/content/sync
│   │   └── scenarios_controller.py   # GET /api/v1/scenarios
│   └── dependencies.py
├── application/
│   ├── commands/
│   │   ├── submit_job_command.py     # SubmitJob + handler
│   │   ├── cancel_job_command.py
│   │   └── sync_content_command.py   # SyncContent + handler
│   ├── queries/
│   │   ├── get_job_query.py
│   │   └── list_scenarios_query.py
│   ├── services/
│   │   ├── dsl_runtime/              # DSL execution engine
│   │   │   ├── executor.py           # Task DAG executor
│   │   │   ├── jq_evaluator.py       # jq expression evaluation
│   │   │   ├── task_dispatcher.py    # Routes task types to handlers
│   │   │   └── data_flow.py          # Input/output/context transforms
│   │   ├── scenario_registry.py      # In-memory registry (@scenario decorator)
│   │   ├── content_ingestion.py      # LAB.zip download, extract, parse
│   │   └── report_generator.py       # Phase report assembly
│   ├── events/
│   │   └── cloud_event_publisher.py  # Emit job lifecycle CloudEvents
│   └── settings.py
├── domain/
│   ├── entities/
│   │   ├── pod_definition.py         # PodDefinition aggregate
│   │   └── job.py                    # Job aggregate (progress, results)
│   ├── value_objects/
│   │   ├── pod_type.py               # PodType enum
│   │   ├── job_status.py             # JobStatus enum
│   │   └── task_result.py            # TaskResult value object
│   ├── events/
│   │   ├── pod_definition_events.py
│   │   └── job_events.py
│   └── repositories/
│       ├── pod_definition_repository.py
│       └── job_repository.py
├── infrastructure/
│   ├── adapters/                     # Infrastructure adapters
│   │   ├── base_adapter.py           # AdapterProtocol ABC
│   │   ├── cml_on_aws_adapter.py     # CML + AWS EC2
│   │   ├── roc_radkit_adapter.py     # ROC + RADkit (CCIE DMZ)
│   │   ├── proxmox_adapter.py        # Proxmox VE (future)
│   │   └── vmware_adapter.py         # VMWare vSphere (future)
│   └── persistence/
│       └── mongo_pod_definition_repository.py
├── scenarios/                         # Registered scenario packages
│   ├── __init__.py                   # Auto-import for registration
│   ├── lab_resolve.py
│   ├── lab_start.py
│   ├── lab_stop.py
│   ├── execute_command.py
│   ├── collect_evidence.py
│   ├── grade_item.py
│   └── ...
└── tests/

2.4 SE API Contract¶

Job Submission (Fire-and-Forget)¶

POST /api/v1/jobs
Content-Type: application/json

{
  "definition_id": "exam-ccnp-test-v1-lab-1.1",
  "version": "1.0.0",
  "session_id": "sess-abc123",
  "phase": "instantiate",
  "worker": {
    "ip": "10.0.1.42",
    "cml_username": "admin",
    "cml_password": "***",
    "region": "us-east-1",
    "adapter": "cml_on_aws"
  },
  "variables": {
    "lab_reuse_enabled": true,
    "port_template": "exam-ccnp-v1"
  },
  "callback_url": "http://lablet-controller:8002/api/cloudevents"
}

Response:

{
  "job_id": "job-7f3a2b",
  "status": "accepted",
  "stream_url": "/api/v1/jobs/job-7f3a2b/stream"
}

CloudEvent Callbacks¶

{
  "specversion": "1.0",
  "type": "io.lcm.scenario-engine.job.completed",
  "source": "/scenario-engine/jobs/job-7f3a2b",
  "subject": "sess-abc123",
  "data": {
    "job_id": "job-7f3a2b",
    "phase": "instantiate",
    "status": "completed",
    "duration_seconds": 142,
    "results": {
      "lab_resolve": { "lab_id": "abc", "lab_title": "CCNP Lab 1.1" },
      "lab_start": { "converged": true },
      "lds_provision": { "url": "https://lds.example.com/pod/123" }
    },
    "report_url": "/api/v1/jobs/job-7f3a2b/report"
  }
}

Content Sync¶

POST /api/v1/content/sync
Content-Type: application/json

{
  "definition_id": "exam-ccnp-test-v1-lab-1.1",
  "version": "1.0.0",
  "s3_uri": "s3://content-bucket/labs/exam-ccnp-test-v1-lab-1.1/v1.0.0/LAB.zip",
  "pod_type": "cml_on_aws"
}

2.5 PodDefinition Domain Model¶

# domain/entities/pod_definition.py

class PodDefinitionStatus(str, Enum):
    DEFINED = "defined"             # Created, awaiting content sync
    SYNCHRONIZING = "synchronizing" # Downloading/extracting from S3
    READY = "ready"                 # Content synced, scenarios validated
    EXPIRED = "expired"             # Timeslot ended, no longer usable
    SUPERSEDED = "superseded"       # Newer version available

@dataclass
class PodDefinitionState:
    id: str
    definition_id: str              # Matches LCM's LabletDefinition
    version: str                    # Semantic version of content
    pod_type: PodType               # CML_ON_AWS, ROC_RADKIT, etc.
    status: PodDefinitionStatus
    content_hash: str               # SHA256 of extracted PAv1/
    synced_at: Optional[datetime]
    topology: Optional[TopologySpec]     # Parsed CML topology
    devices: list[DeviceSpec]            # Device-to-port mapping
    grading_rules: list[GradingRule]     # Per-item grading config
    scenarios: list[ScenarioRef]         # Required scenarios for lifecycle
    lifecycle_phases: dict[str, PhaseDefinition]  # Parsed PAv1/lifecycle.yaml

2.6 PodDefinitionRef in LCM¶

The LCM's LabletDefinition references a PodDefinition via a value object:

# lcm_core/domain/value_objects/pod_definition_ref.py

@dataclass(frozen=True)
class PodDefinitionRef:
    """Reference from LCM's LabletDefinition to SE's PodDefinition."""
    definition_id: str          # e.g. "exam-ccnp-test-v1-lab-1.1"
    version: str                # e.g. "1.0.0"
    pod_type: PodType           # Required infrastructure type
    content_hash: Optional[str] # Set after sync confirmation

2.7 Dual Adapter Selection¶

Adapter selection is a dual requirement:

Content declares — PAv1/manifest.yaml specifies pod_type: cml_on_aws
Worker matches — The assigned worker's capabilities must include the required pod_type

# PAv1/manifest.yaml
schema_version: "1.0"
pod_type: cml_on_aws
required_adapter_version: ">=1.0.0"
min_resources:
  vcpus: 16
  memory_gb: 64
  storage_gb: 200

The resource-scheduler enforces compatibility at scheduling time:

# resource-scheduler validation
if worker.pod_type != definition.pod_definition_ref.pod_type:
    raise IncompatibleWorkerError(
        f"Worker {worker.id} is {worker.pod_type}, "
        f"but definition requires {definition.pod_definition_ref.pod_type}"
    )

2.8 DSL Overview (Proprietary, ServerlessWorkflow-Inspired)¶

⚠️ Superseded by ADR-057 / ADR-058. The ServerlessWorkflow-inspired task-type list and $context data-flow model described below are no longer canonical. The job-body DSL is now a flat, ordered step DAG of a closed scenarioFunction primitive set (uses/with/capture/when/on_error/stage), with data flowing through the four named scopes of ADR-058 (session.* / content.* / runtime_env.* / vars.*) rather than a single mutable $context. The do / for / fork / switch / try / raise / emit / run / wait task types are dropped (iteration is the deferred for_each, ADR-057 §2.8). See DSL-SPECIFICATION.md for the current grammar. The content below is retained as a historical record of the original design only.

The SE executes a proprietary DSL that borrows syntax and concepts from the ServerlessWorkflow specification.

Expression Language: jq¶

All runtime expressions use jq syntax with ${} delimiters in strict mode:

# Runtime expression arguments available:
# $context  — workflow context (accumulated state)
# $input    — current task's transformed input
# $output   — current task's raw output (in output.as only)
# $secrets  — secret store (restricted access)
# $task     — current task descriptor
# $workflow  — workflow descriptor

Task Types¶

Type	Purpose	Example
`call`	Invoke a registered scenario	`call: lab_resolve@v1`
`do`	Sequential sub-tasks	`do: [step1, step2]`
`for`	Iterate over collection	`for: { each: item, in: ${ .devices } }`
`fork`	Parallel execution	`fork: { branches: [a, b] }`
`set`	Set context variables	`set: { lab_id: ${ .result.id } }`
`switch`	Conditional branching	`switch: [{ when: ..., then: ... }]`
`try`	Error handling + retry	`try: { ... } catch: { retry: ... }`
`raise`	Signal failure	`raise: { error: "device unreachable" }`
`wait`	Pause execution	`wait: { seconds: 30 }`
`emit`	Publish CloudEvent	`emit: { type: step.completed }`
`run`	Execute shell/script	`run: { shell: "show ip route" }`

Data Flow Model¶

Raw Input → input.schema (validate) → input.from (transform) → Task Execution
    → output.as (transform) → output.schema (validate) → export.as (update $context)
    → Next Task (transformed output as raw input)

Example: PAv1/lifecycle.yaml¶

# PAv1/lifecycle.yaml — Content-defined session lifecycle
document:
  dsl: "1.0.0"
  namespace: lcm
  name: exam-ccnp-test-v1-lab-1.1
  version: "1.0.0"

phases:
  instantiate:
    do:
      - resolveTopology:
          call: lab_resolve@v1
          with:
            definition_id: ${ $context.definition_id }
          output:
            as: ${ { lab_id: .lab_id, lab_title: .title } }

      - allocatePorts:
          call: ports_alloc@v1
          input:
            from: ${ { lab_id: $context.lab_id, template: $context.port_template } }
          if: ${ $context.port_template != null }

      - startLab:
          call: lab_start@v1
          with:
            lab_id: ${ $context.lab_id }
          output:
            as: ${ { converged: .converged, nodes: .nodes } }

      - provisionLds:
          call: lds_provision@v1
          with:
            lab_id: ${ $context.lab_id }
            devices: ${ $context.devices }
          if: ${ ($context.devices | length) > 0 }

  collect:
    do:
      - gatherEvidence:
          for:
            each: item
            in: ${ $context.grading_rules }
          do:
            - collectItem:
                call: collect_evidence@v1
                with:
                  item_id: ${ $item.id }
                  device: ${ $item.target_device }
                  command: ${ $item.collect_command }
                output:
                  as: ${ { [$item.id]: .output } }
                  export:
                    as: ${ $context | .evidence[$item.id] = $output }

  grade:
    do:
      - gradeItems:
          for:
            each: item
            in: ${ $context.grading_rules }
          do:
            - gradeItem:
                call: grade_item@v1
                with:
                  item_id: ${ $item.id }
                  evidence: ${ $context.evidence[$item.id] }
                  expected: ${ $item.expected }
                  scoring: ${ $item.scoring }
                output:
                  as: ${ { score: .score, max: .max, feedback: .feedback } }
                  export:
                    as: ${ $context | .scores[$item.id] = $output }

      - generateReport:
          call: generate_phase_report@v1
          with:
            phase: grade
            scores: ${ $context.scores }
            template: "PAv1/reports/grade_report.yaml"

  teardown:
    do:
      - stopLab:
          call: lab_stop@v1
          with:
            lab_id: ${ $context.lab_id }

      - deregisterLds:
          call: lds_deregister@v1
          with:
            lab_id: ${ $context.lab_id }
          if: ${ $context.lds_registered == true }

      - wipeLab:
          call: lab_wipe@v1
          with:
            lab_id: ${ $context.lab_id }

2.9 Scenario Registry (In-Memory)¶

Scenarios are Python modules auto-discovered at SE boot via a decorator pattern:

# scenarios/lab_resolve.py

from scenario_engine.registry import scenario

@scenario(name="lab_resolve", version="v1")
async def lab_resolve(input: dict, adapter: AdapterProtocol, ctx: ExecutionContext) -> dict:
    """Resolve lab topology from CML worker.

    Input:
      definition_id: str — The content definition to resolve

    Output:
      lab_id: str — Resolved CML lab ID
      title: str — Lab title
      nodes: list[dict] — Node list with state
    """
    labs = await adapter.list_labs()
    lab = next((l for l in labs if l["title"] == ctx.expected_title), None)
    if not lab:
        lab = await adapter.import_lab(ctx.topology_file)
    return {"lab_id": lab["id"], "title": lab["title"], "nodes": lab.get("nodes", [])}

The registry is not persisted — it's the set of @scenario-decorated functions discovered from scenarios/ package at import time. This is analogous to the existing @step_handler pattern in lablet-controller.

2.10 Content Sync Flow¶

Content Owner → CPA: POST /api/definitions/{id}/publish
    CPA: Update LabletDefinition (status=SYNCING, pod_definition_ref populated)
    CPA → SE: POST /api/v1/content/sync {definition_id, version, s3_uri, pod_type}
        SE: Create PodDefinition (status=SYNCHRONIZING)
        SE → S3: Download LAB.zip
        SE: Extract PAv1/, parse lifecycle.yaml, validate scenarios
        SE: Parse devices.json, topology, grading rules
        SE: Update PodDefinition (status=READY, content_hash=sha256)
    SE → CPA: CloudEvent io.lcm.scenario-engine.pod-definition.ready
    CPA: RecordContentSyncResult (status=SYNCED, content_hash confirmed)

2.11 Step Handler Migration (Option C)¶

Stays in LCM (state transitions)	Moves to SE (I/O automation)
`mark_ready_step.py`	`lab_resolve_step.py`
`archive_step.py`	`lab_start_step.py`
`session_status_update`	`lab_stop_step.py`
`timeslot_validation`	`lab_wipe_step.py`
`lds_register_step.py` (state)	`execute_command_on_cml_node_step.py`
	`ports_alloc_step.py`
	`tags_sync_step.py`
	`transfer_file_step.py`
	`collect_evidence` (new)
	`grade_item` (new)

Shared capabilities (in lcm_core):

lcm_core.infrastructure.content_store — S3 pull, zip extraction
lcm_core.domain.dsl — DSL parser, task model
lcm_core.domain.value_objects.pod_type — PodType enum, compat checks

2.12 Lifecycle Phase Mapping (LCM → SE)¶

LCM Phase	LCM Responsibility	SE Job Phase
INSTANTIATING	Trigger SE job, await CloudEvent	`instantiate`
READY	Mark session ready (after SE confirms)	—
RUNNING	Monitor timeslot, handle extensions	—
COLLECTING	Trigger SE job, await CloudEvent	`collect`
GRADING	Trigger SE job, store results	`grade`
STOPPING	Trigger SE job, await CloudEvent	`teardown`
ARCHIVED	Archive artifacts (LCM local)	`archive`

2.13 Migration Strategy¶

Phase 1: Foundation (Current sprint + 1)

Create scenario-engine/ service scaffold (FastAPI, Neuroglia, Makefile)
Implement PodDefinition aggregate + MongoDB persistence
Implement content sync endpoint (S3 download, extraction, PAv1/ parsing)
Implement basic job submission API (accept + return job_id)
CloudEvent publisher skeleton
Port lab_resolve and lab_start as first two scenarios

Phase 2: DSL Runtime

Implement DSL executor (task types: call, do, for, set, switch, try)
Implement jq expression evaluator (via pyjq or jq.py binding)
Implement data flow pipeline (input.from → execute → output.as → export.as)
Job progress tracking + SSE stream endpoint
Port remaining I/O step handlers as scenarios

Phase 3: Adapter Abstraction

Define AdapterProtocol ABC
Implement CmlOnAwsAdapter (extract from current CML SPI clients)
Stub RocRadkitAdapter interface
Adapter selection from PodDefinition.pod_type

Phase 4: Advanced Features

Per-item grading engine (for + grade_item@v1)
Phase report generation
Evidence collection subsystem
Restore/retake scenario support
fork task type (parallel execution)
listen task type (convergence callbacks)
Warm-pool pre-instantiation via SE

Phase 5: Multi-Adapter

Implement ROC/RADkit adapter
Proxmox adapter
VMWare adapter
Adapter capability negotiation

3. Consequences¶

3.1 Positive¶

Clean service boundary: LCM owns resource state, SE owns pod automation execution
Multi-adapter ready: SE encapsulates adapter complexity; adding Proxmox/ROC doesn't touch LCM
Content-driven: Lab authors define lifecycle steps via DSL without platform code changes
Independent scaling: SE scales with pod automation load; LCM scales with session volume
Fault isolation: SE crash doesn't kill LCM reconciler; jobs can be retried
Reusable: CI/CD, manual triggers, and future services can submit SE jobs
Observable: CloudEvents + SSE streams give LCM real-time progress without polling
Testable: SE testable against mock adapters; LCM testable against mock SE API
DSL-first: jq expressions and ServerlessWorkflow-inspired syntax enable powerful content authoring

3.2 Negative¶

New service deployment: Additional container, health checks, monitoring
Network latency: Job dispatch crosses service boundary (mitigated by fire-and-forget pattern)
Credential management: SE needs its own CML/AWS credentials (via Kubernetes secrets)
Migration cost: Gradual — lablet-controller step handlers remain functional during transition
Shared code coordination: lcm_core changes affect both LCM and SE deployments

3.3 Neutral¶

Existing seed YAML format remains valid (PipelineTemplateResolver preserved in LCM)
PAv1/ is additive — RCUv1/ continues to work as-is
Current lablet-controller tests remain passing throughout migration
The SE can be developed in parallel with ongoing LCM feature work

4. Implementation Notes¶

4.1 Shared Package Strategy¶

lcm_core/                             # Shared between LCM services AND SE
├── domain/
│   ├── value_objects/
│   │   ├── pod_type.py               # PodType enum (CML_ON_AWS, ROC_RADKIT, ...)
│   │   ├── pod_definition_ref.py     # PodDefinitionRef value object
│   │   └── content_manifest.py       # Parsed PAv1/manifest.yaml model
│   └── dsl/                          # DSL model (shared between parser + runtime)
│       ├── __init__.py
│       ├── task_types.py             # CallTask, DoTask, ForTask, SetTask, etc.
│       ├── expressions.py            # RuntimeExpression, JqExpression
│       └── lifecycle_definition.py   # PhaseDefinition, TaskDefinition
├── infrastructure/
│   └── content_store/                # S3 pull + zip extraction
│       ├── __init__.py
│       ├── s3_content_client.py      # Download LAB.zip
│       └── content_extractor.py      # Extract + parse PAv1/ artifacts

4.2 Communication Patterns¶

LCM → SE:  HTTP POST (fire-and-forget job submission)
SE → LCM:  CloudEvents via HTTP POST to callback_url
SE → Any:  SSE stream at /api/v1/jobs/{id}/stream (optional real-time)

CloudEvent Types:

Type	When	Data
`io.lcm.se.job.accepted`	Job queued	`{job_id, phase}`
`io.lcm.se.task.started`	Task begins	`{job_id, task_name}`
`io.lcm.se.task.completed`	Task succeeds	`{job_id, task_name, output}`
`io.lcm.se.task.faulted`	Task fails	`{job_id, task_name, error}`
`io.lcm.se.job.completed`	All tasks done	`{job_id, results, report_url}`
`io.lcm.se.job.faulted`	Job failed	`{job_id, error, partial_results}`
`io.lcm.se.pod-definition.ready`	Content synced	`{definition_id, version, hash}`

4.3 Adapter Protocol¶

# infrastructure/adapters/base_adapter.py

class AdapterProtocol(Protocol):
    """Infrastructure adapter for pod automation.

    Each adapter encapsulates all interactions with a specific
    infrastructure type (CML-on-AWS, ROC/RADkit, Proxmox, VMWare).
    """

    @property
    def pod_type(self) -> PodType: ...

    # Lab management
    async def list_labs(self) -> list[LabInfo]: ...
    async def import_lab(self, topology: bytes) -> LabInfo: ...
    async def start_lab(self, lab_id: str) -> LabStatus: ...
    async def stop_lab(self, lab_id: str) -> LabStatus: ...
    async def wipe_lab(self, lab_id: str) -> None: ...
    async def delete_lab(self, lab_id: str) -> None: ...
    async def get_lab_status(self, lab_id: str) -> LabStatus: ...

    # Node operations
    async def execute_command(self, lab_id: str, node: str, cmd: str) -> CommandResult: ...
    async def transfer_file(self, lab_id: str, node: str, src: str, dst: str) -> None: ...

    # Telemetry
    async def get_system_stats(self) -> SystemStats: ...
    async def get_node_telemetry(self, lab_id: str, node: str) -> NodeTelemetry: ...

4.4 Testing Strategy¶

Layer	Approach
Scenarios	Unit test each `@scenario` with mock adapter
DSL Runtime	Integration test executor with sample lifecycle.yaml
Adapters	Contract tests against recorded responses (VCR pattern)
API	FastAPI TestClient, mock command/query handlers
Content Sync	Integration test with local S3 (MinIO)
End-to-End	LCM submits job → SE executes → CloudEvent received

4.5 Backward Compatibility During Migration¶

lablet-controller keeps existing step_handlers — they continue to work for definitions without PAv1/ content
Dual execution path — lablet-controller checks if SE is available + definition has PAv1/ → delegates to SE; otherwise → executes locally via PipelineExecutor
Feature flag — SE_ENABLED=true enables SE delegation; false = local execution
Gradual rollover — individual definitions can opt-in to SE by including PAv1/

4.6 Deployment¶

# docker-compose addition
scenario-engine:
  build:
    context: ./src/scenario-engine
  environment:
    - MONGODB_URI=mongodb://mongo:27017/scenario_engine
    - S3_ENDPOINT=http://minio:9000
    - S3_BUCKET=content
    - CLOUD_EVENT_SINK=http://lablet-controller:8002/api/cloudevents
  depends_on:
    - mongo
    - minio
  ports:
    - "8004:8000"

5. References¶

ADR-034: Pipeline Executor (DAG execution, progress persistence)
ADR-038: Step Handler Registry (decorator-based registration, reconciler decomposition)
AD-STEP-001: One step per file convention
AD-SE-01: PodDefinition entity with lifecycle persistence
AD-SE-02: DSL expression language is jq
AD-SE-03: Step handler split — state transitions in LCM, I/O automation in SE
AD-SE-04: Adapter selection — dual requirement (definition declares + worker matches)
AD-044: LabletSession lifecycle state machine
ServerlessWorkflow DSL Specification

Appendix A: Glossary¶

Term	Definition
ScenarioEngine (SE)	Separate microservice that executes pod automation jobs
PodDefinition	SE-owned entity representing synced content with lifecycle
Scenario	A registered Python function that performs a specific automation task
Adapter	Infrastructure-specific implementation of pod operations
PodType	Enum declaring infrastructure requirement (CML_ON_AWS, ROC_RADKIT, etc.)
Job	A single execution of a lifecycle phase by the SE
DSL	The proprietary workflow definition language (ServerlessWorkflow-inspired)
PAv1/	POD Automation v1 — content folder containing lifecycle definitions

Appendix B: Decision Log¶

Code	Decision	Date
AD-SE-01	SE persists PodDefinition; registry is in-memory	2026-06-05
AD-SE-02	Expression language = jq	2026-06-05
AD-SE-03	Option C: state in LCM, I/O in SE, shared LAB.zip handling	2026-06-05
AD-SE-04	Dual adapter selection (content declares + worker matches)	2026-06-05