ADR-031: Checkpoint-Based Instantiation Pipeline¶

Attribute	Value
Status	Accepted
Date	2026-03-02
Deciders	Architecture Team
Related ADRs	ADR-004 (Port Allocation), ADR-017 (Lab Operations via Lablet-Controller), ADR-020 (Session Entity Model), ADR-029 (Port Template Extraction), ADR-030 (Resource Observation), ADR-032 (Port Allocation on LabRecord), ADR-033 (CML Node Tag Sync)
Implementation	Instantiation Pipeline Plan

Context¶

When a LabletSession transitions from SCHEDULED → INSTANTIATING, the lablet-controller executes a multi-step provisioning sequence: verify content, resolve/import the CML lab, allocate ports, sync CML node tags, bind the lab to the session, start the lab, provision LDS access, and mark the session ready.

The current implementation handles this as a monolithic waterfall inside _handle_instantiating() on the lablet-controller's reconciler. This approach has critical limitations:

No recovery: If the reconciler crashes mid-sequence (e.g., after port allocation but before lab start), the entire sequence restarts from scratch on the next reconciliation cycle — potentially leaking previously allocated ports or creating duplicate resources.
No visibility: Operators have no way to see which step the instantiation is on, what failed, or how long each step took.
No per-step retry: A transient failure in one step (e.g., CML API timeout during lab start) forces the entire sequence to restart, including steps that already completed successfully.
No step-level metrics: There is no data on per-step durations, failure rates, or bottlenecks — essential for capacity planning and troubleshooting.

The drain-loop fix (AD-DRAIN-001) resolved a prerequisite issue where the WatchTriggeredHostedService could skip events due to a race condition. With that fixed, the reconciler can now reliably re-enter _handle_instantiating() on each cycle — making a checkpoint-based approach viable.

Decision¶

1. DAG-Based Pipeline with Persistent Checkpoints¶

Replace the monolithic _handle_instantiating() with a Directed Acyclic Graph (DAG) of 9 named steps. Each step declares its prerequisites; next_executable_step() resolves which step to execute next based on dependency completion.

Pipeline steps (DAG order):

content_sync ──────────┐
                       ├──→ lab_resolve ──→ ports_alloc ──→ tags_sync ──→ lab_binding ──→ lab_start ──→ lds_provision ──→ mark_ready
variables (skip) ──────┘

Step	Requires	Service	Description
`content_sync`	—	lablet-controller	Verify LabletDefinition content is synced and available
`variables`	—	lablet-controller	Variable substitution (placeholder — skipped for now)
`lab_resolve`	`content_sync`	lablet-controller → CML SPI	Resolve or import lab topology on the target CML worker
`ports_alloc`	`lab_resolve`	lablet-controller → CPA	Allocate ports from worker range, write to `LabRecord.allocated_ports`
`tags_sync`	`ports_alloc`	lablet-controller → CML SPI	Write allocated ports as CML node tags via PATCH API
`lab_binding`	`tags_sync`	lablet-controller → CPA	Bind LabRecord to session, create LabRunRecord, denormalize ports
`lab_start`	`lab_binding`	lablet-controller → CML SPI	Start CML lab, poll until BOOTED
`lds_provision`	`lab_start`	lablet-controller → LDS SPI	Create LDS UserSession, set devices with port mappings
`mark_ready`	`lds_provision`	lablet-controller → CPA	Transition session INSTANTIATING → READY

2. InstantiationProgress Value Object¶

Progress is tracked via an InstantiationProgress value object persisted on the LabletSession aggregate:

@dataclass(frozen=True)
class StepResult:
    step: str                           # Step name (e.g., "content_sync")
    status: str                         # "pending" | "running" | "completed" | "failed" | "skipped"
    started_at: datetime | None
    completed_at: datetime | None
    error_message: str | None
    attempt_count: int                  # Number of attempts (for retry tracking)
    requires: tuple[str, ...]           # Prerequisites (step names)
    output: dict | None                 # Step-specific output data

@dataclass(frozen=True)
class InstantiationProgress:
    steps: tuple[StepResult, ...]
    pipeline_started_at: datetime | None
    pipeline_completed_at: datetime | None

    def next_executable_step(self) -> StepResult | None:
        """Return the next step whose prerequisites are all completed."""
        completed = {s.step for s in self.steps if s.status == "completed"}
        for step in self.steps:
            if step.status in ("pending", "failed"):
                if all(req in completed for req in step.requires):
                    return step
        return None

3. One Step per Reconciliation Cycle¶

Each reconciliation cycle executes at most one step, then persists the updated progress to the CPA and requeues. This ensures:

Crash safety: Progress is checkpointed between steps — a crash never loses more than one step's work
Fair scheduling: Other sessions get reconciliation cycles between steps
Observable progress: Each step update triggers a domain event and SSE broadcast

4. Step Dispatch in Reconciler¶

Steps are implemented as methods on the reconciler class (Option A: inline). The dispatcher maps step names to handler methods:

async def _handle_instantiating(self, instance):
    progress = instance.instantiation_progress or self._build_default_progress(instance)

    next_step = progress.next_executable_step()
    if not next_step:
        return ReconciliationResult.success()

    handler = {
        "content_sync": self._step_content_sync,
        "variables": self._step_variables,
        "lab_resolve": self._step_lab_resolve,
        "ports_alloc": self._step_ports_alloc,
        "tags_sync": self._step_tags_sync,
        "lab_binding": self._step_lab_binding,
        "lab_start": self._step_lab_start,
        "lds_provision": self._step_lds_provision,
        "mark_ready": self._step_mark_ready,
    }.get(next_step.step)

    result = await handler(instance, progress)
    await self._api.update_instantiation_progress(instance.id, result)

    if result.status == "failed":
        return ReconciliationResult.requeue(f"Step {result.step} failed")
    return ReconciliationResult.requeue(f"Step {result.step} {result.status}")

5. Early Timeslot Expiry Check¶

Before executing any pipeline step, the reconciler checks whether the session's timeslot has expired. If expired, the session is transitioned directly to EXPIRED without completing remaining pipeline steps. This prevents wasted work on sessions that will immediately be torn down.

6. Default Pipeline Factory¶

For sessions that don't have custom pipeline configuration, a default factory generates the standard 9-step DAG:

def default_pipeline() -> InstantiationProgress:
    return InstantiationProgress(steps=(
        StepResult(step="content_sync", requires=(), ...),
        StepResult(step="variables", requires=(), ...),
        StepResult(step="lab_resolve", requires=("content_sync",), ...),
        StepResult(step="ports_alloc", requires=("lab_resolve",), ...),
        StepResult(step="tags_sync", requires=("ports_alloc",), ...),
        StepResult(step="lab_binding", requires=("tags_sync",), ...),
        StepResult(step="lab_start", requires=("lab_binding",), ...),
        StepResult(step="lds_provision", requires=("lab_start",), ...),
        StepResult(step="mark_ready", requires=("lds_provision",), ...),
    ))

Rationale¶

Why a DAG (not linear ordering)?¶

Steps have genuine data dependencies: tags_sync needs the output of ports_alloc; lab_binding needs tags written; lab_start needs binding. However, content_sync and variables are independent DAG roots that could execute in parallel. A DAG is the natural representation of these dependencies and enables future parallelism for independent steps.

Why one step per cycle (not run-to-completion)?¶

Crash safety: The worst case is re-executing one step after a crash. Run-to-completion could lose work across multiple steps.
Fairness: In a multi-session environment, running all 9 steps in one cycle starves other sessions. One-step-per-cycle ensures round-robin fairness.
Observability: Progress updates are persisted and broadcast after each step, giving operators real-time visibility.

Why inline methods (not a pipeline engine)?¶

9 steps is below the complexity threshold where a dedicated pipeline engine adds value.
Inline methods keep the code discoverable — all provisioning logic lives in one class.
If pipeline logic grows (custom per-definition pipelines, parallel execution), extraction to a dedicated InstantiationPipeline class is straightforward.
Avoids introducing another framework dependency (e.g., Temporal, Prefect) for what is fundamentally a finite, bounded workflow.

Why not a Saga pattern?¶

Steps don't have meaningful compensating actions. You can't "un-import" a lab or "un-allocate" ports in a way that restores the system to a clean state. The checkpoint approach acknowledges that some steps are idempotent (can be safely retried) and others require manual intervention on failure.

Consequences¶

Positive¶

Crash recovery: Pipeline resumes from the last completed checkpoint — no wasted work or resource leaks
Per-step visibility: Operators can see exactly which step failed, when, and after how many attempts
Retry capability: Failed steps can be retried individually without restarting the entire pipeline
Metrics: Step-level duration and failure rates enable bottleneck identification and capacity planning
Extensibility: New steps can be added to the DAG by declaring prerequisites — no change to the executor logic
Fair scheduling: One-step-per-cycle ensures all sessions make progress in a multi-tenant environment

Negative¶

More reconciliation cycles: A 9-step pipeline requires at least 9 reconciliation cycles (vs. 1 for monolithic). Each cycle incurs etcd watch → HTTP POST → MongoDB write overhead.
Increased complexity: The reconciler grows with 9 step methods and DAG resolution logic. Mitigated by keeping steps as simple single-responsibility methods.
State size: InstantiationProgress adds ~2-3KB per session to the MongoDB document. Negligible at expected scale.

Risks¶

Stale progress: If the CPA is unreachable when the reconciler tries to persist progress, the step result is lost and must be re-executed. Mitigated: steps are designed to be idempotent.
DAG cycle detection: A misconfigured prerequisite could create a cycle, causing the pipeline to stall. Mitigated: next_executable_step() returns None if no step is executable (deadlock detection).

Implementation Notes¶

Cross-Service Changes¶

Service	Changes
control-plane-api	`InstantiationProgress` VO, `UpdateInstantiationProgressCommand`, `LabletSessionInstantiationProgressUpdatedDomainEvent`, new internal API endpoints
lcm-core	`instantiation_progress` field on `LabletSessionReadModel`, `update_instantiation_progress()` on `ControlPlaneApiClient`
lablet-controller	DAG-based `_handle_instantiating()`, 9 step methods, early timeslot expiry check, default pipeline factory
resource-scheduler	Scheduler passes `allocated_ports={}` (port allocation moved to pipeline)

Migration¶

Existing sessions with instantiation_progress = None are handled gracefully: sessions past INSTANTIATING get all-completed progress; sessions currently INSTANTIATING get a fresh default pipeline.
No breaking API changes — new endpoints are additive.

Instantiation Pipeline Implementation Plan
ADR-032: Port Allocation as LabRecord Topology Concern
ADR-033: CML Node Tag Sync with Allocated Ports
domain/value_objects/instantiation_progress.py — InstantiationProgress VO (to be created)
domain/entities/lablet_session.py — LabletSession aggregate