Worker Templates¶

Version: 1.0.0 (February 2026) Scope: Resource Scheduler (template selection) + Control Plane API (template management) Phase: Phase 3 - Auto-Scaling

1. Overview¶

Worker templates define predefined EC2 instance configurations for CML workers. They serve as the capacity model for:

Placement decisions — matching lablet instances to appropriately-sized workers
Scale-up provisioning — selecting which EC2 instance type to launch
Cost optimization — choosing the cheapest template that satisfies requirements

Templates are managed as configuration (not user-created entities) per ADR-007. They are loaded from YAML on startup and seeded to MongoDB.

2. Template Lifecycle¶

flowchart TD
    YAML["config/worker_templates.yaml<br/>(5 template definitions)"] --> Load

    subgraph startup["Application Startup"]
        Load["WorkerTemplateService<br/>load_templates_from_yaml()"]
        Validate["Validate & parse<br/>each template definition"]
        Seed["seed_templates_async()<br/>upsert by name"]
        Load --> Validate --> Seed
    end

    Seed --> MongoDB[("MongoDB<br/>worker_templates collection")]

    MongoDB --> Query["Query at runtime"]

    Query --> Placement["PlacementEngine<br/>_select_template()"]
    Query --> ScaleUp["RequestScaleUpCommand<br/>_resolve_template()"]
    Query --> API["API queries<br/>list/get templates"]

    style YAML fill:#FFF9C4,stroke:#F57F17
    style MongoDB fill:#E3F2FD,stroke:#1565C0
    style startup fill:#E8F5E9,stroke:#2E7D32

Startup Seeding (WorkerTemplateSeederHostedService)¶

The WorkerTemplateSeederHostedService runs during application startup:

Reads config/worker_templates.yaml
Parses each template definition into WorkerTemplate aggregates
Upserts by name — existing templates are updated, new ones created
Logs count of seeded templates

Resilient Startup

Template seeding failures are logged but do not fail startup. Templates can be added later via API or by restarting with corrected YAML.

3. Capacity Model¶

Template Definitions¶

Template	Instance Type	CPU	Memory	Storage	Max Nodes	Cost/hr (USD)	Use Case
micro	`t3.micro`	2	1 GB	20 GB	2	$0.0104	Testing, development
small	`t3.small`	2	2 GB	50 GB	5	$0.0208	Simple labs (≤5 nodes)
medium	`t3.medium`	2	4 GB	100 GB	15	$0.0416	Moderate workloads
large	`t3.large`	2	8 GB	200 GB	30	$0.0832	Complex labs
metal	`m5zn.metal`	48	192 GB	1000 GB	200	$3.9641	Production, nested virt

Metal Instances

The metal template uses m5zn.metal instances which are significantly more expensive ($3.96/hr vs $0.08/hr for large) and slower to provision (2-5 min boot time). They support nested virtualization required for certain CML node types.

WorkerCapacity Value Object¶

The WorkerCapacity dataclass is an immutable value object used to represent and compare capacity:

@dataclass(frozen=True)
class WorkerCapacity:
    cpu_cores: int
    memory_gb: int
    storage_gb: int
    max_nodes: int | None = None

Key operations:

Method	Purpose
`can_fit(required)`	Check if required capacity fits within this capacity
`subtract(other)`	Calculate remaining capacity (clamps to 0)
`zero()`	Create zero-capacity instance for initialization
`from_dict(data)`	Deserialize from dictionary (CloudEvent-compatible)

4. Template Selection Algorithm¶

The placement engine uses a 3-tier fallback strategy to select a template for scale-up:

flowchart TD
    Start["_select_template(requirements)"] --> T1

    subgraph tier1["Tier 1: Cost-Optimized"]
        T1["Get enabled templates"] --> T1F["Filter: capacity >= requirements"]
        T1F --> T1S["Sort by cost_per_hour ASC"]
        T1S --> T1C{"Matches?"}
    end

    T1C -->|Yes| Result1["✅ Cheapest viable template"]

    T1C -->|No| T2

    subgraph tier2["Tier 2: Largest Available"]
        T2["Get all enabled templates"] --> T2S["Sort by CPU cores DESC"]
        T2S --> T2C{"Any templates?"}
    end

    T2C -->|Yes| Result2["⚠️ Largest template + warning"]

    T2C -->|No| T3

    subgraph tier3["Tier 3: Hardcoded Fallback"]
        T3{"Required CPU?"} -->|">= 32"| Metal["metal"]
        T3 -->|">= 16"| Large["large"]
        T3 -->|">= 4"| Medium["medium"]
        T3 -->|"< 4"| Small["small"]
    end

    style tier1 fill:#E8F5E9,stroke:#2E7D32
    style tier2 fill:#FFF3E0,stroke:#E65100
    style tier3 fill:#FFEBEE,stroke:#C62828
    style Result1 fill:#4CAF50,color:white
    style Result2 fill:#FF9800,color:white

TemplateSelection Result¶

The select_optimal_template_async method returns a TemplateSelection dataclass:

@dataclass
class TemplateSelection:
    template: WorkerTemplate        # Selected template
    match_reason: str               # Why this template was chosen
    excess_capacity: WorkerCapacity  # Unused capacity above requirements
    cost_ranking: int               # 0 = cheapest matching option

Headroom Selection¶

For workloads expecting spikes, select_template_with_headroom_async adds a capacity buffer:

$$ \text{adjusted_capacity} = \text{required_capacity} \times \left(1 + \frac{\text{headroom\%}}{100}\right) $$

Default headroom is 20%, meaning a requirement of 10 CPU cores would search for templates with ≥ 12 cores.

5. Domain Model¶

WorkerTemplate Aggregate¶

flowchart TD
    subgraph agg["WorkerTemplate (AggregateRoot)"]
        direction TB
        State["WorkerTemplateState"]
        Events["Domain Events"]
    end

    subgraph fields["State Fields"]
        direction TB
        F1["name: str (unique)"]
        F2["instance_type: Ec2InstanceType"]
        F3["capacity: WorkerCapacity"]
        F4["cost_per_hour_usd: float"]
        F5["ami_name_pattern: str"]
        F6["enabled: bool"]
    end

    subgraph events["Events"]
        direction TB
        E1["WorkerTemplateCreatedDomainEvent"]
        E2["WorkerTemplateUpdatedDomainEvent"]
        E3["WorkerTemplateDisabledDomainEvent"]
        E4["WorkerTemplateEnabledDomainEvent"]
    end

    State --> fields
    Events --> events

    style agg fill:#E3F2FD,stroke:#1565C0

Event-Sourced State Transitions¶

Event	Handler	State Changes
`Created`	Sets id, name, instance_type, capacity, cost	New template registered
`Updated`	Applies change dict to modified fields	Template configuration changed
`Disabled`	Sets `enabled = False`	Template excluded from selection
`Enabled`	Sets `enabled = True`	Template available for selection

Ec2InstanceType Enum¶

Instance type mapping from friendly names:

Friendly Name	AWS Instance Type	In Template
`MICRO`	`t3.micro`	micro
`SMALL`	`t3.small`	small
`MEDIUM`	`t3.medium`	medium
`LARGE`	`t3.large`	large
`METAL`	`m5zn.metal`	metal

6. WorkerTemplateService API¶

Service Methods¶

Method	Arguments	Returns	Description
`load_templates_from_yaml`	`config_path: str`	`list[WorkerTemplate]`	Parse YAML into entities
`load_templates_from_dict`	`templates_data: list[dict]`	`list[WorkerTemplate]`	Parse dicts (for tests)
`seed_templates_async`	`templates: list`	`int` (count)	Upsert to MongoDB by name
`get_template_by_name_async`	`name: str`	`WorkerTemplate`	Lookup by name
`list_enabled_templates_async`	—	`list[WorkerTemplate]`	All enabled, ordered by cost
`list_all_templates_async`	—	`list[WorkerTemplate]`	All templates
`select_optimal_template_async`	`required_capacity`	`TemplateSelection`	Cheapest viable template
`select_template_with_headroom_async`	`capacity, headroom%`	`TemplateSelection`	With capacity buffer
`find_all_matching_templates_async`	`required_capacity`	`list[TemplateSelection]`	All viable, ranked by cost

Custom Exceptions¶

Exception	When Raised	Handling
`TemplateLoadError`	YAML file missing, empty, or unparseable	Startup continues with warning
`TemplateValidationError`	Invalid template definition (missing fields, bad types)	Startup fails for that template
`TemplateNotFoundError`	`get_template_by_name_async` with unknown name	404 in API responses
`NoMatchingTemplateError`	No template can satisfy capacity requirements	Triggers Tier 2/3 fallback

7. YAML Configuration Format¶

Templates are defined in config/worker_templates.yaml:

templates:
  - name: medium                     # Unique identifier
    description: Medium worker...    # Human-readable description
    instance_type: medium            # Friendly name or AWS type
    ami_name_pattern: 'CML-*'        # AMI name filter pattern
    capacity:
      cpu_cores: 2                   # Logical CPU cores
      memory_gb: 4                   # Memory in GB
      storage_gb: 100                # Root volume in GB
      max_nodes: 15                  # Soft limit on CML nodes
    cost_per_hour_usd: 0.0416       # Approximate hourly cost
    enabled: true                    # Available for provisioning

Instance Type Resolution¶

The _create_template_from_dict method resolves instance types through a 3-step chain:

Friendly name → micro, small, medium, large, metal → mapped to Ec2InstanceType enum
AWS instance type → t3.small, m5zn.metal → parsed as enum value
Fallback mapping → m5.xlarge → mapped to closest enum member

8. Configuration Reference¶

Setting	Env Variable	Default	Description
`worker_templates_auto_seed`	`WORKER_TEMPLATES_AUTO_SEED`	`True`	Auto-seed templates on startup
`worker_templates_config_path`	`WORKER_TEMPLATES_CONFIG_PATH`	`config/worker_templates.yaml`	Path to YAML file