Skip to content

ADR-025: Content Metadata Storage in MongoDB

Attribute Value
Status Accepted
Date 2026-02-25
Deciders Architecture Team
Related ADRs ADR-024 (Package Storage in RustFS), ADR-005 (Dual State Store)
Implementation Content Synchronization Plan ยง2 (AD-CS-003), ยง3 (Phase 1), ยง4.3

Context

When the lablet-controller's ContentSyncService downloads a content package from Mosaic, it extracts several metadata artifacts:

Artifact Source File Used By When
CML YAML content cml.yaml / cml.yml lablet-controller Lab import during LabletSession instantiation
CML YAML hash SHA-256 of cml.yaml CPA Content change detection between syncs
Devices JSON devices.json lablet-controller LDS session creation (device definitions)
Grade XML path grade.xml (relative path) Grading Engine Grading ruleset location
Upstream version mosaic_meta.json โ†’ Version CPA/UI Display to user, change tracking
Upstream publish date mosaic_meta.json โ†’ DatePublished CPA/UI Freshness indicator
Upstream instance name mosaic_meta.json โ†’ InstanceName CPA Mosaic instance identification
Upstream form ID mosaic_meta.json โ†’ FormId CPA Cross-reference with Mosaic
Content package hash SHA-256 of entire zip CPA Immutability check, version auto-increment

Two storage strategies were considered:

  1. MongoDB fields: Store extracted metadata as fields on LabletDefinitionState in MongoDB
  2. S3 sidecar files: Extract files from zip and store alongside the archive in RustFS

Decision

Extract metadata from the downloaded zip during sync, then store it as fields on the LabletDefinitionState aggregate in MongoDB. Do NOT store extracted files in S3.

LabletDefinitionState Fields

class LabletDefinitionState(AggregateState[str]):
    # ... existing fields ...

    # Content metadata (populated by content sync)
    content_package_hash: str | None = None       # SHA-256 of the entire zip archive
    upstream_version: str | None = None            # Mosaic Version
    upstream_date_published: str | None = None     # Mosaic DatePublished
    upstream_instance_name: str | None = None      # Mosaic InstanceName
    upstream_form_id: str | None = None            # Mosaic FormId
    grade_xml_path: str | None = None              # Relative path within zip
    cml_yaml_path: str | None = None               # Relative path within zip
    cml_yaml_content: str | None = None            # Full YAML content (for lab import)
    devices_json: str | None = None                # Full JSON content (for LDS)

Sync Result Flow

ContentSyncService โ†’ extracts metadata from zip
  โ†’ POST /api/internal/lablet-definitions/{id}/content-synced
  โ†’ RecordContentSyncResultCommandHandler
  โ†’ definition.record_content_sync(hash, metadata...)
  โ†’ MongoDB (LabletDefinitionState updated)

Rationale

Why MongoDB (not S3)?

  • Simpler access pattern: lablet-controller already fetches the full LabletDefinition from CPA API during LabletSession instantiation. Metadata is included in the response โ€” no additional S3 read needed.
  • Atomic update: All metadata fields are updated in a single MongoDB write operation via the aggregate's record_content_sync() method.
  • Query support: MongoDB supports filtering by metadata fields (e.g., "find all definitions from Mosaic instance X" or "find definitions with outdated upstream_version").
  • No S3 dependency for reads: LabletSession instantiation only needs metadata, not the full package. Decoupling from S3 for reads reduces failure modes.

Why not S3 sidecar files?

  • Would require additional S3 GET operations during session instantiation
  • Adds an S3 dependency for metadata reads (currently only needed for package delivery to LDS)
  • Multiple S3 objects increase sync complexity (partial failure, consistency)
  • No consumer needs individual extracted files from S3 (LDS uses the zip, lablet-controller uses metadata from MongoDB)

Content size considerations

  • cml_yaml_content: Typically 5-50 KB (lab topology YAML) โ€” well within MongoDB document limits
  • devices_json: Typically 1-10 KB (device definitions) โ€” negligible
  • Total metadata footprint per definition: < 100 KB in the worst case

Consequences

Positive

  • Single source of truth for definition + metadata (MongoDB document)
  • No additional S3 reads during session instantiation
  • Rich query capabilities on metadata fields
  • Consistent with existing aggregate state pattern (Neuroglia AggregateState)

Negative

  • MongoDB document size increases per definition (< 100 KB overhead โ€” negligible)
  • Full CML YAML stored in MongoDB (could be large for complex topologies, but still within limits)
  • Metadata is a copy of what's in the zip โ€” potential staleness if zip is modified outside the system (not a realistic scenario)

Risks

  • Very large CML YAML files (> 1 MB) could affect MongoDB performance (mitigated: 16 MB document limit, typical files are < 50 KB)