ADR-025: Content Metadata Storage in MongoDB¶
| Attribute | Value |
|---|---|
| Status | Accepted |
| Date | 2026-02-25 |
| Deciders | Architecture Team |
| Related ADRs | ADR-024 (Package Storage in RustFS), ADR-005 (Dual State Store) |
| Implementation | Content Synchronization Plan ยง2 (AD-CS-003), ยง3 (Phase 1), ยง4.3 |
Context¶
When the lablet-controller's ContentSyncService downloads a content package from Mosaic, it extracts several metadata artifacts:
| Artifact | Source File | Used By | When |
|---|---|---|---|
| CML YAML content | cml.yaml / cml.yml |
lablet-controller | Lab import during LabletSession instantiation |
| CML YAML hash | SHA-256 of cml.yaml |
CPA | Content change detection between syncs |
| Devices JSON | devices.json |
lablet-controller | LDS session creation (device definitions) |
| Grade XML path | grade.xml (relative path) |
Grading Engine | Grading ruleset location |
| Upstream version | mosaic_meta.json โ Version |
CPA/UI | Display to user, change tracking |
| Upstream publish date | mosaic_meta.json โ DatePublished |
CPA/UI | Freshness indicator |
| Upstream instance name | mosaic_meta.json โ InstanceName |
CPA | Mosaic instance identification |
| Upstream form ID | mosaic_meta.json โ FormId |
CPA | Cross-reference with Mosaic |
| Content package hash | SHA-256 of entire zip | CPA | Immutability check, version auto-increment |
Two storage strategies were considered:
- MongoDB fields: Store extracted metadata as fields on
LabletDefinitionStatein MongoDB - S3 sidecar files: Extract files from zip and store alongside the archive in RustFS
Decision¶
Extract metadata from the downloaded zip during sync, then store it as fields on the LabletDefinitionState aggregate in MongoDB. Do NOT store extracted files in S3.
LabletDefinitionState Fields¶
class LabletDefinitionState(AggregateState[str]):
# ... existing fields ...
# Content metadata (populated by content sync)
content_package_hash: str | None = None # SHA-256 of the entire zip archive
upstream_version: str | None = None # Mosaic Version
upstream_date_published: str | None = None # Mosaic DatePublished
upstream_instance_name: str | None = None # Mosaic InstanceName
upstream_form_id: str | None = None # Mosaic FormId
grade_xml_path: str | None = None # Relative path within zip
cml_yaml_path: str | None = None # Relative path within zip
cml_yaml_content: str | None = None # Full YAML content (for lab import)
devices_json: str | None = None # Full JSON content (for LDS)
Sync Result Flow¶
ContentSyncService โ extracts metadata from zip
โ POST /api/internal/lablet-definitions/{id}/content-synced
โ RecordContentSyncResultCommandHandler
โ definition.record_content_sync(hash, metadata...)
โ MongoDB (LabletDefinitionState updated)
Rationale¶
Why MongoDB (not S3)?¶
- Simpler access pattern: lablet-controller already fetches the full LabletDefinition from CPA API during LabletSession instantiation. Metadata is included in the response โ no additional S3 read needed.
- Atomic update: All metadata fields are updated in a single MongoDB write operation via the aggregate's
record_content_sync()method. - Query support: MongoDB supports filtering by metadata fields (e.g., "find all definitions from Mosaic instance X" or "find definitions with outdated upstream_version").
- No S3 dependency for reads: LabletSession instantiation only needs metadata, not the full package. Decoupling from S3 for reads reduces failure modes.
Why not S3 sidecar files?¶
- Would require additional S3 GET operations during session instantiation
- Adds an S3 dependency for metadata reads (currently only needed for package delivery to LDS)
- Multiple S3 objects increase sync complexity (partial failure, consistency)
- No consumer needs individual extracted files from S3 (LDS uses the zip, lablet-controller uses metadata from MongoDB)
Content size considerations¶
cml_yaml_content: Typically 5-50 KB (lab topology YAML) โ well within MongoDB document limitsdevices_json: Typically 1-10 KB (device definitions) โ negligible- Total metadata footprint per definition: < 100 KB in the worst case
Consequences¶
Positive¶
- Single source of truth for definition + metadata (MongoDB document)
- No additional S3 reads during session instantiation
- Rich query capabilities on metadata fields
- Consistent with existing aggregate state pattern (Neuroglia AggregateState)
Negative¶
- MongoDB document size increases per definition (< 100 KB overhead โ negligible)
- Full CML YAML stored in MongoDB (could be large for complex topologies, but still within limits)
- Metadata is a copy of what's in the zip โ potential staleness if zip is modified outside the system (not a realistic scenario)
Risks¶
- Very large CML YAML files (> 1 MB) could affect MongoDB performance (mitigated: 16 MB document limit, typical files are < 50 KB)
Related Documents¶
- Content Synchronization Implementation Plan
- ADR-024: Content Package Storage in RustFS
- ADR-005: Dual State Store Architecture
- ADR-029: Port Template Extraction from CML YAML โ extends the metadata extracted during content sync to include
port_template(auto-derived from CML YAMLnodes[].tags)