Risk Register¶

Attribute	Value
Document Version	0.1.0
Status	Draft
Created	2026-01-16
Parent	Implementation Plan

1. Risk Assessment Matrix¶

Impact	Low (1)	Medium (2)	High (3)
High (3)	3	6	9
Medium (2)	2	4	6
Low (1)	1	2	3

Risk Score = Likelihood × Impact

Low (1-2): Monitor, no immediate action
Medium (3-4): Mitigate, plan contingency
High (6-9): Critical, requires active mitigation

2. Technical Risks¶

R-001: etcd Operational Complexity¶

Attribute	Value
Category	Infrastructure
Likelihood	Medium (2)
Impact	High (3)
Risk Score	6 (High)
Phase	1-5

Description: etcd requires specialized operational knowledge. Cluster management, backup/restore, and troubleshooting are different from MongoDB/Redis.

Mitigation Strategies:

Consider managed etcd service (AWS EKS etcd, Azure CosmosDB etcd API)
Create comprehensive operational runbooks (Task 5.6)
Training session for operations team
Automated backup procedures
Monitoring dashboards with alerting

Contingency: If etcd proves too complex, evaluate MongoDB Change Streams as fallback (trade-off: less reliable watches).

R-002: Worker Startup Time (15-20 minutes)¶

Attribute	Value
Category	Performance
Likelihood	High (3)
Impact	Medium (2)
Risk Score	6 (High)
Phase	3

Description: m5zn.metal EC2 instances take 15-20 minutes to start and initialize CML. This limits elasticity and requires predictive scaling.

Mitigation Strategies:

Predictive scaling based on scheduled timeslots
Maintain minimum warm capacity (configurable)
Pre-scale based on historical patterns
Consider keeping stopped workers (faster restart ~5min)
Implement warm pool for high-demand definitions

Contingency: Accept longer lead times; communicate expected wait times to users.

R-003: CML API Reliability¶

Attribute	Value
Category	Integration
Likelihood	Medium (2)
Impact	Medium (2)
Risk Score	4 (Medium)
Phase	2

Description: CML API may timeout or return errors during lab import/start operations. Network issues between CCM and workers can cause failures.

Mitigation Strategies:

Implement retry logic with exponential backoff
Circuit breaker pattern for repeated failures
Timeout configuration per operation type
Health checks before operations
Detailed error logging for troubleshooting

Contingency: Manual recovery procedures documented in runbooks.

R-004: State Synchronization (etcd ↔ MongoDB)¶

Attribute	Value
Category	Data Integrity
Likelihood	Low (1)
Impact	High (3)
Risk Score	3 (Medium)
Phase	1-2

Description: Dual storage architecture (etcd for state, MongoDB for specs) introduces potential for inconsistency.

Mitigation Strategies:

Clear data ownership per ADR-005
MongoDB as source of truth for aggregates
etcd only for coordination state
Reconciliation on startup
Monitoring for state drift

Contingency: Reconciliation job to fix inconsistencies; manual intervention procedures.

R-005: Leader Election Failures¶

Attribute	Value
Category	High Availability
Likelihood	Low (1)
Impact	High (3)
Risk Score	3 (Medium)
Phase	2-3

Description: Leader election for Scheduler/Controller could fail or cause split-brain scenarios.

Mitigation Strategies:

Use proven etcd lease mechanism
Short lease TTL (15s) for fast failover
Idempotent operations (safe to replay)
Fencing tokens for operations
Comprehensive testing of failover scenarios

Contingency: Manual leadership override via admin API.

R-006: Port Allocation Conflicts¶

Attribute	Value
Category	Concurrency
Likelihood	Medium (2)
Impact	Medium (2)
Risk Score	4 (Medium)
Phase	1

Description: Concurrent port allocations could lead to conflicts if not properly synchronized.

Mitigation Strategies:

etcd transactions for atomic allocation
Compare-and-swap semantics
Bitmap-based allocation for efficiency
Port validation before use
Automatic conflict detection and retry

Contingency: Force release and re-allocate ports.

R-007: Lab YAML Format Variations¶

Attribute	Value
Category	Integration
Likelihood	Medium (2)
Impact	Medium (2)
Risk Score	4 (Medium)
Phase	2

Description: Lab YAML files may have variations in format that break the rewriting logic.

Mitigation Strategies:

Extensive test fixtures with real lab YAMLs
Graceful error handling
YAML validation before and after rewriting
ruamel.yaml for format preservation
Logging of rewrite operations

Contingency: Manual YAML fixing; skip problematic sections with warnings.

R-008: Assessment Platform Integration Failures¶

Attribute	Value
Category	Integration
Likelihood	Medium (2)
Impact	Medium (2)
Risk Score	4 (Medium)
Phase	4

Description: Assessment Platform may be unavailable or return errors during collection/grading.

Mitigation Strategies:

Retry logic with exponential backoff
Circuit breaker pattern
Async event-based communication
State machine allows manual intervention
Timeout handling

Contingency: Manual grading fallback; instance can be terminated without grading.

3. Schedule Risks¶

R-009: Scope Creep¶

Attribute	Value
Category	Schedule
Likelihood	Medium (2)
Impact	Medium (2)
Risk Score	4 (Medium)
Phase	All

Description: Additional requirements or changes during implementation could extend timeline.

Mitigation Strategies:

Clear phase boundaries and acceptance criteria
Change control process
Feature flags for incremental release
Prioritize P0 requirements
Regular stakeholder alignment

Contingency: Defer P1/P2 features to future releases.

R-010: Learning Curve¶

Attribute	Value
Category	Schedule
Likelihood	Medium (2)
Impact	Low (1)
Risk Score	2 (Low)
Phase	1-2

Description: Team may need time to learn etcd operations and patterns.

Mitigation Strategies:

Pair programming for knowledge sharing
Spike tasks for complex areas
Documentation as you go
External training if needed

Contingency: Buffer time built into estimates.

4. Operational Risks¶

R-011: Cost Overruns (AWS)¶

Attribute	Value
Category	Operations
Likelihood	Medium (2)
Impact	Medium (2)
Risk Score	4 (Medium)
Phase	3

Description: m5zn.metal instances are expensive. Poor scaling decisions could increase costs.

Mitigation Strategies:

Aggressive scale-down of idle workers
Cost monitoring and alerting
Budget limits in AWS
Reserved instances for baseline
Spot instances for burst (if viable)

Contingency: Emergency manual scale-down procedure.

R-012: Data Loss (etcd)¶

Attribute	Value
Category	Operations
Likelihood	Low (1)
Impact	High (3)
Risk Score	3 (Medium)
Phase	All

Description: etcd cluster failure could lose coordination state.

Mitigation Strategies:

3-node etcd cluster (quorum)
Regular snapshots (every 30 min)
Off-site backup storage
Tested restore procedures
State reconstruction from MongoDB

Contingency: Restore from backup; reconcile with MongoDB.

5. Risk Summary¶

Risk ID	Risk	Score	Status
R-001	etcd Operational Complexity	6	Active
R-002	Worker Startup Time	6	Active
R-003	CML API Reliability	4	Active
R-004	State Synchronization	3	Active
R-005	Leader Election Failures	3	Active
R-006	Port Allocation Conflicts	4	Active
R-007	Lab YAML Format Variations	4	Active
R-008	Assessment Integration	4	Active
R-009	Scope Creep	4	Active
R-010	Learning Curve	2	Active
R-011	Cost Overruns	4	Active
R-012	Data Loss	3	Active

6. Revision History¶

Version	Date	Author	Changes
0.1.0	2026-01-16	Architecture Team	Initial draft