Risk Register¶
| Attribute | Value |
|---|---|
| Document Version | 0.1.0 |
| Status | Draft |
| Created | 2026-01-16 |
| Parent | Implementation Plan |
1. Risk Assessment Matrix¶
| Impact | Low (1) | Medium (2) | High (3) |
|---|---|---|---|
| High (3) | 3 | 6 | 9 |
| Medium (2) | 2 | 4 | 6 |
| Low (1) | 1 | 2 | 3 |
Risk Score = Likelihood × Impact
- Low (1-2): Monitor, no immediate action
- Medium (3-4): Mitigate, plan contingency
- High (6-9): Critical, requires active mitigation
2. Technical Risks¶
R-001: etcd Operational Complexity¶
| Attribute | Value |
|---|---|
| Category | Infrastructure |
| Likelihood | Medium (2) |
| Impact | High (3) |
| Risk Score | 6 (High) |
| Phase | 1-5 |
Description: etcd requires specialized operational knowledge. Cluster management, backup/restore, and troubleshooting are different from MongoDB/Redis.
Mitigation Strategies:
- Consider managed etcd service (AWS EKS etcd, Azure CosmosDB etcd API)
- Create comprehensive operational runbooks (Task 5.6)
- Training session for operations team
- Automated backup procedures
- Monitoring dashboards with alerting
Contingency: If etcd proves too complex, evaluate MongoDB Change Streams as fallback (trade-off: less reliable watches).
R-002: Worker Startup Time (15-20 minutes)¶
| Attribute | Value |
|---|---|
| Category | Performance |
| Likelihood | High (3) |
| Impact | Medium (2) |
| Risk Score | 6 (High) |
| Phase | 3 |
Description: m5zn.metal EC2 instances take 15-20 minutes to start and initialize CML. This limits elasticity and requires predictive scaling.
Mitigation Strategies:
- Predictive scaling based on scheduled timeslots
- Maintain minimum warm capacity (configurable)
- Pre-scale based on historical patterns
- Consider keeping stopped workers (faster restart ~5min)
- Implement warm pool for high-demand definitions
Contingency: Accept longer lead times; communicate expected wait times to users.
R-003: CML API Reliability¶
| Attribute | Value |
|---|---|
| Category | Integration |
| Likelihood | Medium (2) |
| Impact | Medium (2) |
| Risk Score | 4 (Medium) |
| Phase | 2 |
Description: CML API may timeout or return errors during lab import/start operations. Network issues between CCM and workers can cause failures.
Mitigation Strategies:
- Implement retry logic with exponential backoff
- Circuit breaker pattern for repeated failures
- Timeout configuration per operation type
- Health checks before operations
- Detailed error logging for troubleshooting
Contingency: Manual recovery procedures documented in runbooks.
R-004: State Synchronization (etcd ↔ MongoDB)¶
| Attribute | Value |
|---|---|
| Category | Data Integrity |
| Likelihood | Low (1) |
| Impact | High (3) |
| Risk Score | 3 (Medium) |
| Phase | 1-2 |
Description: Dual storage architecture (etcd for state, MongoDB for specs) introduces potential for inconsistency.
Mitigation Strategies:
- Clear data ownership per ADR-005
- MongoDB as source of truth for aggregates
- etcd only for coordination state
- Reconciliation on startup
- Monitoring for state drift
Contingency: Reconciliation job to fix inconsistencies; manual intervention procedures.
R-005: Leader Election Failures¶
| Attribute | Value |
|---|---|
| Category | High Availability |
| Likelihood | Low (1) |
| Impact | High (3) |
| Risk Score | 3 (Medium) |
| Phase | 2-3 |
Description: Leader election for Scheduler/Controller could fail or cause split-brain scenarios.
Mitigation Strategies:
- Use proven etcd lease mechanism
- Short lease TTL (15s) for fast failover
- Idempotent operations (safe to replay)
- Fencing tokens for operations
- Comprehensive testing of failover scenarios
Contingency: Manual leadership override via admin API.
R-006: Port Allocation Conflicts¶
| Attribute | Value |
|---|---|
| Category | Concurrency |
| Likelihood | Medium (2) |
| Impact | Medium (2) |
| Risk Score | 4 (Medium) |
| Phase | 1 |
Description: Concurrent port allocations could lead to conflicts if not properly synchronized.
Mitigation Strategies:
- etcd transactions for atomic allocation
- Compare-and-swap semantics
- Bitmap-based allocation for efficiency
- Port validation before use
- Automatic conflict detection and retry
Contingency: Force release and re-allocate ports.
R-007: Lab YAML Format Variations¶
| Attribute | Value |
|---|---|
| Category | Integration |
| Likelihood | Medium (2) |
| Impact | Medium (2) |
| Risk Score | 4 (Medium) |
| Phase | 2 |
Description: Lab YAML files may have variations in format that break the rewriting logic.
Mitigation Strategies:
- Extensive test fixtures with real lab YAMLs
- Graceful error handling
- YAML validation before and after rewriting
- ruamel.yaml for format preservation
- Logging of rewrite operations
Contingency: Manual YAML fixing; skip problematic sections with warnings.
R-008: Assessment Platform Integration Failures¶
| Attribute | Value |
|---|---|
| Category | Integration |
| Likelihood | Medium (2) |
| Impact | Medium (2) |
| Risk Score | 4 (Medium) |
| Phase | 4 |
Description: Assessment Platform may be unavailable or return errors during collection/grading.
Mitigation Strategies:
- Retry logic with exponential backoff
- Circuit breaker pattern
- Async event-based communication
- State machine allows manual intervention
- Timeout handling
Contingency: Manual grading fallback; instance can be terminated without grading.
3. Schedule Risks¶
R-009: Scope Creep¶
| Attribute | Value |
|---|---|
| Category | Schedule |
| Likelihood | Medium (2) |
| Impact | Medium (2) |
| Risk Score | 4 (Medium) |
| Phase | All |
Description: Additional requirements or changes during implementation could extend timeline.
Mitigation Strategies:
- Clear phase boundaries and acceptance criteria
- Change control process
- Feature flags for incremental release
- Prioritize P0 requirements
- Regular stakeholder alignment
Contingency: Defer P1/P2 features to future releases.
R-010: Learning Curve¶
| Attribute | Value |
|---|---|
| Category | Schedule |
| Likelihood | Medium (2) |
| Impact | Low (1) |
| Risk Score | 2 (Low) |
| Phase | 1-2 |
Description: Team may need time to learn etcd operations and patterns.
Mitigation Strategies:
- Pair programming for knowledge sharing
- Spike tasks for complex areas
- Documentation as you go
- External training if needed
Contingency: Buffer time built into estimates.
4. Operational Risks¶
R-011: Cost Overruns (AWS)¶
| Attribute | Value |
|---|---|
| Category | Operations |
| Likelihood | Medium (2) |
| Impact | Medium (2) |
| Risk Score | 4 (Medium) |
| Phase | 3 |
Description: m5zn.metal instances are expensive. Poor scaling decisions could increase costs.
Mitigation Strategies:
- Aggressive scale-down of idle workers
- Cost monitoring and alerting
- Budget limits in AWS
- Reserved instances for baseline
- Spot instances for burst (if viable)
Contingency: Emergency manual scale-down procedure.
R-012: Data Loss (etcd)¶
| Attribute | Value |
|---|---|
| Category | Operations |
| Likelihood | Low (1) |
| Impact | High (3) |
| Risk Score | 3 (Medium) |
| Phase | All |
Description: etcd cluster failure could lose coordination state.
Mitigation Strategies:
- 3-node etcd cluster (quorum)
- Regular snapshots (every 30 min)
- Off-site backup storage
- Tested restore procedures
- State reconstruction from MongoDB
Contingency: Restore from backup; reconcile with MongoDB.
5. Risk Summary¶
| Risk ID | Risk | Score | Status |
|---|---|---|---|
| R-001 | etcd Operational Complexity | 6 | Active |
| R-002 | Worker Startup Time | 6 | Active |
| R-003 | CML API Reliability | 4 | Active |
| R-004 | State Synchronization | 3 | Active |
| R-005 | Leader Election Failures | 3 | Active |
| R-006 | Port Allocation Conflicts | 4 | Active |
| R-007 | Lab YAML Format Variations | 4 | Active |
| R-008 | Assessment Integration | 4 | Active |
| R-009 | Scope Creep | 4 | Active |
| R-010 | Learning Curve | 2 | Active |
| R-011 | Cost Overruns | 4 | Active |
| R-012 | Data Loss | 3 | Active |
6. Revision History¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1.0 | 2026-01-16 | Architecture Team | Initial draft |