Code Review: Application Settings & AWS EC2 API Client¶
Date: November 16, 2025¶
Reviewer: AI Assistant Files Reviewed:
src/application/settings.pysrc/integration/services/aws_ec2_api_client.py
Executive Summary¶
Alignment with Domain Model & Project Objectives: ⚠️ PARTIAL¶
Issues Found:
- Critical: Missing
IolVmInstanceDtodefinition (import error) - Critical: Syntax error in imports (
from integration,excshould befrom integration.exceptions) - High: Settings use legacy "IOL VM" terminology instead of "CML Worker"
- High: Missing CML-specific settings (HTTPS endpoint monitoring, license management, etc.)
- Medium: No settings for idle detection thresholds
- Medium: Settings structure doesn't align with multi-worker management
Code Quality: 🟡 NEEDS IMPROVEMENT¶
Strengths:
- Good use of Pydantic for settings validation
- Comprehensive observability configuration
- Good error handling in AWS client
- Proper logging throughout
Issues:
- Import errors preventing compilation
- Inconsistent naming (IOL VM vs CML Worker)
- Missing type hints in some places
- Code style violations (trailing whitespace,
not inusage)
Detailed Analysis¶
1. Application Settings (application/settings.py)¶
✅ Strengths¶
- Comprehensive Observability Config: Excellent OpenTelemetry integration
- Security: Good separation of secrets (though defaults need changing)
- Environment-Aware: Proper development/production distinction
- CORS & Session Management: Well configured
❌ Issues & Recommendations¶
Issue 1: Legacy Naming - "IOL VM" vs "CML Worker"¶
Severity: High Lines: 119-128
# Current (Legacy):
cmlvm_ami_ids: dict[str, str] = {"us-east-1": "ami-0123456789abcdef0", ...}
cmlvm_ami_name: str = "ec2_iol_image_name"
cmlvm_instance_type: Ec2InstanceType = Ec2InstanceType.SMALL
Recommendation:
# Proposed (Aligned with Domain):
cml_worker_ami_ids: dict[str, str] = {"us-east-1": "ami-0123456789abcdef0", ...}
cml_worker_ami_names: dict[str, str] = {"us-east-1": "CML-2.7.0-Ubuntu-22.04", ...}
cml_worker_default_instance_type: str = "c5.2xlarge"
cml_worker_allowed_instance_types: list[str] = ["t3.xlarge", "c5.2xlarge", "c5.4xlarge"]
Issue 2: Missing CML-Specific Settings¶
Severity: High
The settings don't include configuration for:
- CML HTTPS service monitoring
- CML API endpoints
- CML license management
- Idle detection thresholds
- Auto-shutdown policies
Recommendation:
# CML Service Configuration
cml_https_port: int = 443
cml_api_port: int = 443
cml_health_check_interval_seconds: int = 30
cml_startup_timeout_seconds: int = 600
# CML License Configuration
cml_license_token: str | None = None
cml_license_check_on_startup: bool = True
# Lifecycle Management
cml_worker_idle_threshold_minutes: int = 30
cml_worker_auto_shutdown_enabled: bool = True
cml_worker_telemetry_collection_interval_seconds: int = 60
cml_worker_max_idle_before_warning_minutes: int = 20
Issue 3: AWS Region Handling¶
Severity: Medium Lines: 119
Currently using dict[str, str] for AMI IDs per region, but:
- No default region specified
- No validation of region availability
- No AMI name per region
Recommendation:
# AWS Configuration
aws_default_region: str = "us-east-1"
aws_allowed_regions: list[str] = ["us-east-1", "us-west-2", "eu-west-1"]
# CML Worker AMI Configuration (per region)
cml_worker_ami_config: dict[str, dict[str, str]] = {
"us-east-1": {
"ami_id": "ami-0123456789abcdef0",
"ami_name": "CML-2.7.0-Ubuntu-22.04",
"cml_version": "2.7.0"
},
"us-west-2": {
"ami_id": "ami-0987654321fedcba0",
"ami_name": "CML-2.7.0-Ubuntu-22.04",
"cml_version": "2.7.0"
}
}
Issue 4: Security Groups & Network Settings¶
Severity: Medium Lines: 124-126
Settings assume single security group and subnet. Multi-worker scenarios need flexibility.
Recommendation:
# Network Configuration per Region
cml_worker_network_config: dict[str, dict[str, Any]] = {
"us-east-1": {
"vpc_id": "vpc-0123456789abcdef0",
"subnet_id": "subnet-0123456789abcdef0",
"security_group_ids": ["sg-0123456789abcdef0"],
"assign_public_ip": True
}
}
# SSH Key Configuration
cml_worker_key_pairs: dict[str, str] = {
"us-east-1": "cml-worker-keypair-use1",
"us-west-2": "cml-worker-keypair-usw2"
}
Issue 5: Duplicate Logger Configuration¶
Severity: Low Line: 156-157
logging.getLogger("fastapi").setLevel(logging.WARNING)
logging.getLogger("fastapi").setLevel(logging.WARNING) # Duplicate!
2. AWS EC2 API Client (integration/services/aws_ec2_api_client.py)¶
✅ Strengths¶
- Comprehensive EC2 Operations: Create, terminate, list, get details
- CloudWatch Integration: Resource utilization monitoring
- Flexible Filtering: Good
list_instances()with multiple filters - Error Handling: Try-except blocks with proper logging
- Documentation: Good docstrings for most methods
❌ Critical Issues¶
Issue 1: Import Errors (BLOCKING)¶
Severity: Critical Line: 9, 14
# SYNTAX ERROR:
from integration,exc import IntegrationException # Should be: from integration.exceptions
# MISSING DEFINITION:
from integration.models import IolVmInstanceDto # This class doesn't exist!
Fix Required:
from integration.exceptions import IntegrationException
# Create the DTO or use domain entities directly
from domain.entities.cml_worker import CMLWorker
Issue 2: Missing IolVmInstanceDto Definition¶
Severity: Critical Line: 14, 86, 104
This DTO is referenced but not defined anywhere. Options:
Option A: Create the DTO in integration/models/
from dataclasses import dataclass
from integration.enums import AwsRegion
@dataclass
class CMLWorkerInstanceDto:
"""DTO for EC2 instance creation response."""
id: str
aws_region: AwsRegion
instance_name: str
ami_id: str
ami_name: str | None
instance_type: str
security_group_ids: list[str]
subnet_id: str
public_ip: str | None = None
private_ip: str | None = None
Option B: Return domain entity directly (better for DDD)
def create_instance(...) -> CMLWorker:
# Create CMLWorker domain entity
worker = CMLWorker(
name=instance_name,
aws_region=aws_region.value,
instance_type=instance_type,
ami_id=ami_id,
ami_name=ami_name,
created_by=created_by
)
# ... provision EC2, then assign instance
worker.assign_instance(instance.id, public_ip, private_ip)
return worker
Issue 3: Legacy "IOL VM" Terminology¶
Severity: High Throughout file
Variables use "iolvm" prefix instead of "cml_worker":
iolvm_instance_name→worker_nameorcml_worker_nameiolvm_ami_id→ami_id- Error messages reference "IOLVM" → should be "CML Worker"
🟡 Code Quality Issues¶
Issue 4: Inconsistent not in Usage¶
Severity: Low Lines: 225, 230, 233, 236
# Current (Wrong):
if not "Instances" in reservation:
# Should be:
if "Instances" not in reservation:
Issue 5: Trailing Whitespace¶
Severity: Low Lines: 155, 156, 177, 178, 179, 243, 251-257
Run ruff or configure IDE to remove trailing whitespace.
Issue 6: Unused Variable¶
Severity: Low Line: 295
check_if_instance_exists = self.get_instance_details(...) # Never used!
# Should be:
_ = self.get_instance_details(...) # Validate instance exists
# Or better:
if not self.get_instance_details(aws_region, instance_id):
raise IntegrationException(f"Instance {instance_id} not found")
Issue 7: Missing Type Hints¶
Severity: Low Line: 295
CloudWatch client creation has no type hints.
🔧 Design Recommendations¶
Recommendation 1: Add Start/Stop Methods¶
Missing Operations:
def start_instance(self, aws_region: AwsRegion, instance_id: str) -> bool:
"""Start a stopped EC2 instance."""
ec2_client = boto3.client("ec2", ...)
try:
response = ec2_client.start_instances(InstanceIds=[instance_id])
return True
except (ValueError, ParamValidationError, ClientError) as e:
log.error(f"Error starting instance {instance_id}: {e}")
raise IntegrationException(f"Error starting instance: {e}")
def stop_instance(self, aws_region: AwsRegion, instance_id: str) -> bool:
"""Stop a running EC2 instance."""
ec2_client = boto3.client("ec2", ...)
try:
response = ec2_client.stop_instances(InstanceIds=[instance_id])
return True
except (ValueError, ParamValidationError, ClientError) as e:
log.error(f"Error stopping instance {instance_id}: {e}")
raise IntegrationException(f"Error stopping instance: {e}")
Recommendation 2: Add Instance Status Check¶
For HTTPS Service Monitoring:
def get_instance_status_checks(
self,
aws_region: AwsRegion,
instance_id: str
) -> dict[str, str]:
"""Get instance and system status checks.
Returns:
dict with 'instance_status' and 'system_status' (ok, impaired, etc.)
"""
ec2_client = boto3.client("ec2", ...)
try:
response = ec2_client.describe_instance_status(
InstanceIds=[instance_id],
IncludeAllInstances=True
)
if response['InstanceStatuses']:
status = response['InstanceStatuses'][0]
return {
'instance_status': status['InstanceStatus']['Status'],
'system_status': status['SystemStatus']['Status'],
'instance_state': status['InstanceState']['Name']
}
return {}
except (ValueError, ParamValidationError, ClientError) as e:
log.error(f"Error getting status checks for {instance_id}: {e}")
raise IntegrationException(f"Error getting status checks: {e}")
Recommendation 3: Improve Error Granularity¶
Current: All errors raise generic IntegrationException
Better: Specific exception types
# In integration/exceptions.py
class EC2InstanceNotFoundException(IntegrationException):
"""Raised when EC2 instance is not found."""
pass
class EC2PermissionDeniedException(IntegrationException):
"""Raised when AWS credentials lack permissions."""
pass
class EC2QuotaExceededException(IntegrationException):
"""Raised when AWS resource quota is exceeded."""
pass
# In aws_ec2_api_client.py
except ClientError as e:
error_code = e.response.get('Error', {}).get('Code', '')
if error_code == 'InvalidInstanceID.NotFound':
raise EC2InstanceNotFoundException(f"Instance {instance_id} not found")
elif error_code == 'UnauthorizedOperation':
raise EC2PermissionDeniedException(f"Permission denied: {e}")
else:
raise IntegrationException(f"AWS error: {e}")
Recommendation 4: Add Tag Management¶
For Worker Tracking:
def add_tags(
self,
aws_region: AwsRegion,
instance_id: str,
tags: dict[str, str]
) -> bool:
"""Add or update tags on an EC2 instance."""
ec2_client = boto3.client("ec2", ...)
try:
ec2_client.create_tags(
Resources=[instance_id],
Tags=[{'Key': k, 'Value': v} for k, v in tags.items()]
)
return True
except (ValueError, ParamValidationError, ClientError) as e:
log.error(f"Error adding tags to {instance_id}: {e}")
raise IntegrationException(f"Error adding tags: {e}")
Recommendation 5: Extract Boto3 Client Creation¶
DRY Principle Violation: Boto3 client creation repeated everywhere
def _create_ec2_client(self, region: AwsRegion) -> Any:
"""Create configured boto3 EC2 client."""
return boto3.client(
"ec2",
aws_access_key_id=self.aws_account_credentials.aws_access_key_id,
aws_secret_access_key=self.aws_account_credentials.aws_secret_access_key,
region_name=region.value
)
def _create_ec2_resource(self, region: AwsRegion) -> Any:
"""Create configured boto3 EC2 resource."""
return boto3.resource(
"ec2",
aws_access_key_id=self.aws_account_credentials.aws_access_key_id,
aws_secret_access_key=self.aws_account_credentials.aws_secret_access_key,
region_name=region.value
)
def _create_cloudwatch_client(self, region: AwsRegion) -> Any:
"""Create configured boto3 CloudWatch client."""
return boto3.client(
"cloudwatch",
aws_access_key_id=self.aws_account_credentials.aws_access_key_id,
aws_secret_access_key=self.aws_account_credentials.aws_secret_access_key,
region_name=region.value
)
Alignment with Domain Model¶
Current State: 🔴 MISALIGNED¶
| Domain Concept | Settings | AWS Client |
|---|---|---|
| CMLWorker | ❌ Uses "cmlvm" | ❌ Uses "iolvm" |
| AMI ID + Name | ❌ Only ID | ❌ Not tracked |
| Worker Status | ❌ Missing | ⚠️ Partial (state only) |
| Service Status | ❌ Missing | ❌ Missing |
| Idle Detection | ❌ Missing | ⚠️ Has telemetry |
| License Management | ❌ Missing | ❌ Missing |
| Lifecycle (Start/Stop) | ❌ Missing | ❌ Only create/terminate |
Required Changes for Alignment¶
- Rename all "IOL VM" references to "CML Worker"
- Add AMI name tracking (already in domain, needs settings)
- Add CML-specific settings (HTTPS monitoring, licenses, idle thresholds)
- Add start/stop operations to AWS client
- Create proper DTOs or use domain entities
- Fix import errors immediately
Alignment with Project Objectives¶
From docs/project-objectives.md¶
✅ Implemented¶
- EC2 instance creation
- Instance termination
- Instance details retrieval
- Resource utilization monitoring
⚠️ Partially Implemented¶
- List instances (works, but no filtering by CML-specific criteria)
❌ Not Implemented¶
- Start stopped instances (critical!)
- Stop running instances (critical!)
- HTTPS service availability monitoring
- CML API integration
- License management
- Idle detection automation
- Cross-account AWS support (credentials hardcoded)
Priority Action Items¶
🚨 Immediate (Blocking)¶
- Fix import syntax error (line 9)
- Create or remove
IolVmInstanceDtoreference - Add start_instance() method to AWS client
- Add stop_instance() method to AWS client
🔥 High Priority¶
- Rename all "IOL VM" → "CML Worker" terminology
- Add CML-specific settings (idle thresholds, HTTPS monitoring)
- Add AMI name to settings per region
- Fix code style issues (run
ruff check --fix)
📋 Medium Priority¶
- Create specific exception types
- Extract boto3 client creation methods
- Add instance status check method
- Add tag management methods
- Improve telemetry to include active labs count
🎯 Nice to Have¶
- Add EBS snapshot management
- Add cost tracking per instance
- Add network performance metrics
- Add CloudWatch custom metrics for CML-specific data
Conclusion¶
Overall Assessment: 🟡 NEEDS SIGNIFICANT WORK
The code has a solid foundation but requires immediate fixes and substantial refactoring to align with the CML Worker domain model and project objectives. The critical import errors must be fixed before any other work can proceed.
Estimated Effort:
- Critical fixes: 2-4 hours
- High priority items: 1-2 days
- Medium priority: 2-3 days
- Full alignment: 1 week
Recommendation: Fix critical issues first, then systematically work through high-priority alignment tasks before building the application layer.