Skip to content

Code Review: Application Settings & AWS EC2 API Client

Date: November 16, 2025

Reviewer: AI Assistant Files Reviewed:

  • src/application/settings.py
  • src/integration/services/aws_ec2_api_client.py

Executive Summary

Alignment with Domain Model & Project Objectives: ⚠️ PARTIAL

Issues Found:

  1. Critical: Missing IolVmInstanceDto definition (import error)
  2. Critical: Syntax error in imports (from integration,exc should be from integration.exceptions)
  3. High: Settings use legacy "IOL VM" terminology instead of "CML Worker"
  4. High: Missing CML-specific settings (HTTPS endpoint monitoring, license management, etc.)
  5. Medium: No settings for idle detection thresholds
  6. Medium: Settings structure doesn't align with multi-worker management

Code Quality: 🟡 NEEDS IMPROVEMENT

Strengths:

  • Good use of Pydantic for settings validation
  • Comprehensive observability configuration
  • Good error handling in AWS client
  • Proper logging throughout

Issues:

  • Import errors preventing compilation
  • Inconsistent naming (IOL VM vs CML Worker)
  • Missing type hints in some places
  • Code style violations (trailing whitespace, not in usage)

Detailed Analysis

1. Application Settings (application/settings.py)

✅ Strengths

  1. Comprehensive Observability Config: Excellent OpenTelemetry integration
  2. Security: Good separation of secrets (though defaults need changing)
  3. Environment-Aware: Proper development/production distinction
  4. CORS & Session Management: Well configured

❌ Issues & Recommendations

Issue 1: Legacy Naming - "IOL VM" vs "CML Worker"

Severity: High Lines: 119-128

# Current (Legacy):
cmlvm_ami_ids: dict[str, str] = {"us-east-1": "ami-0123456789abcdef0", ...}
cmlvm_ami_name: str = "ec2_iol_image_name"
cmlvm_instance_type: Ec2InstanceType = Ec2InstanceType.SMALL

Recommendation:

# Proposed (Aligned with Domain):
cml_worker_ami_ids: dict[str, str] = {"us-east-1": "ami-0123456789abcdef0", ...}
cml_worker_ami_names: dict[str, str] = {"us-east-1": "CML-2.7.0-Ubuntu-22.04", ...}
cml_worker_default_instance_type: str = "c5.2xlarge"
cml_worker_allowed_instance_types: list[str] = ["t3.xlarge", "c5.2xlarge", "c5.4xlarge"]
Issue 2: Missing CML-Specific Settings

Severity: High

The settings don't include configuration for:

  • CML HTTPS service monitoring
  • CML API endpoints
  • CML license management
  • Idle detection thresholds
  • Auto-shutdown policies

Recommendation:

# CML Service Configuration
cml_https_port: int = 443
cml_api_port: int = 443
cml_health_check_interval_seconds: int = 30
cml_startup_timeout_seconds: int = 600

# CML License Configuration
cml_license_token: str | None = None
cml_license_check_on_startup: bool = True

# Lifecycle Management
cml_worker_idle_threshold_minutes: int = 30
cml_worker_auto_shutdown_enabled: bool = True
cml_worker_telemetry_collection_interval_seconds: int = 60
cml_worker_max_idle_before_warning_minutes: int = 20
Issue 3: AWS Region Handling

Severity: Medium Lines: 119

Currently using dict[str, str] for AMI IDs per region, but:

  • No default region specified
  • No validation of region availability
  • No AMI name per region

Recommendation:

# AWS Configuration
aws_default_region: str = "us-east-1"
aws_allowed_regions: list[str] = ["us-east-1", "us-west-2", "eu-west-1"]

# CML Worker AMI Configuration (per region)
cml_worker_ami_config: dict[str, dict[str, str]] = {
    "us-east-1": {
        "ami_id": "ami-0123456789abcdef0",
        "ami_name": "CML-2.7.0-Ubuntu-22.04",
        "cml_version": "2.7.0"
    },
    "us-west-2": {
        "ami_id": "ami-0987654321fedcba0",
        "ami_name": "CML-2.7.0-Ubuntu-22.04",
        "cml_version": "2.7.0"
    }
}
Issue 4: Security Groups & Network Settings

Severity: Medium Lines: 124-126

Settings assume single security group and subnet. Multi-worker scenarios need flexibility.

Recommendation:

# Network Configuration per Region
cml_worker_network_config: dict[str, dict[str, Any]] = {
    "us-east-1": {
        "vpc_id": "vpc-0123456789abcdef0",
        "subnet_id": "subnet-0123456789abcdef0",
        "security_group_ids": ["sg-0123456789abcdef0"],
        "assign_public_ip": True
    }
}

# SSH Key Configuration
cml_worker_key_pairs: dict[str, str] = {
    "us-east-1": "cml-worker-keypair-use1",
    "us-west-2": "cml-worker-keypair-usw2"
}
Issue 5: Duplicate Logger Configuration

Severity: Low Line: 156-157

logging.getLogger("fastapi").setLevel(logging.WARNING)
logging.getLogger("fastapi").setLevel(logging.WARNING)  # Duplicate!

2. AWS EC2 API Client (integration/services/aws_ec2_api_client.py)

✅ Strengths

  1. Comprehensive EC2 Operations: Create, terminate, list, get details
  2. CloudWatch Integration: Resource utilization monitoring
  3. Flexible Filtering: Good list_instances() with multiple filters
  4. Error Handling: Try-except blocks with proper logging
  5. Documentation: Good docstrings for most methods

❌ Critical Issues

Issue 1: Import Errors (BLOCKING)

Severity: Critical Line: 9, 14

# SYNTAX ERROR:
from integration,exc import IntegrationException  # Should be: from integration.exceptions

# MISSING DEFINITION:
from integration.models import IolVmInstanceDto  # This class doesn't exist!

Fix Required:

from integration.exceptions import IntegrationException

# Create the DTO or use domain entities directly
from domain.entities.cml_worker import CMLWorker
Issue 2: Missing IolVmInstanceDto Definition

Severity: Critical Line: 14, 86, 104

This DTO is referenced but not defined anywhere. Options:

Option A: Create the DTO in integration/models/

from dataclasses import dataclass
from integration.enums import AwsRegion

@dataclass
class CMLWorkerInstanceDto:
    """DTO for EC2 instance creation response."""
    id: str
    aws_region: AwsRegion
    instance_name: str
    ami_id: str
    ami_name: str | None
    instance_type: str
    security_group_ids: list[str]
    subnet_id: str
    public_ip: str | None = None
    private_ip: str | None = None

Option B: Return domain entity directly (better for DDD)

def create_instance(...) -> CMLWorker:
    # Create CMLWorker domain entity
    worker = CMLWorker(
        name=instance_name,
        aws_region=aws_region.value,
        instance_type=instance_type,
        ami_id=ami_id,
        ami_name=ami_name,
        created_by=created_by
    )
    # ... provision EC2, then assign instance
    worker.assign_instance(instance.id, public_ip, private_ip)
    return worker
Issue 3: Legacy "IOL VM" Terminology

Severity: High Throughout file

Variables use "iolvm" prefix instead of "cml_worker":

  • iolvm_instance_nameworker_name or cml_worker_name
  • iolvm_ami_idami_id
  • Error messages reference "IOLVM" → should be "CML Worker"

🟡 Code Quality Issues

Issue 4: Inconsistent not in Usage

Severity: Low Lines: 225, 230, 233, 236

# Current (Wrong):
if not "Instances" in reservation:

# Should be:
if "Instances" not in reservation:
Issue 5: Trailing Whitespace

Severity: Low Lines: 155, 156, 177, 178, 179, 243, 251-257

Run ruff or configure IDE to remove trailing whitespace.

Issue 6: Unused Variable

Severity: Low Line: 295

check_if_instance_exists = self.get_instance_details(...)  # Never used!

# Should be:
_ = self.get_instance_details(...)  # Validate instance exists
# Or better:
if not self.get_instance_details(aws_region, instance_id):
    raise IntegrationException(f"Instance {instance_id} not found")
Issue 7: Missing Type Hints

Severity: Low Line: 295

CloudWatch client creation has no type hints.

🔧 Design Recommendations

Recommendation 1: Add Start/Stop Methods

Missing Operations:

def start_instance(self, aws_region: AwsRegion, instance_id: str) -> bool:
    """Start a stopped EC2 instance."""
    ec2_client = boto3.client("ec2", ...)
    try:
        response = ec2_client.start_instances(InstanceIds=[instance_id])
        return True
    except (ValueError, ParamValidationError, ClientError) as e:
        log.error(f"Error starting instance {instance_id}: {e}")
        raise IntegrationException(f"Error starting instance: {e}")

def stop_instance(self, aws_region: AwsRegion, instance_id: str) -> bool:
    """Stop a running EC2 instance."""
    ec2_client = boto3.client("ec2", ...)
    try:
        response = ec2_client.stop_instances(InstanceIds=[instance_id])
        return True
    except (ValueError, ParamValidationError, ClientError) as e:
        log.error(f"Error stopping instance {instance_id}: {e}")
        raise IntegrationException(f"Error stopping instance: {e}")
Recommendation 2: Add Instance Status Check

For HTTPS Service Monitoring:

def get_instance_status_checks(
    self,
    aws_region: AwsRegion,
    instance_id: str
) -> dict[str, str]:
    """Get instance and system status checks.

    Returns:
        dict with 'instance_status' and 'system_status' (ok, impaired, etc.)
    """
    ec2_client = boto3.client("ec2", ...)
    try:
        response = ec2_client.describe_instance_status(
            InstanceIds=[instance_id],
            IncludeAllInstances=True
        )
        if response['InstanceStatuses']:
            status = response['InstanceStatuses'][0]
            return {
                'instance_status': status['InstanceStatus']['Status'],
                'system_status': status['SystemStatus']['Status'],
                'instance_state': status['InstanceState']['Name']
            }
        return {}
    except (ValueError, ParamValidationError, ClientError) as e:
        log.error(f"Error getting status checks for {instance_id}: {e}")
        raise IntegrationException(f"Error getting status checks: {e}")
Recommendation 3: Improve Error Granularity

Current: All errors raise generic IntegrationException Better: Specific exception types

# In integration/exceptions.py
class EC2InstanceNotFoundException(IntegrationException):
    """Raised when EC2 instance is not found."""
    pass

class EC2PermissionDeniedException(IntegrationException):
    """Raised when AWS credentials lack permissions."""
    pass

class EC2QuotaExceededException(IntegrationException):
    """Raised when AWS resource quota is exceeded."""
    pass

# In aws_ec2_api_client.py
except ClientError as e:
    error_code = e.response.get('Error', {}).get('Code', '')
    if error_code == 'InvalidInstanceID.NotFound':
        raise EC2InstanceNotFoundException(f"Instance {instance_id} not found")
    elif error_code == 'UnauthorizedOperation':
        raise EC2PermissionDeniedException(f"Permission denied: {e}")
    else:
        raise IntegrationException(f"AWS error: {e}")
Recommendation 4: Add Tag Management

For Worker Tracking:

def add_tags(
    self,
    aws_region: AwsRegion,
    instance_id: str,
    tags: dict[str, str]
) -> bool:
    """Add or update tags on an EC2 instance."""
    ec2_client = boto3.client("ec2", ...)
    try:
        ec2_client.create_tags(
            Resources=[instance_id],
            Tags=[{'Key': k, 'Value': v} for k, v in tags.items()]
        )
        return True
    except (ValueError, ParamValidationError, ClientError) as e:
        log.error(f"Error adding tags to {instance_id}: {e}")
        raise IntegrationException(f"Error adding tags: {e}")
Recommendation 5: Extract Boto3 Client Creation

DRY Principle Violation: Boto3 client creation repeated everywhere

def _create_ec2_client(self, region: AwsRegion) -> Any:
    """Create configured boto3 EC2 client."""
    return boto3.client(
        "ec2",
        aws_access_key_id=self.aws_account_credentials.aws_access_key_id,
        aws_secret_access_key=self.aws_account_credentials.aws_secret_access_key,
        region_name=region.value
    )

def _create_ec2_resource(self, region: AwsRegion) -> Any:
    """Create configured boto3 EC2 resource."""
    return boto3.resource(
        "ec2",
        aws_access_key_id=self.aws_account_credentials.aws_access_key_id,
        aws_secret_access_key=self.aws_account_credentials.aws_secret_access_key,
        region_name=region.value
    )

def _create_cloudwatch_client(self, region: AwsRegion) -> Any:
    """Create configured boto3 CloudWatch client."""
    return boto3.client(
        "cloudwatch",
        aws_access_key_id=self.aws_account_credentials.aws_access_key_id,
        aws_secret_access_key=self.aws_account_credentials.aws_secret_access_key,
        region_name=region.value
    )

Alignment with Domain Model

Current State: 🔴 MISALIGNED

Domain Concept Settings AWS Client
CMLWorker ❌ Uses "cmlvm" ❌ Uses "iolvm"
AMI ID + Name ❌ Only ID ❌ Not tracked
Worker Status ❌ Missing ⚠️ Partial (state only)
Service Status ❌ Missing ❌ Missing
Idle Detection ❌ Missing ⚠️ Has telemetry
License Management ❌ Missing ❌ Missing
Lifecycle (Start/Stop) ❌ Missing ❌ Only create/terminate

Required Changes for Alignment

  1. Rename all "IOL VM" references to "CML Worker"
  2. Add AMI name tracking (already in domain, needs settings)
  3. Add CML-specific settings (HTTPS monitoring, licenses, idle thresholds)
  4. Add start/stop operations to AWS client
  5. Create proper DTOs or use domain entities
  6. Fix import errors immediately

Alignment with Project Objectives

From docs/project-objectives.md

✅ Implemented

  • EC2 instance creation
  • Instance termination
  • Instance details retrieval
  • Resource utilization monitoring

⚠️ Partially Implemented

  • List instances (works, but no filtering by CML-specific criteria)

❌ Not Implemented

  • Start stopped instances (critical!)
  • Stop running instances (critical!)
  • HTTPS service availability monitoring
  • CML API integration
  • License management
  • Idle detection automation
  • Cross-account AWS support (credentials hardcoded)

Priority Action Items

🚨 Immediate (Blocking)

  1. Fix import syntax error (line 9)
  2. Create or remove IolVmInstanceDto reference
  3. Add start_instance() method to AWS client
  4. Add stop_instance() method to AWS client

🔥 High Priority

  1. Rename all "IOL VM" → "CML Worker" terminology
  2. Add CML-specific settings (idle thresholds, HTTPS monitoring)
  3. Add AMI name to settings per region
  4. Fix code style issues (run ruff check --fix)

📋 Medium Priority

  1. Create specific exception types
  2. Extract boto3 client creation methods
  3. Add instance status check method
  4. Add tag management methods
  5. Improve telemetry to include active labs count

🎯 Nice to Have

  1. Add EBS snapshot management
  2. Add cost tracking per instance
  3. Add network performance metrics
  4. Add CloudWatch custom metrics for CML-specific data

Conclusion

Overall Assessment: 🟡 NEEDS SIGNIFICANT WORK

The code has a solid foundation but requires immediate fixes and substantial refactoring to align with the CML Worker domain model and project objectives. The critical import errors must be fixed before any other work can proceed.

Estimated Effort:

  • Critical fixes: 2-4 hours
  • High priority items: 1-2 days
  • Medium priority: 2-3 days
  • Full alignment: 1 week

Recommendation: Fix critical issues first, then systematically work through high-priority alignment tasks before building the application layer.