Observability Overview¶

This guide provides an overview of observability in the Cml Cloud Manager, covering the three pillars of observability and how they work together to help you understand your system.

Table of Contents¶

What is Observability?
The Three Pillars
Quick Links
Getting Started
Common Use Cases

What is Observability?¶

Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike monitoring (which tells you what is broken), observability helps you understand why it's broken.

Observability vs Monitoring¶

Aspect	Monitoring	Observability
Focus	Known failures	Unknown failures
Approach	Predefined dashboards	Exploratory analysis
Questions	"Is the system up?"	"Why is it behaving this way?"
Data	Aggregated metrics	High-cardinality data
Response	Alert on thresholds	Investigate root cause

Why Observability Matters¶

Debug Production Issues: Understand complex distributed system behavior
Performance Optimization: Identify bottlenecks and optimize slow operations
User Experience: Track real user impact of bugs and performance issues
Business Insights: Correlate technical metrics with business outcomes
Proactive Problem Detection: Catch issues before they impact users

The Three Pillars¶

The Cml Cloud Manager implements all three pillars of observability using OpenTelemetry:

1. Metrics¶

What: Numerical measurements aggregated over time.

Examples:

Request count per endpoint
Task creation rate
Database query latency
Memory usage

Use Cases:

Dashboards and real-time monitoring
Alerting on thresholds
Capacity planning
Performance trending

Learn More: Metrics Guide

2. Traces¶

What: Request paths through your distributed system showing timing and relationships.

Examples:

End-to-end request flow from API → Handler → Repository → Database
Service dependencies and call graphs
Operation timing and bottlenecks
Error propagation paths

Use Cases:

Debugging slow requests
Understanding system architecture
Finding performance bottlenecks
Troubleshooting distributed transactions

Learn More: Tracing Guide

3. Logs¶

What: Timestamped event records with contextual information.

Examples:

Application errors and exceptions
Business events (task created, user logged in)
Debug information
Security audit trails

Use Cases:

Debugging specific issues
Security auditing
Compliance and audit trails
Root cause analysis

Status: Structured logging implemented, OpenTelemetry log export disabled (using traditional logging).

Quick Links¶

Documentation¶

Architecture - Technical architecture and components
Getting Started - Quick start guide and examples
Configuration - Environment variables and configuration options
Best Practices - Naming conventions, cardinality control, and patterns
Troubleshooting - Common issues and solutions

Instrumentation Guides¶

Metrics Guide - How to add custom metrics
Tracing Guide - How to add custom spans

Tools¶

Grafana: http://localhost:3000 - View traces and metrics
Prometheus: http://localhost:9090 - Query metrics directly
OTEL Collector: http://localhost:4317 - Telemetry collection endpoint

Getting Started¶

1. Start the Stack¶

# Start all services
make up

2. Generate Telemetry¶

# Make API requests
curl -X POST http://localhost:8000/api/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"title": "Test Task", "priority": "high"}'

3. View Data¶

Traces in Grafana:

Open http://localhost:3000
Navigate to Explore → Tempo
Search for service: cml-cloud-manager

Metrics in Prometheus:

Open http://localhost:9090
Execute query: rate(cml_cloud_manager_tasks_created_total[5m])

Full Guide: See Getting Started

Common Use Cases¶

Debug a Slow Request¶

Find slow traces in Grafana/Tempo (> 500ms)
Identify the slowest span
Check database queries and N+1 patterns
Optimize with indexes, caching, or query improvements

Full Workflow: See Getting Started - Workflow 1

Monitor Task Creation Rate¶

# View rate in Prometheus or Grafana
rate(cml_cloud_manager_tasks_created_total[5m])

# Alert when rate drops
alert: TaskCreationRateLow
expr: rate(cml_cloud_manager_tasks_created_total[5m]) < 1

Full Workflow: See Getting Started - Workflow 2

Root Cause Analysis¶

When a production error occurs:

Check logs for error message and trace ID
Find trace in Grafana/Tempo using trace ID
Analyze trace to see request path
Identify failure point (which span failed)
Check span attributes for context
Correlate with metrics in Prometheus

Full Workflow: See Getting Started - Workflow 3

Technology Stack¶

The Cml Cloud Manager uses:

OpenTelemetry - Vendor-neutral instrumentation (CNCF project)
OTEL Collector - Telemetry aggregation and routing
Grafana Tempo - Distributed tracing backend
Prometheus - Time-series metrics database
Grafana - Unified visualization and dashboards
Neuroglia Observability - Python framework integration

Learn More: See Architecture

Real-Time Channel & Observability¶

The SSE stream (see Real-Time Updates) complements traditional telemetry:

Provides immediate surface-level visibility (connection health badge, live worker lifecycle toasts).
Can be correlated with traces and metrics (e.g., a worker.status.updated event followed by related spans).
Helps distinguish UI latency issues (e.g., missed events due to disconnect) from backend performance problems.

When troubleshooting, always check whether the SSE badge is Live; a disconnected badge can explain stale UI data even when metrics/traces look healthy.

Next Steps¶

Read Architecture Guide - Understand how components work together
Follow Getting Started - Set up and use observability features
Add Custom Metrics - Instrument your code with metrics
Add Custom Traces - Add spans to track operations
Review Best Practices - Follow observability patterns
Configure for Production - Optimize settings for production

Observability Overview¶

Table of Contents¶

What is Observability?¶

Observability vs Monitoring¶

Why Observability Matters¶

The Three Pillars¶

1. Metrics¶

2. Traces¶

3. Logs¶

Quick Links¶

Documentation¶

Instrumentation Guides¶

Tools¶

Getting Started¶

1. Start the Stack¶

2. Generate Telemetry¶

3. View Data¶

Common Use Cases¶

Debug a Slow Request¶

Monitor Task Creation Rate¶

Root Cause Analysis¶

Technology Stack¶

Real-Time Channel & Observability¶

Next Steps¶

Additional Resources¶

OpenTelemetry¶

Observability Concepts¶

Backend Tools¶

Framework¶