Observability Overview¶
This guide provides an overview of observability in the Cml Cloud Manager, covering the three pillars of observability and how they work together to help you understand your system.
Table of Contents¶
What is Observability?¶
Observability is the ability to understand the internal state of a system by examining its external outputs. Unlike monitoring (which tells you what is broken), observability helps you understand why it's broken.
Observability vs Monitoring¶
| Aspect | Monitoring | Observability |
|---|---|---|
| Focus | Known failures | Unknown failures |
| Approach | Predefined dashboards | Exploratory analysis |
| Questions | "Is the system up?" | "Why is it behaving this way?" |
| Data | Aggregated metrics | High-cardinality data |
| Response | Alert on thresholds | Investigate root cause |
Why Observability Matters¶
- Debug Production Issues: Understand complex distributed system behavior
- Performance Optimization: Identify bottlenecks and optimize slow operations
- User Experience: Track real user impact of bugs and performance issues
- Business Insights: Correlate technical metrics with business outcomes
- Proactive Problem Detection: Catch issues before they impact users
The Three Pillars¶
The Cml Cloud Manager implements all three pillars of observability using OpenTelemetry:
1. Metrics¶
What: Numerical measurements aggregated over time.
Examples:
- Request count per endpoint
- Task creation rate
- Database query latency
- Memory usage
Use Cases:
- Dashboards and real-time monitoring
- Alerting on thresholds
- Capacity planning
- Performance trending
Learn More: Metrics Guide
2. Traces¶
What: Request paths through your distributed system showing timing and relationships.
Examples:
- End-to-end request flow from API → Handler → Repository → Database
- Service dependencies and call graphs
- Operation timing and bottlenecks
- Error propagation paths
Use Cases:
- Debugging slow requests
- Understanding system architecture
- Finding performance bottlenecks
- Troubleshooting distributed transactions
Learn More: Tracing Guide
3. Logs¶
What: Timestamped event records with contextual information.
Examples:
- Application errors and exceptions
- Business events (task created, user logged in)
- Debug information
- Security audit trails
Use Cases:
- Debugging specific issues
- Security auditing
- Compliance and audit trails
- Root cause analysis
Status: Structured logging implemented, OpenTelemetry log export disabled (using traditional logging).
Quick Links¶
Documentation¶
- Architecture - Technical architecture and components
- Getting Started - Quick start guide and examples
- Configuration - Environment variables and configuration options
- Best Practices - Naming conventions, cardinality control, and patterns
- Troubleshooting - Common issues and solutions
Instrumentation Guides¶
- Metrics Guide - How to add custom metrics
- Tracing Guide - How to add custom spans
Tools¶
- Grafana: http://localhost:3000 - View traces and metrics
- Prometheus: http://localhost:9090 - Query metrics directly
- OTEL Collector: http://localhost:4317 - Telemetry collection endpoint
Getting Started¶
1. Start the Stack¶
2. Generate Telemetry¶
# Make API requests
curl -X POST http://localhost:8000/api/tasks \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{"title": "Test Task", "priority": "high"}'
3. View Data¶
Traces in Grafana:
- Open http://localhost:3000
- Navigate to Explore → Tempo
- Search for service:
cml-cloud-manager
Metrics in Prometheus:
- Open http://localhost:9090
- Execute query:
rate(cml_cloud_manager_tasks_created_total[5m])
Full Guide: See Getting Started
Common Use Cases¶
Debug a Slow Request¶
- Find slow traces in Grafana/Tempo (> 500ms)
- Identify the slowest span
- Check database queries and N+1 patterns
- Optimize with indexes, caching, or query improvements
Full Workflow: See Getting Started - Workflow 1
Monitor Task Creation Rate¶
# View rate in Prometheus or Grafana
rate(cml_cloud_manager_tasks_created_total[5m])
# Alert when rate drops
alert: TaskCreationRateLow
expr: rate(cml_cloud_manager_tasks_created_total[5m]) < 1
Full Workflow: See Getting Started - Workflow 2
Root Cause Analysis¶
When a production error occurs:
- Check logs for error message and trace ID
- Find trace in Grafana/Tempo using trace ID
- Analyze trace to see request path
- Identify failure point (which span failed)
- Check span attributes for context
- Correlate with metrics in Prometheus
Full Workflow: See Getting Started - Workflow 3
Technology Stack¶
The Cml Cloud Manager uses:
- OpenTelemetry - Vendor-neutral instrumentation (CNCF project)
- OTEL Collector - Telemetry aggregation and routing
- Grafana Tempo - Distributed tracing backend
- Prometheus - Time-series metrics database
- Grafana - Unified visualization and dashboards
- Neuroglia Observability - Python framework integration
Learn More: See Architecture
Real-Time Channel & Observability¶
The SSE stream (see Real-Time Updates) complements traditional telemetry:
- Provides immediate surface-level visibility (connection health badge, live worker lifecycle toasts).
- Can be correlated with traces and metrics (e.g., a
worker.status.updatedevent followed by related spans). - Helps distinguish UI latency issues (e.g., missed events due to disconnect) from backend performance problems.
When troubleshooting, always check whether the SSE badge is Live; a disconnected badge can explain stale UI data even when metrics/traces look healthy.
Next Steps¶
- Read Architecture Guide - Understand how components work together
- Follow Getting Started - Set up and use observability features
- Add Custom Metrics - Instrument your code with metrics
- Add Custom Traces - Add spans to track operations
- Review Best Practices - Follow observability patterns
- Configure for Production - Optimize settings for production