Getting Started with Observability¶
This guide provides a quick start to using observability features in the Starter App.
Table of Contents¶
Prerequisites¶
- Docker Compose - Running OTEL Collector, Tempo, and Prometheus
- Poetry - Python dependency management
- Running Application - Starter App must be running
Quick Start¶
1. Start Observability Stack¶
# Start all services including OTEL Collector, Tempo, and Prometheus
make up
# Or start specific services
docker-compose up -d otel-collector tempo prometheus grafana
2. Verify Services¶
Expected Output:
NAME STATUS PORTS
otel-collector Up 0.0.0.0:4317->4317/tcp
tempo Up
prometheus Up 0.0.0.0:9090->9090/tcp
grafana Up 0.0.0.0:3000->3000/tcp
3. Access Web UIs¶
Open the following URLs in your browser:
- Grafana: http://localhost:3000
- Default credentials:
admin/admin -
Used for viewing traces (Tempo) and metrics (Prometheus)
-
Prometheus: http://localhost:9090
- Direct access to metrics
- Execute PromQL queries
4. Generate Telemetry¶
Make API requests to generate traces and metrics:
# Create a task (generates trace and metrics)
curl -X POST http://localhost:8000/api/tasks \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{
"title": "Test Task",
"description": "Testing observability",
"priority": "high"
}'
# List tasks
curl -X GET http://localhost:8000/api/tasks \
-H "Authorization: Bearer YOUR_TOKEN"
# Update a task
curl -X PUT http://localhost:8000/api/tasks/{task_id} \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_TOKEN" \
-d '{
"title": "Updated Task",
"status": "in_progress"
}'
Viewing Traces¶
In Grafana (Recommended)¶
-
Open Grafana: http://localhost:3000
-
Navigate to Explore:
-
Click on the Explore icon (compass) in the left sidebar
-
Select Tempo Data Source:
-
In the dropdown at the top, select Tempo
-
Search for Traces:
-
Search tab: Search by service name
- Service Name:
starter-app - Click Run Query
- Service Name:
-
TraceQL tab: Use TraceQL queries
-
View Trace Details:
- Click on any trace in the results
- Explore the span hierarchy
- Check span attributes and timing
Example: Task Creation Trace¶
When you create a task, you'll see a trace like this:
starter-app: POST /api/tasks (200ms)
├─ create_task_entity (15ms)
│ ├─ span: validate input (3ms)
│ └─ span: create domain object (12ms)
├─ motor_task_repository.add (150ms)
│ ├─ pymongo.insert_one (140ms)
│ │ Attributes:
│ │ ├─ db.system: mongodb
│ │ ├─ db.operation: insert
│ │ └─ db.collection: tasks
│ └─ publish_domain_events (10ms)
└─ record_metrics (5ms)
What to Look For:
- Duration: Total and per-span timing
- Attributes: Task ID, priority, status, user ID
- Events: Milestones within spans
- Errors: Exceptions and error status
Viewing Metrics¶
In Prometheus¶
-
Open Prometheus: http://localhost:9090
-
Execute Queries:
- Click on Graph tab
- Enter a PromQL query
- Click Execute
Example Queries:
# Task creation rate (per second)
rate(starter_app_tasks_created_total[5m])
# Total tasks created
starter_app_tasks_created_total
# Task processing time (95th percentile)
histogram_quantile(0.95, rate(starter_app_task_processing_time_bucket[5m]))
# Tasks by priority
sum by (priority) (starter_app_tasks_created_total)
In Grafana¶
-
Open Grafana: http://localhost:3000
-
Navigate to Explore:
-
Select Prometheus data source
-
Build Queries:
- Use the Metrics Browser or write PromQL
-
Visualize as Table or Graph
-
Create Dashboards (Optional):
- Click + → Dashboard
- Add panels with metric queries
- Save dashboard
Example Workflows¶
Workflow 1: Debug a Slow Request¶
Scenario: API endpoint is slow
Steps:
- Identify slow traces in Grafana/Tempo:
- Analyze the trace:
- Find the slowest span
- Check database queries
-
Look for N+1 queries
-
Check metrics for patterns:
- Fix the issue:
- Add database indexes
- Optimize queries
- Implement caching
Workflow 2: Monitor Task Creation¶
Scenario: Track task creation rate
Steps:
- Query task creation rate:
- Visualize in Grafana:
- Create a graph panel
-
Group by priority:
-
Set up alerts:
- Create alert rule
- Threshold:
< 1(less than 1 task/sec) - Notification channel
Workflow 3: Root Cause Analysis¶
Scenario: Production error reported
Steps:
- Check application logs:
- Find error message
-
Copy trace ID from log
-
Find trace in Grafana/Tempo:
- Paste trace ID in search
-
View full trace
-
Analyze failure:
- Identify failed span
- Check error attributes
-
Review span events
-
Correlate with metrics:
- Check context:
- Span attributes: user ID, task ID, priority
- Timing: when did it start failing?
- Pattern: specific user/task type?
Workflow 4: Performance Optimization¶
Scenario: Optimize task processing
Steps:
- Baseline metrics:
- Generate test load:
# Create multiple tasks
for i in {1..100}; do
curl -X POST http://localhost:8000/api/tasks \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d "{\"title\":\"Task $i\",\"priority\":\"high\"}"
done
- Analyze traces:
- Find common slow patterns
-
Identify bottlenecks
-
Implement optimizations:
- Add indexes
- Use batch operations
-
Optimize queries
-
Measure improvement:
Next Steps¶
- Metrics Guide: Learn to add custom metrics
- Tracing Guide: Learn to add custom spans
- Configuration: Configure observability settings
- Best Practices: Follow observability best practices
- Troubleshooting: Solve common issues
Tips and Tricks¶
Enable Verbose Logging¶
For debugging observability issues:
View Raw Telemetry¶
Check OTEL Collector logs:
make logs-otel
# Or filter for specific data
docker-compose logs otel-collector | grep "Trace"
docker-compose logs otel-collector | grep "Metric"
Test Without Backend¶
Use console exporter for development:
Telemetry will be printed to application logs:
Related Documentation¶
- Observability Overview - Concepts and introduction
- Architecture - Technical architecture
- Configuration - Detailed configuration
- Troubleshooting - Common issues