Skip to content

Getting Started with Observability

This guide provides a quick start to using observability features in the Starter App.

Table of Contents

Prerequisites

  1. Docker Compose - Running OTEL Collector, Tempo, and Prometheus
  2. Poetry - Python dependency management
  3. Running Application - Starter App must be running

Quick Start

1. Start Observability Stack

# Start all services including OTEL Collector, Tempo, and Prometheus
make up

# Or start specific services
docker-compose up -d otel-collector tempo prometheus grafana

2. Verify Services

# Check OTEL Collector logs
make logs-otel

# Check if services are running
docker-compose ps

Expected Output:

NAME                STATUS    PORTS
otel-collector      Up        0.0.0.0:4317->4317/tcp
tempo               Up
prometheus          Up        0.0.0.0:9090->9090/tcp
grafana             Up        0.0.0.0:3000->3000/tcp

3. Access Web UIs

Open the following URLs in your browser:

  • Grafana: http://localhost:3000
  • Default credentials: admin / admin
  • Used for viewing traces (Tempo) and metrics (Prometheus)

  • Prometheus: http://localhost:9090

  • Direct access to metrics
  • Execute PromQL queries

4. Generate Telemetry

Make API requests to generate traces and metrics:

# Create a task (generates trace and metrics)
curl -X POST http://localhost:8000/api/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "title": "Test Task",
    "description": "Testing observability",
    "priority": "high"
  }'

# List tasks
curl -X GET http://localhost:8000/api/tasks \
  -H "Authorization: Bearer YOUR_TOKEN"

# Update a task
curl -X PUT http://localhost:8000/api/tasks/{task_id} \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "title": "Updated Task",
    "status": "in_progress"
  }'

Viewing Traces

  1. Open Grafana: http://localhost:3000

  2. Navigate to Explore:

  3. Click on the Explore icon (compass) in the left sidebar

  4. Select Tempo Data Source:

  5. In the dropdown at the top, select Tempo

  6. Search for Traces:

  7. Search tab: Search by service name

    • Service Name: starter-app
    • Click Run Query
  8. TraceQL tab: Use TraceQL queries

    { service.name="starter-app" && http.status_code=200 }
    
  9. View Trace Details:

  10. Click on any trace in the results
  11. Explore the span hierarchy
  12. Check span attributes and timing

Example: Task Creation Trace

When you create a task, you'll see a trace like this:

starter-app: POST /api/tasks (200ms)
├─ create_task_entity (15ms)
│  ├─ span: validate input (3ms)
│  └─ span: create domain object (12ms)
├─ motor_task_repository.add (150ms)
│  ├─ pymongo.insert_one (140ms)
│  │  Attributes:
│  │  ├─ db.system: mongodb
│  │  ├─ db.operation: insert
│  │  └─ db.collection: tasks
│  └─ publish_domain_events (10ms)
└─ record_metrics (5ms)

What to Look For:

  • Duration: Total and per-span timing
  • Attributes: Task ID, priority, status, user ID
  • Events: Milestones within spans
  • Errors: Exceptions and error status

Viewing Metrics

In Prometheus

  1. Open Prometheus: http://localhost:9090

  2. Execute Queries:

  3. Click on Graph tab
  4. Enter a PromQL query
  5. Click Execute

Example Queries:

# Task creation rate (per second)
rate(starter_app_tasks_created_total[5m])

# Total tasks created
starter_app_tasks_created_total

# Task processing time (95th percentile)
histogram_quantile(0.95, rate(starter_app_task_processing_time_bucket[5m]))

# Tasks by priority
sum by (priority) (starter_app_tasks_created_total)

In Grafana

  1. Open Grafana: http://localhost:3000

  2. Navigate to Explore:

  3. Select Prometheus data source

  4. Build Queries:

  5. Use the Metrics Browser or write PromQL
  6. Visualize as Table or Graph

  7. Create Dashboards (Optional):

  8. Click +Dashboard
  9. Add panels with metric queries
  10. Save dashboard

Example Workflows

Workflow 1: Debug a Slow Request

Scenario: API endpoint is slow

Steps:

  1. Identify slow traces in Grafana/Tempo:
{ service.name="starter-app" && duration > 500ms }
  1. Analyze the trace:
  2. Find the slowest span
  3. Check database queries
  4. Look for N+1 queries

  5. Check metrics for patterns:

# Response time trend
histogram_quantile(0.95, rate(http_server_duration_bucket[5m]))
  1. Fix the issue:
  2. Add database indexes
  3. Optimize queries
  4. Implement caching

Workflow 2: Monitor Task Creation

Scenario: Track task creation rate

Steps:

  1. Query task creation rate:
rate(starter_app_tasks_created_total[5m])
  1. Visualize in Grafana:
  2. Create a graph panel
  3. Group by priority:

    sum by (priority) (rate(starter_app_tasks_created_total[5m]))
    
  4. Set up alerts:

  5. Create alert rule
  6. Threshold: < 1 (less than 1 task/sec)
  7. Notification channel

Workflow 3: Root Cause Analysis

Scenario: Production error reported

Steps:

  1. Check application logs:
make logs-app
  • Find error message
  • Copy trace ID from log

  • Find trace in Grafana/Tempo:

  • Paste trace ID in search
  • View full trace

  • Analyze failure:

  • Identify failed span
  • Check error attributes
  • Review span events

  • Correlate with metrics:

# Error rate spike?
rate(starter_app_tasks_failed_total[5m])
  1. Check context:
  2. Span attributes: user ID, task ID, priority
  3. Timing: when did it start failing?
  4. Pattern: specific user/task type?

Workflow 4: Performance Optimization

Scenario: Optimize task processing

Steps:

  1. Baseline metrics:
histogram_quantile(0.95, rate(starter_app_task_processing_time_bucket[5m]))
  1. Generate test load:
# Create multiple tasks
for i in {1..100}; do
  curl -X POST http://localhost:8000/api/tasks \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d "{\"title\":\"Task $i\",\"priority\":\"high\"}"
done
  1. Analyze traces:
  2. Find common slow patterns
  3. Identify bottlenecks

  4. Implement optimizations:

  5. Add indexes
  6. Use batch operations
  7. Optimize queries

  8. Measure improvement:

# Before vs after
histogram_quantile(0.95, rate(starter_app_task_processing_time_bucket[5m]))

Next Steps

Tips and Tricks

Enable Verbose Logging

For debugging observability issues:

# Add to .env
OTEL_LOG_LEVEL=debug

View Raw Telemetry

Check OTEL Collector logs:

make logs-otel

# Or filter for specific data
docker-compose logs otel-collector | grep "Trace"
docker-compose logs otel-collector | grep "Metric"

Test Without Backend

Use console exporter for development:

# In .env
OTEL_TRACES_EXPORTER=console
OTEL_METRICS_EXPORTER=console

Telemetry will be printed to application logs:

make logs-app

Additional Resources