Observability Troubleshooting¶
This guide covers common observability issues and their solutions.
Table of Contents¶
- Traces Not Appearing
- Metrics Not Recording
- High Memory Usage
- Performance Issues
- Configuration Problems
- Data Quality Issues
Traces Not Appearing¶
Symptom¶
Traces not visible in Grafana/Tempo after making API requests.
Diagnosis¶
1. Check if OTEL Collector is running:
2. Check OTEL Collector logs:
Look for errors like:
connection refuseddeadline exceededfailed to export
3. Check application can reach collector:
# From host
curl http://localhost:4317
# From container
docker-compose exec app curl http://otel-collector:4317
4. Verify environment variables:
docker-compose exec app env | grep OTEL
# Should see:
# OTEL_TRACES_EXPORTER=otlp
# OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
5. Check Tempo is receiving data:
Solutions¶
Problem: Collector not running
Problem: Wrong endpoint
Problem: Traces disabled
Problem: Network issue
Problem: Sampling rate too low
Verification¶
# Generate traffic
curl -X POST http://localhost:8000/api/tasks \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"title":"Test","priority":"high"}'
# Check Grafana
open http://localhost:3000
# Navigate to Explore → Tempo → Search for "starter-app"
Metrics Not Recording¶
Symptom¶
Metrics not visible in Prometheus or Grafana.
Diagnosis¶
1. Check meter is initialized:
# In your code
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
print(f"Meter: {meter}") # Should not be None
2. Check metrics are being called:
# Add debug logging
tasks_created.add(1, {"priority": "high"})
logger.debug("Recorded task creation metric")
3. Check environment variable:
4. Check Prometheus scraping collector:
# Check if Prometheus is scraping OTEL Collector
curl http://localhost:8889/metrics
# Should see metrics like:
# starter_app_tasks_created_total{priority="high"} 5
5. Check Prometheus targets:
Open http://localhost:9090/targets
- Should show
otel-collector:8889as UP
Solutions¶
Problem: Meter not initialized
# Ensure Observability is configured
from neuroglia.observability import Observability
builder = WebApplicationBuilder(app_settings=app_settings)
Observability.configure(builder)
Problem: Wrong exporter
Problem: Metrics not exported
# Check export interval
OTEL_METRIC_EXPORT_INTERVAL=60000 # 60 seconds
# For faster updates in development
OTEL_METRIC_EXPORT_INTERVAL=5000 # 5 seconds
Problem: Prometheus not scraping
# Check deployment/prometheus.yml
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889'] # Correct port
Verification¶
# Query Prometheus directly
curl 'http://localhost:9090/api/v1/query?query=starter_app_tasks_created_total'
# Or use Grafana
open http://localhost:3000
# Navigate to Explore → Prometheus → Metrics Browser
High Memory Usage¶
Symptom¶
Application memory grows continuously or spikes.
Diagnosis¶
1. Check span attributes:
Look for large attributes in your code:
2. Check metric cardinality:
# Check number of unique label combinations
curl http://localhost:8889/metrics | grep starter_app_tasks | wc -l
If > 1000 time series, you have high cardinality.
3. Check for span leaks:
4. Monitor memory:
Solutions¶
Problem: Large span attributes
# ❌ Bad
span.set_attribute("task.description", task.description) # Could be 10KB
# ✅ Good: Truncate
description = task.description[:100] if task.description else ""
span.set_attribute("task.description_preview", description)
Problem: High metric cardinality
# ❌ Bad: Unique IDs as labels
tasks_created.add(1, {"task_id": task_id}) # Millions of values!
# ✅ Good: Categorical labels
tasks_created.add(1, {"priority": "high", "status": "pending"})
Problem: Too many spans
# ❌ Bad: Span per item
for item in items: # 10,000 items
with tracer.start_as_current_span("process_item"):
process(item)
# ✅ Good: One span for batch
with tracer.start_as_current_span("process_items") as span:
span.set_attribute("items.count", len(items))
for item in items:
process(item)
Problem: Batch size too large
# In otel-collector-config.yaml
processors:
batch:
timeout: 1s
send_batch_size: 512 # Reduce from 1024
Problem: Sampling rate too high
# Reduce sampling in production
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # 10% instead of 100%
Performance Issues¶
Symptom¶
Application is slower after adding observability.
Diagnosis¶
1. Measure overhead:
import time
# Without instrumentation
start = time.perf_counter()
result = await operation()
baseline = time.perf_counter() - start
# With instrumentation
start = time.perf_counter()
with tracer.start_as_current_span("operation"):
result = await operation()
instrumented = time.perf_counter() - start
overhead = (instrumented - baseline) / baseline * 100
print(f"Overhead: {overhead:.2f}%")
2. Check export blocking:
3. Profile the application:
Solutions¶
Problem: Synchronous exports (blocking)
OpenTelemetry SDK uses async exports by default. Verify:
# Should use BatchSpanProcessor (async)
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Not SimpleSpanProcessor (blocking)
Problem: Too many spans
# Reduce span granularity
# Only instrument significant operations (> 10ms)
# ❌ Bad: Span for trivial operation
with tracer.start_as_current_span("get_priority"): # < 1ms
priority = task.priority
# ✅ Good: No span for trivial operation
priority = task.priority
Problem: Large attributes
# Limit attribute size
MAX_ATTR_SIZE = 1024 # 1KB
def set_safe_attribute(span, key, value):
if isinstance(value, str) and len(value) > MAX_ATTR_SIZE:
value = value[:MAX_ATTR_SIZE] + "..."
span.set_attribute(key, value)
Problem: High export frequency
Configuration Problems¶
Environment Variables Not Loaded¶
Symptom: Observability not working despite correct configuration.
Solution:
# Restart application after changing .env
docker-compose restart app
# Verify variables are loaded
docker-compose exec app env | grep OTEL
Wrong Service Name¶
Symptom: Can't find traces/metrics for your service.
Solution:
# Check service name
echo $OTEL_SERVICE_NAME
# Should match what you search for in Grafana
# Update if wrong
OTEL_SERVICE_NAME=starter-app
Collector Configuration Not Applied¶
Symptom: Collector changes not taking effect.
Solution:
# Restart collector after config changes
docker-compose restart otel-collector
# Check for syntax errors
docker-compose logs otel-collector | grep -i error
Data Quality Issues¶
Missing Span Attributes¶
Problem: Attributes not showing in Grafana/Tempo.
Cause: Attributes set after span ends.
Solution:
# ❌ Bad: Attribute set after span ends
with tracer.start_as_current_span("operation") as span:
result = do_work()
# Span ended here!
span.set_attribute("result", result) # Too late!
# ✅ Good: Attribute set before span ends
with tracer.start_as_current_span("operation") as span:
result = do_work()
span.set_attribute("result", result) # Within span scope
Broken Trace Context¶
Problem: Spans appear as separate traces instead of one trace.
Cause: Context not propagated.
Solution:
# ❌ Bad: Starting new trace
span = tracer.start_span("operation") # No parent context!
# ✅ Good: Using current context
with tracer.start_as_current_span("operation") as span:
...
For async operations:
# ❌ Bad: Context lost in background task
async def handler():
with tracer.start_as_current_span("parent"):
asyncio.create_task(background_work()) # Context lost!
# ✅ Good: Pass context explicitly
from opentelemetry import context
async def handler():
with tracer.start_as_current_span("parent"):
ctx = context.get_current()
asyncio.create_task(background_work(ctx))
async def background_work(ctx):
with tracer.start_as_current_span("background", context=ctx):
...
Metrics Not Aggregating¶
Problem: Seeing duplicate metrics instead of aggregated values.
Cause: Inconsistent label names or values.
Solution:
# ❌ Bad: Inconsistent labels
tasks_created.add(1, {"priority": "HIGH"})
tasks_created.add(1, {"priority": "high"}) # Different value!
tasks_created.add(1, {"prio": "high"}) # Different key!
# ✅ Good: Consistent labels
tasks_created.add(1, {"priority": "high"})
tasks_created.add(1, {"priority": "high"}) # Same key and value
Getting Help¶
Debug Mode¶
Enable verbose logging:
# In .env
OTEL_LOG_LEVEL=debug
OTEL_TRACES_EXPORTER=console,otlp # Export to both console and collector
Check application logs:
Collector Debug¶
Enable debug logging in collector:
# otel-collector-config.yaml
exporters:
logging:
loglevel: debug # Change from 'info'
service:
pipelines:
traces:
exporters: [otlp/tempo, logging] # Add logging exporter
Test Configuration¶
Use minimal config to isolate issues:
# Minimal .env for testing
OTEL_SERVICE_NAME=starter-app
OTEL_TRACES_EXPORTER=console
OTEL_METRICS_EXPORTER=console
# Run app and check logs
make logs-app
Useful Commands¶
# Check all services
docker-compose ps
# View all logs
docker-compose logs -f
# Check specific service
docker-compose logs otel-collector
docker-compose logs tempo
docker-compose logs prometheus
# Restart everything
docker-compose restart
# Clean restart
docker-compose down
docker-compose up -d
Related Documentation¶
- Observability Overview - Concepts and introduction
- Configuration - Configuration options
- Getting Started - Quick start guide
- Architecture - Technical architecture