π OpenTelemetry Integration GuideΒΆ
Infrastructure setup and deployment guide for production observability
π OverviewΒΆ
This guide covers the comprehensive OpenTelemetry (OTEL) integration for the Neuroglia framework and Mario's Pizzeria application, providing full observability through distributed tracing, metrics, and structured logging.
π Documentation MapΒΆ
This guide focuses on infrastructure provisioning and deployment. For a complete observability learning path:
- Start here for infrastructure setup (Docker Compose, Kubernetes)
- Observability Feature Guide - Developer instrumentation and API reference
- Tutorial: Mario's Pizzeria Observability - Step-by-step implementation
- Mario's Pizzeria Sample - Complete working example
π― What This Guide CoversΒΆ
- β Complete observability stack architecture
- β Docker Compose configuration for all components
- β OTEL Collector setup and configuration
- β Grafana, Tempo, Prometheus, and Loki integration
- β Multi-application instrumentation patterns
- β Production deployment considerations
- β Troubleshooting and verification steps
π What This Guide Does NOT CoverΒΆ
See the Observability Feature Guide for:
- Code instrumentation patterns (controllers, handlers, repositories)
- Choosing metric types (counter, gauge, histogram)
- Tracing decorators and manual instrumentation
- Data flow from application to dashboard
- Layer-specific implementation guidance
- API reference and configuration options
π― Observability PillarsΒΆ
1. Distributed Tracing πΒΆ
- Purpose: Track requests across services and layers
- Backend: Tempo (Grafana's distributed tracing system)
- Benefits: Understand request flow, identify bottlenecks, debug distributed systems
2. Metrics πΒΆ
- Purpose: Quantitative measurements of application performance
- Backend: Prometheus (time-series database)
- Benefits: Monitor performance trends, set alerts, capacity planning
3. Logging πΒΆ
- Purpose: Structured event records with trace correlation
- Backend: Loki (Grafana's log aggregation system)
- Benefits: Debug issues, audit trails, correlated with traces
ποΈ ArchitectureΒΆ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Mario Pizzeria App β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OpenTelemetry SDK (Python) β β
β β - TracerProvider (traces) β β
β β - MeterProvider (metrics) β β
β β - LoggerProvider (logs) β β
β ββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ β
β β OTLP/gRPC (4317) or HTTP (4318) β
βββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OpenTelemetry Collector (All-in-One) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Receivers: OTLP (gRPC 4317, HTTP 4318) β β
β β Processors: Batch, Memory Limiter, Resource β β
β β Exporters: Tempo, Prometheus, Loki, Console β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββ¬ββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ ββββββββββββ ββββββββββββ
β Tempo β βPrometheusβ β Loki β
β (Traces)β β(Metrics) β β (Logs) β
ββββββ¬βββββ βββββββ¬βββββ βββββββ¬βββββ
β β β
ββββββββββββββββΌββββββββββββββ
β
βΌ
βββββββββββββββ
β Grafana β
β (Dashboard) β
βββββββββββββββ
π¦ ComponentsΒΆ
OpenTelemetry Collector (All-in-One)ΒΆ
- Image:
otel/opentelemetry-collector-contrib:latest - Purpose: Central hub for receiving, processing, and exporting telemetry
- Ports:
4317: OTLP gRPC receiver4318: OTLP HTTP receiver8888: Prometheus metrics about the collector itself13133: Health check endpoint
Grafana TempoΒΆ
- Image:
grafana/tempo:latest - Purpose: Distributed tracing backend
- Ports:
3200(HTTP API),9095(gRPC),4317(OTLP gRPC) - Storage: Local filesystem (configurable to S3, GCS, etc.)
PrometheusΒΆ
- Image:
prom/prometheus:latest - Purpose: Metrics storage and querying
- Ports:
9090(Web UI and API) - Scrape Interval: 15s
Grafana LokiΒΆ
- Image:
grafana/loki:latest - Purpose: Log aggregation and querying
- Ports:
3100(HTTP API) - Storage: Local filesystem
GrafanaΒΆ
- Image:
grafana/grafana:latest - Purpose: Unified dashboard for traces, metrics, and logs
- Ports:
3001(Web UI) - Default Credentials: admin/admin (change on first login)
π§ Implementation ComponentsΒΆ
1. Framework Module: neuroglia.observabilityΒΆ
Purpose: Provide reusable OpenTelemetry integration for all Neuroglia applications
Key Features:
- Automatic instrumentation setup (FastAPI, HTTPX, logging)
- TracerProvider and MeterProvider initialization
- Context propagation configuration
- Resource detection (service name, version, host)
- Configurable exporters (OTLP, Console, Jaeger compatibility)
Public API:
from neuroglia.observability import (
configure_opentelemetry,
get_tracer,
get_meter,
trace_async, # Decorator for automatic tracing
record_metric,
)
# Initialize OTEL (call once at startup)
configure_opentelemetry(
service_name="mario-pizzeria",
service_version="1.0.0",
otlp_endpoint="http://otel-collector:4317",
enable_console_export=False,
)
# Get tracer for manual instrumentation
tracer = get_tracer(__name__)
# Automatic tracing decorator
@trace_async()
async def process_order(order_id: str):
# Automatically creates a span
pass
2. Tracing MiddlewareΒΆ
Layers Instrumented:
- β HTTP Requests (automatic via FastAPI instrumentation)
- β Commands (CQRSTracingMiddleware)
- β Queries (CQRSTracingMiddleware)
- β Event Handlers (EventHandlerTracingMiddleware)
- β Repository Operations (RepositoryTracingMixin)
- β External HTTP Calls (automatic via HTTPX instrumentation)
Span Attributes:
command.type: Command class namequery.type: Query class nameevent.type: Event class nameaggregate.id: Aggregate identifierrepository.operation: get/save/update/deletehttp.method,http.url,http.status_code
3. Metrics CollectionΒΆ
Business Metrics:
mario.orders.created(counter): Total orders placedmario.orders.completed(counter): Total orders deliveredmario.orders.cancelled(counter): Total cancelled ordersmario.pizzas.ordered(counter): Total pizzas orderedmario.orders.value(histogram): Order value distribution
Technical Metrics:
neuroglia.command.duration(histogram): Command execution timeneuroglia.query.duration(histogram): Query execution timeneuroglia.event.processing.duration(histogram): Event handler timeneuroglia.repository.operation.duration(histogram): Repository operation timeneuroglia.http.request.duration(histogram): HTTP request duration
Labels/Attributes:
service.name: "mario-pizzeria"command.type: Command class namequery.type: Query class nameevent.type: Event class namerepository.type: Repository class namestatus: "success" | "error"
4. Structured LoggingΒΆ
Features:
- JSON structured logs with trace context
- Automatic trace_id and span_id injection
- Log level filtering
- OTLP log export to Loki via collector
Log Format:
{
"timestamp": "2025-10-24T10:15:30.123Z",
"level": "INFO",
"message": "Order placed successfully",
"service.name": "mario-pizzeria",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"order_id": "61a61887-4200-4d0c-85d3-45c2cdd9cc08",
"customer_id": "cust_123",
"total_amount": 25.5
}
βοΈ FastAPI Multi-Application InstrumentationΒΆ
π¨ Critical Configuration for Multi-App ArchitecturesΒΆ
When building applications with multiple mounted FastAPI apps (main app + sub-apps), proper OpenTelemetry instrumentation configuration is crucial to avoid duplicate metrics warnings and ensure complete observability coverage.
The Problem: Duplicate InstrumentationΒΆ
β WRONG - Causes duplicate metric warnings:
# This creates duplicate HTTP metrics instruments
from neuroglia.observability import instrument_fastapi_app
# Main application
app = FastAPI(title="Mario's Pizzeria")
# Sub-applications
api_app = FastAPI(title="API")
ui_app = FastAPI(title="UI")
# β DON'T DO THIS - Causes warnings
instrument_fastapi_app(app, "main-app")
instrument_fastapi_app(api_app, "api-app") # β οΈ Duplicate metrics
instrument_fastapi_app(ui_app, "ui-app") # β οΈ Duplicate metrics
# Mount sub-apps
app.mount("/api", api_app)
app.mount("/", ui_app)
Error Messages You'll See:
WARNING An instrument with name http.server.duration, type Histogram...
has been created already.
WARNING An instrument with name http.server.request.size, type Histogram...
has been created already.
β CORRECT - Single Main App InstrumentationΒΆ
The solution: Only instrument the main app that contains mounted sub-apps
from neuroglia.observability import configure_opentelemetry, instrument_fastapi_app
# 1. Initialize OpenTelemetry first (once per application)
configure_opentelemetry(
service_name="mario-pizzeria",
service_version="1.0.0",
otlp_endpoint="http://otel-collector:4317"
)
# 2. Create applications
app = FastAPI(title="Mario's Pizzeria")
api_app = FastAPI(title="API")
ui_app = FastAPI(title="UI")
# 3. Define endpoints BEFORE mounting (important for health checks)
@app.get("/health")
async def health_check():
return {"status": "healthy"}
# 4. Mount sub-applications
app.mount("/api", api_app, name="api")
app.mount("/", ui_app, name="ui")
# 5. β
ONLY instrument the main app
instrument_fastapi_app(app, "mario-pizzeria-main")
π Complete Coverage VerificationΒΆ
This single instrumentation captures ALL endpoints across all mounted applications:
Example Tracked Endpoints:
# All these endpoints are automatically instrumented:
β
/health (main app)
β
/ (UI sub-app root)
β
/menu (UI sub-app)
β
/orders (UI sub-app)
β
/api/menu/ (API sub-app)
β
/api/orders/ (API sub-app)
β
/api/kitchen/status (API sub-app)
β
/api/docs (API sub-app)
β
/api/metrics (API sub-app)
HTTP Status Codes Tracked:
β
200 OK (successful requests)
β
307 Temporary Redirect (FastAPI automatic redirects)
β
404 Not Found (missing endpoints)
β
401 Unauthorized (auth failures)
β
500 Internal Error (application errors)
π How It WorksΒΆ
- Request Flow: All HTTP requests reach the main app first
- Middleware Order: OpenTelemetry middleware intercepts requests before routing
- Sub-App Processing: Requests are then routed to appropriate mounted sub-apps
- Metric Collection: Single point of HTTP metric collection with complete coverage
π― Best PracticesΒΆ
- Single Instrumentation Point: Only instrument the main FastAPI app
- Timing Matters: Mount sub-apps before instrumenting the main app
- Health Endpoints: Define main app endpoints before mounting to avoid 404s
- Service Naming: Use descriptive names for the instrumented app
- Verification: Check
/metricsendpoint to confirm all routes are tracked
π¨ Common PitfallsΒΆ
- Instrumenting Sub-Apps: Never instrument mounted sub-applications directly
- Order of Operations: Don't instrument before mounting sub-apps
- Missing Routes: Define health/metrics endpoints on main app, not sub-apps
- Duplicate Names: Use unique service names for different instrumentation calls
π Metrics VerificationΒΆ
Verify your instrumentation is working correctly:
# Check all tracked endpoints
curl -s "http://localhost:8080/api/metrics" | \
grep 'http_target=' | \
sed 's/.*http_target="\([^"]*\)".*/\1/' | \
sort | uniq
# Expected output:
# /
# /api/menu/
# /api/orders/
# /health
# /api/metrics
π Integration ChecklistΒΆ
- [ ] β Initialize OpenTelemetry once at startup
- [ ] β Create all FastAPI apps (main + sub-apps)
- [ ] β Define main app endpoints (health, metrics)
- [ ] β Mount all sub-applications to main app
- [ ] β Instrument ONLY the main app
- [ ] β Verify no duplicate metric warnings in logs
- [ ] β Confirm all endpoints appear in metrics
- [ ] β Test trace propagation across all routes
This configuration ensures complete observability coverage without duplicate instrumentation warnings, providing clean metrics collection across your entire multi-application architecture.
π Key BenefitsΒΆ
For DevelopmentΒΆ
- Debug Distributed Systems: See exact request flow across layers
- Identify Bottlenecks: Visualize which components are slow
- Understand Dependencies: See how services interact
- Root Cause Analysis: Correlate logs with traces for faster debugging
For OperationsΒΆ
- Performance Monitoring: Track response times and throughput
- Alerting: Set alerts on SLIs (latency, error rate, saturation)
- Capacity Planning: Understand resource usage trends
- Incident Response: Quickly isolate and diagnose issues
For BusinessΒΆ
- User Experience: Monitor actual user-facing performance
- Feature Usage: Track which features are used most
- Business Metrics: Orders, revenue, conversion rates
- SLA Compliance: Measure and report on service level objectives
π¨ Grafana DashboardsΒΆ
1. Overview DashboardΒΆ
- Request rate (requests/sec)
- Error rate (%)
- P50, P95, P99 latency
- Active services
- Top endpoints by traffic
2. Traces Dashboard (Tempo)ΒΆ
- Trace search by operation, duration, tags
- Service dependency graph
- Span flamegraphs
- Trace-to-logs correlation
3. Metrics Dashboard (Prometheus)ΒΆ
- Command execution time (histogram)
- Query execution time (histogram)
- Event processing time (histogram)
- Repository operation time (histogram)
- Business metrics (orders, pizzas, revenue)
4. Logs Dashboard (Loki)ΒΆ
- Log stream viewer
- Log filtering by trace_id, service, level
- Log rate over time
- Error log aggregation
5. Mario's Pizzeria Business DashboardΒΆ
- Orders per hour
- Average order value
- Popular pizzas
- Order status distribution
- Delivery time metrics
π Trace Context PropagationΒΆ
OpenTelemetry uses W3C Trace Context for propagating trace information:
HTTP Headers:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: vendor1=value1,vendor2=value2
Propagation Flow:
- Incoming HTTP request with traceparent header
- FastAPI auto-instrumentation extracts context
- Context propagated to commands, queries, events
- Context included in outgoing HTTP calls
- Context correlated in logs and metrics
π Security ConsiderationsΒΆ
- Network Isolation: OTEL collector not exposed to public internet
- Authentication: Grafana requires login (admin/admin default)
- Data Retention: Configure retention policies for traces/logs/metrics
- PII Handling: Avoid logging sensitive customer data
- Resource Limits: Configure memory/CPU limits for collector
β‘ Performance ConsiderationsΒΆ
- Sampling: Use tail-based sampling for high-volume services
- Batch Processing: Collector batches telemetry before export
- Async Export: Telemetry export is non-blocking
- Resource Detection: Done once at startup
- Memory Limits: Configure collector memory_limiter processor
Typical Overhead:
- Tracing: < 1-2% CPU overhead
- Metrics: < 1% CPU overhead
- Logging: < 5% CPU overhead (structured logging)
π§ͺ Testing OTEL IntegrationΒΆ
Manual TestingΒΆ
# 1. Start services
./mario-docker.sh start
# 2. Generate some traffic
curl -X POST http://localhost:8000/api/orders \
-H "Content-Type: application/json" \
-d '{
"customer_id": "cust_123",
"items": [{"pizza_id": "margherita", "quantity": 2}]
}'
# 3. Check OTEL collector health
curl http://localhost:13133/
# 4. View Grafana dashboards
open http://localhost:3001
# Login: admin/admin
# Navigate: Explore β Tempo (traces)
# Navigate: Explore β Prometheus (metrics)
# Navigate: Explore β Loki (logs)
# 5. Check collector logs
docker logs mario-pizzeria-otel-collector-1
Verify Trace FlowΒΆ
- In Application: Check logs for trace_id in output
- In Collector: Check collector logs for received spans
- In Tempo: Search for traces in Grafana Explore
- In Grafana: View trace waterfall and span details
Verify Metrics FlowΒΆ
- In Application: Metrics recorded and exported
- In Collector: Metrics forwarded to Prometheus
- In Prometheus: Query metrics with PromQL
- In Grafana: Visualize metrics on dashboards
Verify Logs FlowΒΆ
- In Application: Structured logs with trace context
- In Collector: Logs forwarded to Loki
- In Loki: Query logs with LogQL
- In Grafana: View correlated logs with traces
π Related DocumentationΒΆ
Neuroglia FrameworkΒΆ
- Observability Feature Guide - Comprehensive developer guide and API reference
- Tutorial: Mario's Pizzeria Observability - Step-by-step implementation
- Mario's Pizzeria Sample - Complete working example
- CQRS & Mediation - Automatic handler tracing
- Getting Started - Framework setup
External ResourcesΒΆ
- OpenTelemetry Python Documentation
- Grafana Tempo Documentation
- Prometheus Documentation
- Grafana Loki Documentation
- W3C Trace Context Specification
- OTEL Framework Integration Analysis - Internal design notes
π Learning ResourcesΒΆ
ConceptsΒΆ
TutorialsΒΆ
π Next StepsΒΆ
After completing the OTEL integration:
- Baseline Performance: Establish baseline metrics for all operations
- Set SLOs: Define Service Level Objectives (e.g., P95 < 500ms)
- Create Alerts: Configure alerts for SLO violations
- Document Runbooks: Create troubleshooting guides using traces
- Optimize Hot Paths: Use trace data to identify and optimize slow operations
- Custom Dashboards: Build domain-specific dashboards for your team
- Team Training: Train team on using Grafana for debugging and monitoring
Status: Implementation in progress - see TODO list for detailed task breakdown