observability-setup
Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install wenis-rad-observability-setup
Repository
Skill path: skills/observability-setup
Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.
Open repositoryBest for
Primary workflow: Run DevOps.
Technical facets: Full Stack, DevOps.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: wenis.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install observability-setup into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/wenis/rad before adding observability-setup to shared team environments
- Use observability-setup for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: observability-setup
description: Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.
allowed-tools: Read, Write, Edit, Bash
---
# Observability Setup
You implement comprehensive observability (logs, metrics, traces, errors) to ensure production systems are visible, debuggable, and maintainable.
## When to use
- Setting up a new production service
- Investigating production issues
- Implementing SRE best practices
- Adding monitoring to existing services
- Preparing for scale/growth
- Improving incident response time
## The Three Pillars of Observability
### 1. Logs (What happened?)
Timestamped records of events. Use for debugging specific issues and audit trails.
### 2. Metrics (How much/how many?)
Numerical measurements over time. Use for dashboards, alerts, and capacity planning.
### 3. Traces (Where did time go?)
Request flow through distributed systems. Use for performance debugging and understanding dependencies.
**Plus:** Error tracking and alerting for proactive issue detection.
## Tech Stack Recommendations
### All-in-One Solutions (Easiest)
- **Datadog**: Complete platform, great UX, ~$15-$31/host/month
- **New Relic**: Full-stack observability with AI insights, ~$25-$99/user/month
- **Elastic (ELK)**: Self-hosted, powerful but complex
### Open Source Stack (Recommended for Startups)
- **Logs**: Loki (lightweight, Prometheus-style)
- **Metrics**: Prometheus + Grafana
- **Traces**: Tempo or Jaeger
- **Errors**: Sentry (free tier available)
- **Dashboards**: Grafana
**Pros:** Free, flexible, vendor-independent
**Cons:** Requires setup and maintenance
### Minimum Viable Stack
- **Logs**: CloudWatch Logs / Stackdriver
- **Metrics**: CloudWatch / Cloud Monitoring
- **Errors**: Sentry
- **Dashboards**: Grafana Cloud (free tier)
## Quick Start: Python Example
Here's a minimal structured logging setup with Python:
```python
# logging_config.py
import structlog
import logging
def setup_logging():
logging.basicConfig(format="%(message)s", level=logging.INFO)
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
# Usage
import structlog
logger = structlog.get_logger()
# Structured logging with context
logger.info(
"user_login",
user_id=user.id,
email=user.email,
ip_address=request.ip
)
# Bind context for multiple log statements
logger = logger.bind(request_id=request_id, user_id=user_id)
logger.info("processing_payment", amount=100.00, currency="USD")
logger.error("payment_failed", error=str(e))
```
## Language-Specific Implementation Guides
For complete implementation examples including logging, metrics, tracing, and error tracking:
- **Python**: See `examples/python-setup.md`
- structlog for logging
- prometheus_client for metrics
- OpenTelemetry for tracing
- Sentry for error tracking
- **Node.js/TypeScript**: See `examples/nodejs-setup.md`
- Pino for logging
- prom-client for metrics
- OpenTelemetry for tracing
- Sentry for error tracking
- **Go**: See `examples/go-setup.md`
- zerolog for logging
- prometheus/client_golang for metrics
- OpenTelemetry for tracing
## Infrastructure Setup
Ready-to-use configuration templates:
- **Complete stack**: `templates/docker-compose.yml`
- Prometheus, Grafana, Loki, Tempo all configured
- Just run `docker-compose up -d`
- **Prometheus config**: `templates/prometheus.yml`
- Scrape configuration for your services
- **Alerts**: `templates/alertmanager.yml`
- Pre-configured alerts for high error rate, latency, downtime
- Slack integration ready
- **Dashboards**: `reference/grafana-dashboards.md`
- Sample dashboard JSON
- Common PromQL queries
## Complete Setup Checklist
### Application Code
- [ ] Add structured logging (JSON format)
- [ ] Instrument metrics (requests, latency, errors)
- [ ] Add distributed tracing
- [ ] Integrate error tracking (Sentry)
- [ ] Add health check endpoint
- [ ] Add metrics endpoint (/metrics)
### Infrastructure
- [ ] Deploy Prometheus for metrics
- [ ] Deploy Grafana for dashboards
- [ ] Deploy Loki for logs (or use CloudWatch)
- [ ] Deploy Tempo/Jaeger for traces
- [ ] Set up AlertManager
- [ ] Configure log aggregation
### Dashboards
- [ ] Request rate
- [ ] Error rate
- [ ] Latency (P50, P95, P99)
- [ ] Resource usage (CPU, memory)
- [ ] Database performance
- [ ] Business metrics (signups, purchases, etc.)
### Alerts
- [ ] High error rate (> 5%)
- [ ] High latency (P95 > 1s)
- [ ] Service down
- [ ] High memory usage (> 80%)
- [ ] Database connection pool exhausted
## Best Practices
ā
**DO:**
- Use structured logging (JSON)
- Include request IDs in all logs for correlation
- Sample traces (10-30%) to reduce overhead
- Create dashboards for your SLIs (Service Level Indicators)
- Alert on symptoms, not causes
- Set up on-call rotation
- Review dashboards regularly
- Use consistent metric naming conventions
- Add units to metric names (seconds, bytes, etc.)
- Set appropriate retention periods
ā **DON'T:**
- Log sensitive data (passwords, tokens, PII)
- Trace 100% of requests (too expensive)
- Create alerts that fire constantly
- Alert on metrics you can't act on
- Ignore alert fatigue
- Log excessively (DEBUG in production)
- Use string formatting in log messages
- Hardcode thresholds without testing
## Key Metrics to Track
### Golden Signals (SRE)
1. **Latency**: How long requests take
2. **Traffic**: How much demand on your system
3. **Errors**: Rate of failed requests
4. **Saturation**: How "full" your service is
### Application Metrics
- Request rate (requests/second)
- Error rate (errors/second, percentage)
- Response time (P50, P95, P99)
- Active connections
- Queue depth
- Cache hit rate
### System Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
- File descriptors
- Thread/goroutine count
### Business Metrics
- User signups
- Successful transactions
- Revenue
- Active users
- Feature usage
## Instructions
1. **Choose your stack**
- All-in-one (Datadog/New Relic) vs open source (Prometheus + Grafana)
- Consider budget, team size, and infrastructure preferences
2. **Implement logging**
- Pick language-specific logging library from examples/
- Configure JSON output for structured logs
- Add request ID middleware for correlation
3. **Add metrics**
- Instrument HTTP handlers (request count, duration, errors)
- Add business metrics (signups, purchases, etc.)
- Expose `/metrics` endpoint for Prometheus
4. **Set up tracing**
- Use OpenTelemetry auto-instrumentation when possible
- Add manual spans for critical paths
- Configure sampling (10-30%)
5. **Integrate error tracking**
- Set up Sentry or equivalent
- Configure environment (dev/staging/prod)
- Add user context to errors
6. **Deploy infrastructure**
- Use `templates/docker-compose.yml` for local/small deployments
- Or use cloud-managed services (CloudWatch, Cloud Monitoring)
- Configure retention periods
7. **Create dashboards**
- Start with golden signals (latency, traffic, errors, saturation)
- Add application-specific metrics
- Use `reference/grafana-dashboards.md` for examples
8. **Configure alerts**
- Use `templates/alertmanager.yml` as starting point
- Set thresholds based on your SLOs
- Route to Slack/PagerDuty/email
9. **Test everything**
- Generate load, verify metrics appear
- Trigger errors, verify alerts fire
- Check trace visualization
- Ensure logs are searchable
10. **Document runbooks**
- Create guides for common alerts
- Document dashboard usage
- Train team on debugging with observability tools
## Constraints
- Must not log sensitive data (PII, passwords, API keys, tokens)
- Metrics endpoint must not require auth (for Prometheus scraping)
- Tracing overhead must be < 5% (use sampling, typically 10-30%)
- Logs must be structured (JSON format)
- Must include request IDs for log correlation
- Dashboards must load in < 3 seconds
- Alert fatigue must be avoided (tune thresholds carefully)
- Must comply with data retention policies (GDPR, etc.)
- Metric cardinality must be controlled (avoid unbounded labels)
- Observability infrastructure must be reliable (monitor the monitors)
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### examples/python-setup.md
```markdown
# Python Observability Setup
Complete implementation guide for observability in Python applications.
## Structured Logging with structlog
```python
# logging_config.py
import structlog
import logging
def setup_logging():
logging.basicConfig(
format="%(message)s",
level=logging.INFO,
)
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
# Usage
import structlog
logger = structlog.get_logger()
# Structured logging
logger.info(
"user_login",
user_id=user.id,
email=user.email,
ip_address=request.ip,
user_agent=request.headers.get("User-Agent")
)
# With context
logger = logger.bind(request_id=request_id, user_id=user_id)
logger.info("processing_payment", amount=100.00, currency="USD")
logger.error("payment_failed", error=str(e), amount=100.00)
```
## Metrics with Prometheus
```python
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
active_users = Gauge(
'active_users',
'Number of active users'
)
# Middleware
def track_request(method, endpoint):
def decorator(func):
def wrapper(*args, **kwargs):
start = time.time()
try:
result = func(*args, **kwargs)
status = 200
return result
except Exception as e:
status = 500
raise
finally:
duration = time.time() - start
http_requests_total.labels(
method=method,
endpoint=endpoint,
status=status
).inc()
http_request_duration_seconds.labels(
method=method,
endpoint=endpoint
).observe(duration)
return wrapper
return decorator
# Usage
@track_request('GET', '/api/users')
def get_users():
# Handler logic
return users
# Start metrics server
if __name__ == '__main__':
start_http_server(8000) # Metrics on :8000/metrics
```
## Distributed Tracing with OpenTelemetry
```python
# tracing.py
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
def setup_tracing(app, service_name="my-service"):
# Set up tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to Tempo/Jaeger
otlp_exporter = OTLPSpanExporter(
endpoint="http://localhost:4317",
insecure=True
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Auto-instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
# Auto-instrument SQLAlchemy
SQLAlchemyInstrumentor().instrument(engine=engine)
return tracer
# Manual instrumentation
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
span.set_attribute("order.amount", order.amount)
try:
result = payment_service.charge(order)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(Status(StatusCode.ERROR))
span.record_exception(e)
raise
```
## Error Tracking with Sentry
```python
# sentry_config.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration
def setup_sentry(app):
sentry_sdk.init(
dsn=os.environ['SENTRY_DSN'],
environment=os.environ.get('ENV', 'development'),
traces_sample_rate=0.1, # 10% of transactions traced
profiles_sample_rate=0.1,
integrations=[
FastApiIntegration(),
SqlalchemyIntegration(),
],
)
# Usage - automatic for unhandled exceptions
# Manual error capture
from sentry_sdk import capture_exception, capture_message
try:
risky_operation()
except Exception as e:
capture_exception(e)
# Also add context
sentry_sdk.set_context("order", {
"id": order.id,
"amount": order.amount,
})
sentry_sdk.set_user({"id": user.id, "email": user.email})
capture_exception(e)
# Breadcrumbs for debugging
sentry_sdk.add_breadcrumb(
category='order',
message='Order validation started',
level='info',
)
```
## Complete Integration Example
```python
# app.py
from fastapi import FastAPI
from logging_config import setup_logging
from tracing import setup_tracing
from sentry_config import setup_sentry
from metrics import start_http_server
import structlog
app = FastAPI()
# Initialize observability
setup_logging()
setup_tracing(app, service_name="my-api")
setup_sentry(app)
# Start metrics server in background
import threading
threading.Thread(target=lambda: start_http_server(8000), daemon=True).start()
logger = structlog.get_logger()
@app.get("/api/users")
async def get_users():
logger.info("fetching_users", endpoint="/api/users")
# Handler logic
return {"users": []}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8080)
```
## Dependencies
```bash
pip install structlog prometheus-client opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi opentelemetry-instrumentation-sqlalchemy \
opentelemetry-exporter-otlp sentry-sdk
```
```
### examples/nodejs-setup.md
```markdown
# Node.js/TypeScript Observability Setup
Complete implementation guide for observability in Node.js/TypeScript applications.
## Structured Logging with Pino
```typescript
// logger.ts
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
...(process.env.NODE_ENV === 'production'
? {}
: { transport: { target: 'pino-pretty' } }),
});
// Usage
import { logger } from './logger';
logger.info({
event: 'user_login',
userId: user.id,
email: user.email,
ipAddress: req.ip,
}, 'User logged in');
logger.error({
event: 'payment_failed',
userId: user.id,
amount: 100.00,
error: error.message,
stack: error.stack,
}, 'Payment processing failed');
```
## Metrics with Prometheus
```typescript
// metrics.ts
import promClient from 'prom-client';
// Enable default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics();
export const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
});
export const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route'],
buckets: [0.1, 0.5, 1, 2, 5],
});
export const activeConnections = new promClient.Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
// Middleware
export function metricsMiddleware(req, res, next) {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
httpRequestDuration.labels(req.method, req.route?.path || req.path).observe(duration);
});
next();
}
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
```
## Distributed Tracing with OpenTelemetry
```typescript
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
export function setupTracing(serviceName: string) {
const sdk = new NodeSDK({
serviceName,
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4317',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown().then(() => process.exit(0));
});
}
// Manual spans
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
async function processOrder(orderId: string) {
return await tracer.startActiveSpan('process_order', async (span) => {
span.setAttribute('order.id', orderId);
try {
const result = await paymentService.charge(order);
span.setAttribute('payment.status', 'success');
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
});
}
```
## Error Tracking with Sentry
```typescript
// sentry.ts
import * as Sentry from '@sentry/node';
import { ProfilingIntegration } from '@sentry/profiling-node';
export function setupSentry(app: Express) {
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.NODE_ENV,
tracesSampleRate: 0.1,
profilesSampleRate: 0.1,
integrations: [
new Sentry.Integrations.Http({ tracing: true }),
new Sentry.Integrations.Express({ app }),
new ProfilingIntegration(),
],
});
// Request handler must be first
app.use(Sentry.Handlers.requestHandler());
app.use(Sentry.Handlers.tracingHandler());
// Routes...
// Error handler must be last
app.use(Sentry.Handlers.errorHandler());
}
// Manual error capture
import * as Sentry from '@sentry/node';
try {
await riskyOperation();
} catch (error) {
Sentry.captureException(error, {
tags: { section: 'payment' },
extra: { orderId, amount },
user: { id: user.id, email: user.email },
});
}
// Breadcrumbs
Sentry.addBreadcrumb({
category: 'order',
message: 'Order validation started',
level: 'info',
});
```
## Complete Integration Example
```typescript
// app.ts
import express from 'express';
import { setupTracing } from './tracing';
import { setupSentry } from './sentry';
import { logger } from './logger';
import { metricsMiddleware } from './metrics';
const app = express();
// Initialize observability
setupTracing('my-api');
setupSentry(app);
// Add middleware
app.use(metricsMiddleware);
app.get('/api/users', async (req, res) => {
logger.info({ event: 'fetching_users', endpoint: '/api/users' }, 'Fetching users');
// Handler logic
res.json({ users: [] });
});
app.listen(8080, () => {
logger.info({ event: 'server_started', port: 8080 }, 'Server started');
});
```
## Dependencies
```json
{
"dependencies": {
"pino": "^8.0.0",
"pino-pretty": "^10.0.0",
"prom-client": "^15.0.0",
"@opentelemetry/sdk-node": "^0.45.0",
"@opentelemetry/auto-instrumentations-node": "^0.40.0",
"@opentelemetry/exporter-trace-otlp-grpc": "^0.45.0",
"@sentry/node": "^7.0.0",
"@sentry/profiling-node": "^1.0.0"
}
}
```
```
### examples/go-setup.md
```markdown
# Go Observability Setup
Complete implementation guide for observability in Go applications.
## Structured Logging with zerolog
```go
// logger.go
package main
import (
"github.com/rs/zerolog"
"github.com/rs/zerolog/log"
"os"
)
func SetupLogger() {
zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
if os.Getenv("ENV") == "development" {
log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr})
}
}
// Usage
log.Info().
Str("event", "user_login").
Str("user_id", user.ID).
Str("email", user.Email).
Str("ip_address", r.RemoteAddr).
Msg("User logged in")
log.Error().
Err(err).
Str("event", "payment_failed").
Float64("amount", 100.00).
Msg("Payment processing failed")
```
## Metrics with Prometheus
```go
// metrics.go
package main
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"net/http"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
// Middleware
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Wrap response writer to capture status
wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
next.ServeHTTP(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(wrapped.statusCode)).Inc()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
})
}
// Metrics endpoint
http.Handle("/metrics", promhttp.Handler())
```
## Distributed Tracing with OpenTelemetry
```go
// tracing.go
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func SetupTracing(serviceName string) func() {
ctx := context.Background()
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("localhost:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
log.Fatal().Err(err).Msg("Failed to create exporter")
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
)),
)
otel.SetTracerProvider(tp)
return func() {
_ = tp.Shutdown(ctx)
}
}
// Usage
func processOrder(ctx context.Context, orderID string) error {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "process_order")
defer span.End()
span.SetAttributes(attribute.String("order.id", orderID))
// Business logic
err := paymentService.Charge(ctx, order)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.SetAttributes(attribute.String("payment.status", "success"))
return nil
}
```
## Complete Integration Example
```go
// main.go
package main
import (
"net/http"
"github.com/rs/zerolog/log"
)
func main() {
// Initialize observability
SetupLogger()
cleanup := SetupTracing("my-api")
defer cleanup()
// Add middleware
mux := http.NewServeMux()
mux.HandleFunc("/api/users", getUsersHandler)
handler := metricsMiddleware(mux)
log.Info().Msg("Starting server on :8080")
http.ListenAndServe(":8080", handler)
}
func getUsersHandler(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
log.Info().Str("event", "fetching_users").Msg("Fetching users")
// Handler logic
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(`{"users": []}`))
}
```
## Dependencies
```bash
go get github.com/rs/zerolog
go get github.com/prometheus/client_golang/prometheus
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
```
```
### templates/docker-compose.yml
```yaml
# Complete Observability Stack with Docker Compose
# Includes: Prometheus (metrics), Grafana (dashboards), Loki (logs), Tempo (traces)
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # Tempo HTTP
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./tempo-config.yml:/etc/tempo.yaml
- tempo-data:/tmp/tempo
command: ["-config.file=/etc/tempo.yaml"]
volumes:
prometheus-data:
grafana-data:
loki-data:
tempo-data:
```
### templates/prometheus.yml
```yaml
# Prometheus Configuration
# Scrapes metrics from your application and system exporters
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['app:8000'] # Your app metrics endpoint
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
```
### templates/alertmanager.yml
```yaml
# AlertManager Configuration
# Handles alerts from Prometheus and routes to notification channels
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
title: 'Alert: {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# Alert Rules (place in prometheus.yml or separate rules file)
groups:
- name: app_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
description: 'Error rate is {{ $value }} (threshold: 0.05)'
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
description: 'P95 latency is {{ $value }}s (threshold: 1s)'
```