SkillHub ClubRun DevOpsFull StackDevOps

observability-setup

Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C2.8

Composite score

2.8

Best-practice grade

B75.6

Install command

npx @skill-hub/cli install wenis-rad-observability-setup

Repository

wenis/rad

Skill path: skills/observability-setup

Open repository

Best for

Primary workflow: Run DevOps.

Technical facets: Full Stack, DevOps.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: wenis.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install observability-setup into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/wenis/rad before adding observability-setup to shared team environments
Use observability-setup for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: observability-setup
description: Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.
allowed-tools: Read, Write, Edit, Bash
---

# Observability Setup

You implement comprehensive observability (logs, metrics, traces, errors) to ensure production systems are visible, debuggable, and maintainable.

## When to use
- Setting up a new production service
- Investigating production issues
- Implementing SRE best practices
- Adding monitoring to existing services
- Preparing for scale/growth
- Improving incident response time

## The Three Pillars of Observability

### 1. Logs (What happened?)
Timestamped records of events. Use for debugging specific issues and audit trails.

### 2. Metrics (How much/how many?)
Numerical measurements over time. Use for dashboards, alerts, and capacity planning.

### 3. Traces (Where did time go?)
Request flow through distributed systems. Use for performance debugging and understanding dependencies.

**Plus:** Error tracking and alerting for proactive issue detection.

## Tech Stack Recommendations

### All-in-One Solutions (Easiest)
- **Datadog**: Complete platform, great UX, ~$15-$31/host/month
- **New Relic**: Full-stack observability with AI insights, ~$25-$99/user/month
- **Elastic (ELK)**: Self-hosted, powerful but complex

### Open Source Stack (Recommended for Startups)
- **Logs**: Loki (lightweight, Prometheus-style)
- **Metrics**: Prometheus + Grafana
- **Traces**: Tempo or Jaeger
- **Errors**: Sentry (free tier available)
- **Dashboards**: Grafana

**Pros:** Free, flexible, vendor-independent
**Cons:** Requires setup and maintenance

### Minimum Viable Stack
- **Logs**: CloudWatch Logs / Stackdriver
- **Metrics**: CloudWatch / Cloud Monitoring
- **Errors**: Sentry
- **Dashboards**: Grafana Cloud (free tier)

## Quick Start: Python Example

Here's a minimal structured logging setup with Python:

```python
# logging_config.py
import structlog
import logging

def setup_logging():
    logging.basicConfig(format="%(message)s", level=logging.INFO)

    structlog.configure(
        processors=[
            structlog.stdlib.filter_by_level,
            structlog.stdlib.add_logger_name,
            structlog.stdlib.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer()
        ],
        wrapper_class=structlog.stdlib.BoundLogger,
        context_class=dict,
        logger_factory=structlog.stdlib.LoggerFactory(),
        cache_logger_on_first_use=True,
    )

# Usage
import structlog
logger = structlog.get_logger()

# Structured logging with context
logger.info(
    "user_login",
    user_id=user.id,
    email=user.email,
    ip_address=request.ip
)

# Bind context for multiple log statements
logger = logger.bind(request_id=request_id, user_id=user_id)
logger.info("processing_payment", amount=100.00, currency="USD")
logger.error("payment_failed", error=str(e))
```

## Language-Specific Implementation Guides

For complete implementation examples including logging, metrics, tracing, and error tracking:

- **Python**: See `examples/python-setup.md`
  - structlog for logging
  - prometheus_client for metrics
  - OpenTelemetry for tracing
  - Sentry for error tracking

- **Node.js/TypeScript**: See `examples/nodejs-setup.md`
  - Pino for logging
  - prom-client for metrics
  - OpenTelemetry for tracing
  - Sentry for error tracking

- **Go**: See `examples/go-setup.md`
  - zerolog for logging
  - prometheus/client_golang for metrics
  - OpenTelemetry for tracing

## Infrastructure Setup

Ready-to-use configuration templates:

- **Complete stack**: `templates/docker-compose.yml`
  - Prometheus, Grafana, Loki, Tempo all configured
  - Just run `docker-compose up -d`

- **Prometheus config**: `templates/prometheus.yml`
  - Scrape configuration for your services

- **Alerts**: `templates/alertmanager.yml`
  - Pre-configured alerts for high error rate, latency, downtime
  - Slack integration ready

- **Dashboards**: `reference/grafana-dashboards.md`
  - Sample dashboard JSON
  - Common PromQL queries

## Complete Setup Checklist

### Application Code
- [ ] Add structured logging (JSON format)
- [ ] Instrument metrics (requests, latency, errors)
- [ ] Add distributed tracing
- [ ] Integrate error tracking (Sentry)
- [ ] Add health check endpoint
- [ ] Add metrics endpoint (/metrics)

### Infrastructure
- [ ] Deploy Prometheus for metrics
- [ ] Deploy Grafana for dashboards
- [ ] Deploy Loki for logs (or use CloudWatch)
- [ ] Deploy Tempo/Jaeger for traces
- [ ] Set up AlertManager
- [ ] Configure log aggregation

### Dashboards
- [ ] Request rate
- [ ] Error rate
- [ ] Latency (P50, P95, P99)
- [ ] Resource usage (CPU, memory)
- [ ] Database performance
- [ ] Business metrics (signups, purchases, etc.)

### Alerts
- [ ] High error rate (> 5%)
- [ ] High latency (P95 > 1s)
- [ ] Service down
- [ ] High memory usage (> 80%)
- [ ] Database connection pool exhausted

## Best Practices

✅ **DO:**
- Use structured logging (JSON)
- Include request IDs in all logs for correlation
- Sample traces (10-30%) to reduce overhead
- Create dashboards for your SLIs (Service Level Indicators)
- Alert on symptoms, not causes
- Set up on-call rotation
- Review dashboards regularly
- Use consistent metric naming conventions
- Add units to metric names (seconds, bytes, etc.)
- Set appropriate retention periods

❌ **DON'T:**
- Log sensitive data (passwords, tokens, PII)
- Trace 100% of requests (too expensive)
- Create alerts that fire constantly
- Alert on metrics you can't act on
- Ignore alert fatigue
- Log excessively (DEBUG in production)
- Use string formatting in log messages
- Hardcode thresholds without testing

## Key Metrics to Track

### Golden Signals (SRE)
1. **Latency**: How long requests take
2. **Traffic**: How much demand on your system
3. **Errors**: Rate of failed requests
4. **Saturation**: How "full" your service is

### Application Metrics
- Request rate (requests/second)
- Error rate (errors/second, percentage)
- Response time (P50, P95, P99)
- Active connections
- Queue depth
- Cache hit rate

### System Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
- File descriptors
- Thread/goroutine count

### Business Metrics
- User signups
- Successful transactions
- Revenue
- Active users
- Feature usage

## Instructions

1. **Choose your stack**
   - All-in-one (Datadog/New Relic) vs open source (Prometheus + Grafana)
   - Consider budget, team size, and infrastructure preferences

2. **Implement logging**
   - Pick language-specific logging library from examples/
   - Configure JSON output for structured logs
   - Add request ID middleware for correlation

3. **Add metrics**
   - Instrument HTTP handlers (request count, duration, errors)
   - Add business metrics (signups, purchases, etc.)
   - Expose `/metrics` endpoint for Prometheus

4. **Set up tracing**
   - Use OpenTelemetry auto-instrumentation when possible
   - Add manual spans for critical paths
   - Configure sampling (10-30%)

5. **Integrate error tracking**
   - Set up Sentry or equivalent
   - Configure environment (dev/staging/prod)
   - Add user context to errors

6. **Deploy infrastructure**
   - Use `templates/docker-compose.yml` for local/small deployments
   - Or use cloud-managed services (CloudWatch, Cloud Monitoring)
   - Configure retention periods

7. **Create dashboards**
   - Start with golden signals (latency, traffic, errors, saturation)
   - Add application-specific metrics
   - Use `reference/grafana-dashboards.md` for examples

8. **Configure alerts**
   - Use `templates/alertmanager.yml` as starting point
   - Set thresholds based on your SLOs
   - Route to Slack/PagerDuty/email

9. **Test everything**
   - Generate load, verify metrics appear
   - Trigger errors, verify alerts fire
   - Check trace visualization
   - Ensure logs are searchable

10. **Document runbooks**
    - Create guides for common alerts
    - Document dashboard usage
    - Train team on debugging with observability tools

## Constraints

- Must not log sensitive data (PII, passwords, API keys, tokens)
- Metrics endpoint must not require auth (for Prometheus scraping)
- Tracing overhead must be < 5% (use sampling, typically 10-30%)
- Logs must be structured (JSON format)
- Must include request IDs for log correlation
- Dashboards must load in < 3 seconds
- Alert fatigue must be avoided (tune thresholds carefully)
- Must comply with data retention policies (GDPR, etc.)
- Metric cardinality must be controlled (avoid unbounded labels)
- Observability infrastructure must be reliable (monitor the monitors)


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### examples/python-setup.md

```markdown
# Python Observability Setup

Complete implementation guide for observability in Python applications.

## Structured Logging with structlog

```python
# logging_config.py
import structlog
import logging

def setup_logging():
    logging.basicConfig(
        format="%(message)s",
        level=logging.INFO,
    )

    structlog.configure(
        processors=[
            structlog.stdlib.filter_by_level,
            structlog.stdlib.add_logger_name,
            structlog.stdlib.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer()
        ],
        wrapper_class=structlog.stdlib.BoundLogger,
        context_class=dict,
        logger_factory=structlog.stdlib.LoggerFactory(),
        cache_logger_on_first_use=True,
    )

# Usage
import structlog

logger = structlog.get_logger()

# Structured logging
logger.info(
    "user_login",
    user_id=user.id,
    email=user.email,
    ip_address=request.ip,
    user_agent=request.headers.get("User-Agent")
)

# With context
logger = logger.bind(request_id=request_id, user_id=user_id)
logger.info("processing_payment", amount=100.00, currency="USD")
logger.error("payment_failed", error=str(e), amount=100.00)
```

## Metrics with Prometheus

```python
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

active_users = Gauge(
    'active_users',
    'Number of active users'
)

# Middleware
def track_request(method, endpoint):
    def decorator(func):
        def wrapper(*args, **kwargs):
            start = time.time()
            try:
                result = func(*args, **kwargs)
                status = 200
                return result
            except Exception as e:
                status = 500
                raise
            finally:
                duration = time.time() - start
                http_requests_total.labels(
                    method=method,
                    endpoint=endpoint,
                    status=status
                ).inc()
                http_request_duration_seconds.labels(
                    method=method,
                    endpoint=endpoint
                ).observe(duration)
        return wrapper
    return decorator

# Usage
@track_request('GET', '/api/users')
def get_users():
    # Handler logic
    return users

# Start metrics server
if __name__ == '__main__':
    start_http_server(8000)  # Metrics on :8000/metrics
```

## Distributed Tracing with OpenTelemetry

```python
# tracing.py
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def setup_tracing(app, service_name="my-service"):
    # Set up tracer
    trace.set_tracer_provider(TracerProvider())
    tracer = trace.get_tracer(__name__)

    # Export to Tempo/Jaeger
    otlp_exporter = OTLPSpanExporter(
        endpoint="http://localhost:4317",
        insecure=True
    )
    span_processor = BatchSpanProcessor(otlp_exporter)
    trace.get_tracer_provider().add_span_processor(span_processor)

    # Auto-instrument FastAPI
    FastAPIInstrumentor.instrument_app(app)

    # Auto-instrument SQLAlchemy
    SQLAlchemyInstrumentor().instrument(engine=engine)

    return tracer

# Manual instrumentation
tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.amount", order.amount)

    try:
        result = payment_service.charge(order)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(Status(StatusCode.ERROR))
        span.record_exception(e)
        raise
```

## Error Tracking with Sentry

```python
# sentry_config.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

def setup_sentry(app):
    sentry_sdk.init(
        dsn=os.environ['SENTRY_DSN'],
        environment=os.environ.get('ENV', 'development'),
        traces_sample_rate=0.1,  # 10% of transactions traced
        profiles_sample_rate=0.1,
        integrations=[
            FastApiIntegration(),
            SqlalchemyIntegration(),
        ],
    )

# Usage - automatic for unhandled exceptions
# Manual error capture
from sentry_sdk import capture_exception, capture_message

try:
    risky_operation()
except Exception as e:
    capture_exception(e)
    # Also add context
    sentry_sdk.set_context("order", {
        "id": order.id,
        "amount": order.amount,
    })
    sentry_sdk.set_user({"id": user.id, "email": user.email})
    capture_exception(e)

# Breadcrumbs for debugging
sentry_sdk.add_breadcrumb(
    category='order',
    message='Order validation started',
    level='info',
)
```

## Complete Integration Example

```python
# app.py
from fastapi import FastAPI
from logging_config import setup_logging
from tracing import setup_tracing
from sentry_config import setup_sentry
from metrics import start_http_server
import structlog

app = FastAPI()

# Initialize observability
setup_logging()
setup_tracing(app, service_name="my-api")
setup_sentry(app)

# Start metrics server in background
import threading
threading.Thread(target=lambda: start_http_server(8000), daemon=True).start()

logger = structlog.get_logger()

@app.get("/api/users")
async def get_users():
    logger.info("fetching_users", endpoint="/api/users")
    # Handler logic
    return {"users": []}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
```

## Dependencies

```bash
pip install structlog prometheus-client opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi opentelemetry-instrumentation-sqlalchemy \
    opentelemetry-exporter-otlp sentry-sdk
```

```

### examples/nodejs-setup.md

```markdown
# Node.js/TypeScript Observability Setup

Complete implementation guide for observability in Node.js/TypeScript applications.

## Structured Logging with Pino

```typescript
// logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  ...(process.env.NODE_ENV === 'production'
    ? {}
    : { transport: { target: 'pino-pretty' } }),
});

// Usage
import { logger } from './logger';

logger.info({
  event: 'user_login',
  userId: user.id,
  email: user.email,
  ipAddress: req.ip,
}, 'User logged in');

logger.error({
  event: 'payment_failed',
  userId: user.id,
  amount: 100.00,
  error: error.message,
  stack: error.stack,
}, 'Payment processing failed');
```

## Metrics with Prometheus

```typescript
// metrics.ts
import promClient from 'prom-client';

// Enable default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics();

export const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

export const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.1, 0.5, 1, 2, 5],
});

export const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

// Middleware
export function metricsMiddleware(req, res, next) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestsTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
    httpRequestDuration.labels(req.method, req.route?.path || req.path).observe(duration);
  });

  next();
}

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});
```

## Distributed Tracing with OpenTelemetry

```typescript
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';

export function setupTracing(serviceName: string) {
  const sdk = new NodeSDK({
    serviceName,
    traceExporter: new OTLPTraceExporter({
      url: 'http://localhost:4317',
    }),
    instrumentations: [getNodeAutoInstrumentations()],
  });

  sdk.start();

  process.on('SIGTERM', () => {
    sdk.shutdown().then(() => process.exit(0));
  });
}

// Manual spans
import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return await tracer.startActiveSpan('process_order', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      const result = await paymentService.charge(order);
      span.setAttribute('payment.status', 'success');
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}
```

## Error Tracking with Sentry

```typescript
// sentry.ts
import * as Sentry from '@sentry/node';
import { ProfilingIntegration } from '@sentry/profiling-node';

export function setupSentry(app: Express) {
  Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV,
    tracesSampleRate: 0.1,
    profilesSampleRate: 0.1,
    integrations: [
      new Sentry.Integrations.Http({ tracing: true }),
      new Sentry.Integrations.Express({ app }),
      new ProfilingIntegration(),
    ],
  });

  // Request handler must be first
  app.use(Sentry.Handlers.requestHandler());
  app.use(Sentry.Handlers.tracingHandler());

  // Routes...

  // Error handler must be last
  app.use(Sentry.Handlers.errorHandler());
}

// Manual error capture
import * as Sentry from '@sentry/node';

try {
  await riskyOperation();
} catch (error) {
  Sentry.captureException(error, {
    tags: { section: 'payment' },
    extra: { orderId, amount },
    user: { id: user.id, email: user.email },
  });
}

// Breadcrumbs
Sentry.addBreadcrumb({
  category: 'order',
  message: 'Order validation started',
  level: 'info',
});
```

## Complete Integration Example

```typescript
// app.ts
import express from 'express';
import { setupTracing } from './tracing';
import { setupSentry } from './sentry';
import { logger } from './logger';
import { metricsMiddleware } from './metrics';

const app = express();

// Initialize observability
setupTracing('my-api');
setupSentry(app);

// Add middleware
app.use(metricsMiddleware);

app.get('/api/users', async (req, res) => {
  logger.info({ event: 'fetching_users', endpoint: '/api/users' }, 'Fetching users');
  // Handler logic
  res.json({ users: [] });
});

app.listen(8080, () => {
  logger.info({ event: 'server_started', port: 8080 }, 'Server started');
});
```

## Dependencies

```json
{
  "dependencies": {
    "pino": "^8.0.0",
    "pino-pretty": "^10.0.0",
    "prom-client": "^15.0.0",
    "@opentelemetry/sdk-node": "^0.45.0",
    "@opentelemetry/auto-instrumentations-node": "^0.40.0",
    "@opentelemetry/exporter-trace-otlp-grpc": "^0.45.0",
    "@sentry/node": "^7.0.0",
    "@sentry/profiling-node": "^1.0.0"
  }
}
```

```

### examples/go-setup.md

```markdown
# Go Observability Setup

Complete implementation guide for observability in Go applications.

## Structured Logging with zerolog

```go
// logger.go
package main

import (
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
    "os"
)

func SetupLogger() {
    zerolog.TimeFieldFormat = zerolog.TimeFormatUnix

    if os.Getenv("ENV") == "development" {
        log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr})
    }
}

// Usage
log.Info().
    Str("event", "user_login").
    Str("user_id", user.ID).
    Str("email", user.Email).
    Str("ip_address", r.RemoteAddr).
    Msg("User logged in")

log.Error().
    Err(err).
    Str("event", "payment_failed").
    Float64("amount", 100.00).
    Msg("Payment processing failed")
```

## Metrics with Prometheus

```go
// metrics.go
package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

// Middleware
func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap response writer to capture status
        wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
        next.ServeHTTP(wrapped, r)

        duration := time.Since(start).Seconds()
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(wrapped.statusCode)).Inc()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    })
}

// Metrics endpoint
http.Handle("/metrics", promhttp.Handler())
```

## Distributed Tracing with OpenTelemetry

```go
// tracing.go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func SetupTracing(serviceName string) func() {
    ctx := context.Background()

    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("localhost:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        log.Fatal().Err(err).Msg("Failed to create exporter")
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
        )),
    )

    otel.SetTracerProvider(tp)

    return func() {
        _ = tp.Shutdown(ctx)
    }
}

// Usage
func processOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "process_order")
    defer span.End()

    span.SetAttributes(attribute.String("order.id", orderID))

    // Business logic
    err := paymentService.Charge(ctx, order)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    span.SetAttributes(attribute.String("payment.status", "success"))
    return nil
}
```

## Complete Integration Example

```go
// main.go
package main

import (
    "net/http"
    "github.com/rs/zerolog/log"
)

func main() {
    // Initialize observability
    SetupLogger()
    cleanup := SetupTracing("my-api")
    defer cleanup()

    // Add middleware
    mux := http.NewServeMux()
    mux.HandleFunc("/api/users", getUsersHandler)

    handler := metricsMiddleware(mux)

    log.Info().Msg("Starting server on :8080")
    http.ListenAndServe(":8080", handler)
}

func getUsersHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    log.Info().Str("event", "fetching_users").Msg("Fetching users")

    // Handler logic
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}
```

## Dependencies

```bash
go get github.com/rs/zerolog
go get github.com/prometheus/client_golang/prometheus
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
```

```

### templates/docker-compose.yml

```yaml
# Complete Observability Stack with Docker Compose
# Includes: Prometheus (metrics), Grafana (dashboards), Loki (logs), Tempo (traces)

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki

  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"   # Tempo HTTP
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - ./tempo-config.yml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    command: ["-config.file=/etc/tempo.yaml"]

volumes:
  prometheus-data:
  grafana-data:
  loki-data:
  tempo-data:

```

### templates/prometheus.yml

```yaml
# Prometheus Configuration
# Scrapes metrics from your application and system exporters

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['app:8000']  # Your app metrics endpoint

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

```

### templates/alertmanager.yml

```yaml
# AlertManager Configuration
# Handles alerts from Prometheus and routes to notification channels

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts'
        title: 'Alert: {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

# Alert Rules (place in prometheus.yml or separate rules file)
groups:
  - name: app_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          description: 'Error rate is {{ $value }} (threshold: 0.05)'

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          description: 'P95 latency is {{ $value }}s (threshold: 1s)'

```