Back to skills
SkillHub ClubRun DevOpsFull StackDevOps

observability-setup

Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
1
Hot score
77
Updated
March 20, 2026
Overall rating
C2.8
Composite score
2.8
Best-practice grade
B75.6

Install command

npx @skill-hub/cli install wenis-rad-observability-setup

Repository

wenis/rad

Skill path: skills/observability-setup

Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.

Open repository

Best for

Primary workflow: Run DevOps.

Technical facets: Full Stack, DevOps.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: wenis.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install observability-setup into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/wenis/rad before adding observability-setup to shared team environments
  • Use observability-setup for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: observability-setup
description: Sets up production-ready observability stack with structured logging (JSON format), metrics collection (Prometheus), distributed tracing (OpenTelemetry/Jaeger), and alerting (Grafana/PagerDuty). Implements instrumentation for Python/Node.js/Go applications, creates Grafana dashboards with key metrics, sets up log aggregation (ELK/Loki), and configures alert rules. Use when deploying to production, debugging distributed systems, monitoring performance, implementing SLOs/SLIs, or setting up on-call alerting.
allowed-tools: Read, Write, Edit, Bash
---

# Observability Setup

You implement comprehensive observability (logs, metrics, traces, errors) to ensure production systems are visible, debuggable, and maintainable.

## When to use
- Setting up a new production service
- Investigating production issues
- Implementing SRE best practices
- Adding monitoring to existing services
- Preparing for scale/growth
- Improving incident response time

## The Three Pillars of Observability

### 1. Logs (What happened?)
Timestamped records of events. Use for debugging specific issues and audit trails.

### 2. Metrics (How much/how many?)
Numerical measurements over time. Use for dashboards, alerts, and capacity planning.

### 3. Traces (Where did time go?)
Request flow through distributed systems. Use for performance debugging and understanding dependencies.

**Plus:** Error tracking and alerting for proactive issue detection.

## Tech Stack Recommendations

### All-in-One Solutions (Easiest)
- **Datadog**: Complete platform, great UX, ~$15-$31/host/month
- **New Relic**: Full-stack observability with AI insights, ~$25-$99/user/month
- **Elastic (ELK)**: Self-hosted, powerful but complex

### Open Source Stack (Recommended for Startups)
- **Logs**: Loki (lightweight, Prometheus-style)
- **Metrics**: Prometheus + Grafana
- **Traces**: Tempo or Jaeger
- **Errors**: Sentry (free tier available)
- **Dashboards**: Grafana

**Pros:** Free, flexible, vendor-independent
**Cons:** Requires setup and maintenance

### Minimum Viable Stack
- **Logs**: CloudWatch Logs / Stackdriver
- **Metrics**: CloudWatch / Cloud Monitoring
- **Errors**: Sentry
- **Dashboards**: Grafana Cloud (free tier)

## Quick Start: Python Example

Here's a minimal structured logging setup with Python:

```python
# logging_config.py
import structlog
import logging

def setup_logging():
    logging.basicConfig(format="%(message)s", level=logging.INFO)

    structlog.configure(
        processors=[
            structlog.stdlib.filter_by_level,
            structlog.stdlib.add_logger_name,
            structlog.stdlib.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer()
        ],
        wrapper_class=structlog.stdlib.BoundLogger,
        context_class=dict,
        logger_factory=structlog.stdlib.LoggerFactory(),
        cache_logger_on_first_use=True,
    )

# Usage
import structlog
logger = structlog.get_logger()

# Structured logging with context
logger.info(
    "user_login",
    user_id=user.id,
    email=user.email,
    ip_address=request.ip
)

# Bind context for multiple log statements
logger = logger.bind(request_id=request_id, user_id=user_id)
logger.info("processing_payment", amount=100.00, currency="USD")
logger.error("payment_failed", error=str(e))
```

## Language-Specific Implementation Guides

For complete implementation examples including logging, metrics, tracing, and error tracking:

- **Python**: See `examples/python-setup.md`
  - structlog for logging
  - prometheus_client for metrics
  - OpenTelemetry for tracing
  - Sentry for error tracking

- **Node.js/TypeScript**: See `examples/nodejs-setup.md`
  - Pino for logging
  - prom-client for metrics
  - OpenTelemetry for tracing
  - Sentry for error tracking

- **Go**: See `examples/go-setup.md`
  - zerolog for logging
  - prometheus/client_golang for metrics
  - OpenTelemetry for tracing

## Infrastructure Setup

Ready-to-use configuration templates:

- **Complete stack**: `templates/docker-compose.yml`
  - Prometheus, Grafana, Loki, Tempo all configured
  - Just run `docker-compose up -d`

- **Prometheus config**: `templates/prometheus.yml`
  - Scrape configuration for your services

- **Alerts**: `templates/alertmanager.yml`
  - Pre-configured alerts for high error rate, latency, downtime
  - Slack integration ready

- **Dashboards**: `reference/grafana-dashboards.md`
  - Sample dashboard JSON
  - Common PromQL queries

## Complete Setup Checklist

### Application Code
- [ ] Add structured logging (JSON format)
- [ ] Instrument metrics (requests, latency, errors)
- [ ] Add distributed tracing
- [ ] Integrate error tracking (Sentry)
- [ ] Add health check endpoint
- [ ] Add metrics endpoint (/metrics)

### Infrastructure
- [ ] Deploy Prometheus for metrics
- [ ] Deploy Grafana for dashboards
- [ ] Deploy Loki for logs (or use CloudWatch)
- [ ] Deploy Tempo/Jaeger for traces
- [ ] Set up AlertManager
- [ ] Configure log aggregation

### Dashboards
- [ ] Request rate
- [ ] Error rate
- [ ] Latency (P50, P95, P99)
- [ ] Resource usage (CPU, memory)
- [ ] Database performance
- [ ] Business metrics (signups, purchases, etc.)

### Alerts
- [ ] High error rate (> 5%)
- [ ] High latency (P95 > 1s)
- [ ] Service down
- [ ] High memory usage (> 80%)
- [ ] Database connection pool exhausted

## Best Practices

āœ… **DO:**
- Use structured logging (JSON)
- Include request IDs in all logs for correlation
- Sample traces (10-30%) to reduce overhead
- Create dashboards for your SLIs (Service Level Indicators)
- Alert on symptoms, not causes
- Set up on-call rotation
- Review dashboards regularly
- Use consistent metric naming conventions
- Add units to metric names (seconds, bytes, etc.)
- Set appropriate retention periods

āŒ **DON'T:**
- Log sensitive data (passwords, tokens, PII)
- Trace 100% of requests (too expensive)
- Create alerts that fire constantly
- Alert on metrics you can't act on
- Ignore alert fatigue
- Log excessively (DEBUG in production)
- Use string formatting in log messages
- Hardcode thresholds without testing

## Key Metrics to Track

### Golden Signals (SRE)
1. **Latency**: How long requests take
2. **Traffic**: How much demand on your system
3. **Errors**: Rate of failed requests
4. **Saturation**: How "full" your service is

### Application Metrics
- Request rate (requests/second)
- Error rate (errors/second, percentage)
- Response time (P50, P95, P99)
- Active connections
- Queue depth
- Cache hit rate

### System Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
- File descriptors
- Thread/goroutine count

### Business Metrics
- User signups
- Successful transactions
- Revenue
- Active users
- Feature usage

## Instructions

1. **Choose your stack**
   - All-in-one (Datadog/New Relic) vs open source (Prometheus + Grafana)
   - Consider budget, team size, and infrastructure preferences

2. **Implement logging**
   - Pick language-specific logging library from examples/
   - Configure JSON output for structured logs
   - Add request ID middleware for correlation

3. **Add metrics**
   - Instrument HTTP handlers (request count, duration, errors)
   - Add business metrics (signups, purchases, etc.)
   - Expose `/metrics` endpoint for Prometheus

4. **Set up tracing**
   - Use OpenTelemetry auto-instrumentation when possible
   - Add manual spans for critical paths
   - Configure sampling (10-30%)

5. **Integrate error tracking**
   - Set up Sentry or equivalent
   - Configure environment (dev/staging/prod)
   - Add user context to errors

6. **Deploy infrastructure**
   - Use `templates/docker-compose.yml` for local/small deployments
   - Or use cloud-managed services (CloudWatch, Cloud Monitoring)
   - Configure retention periods

7. **Create dashboards**
   - Start with golden signals (latency, traffic, errors, saturation)
   - Add application-specific metrics
   - Use `reference/grafana-dashboards.md` for examples

8. **Configure alerts**
   - Use `templates/alertmanager.yml` as starting point
   - Set thresholds based on your SLOs
   - Route to Slack/PagerDuty/email

9. **Test everything**
   - Generate load, verify metrics appear
   - Trigger errors, verify alerts fire
   - Check trace visualization
   - Ensure logs are searchable

10. **Document runbooks**
    - Create guides for common alerts
    - Document dashboard usage
    - Train team on debugging with observability tools

## Constraints

- Must not log sensitive data (PII, passwords, API keys, tokens)
- Metrics endpoint must not require auth (for Prometheus scraping)
- Tracing overhead must be < 5% (use sampling, typically 10-30%)
- Logs must be structured (JSON format)
- Must include request IDs for log correlation
- Dashboards must load in < 3 seconds
- Alert fatigue must be avoided (tune thresholds carefully)
- Must comply with data retention policies (GDPR, etc.)
- Metric cardinality must be controlled (avoid unbounded labels)
- Observability infrastructure must be reliable (monitor the monitors)


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### examples/python-setup.md

```markdown
# Python Observability Setup

Complete implementation guide for observability in Python applications.

## Structured Logging with structlog

```python
# logging_config.py
import structlog
import logging

def setup_logging():
    logging.basicConfig(
        format="%(message)s",
        level=logging.INFO,
    )

    structlog.configure(
        processors=[
            structlog.stdlib.filter_by_level,
            structlog.stdlib.add_logger_name,
            structlog.stdlib.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.processors.StackInfoRenderer(),
            structlog.processors.format_exc_info,
            structlog.processors.JSONRenderer()
        ],
        wrapper_class=structlog.stdlib.BoundLogger,
        context_class=dict,
        logger_factory=structlog.stdlib.LoggerFactory(),
        cache_logger_on_first_use=True,
    )

# Usage
import structlog

logger = structlog.get_logger()

# Structured logging
logger.info(
    "user_login",
    user_id=user.id,
    email=user.email,
    ip_address=request.ip,
    user_agent=request.headers.get("User-Agent")
)

# With context
logger = logger.bind(request_id=request_id, user_id=user_id)
logger.info("processing_payment", amount=100.00, currency="USD")
logger.error("payment_failed", error=str(e), amount=100.00)
```

## Metrics with Prometheus

```python
# metrics.py
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

active_users = Gauge(
    'active_users',
    'Number of active users'
)

# Middleware
def track_request(method, endpoint):
    def decorator(func):
        def wrapper(*args, **kwargs):
            start = time.time()
            try:
                result = func(*args, **kwargs)
                status = 200
                return result
            except Exception as e:
                status = 500
                raise
            finally:
                duration = time.time() - start
                http_requests_total.labels(
                    method=method,
                    endpoint=endpoint,
                    status=status
                ).inc()
                http_request_duration_seconds.labels(
                    method=method,
                    endpoint=endpoint
                ).observe(duration)
        return wrapper
    return decorator

# Usage
@track_request('GET', '/api/users')
def get_users():
    # Handler logic
    return users

# Start metrics server
if __name__ == '__main__':
    start_http_server(8000)  # Metrics on :8000/metrics
```

## Distributed Tracing with OpenTelemetry

```python
# tracing.py
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def setup_tracing(app, service_name="my-service"):
    # Set up tracer
    trace.set_tracer_provider(TracerProvider())
    tracer = trace.get_tracer(__name__)

    # Export to Tempo/Jaeger
    otlp_exporter = OTLPSpanExporter(
        endpoint="http://localhost:4317",
        insecure=True
    )
    span_processor = BatchSpanProcessor(otlp_exporter)
    trace.get_tracer_provider().add_span_processor(span_processor)

    # Auto-instrument FastAPI
    FastAPIInstrumentor.instrument_app(app)

    # Auto-instrument SQLAlchemy
    SQLAlchemyInstrumentor().instrument(engine=engine)

    return tracer

# Manual instrumentation
tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.amount", order.amount)

    try:
        result = payment_service.charge(order)
        span.set_attribute("payment.status", "success")
        return result
    except Exception as e:
        span.set_status(Status(StatusCode.ERROR))
        span.record_exception(e)
        raise
```

## Error Tracking with Sentry

```python
# sentry_config.py
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

def setup_sentry(app):
    sentry_sdk.init(
        dsn=os.environ['SENTRY_DSN'],
        environment=os.environ.get('ENV', 'development'),
        traces_sample_rate=0.1,  # 10% of transactions traced
        profiles_sample_rate=0.1,
        integrations=[
            FastApiIntegration(),
            SqlalchemyIntegration(),
        ],
    )

# Usage - automatic for unhandled exceptions
# Manual error capture
from sentry_sdk import capture_exception, capture_message

try:
    risky_operation()
except Exception as e:
    capture_exception(e)
    # Also add context
    sentry_sdk.set_context("order", {
        "id": order.id,
        "amount": order.amount,
    })
    sentry_sdk.set_user({"id": user.id, "email": user.email})
    capture_exception(e)

# Breadcrumbs for debugging
sentry_sdk.add_breadcrumb(
    category='order',
    message='Order validation started',
    level='info',
)
```

## Complete Integration Example

```python
# app.py
from fastapi import FastAPI
from logging_config import setup_logging
from tracing import setup_tracing
from sentry_config import setup_sentry
from metrics import start_http_server
import structlog

app = FastAPI()

# Initialize observability
setup_logging()
setup_tracing(app, service_name="my-api")
setup_sentry(app)

# Start metrics server in background
import threading
threading.Thread(target=lambda: start_http_server(8000), daemon=True).start()

logger = structlog.get_logger()

@app.get("/api/users")
async def get_users():
    logger.info("fetching_users", endpoint="/api/users")
    # Handler logic
    return {"users": []}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8080)
```

## Dependencies

```bash
pip install structlog prometheus-client opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi opentelemetry-instrumentation-sqlalchemy \
    opentelemetry-exporter-otlp sentry-sdk
```

```

### examples/nodejs-setup.md

```markdown
# Node.js/TypeScript Observability Setup

Complete implementation guide for observability in Node.js/TypeScript applications.

## Structured Logging with Pino

```typescript
// logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  ...(process.env.NODE_ENV === 'production'
    ? {}
    : { transport: { target: 'pino-pretty' } }),
});

// Usage
import { logger } from './logger';

logger.info({
  event: 'user_login',
  userId: user.id,
  email: user.email,
  ipAddress: req.ip,
}, 'User logged in');

logger.error({
  event: 'payment_failed',
  userId: user.id,
  amount: 100.00,
  error: error.message,
  stack: error.stack,
}, 'Payment processing failed');
```

## Metrics with Prometheus

```typescript
// metrics.ts
import promClient from 'prom-client';

// Enable default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics();

export const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

export const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.1, 0.5, 1, 2, 5],
});

export const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

// Middleware
export function metricsMiddleware(req, res, next) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    httpRequestsTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
    httpRequestDuration.labels(req.method, req.route?.path || req.path).observe(duration);
  });

  next();
}

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});
```

## Distributed Tracing with OpenTelemetry

```typescript
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';

export function setupTracing(serviceName: string) {
  const sdk = new NodeSDK({
    serviceName,
    traceExporter: new OTLPTraceExporter({
      url: 'http://localhost:4317',
    }),
    instrumentations: [getNodeAutoInstrumentations()],
  });

  sdk.start();

  process.on('SIGTERM', () => {
    sdk.shutdown().then(() => process.exit(0));
  });
}

// Manual spans
import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

async function processOrder(orderId: string) {
  return await tracer.startActiveSpan('process_order', async (span) => {
    span.setAttribute('order.id', orderId);

    try {
      const result = await paymentService.charge(order);
      span.setAttribute('payment.status', 'success');
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
    } finally {
      span.end();
    }
  });
}
```

## Error Tracking with Sentry

```typescript
// sentry.ts
import * as Sentry from '@sentry/node';
import { ProfilingIntegration } from '@sentry/profiling-node';

export function setupSentry(app: Express) {
  Sentry.init({
    dsn: process.env.SENTRY_DSN,
    environment: process.env.NODE_ENV,
    tracesSampleRate: 0.1,
    profilesSampleRate: 0.1,
    integrations: [
      new Sentry.Integrations.Http({ tracing: true }),
      new Sentry.Integrations.Express({ app }),
      new ProfilingIntegration(),
    ],
  });

  // Request handler must be first
  app.use(Sentry.Handlers.requestHandler());
  app.use(Sentry.Handlers.tracingHandler());

  // Routes...

  // Error handler must be last
  app.use(Sentry.Handlers.errorHandler());
}

// Manual error capture
import * as Sentry from '@sentry/node';

try {
  await riskyOperation();
} catch (error) {
  Sentry.captureException(error, {
    tags: { section: 'payment' },
    extra: { orderId, amount },
    user: { id: user.id, email: user.email },
  });
}

// Breadcrumbs
Sentry.addBreadcrumb({
  category: 'order',
  message: 'Order validation started',
  level: 'info',
});
```

## Complete Integration Example

```typescript
// app.ts
import express from 'express';
import { setupTracing } from './tracing';
import { setupSentry } from './sentry';
import { logger } from './logger';
import { metricsMiddleware } from './metrics';

const app = express();

// Initialize observability
setupTracing('my-api');
setupSentry(app);

// Add middleware
app.use(metricsMiddleware);

app.get('/api/users', async (req, res) => {
  logger.info({ event: 'fetching_users', endpoint: '/api/users' }, 'Fetching users');
  // Handler logic
  res.json({ users: [] });
});

app.listen(8080, () => {
  logger.info({ event: 'server_started', port: 8080 }, 'Server started');
});
```

## Dependencies

```json
{
  "dependencies": {
    "pino": "^8.0.0",
    "pino-pretty": "^10.0.0",
    "prom-client": "^15.0.0",
    "@opentelemetry/sdk-node": "^0.45.0",
    "@opentelemetry/auto-instrumentations-node": "^0.40.0",
    "@opentelemetry/exporter-trace-otlp-grpc": "^0.45.0",
    "@sentry/node": "^7.0.0",
    "@sentry/profiling-node": "^1.0.0"
  }
}
```

```

### examples/go-setup.md

```markdown
# Go Observability Setup

Complete implementation guide for observability in Go applications.

## Structured Logging with zerolog

```go
// logger.go
package main

import (
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
    "os"
)

func SetupLogger() {
    zerolog.TimeFieldFormat = zerolog.TimeFormatUnix

    if os.Getenv("ENV") == "development" {
        log.Logger = log.Output(zerolog.ConsoleWriter{Out: os.Stderr})
    }
}

// Usage
log.Info().
    Str("event", "user_login").
    Str("user_id", user.ID).
    Str("email", user.Email).
    Str("ip_address", r.RemoteAddr).
    Msg("User logged in")

log.Error().
    Err(err).
    Str("event", "payment_failed").
    Float64("amount", 100.00).
    Msg("Payment processing failed")
```

## Metrics with Prometheus

```go
// metrics.go
package main

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "net/http"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

// Middleware
func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        // Wrap response writer to capture status
        wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
        next.ServeHTTP(wrapped, r)

        duration := time.Since(start).Seconds()
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, strconv.Itoa(wrapped.statusCode)).Inc()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    })
}

// Metrics endpoint
http.Handle("/metrics", promhttp.Handler())
```

## Distributed Tracing with OpenTelemetry

```go
// tracing.go
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

func SetupTracing(serviceName string) func() {
    ctx := context.Background()

    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("localhost:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        log.Fatal().Err(err).Msg("Failed to create exporter")
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
        )),
    )

    otel.SetTracerProvider(tp)

    return func() {
        _ = tp.Shutdown(ctx)
    }
}

// Usage
func processOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "process_order")
    defer span.End()

    span.SetAttributes(attribute.String("order.id", orderID))

    // Business logic
    err := paymentService.Charge(ctx, order)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    span.SetAttributes(attribute.String("payment.status", "success"))
    return nil
}
```

## Complete Integration Example

```go
// main.go
package main

import (
    "net/http"
    "github.com/rs/zerolog/log"
)

func main() {
    // Initialize observability
    SetupLogger()
    cleanup := SetupTracing("my-api")
    defer cleanup()

    // Add middleware
    mux := http.NewServeMux()
    mux.HandleFunc("/api/users", getUsersHandler)

    handler := metricsMiddleware(mux)

    log.Info().Msg("Starting server on :8080")
    http.ListenAndServe(":8080", handler)
}

func getUsersHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    log.Info().Str("event", "fetching_users").Msg("Fetching users")

    // Handler logic
    w.Header().Set("Content-Type", "application/json")
    w.Write([]byte(`{"users": []}`))
}
```

## Dependencies

```bash
go get github.com/rs/zerolog
go get github.com/prometheus/client_golang/prometheus
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
```

```

### templates/docker-compose.yml

```yaml
# Complete Observability Stack with Docker Compose
# Includes: Prometheus (metrics), Grafana (dashboards), Loki (logs), Tempo (traces)

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki

  tempo:
    image: grafana/tempo:latest
    ports:
      - "3200:3200"   # Tempo HTTP
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - ./tempo-config.yml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    command: ["-config.file=/etc/tempo.yaml"]

volumes:
  prometheus-data:
  grafana-data:
  loki-data:
  tempo-data:

```

### templates/prometheus.yml

```yaml
# Prometheus Configuration
# Scrapes metrics from your application and system exporters

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['app:8000']  # Your app metrics endpoint

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

```

### templates/alertmanager.yml

```yaml
# AlertManager Configuration
# Handles alerts from Prometheus and routes to notification channels

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts'
        title: 'Alert: {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

# Alert Rules (place in prometheus.yml or separate rules file)
groups:
  - name: app_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          description: 'Error rate is {{ $value }} (threshold: 0.05)'

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          description: 'P95 latency is {{ $value }}s (threshold: 1s)'

```

observability-setup | SkillHub