SkillHub ClubRun DevOpsFull StackDevOps

llm-serving-patterns

LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

764

Hot score

Updated

March 20, 2026

Overall rating

C4.3

Composite score

4.3

Best-practice grade

A92.0

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-llm-serving-patterns

Repository

benchflow-ai/SkillsBench

Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_llm-inference-batching-scheduler/environment/skills/llm-serving-patterns

Open repository

Best for

Primary workflow: Run DevOps.

Technical facets: Full Stack, DevOps.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install llm-serving-patterns into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/benchflow-ai/SkillsBench before adding llm-serving-patterns to shared team environments
Use llm-serving-patterns for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: llm-serving-patterns
description: LLM inference infrastructure, serving frameworks (vLLM, TGI, TensorRT-LLM), quantization techniques, batching strategies, and streaming response patterns. Use when designing LLM serving infrastructure, optimizing inference latency, or scaling LLM deployments.
allowed-tools: Read, Glob, Grep
---

# LLM Serving Patterns

## When to Use This Skill

Use this skill when:

- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively

**Keywords:** LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding

## LLM Serving Architecture Overview

```text
┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Serving Stack                           │
├─────────────────────────────────────────────────────────────────────┤
│  Clients (API, Chat UI, Agents)                                     │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │              Load Balancer / API Gateway                     │   │
│  │  • Rate limiting  • Authentication  • Request routing        │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                   Inference Server                           │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │   │
│  │  │  Request    │  │  Batching   │  │  KV Cache           │  │   │
│  │  │  Queue      │──▶│  Engine     │──▶│  Management        │  │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘  │   │
│  │       │                                      │               │   │
│  │       ▼                                      ▼               │   │
│  │  ┌─────────────────────────────────────────────────────┐    │   │
│  │  │              Model Execution Engine                  │    │   │
│  │  │  • Tensor operations  • Attention  • Token sampling │    │   │
│  │  └─────────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────────┘   │
│       │                                                             │
│       ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    GPU/TPU Cluster                           │   │
│  │  • Model sharding  • Tensor parallelism  • Pipeline parallel │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
```

## Serving Framework Comparison

| Framework | Strengths | Best For | Considerations |
| --------- | --------- | -------- | -------------- |
| **vLLM** | PagedAttention, high throughput, continuous batching | General LLM serving, high concurrency | Python-native, active community |
| **TGI (Text Generation Inference)** | Production-ready, Hugging Face integration | Enterprise deployment, HF models | Rust backend, Docker-first |
| **TensorRT-LLM** | NVIDIA optimization, lowest latency | NVIDIA GPUs, latency-critical | NVIDIA-only, complex setup |
| **Triton Inference Server** | Multi-model, multi-framework | Heterogeneous model serving | Enterprise complexity |
| **Ollama** | Simple local deployment | Development, edge deployment | Limited scaling features |
| **llama.cpp** | CPU inference, quantization | Resource-constrained, edge | C++ integration required |

### Framework Selection Decision Tree

```text
Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
    └── Need high throughput with many concurrent users?
        ├── Yes → vLLM (PagedAttention)
        └── No
            └── Need enterprise features + HF integration?
                ├── Yes → TGI
                └── No
                    └── Simple local/edge deployment?
                        ├── Yes → Ollama or llama.cpp
                        └── No → vLLM (general purpose)
```

## Quantization Techniques

### Precision Levels

| Precision | Bits | Memory Reduction | Quality Impact | Use Case |
| --------- | ---- | ---------------- | -------------- | -------- |
| FP32 | 32 | Baseline | None | Training, reference |
| FP16/BF16 | 16 | 2x | Minimal | Standard serving |
| INT8 | 8 | 4x | Low | Production serving |
| INT4 | 4 | 8x | Moderate | Resource-constrained |
| INT2 | 2 | 16x | Significant | Experimental |

### Quantization Methods

| Method | Description | Quality | Speed |
| ------ | ----------- | ------- | ----- |
| **PTQ (Post-Training Quantization)** | Quantize after training, no retraining | Good | Fast to apply |
| **QAT (Quantization-Aware Training)** | Simulate quantization during training | Better | Requires training |
| **GPTQ** | One-shot weight quantization | Very good | Moderate |
| **AWQ (Activation-aware Weight Quantization)** | Preserves salient weights | Excellent | Moderate |
| **GGUF/GGML** | llama.cpp format, CPU-optimized | Good | Very fast inference |
| **SmoothQuant** | Migrates difficulty to weights | Excellent | Moderate |

### Quantization Selection

```text
Quality vs. Efficiency Trade-off:

Quality ────────────────────────────────────────────▶ Efficiency
   │                                                      │
   │  FP32    FP16    INT8+AWQ   INT8+GPTQ   INT4   INT2  │
   │   ○───────○────────○──────────○──────────○──────○    │
   │   │       │        │          │          │      │    │
   │  Best   Great    Good      Good       Fair   Poor   │
   │                                                      │
```

## Batching Strategies

### Static Batching

```text
Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50]  ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80]  ─┘

Problem: Short requests wait for long ones (head-of-line blocking)
```

### Continuous Batching (Preferred)

```text
Time ──────────────────────────────────────────────────────────▶

Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]

• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization
```

### Batching Parameters

| Parameter | Description | Trade-off |
| --------- | ----------- | --------- |
| `max_batch_size` | Maximum concurrent requests | Memory vs. throughput |
| `max_waiting_tokens` | Tokens before forcing batch | Latency vs. throughput |
| `max_num_seqs` | Maximum sequences in batch | Memory vs. concurrency |

## KV Cache Management

### The KV Cache Problem

```text
Attention: Q × K^T × V

For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)

Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory
```

### PagedAttention (vLLM Innovation)

```text
Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed)   │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed)   │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE                  │
└──────────────────────────────────────────┘

PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │  ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput
```

### KV Cache Optimization Strategies

| Strategy | Description | Memory Savings |
| -------- | ----------- | -------------- |
| **Paged Attention** | Virtual memory for KV cache | ~50% reduction |
| **Prefix Caching** | Reuse KV cache for common prefixes | System prompt: 100% |
| **Quantized KV Cache** | INT8/FP8 for KV values | 50-75% reduction |
| **Sliding Window** | Limited attention context | Linear memory |
| **MQA/GQA** | Grouped query attention | Architecture-dependent |

## Streaming Response Patterns

### Server-Sent Events (SSE)

```text
Client                                Server
   │                                     │
   │──── GET /v1/chat/completions ──────▶│
   │      (stream: true)                 │
   │                                     │
   │◀──── HTTP 200 OK ───────────────────│
   │      Content-Type: text/event-stream│
   │                                     │
   │◀──── data: {"token": "Hello"} ──────│
   │◀──── data: {"token": " world"} ─────│
   │◀──── data: {"token": "!"} ──────────│
   │◀──── data: [DONE] ──────────────────│
   │                                     │
```

**SSE Benefits:**

- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support

### WebSocket Streaming

```text
Client                                Server
   │                                     │
   │──── WebSocket Upgrade ─────────────▶│
   │◀──── 101 Switching Protocols ───────│
   │                                     │
   │──── {"prompt": "Hello"} ───────────▶│
   │                                     │
   │◀──── {"token": "Hi"} ───────────────│
   │◀──── {"token": " there"} ───────────│
   │◀──── {"token": "!"} ────────────────│
   │◀──── {"done": true} ────────────────│
   │                                     │
```

**WebSocket Benefits:**

- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence

### Streaming Implementation Considerations

| Aspect | SSE | WebSocket |
| ------ | --- | --------- |
| **Reconnection** | Built-in | Manual |
| **Scalability** | Per-request | Connection pool |
| **Load Balancing** | Standard HTTP | Sticky sessions |
| **Firewall/Proxy** | Usually works | May need config |
| **Best For** | One-way streaming | Interactive chat |

## Speculative Decoding

### Concept

```text
Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
             10ms   10ms   10ms   10ms   10ms = 50ms total

Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
                      │
                      ▼
Large Model: [Verify T1-T5 in one pass] (15ms)
             Accept: T1, T2, T3 ✓  Reject: T4, T5 ✗
                      │
                      ▼
             [Generate T4, T5 correctly]

Total: ~25ms (2x speedup if 60% acceptance)
```

### Speculative Decoding Trade-offs

| Factor | Impact |
| ------ | ------ |
| **Draft model quality** | Higher match rate = more speedup |
| **Draft model size** | Larger = better quality, slower |
| **Speculation depth** | More tokens = higher risk/reward |
| **Verification cost** | Must be < sequential generation |

## Scaling Strategies

### Horizontal Scaling

```text
┌─────────────────────────────────────────────────────────┐
│                    Load Balancer                        │
│         (Round-robin, Least-connections)                │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ vLLM    │    │ vLLM    │    │ vLLM    │
    │ Node 1  │    │ Node 2  │    │ Node 3  │
    │ (GPU×4) │    │ (GPU×4) │    │ (GPU×4) │
    └─────────┘    └─────────┘    └─────────┘
```

### Model Parallelism

| Strategy | Description | Use Case |
| -------- | ----------- | -------- |
| **Tensor Parallelism** | Split layers across GPUs | Single large model |
| **Pipeline Parallelism** | Different layers on different GPUs | Very large models |
| **Data Parallelism** | Same model, different batches | High throughput |

```text
Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│              Layer N                     │
│  GPU0   │   GPU1   │   GPU2   │   GPU3  │
│  25%    │   25%    │   25%    │   25%   │
└─────────────────────────────────────────┘

Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
```

## Latency Optimization Checklist

### Pre-deployment

- [ ] Choose appropriate quantization (INT8 for production)
- [ ] Enable continuous batching
- [ ] Configure KV cache size appropriately
- [ ] Set optimal batch size for hardware
- [ ] Enable prefix caching for system prompts

### Runtime

- [ ] Monitor GPU memory utilization
- [ ] Track p50/p95/p99 latencies
- [ ] Measure time-to-first-token (TTFT)
- [ ] Monitor tokens-per-second (TPS)
- [ ] Set appropriate timeouts

### Infrastructure

- [ ] Use fastest available interconnect (NVLink, InfiniBand)
- [ ] Minimize network hops
- [ ] Place inference close to users (edge)
- [ ] Consider dedicated inference hardware

## Cost Optimization

### Cost Drivers

| Factor | Impact | Optimization |
| ------ | ------ | ------------ |
| **GPU hours** | Highest | Quantization, batching |
| **Memory** | High | PagedAttention, KV cache optimization |
| **Network** | Medium | Response compression, edge deployment |
| **Storage** | Low | Model deduplication |

### Cost Estimation Formula

```text
Monthly Cost =
  (Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
  ─────────────────────────────────────────────────────────────────────────────
                                    3600

Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour

Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
```

## Common Patterns

### Multi-model Routing

```text
┌─────────────────────────────────────────────────────────┐
│                     Router                              │
│  • Classify request complexity                          │
│  • Route to appropriate model                           │
└─────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │ Small   │    │ Medium  │    │ Large   │
    │ Model   │    │ Model   │    │ Model   │
    │ (7B)    │    │ (13B)   │    │ (70B)   │
    │ Fast    │    │ Balanced│    │ Quality │
    └─────────┘    └─────────┘    └─────────┘
```

### Caching Strategies

| Cache Type | What to Cache | TTL |
| ---------- | ------------- | --- |
| **Prompt cache** | Common system prompts | Long |
| **KV cache** | Prefix tokens | Session |
| **Response cache** | Exact query matches | Varies |
| **Embedding cache** | Document embeddings | Long |

## Related Skills

- `ml-system-design` - End-to-end ML pipeline design
- `rag-architecture` - Retrieval-augmented generation patterns
- `vector-databases` - Vector search for LLM context
- `ml-inference-optimization` - General inference optimization
- `estimation-techniques` - Capacity planning for LLM systems

## Version History

- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews

---

## Last Updated

**Date:** 2025-12-26