SkillHub ClubAnalyze Data & AIFull StackDevOpsData / AI

ml-inference-optimization

ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

785

Hot score

Updated

March 20, 2026

Overall rating

C4.8

Composite score

4.8

Best-practice grade

B81.2

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-ml-inference-optimization

Repository

benchflow-ai/SkillsBench

Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_train-fasttext/environment/skills/ml-inference-optimization

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, DevOps, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install ml-inference-optimization into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/benchflow-ai/SkillsBench before adding ml-inference-optimization to shared team environments
Use ml-inference-optimization for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: ml-inference-optimization
description: ML inference latency optimization, model compression, distillation, caching strategies, and edge deployment patterns. Use when optimizing inference performance, reducing model size, or deploying ML at the edge.
allowed-tools: Read, Glob, Grep
---

# ML Inference Optimization

## When to Use This Skill

Use this skill when:

- Optimizing ML inference latency
- Reducing model size for deployment
- Implementing model compression techniques
- Designing inference caching strategies
- Deploying models at the edge
- Balancing accuracy vs. latency trade-offs

**Keywords:** inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration

## Inference Optimization Overview

```text
┌─────────────────────────────────────────────────────────────────────┐
│                 Inference Optimization Stack                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Model Level                                │  │
│  │  Distillation │ Pruning │ Quantization │ Architecture Search │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                   Compiler Level                              │  │
│  │  Graph optimization │ Operator fusion │ Memory planning       │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Runtime Level                                │  │
│  │  Batching │ Caching │ Async execution │ Multi-threading      │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                              │                                      │
│                              ▼                                      │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                  Hardware Level                               │  │
│  │  GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators            │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

## Model Compression Techniques

### Technique Overview

| Technique | Size Reduction | Speed Improvement | Accuracy Impact |
| --------- | -------------- | ----------------- | --------------- |
| **Quantization** | 2-4x | 2-4x | Low (1-2%) |
| **Pruning** | 2-10x | 1-3x | Low-Medium |
| **Distillation** | 3-10x | 3-10x | Medium |
| **Low-rank factorization** | 2-5x | 1.5-3x | Low-Medium |
| **Weight sharing** | 10-100x | Variable | Medium-High |

### Knowledge Distillation

```text
┌─────────────────────────────────────────────────────────────────────┐
│                    Knowledge Distillation                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌──────────────┐                                                   │
│  │ Teacher Model│ (Large, accurate, slow)                          │
│  │   GPT-4      │                                                   │
│  └──────────────┘                                                   │
│         │                                                           │
│         ▼ Soft labels (probability distributions)                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    Training Process                           │  │
│  │  Loss = α × CrossEntropy(student, hard_labels)               │  │
│  │       + (1-α) × KL_Div(student, teacher_soft_labels)         │  │
│  └──────────────────────────────────────────────────────────────┘  │
│         │                                                           │
│         ▼                                                           │
│  ┌──────────────┐                                                   │
│  │Student Model │ (Small, nearly as accurate, fast)                │
│  │  DistilBERT  │                                                   │
│  └──────────────┘                                                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

**Distillation Types:**

| Type | Description | Use Case |
| ---- | ----------- | -------- |
| **Response distillation** | Match teacher outputs | General compression |
| **Feature distillation** | Match intermediate layers | Better transfer |
| **Relation distillation** | Match sample relationships | Structured data |
| **Self-distillation** | Model teaches itself | Regularization |

### Pruning Strategies

```text
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After:  [0.0, 0.8, 0.0, 0.9, 0.0, 0.7]  (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries

Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
        │ C1│ C2│ C3│ C4│
        └───┴───┴───┴───┘
After:  ┌───┬───┬───┐
        │ C1│ C3│ C4│  (Removed C2 entirely)
        └───┴───┴───┘
• Works with standard hardware
• Lower compression ratio
```

**Pruning Decision Criteria:**

| Method | Description | Effectiveness |
| ------ | ----------- | ------------- |
| **Magnitude-based** | Remove smallest weights | Simple, effective |
| **Gradient-based** | Remove low-gradient weights | Better accuracy |
| **Second-order** | Use Hessian information | Best but expensive |
| **Lottery ticket** | Find winning subnetwork | Theoretical insight |

### Quantization (Detailed)

```text
Precision Hierarchy:

FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████  (different mantissa/exponent)
INT8 (8 bits):  ████████
INT4 (4 bits):  ████
Binary (1 bit): █

Memory and Compute Scale Proportionally
```

**Quantization Approaches:**

| Approach | When Applied | Quality | Effort |
| -------- | ------------ | ------- | ------ |
| **Dynamic quantization** | Runtime | Good | Low |
| **Static quantization** | Post-training with calibration | Better | Medium |
| **QAT** | During training | Best | High |

## Compiler-Level Optimization

### Graph Optimization

```text
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output

Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output

Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth
```

### Common Optimizations

| Optimization | Description | Speedup |
| ------------ | ----------- | ------- |
| **Operator fusion** | Combine sequential ops | 1.2-2x |
| **Constant folding** | Pre-compute constants | 1.1-1.5x |
| **Dead code elimination** | Remove unused ops | Variable |
| **Layout optimization** | Optimize tensor memory layout | 1.1-1.3x |
| **Memory planning** | Optimize buffer allocation | 1.1-1.2x |

### Optimization Frameworks

| Framework | Vendor | Best For |
| --------- | ------ | -------- |
| **TensorRT** | NVIDIA | NVIDIA GPUs, lowest latency |
| **ONNX Runtime** | Microsoft | Cross-platform, broad support |
| **OpenVINO** | Intel | Intel CPUs/GPUs |
| **Core ML** | Apple | Apple devices |
| **TFLite** | Google | Mobile, embedded |
| **Apache TVM** | Open source | Custom hardware, research |

## Runtime Optimization

### Batching Strategies

```text
No Batching:
Request 1: [Process] → Response 1      10ms
Request 2: [Process] → Response 2      10ms
Request 3: [Process] → Response 3      10ms
Total: 30ms, GPU underutilized

Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput

Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput
```

**Batching Parameters:**

| Parameter | Description | Trade-off |
| --------- | ----------- | --------- |
| `batch_size` | Maximum batch size | Throughput vs. latency |
| `max_wait_time` | Wait time for batch fill | Latency vs. efficiency |
| `min_batch_size` | Minimum before processing | Latency predictability |

### Caching Strategies

```text
┌─────────────────────────────────────────────────────────────────────┐
│                    Inference Caching Layers                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Layer 1: Input Cache                                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache exact inputs → Return cached outputs                   │   │
│  │ Hit rate: Low (inputs rarely repeat exactly)                 │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 2: Embedding Cache                                           │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache computed embeddings for repeated tokens/entities       │   │
│  │ Hit rate: Medium (common tokens repeat)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 3: KV Cache (for transformers)                               │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache key-value pairs for attention                          │   │
│  │ Hit rate: High (reuse across tokens in sequence)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Layer 4: Result Cache                                              │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │ Cache semantic equivalents (fuzzy matching)                  │   │
│  │ Hit rate: Variable (depends on query distribution)           │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

**Semantic Caching for LLMs:**

```text
Query: "What's the capital of France?"
       ↓
Hash + Embed query
       ↓
Search cache (similarity > threshold)
       ↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Return
```

### Async and Parallel Execution

```text
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │  Total: 30ms
│10ms │ │15ms │ │5ms  │
└─────┘ └─────┘ └─────┘

Pipelined:
Request 1: │Prep│Model│Post│
Request 2:      │Prep│Model│Post│
Request 3:           │Prep│Model│Post│

Throughput: 3x higher
Latency per request: Same
```

## Hardware Acceleration

### Hardware Comparison

| Hardware | Strengths | Limitations | Best For |
| -------- | --------- | ----------- | -------- |
| **GPU (NVIDIA)** | High parallelism, mature ecosystem | Power, cost | Training, large batch inference |
| **TPU (Google)** | Matrix ops, cloud integration | Vendor lock-in | Google Cloud workloads |
| **NPU (Apple/Qualcomm)** | Power efficient, on-device | Limited models | Mobile, edge |
| **CPU** | Flexible, available | Slower for ML | Low-batch, CPU-bound |
| **FPGA** | Customizable, low latency | Development complexity | Specialized workloads |

### GPU Optimization

| Optimization | Description | Impact |
| ------------ | ----------- | ------ |
| **Tensor Cores** | Use FP16/INT8 tensor operations | 2-8x speedup |
| **CUDA graphs** | Reduce kernel launch overhead | 1.5-2x for small models |
| **Multi-stream** | Parallel execution | Higher throughput |
| **Memory pooling** | Reduce allocation overhead | Lower latency variance |

## Edge Deployment

### Edge Constraints

```text
┌─────────────────────────────────────────────────────────────────────┐
│                      Edge Deployment Constraints                    │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Resource Constraints:                                              │
│  ├── Memory: 1-4 GB (vs. 64+ GB cloud)                             │
│  ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud)                    │
│  ├── Power: 5-15W (vs. 300W+ cloud)                                │
│  └── Storage: 16-128 GB (vs. TB cloud)                             │
│                                                                     │
│  Operational Constraints:                                           │
│  ├── No network (offline operation)                                 │
│  ├── Variable ambient conditions                                    │
│  ├── Infrequent updates                                            │
│  └── Long deployment lifetime                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Edge Optimization Strategies

| Strategy | Description | Use When |
| -------- | ----------- | -------- |
| **Model selection** | Use edge-native models (MobileNet, EfficientNet) | Accuracy acceptable |
| **Aggressive quantization** | INT8 or lower | Memory/power constrained |
| **On-device distillation** | Distill to tiny model | Extreme constraints |
| **Split inference** | Edge preprocessing, cloud inference | Network available |
| **Model caching** | Cache results locally | Repeated queries |

### Edge ML Frameworks

| Framework | Platform | Features |
| --------- | -------- | -------- |
| **TensorFlow Lite** | Android, iOS, embedded | Quantization, delegates |
| **Core ML** | iOS, macOS | Neural Engine optimization |
| **ONNX Runtime Mobile** | Cross-platform | Broad model support |
| **PyTorch Mobile** | Android, iOS | Familiar API |
| **TensorRT** | NVIDIA Jetson | Maximum performance |

## Latency Profiling

### Profiling Methodology

```text
┌─────────────────────────────────────────────────────────────────────┐
│                    Latency Breakdown Analysis                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Data Loading:          ████████░░░░░░░░░░  15%                 │
│  2. Preprocessing:         ██████░░░░░░░░░░░░  10%                 │
│  3. Model Inference:       ████████████████░░  60%                 │
│  4. Postprocessing:        ████░░░░░░░░░░░░░░   8%                 │
│  5. Response Serialization:███░░░░░░░░░░░░░░░   7%                 │
│                                                                     │
│  Target: Model inference (60% = biggest optimization opportunity)  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Profiling Tools

| Tool | Use For |
| ---- | ------- |
| **PyTorch Profiler** | PyTorch model profiling |
| **TensorBoard** | TensorFlow visualization |
| **NVIDIA Nsight** | GPU profiling |
| **Chrome Tracing** | General timeline visualization |
| **perf** | CPU profiling |

### Key Metrics

| Metric | Description | Target |
| ------ | ----------- | ------ |
| **P50 latency** | Median latency | < SLA |
| **P99 latency** | Tail latency | < 2x P50 |
| **Throughput** | Requests/second | Meet demand |
| **GPU utilization** | Compute usage | > 80% |
| **Memory bandwidth** | Memory usage | < limit |

## Optimization Workflow

### Systematic Approach

```text
┌─────────────────────────────────────────────────────────────────────┐
│                  Optimization Workflow                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Baseline                                                        │
│     └── Measure current performance (latency, throughput, accuracy) │
│                                                                     │
│  2. Profile                                                         │
│     └── Identify bottlenecks (model, data, system)                  │
│                                                                     │
│  3. Optimize (in order of effort/impact):                           │
│     ├── Hardware: Use right accelerator                             │
│     ├── Compiler: Enable optimizations (TensorRT, ONNX)            │
│     ├── Runtime: Batching, caching, async                          │
│     ├── Model: Quantization, pruning                                │
│     └── Architecture: Distillation, model change                    │
│                                                                     │
│  4. Validate                                                        │
│     └── Verify accuracy maintained, latency improved                │
│                                                                     │
│  5. Deploy and Monitor                                              │
│     └── Track real-world performance                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Optimization Priority Matrix

```text
                    High Impact
                         │
    Compiler Opts    ────┼──── Quantization
    (easy win)           │     (best ROI)
                         │
Low Effort ──────────────┼──────────────── High Effort
                         │
    Batching         ────┼──── Distillation
    (quick win)          │     (major effort)
                         │
                    Low Impact
```

## Common Patterns

### Multi-Model Serving

```text
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│  Request → ┌─────────┐                                              │
│            │ Router  │                                              │
│            └─────────┘                                              │
│               │   │   │                                             │
│      ┌────────┘   │   └────────┐                                    │
│      ▼            ▼            ▼                                    │
│  ┌───────┐   ┌───────┐   ┌───────┐                                 │
│  │ Tiny  │   │ Small │   │ Large │                                 │
│  │ <10ms │   │ <50ms │   │<500ms │                                 │
│  └───────┘   └───────┘   └───────┘                                 │
│                                                                     │
│  Routing strategies:                                                │
│  • Complexity-based: Simple→Tiny, Complex→Large                    │
│  • Confidence-based: Try Tiny, escalate if low confidence          │
│  • SLA-based: Route based on latency requirements                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

### Speculative Execution

```text
Query: "Translate: Hello"
        │
        ├──▶ Small model (draft): "Bonjour" (5ms)
        │
        └──▶ Large model (verify): Check "Bonjour" (10ms parallel)
             │
             ├── Accept: Return immediately
             └── Reject: Generate with large model

Speedup: 2-3x when drafts are often accepted
```

### Cascade Models

```text
Input → ┌────────┐
        │ Filter │ ← Cheap filter (reject obvious negatives)
        └────────┘
             │ (candidates only)
             ▼
        ┌────────┐
        │ Stage 1│ ← Fast model (coarse ranking)
        └────────┘
             │ (top-100)
             ▼
        ┌────────┐
        │ Stage 2│ ← Accurate model (fine ranking)
        └────────┘
             │ (top-10)
             ▼
         Output

Benefit: 10x cheaper, similar accuracy
```

## Optimization Checklist

### Pre-Deployment

- [ ] Profile baseline performance
- [ ] Identify primary bottleneck (model, data, system)
- [ ] Apply compiler optimizations (TensorRT, ONNX)
- [ ] Evaluate quantization (INT8 usually safe)
- [ ] Tune batch size for target throughput
- [ ] Test accuracy after optimization

### Deployment

- [ ] Configure appropriate hardware
- [ ] Enable caching where applicable
- [ ] Set up monitoring (latency, throughput, errors)
- [ ] Configure auto-scaling policies
- [ ] Implement graceful degradation

### Post-Deployment

- [ ] Monitor p99 latency
- [ ] Track accuracy metrics
- [ ] Analyze cache hit rates
- [ ] Review cost efficiency
- [ ] Plan iterative improvements

## Related Skills

- `llm-serving-patterns` - LLM-specific serving optimization
- `ml-system-design` - End-to-end ML pipeline design
- `quality-attributes-taxonomy` - Performance as quality attribute
- `estimation-techniques` - Capacity planning for ML systems

## Version History

- v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns

---

## Last Updated

**Date:** 2025-12-26