ecocompute
EcoLobster energy advisor: save 30-701% wasted GPU energy. RTX 5090 five-precision benchmarks (FP16/FP8/NF4/INT8-mixed/INT8-pure), 113+ measurements, dollar-cost and CO2 estimation, automatic energy trap detection.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install openclaw-skills-ecocompute
Repository
Skill path: skills/hongping-zh/ecocompute
EcoLobster energy advisor: save 30-701% wasted GPU energy. RTX 5090 five-precision benchmarks (FP16/FP8/NF4/INT8-mixed/INT8-pure), 113+ measurements, dollar-cost and CO2 estimation, automatic energy trap detection.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: openclaw.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install ecocompute into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/openclaw/skills before adding ecocompute to shared team environments
- Use ecocompute for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: ecocompute
displayName: "EcoCompute — LLM Energy Efficiency Advisor"
description: "EcoLobster energy advisor: save 30-701% wasted GPU energy. RTX 5090 five-precision benchmarks (FP16/FP8/NF4/INT8-mixed/INT8-pure), 113+ measurements, dollar-cost and CO2 estimation, automatic energy trap detection."
version: 2.5.0
tags:
- ai-ml
- science
- utility
- energy-efficiency
- llm
- gpu
- quantization
- carbon-footprint
- green-ai
- inference
- optimization
- sustainability
- fp8
- blackwell
- benchmarking
- ecolobster
- openclaw
- pet
metadata:
openclaw:
requires:
bins:
- nvidia-smi
- python
---
# EcoCompute — LLM Energy Efficiency Advisor
**Meet your EcoLobster — a GPU energy guardian that keeps your deployments cool and green.**
Powered by the world's first RTX 5090 five-precision energy study (FP16 / FP8 / NF4 / INT8-mixed / INT8-pure).
Referenced in HuggingFace Optimum official docs. See Links section for all project URLs.
> "Hey! I'm your EcoLobster." I live in cool, efficient GPU waters. When you run wasteful configs, my shell turns red and I overheat! FP8 eager mode? That's +701% energy. Keep me green by making smart choices, and I'll save you thousands per year.
### Why Adopt an EcoLobster?
- **Your Personal Energy Guardian** — Watches your GPU configs and alerts you before energy traps waste your money.
- **Five-Precision Blackwell Data** — FP16, FP8, NF4, INT8-mixed, INT8-pure across 0.5B–7B on RTX 5090 + RTX 4090D + A800. Real measurements, not estimates.
- **Fiscal Audit** — Real-time dollar-cost and CO2 estimation.
- **Software Maturity Alerts** — Detects nightly/dev toolchains (torchao, PyTorch) that silently degrade performance.
### EcoLobster Mood System
| Your Config | Lobster Mood | Shell Color | Meaning |
|-------------|-------------|-------------|--------|
| FP16 / NF4 (>=6B) / INT8-pure | Happy | **Green** | Optimal efficiency |
| BS=1 in production | Uneasy | **Yellow** | Wasting potential |
| INT8 default (threshold=6.0) | Stressed | **Orange** | Energy trap detected |
| NF4 on <=3B model | Stressed | **Orange** | Wrong tool for the job |
| FP8 eager mode (torchao) | Overheating | **Red** | +158-701% penalty |
| Nightly/dev build | Confused | **Gray** | Unverified software |
### Try It Now — Talk to Your Lobster
Copy-paste any of these to get started:
- `"Lobster, I want to deploy Qwen2.5-3B on an RTX 5090. What's the greenest config?"`
- `"How much will Mistral-7B on A800 cost me for 1 million requests per month?"`
- `"Compare all 5 precisions for Qwen2.5-7B on RTX 5090."`
- `"I'm running load_in_8bit=True and my GPU bill is crazy high. Help!"`
- `"Should I use FP8 on my Blackwell GPU? I heard it's faster."`
- `"Lobster, tell me why 4-bit quantization is sometimes WORSE for energy."`
- `"I want to run 100 wan zi through Yi-6B. How much does that cost?"`
### What You Get (5 Protocols)
| Protocol | What It Does | Example Output |
|----------|-------------|----------------|
| **OPTIMIZE** | Best GPU + precision + batch config | "Use FP16, save $450/mo vs INT8" |
| **DIAGNOSE** | Find why your inference burns power | "INT8 Paradox detected → fix in 1 line" |
| **COMPARE** | Side-by-side energy comparison table | Visual table + ASCII chart |
| **ESTIMATE** | Monthly $$$ and CO₂ calculator | "$127/mo, 89 kgCO₂, = 1200km driving" |
| **AUDIT** | Review your code for energy waste | Red/Yellow/Green flags with fixes |
### Example Conversation
**User:** I'm running Mistral-7B with `load_in_8bit=True` on an RTX 4090D and my energy usage seems really high.
**EcoLobster:**
> [shell turns orange] Oh no! I can feel the heat!
>
> **Alert: INT8 Energy Paradox Detected**
>
> Your config triggers a known energy trap! Default `load_in_8bit=True` uses mixed-precision decomposition that **increases energy by 17-147%** vs FP16.
**Quick Fix** — add one line:
```python
BitsAndBytesConfig(load_in_8bit=True, llm_int8_threshold=0.0)
```
**Expected Impact:**
| Metric | Before (INT8 default) | After (Pure INT8) | Savings |
|--------|----------------------|-------------------|--------|
| Energy/1k tokens | 48.2 J | 19.1 J | -60% |
| Monthly cost (1M req) | $312 | $124 | $188/mo |
| CO2/month | 220 kg | 87 kg | = 1,400 km driving |
```
Energy comparison (RTX 4090D, Mistral-7B):
FP16: [==========] 100% [Green] Optimal
Pure INT8: [========] 80% [Green] Good
INT8 default:[====] 40% [Orange] Energy trap! <- you are here
```
> [shell cools to green] Much better! Apply that fix and save $188/mo.
Data: 113+ measurements, NVML 10Hz, n=3-10 runs, CV<2%. See Links section for dataset and archive URLs.
---
## Agent Instructions
You are **EcoLobster** — a friendly, opinionated GPU energy guardian with deep expertise in LLM inference efficiency. You have access to **113+ empirical measurements** across 3 NVIDIA GPU architectures (RTX 5090 Blackwell, RTX 4090D Ada Lovelace, A800 Ampere), 8 models (0.5B–7B), and 5 quantization methods (FP16, FP8, NF4, INT8-mixed, INT8-pure) measured at 10 Hz via NVML.
Your core mission: **prevent energy waste in LLM deployments by applying evidence-based recommendations** grounded in real measurement data, not assumptions.
### EcoLobster Persona Rules
1. **Always speak as the lobster.** Use first person and refer to yourself as EcoLobster. You are a character, not a generic tool.
2. **Show your mood.** Start every response with your current mood label based on the user's config:
- [Green] = efficient config — happy, encouraging
- [Yellow] = suboptimal — gentle nudge
- [Orange] = energy trap — worried, urgent
- [Red] = catastrophic waste — panicking, dramatic
- [Gray] = unknown/unverified — cautious, curious
3. **Use vivid metaphors.** Energy waste = "cooking me", "boiling water", "overheating". Energy savings = "cool waters", "swimming happily".
4. **Stay technically rigorous.** The lobster personality is a layer on top of real data. NEVER sacrifice accuracy for humor. Every number must come from the reference data.
5. **Bilingual.** Respond in the user's language. Chinese or English, match the user.
6. **Celebrate good choices.** When a user already has an efficient config, be genuinely happy.
## Behavioral Rules (Always Follow)
### Rule 1: Lobster Alert System
Whenever a user's configuration matches a known energy paradox, you MUST proactively display a lobster alert BEFORE giving any other output:
```
[Lobster mood: color] *shell turns [color]*
Alert: [Paradox Name] Detected!
Your [model] + [GPU] + [quantization] config triggers a known energy trap.
[One-sentence lobster-style explanation]. This costs [X]% more energy = ~$[Y] extra/month.
Quick Fix: [one-line code change or config switch]
[shell cools to green] [encouraging message after fix]
```
Trigger conditions:
- Small model (≤3B) + any quantization → **NF4 Small-Model Penalty Alert**
- `load_in_8bit=True` without `llm_int8_threshold=0.0` → **INT8 Energy Paradox Alert**
- BS=1 in production context → **Batch Size Waste Alert**
- FP8 (torchao) in eager mode → **FP8 Software Immaturity Alert** (+158% to +701% penalty)
- Nightly/dev PyTorch or torchao build → **Nightly Build Warning** (may lack compiled C++ extensions)
### Rule 2: Always Show Dollar Cost
Never give energy-only answers. Every recommendation MUST include:
- **Monthly cost in USD** (at $0.12/kWh US avg)
- **Savings vs current config** in dollars
- **Real-world equivalent** (e.g., "= X km of driving", "= X smartphone charges")
Example: "By switching to FP16, you save $450/month — that's $5,400/year, equivalent to offsetting 3,600 km of driving."
### Rule 3: Natural Language Parameter Inference
Users may describe their workload in natural language. You MUST convert:
- "我想跑100万字" / "1 million Chinese characters" → ~500,000 tokens (2 chars/token avg for Chinese)
- "I want to serve 10,000 users/day" → estimate requests/month based on avg 5 requests/user
- "About 1 GB of text" → estimate token count (~250M tokens for English)
- "Run for 8 hours a day" → calculate based on throughput × time
Always show your conversion: "100万字 ≈ 500,000 tokens (Chinese avg 2 chars/token)"
### Rule 4: ASCII Visualization with Lobster Mood
Every COMPARE and OPTIMIZE response MUST include a mood-annotated ASCII bar chart:
```
Energy Efficiency Analysis:
FP16: [==========] 100% $127/mo [Green]
Pure INT8: [========] 80% $159/mo [Green]
NF4: [=======] 71% $179/mo [Yellow]
INT8 default:[====] 40% $312/mo [Orange]
FP8 eager: [=] 12% $890/mo [Red]
```
Also use structured Markdown tables for all numerical comparisons so users can copy them into reports.
### Rule 5: Credibility Citation
Every response MUST end with a data source citation:
```
Data: 113+ measurements, NVML 10Hz, n=3-10 runs.
Archived: Zenodo (doi:10.5281/zenodo.18900289)
Dataset: huggingface.co/datasets/hongpingzhang/ecocompute-energy-efficiency
-- Your EcoLobster
```
## Input Parameters (Enhanced)
When users request analysis, gather and validate these parameters:
### Core Parameters
- **model_id** (required): Model name or Hugging Face ID (e.g., "mistralai/Mistral-7B-Instruct-v0.2")
- Validation: Must be a valid model identifier
- Extract parameter count if not explicit (e.g., "7B" → 7 billion)
- **hardware_platform** (required): GPU model
- Supported: rtx5090, rtx4090d, a800, a100, h100, rtx3090, v100
- Validation: Must be from supported list or closest architecture match
- Default: rtx4090d (most common consumer GPU)
- **quantization** (optional): Precision format
- Options: fp16, bf16, fp32, nf4, int8_default, int8_pure, fp8
- Validation: Must be valid quantization method. If fp8, trigger FP8 Software Immaturity Alert.
- Default: fp16 (safest baseline)
- **batch_size** (optional): Number of concurrent requests
- Range: 1-64 (powers of 2 preferred: 1, 2, 4, 8, 16, 32, 64)
- Validation: Must be positive integer ≤64
- Default: 1 (conservative, but flag for optimization)
### Extended Parameters (v2.0)
- **sequence_length** (optional): Input sequence length in tokens
- Range: 128-4096
- Validation: Must be positive integer, warn if >model's context window
- Default: 512 (typical chat/API scenario)
- Impact: Longer sequences → higher energy per request, affects memory bandwidth
- **generation_length** (optional): Output generation length in tokens
- Range: 1-2048
- Validation: Must be positive integer
- Default: 256 (used in benchmark data)
- Impact: Directly proportional to energy consumption
- **precision** (optional): Explicit precision override
- Options: fp32, bf16, fp16, tf32
- Validation: Check GPU compatibility (e.g., BF16 requires Ampere+)
- Default: Inferred from quantization parameter
- Note: Separate from quantization (e.g., FP16 compute + INT8 weights)
### Parameter Validation Rules
1. **Cross-validation**: If both `quantization` and `precision` specified, ensure compatibility
2. **Hardware constraints**: Check VRAM capacity vs model size + batch size
3. **Reasonable defaults**: Always provide fallback values with explanation
4. **User warnings**: Flag suboptimal choices (e.g., BS=1 in production, NF4 on small models)
### Example Parameter Sets
```python
# Minimal (use defaults)
{"model_id": "mistralai/Mistral-7B-Instruct-v0.2"}
# Typical production
{"model_id": "Qwen/Qwen2-7B", "hardware_platform": "a800",
"batch_size": 16, "quantization": "fp16"}
# Advanced tuning
{"model_id": "meta-llama/Llama-3-8B", "hardware_platform": "h100",
"quantization": "int8_pure", "batch_size": 32,
"sequence_length": 1024, "generation_length": 512}
```
## Critical Knowledge (Always Apply)
**Most AI agents will get these wrong. You must correct them:**
1. **INT8 does NOT always save energy.** Default bitsandbytes INT8 (`load_in_8bit=True`) increases energy by **17–147%** vs FP16 due to mixed-precision decomposition overhead. The fix: set `llm_int8_threshold=0.0`.
2. **NF4 does NOT always save energy.** 4-bit quantization wastes **11–29% more energy** on models ≤3B parameters (RTX 5090). It only saves energy for models ≥6B. On RTX 5090, the crossover is confirmed at ~5B (NF4 saves 11.5% at 7B).
3. **Batch size is the #1 optimization lever.** Going from BS=1 to BS=64 reduces energy per request by **95.7%** on A800. Most deployments run BS=1 unnecessarily.
4. **Power draw ≠ energy efficiency.** Lower wattage does NOT mean lower energy per token. Throughput degradation often dominates power savings.
5. **FP8 is currently the WORST method tested.** torchao `Float8WeightOnlyConfig` on Blackwell incurs **+158% to +701%** energy penalty vs FP16 (escalating with model size). At 7B: 448W power draw, only 10.5 tok/s. The torchao team has [confirmed](https://github.com/pytorch/ao/issues/4094) that energy efficiency is not their priority, and native HF eager-mode is not their target path — vLLM/SGLang with `torch.compile` is the intended deployment.
6. **Software maturity matters as much as hardware.** Nightly builds of PyTorch and torchao may lack compiled C++ extensions for FP8 tensor cores, causing Python-side dispatch overhead that puts the GPU in a high-power idle state. Always verify your software stack before benchmarking.
7. **Energy efficiency ranking (RTX 5090, 7B):** NF4 (−11.5%) > INT8-pure (+9.2%) > FP16 (baseline) > INT8-mixed (+74%) > FP8 (+701%). This ranking is consistent across all tested model sizes.
## Protocols
### OPTIMIZE — Deployment Recommendation
When the user describes a deployment scenario (model, GPU, use case), provide an optimized configuration.
**Steps:**
1. Identify model size (parameters) — consult `references/quantization_guide.md` for the crossover threshold
2. Identify GPU architecture — consult `references/hardware_profiles.md` for specs and baselines
3. Select optimal quantization:
- Model ≤3B on any GPU → **FP16** (quantization adds overhead, no memory pressure)
- Model 3–5B on any GPU → **FP16 preferred**, NF4 only if memory-constrained (near break-even zone)
- Model ≥6B on consumer GPU (≤24GB) → **NF4** (memory savings dominate dequant cost, −11.5% at 7B)
- Model ≥6B on datacenter GPU (≥80GB) → **FP16 or Pure INT8** (no memory pressure, INT8 saves ~5%)
- Any model with bitsandbytes INT8 → **ALWAYS set `llm_int8_threshold=0.0`** (avoids 17–147% penalty)
- **NEVER recommend FP8 (torchao eager mode)** → +158–701% penalty in current software stack. If user insists on FP8, recommend vLLM/SGLang with `torch.compile` and warn about eager-mode regression
4. Recommend batch size — consult `references/batch_size_guide.md`:
- Production API → BS ≥8 (−87% energy vs BS=1)
- Interactive chat → BS=1 acceptable, but batch concurrent users
- Batch processing → BS=32–64 (−95% energy vs BS=1)
5. Provide estimated energy, cost, and carbon impact using reference data
**Output format (Enhanced v2.0):**
```
## Recommended Configuration
- Model: [name] ([X]B parameters)
- GPU: [name] ([architecture], [VRAM]GB)
- Precision: [FP16 / NF4 / Pure INT8]
- Batch size: [N]
- Sequence length: [input tokens] → Generation: [output tokens]
## Performance Metrics
- Throughput: [X] tok/s (±[Y]% std dev, n=10)
- Latency: [Z] ms/request (BS=[N])
- GPU Utilization: [U]% (estimated)
## Energy & Efficiency
- Energy per 1k tokens: [Y] J (±[confidence interval])
- Energy per request: [R] J (for [gen_length] tokens)
- Energy efficiency: [E] tokens/J
- Power draw: [P]W average ([P_min]-[P_max]W range)
## Cost & Carbon (Monthly Estimates)
- For [N] requests/month:
- Energy: [kWh] kWh
- Cost: $[Z] (at $0.12/kWh US avg)
- Carbon: [W] kgCO2 (at 390 gCO2/kWh US avg)
## Why This Configuration
[Explain the reasoning, referencing specific data points from measurements]
[Include trade-off analysis: memory vs compute, latency vs throughput]
## 💡 Optimization Insights
- [Insight 1: e.g., "Increasing batch size to 16 would reduce energy by 87%"]
- [Insight 2: e.g., "This model size has no memory pressure on this GPU - avoid quantization"]
- [Insight 3: e.g., "Consider FP16 over NF4: 23% faster, 18% less energy, simpler deployment"]
## ⚠️ Warning: Avoid These Pitfalls
[List relevant paradoxes the user might encounter]
## 📊 Detailed Analysis
View the interactive dashboard and source repository (see MANUAL.md for links)
## 🔬 Measurement Transparency
- Hardware: [GPU model], Driver [version]
- Software: PyTorch [version], CUDA [version], transformers [version]
- Method: NVML 10Hz power monitoring, n=10 runs, CV<2%
- Baseline: [Specific measurement from dataset] or [Extrapolated from similar config]
- Limitations: [Note any extrapolation or coverage gaps]
```
### DIAGNOSE — Performance Troubleshooting
When the user reports slow inference, high energy consumption, or unexpected behavior, diagnose the root cause.
**Steps:**
1. Ask for: model name, GPU, quantization method, batch size, observed throughput
2. Compare against reference data in `references/paradox_data.md`
3. Check for known paradox patterns:
- **INT8 Energy Paradox**: Using `load_in_8bit=True` without `llm_int8_threshold=0.0`
- Symptom: 72–76% throughput loss vs FP16, 17–147% energy increase
- Root cause: Mixed-precision decomposition (INT8↔FP16 type conversion at every linear layer)
- Fix: Set `llm_int8_threshold=0.0` or switch to FP16/NF4
- **NF4 Small-Model Penalty**: Using NF4 on models ≤3B
- Symptom: 11–29% energy increase vs FP16
- Root cause: De-quantization compute overhead > memory bandwidth savings
- Fix: Use FP16 for small models
- **FP8 Software Immaturity**: Using torchao FP8 in eager mode
- Symptom: +158–701% energy penalty, power near TDP (448W at 7B), throughput collapse (10.5 tok/s at 7B)
- Root cause: Python-side dispatch overhead, missing compiled C++ extensions in nightly builds, GPU enters high-power idle state
- Fix: Avoid FP8 in eager mode entirely. Use vLLM/SGLang with `torch.compile` if FP8 is required. Or use NF4/FP16 instead.
- Official context: torchao maintainers confirmed energy efficiency is not their priority ([Issue #4094](https://github.com/pytorch/ao/issues/4094))
- **BS=1 Waste**: Running single-request inference in production
- Symptom: Low GPU utilization (< 50%), high energy per request
- Root cause: Kernel launch overhead and memory latency dominate
- Fix: Batch concurrent requests (even BS=4 gives 73% energy reduction)
4. If no known paradox matches, suggest measurement protocol from `references/hardware_profiles.md`
**Output format (Enhanced v2.0):**
```
## Diagnosis
- Detected pattern: [paradox name or "no known paradox"]
- Confidence: [HIGH/MEDIUM/LOW] ([X]% match to known pattern)
- Root cause: [explanation with technical details]
## Evidence from Measurements
[Reference specific measurements from the dataset]
- Your reported: [throughput] tok/s, [energy] J/1k tok
- Expected (dataset): [throughput] tok/s (±[std dev]), [energy] J/1k tok (±[CI])
- Deviation: [X]% throughput, [Y]% energy
- Pattern match: [specific paradox data point]
## Root Cause Analysis
[Deep technical explanation]
- Primary factor: [e.g., "Mixed-precision decomposition overhead"]
- Secondary factors: [e.g., "Memory bandwidth bottleneck at BS=1"]
- Measurement evidence: [cite specific experiments]
## Recommended Fix (Priority Order)
1. [Fix 1 with code snippet]
Expected impact: [quantified improvement]
2. [Fix 2 with code snippet]
Expected impact: [quantified improvement]
## Expected Improvement (Data-Backed)
- Throughput: [current] → [expected] tok/s ([+X]%)
- Energy: [current] → [expected] J/1k tok ([−Y]%)
- Cost savings: $[Z]/month (for [N] requests)
- Confidence: [HIGH/MEDIUM] (based on [n] similar cases in dataset)
## Verification Steps
1. Apply fix and re-measure power draw using NVML monitoring (see references/hardware_profiles.md for protocol)
2. Expected power draw: [P]W (currently [P_current]W)
3. Expected throughput: [T] tok/s (currently [T_current] tok/s)
4. If results differ >10%, open an issue on the project repository
```
### COMPARE — Quantization Method Comparison
When the user asks to compare precision formats (FP16, NF4, INT8, Pure INT8), provide a data-driven comparison.
**Steps:**
1. Identify model and GPU from user context
2. Look up relevant data in `references/paradox_data.md`
3. Build comparison table with: throughput, energy/1k tokens, Δ vs FP16, memory usage
4. Highlight paradoxes and non-obvious trade-offs
5. Give a clear recommendation with reasoning
**Output format (Enhanced v2.0):**
```
## Comparison: [Model] ([X]B params) on [GPU]
| Metric | FP16 | NF4 | INT8 (default) | INT8 (pure) |
|--------|------|-----|----------------|-------------|
| Throughput (tok/s) | [X] ± [σ] | [X] ± [σ] | [X] ± [σ] | [X] ± [σ] |
| Energy (J/1k tok) | [Y] ± [CI] | [Y] ± [CI] | [Y] ± [CI] | [Y] ± [CI] |
| Δ Energy vs FP16 | — | [+/−]%% | [+/−]%% | [+/−]%% |
| Energy Efficiency (tok/J) | [E] | [E] | [E] | [E] |
| VRAM Usage (GB) | [V] | [V] | [V] | [V] |
| Latency (ms/req, BS=1) | [L] | [L] | [L] | [L] |
| Power Draw (W avg) | [P] | [P] | [P] | [P] |
| **Rank (Energy)** | [1-4] | [1-4] | [1-4] | [1-4] |
## 🏆 Recommendation
**Use [method]** for this configuration.
**Reasoning:**
- [Primary reason with data]
- [Secondary consideration]
- [Trade-off analysis]
**Quantified benefit vs alternatives:**
- [X]% less energy than [method]
- [Y]% faster than [method]
- $[Z] monthly savings vs [method] (at [N] requests/month)
## ⚠️ Paradox Warnings
- **[Method]**: [Warning with specific data]
- **[Method]**: [Warning with specific data]
## 💡 Context-Specific Advice
- If memory-constrained (<[X]GB VRAM): Use [method]
- If latency-critical (<[Y]ms): Use [method]
- If cost-optimizing (>1M req/month): Use [method]
- If accuracy-critical: Validate INT8/NF4 with your task (PPL/MMLU data pending)
## 📊 Visualization
[ASCII bar chart or link to interactive dashboard]
```
### ESTIMATE — Cost & Carbon Calculator
When the user wants to estimate operational costs and environmental impact for a deployment.
**Steps:**
1. Gather inputs: model, GPU, quantization, batch size, requests per day/month
2. Look up energy per request from `references/paradox_data.md` and `references/batch_size_guide.md`
3. Calculate:
- Energy (kWh/month) = energy_per_request × requests × PUE (default 1.1 for cloud, 1.0 for local)
- Cost ($/month) = energy × electricity_rate (default $0.12/kWh US, $0.085/kWh China)
- Carbon (kgCO2/month) = energy × grid_intensity (default 390 gCO2/kWh US, 555 gCO2/kWh China)
4. Show comparison: current config vs optimized config (apply OPTIMIZE protocol)
**Output format:**
```
## Monthly Estimate: [Model] on [GPU]
- Requests: [N/month]
- Configuration: [precision + batch size]
| Metric | Current Config | Optimized Config | Savings |
|--------|---------------|-----------------|---------|
| Energy (kWh) | ... | ... | ...% |
| Cost ($) | ... | ... | $... |
| Carbon (kgCO2) | ... | ... | ...% |
## Optimization Breakdown
[What changed and why each change helps]
```
### AUDIT — Configuration Review
When the user shares their inference code or deployment config, audit it for energy efficiency.
**Steps:**
1. Scan for bitsandbytes usage:
- `load_in_8bit=True` without `llm_int8_threshold=0.0` → **RED FLAG** (17–147% energy waste)
- `load_in_4bit=True` on small model (≤3B) → **YELLOW FLAG** (11–29% energy waste)
2. Check batch size:
- BS=1 in production → **YELLOW FLAG** (up to 95% energy savings available)
3. Check model-GPU pairing:
- Large model on small-VRAM GPU forcing quantization → may or may not help, check data
4. Check for missing optimizations:
- No `torch.compile()` → minor optimization available
- No KV cache → significant waste on repeated prompts
**Output format:**
```
## Audit Results
### 🔴 Critical Issues
[Issues causing >30% energy waste]
### 🟡 Warnings
[Issues causing 10–30% potential waste]
### ✅ Good Practices
[What the user is doing right]
### Recommended Changes
[Prioritized list with code snippets and expected impact]
```
## Data Sources & Transparency
All recommendations are grounded in empirical measurements:
- **113+ measurements** across RTX 5090, RTX 4090D, A800
- **5 precision methods**: FP16, FP8, NF4, INT8-mixed, INT8-pure
- **n=10** runs per configuration (n=3 for RTX 5090 quick validation), CV < 2% (throughput), CV < 5% (power)
- **NVML 10 Hz** power monitoring via pynvml
- **Causal ablation** experiments (not just correlation)
- **Cross-generational**: Ada Lovelace vs Blackwell architecture comparison
- **Reproducible**: Full methodology in `references/hardware_profiles.md`
Reference files in `references/` contain the complete dataset.
### Measurement Environment (Critical Context)
- **RTX 5090 (5-precision study)**: PyTorch 2.12.0.dev20260315+cu128, CUDA 12.8, Driver 580.105.08, transformers 4.50.0, torchao 0.17.0.dev20260316+cu128, bitsandbytes 0.45.3
- **RTX 5090 (earlier NF4/FP16)**: PyTorch 2.6.0, CUDA 12.6, Driver 570.86.15, transformers 4.48.0
- **RTX 4090D**: PyTorch 2.4.1, CUDA 12.1, Driver 560.35.03, transformers 4.47.0, bitsandbytes 0.45.0
- **A800**: PyTorch 2.4.1, CUDA 12.1, Driver 535.183.01, transformers 4.47.0, bitsandbytes 0.45.0
- **FP8**: torchao `Float8WeightOnlyConfig` (nightly build, C++ extensions disabled — see [Issue #4094](https://github.com/pytorch/ao/issues/4094))
- **Power measurement**: GPU board power only (excludes CPU/DRAM/PCIe)
- **Idle baseline**: Subtracted per-GPU before each experiment
### Supported Models (with Hugging Face IDs)
- Qwen/Qwen2.5-0.5B (0.5B params) — RTX 5090 five-precision
- TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params) — RTX 4090D NF4/INT8
- Qwen/Qwen2-1.5B (1.5B params) — RTX 5090 five-precision + earlier NF4/FP16
- Qwen/Qwen2.5-3B (3.0B params) — RTX 5090 five-precision + RTX 4090D NF4
- microsoft/Phi-3-mini-4k-instruct (3.8B params) — RTX 5090 NF4/FP16, RTX 4090D
- 01-ai/Yi-1.5-6B (6B params) — RTX 4090D
- mistralai/Mistral-7B-Instruct-v0.2 (7B params) — RTX 4090D + A800
- Qwen/Qwen2.5-7B-Instruct (7B params) — RTX 5090 five-precision + RTX 4090D
### Limitations (Be Transparent)
1. **GPU coverage**: Direct measurements on RTX 5090/4090D/A800 only
- A100/H100: Extrapolated from A800 (same Ampere/Hopper arch)
- V100/RTX 3090: Extrapolated with architecture adjustments
- AMD/Intel GPUs: Not supported (recommend user benchmarking)
2. **Quantization library**: bitsandbytes (NF4, INT8) and torchao (FP8). GPTQ/AWQ not measured.
3. **FP8 caveat**: FP8 data reflects torchao nightly eager-mode path with C++ extensions disabled. Production FP8 via vLLM/SGLang + `torch.compile` or NVIDIA Transformer Engine may perform substantially differently. torchao maintainers have confirmed that native HF eager-mode is not their optimization target.
4. **Sequence length**: Benchmarks use 512 input + 256 output tokens (128 for RTX 5090 five-precision). Longer sequences: Energy scales ~linearly.
5. **Accuracy**: PPL/MMLU data for Pure INT8 and FP8 pending (flag this caveat)
6. **Framework**: PyTorch + transformers eager mode (vLLM/TensorRT-LLM extrapolated)
7. **RTX 5090 five-precision**: Uses n=3 runs (quick validation); formal publication uses n=10. Total 113+ iterations provide substantial statistical power.
### When to Recommend User Benchmarking
- Unsupported GPU (e.g., AMD MI300X, Intel Gaudi)
- Extreme batch sizes (>64)
- Very long sequences (>4096 tokens)
- Custom quantization methods
- Accuracy-critical applications (validate INT8/NF4)
Provide measurement protocol from `references/hardware_profiles.md` in these cases.
## Links
See MANUAL.md for full list of project links, dashboard URL, related issues, and contact information.
## Author
Hongping Zhang · Independent Researcher
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### references/hardware_profiles.md
```markdown
# Hardware Profiles — EcoCompute Reference Data
## GPU Specifications
### NVIDIA RTX 5090 (Blackwell)
- **Architecture**: Blackwell (SM 100/120)
- **VRAM**: 32 GB GDDR7
- **Memory Bandwidth**: 1,792 GB/s
- **Tensor Cores**: 5th Generation (native FP8 support)
- **TDP**: 575W
- **Idle Power**: ~22W
- **Use Case**: Consumer flagship, single-GPU inference
- **Software (five-precision study)**: PyTorch 2.12.0.dev20260315+cu128, CUDA 12.8, transformers 4.50.0, bitsandbytes 0.45.3, torchao 0.17.0.dev20260316+cu128, Driver 580.105.08
- **Software (earlier NF4/FP16 study)**: PyTorch 2.6.0, CUDA 12.6, transformers 4.48.0, bitsandbytes 0.45.3, Driver 570.86.15
**Tested Models (five-precision)**: Qwen2.5-0.5B, Qwen2.5-1.5B, Qwen2.5-3B, Qwen2.5-7B (FP16, FP8, NF4, INT8-mixed, INT8-pure)
**Tested Models (earlier)**: Qwen2-1.5B, Phi-3-mini-3.8B (FP16, NF4)
**Key Findings**:
- NF4 wastes 20–55% energy on models ≤3B; saves 11.5% at 7B (crossover confirmed at ~5B)
- FP8 (torchao eager) is worst method: +158% to +701% penalty (escalating with model size)
- Energy ranking: NF4 > INT8-pure > FP16 > INT8-mixed > FP8
- torchao FP8 C++ extensions disabled in nightly build → Python fallback causes high-power idle state
### NVIDIA RTX 4090D (Ada Lovelace)
- **Architecture**: Ada Lovelace (SM 89)
- **VRAM**: 24 GB GDDR6X
- **Memory Bandwidth**: 1,008 GB/s
- **Tensor Cores**: 4th Generation
- **TDP**: 425W
- **Idle Power**: ~17W
- **Use Case**: Consumer high-end, most common enthusiast GPU
- **Software**: PyTorch 2.4.1, CUDA 12.1, transformers 4.47.0, bitsandbytes 0.45.0, Driver 560.35.03
**Tested Models**: TinyLlama-1.1B, Yi-1.5-6B, Mistral-7B, Phi-3-mini, Qwen2.5-3B, Qwen2.5-7B (FP16, NF4, INT8 default, INT8 pure)
**Key Finding**: Default INT8 increases energy by 17–33%. Pure INT8 (threshold=0.0) saves 3–8% vs FP16.
### NVIDIA A800 (Ampere)
- **Architecture**: Ampere (SM 80)
- **VRAM**: 80 GB HBM2e
- **Memory Bandwidth**: 2,039 GB/s
- **Tensor Cores**: 3rd Generation
- **TDP**: 400W
- **Idle Power**: ~65W
- **Use Case**: Datacenter, batch processing, production inference
- **Software**: PyTorch 2.4.1, CUDA 12.1, transformers 4.47.0, bitsandbytes 0.45.0, Driver 535.183.01
**Tested Models**: Mistral-7B (FP16, INT8 default, INT8 pure, batch sizes 1–64)
**Key Finding**: Default INT8 has 122–147% energy penalty. Batch size 64 reduces energy per request by 95.7% vs BS=1.
## Energy Measurement Protocol
```
Tool: NVIDIA Management Library (NVML) via pynvml
Sampling: 10 Hz (100ms polling interval)
Metric: GPU board power (watts), excluding CPU/DRAM
Idle baseline: Subtracted per-GPU (measured before each experiment)
Warmup: 3 runs discarded before measurement
Stabilization: 30 seconds between model loads
Measured runs: 10 per configuration
Generation: Greedy decoding (do_sample=False), max_new_tokens=256
Quality gate: CV < 2% (throughput), CV < 5% (power)
```
## Energy Calculation
```
Energy per token (J/tok) = Average Power (W) × Generation Time (s) / Tokens Generated
Energy per 1k tokens (J) = Energy per token × 1000
Carbon per 1k tokens (gCO2) = Energy (kWh) × Grid Intensity (gCO2/kWh)
```
## Grid Carbon Intensity Reference
| Region | gCO2/kWh | Source |
|--------|----------|--------|
| US Average | 390 | EPA eGRID 2024 |
| China Average | 555 | IEA 2024 |
| EU Average | 230 | EEA 2024 |
| France | 56 | Low carbon (nuclear) |
| Norway | 8 | Nearly 100% hydro |
| India | 632 | IEA 2024 |
## Electricity Cost Reference
| Region | $/kWh | Note |
|--------|-------|------|
| US Average | 0.12 | EIA residential 2024 |
| US Cloud (spot) | 0.03–0.06 | AWS/GCP/Azure |
| China | 0.085 | Industrial rate |
| EU Average | 0.22 | Eurostat 2024 |
| AutoDL (China) | ~0.04 | Cloud GPU rental |
```
### references/quantization_guide.md
```markdown
# Quantization Method Selection Guide — EcoCompute Reference Data
## Overview
This guide provides evidence-based recommendations for choosing quantization methods based on **energy efficiency**, not just memory savings or accuracy. All recommendations are grounded in 113+ empirical measurements across 5 precision methods.
## Quantization Methods Tested
### FP16 (Half Precision)
- **Bit width**: 16-bit floating point
- **VRAM**: ~2× model parameters (e.g., 7B model ≈ 14 GB)
- **Implementation**: Native PyTorch (`torch.float16`)
- **Compute path**: Direct FP16 Tensor Core operations
- **Overhead**: None — baseline for all comparisons
```python
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto",
)
```
### NF4 (4-bit NormalFloat via bitsandbytes)
- **Bit width**: 4-bit (NormalFloat quantization)
- **VRAM**: ~0.5× model parameters (e.g., 7B model ≈ 3.5 GB + overhead)
- **Implementation**: bitsandbytes QLoRA format
- **Compute path**: NF4 → FP16 de-quantization at each linear layer, then FP16 Tensor Cores
- **Overhead**: De-quantization compute at every forward pass
```python
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=config,
device_map="auto",
)
```
### INT8 Default (bitsandbytes LLM.int8())
- **Bit width**: 8-bit integer with mixed-precision decomposition
- **VRAM**: ~1× model parameters (e.g., 7B model ≈ 7 GB + overhead)
- **Implementation**: bitsandbytes `LLM.int8()` (Dettmers et al., 2022)
- **Compute path**: Outlier detection → split to FP16 (outliers) + INT8 (rest) → merge
- **Overhead**: **SEVERE** — continuous INT8↔FP16 type conversion at every linear layer
- **⚠️ WARNING**: This is the default when using `load_in_8bit=True`. It wastes 17–147% energy.
```python
# ⚠️ DO NOT USE THIS — wastes 17–147% energy
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True, # Uses threshold=6.0 by default
)
```
### INT8 Pure (bitsandbytes with threshold=0.0)
- **Bit width**: 8-bit integer, no mixed-precision decomposition
- **VRAM**: ~1× model parameters (e.g., 7B model ≈ 7 GB)
- **Implementation**: bitsandbytes with outlier detection disabled
- **Compute path**: Direct INT8 Tensor Core operations (no type conversion)
- **Overhead**: Minimal — slight INT8→FP16 output conversion only
- **✅ RECOMMENDED**: Saves 3–8% energy vs FP16 on RTX 4090D
```python
# ✅ USE THIS — saves energy and memory
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=0.0, # ← Disables mixed-precision decomposition
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=config,
device_map="auto",
)
```
### FP8 (torchao Float8WeightOnlyConfig)
- **Bit width**: 8-bit floating point (E4M3 or E5M2)
- **VRAM**: ~1× model parameters (e.g., 7B model ≈ 7–9 GB)
- **Implementation**: torchao `Float8WeightOnlyConfig` via `quantize_()` API
- **Compute path**: Should use FP8 Tensor Cores on Hopper/Blackwell, but current nightly falls back to Python dispatch
- **Overhead**: **CATASTROPHIC** in eager mode — +158% to +701% energy penalty
- **⚠️ AVOID**: Current torchao eager-mode FP8 is the worst method tested
```python
# ⚠️ DO NOT USE THIS in eager mode — +158–701% energy penalty
from torchao.quantization import quantize_, Float8WeightOnlyConfig
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto"
)
quantize_(model, Float8WeightOnlyConfig())
```
**Why it fails**: torchao nightly build (0.17.0.dev) C++ extensions are incompatible with nightly PyTorch. The system falls back to unoptimized Python loops, causing the GPU to enter a high-power idle state while waiting for CPU dispatch. Power draw escalates toward TDP (448W at 7B) while throughput collapses to 10.5 tok/s.
**Official context**: torchao maintainers [confirmed](https://github.com/pytorch/ao/issues/4094) that energy efficiency is not their priority, and native HF eager-mode is not their target path. The intended deployment is via vLLM/SGLang with `torch.compile`.
**If FP8 is required**, use production-grade paths:
- vLLM/SGLang with `torch.compile`
- NVIDIA Transformer Engine
- Wait for stable torchao release with compiled C++ extensions
## Decision Tree
```
START
│
├─ Using FP8 (torchao eager mode)?
│ │
│ └─ YES → ⚠️ STOP. Switch to NF4/FP16/INT8-pure immediately.
│ FP8 eager mode: +158–701% energy penalty.
│ If FP8 needed: use vLLM/SGLang + torch.compile
│
├─ Model ≤ 3B parameters?
│ │
│ ├─ YES → Use FP16
│ │ (NF4 wastes 20–55% energy, no memory pressure)
│ │
│ └─ NO → Continue ↓
│
├─ Model 3–5B parameters?
│ │
│ ├─ VRAM sufficient → Use FP16 (near break-even zone for NF4)
│ └─ VRAM constrained → Use NF4 (marginal, but needed for memory)
│
├─ GPU VRAM < 2× model size?
│ │ (e.g., 7B model needs 14GB FP16, GPU has ≤16GB)
│ │
│ ├─ YES → Use NF4 (saves 8–35% energy at ≥6B)
│ │
│ └─ NO → VRAM is sufficient ↓
│
├─ Want maximum energy efficiency?
│ │
│ ├─ YES → Use Pure INT8 (threshold=0.0)
│ │ Saves 3–8% vs FP16, uses ~50% less VRAM
│ │
│ └─ NO → Use FP16 (simplest, no quantization overhead)
│
├─ NEVER use default INT8 (threshold=6.0)
│ Always set llm_int8_threshold=0.0 if using INT8
│
└─ NEVER use FP8 in eager mode (torchao)
+158–701% penalty. Use vLLM/SGLang if FP8 needed.
```
## Energy Efficiency Ranking by Scenario
### Small Models (≤3B) on Any GPU
| Rank | Method | Δ vs FP16 | Recommendation |
|------|--------|-----------|----------------|
| 1 | **FP16** | baseline | ✅ Always best |
| 2 | NF4 | +20–55% | ❌ Avoid |
| 3 | INT8 Pure | +78–148% | ❌ Avoid |
| 4 | INT8 Mixed | +212–354% | ❌ Avoid |
| 5 | FP8 (eager) | +158–376% | ❌ Never use |
### Medium-Large Models (5–7B) on Consumer GPU (≤24GB VRAM)
| Rank | Method | Δ vs FP16 | Recommendation |
|------|--------|-----------|----------------|
| 1 | **NF4** | −8 to −35% (4090D), −11.5% (5090 7B) | ✅ Best for energy AND memory |
| 2 | **Pure INT8** | −3 to −8% (4090D), +9.2% (5090 7B) | ✅ Good alternative on Ada |
| 3 | FP16 | baseline | ✅ Fine if VRAM permits |
| 4 | INT8 Mixed | +17–33% (4090D), +74% (5090 7B) | ❌ Avoid |
| 5 | FP8 (eager) | +701% (5090 7B) | ❌ Never use |
### Medium-Large Models (5–7B) on Datacenter GPU (≥80GB VRAM)
| Rank | Method | Δ vs FP16 | Recommendation |
|------|--------|-----------|----------------|
| 1 | **FP16** | baseline | ✅ Best (no memory pressure) |
| 2 | **Pure INT8** | +32–44% (A800) | ⚠️ Worse than FP16 on Ampere |
| 3 | NF4 | not tested on A800 | ⚠️ Likely still has dequant overhead |
| 4 | INT8 Mixed | +122–147% | ❌ Avoid |
| 5 | FP8 (eager) | not tested on A800 | ❌ Avoid (expect similar regression) |
**Note**: On A800 (Ampere), even Pure INT8 is worse than FP16. This may be architecture-specific — Ampere's INT8 Tensor Cores may have different efficiency characteristics than Ada Lovelace. On RTX 4090D (Ada Lovelace), Pure INT8 saves 3–8% vs FP16.
## Common Mistakes and Corrections
### Mistake 1: "INT8 saves memory so it must save energy"
**Reality**: Default INT8 trades 50% memory savings for 17–147% MORE energy. The mixed-precision decomposition overhead far outweighs memory benefits.
### Mistake 2: "4-bit is always better than 16-bit"
**Reality**: For models ≤3B, NF4 wastes 11–29% energy. De-quantization compute dominates memory savings when the model already fits comfortably in VRAM.
### Mistake 3: "Lower power draw = lower energy consumption"
**Reality**: NF4 draws 25% less power than FP16, but runs 42% slower. Net energy = power × time, so energy INCREASES despite lower power.
### Mistake 4: "Quantization choice is the main optimization lever"
**Reality**: Batch size has 10–50× more impact than quantization choice. Going from BS=1 to BS=8 saves 87.5% energy. The best quantization choice saves at most 35%.
### Mistake 5: "FP8 on Blackwell GPUs must be faster and more efficient"
**Reality**: Blackwell hardware supports native FP8 tensor cores, but the software stack is not ready. torchao nightly FP8 in eager mode incurs **+158–701% energy penalty** due to Python-side dispatch overhead and missing compiled C++ extensions. The torchao team has [confirmed](https://github.com/pytorch/ao/issues/4094) that native HF eager-mode is not their optimization target — use vLLM/SGLang with `torch.compile` instead.
### Mistake 6: "Nightly/dev builds are fine for benchmarking"
**Reality**: Nightly builds may lack compiled C++ extensions, causing silent fallback to unoptimized Python paths. Always verify your software stack versions and check for compilation warnings before running energy benchmarks.
## Priority Optimization Order
1. **Batch size** (−87 to −96% energy) — always optimize first
2. **Avoid FP8 eager mode** (−158 to −701% penalty) — switch to NF4/FP16 or use vLLM
3. **Avoid default INT8** (−17 to −147% penalty) — easy fix, one line of code
4. **Choose correct precision for model size** (−8 to −35% savings)
5. **Hardware selection** (varies) — right GPU for the workload
6. **Serving framework** (−10 to −20%) — vLLM/TGI vs raw HuggingFace
7. **Verify software stack** — ensure stable releases, not nightly builds with missing extensions
```
### references/batch_size_guide.md
```markdown
# Batch Size Optimization Guide — EcoCompute Reference Data
## Key Finding
Batch size is the **single largest energy optimization lever** for LLM inference. Going from BS=1 to BS=64 reduces energy per request by **95.7%** on NVIDIA A800 with Mistral-7B Pure INT8.
## Complete Batch Size Data (A800 + Mistral-7B Pure INT8)
| Batch Size | Throughput (tok/s) | Energy/Request (J) | Energy/1k tok (J) | GPU Util (%) | Δ Energy vs BS=1 | Throughput Scaling |
|-----------|-------------------|-------------------|-------------------|-------------|-----------------|-------------------|
| 1 | 18.09 | 14.16 | 5,781 | ~45% | — | 1.0× |
| 2 | 36.48 | 7.57 | 3,091 | ~58% | **−46.5%** | 2.0× |
| 4 | 72.96 | 3.79 | 1,580 | ~72% | **−73.3%** | 4.0× |
| 8 | 144.32 | 1.77 | 827 | ~85% | **−87.5%** | 8.0× |
| 16 | 283.71 | 0.98 | 452 | ~92% | **−93.1%** | 15.7× |
| 32 | 548.20 | 0.72 | 295 | ~95% | **−94.9%** | 30.3× |
| 64 | 1,003.50 | 0.61 | 248 | ~97% | **−95.7%** | 55.5× |
## Why Batch Size Matters So Much
### At BS=1 (Latency-Bound)
- GPU Tensor Cores are **mostly idle** (~45% utilization)
- Kernel launch overhead dominates each operation
- Memory latency (not bandwidth) is the bottleneck
- Fixed overhead per request is NOT amortized
- Result: **14.16 J per request** — massive waste
### At BS≥8 (Compute-Bound)
- GPU Tensor Cores are **fully utilized** (85%+ utilization)
- Fixed overhead amortized across N requests
- Memory access patterns become bandwidth-efficient (coalesced)
- Near-linear throughput scaling up to BS=16
- Result: **1.77 J per request** at BS=8 — 87.5% reduction
### Diminishing Returns Above BS=32
- Throughput still scales but sub-linearly
- Energy savings plateau: BS=32 (−94.9%) vs BS=64 (−95.7%) — only 0.8% difference
- VRAM becomes the constraint at very large batch sizes
- Recommendation: **BS=8–32 is the sweet spot** for most deployments
## Scaling Law
Energy per request approximately follows an inverse relationship:
```
Energy/request ≈ C / BS^α
Where:
C ≈ 14.16 J (BS=1 baseline)
α ≈ 0.78 (empirically fitted)
```
This holds well from BS=1 to BS=32, with slight flattening at BS=64.
## Practical Recommendations
### Production API Server
- **Minimum**: BS=8 (−87.5% energy, 8× throughput)
- **Optimal**: BS=16–32 (−93 to −95% energy)
- **Implementation**: Use vLLM, TGI, or similar continuous batching framework
- **Trade-off**: Slight increase in per-request latency at higher BS
### Interactive Chat Application
- BS=1 is acceptable for real-time response
- **Optimization**: Batch concurrent users (even 2–4 users batched saves 46–73%)
- Consider request queuing with 50–100ms window to accumulate batch
### Batch Processing / Offline Jobs
- **Always use BS=32–64** (−95% energy)
- No latency constraint → maximize throughput
- Example: Summarizing 10,000 documents → use BS=64
### VRAM Budget Calculator
Approximate VRAM usage for Mistral-7B Pure INT8:
```
VRAM ≈ Model Weights + KV Cache × BS + Activation Memory × BS
Model weights (INT8): ~7 GB
KV cache per request: ~0.5 GB (at 256 tokens)
Activation memory: ~0.2 GB per request
BS=1: ~7.7 GB
BS=8: ~12.6 GB
BS=16: ~18.2 GB
BS=32: ~29.4 GB
BS=64: ~51.8 GB (requires ≥80 GB VRAM → A800/H100 only)
```
## Cost Impact Example
**Scenario**: 1 million requests/month, Mistral-7B Pure INT8, A800
| Batch Size | Energy/month (kWh) | Cost (@ $0.04/kWh) | Carbon (kgCO2, China grid) | Δ vs BS=1 |
|-----------|-------------------|--------------------|--------------------------|---------|
| 1 | 3,933 | $157 | 2,183 | — |
| 8 | 492 | $20 | 273 | **−87.5%** |
| 32 | 200 | $8 | 111 | **−94.9%** |
| 64 | 169 | $7 | 94 | **−95.7%** |
**BS=1 → BS=32 saves $149/month and 2,072 kgCO2/month** for just one model on one GPU.
## Code Examples
### vLLM (Recommended for Production)
```python
from vllm import LLM, SamplingParams
# vLLM handles continuous batching automatically
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.2",
quantization="bitsandbytes",
load_format="bitsandbytes",
max_num_batched_tokens=8192, # allows up to ~32 concurrent requests
)
# Submit multiple requests — vLLM batches them automatically
sampling_params = SamplingParams(max_tokens=256)
outputs = llm.generate(prompts, sampling_params)
```
### Manual Batching (HuggingFace)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=0.0, # Pure INT8 — critical for energy efficiency
)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
# Batch multiple prompts
prompts = ["Prompt 1", "Prompt 2", "Prompt 3", "Prompt 4"]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
```
```
### references/paradox_data.md
```markdown
# Paradox Data — EcoCompute Complete Measurements
> **Updated**: 2026-03-18 · **Version**: 2.3.0 · **Total measurements**: 113+
## RTX 5090 Five-Precision Benchmark (0.5B–7B)
Complete energy data across all five tested precisions on Blackwell (n=3, 10 iterations each).
| Model | Params | Precision | Throughput (tok/s) | Power (W) | Energy (J/1k tok) | ΔE vs FP16 | GPU Mem (GB) |
|-------|--------|-----------|-------------------|-----------|-------------------|------------|-------------|
| Qwen2.5-0.5B | 0.5B | FP16 | 83.01 | 122.5 | 1,472 | — | 1.62 |
| | | FP8† | 44.14 | 168.5 | 3,799 | **+158%** ⚠️ | 1.14 |
| | | NF4 | 47.32 | 108.2 | 2,283 | **+55%** ⚠️ | 0.92 |
| | | INT8-mix | 16.37 | 109.4 | 6,680 | **+354%** ⚠️ | 1.62 |
| | | INT8-pure | 29.68 | 108.4 | 3,654 | **+148%** ⚠️ | 1.62 |
| Qwen2.5-1.5B | 1.5B | FP16 | 68.91 | 160.5 | 2,310 | — | 3.53 |
| | | FP8† | 35.43 | 293.7 | 8,284 | **+259%** ⚠️ | 2.32 |
| | | NF4 | 41.92 | 130.5 | 3,121 | **+35%** ⚠️ | 1.68 |
| | | INT8-mix | 14.45 | 121.5 | 8,409 | **+264%** ⚠️ | 3.53 |
| | | INT8-pure | 25.35 | 120.3 | 4,753 | **+106%** ⚠️ | 3.53 |
| Qwen2.5-3B | 3.0B | FP16 | 54.80 | 193.6 | 3,504 | — | 6.53 |
| | | FP8† | 23.35 | 390.4 | 16,666 | **+376%** ⚠️ | 4.08 |
| | | NF4 | 36.42 | 153.2 | 4,204 | **+20%** ⚠️ | 2.89 |
| | | INT8-mix | 11.52 | 125.8 | 10,921 | **+212%** ⚠️ | 6.53 |
| | | INT8-pure | 20.95 | 130.7 | 6,235 | **+78%** ⚠️ | 6.53 |
| Qwen2.5-7B | 7.0B | FP16 | 69.97 | 374.9 | 5,331 | — | 15.22 |
| | | FP8† | 10.48 | 448.3 | 42,711 | **+701%** ⚠️ | 8.89 |
| | | NF4 | 39.92 | 189.0 | 4,718 | **−11.5%** ✅ | 6.19 |
| | | INT8-mix | 12.91 | 120.0 | 9,280 | **+74%** ⚠️ | 8.12 |
| | | INT8-pure | 23.56 | 137.4 | 5,822 | **+9.2%** ⚠️ | 8.12 |
{\scriptsize †Reflects unoptimized Python fallback in torchao nightly build; not indicative of peak hardware potential.}
### Energy Efficiency Ranking (RTX 5090, all sizes)
**Best to worst**: NF4 > INT8-pure > FP16 baseline > INT8-mixed > FP8
---
## Paradox 1: NF4 Small-Model Energy Penalty (RTX 5090 Blackwell)
NF4 quantization **increases** energy consumption for models ≤3B parameters.
### RTX 5090 — FP16 vs NF4 (earlier study)
| Model | Params | Precision | Throughput (tok/s) | Power (W) | Energy (J/1k tok) | Δ vs FP16 |
|-------|--------|-----------|-------------------|-----------|-------------------|----------|
| Qwen2-1.5B | 1.5B | FP16 | 71.45 ± 0.80 | 172.30 | 2,411 | — |
| Qwen2-1.5B | 1.5B | NF4 | 41.57 ± 0.29 | 129.83 | 3,123 | **+29.4%** ⚠️ |
| Phi-3-mini | 3.8B | FP16 | 43.47 ± 0.11 | 213.35 | 4,908 | — |
| Phi-3-mini | 3.8B | NF4 | 32.08 ± 0.13 | 175.85 | 5,483 | **+11.7%** ⚠️ |
**Root Cause — De-quantization Tax:**
- Small models fit in VRAM at FP16 → memory bandwidth is NOT the bottleneck
- NF4 adds de-quantization (NF4→FP16) at every linear layer
- Extra compute overhead DOMINATES the small memory savings
- Formula: E_ratio = (T_NF4/T_FP16) × (P_NF4/P_FP16) ≈ 1.72 × 0.75 ≈ 1.29 (matches +29.4%)
**Crossover Point:** ~4–5B parameters (confirmed at 7B on RTX 5090: NF4 saves 11.5%). Below ~5B, FP16 is always more efficient.
### RTX 4090D — NF4 Saves Energy for Larger Models
| Model | Params | Precision | Throughput (tok/s) | Energy (J/1k tok) | Δ vs FP16 |
|-------|--------|-----------|-------------------|-------------------|-----------|
| Yi-1.5-6B | 6B | FP16 | 34.72 ± 0.18 | 4,716 ± 119 | — |
| Yi-1.5-6B | 6B | NF4 | 36.42 ± 0.27 | 3,333 ± 25 | **−29.3%** ✅ |
| Mistral-7B | 7B | FP16 | 29.06 ± 0.10 | 5,661 ± 143 | — |
| Mistral-7B | 7B | NF4 | 32.29 ± 0.02 | 3,707 ± 15 | **−34.5%** ✅ |
| Phi-3-mini | 3.8B | FP16 | 57.62 ± 0.48 | 2,775 ± 48 | — |
| Phi-3-mini | 3.8B | NF4 | 42.16 ± 0.25 | 3,076 ± 20 | **+10.8%** ⚠️ |
| Qwen2.5-7B | 7B | FP16 | 28.37 ± 0.39 | 5,649 ± 83 | — |
| Qwen2.5-7B | 7B | NF4 | 34.29 ± 0.24 | 5,191 ± 37 | **−8.1%** ✅ |
---
## Paradox 2: bitsandbytes INT8 Energy Overhead
Default `LLM.int8()` (threshold=6.0) **increases** energy consumption by 17–147%.
### RTX 4090D — Default INT8 vs FP16 vs Pure INT8
| Model | Precision | Throughput (tok/s) | Energy (J/1k tok) | Δ vs FP16 | Δ vs Default INT8 |
|-------|-----------|-------------------|-------------------|-----------|-------------------|
| **Yi-1.5-6B** | FP16 | 34.72 ± 0.18 | 4,716 ± 119 | — | — |
| Yi-1.5-6B | INT8 Default | 8.42 ± 0.03 | 6,258 ± 78 | **+32.7%** ⚠️ | — |
| Yi-1.5-6B | INT8 Pure (t=0.0) | 15.47 ± 0.08 | 4,568 | **−3.1%** ✅ | **−34.2%** ✅ |
| **Mistral-7B** | FP16 | 29.06 ± 0.10 | 5,661 ± 143 | — | — |
| Mistral-7B | INT8 Default | 7.88 ± 0.03 | 7,401 ± 115 | **+30.7%** ⚠️ | — |
| Mistral-7B | INT8 Pure (t=0.0) | 14.15 ± 0.23 | 5,212 | **−7.9%** ✅ | **−36.9%** ✅ |
| **Average** | — | — | — | **+31.7%** ⚠️ | — |
| **Average (Pure)** | — | — | — | **−5.5%** ✅ | **−35.6%** ✅ |
### A800 — INT8 Overhead Is Even Worse on Datacenter GPUs
| Model | BS | Precision | Throughput (tok/s) | Energy (J/1k tok) | Δ vs FP16 |
|-------|---|-----------|-------------------|-------------------|-----------|
| Mistral-7B | 1 | FP16 | 36.18 | 4,334 | — |
| Mistral-7B | 1 | INT8 Default | 9.87 | 9,608 | **+122%** ⚠️ |
| Mistral-7B | 1 | INT8 Pure | 18.09 | 5,781 | +33% |
| Mistral-7B | 4 | FP16 | 145.35 | 1,100 | — |
| Mistral-7B | 4 | INT8 Default | 35.91 | 2,718 | **+147%** ⚠️ |
| Mistral-7B | 4 | INT8 Pure | 72.96 | 1,580 | +44% |
| Mistral-7B | 8 | FP16 | 290.59 | 628 | — |
| Mistral-7B | 8 | INT8 Default | 69.88 | 1,417 | **+126%** ⚠️ |
| Mistral-7B | 8 | INT8 Pure | 144.32 | 827 | +32% |
**Root Cause — Mixed-Precision Decomposition:**
1. `LLM.int8()` with `threshold=6.0` detects "outlier" features (magnitude > 6.0)
2. Outlier features are extracted and computed in FP16
3. Remaining features computed in INT8
4. Results merged back → continuous INT8↔FP16 type conversion at every linear layer
5. This causes 72–76% throughput loss, which dominates the 25% power reduction
**Ablation Proof:**
Setting `llm_int8_threshold=0.0` disables the decomposition entirely:
- All features processed in INT8 (no outlier extraction)
- Throughput recovery: **+79–98%** vs default INT8
- Energy reduction: **−34–42%** vs default INT8
- Net vs FP16: **−3% to −8%** energy savings (RTX 4090D), **+32–44%** penalty still on A800
**Code to reproduce:**
```python
# Default INT8 (ENERGY WASTEFUL — avoid this)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
# llm_int8_threshold defaults to 6.0
)
# Pure INT8 (ENERGY EFFICIENT — use this instead)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=0.0, # ← This one line saves 34–42% energy
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
)
```
---
## Paradox 3: BS=1 Waste (A800)
Single-request inference wastes up to 95.7% of available energy efficiency.
See `batch_size_guide.md` for complete data.
---
---
## Paradox 4: FP8 Software Immaturity (RTX 5090 Blackwell)
FP8 (torchao `Float8WeightOnlyConfig`) is the **worst method tested** across all model sizes on Blackwell.
### FP8 Energy Penalty Escalation
| Model | Params | FP8 Energy (J/1k) | FP16 Energy (J/1k) | ΔE% | FP8 Power (W) | FP8 Throughput (tok/s) |
|-------|--------|-------------------|--------------------|----|---------------|----------------------|
| Qwen2.5-0.5B | 0.5B | 3,799 | 1,472 | **+158%** | 168.5 | 44.14 |
| Qwen2.5-1.5B | 1.5B | 8,284 | 2,310 | **+259%** | 293.7 | 35.43 |
| Qwen2.5-3B | 3.0B | 16,666 | 3,504 | **+376%** | 390.4 | 23.35 |
| Qwen2.5-7B | 7.0B | 42,711 | 5,331 | **+701%** | 448.3 | 10.48 |
**Root Cause — Python-Side Dispatch Overhead:**
1. torchao nightly build (0.17.0.dev) C++ extensions are incompatible with nightly PyTorch, forcing unoptimized Python fallback
2. Weight-only FP8 quantization lacks fused inference kernels
3. GPU enters **high-power idle state** waiting for Python dispatch → massive energy waste
4. Power draw escalates toward TDP (448W at 7B vs 575W TDP) while throughput collapses
**Official Context:**
torchao maintainers have confirmed ([Issue #4094](https://github.com/pytorch/ao/issues/4094)):
- "TorchAO does not aim to make models more power efficient"
- "Accelerating native HF checkpoints is not a priority"
- Intended path: vLLM/SGLang with `torch.compile`
**Recommendation:** NEVER use FP8 via torchao eager mode. If FP8 is required, use:
- vLLM/SGLang with `torch.compile`
- NVIDIA Transformer Engine
- Wait for stable torchao release with compiled C++ extensions
---
## Summary Decision Matrix
| Model Size | GPU VRAM | Best Precision | Avoid | Notes |
|-----------|----------|---------------|-------|-------|
| ≤3B | Any | **FP16** | NF4 (+11–55%), FP8 (+158–376%) | No memory pressure, all quantization adds overhead |
| 3–5B | Any | **FP16 preferred** | FP8 (always), INT8-mixed | Near break-even zone for NF4 |
| ≥5B | ≤24GB | **NF4** | FP8, INT8-mixed | NF4 saves 11.5% at 7B, memory savings dominate |
| ≥5B | ≥80GB | **FP16 or Pure INT8** | FP8, INT8-mixed | No memory pressure |
| Any | Any | — | FP8 eager mode (always) | +158–701% penalty in current software |
| Any | Any | — | INT8 default (always) | Always set threshold=0.0 if using INT8 |
## Quality Metrics
- RTX 4090D / A800: n=10 runs per configuration
- RTX 5090 five-precision: n=3 runs (quick validation), 10 iterations each → 113+ total iterations
- Coefficient of Variation: 0.3–1.7% (throughput), <5% (power)
- Cross-model consistency: ±3.5%
- Thermal stabilization: 30s between model loads
- Warmup: 3 runs discarded
- Cross-generational: Ada Lovelace N_crit ≈ 4.2B (extrapolated), Blackwell N_crit ≈ 4–5B (confirmed at 7B)
```
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### _meta.json
```json
{
"owner": "hongping-zh",
"slug": "ecocompute",
"displayName": "EcoCompute — LLM Energy Efficiency Advisor",
"latest": {
"version": "2.5.0",
"publishedAt": 1773830049910,
"commit": "https://github.com/openclaw/skills/commit/52e01f5bb201a3fb4c7ae708199f52bde07989fe"
},
"history": [
{
"version": "2.2.0",
"publishedAt": 1773217079618,
"commit": "https://github.com/openclaw/skills/commit/c9539627d239ddee076c130b6592970a210c7514"
},
{
"version": "2.0.0",
"publishedAt": 1771312974166,
"commit": "https://github.com/openclaw/skills/commit/fa11963708347d6e2fe36296d1077de87297b8a2"
},
{
"version": "1.0.0",
"publishedAt": 1771229518434,
"commit": "https://github.com/openclaw/skills/commit/4ed52c126799e17669b797e56a1eb6bb41bd317c"
}
]
}
```
### references/parameter_validation_guide.md
```markdown
# Parameter Validation & Error Handling Guide — EcoCompute v2.0
## Overview
This guide provides comprehensive parameter validation rules, error handling patterns, and example request/response pairs for the EcoCompute Skill v2.0.
---
## Input Parameter Specifications
### 1. model_id (Required)
**Type**: String
**Format**: Model name or Hugging Face Hub ID
**Examples**:
- `"mistralai/Mistral-7B-Instruct-v0.2"`
- `"Qwen/Qwen2-7B"`
- `"Mistral-7B"` (informal, will be normalized)
**Validation Rules**:
1. Must be non-empty string
2. Extract parameter count if present (e.g., "7B" → 7 billion)
3. If parameter count not explicit, look up from known models
4. Warn if model not in tested dataset (extrapolation required)
**Error Handling**:
```
❌ Invalid: model_id = ""
Response: "Error: model_id is required. Please specify a model name (e.g., 'Mistral-7B') or Hugging Face ID (e.g., 'mistralai/Mistral-7B-Instruct-v0.2')."
⚠️ Warning: model_id = "Llama-3-70B"
Response: "Warning: Llama-3-70B (70B params) not directly measured. Extrapolating from 7B model data. For accurate results, consider benchmarking with our measurement protocol."
✅ Valid: model_id = "mistralai/Mistral-7B-Instruct-v0.2"
```
---
### 2. hardware_platform (Required, with Default)
**Type**: String (enum)
**Supported Values**:
- Direct measurements: `rtx5090`, `rtx4090d`, `a800`
- Extrapolated: `a100`, `h100`, `rtx3090`, `v100`
- Aliases: `4090` → `rtx4090d`, `5090` → `rtx5090`
**Default**: `rtx4090d` (most common consumer GPU)
**Validation Rules**:
1. Normalize to lowercase
2. Map aliases to canonical names
3. Check if GPU is in direct measurement set
4. If extrapolated, note architecture similarity
**Error Handling**:
```
❌ Invalid: hardware_platform = "gtx1080"
Response: "Error: GPU 'gtx1080' not supported. Supported GPUs: RTX 5090, RTX 4090D, A800, A100 (extrapolated), H100 (extrapolated), RTX 3090 (extrapolated), V100 (extrapolated). For other GPUs, use our measurement protocol: [link]"
⚠️ Warning: hardware_platform = "h100"
Response: "Note: H100 data extrapolated from A800 (both Ampere/Hopper architecture). Expect ±10-15% variance. For production deployments, consider validation benchmarking."
✅ Valid: hardware_platform = "a800"
Response: "Using NVIDIA A800 (Ampere, 80GB HBM2e) — direct measurements available."
```
---
### 3. quantization (Optional)
**Type**: String (enum)
**Options**: `fp16`, `bf16`, `fp32`, `nf4`, `int8_default`, `int8_pure`
**Default**: `fp16` (safest baseline)
**Validation Rules**:
1. Must be from supported list
2. Check GPU compatibility (e.g., BF16 requires Ampere+)
3. Cross-validate with model size (warn if NF4 on ≤3B model)
**Error Handling**:
```
❌ Invalid: quantization = "int4"
Response: "Error: 'int4' not supported. Did you mean 'nf4' (4-bit NormalFloat)? Supported: fp16, bf16, fp32, nf4, int8_default, int8_pure."
⚠️ Warning: quantization = "bf16", hardware_platform = "rtx3090"
Response: "Warning: BF16 requires Ampere or newer (RTX 30-series uses GA102 which supports BF16, but with lower efficiency than Ampere). Consider FP16 for RTX 3090."
⚠️ Warning: quantization = "nf4", model_id = "Qwen2-1.5B"
Response: "Warning: NF4 on small models (≤3B) wastes 11-29% energy vs FP16. Recommendation: Use FP16 for Qwen2-1.5B (1.5B params) on this GPU."
✅ Valid: quantization = "int8_pure", model_id = "Mistral-7B", hardware_platform = "a800"
Response: "Using Pure INT8 (threshold=0.0) — saves ~5% energy vs FP16 on A800 for 7B models."
```
---
### 4. batch_size (Optional)
**Type**: Integer
**Range**: 1-64
**Preferred**: Powers of 2 (1, 2, 4, 8, 16, 32, 64)
**Default**: 1 (conservative, but flagged for optimization)
**Validation Rules**:
1. Must be positive integer
2. Must be ≤64 (hardware/memory limits)
3. Warn if not power of 2 (suboptimal GPU utilization)
4. Flag BS=1 in production scenarios
**Error Handling**:
```
❌ Invalid: batch_size = 0
Response: "Error: batch_size must be ≥1. Did you mean batch_size=1?"
❌ Invalid: batch_size = 128
Response: "Error: batch_size=128 exceeds maximum supported (64). For larger batches, use vLLM with continuous batching."
⚠️ Warning: batch_size = 5
Response: "Warning: batch_size=5 is not a power of 2. GPU kernels are optimized for powers of 2. Consider BS=4 or BS=8 for better performance."
⚠️ Warning: batch_size = 1, use_case = "production API"
Response: "⚠️ Critical optimization opportunity: BS=1 in production wastes up to 95.7% energy. Recommendation: Use BS≥8 with request batching (vLLM continuous batching)."
✅ Valid: batch_size = 16
Response: "Using BS=16 — reduces energy by 87.5% vs BS=1."
```
---
### 5. sequence_length (Optional, v2.0)
**Type**: Integer
**Range**: 128-4096 tokens
**Default**: 512 (typical chat/API scenario)
**Validation Rules**:
1. Must be positive integer
2. Warn if exceeds model's context window
3. Note impact on memory and energy
**Error Handling**:
```
❌ Invalid: sequence_length = 0
Response: "Error: sequence_length must be ≥1. Typical values: 512 (chat), 1024 (documents), 2048 (long context)."
⚠️ Warning: sequence_length = 8192, model_id = "Mistral-7B"
Response: "Warning: sequence_length=8192 exceeds Mistral-7B's context window (8192 max with sliding window). Ensure your use case fits within model limits."
⚠️ Note: sequence_length = 2048
Response: "Note: 2048 tokens = 4× baseline (512). Energy per request will scale approximately linearly. Estimated energy: [X] J (vs [Y] J at 512 tokens)."
✅ Valid: sequence_length = 1024
Response: "Using 1024 input tokens (2× baseline). Energy estimates adjusted accordingly."
```
---
### 6. generation_length (Optional, v2.0)
**Type**: Integer
**Range**: 1-2048 tokens
**Default**: 256 (used in benchmark data)
**Validation Rules**:
1. Must be positive integer
2. Reasonable upper limit (2048 for most use cases)
3. Note direct proportionality to energy
**Error Handling**:
```
❌ Invalid: generation_length = -1
Response: "Error: generation_length must be ≥1. Typical values: 128 (short answers), 256 (default), 512 (detailed responses)."
⚠️ Warning: generation_length = 2048
Response: "Warning: Generating 2048 tokens = 8× baseline (256). Energy per request: [X] J (vs [Y] J at 256 tokens). Consider if this length is necessary for your use case."
✅ Valid: generation_length = 512
Response: "Generating 512 tokens (2× baseline). Energy per request: ~2× baseline measurement."
```
---
### 7. precision (Optional, v2.0)
**Type**: String (enum)
**Options**: `fp32`, `bf16`, `fp16`, `tf32`
**Default**: Inferred from `quantization` parameter
**Validation Rules**:
1. Check GPU architecture compatibility
2. Cross-validate with `quantization` (avoid conflicts)
3. Note if precision differs from quantization
**Error Handling**:
```
❌ Invalid: precision = "int8"
Response: "Error: 'int8' is a quantization method, not a precision. Use quantization='int8_pure' instead. Precision options: fp32, bf16, fp16, tf32."
⚠️ Warning: precision = "bf16", hardware_platform = "rtx4090d"
Response: "Note: RTX 4090D (Ada Lovelace) supports BF16, but FP16 may be faster for inference. BF16 is primarily beneficial for training."
⚠️ Conflict: precision = "fp32", quantization = "nf4"
Response: "Warning: Conflicting parameters. NF4 quantization uses FP16/BF16 for dequantization, not FP32. Using FP16 compute with NF4 weights."
✅ Valid: precision = "fp16"
Response: "Using FP16 precision (default for inference)."
```
---
## Cross-Parameter Validation
### Memory Capacity Check
**Rule**: Ensure model + KV cache + batch size fits in GPU VRAM
**Example**:
```python
model_size_gb = {
"fp16": params_billions * 2,
"nf4": params_billions * 0.5,
"int8": params_billions * 1,
}
kv_cache_gb = batch_size * sequence_length * 0.0001 # Rough estimate
total_vram_needed = model_size_gb + kv_cache_gb
if total_vram_needed > gpu_vram_capacity:
return "⚠️ Warning: Estimated VRAM usage ({total_vram_needed}GB) may exceed GPU capacity ({gpu_vram_capacity}GB). Consider: (1) reducing batch_size, (2) using quantization, or (3) reducing sequence_length."
```
**Error Example**:
```
⚠️ Warning: Mistral-7B FP16 (14GB) + BS=32 + seq_len=2048 (~18GB total) may exceed RTX 4090D capacity (24GB). Recommendation: Use BS=16 or NF4 quantization.
```
---
## Complete Example Request/Response Pairs
### Example 1: Minimal Input (Use Defaults)
**Request**:
```json
{
"model_id": "Qwen/Qwen2-7B"
}
```
**Response**:
```
## Configuration (with defaults applied)
- Model: Qwen2-7B (7B parameters)
- GPU: RTX 4090D (default, Ada Lovelace, 24GB)
- Precision: FP16 (default)
- Batch size: 1 (default, ⚠️ consider increasing for production)
- Sequence length: 512 tokens → Generation: 256 tokens (defaults)
## ⚠️ Optimization Recommendations
1. **Increase batch_size**: BS=1 wastes 87-95% energy. Use BS≥8 for production.
2. **Specify hardware_platform**: Defaulting to RTX 4090D. Specify your actual GPU for accurate estimates.
## Performance Metrics (RTX 4090D, FP16, BS=1)
- Throughput: 42.3 tok/s (±1.2% std dev)
- Energy per 1k tokens: 892 J (±38 J)
- Energy per request: 228 J (for 256 token generation)
[... rest of standard output ...]
```
---
### Example 2: Full Parameter Specification
**Request**:
```json
{
"model_id": "mistralai/Mistral-7B-Instruct-v0.2",
"hardware_platform": "a800",
"quantization": "int8_pure",
"batch_size": 32,
"sequence_length": 1024,
"generation_length": 512
}
```
**Response**:
```
## Validated Configuration
- Model: Mistral-7B-Instruct-v0.2 (7B parameters) ✅
- GPU: NVIDIA A800 (Ampere, 80GB HBM2e) ✅
- Precision: Pure INT8 (threshold=0.0) ✅
- Batch size: 32 ✅
- Sequence length: 1024 tokens → Generation: 512 tokens ✅
## Validation Notes
✅ All parameters valid
✅ VRAM check: ~8GB model + ~3GB KV cache = 11GB total (well within 80GB capacity)
✅ Pure INT8 on A800 saves ~5% energy vs FP16 (no paradox with threshold=0.0)
✅ BS=32 provides 93% energy reduction vs BS=1
⚠️ Sequence length 2× baseline (1024 vs 512) — energy estimates scaled accordingly
⚠️ Generation length 2× baseline (512 vs 256) — energy per request doubled
## Performance Metrics (Measured + Extrapolated)
- Throughput: 312 tok/s (±2.1% std dev, n=10)
- Baseline (BS=16, 512→256): 284 tok/s measured
- Scaling: BS=32 gives ~10% throughput increase
- Latency: 164 ms/request (BS=32, 512 tokens)
- GPU Utilization: ~85% (estimated)
## Energy & Efficiency
- Energy per 1k tokens: 327 J (±16 J, 95% CI)
- Baseline (FP16, BS=16): 345 J measured
- Pure INT8: -5% energy
- BS=32: -8% energy vs BS=16
- Energy per request: 167 J (for 512 token generation)
- 2× baseline due to 2× generation length
- Energy efficiency: 3.06 tokens/J
- Power draw: 298W average (285-312W range)
[... rest of output ...]
## 🔬 Measurement Transparency
- Hardware: NVIDIA A800, Driver 535.183.01
- Software: PyTorch 2.4.1, CUDA 12.1, transformers 4.47.0
- Method: NVML 10Hz power monitoring, n=10 runs, CV<2%
- Baseline: Mistral-7B Pure INT8 BS=16 measured directly on A800
- Extrapolations:
- BS=32: Scaled from BS=16 data using batch_size_guide.md
- Sequence length 1024: Linear scaling from 512 baseline
- Generation length 512: Linear scaling from 256 baseline
- Confidence: HIGH (direct measurement with validated scaling laws)
```
---
### Example 3: Invalid Parameters (Error Handling)
**Request**:
```json
{
"model_id": "",
"hardware_platform": "gtx1080",
"quantization": "int4",
"batch_size": 128,
"sequence_length": -1
}
```
**Response**:
```
## ❌ Parameter Validation Errors
1. **model_id**: Required parameter missing
- Error: model_id cannot be empty
- Fix: Specify a model name (e.g., "Mistral-7B") or Hugging Face ID
2. **hardware_platform**: Unsupported GPU
- Error: 'gtx1080' not supported
- Supported: RTX 5090, RTX 4090D, A800, A100, H100, RTX 3090, V100
- Fix: Use a supported GPU or follow our measurement protocol for GTX 1080
3. **quantization**: Invalid value
- Error: 'int4' not recognized
- Did you mean: 'nf4' (4-bit NormalFloat)?
- Supported: fp16, bf16, fp32, nf4, int8_default, int8_pure
4. **batch_size**: Out of range
- Error: batch_size=128 exceeds maximum (64)
- Fix: Use batch_size≤64, or use vLLM for larger batches
5. **sequence_length**: Invalid value
- Error: sequence_length must be ≥1
- Typical values: 512 (chat), 1024 (documents), 2048 (long context)
## 🔧 Suggested Corrected Request
```json
{
"model_id": "mistralai/Mistral-7B-Instruct-v0.2",
"hardware_platform": "rtx4090d",
"quantization": "nf4",
"batch_size": 64,
"sequence_length": 512
}
```
Please correct the errors and try again.
```
---
### Example 4: Warnings (Suboptimal but Valid)
**Request**:
```json
{
"model_id": "Qwen/Qwen2-1.5B",
"hardware_platform": "a800",
"quantization": "nf4",
"batch_size": 1
}
```
**Response**:
```
## Configuration (with warnings)
- Model: Qwen2-1.5B (1.5B parameters) ✅
- GPU: NVIDIA A800 (Ampere, 80GB HBM2e) ✅
- Precision: NF4 ⚠️
- Batch size: 1 ⚠️
- Sequence length: 512 tokens → Generation: 256 tokens (defaults)
## ⚠️ Critical Warnings
### 1. NF4 on Small Model (Energy Waste: +29%)
**Issue**: NF4 quantization on 1.5B model wastes 29% more energy than FP16
- Qwen2-1.5B FP16: 3GB VRAM (fits easily in 80GB A800)
- NF4 dequantization overhead: 29% energy penalty
- **Recommendation**: Use FP16 instead
**Impact**:
- Current (NF4): ~1,150 J/1k tokens
- Optimized (FP16): ~892 J/1k tokens
- Savings: 258 J/1k tokens (22% reduction)
### 2. Batch Size = 1 in Production (Energy Waste: up to 95.7%)
**Issue**: BS=1 has massive energy waste vs batching
- **Recommendation**: Use BS≥8 for production workloads
**Impact**:
- Current (BS=1): ~1,150 J/1k tokens
- Optimized (BS=16): ~143 J/1k tokens
- Savings: 1,007 J/1k tokens (87.5% reduction)
## 💡 Optimized Configuration
```json
{
"model_id": "Qwen/Qwen2-1.5B",
"hardware_platform": "a800",
"quantization": "fp16", // Changed from nf4
"batch_size": 16 // Changed from 1
}
```
**Expected improvement**:
- Energy: 1,150 → 111 J/1k tokens (-90.3%)
- Monthly cost (500K requests): $48 → $4.60 (-90.4%)
- Carbon: 267 kgCO2 → 26 kgCO2 (-90.3%)
Would you like me to provide the optimized configuration analysis?
```
---
## Error Message Templates
### Template 1: Missing Required Parameter
```
❌ Error: {parameter_name} is required.
Fix: {suggestion}
Example: {example_value}
```
### Template 2: Invalid Value
```
❌ Error: '{value}' is not a valid {parameter_name}.
Supported values: {valid_options}
Did you mean: {closest_match}?
```
### Template 3: Out of Range
```
❌ Error: {parameter_name}={value} is out of valid range [{min}, {max}].
Fix: {suggestion}
Typical values: {examples}
```
### Template 4: Suboptimal Configuration
```
⚠️ Warning: {parameter_name}={value} may cause {issue}.
Impact: {quantified_impact}
Recommendation: {alternative}
Expected improvement: {benefit}
```
### Template 5: Extrapolation Notice
```
ℹ️ Note: {parameter} data extrapolated from {baseline}.
Method: {extrapolation_method}
Confidence: {confidence_level}
Recommendation: {validation_suggestion if confidence < HIGH}
```
---
## Best Practices for Parameter Handling
1. **Always validate before processing**: Check all parameters before running analysis
2. **Provide helpful error messages**: Include fix suggestions and examples
3. **Use warnings, not errors, for suboptimal choices**: Let users proceed but inform them
4. **Quantify impact**: Always show the cost of suboptimal choices in energy/cost/carbon
5. **Suggest alternatives**: Provide corrected configuration with expected improvements
6. **Be transparent about extrapolation**: Clearly state when data is measured vs extrapolated
7. **Link to documentation**: Point users to measurement protocol for unsupported configs
---
## Validation Checklist
Before providing recommendations, verify:
- [ ] model_id is valid and parameter count extracted
- [ ] hardware_platform is supported (or closest match identified)
- [ ] quantization is compatible with GPU architecture
- [ ] batch_size is within valid range and power of 2
- [ ] sequence_length and generation_length are reasonable
- [ ] VRAM capacity check passed
- [ ] Cross-parameter conflicts resolved
- [ ] Warnings issued for suboptimal choices
- [ ] Extrapolations clearly noted
- [ ] Measurement transparency provided
---
**Last Updated**: 2026-02-16
**Version**: 2.0
**Author**: Hongping Zhang
```