hugging-face-evaluation-manager
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install nymbo-skills-hugging-face-evaluation-manager
Repository
Skill path: HUGGING FACE/hugging-face-evaluation-manager
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
Open repositoryBest for
Primary workflow: Write Technical Docs.
Technical facets: Full Stack, Backend, Data / AI, Tech Writer.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: Nymbo.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install hugging-face-evaluation-manager into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/Nymbo/Skills before adding hugging-face-evaluation-manager to shared team environments
- Use hugging-face-evaluation-manager for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: hugging-face-evaluation-manager
description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
---
# Overview
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
- Extracting existing evaluation tables from README content
- Importing benchmark scores from Artificial Analysis
- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
## Integration with HF Ecosystem
- **Model Cards**: Updates model-index metadata for leaderboard integration
- **Artificial Analysis**: Direct API integration for benchmark imports
- **Papers with Code**: Compatible with their model-index specification
- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration
- **vLLM**: Efficient GPU inference for custom model evaluation
- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends
- **inspect-ai**: UK AI Safety Institute's evaluation framework
# Version
1.3.0
# Dependencies
## Core Dependencies
- huggingface_hub>=0.26.0
- markdown-it-py>=3.0.0
- python-dotenv>=1.2.1
- pyyaml>=6.0.3
- requests>=2.32.5
- re (built-in)
## Inference Provider Evaluation
- inspect-ai>=0.3.0
- inspect-evals
- openai
## vLLM Custom Model Evaluation (GPU required)
- lighteval[accelerate,vllm]>=0.6.0
- vllm>=0.4.0
- torch>=2.0.0
- transformers>=4.40.0
- accelerate>=0.30.0
Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.
# IMPORTANT: Using This Skill
## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones
**Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:**
```bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
```
**If open PRs exist:**
1. **DO NOT create a new PR** - this creates duplicate work for maintainers
2. **Warn the user** that open PRs already exist
3. **Show the user** the existing PR URLs so they can review them
4. Only proceed if the user explicitly confirms they want to create another PR
This prevents spamming model repositories with duplicate evaluation PRs.
---
**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`:
```bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
```
Key workflow (matches CLI help):
1) `get-prs` → check for existing open PRs first
2) `inspect-tables` → find table numbers/columns
3) `extract-readme --table N` → prints YAML by default
4) add `--apply` (push) or `--create-pr` to write changes
# Core Capabilities
## 1. Inspect and Extract Evaluation Tables from README
- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows
- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text.
- **YAML Generation**: Convert selected table to model-index YAML format
- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`)
## 2. Import from Artificial Analysis
- **API Integration**: Fetch benchmark scores directly from Artificial Analysis
- **Automatic Formatting**: Convert API responses to model-index format
- **Metadata Preservation**: Maintain source attribution and URLs
- **PR Creation**: Automatically create pull requests with evaluation updates
## 3. Model-Index Management
- **YAML Generation**: Create properly formatted model-index entries
- **Merge Support**: Add evaluations to existing model cards without overwriting
- **Validation**: Ensure compliance with Papers with Code specification
- **Batch Operations**: Process multiple models efficiently
## 4. Run Evaluations on HF Jobs (Inference Providers)
- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library
- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
- **Zero-Config**: No Dockerfiles or Space management required
- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job
- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI
## 5. Run Custom Model Evaluations with vLLM (NEW)
⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.
**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal
**When to use:** User working in local device directly when GPU is available
### Before running the script
- check the script path
- check uv is installed
- check gpu is available with `nvidia-smi`
### Running the script
```bash
uv run scripts/train_sft_example.py
```
### Features
- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)
- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks
- **inspect-ai Framework**: UK AI Safety Institute's evaluation library
- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure
# Usage Instructions
The skill includes Python scripts in `scripts/` to perform operations.
### Prerequisites
- Preferred: use `uv run` (PEP 723 header auto-installs deps)
- Or install manually: `pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests`
- Set `HF_TOKEN` environment variable with Write-access token
- For Artificial Analysis: Set `AA_API_KEY` environment variable
- `.env` is loaded automatically if `python-dotenv` is installed
### Method 1: Extract from README (CLI workflow)
Recommended flow (matches `--help`):
```bash
# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
[--model-column-index <column index shown by inspect-tables>] \
[--model-name-override "<column header/model name>"] # use exact header text if you can't use the index
# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--apply # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model" \
--table 1 \
--create-pr # open a PR
```
Validation checklist:
- YAML is printed by default; compare against the README table before applying.
- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
- For transposed tables (models as rows), ensure only one row is extracted.
### Method 2: Import from Artificial Analysis
Fetch benchmark scores from Artificial Analysis API and add them to a model card.
**Basic Usage:**
```bash
AA_API_KEY="your-api-key" python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
```
**With Environment File:**
```bash
# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env
# Run import
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name"
```
**Create Pull Request:**
```bash
python scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "username/model-name" \
--create-pr
```
### Method 3: Run Evaluation Job
Submit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI.
**Direct CLI Usage:**
```bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor cpu-basic \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "mmlu"
```
**GPU Example (A10G):**
```bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "meta-llama/Llama-2-7b-hf" \
--task "gsm8k"
```
**Python Helper (optional):**
```bash
python scripts/run_eval_job.py \
--model "meta-llama/Llama-2-7b-hf" \
--task "mmlu" \
--hardware "t4-small"
```
### Method 4: Run Custom Model Evaluation with vLLM
Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.
#### When to Use vLLM Evaluation (vs Inference Providers)
| Feature | vLLM Scripts | Inference Provider Scripts |
|---------|-------------|---------------------------|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |
#### Option A: lighteval with vLLM Backend
lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
**Standalone (local GPU):**
```bash
# Run MMLU 5-shot with vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
# Run multiple tasks
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
# Use accelerate backend instead of vLLM
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5" \
--backend accelerate
# Chat/instruction-tuned models
python scripts/lighteval_vllm_uv.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--tasks "leaderboard|mmlu|5" \
--use-chat-template
```
**Via HF Jobs:**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--tasks "leaderboard|mmlu|5"
```
**lighteval Task Format:**
Tasks use the format `suite|task|num_fewshot`:
- `leaderboard|mmlu|5` - MMLU with 5-shot
- `leaderboard|gsm8k|5` - GSM8K with 5-shot
- `lighteval|hellaswag|0` - HellaSwag zero-shot
- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot
**Finding Available Tasks:**
The complete list of available lighteval tasks can be found at:
https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
This file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include:
- `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
- `lighteval` - Additional lighteval tasks
- `bigbench` - BigBench tasks
- `original` - Original benchmark tasks
To use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example:
- From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot)
- From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0`
- From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0`
Multiple tasks can be specified as comma-separated values: `--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"`
#### Option B: inspect-ai with vLLM Backend
inspect-ai is the UK AI Safety Institute's evaluation framework.
**Standalone (local GPU):**
```bash
# Run MMLU with vLLM
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu
# Use HuggingFace Transformers backend
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-1B \
--task mmlu \
--backend hf
# Multi-GPU with tensor parallelism
python scripts/inspect_vllm_uv.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--tensor-parallel-size 4
```
**Via HF Jobs:**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model meta-llama/Llama-3.2-1B \
--task mmlu
```
**Available inspect-ai Tasks:**
- `mmlu` - Massive Multitask Language Understanding
- `gsm8k` - Grade School Math
- `hellaswag` - Common sense reasoning
- `arc_challenge` - AI2 Reasoning Challenge
- `truthfulqa` - TruthfulQA benchmark
- `winogrande` - Winograd Schema Challenge
- `humaneval` - Code generation
#### Option C: Python Helper Script
The helper script auto-selects hardware and simplifies job submission:
```bash
# Auto-detect hardware based on model size
python scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-1B \
--task "leaderboard|mmlu|5" \
--framework lighteval
# Explicit hardware selection
python scripts/run_vllm_eval_job.py \
--model meta-llama/Llama-3.2-70B \
--task mmlu \
--framework inspect \
--hardware a100-large \
--tensor-parallel-size 4
# Use HF Transformers backend
python scripts/run_vllm_eval_job.py \
--model microsoft/phi-2 \
--task mmlu \
--framework inspect \
--backend hf
```
**Hardware Recommendations:**
| Model Size | Recommended Hardware |
|------------|---------------------|
| < 3B params | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |
### Commands Reference
**Top-level help and version:**
```bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
```
**Inspect Tables (start here):**
```bash
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
```
**Extract from README:**
```bash
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "username/model-name" \
--table N \
[--model-column-index N] \
[--model-name-override "Exact Column Header or Model Name"] \
[--task-type "text-generation"] \
[--dataset-name "Custom Benchmarks"] \
[--apply | --create-pr]
```
**Import from Artificial Analysis:**
```bash
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "creator-name" \
--model-name "model-slug" \
--repo-id "username/model-name" \
[--create-pr]
```
**View / Validate:**
```bash
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
```
**Check Open PRs (ALWAYS run before --create-pr):**
```bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
```
Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
**Run Evaluation Job (Inference Providers):**
```bash
hf jobs uv run scripts/inspect_eval_uv.py \
--flavor "cpu-basic|t4-small|..." \
--secret HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "task-name"
```
or use the Python helper:
```bash
python scripts/run_eval_job.py \
--model "model-id" \
--task "task-name" \
--hardware "cpu-basic|t4-small|..."
```
**Run vLLM Evaluation (Custom Models):**
```bash
# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--tasks "leaderboard|mmlu|5"
# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
--flavor "a10g-small" \
--secrets HF_TOKEN=$HF_TOKEN \
-- --model "model-id" \
--task "mmlu"
# Helper script (auto hardware selection)
python scripts/run_vllm_eval_job.py \
--model "model-id" \
--task "leaderboard|mmlu|5" \
--framework lighteval
```
### Model-Index Format
The generated model-index follows this structure:
```yaml
model-index:
- name: Model Name
results:
- task:
type: text-generation
dataset:
name: Benchmark Dataset
type: benchmark_type
metrics:
- name: MMLU
type: mmlu
value: 85.2
- name: HumanEval
type: humaneval
value: 72.5
source:
name: Source Name
url: https://source-url.com
```
WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
### Error Handling
- **Table Not Found**: Script will report if no evaluation tables are detected
- **Invalid Format**: Clear error messages for malformed tables
- **API Errors**: Retry logic for transient Artificial Analysis API failures
- **Token Issues**: Validation before attempting updates
- **Merge Conflicts**: Preserves existing model-index entries when adding new ones
- **Space Creation**: Handles naming conflicts and hardware request failures gracefully
### Best Practices
1. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates
2. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
3. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
4. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr`
5. **Verify extracted values**: Compare YAML output against the README table manually
6. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
7. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
8. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
9. **One model per repo**: Only add the main model's results to model-index
10. **No markdown in YAML names**: The model name field in YAML should be plain text
### Model Name Matching
When extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**:
- Removes markdown formatting (bold `**`, links `[]()` )
- Normalizes names (lowercase, replace `-` and `_` with spaces)
- Compares token sets: `"OLMo-3-32B"` → `{"olmo", "3", "32b"}` matches `"**Olmo 3 32B**"` or `"[Olmo-3-32B](...)`
- Only extracts if tokens match exactly (handles different word orders and separators)
- Fails if no exact match found (rather than guessing from similar names)
**For column-based tables** (benchmarks as rows, models as columns):
- Finds the column header matching the model name
- Extracts scores from that column only
**For transposed tables** (models as rows, benchmarks as columns):
- Finds the row in the first column matching the model name
- Extracts all benchmark scores from that row only
This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints.
### Common Patterns
**Update Your Own Model:**
```bash
# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "your-username/your-model" \
--task-type "text-generation"
```
**Update Someone Else's Model (Full Workflow):**
```bash
# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
--repo-id "other-username/their-model"
# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
--repo-id "other-username/their-model" \
--create-pr
# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms
```
**Import Fresh Benchmarks:**
```bash
# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
--repo-id "anthropic/claude-sonnet-4"
# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
--creator-slug "anthropic" \
--model-name "claude-sonnet-4" \
--repo-id "anthropic/claude-sonnet-4" \
--create-pr
```
### Troubleshooting
**Issue**: "No evaluation tables found in README"
- **Solution**: Check if README contains markdown tables with numeric scores
**Issue**: "Could not find model 'X' in transposed table"
- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
- **Example**: `--model-name-override "**Olmo 3-32B**"`
**Issue**: "AA_API_KEY not set"
- **Solution**: Set environment variable or add to .env file
**Issue**: "Token does not have write access"
- **Solution**: Ensure HF_TOKEN has write permissions for the repository
**Issue**: "Model not found in Artificial Analysis"
- **Solution**: Verify creator-slug and model-name match API values
**Issue**: "Payment required for hardware"
- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware
**Issue**: "vLLM out of memory" or CUDA OOM
- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU
**Issue**: "Model architecture not supported by vLLM"
- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers
**Issue**: "Trust remote code required"
- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)
**Issue**: "Chat template not found"
- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template
### Integration Examples
**Python Script Integration:**
```python
import subprocess
import os
def update_model_evaluations(repo_id, readme_content):
"""Update model card with evaluations from README."""
result = subprocess.run([
"python", "scripts/evaluation_manager.py",
"extract-readme",
"--repo-id", repo_id,
"--create-pr"
], capture_output=True, text=True)
if result.returncode == 0:
print(f"Successfully updated {repo_id}")
else:
print(f"Error: {result.stderr}")
```
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/evaluation_manager.py
```python
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "huggingface-hub>=1.1.4",
# "markdown-it-py>=3.0.0",
# "python-dotenv>=1.2.1",
# "pyyaml>=6.0.3",
# "requests>=2.32.5",
# ]
# ///
"""
Manage evaluation results in Hugging Face model cards.
This script provides two methods:
1. Extract evaluation tables from model README files
2. Import evaluation scores from Artificial Analysis API
Both methods update the model-index metadata in model cards.
"""
import argparse
import os
import re
from textwrap import dedent
from typing import Any, Dict, List, Optional, Tuple
def load_env() -> None:
"""Load .env if python-dotenv is available; keep help usable without it."""
try:
import dotenv # type: ignore
except ModuleNotFoundError:
return
dotenv.load_dotenv()
def require_markdown_it():
try:
from markdown_it import MarkdownIt # type: ignore
except ModuleNotFoundError as exc:
raise ModuleNotFoundError(
"markdown-it-py is required for table parsing. "
"Install with `uv add markdown-it-py` or `pip install markdown-it-py`."
) from exc
return MarkdownIt
def require_model_card():
try:
from huggingface_hub import ModelCard # type: ignore
except ModuleNotFoundError as exc:
raise ModuleNotFoundError(
"huggingface-hub is required for model card operations. "
"Install with `uv add huggingface_hub` or `pip install huggingface-hub`."
) from exc
return ModelCard
def require_requests():
try:
import requests # type: ignore
except ModuleNotFoundError as exc:
raise ModuleNotFoundError(
"requests is required for Artificial Analysis import. "
"Install with `uv add requests` or `pip install requests`."
) from exc
return requests
def require_yaml():
try:
import yaml # type: ignore
except ModuleNotFoundError as exc:
raise ModuleNotFoundError(
"PyYAML is required for YAML output. "
"Install with `uv add pyyaml` or `pip install pyyaml`."
) from exc
return yaml
# ============================================================================
# Method 1: Extract Evaluations from README
# ============================================================================
def extract_tables_from_markdown(markdown_content: str) -> List[str]:
"""Extract all markdown tables from content."""
# Pattern to match markdown tables
table_pattern = r"(\|[^\n]+\|(?:\r?\n\|[^\n]+\|)+)"
tables = re.findall(table_pattern, markdown_content)
return tables
def parse_markdown_table(table_str: str) -> Tuple[List[str], List[List[str]]]:
"""
Parse a markdown table string into headers and rows.
Returns:
Tuple of (headers, data_rows)
"""
lines = [line.strip() for line in table_str.strip().split("\n")]
# Remove separator line (the one with dashes)
lines = [line for line in lines if not re.match(r"^\|[\s\-:]+\|$", line)]
if len(lines) < 2:
return [], []
# Parse header
header = [cell.strip() for cell in lines[0].split("|")[1:-1]]
# Parse data rows
data_rows = []
for line in lines[1:]:
cells = [cell.strip() for cell in line.split("|")[1:-1]]
if cells:
data_rows.append(cells)
return header, data_rows
def is_evaluation_table(header: List[str], rows: List[List[str]]) -> bool:
"""Determine if a table contains evaluation results."""
if not header or not rows:
return False
# Check if first column looks like benchmark names
benchmark_keywords = [
"benchmark", "task", "dataset", "eval", "test", "metric",
"mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande",
"truthfulqa", "boolq", "piqa", "siqa"
]
first_col = header[0].lower()
has_benchmark_header = any(keyword in first_col for keyword in benchmark_keywords)
# Check if there are numeric values in the table
has_numeric_values = False
for row in rows:
for cell in row:
try:
float(cell.replace("%", "").replace(",", ""))
has_numeric_values = True
break
except ValueError:
continue
if has_numeric_values:
break
return has_benchmark_header or has_numeric_values
def normalize_model_name(name: str) -> tuple[set[str], str]:
"""
Normalize a model name for matching.
Args:
name: Model name to normalize
Returns:
Tuple of (token_set, normalized_string)
"""
# Remove markdown formatting
cleaned = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', name) # Remove markdown links
cleaned = re.sub(r'\*\*([^\*]+)\*\*', r'\1', cleaned) # Remove bold
cleaned = cleaned.strip()
# Normalize and tokenize
normalized = cleaned.lower().replace("-", " ").replace("_", " ")
tokens = set(normalized.split())
return tokens, normalized
def find_main_model_column(header: List[str], model_name: str) -> Optional[int]:
"""
Identify the column index that corresponds to the main model.
Only returns a column if there's an exact normalized match with the model name.
This prevents extracting scores from training checkpoints or similar models.
Args:
header: Table column headers
model_name: Model name from repo_id (e.g., "OLMo-3-32B-Think")
Returns:
Column index of the main model, or None if no exact match found
"""
if not header or not model_name:
return None
# Normalize model name and extract tokens
model_tokens, _ = normalize_model_name(model_name)
# Find exact matches only
for i, col_name in enumerate(header):
if not col_name:
continue
# Skip first column (benchmark names)
if i == 0:
continue
col_tokens, _ = normalize_model_name(col_name)
# Check for exact token match
if model_tokens == col_tokens:
return i
# No exact match found
return None
def find_main_model_row(
rows: List[List[str]], model_name: str
) -> tuple[Optional[int], List[str]]:
"""
Identify the row index that corresponds to the main model in a transposed table.
In transposed tables, each row represents a different model, with the first
column containing the model name.
Args:
rows: Table data rows
model_name: Model name from repo_id (e.g., "OLMo-3-32B")
Returns:
Tuple of (row_index, available_models)
- row_index: Index of the main model, or None if no exact match found
- available_models: List of all model names found in the table
"""
if not rows or not model_name:
return None, []
model_tokens, _ = normalize_model_name(model_name)
available_models = []
for i, row in enumerate(rows):
if not row or not row[0]:
continue
row_name = row[0].strip()
# Skip separator/header rows
if not row_name or row_name.startswith('---'):
continue
row_tokens, _ = normalize_model_name(row_name)
# Collect all non-empty model names
if row_tokens:
available_models.append(row_name)
# Check for exact token match
if model_tokens == row_tokens:
return i, available_models
return None, available_models
def is_transposed_table(header: List[str], rows: List[List[str]]) -> bool:
"""
Determine if a table is transposed (models as rows, benchmarks as columns).
A table is considered transposed if:
- The first column contains model-like names (not benchmark names)
- Most other columns contain numeric values
- Header row contains benchmark-like names
Args:
header: Table column headers
rows: Table data rows
Returns:
True if table appears to be transposed, False otherwise
"""
if not header or not rows or len(header) < 3:
return False
# Check if first column header suggests model names
first_col = header[0].lower()
model_indicators = ["model", "system", "llm", "name"]
has_model_header = any(indicator in first_col for indicator in model_indicators)
# Check if remaining headers look like benchmarks
benchmark_keywords = [
"mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande",
"eval", "score", "benchmark", "test", "math", "code", "mbpp",
"truthfulqa", "boolq", "piqa", "siqa", "drop", "squad"
]
benchmark_header_count = 0
for col_name in header[1:]:
col_lower = col_name.lower()
if any(keyword in col_lower for keyword in benchmark_keywords):
benchmark_header_count += 1
has_benchmark_headers = benchmark_header_count >= 2
# Check if data rows have numeric values in most columns (except first)
numeric_count = 0
total_cells = 0
for row in rows[:5]: # Check first 5 rows
for cell in row[1:]: # Skip first column
total_cells += 1
try:
float(cell.replace("%", "").replace(",", "").strip())
numeric_count += 1
except (ValueError, AttributeError):
continue
has_numeric_data = total_cells > 0 and (numeric_count / total_cells) > 0.5
return (has_model_header or has_benchmark_headers) and has_numeric_data
def extract_metrics_from_table(
header: List[str],
rows: List[List[str]],
table_format: str = "auto",
model_name: Optional[str] = None,
model_column_index: Optional[int] = None
) -> List[Dict[str, Any]]:
"""
Extract metrics from parsed table data.
Args:
header: Table column headers
rows: Table data rows
table_format: "rows" (benchmarks as rows), "columns" (benchmarks as columns),
"transposed" (models as rows, benchmarks as columns), or "auto"
model_name: Optional model name to identify the correct column/row
Returns:
List of metric dictionaries with name, type, and value
"""
metrics = []
if table_format == "auto":
# First check if it's a transposed table (models as rows)
if is_transposed_table(header, rows):
table_format = "transposed"
else:
# Check if first column header is empty/generic (indicates benchmarks in rows)
first_header = header[0].lower().strip() if header else ""
is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]
if is_first_col_benchmarks:
table_format = "rows"
else:
# Heuristic: if first row has mostly numeric values, benchmarks are columns
try:
numeric_count = sum(
1 for cell in rows[0] if cell and
re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip())
)
table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows"
except (IndexError, ValueError):
table_format = "rows"
if table_format == "rows":
# Benchmarks are in rows, scores in columns
# Try to identify the main model column if model_name is provided
target_column = model_column_index
if target_column is None and model_name:
target_column = find_main_model_column(header, model_name)
for row in rows:
if not row:
continue
benchmark_name = row[0].strip()
if not benchmark_name:
continue
# If we identified a specific column, use it; otherwise use first numeric value
if target_column is not None and target_column < len(row):
try:
value_str = row[target_column].replace("%", "").replace(",", "").strip()
if value_str:
value = float(value_str)
metrics.append({
"name": benchmark_name,
"type": benchmark_name.lower().replace(" ", "_"),
"value": value
})
except (ValueError, IndexError):
pass
else:
# Extract numeric values from remaining columns (original behavior)
for i, cell in enumerate(row[1:], start=1):
try:
# Remove common suffixes and convert to float
value_str = cell.replace("%", "").replace(",", "").strip()
if not value_str:
continue
value = float(value_str)
# Determine metric name
metric_name = benchmark_name
if len(header) > i and header[i].lower() not in ["score", "value", "result"]:
metric_name = f"{benchmark_name} ({header[i]})"
metrics.append({
"name": metric_name,
"type": benchmark_name.lower().replace(" ", "_"),
"value": value
})
break # Only take first numeric value per row
except (ValueError, IndexError):
continue
elif table_format == "transposed":
# Models are in rows (first column), benchmarks are in columns (header)
# Find the row that matches the target model
if not model_name:
print("Warning: model_name required for transposed table format")
return metrics
target_row_idx, available_models = find_main_model_row(rows, model_name)
if target_row_idx is None:
print(f"\n⚠ Could not find model '{model_name}' in transposed table")
if available_models:
print("\nAvailable models in table:")
for i, model in enumerate(available_models, 1):
print(f" {i}. {model}")
print("\nPlease select the correct model name from the list above.")
print("You can specify it using the --model-name-override flag:")
print(f' --model-name-override "{available_models[0]}"')
return metrics
target_row = rows[target_row_idx]
# Extract metrics from each column (skip first column which is model name)
for i in range(1, len(header)):
benchmark_name = header[i].strip()
if not benchmark_name or i >= len(target_row):
continue
try:
value_str = target_row[i].replace("%", "").replace(",", "").strip()
if not value_str:
continue
value = float(value_str)
metrics.append({
"name": benchmark_name,
"type": benchmark_name.lower().replace(" ", "_").replace("-", "_"),
"value": value
})
except (ValueError, AttributeError):
continue
else: # table_format == "columns"
# Benchmarks are in columns
if not rows:
return metrics
# Use first data row for values
data_row = rows[0]
for i, benchmark_name in enumerate(header):
if not benchmark_name or i >= len(data_row):
continue
try:
value_str = data_row[i].replace("%", "").replace(",", "").strip()
if not value_str:
continue
value = float(value_str)
metrics.append({
"name": benchmark_name,
"type": benchmark_name.lower().replace(" ", "_"),
"value": value
})
except ValueError:
continue
return metrics
def extract_evaluations_from_readme(
repo_id: str,
task_type: str = "text-generation",
dataset_name: str = "Benchmarks",
dataset_type: str = "benchmark",
model_name_override: Optional[str] = None,
table_index: Optional[int] = None,
model_column_index: Optional[int] = None
) -> Optional[List[Dict[str, Any]]]:
"""
Extract evaluation results from a model's README.
Args:
repo_id: Hugging Face model repository ID
task_type: Task type for model-index (e.g., "text-generation")
dataset_name: Name for the benchmark dataset
dataset_type: Type identifier for the dataset
model_name_override: Override model name for matching (column header for comparison tables)
table_index: 1-indexed table number from inspect-tables output
Returns:
Model-index formatted results or None if no evaluations found
"""
try:
load_env()
ModelCard = require_model_card()
hf_token = os.getenv("HF_TOKEN")
card = ModelCard.load(repo_id, token=hf_token)
readme_content = card.content
if not readme_content:
print(f"No README content found for {repo_id}")
return None
# Extract model name from repo_id or use override
if model_name_override:
model_name = model_name_override
print(f"Using model name override: '{model_name}'")
else:
model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
# Use markdown-it parser for accurate table extraction
all_tables = extract_tables_with_parser(readme_content)
if not all_tables:
print(f"No tables found in README for {repo_id}")
return None
# If table_index specified, use that specific table
if table_index is not None:
if table_index < 1 or table_index > len(all_tables):
print(f"Invalid table index {table_index}. Found {len(all_tables)} tables.")
print("Run inspect-tables to see available tables.")
return None
tables_to_process = [all_tables[table_index - 1]]
else:
# Filter to evaluation tables only
eval_tables = []
for table in all_tables:
header = table.get("headers", [])
rows = table.get("rows", [])
if is_evaluation_table(header, rows):
eval_tables.append(table)
if len(eval_tables) > 1:
print(f"\n⚠ Found {len(eval_tables)} evaluation tables.")
print("Run inspect-tables first, then use --table to select one:")
print(f' uv run scripts/evaluation_manager.py inspect-tables --repo-id "{repo_id}"')
return None
elif len(eval_tables) == 0:
print(f"No evaluation tables found in README for {repo_id}")
return None
tables_to_process = eval_tables
# Extract metrics from selected table(s)
all_metrics = []
for table in tables_to_process:
header = table.get("headers", [])
rows = table.get("rows", [])
metrics = extract_metrics_from_table(
header,
rows,
model_name=model_name,
model_column_index=model_column_index
)
all_metrics.extend(metrics)
if not all_metrics:
print(f"No metrics extracted from table")
return None
# Build model-index structure
display_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
results = [{
"task": {"type": task_type},
"dataset": {
"name": dataset_name,
"type": dataset_type
},
"metrics": all_metrics,
"source": {
"name": "Model README",
"url": f"https://huggingface.co/{repo_id}"
}
}]
return results
except Exception as e:
print(f"Error extracting evaluations from README: {e}")
return None
# ============================================================================
# Table Inspection (using markdown-it-py for accurate parsing)
# ============================================================================
def extract_tables_with_parser(markdown_content: str) -> List[Dict[str, Any]]:
"""
Extract tables from markdown using markdown-it-py parser.
Uses GFM (GitHub Flavored Markdown) which includes table support.
"""
MarkdownIt = require_markdown_it()
# Disable linkify to avoid optional dependency errors; not needed for table parsing.
md = MarkdownIt("gfm-like", {"linkify": False})
tokens = md.parse(markdown_content)
tables = []
i = 0
while i < len(tokens):
token = tokens[i]
if token.type == "table_open":
table_data = {"headers": [], "rows": []}
current_row = []
in_header = False
i += 1
while i < len(tokens) and tokens[i].type != "table_close":
t = tokens[i]
if t.type == "thead_open":
in_header = True
elif t.type == "thead_close":
in_header = False
elif t.type == "tr_open":
current_row = []
elif t.type == "tr_close":
if in_header:
table_data["headers"] = current_row
else:
table_data["rows"].append(current_row)
current_row = []
elif t.type == "inline":
current_row.append(t.content.strip())
i += 1
if table_data["headers"] or table_data["rows"]:
tables.append(table_data)
i += 1
return tables
def detect_table_format(table: Dict[str, Any], repo_id: str) -> Dict[str, Any]:
"""Analyze a table to detect its format and identify model columns."""
headers = table.get("headers", [])
rows = table.get("rows", [])
if not headers or not rows:
return {"format": "unknown", "columns": headers, "model_columns": [], "row_count": 0, "sample_rows": []}
first_header = headers[0].lower() if headers else ""
is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]
# Check for numeric columns
numeric_columns = []
for col_idx in range(1, len(headers)):
numeric_count = 0
for row in rows[:5]:
if col_idx < len(row):
try:
val = re.sub(r'\s*\([^)]*\)', '', row[col_idx])
float(val.replace("%", "").replace(",", "").strip())
numeric_count += 1
except (ValueError, AttributeError):
pass
if numeric_count > len(rows[:5]) / 2:
numeric_columns.append(col_idx)
# Determine format
if is_first_col_benchmarks and len(numeric_columns) > 1:
format_type = "comparison"
elif is_first_col_benchmarks and len(numeric_columns) == 1:
format_type = "simple"
elif len(numeric_columns) > len(headers) / 2:
format_type = "transposed"
else:
format_type = "unknown"
# Find model columns
model_columns = []
model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
model_tokens, _ = normalize_model_name(model_name)
for idx, header in enumerate(headers):
if idx == 0 and is_first_col_benchmarks:
continue
if header:
header_tokens, _ = normalize_model_name(header)
is_match = model_tokens == header_tokens
is_partial = model_tokens.issubset(header_tokens) or header_tokens.issubset(model_tokens)
model_columns.append({
"index": idx,
"header": header,
"is_exact_match": is_match,
"is_partial_match": is_partial and not is_match
})
return {
"format": format_type,
"columns": headers,
"model_columns": model_columns,
"row_count": len(rows),
"sample_rows": [row[0] for row in rows[:5] if row]
}
def inspect_tables(repo_id: str) -> None:
"""Inspect and display all evaluation tables in a model's README."""
try:
load_env()
ModelCard = require_model_card()
hf_token = os.getenv("HF_TOKEN")
card = ModelCard.load(repo_id, token=hf_token)
readme_content = card.content
if not readme_content:
print(f"No README content found for {repo_id}")
return
tables = extract_tables_with_parser(readme_content)
if not tables:
print(f"No tables found in README for {repo_id}")
return
print(f"\n{'='*70}")
print(f"Tables found in README for: {repo_id}")
print(f"{'='*70}")
eval_table_count = 0
for table in tables:
analysis = detect_table_format(table, repo_id)
if analysis["format"] == "unknown" and not analysis.get("sample_rows"):
continue
eval_table_count += 1
print(f"\n## Table {eval_table_count}")
print(f" Format: {analysis['format']}")
print(f" Rows: {analysis['row_count']}")
print(f"\n Columns ({len(analysis['columns'])}):")
for col_info in analysis.get("model_columns", []):
idx = col_info["index"]
header = col_info["header"]
if col_info["is_exact_match"]:
print(f" [{idx}] {header} ✓ EXACT MATCH")
elif col_info["is_partial_match"]:
print(f" [{idx}] {header} ~ partial match")
else:
print(f" [{idx}] {header}")
if analysis.get("sample_rows"):
print(f"\n Sample rows (first column):")
for row_val in analysis["sample_rows"][:5]:
print(f" - {row_val}")
if eval_table_count == 0:
print("\nNo evaluation tables detected.")
else:
print("\nSuggested next step:")
print(f' uv run scripts/evaluation_manager.py extract-readme --repo-id "{repo_id}" --table <table-number> [--model-column-index <column-index>]')
print(f"\n{'='*70}\n")
except Exception as e:
print(f"Error inspecting tables: {e}")
# ============================================================================
# Pull Request Management
# ============================================================================
def get_open_prs(repo_id: str) -> List[Dict[str, Any]]:
"""
Fetch open pull requests for a Hugging Face model repository.
Args:
repo_id: Hugging Face model repository ID (e.g., "allenai/Olmo-3-32B-Think")
Returns:
List of open PR dictionaries with num, title, author, and createdAt
"""
requests = require_requests()
url = f"https://huggingface.co/api/models/{repo_id}/discussions"
try:
response = requests.get(url, timeout=30, allow_redirects=True)
response.raise_for_status()
data = response.json()
discussions = data.get("discussions", [])
open_prs = [
{
"num": d["num"],
"title": d["title"],
"author": d["author"]["name"],
"createdAt": d.get("createdAt", "unknown"),
}
for d in discussions
if d.get("status") == "open" and d.get("isPullRequest")
]
return open_prs
except requests.RequestException as e:
print(f"Error fetching PRs from Hugging Face: {e}")
return []
def list_open_prs(repo_id: str) -> None:
"""Display open pull requests for a model repository."""
prs = get_open_prs(repo_id)
print(f"\n{'='*70}")
print(f"Open Pull Requests for: {repo_id}")
print(f"{'='*70}")
if not prs:
print("\nNo open pull requests found.")
else:
print(f"\nFound {len(prs)} open PR(s):\n")
for pr in prs:
print(f" PR #{pr['num']} - {pr['title']}")
print(f" Author: {pr['author']}")
print(f" Created: {pr['createdAt']}")
print(f" URL: https://huggingface.co/{repo_id}/discussions/{pr['num']}")
print()
print(f"{'='*70}\n")
# ============================================================================
# Method 2: Import from Artificial Analysis
# ============================================================================
def get_aa_model_data(creator_slug: str, model_name: str) -> Optional[Dict[str, Any]]:
"""
Fetch model evaluation data from Artificial Analysis API.
Args:
creator_slug: Creator identifier (e.g., "anthropic", "openai")
model_name: Model slug/identifier
Returns:
Model data dictionary or None if not found
"""
load_env()
AA_API_KEY = os.getenv("AA_API_KEY")
if not AA_API_KEY:
raise ValueError("AA_API_KEY environment variable is not set")
url = "https://artificialanalysis.ai/api/v2/data/llms/models"
headers = {"x-api-key": AA_API_KEY}
requests = require_requests()
try:
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
data = response.json().get("data", [])
for model in data:
creator = model.get("model_creator", {})
if creator.get("slug") == creator_slug and model.get("slug") == model_name:
return model
print(f"Model {creator_slug}/{model_name} not found in Artificial Analysis")
return None
except requests.RequestException as e:
print(f"Error fetching data from Artificial Analysis: {e}")
return None
def aa_data_to_model_index(
model_data: Dict[str, Any],
dataset_name: str = "Artificial Analysis Benchmarks",
dataset_type: str = "artificial_analysis",
task_type: str = "evaluation"
) -> List[Dict[str, Any]]:
"""
Convert Artificial Analysis model data to model-index format.
Args:
model_data: Raw model data from AA API
dataset_name: Dataset name for model-index
dataset_type: Dataset type identifier
task_type: Task type for model-index
Returns:
Model-index formatted results
"""
model_name = model_data.get("name", model_data.get("slug", "unknown-model"))
evaluations = model_data.get("evaluations", {})
if not evaluations:
print(f"No evaluations found for model {model_name}")
return []
metrics = []
for key, value in evaluations.items():
if value is not None:
metrics.append({
"name": key.replace("_", " ").title(),
"type": key,
"value": value
})
results = [{
"task": {"type": task_type},
"dataset": {
"name": dataset_name,
"type": dataset_type
},
"metrics": metrics,
"source": {
"name": "Artificial Analysis API",
"url": "https://artificialanalysis.ai"
}
}]
return results
def import_aa_evaluations(
creator_slug: str,
model_name: str,
repo_id: str
) -> Optional[List[Dict[str, Any]]]:
"""
Import evaluation results from Artificial Analysis for a model.
Args:
creator_slug: Creator identifier in AA
model_name: Model identifier in AA
repo_id: Hugging Face repository ID to update
Returns:
Model-index formatted results or None if import fails
"""
model_data = get_aa_model_data(creator_slug, model_name)
if not model_data:
return None
results = aa_data_to_model_index(model_data)
return results
# ============================================================================
# Model Card Update Functions
# ============================================================================
def update_model_card_with_evaluations(
repo_id: str,
results: List[Dict[str, Any]],
create_pr: bool = False,
commit_message: Optional[str] = None
) -> bool:
"""
Update a model card with evaluation results.
Args:
repo_id: Hugging Face repository ID
results: Model-index formatted results
create_pr: Whether to create a PR instead of direct push
commit_message: Custom commit message
Returns:
True if successful, False otherwise
"""
try:
load_env()
ModelCard = require_model_card()
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
raise ValueError("HF_TOKEN environment variable is not set")
# Load existing card
card = ModelCard.load(repo_id, token=hf_token)
# Get model name
model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
# Create or update model-index
model_index = [{
"name": model_name,
"results": results
}]
# Merge with existing model-index if present
if "model-index" in card.data:
existing = card.data["model-index"]
if isinstance(existing, list) and existing:
# Keep existing name if present
if "name" in existing[0]:
model_index[0]["name"] = existing[0]["name"]
# Merge results
existing_results = existing[0].get("results", [])
model_index[0]["results"].extend(existing_results)
card.data["model-index"] = model_index
# Prepare commit message
if not commit_message:
commit_message = f"Add evaluation results to {model_name}"
commit_description = (
"This commit adds structured evaluation results to the model card. "
"The results are formatted using the model-index specification and "
"will be displayed in the model card's evaluation widget."
)
# Push update
card.push_to_hub(
repo_id,
token=hf_token,
commit_message=commit_message,
commit_description=commit_description,
create_pr=create_pr
)
action = "Pull request created" if create_pr else "Model card updated"
print(f"✓ {action} successfully for {repo_id}")
return True
except Exception as e:
print(f"Error updating model card: {e}")
return False
def show_evaluations(repo_id: str) -> None:
"""Display current evaluations in a model card."""
try:
load_env()
ModelCard = require_model_card()
hf_token = os.getenv("HF_TOKEN")
card = ModelCard.load(repo_id, token=hf_token)
if "model-index" not in card.data:
print(f"No model-index found in {repo_id}")
return
model_index = card.data["model-index"]
print(f"\nEvaluations for {repo_id}:")
print("=" * 60)
for model_entry in model_index:
model_name = model_entry.get("name", "Unknown")
print(f"\nModel: {model_name}")
results = model_entry.get("results", [])
for i, result in enumerate(results, 1):
print(f"\n Result Set {i}:")
task = result.get("task", {})
print(f" Task: {task.get('type', 'unknown')}")
dataset = result.get("dataset", {})
print(f" Dataset: {dataset.get('name', 'unknown')}")
metrics = result.get("metrics", [])
print(f" Metrics ({len(metrics)}):")
for metric in metrics:
name = metric.get("name", "Unknown")
value = metric.get("value", "N/A")
print(f" - {name}: {value}")
source = result.get("source", {})
if source:
print(f" Source: {source.get('name', 'Unknown')}")
print("\n" + "=" * 60)
except Exception as e:
print(f"Error showing evaluations: {e}")
def validate_model_index(repo_id: str) -> bool:
"""Validate model-index format in a model card."""
try:
load_env()
ModelCard = require_model_card()
hf_token = os.getenv("HF_TOKEN")
card = ModelCard.load(repo_id, token=hf_token)
if "model-index" not in card.data:
print(f"✗ No model-index found in {repo_id}")
return False
model_index = card.data["model-index"]
if not isinstance(model_index, list):
print("✗ model-index must be a list")
return False
for i, entry in enumerate(model_index):
if "name" not in entry:
print(f"✗ Entry {i} missing 'name' field")
return False
if "results" not in entry:
print(f"✗ Entry {i} missing 'results' field")
return False
for j, result in enumerate(entry["results"]):
if "task" not in result:
print(f"✗ Result {j} in entry {i} missing 'task' field")
return False
if "dataset" not in result:
print(f"✗ Result {j} in entry {i} missing 'dataset' field")
return False
if "metrics" not in result:
print(f"✗ Result {j} in entry {i} missing 'metrics' field")
return False
print(f"✓ Model-index format is valid for {repo_id}")
return True
except Exception as e:
print(f"Error validating model-index: {e}")
return False
# ============================================================================
# CLI Interface
# ============================================================================
def main():
parser = argparse.ArgumentParser(
description=(
"Manage evaluation results in Hugging Face model cards.\n\n"
"Use standard Python or `uv run scripts/evaluation_manager.py ...` "
"to auto-resolve dependencies from the PEP 723 header."
),
formatter_class=argparse.RawTextHelpFormatter,
epilog=dedent(
"""\
Typical workflows:
- Inspect tables first:
uv run scripts/evaluation_manager.py inspect-tables --repo-id <model>
- Extract from README (prints YAML by default):
uv run scripts/evaluation_manager.py extract-readme --repo-id <model> --table N
- Apply changes:
uv run scripts/evaluation_manager.py extract-readme --repo-id <model> --table N --apply
- Import from Artificial Analysis:
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug org --model-name slug --repo-id <model>
Tips:
- YAML is printed by default; use --apply or --create-pr to write changes.
- Set HF_TOKEN (and AA_API_KEY for import-aa); .env is loaded automatically if python-dotenv is installed.
- When multiple tables exist, run inspect-tables then select with --table N.
- To apply changes (push or PR), rerun extract-readme with --apply or --create-pr.
"""
),
)
parser.add_argument("--version", action="version", version="evaluation_manager 1.2.0")
subparsers = parser.add_subparsers(dest="command", help="Command to execute")
# Extract from README command
extract_parser = subparsers.add_parser(
"extract-readme",
help="Extract evaluation tables from model README",
formatter_class=argparse.RawTextHelpFormatter,
description="Parse README tables into model-index YAML. Default behavior prints YAML; use --apply/--create-pr to write changes.",
epilog=dedent(
"""\
Examples:
uv run scripts/evaluation_manager.py extract-readme --repo-id username/model
uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-column-index 3
uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-name-override \"**Model 7B**\" # exact header text
uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --create-pr
Apply changes:
- Default: prints YAML to stdout (no writes).
- Add --apply to push directly, or --create-pr to open a PR.
Model selection:
- Preferred: --model-column-index <header index shown by inspect-tables>
- If using --model-name-override, copy the column header text exactly.
"""
),
)
extract_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
extract_parser.add_argument("--table", type=int, help="Table number (1-indexed, from inspect-tables output)")
extract_parser.add_argument("--model-column-index", type=int, help="Preferred: column index from inspect-tables output (exact selection)")
extract_parser.add_argument("--model-name-override", type=str, help="Exact column header/model name for comparison/transpose tables (when index is not used)")
extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Sets model-index task.type (e.g., text-generation, summarization)")
extract_parser.add_argument("--dataset-name", type=str, default="Benchmarks", help="Dataset name")
extract_parser.add_argument("--dataset-type", type=str, default="benchmark", help="Dataset type")
extract_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
extract_parser.add_argument("--apply", action="store_true", help="Apply changes (default is to print YAML only)")
extract_parser.add_argument("--dry-run", action="store_true", help="Preview YAML without updating (default)")
# Import from AA command
aa_parser = subparsers.add_parser(
"import-aa",
help="Import evaluation scores from Artificial Analysis",
formatter_class=argparse.RawTextHelpFormatter,
description="Fetch scores from Artificial Analysis API and write them into model-index.",
epilog=dedent(
"""\
Examples:
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug anthropic --model-name claude-sonnet-4 --repo-id username/model
uv run scripts/evaluation_manager.py import-aa --creator-slug openai --model-name gpt-4o --repo-id username/model --create-pr
Requires: AA_API_KEY in env (or .env if python-dotenv installed).
"""
),
)
aa_parser.add_argument("--creator-slug", type=str, required=True, help="AA creator slug")
aa_parser.add_argument("--model-name", type=str, required=True, help="AA model name")
aa_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
aa_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
# Show evaluations command
show_parser = subparsers.add_parser(
"show",
help="Display current evaluations in model card",
formatter_class=argparse.RawTextHelpFormatter,
description="Print model-index content from the model card (requires HF_TOKEN for private repos).",
)
show_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
# Validate command
validate_parser = subparsers.add_parser(
"validate",
help="Validate model-index format",
formatter_class=argparse.RawTextHelpFormatter,
description="Schema sanity check for model-index section of the card.",
)
validate_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
# Inspect tables command
inspect_parser = subparsers.add_parser(
"inspect-tables",
help="Inspect tables in README → outputs suggested extract-readme command",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Workflow:
1. inspect-tables → see table structure, columns, and table numbers
2. extract-readme → run with --table N (from step 1); YAML prints by default
3. apply changes → rerun extract-readme with --apply or --create-pr
Reminder:
- Preferred: use --model-column-index <index>. If needed, use --model-name-override with the exact column header text.
"""
)
inspect_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
# Get PRs command
prs_parser = subparsers.add_parser(
"get-prs",
help="List open pull requests for a model repository",
formatter_class=argparse.RawTextHelpFormatter,
description="Check for existing open PRs before creating new ones to avoid duplicates.",
epilog=dedent(
"""\
Examples:
uv run scripts/evaluation_manager.py get-prs --repo-id "allenai/Olmo-3-32B-Think"
IMPORTANT: Always run this before using --create-pr to avoid duplicate PRs.
"""
),
)
prs_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
args = parser.parse_args()
if not args.command:
parser.print_help()
return
try:
# Execute command
if args.command == "extract-readme":
results = extract_evaluations_from_readme(
repo_id=args.repo_id,
task_type=args.task_type,
dataset_name=args.dataset_name,
dataset_type=args.dataset_type,
model_name_override=args.model_name_override,
table_index=args.table,
model_column_index=args.model_column_index
)
if not results:
print("No evaluations extracted")
return
apply_changes = args.apply or args.create_pr
# Default behavior: print YAML (dry-run)
yaml = require_yaml()
print("\nExtracted evaluations (YAML):")
print(
yaml.dump(
{"model-index": [{"name": args.repo_id.split('/')[-1], "results": results}]},
sort_keys=False
)
)
if apply_changes:
if args.model_name_override and args.model_column_index is not None:
print("Note: --model-column-index takes precedence over --model-name-override.")
update_model_card_with_evaluations(
repo_id=args.repo_id,
results=results,
create_pr=args.create_pr,
commit_message="Extract evaluation results from README"
)
elif args.command == "import-aa":
results = import_aa_evaluations(
creator_slug=args.creator_slug,
model_name=args.model_name,
repo_id=args.repo_id
)
if not results:
print("No evaluations imported")
return
update_model_card_with_evaluations(
repo_id=args.repo_id,
results=results,
create_pr=args.create_pr,
commit_message=f"Add Artificial Analysis evaluations for {args.model_name}"
)
elif args.command == "show":
show_evaluations(args.repo_id)
elif args.command == "validate":
validate_model_index(args.repo_id)
elif args.command == "inspect-tables":
inspect_tables(args.repo_id)
elif args.command == "get-prs":
list_open_prs(args.repo_id)
except ModuleNotFoundError as exc:
# Surface dependency hints cleanly when user only needs help output
print(exc)
except Exception as exc:
print(f"Error: {exc}")
if __name__ == "__main__":
main()
```
### scripts/run_eval_job.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "huggingface-hub>=0.26.0",
# "python-dotenv>=1.2.1",
# ]
# ///
"""
Submit evaluation jobs using the `hf jobs uv run` CLI.
This wrapper constructs the appropriate command to execute the local
`inspect_eval_uv.py` script on Hugging Face Jobs with the requested hardware.
"""
import argparse
import os
import subprocess
import sys
from pathlib import Path
from typing import Optional
from huggingface_hub import get_token
from dotenv import load_dotenv
load_dotenv()
SCRIPT_PATH = Path(__file__).with_name("inspect_eval_uv.py").resolve()
def create_eval_job(
model_id: str,
task: str,
hardware: str = "cpu-basic",
hf_token: Optional[str] = None,
limit: Optional[int] = None,
) -> None:
"""
Submit an evaluation job using the Hugging Face Jobs CLI.
"""
token = hf_token or os.getenv("HF_TOKEN") or get_token()
if not token:
raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")
if not SCRIPT_PATH.exists():
raise FileNotFoundError(f"Script not found at {SCRIPT_PATH}")
print(f"Preparing evaluation job for {model_id} on task {task} (hardware: {hardware})")
cmd = [
"hf",
"jobs",
"uv",
"run",
str(SCRIPT_PATH),
"--flavor",
hardware,
"--secrets",
f"HF_TOKEN={token}",
"--",
"--model",
model_id,
"--task",
task,
]
if limit:
cmd.extend(["--limit", str(limit)])
print("Executing:", " ".join(cmd))
try:
subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as exc:
print("hf jobs command failed", file=sys.stderr)
raise
def main() -> None:
parser = argparse.ArgumentParser(description="Run inspect-ai evaluations on Hugging Face Jobs")
parser.add_argument("--model", required=True, help="Model ID (e.g. Qwen/Qwen3-0.6B)")
parser.add_argument("--task", required=True, help="Inspect task (e.g. mmlu, gsm8k)")
parser.add_argument("--hardware", default="cpu-basic", help="Hardware flavor (e.g. t4-small, a10g-small)")
parser.add_argument("--limit", type=int, default=None, help="Limit number of samples to evaluate")
args = parser.parse_args()
create_eval_job(
model_id=args.model,
task=args.task,
hardware=args.hardware,
limit=args.limit,
)
if __name__ == "__main__":
main()
```
### scripts/lighteval_vllm_uv.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "lighteval[accelerate,vllm]>=0.6.0",
# "torch>=2.0.0",
# "transformers>=4.40.0",
# "accelerate>=0.30.0",
# "vllm>=0.4.0",
# ]
# ///
"""
Entry point script for running lighteval evaluations with vLLM backend via `hf jobs uv run`.
This script runs evaluations using vLLM for efficient GPU inference on custom HuggingFace models.
It is separate from inference provider scripts and evaluates models directly on the hardware.
Usage (standalone):
python lighteval_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"
Usage (via HF Jobs):
hf jobs uv run lighteval_vllm_uv.py \\
--flavor a10g-small \\
--secret HF_TOKEN=$HF_TOKEN \\
-- --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"
"""
from __future__ import annotations
import argparse
import os
import subprocess
import sys
from typing import Optional
def setup_environment() -> None:
"""Configure environment variables for HuggingFace authentication."""
hf_token = os.getenv("HF_TOKEN")
if hf_token:
os.environ.setdefault("HUGGING_FACE_HUB_TOKEN", hf_token)
os.environ.setdefault("HF_HUB_TOKEN", hf_token)
def run_lighteval_vllm(
model_id: str,
tasks: str,
output_dir: Optional[str] = None,
max_samples: Optional[int] = None,
batch_size: int = 1,
tensor_parallel_size: int = 1,
gpu_memory_utilization: float = 0.8,
dtype: str = "auto",
trust_remote_code: bool = False,
use_chat_template: bool = False,
system_prompt: Optional[str] = None,
) -> None:
"""
Run lighteval with vLLM backend for efficient GPU inference.
Args:
model_id: HuggingFace model ID (e.g., "meta-llama/Llama-3.2-1B")
tasks: Task specification (e.g., "leaderboard|mmlu|5" or "lighteval|hellaswag|0")
output_dir: Directory for evaluation results
max_samples: Limit number of samples per task
batch_size: Batch size for evaluation
tensor_parallel_size: Number of GPUs for tensor parallelism
gpu_memory_utilization: GPU memory fraction to use (0.0-1.0)
dtype: Data type for model weights (auto, float16, bfloat16)
trust_remote_code: Allow executing remote code from model repo
use_chat_template: Apply chat template for conversational models
system_prompt: System prompt for chat models
"""
setup_environment()
# Build lighteval vllm command
cmd = [
"lighteval",
"vllm",
model_id,
tasks,
"--batch-size", str(batch_size),
"--tensor-parallel-size", str(tensor_parallel_size),
"--gpu-memory-utilization", str(gpu_memory_utilization),
"--dtype", dtype,
]
if output_dir:
cmd.extend(["--output-dir", output_dir])
if max_samples:
cmd.extend(["--max-samples", str(max_samples)])
if trust_remote_code:
cmd.append("--trust-remote-code")
if use_chat_template:
cmd.append("--use-chat-template")
if system_prompt:
cmd.extend(["--system-prompt", system_prompt])
print(f"Running: {' '.join(cmd)}")
try:
subprocess.run(cmd, check=True)
print("Evaluation complete.")
except subprocess.CalledProcessError as exc:
print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
sys.exit(exc.returncode)
def run_lighteval_accelerate(
model_id: str,
tasks: str,
output_dir: Optional[str] = None,
max_samples: Optional[int] = None,
batch_size: int = 1,
dtype: str = "bfloat16",
trust_remote_code: bool = False,
use_chat_template: bool = False,
system_prompt: Optional[str] = None,
) -> None:
"""
Run lighteval with accelerate backend for multi-GPU distributed inference.
Use this backend when vLLM is not available or for models not supported by vLLM.
Args:
model_id: HuggingFace model ID
tasks: Task specification
output_dir: Directory for evaluation results
max_samples: Limit number of samples per task
batch_size: Batch size for evaluation
dtype: Data type for model weights
trust_remote_code: Allow executing remote code
use_chat_template: Apply chat template
system_prompt: System prompt for chat models
"""
setup_environment()
# Build lighteval accelerate command
cmd = [
"lighteval",
"accelerate",
model_id,
tasks,
"--batch-size", str(batch_size),
"--dtype", dtype,
]
if output_dir:
cmd.extend(["--output-dir", output_dir])
if max_samples:
cmd.extend(["--max-samples", str(max_samples)])
if trust_remote_code:
cmd.append("--trust-remote-code")
if use_chat_template:
cmd.append("--use-chat-template")
if system_prompt:
cmd.extend(["--system-prompt", system_prompt])
print(f"Running: {' '.join(cmd)}")
try:
subprocess.run(cmd, check=True)
print("Evaluation complete.")
except subprocess.CalledProcessError as exc:
print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
sys.exit(exc.returncode)
def main() -> None:
parser = argparse.ArgumentParser(
description="Run lighteval evaluations with vLLM or accelerate backend on custom HuggingFace models",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Run MMLU evaluation with vLLM
python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5"
# Run with accelerate backend instead of vLLM
python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --backend accelerate
# Run with chat template for instruction-tuned models
python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B-Instruct --tasks "leaderboard|mmlu|5" --use-chat-template
# Run with limited samples for testing
python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --max-samples 10
Task format:
Tasks use the format: "suite|task|num_fewshot"
- leaderboard|mmlu|5 (MMLU with 5-shot)
- lighteval|hellaswag|0 (HellaSwag zero-shot)
- leaderboard|gsm8k|5 (GSM8K with 5-shot)
- Multiple tasks: "leaderboard|mmlu|5,leaderboard|gsm8k|5"
""",
)
parser.add_argument(
"--model",
required=True,
help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
)
parser.add_argument(
"--tasks",
required=True,
help="Task specification (e.g., 'leaderboard|mmlu|5')",
)
parser.add_argument(
"--backend",
choices=["vllm", "accelerate"],
default="vllm",
help="Inference backend to use (default: vllm)",
)
parser.add_argument(
"--output-dir",
default=None,
help="Directory for evaluation results",
)
parser.add_argument(
"--max-samples",
type=int,
default=None,
help="Limit number of samples per task (useful for testing)",
)
parser.add_argument(
"--batch-size",
type=int,
default=1,
help="Batch size for evaluation (default: 1)",
)
parser.add_argument(
"--tensor-parallel-size",
type=int,
default=1,
help="Number of GPUs for tensor parallelism (vLLM only, default: 1)",
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.8,
help="GPU memory fraction to use (vLLM only, default: 0.8)",
)
parser.add_argument(
"--dtype",
default="auto",
choices=["auto", "float16", "bfloat16", "float32"],
help="Data type for model weights (default: auto)",
)
parser.add_argument(
"--trust-remote-code",
action="store_true",
help="Allow executing remote code from model repository",
)
parser.add_argument(
"--use-chat-template",
action="store_true",
help="Apply chat template for instruction-tuned/chat models",
)
parser.add_argument(
"--system-prompt",
default=None,
help="System prompt for chat models",
)
args = parser.parse_args()
if args.backend == "vllm":
run_lighteval_vllm(
model_id=args.model,
tasks=args.tasks,
output_dir=args.output_dir,
max_samples=args.max_samples,
batch_size=args.batch_size,
tensor_parallel_size=args.tensor_parallel_size,
gpu_memory_utilization=args.gpu_memory_utilization,
dtype=args.dtype,
trust_remote_code=args.trust_remote_code,
use_chat_template=args.use_chat_template,
system_prompt=args.system_prompt,
)
else:
run_lighteval_accelerate(
model_id=args.model,
tasks=args.tasks,
output_dir=args.output_dir,
max_samples=args.max_samples,
batch_size=args.batch_size,
dtype=args.dtype if args.dtype != "auto" else "bfloat16",
trust_remote_code=args.trust_remote_code,
use_chat_template=args.use_chat_template,
system_prompt=args.system_prompt,
)
if __name__ == "__main__":
main()
```
### scripts/inspect_vllm_uv.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "inspect-ai>=0.3.0",
# "inspect-evals",
# "vllm>=0.4.0",
# "torch>=2.0.0",
# "transformers>=4.40.0",
# ]
# ///
"""
Entry point script for running inspect-ai evaluations with vLLM or HuggingFace Transformers backend.
This script runs evaluations on custom HuggingFace models using local GPU inference,
separate from inference provider scripts (which use external APIs).
Usage (standalone):
python inspect_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --task "mmlu"
Usage (via HF Jobs):
hf jobs uv run inspect_vllm_uv.py \\
--flavor a10g-small \\
--secret HF_TOKEN=$HF_TOKEN \\
-- --model "meta-llama/Llama-3.2-1B" --task "mmlu"
Model backends:
- vllm: Fast inference with vLLM (recommended for large models)
- hf: HuggingFace Transformers backend (broader model compatibility)
"""
from __future__ import annotations
import argparse
import os
import subprocess
import sys
from typing import Optional
def setup_environment() -> None:
"""Configure environment variables for HuggingFace authentication."""
hf_token = os.getenv("HF_TOKEN")
if hf_token:
os.environ.setdefault("HUGGING_FACE_HUB_TOKEN", hf_token)
os.environ.setdefault("HF_HUB_TOKEN", hf_token)
def run_inspect_vllm(
model_id: str,
task: str,
limit: Optional[int] = None,
max_connections: int = 4,
temperature: float = 0.0,
tensor_parallel_size: int = 1,
gpu_memory_utilization: float = 0.8,
dtype: str = "auto",
trust_remote_code: bool = False,
log_level: str = "info",
) -> None:
"""
Run inspect-ai evaluation with vLLM backend.
Args:
model_id: HuggingFace model ID
task: inspect-ai task to execute (e.g., "mmlu", "gsm8k")
limit: Limit number of samples to evaluate
max_connections: Maximum concurrent connections
temperature: Sampling temperature
tensor_parallel_size: Number of GPUs for tensor parallelism
gpu_memory_utilization: GPU memory fraction
dtype: Data type (auto, float16, bfloat16)
trust_remote_code: Allow remote code execution
log_level: Logging level
"""
setup_environment()
model_spec = f"vllm/{model_id}"
cmd = [
"inspect",
"eval",
task,
"--model",
model_spec,
"--log-level",
log_level,
"--max-connections",
str(max_connections),
]
# vLLM supports temperature=0 unlike HF inference providers
cmd.extend(["--temperature", str(temperature)])
# Older inspect-ai CLI versions do not support --model-args; rely on defaults
# and let vLLM choose sensible settings for small models.
if tensor_parallel_size != 1:
cmd.extend(["--tensor-parallel-size", str(tensor_parallel_size)])
if gpu_memory_utilization != 0.8:
cmd.extend(["--gpu-memory-utilization", str(gpu_memory_utilization)])
if dtype != "auto":
cmd.extend(["--dtype", dtype])
if trust_remote_code:
cmd.append("--trust-remote-code")
if limit:
cmd.extend(["--limit", str(limit)])
print(f"Running: {' '.join(cmd)}")
try:
subprocess.run(cmd, check=True)
print("Evaluation complete.")
except subprocess.CalledProcessError as exc:
print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
sys.exit(exc.returncode)
def run_inspect_hf(
model_id: str,
task: str,
limit: Optional[int] = None,
max_connections: int = 1,
temperature: float = 0.001,
device: str = "auto",
dtype: str = "auto",
trust_remote_code: bool = False,
log_level: str = "info",
) -> None:
"""
Run inspect-ai evaluation with HuggingFace Transformers backend.
Use this when vLLM doesn't support the model architecture.
Args:
model_id: HuggingFace model ID
task: inspect-ai task to execute
limit: Limit number of samples
max_connections: Maximum concurrent connections (keep low for memory)
temperature: Sampling temperature
device: Device to use (auto, cuda, cpu)
dtype: Data type
trust_remote_code: Allow remote code execution
log_level: Logging level
"""
setup_environment()
model_spec = f"hf/{model_id}"
cmd = [
"inspect",
"eval",
task,
"--model",
model_spec,
"--log-level",
log_level,
"--max-connections",
str(max_connections),
"--temperature",
str(temperature),
]
if device != "auto":
cmd.extend(["--device", device])
if dtype != "auto":
cmd.extend(["--dtype", dtype])
if trust_remote_code:
cmd.append("--trust-remote-code")
if limit:
cmd.extend(["--limit", str(limit)])
print(f"Running: {' '.join(cmd)}")
try:
subprocess.run(cmd, check=True)
print("Evaluation complete.")
except subprocess.CalledProcessError as exc:
print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
sys.exit(exc.returncode)
def main() -> None:
parser = argparse.ArgumentParser(
description="Run inspect-ai evaluations with vLLM or HuggingFace Transformers on custom models",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Run MMLU with vLLM backend
python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu
# Run with HuggingFace Transformers backend
python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --backend hf
# Run with limited samples for testing
python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --limit 10
# Run on multiple GPUs with tensor parallelism
python inspect_vllm_uv.py --model meta-llama/Llama-3.2-70B --task mmlu --tensor-parallel-size 4
Available tasks (from inspect-evals):
- mmlu: Massive Multitask Language Understanding
- gsm8k: Grade School Math
- hellaswag: Common sense reasoning
- arc_challenge: AI2 Reasoning Challenge
- truthfulqa: TruthfulQA benchmark
- winogrande: Winograd Schema Challenge
- humaneval: Code generation (HumanEval)
Via HF Jobs:
hf jobs uv run inspect_vllm_uv.py \\
--flavor a10g-small \\
--secret HF_TOKEN=$HF_TOKEN \\
-- --model meta-llama/Llama-3.2-1B --task mmlu
""",
)
parser.add_argument(
"--model",
required=True,
help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
)
parser.add_argument(
"--task",
required=True,
help="inspect-ai task to execute (e.g., mmlu, gsm8k)",
)
parser.add_argument(
"--backend",
choices=["vllm", "hf"],
default="vllm",
help="Model backend (default: vllm)",
)
parser.add_argument(
"--limit",
type=int,
default=None,
help="Limit number of samples to evaluate",
)
parser.add_argument(
"--max-connections",
type=int,
default=None,
help="Maximum concurrent connections (default: 4 for vllm, 1 for hf)",
)
parser.add_argument(
"--temperature",
type=float,
default=None,
help="Sampling temperature (default: 0.0 for vllm, 0.001 for hf)",
)
parser.add_argument(
"--tensor-parallel-size",
type=int,
default=1,
help="Number of GPUs for tensor parallelism (vLLM only, default: 1)",
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.8,
help="GPU memory fraction to use (vLLM only, default: 0.8)",
)
parser.add_argument(
"--dtype",
default="auto",
choices=["auto", "float16", "bfloat16", "float32"],
help="Data type for model weights (default: auto)",
)
parser.add_argument(
"--device",
default="auto",
help="Device for HF backend (auto, cuda, cpu)",
)
parser.add_argument(
"--trust-remote-code",
action="store_true",
help="Allow executing remote code from model repository",
)
parser.add_argument(
"--log-level",
default="info",
choices=["debug", "info", "warning", "error"],
help="Logging level (default: info)",
)
args = parser.parse_args()
if args.backend == "vllm":
run_inspect_vllm(
model_id=args.model,
task=args.task,
limit=args.limit,
max_connections=args.max_connections or 4,
temperature=args.temperature if args.temperature is not None else 0.0,
tensor_parallel_size=args.tensor_parallel_size,
gpu_memory_utilization=args.gpu_memory_utilization,
dtype=args.dtype,
trust_remote_code=args.trust_remote_code,
log_level=args.log_level,
)
else:
run_inspect_hf(
model_id=args.model,
task=args.task,
limit=args.limit,
max_connections=args.max_connections or 1,
temperature=args.temperature if args.temperature is not None else 0.001,
device=args.device,
dtype=args.dtype,
trust_remote_code=args.trust_remote_code,
log_level=args.log_level,
)
if __name__ == "__main__":
main()
```
### scripts/run_vllm_eval_job.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "huggingface-hub>=0.26.0",
# "python-dotenv>=1.2.1",
# ]
# ///
"""
Submit vLLM-based evaluation jobs using the `hf jobs uv run` CLI.
This wrapper constructs the appropriate command to execute vLLM evaluation scripts
(lighteval or inspect-ai) on Hugging Face Jobs with GPU hardware.
Unlike run_eval_job.py (which uses inference providers/APIs), this script runs
models directly on the job's GPU using vLLM or HuggingFace Transformers.
Usage:
python run_vllm_eval_job.py \\
--model meta-llama/Llama-3.2-1B \\
--task mmlu \\
--framework lighteval \\
--hardware a10g-small
"""
from __future__ import annotations
import argparse
import os
import subprocess
import sys
from pathlib import Path
from typing import Optional
from huggingface_hub import get_token
from dotenv import load_dotenv
load_dotenv()
# Script paths for different evaluation frameworks
SCRIPT_DIR = Path(__file__).parent.resolve()
LIGHTEVAL_SCRIPT = SCRIPT_DIR / "lighteval_vllm_uv.py"
INSPECT_SCRIPT = SCRIPT_DIR / "inspect_vllm_uv.py"
# Hardware flavor recommendations for different model sizes
HARDWARE_RECOMMENDATIONS = {
"small": "t4-small", # < 3B parameters
"medium": "a10g-small", # 3B - 13B parameters
"large": "a10g-large", # 13B - 34B parameters
"xlarge": "a100-large", # 34B+ parameters
}
def estimate_hardware(model_id: str) -> str:
"""
Estimate appropriate hardware based on model ID naming conventions.
Returns a hardware flavor recommendation.
"""
model_lower = model_id.lower()
# Check for explicit size indicators in model name
if any(x in model_lower for x in ["70b", "72b", "65b"]):
return "a100-large"
elif any(x in model_lower for x in ["34b", "33b", "32b", "30b"]):
return "a10g-large"
elif any(x in model_lower for x in ["13b", "14b", "7b", "8b"]):
return "a10g-small"
elif any(x in model_lower for x in ["3b", "2b", "1b", "0.5b", "small", "mini"]):
return "t4-small"
# Default to medium hardware
return "a10g-small"
def create_lighteval_job(
model_id: str,
tasks: str,
hardware: str,
hf_token: Optional[str] = None,
max_samples: Optional[int] = None,
backend: str = "vllm",
batch_size: int = 1,
tensor_parallel_size: int = 1,
trust_remote_code: bool = False,
use_chat_template: bool = False,
) -> None:
"""
Submit a lighteval evaluation job on HuggingFace Jobs.
"""
token = hf_token or os.getenv("HF_TOKEN") or get_token()
if not token:
raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")
if not LIGHTEVAL_SCRIPT.exists():
raise FileNotFoundError(f"Script not found at {LIGHTEVAL_SCRIPT}")
print(f"Preparing lighteval job for {model_id}")
print(f" Tasks: {tasks}")
print(f" Backend: {backend}")
print(f" Hardware: {hardware}")
cmd = [
"hf", "jobs", "uv", "run",
str(LIGHTEVAL_SCRIPT),
"--flavor", hardware,
"--secrets", f"HF_TOKEN={token}",
"--",
"--model", model_id,
"--tasks", tasks,
"--backend", backend,
"--batch-size", str(batch_size),
"--tensor-parallel-size", str(tensor_parallel_size),
]
if max_samples:
cmd.extend(["--max-samples", str(max_samples)])
if trust_remote_code:
cmd.append("--trust-remote-code")
if use_chat_template:
cmd.append("--use-chat-template")
print(f"\nExecuting: {' '.join(cmd)}")
try:
subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as exc:
print("hf jobs command failed", file=sys.stderr)
raise
def create_inspect_job(
model_id: str,
task: str,
hardware: str,
hf_token: Optional[str] = None,
limit: Optional[int] = None,
backend: str = "vllm",
tensor_parallel_size: int = 1,
trust_remote_code: bool = False,
) -> None:
"""
Submit an inspect-ai evaluation job on HuggingFace Jobs.
"""
token = hf_token or os.getenv("HF_TOKEN") or get_token()
if not token:
raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")
if not INSPECT_SCRIPT.exists():
raise FileNotFoundError(f"Script not found at {INSPECT_SCRIPT}")
print(f"Preparing inspect-ai job for {model_id}")
print(f" Task: {task}")
print(f" Backend: {backend}")
print(f" Hardware: {hardware}")
cmd = [
"hf", "jobs", "uv", "run",
str(INSPECT_SCRIPT),
"--flavor", hardware,
"--secrets", f"HF_TOKEN={token}",
"--",
"--model", model_id,
"--task", task,
"--backend", backend,
"--tensor-parallel-size", str(tensor_parallel_size),
]
if limit:
cmd.extend(["--limit", str(limit)])
if trust_remote_code:
cmd.append("--trust-remote-code")
print(f"\nExecuting: {' '.join(cmd)}")
try:
subprocess.run(cmd, check=True)
except subprocess.CalledProcessError as exc:
print("hf jobs command failed", file=sys.stderr)
raise
def main() -> None:
parser = argparse.ArgumentParser(
description="Submit vLLM-based evaluation jobs to HuggingFace Jobs",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Run lighteval with vLLM on A10G GPU
python run_vllm_eval_job.py \\
--model meta-llama/Llama-3.2-1B \\
--task "leaderboard|mmlu|5" \\
--framework lighteval \\
--hardware a10g-small
# Run inspect-ai on larger model with multi-GPU
python run_vllm_eval_job.py \\
--model meta-llama/Llama-3.2-70B \\
--task mmlu \\
--framework inspect \\
--hardware a100-large \\
--tensor-parallel-size 4
# Auto-detect hardware based on model size
python run_vllm_eval_job.py \\
--model meta-llama/Llama-3.2-1B \\
--task mmlu \\
--framework inspect
# Run with HF Transformers backend (instead of vLLM)
python run_vllm_eval_job.py \\
--model microsoft/phi-2 \\
--task mmlu \\
--framework inspect \\
--backend hf
Hardware flavors:
- t4-small: T4 GPU, good for models < 3B
- a10g-small: A10G GPU, good for models 3B-13B
- a10g-large: A10G GPU, good for models 13B-34B
- a100-large: A100 GPU, good for models 34B+
Frameworks:
- lighteval: HuggingFace's lighteval library
- inspect: UK AI Safety's inspect-ai library
Task formats:
- lighteval: "suite|task|num_fewshot" (e.g., "leaderboard|mmlu|5")
- inspect: task name (e.g., "mmlu", "gsm8k")
""",
)
parser.add_argument(
"--model",
required=True,
help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
)
parser.add_argument(
"--task",
required=True,
help="Evaluation task (format depends on framework)",
)
parser.add_argument(
"--framework",
choices=["lighteval", "inspect"],
default="lighteval",
help="Evaluation framework to use (default: lighteval)",
)
parser.add_argument(
"--hardware",
default=None,
help="Hardware flavor (auto-detected if not specified)",
)
parser.add_argument(
"--backend",
choices=["vllm", "hf", "accelerate"],
default="vllm",
help="Model backend (default: vllm)",
)
parser.add_argument(
"--limit",
"--max-samples",
type=int,
default=None,
dest="limit",
help="Limit number of samples to evaluate",
)
parser.add_argument(
"--batch-size",
type=int,
default=1,
help="Batch size for evaluation (lighteval only)",
)
parser.add_argument(
"--tensor-parallel-size",
type=int,
default=1,
help="Number of GPUs for tensor parallelism",
)
parser.add_argument(
"--trust-remote-code",
action="store_true",
help="Allow executing remote code from model repository",
)
parser.add_argument(
"--use-chat-template",
action="store_true",
help="Apply chat template (lighteval only)",
)
args = parser.parse_args()
# Auto-detect hardware if not specified
hardware = args.hardware or estimate_hardware(args.model)
print(f"Using hardware: {hardware}")
# Map backend names between frameworks
backend = args.backend
if args.framework == "lighteval" and backend == "hf":
backend = "accelerate" # lighteval uses "accelerate" for HF backend
if args.framework == "lighteval":
create_lighteval_job(
model_id=args.model,
tasks=args.task,
hardware=hardware,
max_samples=args.limit,
backend=backend,
batch_size=args.batch_size,
tensor_parallel_size=args.tensor_parallel_size,
trust_remote_code=args.trust_remote_code,
use_chat_template=args.use_chat_template,
)
else:
create_inspect_job(
model_id=args.model,
task=args.task,
hardware=hardware,
limit=args.limit,
backend=backend if backend != "accelerate" else "hf",
tensor_parallel_size=args.tensor_parallel_size,
trust_remote_code=args.trust_remote_code,
)
if __name__ == "__main__":
main()
```
### scripts/inspect_eval_uv.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "inspect-ai>=0.3.0",
# "inspect-evals",
# "openai",
# ]
# ///
"""
Entry point script for running inspect-ai evaluations via `hf jobs uv run`.
"""
from __future__ import annotations
import argparse
import os
import subprocess
import sys
from pathlib import Path
from typing import Optional
def _inspect_evals_tasks_root() -> Optional[Path]:
"""Return the installed inspect_evals package path if available."""
try:
import inspect_evals
return Path(inspect_evals.__file__).parent
except Exception:
return None
def _normalize_task(task: str) -> str:
"""Allow lighteval-style `suite|task|shots` strings by keeping the task name."""
if "|" in task:
parts = task.split("|")
if len(parts) >= 2 and parts[1]:
return parts[1]
return task
def main() -> None:
parser = argparse.ArgumentParser(description="Inspect-ai job runner")
parser.add_argument("--model", required=True, help="Model ID on Hugging Face Hub")
parser.add_argument("--task", required=True, help="inspect-ai task to execute")
parser.add_argument("--limit", type=int, default=None, help="Limit number of samples to evaluate")
parser.add_argument(
"--tasks-root",
default=None,
help="Optional path to inspect task files. Defaults to the installed inspect_evals package.",
)
parser.add_argument(
"--sandbox",
default="local",
help="Sandbox backend to use (default: local for HF jobs without Docker).",
)
args = parser.parse_args()
# Ensure downstream libraries can read the token passed as a secret
hf_token = os.getenv("HF_TOKEN")
if hf_token:
os.environ.setdefault("HUGGING_FACE_HUB_TOKEN", hf_token)
os.environ.setdefault("HF_HUB_TOKEN", hf_token)
task = _normalize_task(args.task)
tasks_root = Path(args.tasks_root) if args.tasks_root else _inspect_evals_tasks_root()
if tasks_root and not tasks_root.exists():
tasks_root = None
cmd = [
"inspect",
"eval",
task,
"--model",
f"hf-inference-providers/{args.model}",
"--log-level",
"info",
# Reduce batch size to avoid OOM errors (default is 32)
"--max-connections",
"1",
# Set a small positive temperature (HF doesn't allow temperature=0)
"--temperature",
"0.001",
]
if args.sandbox:
cmd.extend(["--sandbox", args.sandbox])
if args.limit:
cmd.extend(["--limit", str(args.limit)])
try:
subprocess.run(cmd, check=True, cwd=tasks_root)
print("Evaluation complete.")
except subprocess.CalledProcessError as exc:
location = f" (cwd={tasks_root})" if tasks_root else ""
print(f"Evaluation failed with exit code {exc.returncode}{location}", file=sys.stderr)
raise
if __name__ == "__main__":
main()
```