Back to skills
SkillHub ClubWrite Technical DocsFull StackBackendData / AI

hugging-face-evaluation-manager

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
6
Hot score
82
Updated
March 20, 2026
Overall rating
C1.6
Composite score
1.6
Best-practice grade
N/A

Install command

npx @skill-hub/cli install nymbo-skills-hugging-face-evaluation-manager
huggingfacemodel-evaluationmachine-learningmetadataautomation

Repository

Nymbo/Skills

Skill path: HUGGING FACE/hugging-face-evaluation-manager

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Backend, Data / AI, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: Nymbo.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install hugging-face-evaluation-manager into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/Nymbo/Skills before adding hugging-face-evaluation-manager to shared team environments
  • Use hugging-face-evaluation-manager for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: hugging-face-evaluation-manager
description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
---

# Overview
This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
- Extracting existing evaluation tables from README content
- Importing benchmark scores from Artificial Analysis
- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)

## Integration with HF Ecosystem
- **Model Cards**: Updates model-index metadata for leaderboard integration
- **Artificial Analysis**: Direct API integration for benchmark imports
- **Papers with Code**: Compatible with their model-index specification
- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration
- **vLLM**: Efficient GPU inference for custom model evaluation
- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends
- **inspect-ai**: UK AI Safety Institute's evaluation framework

# Version
1.3.0

# Dependencies

## Core Dependencies
- huggingface_hub>=0.26.0
- markdown-it-py>=3.0.0
- python-dotenv>=1.2.1
- pyyaml>=6.0.3
- requests>=2.32.5
- re (built-in)

## Inference Provider Evaluation
- inspect-ai>=0.3.0
- inspect-evals
- openai

## vLLM Custom Model Evaluation (GPU required)
- lighteval[accelerate,vllm]>=0.6.0
- vllm>=0.4.0
- torch>=2.0.0
- transformers>=4.40.0
- accelerate>=0.30.0

Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.

# IMPORTANT: Using This Skill

## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones

**Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:**

```bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
```

**If open PRs exist:**
1. **DO NOT create a new PR** - this creates duplicate work for maintainers
2. **Warn the user** that open PRs already exist
3. **Show the user** the existing PR URLs so they can review them
4. Only proceed if the user explicitly confirms they want to create another PR

This prevents spamming model repositories with duplicate evaluation PRs.

---

**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`:
```bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py inspect-tables --help
uv run scripts/evaluation_manager.py extract-readme --help
```
Key workflow (matches CLI help):

1) `get-prs` → check for existing open PRs first
2) `inspect-tables` → find table numbers/columns  
3) `extract-readme --table N` → prints YAML by default  
4) add `--apply` (push) or `--create-pr` to write changes

# Core Capabilities

## 1. Inspect and Extract Evaluation Tables from README
- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows
- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text.
- **YAML Generation**: Convert selected table to model-index YAML format
- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`)

## 2. Import from Artificial Analysis
- **API Integration**: Fetch benchmark scores directly from Artificial Analysis
- **Automatic Formatting**: Convert API responses to model-index format
- **Metadata Preservation**: Maintain source attribution and URLs
- **PR Creation**: Automatically create pull requests with evaluation updates

## 3. Model-Index Management
- **YAML Generation**: Create properly formatted model-index entries
- **Merge Support**: Add evaluations to existing model cards without overwriting
- **Validation**: Ensure compliance with Papers with Code specification
- **Batch Operations**: Process multiple models efficiently

## 4. Run Evaluations on HF Jobs (Inference Providers)
- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library
- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
- **Zero-Config**: No Dockerfiles or Space management required
- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job
- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI

## 5. Run Custom Model Evaluations with vLLM (NEW)

⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.
**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal
**When to use:** User working in local device directly  when GPU is available

### Before running the script

- check the script path
- check uv is installed
- check gpu is available with `nvidia-smi`

### Running the script

```bash
uv run scripts/train_sft_example.py
```
### Features

- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)
- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks
- **inspect-ai Framework**: UK AI Safety Institute's evaluation library
- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure

# Usage Instructions

The skill includes Python scripts in `scripts/` to perform operations.

### Prerequisites
- Preferred: use `uv run` (PEP 723 header auto-installs deps)
- Or install manually: `pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests`
- Set `HF_TOKEN` environment variable with Write-access token
- For Artificial Analysis: Set `AA_API_KEY` environment variable
- `.env` is loaded automatically if `python-dotenv` is installed

### Method 1: Extract from README (CLI workflow)

Recommended flow (matches `--help`):
```bash
# 1) Inspect tables to get table numbers and column hints
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"

# 2) Extract a specific table (prints YAML by default)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  [--model-column-index <column index shown by inspect-tables>] \
  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index

# 3) Apply changes (push or PR)
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --apply       # push directly
# or
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model" \
  --table 1 \
  --create-pr   # open a PR
```

Validation checklist:
- YAML is printed by default; compare against the README table before applying.
- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
- For transposed tables (models as rows), ensure only one row is extracted.

### Method 2: Import from Artificial Analysis

Fetch benchmark scores from Artificial Analysis API and add them to a model card.

**Basic Usage:**
```bash
AA_API_KEY="your-api-key" python scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"
```

**With Environment File:**
```bash
# Create .env file
echo "AA_API_KEY=your-api-key" >> .env
echo "HF_TOKEN=your-hf-token" >> .env

# Run import
python scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name"
```

**Create Pull Request:**
```bash
python scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "username/model-name" \
  --create-pr
```

### Method 3: Run Evaluation Job

Submit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI.

**Direct CLI Usage:**
```bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
  --flavor cpu-basic \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "mmlu"
```

**GPU Example (A10G):**
```bash
HF_TOKEN=$HF_TOKEN \
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "meta-llama/Llama-2-7b-hf" \
     --task "gsm8k"
```

**Python Helper (optional):**
```bash
python scripts/run_eval_job.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --task "mmlu" \
  --hardware "t4-small"
```

### Method 4: Run Custom Model Evaluation with vLLM

Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.

#### When to Use vLLM Evaluation (vs Inference Providers)

| Feature | vLLM Scripts | Inference Provider Scripts |
|---------|-------------|---------------------------|
| Model access | Any HF model | Models with API endpoints |
| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
| Cost | HF Jobs compute cost | API usage fees |
| Speed | vLLM optimized | Depends on provider |
| Offline | Yes (after download) | No |

#### Option A: lighteval with vLLM Backend

lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.

**Standalone (local GPU):**
```bash
# Run MMLU 5-shot with vLLM
python scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5"

# Run multiple tasks
python scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"

# Use accelerate backend instead of vLLM
python scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --tasks "leaderboard|mmlu|5" \
  --backend accelerate

# Chat/instruction-tuned models
python scripts/lighteval_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --tasks "leaderboard|mmlu|5" \
  --use-chat-template
```

**Via HF Jobs:**
```bash
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --tasks "leaderboard|mmlu|5"
```

**lighteval Task Format:**
Tasks use the format `suite|task|num_fewshot`:
- `leaderboard|mmlu|5` - MMLU with 5-shot
- `leaderboard|gsm8k|5` - GSM8K with 5-shot
- `lighteval|hellaswag|0` - HellaSwag zero-shot
- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot

**Finding Available Tasks:**
The complete list of available lighteval tasks can be found at:
https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt

This file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include:
- `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
- `lighteval` - Additional lighteval tasks
- `bigbench` - BigBench tasks
- `original` - Original benchmark tasks

To use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example:
- From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot)
- From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0`
- From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0`

Multiple tasks can be specified as comma-separated values: `--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"`

#### Option B: inspect-ai with vLLM Backend

inspect-ai is the UK AI Safety Institute's evaluation framework.

**Standalone (local GPU):**
```bash
# Run MMLU with vLLM
python scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu

# Use HuggingFace Transformers backend
python scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-1B \
  --task mmlu \
  --backend hf

# Multi-GPU with tensor parallelism
python scripts/inspect_vllm_uv.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --tensor-parallel-size 4
```

**Via HF Jobs:**
```bash
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model meta-llama/Llama-3.2-1B \
     --task mmlu
```

**Available inspect-ai Tasks:**
- `mmlu` - Massive Multitask Language Understanding
- `gsm8k` - Grade School Math
- `hellaswag` - Common sense reasoning
- `arc_challenge` - AI2 Reasoning Challenge
- `truthfulqa` - TruthfulQA benchmark
- `winogrande` - Winograd Schema Challenge
- `humaneval` - Code generation

#### Option C: Python Helper Script

The helper script auto-selects hardware and simplifies job submission:

```bash
# Auto-detect hardware based on model size
python scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-1B \
  --task "leaderboard|mmlu|5" \
  --framework lighteval

# Explicit hardware selection
python scripts/run_vllm_eval_job.py \
  --model meta-llama/Llama-3.2-70B \
  --task mmlu \
  --framework inspect \
  --hardware a100-large \
  --tensor-parallel-size 4

# Use HF Transformers backend
python scripts/run_vllm_eval_job.py \
  --model microsoft/phi-2 \
  --task mmlu \
  --framework inspect \
  --backend hf
```

**Hardware Recommendations:**
| Model Size | Recommended Hardware |
|------------|---------------------|
| < 3B params | `t4-small` |
| 3B - 13B | `a10g-small` |
| 13B - 34B | `a10g-large` |
| 34B+ | `a100-large` |

### Commands Reference

**Top-level help and version:**
```bash
uv run scripts/evaluation_manager.py --help
uv run scripts/evaluation_manager.py --version
```

**Inspect Tables (start here):**
```bash
uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
```

**Extract from README:**
```bash
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "username/model-name" \
  --table N \
  [--model-column-index N] \
  [--model-name-override "Exact Column Header or Model Name"] \
  [--task-type "text-generation"] \
  [--dataset-name "Custom Benchmarks"] \
  [--apply | --create-pr]
```

**Import from Artificial Analysis:**
```bash
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "creator-name" \
  --model-name "model-slug" \
  --repo-id "username/model-name" \
  [--create-pr]
```

**View / Validate:**
```bash
uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
```

**Check Open PRs (ALWAYS run before --create-pr):**
```bash
uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
```
Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.

**Run Evaluation Job (Inference Providers):**
```bash
hf jobs uv run scripts/inspect_eval_uv.py \
  --flavor "cpu-basic|t4-small|..." \
  --secret HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "task-name"
```

or use the Python helper:

```bash
python scripts/run_eval_job.py \
  --model "model-id" \
  --task "task-name" \
  --hardware "cpu-basic|t4-small|..."
```

**Run vLLM Evaluation (Custom Models):**
```bash
# lighteval with vLLM
hf jobs uv run scripts/lighteval_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --tasks "leaderboard|mmlu|5"

# inspect-ai with vLLM
hf jobs uv run scripts/inspect_vllm_uv.py \
  --flavor "a10g-small" \
  --secrets HF_TOKEN=$HF_TOKEN \
  -- --model "model-id" \
     --task "mmlu"

# Helper script (auto hardware selection)
python scripts/run_vllm_eval_job.py \
  --model "model-id" \
  --task "leaderboard|mmlu|5" \
  --framework lighteval
```

### Model-Index Format

The generated model-index follows this structure:

```yaml
model-index:
  - name: Model Name
    results:
      - task:
          type: text-generation
        dataset:
          name: Benchmark Dataset
          type: benchmark_type
        metrics:
          - name: MMLU
            type: mmlu
            value: 85.2
          - name: HumanEval
            type: humaneval
            value: 72.5
        source:
          name: Source Name
          url: https://source-url.com
```

WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.

### Error Handling
- **Table Not Found**: Script will report if no evaluation tables are detected
- **Invalid Format**: Clear error messages for malformed tables
- **API Errors**: Retry logic for transient Artificial Analysis API failures
- **Token Issues**: Validation before attempting updates
- **Merge Conflicts**: Preserves existing model-index entries when adding new ones
- **Space Creation**: Handles naming conflicts and hardware request failures gracefully

### Best Practices

1. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates
2. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
3. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
4. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr`
5. **Verify extracted values**: Compare YAML output against the README table manually
6. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
7. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
8. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
9. **One model per repo**: Only add the main model's results to model-index
10. **No markdown in YAML names**: The model name field in YAML should be plain text

### Model Name Matching

When extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**:

- Removes markdown formatting (bold `**`, links `[]()`  )
- Normalizes names (lowercase, replace `-` and `_` with spaces)
- Compares token sets: `"OLMo-3-32B"` → `{"olmo", "3", "32b"}` matches `"**Olmo 3 32B**"` or `"[Olmo-3-32B](...)`
- Only extracts if tokens match exactly (handles different word orders and separators)
- Fails if no exact match found (rather than guessing from similar names)

**For column-based tables** (benchmarks as rows, models as columns):
- Finds the column header matching the model name
- Extracts scores from that column only

**For transposed tables** (models as rows, benchmarks as columns):
- Finds the row in the first column matching the model name
- Extracts all benchmark scores from that row only

This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints. 

### Common Patterns

**Update Your Own Model:**
```bash
# Extract from README and push directly
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "your-username/your-model" \
  --task-type "text-generation"
```

**Update Someone Else's Model (Full Workflow):**
```bash
# Step 1: ALWAYS check for existing PRs first
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "other-username/their-model"

# Step 2: If NO open PRs exist, proceed with creating one
uv run scripts/evaluation_manager.py extract-readme \
  --repo-id "other-username/their-model" \
  --create-pr

# If open PRs DO exist:
# - Warn the user about existing PRs
# - Show them the PR URLs
# - Do NOT create a new PR unless user explicitly confirms
```

**Import Fresh Benchmarks:**
```bash
# Step 1: Check for existing PRs
uv run scripts/evaluation_manager.py get-prs \
  --repo-id "anthropic/claude-sonnet-4"

# Step 2: If no PRs, import from Artificial Analysis
AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
  --creator-slug "anthropic" \
  --model-name "claude-sonnet-4" \
  --repo-id "anthropic/claude-sonnet-4" \
  --create-pr
```

### Troubleshooting

**Issue**: "No evaluation tables found in README"
- **Solution**: Check if README contains markdown tables with numeric scores

**Issue**: "Could not find model 'X' in transposed table"
- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
- **Example**: `--model-name-override "**Olmo 3-32B**"`

**Issue**: "AA_API_KEY not set"
- **Solution**: Set environment variable or add to .env file

**Issue**: "Token does not have write access"
- **Solution**: Ensure HF_TOKEN has write permissions for the repository

**Issue**: "Model not found in Artificial Analysis"
- **Solution**: Verify creator-slug and model-name match API values

**Issue**: "Payment required for hardware"
- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware

**Issue**: "vLLM out of memory" or CUDA OOM
- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU

**Issue**: "Model architecture not supported by vLLM"
- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers

**Issue**: "Trust remote code required"
- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)

**Issue**: "Chat template not found"
- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template

### Integration Examples

**Python Script Integration:**
```python
import subprocess
import os

def update_model_evaluations(repo_id, readme_content):
    """Update model card with evaluations from README."""
    result = subprocess.run([
        "python", "scripts/evaluation_manager.py",
        "extract-readme",
        "--repo-id", repo_id,
        "--create-pr"
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print(f"Successfully updated {repo_id}")
    else:
        print(f"Error: {result.stderr}")
```


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### scripts/evaluation_manager.py

```python
# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "huggingface-hub>=1.1.4",
#     "markdown-it-py>=3.0.0",
#     "python-dotenv>=1.2.1",
#     "pyyaml>=6.0.3",
#     "requests>=2.32.5",
# ]
# ///

"""
Manage evaluation results in Hugging Face model cards.

This script provides two methods:
1. Extract evaluation tables from model README files
2. Import evaluation scores from Artificial Analysis API

Both methods update the model-index metadata in model cards.
"""

import argparse
import os
import re
from textwrap import dedent
from typing import Any, Dict, List, Optional, Tuple


def load_env() -> None:
    """Load .env if python-dotenv is available; keep help usable without it."""
    try:
        import dotenv  # type: ignore
    except ModuleNotFoundError:
        return
    dotenv.load_dotenv()


def require_markdown_it():
    try:
        from markdown_it import MarkdownIt  # type: ignore
    except ModuleNotFoundError as exc:
        raise ModuleNotFoundError(
            "markdown-it-py is required for table parsing. "
            "Install with `uv add markdown-it-py` or `pip install markdown-it-py`."
        ) from exc
    return MarkdownIt


def require_model_card():
    try:
        from huggingface_hub import ModelCard  # type: ignore
    except ModuleNotFoundError as exc:
        raise ModuleNotFoundError(
            "huggingface-hub is required for model card operations. "
            "Install with `uv add huggingface_hub` or `pip install huggingface-hub`."
        ) from exc
    return ModelCard


def require_requests():
    try:
        import requests  # type: ignore
    except ModuleNotFoundError as exc:
        raise ModuleNotFoundError(
            "requests is required for Artificial Analysis import. "
            "Install with `uv add requests` or `pip install requests`."
        ) from exc
    return requests


def require_yaml():
    try:
        import yaml  # type: ignore
    except ModuleNotFoundError as exc:
        raise ModuleNotFoundError(
            "PyYAML is required for YAML output. "
            "Install with `uv add pyyaml` or `pip install pyyaml`."
        ) from exc
    return yaml


# ============================================================================
# Method 1: Extract Evaluations from README
# ============================================================================


def extract_tables_from_markdown(markdown_content: str) -> List[str]:
    """Extract all markdown tables from content."""
    # Pattern to match markdown tables
    table_pattern = r"(\|[^\n]+\|(?:\r?\n\|[^\n]+\|)+)"
    tables = re.findall(table_pattern, markdown_content)
    return tables


def parse_markdown_table(table_str: str) -> Tuple[List[str], List[List[str]]]:
    """
    Parse a markdown table string into headers and rows.

    Returns:
        Tuple of (headers, data_rows)
    """
    lines = [line.strip() for line in table_str.strip().split("\n")]

    # Remove separator line (the one with dashes)
    lines = [line for line in lines if not re.match(r"^\|[\s\-:]+\|$", line)]

    if len(lines) < 2:
        return [], []

    # Parse header
    header = [cell.strip() for cell in lines[0].split("|")[1:-1]]

    # Parse data rows
    data_rows = []
    for line in lines[1:]:
        cells = [cell.strip() for cell in line.split("|")[1:-1]]
        if cells:
            data_rows.append(cells)

    return header, data_rows


def is_evaluation_table(header: List[str], rows: List[List[str]]) -> bool:
    """Determine if a table contains evaluation results."""
    if not header or not rows:
        return False

    # Check if first column looks like benchmark names
    benchmark_keywords = [
        "benchmark", "task", "dataset", "eval", "test", "metric",
        "mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande",
        "truthfulqa", "boolq", "piqa", "siqa"
    ]

    first_col = header[0].lower()
    has_benchmark_header = any(keyword in first_col for keyword in benchmark_keywords)

    # Check if there are numeric values in the table
    has_numeric_values = False
    for row in rows:
        for cell in row:
            try:
                float(cell.replace("%", "").replace(",", ""))
                has_numeric_values = True
                break
            except ValueError:
                continue
        if has_numeric_values:
            break

    return has_benchmark_header or has_numeric_values


def normalize_model_name(name: str) -> tuple[set[str], str]:
    """
    Normalize a model name for matching.

    Args:
        name: Model name to normalize

    Returns:
        Tuple of (token_set, normalized_string)
    """
    # Remove markdown formatting
    cleaned = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', name)  # Remove markdown links
    cleaned = re.sub(r'\*\*([^\*]+)\*\*', r'\1', cleaned)  # Remove bold
    cleaned = cleaned.strip()

    # Normalize and tokenize
    normalized = cleaned.lower().replace("-", " ").replace("_", " ")
    tokens = set(normalized.split())

    return tokens, normalized


def find_main_model_column(header: List[str], model_name: str) -> Optional[int]:
    """
    Identify the column index that corresponds to the main model.

    Only returns a column if there's an exact normalized match with the model name.
    This prevents extracting scores from training checkpoints or similar models.

    Args:
        header: Table column headers
        model_name: Model name from repo_id (e.g., "OLMo-3-32B-Think")

    Returns:
        Column index of the main model, or None if no exact match found
    """
    if not header or not model_name:
        return None

    # Normalize model name and extract tokens
    model_tokens, _ = normalize_model_name(model_name)

    # Find exact matches only
    for i, col_name in enumerate(header):
        if not col_name:
            continue

        # Skip first column (benchmark names)
        if i == 0:
            continue

        col_tokens, _ = normalize_model_name(col_name)

        # Check for exact token match
        if model_tokens == col_tokens:
            return i

    # No exact match found
    return None


def find_main_model_row(
    rows: List[List[str]], model_name: str
) -> tuple[Optional[int], List[str]]:
    """
    Identify the row index that corresponds to the main model in a transposed table.

    In transposed tables, each row represents a different model, with the first
    column containing the model name.

    Args:
        rows: Table data rows
        model_name: Model name from repo_id (e.g., "OLMo-3-32B")

    Returns:
        Tuple of (row_index, available_models)
        - row_index: Index of the main model, or None if no exact match found
        - available_models: List of all model names found in the table
    """
    if not rows or not model_name:
        return None, []

    model_tokens, _ = normalize_model_name(model_name)
    available_models = []

    for i, row in enumerate(rows):
        if not row or not row[0]:
            continue

        row_name = row[0].strip()

        # Skip separator/header rows
        if not row_name or row_name.startswith('---'):
            continue

        row_tokens, _ = normalize_model_name(row_name)

        # Collect all non-empty model names
        if row_tokens:
            available_models.append(row_name)

        # Check for exact token match
        if model_tokens == row_tokens:
            return i, available_models

    return None, available_models


def is_transposed_table(header: List[str], rows: List[List[str]]) -> bool:
    """
    Determine if a table is transposed (models as rows, benchmarks as columns).

    A table is considered transposed if:
    - The first column contains model-like names (not benchmark names)
    - Most other columns contain numeric values
    - Header row contains benchmark-like names

    Args:
        header: Table column headers
        rows: Table data rows

    Returns:
        True if table appears to be transposed, False otherwise
    """
    if not header or not rows or len(header) < 3:
        return False

    # Check if first column header suggests model names
    first_col = header[0].lower()
    model_indicators = ["model", "system", "llm", "name"]
    has_model_header = any(indicator in first_col for indicator in model_indicators)

    # Check if remaining headers look like benchmarks
    benchmark_keywords = [
        "mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande",
        "eval", "score", "benchmark", "test", "math", "code", "mbpp",
        "truthfulqa", "boolq", "piqa", "siqa", "drop", "squad"
    ]

    benchmark_header_count = 0
    for col_name in header[1:]:
        col_lower = col_name.lower()
        if any(keyword in col_lower for keyword in benchmark_keywords):
            benchmark_header_count += 1

    has_benchmark_headers = benchmark_header_count >= 2

    # Check if data rows have numeric values in most columns (except first)
    numeric_count = 0
    total_cells = 0

    for row in rows[:5]:  # Check first 5 rows
        for cell in row[1:]:  # Skip first column
            total_cells += 1
            try:
                float(cell.replace("%", "").replace(",", "").strip())
                numeric_count += 1
            except (ValueError, AttributeError):
                continue

    has_numeric_data = total_cells > 0 and (numeric_count / total_cells) > 0.5

    return (has_model_header or has_benchmark_headers) and has_numeric_data


def extract_metrics_from_table(
    header: List[str],
    rows: List[List[str]],
    table_format: str = "auto",
    model_name: Optional[str] = None,
    model_column_index: Optional[int] = None
) -> List[Dict[str, Any]]:
    """
    Extract metrics from parsed table data.

    Args:
        header: Table column headers
        rows: Table data rows
        table_format: "rows" (benchmarks as rows), "columns" (benchmarks as columns),
                     "transposed" (models as rows, benchmarks as columns), or "auto"
        model_name: Optional model name to identify the correct column/row

    Returns:
        List of metric dictionaries with name, type, and value
    """
    metrics = []

    if table_format == "auto":
        # First check if it's a transposed table (models as rows)
        if is_transposed_table(header, rows):
            table_format = "transposed"
        else:
            # Check if first column header is empty/generic (indicates benchmarks in rows)
            first_header = header[0].lower().strip() if header else ""
            is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]

            if is_first_col_benchmarks:
                table_format = "rows"
            else:
                # Heuristic: if first row has mostly numeric values, benchmarks are columns
                try:
                    numeric_count = sum(
                        1 for cell in rows[0] if cell and
                        re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip())
                    )
                    table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows"
                except (IndexError, ValueError):
                    table_format = "rows"

    if table_format == "rows":
        # Benchmarks are in rows, scores in columns
        # Try to identify the main model column if model_name is provided
        target_column = model_column_index
        if target_column is None and model_name:
            target_column = find_main_model_column(header, model_name)

        for row in rows:
            if not row:
                continue

            benchmark_name = row[0].strip()
            if not benchmark_name:
                continue

            # If we identified a specific column, use it; otherwise use first numeric value
            if target_column is not None and target_column < len(row):
                try:
                    value_str = row[target_column].replace("%", "").replace(",", "").strip()
                    if value_str:
                        value = float(value_str)
                        metrics.append({
                            "name": benchmark_name,
                            "type": benchmark_name.lower().replace(" ", "_"),
                            "value": value
                        })
                except (ValueError, IndexError):
                    pass
            else:
                # Extract numeric values from remaining columns (original behavior)
                for i, cell in enumerate(row[1:], start=1):
                    try:
                        # Remove common suffixes and convert to float
                        value_str = cell.replace("%", "").replace(",", "").strip()
                        if not value_str:
                            continue

                        value = float(value_str)

                        # Determine metric name
                        metric_name = benchmark_name
                        if len(header) > i and header[i].lower() not in ["score", "value", "result"]:
                            metric_name = f"{benchmark_name} ({header[i]})"

                        metrics.append({
                            "name": metric_name,
                            "type": benchmark_name.lower().replace(" ", "_"),
                            "value": value
                        })
                        break  # Only take first numeric value per row
                    except (ValueError, IndexError):
                        continue

    elif table_format == "transposed":
        # Models are in rows (first column), benchmarks are in columns (header)
        # Find the row that matches the target model
        if not model_name:
            print("Warning: model_name required for transposed table format")
            return metrics

        target_row_idx, available_models = find_main_model_row(rows, model_name)

        if target_row_idx is None:
            print(f"\n⚠ Could not find model '{model_name}' in transposed table")
            if available_models:
                print("\nAvailable models in table:")
                for i, model in enumerate(available_models, 1):
                    print(f"  {i}. {model}")
                print("\nPlease select the correct model name from the list above.")
                print("You can specify it using the --model-name-override flag:")
                print(f'  --model-name-override "{available_models[0]}"')
            return metrics

        target_row = rows[target_row_idx]

        # Extract metrics from each column (skip first column which is model name)
        for i in range(1, len(header)):
            benchmark_name = header[i].strip()
            if not benchmark_name or i >= len(target_row):
                continue

            try:
                value_str = target_row[i].replace("%", "").replace(",", "").strip()
                if not value_str:
                    continue

                value = float(value_str)

                metrics.append({
                    "name": benchmark_name,
                    "type": benchmark_name.lower().replace(" ", "_").replace("-", "_"),
                    "value": value
                })
            except (ValueError, AttributeError):
                continue

    else:  # table_format == "columns"
        # Benchmarks are in columns
        if not rows:
            return metrics

        # Use first data row for values
        data_row = rows[0]

        for i, benchmark_name in enumerate(header):
            if not benchmark_name or i >= len(data_row):
                continue

            try:
                value_str = data_row[i].replace("%", "").replace(",", "").strip()
                if not value_str:
                    continue

                value = float(value_str)

                metrics.append({
                    "name": benchmark_name,
                    "type": benchmark_name.lower().replace(" ", "_"),
                    "value": value
                })
            except ValueError:
                continue

    return metrics


def extract_evaluations_from_readme(
    repo_id: str,
    task_type: str = "text-generation",
    dataset_name: str = "Benchmarks",
    dataset_type: str = "benchmark",
    model_name_override: Optional[str] = None,
    table_index: Optional[int] = None,
    model_column_index: Optional[int] = None
) -> Optional[List[Dict[str, Any]]]:
    """
    Extract evaluation results from a model's README.

    Args:
        repo_id: Hugging Face model repository ID
        task_type: Task type for model-index (e.g., "text-generation")
        dataset_name: Name for the benchmark dataset
        dataset_type: Type identifier for the dataset
        model_name_override: Override model name for matching (column header for comparison tables)
        table_index: 1-indexed table number from inspect-tables output

    Returns:
        Model-index formatted results or None if no evaluations found
    """
    try:
        load_env()
        ModelCard = require_model_card()
        hf_token = os.getenv("HF_TOKEN")
        card = ModelCard.load(repo_id, token=hf_token)
        readme_content = card.content

        if not readme_content:
            print(f"No README content found for {repo_id}")
            return None

        # Extract model name from repo_id or use override
        if model_name_override:
            model_name = model_name_override
            print(f"Using model name override: '{model_name}'")
        else:
            model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id

        # Use markdown-it parser for accurate table extraction
        all_tables = extract_tables_with_parser(readme_content)

        if not all_tables:
            print(f"No tables found in README for {repo_id}")
            return None

        # If table_index specified, use that specific table
        if table_index is not None:
            if table_index < 1 or table_index > len(all_tables):
                print(f"Invalid table index {table_index}. Found {len(all_tables)} tables.")
                print("Run inspect-tables to see available tables.")
                return None
            tables_to_process = [all_tables[table_index - 1]]
        else:
            # Filter to evaluation tables only
            eval_tables = []
            for table in all_tables:
                header = table.get("headers", [])
                rows = table.get("rows", [])
                if is_evaluation_table(header, rows):
                    eval_tables.append(table)

            if len(eval_tables) > 1:
                print(f"\n⚠ Found {len(eval_tables)} evaluation tables.")
                print("Run inspect-tables first, then use --table to select one:")
                print(f'  uv run scripts/evaluation_manager.py inspect-tables --repo-id "{repo_id}"')
                return None
            elif len(eval_tables) == 0:
                print(f"No evaluation tables found in README for {repo_id}")
                return None

            tables_to_process = eval_tables

        # Extract metrics from selected table(s)
        all_metrics = []
        for table in tables_to_process:
            header = table.get("headers", [])
            rows = table.get("rows", [])
            metrics = extract_metrics_from_table(
                header,
                rows,
                model_name=model_name,
                model_column_index=model_column_index
            )
            all_metrics.extend(metrics)

        if not all_metrics:
            print(f"No metrics extracted from table")
            return None

        # Build model-index structure
        display_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id

        results = [{
            "task": {"type": task_type},
            "dataset": {
                "name": dataset_name,
                "type": dataset_type
            },
            "metrics": all_metrics,
            "source": {
                "name": "Model README",
                "url": f"https://huggingface.co/{repo_id}"
            }
        }]

        return results

    except Exception as e:
        print(f"Error extracting evaluations from README: {e}")
        return None


# ============================================================================
# Table Inspection (using markdown-it-py for accurate parsing)
# ============================================================================


def extract_tables_with_parser(markdown_content: str) -> List[Dict[str, Any]]:
    """
    Extract tables from markdown using markdown-it-py parser.
    Uses GFM (GitHub Flavored Markdown) which includes table support.
    """
    MarkdownIt = require_markdown_it()
    # Disable linkify to avoid optional dependency errors; not needed for table parsing.
    md = MarkdownIt("gfm-like", {"linkify": False})
    tokens = md.parse(markdown_content)

    tables = []
    i = 0
    while i < len(tokens):
        token = tokens[i]

        if token.type == "table_open":
            table_data = {"headers": [], "rows": []}
            current_row = []
            in_header = False

            i += 1
            while i < len(tokens) and tokens[i].type != "table_close":
                t = tokens[i]
                if t.type == "thead_open":
                    in_header = True
                elif t.type == "thead_close":
                    in_header = False
                elif t.type == "tr_open":
                    current_row = []
                elif t.type == "tr_close":
                    if in_header:
                        table_data["headers"] = current_row
                    else:
                        table_data["rows"].append(current_row)
                    current_row = []
                elif t.type == "inline":
                    current_row.append(t.content.strip())
                i += 1

            if table_data["headers"] or table_data["rows"]:
                tables.append(table_data)

        i += 1

    return tables


def detect_table_format(table: Dict[str, Any], repo_id: str) -> Dict[str, Any]:
    """Analyze a table to detect its format and identify model columns."""
    headers = table.get("headers", [])
    rows = table.get("rows", [])

    if not headers or not rows:
        return {"format": "unknown", "columns": headers, "model_columns": [], "row_count": 0, "sample_rows": []}

    first_header = headers[0].lower() if headers else ""
    is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]

    # Check for numeric columns
    numeric_columns = []
    for col_idx in range(1, len(headers)):
        numeric_count = 0
        for row in rows[:5]:
            if col_idx < len(row):
                try:
                    val = re.sub(r'\s*\([^)]*\)', '', row[col_idx])
                    float(val.replace("%", "").replace(",", "").strip())
                    numeric_count += 1
                except (ValueError, AttributeError):
                    pass
        if numeric_count > len(rows[:5]) / 2:
            numeric_columns.append(col_idx)

    # Determine format
    if is_first_col_benchmarks and len(numeric_columns) > 1:
        format_type = "comparison"
    elif is_first_col_benchmarks and len(numeric_columns) == 1:
        format_type = "simple"
    elif len(numeric_columns) > len(headers) / 2:
        format_type = "transposed"
    else:
        format_type = "unknown"

    # Find model columns
    model_columns = []
    model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
    model_tokens, _ = normalize_model_name(model_name)

    for idx, header in enumerate(headers):
        if idx == 0 and is_first_col_benchmarks:
            continue
        if header:
            header_tokens, _ = normalize_model_name(header)
            is_match = model_tokens == header_tokens
            is_partial = model_tokens.issubset(header_tokens) or header_tokens.issubset(model_tokens)
            model_columns.append({
                "index": idx,
                "header": header,
                "is_exact_match": is_match,
                "is_partial_match": is_partial and not is_match
            })

    return {
        "format": format_type,
        "columns": headers,
        "model_columns": model_columns,
        "row_count": len(rows),
        "sample_rows": [row[0] for row in rows[:5] if row]
    }


def inspect_tables(repo_id: str) -> None:
    """Inspect and display all evaluation tables in a model's README."""
    try:
        load_env()
        ModelCard = require_model_card()
        hf_token = os.getenv("HF_TOKEN")
        card = ModelCard.load(repo_id, token=hf_token)
        readme_content = card.content

        if not readme_content:
            print(f"No README content found for {repo_id}")
            return

        tables = extract_tables_with_parser(readme_content)

        if not tables:
            print(f"No tables found in README for {repo_id}")
            return

        print(f"\n{'='*70}")
        print(f"Tables found in README for: {repo_id}")
        print(f"{'='*70}")

        eval_table_count = 0
        for table in tables:
            analysis = detect_table_format(table, repo_id)

            if analysis["format"] == "unknown" and not analysis.get("sample_rows"):
                continue

            eval_table_count += 1
            print(f"\n## Table {eval_table_count}")
            print(f"   Format: {analysis['format']}")
            print(f"   Rows: {analysis['row_count']}")

            print(f"\n   Columns ({len(analysis['columns'])}):")
            for col_info in analysis.get("model_columns", []):
                idx = col_info["index"]
                header = col_info["header"]
                if col_info["is_exact_match"]:
                    print(f"      [{idx}] {header}  ✓ EXACT MATCH")
                elif col_info["is_partial_match"]:
                    print(f"      [{idx}] {header}  ~ partial match")
                else:
                    print(f"      [{idx}] {header}")

            if analysis.get("sample_rows"):
                print(f"\n   Sample rows (first column):")
                for row_val in analysis["sample_rows"][:5]:
                    print(f"      - {row_val}")

        if eval_table_count == 0:
            print("\nNo evaluation tables detected.")
        else:
            print("\nSuggested next step:")
            print(f'  uv run scripts/evaluation_manager.py extract-readme --repo-id "{repo_id}" --table <table-number> [--model-column-index <column-index>]')

        print(f"\n{'='*70}\n")

    except Exception as e:
        print(f"Error inspecting tables: {e}")


# ============================================================================
# Pull Request Management
# ============================================================================


def get_open_prs(repo_id: str) -> List[Dict[str, Any]]:
    """
    Fetch open pull requests for a Hugging Face model repository.

    Args:
        repo_id: Hugging Face model repository ID (e.g., "allenai/Olmo-3-32B-Think")

    Returns:
        List of open PR dictionaries with num, title, author, and createdAt
    """
    requests = require_requests()
    url = f"https://huggingface.co/api/models/{repo_id}/discussions"

    try:
        response = requests.get(url, timeout=30, allow_redirects=True)
        response.raise_for_status()

        data = response.json()
        discussions = data.get("discussions", [])

        open_prs = [
            {
                "num": d["num"],
                "title": d["title"],
                "author": d["author"]["name"],
                "createdAt": d.get("createdAt", "unknown"),
            }
            for d in discussions
            if d.get("status") == "open" and d.get("isPullRequest")
        ]

        return open_prs

    except requests.RequestException as e:
        print(f"Error fetching PRs from Hugging Face: {e}")
        return []


def list_open_prs(repo_id: str) -> None:
    """Display open pull requests for a model repository."""
    prs = get_open_prs(repo_id)

    print(f"\n{'='*70}")
    print(f"Open Pull Requests for: {repo_id}")
    print(f"{'='*70}")

    if not prs:
        print("\nNo open pull requests found.")
    else:
        print(f"\nFound {len(prs)} open PR(s):\n")
        for pr in prs:
            print(f"  PR #{pr['num']} - {pr['title']}")
            print(f"     Author: {pr['author']}")
            print(f"     Created: {pr['createdAt']}")
            print(f"     URL: https://huggingface.co/{repo_id}/discussions/{pr['num']}")
            print()

    print(f"{'='*70}\n")


# ============================================================================
# Method 2: Import from Artificial Analysis
# ============================================================================


def get_aa_model_data(creator_slug: str, model_name: str) -> Optional[Dict[str, Any]]:
    """
    Fetch model evaluation data from Artificial Analysis API.

    Args:
        creator_slug: Creator identifier (e.g., "anthropic", "openai")
        model_name: Model slug/identifier

    Returns:
        Model data dictionary or None if not found
    """
    load_env()
    AA_API_KEY = os.getenv("AA_API_KEY")
    if not AA_API_KEY:
        raise ValueError("AA_API_KEY environment variable is not set")

    url = "https://artificialanalysis.ai/api/v2/data/llms/models"
    headers = {"x-api-key": AA_API_KEY}

    requests = require_requests()

    try:
        response = requests.get(url, headers=headers, timeout=30)
        response.raise_for_status()

        data = response.json().get("data", [])

        for model in data:
            creator = model.get("model_creator", {})
            if creator.get("slug") == creator_slug and model.get("slug") == model_name:
                return model

        print(f"Model {creator_slug}/{model_name} not found in Artificial Analysis")
        return None

    except requests.RequestException as e:
        print(f"Error fetching data from Artificial Analysis: {e}")
        return None


def aa_data_to_model_index(
    model_data: Dict[str, Any],
    dataset_name: str = "Artificial Analysis Benchmarks",
    dataset_type: str = "artificial_analysis",
    task_type: str = "evaluation"
) -> List[Dict[str, Any]]:
    """
    Convert Artificial Analysis model data to model-index format.

    Args:
        model_data: Raw model data from AA API
        dataset_name: Dataset name for model-index
        dataset_type: Dataset type identifier
        task_type: Task type for model-index

    Returns:
        Model-index formatted results
    """
    model_name = model_data.get("name", model_data.get("slug", "unknown-model"))
    evaluations = model_data.get("evaluations", {})

    if not evaluations:
        print(f"No evaluations found for model {model_name}")
        return []

    metrics = []
    for key, value in evaluations.items():
        if value is not None:
            metrics.append({
                "name": key.replace("_", " ").title(),
                "type": key,
                "value": value
            })

    results = [{
        "task": {"type": task_type},
        "dataset": {
            "name": dataset_name,
            "type": dataset_type
        },
        "metrics": metrics,
        "source": {
            "name": "Artificial Analysis API",
            "url": "https://artificialanalysis.ai"
        }
    }]

    return results


def import_aa_evaluations(
    creator_slug: str,
    model_name: str,
    repo_id: str
) -> Optional[List[Dict[str, Any]]]:
    """
    Import evaluation results from Artificial Analysis for a model.

    Args:
        creator_slug: Creator identifier in AA
        model_name: Model identifier in AA
        repo_id: Hugging Face repository ID to update

    Returns:
        Model-index formatted results or None if import fails
    """
    model_data = get_aa_model_data(creator_slug, model_name)

    if not model_data:
        return None

    results = aa_data_to_model_index(model_data)
    return results


# ============================================================================
# Model Card Update Functions
# ============================================================================


def update_model_card_with_evaluations(
    repo_id: str,
    results: List[Dict[str, Any]],
    create_pr: bool = False,
    commit_message: Optional[str] = None
) -> bool:
    """
    Update a model card with evaluation results.

    Args:
        repo_id: Hugging Face repository ID
        results: Model-index formatted results
        create_pr: Whether to create a PR instead of direct push
        commit_message: Custom commit message

    Returns:
        True if successful, False otherwise
    """
    try:
        load_env()
        ModelCard = require_model_card()
        hf_token = os.getenv("HF_TOKEN")
        if not hf_token:
            raise ValueError("HF_TOKEN environment variable is not set")

        # Load existing card
        card = ModelCard.load(repo_id, token=hf_token)

        # Get model name
        model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id

        # Create or update model-index
        model_index = [{
            "name": model_name,
            "results": results
        }]

        # Merge with existing model-index if present
        if "model-index" in card.data:
            existing = card.data["model-index"]
            if isinstance(existing, list) and existing:
                # Keep existing name if present
                if "name" in existing[0]:
                    model_index[0]["name"] = existing[0]["name"]

                # Merge results
                existing_results = existing[0].get("results", [])
                model_index[0]["results"].extend(existing_results)

        card.data["model-index"] = model_index

        # Prepare commit message
        if not commit_message:
            commit_message = f"Add evaluation results to {model_name}"

        commit_description = (
            "This commit adds structured evaluation results to the model card. "
            "The results are formatted using the model-index specification and "
            "will be displayed in the model card's evaluation widget."
        )

        # Push update
        card.push_to_hub(
            repo_id,
            token=hf_token,
            commit_message=commit_message,
            commit_description=commit_description,
            create_pr=create_pr
        )

        action = "Pull request created" if create_pr else "Model card updated"
        print(f"✓ {action} successfully for {repo_id}")
        return True

    except Exception as e:
        print(f"Error updating model card: {e}")
        return False


def show_evaluations(repo_id: str) -> None:
    """Display current evaluations in a model card."""
    try:
        load_env()
        ModelCard = require_model_card()
        hf_token = os.getenv("HF_TOKEN")
        card = ModelCard.load(repo_id, token=hf_token)

        if "model-index" not in card.data:
            print(f"No model-index found in {repo_id}")
            return

        model_index = card.data["model-index"]

        print(f"\nEvaluations for {repo_id}:")
        print("=" * 60)

        for model_entry in model_index:
            model_name = model_entry.get("name", "Unknown")
            print(f"\nModel: {model_name}")

            results = model_entry.get("results", [])
            for i, result in enumerate(results, 1):
                print(f"\n  Result Set {i}:")

                task = result.get("task", {})
                print(f"    Task: {task.get('type', 'unknown')}")

                dataset = result.get("dataset", {})
                print(f"    Dataset: {dataset.get('name', 'unknown')}")

                metrics = result.get("metrics", [])
                print(f"    Metrics ({len(metrics)}):")
                for metric in metrics:
                    name = metric.get("name", "Unknown")
                    value = metric.get("value", "N/A")
                    print(f"      - {name}: {value}")

                source = result.get("source", {})
                if source:
                    print(f"    Source: {source.get('name', 'Unknown')}")

        print("\n" + "=" * 60)

    except Exception as e:
        print(f"Error showing evaluations: {e}")


def validate_model_index(repo_id: str) -> bool:
    """Validate model-index format in a model card."""
    try:
        load_env()
        ModelCard = require_model_card()
        hf_token = os.getenv("HF_TOKEN")
        card = ModelCard.load(repo_id, token=hf_token)

        if "model-index" not in card.data:
            print(f"✗ No model-index found in {repo_id}")
            return False

        model_index = card.data["model-index"]

        if not isinstance(model_index, list):
            print("✗ model-index must be a list")
            return False

        for i, entry in enumerate(model_index):
            if "name" not in entry:
                print(f"✗ Entry {i} missing 'name' field")
                return False

            if "results" not in entry:
                print(f"✗ Entry {i} missing 'results' field")
                return False

            for j, result in enumerate(entry["results"]):
                if "task" not in result:
                    print(f"✗ Result {j} in entry {i} missing 'task' field")
                    return False

                if "dataset" not in result:
                    print(f"✗ Result {j} in entry {i} missing 'dataset' field")
                    return False

                if "metrics" not in result:
                    print(f"✗ Result {j} in entry {i} missing 'metrics' field")
                    return False

        print(f"✓ Model-index format is valid for {repo_id}")
        return True

    except Exception as e:
        print(f"Error validating model-index: {e}")
        return False


# ============================================================================
# CLI Interface
# ============================================================================


def main():
    parser = argparse.ArgumentParser(
        description=(
            "Manage evaluation results in Hugging Face model cards.\n\n"
            "Use standard Python or `uv run scripts/evaluation_manager.py ...` "
            "to auto-resolve dependencies from the PEP 723 header."
        ),
        formatter_class=argparse.RawTextHelpFormatter,
        epilog=dedent(
            """\
            Typical workflows:
              - Inspect tables first:
                  uv run scripts/evaluation_manager.py inspect-tables --repo-id <model>
              - Extract from README (prints YAML by default):
                  uv run scripts/evaluation_manager.py extract-readme --repo-id <model> --table N
              - Apply changes:
                  uv run scripts/evaluation_manager.py extract-readme --repo-id <model> --table N --apply
              - Import from Artificial Analysis:
                  AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug org --model-name slug --repo-id <model>

            Tips:
              - YAML is printed by default; use --apply or --create-pr to write changes.
              - Set HF_TOKEN (and AA_API_KEY for import-aa); .env is loaded automatically if python-dotenv is installed.
              - When multiple tables exist, run inspect-tables then select with --table N.
              - To apply changes (push or PR), rerun extract-readme with --apply or --create-pr.
            """
        ),
    )
    parser.add_argument("--version", action="version", version="evaluation_manager 1.2.0")

    subparsers = parser.add_subparsers(dest="command", help="Command to execute")

    # Extract from README command
    extract_parser = subparsers.add_parser(
        "extract-readme",
        help="Extract evaluation tables from model README",
        formatter_class=argparse.RawTextHelpFormatter,
        description="Parse README tables into model-index YAML. Default behavior prints YAML; use --apply/--create-pr to write changes.",
        epilog=dedent(
            """\
            Examples:
              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model
              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-column-index 3
              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-name-override \"**Model 7B**\"  # exact header text
              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --create-pr

            Apply changes:
              - Default: prints YAML to stdout (no writes).
              - Add --apply to push directly, or --create-pr to open a PR.
            Model selection:
              - Preferred: --model-column-index <header index shown by inspect-tables>
              - If using --model-name-override, copy the column header text exactly.
            """
        ),
    )
    extract_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
    extract_parser.add_argument("--table", type=int, help="Table number (1-indexed, from inspect-tables output)")
    extract_parser.add_argument("--model-column-index", type=int, help="Preferred: column index from inspect-tables output (exact selection)")
    extract_parser.add_argument("--model-name-override", type=str, help="Exact column header/model name for comparison/transpose tables (when index is not used)")
    extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Sets model-index task.type (e.g., text-generation, summarization)")
    extract_parser.add_argument("--dataset-name", type=str, default="Benchmarks", help="Dataset name")
    extract_parser.add_argument("--dataset-type", type=str, default="benchmark", help="Dataset type")
    extract_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
    extract_parser.add_argument("--apply", action="store_true", help="Apply changes (default is to print YAML only)")
    extract_parser.add_argument("--dry-run", action="store_true", help="Preview YAML without updating (default)")

    # Import from AA command
    aa_parser = subparsers.add_parser(
        "import-aa",
        help="Import evaluation scores from Artificial Analysis",
        formatter_class=argparse.RawTextHelpFormatter,
        description="Fetch scores from Artificial Analysis API and write them into model-index.",
        epilog=dedent(
            """\
            Examples:
              AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug anthropic --model-name claude-sonnet-4 --repo-id username/model
              uv run scripts/evaluation_manager.py import-aa --creator-slug openai --model-name gpt-4o --repo-id username/model --create-pr

            Requires: AA_API_KEY in env (or .env if python-dotenv installed).
            """
        ),
    )
    aa_parser.add_argument("--creator-slug", type=str, required=True, help="AA creator slug")
    aa_parser.add_argument("--model-name", type=str, required=True, help="AA model name")
    aa_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
    aa_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")

    # Show evaluations command
    show_parser = subparsers.add_parser(
        "show",
        help="Display current evaluations in model card",
        formatter_class=argparse.RawTextHelpFormatter,
        description="Print model-index content from the model card (requires HF_TOKEN for private repos).",
    )
    show_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")

    # Validate command
    validate_parser = subparsers.add_parser(
        "validate",
        help="Validate model-index format",
        formatter_class=argparse.RawTextHelpFormatter,
        description="Schema sanity check for model-index section of the card.",
    )
    validate_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")

    # Inspect tables command
    inspect_parser = subparsers.add_parser(
        "inspect-tables",
        help="Inspect tables in README → outputs suggested extract-readme command",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Workflow:
  1. inspect-tables     → see table structure, columns, and table numbers
  2. extract-readme     → run with --table N (from step 1); YAML prints by default
  3. apply changes      → rerun extract-readme with --apply or --create-pr

Reminder:
  - Preferred: use --model-column-index <index>. If needed, use --model-name-override with the exact column header text.
"""
    )
    inspect_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")

    # Get PRs command
    prs_parser = subparsers.add_parser(
        "get-prs",
        help="List open pull requests for a model repository",
        formatter_class=argparse.RawTextHelpFormatter,
        description="Check for existing open PRs before creating new ones to avoid duplicates.",
        epilog=dedent(
            """\
            Examples:
              uv run scripts/evaluation_manager.py get-prs --repo-id "allenai/Olmo-3-32B-Think"

            IMPORTANT: Always run this before using --create-pr to avoid duplicate PRs.
            """
        ),
    )
    prs_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")

    args = parser.parse_args()

    if not args.command:
        parser.print_help()
        return

    try:
        # Execute command
        if args.command == "extract-readme":
            results = extract_evaluations_from_readme(
                repo_id=args.repo_id,
                task_type=args.task_type,
                dataset_name=args.dataset_name,
                dataset_type=args.dataset_type,
                model_name_override=args.model_name_override,
                table_index=args.table,
                model_column_index=args.model_column_index
            )

            if not results:
                print("No evaluations extracted")
                return

            apply_changes = args.apply or args.create_pr

            # Default behavior: print YAML (dry-run)
            yaml = require_yaml()
            print("\nExtracted evaluations (YAML):")
            print(
                yaml.dump(
                    {"model-index": [{"name": args.repo_id.split('/')[-1], "results": results}]},
                    sort_keys=False
                )
            )

            if apply_changes:
                if args.model_name_override and args.model_column_index is not None:
                    print("Note: --model-column-index takes precedence over --model-name-override.")
                update_model_card_with_evaluations(
                    repo_id=args.repo_id,
                    results=results,
                    create_pr=args.create_pr,
                    commit_message="Extract evaluation results from README"
                )

        elif args.command == "import-aa":
            results = import_aa_evaluations(
                creator_slug=args.creator_slug,
                model_name=args.model_name,
                repo_id=args.repo_id
            )

            if not results:
                print("No evaluations imported")
                return

            update_model_card_with_evaluations(
                repo_id=args.repo_id,
                results=results,
                create_pr=args.create_pr,
                commit_message=f"Add Artificial Analysis evaluations for {args.model_name}"
            )

        elif args.command == "show":
            show_evaluations(args.repo_id)

        elif args.command == "validate":
            validate_model_index(args.repo_id)

        elif args.command == "inspect-tables":
            inspect_tables(args.repo_id)

        elif args.command == "get-prs":
            list_open_prs(args.repo_id)
    except ModuleNotFoundError as exc:
        # Surface dependency hints cleanly when user only needs help output
        print(exc)
    except Exception as exc:
        print(f"Error: {exc}")


if __name__ == "__main__":
    main()

```

### scripts/run_eval_job.py

```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "huggingface-hub>=0.26.0",
#     "python-dotenv>=1.2.1",
# ]
# ///

"""
Submit evaluation jobs using the `hf jobs uv run` CLI.

This wrapper constructs the appropriate command to execute the local
`inspect_eval_uv.py` script on Hugging Face Jobs with the requested hardware.
"""

import argparse
import os
import subprocess
import sys
from pathlib import Path
from typing import Optional

from huggingface_hub import get_token
from dotenv import load_dotenv

load_dotenv()


SCRIPT_PATH = Path(__file__).with_name("inspect_eval_uv.py").resolve()


def create_eval_job(
    model_id: str,
    task: str,
    hardware: str = "cpu-basic",
    hf_token: Optional[str] = None,
    limit: Optional[int] = None,
) -> None:
    """
    Submit an evaluation job using the Hugging Face Jobs CLI.
    """
    token = hf_token or os.getenv("HF_TOKEN") or get_token()
    if not token:
        raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")

    if not SCRIPT_PATH.exists():
        raise FileNotFoundError(f"Script not found at {SCRIPT_PATH}")

    print(f"Preparing evaluation job for {model_id} on task {task} (hardware: {hardware})")

    cmd = [
        "hf",
        "jobs",
        "uv",
        "run",
        str(SCRIPT_PATH),
        "--flavor",
        hardware,
        "--secrets",
        f"HF_TOKEN={token}",
        "--",
        "--model",
        model_id,
        "--task",
        task,
    ]

    if limit:
        cmd.extend(["--limit", str(limit)])

    print("Executing:", " ".join(cmd))

    try:
        subprocess.run(cmd, check=True)
    except subprocess.CalledProcessError as exc:
        print("hf jobs command failed", file=sys.stderr)
        raise


def main() -> None:
    parser = argparse.ArgumentParser(description="Run inspect-ai evaluations on Hugging Face Jobs")
    parser.add_argument("--model", required=True, help="Model ID (e.g. Qwen/Qwen3-0.6B)")
    parser.add_argument("--task", required=True, help="Inspect task (e.g. mmlu, gsm8k)")
    parser.add_argument("--hardware", default="cpu-basic", help="Hardware flavor (e.g. t4-small, a10g-small)")
    parser.add_argument("--limit", type=int, default=None, help="Limit number of samples to evaluate")

    args = parser.parse_args()

    create_eval_job(
        model_id=args.model,
        task=args.task,
        hardware=args.hardware,
        limit=args.limit,
    )


if __name__ == "__main__":
    main()

```

### scripts/lighteval_vllm_uv.py

```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "lighteval[accelerate,vllm]>=0.6.0",
#     "torch>=2.0.0",
#     "transformers>=4.40.0",
#     "accelerate>=0.30.0",
#     "vllm>=0.4.0",
# ]
# ///

"""
Entry point script for running lighteval evaluations with vLLM backend via `hf jobs uv run`.

This script runs evaluations using vLLM for efficient GPU inference on custom HuggingFace models.
It is separate from inference provider scripts and evaluates models directly on the hardware.

Usage (standalone):
    python lighteval_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"

Usage (via HF Jobs):
    hf jobs uv run lighteval_vllm_uv.py \\
        --flavor a10g-small \\
        --secret HF_TOKEN=$HF_TOKEN \\
        -- --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"
"""

from __future__ import annotations

import argparse
import os
import subprocess
import sys
from typing import Optional


def setup_environment() -> None:
    """Configure environment variables for HuggingFace authentication."""
    hf_token = os.getenv("HF_TOKEN")
    if hf_token:
        os.environ.setdefault("HUGGING_FACE_HUB_TOKEN", hf_token)
        os.environ.setdefault("HF_HUB_TOKEN", hf_token)


def run_lighteval_vllm(
    model_id: str,
    tasks: str,
    output_dir: Optional[str] = None,
    max_samples: Optional[int] = None,
    batch_size: int = 1,
    tensor_parallel_size: int = 1,
    gpu_memory_utilization: float = 0.8,
    dtype: str = "auto",
    trust_remote_code: bool = False,
    use_chat_template: bool = False,
    system_prompt: Optional[str] = None,
) -> None:
    """
    Run lighteval with vLLM backend for efficient GPU inference.

    Args:
        model_id: HuggingFace model ID (e.g., "meta-llama/Llama-3.2-1B")
        tasks: Task specification (e.g., "leaderboard|mmlu|5" or "lighteval|hellaswag|0")
        output_dir: Directory for evaluation results
        max_samples: Limit number of samples per task
        batch_size: Batch size for evaluation
        tensor_parallel_size: Number of GPUs for tensor parallelism
        gpu_memory_utilization: GPU memory fraction to use (0.0-1.0)
        dtype: Data type for model weights (auto, float16, bfloat16)
        trust_remote_code: Allow executing remote code from model repo
        use_chat_template: Apply chat template for conversational models
        system_prompt: System prompt for chat models
    """
    setup_environment()

    # Build lighteval vllm command
    cmd = [
        "lighteval",
        "vllm",
        model_id,
        tasks,
        "--batch-size", str(batch_size),
        "--tensor-parallel-size", str(tensor_parallel_size),
        "--gpu-memory-utilization", str(gpu_memory_utilization),
        "--dtype", dtype,
    ]

    if output_dir:
        cmd.extend(["--output-dir", output_dir])

    if max_samples:
        cmd.extend(["--max-samples", str(max_samples)])

    if trust_remote_code:
        cmd.append("--trust-remote-code")

    if use_chat_template:
        cmd.append("--use-chat-template")

    if system_prompt:
        cmd.extend(["--system-prompt", system_prompt])

    print(f"Running: {' '.join(cmd)}")

    try:
        subprocess.run(cmd, check=True)
        print("Evaluation complete.")
    except subprocess.CalledProcessError as exc:
        print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
        sys.exit(exc.returncode)


def run_lighteval_accelerate(
    model_id: str,
    tasks: str,
    output_dir: Optional[str] = None,
    max_samples: Optional[int] = None,
    batch_size: int = 1,
    dtype: str = "bfloat16",
    trust_remote_code: bool = False,
    use_chat_template: bool = False,
    system_prompt: Optional[str] = None,
) -> None:
    """
    Run lighteval with accelerate backend for multi-GPU distributed inference.

    Use this backend when vLLM is not available or for models not supported by vLLM.

    Args:
        model_id: HuggingFace model ID
        tasks: Task specification
        output_dir: Directory for evaluation results
        max_samples: Limit number of samples per task
        batch_size: Batch size for evaluation
        dtype: Data type for model weights
        trust_remote_code: Allow executing remote code
        use_chat_template: Apply chat template
        system_prompt: System prompt for chat models
    """
    setup_environment()

    # Build lighteval accelerate command
    cmd = [
        "lighteval",
        "accelerate",
        model_id,
        tasks,
        "--batch-size", str(batch_size),
        "--dtype", dtype,
    ]

    if output_dir:
        cmd.extend(["--output-dir", output_dir])

    if max_samples:
        cmd.extend(["--max-samples", str(max_samples)])

    if trust_remote_code:
        cmd.append("--trust-remote-code")

    if use_chat_template:
        cmd.append("--use-chat-template")

    if system_prompt:
        cmd.extend(["--system-prompt", system_prompt])

    print(f"Running: {' '.join(cmd)}")

    try:
        subprocess.run(cmd, check=True)
        print("Evaluation complete.")
    except subprocess.CalledProcessError as exc:
        print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
        sys.exit(exc.returncode)


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Run lighteval evaluations with vLLM or accelerate backend on custom HuggingFace models",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Run MMLU evaluation with vLLM
  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5"

  # Run with accelerate backend instead of vLLM
  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --backend accelerate

  # Run with chat template for instruction-tuned models
  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B-Instruct --tasks "leaderboard|mmlu|5" --use-chat-template

  # Run with limited samples for testing
  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --max-samples 10

Task format:
  Tasks use the format: "suite|task|num_fewshot"
  - leaderboard|mmlu|5 (MMLU with 5-shot)
  - lighteval|hellaswag|0 (HellaSwag zero-shot)
  - leaderboard|gsm8k|5 (GSM8K with 5-shot)
  - Multiple tasks: "leaderboard|mmlu|5,leaderboard|gsm8k|5"
        """,
    )

    parser.add_argument(
        "--model",
        required=True,
        help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
    )
    parser.add_argument(
        "--tasks",
        required=True,
        help="Task specification (e.g., 'leaderboard|mmlu|5')",
    )
    parser.add_argument(
        "--backend",
        choices=["vllm", "accelerate"],
        default="vllm",
        help="Inference backend to use (default: vllm)",
    )
    parser.add_argument(
        "--output-dir",
        default=None,
        help="Directory for evaluation results",
    )
    parser.add_argument(
        "--max-samples",
        type=int,
        default=None,
        help="Limit number of samples per task (useful for testing)",
    )
    parser.add_argument(
        "--batch-size",
        type=int,
        default=1,
        help="Batch size for evaluation (default: 1)",
    )
    parser.add_argument(
        "--tensor-parallel-size",
        type=int,
        default=1,
        help="Number of GPUs for tensor parallelism (vLLM only, default: 1)",
    )
    parser.add_argument(
        "--gpu-memory-utilization",
        type=float,
        default=0.8,
        help="GPU memory fraction to use (vLLM only, default: 0.8)",
    )
    parser.add_argument(
        "--dtype",
        default="auto",
        choices=["auto", "float16", "bfloat16", "float32"],
        help="Data type for model weights (default: auto)",
    )
    parser.add_argument(
        "--trust-remote-code",
        action="store_true",
        help="Allow executing remote code from model repository",
    )
    parser.add_argument(
        "--use-chat-template",
        action="store_true",
        help="Apply chat template for instruction-tuned/chat models",
    )
    parser.add_argument(
        "--system-prompt",
        default=None,
        help="System prompt for chat models",
    )

    args = parser.parse_args()

    if args.backend == "vllm":
        run_lighteval_vllm(
            model_id=args.model,
            tasks=args.tasks,
            output_dir=args.output_dir,
            max_samples=args.max_samples,
            batch_size=args.batch_size,
            tensor_parallel_size=args.tensor_parallel_size,
            gpu_memory_utilization=args.gpu_memory_utilization,
            dtype=args.dtype,
            trust_remote_code=args.trust_remote_code,
            use_chat_template=args.use_chat_template,
            system_prompt=args.system_prompt,
        )
    else:
        run_lighteval_accelerate(
            model_id=args.model,
            tasks=args.tasks,
            output_dir=args.output_dir,
            max_samples=args.max_samples,
            batch_size=args.batch_size,
            dtype=args.dtype if args.dtype != "auto" else "bfloat16",
            trust_remote_code=args.trust_remote_code,
            use_chat_template=args.use_chat_template,
            system_prompt=args.system_prompt,
        )


if __name__ == "__main__":
    main()


```

### scripts/inspect_vllm_uv.py

```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "inspect-ai>=0.3.0",
#     "inspect-evals",
#     "vllm>=0.4.0",
#     "torch>=2.0.0",
#     "transformers>=4.40.0",
# ]
# ///

"""
Entry point script for running inspect-ai evaluations with vLLM or HuggingFace Transformers backend.

This script runs evaluations on custom HuggingFace models using local GPU inference,
separate from inference provider scripts (which use external APIs).

Usage (standalone):
    python inspect_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --task "mmlu"

Usage (via HF Jobs):
    hf jobs uv run inspect_vllm_uv.py \\
        --flavor a10g-small \\
        --secret HF_TOKEN=$HF_TOKEN \\
        -- --model "meta-llama/Llama-3.2-1B" --task "mmlu"

Model backends:
    - vllm: Fast inference with vLLM (recommended for large models)
    - hf: HuggingFace Transformers backend (broader model compatibility)
"""

from __future__ import annotations

import argparse
import os
import subprocess
import sys
from typing import Optional


def setup_environment() -> None:
    """Configure environment variables for HuggingFace authentication."""
    hf_token = os.getenv("HF_TOKEN")
    if hf_token:
        os.environ.setdefault("HUGGING_FACE_HUB_TOKEN", hf_token)
        os.environ.setdefault("HF_HUB_TOKEN", hf_token)


def run_inspect_vllm(
    model_id: str,
    task: str,
    limit: Optional[int] = None,
    max_connections: int = 4,
    temperature: float = 0.0,
    tensor_parallel_size: int = 1,
    gpu_memory_utilization: float = 0.8,
    dtype: str = "auto",
    trust_remote_code: bool = False,
    log_level: str = "info",
) -> None:
    """
    Run inspect-ai evaluation with vLLM backend.

    Args:
        model_id: HuggingFace model ID
        task: inspect-ai task to execute (e.g., "mmlu", "gsm8k")
        limit: Limit number of samples to evaluate
        max_connections: Maximum concurrent connections
        temperature: Sampling temperature
        tensor_parallel_size: Number of GPUs for tensor parallelism
        gpu_memory_utilization: GPU memory fraction
        dtype: Data type (auto, float16, bfloat16)
        trust_remote_code: Allow remote code execution
        log_level: Logging level
    """
    setup_environment()

    model_spec = f"vllm/{model_id}"
    cmd = [
        "inspect",
        "eval",
        task,
        "--model",
        model_spec,
        "--log-level",
        log_level,
        "--max-connections",
        str(max_connections),
    ]

    # vLLM supports temperature=0 unlike HF inference providers
    cmd.extend(["--temperature", str(temperature)])

    # Older inspect-ai CLI versions do not support --model-args; rely on defaults
    # and let vLLM choose sensible settings for small models.
    if tensor_parallel_size != 1:
        cmd.extend(["--tensor-parallel-size", str(tensor_parallel_size)])
    if gpu_memory_utilization != 0.8:
        cmd.extend(["--gpu-memory-utilization", str(gpu_memory_utilization)])
    if dtype != "auto":
        cmd.extend(["--dtype", dtype])
    if trust_remote_code:
        cmd.append("--trust-remote-code")

    if limit:
        cmd.extend(["--limit", str(limit)])

    print(f"Running: {' '.join(cmd)}")

    try:
        subprocess.run(cmd, check=True)
        print("Evaluation complete.")
    except subprocess.CalledProcessError as exc:
        print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
        sys.exit(exc.returncode)


def run_inspect_hf(
    model_id: str,
    task: str,
    limit: Optional[int] = None,
    max_connections: int = 1,
    temperature: float = 0.001,
    device: str = "auto",
    dtype: str = "auto",
    trust_remote_code: bool = False,
    log_level: str = "info",
) -> None:
    """
    Run inspect-ai evaluation with HuggingFace Transformers backend.

    Use this when vLLM doesn't support the model architecture.

    Args:
        model_id: HuggingFace model ID
        task: inspect-ai task to execute
        limit: Limit number of samples
        max_connections: Maximum concurrent connections (keep low for memory)
        temperature: Sampling temperature
        device: Device to use (auto, cuda, cpu)
        dtype: Data type
        trust_remote_code: Allow remote code execution
        log_level: Logging level
    """
    setup_environment()

    model_spec = f"hf/{model_id}"

    cmd = [
        "inspect",
        "eval",
        task,
        "--model",
        model_spec,
        "--log-level",
        log_level,
        "--max-connections",
        str(max_connections),
        "--temperature",
        str(temperature),
    ]

    if device != "auto":
        cmd.extend(["--device", device])
    if dtype != "auto":
        cmd.extend(["--dtype", dtype])
    if trust_remote_code:
        cmd.append("--trust-remote-code")

    if limit:
        cmd.extend(["--limit", str(limit)])

    print(f"Running: {' '.join(cmd)}")

    try:
        subprocess.run(cmd, check=True)
        print("Evaluation complete.")
    except subprocess.CalledProcessError as exc:
        print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
        sys.exit(exc.returncode)


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Run inspect-ai evaluations with vLLM or HuggingFace Transformers on custom models",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Run MMLU with vLLM backend
  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu

  # Run with HuggingFace Transformers backend
  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --backend hf

  # Run with limited samples for testing
  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --limit 10

  # Run on multiple GPUs with tensor parallelism
  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-70B --task mmlu --tensor-parallel-size 4

Available tasks (from inspect-evals):
  - mmlu: Massive Multitask Language Understanding
  - gsm8k: Grade School Math
  - hellaswag: Common sense reasoning
  - arc_challenge: AI2 Reasoning Challenge
  - truthfulqa: TruthfulQA benchmark
  - winogrande: Winograd Schema Challenge
  - humaneval: Code generation (HumanEval)

Via HF Jobs:
  hf jobs uv run inspect_vllm_uv.py \\
      --flavor a10g-small \\
      --secret HF_TOKEN=$HF_TOKEN \\
      -- --model meta-llama/Llama-3.2-1B --task mmlu
        """,
    )

    parser.add_argument(
        "--model",
        required=True,
        help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
    )
    parser.add_argument(
        "--task",
        required=True,
        help="inspect-ai task to execute (e.g., mmlu, gsm8k)",
    )
    parser.add_argument(
        "--backend",
        choices=["vllm", "hf"],
        default="vllm",
        help="Model backend (default: vllm)",
    )
    parser.add_argument(
        "--limit",
        type=int,
        default=None,
        help="Limit number of samples to evaluate",
    )
    parser.add_argument(
        "--max-connections",
        type=int,
        default=None,
        help="Maximum concurrent connections (default: 4 for vllm, 1 for hf)",
    )
    parser.add_argument(
        "--temperature",
        type=float,
        default=None,
        help="Sampling temperature (default: 0.0 for vllm, 0.001 for hf)",
    )
    parser.add_argument(
        "--tensor-parallel-size",
        type=int,
        default=1,
        help="Number of GPUs for tensor parallelism (vLLM only, default: 1)",
    )
    parser.add_argument(
        "--gpu-memory-utilization",
        type=float,
        default=0.8,
        help="GPU memory fraction to use (vLLM only, default: 0.8)",
    )
    parser.add_argument(
        "--dtype",
        default="auto",
        choices=["auto", "float16", "bfloat16", "float32"],
        help="Data type for model weights (default: auto)",
    )
    parser.add_argument(
        "--device",
        default="auto",
        help="Device for HF backend (auto, cuda, cpu)",
    )
    parser.add_argument(
        "--trust-remote-code",
        action="store_true",
        help="Allow executing remote code from model repository",
    )
    parser.add_argument(
        "--log-level",
        default="info",
        choices=["debug", "info", "warning", "error"],
        help="Logging level (default: info)",
    )

    args = parser.parse_args()

    if args.backend == "vllm":
        run_inspect_vllm(
            model_id=args.model,
            task=args.task,
            limit=args.limit,
            max_connections=args.max_connections or 4,
            temperature=args.temperature if args.temperature is not None else 0.0,
            tensor_parallel_size=args.tensor_parallel_size,
            gpu_memory_utilization=args.gpu_memory_utilization,
            dtype=args.dtype,
            trust_remote_code=args.trust_remote_code,
            log_level=args.log_level,
        )
    else:
        run_inspect_hf(
            model_id=args.model,
            task=args.task,
            limit=args.limit,
            max_connections=args.max_connections or 1,
            temperature=args.temperature if args.temperature is not None else 0.001,
            device=args.device,
            dtype=args.dtype,
            trust_remote_code=args.trust_remote_code,
            log_level=args.log_level,
        )


if __name__ == "__main__":
    main()

```

### scripts/run_vllm_eval_job.py

```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "huggingface-hub>=0.26.0",
#     "python-dotenv>=1.2.1",
# ]
# ///

"""
Submit vLLM-based evaluation jobs using the `hf jobs uv run` CLI.

This wrapper constructs the appropriate command to execute vLLM evaluation scripts
(lighteval or inspect-ai) on Hugging Face Jobs with GPU hardware.

Unlike run_eval_job.py (which uses inference providers/APIs), this script runs
models directly on the job's GPU using vLLM or HuggingFace Transformers.

Usage:
    python run_vllm_eval_job.py \\
        --model meta-llama/Llama-3.2-1B \\
        --task mmlu \\
        --framework lighteval \\
        --hardware a10g-small
"""

from __future__ import annotations

import argparse
import os
import subprocess
import sys
from pathlib import Path
from typing import Optional

from huggingface_hub import get_token
from dotenv import load_dotenv

load_dotenv()

# Script paths for different evaluation frameworks
SCRIPT_DIR = Path(__file__).parent.resolve()
LIGHTEVAL_SCRIPT = SCRIPT_DIR / "lighteval_vllm_uv.py"
INSPECT_SCRIPT = SCRIPT_DIR / "inspect_vllm_uv.py"

# Hardware flavor recommendations for different model sizes
HARDWARE_RECOMMENDATIONS = {
    "small": "t4-small",       # < 3B parameters
    "medium": "a10g-small",    # 3B - 13B parameters
    "large": "a10g-large",     # 13B - 34B parameters
    "xlarge": "a100-large",    # 34B+ parameters
}


def estimate_hardware(model_id: str) -> str:
    """
    Estimate appropriate hardware based on model ID naming conventions.
    
    Returns a hardware flavor recommendation.
    """
    model_lower = model_id.lower()
    
    # Check for explicit size indicators in model name
    if any(x in model_lower for x in ["70b", "72b", "65b"]):
        return "a100-large"
    elif any(x in model_lower for x in ["34b", "33b", "32b", "30b"]):
        return "a10g-large"
    elif any(x in model_lower for x in ["13b", "14b", "7b", "8b"]):
        return "a10g-small"
    elif any(x in model_lower for x in ["3b", "2b", "1b", "0.5b", "small", "mini"]):
        return "t4-small"
    
    # Default to medium hardware
    return "a10g-small"


def create_lighteval_job(
    model_id: str,
    tasks: str,
    hardware: str,
    hf_token: Optional[str] = None,
    max_samples: Optional[int] = None,
    backend: str = "vllm",
    batch_size: int = 1,
    tensor_parallel_size: int = 1,
    trust_remote_code: bool = False,
    use_chat_template: bool = False,
) -> None:
    """
    Submit a lighteval evaluation job on HuggingFace Jobs.
    """
    token = hf_token or os.getenv("HF_TOKEN") or get_token()
    if not token:
        raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")

    if not LIGHTEVAL_SCRIPT.exists():
        raise FileNotFoundError(f"Script not found at {LIGHTEVAL_SCRIPT}")

    print(f"Preparing lighteval job for {model_id}")
    print(f"  Tasks: {tasks}")
    print(f"  Backend: {backend}")
    print(f"  Hardware: {hardware}")

    cmd = [
        "hf", "jobs", "uv", "run",
        str(LIGHTEVAL_SCRIPT),
        "--flavor", hardware,
        "--secrets", f"HF_TOKEN={token}",
        "--",
        "--model", model_id,
        "--tasks", tasks,
        "--backend", backend,
        "--batch-size", str(batch_size),
        "--tensor-parallel-size", str(tensor_parallel_size),
    ]

    if max_samples:
        cmd.extend(["--max-samples", str(max_samples)])

    if trust_remote_code:
        cmd.append("--trust-remote-code")

    if use_chat_template:
        cmd.append("--use-chat-template")

    print(f"\nExecuting: {' '.join(cmd)}")

    try:
        subprocess.run(cmd, check=True)
    except subprocess.CalledProcessError as exc:
        print("hf jobs command failed", file=sys.stderr)
        raise


def create_inspect_job(
    model_id: str,
    task: str,
    hardware: str,
    hf_token: Optional[str] = None,
    limit: Optional[int] = None,
    backend: str = "vllm",
    tensor_parallel_size: int = 1,
    trust_remote_code: bool = False,
) -> None:
    """
    Submit an inspect-ai evaluation job on HuggingFace Jobs.
    """
    token = hf_token or os.getenv("HF_TOKEN") or get_token()
    if not token:
        raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")

    if not INSPECT_SCRIPT.exists():
        raise FileNotFoundError(f"Script not found at {INSPECT_SCRIPT}")

    print(f"Preparing inspect-ai job for {model_id}")
    print(f"  Task: {task}")
    print(f"  Backend: {backend}")
    print(f"  Hardware: {hardware}")

    cmd = [
        "hf", "jobs", "uv", "run",
        str(INSPECT_SCRIPT),
        "--flavor", hardware,
        "--secrets", f"HF_TOKEN={token}",
        "--",
        "--model", model_id,
        "--task", task,
        "--backend", backend,
        "--tensor-parallel-size", str(tensor_parallel_size),
    ]

    if limit:
        cmd.extend(["--limit", str(limit)])

    if trust_remote_code:
        cmd.append("--trust-remote-code")

    print(f"\nExecuting: {' '.join(cmd)}")

    try:
        subprocess.run(cmd, check=True)
    except subprocess.CalledProcessError as exc:
        print("hf jobs command failed", file=sys.stderr)
        raise


def main() -> None:
    parser = argparse.ArgumentParser(
        description="Submit vLLM-based evaluation jobs to HuggingFace Jobs",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Run lighteval with vLLM on A10G GPU
  python run_vllm_eval_job.py \\
      --model meta-llama/Llama-3.2-1B \\
      --task "leaderboard|mmlu|5" \\
      --framework lighteval \\
      --hardware a10g-small

  # Run inspect-ai on larger model with multi-GPU
  python run_vllm_eval_job.py \\
      --model meta-llama/Llama-3.2-70B \\
      --task mmlu \\
      --framework inspect \\
      --hardware a100-large \\
      --tensor-parallel-size 4

  # Auto-detect hardware based on model size
  python run_vllm_eval_job.py \\
      --model meta-llama/Llama-3.2-1B \\
      --task mmlu \\
      --framework inspect

  # Run with HF Transformers backend (instead of vLLM)
  python run_vllm_eval_job.py \\
      --model microsoft/phi-2 \\
      --task mmlu \\
      --framework inspect \\
      --backend hf

Hardware flavors:
  - t4-small: T4 GPU, good for models < 3B
  - a10g-small: A10G GPU, good for models 3B-13B
  - a10g-large: A10G GPU, good for models 13B-34B
  - a100-large: A100 GPU, good for models 34B+

Frameworks:
  - lighteval: HuggingFace's lighteval library
  - inspect: UK AI Safety's inspect-ai library

Task formats:
  - lighteval: "suite|task|num_fewshot" (e.g., "leaderboard|mmlu|5")
  - inspect: task name (e.g., "mmlu", "gsm8k")
        """,
    )

    parser.add_argument(
        "--model",
        required=True,
        help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
    )
    parser.add_argument(
        "--task",
        required=True,
        help="Evaluation task (format depends on framework)",
    )
    parser.add_argument(
        "--framework",
        choices=["lighteval", "inspect"],
        default="lighteval",
        help="Evaluation framework to use (default: lighteval)",
    )
    parser.add_argument(
        "--hardware",
        default=None,
        help="Hardware flavor (auto-detected if not specified)",
    )
    parser.add_argument(
        "--backend",
        choices=["vllm", "hf", "accelerate"],
        default="vllm",
        help="Model backend (default: vllm)",
    )
    parser.add_argument(
        "--limit",
        "--max-samples",
        type=int,
        default=None,
        dest="limit",
        help="Limit number of samples to evaluate",
    )
    parser.add_argument(
        "--batch-size",
        type=int,
        default=1,
        help="Batch size for evaluation (lighteval only)",
    )
    parser.add_argument(
        "--tensor-parallel-size",
        type=int,
        default=1,
        help="Number of GPUs for tensor parallelism",
    )
    parser.add_argument(
        "--trust-remote-code",
        action="store_true",
        help="Allow executing remote code from model repository",
    )
    parser.add_argument(
        "--use-chat-template",
        action="store_true",
        help="Apply chat template (lighteval only)",
    )

    args = parser.parse_args()

    # Auto-detect hardware if not specified
    hardware = args.hardware or estimate_hardware(args.model)
    print(f"Using hardware: {hardware}")

    # Map backend names between frameworks
    backend = args.backend
    if args.framework == "lighteval" and backend == "hf":
        backend = "accelerate"  # lighteval uses "accelerate" for HF backend

    if args.framework == "lighteval":
        create_lighteval_job(
            model_id=args.model,
            tasks=args.task,
            hardware=hardware,
            max_samples=args.limit,
            backend=backend,
            batch_size=args.batch_size,
            tensor_parallel_size=args.tensor_parallel_size,
            trust_remote_code=args.trust_remote_code,
            use_chat_template=args.use_chat_template,
        )
    else:
        create_inspect_job(
            model_id=args.model,
            task=args.task,
            hardware=hardware,
            limit=args.limit,
            backend=backend if backend != "accelerate" else "hf",
            tensor_parallel_size=args.tensor_parallel_size,
            trust_remote_code=args.trust_remote_code,
        )


if __name__ == "__main__":
    main()


```

### scripts/inspect_eval_uv.py

```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "inspect-ai>=0.3.0",
#     "inspect-evals",
#     "openai",
# ]
# ///

"""
Entry point script for running inspect-ai evaluations via `hf jobs uv run`.
"""

from __future__ import annotations

import argparse
import os
import subprocess
import sys
from pathlib import Path
from typing import Optional


def _inspect_evals_tasks_root() -> Optional[Path]:
    """Return the installed inspect_evals package path if available."""
    try:
        import inspect_evals

        return Path(inspect_evals.__file__).parent
    except Exception:
        return None


def _normalize_task(task: str) -> str:
    """Allow lighteval-style `suite|task|shots` strings by keeping the task name."""
    if "|" in task:
        parts = task.split("|")
        if len(parts) >= 2 and parts[1]:
            return parts[1]
    return task


def main() -> None:
    parser = argparse.ArgumentParser(description="Inspect-ai job runner")
    parser.add_argument("--model", required=True, help="Model ID on Hugging Face Hub")
    parser.add_argument("--task", required=True, help="inspect-ai task to execute")
    parser.add_argument("--limit", type=int, default=None, help="Limit number of samples to evaluate")
    parser.add_argument(
        "--tasks-root",
        default=None,
        help="Optional path to inspect task files. Defaults to the installed inspect_evals package.",
    )
    parser.add_argument(
        "--sandbox",
        default="local",
        help="Sandbox backend to use (default: local for HF jobs without Docker).",
    )
    args = parser.parse_args()

    # Ensure downstream libraries can read the token passed as a secret
    hf_token = os.getenv("HF_TOKEN")
    if hf_token:
        os.environ.setdefault("HUGGING_FACE_HUB_TOKEN", hf_token)
        os.environ.setdefault("HF_HUB_TOKEN", hf_token)

    task = _normalize_task(args.task)
    tasks_root = Path(args.tasks_root) if args.tasks_root else _inspect_evals_tasks_root()
    if tasks_root and not tasks_root.exists():
        tasks_root = None

    cmd = [
        "inspect",
        "eval",
        task,
        "--model",
        f"hf-inference-providers/{args.model}",
        "--log-level",
        "info",
        # Reduce batch size to avoid OOM errors (default is 32)
        "--max-connections",
        "1",
        # Set a small positive temperature (HF doesn't allow temperature=0)
        "--temperature",
        "0.001",
    ]

    if args.sandbox:
        cmd.extend(["--sandbox", args.sandbox])

    if args.limit:
        cmd.extend(["--limit", str(args.limit)])

    try:
        subprocess.run(cmd, check=True, cwd=tasks_root)
        print("Evaluation complete.")
    except subprocess.CalledProcessError as exc:
        location = f" (cwd={tasks_root})" if tasks_root else ""
        print(f"Evaluation failed with exit code {exc.returncode}{location}", file=sys.stderr)
        raise


if __name__ == "__main__":
    main()


```