hf-jobs
This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install nymbo-skills-hf-jobs
Repository
Skill path: HUGGING FACE/hf-jobs
This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, DevOps, Data / AI.
Target audience: everyone.
License: Complete terms in LICENSE.txt.
Original source
Catalog source: SkillHub Club.
Repository owner: Nymbo.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install hf-jobs into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/Nymbo/Skills before adding hf-jobs to shared team environments
- Use hf-jobs for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: hf-jobs
description: This skill should be used when users want to run any workload on Hugging Face Jobs infrastructure. Covers UV scripts, Docker-based jobs, hardware selection, cost estimation, authentication with tokens, secrets management, timeout configuration, and result persistence. Designed for general-purpose compute workloads including data processing, inference, experiments, batch jobs, and any Python-based tasks. Should be invoked for tasks involving cloud compute, GPU workloads, or when users mention running jobs on Hugging Face infrastructure without local setup.
license: Complete terms in LICENSE.txt
---
# Running Workloads on Hugging Face Jobs
## Overview
Run any workload on fully managed Hugging Face infrastructure. No local setup required—jobs run on cloud CPUs, GPUs, or TPUs and can persist results to the Hugging Face Hub.
**Common use cases:**
- **Data Processing** - Transform, filter, or analyze large datasets
- **Batch Inference** - Run inference on thousands of samples
- **Experiments & Benchmarks** - Reproducible ML experiments
- **Model Training** - Fine-tune models (see `model-trainer` skill for TRL-specific training)
- **Synthetic Data Generation** - Generate datasets using LLMs
- **Development & Testing** - Test code without local GPU setup
- **Scheduled Jobs** - Automate recurring tasks
**For model training specifically:** See the `model-trainer` skill for TRL-based training workflows.
## When to Use This Skill
Use this skill when users want to:
- Run Python workloads on cloud infrastructure
- Execute jobs without local GPU/TPU setup
- Process data at scale
- Run batch inference or experiments
- Schedule recurring tasks
- Use GPUs/TPUs for any workload
- Persist results to the Hugging Face Hub
## Key Directives
When assisting with jobs:
1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})` or `hf_jobs("run", {...})`. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`.
2. **Always handle authentication** - Jobs that interact with the Hub require `HF_TOKEN` via secrets. See Token Usage section below.
3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
4. **Set appropriate timeouts** - Default 30min may be insufficient for long-running tasks.
## Prerequisites Checklist
Before starting any job, verify:
### ✅ **Account & Authentication**
- Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
- Authenticated login: Check with `hf_whoami()`
- **HF_TOKEN for Hub Access** ⚠️ CRITICAL - Required for any Hub operations (push models/datasets, download private repos, etc.)
- Token must have appropriate permissions (read for downloads, write for uploads)
### ✅ **Token Usage** (See Token Usage section for details)
**When tokens are required:**
- Pushing models/datasets to Hub
- Accessing private repositories
- Using Hub APIs in scripts
- Any authenticated Hub operations
**How to provide tokens:**
```python
{
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Recommended: automatic token
}
```
**⚠️ CRITICAL:** The `$HF_TOKEN` placeholder is automatically replaced with your logged-in token. Never hardcode tokens in scripts.
## Token Usage Guide
### Understanding Tokens
**What are HF Tokens?**
- Authentication credentials for Hugging Face Hub
- Required for authenticated operations (push, private repos, API access)
- Stored securely on your machine after `hf auth login`
**Token Types:**
- **Read Token** - Can download models/datasets, read private repos
- **Write Token** - Can push models/datasets, create repos, modify content
- **Organization Token** - Can act on behalf of an organization
### When Tokens Are Required
**Always Required:**
- Pushing models/datasets to Hub
- Accessing private repositories
- Creating new repositories
- Modifying existing repositories
- Using Hub APIs programmatically
**Not Required:**
- Downloading public models/datasets
- Running jobs that don't interact with Hub
- Reading public repository information
### How to Provide Tokens to Jobs
#### Method 1: Automatic Token (Recommended)
```python
hf_jobs("uv", {
"script": "your_script.py",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
})
```
**How it works:**
- `$HF_TOKEN` is a placeholder that gets replaced with your actual token
- Uses the token from your logged-in session (`hf auth login`)
- Most secure and convenient method
- Token is encrypted server-side when passed as a secret
**Benefits:**
- No token exposure in code
- Uses your current login session
- Automatically updated if you re-login
- Works seamlessly with MCP tools
#### Method 2: Explicit Token (Not Recommended)
```python
hf_jobs("uv", {
"script": "your_script.py",
"secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
})
```
**When to use:**
- Only if automatic token doesn't work
- Testing with a specific token
- Organization tokens (use with caution)
**Security concerns:**
- Token visible in code/logs
- Must manually update if token rotates
- Risk of token exposure
#### Method 3: Environment Variable (Less Secure)
```python
hf_jobs("uv", {
"script": "your_script.py",
"env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
})
```
**Difference from secrets:**
- `env` variables are visible in job logs
- `secrets` are encrypted server-side
- Always prefer `secrets` for tokens
### Using Tokens in Scripts
**In your Python script, tokens are available as environment variables:**
```python
# /// script
# dependencies = ["huggingface-hub"]
# ///
import os
from huggingface_hub import HfApi
# Token is automatically available if passed via secrets
token = os.environ.get("HF_TOKEN")
# Use with Hub API
api = HfApi(token=token)
# Or let huggingface_hub auto-detect
api = HfApi() # Automatically uses HF_TOKEN env var
```
**Best practices:**
- Don't hardcode tokens in scripts
- Use `os.environ.get("HF_TOKEN")` to access
- Let `huggingface_hub` auto-detect when possible
- Verify token exists before Hub operations
### Token Verification
**Check if you're logged in:**
```python
from huggingface_hub import whoami
user_info = whoami() # Returns your username if authenticated
```
**Verify token in job:**
```python
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found!"
token = os.environ["HF_TOKEN"]
print(f"Token starts with: {token[:7]}...") # Should start with "hf_"
```
### Common Token Issues
**Error: 401 Unauthorized**
- **Cause:** Token missing or invalid
- **Fix:** Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
- **Verify:** Check `hf_whoami()` works locally
**Error: 403 Forbidden**
- **Cause:** Token lacks required permissions
- **Fix:** Ensure token has write permissions for push operations
- **Check:** Token type at https://huggingface.co/settings/tokens
**Error: Token not found in environment**
- **Cause:** `secrets` not passed or wrong key name
- **Fix:** Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
- **Verify:** Script checks `os.environ.get("HF_TOKEN")`
**Error: Repository access denied**
- **Cause:** Token doesn't have access to private repo
- **Fix:** Use token from account with access
- **Check:** Verify repo visibility and your permissions
### Token Security Best Practices
1. **Never commit tokens** - Use `$HF_TOKEN` placeholder or environment variables
2. **Use secrets, not env** - Secrets are encrypted server-side
3. **Rotate tokens regularly** - Generate new tokens periodically
4. **Use minimal permissions** - Create tokens with only needed permissions
5. **Don't share tokens** - Each user should use their own token
6. **Monitor token usage** - Check token activity in Hub settings
### Complete Token Example
```python
# Example: Push results to Hub
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["huggingface-hub", "datasets"]
# ///
import os
from huggingface_hub import HfApi
from datasets import Dataset
# Verify token is available
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
# Use token for Hub operations
api = HfApi(token=os.environ["HF_TOKEN"])
# Create and push dataset
data = {"text": ["Hello", "World"]}
dataset = Dataset.from_dict(data)
dataset.push_to_hub("username/my-dataset", token=os.environ["HF_TOKEN"])
print("✅ Dataset pushed successfully!")
""",
"flavor": "cpu-basic",
"timeout": "30m",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided securely
})
```
## Quick Start: Two Approaches
### Approach 1: UV Scripts (Recommended)
UV scripts use PEP 723 inline dependencies for clean, self-contained workloads.
```python
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["transformers", "torch"]
# ///
from transformers import pipeline
import torch
# Your workload here
classifier = pipeline("sentiment-analysis")
result = classifier("I love Hugging Face!")
print(result)
""",
"flavor": "cpu-basic",
"timeout": "30m"
})
```
**Benefits:** Direct MCP tool usage, clean code, dependencies declared inline, no file saving required
**When to use:** Default choice for all workloads, custom logic, any scenario requiring `hf_jobs()`
#### Working with Scripts
⚠️ **Important:** There are *two* “script path” stories depending on how you run Jobs:
- **Using the `hf_jobs()` MCP tool (recommended in this repo)**: the `script` value must be **inline code** (a string) or a **URL**. A local filesystem path (like `"./scripts/foo.py"`) won’t exist inside the remote container.
- **Using the `hf jobs uv run` CLI**: local file paths **do work** (the CLI uploads your script).
**Common mistake with `hf_jobs()` MCP tool:**
```python
# ❌ Will fail (remote container can't see your local path)
hf_jobs("uv", {"script": "./scripts/foo.py"})
```
**Correct patterns with `hf_jobs()` MCP tool:**
```python
# ✅ Inline: read the local script file and pass its *contents*
from pathlib import Path
script = Path("hf-jobs/scripts/foo.py").read_text()
hf_jobs("uv", {"script": script})
# ✅ URL: host the script somewhere reachable
hf_jobs("uv", {"script": "https://huggingface.co/datasets/uv-scripts/.../raw/main/foo.py"})
```
**CLI equivalent (local paths supported):**
```bash
hf jobs uv run ./scripts/foo.py -- --your --args
```
### Approach 2: Docker-Based Jobs
Run jobs with custom Docker images and commands.
```python
hf_jobs("run", {
"image": "python:3.12",
"command": ["python", "-c", "print('Hello from HF Jobs!')"],
"flavor": "cpu-basic",
"timeout": "30m"
})
```
**Benefits:** Full Docker control, use pre-built images, run any command
**When to use:** Need specific Docker images, non-Python workloads, complex environments
**Example with GPU:**
```python
hf_jobs("run", {
"image": "pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
"command": ["python", "-c", "import torch; print(torch.cuda.get_device_name())"],
"flavor": "a10g-small",
"timeout": "1h"
})
```
### Finding More UV Scripts on Hub
The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
```python
# Discover available UV script collections
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
# Explore a specific collection
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
```
**Popular collections:** OCR, classification, synthetic-data, vLLM, dataset-creation
## Hardware Selection
| Workload Type | Recommended Hardware | Cost (approx./hr) | Use Case |
|---------------|---------------------|------------------|----------|
| Data processing, testing | `cpu-basic`, `cpu-upgrade` | ~$0.10-0.50 | Lightweight tasks |
| Small models, demos | `t4-small` | ~$0.75 | <1B models, quick tests |
| Medium models | `t4-medium`, `l4x1` | ~$1.50-2.50 | 1-7B models |
| Large models, production | `a10g-small`, `a10g-large` | ~$3.50-5.00 | 7-13B models |
| Very large models | `a100-large` | ~$8-12 | 13B+ models |
| Batch inference | `a10g-large`, `a100-large` | ~$5-10 | High-throughput |
| Data processing | `cpu-upgrade`, `l4x1` | ~$0.50-2.50 | Parallel workloads |
**GPU Flavors:** cpu-basic/upgrade, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
**TPU Flavors:** v5e-1x1, v5e-2x2, v5e-2x4
**Guidelines:**
- Start with smaller hardware for testing
- Scale up based on actual needs
- Use multi-GPU for parallel workloads
- See `references/hardware_guide.md` for detailed specifications
## Critical: Saving Results
**⚠️ EPHEMERAL ENVIRONMENT—MUST PERSIST RESULTS**
The Jobs environment is temporary. All files are deleted when the job ends. If results aren't persisted, **ALL WORK IS LOST**.
### Persistence Options
**1. Push to Hugging Face Hub (Recommended)**
```python
# Push models
model.push_to_hub("username/model-name", token=os.environ["HF_TOKEN"])
# Push datasets
dataset.push_to_hub("username/dataset-name", token=os.environ["HF_TOKEN"])
# Push artifacts
api.upload_file(
path_or_fileobj="results.json",
path_in_repo="results.json",
repo_id="username/results",
token=os.environ["HF_TOKEN"]
)
```
**2. Use External Storage**
```python
# Upload to S3, GCS, etc.
import boto3
s3 = boto3.client('s3')
s3.upload_file('results.json', 'my-bucket', 'results.json')
```
**3. Send Results via API**
```python
# POST results to your API
import requests
requests.post("https://your-api.com/results", json=results)
```
### Required Configuration for Hub Push
**In job submission:**
```python
{
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
}
```
**In script:**
```python
import os
from huggingface_hub import HfApi
# Token automatically available from secrets
api = HfApi(token=os.environ.get("HF_TOKEN"))
# Push your results
api.upload_file(...)
```
### Verification Checklist
Before submitting:
- [ ] Results persistence method chosen
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` if using Hub
- [ ] Script handles missing token gracefully
- [ ] Test persistence path works
**See:** `references/hub_saving.md` for detailed Hub persistence guide
## Timeout Management
**⚠️ DEFAULT: 30 MINUTES**
### Setting Timeouts
```python
{
"timeout": "2h" # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
}
```
### Timeout Guidelines
| Scenario | Recommended | Notes |
|----------|-------------|-------|
| Quick test | 10-30 min | Verify setup |
| Data processing | 1-2 hours | Depends on data size |
| Batch inference | 2-4 hours | Large batches |
| Experiments | 4-8 hours | Multiple runs |
| Long-running | 8-24 hours | Production workloads |
**Always add 20-30% buffer** for setup, network delays, and cleanup.
**On timeout:** Job killed immediately, all unsaved progress lost
## Cost Estimation
**General guidelines:**
```
Total Cost = (Hours of runtime) × (Cost per hour)
```
**Example calculations:**
**Quick test:**
- Hardware: cpu-basic ($0.10/hour)
- Time: 15 minutes (0.25 hours)
- Cost: $0.03
**Data processing:**
- Hardware: l4x1 ($2.50/hour)
- Time: 2 hours
- Cost: $5.00
**Batch inference:**
- Hardware: a10g-large ($5/hour)
- Time: 4 hours
- Cost: $20.00
**Cost optimization tips:**
1. Start small - Test on cpu-basic or t4-small
2. Monitor runtime - Set appropriate timeouts
3. Use checkpoints - Resume if job fails
4. Optimize code - Reduce unnecessary compute
5. Choose right hardware - Don't over-provision
## Monitoring and Tracking
### Check Job Status
```python
# List all jobs
hf_jobs("ps")
# Inspect specific job
hf_jobs("inspect", {"job_id": "your-job-id"})
# View logs
hf_jobs("logs", {"job_id": "your-job-id"})
# Cancel a job
hf_jobs("cancel", {"job_id": "your-job-id"})
```
**Remember:** Wait for user to request status checks. Avoid polling repeatedly.
### Job URLs
After submission, jobs have monitoring URLs:
```
https://huggingface.co/jobs/username/job-id
```
View logs, status, and details in the browser.
## Scheduled Jobs
Run jobs on a schedule using CRON expressions or predefined schedules.
```python
# Schedule a job that runs every hour
hf_jobs("scheduled uv", {
"script": "your_script.py",
"schedule": "@hourly",
"flavor": "cpu-basic"
})
# Use CRON syntax
hf_jobs("scheduled uv", {
"script": "your_script.py",
"schedule": "0 9 * * 1", # 9 AM every Monday
"flavor": "cpu-basic"
})
```
**Available schedules:**
- `@annually`, `@yearly` - Once per year
- `@monthly` - Once per month
- `@weekly` - Once per week
- `@daily` - Once per day
- `@hourly` - Once per hour
- CRON expression - Custom schedule (e.g., `"0 9 * * 1"`)
**Manage scheduled jobs:**
```python
hf_jobs("scheduled ps") # List scheduled jobs
hf_jobs("scheduled suspend", {"job_id": "..."}) # Pause
hf_jobs("scheduled resume", {"job_id": "..."}) # Resume
hf_jobs("scheduled delete", {"job_id": "..."}) # Delete
```
## Common Workload Patterns
This repository ships ready-to-run UV scripts in `hf-jobs/scripts/`. Prefer using them instead of inventing new templates.
### Pattern 1: Dataset → Model Responses (vLLM) — `scripts/generate-responses.py`
**What it does:** loads a Hub dataset (chat `messages` or a `prompt` column), applies a model chat template, generates responses with vLLM, and **pushes** the output dataset + dataset card back to the Hub.
**Requires:** GPU + **write** token (it pushes a dataset).
```python
from pathlib import Path
script = Path("hf-jobs/scripts/generate-responses.py").read_text()
hf_jobs("uv", {
"script": script,
"script_args": [
"username/input-dataset",
"username/output-dataset",
"--messages-column", "messages",
"--model-id", "Qwen/Qwen3-30B-A3B-Instruct-2507",
"--temperature", "0.7",
"--top-p", "0.8",
"--max-tokens", "2048",
],
"flavor": "a10g-large",
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
})
```
### Pattern 2: CoT Self-Instruct Synthetic Data — `scripts/cot-self-instruct.py`
**What it does:** generates synthetic prompts/answers via CoT Self-Instruct, optionally filters outputs (answer-consistency / RIP), then **pushes** the generated dataset + dataset card to the Hub.
**Requires:** GPU + **write** token (it pushes a dataset).
```python
from pathlib import Path
script = Path("hf-jobs/scripts/cot-self-instruct.py").read_text()
hf_jobs("uv", {
"script": script,
"script_args": [
"--seed-dataset", "davanstrien/s1k-reasoning",
"--output-dataset", "username/synthetic-math",
"--task-type", "reasoning",
"--num-samples", "5000",
"--filter-method", "answer-consistency",
],
"flavor": "l4x4",
"timeout": "8h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
})
```
### Pattern 3: Streaming Dataset Stats (Polars + HF Hub) — `scripts/finepdfs-stats.py`
**What it does:** scans parquet directly from Hub (no 300GB download), computes temporal stats, and (optionally) uploads results to a Hub dataset repo.
**Requires:** CPU is often enough; token needed **only** if you pass `--output-repo` (upload).
```python
from pathlib import Path
script = Path("hf-jobs/scripts/finepdfs-stats.py").read_text()
hf_jobs("uv", {
"script": script,
"script_args": [
"--limit", "10000",
"--show-plan",
"--output-repo", "username/finepdfs-temporal-stats",
],
"flavor": "cpu-upgrade",
"timeout": "2h",
"env": {"HF_XET_HIGH_PERFORMANCE": "1"},
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
})
```
## Common Failure Modes
### Out of Memory (OOM)
**Fix:**
1. Reduce batch size or data chunk size
2. Process data in smaller batches
3. Upgrade hardware: cpu → t4 → a10g → a100
### Job Timeout
**Fix:**
1. Check logs for actual runtime
2. Increase timeout with buffer: `"timeout": "3h"`
3. Optimize code for faster execution
4. Process data in chunks
### Hub Push Failures
**Fix:**
1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
2. Verify token in script: `assert "HF_TOKEN" in os.environ`
3. Check token permissions
4. Verify repo exists or can be created
### Missing Dependencies
**Fix:**
Add to PEP 723 header:
```python
# /// script
# dependencies = ["package1", "package2>=1.0.0"]
# ///
```
### Authentication Errors
**Fix:**
1. Check `hf_whoami()` works locally
2. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
3. Re-login: `hf auth login`
4. Check token has required permissions
## Troubleshooting
**Common issues:**
- Job times out → Increase timeout, optimize code
- Results not saved → Check persistence method, verify HF_TOKEN
- Out of Memory → Reduce batch size, upgrade hardware
- Import errors → Add dependencies to PEP 723 header
- Authentication errors → Check token, verify secrets parameter
**See:** `references/troubleshooting.md` for complete troubleshooting guide
## Resources
### References (In This Skill)
- `references/token_usage.md` - Complete token usage guide
- `references/hardware_guide.md` - Hardware specs and selection
- `references/hub_saving.md` - Hub persistence guide
- `references/troubleshooting.md` - Common issues and solutions
### Scripts (In This Skill)
- `scripts/generate-responses.py` - vLLM batch generation: dataset → responses → push to Hub
- `scripts/cot-self-instruct.py` - CoT Self-Instruct synthetic data generation + filtering → push to Hub
- `scripts/finepdfs-stats.py` - Polars streaming stats over `finepdfs-edu` parquet on Hub (optional push)
### External Links
- [HF Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
- [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/)
- [UV Scripts Organization](https://huggingface.co/uv-scripts)
- [HF Hub Authentication](https://huggingface.co/docs/huggingface_hub/quick-start#authentication)
## Key Takeaways
1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
3. **Always set timeout** - Default 30 min may be insufficient; set appropriate timeout
4. **Always persist results** - Environment is ephemeral; without persistence, all work is lost
5. **Use tokens securely** - Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}` for Hub operations
6. **Choose appropriate hardware** - Start small, scale up based on needs
7. **Use UV scripts** - Default to `hf_jobs("uv", {...})` with inline scripts for Python workloads
8. **Handle authentication** - Verify tokens are available before Hub operations
9. **Monitor jobs** - Provide job URLs and status check commands
10. **Optimize costs** - Choose right hardware, set appropriate timeouts
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### references/hardware_guide.md
```markdown
# Hardware Selection Guide
Choosing the right hardware (flavor) is critical for cost-effective workloads.
## Available Hardware
### CPU
- `cpu-basic` - Basic CPU, testing only
- `cpu-upgrade` - Enhanced CPU
**Use cases:** Data processing, testing scripts, lightweight workloads
**Not recommended for:** Model training, GPU-accelerated workloads
### GPU Options
| Flavor | GPU | Memory | Use Case | Cost/hour |
|--------|-----|--------|----------|-----------|
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos, batch inference | ~$0.50-1 |
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient workloads | ~$2-3 |
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU workloads | ~$8-12 |
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast workloads | ~$8-12 |
## Selection Guidelines
### By Workload Type
**Data Processing**
- **Recommended:** `cpu-upgrade` or `l4x1`
- **Use case:** Transform, filter, analyze datasets
- **Batch size:** Depends on data size
- **Time:** Varies by dataset size
**Batch Inference**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Run inference on thousands of samples
- **Batch size:** 8-32 depending on model
- **Time:** Depends on number of samples
**Experiments & Benchmarks**
- **Recommended:** `a10g-small` or `a10g-large`
- **Use case:** Reproducible ML experiments
- **Batch size:** Varies
- **Time:** Depends on experiment complexity
**Model Training** (see `model-trainer` skill for details)
- **Recommended:** See model-trainer skill
- **Use case:** Fine-tuning models
- **Batch size:** Depends on model size
- **Time:** Hours to days
**Synthetic Data Generation**
- **Recommended:** `a10g-large` or `a100-large`
- **Use case:** Generate datasets using LLMs
- **Batch size:** Depends on generation method
- **Time:** Hours for large datasets
### By Budget
**Minimal Budget (<$5 total)**
- Use `cpu-basic` or `t4-small`
- Process small datasets
- Quick tests and demos
**Small Budget ($5-20)**
- Use `t4-medium` or `a10g-small`
- Process medium datasets
- Run experiments
**Medium Budget ($20-50)**
- Use `a10g-small` or `a10g-large`
- Process large datasets
- Production workloads
**Large Budget ($50-200)**
- Use `a10g-large` or `a100-large`
- Large-scale processing
- Multiple experiments
### By Model Size (for inference/processing)
**Tiny Models (<1B parameters)**
- **Recommended:** `t4-small`
- **Example:** Qwen2.5-0.5B, TinyLlama
- **Batch size:** 8-16
**Small Models (1-3B parameters)**
- **Recommended:** `t4-medium` or `a10g-small`
- **Example:** Qwen2.5-1.5B, Phi-2
- **Batch size:** 4-8
**Medium Models (3-7B parameters)**
- **Recommended:** `a10g-small` or `a10g-large`
- **Example:** Qwen2.5-7B, Mistral-7B
- **Batch size:** 2-4
**Large Models (7-13B parameters)**
- **Recommended:** `a10g-large` or `a100-large`
- **Example:** Llama-3-8B
- **Batch size:** 1-2
**Very Large Models (13B+ parameters)**
- **Recommended:** `a100-large`
- **Example:** Llama-3-13B, Llama-3-70B
- **Batch size:** 1
## Memory Considerations
### Estimating Memory Requirements
**For inference:**
```
Memory (GB) ≈ (Model params in billions) × 2-4
```
**For training:**
```
Memory (GB) ≈ (Model params in billions) × 20 (full) or × 4 (LoRA)
```
**Examples:**
- Qwen2.5-0.5B inference: ~1-2GB ✅ fits t4-small
- Qwen2.5-7B inference: ~14-28GB ✅ fits a10g-large
- Qwen2.5-7B training: ~140GB ❌ not feasible without LoRA
### Memory Optimization
If hitting memory limits:
1. **Reduce batch size**
```python
batch_size = 1
```
2. **Process in chunks**
```python
for chunk in chunks:
process(chunk)
```
3. **Use smaller models**
- Use quantized models
- Use LoRA adapters
4. **Upgrade hardware**
- cpu → t4 → a10g → a100
## Cost Estimation
### Formula
```
Total Cost = (Hours of runtime) × (Cost per hour)
```
### Example Calculations
**Data processing:**
- Hardware: cpu-upgrade ($0.50/hour)
- Time: 1 hour
- Cost: $0.50
**Batch inference:**
- Hardware: a10g-large ($5/hour)
- Time: 2 hours
- Cost: $10.00
**Experiments:**
- Hardware: a10g-small ($3.50/hour)
- Time: 4 hours
- Cost: $14.00
### Cost Optimization Tips
1. **Start small:** Test on cpu-basic or t4-small
2. **Monitor runtime:** Set appropriate timeouts
3. **Optimize code:** Reduce unnecessary compute
4. **Choose right hardware:** Don't over-provision
5. **Use checkpoints:** Resume if job fails
6. **Monitor costs:** Check running jobs regularly
## Multi-GPU Workloads
Multi-GPU flavors automatically distribute workloads:
**Multi-GPU flavors:**
- `l4x4` - 4x L4 GPUs
- `a10g-largex2` - 2x A10G GPUs
- `a10g-largex4` - 4x A10G GPUs
**When to use:**
- Large models (>13B parameters)
- Need faster processing (linear speedup)
- Large datasets (>100K samples)
- Parallel workloads
**Example:**
```python
hf_jobs("uv", {
"script": "process.py",
"flavor": "a10g-largex2", # 2 GPUs
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
## Choosing Between Options
### CPU vs GPU
**Choose CPU when:**
- No GPU acceleration needed
- Data processing only
- Budget constrained
- Simple workloads
**Choose GPU when:**
- Model inference/training
- GPU-accelerated libraries
- Need faster processing
- Large models
### a10g vs a100
**Choose a10g when:**
- Model <13B parameters
- Budget conscious
- Processing time not critical
**Choose a100 when:**
- Model 13B+ parameters
- Need fastest processing
- Memory requirements high
- Budget allows
### Single vs Multi-GPU
**Choose single GPU when:**
- Model <7B parameters
- Budget constrained
- Simpler debugging
**Choose multi-GPU when:**
- Model >13B parameters
- Need faster processing
- Large batch sizes required
- Cost-effective for large jobs
## Quick Reference
```python
# Workload type → Hardware selection
HARDWARE_MAP = {
"data_processing": "cpu-upgrade",
"batch_inference_small": "t4-small",
"batch_inference_medium": "a10g-large",
"batch_inference_large": "a100-large",
"experiments": "a10g-small",
"training": "see model-trainer skill"
}
```
```
### references/hub_saving.md
```markdown
# Saving Results to Hugging Face Hub
**⚠️ CRITICAL:** Job environments are ephemeral. ALL results are lost when a job completes unless persisted to the Hub or external storage.
## Why Persistence is Required
When running on Hugging Face Jobs:
- Environment is temporary
- All files deleted on job completion
- No local disk persistence
- Cannot access results after job ends
**Without persistence, all work is permanently lost.**
## Persistence Options
### Option 1: Push to Hugging Face Hub (Recommended)
**For models:**
```python
from transformers import AutoModel
model.push_to_hub("username/model-name", token=os.environ.get("HF_TOKEN"))
```
**For datasets:**
```python
from datasets import Dataset
dataset.push_to_hub("username/dataset-name", token=os.environ.get("HF_TOKEN"))
```
**For files/artifacts:**
```python
from huggingface_hub import HfApi
api = HfApi(token=os.environ.get("HF_TOKEN"))
api.upload_file(
path_or_fileobj="results.json",
path_in_repo="results.json",
repo_id="username/results",
repo_type="dataset"
)
```
### Option 2: External Storage
**S3:**
```python
import boto3
s3 = boto3.client('s3')
s3.upload_file('results.json', 'my-bucket', 'results.json')
```
**Google Cloud Storage:**
```python
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('my-bucket')
blob = bucket.blob('results.json')
blob.upload_from_filename('results.json')
```
### Option 3: API Endpoint
```python
import requests
requests.post("https://your-api.com/results", json=results)
```
## Required Configuration for Hub Push
### Job Configuration
**Always include HF_TOKEN:**
```python
hf_jobs("uv", {
"script": "your_script.py",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required for Hub operations
})
```
### Script Configuration
**Verify token exists:**
```python
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
```
**Use token for Hub operations:**
```python
from huggingface_hub import HfApi
# Auto-detects HF_TOKEN from environment
api = HfApi()
# Or explicitly pass token
api = HfApi(token=os.environ.get("HF_TOKEN"))
```
## Complete Examples
### Example 1: Push Dataset
```python
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["datasets", "huggingface-hub"]
# ///
import os
from datasets import Dataset
from huggingface_hub import HfApi
# Verify token
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
# Process data
data = {"text": ["Sample 1", "Sample 2"]}
dataset = Dataset.from_dict(data)
# Push to Hub
dataset.push_to_hub("username/my-dataset")
print("✅ Dataset pushed!")
""",
"flavor": "cpu-basic",
"timeout": "30m",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
### Example 2: Push Model
```python
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["transformers"]
# ///
import os
from transformers import AutoModel, AutoTokenizer
# Verify token
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
# Load and process model
model = AutoModel.from_pretrained("base-model")
tokenizer = AutoTokenizer.from_pretrained("base-model")
# ... process model ...
# Push to Hub
model.push_to_hub("username/my-model")
tokenizer.push_to_hub("username/my-model")
print("✅ Model pushed!")
""",
"flavor": "a10g-large",
"timeout": "2h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
### Example 3: Push Artifacts
```python
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["huggingface-hub", "pandas"]
# ///
import os
import json
import pandas as pd
from huggingface_hub import HfApi
# Verify token
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
# Generate results
results = {"accuracy": 0.95, "loss": 0.05}
df = pd.DataFrame([results])
# Save files
with open("results.json", "w") as f:
json.dump(results, f)
df.to_csv("results.csv", index=False)
# Push to Hub
api = HfApi()
api.upload_file("results.json", "results.json", "username/results", repo_type="dataset")
api.upload_file("results.csv", "results.csv", "username/results", repo_type="dataset")
print("✅ Results pushed!")
""",
"flavor": "cpu-basic",
"timeout": "30m",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
## Authentication Methods
### Method 1: Automatic Token (Recommended)
```python
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
```
Uses your logged-in Hugging Face token automatically.
### Method 2: Explicit Token
```python
"secrets": {"HF_TOKEN": "hf_abc123..."}
```
Provide token explicitly (not recommended for security).
### Method 3: Environment Variable
```python
"env": {"HF_TOKEN": "hf_abc123..."}
```
Pass as regular environment variable (less secure than secrets).
**Always prefer Method 1** for security and convenience.
## Verification Checklist
Before submitting any job that saves to Hub, verify:
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
- [ ] Hub push code included in script
- [ ] Repository name doesn't conflict with existing repos
- [ ] You have write access to the target namespace
## Repository Setup
### Automatic Creation
If repository doesn't exist, it's created automatically when first pushing (if token has write permissions).
### Manual Creation
Create repository before pushing:
```python
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(
repo_id="username/repo-name",
repo_type="model", # or "dataset"
private=False, # or True for private repo
)
```
### Repository Naming
**Valid names:**
- `username/my-model`
- `username/model-name`
- `organization/model-name`
**Invalid names:**
- `model-name` (missing username)
- `username/model name` (spaces not allowed)
- `username/MODEL` (uppercase discouraged)
## Troubleshooting
### Error: 401 Unauthorized
**Cause:** HF_TOKEN not provided or invalid
**Solutions:**
1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
2. Check you're logged in: `hf_whoami()`
3. Re-login: `hf auth login`
### Error: 403 Forbidden
**Cause:** No write access to repository
**Solutions:**
1. Check repository namespace matches your username
2. Verify you're a member of organization (if using org namespace)
3. Check token has write permissions
### Error: Repository not found
**Cause:** Repository doesn't exist and auto-creation failed
**Solutions:**
1. Manually create repository first
2. Check repository name format
3. Verify namespace exists
### Error: Push failed
**Cause:** Network issues or Hub unavailable
**Solutions:**
1. Check logs for specific error
2. Verify token is valid
3. Retry push operation
## Best Practices
1. **Always verify token exists** before Hub operations
2. **Use descriptive repo names** (e.g., `my-experiment-results` not `results`)
3. **Push incrementally** for large results (use checkpoints)
4. **Verify push success** in logs before job completes
5. **Use appropriate repo types** (model vs dataset)
6. **Add README** with result descriptions
7. **Tag repos** with relevant tags
## Monitoring Push Progress
Check logs for push progress:
```python
hf_jobs("logs", {"job_id": "your-job-id"})
```
**Look for:**
```
Pushing to username/repo-name...
Upload file results.json: 100%
✅ Push successful
```
## Key Takeaway
**Without `secrets={"HF_TOKEN": "$HF_TOKEN"}` and persistence code, all results are permanently lost.**
Always verify both are configured before submitting any job that produces results.
```
### scripts/generate-responses.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "datasets",
# "flashinfer-python",
# "huggingface-hub[hf_transfer]",
# "hf-xet>= 1.1.7",
# "torch",
# "transformers",
# "vllm>=0.8.5",
# ]
#
# ///
"""
Generate responses for prompts in a dataset using vLLM for efficient GPU inference.
This script loads a dataset from Hugging Face Hub containing chat-formatted messages,
applies the model's chat template, generates responses using vLLM, and saves the
results back to the Hub with a comprehensive dataset card.
Example usage:
# Local execution with auto GPU detection
uv run generate-responses.py \\
username/input-dataset \\
username/output-dataset \\
--messages-column messages
# With custom model and sampling parameters
uv run generate-responses.py \\
username/input-dataset \\
username/output-dataset \\
--model-id meta-llama/Llama-3.1-8B-Instruct \\
--temperature 0.9 \\
--top-p 0.95 \\
--max-tokens 2048
# HF Jobs execution (see script output for full command)
hf jobs uv run --flavor a100x4 ...
"""
import argparse
import logging
import os
import sys
from datetime import datetime
from typing import Optional
from datasets import load_dataset
from huggingface_hub import DatasetCard, get_token, login
from torch import cuda
from tqdm.auto import tqdm
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Enable HF Transfer for faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
def check_gpu_availability() -> int:
"""Check if CUDA is available and return the number of GPUs."""
if not cuda.is_available():
logger.error("CUDA is not available. This script requires a GPU.")
logger.error(
"Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
)
sys.exit(1)
num_gpus = cuda.device_count()
for i in range(num_gpus):
gpu_name = cuda.get_device_name(i)
gpu_memory = cuda.get_device_properties(i).total_memory / 1024**3
logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
return num_gpus
def create_dataset_card(
source_dataset: str,
model_id: str,
messages_column: str,
prompt_column: Optional[str],
sampling_params: SamplingParams,
tensor_parallel_size: int,
num_examples: int,
generation_time: str,
num_skipped: int = 0,
max_model_len_used: Optional[int] = None,
) -> str:
"""Create a comprehensive dataset card documenting the generation process."""
filtering_section = ""
if num_skipped > 0:
skip_percentage = (num_skipped / num_examples) * 100
processed = num_examples - num_skipped
filtering_section = f"""
### Filtering Statistics
- **Total Examples**: {num_examples:,}
- **Processed**: {processed:,} ({100 - skip_percentage:.1f}%)
- **Skipped (too long)**: {num_skipped:,} ({skip_percentage:.1f}%)
- **Max Model Length Used**: {max_model_len_used:,} tokens
Note: Prompts exceeding the maximum model length were skipped and have empty responses."""
return f"""---
tags:
- generated
- vllm
- uv-script
---
# Generated Responses Dataset
This dataset contains generated responses for prompts from [{source_dataset}](https://huggingface.co/datasets/{source_dataset}).
## Generation Details
- **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
- **Input Column**: `{prompt_column if prompt_column else messages_column}` ({"plain text prompts" if prompt_column else "chat messages"})
- **Model**: [{model_id}](https://huggingface.co/{model_id})
- **Number of Examples**: {num_examples:,}
- **Generation Date**: {generation_time}{filtering_section}
### Sampling Parameters
- **Temperature**: {sampling_params.temperature}
- **Top P**: {sampling_params.top_p}
- **Top K**: {sampling_params.top_k}
- **Min P**: {sampling_params.min_p}
- **Max Tokens**: {sampling_params.max_tokens}
- **Repetition Penalty**: {sampling_params.repetition_penalty}
### Hardware Configuration
- **Tensor Parallel Size**: {tensor_parallel_size}
- **GPU Configuration**: {tensor_parallel_size} GPU(s)
## Dataset Structure
The dataset contains all columns from the source dataset plus:
- `response`: The generated response from the model
## Generation Script
Generated using the vLLM inference script from [uv-scripts/vllm](https://huggingface.co/datasets/uv-scripts/vllm).
To reproduce this generation:
```bash
uv run https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
{source_dataset} \\
<output-dataset> \\
--model-id {model_id} \\
{"--prompt-column " + prompt_column if prompt_column else "--messages-column " + messages_column} \\
--temperature {sampling_params.temperature} \\
--top-p {sampling_params.top_p} \\
--top-k {sampling_params.top_k} \\
--max-tokens {sampling_params.max_tokens}{f" \\\\\\n --max-model-len {max_model_len_used}" if max_model_len_used else ""}
```
"""
def main(
src_dataset_hub_id: str,
output_dataset_hub_id: str,
model_id: str = "Qwen/Qwen3-30B-A3B-Instruct-2507",
messages_column: str = "messages",
prompt_column: Optional[str] = None,
output_column: str = "response",
temperature: float = 0.7,
top_p: float = 0.8,
top_k: int = 20,
min_p: float = 0.0,
max_tokens: int = 16384,
repetition_penalty: float = 1.0,
gpu_memory_utilization: float = 0.90,
max_model_len: Optional[int] = None,
tensor_parallel_size: Optional[int] = None,
skip_long_prompts: bool = True,
max_samples: Optional[int] = None,
hf_token: Optional[str] = None,
):
"""
Main generation pipeline.
Args:
src_dataset_hub_id: Input dataset on Hugging Face Hub
output_dataset_hub_id: Where to save results on Hugging Face Hub
model_id: Hugging Face model ID for generation
messages_column: Column name containing chat messages
prompt_column: Column name containing plain text prompts (alternative to messages_column)
output_column: Column name for generated responses
temperature: Sampling temperature
top_p: Top-p sampling parameter
top_k: Top-k sampling parameter
min_p: Minimum probability threshold
max_tokens: Maximum tokens to generate
repetition_penalty: Repetition penalty parameter
gpu_memory_utilization: GPU memory utilization factor
max_model_len: Maximum model context length (None uses model default)
tensor_parallel_size: Number of GPUs to use (auto-detect if None)
skip_long_prompts: Skip prompts exceeding max_model_len instead of failing
max_samples: Maximum number of samples to process (None for all)
hf_token: Hugging Face authentication token
"""
generation_start_time = datetime.now().isoformat()
# GPU check and configuration
num_gpus = check_gpu_availability()
if tensor_parallel_size is None:
tensor_parallel_size = num_gpus
logger.info(
f"Auto-detected {num_gpus} GPU(s), using tensor_parallel_size={tensor_parallel_size}"
)
else:
logger.info(f"Using specified tensor_parallel_size={tensor_parallel_size}")
if tensor_parallel_size > num_gpus:
logger.warning(
f"Requested {tensor_parallel_size} GPUs but only {num_gpus} available"
)
# Authentication - try multiple methods
HF_TOKEN = hf_token or os.environ.get("HF_TOKEN") or get_token()
if not HF_TOKEN:
logger.error("No HuggingFace token found. Please provide token via:")
logger.error(" 1. --hf-token argument")
logger.error(" 2. HF_TOKEN environment variable")
logger.error(" 3. Run 'huggingface-cli login' or use login() in Python")
sys.exit(1)
logger.info("HuggingFace token found, authenticating...")
login(token=HF_TOKEN)
# Initialize vLLM
logger.info(f"Loading model: {model_id}")
vllm_kwargs = {
"model": model_id,
"tensor_parallel_size": tensor_parallel_size,
"gpu_memory_utilization": gpu_memory_utilization,
}
if max_model_len is not None:
vllm_kwargs["max_model_len"] = max_model_len
logger.info(f"Using max_model_len={max_model_len}")
llm = LLM(**vllm_kwargs)
# Load tokenizer for chat template
logger.info("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Create sampling parameters
sampling_params = SamplingParams(
temperature=temperature,
top_p=top_p,
top_k=top_k,
min_p=min_p,
max_tokens=max_tokens,
repetition_penalty=repetition_penalty,
)
# Load dataset
logger.info(f"Loading dataset: {src_dataset_hub_id}")
dataset = load_dataset(src_dataset_hub_id, split="train")
# Apply max_samples if specified
if max_samples is not None and max_samples < len(dataset):
logger.info(f"Limiting dataset to {max_samples} samples")
dataset = dataset.select(range(max_samples))
total_examples = len(dataset)
logger.info(f"Dataset loaded with {total_examples:,} examples")
# Determine which column to use and validate
if prompt_column:
# Use prompt column mode
if prompt_column not in dataset.column_names:
logger.error(
f"Column '{prompt_column}' not found. Available columns: {dataset.column_names}"
)
sys.exit(1)
logger.info(f"Using prompt column mode with column: '{prompt_column}'")
use_messages = False
else:
# Use messages column mode
if messages_column not in dataset.column_names:
logger.error(
f"Column '{messages_column}' not found. Available columns: {dataset.column_names}"
)
sys.exit(1)
logger.info(f"Using messages column mode with column: '{messages_column}'")
use_messages = True
# Get effective max length for filtering
if max_model_len is not None:
effective_max_len = max_model_len
else:
# Get model's default max length
effective_max_len = llm.llm_engine.model_config.max_model_len
logger.info(f"Using effective max model length: {effective_max_len}")
# Process messages and apply chat template
logger.info("Preparing prompts...")
all_prompts = []
valid_prompts = []
valid_indices = []
skipped_info = []
for i, example in enumerate(tqdm(dataset, desc="Processing prompts")):
if use_messages:
# Messages mode: use existing chat messages
messages = example[messages_column]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
else:
# Prompt mode: convert plain text to messages format
user_prompt = example[prompt_column]
messages = [{"role": "user", "content": user_prompt}]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
all_prompts.append(prompt)
# Count tokens if filtering is enabled
if skip_long_prompts:
tokens = tokenizer.encode(prompt)
if len(tokens) <= effective_max_len:
valid_prompts.append(prompt)
valid_indices.append(i)
else:
skipped_info.append((i, len(tokens)))
else:
valid_prompts.append(prompt)
valid_indices.append(i)
# Log filtering results
if skip_long_prompts and skipped_info:
logger.warning(
f"Skipped {len(skipped_info)} prompts that exceed max_model_len ({effective_max_len} tokens)"
)
logger.info("Skipped prompt details (first 10):")
for idx, (prompt_idx, token_count) in enumerate(skipped_info[:10]):
logger.info(
f" - Example {prompt_idx}: {token_count} tokens (exceeds by {token_count - effective_max_len})"
)
if len(skipped_info) > 10:
logger.info(f" ... and {len(skipped_info) - 10} more")
skip_percentage = (len(skipped_info) / total_examples) * 100
if skip_percentage > 10:
logger.warning(f"WARNING: {skip_percentage:.1f}% of prompts were skipped!")
if not valid_prompts:
logger.error("No valid prompts to process after filtering!")
sys.exit(1)
# Generate responses - vLLM handles batching internally
logger.info(f"Starting generation for {len(valid_prompts):,} valid prompts...")
logger.info("vLLM will handle batching and scheduling automatically")
outputs = llm.generate(valid_prompts, sampling_params)
# Extract generated text and create full response list
logger.info("Extracting generated responses...")
responses = [""] * total_examples # Initialize with empty strings
for idx, output in enumerate(outputs):
original_idx = valid_indices[idx]
response = output.outputs[0].text.strip()
responses[original_idx] = response
# Add responses to dataset
logger.info("Adding responses to dataset...")
dataset = dataset.add_column(output_column, responses)
# Create dataset card
logger.info("Creating dataset card...")
card_content = create_dataset_card(
source_dataset=src_dataset_hub_id,
model_id=model_id,
messages_column=messages_column,
prompt_column=prompt_column,
sampling_params=sampling_params,
tensor_parallel_size=tensor_parallel_size,
num_examples=total_examples,
generation_time=generation_start_time,
num_skipped=len(skipped_info) if skip_long_prompts else 0,
max_model_len_used=effective_max_len if skip_long_prompts else None,
)
# Push dataset to hub
logger.info(f"Pushing dataset to: {output_dataset_hub_id}")
dataset.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
# Push dataset card
card = DatasetCard(card_content)
card.push_to_hub(output_dataset_hub_id, token=HF_TOKEN)
logger.info("✅ Generation complete!")
logger.info(
f"Dataset available at: https://huggingface.co/datasets/{output_dataset_hub_id}"
)
if __name__ == "__main__":
if len(sys.argv) > 1:
parser = argparse.ArgumentParser(
description="Generate responses for dataset prompts using vLLM",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Basic usage with default Qwen model
uv run generate-responses.py input-dataset output-dataset
# With custom model and parameters
uv run generate-responses.py input-dataset output-dataset \\
--model-id meta-llama/Llama-3.1-8B-Instruct \\
--temperature 0.9 \\
--max-tokens 2048
# Force specific GPU configuration
uv run generate-responses.py input-dataset output-dataset \\
--tensor-parallel-size 2 \\
--gpu-memory-utilization 0.95
# Using environment variable for token
HF_TOKEN=hf_xxx uv run generate-responses.py input-dataset output-dataset
""",
)
parser.add_argument(
"src_dataset_hub_id",
help="Input dataset on Hugging Face Hub (e.g., username/dataset-name)",
)
parser.add_argument(
"output_dataset_hub_id", help="Output dataset name on Hugging Face Hub"
)
parser.add_argument(
"--model-id",
type=str,
default="Qwen/Qwen3-30B-A3B-Instruct-2507",
help="Model to use for generation (default: Qwen3-30B-A3B-Instruct-2507)",
)
parser.add_argument(
"--messages-column",
type=str,
default="messages",
help="Column containing chat messages (default: messages)",
)
parser.add_argument(
"--prompt-column",
type=str,
help="Column containing plain text prompts (alternative to --messages-column)",
)
parser.add_argument(
"--output-column",
type=str,
default="response",
help="Column name for generated responses (default: response)",
)
parser.add_argument(
"--max-samples",
type=int,
help="Maximum number of samples to process (default: all)",
)
parser.add_argument(
"--temperature",
type=float,
default=0.7,
help="Sampling temperature (default: 0.7)",
)
parser.add_argument(
"--top-p",
type=float,
default=0.8,
help="Top-p sampling parameter (default: 0.8)",
)
parser.add_argument(
"--top-k",
type=int,
default=20,
help="Top-k sampling parameter (default: 20)",
)
parser.add_argument(
"--min-p",
type=float,
default=0.0,
help="Minimum probability threshold (default: 0.0)",
)
parser.add_argument(
"--max-tokens",
type=int,
default=16384,
help="Maximum tokens to generate (default: 16384)",
)
parser.add_argument(
"--repetition-penalty",
type=float,
default=1.0,
help="Repetition penalty (default: 1.0)",
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.90,
help="GPU memory utilization factor (default: 0.90)",
)
parser.add_argument(
"--max-model-len",
type=int,
help="Maximum model context length (default: model's default)",
)
parser.add_argument(
"--tensor-parallel-size",
type=int,
help="Number of GPUs to use (default: auto-detect)",
)
parser.add_argument(
"--hf-token",
type=str,
help="Hugging Face token (can also use HF_TOKEN env var)",
)
parser.add_argument(
"--skip-long-prompts",
action="store_true",
default=True,
help="Skip prompts that exceed max_model_len instead of failing (default: True)",
)
parser.add_argument(
"--no-skip-long-prompts",
dest="skip_long_prompts",
action="store_false",
help="Fail on prompts that exceed max_model_len",
)
args = parser.parse_args()
main(
src_dataset_hub_id=args.src_dataset_hub_id,
output_dataset_hub_id=args.output_dataset_hub_id,
model_id=args.model_id,
messages_column=args.messages_column,
prompt_column=args.prompt_column,
output_column=args.output_column,
temperature=args.temperature,
top_p=args.top_p,
top_k=args.top_k,
min_p=args.min_p,
max_tokens=args.max_tokens,
repetition_penalty=args.repetition_penalty,
gpu_memory_utilization=args.gpu_memory_utilization,
max_model_len=args.max_model_len,
tensor_parallel_size=args.tensor_parallel_size,
skip_long_prompts=args.skip_long_prompts,
max_samples=args.max_samples,
hf_token=args.hf_token,
)
else:
# Show HF Jobs example when run without arguments
print("""
vLLM Response Generation Script
==============================
This script requires arguments. For usage information:
uv run generate-responses.py --help
Example HF Jobs command with multi-GPU:
# If you're logged in with huggingface-cli, token will be auto-detected
hf jobs uv run \\
--flavor l4x4 \\
https://huggingface.co/datasets/uv-scripts/vllm/raw/main/generate-responses.py \\
username/input-dataset \\
username/output-dataset \\
--messages-column messages \\
--model-id Qwen/Qwen3-30B-A3B-Instruct-2507 \\
--temperature 0.7 \\
--max-tokens 16384
""")
```
### scripts/cot-self-instruct.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "datasets",
# "transformers",
# "vllm>=0.6.5",
# "huggingface-hub[hf_transfer]",
# "torch",
# "numpy",
# "tqdm",
# "scikit-learn",
# ]
# ///
"""
Generate high-quality synthetic data using Chain-of-Thought Self-Instruct methodology.
This script implements the CoT-Self-Instruct approach from the paper "CoT-Self-Instruct:
Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025).
It supports two modes:
1. Reasoning tasks: Generates both questions and answers with Chain-of-Thought
2. Instruction tasks: Generates diverse prompts for general instruction following
Example usage:
# Reasoning tasks with Answer-Consistency filtering
uv run cot-self-instruct.py \\
--seed-dataset davanstrien/s1k-reasoning \\
--output-dataset username/synthetic-math \\
--task-type reasoning \\
--num-samples 5000 \\
--filter-method answer-consistency
# Instruction tasks with RIP filtering
uv run cot-self-instruct.py \\
--seed-dataset wildchat-filtered \\
--output-dataset username/synthetic-prompts \\
--task-type instruction \\
--filter-method rip \\
--reward-model Nexusflow/Athene-RM-8B
# HF Jobs execution
hf jobs uv run --flavor l4x4 \\
--image vllm/vllm-openai \\
-e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
[args...]
"""
import argparse
import json
import logging
import os
import random
import re
import sys
from collections import Counter
from datetime import datetime
from typing import Dict, List, Optional, Tuple, Union
import numpy as np
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import DatasetCard, login
from sklearn.cluster import KMeans
from tqdm.auto import tqdm
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Enable HF Transfer for faster downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Prompt templates from the paper
REASONING_PROMPT_TEMPLATE = """You are a reasoning question generator assistant. Your goal is to create a novel, and challenging reasoning question. You are provided the following seed questions:
Seed Question 1: {seed1}
Seed Question 2: {seed2}
Your task is to:
1. Write a brand-new, self-contained reasoning question that meets the following requirements:
(a) The question draws inspiration from the seed question without copying it verbatim, remaining novel and of comparable difficulty.
(b) The question's final answer should be a single, unambiguous scalar value (e.g., an integer, reduced fraction, exact radical), or another answer type that can be verified in one step (e.g., 'yes/no,' a choice from A to D).
2. Then reason step by step, solve the new question and format your output as follows:
[New Question Begin]{{your_generated_question}}[New Question End]
[Final Answer to New Question Begin]\\boxed{{your_final_answer}}[Final Answer to New Question End]"""
INSTRUCTION_PROMPT_TEMPLATE = """You are a prompt generator assistant. Your goal is to create diverse and creative synthetic prompts.
Please follow the steps below to create synthetic prompts.
Step 1: Carefully read #Prompt 1# and #Prompt 2#. Identify and list all the common elements between these two prompts. If no common elements are found, list the main elements from each prompt.
Step 2: Develop a comprehensive plan based on the #Common Elements List# or #Main Elements List# from Step 1. This plan will guide the generation of new synthetic prompts that are similar to the original prompts.
Step 3: Execute the plan step by step and provide one #Synthetic Prompt#.
Please reply strictly in the following format:
- Step 1 #Common Elements List# or #Main Elements List#:
- Step 2 #Plan#:
- Step 3 #Synthetic Prompt#:
#Prompt 1#:
{prompt1}
#Prompt 2#:
{prompt2}"""
def check_gpu_availability() -> int:
"""Check if CUDA is available and return the number of GPUs."""
if not torch.cuda.is_available():
logger.error("CUDA is not available. This script requires a GPU.")
logger.error(
"Please run on a machine with NVIDIA GPU or use HF Jobs with GPU flavor."
)
sys.exit(1)
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
gpu_name = torch.cuda.get_device_name(i)
gpu_memory = torch.cuda.get_device_properties(i).total_memory / 1024**3
logger.info(f"GPU {i}: {gpu_name} with {gpu_memory:.1f} GB memory")
return num_gpus
def parse_thinking_output(text: str) -> str:
"""Remove thinking tokens from model output."""
# Remove <think>...</think> blocks
text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
return text.strip()
def extract_reasoning_output(text: str) -> Tuple[Optional[str], Optional[str]]:
"""Extract question and answer from reasoning task output."""
text = parse_thinking_output(text)
# Extract question
question_match = re.search(r'\[New Question Begin\](.*?)\[New Question End\]', text, re.DOTALL)
if not question_match:
return None, None
question = question_match.group(1).strip()
# Extract answer
answer_match = re.search(r'\[Final Answer to New Question Begin\]\\?boxed\{(.*?)\}\[Final Answer to New Question End\]', text, re.DOTALL)
if not answer_match:
# Try without \boxed
answer_match = re.search(r'\[Final Answer to New Question Begin\](.*?)\[Final Answer to New Question End\]', text, re.DOTALL)
if not answer_match:
return question, None
answer = answer_match.group(1).strip()
return question, answer
def extract_instruction_output(text: str) -> Optional[str]:
"""Extract synthetic prompt from instruction task output."""
text = parse_thinking_output(text)
# Look for the synthetic prompt after "Step 3 #Synthetic Prompt#:"
match = re.search(r'Step 3 #Synthetic Prompt#:\s*(.+)', text, re.DOTALL)
if match:
return match.group(1).strip()
return None
def categorize_prompts(prompts: List[str], num_categories: int = 8) -> Dict[int, List[int]]:
"""Categorize prompts using clustering for instruction tasks."""
from transformers import AutoModel
logger.info(f"Categorizing {len(prompts)} prompts into {num_categories} categories...")
# Use a small model for embeddings
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Get embeddings
embeddings = []
for prompt in tqdm(prompts, desc="Computing embeddings"):
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1).numpy()
embeddings.append(embedding[0])
# Cluster
kmeans = KMeans(n_clusters=num_categories, random_state=42)
labels = kmeans.fit_predict(embeddings)
# Group by category
categories = {}
for idx, label in enumerate(labels):
if label not in categories:
categories[label] = []
categories[label].append(idx)
return categories
def generate_synthetic_data(
llm: LLM,
seed_data: List[Dict],
task_type: str,
num_samples: int,
categories: Optional[Dict[int, List[int]]] = None,
) -> List[Dict]:
"""Generate synthetic data using CoT-Self-Instruct."""
synthetic_data = []
# Set up progress bar
pbar = tqdm(total=num_samples, desc="Generating synthetic data")
while len(synthetic_data) < num_samples:
# Sample seed data
if task_type == "reasoning":
# Random sampling for reasoning tasks
seeds = random.sample(seed_data, min(2, len(seed_data)))
prompt = REASONING_PROMPT_TEMPLATE.format(
seed1=seeds[0].get("question", seeds[0].get("prompt", "")),
seed2=seeds[1].get("question", seeds[1].get("prompt", "")) if len(seeds) > 1 else seeds[0].get("question", seeds[0].get("prompt", ""))
)
else:
# Category-aware sampling for instruction tasks
if categories:
# Pick a random category
category = random.choice(list(categories.keys()))
category_indices = categories[category]
indices = random.sample(category_indices, min(2, len(category_indices)))
seeds = [seed_data[i] for i in indices]
else:
seeds = random.sample(seed_data, min(2, len(seed_data)))
prompt = INSTRUCTION_PROMPT_TEMPLATE.format(
prompt1=seeds[0].get("prompt", seeds[0].get("question", "")),
prompt2=seeds[1].get("prompt", seeds[1].get("question", "")) if len(seeds) > 1 else seeds[0].get("prompt", seeds[0].get("question", ""))
)
# Generate
sampling_params = SamplingParams(
temperature=0.7 if task_type == "reasoning" else 0.8,
top_p=0.95 if task_type == "reasoning" else 0.9,
max_tokens=2048,
)
outputs = llm.generate([prompt], sampling_params)
output_text = outputs[0].outputs[0].text
# Parse output
if task_type == "reasoning":
question, answer = extract_reasoning_output(output_text)
if question and answer:
synthetic_data.append({
"question": question,
"answer": answer,
"seed_indices": [seed_data.index(s) for s in seeds],
})
pbar.update(1)
else:
synthetic_prompt = extract_instruction_output(output_text)
if synthetic_prompt:
synthetic_data.append({
"prompt": synthetic_prompt,
"seed_indices": [seed_data.index(s) for s in seeds],
})
pbar.update(1)
pbar.close()
return synthetic_data
def answer_consistency_filter(
llm: LLM,
synthetic_data: List[Dict],
k_responses: int = 16,
threshold: float = 0.5,
) -> List[Dict]:
"""Filter reasoning tasks using Answer-Consistency."""
logger.info(f"Applying Answer-Consistency filter with K={k_responses}")
filtered_data = []
for item in tqdm(synthetic_data, desc="Answer-Consistency filtering"):
question = item["question"]
original_answer = item["answer"]
# Generate K responses
prompts = [question] * k_responses
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
max_tokens=1024,
)
outputs = llm.generate(prompts, sampling_params)
# Extract answers
answers = []
for output in outputs:
text = output.outputs[0].text
# Try to extract boxed answer
match = re.search(r'\\boxed\{(.*?)\}', text)
if match:
answers.append(match.group(1).strip())
if not answers:
continue
# Get majority answer
answer_counts = Counter(answers)
if answer_counts:
majority_answer, count = answer_counts.most_common(1)[0]
# Check if majority answer matches original and meets threshold
if (majority_answer == original_answer and
count / len(answers) >= threshold):
item["consistency_score"] = count / len(answers)
filtered_data.append(item)
logger.info(f"Answer-Consistency: kept {len(filtered_data)}/{len(synthetic_data)} examples")
return filtered_data
def rip_filter(
llm: LLM,
synthetic_data: List[Dict],
reward_model_id: str,
k_responses: int = 32,
threshold: float = 0.5,
) -> List[Dict]:
"""Filter using Rejecting Instruction Preferences (RIP)."""
logger.info(f"Applying RIP filter with K={k_responses} and reward model {reward_model_id}")
# Note: In a full implementation, you would load and use the actual reward model
# For this example, we'll use a placeholder scoring mechanism
logger.warning("RIP filtering requires a reward model implementation - using placeholder")
filtered_data = []
for item in tqdm(synthetic_data, desc="RIP filtering"):
prompt = item.get("prompt", item.get("question", ""))
# Generate K responses
prompts = [prompt] * k_responses
sampling_params = SamplingParams(
temperature=1.0,
top_p=1.0,
max_tokens=1024,
)
outputs = llm.generate(prompts, sampling_params)
# In real implementation: score each response with reward model
# For now, use length as a proxy (longer responses often score higher)
scores = [len(output.outputs[0].text) for output in outputs]
# Use minimum score as quality indicator
min_score = min(scores) if scores else 0
normalized_score = min_score / 1000 # Normalize to 0-1 range
if normalized_score >= threshold:
item["rip_score"] = normalized_score
filtered_data.append(item)
logger.info(f"RIP filter: kept {len(filtered_data)}/{len(synthetic_data)} examples")
return filtered_data
def create_dataset_card(
task_type: str,
source_dataset: str,
generation_model: str,
filter_method: str,
num_generated: int,
num_filtered: int,
generation_time: str,
additional_info: Dict = None,
) -> str:
"""Create a comprehensive dataset card."""
filter_info = ""
if filter_method == "answer-consistency":
filter_info = """
### Answer-Consistency Filtering
This dataset was filtered using Answer-Consistency:
- Generated K responses for each synthetic question
- Kept only examples where majority answer matched the generated answer
- Ensures high-quality, correctly solved problems"""
elif filter_method == "rip":
filter_info = """
### RIP (Rejecting Instruction Preferences) Filtering
This dataset was filtered using RIP:
- Generated K responses for each synthetic prompt
- Scored responses using a reward model
- Kept only prompts with high minimum scores"""
return f"""---
tags:
- synthetic-data
- cot-self-instruct
- {task_type}
- uv-script
---
# CoT-Self-Instruct Synthetic Data
This dataset contains synthetic {task_type} data generated using the Chain-of-Thought Self-Instruct methodology.
## Generation Details
- **Source Dataset**: [{source_dataset}](https://huggingface.co/datasets/{source_dataset})
- **Generation Model**: [{generation_model}](https://huggingface.co/{generation_model})
- **Task Type**: {task_type}
- **Filter Method**: {filter_method}
- **Generated Examples**: {num_generated:,}
- **After Filtering**: {num_filtered:,} ({(num_filtered/num_generated)*100:.1f}% acceptance rate)
- **Generation Date**: {generation_time}
{filter_info}
## Methodology
Generated using CoT-Self-Instruct, which:
1. Uses Chain-of-Thought reasoning to analyze seed examples
2. Generates new synthetic examples of similar quality and complexity
3. Applies quality filtering to ensure high-quality outputs
Based on the paper: "CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks" (2025)
## Generation Script
Generated using the CoT-Self-Instruct script from [uv-scripts/synthetic-data](https://huggingface.co/datasets/uv-scripts/synthetic-data).
To reproduce:
```bash
uv run https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
--seed-dataset {source_dataset} \\
--output-dataset <your-dataset> \\
--task-type {task_type} \\
--generation-model {generation_model} \\
--filter-method {filter_method}
```
"""
def main():
parser = argparse.ArgumentParser(
description="Generate synthetic data using CoT-Self-Instruct",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
# Dataset arguments
parser.add_argument(
"--seed-dataset",
type=str,
required=True,
help="HuggingFace dataset ID containing seed examples",
)
parser.add_argument(
"--output-dataset",
type=str,
required=True,
help="HuggingFace dataset ID for output",
)
# Task configuration
parser.add_argument(
"--task-type",
type=str,
choices=["reasoning", "instruction", "auto"],
default="auto",
help="Type of task (reasoning generates Q&A, instruction generates prompts)",
)
parser.add_argument(
"--task-column",
type=str,
default=None,
help="Column name containing tasks (auto-detected if not specified)",
)
# Model configuration
parser.add_argument(
"--generation-model",
type=str,
default="Qwen/Qwen3-30B-A3B-Thinking-2507",
help="Model for synthetic data generation",
)
parser.add_argument(
"--filter-model",
type=str,
default=None,
help="Model for filtering (defaults to generation model)",
)
parser.add_argument(
"--reward-model",
type=str,
default="Nexusflow/Athene-RM-8B",
help="Reward model for RIP filtering",
)
# Generation parameters
parser.add_argument(
"--num-samples",
type=int,
default=5000,
help="Number of synthetic examples to generate",
)
parser.add_argument(
"--batch-size",
type=int,
default=1,
help="Batch size for generation",
)
# Filtering parameters
parser.add_argument(
"--filter-method",
type=str,
choices=["answer-consistency", "rip", "both", "none"],
default="answer-consistency",
help="Quality filtering method",
)
parser.add_argument(
"--k-responses",
type=int,
default=16,
help="Number of responses for filtering",
)
parser.add_argument(
"--quality-threshold",
type=float,
default=0.5,
help="Minimum quality threshold for filtering",
)
# GPU configuration
parser.add_argument(
"--tensor-parallel-size",
type=int,
default=None,
help="Number of GPUs for tensor parallelism (auto-detected if not set)",
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.9,
help="GPU memory utilization",
)
# Other arguments
parser.add_argument(
"--hf-token",
type=str,
default=None,
help="HuggingFace API token",
)
parser.add_argument(
"--seed",
type=int,
default=42,
help="Random seed",
)
args = parser.parse_args()
# Set random seeds
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
# Check GPU
num_gpus = check_gpu_availability()
tensor_parallel_size = args.tensor_parallel_size or num_gpus
# Authentication
hf_token = args.hf_token or os.environ.get("HF_TOKEN")
if hf_token:
login(token=hf_token)
# Load seed dataset
logger.info(f"Loading seed dataset: {args.seed_dataset}")
seed_dataset = load_dataset(args.seed_dataset, split="train")
# Auto-detect task type and column if needed
if args.task_type == "auto":
columns = seed_dataset.column_names
if "question" in columns and "answer" in columns:
args.task_type = "reasoning"
logger.info("Auto-detected task type: reasoning")
else:
args.task_type = "instruction"
logger.info("Auto-detected task type: instruction")
if not args.task_column:
if args.task_type == "reasoning":
args.task_column = "question"
else:
# Try to find prompt column
for col in ["prompt", "instruction", "text", "input"]:
if col in seed_dataset.column_names:
args.task_column = col
break
logger.info(f"Using task column: {args.task_column}")
# Convert to list of dicts
seed_data = seed_dataset.to_list()
# Categorize prompts for instruction tasks
categories = None
if args.task_type == "instruction" and len(seed_data) > 100:
prompts = [item.get(args.task_column, "") for item in seed_data]
categories = categorize_prompts(prompts)
# Initialize generation model
logger.info(f"Loading generation model: {args.generation_model}")
generation_llm = LLM(
model=args.generation_model,
tensor_parallel_size=tensor_parallel_size,
gpu_memory_utilization=args.gpu_memory_utilization,
)
# Generate synthetic data
start_time = datetime.now()
synthetic_data = generate_synthetic_data(
generation_llm,
seed_data,
args.task_type,
args.num_samples,
categories,
)
# Apply filtering
filter_llm = generation_llm
if args.filter_model and args.filter_model != args.generation_model:
logger.info(f"Loading filter model: {args.filter_model}")
# Clean up generation model
del generation_llm
torch.cuda.empty_cache()
filter_llm = LLM(
model=args.filter_model,
tensor_parallel_size=tensor_parallel_size,
gpu_memory_utilization=args.gpu_memory_utilization,
)
filtered_data = synthetic_data
if args.filter_method != "none":
if args.filter_method == "answer-consistency" and args.task_type == "reasoning":
filtered_data = answer_consistency_filter(
filter_llm,
synthetic_data,
args.k_responses,
args.quality_threshold,
)
elif args.filter_method == "rip":
filtered_data = rip_filter(
filter_llm,
synthetic_data,
args.reward_model,
args.k_responses,
args.quality_threshold,
)
elif args.filter_method == "both":
if args.task_type == "reasoning":
filtered_data = answer_consistency_filter(
filter_llm,
synthetic_data,
args.k_responses,
args.quality_threshold,
)
filtered_data = rip_filter(
filter_llm,
filtered_data,
args.reward_model,
args.k_responses,
args.quality_threshold,
)
# Create HuggingFace dataset
logger.info(f"Creating dataset with {len(filtered_data)} examples")
dataset = Dataset.from_list(filtered_data)
# Create dataset card
generation_time = start_time.strftime("%Y-%m-%d %H:%M:%S UTC")
dataset_card = create_dataset_card(
args.task_type,
args.seed_dataset,
args.generation_model,
args.filter_method,
len(synthetic_data),
len(filtered_data),
generation_time,
)
# Push to hub
logger.info(f"Pushing dataset to: {args.output_dataset}")
# Create dataset card
card = DatasetCard(dataset_card)
dataset.push_to_hub(args.output_dataset)
# Push card separately
card.push_to_hub(args.output_dataset)
logger.info("Done! Dataset available at: https://huggingface.co/datasets/" + args.output_dataset)
# Print example HF Jobs command if running locally
if len(sys.argv) > 1:
print("\nTo run on HF Jobs:")
print(f"""hf jobs uv run --flavor l4x4 \\
--image vllm/vllm-openai \\
-e HF_TOKEN=$(python3 -c "from huggingface_hub import get_token; print(get_token())") \\
https://huggingface.co/datasets/uv-scripts/synthetic-data/raw/main/cot-self-instruct.py \\
--seed-dataset {args.seed_dataset} \\
--output-dataset {args.output_dataset} \\
--task-type {args.task_type} \\
--generation-model {args.generation_model} \\
--filter-method {args.filter_method} \\
--num-samples {args.num_samples}""")
if __name__ == "__main__":
main()
```
### scripts/finepdfs-stats.py
```python
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "polars>=1.31.0",
# "huggingface-hub",
# "datasets",
# "ascii-graph",
# ]
# ///
"""
Analyze educational quality trends across CommonCrawl dumps using Polars streaming.
Answers: "Is the web getting more educational over time?"
Demonstrates Polars HF Hub integration - process 50M+ docs without downloading 300GB+.
Example usage:
# Analyze English PDFs (default)
uv run finepdfs-stats.py
# Analyze all 70+ languages
uv run finepdfs-stats.py --all-languages
# Quick test
uv run finepdfs-stats.py --limit 10000 --show-plan
# Save results to HF Hub
uv run finepdfs-stats.py --output-repo username/finepdfs-temporal-stats
# Run on HF Jobs
hf jobs uv run \\
-s HF_TOKEN \\
-e HF_XET_HIGH_PERFORMANCE=1 \\
https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
-- --output-repo username/stats
"""
import argparse
import logging
import os
import sys
import time
from pathlib import Path
import polars as pl
from ascii_graph import Pyasciigraph
from datasets import Dataset
from huggingface_hub import HfApi, create_repo, list_repo_tree, login
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
# Common language+script codes for finepdfs-edu
COMMON_LANGUAGES = {
"eng_Latn": "English (Latin script)",
"fra_Latn": "French (Latin script)",
"deu_Latn": "German (Latin script)",
"spa_Latn": "Spanish (Latin script)",
"por_Latn": "Portuguese (Latin script)",
"ita_Latn": "Italian (Latin script)",
"nld_Latn": "Dutch (Latin script)",
"pol_Latn": "Polish (Latin script)",
"rus_Cyrl": "Russian (Cyrillic script)",
"zho_Hans": "Chinese (Simplified)",
"zho_Hant": "Chinese (Traditional)",
"jpn_Jpan": "Japanese",
"kor_Hang": "Korean",
"ara_Arab": "Arabic",
"hin_Deva": "Hindi (Devanagari)",
}
def list_available_languages(dataset_id: str) -> list[str]:
"""List available language subsets in the dataset."""
try:
tree = list_repo_tree(dataset_id, path_in_repo="data", repo_type="dataset")
languages = [
item.path.replace("data/", "")
for item in tree
if item.path.startswith("data/")
and "/" not in item.path.replace("data/", "")
]
return sorted(languages)
except Exception as e:
logger.warning(f"Could not list languages: {e}")
return list(COMMON_LANGUAGES.keys())
def compute_temporal_stats(df: pl.LazyFrame, output_path: Path) -> pl.DataFrame:
"""Single scan: compute stats grouped by dump for temporal analysis."""
query = df.group_by("dump").agg(
pl.len().alias("doc_count"),
pl.col("token_count").sum().alias("total_tokens"),
pl.col("fw_edu_scores").list.mean().mean().alias("avg_edu_score"),
(pl.col("fw_edu_scores").list.mean() >= 3).sum().alias("high_edu_count"),
)
query.sink_parquet(output_path, engine="streaming")
return pl.read_parquet(output_path)
def compute_global_stats(temporal: pl.DataFrame) -> pl.DataFrame:
"""Compute global stats from temporal breakdown."""
total = temporal["doc_count"].sum()
return pl.DataFrame(
{
"total_docs": [total],
"total_tokens": [temporal["total_tokens"].sum()],
"avg_edu_score": [
(temporal["avg_edu_score"] * temporal["doc_count"]).sum() / total
],
"high_edu_rate": [temporal["high_edu_count"].sum() / total],
"num_dumps": [len(temporal)],
}
)
def format_temporal_stats(temporal: pl.DataFrame) -> pl.DataFrame:
"""Format temporal stats with high_edu_rate, sorted chronologically."""
return (
temporal.with_columns(
(pl.col("high_edu_count") / pl.col("doc_count")).alias("high_edu_rate")
)
.select(["dump", "doc_count", "avg_edu_score", "high_edu_rate"])
.sort(
"dump"
) # Chronological order (CC-MAIN-2017-xx comes before CC-MAIN-2024-xx)
)
def create_ascii_charts(temporal_stats: pl.DataFrame) -> str:
"""Create ASCII bar charts showing temporal trends."""
# Extract year from dump name (CC-MAIN-2024-42 -> 2024)
# Group by year and average the values for cleaner display
yearly = (
temporal_stats.with_columns(
pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
)
.group_by("year")
.agg(
pl.col("doc_count").sum(),
pl.col("avg_edu_score").mean(),
pl.col("high_edu_rate").mean(),
)
.sort("year")
)
lines = []
# High edu rate chart (more dramatic differences)
data_rate = [
(row["year"], row["high_edu_rate"] * 100)
for row in yearly.iter_rows(named=True)
]
graph = Pyasciigraph(line_length=60, float_format="{0:.1f}%")
lines.extend(graph.graph("High Educational Content (edu >= 3)", data_rate))
lines.append("")
# Avg edu score chart
data_score = [
(row["year"], row["avg_edu_score"]) for row in yearly.iter_rows(named=True)
]
graph2 = Pyasciigraph(line_length=60, float_format="{0:.2f}")
lines.extend(graph2.graph("Average Educational Score", data_score))
return "\n".join(lines)
def create_readme(
args,
global_stats: pl.DataFrame,
temporal_stats: pl.DataFrame,
scan_time: float,
ascii_charts: str,
) -> str:
"""Create README content for the stats dataset."""
stats = global_stats.to_dicts()[0]
total_docs = stats.get("total_docs", 0)
docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
# Get first and last year averages for trend (more representative than single dumps)
yearly = (
temporal_stats.with_columns(
pl.col("dump").str.extract(r"CC-MAIN-(\d{4})", 1).alias("year")
)
.group_by("year")
.agg(
pl.col("doc_count").sum(),
pl.col("avg_edu_score").mean(),
pl.col("high_edu_rate").mean(),
)
.sort("year")
)
first_year = yearly.head(1).to_dicts()[0]
last_year = yearly.tail(1).to_dicts()[0]
scope = (
"all languages"
if args.all_languages
else COMMON_LANGUAGES.get(args.lang, args.lang)
)
return f"""---
tags:
- uv-script
- statistics
- polars
- finepdfs-edu
- temporal-analysis
license: odc-by
configs:
- config_name: global_stats
data_files: global_stats/train-*.parquet
- config_name: temporal_stats
data_files: temporal_stats/train-*.parquet
default_viewer_config: temporal_stats
---
# Is the Web Getting More Educational?
Temporal analysis of educational quality in **{scope}** across {stats.get("num_dumps", 0)} CommonCrawl dumps.
## Trend
```
{ascii_charts}
```
## Key Finding
| Year | Avg Edu Score | High Edu Rate |
|------|---------------|---------------|
| {first_year["year"]} | {first_year["avg_edu_score"]:.2f} | {first_year["high_edu_rate"] * 100:.1f}% |
| {last_year["year"]} | {last_year["avg_edu_score"]:.2f} | {last_year["high_edu_rate"] * 100:.1f}% |
## Performance
- **{total_docs:,} documents** processed in **{scan_time:.0f} seconds**
- **{docs_per_sec:,.0f} docs/sec** using Polars streaming
- Single scan, no full dataset download required
## Summary
| Metric | Value |
|--------|-------|
| Scope | {scope} |
| Total Documents | {total_docs:,} |
| Total Tokens | {stats.get("total_tokens", 0):,} |
| Avg Edu Score | {stats.get("avg_edu_score", 0):.3f} |
| High Edu Rate | {stats.get("high_edu_rate", 0) * 100:.1f}% |
| CommonCrawl Dumps | {stats.get("num_dumps", 0)} |
## Files
- `global_stats` - Overall summary
- `temporal_stats` - Per-dump breakdown (sorted chronologically)
## Reproduce
```bash
uv run https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\
{"--all-languages" if args.all_languages else f"--lang {args.lang}"} --output-repo your-username/stats
```
## Source
- **Dataset**: [{args.source_dataset}](https://huggingface.co/datasets/{args.source_dataset})
- **Script**: [uv-scripts/dataset-stats](https://huggingface.co/datasets/uv-scripts/dataset-stats)
"""
def main():
parser = argparse.ArgumentParser(
description="Analyze educational quality trends across CommonCrawl dumps",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument(
"--source-dataset",
type=str,
default="HuggingFaceFW/finepdfs-edu",
help="Source dataset (default: HuggingFaceFW/finepdfs-edu)",
)
parser.add_argument(
"--lang",
type=str,
default="eng_Latn",
help="Language+script code (default: eng_Latn)",
)
parser.add_argument(
"--all-languages",
action="store_true",
help="Analyze all languages (70+) instead of single language",
)
parser.add_argument(
"--show-plan",
action="store_true",
help="Show Polars query plan (demonstrates optimization)",
)
parser.add_argument(
"--list-languages",
action="store_true",
help="List available languages and exit",
)
parser.add_argument(
"--limit",
type=int,
help="Limit to first N rows (for testing)",
)
parser.add_argument(
"--output-repo",
type=str,
help="HuggingFace dataset repository to upload results",
)
parser.add_argument(
"--output-dir",
type=str,
default="./stats_output",
help="Local directory for output files",
)
parser.add_argument(
"--hf-token",
type=str,
help="HuggingFace API token (or set HF_TOKEN env var)",
)
parser.add_argument(
"--private",
action="store_true",
help="Make the output dataset private",
)
args = parser.parse_args()
# Check for high-performance mode
if os.environ.get("HF_XET_HIGH_PERFORMANCE"):
logger.info("High-performance mode enabled (HF_XET_HIGH_PERFORMANCE=1)")
# List languages mode
if args.list_languages:
print(f"Available language+script codes for {args.source_dataset}:\n")
print("Common languages:")
for code, name in COMMON_LANGUAGES.items():
print(f" {code:12} - {name}")
print("\nFetching full list from HF Hub...")
all_langs = list_available_languages(args.source_dataset)
print(f"\nAll available ({len(all_langs)} total):")
for lang in all_langs[:30]: # Show first 30
name = COMMON_LANGUAGES.get(lang, "")
print(f" {lang:12} {name}")
if len(all_langs) > 30:
print(f" ... and {len(all_langs) - 30} more")
sys.exit(0)
# Build the parquet path
if args.all_languages:
source_path = f"hf://datasets/{args.source_dataset}/data/*/train/*.parquet"
scope_desc = "all languages"
else:
source_path = (
f"hf://datasets/{args.source_dataset}/data/{args.lang}/train/*.parquet"
)
scope_desc = f"{args.lang} ({COMMON_LANGUAGES.get(args.lang, 'unknown')})"
logger.info(f"Scanning: {source_path}")
logger.info(f"Scope: {scope_desc}")
# Create lazy frame - this doesn't load any data yet!
logger.info("Creating lazy query plan...")
df = pl.scan_parquet(source_path)
# Apply limit if specified
if args.limit:
logger.info(f"Limiting to first {args.limit:,} rows")
df = df.head(args.limit)
# Show query plan if requested
if args.show_plan:
# Build a sample query to show the plan
sample_query = df.select(
pl.len(),
pl.col("token_count").sum(),
pl.col("language").n_unique(),
)
print("\nQuery Plan (showing Polars optimization):")
print("=" * 60)
print(sample_query.explain())
print("=" * 60)
print("\nNote: Polars uses projection pushdown - only reads columns needed!")
print("The 'text' column is never loaded, making this very fast.\n")
# Create output directory
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# Single scan: compute temporal stats
logger.info("Computing temporal stats (single scan)...")
start = time.perf_counter()
temporal_path = output_dir / "temporal_stats.parquet"
temporal_raw = compute_temporal_stats(df, temporal_path)
scan_time = time.perf_counter() - start
logger.info(f"Scan complete in {scan_time:.2f}s - {len(temporal_raw)} dumps")
# Compute stats
global_stats = compute_global_stats(temporal_raw)
temporal_stats = format_temporal_stats(temporal_raw)
# Save
global_stats.write_parquet(output_dir / "global_stats.parquet")
temporal_stats.write_parquet(output_dir / "temporal_stats.parquet")
# Print results
total_docs = global_stats["total_docs"][0]
docs_per_sec = total_docs / scan_time if scan_time > 0 else 0
print("\n" + "=" * 70)
print("IS THE WEB GETTING MORE EDUCATIONAL?")
print("=" * 70)
print(f"\nScope: {scope_desc}")
print(f"Dataset: {args.source_dataset}")
print("\n" + "-" * 70)
print("GLOBAL STATS")
print("-" * 70)
print(global_stats)
print("\n" + "-" * 70)
print(f"TEMPORAL TREND ({len(temporal_stats)} CommonCrawl dumps)")
print("-" * 70)
# Show first 5 and last 5
if len(temporal_stats) > 10:
print("Earliest dumps:")
print(temporal_stats.head(5))
print("\n...")
print("\nLatest dumps:")
print(temporal_stats.tail(5))
else:
print(temporal_stats)
# Create ASCII charts
ascii_charts = create_ascii_charts(temporal_stats)
print("\n" + "-" * 70)
print("TREND VISUALIZATION")
print("-" * 70)
print(ascii_charts)
print("\n" + "-" * 70)
print("PERFORMANCE")
print("-" * 70)
print(f"Scan time: {scan_time:.2f}s")
print(f"Documents: {total_docs:,}")
print(f"Throughput: {docs_per_sec:,.0f} docs/sec")
logger.info(f"Results saved to: {output_dir}")
# Upload to HF Hub if requested
if args.output_repo:
hf_token = args.hf_token or os.environ.get("HF_TOKEN")
if hf_token:
login(token=hf_token)
api = HfApi(token=hf_token)
logger.info(f"Creating/updating dataset repository: {args.output_repo}")
create_repo(
args.output_repo,
repo_type="dataset",
private=args.private,
token=hf_token,
exist_ok=True,
)
# Upload each as a dataset config
configs = [
("global_stats", global_stats),
("temporal_stats", temporal_stats),
]
for config_name, stats_df in configs:
logger.info(f"Uploading {config_name}...")
ds = Dataset.from_polars(stats_df)
ds.push_to_hub(
args.output_repo,
config_name=config_name,
token=hf_token,
private=args.private,
)
time.sleep(1) # Avoid 409 conflicts
# Upload README
readme_content = create_readme(
args, global_stats, temporal_stats, scan_time, ascii_charts
)
api.upload_file(
path_or_fileobj=readme_content.encode(),
path_in_repo="README.md",
repo_id=args.output_repo,
repo_type="dataset",
token=hf_token,
)
dataset_url = f"https://huggingface.co/datasets/{args.output_repo}"
logger.info(f"Dataset uploaded: {dataset_url}")
print(f"\nResults uploaded to: {dataset_url}")
if __name__ == "__main__":
if len(sys.argv) == 1:
print("Is the Web Getting More Educational?")
print("=" * 40)
print("\nAnalyze educational quality trends across CommonCrawl dumps")
print("using Polars streaming - no download needed!\n")
print("Example commands:\n")
print("# Quick test:")
print("uv run finepdfs-stats.py --limit 10000\n")
print("# Analyze English PDFs:")
print("uv run finepdfs-stats.py\n")
print("# Analyze ALL 70+ languages:")
print("uv run finepdfs-stats.py --all-languages\n")
print("# Show query plan (see Polars optimization):")
print("uv run finepdfs-stats.py --show-plan --limit 1000\n")
print("# Save results to HF Hub:")
print("uv run finepdfs-stats.py --output-repo username/temporal-stats\n")
print("# Run on HF Jobs:")
print("hf jobs uv run \\")
print(" -s HF_TOKEN \\")
print(" -e HF_XET_HIGH_PERFORMANCE=1 \\")
print(
" https://huggingface.co/datasets/uv-scripts/dataset-stats/raw/main/finepdfs-stats.py \\"
)
print(" -- --output-repo username/stats")
sys.exit(0)
main()
```
### references/troubleshooting.md
```markdown
# Troubleshooting Guide
Common issues and solutions for Hugging Face Jobs.
## Authentication Issues
### Error: 401 Unauthorized
**Symptoms:**
```
401 Client Error: Unauthorized for url: https://huggingface.co/api/...
```
**Causes:**
- Token missing from job
- Token invalid or expired
- Token not passed correctly
**Solutions:**
1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
2. Verify `hf_whoami()` works locally
3. Re-login: `hf auth login`
4. Check token hasn't expired
**Verification:**
```python
# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
```
### Error: 403 Forbidden
**Symptoms:**
```
403 Client Error: Forbidden for url: https://huggingface.co/api/...
```
**Causes:**
- Token lacks required permissions
- No access to private repository
- Organization permissions insufficient
**Solutions:**
1. Ensure token has write permissions
2. Check token type at https://huggingface.co/settings/tokens
3. Verify access to target repository
4. Use organization token if needed
### Error: Token not found in environment
**Symptoms:**
```
KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found
```
**Causes:**
- `secrets` not passed in job config
- Wrong key name (should be `HF_TOKEN`)
- Using `env` instead of `secrets`
**Solutions:**
1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
2. Verify key name is exactly `HF_TOKEN`
3. Check job config syntax
## Job Execution Issues
### Error: Job Timeout
**Symptoms:**
- Job stops unexpectedly
- Status shows "TIMEOUT"
- Partial results only
**Causes:**
- Default 30min timeout exceeded
- Job takes longer than expected
- No timeout specified
**Solutions:**
1. Check logs for actual runtime
2. Increase timeout with buffer: `"timeout": "3h"`
3. Optimize code for faster execution
4. Process data in chunks
5. Add 20-30% buffer to estimated time
**Example:**
```python
hf_jobs("uv", {
"script": "...",
"timeout": "2h" # Set appropriate timeout
})
```
### Error: Out of Memory (OOM)
**Symptoms:**
```
RuntimeError: CUDA out of memory
MemoryError: Unable to allocate array
```
**Causes:**
- Batch size too large
- Model too large for hardware
- Insufficient GPU memory
**Solutions:**
1. Reduce batch size
2. Process data in smaller chunks
3. Upgrade hardware: cpu → t4 → a10g → a100
4. Use smaller models or quantization
5. Enable gradient checkpointing (for training)
**Example:**
```python
# Reduce batch size
batch_size = 1
# Process in chunks
for chunk in chunks:
process(chunk)
```
### Error: Missing Dependencies
**Symptoms:**
```
ModuleNotFoundError: No module named 'package_name'
ImportError: cannot import name 'X'
```
**Causes:**
- Package not in dependencies
- Wrong package name
- Version mismatch
**Solutions:**
1. Add to PEP 723 header:
```python
# /// script
# dependencies = ["package-name>=1.0.0"]
# ///
```
2. Check package name spelling
3. Specify version if needed
4. Check package availability
### Error: Script Not Found
**Symptoms:**
```
FileNotFoundError: script.py not found
```
**Causes:**
- Local file path used (not supported)
- URL incorrect
- Script not accessible
**Solutions:**
1. Use inline script (recommended)
2. Use publicly accessible URL
3. Upload script to Hub first
4. Check URL is correct
**Correct approaches:**
```python
# ✅ Inline code
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<code>"})
# ✅ From URL
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/script.py"})
```
## Hub Push Issues
### Error: Push Failed
**Symptoms:**
```
Error pushing to Hub
Upload failed
```
**Causes:**
- Network issues
- Token missing or invalid
- Repository access denied
- File too large
**Solutions:**
1. Check token: `assert "HF_TOKEN" in os.environ`
2. Verify repository exists or can be created
3. Check network connectivity in logs
4. Retry push operation
5. Split large files into chunks
### Error: Repository Not Found
**Symptoms:**
```
404 Client Error: Not Found
Repository not found
```
**Causes:**
- Repository doesn't exist
- Wrong repository name
- No access to private repo
**Solutions:**
1. Create repository first:
```python
from huggingface_hub import HfApi
api = HfApi()
api.create_repo("username/repo-name", repo_type="dataset")
```
2. Check repository name format
3. Verify namespace exists
4. Check repository visibility
### Error: Results Not Saved
**Symptoms:**
- Job completes successfully
- No results visible on Hub
- Files not persisted
**Causes:**
- No persistence code in script
- Push code not executed
- Push failed silently
**Solutions:**
1. Add persistence code to script
2. Verify push executes successfully
3. Check logs for push errors
4. Add error handling around push
**Example:**
```python
try:
dataset.push_to_hub("username/dataset")
print("✅ Push successful")
except Exception as e:
print(f"❌ Push failed: {e}")
raise
```
## Hardware Issues
### Error: GPU Not Available
**Symptoms:**
```
CUDA not available
No GPU found
```
**Causes:**
- CPU flavor used instead of GPU
- GPU not requested
- CUDA not installed in image
**Solutions:**
1. Use GPU flavor: `"flavor": "a10g-large"`
2. Check image has CUDA support
3. Verify GPU availability in logs
### Error: Slow Performance
**Symptoms:**
- Job takes longer than expected
- Low GPU utilization
- CPU bottleneck
**Causes:**
- Wrong hardware selected
- Inefficient code
- Data loading bottleneck
**Solutions:**
1. Upgrade hardware
2. Optimize code
3. Use batch processing
4. Profile code to find bottlenecks
## General Issues
### Error: Job Status Unknown
**Symptoms:**
- Can't check job status
- Status API returns error
**Solutions:**
1. Use job URL: `https://huggingface.co/jobs/username/job-id`
2. Check logs: `hf_jobs("logs", {"job_id": "..."})`
3. Inspect job: `hf_jobs("inspect", {"job_id": "..."})`
### Error: Logs Not Available
**Symptoms:**
- No logs visible
- Logs delayed
**Causes:**
- Job just started (logs delayed 30-60s)
- Job failed before logging
- Logs not yet generated
**Solutions:**
1. Wait 30-60 seconds after job start
2. Check job status first
3. Use job URL for web interface
### Error: Cost Unexpectedly High
**Symptoms:**
- Job costs more than expected
- Longer runtime than estimated
**Causes:**
- Job ran longer than timeout
- Wrong hardware selected
- Inefficient code
**Solutions:**
1. Monitor job runtime
2. Set appropriate timeout
3. Optimize code
4. Choose right hardware
5. Check cost estimates before running
## Debugging Tips
### 1. Add Logging
```python
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("Starting processing...")
logger.info(f"Processed {count} items")
```
### 2. Verify Environment
```python
import os
print(f"Python version: {os.sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"HF_TOKEN present: {'HF_TOKEN' in os.environ}")
```
### 3. Test Locally First
Run script locally before submitting to catch errors early:
```bash
python script.py
```
### 4. Check Job Logs
```python
# View logs
hf_jobs("logs", {"job_id": "your-job-id"})
# Or use job URL
# https://huggingface.co/jobs/username/job-id
```
### 5. Add Error Handling
```python
try:
# Your code
process_data()
except Exception as e:
print(f"Error: {e}")
import traceback
traceback.print_exc()
raise
```
## Quick Reference
### Common Error Codes
| Code | Meaning | Solution |
|------|---------|----------|
| 401 | Unauthorized | Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` |
| 403 | Forbidden | Check token permissions |
| 404 | Not Found | Verify repository exists |
| 500 | Server Error | Retry or contact support |
### Checklist Before Submitting
- [ ] Token configured: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
- [ ] Timeout set appropriately
- [ ] Hardware selected correctly
- [ ] Dependencies listed in PEP 723 header
- [ ] Persistence code included
- [ ] Error handling added
- [ ] Logging added for debugging
## Getting Help
If issues persist:
1. **Check logs** - Most errors include detailed messages
2. **Review documentation** - See main SKILL.md
3. **Check Hub status** - https://status.huggingface.co
4. **Community forums** - https://discuss.huggingface.co
5. **GitHub issues** - For bugs in huggingface_hub
## Key Takeaways
1. **Always include token** - `secrets={"HF_TOKEN": "$HF_TOKEN"}`
2. **Set appropriate timeout** - Default 30min may be insufficient
3. **Verify persistence** - Results won't persist without code
4. **Check logs** - Most issues visible in job logs
5. **Test locally** - Catch errors before submitting
6. **Add error handling** - Better debugging information
7. **Monitor costs** - Set timeouts to avoid unexpected charges
```
### references/token_usage.md
```markdown
# Token Usage Guide for Hugging Face Jobs
**⚠️ CRITICAL:** Proper token usage is essential for any job that interacts with the Hugging Face Hub.
## Overview
Hugging Face tokens are authentication credentials that allow your jobs to interact with the Hub. They're required for:
- Pushing models/datasets to Hub
- Accessing private repositories
- Creating new repositories
- Using Hub APIs programmatically
- Any authenticated Hub operations
## Token Types
### Read Token
- **Permissions:** Download models/datasets, read private repos
- **Use case:** Jobs that only need to download/read content
- **Creation:** https://huggingface.co/settings/tokens
### Write Token
- **Permissions:** Push models/datasets, create repos, modify content
- **Use case:** Jobs that need to upload results (most common)
- **Creation:** https://huggingface.co/settings/tokens
- **⚠️ Required for:** Pushing models, datasets, or any uploads
### Organization Token
- **Permissions:** Act on behalf of an organization
- **Use case:** Jobs running under organization namespace
- **Creation:** Organization settings → Tokens
## Providing Tokens to Jobs
### Method 1: Automatic Token (Recommended) ⭐
```python
hf_jobs("uv", {
"script": "your_script.py",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Automatic replacement
})
```
**How it works:**
1. `$HF_TOKEN` is a placeholder that gets replaced with your actual token
2. Uses the token from your logged-in session (`hf auth login`)
3. Token is encrypted server-side when passed as a secret
4. Most secure and convenient method
**Benefits:**
- ✅ No token exposure in code
- ✅ Uses your current login session
- ✅ Automatically updated if you re-login
- ✅ Works seamlessly with MCP tools
- ✅ Token encrypted server-side
**Requirements:**
- Must be logged in: `hf auth login` or `hf_whoami()` works
- Token must have required permissions
### Method 2: Explicit Token (Not Recommended)
```python
hf_jobs("uv", {
"script": "your_script.py",
"secrets": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Hardcoded token
})
```
**When to use:**
- Only if automatic token doesn't work
- Testing with a specific token
- Organization tokens (use with caution)
**Security concerns:**
- ❌ Token visible in code/logs
- ❌ Must manually update if token rotates
- ❌ Risk of token exposure
- ❌ Not recommended for production
### Method 3: Environment Variable (Less Secure)
```python
hf_jobs("uv", {
"script": "your_script.py",
"env": {"HF_TOKEN": "hf_abc123..."} # ⚠️ Less secure than secrets
})
```
**Difference from secrets:**
- `env` variables are visible in job logs
- `secrets` are encrypted server-side
- Always prefer `secrets` for tokens
**When to use:**
- Only for non-sensitive configuration
- Never use for tokens (use `secrets` instead)
## Using Tokens in Scripts
### Accessing Tokens
Tokens passed via `secrets` are available as environment variables in your script:
```python
import os
# Get token from environment
token = os.environ.get("HF_TOKEN")
# Verify token exists
if not token:
raise ValueError("HF_TOKEN not found in environment!")
```
### Using with Hugging Face Hub
**Option 1: Explicit token parameter**
```python
from huggingface_hub import HfApi
api = HfApi(token=os.environ.get("HF_TOKEN"))
api.upload_file(...)
```
**Option 2: Auto-detection (Recommended)**
```python
from huggingface_hub import HfApi
# Automatically uses HF_TOKEN env var
api = HfApi() # ✅ Simpler, uses token from environment
api.upload_file(...)
```
**Option 3: With transformers/datasets**
```python
from transformers import AutoModel
from datasets import load_dataset
# Auto-detects HF_TOKEN from environment
model = AutoModel.from_pretrained("username/model")
dataset = load_dataset("username/dataset")
# For push operations, token is auto-detected
model.push_to_hub("username/new-model")
dataset.push_to_hub("username/new-dataset")
```
### Complete Example
```python
# /// script
# dependencies = ["huggingface-hub", "datasets"]
# ///
import os
from huggingface_hub import HfApi
from datasets import Dataset
# Verify token is available
assert "HF_TOKEN" in os.environ, "HF_TOKEN required for Hub operations!"
# Use token for Hub operations
api = HfApi() # Auto-detects HF_TOKEN
# Create and push dataset
data = {"text": ["Hello", "World"]}
dataset = Dataset.from_dict(data)
# Push to Hub (token auto-detected)
dataset.push_to_hub("username/my-dataset")
print("✅ Dataset pushed successfully!")
```
## Token Verification
### Check Authentication Locally
```python
from huggingface_hub import whoami
try:
user_info = whoami()
print(f"✅ Logged in as: {user_info['name']}")
except Exception as e:
print(f"❌ Not authenticated: {e}")
```
### Verify Token in Job
```python
import os
# Check token exists
if "HF_TOKEN" not in os.environ:
raise ValueError("HF_TOKEN not found in environment!")
token = os.environ["HF_TOKEN"]
# Verify token format (should start with "hf_")
if not token.startswith("hf_"):
raise ValueError(f"Invalid token format: {token[:10]}...")
# Test token works
from huggingface_hub import whoami
try:
user_info = whoami(token=token)
print(f"✅ Token valid for user: {user_info['name']}")
except Exception as e:
raise ValueError(f"Token validation failed: {e}")
```
## Common Token Issues
### Error: 401 Unauthorized
**Symptoms:**
```
401 Client Error: Unauthorized for url: https://huggingface.co/api/...
```
**Causes:**
1. Token missing from job
2. Token invalid or expired
3. Token not passed correctly
**Solutions:**
1. Add `secrets={"HF_TOKEN": "$HF_TOKEN"}` to job config
2. Verify `hf_whoami()` works locally
3. Re-login: `hf auth login`
4. Check token hasn't expired
**Verification:**
```python
# In your script
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN missing!"
```
### Error: 403 Forbidden
**Symptoms:**
```
403 Client Error: Forbidden for url: https://huggingface.co/api/...
```
**Causes:**
1. Token lacks required permissions (read-only token used for write)
2. No access to private repository
3. Organization permissions insufficient
**Solutions:**
1. Ensure token has write permissions
2. Check token type at https://huggingface.co/settings/tokens
3. Verify access to target repository
4. Use organization token if needed
**Check token permissions:**
```python
from huggingface_hub import whoami
user_info = whoami()
print(f"User: {user_info['name']}")
print(f"Type: {user_info.get('type', 'user')}")
```
### Error: Token not found in environment
**Symptoms:**
```
KeyError: 'HF_TOKEN'
ValueError: HF_TOKEN not found
```
**Causes:**
1. `secrets` not passed in job config
2. Wrong key name (should be `HF_TOKEN`)
3. Using `env` instead of `secrets`
**Solutions:**
1. Use `secrets={"HF_TOKEN": "$HF_TOKEN"}` (not `env`)
2. Verify key name is exactly `HF_TOKEN`
3. Check job config syntax
**Correct configuration:**
```python
# ✅ Correct
hf_jobs("uv", {
"script": "...",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
# ❌ Wrong - using env instead of secrets
hf_jobs("uv", {
"script": "...",
"env": {"HF_TOKEN": "$HF_TOKEN"} # Less secure
})
# ❌ Wrong - wrong key name
hf_jobs("uv", {
"script": "...",
"secrets": {"TOKEN": "$HF_TOKEN"} # Wrong key
})
```
### Error: Repository access denied
**Symptoms:**
```
403 Client Error: Forbidden
Repository not found or access denied
```
**Causes:**
1. Token doesn't have access to private repo
2. Repository doesn't exist and can't be created
3. Wrong namespace
**Solutions:**
1. Use token from account with access
2. Verify repo visibility (public vs private)
3. Check namespace matches token owner
4. Create repo first if needed
**Check repository access:**
```python
from huggingface_hub import HfApi
api = HfApi()
try:
repo_info = api.repo_info("username/repo-name")
print(f"✅ Access granted: {repo_info.id}")
except Exception as e:
print(f"❌ Access denied: {e}")
```
## Token Security Best Practices
### 1. Never Commit Tokens
**❌ Bad:**
```python
# Never do this!
token = "hf_abc123xyz..."
api = HfApi(token=token)
```
**✅ Good:**
```python
# Use environment variable
token = os.environ.get("HF_TOKEN")
api = HfApi(token=token)
```
### 2. Use Secrets, Not Environment Variables
**❌ Bad:**
```python
hf_jobs("uv", {
"script": "...",
"env": {"HF_TOKEN": "$HF_TOKEN"} # Visible in logs
})
```
**✅ Good:**
```python
hf_jobs("uv", {
"script": "...",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Encrypted server-side
})
```
### 3. Use Automatic Token Replacement
**❌ Bad:**
```python
hf_jobs("uv", {
"script": "...",
"secrets": {"HF_TOKEN": "hf_abc123..."} # Hardcoded
})
```
**✅ Good:**
```python
hf_jobs("uv", {
"script": "...",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Automatic
})
```
### 4. Rotate Tokens Regularly
- Generate new tokens periodically
- Revoke old tokens
- Update job configurations
- Monitor token usage
### 5. Use Minimal Permissions
- Create tokens with only needed permissions
- Use read tokens when write isn't needed
- Don't use admin tokens for regular jobs
### 6. Don't Share Tokens
- Each user should use their own token
- Don't commit tokens to repositories
- Don't share tokens in logs or messages
### 7. Monitor Token Usage
- Check token activity in Hub settings
- Review job logs for token issues
- Set up alerts for unauthorized access
## Token Workflow Examples
### Example 1: Push Model to Hub
```python
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["transformers"]
# ///
import os
from transformers import AutoModel, AutoTokenizer
# Verify token
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
# Load and process model
model = AutoModel.from_pretrained("base-model")
# ... process model ...
# Push to Hub (token auto-detected)
model.push_to_hub("username/my-model")
print("✅ Model pushed!")
""",
"flavor": "a10g-large",
"timeout": "2h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
})
```
### Example 2: Access Private Dataset
```python
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["datasets"]
# ///
import os
from datasets import load_dataset
# Verify token
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
# Load private dataset (token auto-detected)
dataset = load_dataset("private-org/private-dataset")
print(f"✅ Loaded {len(dataset)} examples")
""",
"flavor": "cpu-basic",
"timeout": "30m",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
})
```
### Example 3: Create and Push Dataset
```python
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["datasets", "huggingface-hub"]
# ///
import os
from datasets import Dataset
from huggingface_hub import HfApi
# Verify token
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
# Create dataset
data = {"text": ["Sample 1", "Sample 2"]}
dataset = Dataset.from_dict(data)
# Push to Hub
api = HfApi() # Auto-detects HF_TOKEN
dataset.push_to_hub("username/my-dataset")
print("✅ Dataset pushed!")
""",
"flavor": "cpu-basic",
"timeout": "30m",
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Token provided
})
```
## Quick Reference
### Token Checklist
Before submitting a job that uses Hub:
- [ ] Job includes `secrets={"HF_TOKEN": "$HF_TOKEN"}`
- [ ] Script checks for token: `assert "HF_TOKEN" in os.environ`
- [ ] Token has required permissions (read/write)
- [ ] User is logged in: `hf_whoami()` works
- [ ] Token not hardcoded in script
- [ ] Using `secrets` not `env` for token
### Common Patterns
**Pattern 1: Auto-detect token**
```python
from huggingface_hub import HfApi
api = HfApi() # Uses HF_TOKEN from environment
```
**Pattern 2: Explicit token**
```python
import os
from huggingface_hub import HfApi
api = HfApi(token=os.environ.get("HF_TOKEN"))
```
**Pattern 3: Verify token**
```python
import os
assert "HF_TOKEN" in os.environ, "HF_TOKEN required!"
```
## Key Takeaways
1. **Always use `secrets={"HF_TOKEN": "$HF_TOKEN"}`** for Hub operations
2. **Never hardcode tokens** in scripts or job configs
3. **Verify token exists** in script before Hub operations
4. **Use auto-detection** when possible (`HfApi()` without token parameter)
5. **Check permissions** - ensure token has required access
6. **Monitor token usage** - review activity regularly
7. **Rotate tokens** - generate new tokens periodically
```