SkillHub ClubAnalyze Data & AIFull StackData / AI

funsloth-check

Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

785

Hot score

Updated

March 20, 2026

Overall rating

C4.6

Composite score

4.6

Best-practice grade

B79.6

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-funsloth-check

Repository

benchflow-ai/SkillsBench

Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_count-dataset-tokens/environment/skills/funsloth-check

Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install funsloth-check into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/benchflow-ai/SkillsBench before adding funsloth-check to shared team environments
Use funsloth-check for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: funsloth-check
description: Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
---

# Dataset Validation for Unsloth Fine-tuning

Validate datasets before fine-tuning with Unsloth.

## Quick Start

For automated validation, use the script:
```bash
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
```

## Workflow

### 1. Get Dataset Source
Ask for: HF dataset ID (e.g., `mlabonne/FineTome-100k`) or local path (e.g., `./data.jsonl`)

### 2. Load and Detect Format
Auto-detect format from structure. See [DATA_FORMATS.md](DATA_FORMATS.md) for details.

| Format | Detection | Key Fields |
|--------|-----------|------------|
| Raw | `text` only | `text` |
| Alpaca | `instruction` + `output` | `instruction`, `output` |
| ShareGPT | `conversations` array | `from`, `value` |
| ChatML | `messages` array | `role`, `content` |

### 3. Validate Schema
Check required fields exist. Report issues with fix suggestions.

### 4. Show Samples
Display 2-3 examples for visual verification.

### 5. Token Analysis
Report statistics: total tokens, min/max/mean/median sequence length.

Flag concerns:
- Sequences > 4096 tokens
- Sequences < 10 tokens

### 6. Chinchilla Analysis
Ask for target model and LoRA rank, then calculate:

| Chinchilla Fraction | Interpretation |
|--------------------|----------------|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |

### 7. Recommendations
Based on analysis, suggest:
- `standardize_sharegpt()` for ShareGPT data
- Sequence length adjustments
- Learning rate for small datasets

### 8. Optional: HF Upload
Offer to upload local datasets to Hub.

### 9. Handoff
Pass context to `funsloth-train`:
```yaml
dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2
```

## Bundled Resources

- [scripts/validate_dataset.py](scripts/validate_dataset.py) - Automated validation script
- [DATA_FORMATS.md](DATA_FORMATS.md) - Dataset format reference


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### DATA_FORMATS.md

```markdown
# Data Format Reference

Supported dataset formats for Unsloth fine-tuning.

## Format Overview

| Format | Use Case | Key Fields |
|--------|----------|------------|
| **Raw Corpus** | Continued pretraining | `text` |
| **Alpaca** | Single-turn instruction | `instruction`, `output` |
| **ShareGPT** | Multi-turn conversation | `conversations[{from, value}]` |
| **ChatML** | Native chat format | `messages[{role, content}]` |

## Raw Corpus Format

For continued pretraining on domain text.

```json
{"text": "The mitochondria is the powerhouse of the cell..."}
{"text": "In quantum mechanics, superposition refers to..."}
```

### Usage
```python
# Just load and use directly
dataset = load_dataset("json", data_files="corpus.jsonl", split="train")

def format_fn(example):
    return {"text": example["text"]}
```

## Alpaca Format

Most common for instruction tuning.

```json
{
  "instruction": "Summarize the following text.",
  "input": "Long article text here...",
  "output": "Brief summary of the article."
}
```

### Variations
```json
// Without input
{
  "instruction": "What is the capital of France?",
  "output": "The capital of France is Paris."
}

// With system prompt
{
  "instruction": "Translate to Spanish",
  "input": "Hello, how are you?",
  "output": "Hola, ¿cómo estás?",
  "system": "You are a helpful translator."
}
```

### Conversion to Chat
```python
def alpaca_to_chat(example):
    messages = []

    if example.get("system"):
        messages.append({"role": "system", "content": example["system"]})

    user_content = example["instruction"]
    if example.get("input"):
        user_content += f"\n\n{example['input']}"

    messages.append({"role": "user", "content": user_content})
    messages.append({"role": "assistant", "content": example["output"]})

    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
```

## ShareGPT Format

Multi-turn conversations, common from ChatGPT exports.

```json
{
  "conversations": [
    {"from": "human", "value": "What is Python?"},
    {"from": "gpt", "value": "Python is a programming language..."},
    {"from": "human", "value": "How do I install it?"},
    {"from": "gpt", "value": "You can download Python from..."}
  ]
}
```

### Role Mappings
```
"human" → user
"gpt" → assistant
"system" → system
```

### Important: Standardize First
```python
from unsloth.chat_templates import standardize_sharegpt

# This converts ShareGPT to ChatML internally
dataset = standardize_sharegpt(dataset)
```

### Variations
```json
// With system message
{
  "conversations": [
    {"from": "system", "value": "You are a helpful assistant."},
    {"from": "human", "value": "Hello!"},
    {"from": "gpt", "value": "Hi there! How can I help?"}
  ]
}

// Alternative field names (less common)
{
  "conversations": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi!"}
  ]
}
```

## ChatML Format

Native chat format, used by many models.

```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the weather like?"},
    {"role": "assistant", "content": "I don't have access to weather data..."}
  ]
}
```

### Valid Roles
- `system` - System instructions (optional, usually first)
- `user` - Human input
- `assistant` - Model response

### Usage
```python
def chatml_to_text(example):
    return {"text": tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False
    )}
```

## DPO Format

For preference optimization (chosen vs rejected).

```json
{
  "prompt": "Write a haiku about coding.",
  "chosen": "Fingers on keyboard\nLogic flows like mountain streams\nBugs become features",
  "rejected": "coding is fun\ni like to code\nthe end"
}
```

### Alternative Structure
```json
{
  "prompt": "Explain quantum computing.",
  "chosen_messages": [
    {"role": "user", "content": "Explain quantum computing."},
    {"role": "assistant", "content": "Quantum computing harnesses..."}
  ],
  "rejected_messages": [
    {"role": "user", "content": "Explain quantum computing."},
    {"role": "assistant", "content": "its computers but quantum"}
  ]
}
```

## GRPO/RL Format

For reinforcement learning with rewards.

```json
{
  "prompt": "What is 15 + 27?",
  "ground_truth": "42"
}
```

The reward function compares model output to ground_truth.

## Format Detection

```python
def detect_format(sample):
    if "conversations" in sample:
        conv = sample["conversations"]
        if isinstance(conv, list) and len(conv) > 0:
            if "from" in conv[0]:
                return "sharegpt"

    if "messages" in sample:
        msgs = sample["messages"]
        if isinstance(msgs, list) and len(msgs) > 0:
            if "role" in msgs[0]:
                return "chatml"

    if "instruction" in sample and "output" in sample:
        return "alpaca"

    if "text" in sample:
        return "raw"

    return "unknown"
```

## Common Transformations

### ShareGPT → ChatML
```python
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
```

### Alpaca → ChatML
```python
def alpaca_to_chatml(example):
    messages = []
    user = example["instruction"]
    if example.get("input"):
        user += f"\n\n{example['input']}"
    messages.append({"role": "user", "content": user})
    messages.append({"role": "assistant", "content": example["output"]})
    return {"messages": messages}

dataset = dataset.map(alpaca_to_chatml)
```

### Filter Empty Examples
```python
def is_valid(example):
    if "text" in example:
        return len(example["text"].strip()) > 10
    if "messages" in example:
        return len(example["messages"]) >= 2
    return True

dataset = dataset.filter(is_valid)
```

## Dataset Sources

### Hugging Face Hub
```python
from datasets import load_dataset

# Full dataset
dataset = load_dataset("mlabonne/FineTome-100k", split="train")

# Subset
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:1000]")
```

### Local Files
```python
# JSONL
dataset = load_dataset("json", data_files="data.jsonl", split="train")

# JSON
dataset = load_dataset("json", data_files="data.json", split="train")

# CSV
dataset = load_dataset("csv", data_files="data.csv", split="train")

# Parquet
dataset = load_dataset("parquet", data_files="data.parquet", split="train")
```

### Multiple Files
```python
dataset = load_dataset("json", data_files={
    "train": ["train1.jsonl", "train2.jsonl"],
    "test": "test.jsonl"
})
```

## Quality Checks

### Verify Format
```python
sample = dataset[0]
print(f"Keys: {sample.keys()}")
print(f"Format: {detect_format(sample)}")
print(f"Sample:\n{json.dumps(sample, indent=2)[:500]}")
```

### Check for Issues
```python
issues = []

for i, ex in enumerate(dataset):
    # Empty content
    if not get_text_content(ex):
        issues.append(f"Empty at {i}")

    # Mismatched format
    if detect_format(ex) != expected_format:
        issues.append(f"Format mismatch at {i}")

print(f"Found {len(issues)} issues")
```

```

### scripts/validate_dataset.py

```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "datasets>=2.18",
#     "transformers>=4.45",
#     "huggingface_hub>=0.20",
# ]
# ///
"""
Dataset Validation Script for Fine-tuning

Validates dataset format, calculates statistics, and checks Chinchilla optimality.

Usage:
    python validate_dataset.py --dataset "mlabonne/FineTome-100k"
    python validate_dataset.py --file "./data/train.jsonl"
"""

import argparse
import json
import statistics
from datasets import load_dataset
from transformers import AutoTokenizer
from huggingface_hub import HfApi

# ============================================
# FORMAT DETECTION
# ============================================

def detect_format(sample: dict) -> str:
    """Detect dataset format from a sample."""
    if "conversations" in sample:
        conv = sample["conversations"]
        if isinstance(conv, list) and len(conv) > 0:
            if "from" in conv[0] and "value" in conv[0]:
                return "sharegpt"

    if "messages" in sample:
        msgs = sample["messages"]
        if isinstance(msgs, list) and len(msgs) > 0:
            if "role" in msgs[0] and "content" in msgs[0]:
                return "chatml"

    if "instruction" in sample and "output" in sample:
        return "alpaca"

    if "text" in sample and len(sample) <= 2:  # text + maybe id
        return "raw"

    return "unknown"


def validate_schema(sample: dict, format_type: str) -> list[str]:
    """Validate that sample has required fields."""
    issues = []

    if format_type == "alpaca":
        if "instruction" not in sample:
            issues.append("Missing 'instruction' field")
        if "output" not in sample:
            issues.append("Missing 'output' field")

    elif format_type == "sharegpt":
        if "conversations" not in sample:
            issues.append("Missing 'conversations' field")
        else:
            conv = sample["conversations"]
            if not isinstance(conv, list):
                issues.append("'conversations' should be a list")
            elif len(conv) == 0:
                issues.append("'conversations' is empty")
            else:
                for i, turn in enumerate(conv):
                    if "from" not in turn:
                        issues.append(f"Turn {i} missing 'from' key")
                    if "value" not in turn:
                        issues.append(f"Turn {i} missing 'value' key")

    elif format_type == "chatml":
        if "messages" not in sample:
            issues.append("Missing 'messages' field")
        else:
            msgs = sample["messages"]
            if not isinstance(msgs, list):
                issues.append("'messages' should be a list")
            elif len(msgs) == 0:
                issues.append("'messages' is empty")
            else:
                for i, msg in enumerate(msgs):
                    if "role" not in msg:
                        issues.append(f"Message {i} missing 'role' key")
                    if "content" not in msg:
                        issues.append(f"Message {i} missing 'content' key")

    elif format_type == "raw":
        if "text" not in sample:
            issues.append("Missing 'text' field")
        elif not sample["text"]:
            issues.append("'text' field is empty")

    return issues


# ============================================
# TOKEN ANALYSIS
# ============================================

def get_text_content(sample: dict, format_type: str) -> str:
    """Extract all text from a sample."""
    if format_type == "alpaca":
        parts = [sample.get("instruction", "")]
        if sample.get("input"):
            parts.append(sample["input"])
        parts.append(sample.get("output", ""))
        return " ".join(parts)

    elif format_type == "sharegpt":
        return " ".join(
            turn.get("value", "")
            for turn in sample.get("conversations", [])
        )

    elif format_type == "chatml":
        return " ".join(
            msg.get("content", "")
            for msg in sample.get("messages", [])
        )

    else:  # raw
        return sample.get("text", "")


def count_tokens(text: str, tokenizer) -> int:
    """Count tokens in text."""
    return len(tokenizer.encode(text, add_special_tokens=False))


# ============================================
# CHINCHILLA ANALYSIS
# ============================================

MODEL_PARAMS = {
    "llama-3.1-8b": 8e9,
    "llama-3.1-70b": 70e9,
    "qwen-2.5-7b": 7e9,
    "qwen-2.5-14b": 14e9,
    "qwen-2.5-32b": 32e9,
    "gemma-2-9b": 9e9,
    "gemma-2-27b": 27e9,
    "phi-4-14b": 14e9,
    "mistral-7b": 7e9,
}


def calculate_chinchilla(
    total_tokens: int,
    model_key: str,
    use_lora: bool = True,
    lora_rank: int = 16,
) -> dict:
    """Calculate Chinchilla optimality metrics."""
    base_params = MODEL_PARAMS.get(model_key, 8e9)

    if use_lora:
        # Approximate trainable params for LoRA
        # ~0.5% of base at rank 16, scales with rank
        lora_fraction = (lora_rank / 16) * 0.005
        trainable_params = base_params * lora_fraction
    else:
        trainable_params = base_params

    optimal_tokens = 20 * trainable_params
    chinchilla_fraction = total_tokens / optimal_tokens if optimal_tokens > 0 else 0

    return {
        "base_params": base_params,
        "trainable_params": trainable_params,
        "optimal_tokens": optimal_tokens,
        "actual_tokens": total_tokens,
        "chinchilla_fraction": chinchilla_fraction,
    }


def interpret_chinchilla(fraction: float) -> str:
    """Interpret Chinchilla fraction."""
    if fraction < 0.5:
        return "Dataset may be too small for optimal training. Consider data augmentation or a smaller model."
    elif fraction <= 2.0:
        return "Good range for fine-tuning. Dataset size is reasonable for this model."
    else:
        return "Large dataset relative to trainable params. Training may take longer but could yield good results."


# ============================================
# MAIN
# ============================================

def main():
    parser = argparse.ArgumentParser(description="Validate dataset for fine-tuning")
    parser.add_argument("--dataset", help="Hugging Face dataset ID")
    parser.add_argument("--file", help="Local file path (jsonl/json)")
    parser.add_argument("--split", default="train", help="Dataset split")
    parser.add_argument("--model", default="llama-3.1-8b",
                       choices=list(MODEL_PARAMS.keys()),
                       help="Target model for Chinchilla analysis")
    parser.add_argument("--lora-rank", type=int, default=16,
                       help="LoRA rank (0 for full fine-tuning)")
    parser.add_argument("--tokenizer", default="unsloth/llama-3.1-8b-unsloth-bnb-4bit",
                       help="Tokenizer for token counting")
    parser.add_argument("--max-samples", type=int, default=10000,
                       help="Max samples for token analysis")
    args = parser.parse_args()

    if not args.dataset and not args.file:
        parser.error("Must specify --dataset or --file")

    print("=" * 60)
    print("Dataset Validation Report")
    print("=" * 60)

    # Load dataset
    print("\n1. Loading Dataset")
    print("-" * 40)

    if args.dataset:
        print(f"Source: Hugging Face - {args.dataset}")
        dataset = load_dataset(args.dataset, split=args.split)
    else:
        print(f"Source: Local file - {args.file}")
        if args.file.endswith(".jsonl"):
            dataset = load_dataset("json", data_files=args.file, split="train")
        else:
            dataset = load_dataset("json", data_files=args.file, split="train")

    print(f"Total rows: {len(dataset):,}")
    print(f"Columns: {', '.join(dataset.column_names)}")

    # Detect format
    print("\n2. Format Detection")
    print("-" * 40)

    sample = dataset[0]
    format_type = detect_format(sample)
    print(f"Detected format: {format_type.upper()}")

    if format_type == "unknown":
        print("WARNING: Could not detect format. Check your data structure.")
        print(f"Sample keys: {list(sample.keys())}")

    # Validate schema
    print("\n3. Schema Validation")
    print("-" * 40)

    issues = validate_schema(sample, format_type)
    if issues:
        print("ISSUES FOUND:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("Schema valid!")

    # Check multiple samples for consistency
    validation_count = min(100, len(dataset))
    inconsistent = 0
    for i in range(validation_count):
        sample_format = detect_format(dataset[i])
        if sample_format != format_type:
            inconsistent += 1

    if inconsistent > 0:
        print(f"WARNING: {inconsistent}/{validation_count} samples have inconsistent format")

    # Show samples
    print("\n4. Sample Data")
    print("-" * 40)

    for i in range(min(2, len(dataset))):
        print(f"\nSample {i+1}:")
        sample = dataset[i]
        sample_json = json.dumps(sample, indent=2, ensure_ascii=False)
        if len(sample_json) > 500:
            sample_json = sample_json[:500] + "\n... (truncated)"
        print(sample_json)

    # Token analysis
    print("\n5. Token Analysis")
    print("-" * 40)

    print(f"Loading tokenizer: {args.tokenizer}")
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer)

    analysis_count = min(args.max_samples, len(dataset))
    print(f"Analyzing {analysis_count:,} samples...")

    token_counts = []
    for i in range(analysis_count):
        text = get_text_content(dataset[i], format_type)
        tokens = count_tokens(text, tokenizer)
        token_counts.append(tokens)

    # Extrapolate if we sampled
    if analysis_count < len(dataset):
        avg_tokens = statistics.mean(token_counts)
        total_tokens = int(avg_tokens * len(dataset))
        print(f"(Extrapolated from {analysis_count:,} sample)")
    else:
        total_tokens = sum(token_counts)

    print(f"\n{'Metric':<25} {'Value':>15}")
    print("-" * 42)
    print(f"{'Total tokens':<25} {total_tokens:>15,}")
    print(f"{'Min sequence':<25} {min(token_counts):>15,}")
    print(f"{'Max sequence':<25} {max(token_counts):>15,}")
    print(f"{'Mean sequence':<25} {statistics.mean(token_counts):>15,.1f}")
    print(f"{'Median sequence':<25} {statistics.median(token_counts):>15,.1f}")
    print(f"{'Std deviation':<25} {statistics.stdev(token_counts):>15,.1f}")

    # Flag concerns
    long_seqs = sum(1 for t in token_counts if t > 4096)
    short_seqs = sum(1 for t in token_counts if t < 10)

    if long_seqs > 0:
        pct = 100 * long_seqs / len(token_counts)
        print(f"\nWARNING: {long_seqs} sequences ({pct:.1f}%) exceed 4096 tokens")

    if short_seqs > 0:
        pct = 100 * short_seqs / len(token_counts)
        print(f"WARNING: {short_seqs} sequences ({pct:.1f}%) are very short (<10 tokens)")

    # Chinchilla analysis
    print("\n6. Chinchilla Optimality Analysis")
    print("-" * 40)

    use_lora = args.lora_rank > 0
    chinchilla = calculate_chinchilla(
        total_tokens=total_tokens,
        model_key=args.model,
        use_lora=use_lora,
        lora_rank=args.lora_rank,
    )

    print(f"Target model: {args.model}")
    print(f"Training mode: {'LoRA (rank ' + str(args.lora_rank) + ')' if use_lora else 'Full fine-tuning'}")
    print(f"\n{'Metric':<25} {'Value':>20}")
    print("-" * 47)
    print(f"{'Base parameters':<25} {chinchilla['base_params']:>20,.0f}")
    print(f"{'Trainable parameters':<25} {chinchilla['trainable_params']:>20,.0f}")
    print(f"{'Optimal tokens':<25} {chinchilla['optimal_tokens']:>20,.0f}")
    print(f"{'Your tokens':<25} {chinchilla['actual_tokens']:>20,}")
    print(f"{'Chinchilla fraction':<25} {chinchilla['chinchilla_fraction']:>20.2f}x")

    print(f"\nInterpretation: {interpret_chinchilla(chinchilla['chinchilla_fraction'])}")

    # Recommendations
    print("\n7. Recommendations")
    print("-" * 40)

    if format_type == "sharegpt":
        print("- Use standardize_sharegpt() before training to convert to ChatML")

    if statistics.mean(token_counts) > 2048:
        print("- Consider max_seq_length=4096 or higher")

    if len(dataset) < 100:
        print("- Small dataset: use low learning rate (1e-5) and few epochs (1-3)")

    if chinchilla['chinchilla_fraction'] < 0.5:
        print("- Consider data augmentation or using a smaller model")

    print("\n" + "=" * 60)
    print("Validation Complete")
    print("=" * 60)


if __name__ == "__main__":
    main()

```