funsloth-check
Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install benchflow-ai-skillsbench-funsloth-check
Repository
Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_count-dataset-tokens/environment/skills/funsloth-check
Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: benchflow-ai.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install funsloth-check into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/benchflow-ai/SkillsBench before adding funsloth-check to shared team environments
- Use funsloth-check for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: funsloth-check
description: Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
---
# Dataset Validation for Unsloth Fine-tuning
Validate datasets before fine-tuning with Unsloth.
## Quick Start
For automated validation, use the script:
```bash
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
```
## Workflow
### 1. Get Dataset Source
Ask for: HF dataset ID (e.g., `mlabonne/FineTome-100k`) or local path (e.g., `./data.jsonl`)
### 2. Load and Detect Format
Auto-detect format from structure. See [DATA_FORMATS.md](DATA_FORMATS.md) for details.
| Format | Detection | Key Fields |
|--------|-----------|------------|
| Raw | `text` only | `text` |
| Alpaca | `instruction` + `output` | `instruction`, `output` |
| ShareGPT | `conversations` array | `from`, `value` |
| ChatML | `messages` array | `role`, `content` |
### 3. Validate Schema
Check required fields exist. Report issues with fix suggestions.
### 4. Show Samples
Display 2-3 examples for visual verification.
### 5. Token Analysis
Report statistics: total tokens, min/max/mean/median sequence length.
Flag concerns:
- Sequences > 4096 tokens
- Sequences < 10 tokens
### 6. Chinchilla Analysis
Ask for target model and LoRA rank, then calculate:
| Chinchilla Fraction | Interpretation |
|--------------------|----------------|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |
### 7. Recommendations
Based on analysis, suggest:
- `standardize_sharegpt()` for ShareGPT data
- Sequence length adjustments
- Learning rate for small datasets
### 8. Optional: HF Upload
Offer to upload local datasets to Hub.
### 9. Handoff
Pass context to `funsloth-train`:
```yaml
dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2
```
## Bundled Resources
- [scripts/validate_dataset.py](scripts/validate_dataset.py) - Automated validation script
- [DATA_FORMATS.md](DATA_FORMATS.md) - Dataset format reference
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### DATA_FORMATS.md
```markdown
# Data Format Reference
Supported dataset formats for Unsloth fine-tuning.
## Format Overview
| Format | Use Case | Key Fields |
|--------|----------|------------|
| **Raw Corpus** | Continued pretraining | `text` |
| **Alpaca** | Single-turn instruction | `instruction`, `output` |
| **ShareGPT** | Multi-turn conversation | `conversations[{from, value}]` |
| **ChatML** | Native chat format | `messages[{role, content}]` |
## Raw Corpus Format
For continued pretraining on domain text.
```json
{"text": "The mitochondria is the powerhouse of the cell..."}
{"text": "In quantum mechanics, superposition refers to..."}
```
### Usage
```python
# Just load and use directly
dataset = load_dataset("json", data_files="corpus.jsonl", split="train")
def format_fn(example):
return {"text": example["text"]}
```
## Alpaca Format
Most common for instruction tuning.
```json
{
"instruction": "Summarize the following text.",
"input": "Long article text here...",
"output": "Brief summary of the article."
}
```
### Variations
```json
// Without input
{
"instruction": "What is the capital of France?",
"output": "The capital of France is Paris."
}
// With system prompt
{
"instruction": "Translate to Spanish",
"input": "Hello, how are you?",
"output": "Hola, ¿cómo estás?",
"system": "You are a helpful translator."
}
```
### Conversion to Chat
```python
def alpaca_to_chat(example):
messages = []
if example.get("system"):
messages.append({"role": "system", "content": example["system"]})
user_content = example["instruction"]
if example.get("input"):
user_content += f"\n\n{example['input']}"
messages.append({"role": "user", "content": user_content})
messages.append({"role": "assistant", "content": example["output"]})
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
```
## ShareGPT Format
Multi-turn conversations, common from ChatGPT exports.
```json
{
"conversations": [
{"from": "human", "value": "What is Python?"},
{"from": "gpt", "value": "Python is a programming language..."},
{"from": "human", "value": "How do I install it?"},
{"from": "gpt", "value": "You can download Python from..."}
]
}
```
### Role Mappings
```
"human" → user
"gpt" → assistant
"system" → system
```
### Important: Standardize First
```python
from unsloth.chat_templates import standardize_sharegpt
# This converts ShareGPT to ChatML internally
dataset = standardize_sharegpt(dataset)
```
### Variations
```json
// With system message
{
"conversations": [
{"from": "system", "value": "You are a helpful assistant."},
{"from": "human", "value": "Hello!"},
{"from": "gpt", "value": "Hi there! How can I help?"}
]
}
// Alternative field names (less common)
{
"conversations": [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi!"}
]
}
```
## ChatML Format
Native chat format, used by many models.
```json
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the weather like?"},
{"role": "assistant", "content": "I don't have access to weather data..."}
]
}
```
### Valid Roles
- `system` - System instructions (optional, usually first)
- `user` - Human input
- `assistant` - Model response
### Usage
```python
def chatml_to_text(example):
return {"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False
)}
```
## DPO Format
For preference optimization (chosen vs rejected).
```json
{
"prompt": "Write a haiku about coding.",
"chosen": "Fingers on keyboard\nLogic flows like mountain streams\nBugs become features",
"rejected": "coding is fun\ni like to code\nthe end"
}
```
### Alternative Structure
```json
{
"prompt": "Explain quantum computing.",
"chosen_messages": [
{"role": "user", "content": "Explain quantum computing."},
{"role": "assistant", "content": "Quantum computing harnesses..."}
],
"rejected_messages": [
{"role": "user", "content": "Explain quantum computing."},
{"role": "assistant", "content": "its computers but quantum"}
]
}
```
## GRPO/RL Format
For reinforcement learning with rewards.
```json
{
"prompt": "What is 15 + 27?",
"ground_truth": "42"
}
```
The reward function compares model output to ground_truth.
## Format Detection
```python
def detect_format(sample):
if "conversations" in sample:
conv = sample["conversations"]
if isinstance(conv, list) and len(conv) > 0:
if "from" in conv[0]:
return "sharegpt"
if "messages" in sample:
msgs = sample["messages"]
if isinstance(msgs, list) and len(msgs) > 0:
if "role" in msgs[0]:
return "chatml"
if "instruction" in sample and "output" in sample:
return "alpaca"
if "text" in sample:
return "raw"
return "unknown"
```
## Common Transformations
### ShareGPT → ChatML
```python
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
```
### Alpaca → ChatML
```python
def alpaca_to_chatml(example):
messages = []
user = example["instruction"]
if example.get("input"):
user += f"\n\n{example['input']}"
messages.append({"role": "user", "content": user})
messages.append({"role": "assistant", "content": example["output"]})
return {"messages": messages}
dataset = dataset.map(alpaca_to_chatml)
```
### Filter Empty Examples
```python
def is_valid(example):
if "text" in example:
return len(example["text"].strip()) > 10
if "messages" in example:
return len(example["messages"]) >= 2
return True
dataset = dataset.filter(is_valid)
```
## Dataset Sources
### Hugging Face Hub
```python
from datasets import load_dataset
# Full dataset
dataset = load_dataset("mlabonne/FineTome-100k", split="train")
# Subset
dataset = load_dataset("mlabonne/FineTome-100k", split="train[:1000]")
```
### Local Files
```python
# JSONL
dataset = load_dataset("json", data_files="data.jsonl", split="train")
# JSON
dataset = load_dataset("json", data_files="data.json", split="train")
# CSV
dataset = load_dataset("csv", data_files="data.csv", split="train")
# Parquet
dataset = load_dataset("parquet", data_files="data.parquet", split="train")
```
### Multiple Files
```python
dataset = load_dataset("json", data_files={
"train": ["train1.jsonl", "train2.jsonl"],
"test": "test.jsonl"
})
```
## Quality Checks
### Verify Format
```python
sample = dataset[0]
print(f"Keys: {sample.keys()}")
print(f"Format: {detect_format(sample)}")
print(f"Sample:\n{json.dumps(sample, indent=2)[:500]}")
```
### Check for Issues
```python
issues = []
for i, ex in enumerate(dataset):
# Empty content
if not get_text_content(ex):
issues.append(f"Empty at {i}")
# Mismatched format
if detect_format(ex) != expected_format:
issues.append(f"Format mismatch at {i}")
print(f"Found {len(issues)} issues")
```
```
### scripts/validate_dataset.py
```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "datasets>=2.18",
# "transformers>=4.45",
# "huggingface_hub>=0.20",
# ]
# ///
"""
Dataset Validation Script for Fine-tuning
Validates dataset format, calculates statistics, and checks Chinchilla optimality.
Usage:
python validate_dataset.py --dataset "mlabonne/FineTome-100k"
python validate_dataset.py --file "./data/train.jsonl"
"""
import argparse
import json
import statistics
from datasets import load_dataset
from transformers import AutoTokenizer
from huggingface_hub import HfApi
# ============================================
# FORMAT DETECTION
# ============================================
def detect_format(sample: dict) -> str:
"""Detect dataset format from a sample."""
if "conversations" in sample:
conv = sample["conversations"]
if isinstance(conv, list) and len(conv) > 0:
if "from" in conv[0] and "value" in conv[0]:
return "sharegpt"
if "messages" in sample:
msgs = sample["messages"]
if isinstance(msgs, list) and len(msgs) > 0:
if "role" in msgs[0] and "content" in msgs[0]:
return "chatml"
if "instruction" in sample and "output" in sample:
return "alpaca"
if "text" in sample and len(sample) <= 2: # text + maybe id
return "raw"
return "unknown"
def validate_schema(sample: dict, format_type: str) -> list[str]:
"""Validate that sample has required fields."""
issues = []
if format_type == "alpaca":
if "instruction" not in sample:
issues.append("Missing 'instruction' field")
if "output" not in sample:
issues.append("Missing 'output' field")
elif format_type == "sharegpt":
if "conversations" not in sample:
issues.append("Missing 'conversations' field")
else:
conv = sample["conversations"]
if not isinstance(conv, list):
issues.append("'conversations' should be a list")
elif len(conv) == 0:
issues.append("'conversations' is empty")
else:
for i, turn in enumerate(conv):
if "from" not in turn:
issues.append(f"Turn {i} missing 'from' key")
if "value" not in turn:
issues.append(f"Turn {i} missing 'value' key")
elif format_type == "chatml":
if "messages" not in sample:
issues.append("Missing 'messages' field")
else:
msgs = sample["messages"]
if not isinstance(msgs, list):
issues.append("'messages' should be a list")
elif len(msgs) == 0:
issues.append("'messages' is empty")
else:
for i, msg in enumerate(msgs):
if "role" not in msg:
issues.append(f"Message {i} missing 'role' key")
if "content" not in msg:
issues.append(f"Message {i} missing 'content' key")
elif format_type == "raw":
if "text" not in sample:
issues.append("Missing 'text' field")
elif not sample["text"]:
issues.append("'text' field is empty")
return issues
# ============================================
# TOKEN ANALYSIS
# ============================================
def get_text_content(sample: dict, format_type: str) -> str:
"""Extract all text from a sample."""
if format_type == "alpaca":
parts = [sample.get("instruction", "")]
if sample.get("input"):
parts.append(sample["input"])
parts.append(sample.get("output", ""))
return " ".join(parts)
elif format_type == "sharegpt":
return " ".join(
turn.get("value", "")
for turn in sample.get("conversations", [])
)
elif format_type == "chatml":
return " ".join(
msg.get("content", "")
for msg in sample.get("messages", [])
)
else: # raw
return sample.get("text", "")
def count_tokens(text: str, tokenizer) -> int:
"""Count tokens in text."""
return len(tokenizer.encode(text, add_special_tokens=False))
# ============================================
# CHINCHILLA ANALYSIS
# ============================================
MODEL_PARAMS = {
"llama-3.1-8b": 8e9,
"llama-3.1-70b": 70e9,
"qwen-2.5-7b": 7e9,
"qwen-2.5-14b": 14e9,
"qwen-2.5-32b": 32e9,
"gemma-2-9b": 9e9,
"gemma-2-27b": 27e9,
"phi-4-14b": 14e9,
"mistral-7b": 7e9,
}
def calculate_chinchilla(
total_tokens: int,
model_key: str,
use_lora: bool = True,
lora_rank: int = 16,
) -> dict:
"""Calculate Chinchilla optimality metrics."""
base_params = MODEL_PARAMS.get(model_key, 8e9)
if use_lora:
# Approximate trainable params for LoRA
# ~0.5% of base at rank 16, scales with rank
lora_fraction = (lora_rank / 16) * 0.005
trainable_params = base_params * lora_fraction
else:
trainable_params = base_params
optimal_tokens = 20 * trainable_params
chinchilla_fraction = total_tokens / optimal_tokens if optimal_tokens > 0 else 0
return {
"base_params": base_params,
"trainable_params": trainable_params,
"optimal_tokens": optimal_tokens,
"actual_tokens": total_tokens,
"chinchilla_fraction": chinchilla_fraction,
}
def interpret_chinchilla(fraction: float) -> str:
"""Interpret Chinchilla fraction."""
if fraction < 0.5:
return "Dataset may be too small for optimal training. Consider data augmentation or a smaller model."
elif fraction <= 2.0:
return "Good range for fine-tuning. Dataset size is reasonable for this model."
else:
return "Large dataset relative to trainable params. Training may take longer but could yield good results."
# ============================================
# MAIN
# ============================================
def main():
parser = argparse.ArgumentParser(description="Validate dataset for fine-tuning")
parser.add_argument("--dataset", help="Hugging Face dataset ID")
parser.add_argument("--file", help="Local file path (jsonl/json)")
parser.add_argument("--split", default="train", help="Dataset split")
parser.add_argument("--model", default="llama-3.1-8b",
choices=list(MODEL_PARAMS.keys()),
help="Target model for Chinchilla analysis")
parser.add_argument("--lora-rank", type=int, default=16,
help="LoRA rank (0 for full fine-tuning)")
parser.add_argument("--tokenizer", default="unsloth/llama-3.1-8b-unsloth-bnb-4bit",
help="Tokenizer for token counting")
parser.add_argument("--max-samples", type=int, default=10000,
help="Max samples for token analysis")
args = parser.parse_args()
if not args.dataset and not args.file:
parser.error("Must specify --dataset or --file")
print("=" * 60)
print("Dataset Validation Report")
print("=" * 60)
# Load dataset
print("\n1. Loading Dataset")
print("-" * 40)
if args.dataset:
print(f"Source: Hugging Face - {args.dataset}")
dataset = load_dataset(args.dataset, split=args.split)
else:
print(f"Source: Local file - {args.file}")
if args.file.endswith(".jsonl"):
dataset = load_dataset("json", data_files=args.file, split="train")
else:
dataset = load_dataset("json", data_files=args.file, split="train")
print(f"Total rows: {len(dataset):,}")
print(f"Columns: {', '.join(dataset.column_names)}")
# Detect format
print("\n2. Format Detection")
print("-" * 40)
sample = dataset[0]
format_type = detect_format(sample)
print(f"Detected format: {format_type.upper()}")
if format_type == "unknown":
print("WARNING: Could not detect format. Check your data structure.")
print(f"Sample keys: {list(sample.keys())}")
# Validate schema
print("\n3. Schema Validation")
print("-" * 40)
issues = validate_schema(sample, format_type)
if issues:
print("ISSUES FOUND:")
for issue in issues:
print(f" - {issue}")
else:
print("Schema valid!")
# Check multiple samples for consistency
validation_count = min(100, len(dataset))
inconsistent = 0
for i in range(validation_count):
sample_format = detect_format(dataset[i])
if sample_format != format_type:
inconsistent += 1
if inconsistent > 0:
print(f"WARNING: {inconsistent}/{validation_count} samples have inconsistent format")
# Show samples
print("\n4. Sample Data")
print("-" * 40)
for i in range(min(2, len(dataset))):
print(f"\nSample {i+1}:")
sample = dataset[i]
sample_json = json.dumps(sample, indent=2, ensure_ascii=False)
if len(sample_json) > 500:
sample_json = sample_json[:500] + "\n... (truncated)"
print(sample_json)
# Token analysis
print("\n5. Token Analysis")
print("-" * 40)
print(f"Loading tokenizer: {args.tokenizer}")
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer)
analysis_count = min(args.max_samples, len(dataset))
print(f"Analyzing {analysis_count:,} samples...")
token_counts = []
for i in range(analysis_count):
text = get_text_content(dataset[i], format_type)
tokens = count_tokens(text, tokenizer)
token_counts.append(tokens)
# Extrapolate if we sampled
if analysis_count < len(dataset):
avg_tokens = statistics.mean(token_counts)
total_tokens = int(avg_tokens * len(dataset))
print(f"(Extrapolated from {analysis_count:,} sample)")
else:
total_tokens = sum(token_counts)
print(f"\n{'Metric':<25} {'Value':>15}")
print("-" * 42)
print(f"{'Total tokens':<25} {total_tokens:>15,}")
print(f"{'Min sequence':<25} {min(token_counts):>15,}")
print(f"{'Max sequence':<25} {max(token_counts):>15,}")
print(f"{'Mean sequence':<25} {statistics.mean(token_counts):>15,.1f}")
print(f"{'Median sequence':<25} {statistics.median(token_counts):>15,.1f}")
print(f"{'Std deviation':<25} {statistics.stdev(token_counts):>15,.1f}")
# Flag concerns
long_seqs = sum(1 for t in token_counts if t > 4096)
short_seqs = sum(1 for t in token_counts if t < 10)
if long_seqs > 0:
pct = 100 * long_seqs / len(token_counts)
print(f"\nWARNING: {long_seqs} sequences ({pct:.1f}%) exceed 4096 tokens")
if short_seqs > 0:
pct = 100 * short_seqs / len(token_counts)
print(f"WARNING: {short_seqs} sequences ({pct:.1f}%) are very short (<10 tokens)")
# Chinchilla analysis
print("\n6. Chinchilla Optimality Analysis")
print("-" * 40)
use_lora = args.lora_rank > 0
chinchilla = calculate_chinchilla(
total_tokens=total_tokens,
model_key=args.model,
use_lora=use_lora,
lora_rank=args.lora_rank,
)
print(f"Target model: {args.model}")
print(f"Training mode: {'LoRA (rank ' + str(args.lora_rank) + ')' if use_lora else 'Full fine-tuning'}")
print(f"\n{'Metric':<25} {'Value':>20}")
print("-" * 47)
print(f"{'Base parameters':<25} {chinchilla['base_params']:>20,.0f}")
print(f"{'Trainable parameters':<25} {chinchilla['trainable_params']:>20,.0f}")
print(f"{'Optimal tokens':<25} {chinchilla['optimal_tokens']:>20,.0f}")
print(f"{'Your tokens':<25} {chinchilla['actual_tokens']:>20,}")
print(f"{'Chinchilla fraction':<25} {chinchilla['chinchilla_fraction']:>20.2f}x")
print(f"\nInterpretation: {interpret_chinchilla(chinchilla['chinchilla_fraction'])}")
# Recommendations
print("\n7. Recommendations")
print("-" * 40)
if format_type == "sharegpt":
print("- Use standardize_sharegpt() before training to convert to ChatML")
if statistics.mean(token_counts) > 2048:
print("- Consider max_seq_length=4096 or higher")
if len(dataset) < 100:
print("- Small dataset: use low learning rate (1e-5) and few epochs (1-3)")
if chinchilla['chinchilla_fraction'] < 0.5:
print("- Consider data augmentation or using a smaller model")
print("\n" + "=" * 60)
print("Validation Complete")
print("=" * 60)
if __name__ == "__main__":
main()
```