SkillHub ClubShip Full StackFull Stack

nanogpt-training

Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, tokenized dataset loading (e.g., HuggingFace Hub shards), modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

745

Hot score

Updated

March 20, 2026

Overall rating

C5.0

Composite score

5.0

Best-practice grade

C62.8

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-nanogpt-training

Repository

benchflow-ai/SkillsBench

Skill path: tasks-no-skills/mhc-layer-impl/environment/skills/nanogpt-training

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install nanogpt-training into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/benchflow-ai/SkillsBench before adding nanogpt-training to shared team environments
Use nanogpt-training for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: nanogpt-training
description: Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, tokenized dataset loading (e.g., HuggingFace Hub shards), modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.
---

# NanoGPT Training

## Overview

Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:

- **GPT-124M Architecture**: Standard transformer with RoPE and modern optimizations
- **Tokenized Datasets**: Loading pre-tokenized shards from HuggingFace Hub or local files
- **Modern Optimizers**: Muon optimizer with Newton-Schulz orthogonalization
- **Mixed Precision**: bfloat16 training on A100 for 2x speedup

Training options:
- **Baseline GPT**: Standard residual connections
- **Experimental residual variants**: Optional alternative residual schemes for stability/efficiency

## Quick Reference

| Topic | Reference |
|-------|-----------|
| Model Architecture | [GPT Architecture](references/gpt-architecture.md) |
| Data Loading | [Tokenized Data](references/tokenized-data.md) |
| Optimizers | [Optimizers](references/optimizers.md) |
| Training Loop | [Training Loop](references/training-loop.md) |
| Hyperparameters | [Hyperparameters](references/hyperparameters.md) |

## Installation

```bash
pip install torch einops numpy huggingface_hub
```

## Minimal Example

```python
import modal

app = modal.App("gpt-training")

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "einops", "numpy", "huggingface_hub"
)

@app.function(gpu="A100", image=image, timeout=3600)
def train():
    import torch
    from dataclasses import dataclass

    @dataclass
    class GPTConfig:
        block_size: int = 1024
        vocab_size: int = 50257
        n_layer: int = 12
        n_head: int = 12
        n_embd: int = 768
        dropout: float = 0.0
        bias: bool = False

    # Download data, build model, train
    # ... (see references for full implementation)

    return {"final_loss": final_loss}

@app.local_entrypoint()
def main():
    results = train.remote()
    print(results)
```

## Common Imports

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from dataclasses import dataclass
from einops import rearrange, repeat, reduce
import numpy as np
import math
```

## When to Use What

| Scenario | Approach |
|----------|----------|
| Standard GPT training | Use baseline model with standard residuals |
| Stability experiments | Try alternative residual variants or extra streams |
| Small experiments | Use T4/A10G GPU |
| Full training | Use A100 with bfloat16 |
| Custom data | Modify the dataset loader class |
| Different model size | Adjust GPTConfig parameters |

## Metrics to Monitor

| Metric | Typical Signal | Notes |
|--------|----------------|-------|
| Validation loss | Steady decrease | Absolute value depends on dataset/tokenizer |
| Grad norm | Moderate, stable range | Large spikes indicate instability |
| Training stability | Smooth curves | Frequent spikes suggest LR/batch issues |
| Throughput | Consistent tokens/sec | Use for comparing configs |

## External Resources

- nanoGPT: https://github.com/karpathy/nanoGPT
- build-nanogpt: https://github.com/karpathy/build-nanogpt
- modded-nanogpt: https://github.com/KellerJordan/modded-nanogpt
- FineWeb-Edu token shards: https://huggingface.co/datasets/karpathy/fineweb-edu-100B-gpt2-token-shards


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/gpt-architecture.md

```markdown
# GPT Architecture

## GPT-124M Configuration

```python
from dataclasses import dataclass

@dataclass
class GPTConfig:
    block_size: int = 1024      # Context length
    vocab_size: int = 50257     # GPT-2 tokenizer
    n_layer: int = 12           # Transformer blocks
    n_head: int = 12            # Attention heads
    n_embd: int = 768           # Embedding dimension
    dropout: float = 0.0        # No dropout for pretraining
    bias: bool = False          # No bias in linear layers
```

This gives ~124M parameters, matching GPT-2 small.

## Rotary Positional Embeddings (RoPE)

RoPE encodes position by rotating query and key vectors:

```python
class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, dim, max_seq_len=2048, base=10000):
        super().__init__()
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)
        t = torch.arange(max_seq_len)
        freqs = torch.outer(t, inv_freq)
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos())
        self.register_buffer("sin_cached", emb.sin())

    def forward(self, seq_len):
        return self.cos_cached[:seq_len], self.sin_cached[:seq_len]


def apply_rotary_emb(q, k, cos, sin):
    def rotate_half(x):
        x1, x2 = x[..., :x.shape[-1]//2], x[..., x.shape[-1]//2:]
        return torch.cat((-x2, x1), dim=-1)

    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed
```

## ReLU² Activation

ReLU² (squared ReLU) can work better than GELU:

```python
class ReluSquared(nn.Module):
    def forward(self, x):
        return F.relu(x).pow(2)

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=False)
        self.act = ReluSquared()  # or nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=False)

    def forward(self, x):
        return self.c_proj(self.act(self.c_fc(x)))
```

## Causal Self-Attention

```python
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = config.n_embd // config.n_head

        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=False)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.rope = RotaryPositionalEmbedding(self.head_dim, config.block_size)

    def forward(self, x):
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)

        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        cos, sin = self.rope(T)
        q, k = apply_rotary_emb(q, k, cos, sin)

        # FlashAttention (PyTorch 2.0+)
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)

        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)
```

## Complete Baseline Block

```python
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
```

## Full GPT Model

```python
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            wte=nn.Embedding(config.vocab_size, config.n_embd),
            drop=nn.Dropout(config.dropout),
            h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f=nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight  # Weight tying

    def forward(self, idx, targets=None):
        x = self.transformer.drop(self.transformer.wte(idx))
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss
```

```

### references/tokenized-data.md

```markdown
# Tokenized Data Loading

## Download Only What You Need

Start with a small subset of shards from `karpathy/fineweb-edu-100B-gpt2-token-shards` (each shard is ~200MB) so you can validate the pipeline quickly. Scale up once training is stable and you know the throughput.

## Downloading in Modal

Since Modal functions run in isolated containers, download data inside the function:

```python
def download_tokenized_data():
    """Download FineWeb-Edu GPT-2 token shards."""
    import os
    from huggingface_hub import hf_hub_download

    data_dir = "/tmp/data/fineweb-edu"
    os.makedirs(data_dir, exist_ok=True)

    def get_file(fname):
        if not os.path.exists(os.path.join(data_dir, fname)):
            print(f"Downloading {fname}...")
            hf_hub_download(
                repo_id="karpathy/fineweb-edu-100B-gpt2-token-shards",
                filename=fname,
                repo_type="dataset",
                local_dir=data_dir,
            )

    # Use a few shards for training and hold out one for validation
    get_file("edu_fineweb_train_000001.bin")
    get_file("edu_fineweb_train_000002.bin")
    get_file("edu_fineweb_train_000003.bin")  # validation holdout

    print("Data download complete!")
    return data_dir
```

The dataset is public (no auth required) and ships only training shards; hold out one shard for validation.

## Dataset Files

| File | Purpose |
|------|---------|
| `edu_fineweb_train_000001.bin` | Training shard |
| `edu_fineweb_train_000003.bin` | Validation holdout shard |

The dataset ships training shards only; reserve the last shard you download as validation.

Files typically store GPT-2 token IDs as `uint16` arrays.

## Memory-Mapped Data Loading

Use memory-mapped files for efficient data loading:

```python
import numpy as np
import torch

class TokenizedDataset:
    def __init__(self, data_dir, split="train", block_size=1024):
        self.block_size = block_size

        import os
        all_shards = sorted([
            os.path.join(data_dir, f)
            for f in os.listdir(data_dir)
            if f.startswith("edu_fineweb_train_") and f.endswith(".bin")
        ])

        if len(all_shards) < 2:
            raise ValueError("Need at least 2 shards to create train/val splits")

        if split == "val":
            self.shards = all_shards[-1:]
        else:
            self.shards = all_shards[:-1]

        self.data = [np.memmap(s, dtype=np.uint16, mode="r") for s in self.shards]
        self.lengths = [len(d) for d in self.data]
        self.total_length = sum(self.lengths)

    def _get_tokens(self, global_idx, length):
        """Get tokens starting at global_idx across shards."""
        cumsum = 0
        for i, shard_len in enumerate(self.lengths):
            if global_idx < cumsum + shard_len:
                local_idx = global_idx - cumsum
                return np.array(self.data[i][local_idx:local_idx + length])
            cumsum += shard_len
        raise IndexError("Index out of range")

    def get_batch(self, batch_size, device="cuda"):
        max_start = self.total_length - self.block_size - 1
        starts = torch.randint(0, max_start, (batch_size,))

        x = torch.zeros(batch_size, self.block_size, dtype=torch.long)
        y = torch.zeros(batch_size, self.block_size, dtype=torch.long)

        for i, start in enumerate(starts):
            tokens = self._get_tokens(start.item(), self.block_size + 1)
            x[i] = torch.from_numpy(tokens[:-1].astype(np.int32))
            y[i] = torch.from_numpy(tokens[1:].astype(np.int32))

        return x.to(device), y.to(device)
```

## Usage

```python
# Create dataset
data_dir = download_tokenized_data()
train_dataset = TokenizedDataset(data_dir, split="train", block_size=1024)
val_dataset = TokenizedDataset(data_dir, split="val", block_size=1024)

# Get batch
x, y = train_dataset.get_batch(batch_size=32, device="cuda")
```

## Memory Efficiency

Memory-mapped files:
- Don't load entire dataset into RAM
- Load data on-demand as needed
- Allow training on datasets larger than available RAM
- Each shard is mapped independently

```

### references/optimizers.md

```markdown
# Optimizers

## AdamW (Standard)

Standard choice for transformer training:

```python
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=6e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
)
```

## Muon Optimizer

Muon uses Newton-Schulz iterations to orthogonalize momentum, providing better conditioning:

```python
def newton_schulz_iteration(G, num_iters=5):
    """Orthogonalize a matrix using Newton-Schulz iteration."""
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G / (G.norm() + 1e-7)
    for _ in range(num_iters):
        A = X @ X.T
        X = a * X + b * A @ X + c * A @ A @ X
    return X


class Muon(torch.optim.Optimizer):
    """Muon optimizer with orthogonalized momentum."""

    def __init__(self, params, lr=0.02, momentum=0.95, nesterov=True):
        defaults = dict(lr=lr, momentum=momentum, nesterov=nesterov)
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self):
        for group in self.param_groups:
            lr, momentum = group['lr'], group['momentum']
            for p in group['params']:
                if p.grad is None:
                    continue

                g = p.grad
                state = self.state[p]

                if len(state) == 0:
                    state['momentum_buffer'] = torch.zeros_like(g)

                buf = state['momentum_buffer']
                buf.mul_(momentum).add_(g)

                if g.ndim >= 2:
                    g_orth = newton_schulz_iteration(buf.reshape(g.shape[0], -1))
                    g_orth = g_orth.reshape(g.shape)
                else:
                    g_orth = buf

                if group['nesterov']:
                    g_orth = g_orth.mul_(momentum).add_(g)

                p.add_(g_orth, alpha=-lr)
```

## Learning Rate Schedule

Use linear warmup followed by cosine decay:

```python
import math

def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
    if step < warmup_steps:
        return max_lr * (step + 1) / warmup_steps
    if step >= max_steps:
        return min_lr
    decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (max_lr - min_lr)
```

## Gradient Clipping

Always clip gradients to prevent explosions:

```python
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
```

## Optimizer Comparison

| Optimizer | LR | Momentum | Use Case |
|-----------|-----|----------|----------|
| AdamW | 6e-4 | β1=0.9, β2=0.95 | Standard training |
| Muon | 0.02 | 0.95 | Better conditioning |

## Parameter Groups (Advanced)

Different learning rates for different parameter types:

```python
# Separate embedding and other parameters
embed_params = [p for n, p in model.named_parameters() if 'wte' in n or 'lm_head' in n]
other_params = [p for n, p in model.named_parameters() if 'wte' not in n and 'lm_head' not in n]

optimizer = torch.optim.AdamW([
    {'params': embed_params, 'lr': 6e-4, 'weight_decay': 0.0},
    {'params': other_params, 'lr': 6e-4, 'weight_decay': 0.1},
])
```

```

### references/training-loop.md

```markdown
# Training Loop

## Mixed Precision Training

Use bfloat16 on A100 for 2x speedup:

```python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for step in range(max_steps):
    # Update learning rate
    lr = get_lr(step, warmup_steps, max_steps, max_lr, min_lr)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # Forward pass with mixed precision
    with autocast(dtype=torch.bfloat16):
        _, loss = model(x, y)

    # Backward pass
    scaler.scale(loss).backward()

    # Gradient clipping
    scaler.unscale_(optimizer)
    grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # Optimizer step
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad(set_to_none=True)
```

## Complete Training Function

```python
def train_model(model, train_dataset, val_dataset, config, device="cuda"):
    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=6e-4, weight_decay=0.1)
    scaler = GradScaler()

    # Training config
    max_steps = config.get("max_steps", 2000)
    warmup_steps = config.get("warmup_steps", 200)
    max_lr = config.get("max_lr", 6e-4)
    min_lr = config.get("min_lr", 6e-5)
    batch_size = config.get("batch_size", 32)
    eval_interval = config.get("eval_interval", 100)

    # Tracking
    train_losses = []
    val_losses = []
    grad_norms = []

    for step in range(max_steps):
        model.train()

        # Update LR
        lr = get_lr(step, warmup_steps, max_steps, max_lr, min_lr)
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        # Get batch
        x, y = train_dataset.get_batch(batch_size, device)

        # Forward/backward
        with autocast(dtype=torch.bfloat16):
            _, loss = model(x, y)

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad(set_to_none=True)

        # Track metrics
        train_losses.append(loss.item())
        grad_norms.append(grad_norm.item())

        # Evaluation
        if step % eval_interval == 0 or step == max_steps - 1:
            model.eval()
            with torch.no_grad():
                x_val, y_val = val_dataset.get_batch(batch_size, device)
                with autocast(dtype=torch.bfloat16):
                    _, val_loss = model(x_val, y_val)
                val_losses.append(val_loss.item())
                print(f"Step {step}: train_loss={loss.item():.4f}, val_loss={val_loss.item():.4f}, grad_norm={grad_norm.item():.4f}")

    return {
        "final_val_loss": val_losses[-1],
        "train_losses": train_losses,
        "val_losses": val_losses,
        "grad_norms": grad_norms,
        "grad_norm_std": float(np.std(grad_norms)),
        "max_grad_norm": float(max(grad_norms)),
    }
```

## Evaluation

```python
@torch.no_grad()
def evaluate(model, dataset, batch_size=32, num_batches=10, device="cuda"):
    model.eval()
    losses = []
    for _ in range(num_batches):
        x, y = dataset.get_batch(batch_size, device)
        with autocast(dtype=torch.bfloat16):
            _, loss = model(x, y)
        losses.append(loss.item())
    return sum(losses) / len(losses)
```

## Progress Logging

```python
# Inside training loop
if step % 100 == 0:
    print(f"Step {step}/{max_steps}")
    print(f"  Train Loss: {loss.item():.4f}")
    print(f"  Val Loss: {val_loss.item():.4f}")
    print(f"  Grad Norm: {grad_norm.item():.4f}")
    print(f"  LR: {lr:.2e}")
```

```

### references/hyperparameters.md

```markdown
# Hyperparameters

## Recommended Settings for GPT-124M on A100

| Parameter | Recommended Value | Notes |
|-----------|-------------------|-------|
| `max_lr` | 6e-4 | Tune ±2x for stability |
| `min_lr` | 6e-5 | Keep 10x below max |
| `warmup_steps` | 200 | Increase for large batches |
| `max_steps` | 2000-4000 | Longer runs improve quality |
| `batch_size` | 32 | Reduce for memory-heavy variants |
| `weight_decay` | 0.1 | Standard GPT-2 pretraining |
| `block_size` | 1024 | Adjust for context length |
| `grad_clip` | 1.0 | Prevents rare spikes |

## Model Configuration

```python
@dataclass
class GPTConfig:
    block_size: int = 1024      # Context length
    vocab_size: int = 50257     # GPT-2 tokenizer vocab size
    n_layer: int = 12           # Number of transformer blocks
    n_head: int = 12            # Number of attention heads
    n_embd: int = 768           # Embedding dimension
    dropout: float = 0.0        # Dropout (0 for pretraining)
    bias: bool = False          # Use bias in linear layers
```

## Optional Variant Parameters

If you introduce alternative residual schemes or routing layers, document their extra knobs. Common examples include:

- `num_streams`: number of parallel residual streams
- `mixing_iters`: iterations for normalization or mixing
- `mixing_temperature`: softness of routing weights

Start with conservative defaults and validate stability early.

## Learning Rate Schedule

```
LR
 ^
 |    /\
 |   /  \
 |  /    \______
 | /
 +---------------> Steps
   warmup  decay
```

- **Warmup**: Linear increase from 0 to max_lr over warmup_steps
- **Decay**: Cosine decay from max_lr to min_lr

## Batch Size Guidelines

| GPU | VRAM | Suggested Batch (GPT-124M) |
|-----|------|----------------------------|
| T4 | 16GB | 8 |
| A10G | 24GB | 16 |
| A100-40GB | 40GB | 32 |
| A100-80GB | 80GB | 64 |

Reduce batch size for memory-heavy variants or longer context.

## Training Duration

| Steps | Tokens | Time (A100) | Quality |
|-------|--------|-------------|---------|
| 1000 | ~32M | ~8 min | Initial convergence |
| 2000 | ~64M | ~15 min | Good quality |
| 4000 | ~128M | ~30 min | Better quality |

## Training Health Signals

| Metric | Healthy Signal | Warning Signs |
|--------|----------------|---------------|
| Validation loss | Consistent downward trend | Plateaus early or increases |
| Grad norm | Stable with occasional bumps | Repeated spikes or NaNs |
| Grad norm variability | Low over short windows | Large oscillations between steps |

## Tuning Tips

1. **Loss Spikes**: Reduce learning rate or increase warmup
2. **Slow Convergence**: Increase learning rate (carefully)
3. **OOM Errors**: Reduce batch size
4. **Variant Instability**: Simplify variant or reduce mixing temperature

```