SkillHub ClubWrite Technical DocsFull StackData / AITech Writer

AILANG Post-Release Tasks

Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases. Executes 46-benchmark suite (medium/hard/stretch) + full standard eval with validation and progress reporting. Use when user says "post-release tasks for vX.X.X" or "update dashboard". Fully autonomous with pre-flight checks.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C2.9

Composite score

2.9

Best-practice grade

C55.6

Install command

npx @skill-hub/cli install sunholo-data-ailang-post-release

automationbenchmarkingrelease-managementdocumentationevaluation

Repository

sunholo-data/ailang

Skill path: .claude/skills/post-release

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Data / AI, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: sunholo-data.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install AILANG Post-Release Tasks into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/sunholo-data/ailang before adding AILANG Post-Release Tasks to shared team environments
Use AILANG Post-Release Tasks for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: AILANG Post-Release Tasks
description: Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases. Executes 46-benchmark suite (medium/hard/stretch) + full standard eval with validation and progress reporting. Use when user says "post-release tasks for vX.X.X" or "update dashboard". Fully autonomous with pre-flight checks.
---

# AILANG Post-Release Tasks

Run post-release tasks for an AILANG release: evaluation baselines, dashboard updates, and documentation.

## Quick Start

**Most common usage:**
```bash
# User says: "Run post-release tasks for v0.3.14"
# This skill will:
# 1. Run eval baseline (ALL 6 PRODUCTION MODELS, both languages) - ALWAYS USE --full FOR RELEASES
# 2. Update website dashboard (JSON with history preservation)
# 3. Update axiom scorecard KPI (if features affect axiom compliance)
# 4. Extract metrics and UPDATE CHANGELOG.md automatically
# 5. Move design docs from planned/ to implemented/
# 6. Run docs-sync to verify website accuracy (version constants, PLANNED banners, examples)
# 7. Commit all changes to git
```

**🚨 CRITICAL: For releases, ALWAYS use --full flag by default**
- Dev models (without --full) are only for quick testing/validation, NOT releases
- Users expect full benchmark results when they say "post-release" or "update dashboard"
- Never start with dev models and then try to add production models later

## When to Use This Skill

Invoke this skill when:
- User says "post-release tasks", "update dashboard", "run benchmarks"
- After successful release (once GitHub release is published)
- User asks about eval baselines or benchmark results
- User wants to update documentation after a release

## Available Scripts

### `scripts/run_eval_baseline.sh <version> [--full]`
Run evaluation baseline for a release version.

**🚨 CRITICAL: ALWAYS use --full for releases!**

**Usage:**
```bash
# ✅ CORRECT - For releases (ALL 6 production models)
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14 --full
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.3.14 --full

# ❌ WRONG - Only use without --full for quick testing/validation
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.3.14
```

**Output:**
```
Running eval baseline for 0.3.14...
Mode: FULL (all 6 production models)
Expected cost: ~$0.50-1.00
Expected time: ~15-20 minutes

[Running benchmarks...]

✓ Baseline complete
  Results: eval_results/baselines/0.3.14
  Files: 264 result files
```

**What it does:**
- **Step 1**: Runs `make eval-baseline` (standard 0-shot + repair evaluation)
  - Tests both AILANG and Python implementations
  - Uses all 6 production models (--full) or 3 dev models (default)
  - Tests all benchmarks in benchmarks/ directory
- **Step 2**: Runs agent eval on full benchmark suite
  - **Current suite** (v0.4.8+): 46 benchmarks (trimmed from 56)
    - Removed: trivial benchmarks (print tests), most easy benchmarks
    - Kept: fizzbuzz (1 easy for validation), all medium/hard/stretch benchmarks
    - **Stretch goals** (6 new): symbolic_diff, mini_interpreter, lambda_calc,
      graph_bfs, type_unify, red_black_tree
    - Expected: ~55-70% success rate with haiku, higher with sonnet/opus
  - Uses haiku+sonnet (--full) or haiku only (default)
  - Tests both AILANG and Python implementations
- Saves combined results to eval_results/baselines/X.X.X/
- Accepts version with or without 'v' prefix

### `scripts/update_dashboard.sh <version>`
Update website benchmark dashboard with new release data.

**Usage:**
```bash
.claude/skills/post-release/scripts/update_dashboard.sh 0.3.14
```

**Output:**
```
Updating dashboard for 0.3.14...

1/5 Generating Docusaurus markdown...
  ✓ Written to docs/docs/benchmarks/performance.md

2/5 Generating dashboard JSON with history...
  ✓ Written to docs/static/benchmarks/latest.json (history preserved)

3/5 Validating JSON...
  ✓ Version: 0.3.14
  ✓ Success rate: 0.627

4/5 Clearing Docusaurus cache...
  ✓ Cache cleared

5/5 Summary
  ✓ Dashboard updated for 0.3.14
  ✓ Markdown: docs/docs/benchmarks/performance.md
  ✓ JSON: docs/static/benchmarks/latest.json

Next steps:
  1. Test locally: cd docs && npm start
  2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance
  3. Verify timeline shows 0.3.14
  4. Commit: git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
  5. Commit: git commit -m 'Update benchmark dashboard for 0.3.14'
  6. Push: git push
```

**What it does:**
- Generates Docusaurus-formatted markdown
- Updates dashboard JSON with history preservation
- Validates JSON structure (version matches input exactly)
- Clears Docusaurus build cache
- Provides next steps for testing and committing
- Accepts version with or without 'v' prefix

### `scripts/extract_changelog_metrics.sh [json_file]`
Extract benchmark metrics from dashboard JSON for CHANGELOG.

**Usage:**
```bash
.claude/skills/post-release/scripts/extract_changelog_metrics.sh
# Or specify JSON file:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh docs/static/benchmarks/latest.json
```

**Output:**
```
Extracting metrics from docs/static/benchmarks/latest.json...

=== CHANGELOG.md Template ===

### Benchmark Results (M-EVAL)

**Overall Performance**: 59.1% success rate (399 total runs)

**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap: 54.0 percentage points (expected for new language)

**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)

=== End Template ===

Use this template in CHANGELOG.md for 0.3.15
```

**What it does:**
- Parses dashboard JSON for metrics
- Calculates percentages and gap between AILANG/Python
- **Auto-compares with previous version from history**
- **Formats comparison text automatically** (+X% improvement or -X% regression)
- Generates ready-to-paste CHANGELOG template with no manual work needed

### `scripts/cleanup_design_docs.sh <version> [--dry-run] [--force] [--check-only]`
Move design docs from planned/ to implemented/ after a release.

**Features:**
- Detects **duplicates** (docs already in implemented/)
- Detects **misplaced docs** (Target: field doesn't match folder version)
- Only moves docs with "Status: Implemented" in their frontmatter
- Docs with other statuses are flagged for review

**Usage:**
```bash
# Check-only: Report issues without making changes
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --check-only

# Preview what would be moved/deleted/relocated
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --dry-run

# Execute: Move implemented docs, delete duplicates, relocate misplaced
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9

# Force move all docs regardless of status
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --force
```

**Output:**
```
Design Doc Cleanup for v0.5.9
==================================

Checking 5 design doc(s) in design_docs/planned/v0_5_9/:

Phase 1: Detecting issues...

  [DUPLICATE] m-fix-if-else-let-block.md
              Already exists in design_docs/implemented/v0_5_9/
  [MISPLACED] m-codegen-value-types.md
              Target: v0.5.10 (folder: v0_5_9)
              Should be in: design_docs/planned/v0_5_10/

Issues found:
  - 1 duplicate(s) (can be deleted)
  - 1 misplaced doc(s) (wrong version folder)

Phase 2: Processing docs...

  [DELETED] m-fix-if-else-let-block.md (duplicate - already in implemented/)
  [RELOCATED] m-codegen-value-types.md → design_docs/planned/v0_5_10/
  [MOVED] m-dx11-cycles.md → design_docs/implemented/v0_5_9/
  [NEEDS REVIEW] m-unfinished-feature.md
                 Found: **Status**: Planned

Summary:
  ✓ Deleted 1 duplicate(s)
  ✓ Relocated 1 doc(s) to correct version folder
  ✓ Moved 1 doc(s) to design_docs/implemented/v0_5_9/
  ⚠ 1 doc(s) need review (not marked as Implemented)
```

**What it does:**
- **Phase 1 (Detection)**: Identifies duplicates and misplaced docs
- **Phase 2 (Processing)**:
  - Deletes duplicates (same file already exists in implemented/)
  - Relocates misplaced docs (Target: version doesn't match folder)
  - Moves docs with "Status: Implemented" to implemented/
  - Flags docs without Implemented status for review
- Creates target folders if needed
- Removes empty planned folder after cleanup
- Use `--check-only` to only report issues, `--dry-run` to preview actions, `--force` to move all

## Post-Release Workflow

### 1. Verify Release Exists

```bash
git tag -l vX.X.X
gh release view vX.X.X
```

If release doesn't exist, run `release-manager` skill first.

### 2. Run Eval Baseline

**🚨 CRITICAL: ALWAYS use --full for releases!**

**Correct workflow:**
```bash
# ✅ ALWAYS do this for releases - ONE command, let it complete
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full
```

This runs all 6 production models with both AILANG and Python.

**Cost**: ~$0.50-1.00
**Time**: ~15-20 minutes

**❌ WRONG workflow (what happened with v0.3.22):**
```bash
# DON'T do this for releases!
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X  # Missing --full!
# Then try to add production models later with --skip-existing
# Result: Confusion, multiple processes, incomplete baseline
```

**If baseline times out or is interrupted:**
```bash
# Resume with ALL 6 models (maintains --full semantics)
ailang eval-suite --full --langs python,ailang --parallel 5 \
  --output eval_results/baselines/X.X.X --skip-existing
```

The `--skip-existing` flag skips benchmarks that already have result files, allowing resumption of interrupted runs. But ONLY use this for recovery, not as a strategy to "add more models later".

### 3. Update Website Dashboard

**Use the automation script:**
```bash
.claude/skills/post-release/scripts/update_dashboard.sh X.X.X
```

**IMPORTANT**: This script automatically:
- Generates Docusaurus markdown (docs/docs/benchmarks/performance.md)
- Updates JSON with history preservation (docs/static/benchmarks/latest.json)
- Does NOT overwrite historical data - merges new version into existing history
- Validates JSON structure before writing
- Clears Docusaurus cache to prevent webpack errors

**Test locally (optional but recommended):**
```bash
cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance
```

Verify:
- Timeline shows vX.X.X
- Success rate matches eval results
- No errors or warnings

**Commit dashboard updates:**
```bash
git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push
```

### 4. Update Axiom Scorecard

**Review and update the axiom scorecard:**
```bash
# View current scorecard
ailang axioms

# The scorecard is at docs/static/benchmarks/axiom_scorecard.json
# Update scores if features were added/improved that affect axiom compliance
```

**When to update scores:**
- +1 → +2 if a partial implementation becomes complete
- New feature aligns with an axiom → update evidence
- Gaps were fixed → remove from gaps array
- Add to history array to track KPI over time

**Update history entry:**
```json
{
  "version": "vX.X.X",
  "date": "YYYY-MM-DD",
  "score": 18,
  "maxScore": 24,
  "percentage": 75.0,
  "notes": "Added capability budgets (A9 +1)"
}
```

### 5. Extract Metrics for CHANGELOG

**Generate metrics template:**
```bash
.claude/skills/post-release/scripts/extract_changelog_metrics.sh X.X.X
```

This outputs a formatted template with:
- Overall success rate
- Standard eval metrics (0-shot, final with repair, repair effectiveness)
- Agent eval metrics by language
- **Automatic comparison with previous version** (no manual work!)

**Update CHANGELOG.md automatically:**
1. Run the script to generate the template
2. Insert the "Benchmark Results (M-EVAL)" section into CHANGELOG.md
3. Place it after the feature/fix sections and before the next version

For CHANGELOG template format, see [`resources/version_notes.md`](resources/version_notes.md).

### 5a. Analyze Agent Evaluation Results

```bash
# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X
```

Target metrics: Avg Turns ≤1.5x gap, Avg Tokens ≤2.0x gap vs Python.

For detailed agent analysis guide, see [`resources/version_notes.md`](resources/version_notes.md).

### 6. Move Design Docs to Implemented

**Step 1: Check for issues (duplicates, misplaced docs):**
```bash
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --check-only
```

**Step 2: Preview all changes:**
```bash
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --dry-run
```

**Step 3: Check any flagged docs:**
- `[DUPLICATE]` - These will be deleted (already in implemented/)
- `[MISPLACED]` - These will be relocated to correct version folder
- `[NEEDS REVIEW]` - Update `**Status**:` to `Implemented` if done, or leave for next version

**Step 4: Execute the cleanup:**
```bash
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X
```

**Step 5: Commit the changes:**
```bash
git add design_docs/
git commit -m "docs: cleanup design docs for vX_Y_Z"
```

The script automatically:
- Deletes duplicates (same file already in implemented/)
- Relocates misplaced docs (Target: version doesn't match folder)
- Moves docs with "Status: Implemented" to implemented/
- Flags remaining docs for manual review

### 7. Update Public Documentation

- Update `prompts/` with latest AILANG syntax
- Update website docs (`docs/`) with latest features
- Remove outdated examples or references
- Add new examples to website
- Update `docs/guides/evaluation/` if significant benchmark improvements
- **Update `docs/LIMITATIONS.md`**:
  - Remove limitations that were fixed in this release
  - Add new known limitations discovered during development/testing
  - Update workarounds if they changed
  - Update version numbers in "Since" and "Fixed in" fields
  - **Test examples**: Verify that limitations listed still exist and workarounds still work
    ```bash
    # Test examples from LIMITATIONS.md
    # Example: Test Y-combinator still fails (should fail)
    echo 'let Y = \f. (\x. f(x(x)))(\x. f(x(x))) in Y' | ailang repl

    # Example: Test named recursion works (should succeed)
    ailang run examples/factorial.ail

    # Example: Test polymorphic operator limitation (should panic with floats)
    # Create test file and verify behavior matches documentation
    ```
  - Commit changes: `git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"`

### 8. Run Documentation Sync Check

**Run docs-sync to verify website accuracy:**
```bash
# Check version constants are correct
.claude/skills/docs-sync/scripts/check_versions.sh

# Audit design docs vs website claims
.claude/skills/docs-sync/scripts/audit_design_docs.sh

# Generate full sync report
.claude/skills/docs-sync/scripts/generate_report.sh
```

**What docs-sync checks:**
- Version constants in `docs/src/constants/version.js` match git tag
- Teaching prompt references point to latest version
- Architecture pages have PLANNED banners for unimplemented features
- Design docs status (planned vs implemented) matches website claims
- Examples referenced in website actually work

**If issues found:**
1. Update version.js if stale
2. Add PLANNED banners to theoretical feature pages
3. Move implemented features from roadmap to current sections
4. Fix broken example references

**Commit docs-sync fixes:**
```bash
git add docs/
git commit -m "docs: sync website with vX.X.X implementation"
```

See [docs-sync skill](../docs-sync/SKILL.md) for full documentation.

## Resources

### Post-Release Checklist
See [`resources/post_release_checklist.md`](resources/post_release_checklist.md) for complete step-by-step checklist.

## Prerequisites

- Release vX.X.X completed successfully
- Git tag vX.X.X exists
- GitHub release published with all binaries
- `ailang` binary installed (for eval baseline)
- Node.js/npm installed (for dashboard, optional)

## Common Issues

### Eval Baseline Times Out
**Solution**: Use `--skip-existing` flag to resume:
```bash
bin/ailang eval-suite --full --skip-existing --output eval_results/baselines/vX.X.X
```

### Dashboard Shows "null" for Aggregates
**Cause**: Wrong JSON file (performance matrix vs dashboard JSON)
**Solution**: Use `update_dashboard.sh` script, not manual file copying

### Webpack/Cache Errors in Docusaurus
**Cause**: Stale build cache
**Solution**: Run `cd docs && npm run clear && rm -rf .docusaurus build`

### Dashboard Shows Old Version
**Cause**: Didn't run update_dashboard.sh with correct version
**Solution**: Re-run `update_dashboard.sh X.X.X` with correct version

## Progressive Disclosure

This skill loads information progressively:

1. **Always loaded**: This SKILL.md file (YAML frontmatter + workflow overview)
2. **Execute as needed**: Scripts in `scripts/` directory (automation)
3. **Load on demand**: `resources/post_release_checklist.md` (detailed checklist)

Scripts execute without loading into context window, saving tokens while providing powerful automation.

## Version Notes

For historical improvements and lessons learned, see [`resources/version_notes.md`](resources/version_notes.md).

Key points:
- All scripts accept version with or without 'v' prefix
- Use `--validate` flag to check configuration before running
- Agent eval requires explicit `--benchmarks` list (defined in script)

## Notes

- This skill follows Anthropic's Agent Skills specification (Oct 2025)
- Scripts handle 100% of automation (eval baseline, dashboard, metrics extraction, design docs)
- Can be run hours or even days after release
- Dashboard JSON preserves history - never overwrites historical data
- Always use `--full` flag for release baselines (all production models)
- Design docs cleanup now handles duplicates, misplaced docs, and status-based moves
  - `--check-only` to report issues without changes
  - `--dry-run` to preview all actions
  - `--force` to move all regardless of status


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### resources/version_notes.md

```markdown
# Post-Release Version Notes

Historical improvements and lessons learned from past releases.

## Recent Improvements (v0.3.15)

### Version Format Handling
All scripts now accept version with or without 'v' prefix:
- `0.3.15` works
- `v0.3.15` works
- Scripts pass version to underlying tools exactly as given
- No more "directory not found" errors due to version format mismatch

### Auto-Comparison in Metrics
The `extract_changelog_metrics.sh` script now:
- Automatically finds previous version from history
- Calculates AILANG performance difference
- Formats comparison text ("+X% improvement" or "-X% regression")
- Shows before/after percentages: "(48.2% → 33.0%)"
- **Zero manual jq queries needed!**

### Eliminated Manual Work
Before v0.3.15, you needed to:
- Run 15+ manual jq queries to extract metrics
- Manually compare with previous version
- Format comparison text yourself
- Handle version prefix inconsistencies

After v0.3.15:
- **3 scripts do everything automatically**
- No jq queries needed
- No manual calculations
- Version format doesn't matter

## Lessons Learned (v0.4.8)

### Always Validate Before Long-Running Operations

The pre-release checks now include:
1. **Golden file validation**: `make test-import-errors` to catch stale goldens
2. **Agent eval config validation**: `--validate` flag to verify benchmarks list

**Issue discovered**: Agent eval requires explicit `--benchmarks` list (safety feature), but script didn't have it defined. Now fixed with `AGENT_BENCHMARKS` at top of script.

### Agent Eval Requirements

- Agent mode REQUIRES explicit `--benchmarks` list (46 benchmarks as of v0.4.8)
- The list is defined in `AGENT_BENCHMARKS` variable at top of `run_eval_baseline.sh`
- Keep in sync with `benchmarks/` directory

### Golden Files Can Become Stale

When error behavior changes (e.g., module now exists → different error code), golden files need regeneration:
```bash
make regen-import-error-goldens
```

Pre-release checks now validate goldens match current behavior.

### Validate Script Before Running

Use `--validate` flag to check configuration without running full eval:
```bash
.claude/skills/post-release/scripts/run_eval_baseline.sh --validate
```

This checks:
- AGENT_BENCHMARKS is defined
- Benchmark count (46 expected)
- ailang command exists
- Benchmark files exist

## CHANGELOG Template Example

When updating CHANGELOG.md with benchmark results, use this format:

```markdown
### Benchmark Results (M-EVAL)

**Overall Performance**: 68.1% success rate (480 total runs)

**Standard Eval (0-shot + self-repair):**

| Metric | 0.4.4 | 0.4.5 | Change |
|--------|--------|--------|--------|
| **0-shot (first attempt)** | 55.6% | 64.0% (182/284) | **+8.4%** |
| **Final (with repair)** | 60.5% | 68.6% (195/284) | **+8.1%** |
| **Repair effectiveness** | +4.9pp | +4.6pp | -.3pp |
| **Python (final)** | 73.1% | 76.4% (208/272) | +3.3% |

**Agent Eval (multi-turn iterative problem solving):**

| Language | 0.4.4 | 0.4.5 | Change |
|----------|--------|--------|--------|
| **AILANG** | 92.1% | 100.0% (38/38) | **+7.9%** |
| **Python** | 100.0% | 100.0% (38/38) | 0% |

**Key Findings:**
- Major improvement in 0-shot performance
- Perfect agent eval score achieved
```

## Agent Eval Analysis Guide

### Check agent efficiency metrics:
```bash
# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X
```

### Key metrics to track:
- **Avg Turns**: AILANG vs Python (target: ≤1.5x gap)
- **Avg Tokens**: AILANG vs Python (target: ≤2.0x gap)
- **Success Rate**: Both should be 100% (agent mode corrects mistakes)

### Example good result:
```
🐍 Python: 10.6 avg turns, 58k tokens, 100% success
🔷 AILANG: 12.0 avg turns, 72k tokens, 100% success
Gap: 1.13x turns, 1.24x tokens ✅ (within target!)
```

### Example needs work:
```
🐍 Python: 10.6 avg turns, 58k tokens, 100% success
🔷 AILANG: 18.0 avg turns, 178k tokens, 100% success
Gap: 1.7x turns, 3.0x tokens ⚠️ (needs optimization!)
```

### If AILANG significantly worse than Python:
1. Identify expensive benchmarks (from "Most Expensive" section)
2. View transcripts: `.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/X.X.X <benchmark>`
3. File optimization tasks for next release

```

### ../docs-sync/SKILL.md

```markdown
---
name: docs-sync
description: Sync AILANG documentation website with codebase reality. Use after releases, when features are implemented, or when website accuracy is questioned. Checks design docs vs website, validates examples, updates feature status.
---

# Documentation Sync Skill

Keep the AILANG website in sync with actual implementation by checking design docs, validating examples, and tracking feature status.

## Quick Start

```bash
# After a release - full sync check
# User: "sync docs for v0.5.6" or "run docs-sync"

# Check specific area
# User: "verify landing pages" or "check working examples"
```

## When to Use This Skill

Invoke this skill when:
- After any release (post-release should trigger this)
- User questions website accuracy
- Features move from planned → implemented
- Examples may be broken or outdated
- Version constants need updating

## Workflow

### 1. Pre-flight Checks

Run all diagnostic scripts to understand current state:

```bash
# Check design docs status
.claude/skills/docs-sync/scripts/audit_design_docs.sh

# Check version constants
.claude/skills/docs-sync/scripts/check_versions.sh

# Validate working examples
.claude/skills/docs-sync/scripts/check_examples.sh
```

### 2. Review Feature Themes

Features are grouped into themes (see [resources/feature_themes.md](resources/feature_themes.md)):

| Theme | Status Page | Description |
|-------|-------------|-------------|
| Core Language | `/reference/language-syntax` | Types, ADTs, pattern matching |
| Effect System | `/reference/effects` | Capabilities, IO, FS, Net |
| Module System | `/reference/modules` | Imports, exports, aliasing |
| Go Codegen | `/guides/go-codegen` | Compilation to Go |
| AI Integration | `/guides/ai-integration` | Prompts, benchmarks, agents |
| Testing | `/guides/testing` | Inline tests, property-based |
| Developer Experience | `/guides/development` | REPL, debugging, CLI |
| **Roadmap: Execution Profiles** | `/roadmap/execution-profiles` | v0.6.0 planned |
| **Roadmap: Shared Semantic State** | `/roadmap/shared-semantic-state` | v0.6.0 planned |
| **Roadmap: Deterministic Tooling** | `/roadmap/deterministic-tooling` | v0.7.0 planned |

### 3. Fix Priority Order

1. **Version Constants** - Update `docs/src/constants/version.js`
2. **Landing Pages** - intro.mdx, vision.mdx, why-ailang.mdx
3. **Working Examples** - Ensure examples use raw-loader, all pass
4. **Status Banners** - Add "PLANNED" banners to future features
5. **Feature Docs** - Create missing docs for implemented features
6. **Roadmap Section** - Move theoretical pages to roadmap

### 4. Update Checklist

After fixing, verify:

```bash
# Rebuild and check
cd docs && npm run build

# Verify no broken links
npm run serve  # Manual check

# Commit changes
git add docs/
git commit -m "docs: sync website with v0.X.X implementation"
```

## Scripts

| Script | Purpose |
|--------|---------|
| `audit_design_docs.sh` | Compare planned vs implemented design docs |
| `derive_roadmap_versions.sh` | **Derive target versions from design doc folders** |
| `check_versions.sh` | Verify version constants match releases |
| `check_examples.sh` | Validate example files compile/run |
| `generate_report.sh` | Generate sync status report |

### CLI Example Verification (External)

The Makefile provides CLI example verification that complements this skill:

```bash
# Verify all CLI examples documented in examples/cli_examples.txt
make verify-cli-examples

# Full verification: code examples + CLI examples
make verify-examples && make verify-cli-examples
```

**CLI Examples File Format** (`examples/cli_examples.txt`):
```
# Comment explaining the example
$ ailang run --caps IO --entry main examples/runnable/hello.ail
Hello, AILANG!
```

This ensures CLI syntax in documentation matches actual behavior.

### Version Derivation Script

```bash
# List all planned features with derived target versions
.claude/skills/docs-sync/scripts/derive_roadmap_versions.sh

# Full lifecycle: planned + implemented features
.claude/skills/docs-sync/scripts/derive_roadmap_versions.sh --full

# Check website consistency (exits 1 if mismatches)
.claude/skills/docs-sync/scripts/derive_roadmap_versions.sh --check

# JSON output for automation
.claude/skills/docs-sync/scripts/derive_roadmap_versions.sh --json --full

# Full validation: all features + website check
.claude/skills/docs-sync/scripts/derive_roadmap_versions.sh --full --check
```

## Resources

| Resource | Content |
|----------|---------|
| `feature_themes.md` | Feature groupings and expected pages |
| `landing_page_checklist.md` | Requirements for main pages |

## Integration with Post-Release

The `post-release` skill should invoke docs-sync automatically:

```bash
# In post-release workflow after eval baselines
# Run docs-sync to update website
```

## Key Principles

1. **Theoretical is OK** - Future features can be documented, but must:
   - Link to design docs on GitHub
   - Have clear status banner (e.g., "PLANNED FOR v0.6.0")
   - Be in roadmap section, not current features

2. **Examples Must Work** - Every code example should:
   - Use raw-loader to import from `examples/`
   - Be tested with `ailang run` or `ailang test`
   - Never embed code directly in MDX
   - **CLI commands must be added to `examples/cli_examples.txt`** and verified with `./tools/verify_cli_examples.sh`
   - Default entrypoint is `main` - don't use `--entry main` unless showing non-main entry
   - Test all commands before documenting: `./bin/ailang run --caps IO examples/runnable/hello.ail`

   **Exception for Reference Documentation:**
   - Reference pages (e.g., `language-syntax.md`, `effects.md`) may use inline syntax snippets
   - These are small patterns showing language constructs, not complete runnable programs
   - Tutorial pages (`examples.mdx`, `getting-started.mdx`, `guides/`) must always import from files
   - Rule: If the code snippet is a complete runnable program, it MUST be imported

3. **Design Docs = Ultimate Source of Truth** - The folder structure tracks complete feature lifecycle:

   **Planned features:**
   - `design_docs/planned/v0_6_0/foo.md` → Feature targets v0.6.0
   - Website pages must match: `PLANNED FOR v0.6.0` banner
   - Website MUST link to design doc on GitHub as ultimate reference
   - OVERDUE = design doc folder version ≤ current release but not implemented

   **Implemented features:**
   - `design_docs/implemented/v0_5_6/foo.md` → Feature shipped in v0.5.6
   - Website reference pages can link to implemented design docs
   - Moving `planned/` → `implemented/` = feature is done

   **Validation:**
   - Run `./scripts/derive_roadmap_versions.sh --check` to validate website
   - Run `./scripts/derive_roadmap_versions.sh --full` to see complete lifecycle
   - Script checks for GitHub links in roadmap pages

4. **One Source of Truth** - Version comes from:
   - `git describe --tags` → actual version
   - `prompts/versions.json` → latest prompt
   - These feed into `docs/src/constants/version.js`

5. **Themes Over Changelog** - Group features by theme, not by version. Users care about "how do effects work?" not "what changed in v0.5.3?"

6. **Evolving Themes** - Themes should evolve as the language grows:
   - When a feature doesn't fit existing themes, consider a new theme
   - New themes warrant new landing pages
   - Update `resources/feature_themes.md` when adding themes
   - Current themes: Core Language, Type System, Effect System, Module System, Go Codegen, Arrays, Testing, AI Integration, Developer Experience
   - Roadmap themes: Execution Profiles, Deterministic Tooling, Shared Semantic State

## Theme Evolution Guidelines

When reviewing new features, ask:

1. **Does this fit an existing theme?** → Add to that theme's page
2. **Is this a major new capability?** → Consider a new theme
3. **Is this cross-cutting?** → May need multiple mentions but one authoritative page

### Signals for a New Theme

- 3+ related features with no natural home
- A new design doc folder (e.g., `v0_7_0/`) with a coherent focus
- User questions consistently asking "how do I do X?" where X isn't covered
- A planned feature graduating to implemented that's substantial

### Creating a New Theme

1. Add entry to `resources/feature_themes.md`
2. Create website page at appropriate location
3. Update sidebar in `docs/sidebars.js`
4. Cross-link from related themes
5. Update this skill's theme table above

```

### resources/post_release_checklist.md

```markdown
# Post-Release Checklist

## Prerequisites

- [ ] Release vX.X.X completed successfully
- [ ] Git tag vX.X.X exists
- [ ] GitHub release published with all binaries

## 1. Run Eval Baseline

Use the automation script:
```bash
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full
```

**Manual alternative:**
```bash
make eval-baseline EVAL_VERSION=X.X.X FULL=true
```

**If baseline times out or is interrupted:**
```bash
bin/ailang eval-suite --full --langs python,ailang --parallel 5 \
  --output ./eval_results/baselines/vX.X.X --self-repair --skip-existing
```

- [ ] Eval baseline complete (eval_results/baselines/vX.X.X/)
- [ ] Results include both AILANG and Python
- [ ] All models run successfully

## 2. Update Website Dashboard

Use the automation script:
```bash
.claude/skills/post-release/scripts/update_dashboard.sh X.X.X
```

This will:
- Generate docs/docs/benchmarks/performance.md
- Update docs/static/benchmarks/latest.json (preserves history)
- Validate JSON output
- Clear Docusaurus cache

- [ ] Markdown generated (docs/docs/benchmarks/performance.md)
- [ ] JSON updated (docs/static/benchmarks/latest.json)
- [ ] JSON validation passed
- [ ] Cache cleared

## 3. Test Dashboard Locally (Optional)

```bash
cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance
```

- [ ] Timeline shows vX.X.X
- [ ] Success rate matches eval results
- [ ] No webpack/cache errors

## 4. Update Axiom Scorecard (if applicable)

Review axiom compliance if the release includes features that affect design axioms:

```bash
# View current scorecard
ailang axioms

# Edit scorecard JSON if needed
# File: docs/static/benchmarks/axiom_scorecard.json
```

Update if:
- A partial implementation became complete (+1 → +2)
- New feature aligns with an axiom (add evidence)
- Gaps were fixed (remove from gaps array)

Always add history entry:
```json
{
  "version": "vX.X.X",
  "date": "YYYY-MM-DD",
  "score": 18,
  "maxScore": 24,
  "percentage": 75.0,
  "notes": "Brief description of changes"
}
```

- [ ] Axiom scorecard reviewed
- [ ] Scores updated if applicable
- [ ] History entry added

## 5. Extract Metrics for CHANGELOG

Use the automation script:
```bash
.claude/skills/post-release/scripts/extract_changelog_metrics.sh
```

This outputs a CHANGELOG.md template with:
- Overall success rate
- AILANG-only rate
- Python baseline rate
- Gap analysis

- [ ] Metrics extracted
- [ ] CHANGELOG.md updated with benchmark results

## 6. Commit Dashboard Updates

```bash
git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push
```

- [ ] Files staged
- [ ] Committed
- [ ] Pushed

## 7. Verify Sprint JSON Tracking

Check that sprint state JSON files are properly completed:

```bash
# List sprint JSON files in project
ls -la .ailang/state/sprints/

# Verify sprint completion for this release
cat .ailang/state/sprints/sprint_<MILESTONE>.json | jq '.'
```

**For each sprint JSON file related to this release:**

- [ ] Sprint status is `"completed"` (not `"in_progress"`)
- [ ] All milestones marked with `"passes": true`
- [ ] `completed` timestamp is filled out for each milestone
- [ ] `actual_loc` values are present (not 0)
- [ ] `velocity` section has final metrics calculated
- [ ] `completion_summary` section exists with:
  - [ ] Total milestones count
  - [ ] Key deliverable counts (tests, files, functions, etc.)
  - [ ] Phase 2 message ID if applicable
  - [ ] Any important metrics specific to the sprint

**If sprint JSON is incomplete:**

1. Review sprint completion documents (e.g., `M-<MILESTONE>-SPRINT-COMPLETE.md`)
2. Update sprint JSON with correct values
3. Create tracked completion summary if missing (sprint JSONs are gitignored)

**Common issues:**
- Sprint left in `"in_progress"` status
- Milestones missing completion timestamps
- `actual_loc` not calculated
- `velocity.efficiency` not computed
- `completion_summary` section missing

## 8. Update Design Docs

- [ ] Move completed design docs to design_docs/implemented/vX_Y/
- [ ] Update design docs with what was actually implemented
- [ ] Create new design docs in design_docs/planned/ for deferred features

## 9. Update Public Documentation

- [ ] Ensure prompts/ reflects latest AILANG syntax
- [ ] Update website docs (docs/) with latest features
- [ ] Remove old references or outdated examples
- [ ] Add new examples to website if applicable
- [ ] Update docs/guides/evaluation/ with significant improvements
- [ ] **Update docs/LIMITATIONS.md**:
  - [ ] Remove limitations fixed in this release
  - [ ] Add new limitations discovered during development
  - [ ] Update workarounds if they changed
  - [ ] Update version numbers ("Since", "Fixed in" fields)
  - [ ] Test examples in LIMITATIONS.md still work/fail as documented
  - [ ] Commit: `git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"`

## 10. Run Documentation Sync Check

Use the docs-sync skill to verify website accuracy:

```bash
# Check version constants match git tag
.claude/skills/docs-sync/scripts/check_versions.sh

# Audit design docs vs website claims
.claude/skills/docs-sync/scripts/audit_design_docs.sh

# Generate full sync report
.claude/skills/docs-sync/scripts/generate_report.sh
```

- [ ] Version constants match git tag (docs/src/constants/version.js)
- [ ] Teaching prompt references point to latest version
- [ ] Architecture pages have PLANNED banners for unimplemented features
- [ ] No unimplemented features claimed as current
- [ ] Examples referenced in website actually work

**If issues found:**
- [ ] Update version.js if stale
- [ ] Add PLANNED banners to theoretical feature pages
- [ ] Move implemented features from roadmap to current sections
- [ ] Fix broken example references
- [ ] Commit: `git add docs/ && git commit -m "docs: sync website with vX.X.X"`

## Final Verification

- [ ] Eval baseline complete for vX.X.X
- [ ] CHANGELOG.md has benchmark results (AILANG + combined)
- [ ] Website dashboard shows vX.X.X as latest
- [ ] Dashboard timeline includes vX.X.X data point
- [ ] Dashboard JSON preserves history (multiple versions)
- [ ] **Axiom scorecard reviewed and updated if applicable**
- [ ] **Axiom history entry added for this version**
- [ ] Design docs moved to implemented/
- [ ] Public docs updated
- [ ] **docs/LIMITATIONS.md updated and tested**
- [ ] **docs-sync report shows no critical issues**
- [ ] All changes committed and pushed

## Notes

- Can be run hours or days after release
- Eval baselines cost ~\$0.50-1.00 (full) or ~\$0.10-0.20 (dev)
- Dashboard requires Node.js/npm installed
- Design doc migration is manual review

```

### scripts/run_eval_baseline.sh

```bash
#!/usr/bin/env bash
# Run evaluation baseline for a release version

set -euo pipefail

# Agent benchmarks list (46 benchmarks as of v0.4.8)
# This MUST be kept in sync with benchmarks/ directory
AGENT_BENCHMARKS="adt_option,api_call_json,balanced_parens,binary_tree_sum,canonical_normalization,cli_args,config_file_parser,csv_to_json_converter,effect_composition,effect_pure_separation,effect_tracking_io_fs,error_handling,exhaustive_pattern_matching,explicit_state_threading,expression_evaluator,fizzbuzz,float_eq,fold_reduce,gcd_lcm,graph_bfs,higher_order_functions,immutable_data_structures,inline_tests,json_encode,json_parse,json_transform,lambda_calc,list_comprehension,log_file_analyzer,merge_sort,mini_interpreter,nested_records,no_runtime_crashes_option,numeric_modulo,pattern_matching_complex,pipeline,record_update,records_person,recursion_fibonacci,red_black_tree,run_length_encode,state_machine_traffic_light,symbolic_diff,tree_transformation_pipeline,type_safe_record_access,type_unify"

# Validation mode: check configuration without running full eval
if [[ "${1:-}" == "--validate" ]]; then
    echo "Validating agent eval configuration..."

    # Check AGENT_BENCHMARKS is defined and non-empty
    if [[ -z "${AGENT_BENCHMARKS:-}" ]]; then
        echo "ERROR: AGENT_BENCHMARKS not defined"
        exit 1
    fi

    # Count benchmarks
    BENCHMARK_COUNT=$(echo "$AGENT_BENCHMARKS" | tr ',' '\n' | wc -l | tr -d ' ')
    echo "  Benchmarks defined: $BENCHMARK_COUNT"

    # Verify ailang command exists
    if ! command -v ailang &> /dev/null; then
        echo "ERROR: ailang command not found"
        exit 1
    fi
    echo "  ailang command: found"

    # Check that benchmark files exist (spot check first benchmark)
    FIRST_BENCHMARK=$(echo "$AGENT_BENCHMARKS" | cut -d',' -f1)
    if [[ -f "benchmarks/${FIRST_BENCHMARK}.yml" ]]; then
        echo "  Benchmark files: found (checked $FIRST_BENCHMARK.yml)"
    else
        echo "WARNING: Benchmark file not found: benchmarks/${FIRST_BENCHMARK}.yml"
    fi

    echo "  ✓ Agent eval configuration valid"
    exit 0
fi

# Progress monitoring (runs in background)
monitor_progress() {
    local results_dir="$1"
    local expected_count="$2"
    local phase="$3"

    echo "[$phase] Monitoring progress..."
    while true; do
        sleep 120  # Report every 2 minutes

        if [[ -d "$results_dir" ]]; then
            current=$(find "$results_dir" -name "*.json" 2>/dev/null | wc -l | tr -d ' ')
            percent=$((current * 100 / expected_count))
            echo "[$phase] Progress: $current/$expected_count files ($percent%)"
        fi
    done
}

if [[ $# -eq 0 ]]; then
    echo "Usage: $0 <version> [--full]" >&2
    echo "Example: $0 0.3.14 --full" >&2
    echo "" >&2
    echo "Options:" >&2
    echo "  --full    Run with all production models (default: dev models only)" >&2
    exit 1
fi

VERSION="$1"
FULL_FLAG=""

# Normalize version: ensure directory always has "v" prefix
# Strip any existing "v" prefix first, then add it
VERSION_NORMALIZED="${VERSION#v}"
RESULTS_DIR="eval_results/baselines/v$VERSION_NORMALIZED"

if [[ $# -gt 1 ]] && [[ "$2" == "--full" ]]; then
    FULL_FLAG="FULL=true"
fi

echo "Running eval baseline for $VERSION..."
if [[ -n "$FULL_FLAG" ]]; then
    echo "Mode: FULL (all 6 production models)"
    echo "Expected cost: ~\$0.50-1.00"
    echo "Expected time: ~15-20 minutes"
else
    echo "Mode: DEV (3 cheap models only)"
    echo "Expected cost: ~\$0.10-0.20"
    echo "Expected time: ~5-10 minutes"
fi
echo

# Step 1: Run standard eval baseline (pass normalized version WITH v prefix)
echo "=== Step 1/2: Standard Eval (0-shot + repair) ==="
if [[ -n "$FULL_FLAG" ]]; then
    monitor_progress "$RESULTS_DIR" 480 "Standard" &
    MONITOR_PID=$!
    make eval-baseline EVAL_VERSION="v$VERSION_NORMALIZED" FULL=true
    kill $MONITOR_PID 2>/dev/null || true
else
    monitor_progress "$RESULTS_DIR" 246 "Standard" &
    MONITOR_PID=$!
    make eval-baseline EVAL_VERSION="v$VERSION_NORMALIZED"
    kill $MONITOR_PID 2>/dev/null || true
fi

# Step 2: Run agent eval on curated benchmarks
echo
echo "=== Step 2/2: Agent Eval (multi-turn) ==="
echo "Running agent eval on curated benchmarks..."

# AGENT_BENCHMARKS is defined at the top of the script (46 benchmarks)
# See: benchmark-manager skill for benchmark categories
echo "Benchmarks: all 46 benchmarks (see AGENT_BENCHMARKS at top of script)"
echo

# Pre-flight validation and configuration summary
echo "=== Pre-Flight Check ==="
echo "Version: $VERSION"
echo "Mode: ${FULL_FLAG:-DEV}"
echo "Benchmarks: 46 (1 easy, ~19 medium, ~17 hard, 6 stretch)"
echo "Agent models: claude-sonnet-4-5 (Claude executor), gemini-3-flash (Gemini executor)"
echo "Agent parallelism: 2"
echo "Agent timeout: default (no override)"
echo "Prompt version: latest (auto-selected)"
echo
echo "Expected results:"
if [[ -n "$FULL_FLAG" ]]; then
    echo "  Standard eval: ~552 files (46 benchmarks × 6 models × 2 langs)"
    echo "  Agent eval: ~184 files (46 benchmarks × 2 executors × 2 langs)"
    echo "  Total: ~736 files"
else
    echo "  Standard eval: ~276 files (46 benchmarks × 3 models × 2 langs)"
    echo "  Agent eval: ~92 files (46 benchmarks × 2 executors × 1 lang)"
    echo "  Total: ~368 files"
fi
echo
echo "Starting in 3 seconds... (Ctrl-C to abort)"
sleep 3
echo

# Agent eval uses both Claude and Gemini executors (multi-executor support v0.6.0+)
# --benchmarks is required for agent mode (safety feature)
# Models: claude-sonnet-4-5 (Claude executor), gemini-3-flash (Gemini executor)
AGENT_MODELS="claude-sonnet-4-5,gemini-3-flash"

if [[ -n "$FULL_FLAG" ]]; then
    # Full mode: run both models on both languages
    echo "Mode: FULL (Claude Sonnet + Gemini 3 Flash)"
    echo "Executors: claude, gemini"
    monitor_progress "$RESULTS_DIR" 184 "Agent" &
    MONITOR_PID=$!
    ailang eval-suite --agent \
        --models "$AGENT_MODELS" \
        --benchmarks "$AGENT_BENCHMARKS" \
        --langs ailang,python \
        --agent-parallel 2 \
        --output "$RESULTS_DIR"
    kill $MONITOR_PID 2>/dev/null || true
else
    # Dev mode: both executors but AILANG only (faster)
    echo "Mode: DEV (Claude Sonnet + Gemini 3 Flash, AILANG only)"
    echo "Executors: claude, gemini"
    monitor_progress "$RESULTS_DIR" 92 "Agent" &
    MONITOR_PID=$!
    ailang eval-suite --agent \
        --models "$AGENT_MODELS" \
        --benchmarks "$AGENT_BENCHMARKS" \
        --langs ailang \
        --agent-parallel 2 \
        --output "$RESULTS_DIR"
    kill $MONITOR_PID 2>/dev/null || true
fi

# Show combined results
echo
if [[ -d "$RESULTS_DIR" ]]; then
    STANDARD_COUNT=$(find "$RESULTS_DIR/standard" -name "*.json" 2>/dev/null | wc -l | tr -d ' ')
    AGENT_COUNT=$(find "$RESULTS_DIR/agent" -name "*.json" 2>/dev/null | wc -l | tr -d ' ')
    TOTAL_COUNT=$((STANDARD_COUNT + AGENT_COUNT))
    echo "✓ Baseline complete"
    echo "  Results: $RESULTS_DIR"
    echo "  Standard eval: $STANDARD_COUNT files"
    echo "  Agent eval: $AGENT_COUNT files"
    echo "  Total: $TOTAL_COUNT files"
else
    echo "✗ Results directory not found: $RESULTS_DIR" >&2
    exit 1
fi

# Validation
echo
echo "=== Validation ==="

# Check file counts (46 benchmarks × models × 2 langs)
if [[ -n "$FULL_FLAG" ]]; then
    EXPECTED_STANDARD=552   # 46 × 6 × 2
    EXPECTED_AGENT=184      # 46 × 2 × 2
    EXPECTED_TOTAL=736
else
    EXPECTED_STANDARD=276   # 46 × 3 × 2
    EXPECTED_AGENT=92       # 46 × 1 × 2
    EXPECTED_TOTAL=368
fi

# Allow 5% tolerance for filtering/failures
MIN_STANDARD=$((EXPECTED_STANDARD * 95 / 100))
MIN_AGENT=$((EXPECTED_AGENT * 85 / 100))  # Agent eval can have more variance
MIN_TOTAL=$((EXPECTED_TOTAL * 90 / 100))

if [[ $STANDARD_COUNT -lt $MIN_STANDARD ]]; then
    echo "⚠️  WARNING: Standard eval produced fewer files than expected"
    echo "   Expected: ~$EXPECTED_STANDARD, Got: $STANDARD_COUNT (minimum: $MIN_STANDARD)"
fi

if [[ $AGENT_COUNT -lt $MIN_AGENT ]]; then
    echo "⚠️  WARNING: Agent eval produced fewer files than expected"
    echo "   Expected: ~$EXPECTED_AGENT, Got: $AGENT_COUNT (minimum: $MIN_AGENT)"
fi

if [[ $TOTAL_COUNT -lt $MIN_TOTAL ]]; then
    echo "⚠️  WARNING: Total file count below expected minimum"
    echo "   Expected: ~$EXPECTED_TOTAL, Got: $TOTAL_COUNT (minimum: $MIN_TOTAL)"
    echo "   This may indicate interrupted runs or configuration issues"
else
    echo "✓ File counts within expected range"
fi

```

### scripts/update_dashboard.sh

```bash
#!/usr/bin/env bash
# Update website benchmark dashboard with new release data

set -euo pipefail

if [[ $# -eq 0 ]]; then
    echo "Usage: $0 <version>" >&2
    echo "Example: $0 0.3.14" >&2
    exit 1
fi

VERSION="$1"

# Normalize version: ensure directory always has "v" prefix
# Strip any existing "v" prefix first, then add it
VERSION_NORMALIZED="${VERSION#v}"
RESULTS_DIR="eval_results/baselines/v$VERSION_NORMALIZED"
MARKDOWN_FILE="docs/docs/benchmarks/performance.md"
JSON_FILE="docs/static/benchmarks/latest.json"

# Verify results directory exists
if [[ ! -d "$RESULTS_DIR" ]]; then
    echo "Error: Results directory not found: $RESULTS_DIR" >&2
    echo "Run eval baseline first with run_eval_baseline.sh" >&2
    exit 1
fi

echo "Updating dashboard for $VERSION..."
echo

# NOTE: We do NOT regenerate performance.md - it's a static template with React components
# The React components read data from latest.json, so we only update the JSON file

# Generate JSON with history preservation (writes to docs/static/benchmarks/latest.json automatically)
echo "1/3 Generating dashboard JSON with history..."
if ailang eval-report "$RESULTS_DIR" "$VERSION" --format=json > /dev/null; then
    echo "  ✓ Written to $JSON_FILE (history preserved)"
else
    echo "  ✗ Failed to generate JSON" >&2
    exit 1
fi
echo

# Validate JSON (version in JSON matches what we passed)
echo "2/3 Validating JSON..."
if VERSION_CHECK=$(jq -r '.version' "$JSON_FILE" 2>/dev/null) && [[ "$VERSION_CHECK" == "$VERSION" ]]; then
    SUCCESS_RATE=$(jq -r '.aggregates.finalSuccess' "$JSON_FILE" 2>/dev/null)
    echo "  ✓ Version: $VERSION_CHECK"
    echo "  ✓ Success rate: $SUCCESS_RATE"
else
    echo "  ✗ JSON validation failed (expected: $VERSION, got: $VERSION_CHECK)" >&2
    exit 1
fi
echo

# Clear Docusaurus cache
echo "3/3 Clearing Docusaurus cache..."
if (cd docs && npm run clear > /dev/null 2>&1); then
    echo "  ✓ Cache cleared"
else
    echo "  ⚠ Cache clear failed (npm may not be installed)"
fi
echo

# Summary
echo "✓ Dashboard JSON updated for $VERSION"
echo "  Data file: $JSON_FILE"
echo "  Template: $MARKDOWN_FILE (static - not modified)"
echo
echo "Next steps:"
echo "  1. Test locally: cd docs && npm start"
echo "  2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance"
echo "  3. Verify radar charts and timeline show $VERSION"
echo "  4. Commit: git add $JSON_FILE"
echo "  5. Commit: git commit -m 'Update benchmark dashboard for $VERSION'"
echo "  6. Push: git push"
echo
echo "Note: performance.md is a static template with React components."
echo "      The components read data from latest.json - no need to regenerate it."

```

### scripts/extract_changelog_metrics.sh

```bash
#!/usr/bin/env bash
# Extract comprehensive benchmark metrics for CHANGELOG
# Extracts: 0-shot, final, repair effectiveness, and agent eval metrics

set -euo pipefail

VERSION="${1:-}"

if [[ -z "$VERSION" ]]; then
    echo "Usage: $0 <version> [results_dir]" >&2
    echo "Example: $0 0.4.1" >&2
    exit 1
fi

# Normalize version: ensure directory always has "v" prefix
VERSION_NORMALIZED="${VERSION#v}"
RESULTS_DIR="${2:-eval_results/baselines/v$VERSION_NORMALIZED}"
JSON_FILE="docs/static/benchmarks/latest.json"

if [[ ! -d "$RESULTS_DIR" ]]; then
    echo "Error: Results directory not found: $RESULTS_DIR" >&2
    exit 1
fi

if [[ ! -f "$RESULTS_DIR/summary.jsonl" ]]; then
    echo "Error: summary.jsonl not found. Run 'ailang eval-summary $RESULTS_DIR' first" >&2
    exit 1
fi

if [[ ! -f "$JSON_FILE" ]]; then
    echo "Error: Dashboard JSON not found: $JSON_FILE" >&2
    echo "Run update_dashboard.sh first" >&2
    exit 1
fi

SUMMARY="$RESULTS_DIR/summary.jsonl"

echo "Extracting metrics from $RESULTS_DIR..."
echo

# Extract standard eval metrics (AILANG only)
AILANG_TOTAL=$(jq -s 'map(select(.lang == "ailang")) | length' "$SUMMARY")
AILANG_0SHOT=$(jq -s 'map(select(.lang == "ailang" and .first_attempt_ok == true)) | length' "$SUMMARY")
AILANG_FINAL=$(jq -s 'map(select(.lang == "ailang" and .stdout_ok == true)) | length' "$SUMMARY")

# Calculate percentages
AILANG_0SHOT_PCT=$(echo "scale=1; $AILANG_0SHOT * 100 / $AILANG_TOTAL" | bc)
AILANG_FINAL_PCT=$(echo "scale=1; $AILANG_FINAL * 100 / $AILANG_TOTAL" | bc)
AILANG_REPAIR_GAIN=$(echo "scale=1; $AILANG_FINAL_PCT - $AILANG_0SHOT_PCT" | bc)

# Python metrics
PYTHON_TOTAL=$(jq -s 'map(select(.lang == "python")) | length' "$SUMMARY")
PYTHON_FINAL=$(jq -s 'map(select(.lang == "python" and .stdout_ok == true)) | length' "$SUMMARY")
PYTHON_FINAL_PCT=$(echo "scale=1; $PYTHON_FINAL * 100 / $PYTHON_TOTAL" | bc)

# Agent eval metrics
AILANG_AGENT_TOTAL=$(jq -s 'map(select(.lang == "ailang" and .eval_mode == "agent")) | length' "$SUMMARY")
AILANG_AGENT_SUCCESS=$(jq -s 'map(select(.lang == "ailang" and .eval_mode == "agent" and .stdout_ok == true)) | length' "$SUMMARY")
AILANG_AGENT_PCT=$(echo "scale=1; $AILANG_AGENT_SUCCESS * 100 / $AILANG_AGENT_TOTAL" | bc 2>/dev/null || echo "0.0")

PYTHON_AGENT_TOTAL=$(jq -s 'map(select(.lang == "python" and .eval_mode == "agent")) | length' "$SUMMARY")
PYTHON_AGENT_SUCCESS=$(jq -s 'map(select(.lang == "python" and .eval_mode == "agent" and .stdout_ok == true)) | length' "$SUMMARY")
PYTHON_AGENT_PCT=$(echo "scale=1; $PYTHON_AGENT_SUCCESS * 100 / $PYTHON_AGENT_TOTAL" | bc 2>/dev/null || echo "0.0")

# Overall success rate
OVERALL_RATE=$(jq -r '.aggregates.finalSuccess' "$JSON_FILE")
TOTAL_RUNS=$(jq -r '.totalRuns' "$JSON_FILE")
OVERALL_PCT=$(echo "$OVERALL_RATE * 100" | bc -l | xargs printf "%.1f")

# Find previous version in history
PREV_VERSION=$(jq -r '.history | sort_by(.timestamp) | reverse | .[1].version // "none"' "$JSON_FILE" 2>/dev/null || echo "none")

# Try to get previous version's summary.jsonl if it exists
# Normalize previous version path too
if [[ "$PREV_VERSION" != "none" ]] && [[ "$PREV_VERSION" != "null" ]]; then
    PREV_VERSION_NORMALIZED="${PREV_VERSION#v}"
    PREV_RESULTS_DIR="eval_results/baselines/v$PREV_VERSION_NORMALIZED"
else
    PREV_RESULTS_DIR="eval_results/baselines/$PREV_VERSION"
fi
PREV_SUMMARY="$PREV_RESULTS_DIR/summary.jsonl"

if [[ "$PREV_VERSION" != "none" ]] && [[ "$PREV_VERSION" != "null" ]] && [[ -f "$PREV_SUMMARY" ]]; then
    # Extract previous metrics
    PREV_AILANG_0SHOT=$(jq -s 'map(select(.lang == "ailang" and .first_attempt_ok == true)) | length' "$PREV_SUMMARY")
    PREV_AILANG_FINAL=$(jq -s 'map(select(.lang == "ailang" and .stdout_ok == true)) | length' "$PREV_SUMMARY")
    PREV_AILANG_TOTAL=$(jq -s 'map(select(.lang == "ailang")) | length' "$PREV_SUMMARY")

    PREV_AILANG_0SHOT_PCT=$(echo "scale=1; $PREV_AILANG_0SHOT * 100 / $PREV_AILANG_TOTAL" | bc)
    PREV_AILANG_FINAL_PCT=$(echo "scale=1; $PREV_AILANG_FINAL * 100 / $PREV_AILANG_TOTAL" | bc)
    PREV_AILANG_REPAIR_GAIN=$(echo "scale=1; $PREV_AILANG_FINAL_PCT - $PREV_AILANG_0SHOT_PCT" | bc)

    PREV_PYTHON_TOTAL=$(jq -s 'map(select(.lang == "python")) | length' "$PREV_SUMMARY")
    PREV_PYTHON_FINAL=$(jq -s 'map(select(.lang == "python" and .stdout_ok == true)) | length' "$PREV_SUMMARY")
    PREV_PYTHON_FINAL_PCT=$(echo "scale=1; $PREV_PYTHON_FINAL * 100 / $PREV_PYTHON_TOTAL" | bc)

    # Agent eval
    PREV_AILANG_AGENT_TOTAL=$(jq -s 'map(select(.lang == "ailang" and .eval_mode == "agent")) | length' "$PREV_SUMMARY")
    PREV_AILANG_AGENT_SUCCESS=$(jq -s 'map(select(.lang == "ailang" and .eval_mode == "agent" and .stdout_ok == true)) | length' "$PREV_SUMMARY")
    PREV_AILANG_AGENT_PCT=$(echo "scale=1; $PREV_AILANG_AGENT_SUCCESS * 100 / $PREV_AILANG_AGENT_TOTAL" | bc 2>/dev/null || echo "0.0")

    PREV_PYTHON_AGENT_TOTAL=$(jq -s 'map(select(.lang == "python" and .eval_mode == "agent")) | length' "$PREV_SUMMARY")
    PREV_PYTHON_AGENT_SUCCESS=$(jq -s 'map(select(.lang == "python" and .eval_mode == "agent" and .stdout_ok == true)) | length' "$PREV_SUMMARY")
    PREV_PYTHON_AGENT_PCT=$(echo "scale=1; $PREV_PYTHON_AGENT_SUCCESS * 100 / $PREV_PYTHON_AGENT_TOTAL" | bc 2>/dev/null || echo "0.0")

    # Calculate changes
    CHANGE_0SHOT=$(echo "scale=1; $AILANG_0SHOT_PCT - $PREV_AILANG_0SHOT_PCT" | bc)
    CHANGE_FINAL=$(echo "scale=1; $AILANG_FINAL_PCT - $PREV_AILANG_FINAL_PCT" | bc)
    CHANGE_REPAIR=$(echo "scale=1; $AILANG_REPAIR_GAIN - $PREV_AILANG_REPAIR_GAIN" | bc)
    CHANGE_PYTHON=$(echo "scale=1; $PYTHON_FINAL_PCT - $PREV_PYTHON_FINAL_PCT" | bc)
    CHANGE_AGENT_AILANG=$(echo "scale=1; $AILANG_AGENT_PCT - $PREV_AILANG_AGENT_PCT" | bc)
    CHANGE_AGENT_PYTHON=$(echo "scale=1; $PYTHON_AGENT_PCT - $PREV_PYTHON_AGENT_PCT" | bc)

    # Format change strings with +/- signs
    [[ $(echo "$CHANGE_0SHOT > 0" | bc) -eq 1 ]] && CHANGE_0SHOT="+$CHANGE_0SHOT"
    [[ $(echo "$CHANGE_FINAL > 0" | bc) -eq 1 ]] && CHANGE_FINAL="+$CHANGE_FINAL"
    [[ $(echo "$CHANGE_REPAIR > 0" | bc) -eq 1 ]] && CHANGE_REPAIR="+$CHANGE_REPAIR"
    [[ $(echo "$CHANGE_PYTHON > 0" | bc) -eq 1 ]] && CHANGE_PYTHON="+$CHANGE_PYTHON"
    [[ $(echo "$CHANGE_AGENT_AILANG > 0" | bc) -eq 1 ]] && CHANGE_AGENT_AILANG="+$CHANGE_AGENT_AILANG"
    [[ $(echo "$CHANGE_AGENT_PYTHON > 0" | bc) -eq 1 ]] && CHANGE_AGENT_PYTHON="+$CHANGE_AGENT_PYTHON"
else
    PREV_VERSION="none"
    PREV_AILANG_0SHOT_PCT="N/A"
    PREV_AILANG_FINAL_PCT="N/A"
    PREV_AILANG_REPAIR_GAIN="N/A"
    PREV_PYTHON_FINAL_PCT="N/A"
    PREV_AILANG_AGENT_PCT="N/A"
    PREV_PYTHON_AGENT_PCT="N/A"
    CHANGE_0SHOT="N/A"
    CHANGE_FINAL="N/A"
    CHANGE_REPAIR="N/A"
    CHANGE_PYTHON="N/A"
    CHANGE_AGENT_AILANG="N/A"
    CHANGE_AGENT_PYTHON="N/A"
fi

# Output CHANGELOG template
echo "=== CHANGELOG.md Template ==="
echo
cat << EOF
### Benchmark Results (M-EVAL)

**Overall Performance**: ${OVERALL_PCT}% success rate ($TOTAL_RUNS total runs)

**Standard Eval (0-shot + self-repair):**

| Metric | $PREV_VERSION | $VERSION | Change |
|--------|--------|--------|--------|
| **0-shot (first attempt)** | ${PREV_AILANG_0SHOT_PCT}% | ${AILANG_0SHOT_PCT}% ($AILANG_0SHOT/$AILANG_TOTAL) | **${CHANGE_0SHOT}%** |
| **Final (with repair)** | ${PREV_AILANG_FINAL_PCT}% | ${AILANG_FINAL_PCT}% ($AILANG_FINAL/$AILANG_TOTAL) | **${CHANGE_FINAL}%** |
| **Repair effectiveness** | +${PREV_AILANG_REPAIR_GAIN}pp | +${AILANG_REPAIR_GAIN}pp | **${CHANGE_REPAIR}pp** |
| **Python (final)** | ${PREV_PYTHON_FINAL_PCT}% | ${PYTHON_FINAL_PCT}% ($PYTHON_FINAL/$PYTHON_TOTAL) | ${CHANGE_PYTHON}% |

**Agent Eval (multi-turn iterative problem solving):**

| Language | $PREV_VERSION | $VERSION | Change |
|----------|--------|--------|--------|
| **AILANG** | ${PREV_AILANG_AGENT_PCT}% | ${AILANG_AGENT_PCT}% ($AILANG_AGENT_SUCCESS/$AILANG_AGENT_TOTAL) | **${CHANGE_AGENT_AILANG}%** |
| **Python** | ${PREV_PYTHON_AGENT_PCT}% | ${PYTHON_AGENT_PCT}% ($PYTHON_AGENT_SUCCESS/$PYTHON_AGENT_TOTAL) | **${CHANGE_AGENT_PYTHON}%** |

**Key Findings:**
[Add analysis based on the data above]

EOF

echo "=== End Template ==="
echo
echo "Template generated for $VERSION (compared to $PREV_VERSION)"

```

### scripts/cleanup_design_docs.sh

```bash
#!/bin/bash
#
# Move design docs from planned/ to implemented/ after a release.
# Only moves docs with "Status: Implemented" in their frontmatter.
# Docs without this status are flagged for review.
#
# Features:
# - Detects duplicates (docs already in implemented/)
# - Detects misplaced docs (Target: field doesn't match folder)
# - Suggests removal of superseded docs
#
# Usage: cleanup_design_docs.sh <version> [--dry-run] [--force] [--check-only]
# Example: cleanup_design_docs.sh 0.5.7
#          cleanup_design_docs.sh 0.5.7 --dry-run     # Preview without changes
#          cleanup_design_docs.sh 0.5.7 --force       # Move all, ignore status
#          cleanup_design_docs.sh 0.5.7 --check-only  # Only report issues, don't move
#

set -e

VERSION="${1:-}"
DRY_RUN=false
FORCE=false
CHECK_ONLY=false

# Check for flags
for arg in "$@"; do
    case "$arg" in
        --dry-run) DRY_RUN=true ;;
        --force) FORCE=true ;;
        --check-only) CHECK_ONLY=true ;;
    esac
done

# Remove 'v' prefix if present
VERSION="${VERSION#v}"

if [ -z "$VERSION" ]; then
    echo "Usage: $0 <version> [--dry-run] [--force] [--check-only]"
    echo "Example: $0 0.5.7"
    echo "         $0 0.5.7 --dry-run     # Preview without changes"
    echo "         $0 0.5.7 --force       # Move all docs, ignore status"
    echo "         $0 0.5.7 --check-only  # Report issues only"
    exit 1
fi

# Convert version to folder format (0.5.7 -> v0_5_7)
FOLDER_VERSION="v${VERSION//./_}"

PLANNED_DIR="design_docs/planned"
IMPLEMENTED_DIR="design_docs/implemented"

echo "Design Doc Cleanup for v${VERSION}"
echo "=================================="
echo ""

if [ "$CHECK_ONLY" = true ]; then
    echo "Mode: CHECK ONLY (reporting issues, no changes)"
    echo ""
elif [ "$DRY_RUN" = true ]; then
    echo "Mode: DRY RUN (no changes will be made)"
    echo ""
fi

if [ "$FORCE" = true ] && [ "$CHECK_ONLY" = false ]; then
    echo "Mode: FORCE (moving all docs regardless of status)"
    echo ""
fi

# Check if planned folder for this version exists
if [ ! -d "${PLANNED_DIR}/${FOLDER_VERSION}" ]; then
    echo "No ${PLANNED_DIR}/${FOLDER_VERSION}/ folder found."
    echo "Nothing to move."
    exit 0
fi

# Count docs
DOC_COUNT=$(find "${PLANNED_DIR}/${FOLDER_VERSION}" -name "*.md" -type f 2>/dev/null | wc -l | tr -d ' ')

if [ "$DOC_COUNT" -eq 0 ]; then
    echo "No design docs found in ${PLANNED_DIR}/${FOLDER_VERSION}/"
    echo "Nothing to move."
    exit 0
fi

echo "Checking ${DOC_COUNT} design doc(s) in ${PLANNED_DIR}/${FOLDER_VERSION}/:"
echo ""

# --- Phase 1: Issue Detection ---
echo "Phase 1: Detecting issues..."
echo ""

DUPLICATE_COUNT=0
MISPLACED_COUNT=0
DUPLICATE_DOCS=""
MISPLACED_DOCS=""

for doc in "${PLANNED_DIR}/${FOLDER_VERSION}"/*.md; do
    if [ -f "$doc" ]; then
        basename="$(basename "$doc")"

        # Check for duplicates (already exists in implemented/)
        if [ -f "${IMPLEMENTED_DIR}/${FOLDER_VERSION}/${basename}" ]; then
            echo "  [DUPLICATE] $basename"
            echo "              Already exists in ${IMPLEMENTED_DIR}/${FOLDER_VERSION}/"
            DUPLICATE_COUNT=$((DUPLICATE_COUNT + 1))
            DUPLICATE_DOCS="$DUPLICATE_DOCS$basename\n"
            continue
        fi

        # Check for misplaced docs (Target: field doesn't match folder version)
        # Handles formats: **Target**: v0.5.10, Target: v0.5.10, **Target:** 0.5.10
        target_version=$(head -20 "$doc" | grep -iE "Target" | grep -oE "v?[0-9]+\.[0-9]+\.[0-9]+" | sed 's/^v//' | head -1 || echo "")
        if [ -n "$target_version" ] && [ "$target_version" != "$VERSION" ]; then
            target_folder="v${target_version//./_}"
            echo "  [MISPLACED] $basename"
            echo "              Target: v${target_version} (folder: ${FOLDER_VERSION})"
            echo "              Should be in: ${PLANNED_DIR}/${target_folder}/"
            MISPLACED_COUNT=$((MISPLACED_COUNT + 1))
            MISPLACED_DOCS="$MISPLACED_DOCS$basename:$target_folder\n"
        fi
    fi
done

if [ "$DUPLICATE_COUNT" -gt 0 ] || [ "$MISPLACED_COUNT" -gt 0 ]; then
    echo ""
    echo "Issues found:"
    [ "$DUPLICATE_COUNT" -gt 0 ] && echo "  - $DUPLICATE_COUNT duplicate(s) (can be deleted)"
    [ "$MISPLACED_COUNT" -gt 0 ] && echo "  - $MISPLACED_COUNT misplaced doc(s) (wrong version folder)"
    echo ""
fi

# In check-only mode, stop here
if [ "$CHECK_ONLY" = true ]; then
    echo "Check complete. Run without --check-only to process docs."
    exit 0
fi

# --- Phase 2: Move implemented docs ---
echo "Phase 2: Processing docs..."
echo ""

MOVED_COUNT=0
NEEDS_REVIEW_COUNT=0
NEEDS_REVIEW_DOCS=""
SKIPPED_DUPLICATE=0
SKIPPED_MISPLACED=0

# Check each doc
for doc in "${PLANNED_DIR}/${FOLDER_VERSION}"/*.md; do
    if [ -f "$doc" ]; then
        basename="$(basename "$doc")"

        # Skip duplicates (already flagged in Phase 1)
        if [ -f "${IMPLEMENTED_DIR}/${FOLDER_VERSION}/${basename}" ]; then
            if [ "$DRY_RUN" = true ]; then
                echo "  [WOULD DELETE] $basename (duplicate)"
            else
                rm "$doc"
                echo "  [DELETED] $basename (duplicate - already in implemented/)"
            fi
            SKIPPED_DUPLICATE=$((SKIPPED_DUPLICATE + 1))
            continue
        fi

        # Check for misplaced docs
        # Handles formats: **Target**: v0.5.10, Target: v0.5.10, **Target:** 0.5.10
        target_version=$(head -20 "$doc" | grep -iE "Target" | grep -oE "v?[0-9]+\.[0-9]+\.[0-9]+" | sed 's/^v//' | head -1 || echo "")
        if [ -n "$target_version" ] && [ "$target_version" != "$VERSION" ]; then
            target_folder="v${target_version//./_}"
            if [ "$DRY_RUN" = true ]; then
                echo "  [WOULD RELOCATE] $basename → ${PLANNED_DIR}/${target_folder}/"
            else
                mkdir -p "${PLANNED_DIR}/${target_folder}"
                mv "$doc" "${PLANNED_DIR}/${target_folder}/"
                echo "  [RELOCATED] $basename → ${PLANNED_DIR}/${target_folder}/"
            fi
            SKIPPED_MISPLACED=$((SKIPPED_MISPLACED + 1))
            continue
        fi

        # Check for "Status: Implemented" in first 15 lines (handles various formats)
        is_implemented=false
        if head -15 "$doc" | grep -qiE "^\*?\*?Status\*?\*?:?\s*\*?\*?Implemented"; then
            is_implemented=true
        fi

        if [ "$FORCE" = true ] || [ "$is_implemented" = true ]; then
            if [ "$DRY_RUN" = true ]; then
                if [ "$is_implemented" = true ]; then
                    echo "  [WOULD MOVE] $basename (Status: Implemented)"
                else
                    echo "  [WOULD MOVE] $basename (forced)"
                fi
            else
                # Create implemented folder if needed
                mkdir -p "${IMPLEMENTED_DIR}/${FOLDER_VERSION}"

                # Move the doc
                mv "$doc" "${IMPLEMENTED_DIR}/${FOLDER_VERSION}/"
                if [ "$is_implemented" = true ]; then
                    echo "  [MOVED] $basename → ${IMPLEMENTED_DIR}/${FOLDER_VERSION}/"
                else
                    echo "  [MOVED] $basename → ${IMPLEMENTED_DIR}/${FOLDER_VERSION}/ (forced)"
                fi
            fi
            MOVED_COUNT=$((MOVED_COUNT + 1))
        else
            # Check what the actual status is
            status_line=$(head -15 "$doc" | grep -iE "^\*?\*?Status\*?\*?:" | head -1 || echo "")
            if [ -n "$status_line" ]; then
                echo "  [NEEDS REVIEW] $basename"
                echo "                 Found: $status_line"
            else
                echo "  [NEEDS REVIEW] $basename (no Status field found)"
            fi
            NEEDS_REVIEW_COUNT=$((NEEDS_REVIEW_COUNT + 1))
            NEEDS_REVIEW_DOCS="$NEEDS_REVIEW_DOCS$basename\n"
        fi
    fi
done

echo ""

# Check if planned folder is now empty and remove it
if [ "$DRY_RUN" = false ] && [ "$MOVED_COUNT" -gt 0 ]; then
    remaining=$(find "${PLANNED_DIR}/${FOLDER_VERSION}" -type f 2>/dev/null | wc -l | tr -d ' ')
    if [ "$remaining" -eq 0 ]; then
        rmdir "${PLANNED_DIR}/${FOLDER_VERSION}" 2>/dev/null || true
        echo "Removed empty ${PLANNED_DIR}/${FOLDER_VERSION}/ folder"
        echo ""
    fi
fi

# Summary
echo "Summary:"
if [ "$DRY_RUN" = true ]; then
    [ "$SKIPPED_DUPLICATE" -gt 0 ] && echo "  Would delete: $SKIPPED_DUPLICATE duplicate(s)"
    [ "$SKIPPED_MISPLACED" -gt 0 ] && echo "  Would relocate: $SKIPPED_MISPLACED misplaced doc(s)"
    [ "$MOVED_COUNT" -gt 0 ] && echo "  Would move: $MOVED_COUNT doc(s) to implemented/"
    [ "$NEEDS_REVIEW_COUNT" -gt 0 ] && echo "  Needs review: $NEEDS_REVIEW_COUNT doc(s)"
    echo ""
    echo "Run without --dry-run to apply changes."
else
    if [ "$SKIPPED_DUPLICATE" -gt 0 ]; then
        echo "  ✓ Deleted $SKIPPED_DUPLICATE duplicate(s)"
    fi
    if [ "$SKIPPED_MISPLACED" -gt 0 ]; then
        echo "  ✓ Relocated $SKIPPED_MISPLACED doc(s) to correct version folder"
    fi
    if [ "$MOVED_COUNT" -gt 0 ]; then
        echo "  ✓ Moved $MOVED_COUNT doc(s) to ${IMPLEMENTED_DIR}/${FOLDER_VERSION}/"
    fi
    if [ "$NEEDS_REVIEW_COUNT" -gt 0 ]; then
        echo "  ⚠ $NEEDS_REVIEW_COUNT doc(s) need review (not marked as Implemented)"
        echo ""
        echo "To fix docs needing review:"
        echo "  1. Update the doc's Status field to 'Implemented'"
        echo "  2. Re-run this script"
        echo "  Or use --force to move regardless of status"
    fi
fi

TOTAL_CHANGES=$((SKIPPED_DUPLICATE + SKIPPED_MISPLACED + MOVED_COUNT))
if [ "$TOTAL_CHANGES" -gt 0 ] && [ "$DRY_RUN" = false ]; then
    echo ""
    echo "Next steps:"
    echo "  git add design_docs/"
    echo "  git commit -m 'docs: cleanup design docs for ${FOLDER_VERSION}'"
fi

```