SkillHub ClubAnalyze Data & AIFull StackData / AI

Eval Analyzer

Identify AILANG language gaps from agent struggles, analyze eval baselines, and generate actionable insights. PRIMARY PURPOSE is finding what stdlib/prompt improvements would help agents succeed. Use when analyzing eval results, checking benchmarks, or investigating failures.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C2.6

Composite score

2.6

Best-practice grade

C55.6

Install command

npx @skill-hub/cli install sunholo-data-ailang-eval-analyzer

analysisoptimizationdebuggingevaluationai-agents

Repository

sunholo-data/ailang

Skill path: .claude/skills/eval-analyzer

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: sunholo-data.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install Eval Analyzer into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/sunholo-data/ailang before adding Eval Analyzer to shared team environments
Use Eval Analyzer for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: Eval Analyzer
description: Identify AILANG language gaps from agent struggles, analyze eval baselines, and generate actionable insights. PRIMARY PURPOSE is finding what stdlib/prompt improvements would help agents succeed. Use when analyzing eval results, checking benchmarks, or investigating failures.
---

# Eval Analyzer

**Primary goal**: Identify AILANG language gaps from what agents struggle with → drives stdlib additions and prompt improvements.

Secondary: Analyze eval baseline results, compare model performance, track success rates.

## Quick Start

**Language gap analysis** (most valuable):
```bash
# Find what stdlib/prompt improvements would help agents
.claude/skills/eval-analyzer/scripts/find_language_gaps.sh eval_results/baselines/v0.6.2

# Output shows:
# - Functions agents searched for but couldn't find
# - Undefined variable errors (hallucinated functions)
# - Type confusion patterns
# - Benchmarks with stuck loops (high turn count)
# - Mapping of hallucinated names to actual builtins
```

**Standard analysis:**
```bash
# User says: "Analyze the v0.3.24 eval results"
# This skill will:
# 1. Run find_language_gaps.sh to identify AILANG improvements needed
# 2. Run eval-analyze to categorize failures
# 3. Run agent KPIs to analyze efficiency
# 4. Identify top failing benchmarks
# 5. Generate actionable recommendations
```

**For agent evaluation analysis** (NEW - optimization focus):
```bash
# Step 1: Get efficiency metrics (turns, tokens, cost)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.24

# Step 2: Investigate expensive benchmarks
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/v0.3.24 simple_print

# Step 3: Compare Python vs AILANG
./tools/compare_agents.sh eval_results/baselines/v0.3.24
```

**See [`resources/agent_optimization_guide.md`](resources/agent_optimization_guide.md) for complete optimization strategies.**

## Language Gap Analysis (PRIMARY GOAL)

**The most valuable output of eval analysis is identifying AILANG language gaps** - what agents struggle with that reveals missing stdlib functions, undocumented features, or prompt gaps.

### Why This Matters

When an agent fails after many turns, the transcript reveals what it was *trying* to do:

```
[config_file_parser - 22 turns]
Turn 14: "Perfect! There's floatToInt in std/prelude" -- HALLUCINATED!
Turn 15: "It looks like floatToInt is a builtin, not from a module"
Turn 18: "Let me add a floatToInt using basic arithmetic" -- broken workaround
```

**Insight**: Agent knows what it needs (`floatToInt`) but can't find it → **stdlib gap**.

### Language Gap Workflow

```bash
# Step 1: Find stuck loops - agents searching for functions
cat eval_results/baselines/v0.6.2/agent/*_ailang_*.json | \
  jq -r 'select(.stdout_ok == false) | .stderr' | \
  grep -i "let me check\|what function\|is available\|undefined variable" | head -20

# Step 2: Check if builtin exists for hallucinated function
ailang builtins list | grep -i "float\|int"

# Step 3: Check if stdlib wrapper exists
grep -i "floatToInt" std/*.ail

# Step 4: If builtin exists but wrapper doesn't → Add stdlib wrapper
# Step 5: If wrapper exists but agent didn't know → Update prompt
```

### Gap Pattern Categories

| Agent Behavior | Gap Type | Fix |
|----------------|----------|-----|
| "Let me check what X is available" then fails | Missing stdlib wrapper | Add wrapper to std/ |
| Uses function that exists but wrong name | Undocumented | Document in prompt |
| Tries Python syntax in AILANG | Prompt gap | Add AILANG examples |
| 10+ turns on same type error | Type confusion | Add type examples to prompt |
| `undefined variable: floatToInt` | Missing wrapper | Add `floatToInt = _float_to_int` |

### Example Gap Report

After analysis, produce actionable output:

```markdown
## Missing Wrappers (builtin exists, wrapper doesn't)
| Function | Builtin | Add to | Impact |
|----------|---------|--------|--------|
| floatToInt | _float_to_int | std/math | 3 benchmarks |
| intToFloat | _int_to_float | std/math | 2 benchmarks |

## Undocumented (exists but agents don't know)
| Function | Module | Agents looked for |
|----------|--------|-------------------|
| substring | std/string | stringSlice, slice |
| contains | std/string | includes, has |

## Prompt Gaps (syntax confusion)
| Issue | Example | Add to prompt |
|-------|---------|---------------|
| Json vs string | `get(str, key)` fails | Json type examples |
| String not list | `match s { [a,b] => }` | String handling section |
```

**See [design_docs/planned/v0_6_5/m-eval-gap-analysis.md](../../../design_docs/planned/v0_6_5/m-eval-gap-analysis.md) for full analysis example.**

## When to Use This Skill

Invoke this skill when:
- User asks to "analyze eval results", "check benchmarks", "what's failing"
- After running an eval baseline
- When investigating why benchmark performance changed
- User wants to understand failure patterns or model performance
- Comparing two versions of AILANG
- **Identifying AILANG language gaps** from what agents struggle with

## Key Eval Commands

All commands work on baseline directories like `eval_results/baselines/v0.3.16/`.

### 1. Quick Overview - `eval-matrix`

Shows comprehensive statistics with model/language breakdowns.

```bash
ailang eval-matrix eval_results/baselines/v0.3.16 0.3.16 | head -60
```

**Shows**: Overall stats, per-model performance, per-language breakdown, top error codes.

### 2. Detailed Analysis - `eval-analyze`

Categorizes failures and can generate design docs for issues.

```bash
# Dry run (no design docs, just analysis)
ailang eval-analyze -results eval_results/baselines/v0.3.16 -dry-run

# Full analysis with design doc generation
ailang eval-analyze -results eval_results/baselines/v0.3.16
```

**⚠️ CRITICAL**: Must use `-results` flag, NOT positional argument!

**Output**: Categorized failures (compile_error, logic_error, runtime_error) with frequency, affected benchmarks, models, and sample errors.

### 3. Query-Friendly Summary - `eval-summary`

Generates JSONL for easy querying with jq.

```bash
ailang eval-summary eval_results/baselines/v0.3.16
```

**Output**: `eval_results/baselines/v0.3.16/summary.jsonl`

### 4. Compare Versions - `eval-compare`

Shows what changed between two versions.

```bash
ailang eval-compare eval_results/baselines/v0.3.15 eval_results/baselines/v0.3.16
```

### 5. Fair Comparison (RECOMMENDED) - `fair_comparison.py`

**Use this for accurate version comparisons!** The `eval-compare` command may include duplicates or different model sets. This script normalizes data for apple-to-apples comparison.

```bash
.claude/skills/eval-analyzer/scripts/fair_comparison.py
```

**What it does:**
- Deduplicates runs (keeps last run per benchmark+model)
- Filters to dev models only (gpt5-mini, claude-haiku-4-5, gemini-2-5-flash)
- AILANG only (ignores Python results)
- Shows net fixes vs regressions
- Per-model breakdown

**Output:**
```
v0.4.0: 56/123 = 45.5%
v0.4.2: 59/123 = 48.0%
Delta:  +3 (+2.4pp)

✅ Fixed:   11 benchmarks
❌ Broken:  8 benchmarks
NET:       +3 benchmarks
```

**When to use:** Before making decisions based on eval results (e.g., reverting changes, merging PRs).

### 6. Validate Results - `validate_eval_results.py`

**Check for output corruption and race conditions** in eval results.

```bash
python3 tools/validate_eval_results.py eval_results/baselines/v0.4.2
```

**Checks:**
- Output corruption (fibonacci outputting "All results equal", etc.)
- Duplicate runs for same benchmark+model
- Code hash validation (if available)
- Success rate statistics

**When to use:** After running eval baselines, especially if results look suspicious.

## Agent Analysis Scripts (NEW!)

**For agent-based evaluation results** (Python vs AILANG comparisons with Claude Code):

### 1. Agent KPIs - Minimize Tokens & Turns

Shows efficiency metrics for agent runs - **key for optimizing language and prompts**.

```bash
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/WITH_ALL_FIXES
```

**Output:**
- Average turns, tokens, cost by language (Python vs AILANG)
- Most expensive benchmarks (by turns) - candidates for optimization
- Most efficient benchmarks - learn from these
- Success rates and performance comparison

**Goal**: Minimize agent turns and tokens → indicates clearer prompts and simpler language.

### 2. Agent Transcripts - View AILANG Conversations

View full agent conversation logs to understand what happened.

```bash
# View all transcripts
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES

# View only failures
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES --failed-only

# View specific benchmark
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/WITH_ALL_FIXES fizzbuzz
```

**Output:**
- Turn-by-turn conversation showing agent's thought process
- Metrics: turns, tokens, duration
- Success/failure status with error category
- First 100 lines of transcript (with hint to view full)

**Use for**: Understanding why AILANG solutions fail or take many turns.

### 3. Python vs AILANG Comparison

Use the existing `tools/compare_agents.sh` script for side-by-side comparison:

```bash
./tools/compare_agents.sh eval_results/WITH_ALL_FIXES
```

**Output:**
- Side-by-side metrics table
- Solution code comparison
- Transcripts for failed solutions (automatic)
- Winner indicators for each metric

## Standard Eval Workflow (Non-Agent)

### Step 1: Get High-Level Overview

```bash
# Show overall statistics
ailang eval-matrix eval_results/baselines/v0.3.16 0.3.16 | head -60
```

**Look for:**
- Overall success rate (target: >60%)
- AILANG vs Python gap (current: ~54%)
- Model performance variance
- Top error codes

### Step 2: Identify Problem Areas

```bash
# Categorize all failures
ailang eval-analyze -results eval_results/baselines/v0.3.16 -dry-run
```

**Key metrics:**
- compile_error frequency (parse/syntax issues)
- logic_error frequency (wrong output)
- runtime_error frequency (crashes)
- Which benchmarks fail most

### Step 3: Query with jq (Custom Analysis)

Use jq queries on summary.jsonl for custom analysis:

```bash
# Ensure summary exists
ailang eval-summary eval_results/baselines/v0.3.20

# AILANG-only success rate (all models)
jq -s 'map(select(.lang == "ailang")) |
  {total: length, success: (map(select(.stdout_ok == true)) | length),
   rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
  eval_results/baselines/v0.3.20/summary.jsonl

# Dev models only (useful for prompt testing)
jq -s 'map(select(.lang == "ailang" and
  (.model == "gpt5-mini" or .model == "claude-haiku-4-5" or .model == "gemini-2-5-flash"))) |
  {total: length, success: (map(select(.stdout_ok == true)) | length),
   rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
  eval_results/baselines/v0.3.20/summary.jsonl

# Check specific benchmark across all models
jq -s 'map(select(.benchmark == "explicit_state_threading" and .lang == "ailang")) |
  map({model, success: .stdout_ok, error: .error_category})' \
  eval_results/baselines/v0.3.20/summary.jsonl

# Compare two versions (dev models AILANG-only)
jq -s 'map(select(.lang == "ailang" and
  (.model == "gpt5-mini" or .model == "claude-haiku-4-5" or .model == "gemini-2-5-flash"))) |
  {total: length, success: (map(select(.stdout_ok == true)) | length),
   rate: ((map(select(.stdout_ok == true)) | length) * 100.0 / length)}' \
  eval_results/baselines/v0.3.20/summary.jsonl \
  eval_results/baselines/v0.3.21/summary.jsonl
```

**For more jq patterns**, see [`resources/jq_queries.md`](resources/jq_queries.md)

### Step 4: Deep Dive with Helper Scripts

Use the provided helper scripts for detailed code inspection:

```bash
# Failure analysis with error categorization
.claude/skills/eval-analyzer/scripts/analyze_failures.sh eval_results/baselines/v0.3.16

# Model performance comparison
.claude/skills/eval-analyzer/scripts/compare_models.sh eval_results/baselines/v0.3.16

# Examine specific benchmark failures
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json
```

### Step 4: Compare with Previous Version

```bash
# Show regressions and improvements
ailang eval-compare eval_results/baselines/v0.3.15 eval_results/baselines/v0.3.16
```

### Step 5: Generate Insights

Based on the data, identify:

1. **Systemic Issues**: Categories with >50 failures
2. **Model Patterns**: Which models struggle with which features
3. **Benchmark Hotspots**: Benchmarks with 100% failure rate
4. **Cost Efficiency**: Which models give best success/cost ratio
5. **Trends**: Improvements or regressions vs previous version

## Key Metrics to Track

1. **Overall Success Rate**: AILANG vs Python gap (target: reduce below 50%)
2. **Error Code Distribution**:
   - PAR_001 (parse errors) - indicates prompt/syntax issues
   - WRONG_LANG - models writing Python instead of AILANG
   - IMPERATIVE - models using imperative patterns
3. **Model Performance**: Which models work best with AILANG
4. **Benchmark-Level**: Which benchmarks consistently fail
5. **Cost Efficiency**: Success rate per dollar spent
6. **Repair Success**: Is self-repair helping? (currently low)

## Common Issues

### Issue 1: "Total Runs: 6" instead of 408

**Symptom**: eval-analyze only finds 6 results

**Cause**: Used positional argument instead of `-results` flag

**Solution**:
```bash
# ❌ WRONG
ailang eval-analyze eval_results/baselines/v0.3.16

# ✅ CORRECT
ailang eval-analyze -results eval_results/baselines/v0.3.16
```

### Issue 2: Summary file not found

**Symptom**: jq queries fail with "file not found"

**Cause**: Need to run eval-summary first

**Solution**:
```bash
ailang eval-summary eval_results/baselines/v0.3.16
```

### Issue 3: Design docs not generated

**Symptom**: eval-analyze shows issues but doesn't create docs

**Cause**: Using `-dry-run` flag

**Solution**: Run without `-dry-run` to generate design docs

## Helper Scripts

The skill includes helper scripts in `scripts/` directory:

### quick_summary.sh

Fast overview using eval-matrix.

```bash
.claude/skills/eval-analyzer/scripts/quick_summary.sh eval_results/baselines/v0.3.16
```

**Output**: Overall stats, model performance, language breakdown, top error codes.

### analyze_failures.sh

Detailed failure analysis with error categorization.

```bash
.claude/skills/eval-analyzer/scripts/analyze_failures.sh eval_results/baselines/v0.3.16 ailang
```

**Output**: Overall statistics, error categories, top failing benchmarks, model performance, error codes.

### compare_models.sh

Model-by-model performance comparison.

```bash
.claude/skills/eval-analyzer/scripts/compare_models.sh eval_results/baselines/v0.3.16
```

**Output**: Success rates, first-attempt vs final, cost analysis, token usage, best model per benchmark.

### examine_code.sh

Inspect generated code from specific benchmarks.

```bash
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json
.claude/skills/eval-analyzer/scripts/examine_code.sh eval_results/baselines/v0.3.16 api_call_json gpt5
```

**Output**: Generated code, compiler errors, success status, error codes for each model run.

### examine_prompts.sh

View prompts used for specific benchmarks.

```bash
.claude/skills/eval-analyzer/scripts/examine_prompts.sh eval_results/baselines/v0.3.16 api_call_json
```

**Output**: System prompt, user prompt, success status for benchmark runs.

### verify_prompt_accuracy.sh

Check if prompt documentation matches actual implementation.

```bash
.claude/skills/eval-analyzer/scripts/verify_prompt_accuracy.sh v0.3.16
```

**Output**: Reports false limitations, undocumented features, and prompt-code mismatches.

**Use this**: After creating new prompt versions to catch documentation bugs!

## Resources

### Analysis Documents

- [`resources/failure_analysis_v0.3.16.md`](resources/failure_analysis_v0.3.16.md) - Comprehensive analysis of v0.3.16 eval results with root cause analysis

### Common jq Patterns

See [`resources/jq_queries.md`](resources/jq_queries.md) for more query examples and patterns.

## Progressive Disclosure

This skill loads information progressively:

1. **Always loaded**: This SKILL.md file (workflow + commands + scripts)
2. **Execute as needed**: `ailang eval-*` commands and helper scripts
3. **Load on demand**: `resources/jq_queries.md`, analysis documents

## Notes

- All eval commands work offline (no API calls for analysis)
- `eval-analyze` generates design docs using LLM (default: gpt5)
- Summary JSONL format is stable and queryable
- Use `-dry-run` to preview before generating design docs
- baseline directories typically at `eval_results/baselines/vX.X.X/`
- This skill complements `post-release` skill (which runs baselines)


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### resources/agent_optimization_guide.md

```markdown
# Agent Evaluation Optimization Guide

**Goal**: Minimize conversation turns and tokens for AILANG agent evaluations to match or beat Python performance.

## Quick Start

After running agent evals (as part of post-release), analyze optimization opportunities:

```bash
# Step 1: Get agent KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.24

# Step 2: View expensive benchmarks to understand why they're costly
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/v0.3.24 record_update

# Step 3: Compare Python vs AILANG side-by-side
./tools/compare_agents.sh eval_results/baselines/v0.3.24
```

## Understanding Agent KPIs

### Key Metrics

**From `agent_kpis.sh` output:**

```
🐍 Python (5 benchmarks, 100.0% success)
   Avg Turns:        10.6 (lower is better)
   Avg Input Tokens: 58417 (lower is better)
   Avg Output Tokens: 712
   Avg Cost:         $.0133
   Total Cost:       $.06651080

🔷 AILANG (5 benchmarks, 100.0% success)
   Avg Turns:        18.0 (lower is better)
   Avg Input Tokens: 178435 (lower is better)
   Avg Output Tokens: 1662
   Avg Cost:         $.0452
   Total Cost:       $.226134400000000006
```

### What the Numbers Mean

**Turns**:
- Number of back-and-forth exchanges between agent and task
- **Lower = simpler language, clearer prompts, fewer mistakes**
- Python 10.6 turns vs AILANG 18.0 turns = **1.7x gap**

**Input Tokens**:
- Cumulative prompt + context tokens across all turns
- Includes: teaching prompt, task description, previous attempts, errors
- **Lower = less context needed, more efficient communication**
- Python 58k vs AILANG 178k = **3.0x gap**

**Output Tokens**:
- Code generated by agent + explanations
- **Higher output with fewer turns = more confident, complete solutions**

**Gap Analysis**:
- AILANG taking 1.7x more turns → Language has rough edges OR prompt unclear
- AILANG using 3.0x more tokens → More context needed to write correct code

## Optimization Opportunities

### 1. Language Simplification

**Symptom**: High turn count, many compilation/runtime errors in early turns

**Indicators** (from `agent_transcripts.sh`):
```
[TURN 1] Error: "module declaration doesn't match canonical path"
[TURN 2] Fixed module, now: "print is not in scope"
[TURN 3] Added import, now: "type mismatch: expected float, got int"
[TURN 4] Added type annotations...
```

**Root Causes**:
- Module system too strict (path must match module name exactly)
- Import syntax not intuitive (agent forgets std/io)
- Type errors require explicit annotations (no coercion)
- Effect system confusing (when to use ! {IO} vs pure functions)

**Optimization Strategies**:
1. **Relax module rules**: Allow flexible module paths or infer from file location
2. **Auto-import prelude**: Make print, show available without imports
3. **Better type inference**: Reduce need for explicit annotations
4. **Clearer effect rules**: Document when effects required vs optional

**Measure Impact**:
```bash
# Before: AILANG 18.0 avg turns
# After fix: Re-run agent eval, expect turns to drop to ~12-14
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.3.25
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.25
```

### 2. Prompt Optimization

**Symptom**: Agent makes same mistakes across multiple benchmarks

**Indicators** (from `agent_kpis.sh "Most Expensive"`):
```
✅ 🔷 simple_print              ailang       25 turns
✅ 🔷 record_update             ailang       19 turns
✅ 🔷 recursion_factorial       ailang       16 turns
```

**Root Causes**:
- Prompt doesn't emphasize critical rules (module path, imports, effects)
- Examples in prompt use outdated syntax
- Prompt too long → agent misses important details
- Prompt doesn't show common error patterns + fixes

**Optimization Strategies**:
1. **Add "Common Mistakes" section** to prompt with quick fixes
2. **Shorten examples** - keep only most relevant ones
3. **Use bold/emphasis** for critical rules (module path, imports)
4. **Add success patterns** - show working code for common tasks

**Test Prompt Changes**:
```bash
# Edit prompts/v0.3.24.md with improvements
# Run dev models only (cheap, fast) to validate
ailang eval-suite --agent \
  --benchmarks simple_print,record_update,recursion_factorial \
  --langs ailang \
  --agent-parallel 1 \
  --output eval_results/prompt_test_v0.3.24 \
  --prompt-version v0.3.24

# Compare with previous version
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/prompt_test_v0.3.24
```

### 3. Error Message Improvements

**Symptom**: Agent takes many turns to fix same error type

**Indicators** (from transcripts):
```
[TURN 2] Stderr: "type error at line 5"
[TURN 3] Stderr: "type error at line 5"  # Same error!
[TURN 4] Stderr: "type error at line 5"  # Still same error!
```

**Root Causes**:
- Error messages too vague ("type error" - which types?)
- No hints about how to fix
- No mention of inferred vs declared types
- Stack traces confusing for AI agents

**Optimization Strategies**:
1. **More specific errors**: "Expected type float, but got int (add type annotation or cast)"
2. **Show context**: Include surrounding code in error
3. **Suggest fixes**: "Did you mean to use show(x) instead of x?"
4. **AI-friendly format**: Structured errors with clear fields

**Track Error Turn Counts**:
```bash
# Find benchmarks with repeated errors
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/v0.3.24 --failed-only
# Look for same error appearing in consecutive turns
```

### 4. Documentation Gaps

**Symptom**: Agent asks questions or uses outdated syntax

**Indicators** (from transcripts):
```
[TURN 1] Agent: "How do I read a file in AILANG?"
[TURN 3] Agent: "The file import failed, trying different syntax..."
[TURN 5] Agent: "Looking for alternative approach..."
```

**Root Causes**:
- Documentation missing for common operations (file I/O, JSON parsing)
- Examples don't cover real-world use cases
- API changes not reflected in teaching prompt
- No migration guide from similar languages (Python/Haskell)

**Optimization Strategies**:
1. **Expand examples** - cover file I/O, JSON, records, effects
2. **Add "Coming from Python" section** - translate common patterns
3. **Keep prompt in sync** - automated checks for outdated syntax
4. **Test prompt accuracy** - use verify_prompt_accuracy.sh

**Validate Documentation**:
```bash
# Check if prompt matches implementation
.claude/skills/eval-analyzer/scripts/verify_prompt_accuracy.sh v0.3.24
```

## Benchmark-Specific Optimization

### High-Cost Benchmarks (from "Most Expensive")

Use `agent_transcripts.sh` to investigate why specific benchmarks are costly:

```bash
# View full conversation for expensive benchmark
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/v0.3.24 simple_print
```

**Common Patterns**:

**Module path issues (simple_print: 25 turns)**:
- Fix: Auto-infer module path from file location
- Fix: Better error message explaining canonical path rule

**Record update syntax (record_update: 19 turns)**:
- Fix: Add record update to prompt examples
- Fix: Document { r | field: newValue } syntax clearly

**Recursion + types (recursion_factorial: 16 turns)**:
- Fix: Show how to annotate recursive functions
- Fix: Explain when return type annotation required

## Success Metrics

**Target Goals** (based on Python baseline):

| Metric | Current AILANG | Target | Python Baseline |
|--------|---------------|--------|-----------------|
| Avg Turns | 18.0 | **≤12.0** | 10.6 |
| Avg Input Tokens | 178k | **≤80k** | 58k |
| Success Rate | 100% | **100%** | 100% |
| Avg Cost | $0.045 | **≤$0.020** | $0.013 |

**How to Measure Progress**:

```bash
# Baseline v0.3.24
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.24

# After language improvements
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.3.25
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.25

# Calculate improvement
# Turns: 18.0 → 14.5 = 19% reduction ✅
# Tokens: 178k → 120k = 33% reduction ✅
```

## Workflow Summary

**Step 1: Identify High-Cost Areas**
```bash
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.24
# Focus on "Most Expensive" section
```

**Step 2: Investigate Root Causes**
```bash
.claude/skills/eval-analyzer/scripts/agent_transcripts.sh eval_results/baselines/v0.3.24 simple_print
# Read transcripts to understand what's going wrong
```

**Step 3: Make Improvements**
- Fix language issues (module system, imports, error messages)
- Update teaching prompt (examples, common mistakes, clarity)
- Improve documentation (cover gaps, update outdated info)

**Step 4: Validate Changes**
```bash
# Quick test with dev models (cheap)
ailang eval-suite --agent \
  --benchmarks simple_print,record_update,recursion_factorial \
  --langs ailang \
  --output eval_results/test_v0.3.25

# Check improvement
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/test_v0.3.25
```

**Step 5: Full Baseline (if promising)**
```bash
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.3.25 --full
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/v0.3.25
```

## Common Issues

### Agent succeeds but takes too many turns

**Not a language bug!** - It's a prompt/documentation issue.

**Solution**: Add clearer examples, emphasize common patterns, reduce prompt noise.

### AILANG vs Python gap too large (>2x turns)

**Indicates**: Language has unnecessary complexity or unclear rules.

**Solution**: Simplify syntax, relax strict rules, improve error messages.

### High token count but low turn count

**Indicates**: Agent needs lots of context to understand language.

**Solution**: Better documentation, clearer prompts, more intuitive syntax.

### Same benchmark fails repeatedly across versions

**Indicates**: Language limitation or missing feature.

**Solution**: Check if benchmark requires unsupported feature, consider language enhancement.

## Next Steps

1. **Analyze current baseline**: `agent_kpis.sh eval_results/baselines/v0.3.24`
2. **Identify top 3 expensive benchmarks**: Look at "Most Expensive" section
3. **Read transcripts**: Understand why they're expensive
4. **Make targeted fixes**: Language, prompt, or documentation
5. **Re-evaluate**: Run agent eval on fixed version
6. **Measure improvement**: Compare KPIs before/after

**Goal**: Match or beat Python's 10.6 avg turns and 58k avg tokens. This proves AILANG is as easy (or easier) than Python for AI agents to work with.

```

### ../../../design_docs/planned/v0_6_5/m-eval-gap-analysis.md

```markdown
# M-EVAL-GAP: AILANG vs Python Parity Analysis

**Status**: In Progress - v0.6.3 Tested
**Version**: v0.6.2 → v0.6.5
**Date**: 2026-01-05
**Goal**: Achieve AILANG parity with Python in agent mode evals

---

## v0.6.3 Test Results (2026-01-05)

**Changes made:**
1. ✅ Added `floatToInt`/`intToFloat` wrappers to `std/math.ail`
2. ✅ Created prompt v0.6.3 with explicit stdlib documentation (std/math, std/string)

**Test Results:**

| Benchmark | v0.6.2 | v0.6.3 | Turns | Result |
|-----------|--------|--------|-------|--------|
| `json_parse` | ❌ 14 turns | ✅ **Pass** | 47 | Agent succeeded with new prompt |
| `config_file_parser` | ❌ 55 turns | ❌ 26 turns | 26 | **Improvement** but still fails |

**Key Findings:**

1. **Agent discoverability improved** - In v0.6.3, agent uses `floatToInt(port_f)` (knows the function exists from prompt), but **forgets to import `std/math`**. In v0.6.2, agent implemented its own broken recursive version.

2. **`json_parse` now passes** - 47 turns is high but successful. Agent correctly handles nested Option matching.

3. **Import forgetting is the new gap** - Agent sees function in prompt, uses it, but omits import statement.

**Next Actions (v0.6.4):**

| Priority | Action | Expected Impact |
|----------|--------|-----------------|
| P0 | Add "REMEMBER to import" warning in prompt | Fix config_file_parser |
| P0 | Add JSON helper functions (`getString`, `getNumber`) | Reduce json_parse turns from 47 to <20 |
| P1 | Auto-import `std/math` in prelude | Eliminate import forgetting |
| P2 | Fix Gemini executor bug | Enable real Gemini benchmarking |

---

## Executive Summary

Analysis of v0.6.2 eval baselines reveals:

| Mode | Model | AILANG | Python | Gap |
|------|-------|--------|--------|-----|
| Standard | Haiku | 56.5% | 58.7% | **2.2%** |
| Standard | Opus | 69.5% | 80.4% | 10.9% |
| Agent | Haiku | **80.7%** | 100% | 19.3% |
| Agent | Gemini Flash 3 | 69.2% | 100% | 30.8% |

**Key Finding**: Agent mode dramatically improves AILANG success (56.5% → 80.7% for Haiku), but 10-16 benchmarks still fail due to:
1. **Timeout issues** (1m limit too short for complex repairs)
2. **JSON API confusion** (agents struggle with `Json` type)
3. **Missing stdlib functions** (agents hallucinate functions that don't exist)

---

## Detailed Failure Analysis

### Agent Mode Failures by Model

#### Claude Haiku (10 failures, 80.7% success)

| Benchmark | Turns | Root Cause |
|-----------|-------|------------|
| `api_call_json` | 15 | Timeout - Net capability complexity |
| `cli_args` | 34 | Runtime error - file I/O confusion |
| `config_file_parser` | 22 | Timeout - missing stdlib functions |
| `csv_to_json_converter` | 28 | Timeout - type confusion with strings |
| `json_parse` | 14 | Timeout - `Json` type confusion |
| `json_transform` | 13 | Timeout - JSON API misunderstanding |
| `log_file_analyzer` | 17 | Timeout - string parsing complexity |
| `pipeline` | 17 | Timeout - effect composition |
| `symbolic_diff` | 13 | Timeout - ADT complexity |
| `type_unify` | 14 | Timeout - recursive ADT |

**Pattern**: Most failures are timeouts after 13-34 turns. Agent understands AILANG but gets stuck iterating on type errors.

#### Gemini Flash 3 (16 failures, 69.2% success)

| Benchmark | Turns | Root Cause |
|-----------|-------|------------|
| All 16 | 1 | **Immediate timeout** - agent infra issue |

**Pattern**: Gemini shows `turns: 1` for all failures - the agent eval infrastructure may not be properly handling Gemini CLI.

---

## Root Cause Categories

### Category 1: Timeout Configuration (40% of failures)

**Problem**: 1-minute timeout too short for complex benchmarks.

**Evidence**:
- `csv_to_json_converter`: 28 turns before timeout
- `cli_args`: 34 turns before timeout
- Agent is making progress but runs out of time

**Solution**:
- Increase timeout to 3-5 minutes for agent mode
- Add benchmark-specific timeout overrides for complex tasks

### Category 2: JSON API Teaching Gap (30% of failures)

**Problem**: Prompt doesn't clearly explain `Json` type vs `string`.

**Evidence** (from json_parse transcript):
```
[TURN 8] I'm passing a string to get, but get expects a Json
[TURN 9] Wait, I'm still confused about the types
[TURN 11] The type error is clear: get expects a Json but I'm passing a string
[TURN 14] Let me just try to look at what AILANG considers the JSON type
```

Agent spends 14 turns confused about whether JSON values are `string` or `Json` type.

**Solution**: Add explicit JSON examples to prompt:
```ailang
-- JSON values have type Json, not string
let parsed: Result[Json, string] = decode("{\"name\":\"Alice\"}")
match parsed {
  Ok(obj) => {  -- obj has type Json
    match get(obj, "name") {  -- get takes Json, returns Option[Json]
      Some(nameJson) => match asString(nameJson) {  -- convert Json to string
        Some(name) => print(name),
        None => ()
      },
      None => ()
    }
  },
  Err(e) => print("Parse error: " ++ e)
}
```

### Category 3: Missing Stdlib Functions (20% of failures)

**Problem**: Agents hallucinate stdlib functions that don't exist.

**Hallucinated functions**:
- `_string_slice` / `stringSlice` - substring extraction
- `contains` - string containment check
- `_cast_float` / `_itof` / `round` - numeric conversions
- `floatToInt` - float to int conversion

**Current stdlib gaps**:
| Function | Exists? | Workaround |
|----------|---------|------------|
| `stringSlice(s, start, end)` | No | Use `_string_to_list` + `_list_slice` |
| `contains(s, sub)` | No | Use `_string_indexOf(s, sub) >= 0` |
| `floatToInt(f)` | Partial | `_float_to_int` exists (v0.5.11) |
| `round(f)` | No | No workaround |

**Solution**: Add commonly-needed string functions to `std/string`:
```ailang
-- Proposed additions to std/string
export func slice(s: string, start: int, end: int) -> string
export func contains(s: string, sub: string) -> bool
export func startsWith(s: string, prefix: string) -> bool
export func endsWith(s: string, suffix: string) -> bool
```

### Category 4: Gemini Agent Infrastructure - CRITICAL BUG (30% of failures)

**Problem**: Gemini Flash 3 agent runs are NOT using Gemini CLI - they're using Claude Code!

**Root Cause Analysis**:

1. **Wrong Runner Used**: v0.6.2 evals used `RunAgentBenchmark()` which hardcodes Claude Code:
   ```go
   // agent_runner.go:225
   Executor: "claude", // Legacy runner always uses Claude Code
   ```

2. **Multi-Executor Not Invoked**: The correct `RunAgentBenchmarkWithExecutor()` in
   `agent_runner_multi.go` supports multiple executors but was never called.

3. **models.yml Shows Intent But Not Reality**:
   ```yaml
   gemini-3-flash:
     agent_cli: "gemini"  # Configured but NOT USED
     agent_model_name: "gemini-3-flash-preview"
   ```

   Comment in models.yml even says: `"(not yet implemented)"` for Gemini CLI

**Evidence**:
```json
{
  "model": "gemini-3-flash",
  "agent_turns": 1,
  "stderr": "timeout after 1m0s\n\n=== Claude Session Transcript ==="  // <-- CLAUDE, not Gemini!
}
```

Even successful Gemini runs show "Claude Session Transcript" with `turns: 1`.

**Impact**: All "Gemini Flash 3" agent results are actually Claude Code runs with the wrong
model configuration. The 69.2% success rate is meaningless for Gemini evaluation.

**Solution** (High Priority):

1. **Use `RunAgentBenchmarkWithExecutor`** in eval suite:
   - Update `cmd/ailang/eval.go` to use multi-executor runner
   - Pass model name to select correct executor from models.yml

2. **Verify Gemini CLI Works**:
   ```bash
   gemini -m gemini-3-flash-preview --output-format json "Hello"
   ```

3. **Add Integration Test**:
   ```go
   func TestGeminiExecutor(t *testing.T) {
       exec, _ := executor.GlobalFactory().GetExecutor("gemini")
       // Verify it calls gemini CLI, not claude
   }
   ```

---

## Recommended Actions

### Phase 1: Infrastructure Fixes (v0.6.3) - CRITICAL

**P0: Fix Gemini Agent Executor** (blocks all Gemini agent eval)

1. **Wire up multi-executor in eval suite**:
   - File: `cmd/ailang/eval.go`
   - Change: Use `RunAgentBenchmarkWithExecutor()` instead of `RunAgentBenchmark()`
   - Test: `DEBUG_AGENT=1 ailang eval-suite --models gemini-3-flash --agent-mode`

2. **Verify Gemini CLI executor works**:
   - File: `internal/executor/gemini/gemini.go`
   - Test: `go test ./internal/executor/gemini/...`

3. **Re-run Gemini agent baseline** after fix

**P1: ~~Increase Agent Timeout~~ - WILL NOT FIX MOST FAILURES**

**Analysis Update**: Examined transcripts show models are **stuck in loops**, not "almost done":

```
[config_file_parser - 22 turns]
Turn 9:  "floatToInt type error - let me check what casting function is available"
Turn 13: "Let me check the builtins..."
Turn 14+: Keeps trying workarounds that compile but fail logically
Final code: Broken convertToInt using `(x % 1.0)` - doesn't work

[csv_to_json_converter - 28 turns]
Turn 11: "I'm trying to pattern match on strings as if they're lists"
Turn 12+: Rewrites but can't fix fundamental misunderstanding
```

**Real Root Causes** (timeout won't help):
1. **Missing `floatToInt`** - model keeps searching for it, can't find, tries broken workarounds
2. **String ≠ character list** - model tries `match str { [a, b, c] => ... }` which doesn't work
3. **Missing string operations** - `slice`, `contains`, `charAt` don't exist

**Actual Fix**: Add stdlib functions (see Phase 3), not longer timeouts.

**Per-benchmark timeouts**: May be useful for genuinely complex tasks (e.g., `type_unify` which requires deep recursion), but current 60s is reasonable for most benchmarks. The issue is capability gaps, not time constraints.

### Phase 2: Prompt Improvements (v0.6.3)

**P0: Add JSON API Examples** (fixes 30% of failures)

Add to `prompts/v0.6.3.md`:

```ailang
-- IMPORTANT: JSON values have type `Json`, not `string`!
-- The `decode` function returns Result[Json, string]
-- Use accessor functions to extract typed values

import std/json (decode, get, asString, asNumber, asArray)

func example() -> () ! {IO} = {
  -- Parse JSON string into Json value
  let jsonStr = "{\"name\":\"Alice\",\"age\":30}";
  match decode(jsonStr) {
    Ok(obj) => {
      -- obj has type Json
      -- get(obj, key) returns Option[Json]
      match get(obj, "name") {
        Some(nameJson) => {
          -- nameJson has type Json, convert to string
          match asString(nameJson) {
            Some(name) => print("Name: " ++ name),
            None => print("name is not a string")
          }
        },
        None => print("no name field")
      }
    },
    Err(e) => print("Parse error: " ++ e)
  }
}
```

### Phase 3: Add Missing Wrappers + Document in Prompt (v0.6.3-v0.6.4)

**Discovery**: Most stdlib wrappers already exist! Agent just doesn't know about them.

**Already in stdlib** (agent transcript shows it couldn't find these):

| Function | Exists in stdlib? | Agent looked for |
|----------|-------------------|------------------|
| `substring(s, start, end)` | ✅ std/string | `stringSlice` |
| `contains(hay, needle)` | ✅ std/string | `contains` |
| `trim(s)` | ✅ std/string | `trim` |
| `split(s, delim)` | ✅ std/string | `split` |
| `floor(x)` | ✅ std/math (returns float) | - |
| `ceil(x)` | ✅ std/math (returns float) | - |
| `round(x)` | ✅ std/math (returns float) | - |

**ONLY missing wrappers** (cause of stuck loops):

| Function | Builtin exists? | Wrapper needed |
|----------|-----------------|----------------|
| `floatToInt(x) -> int` | ✅ `_float_to_int` | ❌ No wrapper in stdlib |
| `intToFloat(x) -> float` | ✅ `_int_to_float` | ❌ No wrapper in stdlib |

**Minimal fix** (just 2 lines in std/math.ail):
```ailang
export pure func floatToInt(x: float) -> int = _float_to_int(x)
export pure func intToFloat(x: int) -> float = _int_to_float(x)
```

**Prompt fix** (document what's available):
```
## String Functions (import std/string)
substring(s, start, end), contains(hay, needle), trim(s), split(s, delim)
toLower(s), toUpper(s), find(hay, needle), stringToInt(s), stringToFloat(s)

## Math Functions (import std/math)
floor(x), ceil(x), round(x)          -- return float
floatToInt(x), intToFloat(x)         -- type conversion (v0.6.3+)
abs_Float(x), abs_Int(x), sqrt(x), pow(x, y)
sin(x), cos(x), tan(x), log(x), exp(x)
```

**Agent was 1 function away from success** - just needed `floatToInt`!

### Phase 4: Prompt A/B Testing (v0.6.5)

1. **Create benchmark-specific prompts** for the 10 hardest cases
2. **Measure turns-to-success** as primary metric (not just pass/fail)
3. **Compare Haiku vs Opus** on same prompts to identify model-specific issues

---

## Success Metrics

| Metric | Current (v0.6.2) | Target (v0.6.5) | Notes |
|--------|------------------|-----------------|-------|
| Haiku Agent Success | 80.7% | **90%+** | +5 benchmarks |
| Gemini Flash Agent Success | 69.2%* | **TBD** | *Invalid - needs re-run |
| Average turns (success) | ~10 | **<8** | Efficiency metric |
| "floatToInt not found" loops | ~3 | **0** | After stdlib fix |
| JSON-related failures | 6 | **0-2** | After prompt fix |

*Gemini results are invalid because agent eval used Claude Code CLI instead of Gemini CLI.

## Revised Priority Actions

1. **Add `floatToInt`/`intToFloat` to std/math.ail** (2 lines) - unblocks config_file_parser
2. **Document stdlib in prompt** - unblocks agents who can't find existing functions
3. **Fix Gemini executor** - enables real Gemini benchmarking
4. **Add JSON examples to prompt** - reduces type confusion

---

## Appendix: Benchmark-Specific Analysis

### Hardest Benchmarks (0% AILANG success across all models)

| Benchmark | Python | Issue | Fix |
|-----------|--------|-------|-----|
| `config_file_parser` | 100% | Missing `stringSlice` | Add to stdlib |
| `csv_to_json_converter` | 100% | String parsing, `contains` | Add to stdlib |
| `error_handling` | 100% | Type annotation confusion | Better examples |

### Benchmarks Where Opus Succeeds but Haiku Fails

| Benchmark | Opus | Haiku | Gap Reason |
|-----------|------|-------|------------|
| `canonical_normalization` | Pass | WRONG_LANG | Haiku writes Python-like |
| `fold_reduce` | Pass | WRONG_LANG | Same |
| `graph_bfs` | Pass | WRONG_LANG | Same |
| `json_parse` | Pass | Timeout | Haiku slower to learn |
| `merge_sort` | Pass | PAR_001 | Syntax errors |

**Insight**: Haiku's main gaps are:
1. Syntax confusion (writes Python-like code) → **Prompt examples**
2. Slower iteration (times out) → **Longer timeout**
3. Missing functions → **Stdlib additions**

```

### resources/jq_queries.md

```markdown
# Common jq Queries for Eval Analysis

This reference provides additional jq queries beyond those in SKILL.md.

## Setup

All queries assume you have the summary.jsonl file:

```bash
SUMMARY=eval_results/baselines/v0.3.16/summary.jsonl

# Generate if missing
ailang eval-summary eval_results/baselines/v0.3.16
```

## Advanced Queries

### Benchmark-Specific Analysis

**All models that succeeded on a specific benchmark:**
```bash
jq -s --arg bench "fizzbuzz" '
  map(select(.id == $bench and .lang == "ailang" and .stdout_ok)) |
  map(.model)
' $SUMMARY
```

**Failure reasons for a specific benchmark:**
```bash
jq -s --arg bench "fizzbuzz" '
  map(select(.id == $bench and .lang == "ailang" and .stdout_ok == false)) |
  group_by(.error_category) |
  map({
    error: .[0].error_category,
    models: map(.model),
    count: length
  })
' $SUMMARY
```

### Model-Specific Analysis

**All benchmarks a model failed:**
```bash
jq -s --arg model "gpt5" '
  map(select(.model == $model and .lang == "ailang" and .stdout_ok == false)) |
  map(.id) |
  unique
' $SUMMARY
```

**Model's error distribution:**
```bash
jq -s --arg model "gpt5" '
  map(select(.model == $model and .lang == "ailang" and .stdout_ok == false)) |
  group_by(.error_category) |
  map({
    category: .[0].error_category,
    count: length,
    benchmarks: map(.id) | unique
  })
' $SUMMARY
```

### Cost Analysis

**Cost per benchmark:**
```bash
jq -s '
  map(select(.lang == "ailang")) |
  group_by(.id) |
  map({
    benchmark: .[0].id,
    total_cost: (map(.cost_usd) | add * 100 | round / 100),
    avg_cost: (map(.cost_usd) | add / length * 1000 | round / 1000)
  }) |
  sort_by(-.total_cost)
' $SUMMARY
```

**Most expensive failures:**
```bash
jq -s '
  map(select(.lang == "ailang" and .stdout_ok == false)) |
  sort_by(-.cost_usd) |
  .[:10] |
  map({
    benchmark: .id,
    model: .model,
    cost: (.cost_usd * 100 | round / 100),
    tokens: .total_tokens,
    error: .error_category
  })
' $SUMMARY
```

### Time Analysis

**Slowest runs:**
```bash
jq -s '
  map(select(.lang == "ailang")) |
  sort_by(-.duration_ms) |
  .[:10] |
  map({
    benchmark: .id,
    model: .model,
    duration_ms: .duration_ms,
    success: .stdout_ok
  })
' $SUMMARY
```

**Average duration by model:**
```bash
jq -s '
  map(select(.lang == "ailang")) |
  group_by(.model) |
  map({
    model: .[0].model,
    avg_duration_ms: (map(.duration_ms) | add / length | round)
  }) |
  sort_by(-.avg_duration_ms)
' $SUMMARY
```

### Repair Analysis

**Repair success by error category:**
```bash
jq -s '
  map(select(.lang == "ailang" and .repair_used == true)) |
  group_by(.error_category) |
  map({
    category: .[0].error_category,
    total_repairs: length,
    successful: map(select(.repair_ok)) | length,
    success_rate: (map(select(.repair_ok)) | length) / length * 100 | round
  })
' $SUMMARY
```

**Which models benefit most from repair:**
```bash
jq -s '
  map(select(.lang == "ailang")) |
  group_by(.model) |
  map({
    model: .[0].model,
    first_attempt_rate: (map(select(.first_attempt_ok)) | length) / length * 100 | round,
    final_rate: (map(select(.stdout_ok)) | length) / length * 100 | round,
    improvement: ((map(select(.stdout_ok)) | length) - (map(select(.first_attempt_ok)) | length))
  }) |
  sort_by(-.improvement)
' $SUMMARY
```

### Correlation Analysis

**Benchmarks where all models fail:**
```bash
jq -s '
  map(select(.lang == "ailang")) |
  group_by(.id) |
  map({
    benchmark: .[0].id,
    success_count: map(select(.stdout_ok)) | length,
    total: length
  }) |
  map(select(.success_count == 0))
' $SUMMARY
```

**Benchmarks with high model variance:**
```bash
jq -s '
  map(select(.lang == "ailang")) |
  group_by(.id) |
  map({
    benchmark: .[0].id,
    success_by_model: (
      group_by(.model) |
      map({
        model: .[0].model,
        success: map(select(.stdout_ok)) | length > 0
      })
    ),
    variance: (
      group_by(.model) |
      map(map(select(.stdout_ok)) | length) |
      (max - min)
    )
  }) |
  sort_by(-.variance) |
  .[:10]
' $SUMMARY
```

### Token Efficiency

**Cost per successful token:**
```bash
jq -s '
  map(select(.lang == "ailang" and .stdout_ok)) |
  group_by(.model) |
  map({
    model: .[0].model,
    cost_per_1k_tokens: (
      (map(.cost_usd) | add) / (map(.total_tokens) | add) * 1000 * 100 | round / 100
    )
  })
' $SUMMARY
```

**Output token efficiency (when successful):**
```bash
jq -s '
  map(select(.lang == "ailang" and .stdout_ok)) |
  group_by(.model) |
  map({
    model: .[0].model,
    avg_output_tokens: (map(.output_tokens) | add / length | round),
    min_output: (map(.output_tokens) | min),
    max_output: (map(.output_tokens) | max)
  })
' $SUMMARY
```

## Exporting Data

### CSV Export for Spreadsheets

**All runs:**
```bash
jq -r '
  [.id, .lang, .model, .stdout_ok, .error_category, .total_tokens, .cost_usd, .duration_ms] |
  @csv
' $SUMMARY > results.csv
```

**Only AILANG failures:**
```bash
jq -r '
  select(.lang == "ailang" and .stdout_ok == false) |
  [.id, .model, .error_category, .err_code, .compile_ok, .runtime_ok, .total_tokens] |
  @csv
' $SUMMARY > failures.csv
```

### JSON for Further Processing

**Failure summary by category:**
```bash
jq -s '
  map(select(.lang == "ailang" and .stdout_ok == false)) |
  group_by(.error_category) |
  map({
    category: .[0].error_category,
    count: length,
    benchmarks: (map(.id) | unique),
    models_affected: (map(.model) | unique),
    total_cost: (map(.cost_usd) | add),
    avg_tokens: (map(.total_tokens) | add / length | round)
  })
' $SUMMARY > failure_summary.json
```

## Tips

1. **Pipe to `head`** for quick checks: `jq ... $SUMMARY | head -20`
2. **Use `less -S`** for wide output: `jq ... $SUMMARY | less -S`
3. **Save complex queries** as shell functions in your `~/.bashrc`
4. **Combine with grep** for filtering: `jq ... $SUMMARY | grep "gpt5"`
5. **Use `-c`** for compact output: `jq -c ...`
6. **Pretty print saved JSON**: `jq . failure_summary.json`

```

### resources/failure_analysis_v0.3.16.md

```markdown
# AILANG v0.3.16 Eval Failure Analysis

## Executive Summary

**Overall Success Rate**: 31.0% (63/204 runs)
**Failure Rate**: 69.0% (141/204 runs)

**Failure Breakdown:**
- **compile_error**: 62% (90 failures) - Models generating invalid AILANG syntax
- **logic_error**: 31% (46 failures) - Valid code but wrong output
- **runtime_error**: 6% (9 failures) - Crashes during execution

**Top Error Codes:**
- PAR_001 (68 occurrences) - Parse errors, syntax mistakes
- WRONG_LANG (10 occurrences) - Models generating Python/other languages instead of AILANG
- IMPERATIVE (7 occurrences) - Imperative patterns (e.g., `x = y` instead of `let x = y`)
- CAP_001 (3 occurrences) - Missing capability grants

## Root Cause Analysis

### Problem 1: Prompt-Benchmark Mismatch (CRITICAL BUG)

**The Bug**: v0.3.16 prompt contains a **FALSE limitation** that contradicts the actual implementation!

**What the prompt says**:
```markdown
⚠️ NO custom HTTP headers (OpenAI/Claude APIs blocked until v0.4.0)
```

**Reality**: `httpRequest()` with custom headers has been **working since v0.3.9**!
```ailang
import std/net (httpRequest)

let headers = [{name: "X-Test-Header", value: "value123"}]
let response = httpRequest("POST", url, headers, body)  -- ✅ WORKS!
```

**What happened**:
1. v0.3.9 added `httpRequest()` with headers but kept old "NO headers" limitation (contradictory)
2. v0.3.16 **removed httpRequest documentation** but **kept the false limitation**
3. Models read "NO custom HTTP headers" and gave up or hallucinated syntax
4. Result: All 6 models failed `api_call_json` benchmark (0% success rate)

**Impact**: Any benchmark requiring HTTP headers will fail at 0% because models think it's impossible

**Generated Code Patterns** (models trying to work around "impossible" requirement):
- **gemini-2-5-flash**: Used uppercase `LET` keyword (doesn't exist in AILANG)
- **claude-haiku-4-5**: Used `import HTTP` (doesn't exist) and Python-style dict syntax
- **claude-sonnet-4-5**: Invented shell-like syntax `http-post "url" {...} '...' response`
- **gpt5**: Used Python-style keyword arguments `http.post(url: "...", headers: {...}, json: {...})`

**Root Cause**: Documentation regression - feature exists but prompt says it doesn't!

**Fix**: Update v0.3.17 prompt to:
1. **REMOVE** false "NO custom HTTP headers" limitation
2. **ADD** httpRequest() documentation with examples
3. **ADD** to import checklist: `httpRequest` → `import std/net (httpRequest)`

### Problem 2: Syntax Confusion (PAR_001 errors)

**Models are generating invalid AILANG syntax patterns:**

1. **Python-style keyword arguments**:
   ```python
   # ❌ WRONG (GPT-5)
   http.post(url: "...", headers: {...}, json: {...})
   ```
   AILANG uses positional arguments only

2. **Uppercase keywords**:
   ```
   # ❌ WRONG (Gemini)
   LET url = "..."
   ```
   AILANG keywords are lowercase: `let`, `func`, `type`, etc.

3. **Module-qualified calls without imports**:
   ```
   # ❌ WRONG (Gemini)
   http.post(...)  # No import statement!
   ```
   AILANG requires explicit imports: `import std/net (httpPost)`

4. **Invented syntax**:
   ```
   # ❌ WRONG (Claude Sonnet)
   http-post "url" {...} '...' response
   ```
   No such syntax exists in AILANG

**Root Cause**: Models are trained on Python/JavaScript/Bash and default to familiar syntax patterns.

**Recommendation**: Enhance prompt with more negative examples (anti-patterns).

### Problem 3: Prompt Version Tracking

**How to find prompt version**:
1. Check `eval_results/baselines/{version}/baseline.json` for `"version": "0.3.16"`
2. Cross-reference with `prompts/versions.json` active version
3. For v0.3.16 baseline → used `prompts/v0.3.16.md` prompt

**Enhancement idea**: Add `prompt_version` and `prompt_hash` fields to individual result JSON files for easier tracking (currently only in baseline.json)

## Detailed Failure Patterns

### API Call JSON Benchmark (0% success, 6/6 models failed)

**Expected AILANG code** (what models SHOULD have generated):
```ailang
module benchmark/solution

import std/net (httpRequest)
import std/json (encode, jo, kv, js, jnum)

export func main() -> () ! {Net, IO} {
  let headers = [
    {name: "X-Test-Header", value: "value123"},
    {name: "Content-Type", value: "application/json"}
  ];
  let body = encode(jo([
    kv("message", js("Hello from AILANG")),
    kv("count", jnum(42.0))
  ]));
  match httpRequest("POST", "https://httpbin.org/post", headers, body) {
    Ok(resp) => print(show(resp.status)),
    Err(e) => print("error")
  }
}
```

**Why this is correct:**
- Uses `httpRequest()` with custom headers (supported since v0.3.9)
- Pattern matches on Result type for error handling
- Accesses `resp.status` field from HttpResponse record
- Uses JSON encoding functions from std/json

**What models generated instead**:

1. **Gemini**: Uppercase `LET`, no imports, wrong module qualifier
2. **Claude Haiku**: Invented `import HTTP`, Python dict syntax
3. **Claude Sonnet**: Shell-like command syntax
4. **GPT-5**: Python keyword argument syntax

**Why all failed**: Benchmark requires HTTP headers, but prompt says they're not supported.

### Common Error Categories

#### 1. Compile Error (62% of failures)

**Subcategories:**
- Parse errors (PAR_001): Wrong syntax, invalid tokens
- Wrong language (WRONG_LANG): Python/JS generated instead of AILANG
- Imperative style (IMPERATIVE): Using `=` instead of `let`
- Missing imports: Using functions without importing them

**Pattern**: Models default to familiar syntax from training data

#### 2. Logic Error (31% of failures)

**Patterns:**
- Correct syntax, but wrong algorithm
- Off-by-one errors in recursion
- Incorrect pattern matching logic
- Wrong output format (e.g., printing list instead of just status code)

**Pattern**: Models understand the task but implement wrong logic

#### 3. Runtime Error (6% of failures)

**Patterns:**
- Stack overflow (infinite recursion)
- Missing capability grants (CAP_001)
- Type errors not caught by type checker
- Record field access on wrong type

**Pattern**: Rare - type system catches most errors at compile time

## Benchmarks with 0% Success Rate (All Models Failed)

1. **api_call_json** - Requires unsupported HTTP headers
2. (Need to examine 16 more to categorize)

## Model Performance Comparison

(Data from summary.jsonl - need to run analyze_failures.sh to populate)

## Recommendations

### Immediate Actions (v0.3.17)

1. **Fix prompt-benchmark mismatch**:
   - Remove `api_call_json` benchmark OR implement HTTP headers
   - Audit all benchmarks for unsupported features

2. **Add prompt version to results**:
   - Modify eval harness to store `prompt_version` field
   - Enables correlation analysis between prompt changes and success rates

3. **Enhance syntax documentation**:
   - Add "Common Mistakes" section with anti-patterns
   - Show ❌ WRONG / ✅ CORRECT examples for:
     - Function calls (positional vs keyword args)
     - Keywords (lowercase only)
     - Import syntax
     - Module qualification

### Medium-term Actions (v0.4.0)

1. **Implement HTTP headers support**:
   - Design: How should headers be passed? List of records? Map?
   - Security: Whitelist allowed headers
   - Testing: Update benchmarks to use new API

2. **Improve error messages**:
   - When parse fails, suggest common fixes
   - "Did you mean `let` instead of `LET`?"
   - "AILANG doesn't support keyword arguments - use positional arguments"

3. **Expand eval analysis**:
   - Create repair suggestion system
   - Identify which syntax errors can be auto-fixed
   - Test if auto-repair improves success rates

### Long-term Actions (v0.5.0+)

1. **Systematic prompt engineering**:
   - A/B test prompt variations
   - Measure: Which wording reduces parse errors most?
   - Track prompt effectiveness over time

2. **Model-specific prompts**:
   - GPT-5 makes different mistakes than Claude
   - Custom prompts per model family?
   - Trade-off: Maintenance burden vs. performance gain

3. **Synthetic training data**:
   - Generate AILANG code examples from successful runs
   - Fine-tune models on AILANG syntax
   - Reduce reliance on prompt engineering

## Next Steps

1. Run `.claude/skills/eval-analyzer/scripts/analyze_failures.sh` for detailed stats
2. Examine remaining 16 benchmarks with 0% success rate
3. Categorize all failure patterns systematically
4. Create design doc for HTTP headers support
5. Update eval harness to store prompt version

```