Back to skills
SkillHub ClubShip Full StackFull StackTesting

llm-evaluator

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for observability. Triggers: evaluate trace, score quality, check accuracy, backfill scores, test evaluator, LLM judge.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
3,071
Hot score
99
Updated
March 19, 2026
Overall rating
C4.0
Composite score
4.0
Best-practice grade
B81.2

Install command

npx @skill-hub/cli install openclaw-skills-llm-evaluator-pro

Repository

openclaw/skills

Skill path: skills/aiwithabidi/llm-evaluator-pro

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for observability. Triggers: evaluate trace, score quality, check accuracy, backfill scores, test evaluator, LLM judge.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack, Testing.

Target audience: everyone.

License: MIT.

Original source

Catalog source: SkillHub Club.

Repository owner: openclaw.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install llm-evaluator into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/openclaw/skills before adding llm-evaluator to shared team environments
  • Use llm-evaluator for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: llm-evaluator
version: 1.0.0
description: >
  LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy,
  hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace
  scoring, batch backfill, and test mode. Integrates with Langfuse dashboard for
  observability. Triggers: evaluate trace, score quality, check accuracy, backfill
  scores, test evaluator, LLM judge.
license: MIT
compatibility:
  openclaw: ">=0.10"
metadata:
  openclaw:
    requires:
      bins: ["python3"]
      env: ["OPENROUTER_API_KEY", "LANGFUSE_PUBLIC_KEY", "LANGFUSE_SECRET_KEY"]
---

# LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

## When to Use

- Evaluating quality of search results or AI responses
- Scoring traces for relevance, accuracy, hallucination detection
- Batch scoring recent unscored traces
- Quality assurance on agent outputs

## Usage

```bash
# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20
```

## Evaluators

| Evaluator | Measures | Scale |
|-----------|----------|-------|
| relevance | Response relevance to query | 0–1 |
| accuracy | Factual correctness | 0–1 |
| hallucination | Made-up information detection | 0–1 |
| helpfulness | Overall usefulness | 0–1 |

## Credits

Built by [M. Abidi](https://www.linkedin.com/in/mohammad-ali-abidi) | [agxntsix.ai](https://www.agxntsix.ai)
[YouTube](https://youtube.com/@aiwithabidi) | [GitHub](https://github.com/aiwithabidi)
Part of the **AgxntSix Skill Suite** for OpenClaw agents.

📅 **Need help setting up OpenClaw for your business?** [Book a free consultation](https://cal.com/agxntsix/abidi-openclaw)


---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### _meta.json

```json
{
  "owner": "aiwithabidi",
  "slug": "llm-evaluator-pro",
  "displayName": "LLM Evaluator Pro",
  "latest": {
    "version": "1.0.0",
    "publishedAt": 1771140092289,
    "commit": "https://github.com/openclaw/skills/commit/7de73f61cab993837f596debec11a3f3c101145c"
  },
  "history": []
}

```

### scripts/evaluator.py

```python
#!/usr/bin/env python3
"""
AgxntSix Evaluator — LLM-as-a-Judge via Langfuse.

Uses GPT-5-nano via OpenRouter as a cheap, fast judge to evaluate traces.
Scores are saved directly to Langfuse for dashboard visibility.

Usage:
  evaluator.py test                              # Run a quick test
  evaluator.py score <trace_id>                  # Score a specific trace
  evaluator.py backfill [--limit 10]             # Score recent unscored traces
  evaluator.py experiment [--name "run name"]    # Run experiment on dataset
"""
import argparse
import json
import os
import sys
import time
import requests
from datetime import datetime

# Langfuse config
os.environ.setdefault("LANGFUSE_SECRET_KEY", "sk-lf-115cb6b4-7153-4fe6-9255-bf28f8b115de")
os.environ.setdefault("LANGFUSE_PUBLIC_KEY", "pk-lf-8a9322b9-5eb1-4e8b-815e-b3428dc69bc4")
os.environ.setdefault("LANGFUSE_HOST", "http://langfuse-web:3000")

LF_AUTH = ("pk-lf-8a9322b9-5eb1-4e8b-815e-b3428dc69bc4", "sk-lf-115cb6b4-7153-4fe6-9255-bf28f8b115de")
LF_API = "http://langfuse-web:3000/api/public"

from openai import OpenAI

# OpenRouter key
OR_KEY = os.environ.get("OPENROUTER_API_KEY", "")
if not OR_KEY:
    try:
        with open(os.path.expanduser("~/.openclaw/workspace/.env")) as f:
            for line in f:
                if line.startswith("OPENROUTER_API_KEY="):
                    OR_KEY = line.strip().split("=", 1)[1]
    except:
        pass

JUDGE_MODEL = "openai/gpt-5-nano"


def get_judge():
    return OpenAI(api_key=OR_KEY, base_url="https://openrouter.ai/api/v1")


def llm_judge(prompt: str) -> dict:
    """Call the judge LLM and parse score + reasoning."""
    judge = get_judge()
    resp = judge.chat.completions.create(
        model=JUDGE_MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    text = resp.choices[0].message.content.strip()
    try:
        result = json.loads(text)
        return {"score": float(result.get("score", 0.5)), "reasoning": result.get("reasoning", text)}
    except:
        import re
        match = re.search(r'(\d+\.?\d*)\s*/\s*(\d+)', text)
        if match:
            return {"score": float(match.group(1)) / float(match.group(2)), "reasoning": text}
        return {"score": 0.5, "reasoning": text}


def save_score(trace_id: str, name: str, value: float, comment: str = ""):
    """Save a score to Langfuse via API."""
    requests.post(
        f"{LF_API}/scores",
        auth=LF_AUTH,
        json={"traceId": trace_id, "name": name, "value": value, "comment": comment},
    )


# === EVALUATORS ===

def eval_relevance(input_text: str, output_text: str) -> dict:
    return llm_judge(
        f'Rate the relevance of this AI response to the query (0.0-1.0). '
        f'Respond JSON only: {{"score": <float>, "reasoning": "<brief>"}}\n\n'
        f'Query: {input_text}\nResponse: {output_text}'
    )

def eval_accuracy(input_text: str, output_text: str, expected: str = "") -> dict:
    ctx = f"\nExpected: {expected}" if expected else ""
    return llm_judge(
        f'Rate the factual accuracy of this response (0.0-1.0).{ctx} '
        f'Respond JSON only: {{"score": <float>, "reasoning": "<brief>"}}\n\n'
        f'Query: {input_text}\nResponse: {output_text}'
    )

def eval_hallucination(input_text: str, output_text: str) -> dict:
    return llm_judge(
        f'Rate whether this response is free of hallucinations (0.0=severe hallucinations, 1.0=fully grounded). '
        f'Respond JSON only: {{"score": <float>, "reasoning": "<brief>"}}\n\n'
        f'Query: {input_text}\nResponse: {output_text}'
    )

def eval_helpfulness(input_text: str, output_text: str) -> dict:
    return llm_judge(
        f'Rate how helpful this response is to the user (0.0=not helpful, 1.0=extremely helpful). '
        f'Respond JSON only: {{"score": <float>, "reasoning": "<brief>"}}\n\n'
        f'Query: {input_text}\nResponse: {output_text}'
    )

EVALUATORS = {
    "relevance": eval_relevance,
    "accuracy": eval_accuracy,
    "hallucination": eval_hallucination,
    "helpfulness": eval_helpfulness,
}


# === COMMANDS ===

def cmd_test(args):
    """Quick test of the judge pipeline."""
    from langfuse import get_client
    lf = get_client()
    
    test_cases = [
        {"input": "What is the capital of France?", "output": "The capital of France is Paris.", "expected": "Paris"},
        {"input": "What is 2+2?", "output": "2+2 equals 4.", "expected": "4"},
        {"input": "Tell me about quantum computing", "output": "Quantum computing uses qubits that can exist in superposition, enabling parallel computation. It's useful for optimization and cryptography.", "expected": ""},
    ]
    
    print(f"Running LLM-as-a-Judge on {len(test_cases)} test cases...\n")
    
    for i, tc in enumerate(test_cases):
        # Create a trace for this test
        with lf.start_as_current_observation(
            as_type="span",
            name=f"judge-test-{i+1}",
            input=tc["input"],
        ) as span:
            span.update(output=tc["output"])
        
        lf.flush()
        time.sleep(0.5)
        
        # Get trace ID from recent traces
        resp = requests.get(f"{LF_API}/traces", params={"limit": 1, "orderBy": "timestamp.desc"}, auth=LF_AUTH)
        traces = resp.json().get("data", [])
        if not traces:
            print(f"  ❌ Could not find trace for test {i+1}")
            continue
        
        trace_id = traces[0]["id"]
        print(f"Test {i+1}: {tc['input'][:50]}")
        
        # Run all evaluators
        for name, evaluator in EVALUATORS.items():
            if name == "accuracy":
                result = evaluator(tc["input"], tc["output"], tc.get("expected", ""))
            else:
                result = evaluator(tc["input"], tc["output"])
            
            save_score(trace_id, name, result["score"], result["reasoning"])
            print(f"  {name}: {result['score']:.2f} — {result['reasoning'][:80]}")
        
        print()
    
    print("✅ All test scores saved to Langfuse")


def cmd_score(args):
    """Score a specific trace."""
    resp = requests.get(f"{LF_API}/traces/{args.trace_id}", auth=LF_AUTH)
    if resp.status_code != 200:
        print(f"❌ Trace not found: {args.trace_id}")
        return
    
    trace = resp.json()
    input_text = str(trace.get("input", ""))
    output_text = str(trace.get("output", ""))
    
    if not input_text or not output_text:
        print("⚠️ Trace has no input/output to evaluate")
        return
    
    print(f"Scoring trace: {args.trace_id}")
    print(f"  Input: {input_text[:100]}...")
    print(f"  Output: {output_text[:100]}...")
    
    evals = EVALUATORS if args.evaluators == "all" else {args.evaluators: EVALUATORS[args.evaluators]}
    
    for name, evaluator in evals.items():
        result = evaluator(input_text, output_text)
        save_score(args.trace_id, name, result["score"], result["reasoning"])
        print(f"  {name}: {result['score']:.2f} — {result['reasoning'][:80]}")
    
    print("✅ Scores saved")


def cmd_backfill(args):
    """Score recent unscored traces."""
    resp = requests.get(f"{LF_API}/traces", params={"limit": args.limit, "orderBy": "timestamp.desc"}, auth=LF_AUTH)
    traces = resp.json().get("data", [])
    
    scored = 0
    for trace in traces:
        input_text = str(trace.get("input", ""))
        output_text = str(trace.get("output", ""))
        
        if not input_text or not output_text or len(output_text) < 10:
            continue
        
        # Check if already scored
        score_resp = requests.get(f"{LF_API}/scores", params={"traceId": trace["id"], "limit": 1}, auth=LF_AUTH)
        if score_resp.json().get("data", []):
            continue
        
        print(f"Scoring: {trace.get('name', 'unnamed')} ({trace['id'][:12]}...)")
        
        for name in ["relevance", "accuracy"]:
            result = EVALUATORS[name](input_text, output_text)
            save_score(trace["id"], name, result["score"], result["reasoning"])
            print(f"  {name}: {result['score']:.2f}")
        
        scored += 1
    
    print(f"\n✅ Scored {scored} traces")


def cmd_experiment(args):
    """Run experiment using Langfuse datasets."""
    from langfuse import get_client, Evaluation
    lf = get_client()
    
    dataset_name = args.dataset or "search-quality"
    
    try:
        dataset = lf.get_dataset(dataset_name)
    except:
        print(f"❌ Dataset '{dataset_name}' not found")
        return
    
    def task(*, item, **kwargs):
        judge = get_judge()
        resp = judge.chat.completions.create(
            model=JUDGE_MODEL,
            messages=[{"role": "user", "content": str(item.input)}],
            temperature=0,
        )
        return resp.choices[0].message.content
    
    def simple_accuracy(*, input, output, expected_output=None, **kwargs):
        if expected_output and str(expected_output).lower() in str(output).lower():
            return Evaluation(name="accuracy", value=1.0, comment="Expected output found in response")
        return Evaluation(name="accuracy", value=0.0, comment="Expected output not found")
    
    run_name = args.name or f"Experiment {datetime.now().strftime('%Y-%m-%d %H:%M')}"
    
    print(f"Running experiment '{run_name}' on dataset '{dataset_name}'...")
    result = lf.run_experiment(
        name=run_name,
        dataset=dataset,
        task=task,
        evaluators=[simple_accuracy],
    )
    print(result.format())
    lf.flush()


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="AgxntSix Evaluator (LLM-as-a-Judge)")
    sub = parser.add_subparsers(dest="cmd")
    
    sub.add_parser("test", help="Run test evaluation")
    
    p = sub.add_parser("score", help="Score a specific trace")
    p.add_argument("trace_id")
    p.add_argument("--evaluators", default="all", choices=["all", "relevance", "accuracy", "hallucination", "helpfulness"])
    
    p = sub.add_parser("backfill", help="Score recent unscored traces")
    p.add_argument("--limit", type=int, default=10)
    
    p = sub.add_parser("experiment", help="Run experiment on dataset")
    p.add_argument("--dataset", default="search-quality")
    p.add_argument("--name", help="Experiment run name")
    
    args = parser.parse_args()
    if not args.cmd:
        parser.print_help()
    else:
        globals()[f"cmd_{args.cmd}"](args)

```

llm-evaluator | SkillHub