phoenix-evals
Build and run evaluators for AI/LLM applications using Phoenix.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install arize-ai-phoenix-phoenix-evals
Repository
Skill path: skills/phoenix-evals
Build and run evaluators for AI/LLM applications using Phoenix.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Apache-2.0.
Original source
Catalog source: SkillHub Club.
Repository owner: Arize-ai.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install phoenix-evals into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/Arize-ai/phoenix before adding phoenix-evals to shared team environments
- Use phoenix-evals for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
--- name: phoenix-evals description: Build and run evaluators for AI/LLM applications using Phoenix. license: Apache-2.0 metadata: author: [email protected] version: "1.0.0" languages: Python, TypeScript --- # Phoenix Evals Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans. ## Quick Reference | Task | Files | | ---- | ----- | | Setup | `setup-python`, `setup-typescript` | | Build code evaluator | `evaluators-code-{python\|typescript}` | | Build LLM evaluator | `evaluators-llm-{python\|typescript}`, `evaluators-custom-templates` | | Run experiment | `experiments-running-{python\|typescript}` | | Create dataset | `experiments-datasets-{python\|typescript}` | | Validate evaluator | `validation`, `validation-calibration-{python\|typescript}` | | Analyze errors | `error-analysis`, `axial-coding` | | RAG evals | `evaluators-rag` | | Production | `production-overview`, `production-guardrails` | ## Workflows **Starting Fresh:** `observe-tracing-setup` → `error-analysis` → `axial-coding` → `evaluators-overview` **Building Evaluator:** `fundamentals` → `evaluators-{code\|llm}-{python\|typescript}` → `validation-calibration-{python\|typescript}` **RAG Systems:** `evaluators-rag` → `evaluators-code-*` (retrieval) → `evaluators-llm-*` (faithfulness) **Production:** `production-overview` → `production-guardrails` → `production-continuous` ## Rule Categories | Prefix | Description | | ------ | ----------- | | `fundamentals-*` | Types, scores, anti-patterns | | `observe-*` | Tracing, sampling | | `error-analysis-*` | Finding failures | | `axial-coding-*` | Categorizing failures | | `evaluators-*` | Code, LLM, RAG evaluators | | `experiments-*` | Datasets, running experiments | | `validation-*` | Calibrating judges | | `production-*` | CI/CD, monitoring | ## Key Principles | Principle | Action | | --------- | ------ | | Error analysis first | Can't automate what you haven't observed | | Custom > generic | Build from your failures | | Code first | Deterministic before LLM | | Validate judges | >80% TPR/TNR | | Binary > Likert | Pass/fail, not 1-5 |