Back to skills
SkillHub ClubAnalyze Data & AIFull StackData / AITesting

testing-agent

Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
9,603
Hot score
99
Updated
March 20, 2026
Overall rating
C4.5
Composite score
4.5
Best-practice grade
B77.6

Install command

npx @skill-hub/cli install adenhq-hive-testing-agent

Repository

adenhq/hive

Skill path: .claude/skills/testing-agent

Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI, Testing.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: adenhq.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install testing-agent into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/adenhq/hive before adding testing-agent to shared team environments
  • Use testing-agent for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: testing-agent
description: Run goal-based evaluation tests for agents. Use when you need to verify an agent meets its goals, debug failing tests, or iterate on agent improvements based on test results.
---

# Testing Workflow

This skill provides tools for testing agents built with the building-agents skill.

## Workflow Overview

1. `mcp__agent-builder__list_tests` - Check what tests exist
2. `mcp__agent-builder__generate_constraint_tests` or `mcp__agent-builder__generate_success_tests` - Get test guidelines
3. **Write tests directly** using the Write tool with the guidelines provided
4. `mcp__agent-builder__run_tests` - Execute tests
5. `mcp__agent-builder__debug_test` - Debug failures

## How Test Generation Works

The `generate_*_tests` MCP tools return **guidelines and templates** - they do NOT generate test code via LLM.
You (Claude) write the tests directly using the Write tool based on the guidelines.

### Example Workflow

```python
# Step 1: Get test guidelines
result = mcp__agent-builder__generate_constraint_tests(
    goal_id="my-goal",
    goal_json='{"id": "...", "constraints": [...]}',
    agent_path="exports/my_agent"
)

# Step 2: The result contains:
# - output_file: where to write tests
# - file_header: imports and fixtures to use
# - test_template: format for test functions
# - constraints_formatted: the constraints to test
# - test_guidelines: rules for writing tests

# Step 3: Write tests directly using the Write tool
Write(
    file_path=result["output_file"],
    content=result["file_header"] + test_code_you_write
)

# Step 4: Run tests via MCP tool
mcp__agent-builder__run_tests(
    goal_id="my-goal",
    agent_path="exports/my_agent"
)

# Step 5: Debug failures via MCP tool
mcp__agent-builder__debug_test(
    goal_id="my-goal",
    test_name="test_constraint_foo",
    agent_path="exports/my_agent"
)
```

---

# Testing Agents with MCP Tools

Run goal-based evaluation tests for agents built with the building-agents skill.

**Key Principle: MCP tools provide guidelines, Claude writes tests directly**
- ✅ Get guidelines: `generate_constraint_tests`, `generate_success_tests` → returns templates and guidelines
- ✅ Write tests: Use the Write tool with the provided file_header and test_template
- ✅ Run tests: `run_tests` (runs pytest via subprocess)
- ✅ Debug failures: `debug_test` (re-runs single test with verbose output)
- ✅ List tests: `list_tests` (scans Python test files)
- ✅ Tests stored in `exports/{agent}/tests/test_*.py`

## Architecture: Python Test Files

```
exports/my_agent/
├── __init__.py
├── agent.py              ← Agent to test
├── nodes/__init__.py
├── config.py
├── __main__.py
└── tests/                ← Test files written by MCP tools
    ├── conftest.py       # Shared fixtures (auto-created)
    ├── test_constraints.py
    ├── test_success_criteria.py
    └── test_edge_cases.py
```

**Tests import the agent directly:**
```python
import pytest
from exports.my_agent import default_agent


@pytest.mark.asyncio
async def test_happy_path(mock_mode):
    result = await default_agent.run({"query": "test"}, mock_mode=mock_mode)
    assert result.success
    assert len(result.output) > 0
```

## Why This Approach

- MCP tools provide consistent test guidelines with proper imports, fixtures, and API key enforcement
- Claude writes tests directly, eliminating circular LLM dependencies in the MCP server
- `run_tests` parses pytest output into structured results for iteration
- `debug_test` provides formatted output with actionable debugging info
- File headers include conftest.py setup with proper fixtures

## Quick Start

1. **Check existing tests** - `list_tests(goal_id, agent_path)`
2. **Get test guidelines** - `generate_constraint_tests` or `generate_success_tests`
3. **Write tests** - Use the Write tool with the provided file_header and guidelines
4. **Run tests** - `run_tests(goal_id, agent_path)`
5. **Debug failures** - `debug_test(goal_id, test_name, agent_path)`
6. **Iterate** - Repeat steps 4-5 until all pass

## ⚠️ API Key Requirement for Real Testing

**CRITICAL: Real LLM testing requires an API key.** Mock mode only validates structure and does NOT test actual agent behavior.

### Prerequisites

Before running agent tests, you MUST set your API key:

```bash
export ANTHROPIC_API_KEY="your-key-here"
```

**Why API keys are required:**
- Tests need to execute the agent's LLM nodes to validate behavior
- Mock mode bypasses LLM calls, providing no confidence in real-world performance
- Success criteria (personalization, reasoning quality, constraint adherence) can only be tested with real LLM calls

### Mock Mode Limitations

Mock mode (`--mock` flag or `mock_mode=True`) is **ONLY for structure validation**:

✓ Validates graph structure (nodes, edges, connections)
✓ Tests that code doesn't crash on execution
✗ Does NOT test LLM message generation
✗ Does NOT test reasoning or decision-making quality
✗ Does NOT test constraint validation (length limits, format rules)
✗ Does NOT test real API integrations or tool use
✗ Does NOT test personalization or content quality

**Bottom line:** If you're testing whether an agent achieves its goal, you MUST use a real API key.

### Enforcing API Key in Tests

When generating tests, **ALWAYS include API key checks**:

```python
import os
import pytest
from aden_tools.credentials import CredentialManager

# At the top of every test file
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1 for structure validation only."
)


@pytest.fixture(scope="session", autouse=True)
def check_api_key():
    """Ensure API key is set for real testing."""
    creds = CredentialManager()
    if not creds.is_available("anthropic"):
        if os.environ.get("MOCK_MODE"):
            print("\n⚠️  Running in MOCK MODE - structure validation only")
            print("   This does NOT test LLM behavior or agent quality")
            print("   Set ANTHROPIC_API_KEY for real testing\n")
        else:
            pytest.fail(
                "\n❌ ANTHROPIC_API_KEY not set!\n\n"
                "Real testing requires an API key. Choose one:\n"
                "1. Set API key (RECOMMENDED):\n"
                "   export ANTHROPIC_API_KEY='your-key-here'\n"
                "2. Run structure validation only:\n"
                "   MOCK_MODE=1 pytest exports/{agent}/tests/\n\n"
                "Note: Mock mode does NOT validate agent behavior or quality."
            )
```

### User Communication

When the user asks to test an agent, **ALWAYS check for the API key first**:

```python
from aden_tools.credentials import CredentialManager

# Before running any tests
creds = CredentialManager()
if not creds.is_available("anthropic"):
    print("⚠️  No ANTHROPIC_API_KEY found!")
    print()
    print("Testing requires a real API key to validate agent behavior.")
    print()
    print("Options:")
    print("1. Set your API key (RECOMMENDED):")
    print("   export ANTHROPIC_API_KEY='your-key-here'")
    print()
    print("2. Run in mock mode (structure validation only):")
    print("   MOCK_MODE=1 pytest exports/{agent}/tests/")
    print()
    print("Mock mode does NOT test:")
    print("  - LLM message generation")
    print("  - Reasoning or decision quality")
    print("  - Constraint validation")
    print("  - Real API integrations")

    # Ask user what to do
    AskUserQuestion(...)
```

## The Three-Stage Flow

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           GOAL STAGE                                     │
│  (building-agents skill)                                                 │
│                                                                          │
│  1. User defines goal with success_criteria and constraints             │
│  2. Goal written to agent.py immediately                                │
│  3. Generate CONSTRAINT TESTS → Write to tests/ → USER APPROVAL         │
│     Files created: exports/{agent}/tests/test_constraints.py            │
└─────────────────────────────────────────────────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────────────┐
│                          AGENT STAGE                                     │
│  (building-agents skill)                                                 │
│                                                                          │
│  Build nodes + edges, written immediately to files                      │
│  Constraint tests can run during development:                           │
│    run_tests(goal_id, agent_path, test_types='["constraint"]')          │
└─────────────────────────────────────────────────────────────────────────┘
                                   ↓
┌─────────────────────────────────────────────────────────────────────────┐
│                           EVAL STAGE (this skill)                        │
│                                                                          │
│  1. Generate SUCCESS_CRITERIA TESTS → Write to tests/ → USER APPROVAL   │
│     Files created: exports/{agent}/tests/test_success_criteria.py       │
│  2. Run all tests: run_tests(goal_id, agent_path)                       │
│  3. On failure → debug_test(goal_id, test_name, agent_path)             │
│  4. Iterate: Edit agent code → Re-run run_tests (instant feedback)      │
└─────────────────────────────────────────────────────────────────────────┘
```

## Step-by-Step: Testing an Agent

### Step 1: Check Existing Tests

**ALWAYS check first** before generating new tests:

```python
mcp__agent-builder__list_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)
```

This shows what test files already exist. If tests exist:
- Review the list to see what's covered
- Ask user if they want to add more or run existing tests

### Step 2: Get Constraint Test Guidelines (Goal Stage)

After goal is defined, get test guidelines using the MCP tool:

```python
# First, read the goal from agent.py to get the goal JSON
goal_code = Read(file_path="exports/your_agent/agent.py")
# Extract the goal definition and convert to JSON

# Get constraint test guidelines via MCP tool
result = mcp__agent-builder__generate_constraint_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "goal-id", "name": "...", "constraints": [...]}',
    agent_path="exports/your_agent"
)
```

**Response includes:**
- `output_file`: Where to write tests (e.g., `exports/your_agent/tests/test_constraints.py`)
- `file_header`: Imports, fixtures, and pytest setup to use at the top of the file
- `test_template`: Format for test functions
- `constraints_formatted`: The constraints to test
- `test_guidelines`: Rules and best practices for writing tests
- `instruction`: How to proceed

**Write tests directly** using the provided guidelines:

```python
# Write tests using the Write tool
Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + your_test_code
)
```

### Step 3: Get Success Criteria Test Guidelines (Eval Stage)

After agent is fully built, get success criteria test guidelines:

```python
# Get success criteria test guidelines via MCP tool
result = mcp__agent-builder__generate_success_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "goal-id", "name": "...", "success_criteria": [...]}',
    node_names="analyze_request,search_web,format_results",
    tool_names="web_search,web_scrape",
    agent_path="exports/your_agent"
)
```

**Write tests directly** using the provided guidelines:

```python
# Write tests using the Write tool
Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + your_test_code
)
```

### Step 4: Test Fixtures (conftest.py)

The `file_header` returned by the MCP tools includes proper imports and fixtures.
You should also create a conftest.py file in the tests directory with shared fixtures:

```python
# Create conftest.py with the conftest template
Write(
    file_path="exports/your_agent/tests/conftest.py",
    content=conftest_content  # Use PYTEST_CONFTEST_TEMPLATE format
)
```

### Step 5: Run Tests

**Use the MCP tool to run tests** (not pytest directly):

```python
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)

**Response includes structured results:**
```json
{
  "goal_id": "your-goal-id",
  "overall_passed": false,
  "summary": {
    "total": 12,
    "passed": 10,
    "failed": 2,
    "skipped": 0,
    "errors": 0,
    "pass_rate": "83.3%"
  },
  "test_results": [
    {"file": "test_constraints.py", "test_name": "test_constraint_api_rate_limits", "status": "passed"},
    {"file": "test_success_criteria.py", "test_name": "test_success_find_relevant_results", "status": "failed"}
  ],
  "failures": [
    {"test_name": "test_success_find_relevant_results", "details": "AssertionError: Expected 3-5 results..."}
  ]
}
```

**Options for `run_tests`:**
```python
# Run only constraint tests
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    test_types='["constraint"]'
)

# Run with parallel workers
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    parallel=4
)

# Stop on first failure
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    fail_fast=True
)
```

### Step 6: Debug Failed Tests

**Use the MCP tool to debug** (not Bash/pytest directly):

```python
mcp__agent-builder__debug_test(
    goal_id="your-goal-id",
    test_name="test_success_find_relevant_results",
    agent_path="exports/your_agent"
)
```

**Response includes:**
- Full verbose output from the test
- Stack trace with exact line numbers
- Captured logs and prints
- Suggestions for fixing the issue

### Step 7: Categorize Errors

When a test fails, categorize the error to guide iteration:

```python
def categorize_test_failure(test_output, agent_code):
    """Categorize test failure to guide iteration."""

    # Read test output and agent code
    failure_info = {
        "test_name": "...",
        "error_message": "...",
        "stack_trace": "...",
    }

    # Pattern-based categorization
    if any(pattern in failure_info["error_message"].lower() for pattern in [
        "typeerror", "attributeerror", "keyerror", "valueerror",
        "null", "none", "undefined", "tool call failed"
    ]):
        category = "IMPLEMENTATION_ERROR"
        guidance = {
            "stage": "Agent",
            "action": "Fix the bug in agent code",
            "files_to_edit": ["agent.py", "nodes/__init__.py"],
            "restart_required": False,
            "description": "Code bug - fix and re-run tests"
        }

    elif any(pattern in failure_info["error_message"].lower() for pattern in [
        "assertion", "expected", "got", "should be", "success criteria"
    ]):
        category = "LOGIC_ERROR"
        guidance = {
            "stage": "Goal",
            "action": "Update goal definition",
            "files_to_edit": ["agent.py (goal section)"],
            "restart_required": True,
            "description": "Goal definition is wrong - update and rebuild"
        }

    elif any(pattern in failure_info["error_message"].lower() for pattern in [
        "timeout", "rate limit", "empty", "boundary", "edge case"
    ]):
        category = "EDGE_CASE"
        guidance = {
            "stage": "Eval",
            "action": "Add edge case test and fix handling",
            "files_to_edit": ["agent.py", "tests/test_edge_cases.py"],
            "restart_required": False,
            "description": "New scenario - add test and handle it"
        }

    else:
        category = "UNKNOWN"
        guidance = {
            "stage": "Unknown",
            "action": "Manual investigation required",
            "restart_required": False
        }

    return {
        "category": category,
        "guidance": guidance,
        "failure_info": failure_info
    }
```

**Show categorization to user:**

```python
AskUserQuestion(
    questions=[{
        "question": f"Test failed with {category}. How would you like to proceed?",
        "header": "Test Failure",
        "options": [
            {
                "label": "Fix code directly (Recommended)" if category == "IMPLEMENTATION_ERROR" else "Update goal",
                "description": guidance["description"]
            },
            {
                "label": "Show detailed error info",
                "description": "View full stack trace and logs"
            },
            {
                "label": "Skip for now",
                "description": "Continue with other tests"
            }
        ],
        "multiSelect": false
    }]
)
```

### Step 8: Iterate Based on Error Category

#### IMPLEMENTATION_ERROR → Fix Agent Code

```python
# 1. Show user the exact file and line that failed
print(f"Error in: exports/{agent_name}/nodes/__init__.py:42")
print(f"Issue: 'NoneType' object has no attribute 'get'")

# 2. Read the problematic code
code = Read(file_path=f"exports/{agent_name}/nodes/__init__.py")

# 3. User can fix directly, or you suggest a fix:
Edit(
    file_path=f"exports/{agent_name}/nodes/__init__.py",
    old_string="if results.get('videos'):",
    new_string="if results and results.get('videos'):"
)

# 4. Re-run tests immediately (instant feedback!)
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path=f"exports/{agent_name}"
)
```

#### LOGIC_ERROR → Update Goal

```python
# 1. Show user the goal definition
goal_code = Read(file_path=f"exports/{agent_name}/agent.py")

# 2. Discuss what needs to change in success_criteria or constraints

# 3. Edit the goal
Edit(
    file_path=f"exports/{agent_name}/agent.py",
    old_string='target="3-5 videos"',
    new_string='target="1-5 videos"'  # More realistic
)

# 4. May need to regenerate agent nodes if goal changed significantly
# This requires going back to building-agents skill
```

#### EDGE_CASE → Add Test and Fix

```python
# 1. Create new edge case test with API key enforcement
edge_case_test = '''
@pytest.mark.asyncio
async def test_edge_case_empty_results(mock_mode):
    """Test: Agent handles no results gracefully"""
    result = await default_agent.run({{"query": "xyzabc123nonsense"}}, mock_mode=mock_mode)

    # Should succeed with empty results, not crash
    assert result.success or result.error is not None
    if result.success:
        assert result.output.get("message") == "No results found"
'''

# 2. Add to test file
Edit(
    file_path=f"exports/{agent_name}/tests/test_edge_cases.py",
    old_string="# Add edge case tests here",
    new_string=edge_case_test
)

# 3. Fix agent to handle edge case
# Edit agent code to handle empty results

# 4. Re-run tests
```

## Test File Templates (Reference Only)

**⚠️ Do NOT copy-paste these templates directly.** Use `generate_constraint_tests` and `generate_success_tests` MCP tools to create properly structured tests with correct imports and fixtures.

These templates show the structure of generated tests for reference only.

### Constraint Test Template

```python
"""Constraint tests for {agent_name}.

These tests validate that the agent respects its defined constraints.
Requires ANTHROPIC_API_KEY for real testing.
"""

import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager


# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.mark.asyncio
async def test_constraint_{constraint_id}():
    """Test: {constraint_description}"""
    # Test implementation based on constraint type
    mock_mode = bool(os.environ.get("MOCK_MODE"))
    result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)

    # Assert constraint is respected
    assert True  # Replace with actual check
```

### Success Criteria Test Template

```python
"""Success criteria tests for {agent_name}.

These tests validate that the agent achieves its defined success criteria.
Requires ANTHROPIC_API_KEY for real testing - mock mode cannot validate success criteria.
"""

import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager


# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.mark.asyncio
async def test_success_{criteria_id}():
    """Test: {criteria_description}"""
    mock_mode = bool(os.environ.get("MOCK_MODE"))
    result = await default_agent.run({{"test": "input"}}, mock_mode=mock_mode)

    assert result.success, f"Agent failed: {{result.error}}"

    # Verify success criterion met
    # e.g., assert metric meets target
    assert True  # Replace with actual check
```

### Edge Case Test Template

```python
"""Edge case tests for {agent_name}.

These tests validate agent behavior in unusual or boundary conditions.
Requires ANTHROPIC_API_KEY for real testing.
"""

import os
import pytest
from exports.{agent_name} import default_agent
from aden_tools.credentials import CredentialManager


# Enforce API key for real testing
pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.mark.asyncio
async def test_edge_case_{scenario_name}():
    """Test: Agent handles {scenario_description}"""
    mock_mode = bool(os.environ.get("MOCK_MODE"))
    result = await default_agent.run({{"edge": "case_input"}}, mock_mode=mock_mode)

    # Verify graceful handling
    assert result.success or result.error is not None
```

## Interactive Build + Test Loop

During agent construction (Agent stage), you can run constraint tests incrementally:

```python
# After adding first node
print("Added search_node. Running relevant constraint tests...")
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path=f"exports/{agent_name}",
    test_types='["constraint"]'
)

# After adding second node
print("Added filter_node. Running all constraint tests...")
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path=f"exports/{agent_name}",
    test_types='["constraint"]'
)
```

This provides **immediate feedback** during development, catching issues early.

## Common Test Patterns

**Note:** All test patterns should include API key enforcement via conftest.py.

### ⚠️ CRITICAL: Framework Features You Must Know

#### OutputCleaner - Automatic I/O Cleaning (NEW!)

**The framework now automatically validates and cleans node outputs** using a fast LLM (Cerebras llama-3.3-70b) at edge traversal time. This prevents cascading failures from malformed output.

**What OutputCleaner does**:
- ✅ Validates output matches next node's input schema
- ✅ Detects JSON parsing trap (entire response in one key)
- ✅ Cleans malformed output automatically (~200-500ms, ~$0.001 per cleaning)
- ✅ Boosts success rates by 1.8-2.2x

**Impact on tests**: Tests should still use safe patterns because OutputCleaner may not catch all issues in test mode.

#### Safe Test Patterns (REQUIRED)

**❌ UNSAFE** (will cause test failures):
```python
# Direct key access - can crash!
approval_decision = result.output["approval_decision"]
assert approval_decision == "APPROVED"

# Nested access without checks
category = result.output["analysis"]["category"]

# Assuming parsed JSON structure
for issue in result.output["compliance_issues"]:
    ...
```

**✅ SAFE** (correct patterns):
```python
# 1. Safe dict access with .get()
output = result.output or {}
approval_decision = output.get("approval_decision", "UNKNOWN")
assert "APPROVED" in approval_decision or approval_decision == "APPROVED"

# 2. Type checking before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
    category = analysis.get("category", "unknown")

# 3. Parse JSON from strings (the JSON parsing trap!)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
    try:
        parsed = json.loads(recommendation)
        if isinstance(parsed, dict):
            approval = parsed.get("approval_decision", "UNKNOWN")
    except json.JSONDecodeError:
        approval = "UNKNOWN"
elif isinstance(recommendation, dict):
    approval = recommendation.get("approval_decision", "UNKNOWN")

# 4. Safe iteration with type check
compliance_issues = output.get("compliance_issues", [])
if isinstance(compliance_issues, list):
    for issue in compliance_issues:
        ...
```

#### Helper Functions for Safe Access

**Add to conftest.py**:
```python
import json
import re

def _parse_json_from_output(result, key):
    """Parse JSON from agent output (framework may store full LLM response as string)."""
    response_text = result.output.get(key, "")
    # Remove markdown code blocks if present
    json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()

    try:
        return json.loads(json_text)
    except (json.JSONDecodeError, AttributeError, TypeError):
        return result.output.get(key)

def safe_get_nested(result, key_path, default=None):
    """Safely get nested value from result.output."""
    output = result.output or {}
    current = output

    for key in key_path:
        if isinstance(current, dict):
            current = current.get(key)
        elif isinstance(current, str):
            try:
                json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
                parsed = json.loads(json_text)
                if isinstance(parsed, dict):
                    current = parsed.get(key)
                else:
                    return default
            except json.JSONDecodeError:
                return default
        else:
            return default

    return current if current is not None else default

# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested
```

**Usage in tests**:
```python
# Use helper to parse JSON safely
parsed = pytest.parse_json_from_output(result, "recommendation")
if isinstance(parsed, dict):
    approval = parsed.get("approval_decision", "UNKNOWN")

# Safe nested access
risk_score = pytest.safe_get_nested(result, ["analysis", "risk_score"], default=0.0)
```

#### Test Count Guidance

**Generate 8-15 tests total, NOT 30+**

- ✅ 2-3 tests per success criterion
- ✅ 1 happy path test
- ✅ 1 boundary/edge case test
- ✅ 1 error handling test (optional)

**Why fewer tests?**:
- Each test requires real LLM call (~3 seconds, costs money)
- 30 tests = 90 seconds, $0.30+ in costs
- 12 tests = 36 seconds, $0.12 in costs
- Focus on quality over quantity

#### ExecutionResult Fields (Important!)

**`result.success=True` means NO exception, NOT goal achieved**

```python
# ❌ WRONG - assumes goal achieved
assert result.success

# ✅ RIGHT - check success AND output
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"
```

**All ExecutionResult fields**:
- `success: bool` - Execution completed without exception (NOT goal achieved!)
- `output: dict` - Complete memory snapshot (may contain raw strings)
- `error: str | None` - Error message if failed
- `steps_executed: int` - Number of nodes executed
- `total_tokens: int` - Cumulative token usage
- `total_latency_ms: int` - Total execution time
- `path: list[str]` - Node IDs traversed
- `paused_at: str | None` - Node ID if HITL pause occurred
- `session_state: dict` - State for resuming

### Happy Path Test
```python
@pytest.mark.asyncio
async def test_happy_path(mock_mode):
    """Test normal successful execution"""
    result = await default_agent.run({{"query": "python tutorials"}}, mock_mode=mock_mode)
    assert result.success
    assert len(result.output) > 0
```

### Boundary Condition Test
```python
@pytest.mark.asyncio
async def test_boundary_minimum(mock_mode):
    """Test at minimum threshold"""
    result = await default_agent.run({{"query": "very specific niche topic"}}, mock_mode=mock_mode)
    assert result.success
    assert len(result.output.get("results", [])) >= 1
```

### Error Handling Test
```python
@pytest.mark.asyncio
async def test_error_handling(mock_mode):
    """Test graceful error handling"""
    result = await default_agent.run({{"query": ""}}, mock_mode=mock_mode)  # Invalid input
    assert not result.success or result.output.get("error") is not None
```

### Performance Test
```python
@pytest.mark.asyncio
async def test_performance_latency(mock_mode):
    """Test response time is acceptable"""
    import time
    start = time.time()
    result = await default_agent.run({{"query": "test"}}, mock_mode=mock_mode)
    duration = time.time() - start
    assert duration < 5.0, f"Took {{duration}}s, expected <5s"
```

## Integration with building-agents

### Handoff Points

| Scenario | From | To | Action |
|----------|------|-----|--------|
| Agent built, ready to test | building-agents | testing-agent | Generate success tests |
| LOGIC_ERROR found | testing-agent | building-agents | Update goal, rebuild |
| IMPLEMENTATION_ERROR found | testing-agent | Direct fix | Edit agent files, re-run tests |
| EDGE_CASE found | testing-agent | testing-agent | Add edge case test |
| All tests pass | testing-agent | Done | Agent validated ✅ |

### Iteration Speed Comparison

| Scenario | Old Approach | New Approach |
|----------|--------------|--------------|
| **Bug Fix** | Rebuild via MCP tools (14 min) | Edit Python file, pytest (2 min) |
| **Add Test** | Generate via MCP, export (5 min) | Write test file directly (1 min) |
| **Debug** | Read subprocess logs | pdb, breakpoints, prints |
| **Inspect** | Limited visibility | Full Python introspection |

## Anti-Patterns

### Testing Best Practices

| Don't | Do Instead |
|-------|------------|
| ❌ Write tests without getting guidelines first | ✅ Use `generate_*_tests` to get proper file_header and guidelines |
| ❌ Run pytest via Bash | ✅ Use `run_tests` MCP tool for structured results |
| ❌ Debug tests with Bash pytest -vvs | ✅ Use `debug_test` MCP tool for formatted output |
| ❌ Check for tests with Glob | ✅ Use `list_tests` MCP tool |
| ❌ Skip the file_header from guidelines | ✅ Always include the file_header for proper imports and fixtures |

### General Testing

| Don't | Do Instead |
|-------|------------|
| ❌ Treat all failures the same | ✅ Use debug_test to categorize and iterate appropriately |
| ❌ Rebuild entire agent for small bugs | ✅ Edit code directly, re-run tests |
| ❌ Run tests without API key | ✅ Always set ANTHROPIC_API_KEY first |
| ❌ Write tests without understanding the constraints/criteria | ✅ Read the formatted constraints/criteria from guidelines |

## Workflow Summary

```
1. Check existing tests: list_tests(goal_id, agent_path)
   → Scans exports/{agent}/tests/test_*.py
   ↓
2. Get test guidelines: generate_constraint_tests, generate_success_tests
   → Returns file_header, test_template, constraints/criteria, guidelines
   ↓
3. Write tests: Use Write tool with the provided guidelines
   → Write tests to exports/{agent}/tests/test_*.py
   ↓
4. Run tests: run_tests(goal_id, agent_path)
   → Executes: pytest exports/{agent}/tests/ -v
   ↓
5. Debug failures: debug_test(goal_id, test_name, agent_path)
   → Re-runs single test with verbose output
   ↓
6. Fix based on category:
   - IMPLEMENTATION_ERROR → Edit agent code directly
   - ASSERTION_FAILURE → Fix agent logic or update test
   - IMPORT_ERROR → Check package structure
   - API_ERROR → Check API keys and connectivity
   ↓
7. Re-run tests: run_tests(goal_id, agent_path)
   ↓
8. Repeat until all pass ✅
```

## MCP Tools Reference

```python
# Check existing tests (scans Python test files)
mcp__agent-builder__list_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)

# Get constraint test guidelines (returns templates and guidelines, NOT generated tests)
mcp__agent-builder__generate_constraint_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "constraints": [...]}',
    agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines

# Get success criteria test guidelines
mcp__agent-builder__generate_success_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "success_criteria": [...]}',
    node_names="node1,node2",
    tool_names="tool1,tool2",
    agent_path="exports/your_agent"
)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines

# Run tests via pytest subprocess
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent"
)

# Debug a failed test (re-runs with verbose output)
mcp__agent-builder__debug_test(
    goal_id="your-goal-id",
    test_name="test_constraint_foo",
    agent_path="exports/your_agent"
)
```

## run_tests Options

```python
# Run only constraint tests
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    test_types='["constraint"]'
)

# Run only success criteria tests
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    test_types='["success"]'
)

# Run with pytest-xdist parallelism (requires pytest-xdist)
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    parallel=4
)

# Stop on first failure
mcp__agent-builder__run_tests(
    goal_id="your-goal-id",
    agent_path="exports/your_agent",
    fail_fast=True
)
```

## Direct pytest Commands

You can also run tests directly with pytest (the MCP tools use pytest internally):

```bash
# Run all tests
pytest exports/your_agent/tests/ -v

# Run specific test file
pytest exports/your_agent/tests/test_constraints.py -v

# Run specific test
pytest exports/your_agent/tests/test_constraints.py::test_constraint_foo -vvs

# Run in mock mode (structure validation only)
MOCK_MODE=1 pytest exports/your_agent/tests/ -v
```

---

**MCP tools generate tests, write them to Python files, and run them via pytest.**
testing-agent | SkillHub