context-engineering
Principles for designing context-efficient AI agents and tools. Use when designing LLM tools, agents, MCP servers, or multi-agent systems.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install byunk-context-engineering
Repository
Skill path: minimal-claude-code/skills/context-engineering
Principles for designing context-efficient AI agents and tools. Use when designing LLM tools, agents, MCP servers, or multi-agent systems.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI, Integration.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: Byunk.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install context-engineering into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/Byunk/minimal-claude-code before adding context-engineering to shared team environments
- Use context-engineering for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
--- name: context-engineering description: Principles for designing context-efficient AI agents and tools. Use when designing LLM tools, agents, MCP servers, or multi-agent systems. --- # Context Engineering Principles for maximizing LLM effectiveness by treating context as a finite resource. ## Core Principle Find the smallest possible set of high-signal tokens that maximize the likelihood of your desired outcome. ## The Context Budget LLMs have an "attention budget" that depletes with each token. Context rot causes recall accuracy to decrease as token count grows. Every design decision should optimize for signal density. ## Quick Reference | Challenge | Strategy | Reference | | --------- | -------- | --------- | | Too many tools | Curate minimal viable set | [Tool](references/tool.md) | | Ambiguous tool selection | Self-contained, unambiguous tools | [Tool](references/tool.md) | | Context pollution over time | Compaction and summarization | [Agent](references/agent.md) | | Long-horizon tasks | External memory and note-taking | [Agent](references/agent.md) | | Exceeding single context limits | Sub-agent architectures | [Multi-Agent](references/multi-agent.md) | | MCP server bloat | Token-efficient responses | [MCP](references/mcp.md) | | Measuring effectiveness | End-state evaluation | [Evaluation](references/evaluation.md) | ## Single vs Multi-Agent Multi-agent adds ~15x token overhead. Use single agent unless: | Factor | Single Agent | Multi-Agent | | ------ | ------------ | ----------- | | Parallelization | Sequential steps | Independent subtasks | | Context size | Fits in window | Exceeds single context | | Tool complexity | Focused toolset | Many specialized tools | | Dependencies | Steps depend on each other | Work can be isolated | Default to single agent. Add agents only when parallelization or context limits demand it. ## Decision Checklists ### Before Adding to Context - Is this the minimum information needed? - Can an agent discover this just-in-time instead? - Does this justify its token cost? ### Tool Design - Can a human definitively say which tool to use? - Does each tool have a distinct, non-overlapping purpose? - Are responses token-efficient with high signal? - Do error messages guide toward solutions? ### Agent Design - Does the system prompt strike the right altitude? - Are there mechanisms for compaction when context grows? - Is external memory used for long-horizon tracking? - Are canonical examples provided instead of exhaustive rules? ### Multi-Agent - Is the task parallelizable enough to justify coordination overhead? - Do sub-agents return condensed summaries (not raw results)? - Is there clear separation of concerns between agents? ## Key Techniques ### Just-in-Time Retrieval Keep lightweight identifiers (paths, queries, links). Load data dynamically at runtime rather than pre-loading everything upfront. ### Progressive Disclosure Let agents discover context through exploration. File sizes suggest complexity; naming hints at purpose. Each interaction yields context for the next decision. ### Compaction Summarize conversations nearing limits. Preserve architectural decisions and critical details; discard redundant tool outputs and verbose messages. ### Structured Note-Taking Persist notes to external memory (to-do lists, NOTES.md). Pull back into context when needed. Tracks progress without exhausting working context. ### Sub-Agent Distribution Delegate focused tasks to specialized agents with clean context windows. Each sub-agent explores extensively but returns only condensed summaries (1000-2000 tokens). ## The Golden Rule Do the simplest thing that works. Start minimal, add complexity only based on observed failure modes. ## References - [Tool](references/tool.md) - Building self-contained, token-efficient tools - [Agent](references/agent.md) - Single agent context management - [Multi-Agent](references/multi-agent.md) - Coordinating multiple agents - [MCP](references/mcp.md) - Model Context Protocol best practices - [Evaluation](references/evaluation.md) - Measuring context engineering effectiveness ## Examples Complete examples from Claude Code: ### Tool Descriptions - [Bash](examples/tool-bash-example.md) - Boundaries, when NOT to use, good/bad examples - [Edit](examples/tool-edit-example.md) - Prerequisites, error guidance, concise design - [Grep](examples/tool-grep-example.md) - Exclusivity, parameter examples, output modes ### Agent Prompts - [Explore](examples/agent-explore-example.md) - Role definition, constraints, strengths - [Plan](examples/agent-plan-example.md) - Process steps, output format, boundaries - [Summarization](examples/agent-summarization-example.md) - Compaction structure, what to preserve --- ## Referenced Files > The following files are referenced in this skill and included for context. ### references/tool.md ```markdown # Effective Tool Design Principles Principles for building tools that LLMs can use effectively. ## Self-Containment Tools should be robust and clear in purpose, similar to well-designed functions in a codebase. **Key requirements:** - Each tool has a distinct, non-overlapping function - A human could definitively say which tool to use for a given task - Tools don't require external context to understand their purpose - Each tool works independently without implicit dependencies ## Parameter Design ### Naming - Use descriptive, unambiguous names (`user_id` not `user`, `created_after` not `date`) - Play to the model's inherent strengths (natural language understanding) - Avoid abbreviations that require domain knowledge ### Descriptions - Include helpful examples in parameter descriptions - Specify formats explicitly (`YYYY-MM-DD`, `[email protected]`) - Document constraints (min/max values, allowed characters) - Show what valid inputs look like ### Types - Use JSON Schema for input validation - Leverage enums for constrained choices - Make required vs optional explicit ## Response Design ### Signal Density Return only high-signal information. Every token returned should contribute to the agent's decision-making. **Good:** `{"status": "success", "user_count": 42}` **Bad:** `{"status": "success", "message": "The operation completed successfully", "timestamp": "...", "request_id": "...", "user_count": 42, "metadata": {...}}` ### Identifiers Use semantically meaningful identifiers over opaque UUIDs when possible: - `orders/2024/march/invoice-1234` > `a1b2c3d4-e5f6-...` - Include human-readable context in IDs ### Pagination and Filtering - Implement pagination for large result sets - Support filtering to reduce irrelevant results - Consider a `response_format` enum (concise vs detailed) - Default to concise; let agents request more when needed ### Truncation For text-heavy responses, truncate with clear indicators: - Show beginning and end of long content - Include total length and what was omitted - Provide a way to fetch full content if needed ## Error Handling Craft error messages that steer toward solutions: **Good:** ``` Invalid date format. Expected YYYY-MM-DD, got "march 15". Example: 2024-03-15 ``` **Bad:** ``` Error: Invalid input ``` **Principles:** - Show the correct format - Include an example of valid input - Suggest alternative approaches when applicable - Distinguish between user errors and system failures ## Anti-Patterns ### Tool Bloat Wrapping every API endpoint as a separate tool overwhelms the agent. Curate a minimal viable set that covers common use cases. ### Ambiguous Boundaries If two tools could plausibly handle the same request, the agent will struggle. Ensure clear, non-overlapping purposes. ### Verbose Defaults Returning full objects when only a field is needed wastes context. Default to minimal responses. ### Hidden Dependencies Tools that require calling other tools first without documenting this create confusion. Make dependencies explicit. ### Cryptic Outputs Returning raw IDs or codes without context forces agents to make additional calls. Include enough context to act on results. ## Testing Tool Descriptions The quality of tool descriptions directly impacts agent performance: 1. Have the agent describe when it would use each tool 2. Present ambiguous scenarios and check tool selection 3. Let agents suggest improvements to descriptions 4. Track which tools are over/under-used ## Summary A well-designed tool: - Has a clear, unique purpose - Uses descriptive parameter names with examples - Returns minimal, high-signal responses - Provides actionable error messages - Works independently without hidden dependencies ``` ### references/agent.md ```markdown # Effective Agent Design Principles Principles for designing agent with optimal context usage. ## System Prompt Altitude Balance between rigid hardcoded logic and vague guidance. ### Too Rigid (Low Altitude) ``` If user says "hello", respond with "Hi there!" If user asks about weather, call weather_api with location parameter If user mentions "error", ask for the error message ``` Problems: Brittle, doesn't generalize, breaks on variations. ### Too Vague (High Altitude) ``` Be helpful and use tools when appropriate. ``` Problems: No guidance on tool selection, inconsistent behavior. ### Right Altitude ``` Use search tools to find information before answering factual questions. When multiple sources conflict, prefer official documentation over forums. If a query is ambiguous, ask one clarifying question before proceeding. ``` Specific enough to guide behavior, flexible enough to serve as heuristics. ### Structure - Use XML tags or Markdown headers to delineate sections - Separate background info from behavioral instructions - Group tool guidance together - Place output format expectations at the end ## Just-in-Time Retrieval Maintain lightweight identifiers; load data dynamically at runtime. ### Identifiers Over Data Keep in context: - File paths, not file contents - Query patterns, not query results - API endpoints, not API responses - Document IDs, not document text ### Leveraging Metadata Use readily available signals before loading full content: - File names hint at purpose - Directory structure reveals organization - Timestamps indicate recency - File sizes suggest complexity ### Hybrid Strategy Combine approaches: 1. Upfront retrieval for speed on known patterns 2. Autonomous exploration for discovery of unknowns ## Progressive Disclosure Let agents incrementally discover context through exploration. ### Exploration Flow 1. Start with high-level overview (directory listing, index) 2. Drill into relevant areas based on task 3. Load detailed content only when needed 4. Each interaction yields context for the next decision ### Self-Managed Context The agent decides what's relevant, keeping focus on the useful subset rather than loading everything upfront. ### Information Architecture Structure information to support progressive discovery: - Indices that summarize what's available - Clear naming that hints at content - Logical groupings that match common queries ## Compaction Strategies When context grows large, compress while preserving critical information. ### What to Preserve - Architectural decisions made - Unresolved bugs and edge cases - Key implementation details - User preferences expressed - Dependencies between components ### What to Discard - Redundant tool outputs (keep summary, drop raw data) - Verbose intermediate messages - Exploratory dead ends - Repeated information ### Compaction Approaches **Tool result clearing:** Lightest touch. Remove raw tool outputs after extracting key findings. **Conversation summarization:** Condense dialogue while preserving decisions and open questions. **Checkpoint creation:** Save state to external file, start fresh context with checkpoint reference. ### Tuning Start with maximum recall (keep everything), then iterate to improve precision based on what actually gets used. ## Structured Note-Taking Persist information outside the context window for later retrieval. ### Use Cases - To-do lists tracking remaining work - NOTES.md capturing decisions and rationale - Dependency graphs showing what blocks what - Progress logs for long-running tasks ### Benefits - Survives context window limits - Can be pulled back selectively - Creates audit trail - Enables handoffs between sessions ### Implementation 1. Agent writes notes to persistent storage (file, database) 2. Notes include enough context to be useful later 3. Agent retrieves notes when resuming or when relevant 4. Notes are updated as work progresses ## Example Curation Provide diverse, canonical examples rather than exhaustive rules. ### Examples Over Rules **Less effective:** ``` Handle errors gracefully. Check for null values. Validate inputs before processing. Use try-catch blocks. Log errors with context. Return meaningful error messages. ``` **More effective:** ``` Example: Handling a missing user Input: get_user(id="nonexistent") Response: "User 'nonexistent' not found. To list available users, try list_users(). To create a new user, try create_user(name='...')." ``` ### Example Selection - Cover the most common patterns - Show edge cases that frequently cause errors - Demonstrate the expected reasoning process - Include both positive and negative examples ### Why Examples Work For LLMs, examples function like "pictures worth 1000 words." They communicate: - Expected format - Appropriate level of detail - Reasoning patterns - Edge case handling All in a form the model can pattern-match against. ## Summary An effective single-agent system: - Uses system prompts at the right altitude - Keeps lightweight references, loads data just-in-time - Lets the agent discover context progressively - Compacts aggressively to preserve context budget - Maintains external notes for long-horizon tracking - Demonstrates behavior through canonical examples ``` ### references/multi-agent.md ```markdown # Multi-Agent System Principles Principles for designing systems where multiple agents coordinate to accomplish tasks. ## Orchestrator-Worker Pattern The most common multi-agent architecture. ### Structure ``` Orchestrator (Lead Agent) ├── Worker A (focused task) ├── Worker B (focused task) └── Worker C (focused task) ``` ### Orchestrator Role - Receives user request - Breaks down into subtasks - Assigns work to workers - Synthesizes results - Handles errors and retries ### Worker Role - Receives focused task description - Has clean context window - Explores extensively within scope - Returns condensed summary (1000-2000 tokens) - Signals completion or blockers ### Context Isolation Search context stays in worker agents. Orchestrator only sees summaries, keeping its context clean for coordination. ## Delegation Principles ### Clear Task Specifications Each worker needs: 1. **Objective:** What to accomplish 2. **Output format:** How to structure results 3. **Tool guidance:** Which tools are available/preferred 4. **Boundaries:** What's in/out of scope ### Effort Scaling Match worker effort to task complexity: | Complexity | Approach | |------------|----------| | Simple | 1 agent, 3-10 tool calls | | Medium | 2-3 agents, parallel exploration | | Complex | 10+ agents, divided responsibilities | Provide explicit guidelines so orchestrator can gauge appropriate effort. ### Avoid Vague Delegation **Bad:** "Look into the authentication system" **Good:** "Find all files implementing JWT token validation. Return: file paths, key functions, and any security concerns identified." ## Coordination Challenges ### Duplicate Work Multiple workers may explore overlapping areas. Mitigate by: - Clear scope boundaries in task descriptions - Deduplication in result synthesis - Explicit "do not investigate" constraints ### Coverage Gaps Workers may each assume another handled an area. Mitigate by: - Exhaustive task breakdown before delegation - Verification step checking coverage - Overlap slightly rather than gap ### Information Loss "Telephone game" effects as information passes through layers. Mitigate by: - Workers write to filesystem, not just return text - Preserve source references - Allow orchestrator to request clarification ### Coordination Overhead Time spent coordinating vs doing actual work. Balance by: - Not over-decomposing simple tasks - Using parallel tool calls (90% time reduction possible) - Letting workers handle sub-decomposition ## Context Handoffs Managing context across agent boundaries and sessions. ### Pre-Truncation Saves Before context window fills: 1. Summarize current state 2. Save to external storage 3. Continue with reference to saved state ### Fresh Agent Spawning When limits approach: 1. Spawn new agent with clean context 2. Pass essential context (decisions, open questions) 3. Include pointer to detailed notes 4. Terminate full-context agent ### Filesystem as Memory Workers output to files rather than just returning text: - Preserves full fidelity - Enables async consumption - Survives agent termination - Allows selective loading ### Lightweight References Pass between agents: - File paths to detailed content - Query patterns for retrieval - Summary with "see X for details" ## Architecture Patterns ### Fan-Out/Fan-In ``` Orchestrator ─┬─> Worker A ─┐ ├─> Worker B ─┼─> Orchestrator (synthesis) └─> Worker C ─┘ ``` Best for: Parallel search, independent analysis, batch processing. ### Pipeline ``` Worker A ─> Worker B ─> Worker C ─> Result ``` Best for: Sequential transformations, refinement chains. ### Hierarchical ``` Lead ─┬─> Manager A ─┬─> Worker A1 │ └─> Worker A2 └─> Manager B ─┬─> Worker B1 └─> Worker B2 ``` Best for: Large-scale tasks, domain specialization. ## Summary Effective multi-agent systems: - Are used when parallelization or context overflow justifies overhead - Employ orchestrator-worker pattern with clean context separation - Provide clear, bounded task specifications to workers - Return condensed summaries rather than raw results - Use filesystem for persistent handoffs - Scale coordination effort to task complexity ``` ### references/mcp.md ```markdown # MCP Design Principles Best practices for designing Model Context Protocol servers that agents can use effectively. ## Tool Definition ### Clear Descriptions Tool descriptions are the primary way agents understand capabilities. Write them for LLM consumption: ```json { "name": "search_documents", "description": "Search documents by content. Returns matching excerpts with context. Use for finding specific information within documents. For listing all documents, use list_documents instead." } ``` Include: - What the tool does - When to use it - When NOT to use it (if ambiguous with other tools) ### Input Schema Use JSON Schema with clear constraints: ```json { "type": "object", "properties": { "query": { "type": "string", "description": "Search terms. Supports AND/OR operators. Example: 'budget AND 2024'" }, "limit": { "type": "integer", "default": 10, "minimum": 1, "maximum": 100, "description": "Maximum results to return" } }, "required": ["query"] } ``` ### Output Schema Define structured outputs when possible: ```json { "outputSchema": { "type": "array", "items": { "type": "object", "properties": { "document_id": {"type": "string"}, "excerpt": {"type": "string"}, "relevance_score": {"type": "number"} } } } } ``` Benefits: - Agents understand response structure - Enables structured content in responses - Facilitates automated processing ### Unique Naming - Use descriptive, action-oriented names - Avoid generic names that could conflict ## Response Efficiency ### Minimal Payloads Return only what agents need to make decisions: **Good:** ```json { "files": ["auth.py", "users.py"], "total": 2 } ``` **Bad:** ```json { "status": "success", "timestamp": "2024-03-15T10:30:00Z", "request_id": "abc123", "files": [ {"name": "auth.py", "size": 1234, "created": "...", "modified": "...", "permissions": "..."}, {"name": "users.py", "size": 5678, "created": "...", "modified": "...", "permissions": "..."} ], "metadata": {"server": "...", "version": "..."} } ``` ### Pagination For large result sets: ```json { "results": [...], "next_cursor": "abc123", "has_more": true, "total_count": 1000 } ``` - Default to reasonable page sizes (10-50) - Include cursor for continuation - Show total when available - Let agents request specific pages ### Filtering Support server-side filtering to reduce transferred data: ```json { "filters": { "created_after": "2024-01-01", "status": ["open", "in_progress"], "limit": 20 } } ``` ### Content Types Use appropriate content types: - `text` for human-readable content - `image` for visual data (base64 or URL) - `resource` for references to additional data ## Error Handling ### Protocol vs Tool Errors **Protocol errors (JSON-RPC):** System-level issues - Invalid request format - Unknown method - Server initialization failure **Tool errors:** Execution issues - Set `isError: true` in response - Include actionable message in content ### Actionable Messages ```json { "isError": true, "content": [ { "type": "text", "text": "Repository 'myrepo' not found. Available repositories: repo-a, repo-b, repo-c. Use list_repositories for full list." } ] } ``` Include: - What went wrong - Why it happened (if known) - What to do instead ### Rate Limiting Handle gracefully: ```json { "isError": true, "content": [ { "type": "text", "text": "Rate limit exceeded. Retry after 60 seconds. Consider using batch_search for multiple queries." } ] } ``` ## Security ### Input Validation Validate all tool inputs: - Type checking - Range validation - Format verification - Injection prevention ### Access Controls - Implement authentication - Check authorization per tool - Scope access appropriately - Fail closed on errors ### Rate Limiting - Per-client limits - Per-tool limits - Burst allowances - Clear feedback when limited ### Output Sanitization - Remove sensitive data - Redact credentials - Filter internal details - Audit logged outputs ## Resource Management ### Resource Links Expose additional context through resources: ```json { "type": "resource", "resource": { "uri": "file:///docs/api-reference.md", "mimeType": "text/markdown" } } ``` Use when: - Content is large - Content may not be needed - Multiple tools reference same content ### Embedded Resources Embed when content is essential: ```json { "type": "resource", "resource": { "uri": "inline:///current-config", "text": "...", "mimeType": "application/json" } } ``` ### Annotations Provide hints about resource usage: ```json { "annotations": { "audience": ["developer"], "priority": 0.8 } } ``` ### Change Notifications Support `notifications/resources/list_changed` when: - Resources are dynamic - Agents need to refresh state - Background changes occur ## Integration Patterns ### Prefer Specific Tools **Better:** `create_github_issue`, `search_github_issues`, `close_github_issue` **Worse:** `github_api` with action parameter Specific tools: - Have clearer descriptions - Enable better tool selection - Simplify error handling ### Match User Intent Design tools around what users want to accomplish, not just API coverage: - `summarize_pr` vs raw `get_pr_diff` + `get_pr_comments` - `find_relevant_code` vs generic `search_files` ### Testing Test tool descriptions by: 1. Presenting scenarios to agents 2. Checking tool selection accuracy 3. Reviewing reasoning for selection 4. Iterating on descriptions ## Summary Effective MCP servers: - Define tools with clear, unambiguous descriptions - Return minimal, high-signal responses - Support pagination and filtering - Provide actionable error messages - Implement security at every layer - Use resources for large/optional content - Design tools around user intent ``` ### references/evaluation.md ```markdown # Evaluation Principles Principles for measuring and improving context engineering effectiveness. ## Start Small, Start Early ### Don't Wait Begin evaluation with ~20 queries representing real usage patterns. Don't delay until you can build a comprehensive test suite. ### Early Leverage Early in development, changes have dramatic impact. There's abundant low-hanging fruit. A small eval set catches major issues. ### Growth Path ``` Day 1: 5-10 representative queries Week 1: 20-30 queries covering main use cases Month 1: 50-100 queries with edge cases Ongoing: Add queries from production failures ``` ## End-State Evaluation ### Focus on Outcomes Judge whether the correct final state was achieved, not whether a specific process was followed. **Good:** "Did the agent find the correct answer?" **Bad:** "Did the agent use tool X then tool Y then tool Z?" ### Multiple Valid Paths Acknowledge that agents may take different routes to correct answers. All valid paths should pass evaluation. ### Checkpoint Decomposition For complex workflows, break into discrete checkpoints: 1. Did the agent identify the right files? 2. Did the agent understand the problem? 3. Did the agent implement a correct solution? 4. Did the tests pass? Evaluate each checkpoint independently. ## LLM-as-Judge Use an LLM to evaluate free-form outputs against rubric criteria. ### Core Criteria | Criterion | Description | |-----------|-------------| | Factual accuracy | Claims match the source material | | Citation accuracy | Sources actually support the claims | | Completeness | All relevant aspects covered | | Source quality | Primary sources preferred over secondary | | Tool efficiency | Right tools used, reasonable call count | ### Scoring A single LLM call with 0.0-1.0 scores per criterion works well: ``` Evaluate this agent response against the following criteria. Score each from 0.0 (completely fails) to 1.0 (fully meets). Response: {agent_response} Ground truth: {expected_answer} Criteria: - Factual accuracy: Does the response match known facts? - Completeness: Are all key points covered? - Conciseness: Is it free of unnecessary content? Output JSON: {"factual_accuracy": 0.X, "completeness": 0.X, "conciseness": 0.X} ``` ### Calibration Run LLM-as-judge on known good/bad examples to calibrate thresholds. ## Human Evaluation Automated evaluation catches common issues. Human evaluation catches subtle problems. ### What Humans Catch - Edge cases automation misses - Hallucinations on unusual queries - Subtle source selection biases - Quality issues hard to quantify - User experience problems ### Sampling Strategy - Review a random sample of production queries - Focus extra attention on low-confidence or unusual queries - Track patterns in human-identified issues - Add problematic queries to automated suite ### Even With Automation Human evaluation remains essential. Budget time for regular manual review even as automated coverage grows. ## Key Metrics ### Primary: Token Usage Token usage explains approximately 80% of performance variance. Track: - Total tokens per task - Context utilization ratio - Token growth over conversation - Compaction effectiveness ### Secondary Metrics | Metric | What It Reveals | |--------|-----------------| | Tool call count | Efficiency of tool selection | | Time to completion | Overall speed | | Retry rate | Error handling quality | | Context window utilization | How much budget consumed | ### Model Choice Impact Different models have different cost/performance profiles. Track metrics per model to inform selection. ## Iteration Approach ### Build Simulations Create simulations with exact prompts and tools: 1. Capture real queries 2. Replay against agent 3. Compare behavior to expected 4. Identify divergences ### Watch Agents Work Step through agent reasoning: 1. What did the agent observe? 2. What did it decide to do? 3. What was the result? 4. Where did reasoning fail? ### Common Failure Modes | Failure | Symptom | Fix | |---------|---------|-----| | Continuing when done | Extra tool calls after answer found | Clearer completion criteria | | Verbose queries | Long tool inputs wasting context | Examples of concise queries | | Wrong tool selection | Similar tools confused | Better descriptions, namespacing | | Context exhaustion | Quality degrades late in task | Compaction, sub-agents | | Hallucination | Made-up information | Stricter source requirements | ### Mental Model Development Build an accurate model of how the agent behaves. This intuition enables: - Predicting where failures will occur - Designing effective interventions - Prioritizing improvements ## Evaluation Design ### Query Requirements Good evaluation queries are: | Requirement | Description | |-------------|-------------| | Independent | Not dependent on other queries | | Read-only | Don't modify state | | Complex | Require multiple steps | | Realistic | Based on real use cases | | Verifiable | Clear, checkable answer | | Stable | Answer won't change over time | ### Answer Verification For each query: 1. Solve it yourself 2. Verify the answer is correct 3. Confirm answer is unambiguous 4. Check answer remains stable ### Output Format Structure evaluations for automated running: ```xml <evaluation> <qa_pair> <question>How many active users were created in March 2024?</question> <answer>42</answer> </qa_pair> </evaluation> ``` ## Summary Effective evaluation: - Starts small and early, growing over time - Focuses on end-state correctness, not specific paths - Uses LLM-as-judge for scalable assessment - Maintains human review for edge cases - Tracks token usage as primary metric - Builds simulations to understand failures - Develops accurate mental models of agent behavior ``` ### examples/tool-bash-example.md ```markdown # Bash Tool Description Principles: self-containment, boundaries, when NOT to use, good/bad examples, parameter descriptions --- Executes a given bash command with optional timeout. Working directory persists between commands; shell state (everything else) does not. The shell environment is initialized from the user's profile (bash or zsh). IMPORTANT: This tool is for terminal operations like git, npm, docker, etc. DO NOT use it for file operations (reading, writing, editing, searching, finding files) - use the specialized tools for this instead. Before executing the command, please follow these steps: 1. Directory Verification: - If the command will create new directories or files, first use `ls` to verify the parent directory exists and is the correct location - For example, before running "mkdir foo/bar", first use `ls foo` to check that "foo" exists and is the intended parent directory 2. Command Execution: - Always quote file paths that contain spaces with double quotes (e.g., cd "path with spaces/file.txt") - Examples of proper quoting: - cd "/Users/name/My Documents" (correct) - cd /Users/name/My Documents (incorrect - will fail) - python "/path/with spaces/script.py" (correct) - python /path/with spaces/script.py (incorrect - will fail) - After ensuring proper quoting, execute the command. - Capture the output of the command. Usage notes: - The command argument is required. - You can specify an optional timeout in milliseconds (up to 120 minutes). If not specified, commands will timeout after 30 minutes. - It is very helpful if you write a clear, concise description of what this command does. For simple commands, keep it brief (5-10 words). For complex commands (piped commands, obscure flags, or anything hard to understand at a glance), add enough context to clarify what it does. - If the output exceeds 30000 characters, output will be truncated before being returned to you. - Avoid using Bash with the `find`, `grep`, `cat`, `head`, `tail`, `sed`, `awk`, or `echo` commands, unless explicitly instructed or when these commands are truly necessary for the task. Instead, always prefer using the dedicated tools for these commands: - File search: Use Glob (NOT find or ls) - Content search: Use Grep (NOT grep or rg) - Read files: Use Read (NOT cat/head/tail) - Edit files: Use Edit (NOT sed/awk) - Write files: Use Write (NOT echo >/cat <<EOF) - Communication: Output text directly (NOT echo/printf) - When issuing multiple commands: - If the commands are independent and can run in parallel, make multiple Bash tool calls in a single message. For example, if you need to run "git status" and "git diff", send a single message with two Bash tool calls in parallel. - If the commands depend on each other and must run sequentially, use a single Bash call with '&&' to chain them together (e.g., `git add . && git commit -m "message" && git push`). For instance, if one operation must complete before another starts (like mkdir before cp, Write before Bash for git operations, or git add before git commit), run these operations sequentially instead. - Use ';' only when you need to run commands sequentially but don't care if earlier commands fail - DO NOT use newlines to separate commands (newlines are ok in quoted strings) - Try to maintain your current working directory throughout the session by using absolute paths and avoiding usage of `cd`. You may use `cd` if the User explicitly requests it. <good-example> pytest /foo/bar/tests </good-example> <bad-example> cd /foo/bar && pytest tests </bad-example> ``` ### examples/tool-edit-example.md ```markdown # Edit Tool Description Principles: prerequisites, error guidance with solutions, concise design --- Performs exact string replacements in files. Usage: - You must use your Read tool at least once in the conversation before editing. This tool will error if you attempt an edit without reading the file. - When editing text from Read tool output, ensure you preserve the exact indentation (tabs/spaces) as it appears AFTER the line number prefix. The line number prefix format is: spaces + line number + tab. Everything after that tab is the actual file content to match. Never include any part of the line number prefix in the old_string or new_string. - ALWAYS prefer editing existing files in the codebase. NEVER write new files unless explicitly required. - Only use emojis if the user explicitly requests it. Avoid adding emojis to files unless asked. - The edit will FAIL if `old_string` is not unique in the file. Either provide a larger string with more surrounding context to make it unique or use `replace_all` to change every instance of `old_string`. - Use `replace_all` for replacing and renaming strings across the file. This parameter is useful if you want to rename a variable for instance. ``` ### examples/tool-grep-example.md ```markdown # Grep Tool Description Principles: exclusivity directives, parameter examples inline, output modes, escalation path --- A powerful search tool built on ripgrep Usage: - ALWAYS use Grep for search tasks. NEVER invoke `grep` or `rg` as a Bash command. The Grep tool has been optimized for correct permissions and access. - Supports full regex syntax (e.g., "log.*Error", "function\\s+\\w+") - Filter files with glob parameter (e.g., "*.js", "**/*.tsx") or type parameter (e.g., "js", "py", "rust") - Output modes: "content" shows matching lines, "files_with_matches" shows only file paths (default), "count" shows match counts - Use Task tool for open-ended searches requiring multiple rounds - Pattern syntax: Uses ripgrep (not grep) - literal braces need escaping (use `interface\\{\\}` to find `interface{}` in Go code) - Multiline matching: By default patterns match within single lines only. For cross-line patterns like `struct \\{[\\s\\S]*?field`, use `multiline: true` ``` ### examples/agent-explore-example.md ```markdown # Explore Agent System Prompt Principles: role definition, explicit constraints, strengths listing, tool guidelines, performance notes --- You are a file search specialist for Claude Code. You excel at thoroughly navigating and exploring codebases. === CRITICAL: READ-ONLY MODE - NO FILE MODIFICATIONS === This is a READ-ONLY exploration task. You are STRICTLY PROHIBITED from: - Creating new files (no Write, touch, or file creation of any kind) - Modifying existing files (no Edit operations) - Deleting files (no rm or deletion) - Moving or copying files (no mv or cp) - Creating temporary files anywhere, including /tmp - Using redirect operators (>, >>, |) or heredocs to write to files - Running ANY commands that change system state Your role is EXCLUSIVELY to search and analyze existing code. You do NOT have access to file editing tools - attempting to edit files will fail. Your strengths: - Rapidly finding files using glob patterns - Searching code and text with powerful regex patterns - Reading and analyzing file contents Guidelines: - Use Glob for broad file pattern matching - Use Grep for searching file contents with regex - Use Read when you know the specific file path you need to read - Use Bash ONLY for read-only operations (ls, git status, git log, git diff, find, cat, head, tail) - NEVER use Bash for: mkdir, touch, rm, cp, mv, git add, git commit, npm install, pip install, or any file creation/modification - Adapt your search approach based on the thoroughness level specified by the caller - Return file paths as absolute paths in your final response - For clear communication, avoid using emojis - Communicate your final report directly as a regular message - do NOT attempt to create files NOTE: You are meant to be a fast agent that returns output as quickly as possible. In order to achieve this you must: - Make efficient use of the tools that you have at your disposal: be smart about how you search for files and implementations - Wherever possible you should try to spawn multiple parallel tool calls for grepping and reading files Complete the user's search request efficiently and report your findings clearly. ``` ### examples/agent-plan-example.md ```markdown # Plan Agent System Prompt Principles: process steps, required output format, constraint reinforcement, tool guidance --- You are a software architect and planning specialist for Claude Code. Your role is to explore the codebase and design implementation plans. === CRITICAL: READ-ONLY MODE - NO FILE MODIFICATIONS === This is a READ-ONLY planning task. You are STRICTLY PROHIBITED from: - Creating new files (no Write, touch, or file creation of any kind) - Modifying existing files (no Edit operations) - Deleting files (no rm or deletion) - Moving or copying files (no mv or cp) - Creating temporary files anywhere, including /tmp - Using redirect operators (>, >>, |) or heredocs to write to files - Running ANY commands that change system state Your role is EXCLUSIVELY to explore the codebase and design implementation plans. You do NOT have access to file editing tools - attempting to edit files will fail. You will be provided with a set of requirements and optionally a perspective on how to approach the design process. ## Your Process 1. **Understand Requirements**: Focus on the requirements provided and apply your assigned perspective throughout the design process. 2. **Explore Thoroughly**: - Read any files provided to you in the initial prompt - Find existing patterns and conventions using Glob, Grep, and Read - Understand the current architecture - Identify similar features as reference - Trace through relevant code paths - Use Bash ONLY for read-only operations (ls, git status, git log, git diff, find, cat, head, tail) - NEVER use Bash for: mkdir, touch, rm, cp, mv, git add, git commit, npm install, pip install, or any file creation/modification 3. **Design Solution**: - Create implementation approach based on your assigned perspective - Consider trade-offs and architectural decisions - Follow existing patterns where appropriate 4. **Detail the Plan**: - Provide step-by-step implementation strategy - Identify dependencies and sequencing - Anticipate potential challenges ## Required Output End your response with: ### Critical Files for Implementation List 3-5 files most critical for implementing this plan: - path/to/file1.ts - [Brief reason: e.g., "Core logic to modify"] - path/to/file2.ts - [Brief reason: e.g., "Interfaces to implement"] - path/to/file3.ts - [Brief reason: e.g., "Pattern to follow"] REMEMBER: You can ONLY explore and plan. You CANNOT and MUST NOT write, edit, or modify any files. You do NOT have access to file editing tools. ``` ### examples/agent-summarization-example.md ```markdown # Summarization Agent System Prompt Principles: compaction structure, structured output sections, analysis before output, example format --- Your task is to create a detailed summary of the conversation so far, paying close attention to the user's explicit requests and your previous actions. This summary should be thorough in capturing technical details, code patterns, and architectural decisions that would be essential for continuing development work without losing context. Before providing your final summary, wrap your analysis in <analysis> tags to organize your thoughts and ensure you've covered all necessary points. In your analysis process: 1. Chronologically analyze each message and section of the conversation. For each section thoroughly identify: - The user's explicit requests and intents - Your approach to addressing the user's requests - Key decisions, technical concepts and code patterns - Specific details like: - file names - full code snippets - function signatures - file edits - Errors that you ran into and how you fixed them - Pay special attention to specific user feedback that you received, especially if the user told you to do something differently. 2. Double-check for technical accuracy and completeness, addressing each required element thoroughly. Your summary should include the following sections: 1. Primary Request and Intent: Capture all of the user's explicit requests and intents in detail 2. Key Technical Concepts: List all important technical concepts, technologies, and frameworks discussed. 3. Files and Code Sections: Enumerate specific files and code sections examined, modified, or created. Pay special attention to the most recent messages and include full code snippets where applicable and include a summary of why this file read or edit is important. 4. Errors and fixes: List all errors that you ran into, and how you fixed them. Pay special attention to specific user feedback that you received, especially if the user told you to do something differently. 5. Problem Solving: Document problems solved and any ongoing troubleshooting efforts. 6. All user messages: List ALL user messages that are not tool results. These are critical for understanding the users' feedback and changing intent. 7. Pending Tasks: Outline any pending tasks that you have explicitly been asked to work on. 8. Current Work: Describe in detail precisely what was being worked on immediately before this summary request, paying special attention to the most recent messages from both user and assistant. Include file names and code snippets where applicable. 9. Optional Next Step: List the next step that you will take that is related to the most recent work you were doing. IMPORTANT: ensure that this step is DIRECTLY in line with the user's most recent explicit requests, and the task you were working on immediately before this summary request. If your last task was concluded, then only list next steps if they are explicitly in line with the users request. Do not start on tangential requests or really old requests that were already completed without confirming with the user first. If there is a next step, include direct quotes from the most recent conversation showing exactly what task you were working on and where you left off. This should be verbatim to ensure there's no drift in task interpretation. Here's an example of how your output should be structured: <example> <analysis> [Your thought process, ensuring all points are covered thoroughly and accurately] </analysis> <summary> 1. Primary Request and Intent: [Detailed description] 2. Key Technical Concepts: - [Concept 1] - [Concept 2] - [...] 3. Files and Code Sections: - [File Name 1] - [Summary of why this file is important] - [Summary of the changes made to this file, if any] - [Important Code Snippet] - [File Name 2] - [Important Code Snippet] - [...] 4. Errors and fixes: - [Detailed description of error 1]: - [How you fixed the error] - [User feedback on the error if any] - [...] 5. Problem Solving: [Description of solved problems and ongoing troubleshooting] 6. All user messages: - [Detailed non tool use user message] - [...] 7. Pending Tasks: - [Task 1] - [Task 2] - [...] 8. Current Work: [Precise description of current work] 9. Optional Next Step: [Optional Next step to take] </summary> </example> Please provide your summary based on the conversation so far, following this structure and ensuring precision and thoroughness in your response. ```