Back to skills
SkillHub ClubWrite Technical DocsFull StackData / AITech Writer

input-output-guardrails

Implementing safety filters, content moderation, and guardrails for AI system inputs and outputs

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
126
Hot score
95
Updated
March 20, 2026
Overall rating
C2.8
Composite score
2.8
Best-practice grade
B77.6

Install command

npx @skill-hub/cli install majiayu000-claude-skill-registry-input-output-guardrails

Repository

majiayu000/claude-skill-registry

Skill path: skills/other/input-output-guardrails

Implementing safety filters, content moderation, and guardrails for AI system inputs and outputs

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Data / AI, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: majiayu000.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install input-output-guardrails into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/majiayu000/claude-skill-registry before adding input-output-guardrails to shared team environments
  • Use input-output-guardrails for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: input-output-guardrails
version: "2.0.0"
description: Implementing safety filters, content moderation, and guardrails for AI system inputs and outputs
sasmp_version: "1.3.0"
bonded_agent: 05-defense-strategy-developer
bond_type: SECONDARY_BOND
# Schema Definitions
input_schema:
  type: object
  required: [guardrail_type]
  properties:
    guardrail_type:
      type: string
      enum: [input, output, both]
    strictness:
      type: string
      enum: [permissive, balanced, strict]
      default: balanced
output_schema:
  type: object
  properties:
    blocked_requests:
      type: integer
    filtered_outputs:
      type: integer
    false_positive_rate:
      type: number
# Framework Mappings
owasp_llm_2025: [LLM01, LLM02, LLM05, LLM07]
nist_ai_rmf: [Manage]
---

# Input/Output Guardrails

Implement **multi-layer safety systems** to filter malicious inputs and harmful outputs.

## Quick Reference

```yaml
Skill:       input-output-guardrails
Agent:       05-defense-strategy-developer
OWASP:       LLM01 (Injection), LLM02 (Disclosure), LLM05 (Output), LLM07 (Leakage)
NIST:        Manage
Use Case:    Production safety filtering
```

## Guardrail Architecture

```
User Input → [Input Guardrails] → [AI Model] → [Output Guardrails] → Response
                    ↓                               ↓
             [Blocked/Modified]              [Blocked/Modified]
                    ↓                               ↓
             [Fallback Response]            [Safe Alternative]
```

## Input Guardrails

### 1. Injection Detection

```yaml
Category: prompt_injection
Latency: <10ms
Block Rate: 95%+
```

```python
class InputGuardrails:
    INJECTION_PATTERNS = [
        r'ignore\s+(previous|prior|all)\s+(instructions?|guidelines?)',
        r'you\s+are\s+(now|an?)\s+(unrestricted|evil)',
        r'(developer|admin|debug)\s+mode',
        r'bypass\s+(safety|security|filter)',
        r'pretend\s+(you|to)\s+(are|be)',
        r'what\s+(is|are)\s+your\s+(instructions?|prompt)',
    ]

    def __init__(self, config):
        self.patterns = [re.compile(p, re.I) for p in self.INJECTION_PATTERNS]
        self.max_length = config.get('max_length', 4096)
        self.pii_detector = PIIDetector()

    def validate(self, user_input: str) -> tuple[bool, str]:
        # Length check
        if len(user_input) > self.max_length:
            return False, "Input too long"

        # Empty check
        if not user_input.strip():
            return False, "Empty input"

        # Injection detection
        for pattern in self.patterns:
            if pattern.search(user_input):
                return False, "Invalid request"

        # PII handling
        if self.pii_detector.contains_pii(user_input):
            return True, self.pii_detector.redact(user_input)

        return True, user_input
```

### 2. PII Detection & Redaction

```python
class PIIDetector:
    PATTERNS = {
        'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
        'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
        'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
        'api_key': r'(sk|pk)[-_][a-zA-Z0-9]{20,}',
    }

    def contains_pii(self, text: str) -> bool:
        for pattern in self.PATTERNS.values():
            if re.search(pattern, text):
                return True
        return False

    def redact(self, text: str) -> str:
        for name, pattern in self.PATTERNS.items():
            text = re.sub(pattern, f'[REDACTED_{name.upper()}]', text)
        return text
```

### 3. Rate & Cost Limiting

```yaml
Limits:
  max_tokens_input: 4096
  max_requests_per_minute: 60
  max_concurrent: 5
  cost_limit_per_hour: $10

Actions:
  exceeded_tokens: truncate
  exceeded_rate: queue (5s backoff)
  exceeded_concurrent: reject
  exceeded_cost: block
```

## Output Guardrails

### 1. Content Safety Filtering

```python
class OutputGuardrails:
    def __init__(self, config):
        self.toxicity_threshold = config.get('toxicity', 0.3)
        self.toxicity_model = load_toxicity_classifier()
        self.blocklist = self._load_blocklist()

    def filter(self, response: str) -> tuple[str, dict]:
        metadata = {'filtered': False, 'reasons': []}

        # Toxicity check
        toxicity = self.toxicity_model.predict(response)
        if toxicity > self.toxicity_threshold:
            metadata['filtered'] = True
            metadata['reasons'].append('toxicity')
            return self._safe_response(), metadata

        # Blocklist check
        for term in self.blocklist:
            if term.lower() in response.lower():
                metadata['filtered'] = True
                metadata['reasons'].append('blocklist')
                return self._safe_response(), metadata

        # System prompt leak detection
        if self._detects_system_leak(response):
            metadata['filtered'] = True
            metadata['reasons'].append('system_leak')
            response = self._redact_system_content(response)

        return response, metadata

    def _detects_system_leak(self, response: str) -> bool:
        leak_indicators = [
            'you are a helpful',
            'your instructions are',
            'system prompt:',
        ]
        return any(ind in response.lower() for ind in leak_indicators)
```

### 2. Sensitive Data Redaction

```python
class OutputRedactor:
    SENSITIVE_PATTERNS = {
        'api_key': r'[a-zA-Z0-9_-]{20,}(?:key|token|secret)',
        'password': r'password["\']?\s*[:=]\s*["\']?[^\s"\']+',
        'connection_string': r'(mongodb|mysql|postgres)://[^\s]+',
        'ip_address': r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b',
    }

    def redact(self, response: str) -> str:
        for name, pattern in self.SENSITIVE_PATTERNS.items():
            response = re.sub(pattern, '[REDACTED]', response, flags=re.I)
        return response
```

### 3. Factuality & Citation

```yaml
Factuality Checks:
  major_claims:
    action: flag_for_verification
    threshold: confidence < 0.8

  citations:
    action: verify_source_exists
    block_if: source_not_found

  uncertainty:
    action: add_disclaimer
    phrases: ["I'm not certain", "might be", "could be"]
```

## Combined Configuration

```yaml
# guardrails_config.yaml
input:
  injection_detection: true
  pii_redaction: true
  max_length: 4096
  rate_limit: 60/min

output:
  toxicity_threshold: 0.3
  blocklist_enabled: true
  sensitive_redaction: true
  system_leak_detection: true

fallback:
  input_blocked: "I cannot process this request."
  output_blocked: "I cannot provide this information."

logging:
  log_blocked: true
  log_filtered: true
  include_reason: false  # Privacy
```

## Effectiveness Metrics

```
┌──────────────────┬─────────┬────────┬──────────┐
│ Metric           │ Target  │ Actual │ Status   │
├──────────────────┼─────────┼────────┼──────────┤
│ Injection Block  │ >95%    │ 97%    │ ✓ PASS   │
│ False Positive   │ <2%     │ 1.5%   │ ✓ PASS   │
│ Latency Impact   │ <50ms   │ 35ms   │ ✓ PASS   │
│ Toxicity Block   │ >90%    │ 92%    │ ✓ PASS   │
│ PII Redaction    │ >99%    │ 99.5%  │ ✓ PASS   │
└──────────────────┴─────────┴────────┴──────────┘
```

## Troubleshooting

```yaml
Issue: High false positive rate
Solution: Tune patterns, add allowlist, use context

Issue: Latency too high
Solution: Optimize regex, use compiled patterns, cache

Issue: Bypassed by encoding
Solution: Normalize unicode, decode before checking
```

## Integration Points

| Component | Purpose |
|-----------|---------|
| Agent 05 | Implements guardrails |
| /defend | Configuration recommendations |
| CI/CD | Automated testing |
| Monitoring | Alert on filter triggers |

---

**Protect AI systems with comprehensive input/output guardrails.**