input-output-guardrails
Implementing safety filters, content moderation, and guardrails for AI system inputs and outputs
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install majiayu000-claude-skill-registry-input-output-guardrails
Repository
Skill path: skills/other/input-output-guardrails
Implementing safety filters, content moderation, and guardrails for AI system inputs and outputs
Open repositoryBest for
Primary workflow: Write Technical Docs.
Technical facets: Full Stack, Data / AI, Tech Writer.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: majiayu000.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install input-output-guardrails into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/majiayu000/claude-skill-registry before adding input-output-guardrails to shared team environments
- Use input-output-guardrails for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: input-output-guardrails
version: "2.0.0"
description: Implementing safety filters, content moderation, and guardrails for AI system inputs and outputs
sasmp_version: "1.3.0"
bonded_agent: 05-defense-strategy-developer
bond_type: SECONDARY_BOND
# Schema Definitions
input_schema:
type: object
required: [guardrail_type]
properties:
guardrail_type:
type: string
enum: [input, output, both]
strictness:
type: string
enum: [permissive, balanced, strict]
default: balanced
output_schema:
type: object
properties:
blocked_requests:
type: integer
filtered_outputs:
type: integer
false_positive_rate:
type: number
# Framework Mappings
owasp_llm_2025: [LLM01, LLM02, LLM05, LLM07]
nist_ai_rmf: [Manage]
---
# Input/Output Guardrails
Implement **multi-layer safety systems** to filter malicious inputs and harmful outputs.
## Quick Reference
```yaml
Skill: input-output-guardrails
Agent: 05-defense-strategy-developer
OWASP: LLM01 (Injection), LLM02 (Disclosure), LLM05 (Output), LLM07 (Leakage)
NIST: Manage
Use Case: Production safety filtering
```
## Guardrail Architecture
```
User Input → [Input Guardrails] → [AI Model] → [Output Guardrails] → Response
↓ ↓
[Blocked/Modified] [Blocked/Modified]
↓ ↓
[Fallback Response] [Safe Alternative]
```
## Input Guardrails
### 1. Injection Detection
```yaml
Category: prompt_injection
Latency: <10ms
Block Rate: 95%+
```
```python
class InputGuardrails:
INJECTION_PATTERNS = [
r'ignore\s+(previous|prior|all)\s+(instructions?|guidelines?)',
r'you\s+are\s+(now|an?)\s+(unrestricted|evil)',
r'(developer|admin|debug)\s+mode',
r'bypass\s+(safety|security|filter)',
r'pretend\s+(you|to)\s+(are|be)',
r'what\s+(is|are)\s+your\s+(instructions?|prompt)',
]
def __init__(self, config):
self.patterns = [re.compile(p, re.I) for p in self.INJECTION_PATTERNS]
self.max_length = config.get('max_length', 4096)
self.pii_detector = PIIDetector()
def validate(self, user_input: str) -> tuple[bool, str]:
# Length check
if len(user_input) > self.max_length:
return False, "Input too long"
# Empty check
if not user_input.strip():
return False, "Empty input"
# Injection detection
for pattern in self.patterns:
if pattern.search(user_input):
return False, "Invalid request"
# PII handling
if self.pii_detector.contains_pii(user_input):
return True, self.pii_detector.redact(user_input)
return True, user_input
```
### 2. PII Detection & Redaction
```python
class PIIDetector:
PATTERNS = {
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
'email': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
'api_key': r'(sk|pk)[-_][a-zA-Z0-9]{20,}',
}
def contains_pii(self, text: str) -> bool:
for pattern in self.PATTERNS.values():
if re.search(pattern, text):
return True
return False
def redact(self, text: str) -> str:
for name, pattern in self.PATTERNS.items():
text = re.sub(pattern, f'[REDACTED_{name.upper()}]', text)
return text
```
### 3. Rate & Cost Limiting
```yaml
Limits:
max_tokens_input: 4096
max_requests_per_minute: 60
max_concurrent: 5
cost_limit_per_hour: $10
Actions:
exceeded_tokens: truncate
exceeded_rate: queue (5s backoff)
exceeded_concurrent: reject
exceeded_cost: block
```
## Output Guardrails
### 1. Content Safety Filtering
```python
class OutputGuardrails:
def __init__(self, config):
self.toxicity_threshold = config.get('toxicity', 0.3)
self.toxicity_model = load_toxicity_classifier()
self.blocklist = self._load_blocklist()
def filter(self, response: str) -> tuple[str, dict]:
metadata = {'filtered': False, 'reasons': []}
# Toxicity check
toxicity = self.toxicity_model.predict(response)
if toxicity > self.toxicity_threshold:
metadata['filtered'] = True
metadata['reasons'].append('toxicity')
return self._safe_response(), metadata
# Blocklist check
for term in self.blocklist:
if term.lower() in response.lower():
metadata['filtered'] = True
metadata['reasons'].append('blocklist')
return self._safe_response(), metadata
# System prompt leak detection
if self._detects_system_leak(response):
metadata['filtered'] = True
metadata['reasons'].append('system_leak')
response = self._redact_system_content(response)
return response, metadata
def _detects_system_leak(self, response: str) -> bool:
leak_indicators = [
'you are a helpful',
'your instructions are',
'system prompt:',
]
return any(ind in response.lower() for ind in leak_indicators)
```
### 2. Sensitive Data Redaction
```python
class OutputRedactor:
SENSITIVE_PATTERNS = {
'api_key': r'[a-zA-Z0-9_-]{20,}(?:key|token|secret)',
'password': r'password["\']?\s*[:=]\s*["\']?[^\s"\']+',
'connection_string': r'(mongodb|mysql|postgres)://[^\s]+',
'ip_address': r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b',
}
def redact(self, response: str) -> str:
for name, pattern in self.SENSITIVE_PATTERNS.items():
response = re.sub(pattern, '[REDACTED]', response, flags=re.I)
return response
```
### 3. Factuality & Citation
```yaml
Factuality Checks:
major_claims:
action: flag_for_verification
threshold: confidence < 0.8
citations:
action: verify_source_exists
block_if: source_not_found
uncertainty:
action: add_disclaimer
phrases: ["I'm not certain", "might be", "could be"]
```
## Combined Configuration
```yaml
# guardrails_config.yaml
input:
injection_detection: true
pii_redaction: true
max_length: 4096
rate_limit: 60/min
output:
toxicity_threshold: 0.3
blocklist_enabled: true
sensitive_redaction: true
system_leak_detection: true
fallback:
input_blocked: "I cannot process this request."
output_blocked: "I cannot provide this information."
logging:
log_blocked: true
log_filtered: true
include_reason: false # Privacy
```
## Effectiveness Metrics
```
┌──────────────────┬─────────┬────────┬──────────┐
│ Metric │ Target │ Actual │ Status │
├──────────────────┼─────────┼────────┼──────────┤
│ Injection Block │ >95% │ 97% │ ✓ PASS │
│ False Positive │ <2% │ 1.5% │ ✓ PASS │
│ Latency Impact │ <50ms │ 35ms │ ✓ PASS │
│ Toxicity Block │ >90% │ 92% │ ✓ PASS │
│ PII Redaction │ >99% │ 99.5% │ ✓ PASS │
└──────────────────┴─────────┴────────┴──────────┘
```
## Troubleshooting
```yaml
Issue: High false positive rate
Solution: Tune patterns, add allowlist, use context
Issue: Latency too high
Solution: Optimize regex, use compiled patterns, cache
Issue: Bypassed by encoding
Solution: Normalize unicode, decode before checking
```
## Integration Points
| Component | Purpose |
|-----------|---------|
| Agent 05 | Implements guardrails |
| /defend | Configuration recommendations |
| CI/CD | Automated testing |
| Monitoring | Alert on filter triggers |
---
**Protect AI systems with comprehensive input/output guardrails.**