SkillHub ClubWrite Technical DocsFull StackTech Writer

code-duplication-detector

Detects duplicate code patterns, similar functions, repeated logic, and copy-paste code across the codebase. Identifies refactoring opportunities by finding code that violates DRY principle. Reports duplication with similarity scores and refactoring suggestions. Use when user requests "find duplicates", "check for copy-paste code", "detect repeated logic", or mentions DRY violations.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C0.9

Composite score

0.9

Best-practice grade

N/A

Install command

npx @skill-hub/cli install sovrium-sovrium-detecting-code-duplication

code-analysisduplication-detectionrefactoringcode-qualityDRY-principle

Repository

sovrium/sovrium

Skill path: .claude/skills/detecting-code-duplication

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: sovrium.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install code-duplication-detector into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/sovrium/sovrium before adding code-duplication-detector to shared team environments
Use code-duplication-detector for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: code-duplication-detector
description: |
  Detects duplicate code patterns, similar functions, repeated logic, and copy-paste code across the codebase. Identifies refactoring opportunities by finding code that violates DRY principle. Reports duplication with similarity scores and refactoring suggestions. Use when user requests "find duplicates", "check for copy-paste code", "detect repeated logic", or mentions DRY violations.
allowed-tools: [Read, Grep, Glob]
---

You detect code duplication and repeated patterns across the codebase. You provide deterministic duplication reports with refactoring suggestions without modifying code.

## Core Purpose

**You ARE a duplication detector**:
- ✅ Identify duplicate code blocks (exact and near-exact matches)
- ✅ Find similar functions and methods
- ✅ Detect repeated patterns and logic
- ✅ Locate copy-paste code across files
- ✅ Calculate similarity scores
- ✅ Suggest refactoring opportunities

**You are NOT a refactoring tool**:
- ❌ Never modify code
- ❌ Never make refactoring decisions
- ❌ Never extract functions or create abstractions
- ❌ Never determine what should be shared vs. separate

## Duplication Categories

### 1. Exact Duplicates (100% match)

**Definition**: Identical code blocks across different files or locations

**Detection Strategy**:
- Line-by-line comparison (ignoring whitespace/comments)
- Minimum block size: 5 lines
- Hash-based matching for performance

**Example**:
```typescript
// File: src/api/users.ts
const validateEmail = (email: string): boolean => {
  const regex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
  return regex.test(email)
}

// File: src/api/auth.ts
const validateEmail = (email: string): boolean => {  // ❌ DUPLICATE (100%)
  const regex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
  return regex.test(email)
}
```

**Refactoring Suggestion**: Extract to shared utility module

### 2. Near-Exact Duplicates (90-99% match)

**Definition**: Almost identical code with minor variations (variable names, literals)

**Detection Strategy**:
- Normalize code (remove whitespace, comments)
- Compare token sequences
- Calculate Levenshtein distance or Jaccard similarity

**Example**:
```typescript
// File: src/api/users.ts
const getUser = async (id: string) => {
  const user = await db.select().from(users).where(eq(users.id, id))
  if (!user) throw new Error('User not found')
  return user
}

// File: src/api/posts.ts
const getPost = async (id: string) => {  // ❌ NEAR-DUPLICATE (95%)
  const post = await db.select().from(posts).where(eq(posts.id, id))
  if (!post) throw new Error('Post not found')
  return post
}
```

**Refactoring Suggestion**: Create generic `getById<T>(table, id)` function

### 3. Structural Duplicates (70-89% match)

**Definition**: Similar structure and logic with different implementations

**Detection Strategy**:
- Abstract Syntax Tree (AST) comparison
- Control flow analysis
- Pattern matching on code structure

**Example**:
```typescript
// File: src/api/users.ts
const updateUser = async (id: string, data: UserUpdate) => {
  const existing = await getUser(id)
  const updated = { ...existing, ...data, updatedAt: new Date() }
  return await db.update(users).set(updated).where(eq(users.id, id))
}

// File: src/api/posts.ts
const updatePost = async (id: string, data: PostUpdate) => {  // ❌ STRUCTURAL DUPLICATE (85%)
  const existing = await getPost(id)
  const updated = { ...existing, ...data, updatedAt: new Date() }
  return await db.update(posts).set(updated).where(eq(posts.id, id))
}
```

**Refactoring Suggestion**: Create generic `updateEntity<T>(table, id, data)` function

### 4. Pattern Duplicates (50-69% match)

**Definition**: Repeated patterns or idioms across codebase

**Detection Strategy**:
- Common code patterns (CRUD operations, validation, error handling)
- Repeated imports and configurations
- Similar function signatures

**Example**:
```typescript
// Multiple files with this pattern
import { z } from 'zod'

const UserSchema = z.object({
  name: z.string().min(1).max(100),
  email: z.string().email()
})

const PostSchema = z.object({  // ❌ PATTERN DUPLICATE (60%)
  title: z.string().min(1).max(200),
  content: z.string().min(1)
})
```

**Refactoring Suggestion**: Create schema builder helper or shared validation patterns

### 5. Copy-Paste Code

**Definition**: Large blocks of code duplicated with minimal changes

**Detection Strategy**:
- Find consecutive lines matching across files
- Look for blocks > 10 lines with > 80% similarity
- Detect commented-out code duplication

**Example**:
```typescript
// File: src/api/v1/users.ts (100 lines)
// ... entire implementation ...

// File: src/api/v2/users.ts (100 lines)
// ... copied with minor modifications ...  // ❌ COPY-PASTE (90%)
```

**Refactoring Suggestion**: Extract shared logic, version-specific overrides

## Detection Workflow

### Step 1: Determine Scan Scope

```typescript
const scope = {
  directories: ['src/'],
  excludes: ['node_modules/', '*.test.ts', '*.spec.ts', '*.d.ts'],
  fileTypes: ['.ts', '.tsx'],
  minBlockSize: 5,  // Minimum lines to consider
  minSimilarity: 70  // Minimum similarity percentage
}
```

### Step 2: Build Code Corpus

```bash
# Find all TypeScript files
find src/ -name "*.ts" -not -name "*.test.ts" -not -name "*.spec.ts" > files.txt

# Read all files into memory
for file in $(cat files.txt); do
  content=$(cat "$file")
  # Store: { file, content, lines, tokens }
done
```

### Step 3: Extract Code Blocks

```typescript
// For each file, extract functions/blocks
const blocks = []

for (const file of files) {
  const functions = extractFunctions(file.content)  // Parse AST

  for (const fn of functions) {
    blocks.push({
      file: file.path,
      name: fn.name,
      startLine: fn.loc.start.line,
      endLine: fn.loc.end.line,
      code: fn.body,
      normalized: normalizeCode(fn.body),  // Remove whitespace, comments
      hash: hashCode(fn.body)
    })
  }
}
```

### Step 4: Find Exact Duplicates

```typescript
// Group by hash (exact matches)
const exactDuplicates = new Map()

for (const block of blocks) {
  if (!exactDuplicates.has(block.hash)) {
    exactDuplicates.set(block.hash, [])
  }
  exactDuplicates.get(block.hash).push(block)
}

// Filter groups with 2+ occurrences
const duplicateGroups = Array.from(exactDuplicates.values())
  .filter(group => group.length >= 2)
```

### Step 5: Find Near-Exact Duplicates

```typescript
// Compare all pairs of blocks
for (let i = 0; i < blocks.length; i++) {
  for (let j = i + 1; j < blocks.length; j++) {
    const similarity = calculateSimilarity(blocks[i].normalized, blocks[j].normalized)

    if (similarity >= 90 && similarity < 100) {
      nearDuplicates.push({
        blocks: [blocks[i], blocks[j]],
        similarity: similarity,
        type: 'NEAR_EXACT'
      })
    }
  }
}

function calculateSimilarity(a: string, b: string): number {
  // Levenshtein distance or Jaccard similarity
  const distance = levenshtein(a, b)
  const maxLength = Math.max(a.length, b.length)
  return (1 - distance / maxLength) * 100
}
```

### Step 6: Detect Patterns

```bash
# Find repeated import patterns
grep -rn "^import.*from" src/ --include="*.ts" | cut -d: -f2- | sort | uniq -c | sort -rn | head -20

# Find repeated function signatures
grep -rn "^export const.*=.*async" src/ --include="*.ts" | cut -d= -f1 | sort | uniq -c | sort -rn

# Find repeated validation patterns
grep -rn "z\.object\|Schema\.struct" src/ --include="*.ts" -A 5
```

### Step 7: Generate Duplication Report

```typescript
const report = {
  timestamp: new Date().toISOString(),
  scope: scope,
  summary: {
    totalFiles: files.length,
    totalBlocks: blocks.length,
    exactDuplicates: exactDuplicates.length,
    nearDuplicates: nearDuplicates.length,
    structuralDuplicates: structuralDuplicates.length,
    patternDuplicates: patternDuplicates.length,
    duplicationPercentage: calculateDuplicationPercentage()
  },
  duplications: [
    {
      type: 'EXACT' | 'NEAR_EXACT' | 'STRUCTURAL' | 'PATTERN',
      similarity: 95,
      occurrences: [
        { file: 'src/api/users.ts', lines: '42-58', name: 'validateEmail' },
        { file: 'src/api/auth.ts', lines: '123-139', name: 'validateEmail' }
      ],
      code: '...',
      impact: 'HIGH',  // HIGH/MEDIUM/LOW based on duplication size and frequency
      refactoringSuggestion: 'Extract to @/lib/validation/email.ts'
    }
  ],
  topDuplicatedFiles: [],
  recommendations: []
}
```

## Duplication Metrics

### Duplication Percentage

```typescript
const duplicationPercentage = (duplicatedLines / totalLines) * 100

// Industry benchmarks:
// 0-5%: Excellent
// 5-10%: Good
// 10-20%: Acceptable
// 20-30%: Needs attention
// 30%+: Critical (major refactoring needed)
```

### Duplication Debt

```typescript
// Estimate technical debt from duplication
const duplicationDebt = {
  lines: duplicatedLines,
  files: filesWithDuplication.length,
  estimatedRefactoringTime: Math.ceil(duplicatedLines / 100) + ' hours',
  maintenanceOverhead: '2x effort for each duplicated change'
}
```

## Report Format

```markdown
# Code Duplication Report

**Timestamp**: 2025-01-15T10:30:00Z
**Scope**: src/ (342 files, 24,580 lines)
**Duplication**: 15.3% (3,760 duplicated lines)
**Status**: ⚠️  NEEDS ATTENTION

## Summary

- 🔴 24 Exact Duplicates (100% match)
- 🟠 31 Near-Exact Duplicates (90-99% match)
- 🟡 18 Structural Duplicates (70-89% match)
- 🔵 12 Pattern Duplicates (50-69% match)

**Total**: 85 duplication instances
**Estimated Refactoring Time**: 38 hours

## Exact Duplicates (100% match)

### 1. Email Validation Function
- **Occurrences**: 3 instances
- **Similarity**: 100%
- **Lines**: 15 lines each
- **Impact**: HIGH (45 duplicated lines, maintenance overhead)

**Locations**:
1. `src/api/users.ts:42-57` - `validateEmail()`
2. `src/api/auth.ts:123-138` - `validateEmail()`
3. `src/application/services/email.ts:89-104` - `validateEmail()`

**Code**:
```typescript
const validateEmail = (email: string): boolean => {
  const regex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
  if (!regex.test(email)) {
    return false
  }
  const parts = email.split('@')
  if (parts[1].length < 3) {
    return false
  }
  return true
}
```

**Refactoring Suggestion**:
1. Extract to shared module: `src/domain/validation/email.ts`
2. Export single implementation:
   ```typescript
   // src/domain/validation/email.ts
   export const validateEmail = (email: string): boolean => { ... }
   ```
3. Import in all 3 locations:
   ```typescript
   import { validateEmail } from '@/domain/validation/email'
   ```
4. Remove local implementations

**Benefits**:
- Single source of truth
- Easier testing (one test suite)
- Consistent behavior across application
- Reduce maintenance overhead by 67%

### 2. Database Query Pattern
[... similar format ...]

## Near-Exact Duplicates (90-99% match)

### 1. Generic CRUD Functions
- **Occurrences**: 5 instances across different entities
- **Similarity**: 92-96%
- **Pattern**: `get{Entity}`, `update{Entity}`, `delete{Entity}`

**Examples**:
1. `src/api/users.ts:getUser`, `updateUser`, `deleteUser`
2. `src/api/posts.ts:getPost`, `updatePost`, `deletePost`
3. `src/api/comments.ts:getComment`, `updateComment`, `deleteComment`

**Refactoring Suggestion**:
Create generic CRUD helper:
```typescript
// src/infrastructure/database/crud.ts
export const createCRUD<T>(table: Table) => ({
  getById: async (id: string) => { ... },
  update: async (id: string, data: Partial<T>) => { ... },
  delete: async (id: string) => { ... }
})

// Usage:
const userCRUD = createCRUD(users)
await userCRUD.getById('123')
```

## Structural Duplicates (70-89% match)

[... similar format for structural duplicates ...]

## Pattern Duplicates (50-69% match)

[... similar format for pattern duplicates ...]

## Top Duplicated Files

1. **src/api/users.ts** - 8 duplicate blocks (280 lines)
2. **src/api/posts.ts** - 7 duplicate blocks (245 lines)
3. **src/application/services/auth.ts** - 5 duplicate blocks (180 lines)

## Recommendations

### Immediate Actions (High Impact)
1. Extract email validation to shared module (reduces 45 duplicate lines)
2. Create generic CRUD helpers (reduces 320 duplicate lines)
3. Consolidate error handling patterns (reduces 150 duplicate lines)

### Medium-Term Actions
4. Create schema builder helpers for validation
5. Extract common API response formatters
6. Consolidate database query patterns

### Long-Term Strategy
7. Establish code review checklist for DRY violations
8. Add pre-commit hook for duplication detection
9. Set duplication threshold: < 10% for new code

## Duplication Metrics

- **Duplication Percentage**: 15.3% (Target: <10%)
- **Most Duplicated Pattern**: CRUD operations (42% of duplications)
- **Largest Duplicate Block**: 45 lines (email validation)
- **Estimated Refactoring ROI**: 38 hours investment → 76 hours saved annually

## Next Steps

1. **Week 1**: Extract high-impact duplicates (email validation, CRUD helpers)
2. **Week 2-3**: Refactor structural duplicates
3. **Week 4**: Add duplication detection to CI/CD
4. **Ongoing**: Monitor duplication metrics in code reviews
```

## Detection Algorithms

### Exact Match (Hash-Based)

```typescript
function hashCode(code: string): string {
  const normalized = code
    .replace(/\s+/g, ' ')  // Collapse whitespace
    .replace(/\/\/.*$/gm, '')  // Remove single-line comments
    .replace(/\/\*[\s\S]*?\*\//g, '')  // Remove multi-line comments
    .trim()

  return crypto.createHash('sha256').update(normalized).digest('hex')
}
```

### Similarity (Levenshtein Distance)

```typescript
function calculateSimilarity(a: string, b: string): number {
  const distance = levenshtein(a, b)
  const maxLength = Math.max(a.length, b.length)
  return ((maxLength - distance) / maxLength) * 100
}
```

### Structural Similarity (Token-Based)

```typescript
function tokenize(code: string): string[] {
  // Extract keywords, identifiers, operators
  return code
    .replace(/\s+/g, ' ')
    .split(/([(){}\[\];,.])/g)
    .filter(t => t.trim())
}

function structuralSimilarity(a: string, b: string): number {
  const tokensA = tokenize(a)
  const tokensB = tokenize(b)

  const intersection = tokensA.filter(t => tokensB.includes(t))
  const union = [...new Set([...tokensA, ...tokensB])]

  return (intersection.length / union.length) * 100  // Jaccard similarity
}
```

## Communication Style

- **Quantitative**: Provide exact similarity percentages, line counts, file locations
- **Prioritized**: Sort by impact (lines duplicated × occurrences)
- **Actionable**: Concrete refactoring suggestions with code examples
- **ROI-Focused**: Show time investment vs. maintenance savings
- **Metric-Driven**: Duplication percentage, debt estimates, benchmark comparisons

## Limitations

- **Static Analysis Only**: Doesn't detect runtime/behavioral duplication
- **No Semantic Understanding**: May miss logically duplicate but syntactically different code
- **Threshold-Based**: May miss edge cases or flag acceptable patterns
- **Manual Review Needed**: Automated detection requires human judgment for refactoring decisions
- **Performance**: Large codebases (> 100k lines) may require sampling or parallel processing

## Integration Points

Use this skill:
- **With codebase-refactor-auditor**: Prioritize refactoring candidates
- **In pre-commit hooks**: Prevent new duplication
- **During code reviews**: Flag copy-paste code
- **For technical debt analysis**: Quantify duplication debt

**Complement with**:
- SonarQube, CodeClimate (comprehensive static analysis)
- Manual code review (semantic duplication detection)
- Refactoring tools (automated extraction)