Back to results

Filtered result set

49 / 26996 matches

SkillHub ClubShip Full StackFull StackBackend

document-converter

Converts documents between Markdown, DOCX, and PDF formats with intelligent tool selection. Supports batch processing, automatic fallbacks, and offers both API-enhanced and offline modes. Handles text extraction from PDFs and DOCX files with quality validation.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
435
Hot score
99
Updated
March 20, 2026
Overall rating
A8.3
Composite score
8.3
Best-practice grade
B75.6

Install command

npx @skill-hub/cli install benbrastmckie-document-converter
document-conversionmarkdownpdfdocxbatch-processing

Repository

benbrastmckie/.config

Skill path: .claude/skills/document-converter

Converts documents between Markdown, DOCX, and PDF formats with intelligent tool selection. Supports batch processing, automatic fallbacks, and offers both API-enhanced and offline modes. Handles text extraction from PDFs and DOCX files with quality validation.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack, Backend.

Target audience: Developers, technical writers, and researchers who regularly work with multiple document formats and need automated conversion workflows..

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benbrastmckie.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install document-converter into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/benbrastmckie/.config before adding document-converter to shared team environments
  • Use document-converter for productivity workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: document-converter
description: Convert between Markdown, DOCX, and PDF formats bidirectionally. Handles text extraction from PDF/DOCX, markdown to document conversion. Use when converting document formats or extracting structured content from Word or PDF files.
allowed-tools: Bash, Read, Glob, Write
dependencies:
  - pandoc>=2.0
  - python3>=3.8
  - markitdown (optional, recommended)
  - pymupdf4llm (optional, recommended)
  - google-genai (optional, for enhanced PDF conversion)
  - pdf2docx (optional, for PDF to DOCX)
model: haiku-4.5
model-justification: Orchestrates external conversion tools with minimal AI reasoning required
fallback-model: sonnet-4.5
---

# Document Converter Skill

Convert documents bidirectionally between Markdown, DOCX, and PDF formats. This skill automatically detects optimal conversion tools, handles batch processing, and ensures quality output with appropriate fallback mechanisms.

## Core Capabilities

### Conversion Modes

The skill supports two conversion modes:

**Default Mode (API)**: When GEMINI_API_KEY is set, PDF-to-Markdown conversions use Google Gemini API for significantly improved quality (+20-30% fidelity). Other conversions use local tools.

**Offline Mode**: Use --no-api flag or set CONVERT_DOCS_OFFLINE=true to disable all API calls. All conversions use local tools only.

### Conversion Directions

**TO Markdown** (text extraction from documents):
- DOCX → Markdown (MarkItDown or Pandoc)
- PDF → Markdown (Gemini API, PyMuPDF4LLM, or MarkItDown)

**FROM Markdown** (document generation):
- Markdown → DOCX (Pandoc)
- Markdown → PDF (Pandoc with Typst or XeLaTeX)

**Direct Conversion**:
- PDF → DOCX (pdf2docx - direct conversion preserves layout)

### Features

- Automatic tool detection and selection
- Cascading fallback mechanisms
- Batch processing support
- Image extraction and embedding
- Filename sanitization (spaces to underscores)
- Quality validation and reporting
- Concurrent conversion support

## Tool Priority Matrix

The skill uses intelligent tool selection based on format and quality metrics:

### PDF → Markdown (Mode-Dependent)

**Gemini Mode** (when GEMINI_API_KEY is set):
1. **Gemini API** (primary) - 95%+ fidelity
   - Vision-based understanding of layout
   - Semantic structure preservation
   - Code block language detection
   - 60 req/min, 1000 req/day free tier
2. **PyMuPDF4LLM** (fallback) - 70-75% fidelity
3. **MarkItDown** (fallback) - 65-70% fidelity

**Offline Mode** (--no-api or no API key):
1. **PyMuPDF4LLM** (primary) - 70-75% fidelity
   - Zero configuration required
   - Perfect Unicode preservation
   - Good for simple PDFs
2. **MarkItDown** (fallback) - 65-70% fidelity
   - Consistent quality across document types
   - Easy to configure

### PDF → DOCX
1. **pdf2docx** (only option) - 80-85% fidelity
   - Direct conversion (no intermediate format)
   - Preserves images and layout better than Gemini->Pandoc
   - Fast processing

### DOCX → Markdown
1. **MarkItDown** (primary) - 75-80% fidelity
   - Perfect table preservation (pipe-style markdown)
   - Excellent Unicode/emoji support
   - Fast processing
2. **Pandoc** (fallback) - 68% fidelity
   - Reliable baseline conversion
   - Tables converted to grid format

### Markdown → DOCX
1. **Pandoc** (only option) - 95%+ quality preservation

### Markdown → PDF
1. **Pandoc + Typst** (primary) - Fast, modern PDF engine
2. **Pandoc + XeLaTeX** (fallback) - Traditional LaTeX engine

## Usage Patterns

### Basic Conversion

```bash
# Source conversion core library
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Detect available tools
detect_tools

# Convert DOCX/PDF to Markdown
main_conversion /path/to/documents /path/to/output

# Convert Markdown to DOCX
main_conversion /path/to/markdown /path/to/output
```

### Batch Processing

The conversion core automatically processes all files in the input directory:
- Discovers all convertible files (.docx, .pdf, or .md)
- Detects conversion direction automatically
- Processes files concurrently (default: 4 parallel conversions)
- Generates conversion.log with statistics

### Progress Streaming

The conversion script emits PROGRESS markers:
```
[PROGRESS] Converting: file1.docx → file1.md
[PROGRESS] Converting: file2.pdf → file2.md (2/10)
[SUCCESS] Converted file1.docx → file1.md
[FAILED] file3.pdf: Conversion timeout after 300s
```

## Conversion Workflow

### Phase 1: Tool Detection
- Check for MarkItDown availability (`command -v markitdown`)
- Check for Pandoc availability (`command -v pandoc`)
- Check for PyMuPDF4LLM availability (`python3 -c "import pymupdf4llm"`)
- Check for PDF engines (Typst, XeLaTeX)
- Set availability flags for tool selection

### Phase 2: File Discovery
- Scan input directory for convertible files
- Detect conversion direction (TO_MARKDOWN or FROM_MARKDOWN)
- Validate mixed-mode errors (cannot mix directions)
- Create output directory structure

### Phase 3: Conversion Execution
- Process files using optimal tool based on priority matrix
- Apply timeout limits (60s DOCX, 300s PDF, 120s MD→PDF)
- Handle collisions (overwrite or skip existing files)
- Extract/embed images appropriately
- Retry with fallback tools on failure

### Phase 4: Validation
- Verify output file exists
- Check for broken image links
- Validate document structure (headings present)
- Report any quality issues

### Phase 5: Reporting
- Generate conversion.log with statistics
- Report success/failure counts by format
- List timeout occurrences
- Summarize validation issues

## Quality Considerations

### Fidelity Expectations
- DOCX→Markdown: 75-80% with MarkItDown (best tables)
- PDF→Markdown: Varies by PDF complexity (scan quality critical)
- Markdown→DOCX: 95%+ with Pandoc (excellent preservation)
- Markdown→PDF: High quality with Typst/XeLaTeX engines

### Known Limitations
- **Scanned PDFs**: OCR quality depends on scan resolution
- **Complex layouts**: Multi-column or nested tables may degrade
- **Embedded fonts**: PDF fonts may affect text extraction accuracy
- **Images**: Large images may cause timeout issues

### Best Practices
- Use MarkItDown for DOCX/PDF when available (better quality)
- Allow longer timeouts for large PDFs (300s default)
- Review conversion.log for failure patterns
- Test output files for critical conversions

## Error Handling

### Common Errors

**Tool not available**:
```
Error: No conversion tool available for DOCX→Markdown
Required: markitdown or pandoc
```
→ Install required tools

**Conversion timeout**:
```
[FAILED] large_document.pdf: Conversion timeout after 300s
```
→ Increase TIMEOUT_PDF_TO_MD or use simpler PDF

**Validation failure**:
```
[WARNING] output.md: No headings found (possible conversion issue)
```
→ Check source document structure

### Recovery Strategies
- Failed conversions automatically retry with fallback tool
- Timeouts skip to next file (batch processing continues)
- Validation warnings don't block workflow (reported only)

## Configuration Options

Environment variables to tune conversion behavior:

```bash
# Timeout multipliers (seconds)
TIMEOUT_MULTIPLIER=1.5  # Increase all timeouts by 50%

# Disk usage limits
MAX_DISK_USAGE_GB=10  # Abort if output exceeds 10GB
MIN_FREE_SPACE_MB=500  # Require 500MB free space

# Concurrency
MAX_CONCURRENT_CONVERSIONS=4  # Parallel conversion limit
```

## Integration Examples

### From Claude Code Agents

When working within agent contexts, the skill automatically triggers when Claude detects conversion needs:

```markdown
User: "Extract text from these PDF reports"
→ Skill auto-invokes: document-converter
→ Converts PDFs to Markdown
→ Returns structured text
```

### From Slash Commands

The `/convert-docs` command delegates to this skill when available:

```bash
/convert-docs ./documents ./output
→ Checks skill availability
→ Delegates to document-converter skill
→ Falls back to script mode if skill unavailable
```

### From Other Skills

Skills can compose with document-converter:

```yaml
# research-specialist skill
dependencies:
  - document-converter  # Auto-loads for PDF analysis
```

## Script Locations

The skill relies on conversion scripts in the project:

- **Core orchestration**: `.claude/lib/convert/convert-core.sh`
- **DOCX functions**: `.claude/lib/convert/convert-docx.sh`
- **PDF functions**: `.claude/lib/convert/convert-pdf.sh`
- **Gemini wrapper**: `.claude/lib/convert/convert-gemini.sh`
- **Gemini Python**: `.claude/lib/convert/convert_gemini.py`
- **Markdown utilities**: `.claude/lib/convert/convert-markdown.sh`

Scripts are symlinked in the skill's `scripts/` directory for easy access.

## Testing

Test the skill with sample conversions:

```bash
# Test DOCX→Markdown
/convert-docs ./test/sample.docx ./output

# Test PDF→Markdown
/convert-docs ./test/report.pdf ./output

# Test Markdown→DOCX
/convert-docs ./test/document.md ./output

# Batch test
/convert-docs ./test/documents ./output
```

Verify:
- Conversion.log generated with statistics
- Output files created with correct extensions
- Image directories created when needed
- Quality meets expectations (check tables, formatting)

## Troubleshooting

### Skill Not Triggering

If the skill doesn't auto-invoke when expected:
- Check description includes trigger keywords (convert, document, PDF, DOCX, Markdown)
- Test with explicit skill invocation: "Use document-converter skill"
- Verify skill is in `.claude/skills/` directory (project-level)

### Tool Installation

**MarkItDown** (recommended):
```bash
pip install markitdown
```

**PyMuPDF4LLM** (optional):
```bash
pip install pymupdf4llm
```

**pdf2docx** (optional, for PDF to DOCX):
```bash
pip install pdf2docx
```

**google-genai** (optional, for Gemini API):
```bash
pip install google-genai

# Set API key (free tier available at https://aistudio.google.com/)
export GEMINI_API_KEY="your-api-key"
```

**Pandoc** (required):
```bash
# Ubuntu/Debian
apt install pandoc

# macOS
brew install pandoc
```

**Typst** (optional, for MD→PDF):
```bash
# Ubuntu/Debian
wget https://github.com/typst/typst/releases/latest/download/typst-x86_64-unknown-linux-musl.tar.xz
tar -xf typst-*.tar.xz && sudo mv typst-*/typst /usr/local/bin/

# macOS
brew install typst
```

### Performance Issues

If conversions are slow:
- Reduce MAX_CONCURRENT_CONVERSIONS (lower parallelism)
- Increase timeout values for large files
- Check disk I/O (slow storage may bottleneck)
- Use PyMuPDF4LLM for simple PDFs (faster than MarkItDown)

## See Also

- [reference.md](./reference.md) - Detailed tool documentation and metrics
- [examples.md](./examples.md) - Usage examples and common patterns
- [Convert-Docs Command Guide](../../docs/guides/commands/convert-docs-command-guide.md)
- [MarkItDown Documentation](https://github.com/microsoft/markitdown)
- [Pandoc Manual](https://pandoc.org/MANUAL.html)


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### reference.md

```markdown
# Document Converter - Technical Reference

Complete technical documentation for the document-converter skill, including tool specifications, quality metrics, API reference, and performance benchmarks.

## Tool Specifications

### MarkItDown

**Version**: Latest (pip installable)
**Repository**: https://github.com/microsoft/markitdown
**License**: MIT
**Platform**: Cross-platform (Python)

**Capabilities**:
- DOCX → Markdown conversion
- PDF → Markdown conversion
- Excel → Markdown table extraction
- PowerPoint → Markdown outline extraction
- Image OCR support (with pytesseract)

**Quality Metrics** (from testing):
- DOCX tables: 100% pipe-style markdown preservation
- Heading preservation: 95%+
- Unicode/emoji support: Perfect
- Image extraction: Reliable
- Processing speed: Fast (0.5-2s per file)

**Command Interface**:
```bash
markitdown input.docx > output.md
markitdown input.pdf > output.md
```

**Exit Codes**:
- 0: Success
- 1: File not found
- 2: Conversion error

**Installation**:
```bash
pip install markitdown

# With OCR support
pip install markitdown[ocr]
```

### Pandoc

**Version**: >= 2.0 (tested with 2.19+)
**Repository**: https://github.com/jgm/pandoc
**License**: GPL-2.0
**Platform**: Cross-platform (Haskell)

**Capabilities**:
- Universal document converter (40+ formats)
- DOCX ↔ Markdown
- PDF → Markdown (via LaTeX)
- Markdown → PDF (with LaTeX/Typst)
- HTML, EPUB, and more

**Quality Metrics**:
- DOCX→Markdown: 68% fidelity (tables verbose)
- Markdown→DOCX: 95%+ fidelity (excellent)
- Markdown→PDF: High quality (LaTeX rendering)
- Heading preservation: Excellent
- Table handling: Reliable but grid-style (verbose)

**Command Interface**:
```bash
# DOCX → Markdown
pandoc -f docx -t gfm input.docx -o output.md

# Markdown → DOCX
pandoc -f gfm -t docx input.md -o output.docx

# Markdown → PDF (with Typst)
pandoc -f gfm --pdf-engine=typst input.md -o output.pdf

# Markdown → PDF (with XeLaTeX)
pandoc -f gfm --pdf-engine=xelatex input.md -o output.pdf
```

**Format Options**:
- `gfm`: GitHub-Flavored Markdown (recommended)
- `markdown`: Pandoc's extended markdown
- `commonmark`: Strict CommonMark spec

**Exit Codes**:
- 0: Success
- 1: Conversion error
- 3: File not found
- 4: Encoding error

**Installation**:
```bash
# Ubuntu/Debian
apt install pandoc

# macOS
brew install pandoc

# From source
cabal install pandoc
```

### PyMuPDF4LLM

**Version**: Latest (pip installable)
**Repository**: https://github.com/pymupdf/PyMuPDF4LLM
**License**: AGPL-3.0
**Platform**: Cross-platform (Python)

**Capabilities**:
- PDF → Markdown conversion
- Optimized for LLM consumption
- Lightweight dependencies (PyMuPDF only)
- Fast processing
- Zero configuration

**Quality Metrics**:
- Text extraction: Excellent
- Unicode preservation: Perfect
- Table detection: Basic (not formatted)
- Image extraction: Supported
- Processing speed: Very fast (0.2-1s per file)

**Command Interface**:
```python
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("input.pdf")
with open("output.md", "w") as f:
    f.write(md_text)
```

**Bash Wrapper** (used in skill):
```bash
python3 -c "
import sys
import pymupdf4llm

md_text = pymupdf4llm.to_markdown(sys.argv[1])
with open(sys.argv[2], 'w') as f:
    f.write(md_text)
" input.pdf output.md
```

**Installation**:
```bash
pip install pymupdf4llm
```

### Typst

**Version**: Latest (releases on GitHub)
**Repository**: https://github.com/typst/typst
**License**: Apache-2.0
**Platform**: Cross-platform (Rust binary)

**Capabilities**:
- Markdown → PDF compilation
- Fast rendering (Rust-based)
- Modern typesetting engine
- No LaTeX dependencies

**Quality Metrics**:
- Rendering speed: Very fast (<1s typical)
- Font support: Excellent
- Layout quality: High
- Unicode support: Perfect

**Command Interface**:
```bash
typst compile input.typ output.pdf

# Via Pandoc
pandoc --pdf-engine=typst input.md -o output.pdf
```

**Installation**:
```bash
# Ubuntu/Debian (manual)
wget https://github.com/typst/typst/releases/latest/download/typst-x86_64-unknown-linux-musl.tar.xz
tar -xf typst-*.tar.xz && sudo mv typst-*/typst /usr/local/bin/

# macOS
brew install typst

# Windows
scoop install typst
```

### XeLaTeX

**Version**: Part of TeX Live distribution
**License**: TeX license
**Platform**: Cross-platform (LaTeX)

**Capabilities**:
- Markdown → PDF compilation (via Pandoc)
- Traditional LaTeX typesetting
- Extensive font support
- Unicode support

**Quality Metrics**:
- Rendering quality: Excellent
- Rendering speed: Slower than Typst (5-10s typical)
- Font support: Extensive
- Package ecosystem: Comprehensive

**Command Interface**:
```bash
# Via Pandoc
pandoc --pdf-engine=xelatex input.md -o output.pdf
```

**Installation**:
```bash
# Ubuntu/Debian
apt install texlive-xetex

# macOS
brew install mactex

# Full TeX Live
wget http://mirror.ctan.org/systems/texlive/tlnet/install-tl-unx.tar.gz
tar -xzf install-tl-unx.tar.gz
cd install-tl-* && ./install-tl
```

## Quality Comparison Matrix

Based on comprehensive testing with sample documents:

| Source → Target | Tool 1 (Primary) | Fidelity | Tool 2 (Fallback) | Fidelity | Speed |
|-----------------|------------------|----------|-------------------|----------|-------|
| DOCX → Markdown | MarkItDown | 75-80% | Pandoc | 68% | Fast |
| PDF → Markdown | MarkItDown | 70-85% | PyMuPDF4LLM | 65-75% | Fast |
| Markdown → DOCX | Pandoc | 95%+ | - | - | Fast |
| Markdown → PDF | Pandoc+Typst | 98%+ | Pandoc+XeLaTeX | 98%+ | Typst faster |

**Fidelity Scoring Criteria**:
- Heading structure preserved (20%)
- List formatting preserved (15%)
- Table structure preserved (25%)
- Bold/italic preserved (10%)
- Links preserved (10%)
- Images extracted/embedded (10%)
- Unicode/special chars preserved (10%)

## API Reference

### convert-core.sh

Main orchestration module for document conversion.

#### Functions

##### `detect_tools()`

Detect available conversion tools and set global flags.

**Sets**:
- `MARKITDOWN_AVAILABLE` (true/false)
- `PANDOC_AVAILABLE` (true/false)
- `PYMUPDF_AVAILABLE` (true/false)
- `TYPST_AVAILABLE` (true/false)
- `XELATEX_AVAILABLE` (true/false)

**Example**:
```bash
source convert-core.sh
detect_tools

if [ "$MARKITDOWN_AVAILABLE" = true ]; then
  echo "MarkItDown is available"
fi
```

##### `discover_files(input_dir, output_dir)`

Discover convertible files in input directory and determine conversion direction.

**Parameters**:
- `input_dir`: Directory to scan for files
- `output_dir`: Destination directory

**Sets**:
- `CONVERSION_DIRECTION`: "TO_MARKDOWN" or "FROM_MARKDOWN"
- Populates arrays: `docx_files`, `pdf_files`, `md_files`

**Returns**: 0 on success, 1 on error (no files or mixed direction)

**Example**:
```bash
discover_files ~/Documents ~/Output
echo "Direction: $CONVERSION_DIRECTION"
echo "DOCX files: ${#docx_files[@]}"
```

##### `validate_file_output(output_file, format)`

Validate conversion output file meets quality standards.

**Parameters**:
- `output_file`: Path to output file
- `format`: Expected format ("markdown", "docx", "pdf")

**Returns**: 0 if valid, 1 if validation issues found

**Checks**:
- File exists and non-empty
- Markdown: Contains headings (basic structure check)
- Images referenced exist
- Encoding is valid UTF-8

**Example**:
```bash
if validate_file_output output.md markdown; then
  echo "Valid markdown output"
else
  echo "Validation failed"
  validation_failures=$((validation_failures + 1))
fi
```

##### `main_conversion(input_dir, output_dir)`

Main conversion entry point. Orchestrates full conversion workflow.

**Parameters**:
- `input_dir`: Directory containing files to convert
- `output_dir`: Destination directory (default: ./converted_output)

**Returns**: 0 on success, 1 on error

**Workflow**:
1. Detect tools
2. Discover files
3. Create output directory
4. Process conversions (batch)
5. Validate outputs
6. Generate conversion.log

**Example**:
```bash
source convert-core.sh
main_conversion ~/Documents ~/Output
```

### convert-docx.sh

DOCX-specific conversion functions.

#### Functions

##### `convert_docx_to_markdown(input_file, output_file)`

Convert DOCX file to Markdown using MarkItDown or Pandoc.

**Parameters**:
- `input_file`: Path to .docx file
- `output_file`: Path to .md output

**Returns**: 0 on success, 1 on failure

**Behavior**:
- Tries MarkItDown first (if available)
- Falls back to Pandoc on failure
- Applies timeout (60s default)
- Extracts images to `output_dir/media/`

**Example**:
```bash
source convert-docx.sh
convert_docx_to_markdown document.docx output.md
```

##### `convert_markdown_to_docx(input_file, output_file)`

Convert Markdown file to DOCX using Pandoc.

**Parameters**:
- `input_file`: Path to .md file
- `output_file`: Path to .docx output

**Returns**: 0 on success, 1 on failure

**Behavior**:
- Uses Pandoc (only option)
- Applies timeout (60s default)
- Preserves formatting (95%+ fidelity)

**Example**:
```bash
source convert-docx.sh
convert_markdown_to_docx README.md README.docx
```

### convert-pdf.sh

PDF-specific conversion functions.

#### Functions

##### `convert_pdf_to_markdown(input_file, output_file)`

Convert PDF file to Markdown using MarkItDown or PyMuPDF4LLM.

**Parameters**:
- `input_file`: Path to .pdf file
- `output_file`: Path to .md output

**Returns**: 0 on success, 1 on failure

**Behavior**:
- Tries MarkItDown first (if available)
- Falls back to PyMuPDF4LLM on failure
- Applies timeout (300s default - PDFs are slower)
- Extracts images to `output_dir/media/`

**Example**:
```bash
source convert-pdf.sh
convert_pdf_to_markdown report.pdf report.md
```

##### `convert_markdown_to_pdf(input_file, output_file)`

Convert Markdown file to PDF using Pandoc with Typst or XeLaTeX.

**Parameters**:
- `input_file`: Path to .md file
- `output_file`: Path to .pdf output

**Returns**: 0 on success, 1 on failure

**Behavior**:
- Tries Pandoc + Typst first (if available)
- Falls back to Pandoc + XeLaTeX
- Applies timeout (120s default)
- High-quality rendering

**Example**:
```bash
source convert-pdf.sh
convert_markdown_to_pdf document.md document.pdf
```

### convert-markdown.sh

Markdown utility functions.

#### Functions

##### `sanitize_filename(filename)`

Sanitize filename for safe filesystem usage.

**Parameters**:
- `filename`: Input filename

**Returns**: Sanitized filename (stdout)

**Behavior**:
- Converts spaces to underscores
- Lowercases filename
- Removes special characters
- Preserves extension

**Example**:
```bash
source convert-markdown.sh
clean_name=$(sanitize_filename "My Document (Draft).md")
echo "$clean_name"  # my_document_draft.md
```

##### `extract_markdown_images(markdown_file, media_dir)`

Extract image references from markdown file.

**Parameters**:
- `markdown_file`: Path to .md file
- `media_dir`: Directory for extracted images

**Returns**: 0 on success, 1 on error

**Behavior**:
- Parses markdown for image syntax `![alt](path)`
- Copies referenced images to media_dir
- Updates image paths in markdown

**Example**:
```bash
source convert-markdown.sh
extract_markdown_images document.md ./media/
```

## Performance Benchmarks

Measured on typical developer workstation (8-core CPU, SSD):

### Conversion Speed

| Source Format | Target Format | Tool | File Size | Time | Throughput |
|---------------|---------------|------|-----------|------|------------|
| DOCX | Markdown | MarkItDown | 100KB | 0.8s | 125KB/s |
| DOCX | Markdown | Pandoc | 100KB | 1.2s | 83KB/s |
| PDF (text) | Markdown | MarkItDown | 500KB | 2.3s | 217KB/s |
| PDF (text) | Markdown | PyMuPDF4LLM | 500KB | 0.9s | 555KB/s |
| PDF (scan) | Markdown | MarkItDown | 2MB | 15s | 133KB/s |
| Markdown | DOCX | Pandoc | 50KB | 0.7s | 71KB/s |
| Markdown | PDF | Typst | 50KB | 0.5s | 100KB/s |
| Markdown | PDF | XeLaTeX | 50KB | 6.2s | 8KB/s |

### Batch Processing

| Files | Format | Concurrency | Total Time | Per-File Avg |
|-------|--------|-------------|------------|--------------|
| 10 DOCX | → Markdown | 4 | 3.2s | 0.32s |
| 10 DOCX | → Markdown | 1 | 9.8s | 0.98s |
| 20 PDF | → Markdown | 4 | 12.5s | 0.63s |
| 20 PDF | → Markdown | 1 | 43s | 2.15s |
| 50 Markdown | → DOCX | 4 | 14s | 0.28s |
| 50 Markdown | → PDF | 4 | 32s | 0.64s |

**Concurrency Benefits**:
- 4 parallel conversions: ~3x speedup vs sequential
- Diminishing returns beyond 4 concurrent (I/O bound)
- Optimal: Match CPU core count (4-8 concurrent)

## Configuration Reference

### Environment Variables

#### Timeout Configuration

```bash
# Base timeout values (seconds)
TIMEOUT_DOCX_TO_MD=60      # DOCX → Markdown timeout
TIMEOUT_PDF_TO_MD=300      # PDF → Markdown timeout (longer for scans)
TIMEOUT_MD_TO_DOCX=60      # Markdown → DOCX timeout
TIMEOUT_MD_TO_PDF=120      # Markdown → PDF timeout

# Global multiplier
TIMEOUT_MULTIPLIER=1.5     # Increase all timeouts by 50%
```

#### Resource Limits

```bash
# Disk usage limits
MAX_DISK_USAGE_GB=10       # Abort if output exceeds 10GB
MIN_FREE_SPACE_MB=100      # Require 100MB free space

# Concurrency
MAX_CONCURRENT_CONVERSIONS=4  # Parallel conversion limit
```

#### Logging

```bash
# Log file location
LOG_FILE=/path/to/conversion.log

# Log level (not currently implemented)
LOG_LEVEL=INFO  # DEBUG, INFO, WARN, ERROR
```

### Configuration Files

Currently, the conversion system uses environment variables only. Future enhancement could add:

**~/.claude/convert.conf**:
```ini
[timeouts]
docx_to_md = 60
pdf_to_md = 300
md_to_docx = 60
md_to_pdf = 120

[resources]
max_disk_gb = 10
min_free_mb = 100
max_concurrent = 4

[tools]
prefer_markitdown = true
prefer_typst = true
```

## Troubleshooting Guide

### Common Issues

#### Issue: "No conversion tool available"

**Cause**: Required tools not installed.

**Solution**:
```bash
# Install MarkItDown (recommended)
pip install markitdown

# Install Pandoc (required for Markdown → DOCX/PDF)
apt install pandoc  # Ubuntu/Debian
brew install pandoc  # macOS
```

#### Issue: "Conversion timeout"

**Cause**: File too large or complex for default timeout.

**Solution**:
```bash
# Increase timeout multiplier
export TIMEOUT_MULTIPLIER=2.0

# Or increase specific timeout
export TIMEOUT_PDF_TO_MD=600  # 10 minutes
```

#### Issue: "Tables look wrong in output"

**Cause**: Tool-specific table formatting differences.

**Solution**:
- MarkItDown: Pipe-style tables (best for GFM)
- Pandoc: Grid-style tables (verbose but reliable)
- Try alternate tool: `--use-agent` mode allows manual tool selection

#### Issue: "Images missing in output"

**Cause**: Image extraction failed or paths incorrect.

**Solution**:
- Check `media/` directory created
- Verify image paths in markdown (`![alt](media/image.png)`)
- Check conversion.log for image extraction errors

#### Issue: "Unicode characters corrupted"

**Cause**: Encoding issues in conversion pipeline.

**Solution**:
- MarkItDown: Perfect Unicode support (use as primary)
- PyMuPDF4LLM: Perfect Unicode support
- Pandoc: Check locale settings (`export LANG=en_US.UTF-8`)

#### Issue: "PDF conversion poor quality"

**Cause**: Scanned PDF or complex layout.

**Solution**:
- Scanned PDFs: Install OCR support (`pip install markitdown[ocr]`)
- Complex layouts: Try PyMuPDF4LLM fallback
- Increase timeout for large scans

#### Issue: "Batch conversion hangs"

**Cause**: Concurrent conversion deadlock or timeout.

**Solution**:
```bash
# Reduce concurrency
export MAX_CONCURRENT_CONVERSIONS=1

# Increase timeouts
export TIMEOUT_MULTIPLIER=3.0
```

### Debugging Tips

#### Enable Verbose Logging

```bash
# Run with bash tracing
bash -x /path/to/convert-core.sh

# Check conversion.log after run
cat conversion.log | grep FAILED
cat conversion.log | grep WARNING
```

#### Test Single File

Isolate issue by testing single file conversion:

```bash
# Test DOCX → Markdown
source convert-core.sh
detect_tools
convert_docx_to_markdown test.docx test.md

# Check exit code
echo $?  # 0 = success, 1 = failure
```

#### Check Tool Versions

```bash
# MarkItDown version
pip show markitdown

# Pandoc version
pandoc --version

# PyMuPDF4LLM version
pip show pymupdf4llm
```

## Migration Notes

### From convert-docs Command

If migrating from direct command usage to skill-based approach:

**Before** (command invocation):
```bash
/convert-docs ./documents ./output
```

**After** (skill-based, automatic):
```markdown
"Convert the PDF reports in ./documents to Markdown"
→ Claude automatically invokes document-converter skill
→ Same conversion quality and tools
→ More seamless integration in workflows
```

**Benefits**:
- Automatic skill discovery (no explicit command needed)
- Composition with other skills
- Unified conversion interface across workflows

### Backward Compatibility

The `/convert-docs` command is fully backward compatible:
- Checks for skill availability
- Delegates to skill if available
- Falls back to script mode if skill not present
- Zero breaking changes

## Future Enhancements

### Planned Features

1. **Configuration File Support**
   - Move from environment variables to ~/.claude/convert.conf
   - Per-project configuration overrides

2. **Quality Metrics API**
   - Programmatic quality scoring
   - Automated quality regression testing

3. **Format Extensions**
   - HTML → Markdown
   - Markdown → HTML
   - EPUB support

4. **Advanced Image Handling**
   - Image optimization (compress large images)
   - OCR for scanned document images
   - SVG preservation

5. **Incremental Conversion**
   - Skip already-converted files (checksum comparison)
   - Resume interrupted batch conversions

### Feature Requests

See [GitHub Issues](https://github.com/your-repo/issues) for feature requests and roadmap.

## Changelog

### Version 1.0.0 (2025-11-20)
- Initial skill release
- DOCX ↔ Markdown conversion
- PDF → Markdown conversion
- Markdown → PDF conversion
- Tool auto-detection and fallback
- Batch processing support
- Quality validation
- Concurrent conversion support

## License

Same license as parent project. See LICENSE file.

```

### examples.md

```markdown
# Document Converter - Usage Examples

Practical examples for using the document-converter skill in various scenarios, from simple conversions to complex workflows.

## Basic Conversions

### Single DOCX to Markdown

```bash
# Source conversion library
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Detect tools
detect_tools

# Convert single file
convert_docx_to_markdown ~/Documents/report.docx ~/Output/report.md

# Check result
cat ~/Output/report.md
```

**Expected Output**:
```
[PROGRESS] Converting: report.docx → report.md
[SUCCESS] Converted report.docx → report.md (MarkItDown)
```

### Single PDF to Markdown

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
detect_tools

convert_pdf_to_markdown ~/Documents/whitepaper.pdf ~/Output/whitepaper.md

# Verify output
ls -lh ~/Output/whitepaper.md
cat ~/Output/whitepaper.md | head -20
```

**Note**: PDF conversions may take longer (up to 300s for large scanned PDFs).

### Markdown to DOCX

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
detect_tools

convert_markdown_to_docx ~/notes/README.md ~/Output/README.docx

# Open in Word to verify
xdg-open ~/Output/README.docx  # Linux
open ~/Output/README.docx      # macOS
```

**Quality**: 95%+ fidelity with Pandoc (excellent formatting preservation).

### Markdown to PDF

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
detect_tools

convert_markdown_to_pdf ~/notes/document.md ~/Output/document.pdf

# Open PDF viewer
xdg-open ~/Output/document.pdf  # Linux
open ~/Output/document.pdf      # macOS
```

**Note**: Requires Typst or XeLaTeX for PDF engine.

## Batch Conversions

### Convert All DOCX in Directory

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Main conversion handles batch automatically
main_conversion ~/Documents/Word ~/Output/Markdown

# Check results
cat ~/Output/Markdown/conversion.log
```

**Output Structure**:
```
~/Output/Markdown/
├── conversion.log
├── document1.md
├── document2.md
├── report.md
└── media/
    ├── image1.png
    └── image2.jpg
```

### Convert All PDFs in Directory

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

main_conversion ~/Downloads/PDFs ~/Output/Markdown

# Review log for failures
grep FAILED ~/Output/Markdown/conversion.log
```

**Typical Batch Stats** (10 files):
```
Conversion Statistics:
  DOCX → Markdown: 10 succeeded, 0 failed
  Processing time: 3.2 seconds (4 concurrent)
  Tool: MarkItDown (primary)
```

### Convert All Markdown to DOCX

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

main_conversion ~/notes ~/Output/DOCX

# Verify all files converted
ls ~/Output/DOCX/*.docx | wc -l
```

## Agent-Based Workflows

### Autonomous Conversion (Skill Auto-Invoke)

When working in Claude Code, the skill automatically triggers:

**User**: "Extract text from the PDF reports in ./research/"

**Claude** (internal):
```
→ Detects conversion need
→ Loads document-converter skill
→ Executes conversion
→ Returns structured markdown
```

**Result**: PDF files converted to markdown without explicit command.

### Explicit Skill Invocation

```markdown
User: "Use the document-converter skill to convert all Word documents in ./contracts/ to markdown for analysis"

Claude: I'll use the document-converter skill to convert the Word documents.

[Invokes skill internally]

Result:
- Converted 15 DOCX files to Markdown
- Extracted to ./contracts/markdown/
- See conversion.log for details
```

### Command Integration

The `/convert-docs` command delegates to the skill when available:

```bash
# Standard command usage
/convert-docs ./documents ./output

# Behind the scenes:
# 1. Checks if document-converter skill exists
# 2. Delegates to skill if available
# 3. Falls back to script mode if not
```

## Advanced Scenarios

### Custom Timeout Configuration

For large or complex files:

```bash
# Increase PDF timeout to 10 minutes
export TIMEOUT_PDF_TO_MD=600

# Or use multiplier
export TIMEOUT_MULTIPLIER=2.0

source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
main_conversion ~/large_pdfs ~/output
```

### Concurrent Conversion Tuning

Optimize for your system:

```bash
# Low-powered system (reduce concurrency)
export MAX_CONCURRENT_CONVERSIONS=2

# High-powered system (increase concurrency)
export MAX_CONCURRENT_CONVERSIONS=8

source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
main_conversion ~/documents ~/output
```

### Selective Tool Usage

Force specific tool when needed:

```bash
# Use only Pandoc (skip MarkItDown)
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Manually set tool availability
MARKITDOWN_AVAILABLE=false
PANDOC_AVAILABLE=true

main_conversion ~/documents ~/output
```

**Use Case**: Testing tool-specific output differences.

### Disk Space Management

Prevent disk exhaustion:

```bash
# Limit output size to 5GB
export MAX_DISK_USAGE_GB=5

# Require 500MB free space
export MIN_FREE_SPACE_MB=500

source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
main_conversion ~/large_dataset ~/output
```

**Behavior**: Conversion aborts if limits exceeded.

## Quality Validation Workflows

### Validate Conversion Output

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Convert file
convert_docx_to_markdown report.docx report.md

# Validate output
if validate_file_output report.md markdown; then
  echo "✓ Conversion quality verified"
else
  echo "⚠ Validation warnings (check conversion.log)"
fi
```

**Validation Checks**:
- File exists and non-empty
- Contains headings (structural check)
- Referenced images exist
- Valid UTF-8 encoding

### Compare Tool Output

Test quality differences between tools:

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Convert with MarkItDown
PANDOC_AVAILABLE=false
convert_docx_to_markdown test.docx test_markitdown.md

# Convert with Pandoc
MARKITDOWN_AVAILABLE=false
PANDOC_AVAILABLE=true
convert_docx_to_markdown test.docx test_pandoc.md

# Compare outputs
diff -u test_markitdown.md test_pandoc.md
```

**Analysis**: Review table formatting, image handling, Unicode characters.

## Integration Examples

### Skill Composition

Document-converter can be used alongside other skills:

```markdown
User: "Analyze the financial reports in ./pdfs/ and create a summary"

Claude:
1. Uses document-converter skill to extract text from PDFs
2. Uses research-specialist skill to analyze financial data
3. Uses doc-generator skill to create summary document

Result: Comprehensive analysis with source extraction automated
```

### Workflow Automation

Integrate with other Claude Code features:

```bash
# Example: Convert, analyze, commit workflow
/convert-docs ./research/pdfs ./research/markdown
# → Converts PDFs to Markdown (skill-based)

# Analyze markdown files (research agent)
"Analyze the converted markdown files in ./research/markdown"

# Commit results
/commit "Add research analysis from PDF reports"
```

## Troubleshooting Examples

### Handle Conversion Failures

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

main_conversion ~/documents ~/output

# Check for failures
if grep -q FAILED ~/output/conversion.log; then
  echo "Some conversions failed:"
  grep FAILED ~/output/conversion.log

  # Retry with increased timeout
  export TIMEOUT_MULTIPLIER=2.0
  # Manually retry failed files...
fi
```

### Debug Single File Issues

Isolate problem files:

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
detect_tools

# Enable bash tracing
set -x

# Convert problematic file
convert_pdf_to_markdown problem.pdf problem.md

# Check exit code
echo "Exit code: $?"

# Disable tracing
set +x
```

### Test Tool Availability

Verify tools before conversion:

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

detect_tools

echo "Tool Availability:"
echo "  MarkItDown: $MARKITDOWN_AVAILABLE"
echo "  Pandoc: $PANDOC_AVAILABLE"
echo "  PyMuPDF4LLM: $PYMUPDF_AVAILABLE"
echo "  Typst: $TYPST_AVAILABLE"
echo "  XeLaTeX: $XELATEX_AVAILABLE"

# Install missing tools if needed
if [ "$MARKITDOWN_AVAILABLE" = false ]; then
  echo "Installing MarkItDown..."
  pip install markitdown
fi
```

## Performance Benchmarking

### Measure Conversion Speed

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Benchmark single file
time convert_docx_to_markdown large.docx large.md

# Benchmark batch
time main_conversion ~/100_files ~/output

# Parse conversion.log for detailed timing
grep "Processing time" ~/output/conversion.log
```

### Compare Concurrency Settings

```bash
# Test sequential
export MAX_CONCURRENT_CONVERSIONS=1
time main_conversion ~/test_files ~/output_seq

# Test parallel (4 concurrent)
export MAX_CONCURRENT_CONVERSIONS=4
time main_conversion ~/test_files ~/output_par

# Calculate speedup
# Typical: 3-4x faster with 4 concurrent
```

## Edge Cases

### Handle Mixed File Types

```bash
# Directory with DOCX, PDF, and MD files
# Note: Cannot mix TO_MARKDOWN and FROM_MARKDOWN in single run

# Convert documents TO markdown first
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
main_conversion ~/mixed/docs ~/mixed/markdown

# Then convert markdown FROM markdown to DOCX/PDF
main_conversion ~/mixed/markdown ~/mixed/docx
```

**Error if Mixed**:
```
Error: Cannot mix TO_MARKDOWN and FROM_MARKDOWN conversions
Found: DOCX files AND MD files in same directory
```

### Handle Special Characters in Filenames

```bash
source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Filename sanitization automatic
convert_docx_to_markdown "My Document (Draft) [v2].docx" "output.md"

# Output filename sanitized: my_document_draft_v2.md
```

**Sanitization Rules**:
- Spaces → underscores
- Lowercase conversion
- Special characters removed
- Extension preserved

### Handle Very Large Files

```bash
# Increase timeout significantly
export TIMEOUT_PDF_TO_MD=1800  # 30 minutes

# Reduce concurrency (memory management)
export MAX_CONCURRENT_CONVERSIONS=1

source /home/benjamin/.config/.claude/lib/convert/convert-core.sh
main_conversion ~/large_scanned_pdfs ~/output
```

**Note**: Scanned PDFs with OCR can take 10-20 minutes for 100+ page documents.

## Real-World Use Cases

### Documentation Migration

**Scenario**: Migrate Word-based documentation to Markdown for version control.

```bash
# Convert all DOCX to Markdown
/convert-docs ./docs/word ./docs/markdown

# Review conversion quality
cat ./docs/markdown/conversion.log

# Commit to git
cd ./docs/markdown
git add *.md media/
git commit -m "Migrate documentation from Word to Markdown"
```

### Research Paper Extraction

**Scenario**: Extract text from PDF research papers for analysis.

```markdown
User: "Extract text from the PDFs in ./research/papers/ for literature review"

Claude: [Invokes document-converter skill]

Result:
- Converted 45 PDF papers to Markdown
- Extracted to ./research/papers/markdown/
- Ready for analysis and summarization
```

### Report Generation

**Scenario**: Generate DOCX reports from Markdown templates.

```bash
# Write report in Markdown (easier editing)
vim report_template.md

# Convert to DOCX for distribution
/convert-docs ./templates ./reports

# Result: Professional DOCX report generated
ls ./reports/report_template.docx
```

### Batch Invoice Processing

**Scenario**: Extract text from PDF invoices for accounting.

```bash
# Convert all invoice PDFs to Markdown
/convert-docs ./invoices/pdf ./invoices/text

# Parse markdown for structured data
# (Next step: Use grep/awk to extract invoice numbers, amounts, etc.)
grep "Total:" ./invoices/text/*.md
```

## Testing and Validation

### Test Suite Example

```bash
#!/bin/bash
# test_conversions.sh - Validate document-converter skill

source /home/benjamin/.config/.claude/lib/convert/convert-core.sh

# Test 1: DOCX → Markdown
echo "Test 1: DOCX → Markdown"
detect_tools
convert_docx_to_markdown test/sample.docx test/output.md
[ -f test/output.md ] && echo "✓ PASS" || echo "✗ FAIL"

# Test 2: PDF → Markdown
echo "Test 2: PDF → Markdown"
convert_pdf_to_markdown test/sample.pdf test/output.md
[ -f test/output.md ] && echo "✓ PASS" || echo "✗ FAIL"

# Test 3: Markdown → DOCX
echo "Test 3: Markdown → DOCX"
convert_markdown_to_docx test/sample.md test/output.docx
[ -f test/output.docx ] && echo "✓ PASS" || echo "✗ FAIL"

# Test 4: Markdown → PDF
echo "Test 4: Markdown → PDF"
convert_markdown_to_pdf test/sample.md test/output.pdf
[ -f test/output.pdf ] && echo "✓ PASS" || echo "✗ FAIL"

# Test 5: Batch conversion
echo "Test 5: Batch conversion"
main_conversion test/batch test/batch_output
[ -f test/batch_output/conversion.log ] && echo "✓ PASS" || echo "✗ FAIL"
```

### Quality Regression Testing

```bash
# Baseline: Convert with current tools
/convert-docs test/baseline/docs test/baseline/output

# After tool update: Convert again
/convert-docs test/baseline/docs test/updated/output

# Compare outputs
diff -r test/baseline/output test/updated/output

# Manual review of differences
# Focus on: tables, images, unicode, headings
```

## Tips and Best Practices

### When to Use Script Mode vs Agent Mode

**Script Mode** (default):
- Standard batch conversions
- Fast processing required
- Tools already validated
- No quality reporting needed

**Agent Mode** (`--use-agent`):
- First-time conversions (tool detection)
- Quality-critical documents
- Troubleshooting conversion issues
- Detailed logging required

### Optimizing Conversion Quality

1. **Use MarkItDown for DOCX/PDF** (better table preservation)
2. **Install optional tools** (Typst for faster PDFs)
3. **Review conversion.log** (check warnings)
4. **Validate critical conversions** (use validate_file_output)
5. **Test sample files first** (before batch processing)

### Managing Large Batch Conversions

1. **Set appropriate timeouts** (longer for scanned PDFs)
2. **Tune concurrency** (match CPU cores)
3. **Monitor disk space** (set MAX_DISK_USAGE_GB)
4. **Review failures incrementally** (check conversion.log periodically)
5. **Resume interrupted conversions** (skill handles partial completion)

### Handling Edge Cases

1. **Special characters in filenames** → Automatic sanitization
2. **Mixed file types** → Separate conversion runs (TO vs FROM markdown)
3. **Very large files** → Increase timeouts, reduce concurrency
4. **Scanned PDFs** → Install OCR support (`pip install markitdown[ocr]`)
5. **Complex layouts** → Try alternate tool (PyMuPDF4LLM for PDFs)

## Additional Resources

- [SKILL.md](./SKILL.md) - Core skill documentation
- [reference.md](./reference.md) - Technical reference and API docs
- [Convert-Docs Command Guide](../../docs/guides/commands/convert-docs-command-guide.md)
- [MarkItDown Documentation](https://github.com/microsoft/markitdown)
- [Pandoc Manual](https://pandoc.org/MANUAL.html)
- [PyMuPDF4LLM Repository](https://github.com/pymupdf/PyMuPDF4LLM)

```

### ../../docs/guides/commands/convert-docs-command-guide.md

```markdown
# /convert-docs Command - Complete Guide

**Executable**: `.claude/commands/convert-docs.md`

**Quick Start**: Run `/convert-docs <input-directory>` to convert documents bidirectionally.

---

## Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Usage Examples](#usage-examples)
4. [Advanced Topics](#advanced-topics)
5. [Troubleshooting](#troubleshooting)

---

## Overview

### Purpose

The `/convert-docs` command converts documents bidirectionally between Markdown, Word (DOCX), and PDF formats. It automatically detects file types and applies appropriate conversion tools.

### Supported Conversions

- **To Markdown**: `.docx`, `.pdf` files
- **From Markdown**: `.md` files to DOCX (PDF via Pandoc)

### When to Use

- Converting Word documents to Markdown for version control
- Converting research PDFs to editable Markdown
- Generating Word/PDF reports from Markdown documentation
- Batch processing multiple documents

### When NOT to Use

- Single file that can be manually converted
- Complex documents with special formatting that may not convert cleanly
- Protected/encrypted documents

---

## Architecture

### Design Principles

- **Dual Mode**: Script mode (fast) or Agent mode (comprehensive)
- **Automatic Detection**: Identifies conversion direction from file types
- **Fallback Mechanisms**: Multiple tools with automatic retry

### Patterns Used

- Conditional mode selection pattern
- Tool fallback chain
- Conversion coordinator pattern

### Integration Points

- **convert-core.sh**: Core conversion library
- **doc-converter agent**: Comprehensive 5-phase workflow
- **Pandoc**: Primary conversion tool
- **Mammoth**: DOCX-specific tool (fallback)
- **pdf-to-text/pdftotext**: PDF extraction tools

### Data Flow

```
Input Directory → Mode Detection → Tool Selection
                                        ↓
Output Directory ← Logging ← Conversion ← Validation
```

---

## Usage Examples

### Example 1: Basic Conversion (Script Mode)

```bash
/convert-docs ~/Documents/Reports
```

**Expected Output**:
```
PROGRESS: Detecting file types...
PROGRESS: Found 3 DOCX files, 2 PDF files
PROGRESS: Converting to Markdown...
PROGRESS: report1.docx -> report1.md (success)
PROGRESS: report2.docx -> report2.md (success)
PROGRESS: research.pdf -> research.md (success)
CONVERSION_COMPLETE: 5/5 files converted
Output: ./converted_output/
Log: ./converted_output/conversion.log
```

**Explanation**:
Script mode provides fast conversion with automatic tool selection and fallback handling.

### Example 2: Specify Output Directory

```bash
/convert-docs ~/Documents/Reports ~/Projects/docs
```

**Expected Output**:
```
PROGRESS: Output directory: ~/Projects/docs
PROGRESS: Converting 5 files...
CONVERSION_COMPLETE: 5/5 files converted
```

**Explanation**:
Converted files are placed in the specified output directory instead of `./converted_output/`.

### Example 3: Agent Mode (Comprehensive)

```bash
/convert-docs ~/Documents/Reports --use-agent
```

**Expected Output**:
```
PROGRESS: Initializing doc-converter agent...
PROGRESS: Phase 1: Tool verification...
PROGRESS: Phase 2: File analysis...
PROGRESS: Phase 3: Conversion execution...
PROGRESS: Phase 4: Quality validation...
PROGRESS: Phase 5: Summary reporting...
CONVERSION_COMPLETE: All files converted with quality report
```

**Explanation**:
Agent mode provides comprehensive workflow with quality checks, detailed logging, and validation phases.

### Example 4: Markdown to DOCX

```bash
/convert-docs ~/Projects/docs/markdown
```

**Expected Output**:
```
PROGRESS: Found 4 MD files
PROGRESS: Converting to DOCX...
PROGRESS: readme.md -> readme.docx (success)
CONVERSION_COMPLETE: 4/4 files converted
```

**Explanation**:
When input contains Markdown files, they are converted to DOCX format.

---

## Advanced Topics

### Performance Considerations

- Script mode: <0.5s overhead, instant conversion start
- Agent mode: ~2-3s initialization overhead
- Large PDFs may take longer due to text extraction
- Consider batch sizes for many files

### Execution Modes

#### Script Mode (Default)
- **Speed**: Fastest, minimal overhead
- **Use for**: Standard conversions, quick batch processing
- **Tools**: Same quality as agent mode
- **Logging**: Basic conversion log

#### Agent Mode
- **Speed**: 2-3s additional overhead
- **Use for**: Quality-critical conversions, audits
- **Trigger**: `--use-agent` flag or keywords
- **Keywords**: "detailed logging", "quality reporting", "verify tools"

### Tool Priority

**For DOCX conversion**:
1. Pandoc (primary)
2. Mammoth (fallback)

**For PDF conversion**:
1. pdftotext (primary)
2. pdf-to-text (fallback)

**For Markdown to DOCX**:
1. Pandoc (primary)

### Quality Considerations

- Complex tables may require manual adjustment
- Images are referenced but may need path updates
- Headers/footers converted to Markdown sections
- Footnotes converted to inline or endnotes

---

## Troubleshooting

### Common Issues

#### Issue 1: Pandoc Not Found

**Symptoms**:
- "pandoc: command not found" error
- Conversion fails immediately

**Cause**:
Pandoc not installed on system

**Solution**:
```bash
# Ubuntu/Debian
sudo apt install pandoc

# macOS
brew install pandoc

# Verify installation
pandoc --version
```

#### Issue 2: PDF Text Extraction Failed

**Symptoms**:
- Empty or garbled Markdown output
- "Unable to extract text" error

**Cause**:
PDF may be image-based (scanned) or protected

**Solution**:
```bash
# Check if PDF contains text
pdftotext input.pdf - | head -20

# For image-based PDFs, use OCR:
# Install tesseract and use ocrmypdf first
ocrmypdf input.pdf input_ocr.pdf
/convert-docs <directory-with-ocr-pdf>
```

#### Issue 3: Encoding Issues

**Symptoms**:
- Strange characters in output
- Conversion succeeds but content garbled

**Cause**:
Source document has non-UTF-8 encoding

**Solution**:
```bash
# Check file encoding
file -i input.docx

# For agent mode with explicit encoding
/convert-docs <input-dir> --use-agent
# Agent handles encoding detection
```

### Debug Mode

Enable verbose output with agent mode:
```bash
/convert-docs <input-dir> --use-agent

# Check detailed log
cat ./converted_output/conversion.log
```

### Getting Help

- Check [Command Reference](.claude/docs/reference/standards/command-reference.md) for quick syntax
- Review conversion tools documentation (Pandoc, Mammoth)
- See related commands: `/research`, `/document`

---

## See Also

- [Command Reference](.claude/docs/reference/standards/command-reference.md)
- [Pandoc Documentation](https://pandoc.org/MANUAL.html)
- [Library API - convert-core.sh](.claude/docs/reference/library-api/overview.md)

```

document-converter | SkillHub