Back to skills
SkillHub ClubShip Full StackFull Stack

pdf

Comprehensive PDF manipulation, extraction, and generation with support for text extraction, form filling, merging, splitting, annotations, and creation. Use when working with .pdf files for: (1) Extracting text and tables, (2) Filling PDF forms, (3) Merging/splitting PDFs, (4) Creating PDFs programmatically, (5) Adding watermarks/annotations, (6) PDF metadata management

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
221
Hot score
97
Updated
March 20, 2026
Overall rating
C3.1
Composite score
3.1
Best-practice grade
B75.6

Install command

npx @skill-hub/cli install aiskillstore-marketplace-pdf

Repository

aiskillstore/marketplace

Skill path: skills/autumnsgrove/pdf

Comprehensive PDF manipulation, extraction, and generation with support for text extraction, form filling, merging, splitting, annotations, and creation. Use when working with .pdf files for: (1) Extracting text and tables, (2) Filling PDF forms, (3) Merging/splitting PDFs, (4) Creating PDFs programmatically, (5) Adding watermarks/annotations, (6) PDF metadata management

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: aiskillstore.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install pdf into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/aiskillstore/marketplace before adding pdf to shared team environments
  • Use pdf for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: pdf
description: "Comprehensive PDF manipulation, extraction, and generation with support for text extraction, form filling, merging, splitting, annotations, and creation. Use when working with .pdf files for: (1) Extracting text and tables, (2) Filling PDF forms, (3) Merging/splitting PDFs, (4) Creating PDFs programmatically, (5) Adding watermarks/annotations, (6) PDF metadata management"
---

# PDF Manipulation Skill

Comprehensive guide for working with PDF files in Python, covering extraction, manipulation, creation, and advanced operations using progressive disclosure for efficiency.

## Core Capabilities

Extract and manipulate PDF content:
- Extract text with layout preservation
- Extract tables and parse structured data
- Fill PDF forms programmatically
- Merge multiple PDFs into a single document
- Split PDFs by pages or ranges
- Create PDFs from scratch with text, images, and graphics
- Add watermarks and annotations
- Extract and modify metadata (author, title, keywords)
- Add password protection and encryption
- Perform OCR on scanned documents
- Convert images to PDF
- Compress and optimize PDF files
- Extract images from PDFs
- Rotate and reorder pages

## Quick Start

Install required libraries:

```bash
pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow
```

For detailed installation instructions including system dependencies, see:
- [Library Installation Guide](./references/library-installation.md)

## Python Libraries Overview

**pypdf**: Basic operations (merge, split, rotate, metadata)
**pdfplumber**: Advanced text/table extraction with layout awareness
**reportlab**: Create PDFs from scratch (reports, invoices, documents)
**PyMuPDF (fitz)**: Advanced manipulation, annotations, compression
**pdf2image**: Convert PDF pages to images (requires poppler)
**pytesseract**: OCR for scanned documents (requires tesseract)

## Text Extraction Workflow

### Basic Extraction

```python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)
```

### Layout-Aware Extraction

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        words = page.extract_words()  # With positioning
        print(text)
```

### Extract from Specific Region

```python
with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]
    bbox = (0, 0, 612, 100)  # x0, y0, x1, y1
    header = page.crop(bbox).extract_text()
```

For detailed text extraction methods including OCR fallback and encoding handling, see:
- [Text Extraction Reference](./references/text-extraction.md)

## Table Extraction Workflow

### Extract All Tables

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            print(table)
```

### Advanced Table Detection

```python
table_settings = {
    "vertical_strategy": "lines",
    "horizontal_strategy": "lines",
    "snap_tolerance": 3
}

tables = page.extract_tables(table_settings=table_settings)
```

For detailed table extraction strategies and data cleaning, see:
- [Table Extraction Reference](./references/table-extraction.md)

## PDF Form Operations

### Fill Form Fields

```python
import fitz

doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        if widget.field_name == "name":
            widget.field_value = "John Doe"
            widget.update()
doc.save("filled.pdf")
doc.close()
```

### Extract Form Field Names

```python
doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        print(f"{widget.field_name}: {widget.field_type_string}")
doc.close()
```

For form filling, flattening, and debugging, see:
- [PDF Operations Reference](./references/pdf-operations.md)

## Merging and Splitting

### Merge PDFs

```python
from pypdf import PdfMerger

merger = PdfMerger()
for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)
merger.write("merged.pdf")
merger.close()
```

### Merge with Page Ranges

```python
merger = PdfMerger()
merger.append("doc1.pdf", pages=(0, 3))  # First 3 pages
merger.append("doc2.pdf")  # All pages
merger.write("compiled.pdf")
merger.close()
```

### Split into Individual Pages

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", 'wb') as f:
        writer.write(f)
```

For merging with bookmarks and splitting by size, see:
- [PDF Operations Reference](./references/pdf-operations.md)

## Creating PDFs

### Simple Text PDF

```python
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

c = canvas.Canvas("output.pdf", pagesize=letter)
c.setFont("Helvetica", 12)
c.drawString(50, 750, "Hello, World!")
c.save()
```

### Styled Report

```python
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf")
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Content here", styles['BodyText']))

doc.build(story)
```

### PDF with Table

```python
from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors

data = [
    ['Product', 'Quantity', 'Price'],
    ['Widget A', '10', '$50'],
    ['Widget B', '5', '$75']
]

table = Table(data)
table.setStyle(TableStyle([
    ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
    ('GRID', (0, 0), (-1, -1), 1, colors.black)
]))
```

For complete PDF creation workflows including images, multi-column layouts, and custom fonts, see:
- [PDF Creation Reference](./references/pdf-creation.md)

For practical examples:
- [Invoice Generator](./examples/invoice-generator.md)
- [Report Automation](./examples/report-automation.md)

## Metadata and Security

### Extract Metadata

```python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
metadata = reader.metadata
print(f"Title: {metadata.get('/Title')}")
print(f"Author: {metadata.get('/Author')}")
```

### Modify Metadata

```python
from pypdf import PdfWriter

writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)

writer.add_metadata({
    '/Title': 'New Title',
    '/Author': 'John Doe'
})

with open("updated.pdf", 'wb') as f:
    writer.write(f)
```

### Add Password Protection

```python
writer.encrypt(
    user_password="user123",
    owner_password="owner456",
    algorithm="AES-256"
)
```

For detailed security operations and comprehensive metadata management, see:
- [Metadata, Security, and OCR Reference](./references/metadata-security-ocr.md)

## OCR for Scanned Documents

### Basic OCR

```python
from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("scanned.pdf")
for i, image in enumerate(images):
    text = pytesseract.image_to_string(image)
    print(f"Page {i+1}:\n{text}")
```

### Multi-Language OCR

```python
text = pytesseract.image_to_string(image, lang='eng+fra+deu')
```

For searchable PDF creation and OCR preprocessing, see:
- [Metadata, Security, and OCR Reference](./references/metadata-security-ocr.md)

## Watermarks and Annotations

### Add Text Watermark

```python
import fitz

doc = fitz.open("document.pdf")
for page in doc:
    page.insert_textbox(
        page.rect,
        "CONFIDENTIAL",
        fontsize=50,
        rotate=45,
        opacity=0.3,
        color=(0.7, 0.7, 0.7)
    )
doc.save("watermarked.pdf")
doc.close()
```

### Add Annotations

```python
page.add_highlight_annot(rect)  # Highlight
page.add_text_annot(point, "Note")  # Text note
page.add_underline_annot(rect)  # Underline
```

For stamps and image watermarks, see:
- [Metadata, Security, and OCR Reference](./references/metadata-security-ocr.md)

## Page Operations

### Rotate Pages

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.rotate(90)
    writer.add_page(page)

with open("rotated.pdf", 'wb') as f:
    writer.write(f)
```

### Extract Images

```python
import fitz

doc = fitz.open("document.pdf")
for page_num in range(len(doc)):
    page = doc[page_num]
    for img_index, img in enumerate(page.get_images()):
        xref = img[0]
        base_image = doc.extract_image(xref)
        with open(f"image_{page_num}_{img_index}.png", "wb") as f:
            f.write(base_image["image"])
doc.close()
```

### Convert Images to PDF

```python
from PIL import Image
from reportlab.pdfgen import canvas

c = canvas.Canvas("output.pdf")
for img_path in ["img1.jpg", "img2.jpg"]:
    img = Image.open(img_path)
    c.setPageSize(img.size)
    c.drawImage(img_path, 0, 0, width=img.width, height=img.height)
    c.showPage()
c.save()
```

For detailed page operations, see:
- [PDF Operations Reference](./references/pdf-operations.md)

## Optimization

### Compress PDF

```python
import fitz

doc = fitz.open("large.pdf")
doc.save(
    "optimized.pdf",
    garbage=4,
    deflate=True,
    clean=True
)
doc.close()
```

## Best Practices

### Memory Management

Process large PDFs in chunks:

```python
from pypdf import PdfReader
import gc

reader = PdfReader("large.pdf")
for i, page in enumerate(reader.pages):
    text = page.extract_text()
    # Process text
    if i % 10 == 0:
        gc.collect()
```

### Error Handling

Always handle encryption and errors:

```python
from pypdf import PdfReader

try:
    reader = PdfReader("document.pdf")

    if reader.is_encrypted:
        reader.decrypt(password)

    for page in reader.pages:
        text = page.extract_text()
except Exception as e:
    print(f"Error: {e}")
```

### OCR Fallback

Detect and handle scanned documents:

```python
import fitz

doc = fitz.open("document.pdf")
text = doc[0].get_text()

if not text.strip():
    # Use OCR for scanned document
    from pdf2image import convert_from_path
    import pytesseract

    images = convert_from_path("document.pdf")
    text = pytesseract.image_to_string(images[0])
```

For comprehensive best practices, common pitfalls, and troubleshooting, see:
- [Best Practices and Common Pitfalls](./references/best-practices.md)

## Common Pitfalls

**Scanned Documents**: Text extraction returns empty for scanned PDFs. Use OCR (pytesseract).

**Table Detection**: Tables not detected correctly. Adjust table_settings strategies.

**Encrypted PDFs**: Operations fail. Check and decrypt with password first.

**Form Fields**: Can't find field names. Use debug helper to list all fields.

**Memory Issues**: Large PDFs cause crashes. Process in chunks with garbage collection.

**Encoding Issues**: Special characters corrupted. Handle with UTF-8 encoding explicitly.

For detailed solutions and debugging strategies, see:
- [Best Practices and Common Pitfalls](./references/best-practices.md)

## Quick Reference

**Text Extraction**:
- Simple: `pypdf` - `page.extract_text()`
- Advanced: `pdfplumber` - `page.extract_text()` + `page.extract_words()`

**Table Extraction**:
- Always use: `pdfplumber` - `page.extract_tables()`

**PDF Creation**:
- Use: `reportlab` - `canvas.Canvas()` or `SimpleDocTemplate()`

**Advanced Operations**:
- Use: `PyMuPDF (fitz)` - forms, annotations, compression

**OCR**:
- Use: `pytesseract` + `pdf2image`

**Merging/Splitting**:
- Use: `pypdf` - `PdfMerger()` and `PdfWriter()`

## Helper Scripts

The skill includes helper scripts for common operations:

```bash
# See scripts directory for utilities
python scripts/pdf_helper.py --help
```

## Additional Resources

**Comprehensive References**:
- [Library Installation](./references/library-installation.md) - Setup and dependencies
- [Text Extraction](./references/text-extraction.md) - All extraction methods
- [Table Extraction](./references/table-extraction.md) - Table detection strategies
- [PDF Operations](./references/pdf-operations.md) - Forms, merge, split, pages
- [PDF Creation](./references/pdf-creation.md) - Creating PDFs from scratch
- [Metadata, Security, OCR](./references/metadata-security-ocr.md) - Advanced operations
- [Best Practices](./references/best-practices.md) - Pitfalls and solutions

**Practical Examples**:
- [Invoice Generator](./examples/invoice-generator.md) - Professional invoice templates
- [Report Automation](./examples/report-automation.md) - Automated report generation

## Implementation Guidelines

When working with PDFs:

1. **Choose the right library** for your task (see Quick Reference)
2. **Handle errors** with try-except blocks
3. **Check for encryption** before processing
4. **Use OCR fallback** for scanned documents
5. **Process large files in chunks** to manage memory
6. **Validate input files** before operations
7. **Close documents** to free resources: `doc.close()`

For production use, always implement proper error handling, validate inputs, and test with various PDF types and versions.


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/library-installation.md

```markdown
# PDF Library Installation Guide

## Quick Installation

Install all required libraries:

```bash
pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow
```

## Individual Libraries

### 1. pypdf (PyPDF2)
**Purpose**: Basic PDF operations (merging, splitting, rotation)

```bash
pip install pypdf
```

```python
from pypdf import PdfReader, PdfWriter, PdfMerger
```

**Use for**: Merging, splitting, rotating pages, extracting metadata

### 2. pdfplumber
**Purpose**: Advanced text and table extraction with layout awareness

```bash
pip install pdfplumber
```

```python
import pdfplumber
```

**Use for**: Extracting text with positioning, table detection, precise layout analysis

### 3. reportlab
**Purpose**: Creating PDFs from scratch

```bash
pip install reportlab
```

```python
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter, A4
from reportlab.lib.units import inch
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph
from reportlab.lib.styles import getSampleStyleSheet
```

**Use for**: Generating reports, invoices, certificates, custom PDFs

### 4. PyMuPDF (fitz)
**Purpose**: Advanced PDF manipulation, rendering, and conversion

```bash
pip install PyMuPDF
```

```python
import fitz  # PyMuPDF
```

**Use for**: Advanced operations, image extraction, annotations, rendering, compression

### 5. pdf2image
**Purpose**: Converting PDF pages to images

```bash
pip install pdf2image
```

**Requires poppler:**
```bash
# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases/
```

```python
from pdf2image import convert_from_path
```

### 6. pytesseract (for OCR)
**Purpose**: OCR for scanned documents

```bash
pip install pytesseract
```

**Requires tesseract:**
```bash
# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Windows
# Download from https://github.com/UB-Mannheim/tesseract/wiki
```

```python
import pytesseract
from PIL import Image
```

## System Dependencies

### macOS
```bash
brew install poppler tesseract
```

### Ubuntu/Debian
```bash
sudo apt-get install poppler-utils tesseract-ocr
```

### Windows
1. Install poppler from: https://github.com/oschwartz10612/poppler-windows/releases/
2. Install tesseract from: https://github.com/UB-Mannheim/tesseract/wiki
3. Add both to system PATH

## Troubleshooting

### Import Errors
If you encounter import errors, ensure you're using the correct package names:
- Use `pypdf` (not `PyPDF2` for newer versions)
- Use `import fitz` for PyMuPDF

### Missing System Dependencies
If pdf2image or pytesseract fail, verify system dependencies are installed:
```bash
# Test poppler
pdftoppm -v

# Test tesseract
tesseract --version
```

### Version Conflicts
For compatibility, use these version ranges:
```bash
pip install 'pypdf>=3.0.0' 'pdfplumber>=0.9.0' 'reportlab>=4.0.0' 'PyMuPDF>=1.23.0'
```

```

### references/text-extraction.md

```markdown
# Text Extraction Reference

## Basic Text Extraction (pypdf)

```python
from pypdf import PdfReader

def extract_text_basic(pdf_path):
    """Extract all text from a PDF using pypdf."""
    reader = PdfReader(pdf_path)
    text = ""

    for page_num, page in enumerate(reader.pages, start=1):
        page_text = page.extract_text()
        text += f"--- Page {page_num} ---\n{page_text}\n\n"

    return text

# Usage
text = extract_text_basic("document.pdf")
print(text)
```

## Advanced Text Extraction with Layout (pdfplumber)

```python
import pdfplumber

def extract_text_with_layout(pdf_path):
    """Extract text preserving layout information."""
    results = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            # Extract text
            text = page.extract_text()

            # Extract words with positioning
            words = page.extract_words()

            results.append({
                'page': page_num,
                'text': text,
                'words': words,
                'width': page.width,
                'height': page.height
            })

    return results

# Usage
pages = extract_text_with_layout("document.pdf")
for page_data in pages:
    print(f"Page {page_data['page']}:")
    print(page_data['text'])
    print(f"Total words: {len(page_data['words'])}")
```

## Extract Text from Specific Regions

```python
import pdfplumber

def extract_text_from_region(pdf_path, page_num, bbox):
    """
    Extract text from a specific region.

    Args:
        pdf_path: Path to PDF
        page_num: Page number (0-indexed)
        bbox: Tuple (x0, y0, x1, y1) defining the region
    """
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_num]

        # Crop to specific region
        cropped = page.crop(bbox)
        text = cropped.extract_text()

        return text

# Usage - Extract header region
header_text = extract_text_from_region(
    "document.pdf",
    page_num=0,
    bbox=(0, 0, 612, 100)  # Top 100 points
)
```

## Handle Text Encoding Issues

```python
import pdfplumber

def extract_text_safe(pdf_path):
    """Extract text with proper encoding handling."""
    with pdfplumber.open(pdf_path) as pdf:
        all_text = []

        for page in pdf.pages:
            text = page.extract_text()

            if text:
                # Handle encoding issues
                text = text.encode('utf-8', errors='ignore').decode('utf-8')
                all_text.append(text)

        return "\n\n".join(all_text)
```

## Extract Text with OCR Fallback

```python
import fitz

def extract_text_with_ocr_fallback(pdf_path):
    """Try text extraction, fall back to OCR if needed."""
    doc = fitz.open(pdf_path)
    page = doc[0]
    text = page.get_text()

    if not text.strip():
        print("No text found, using OCR...")
        from pdf2image import convert_from_path
        import pytesseract

        images = convert_from_path(pdf_path)
        text = pytesseract.image_to_string(images[0])

    return text
```

## Handle Page Rotation

```python
import fitz

def extract_text_handle_rotation(pdf_path):
    """Extract text accounting for page rotation."""
    doc = fitz.open(pdf_path)

    for page in doc:
        # Check rotation
        rotation = page.rotation

        if rotation != 0:
            # Rotate page to 0 degrees
            page.set_rotation(0)

        text = page.get_text()
        print(text)

    doc.close()
```

## Memory-Efficient Processing for Large PDFs

```python
from pypdf import PdfReader
import gc

def process_large_pdf_in_chunks(pdf_path, chunk_size=10):
    """Process large PDFs in chunks to manage memory."""
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)

    for start in range(0, total_pages, chunk_size):
        end = min(start + chunk_size, total_pages)

        # Process chunk
        for page_num in range(start, end):
            page = reader.pages[page_num]
            text = page.extract_text()

            # Process text here
            yield page_num, text

        # Force garbage collection
        gc.collect()

# Usage
for page_num, text in process_large_pdf_in_chunks("large_file.pdf"):
    print(f"Processing page {page_num}")
```

## Preserve Document Structure

```python
import pdfplumber

def extract_structured_content(pdf_path):
    """Extract content while preserving structure."""
    with pdfplumber.open(pdf_path) as pdf:
        structured_data = []

        for page in pdf.pages:
            page_data = {
                'page_number': page.page_number,
                'text': page.extract_text(),
                'tables': page.extract_tables(),
                'images': len(page.images),
                'width': page.width,
                'height': page.height
            }

            structured_data.append(page_data)

        return structured_data
```

## Count Words in PDF

```python
import pdfplumber

def count_words_in_pdf(pdf_path):
    """Count total words in PDF."""
    total_words = 0

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                words = text.split()
                total_words += len(words)

    return total_words

# Usage
word_count = count_words_in_pdf("document.pdf")
print(f"Total words: {word_count}")
```

## Error Handling Template

```python
from pypdf import PdfReader
import logging

def safe_pdf_operation(pdf_path):
    """Template for safe PDF operations with error handling."""
    try:
        reader = PdfReader(pdf_path)

        # Check if encrypted
        if reader.is_encrypted:
            logging.warning(f"PDF {pdf_path} is encrypted")
            return None

        # Perform operations
        result = []
        for page in reader.pages:
            try:
                text = page.extract_text()
                result.append(text)
            except Exception as e:
                logging.error(f"Error extracting page: {e}")
                result.append("")

        return result

    except FileNotFoundError:
        logging.error(f"File not found: {pdf_path}")
        return None
    except Exception as e:
        logging.error(f"Error processing PDF: {e}")
        return None
```

```

### references/table-extraction.md

```markdown
# Table Extraction Reference

## Basic Table Extraction (pdfplumber)

```python
import pdfplumber
import pandas as pd

def extract_tables(pdf_path):
    """Extract all tables from a PDF."""
    all_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()

            for table_num, table in enumerate(tables, start=1):
                # Convert to DataFrame
                if table:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    all_tables.append({
                        'page': page_num,
                        'table': table_num,
                        'data': df
                    })

    return all_tables

# Usage
tables = extract_tables("report.pdf")
for t in tables:
    print(f"Page {t['page']}, Table {t['table']}:")
    print(t['data'])
    print("\n")
```

## Advanced Table Extraction with Settings

```python
import pdfplumber

def extract_tables_advanced(pdf_path):
    """Extract tables with custom settings for better accuracy."""
    tables = []

    table_settings = {
        "vertical_strategy": "lines",
        "horizontal_strategy": "lines",
        "explicit_vertical_lines": [],
        "explicit_horizontal_lines": [],
        "snap_tolerance": 3,
        "join_tolerance": 3,
        "edge_min_length": 3,
        "min_words_vertical": 3,
        "min_words_horizontal": 1,
    }

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables(table_settings=table_settings)
            tables.extend(page_tables)

    return tables
```

## Find and Extract Specific Tables

```python
import pdfplumber
import pandas as pd

def find_table_with_keyword(pdf_path, keyword):
    """Find and extract tables containing a specific keyword."""
    matching_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()

            for table in tables:
                # Check if keyword exists in table
                table_text = str(table).lower()
                if keyword.lower() in table_text:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    matching_tables.append({
                        'page': page_num,
                        'data': df
                    })

    return matching_tables

# Usage
sales_tables = find_table_with_keyword("report.pdf", "revenue")
```

## Table Extraction with Multiple Strategies

```python
import pdfplumber

def extract_tables_robust(pdf_path):
    """Extract tables with multiple strategies."""
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[0]

        # Try different strategies
        strategies = [
            {"vertical_strategy": "lines", "horizontal_strategy": "lines"},
            {"vertical_strategy": "text", "horizontal_strategy": "text"},
            {"vertical_strategy": "lines", "horizontal_strategy": "text"}
        ]

        for strategy in strategies:
            tables = page.extract_tables(table_settings=strategy)
            if tables:
                return tables

        return []
```

## Export Tables to CSV

```python
import pdfplumber
import pandas as pd
import os

def extract_tables_to_csv(pdf_path, output_dir):
    """Extract tables and save each as CSV."""
    os.makedirs(output_dir, exist_ok=True)

    with pdfplumber.open(pdf_path) as pdf:
        table_count = 0

        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()

            for table_num, table in enumerate(tables, start=1):
                if table:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    csv_path = os.path.join(
                        output_dir,
                        f"page{page_num}_table{table_num}.csv"
                    )
                    df.to_csv(csv_path, index=False)
                    table_count += 1

        print(f"Extracted {table_count} tables to {output_dir}")

# Usage
extract_tables_to_csv("report.pdf", "extracted_tables/")
```

## Table Extraction with Validation

```python
import pdfplumber
import pandas as pd

def extract_validated_tables(pdf_path, min_rows=2, min_cols=2):
    """Extract tables with validation criteria."""
    valid_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            tables = page.extract_tables()

            for table in tables:
                if not table:
                    continue

                # Validate table dimensions
                rows = len(table)
                cols = len(table[0]) if table else 0

                if rows >= min_rows and cols >= min_cols:
                    df = pd.DataFrame(table[1:], columns=table[0])
                    valid_tables.append({
                        'page': page_num,
                        'rows': rows,
                        'cols': cols,
                        'data': df
                    })

    return valid_tables

# Usage
tables = extract_validated_tables("report.pdf", min_rows=3, min_cols=3)
```

## Clean and Format Extracted Tables

```python
import pdfplumber
import pandas as pd

def extract_clean_tables(pdf_path):
    """Extract and clean tables for analysis."""
    clean_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            tables = page.extract_tables()

            for table in tables:
                if not table:
                    continue

                # Convert to DataFrame
                df = pd.DataFrame(table[1:], columns=table[0])

                # Clean data
                # Remove empty columns
                df = df.dropna(axis=1, how='all')

                # Remove empty rows
                df = df.dropna(axis=0, how='all')

                # Strip whitespace
                df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

                clean_tables.append(df)

    return clean_tables
```

```

### references/pdf-operations.md

```markdown
# PDF Operations Reference

## Forms

### Fill PDF Forms (PyMuPDF)

```python
import fitz

def fill_pdf_form(input_pdf, output_pdf, field_values):
    """
    Fill PDF form fields.

    Args:
        input_pdf: Path to input PDF with form fields
        output_pdf: Path to save filled PDF
        field_values: Dictionary of {field_name: value}
    """
    doc = fitz.open(input_pdf)

    for page_num in range(len(doc)):
        page = doc[page_num]

        for widget in page.widgets():
            if widget.field_name in field_values:
                widget.field_value = field_values[widget.field_name]
                widget.update()

    doc.save(output_pdf)
    doc.close()

# Usage
form_data = {
    "name": "John Doe",
    "email": "[email protected]",
    "date": "2025-10-25",
    "signature": "John Doe"
}
fill_pdf_form("form_template.pdf", "filled_form.pdf", form_data)
```

### Extract Form Field Names

```python
import fitz

def get_form_fields(pdf_path):
    """Extract all form field names and their current values."""
    doc = fitz.open(pdf_path)
    fields = []

    for page_num in range(len(doc)):
        page = doc[page_num]

        for widget in page.widgets():
            fields.append({
                'page': page_num + 1,
                'name': widget.field_name,
                'type': widget.field_type_string,
                'value': widget.field_value
            })

    doc.close()
    return fields

# Usage
fields = get_form_fields("form.pdf")
for field in fields:
    print(f"{field['name']}: {field['type']} = {field['value']}")
```

### Debug Form Fields

```python
import fitz

def debug_form_fields(pdf_path):
    """Debug helper to see all form fields."""
    doc = fitz.open(pdf_path)

    print("=== Form Fields ===")
    for page_num in range(len(doc)):
        page = doc[page_num]
        widgets = page.widgets()

        if widgets:
            print(f"\nPage {page_num + 1}:")
            for widget in widgets:
                print(f"  Name: {widget.field_name}")
                print(f"  Type: {widget.field_type_string}")
                print(f"  Value: {widget.field_value}")
                print(f"  Rect: {widget.rect}")
                print("  ---")

    doc.close()
```

### Flatten PDF Forms

```python
import fitz

def flatten_pdf_form(input_pdf, output_pdf):
    """Flatten form fields to make them non-editable."""
    doc = fitz.open(input_pdf)

    for page_num in range(len(doc)):
        page = doc[page_num]

        # Get all widgets (form fields)
        for widget in page.widgets():
            # This makes the field non-editable
            widget.update()

    # Save with form fields flattened
    doc.save(output_pdf, garbage=4, deflate=True)
    doc.close()
```

## Merging

### Basic Merge

```python
from pypdf import PdfMerger

def merge_pdfs(pdf_list, output_path):
    """Merge multiple PDFs into one."""
    merger = PdfMerger()

    for pdf in pdf_list:
        merger.append(pdf)

    merger.write(output_path)
    merger.close()

# Usage
pdfs = ["file1.pdf", "file2.pdf", "file3.pdf"]
merge_pdfs(pdfs, "merged_output.pdf")
```

### Merge with Page Ranges

```python
from pypdf import PdfMerger

def merge_pdfs_with_ranges(pdf_configs, output_path):
    """
    Merge PDFs with specific page ranges.

    Args:
        pdf_configs: List of dicts with 'path', 'pages' keys
        output_path: Output file path

    Example:
        configs = [
            {'path': 'doc1.pdf', 'pages': (0, 3)},  # First 3 pages
            {'path': 'doc2.pdf', 'pages': (5, 10)}, # Pages 6-10
        ]
    """
    merger = PdfMerger()

    for config in pdf_configs:
        path = config['path']
        pages = config.get('pages')

        if pages:
            merger.append(path, pages=pages)
        else:
            merger.append(path)

    merger.write(output_path)
    merger.close()

# Usage
configs = [
    {'path': 'intro.pdf', 'pages': (0, 2)},
    {'path': 'content.pdf'},  # All pages
    {'path': 'appendix.pdf', 'pages': (10, 15)}
]
merge_pdfs_with_ranges(configs, "compiled.pdf")
```

### Merge with Bookmarks

```python
from pypdf import PdfMerger

def merge_with_bookmarks(pdf_list, output_path, bookmark_names=None):
    """Merge PDFs and add bookmarks for each document."""
    merger = PdfMerger()

    if bookmark_names is None:
        bookmark_names = [f"Document {i+1}" for i in range(len(pdf_list))]

    for pdf, bookmark in zip(pdf_list, bookmark_names):
        merger.append(pdf, outline_item=bookmark)

    merger.write(output_path)
    merger.close()

# Usage
pdfs = ["chapter1.pdf", "chapter2.pdf", "chapter3.pdf"]
bookmarks = ["Introduction", "Methods", "Results"]
merge_with_bookmarks(pdfs, "thesis.pdf", bookmarks)
```

## Splitting

### Split into Individual Pages

```python
from pypdf import PdfReader, PdfWriter
import os

def split_pdf_pages(input_pdf, output_dir):
    """Split PDF into individual pages."""
    reader = PdfReader(input_pdf)

    os.makedirs(output_dir, exist_ok=True)

    for page_num, page in enumerate(reader.pages):
        writer = PdfWriter()
        writer.add_page(page)

        output_path = os.path.join(output_dir, f"page_{page_num + 1}.pdf")
        with open(output_path, 'wb') as output_file:
            writer.write(output_file)

    print(f"Split {len(reader.pages)} pages into {output_dir}")

# Usage
split_pdf_pages("document.pdf", "split_pages/")
```

### Split by Page Ranges

```python
from pypdf import PdfReader, PdfWriter

def split_pdf_ranges(input_pdf, ranges, output_paths):
    """
    Split PDF into multiple files by page ranges.

    Args:
        input_pdf: Input PDF path
        ranges: List of tuples (start, end) - pages are 0-indexed
        output_paths: List of output file paths
    """
    reader = PdfReader(input_pdf)

    for (start, end), output_path in zip(ranges, output_paths):
        writer = PdfWriter()

        for page_num in range(start, end):
            writer.add_page(reader.pages[page_num])

        with open(output_path, 'wb') as output_file:
            writer.write(output_file)

# Usage
ranges = [(0, 5), (5, 10), (10, 15)]
outputs = ["part1.pdf", "part2.pdf", "part3.pdf"]
split_pdf_ranges("document.pdf", ranges, outputs)
```

### Split by Size

```python
from pypdf import PdfReader, PdfWriter
import os

def split_pdf_by_size(input_pdf, max_size_mb, output_dir):
    """Split PDF into chunks not exceeding max size."""
    reader = PdfReader(input_pdf)

    os.makedirs(output_dir, exist_ok=True)

    current_writer = PdfWriter()
    current_size = 0
    file_count = 1

    for page in reader.pages:
        current_writer.add_page(page)

        # Estimate size (approximate)
        temp_path = f"/tmp/temp_check.pdf"
        with open(temp_path, 'wb') as f:
            current_writer.write(f)

        current_size = os.path.getsize(temp_path) / (1024 * 1024)  # MB

        if current_size >= max_size_mb:
            output_path = os.path.join(output_dir, f"part_{file_count}.pdf")
            with open(output_path, 'wb') as f:
                current_writer.write(f)

            current_writer = PdfWriter()
            current_size = 0
            file_count += 1

        os.remove(temp_path)

    # Write remaining pages
    if len(current_writer.pages) > 0:
        output_path = os.path.join(output_dir, f"part_{file_count}.pdf")
        with open(output_path, 'wb') as f:
            current_writer.write(f)
```

## Page Operations

### Rotate Pages

```python
from pypdf import PdfReader, PdfWriter

def rotate_pages(input_pdf, output_pdf, rotation=90, pages=None):
    """
    Rotate specific pages in PDF.

    Args:
        rotation: Degrees to rotate (90, 180, 270)
        pages: List of page numbers (0-indexed), or None for all pages
    """
    reader = PdfReader(input_pdf)
    writer = PdfWriter()

    for page_num, page in enumerate(reader.pages):
        if pages is None or page_num in pages:
            page.rotate(rotation)
        writer.add_page(page)

    with open(output_pdf, 'wb') as output_file:
        writer.write(output_file)

# Usage
rotate_pages("document.pdf", "rotated.pdf", rotation=90, pages=[0, 2, 4])
```

### Extract Images

```python
import fitz
import os

def extract_images(pdf_path, output_dir):
    """Extract all images from PDF."""
    doc = fitz.open(pdf_path)
    os.makedirs(output_dir, exist_ok=True)

    image_count = 0

    for page_num in range(len(doc)):
        page = doc[page_num]
        images = page.get_images()

        for img_index, img in enumerate(images):
            xref = img[0]
            base_image = doc.extract_image(xref)

            image_bytes = base_image["image"]
            image_ext = base_image["ext"]

            image_filename = os.path.join(
                output_dir,
                f"page{page_num + 1}_img{img_index + 1}.{image_ext}"
            )

            with open(image_filename, "wb") as img_file:
                img_file.write(image_bytes)

            image_count += 1

    print(f"Extracted {image_count} images to {output_dir}")
    doc.close()

# Usage
extract_images("document.pdf", "extracted_images/")
```

### Convert Images to PDF

```python
from PIL import Image
from reportlab.pdfgen import canvas

def images_to_pdf(image_paths, output_pdf):
    """Convert multiple images to a single PDF."""
    c = canvas.Canvas(output_pdf)

    for img_path in image_paths:
        img = Image.open(img_path)
        width, height = img.size

        # Set page size to image size
        c.setPageSize((width, height))

        # Draw image
        c.drawImage(img_path, 0, 0, width=width, height=height)
        c.showPage()

    c.save()

# Usage
images = ["scan1.jpg", "scan2.jpg", "scan3.jpg"]
images_to_pdf(images, "scanned_document.pdf")
```

```

### references/pdf-creation.md

```markdown
# PDF Creation Reference

## Basic PDF Creation

```python
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

def create_simple_pdf(output_path, content):
    """Create a simple PDF with text content."""
    c = canvas.Canvas(output_path, pagesize=letter)
    width, height = letter

    # Set font
    c.setFont("Helvetica", 12)

    # Add text
    y_position = height - 50
    for line in content:
        c.drawString(50, y_position, line)
        y_position -= 20

    c.save()

# Usage
content = [
    "This is the first line",
    "This is the second line",
    "This is the third line"
]
create_simple_pdf("output.pdf", content)
```

## Create Styled Report

```python
from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.enums import TA_CENTER, TA_JUSTIFY

def create_styled_report(output_path, title, sections):
    """
    Create a professionally styled PDF report.

    Args:
        output_path: Output file path
        title: Document title
        sections: List of dicts with 'heading' and 'content' keys
    """
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    story = []
    styles = getSampleStyleSheet()

    # Custom styles
    title_style = ParagraphStyle(
        'CustomTitle',
        parent=styles['Heading1'],
        fontSize=24,
        textColor='darkblue',
        alignment=TA_CENTER,
        spaceAfter=30
    )

    # Add title
    story.append(Paragraph(title, title_style))
    story.append(Spacer(1, 0.5*inch))

    # Add sections
    for section in sections:
        # Section heading
        story.append(Paragraph(section['heading'], styles['Heading2']))
        story.append(Spacer(1, 0.2*inch))

        # Section content
        for paragraph in section['content']:
            p = Paragraph(paragraph, styles['BodyText'])
            story.append(p)
            story.append(Spacer(1, 0.1*inch))

        story.append(Spacer(1, 0.3*inch))

    doc.build(story)

# Usage
sections = [
    {
        'heading': 'Introduction',
        'content': [
            'This is the introduction paragraph.',
            'It contains important information about the topic.'
        ]
    },
    {
        'heading': 'Methods',
        'content': [
            'We used various methods to conduct this research.',
            'The methodology was carefully designed.'
        ]
    }
]
create_styled_report("report.pdf", "Research Report", sections)
```

## Create PDF with Tables

```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph
from reportlab.lib import colors
from reportlab.lib.styles import getSampleStyleSheet

def create_pdf_with_table(output_path, title, table_data):
    """Create PDF with formatted table."""
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    elements = []
    styles = getSampleStyleSheet()

    # Add title
    elements.append(Paragraph(title, styles['Title']))
    elements.append(Paragraph("<br/><br/>", styles['Normal']))

    # Create table
    table = Table(table_data)

    # Add style to table
    table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 14),
        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
        ('BACKGROUND', (0, 1), (-1, -1), colors.beige),
        ('GRID', (0, 0), (-1, -1), 1, colors.black)
    ]))

    elements.append(table)
    doc.build(elements)

# Usage
data = [
    ['Product', 'Quantity', 'Price'],
    ['Widget A', '10', '$50'],
    ['Widget B', '5', '$75'],
    ['Widget C', '20', '$30']
]
create_pdf_with_table("invoice.pdf", "Sales Invoice", data)
```

## Create PDF with Images

```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Image, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch

def create_pdf_with_images(output_path, title, image_paths, captions=None):
    """Create PDF with images and captions."""
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    elements = []
    styles = getSampleStyleSheet()

    # Add title
    elements.append(Paragraph(title, styles['Title']))
    elements.append(Spacer(1, 0.5*inch))

    if captions is None:
        captions = [f"Image {i+1}" for i in range(len(image_paths))]

    for img_path, caption in zip(image_paths, captions):
        # Add image
        img = Image(img_path, width=4*inch, height=3*inch)
        elements.append(img)

        # Add caption
        elements.append(Spacer(1, 0.1*inch))
        elements.append(Paragraph(caption, styles['Italic']))
        elements.append(Spacer(1, 0.3*inch))

    doc.build(elements)

# Usage
images = ["chart1.png", "chart2.png", "chart3.png"]
captions = ["Sales Chart", "Revenue Chart", "Growth Chart"]
create_pdf_with_images("visual_report.pdf", "Analytics Report", images, captions)
```

## Create Multi-Column Layout

```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Frame, PageTemplate
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch

def create_multicolumn_pdf(output_path, content):
    """Create PDF with multiple columns."""
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    styles = getSampleStyleSheet()

    # Define frames for columns
    frame_width = (letter[0] - 2*inch) / 2
    frame_height = letter[1] - 2*inch

    frame1 = Frame(0.5*inch, 0.5*inch, frame_width, frame_height)
    frame2 = Frame(frame_width + inch, 0.5*inch, frame_width, frame_height)

    # Create page template
    template = PageTemplate(frames=[frame1, frame2])
    doc.addPageTemplates([template])

    # Build story
    story = [Paragraph(p, styles['Normal']) for p in content]
    doc.build(story)

# Usage
content = ["Paragraph " + str(i) for i in range(1, 21)]
create_multicolumn_pdf("columns.pdf", content)
```

## Create PDF with Custom Fonts

```python
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

def create_pdf_custom_font(output_path, font_path, font_name, text):
    """Create PDF with custom font."""
    # Register custom font
    pdfmetrics.registerFont(TTFont(font_name, font_path))

    c = canvas.Canvas(output_path, pagesize=letter)

    # Use custom font
    c.setFont(font_name, 16)
    c.drawString(50, 750, text)

    c.save()

# Usage
create_pdf_custom_font(
    "custom_font.pdf",
    "/path/to/font.ttf",
    "CustomFont",
    "Text in custom font"
)
```

## Create Invoice Template

```python
from reportlab.lib.pagesizes import letter
from reportlab.lib import colors
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch

def create_invoice(output_path, invoice_data):
    """
    Create professional invoice.

    Args:
        invoice_data: Dict with 'company', 'client', 'items', 'total'
    """
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    elements = []
    styles = getSampleStyleSheet()

    # Company header
    elements.append(Paragraph(invoice_data['company'], styles['Title']))
    elements.append(Spacer(1, 0.2*inch))

    # Invoice details
    elements.append(Paragraph(f"Invoice To: {invoice_data['client']}", styles['Normal']))
    elements.append(Paragraph(f"Date: {invoice_data['date']}", styles['Normal']))
    elements.append(Spacer(1, 0.3*inch))

    # Items table
    table_data = [['Description', 'Quantity', 'Price', 'Total']]
    table_data.extend(invoice_data['items'])

    table = Table(table_data)
    table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('GRID', (0, 0), (-1, -1), 1, colors.black),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold')
    ]))

    elements.append(table)
    elements.append(Spacer(1, 0.3*inch))

    # Total
    elements.append(Paragraph(f"Total: ${invoice_data['total']}", styles['Heading2']))

    doc.build(elements)

# Usage
invoice = {
    'company': 'Acme Corporation',
    'client': 'John Smith',
    'date': '2025-10-25',
    'items': [
        ['Service A', '2', '$100', '$200'],
        ['Service B', '1', '$150', '$150']
    ],
    'total': 350
}
create_invoice("invoice.pdf", invoice)
```

```

### examples/invoice-generator.md

```markdown
# Invoice Generator Example

Complete example of generating professional invoices from data.

## Basic Invoice

```python
from reportlab.lib.pagesizes import letter
from reportlab.lib import colors
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch

def create_invoice(output_path, invoice_data):
    """
    Create professional invoice.

    Args:
        invoice_data: Dict with company, client, items, total
    """
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    elements = []
    styles = getSampleStyleSheet()

    # Company header
    elements.append(Paragraph(invoice_data['company'], styles['Title']))
    elements.append(Paragraph(invoice_data['address'], styles['Normal']))
    elements.append(Spacer(1, 0.3*inch))

    # Invoice details
    elements.append(Paragraph(f"<b>Invoice #:</b> {invoice_data['invoice_number']}", styles['Normal']))
    elements.append(Paragraph(f"<b>Date:</b> {invoice_data['date']}", styles['Normal']))
    elements.append(Spacer(1, 0.2*inch))

    # Client information
    elements.append(Paragraph(f"<b>Bill To:</b>", styles['Heading3']))
    elements.append(Paragraph(invoice_data['client_name'], styles['Normal']))
    elements.append(Paragraph(invoice_data['client_address'], styles['Normal']))
    elements.append(Spacer(1, 0.3*inch))

    # Items table
    table_data = [['Description', 'Quantity', 'Unit Price', 'Total']]
    for item in invoice_data['items']:
        table_data.append([
            item['description'],
            str(item['quantity']),
            f"${item['price']:.2f}",
            f"${item['quantity'] * item['price']:.2f}"
        ])

    # Add subtotal, tax, total
    subtotal = sum(item['quantity'] * item['price'] for item in invoice_data['items'])
    tax = subtotal * 0.08  # 8% tax
    total = subtotal + tax

    table_data.append(['', '', 'Subtotal:', f"${subtotal:.2f}"])
    table_data.append(['', '', 'Tax (8%):', f"${tax:.2f}"])
    table_data.append(['', '', 'Total:', f"${total:.2f}"])

    table = Table(table_data)
    table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('ALIGN', (0, 1), (0, -1), 'LEFT'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 12),
        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
        ('BACKGROUND', (0, -3), (-1, -1), colors.beige),
        ('GRID', (0, 0), (-1, -4), 1, colors.black),
        ('LINEABOVE', (2, -3), (-1, -3), 2, colors.black),
        ('LINEABOVE', (2, -1), (-1, -1), 2, colors.black),
        ('FONTNAME', (2, -1), (-1, -1), 'Helvetica-Bold')
    ]))

    elements.append(table)
    elements.append(Spacer(1, 0.5*inch))

    # Payment terms
    elements.append(Paragraph("<b>Payment Terms:</b>", styles['Heading3']))
    elements.append(Paragraph(invoice_data['payment_terms'], styles['Normal']))

    # Build PDF
    doc.build(elements)

# Usage example
invoice_data = {
    'company': 'Acme Corporation',
    'address': '123 Business St, City, State 12345',
    'invoice_number': 'INV-2025-001',
    'date': '2025-10-25',
    'client_name': 'John Smith',
    'client_address': '456 Client Ave, Town, State 67890',
    'items': [
        {'description': 'Web Design Services', 'quantity': 10, 'price': 150.00},
        {'description': 'Logo Design', 'quantity': 1, 'price': 500.00},
        {'description': 'Hosting (Annual)', 'quantity': 1, 'price': 200.00}
    ],
    'payment_terms': 'Net 30 days. Payment due within 30 days of invoice date.'
}

create_invoice("invoice_example.pdf", invoice_data)
print("Invoice created: invoice_example.pdf")
```

## Enhanced Invoice with Logo

```python
from reportlab.lib.pagesizes import letter
from reportlab.lib import colors
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer, Image
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch

def create_invoice_with_logo(output_path, invoice_data, logo_path=None):
    """Create invoice with company logo."""
    doc = SimpleDocTemplate(output_path, pagesize=letter,
                           topMargin=0.5*inch, bottomMargin=0.5*inch)
    elements = []
    styles = getSampleStyleSheet()

    # Add logo if provided
    if logo_path:
        logo = Image(logo_path, width=2*inch, height=1*inch)
        elements.append(logo)
        elements.append(Spacer(1, 0.2*inch))

    # Company header
    elements.append(Paragraph(invoice_data['company'], styles['Title']))
    elements.append(Paragraph(invoice_data['address'], styles['Normal']))
    elements.append(Paragraph(f"Phone: {invoice_data['phone']} | Email: {invoice_data['email']}", styles['Normal']))
    elements.append(Spacer(1, 0.5*inch))

    # Two-column layout for invoice details and client info
    details_table = Table([
        ['Invoice Number:', invoice_data['invoice_number'], 'Bill To:', invoice_data['client_name']],
        ['Date:', invoice_data['date'], 'Address:', invoice_data['client_address']],
        ['Due Date:', invoice_data['due_date'], 'Contact:', invoice_data['client_contact']]
    ], colWidths=[1.5*inch, 2*inch, 1*inch, 2.5*inch])

    details_table.setStyle(TableStyle([
        ('ALIGN', (0, 0), (1, -1), 'LEFT'),
        ('ALIGN', (2, 0), (3, -1), 'LEFT'),
        ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
        ('FONTNAME', (2, 0), (2, -1), 'Helvetica-Bold'),
        ('VALIGN', (0, 0), (-1, -1), 'TOP')
    ]))

    elements.append(details_table)
    elements.append(Spacer(1, 0.5*inch))

    # Items table with enhanced styling
    table_data = [['Item', 'Description', 'Qty', 'Rate', 'Amount']]
    for i, item in enumerate(invoice_data['items'], 1):
        table_data.append([
            str(i),
            item['description'],
            str(item['quantity']),
            f"${item['price']:.2f}",
            f"${item['quantity'] * item['price']:.2f}"
        ])

    # Calculations
    subtotal = sum(item['quantity'] * item['price'] for item in invoice_data['items'])
    discount = subtotal * (invoice_data.get('discount_percent', 0) / 100)
    tax = (subtotal - discount) * (invoice_data.get('tax_rate', 0.08))
    total = subtotal - discount + tax

    # Add financial summary rows
    table_data.append(['', '', '', 'Subtotal:', f"${subtotal:.2f}"])
    if discount > 0:
        table_data.append(['', '', '', f"Discount ({invoice_data['discount_percent']}%):", f"-${discount:.2f}"])
    table_data.append(['', '', '', f"Tax ({invoice_data.get('tax_rate', 0.08)*100:.0f}%):", f"${tax:.2f}"])
    table_data.append(['', '', '', 'TOTAL:', f"${total:.2f}"])

    table = Table(table_data, colWidths=[0.5*inch, 3*inch, 0.75*inch, 1.5*inch, 1.25*inch])
    table.setStyle(TableStyle([
        # Header styling
        ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#4472C4')),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 11),
        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
        ('TOPPADDING', (0, 0), (-1, 0), 12),

        # Data rows
        ('ALIGN', (0, 0), (0, -1), 'CENTER'),
        ('ALIGN', (2, 1), (2, -1), 'CENTER'),
        ('ALIGN', (3, 1), (-1, -1), 'RIGHT'),
        ('FONTSIZE', (0, 1), (-1, -1), 10),
        ('ROWBACKGROUNDS', (0, 1), (-1, -5), [colors.white, colors.HexColor('#E7E6E6')]),

        # Summary rows styling
        ('BACKGROUND', (0, -4), (-1, -1), colors.HexColor('#D9E2F3')),
        ('FONTNAME', (3, -1), (-1, -1), 'Helvetica-Bold'),
        ('FONTSIZE', (3, -1), (-1, -1), 12),
        ('LINEABOVE', (3, -4), (-1, -4), 1, colors.black),
        ('LINEABOVE', (3, -1), (-1, -1), 2, colors.black),

        # Grid
        ('GRID', (0, 0), (-1, -5), 0.5, colors.grey),
    ]))

    elements.append(table)
    elements.append(Spacer(1, 0.5*inch))

    # Footer sections
    footer_data = [
        ['<b>Payment Terms:</b>', '<b>Notes:</b>'],
        [invoice_data.get('payment_terms', 'Net 30'), invoice_data.get('notes', 'Thank you for your business!')]
    ]

    footer_table = Table(footer_data, colWidths=[3.5*inch, 3.5*inch])
    footer_table.setStyle(TableStyle([
        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold')
    ]))

    elements.append(footer_table)

    # Build PDF
    doc.build(elements)

# Usage
invoice_data = {
    'company': 'Professional Services Inc.',
    'address': '789 Corporate Blvd, Suite 100\nBusiness City, ST 12345',
    'phone': '(555) 123-4567',
    'email': '[email protected]',
    'invoice_number': 'INV-2025-042',
    'date': '2025-10-25',
    'due_date': '2025-11-24',
    'client_name': 'ABC Corporation',
    'client_address': '321 Client Street\nClient Town, ST 67890',
    'client_contact': '[email protected]',
    'items': [
        {'description': 'Consulting Services - October 2025', 'quantity': 40, 'price': 175.00},
        {'description': 'Software License (Annual)', 'quantity': 5, 'price': 299.00},
        {'description': 'Training Session', 'quantity': 2, 'price': 500.00}
    ],
    'discount_percent': 10,
    'tax_rate': 0.08,
    'payment_terms': 'Payment due within 30 days\nAccepted methods: Check, ACH, Credit Card',
    'notes': 'Thank you for your continued business.\nPlease reference invoice number on payment.'
}

create_invoice_with_logo("professional_invoice.pdf", invoice_data)
print("Professional invoice created: professional_invoice.pdf")
```

## Batch Invoice Generation

```python
import pandas as pd
from pathlib import Path

def generate_invoices_from_csv(csv_path, output_dir, template_function):
    """Generate multiple invoices from CSV data."""
    # Read invoice data
    df = pd.read_csv(csv_path)

    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    for _, row in df.iterrows():
        invoice_data = {
            'company': row['company'],
            'address': row['company_address'],
            'invoice_number': row['invoice_number'],
            'date': row['date'],
            'client_name': row['client_name'],
            'client_address': row['client_address'],
            'items': eval(row['items']),  # Be careful with eval in production
            'payment_terms': row['payment_terms']
        }

        output_file = output_path / f"{row['invoice_number']}.pdf"
        template_function(str(output_file), invoice_data)
        print(f"Generated: {output_file}")

# Usage
# generate_invoices_from_csv('invoices.csv', 'generated_invoices/', create_invoice)
```

```

### examples/report-automation.md

```markdown
# Report Automation Example

Complete example of automated report generation with data visualization.

## Monthly Report Generator

```python
from reportlab.lib.pagesizes import letter, A4
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.platypus import (SimpleDocTemplate, Paragraph, Spacer, PageBreak,
                                Table, TableStyle, Image)
from reportlab.lib import colors
from reportlab.lib.enums import TA_CENTER, TA_RIGHT
import pandas as pd
from datetime import datetime

def create_monthly_report(output_path, report_data):
    """
    Generate comprehensive monthly report.

    Args:
        report_data: Dict with sections, metrics, charts
    """
    doc = SimpleDocTemplate(
        output_path,
        pagesize=letter,
        topMargin=0.75*inch,
        bottomMargin=0.75*inch,
        leftMargin=0.75*inch,
        rightMargin=0.75*inch
    )

    story = []
    styles = getSampleStyleSheet()

    # Custom styles
    title_style = ParagraphStyle(
        'ReportTitle',
        parent=styles['Title'],
        fontSize=28,
        textColor=colors.HexColor('#1F4788'),
        spaceAfter=30,
        alignment=TA_CENTER
    )

    subtitle_style = ParagraphStyle(
        'Subtitle',
        parent=styles['Normal'],
        fontSize=14,
        textColor=colors.HexColor('#666666'),
        alignment=TA_CENTER,
        spaceAfter=20
    )

    # Title page
    story.append(Spacer(1, 2*inch))
    story.append(Paragraph(report_data['title'], title_style))
    story.append(Paragraph(report_data['subtitle'], subtitle_style))
    story.append(Spacer(1, 0.5*inch))
    story.append(Paragraph(f"Report Period: {report_data['period']}", subtitle_style))
    story.append(Paragraph(f"Generated: {datetime.now().strftime('%B %d, %Y')}", subtitle_style))
    story.append(PageBreak())

    # Executive Summary
    story.append(Paragraph("Executive Summary", styles['Heading1']))
    story.append(Spacer(1, 0.2*inch))

    for paragraph in report_data['executive_summary']:
        story.append(Paragraph(paragraph, styles['BodyText']))
        story.append(Spacer(1, 0.1*inch))

    story.append(Spacer(1, 0.3*inch))

    # Key Metrics Table
    story.append(Paragraph("Key Performance Indicators", styles['Heading2']))
    story.append(Spacer(1, 0.2*inch))

    metrics_data = [['Metric', 'Current', 'Previous', 'Change']]
    for metric in report_data['metrics']:
        change = metric['current'] - metric['previous']
        change_pct = (change / metric['previous'] * 100) if metric['previous'] != 0 else 0
        change_str = f"{change:+.1f} ({change_pct:+.1f}%)"

        metrics_data.append([
            metric['name'],
            f"{metric['current']:.1f}",
            f"{metric['previous']:.1f}",
            change_str
        ])

    metrics_table = Table(metrics_data, colWidths=[3*inch, 1.25*inch, 1.25*inch, 1.5*inch])
    metrics_table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#4472C4')),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('ALIGN', (0, 1), (0, -1), 'LEFT'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, 0), 12),
        ('BOTTOMPADDING', (0, 0), (-1, 0), 12),
        ('GRID', (0, 0), (-1, -1), 1, colors.grey),
        ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.HexColor('#E7E6E6')])
    ]))

    story.append(metrics_table)
    story.append(Spacer(1, 0.4*inch))

    # Detailed Sections
    for section in report_data['sections']:
        story.append(Paragraph(section['title'], styles['Heading2']))
        story.append(Spacer(1, 0.2*inch))

        for paragraph in section['content']:
            story.append(Paragraph(paragraph, styles['BodyText']))
            story.append(Spacer(1, 0.1*inch))

        # Add chart if provided
        if 'chart' in section:
            story.append(Spacer(1, 0.2*inch))
            img = Image(section['chart'], width=5*inch, height=3*inch)
            story.append(img)
            story.append(Paragraph(f"<i>{section.get('chart_caption', '')}</i>",
                                 styles['Normal']))

        story.append(Spacer(1, 0.4*inch))

    # Build PDF
    doc.build(story)

# Usage
report_data = {
    'title': 'Q4 2025 Business Performance Report',
    'subtitle': 'Quarterly Analysis and Insights',
    'period': 'October - December 2025',
    'executive_summary': [
        'This quarter showed strong growth across all key metrics, with revenue '
        'increasing 15% year-over-year and customer acquisition exceeding targets by 20%.',
        'Operational efficiency improvements contributed to a 5% reduction in costs, '
        'while maintaining high customer satisfaction scores.',
        'Looking ahead, we anticipate continued growth driven by new product launches '
        'and expanded market presence.'
    ],
    'metrics': [
        {'name': 'Revenue ($M)', 'current': 45.2, 'previous': 39.3},
        {'name': 'New Customers', 'current': 1250, 'previous': 980},
        {'name': 'Customer Retention (%)', 'current': 94.5, 'previous': 92.1},
        {'name': 'Avg. Deal Size ($K)', 'current': 36.2, 'previous': 34.8}
    ],
    'sections': [
        {
            'title': 'Revenue Analysis',
            'content': [
                'Q4 revenue reached $45.2M, representing a 15% increase over Q3 and '
                'exceeding our quarterly target by 8%.',
                'Growth was driven primarily by enterprise sales, which increased 25% '
                'quarter-over-quarter. SMB segment showed steady 10% growth.',
                'Recurring revenue now accounts for 75% of total revenue, up from '
                '68% in the previous quarter, indicating strong business model health.'
            ]
        },
        {
            'title': 'Customer Acquisition',
            'content': [
                'We acquired 1,250 new customers this quarter, surpassing our target '
                'of 1,000 by 25%.',
                'Cost per acquisition decreased 12% due to improved marketing efficiency '
                'and increased word-of-mouth referrals.',
                'Customer onboarding time reduced from 14 days to 9 days through '
                'process improvements and automation.'
            ]
        },
        {
            'title': 'Operational Efficiency',
            'content': [
                'Operational costs decreased 5% while maintaining service quality, '
                'driven by automation initiatives and process optimization.',
                'Team productivity increased 18% measured by output per employee, '
                'attributed to new tools and training programs.',
                'Customer support response time improved by 30%, with average '
                'first-response time now under 2 hours.'
            ]
        }
    ]
}

create_monthly_report("q4_2025_report.pdf", report_data)
print("Report generated: q4_2025_report.pdf")
```

## Data-Driven Report with Pandas Integration

```python
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import inch
from reportlab.lib import colors
import pandas as pd

def create_data_report(output_path, df, title, analysis):
    """
    Create report from pandas DataFrame.

    Args:
        df: pandas DataFrame with data
        title: Report title
        analysis: Dict with analysis sections
    """
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    story = []
    styles = getSampleStyleSheet()

    # Title
    story.append(Paragraph(title, styles['Title']))
    story.append(Spacer(1, 0.3*inch))

    # Summary statistics
    story.append(Paragraph("Summary Statistics", styles['Heading2']))
    story.append(Spacer(1, 0.2*inch))

    # Convert DataFrame describe() to table
    summary = df.describe()
    table_data = [[''] + list(summary.columns)]
    for idx in summary.index:
        row = [idx] + [f"{val:.2f}" for val in summary.loc[idx]]
        table_data.append(row)

    summary_table = Table(table_data)
    summary_table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('GRID', (0, 0), (-1, -1), 1, colors.black),
        ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey])
    ]))

    story.append(summary_table)
    story.append(Spacer(1, 0.4*inch))

    # Detailed data table (first 20 rows)
    story.append(Paragraph("Detailed Data (First 20 Rows)", styles['Heading2']))
    story.append(Spacer(1, 0.2*inch))

    data_subset = df.head(20)
    table_data = [list(data_subset.columns)]
    for _, row in data_subset.iterrows():
        table_data.append([str(val) for val in row])

    data_table = Table(table_data)
    data_table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('FONTSIZE', (0, 0), (-1, -1), 8),
        ('GRID', (0, 0), (-1, -1), 0.5, colors.grey)
    ]))

    story.append(data_table)
    story.append(Spacer(1, 0.4*inch))

    # Analysis sections
    for section in analysis:
        story.append(Paragraph(section['title'], styles['Heading2']))
        story.append(Spacer(1, 0.2*inch))

        for paragraph in section['content']:
            story.append(Paragraph(paragraph, styles['BodyText']))
            story.append(Spacer(1, 0.1*inch))

        story.append(Spacer(1, 0.3*inch))

    doc.build(story)

# Usage
# Create sample data
data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Revenue': [120000, 135000, 142000, 156000, 168000, 175000],
    'Expenses': [85000, 88000, 92000, 95000, 98000, 101000],
    'Customers': [450, 485, 510, 548, 580, 612]
}
df = pd.DataFrame(data)

analysis = [
    {
        'title': 'Revenue Trend',
        'content': [
            'Revenue showed consistent growth over the 6-month period, increasing '
            'from $120K in January to $175K in June, representing a 46% growth.',
            'Average month-over-month growth rate was 7.8%, indicating strong business momentum.'
        ]
    },
    {
        'title': 'Profitability',
        'content': [
            'Profit margins improved from 29% in January to 42% in June as revenue '
            'growth outpaced expense growth.',
            'Expense control remained effective with only 19% increase over the period '
            'while revenue grew 46%.'
        ]
    }
]

create_data_report("data_analysis_report.pdf", df, "6-Month Business Analysis", analysis)
print("Data report generated: data_analysis_report.pdf")
```

## Automated Weekly Report

```python
def create_weekly_report(week_number, data):
    """Generate automated weekly report."""
    output_path = f"weekly_report_week{week_number}.pdf"

    doc = SimpleDocTemplate(output_path, pagesize=letter)
    story = []
    styles = getSampleStyleSheet()

    # Title
    story.append(Paragraph(f"Weekly Report - Week {week_number}", styles['Title']))
    story.append(Paragraph(data['date_range'], styles['Normal']))
    story.append(Spacer(1, 0.5*inch))

    # Highlights
    story.append(Paragraph("Week Highlights", styles['Heading2']))
    story.append(Spacer(1, 0.2*inch))

    highlights_data = [['Category', 'Metric', 'Value', 'vs Last Week']]
    for item in data['highlights']:
        highlights_data.append([
            item['category'],
            item['metric'],
            str(item['value']),
            f"{item['change']:+.1f}%"
        ])

    table = Table(highlights_data)
    table.setStyle(TableStyle([
        ('BACKGROUND', (0, 0), (-1, 0), colors.HexColor('#4472C4')),
        ('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
        ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
        ('ALIGN', (0, 0), (-1, -1), 'CENTER'),
        ('GRID', (0, 0), (-1, -1), 1, colors.grey)
    ]))

    story.append(table)
    story.append(Spacer(1, 0.4*inch))

    # Action Items
    story.append(Paragraph("Action Items", styles['Heading2']))
    story.append(Spacer(1, 0.2*inch))

    for item in data['action_items']:
        story.append(Paragraph(f"• {item}", styles['BodyText']))
        story.append(Spacer(1, 0.1*inch))

    doc.build(story)
    return output_path

# Usage
week_data = {
    'date_range': 'October 21-27, 2025',
    'highlights': [
        {'category': 'Sales', 'metric': 'New Deals', 'value': 12, 'change': 20},
        {'category': 'Marketing', 'metric': 'Leads', 'value': 145, 'change': -5},
        {'category': 'Support', 'metric': 'Tickets Closed', 'value': 89, 'change': 15}
    ],
    'action_items': [
        'Follow up with 3 high-value prospects from this week',
        'Review and optimize underperforming marketing campaigns',
        'Schedule training session for new support tool'
    ]
}

report_path = create_weekly_report(43, week_data)
print(f"Weekly report generated: {report_path}")
```

```

### references/metadata-security-ocr.md

```markdown
# Metadata, Security, and OCR Reference

## Metadata Management

### Extract Metadata

```python
from pypdf import PdfReader

def extract_metadata(pdf_path):
    """Extract PDF metadata."""
    reader = PdfReader(pdf_path)
    metadata = reader.metadata

    info = {
        'title': metadata.get('/Title', ''),
        'author': metadata.get('/Author', ''),
        'subject': metadata.get('/Subject', ''),
        'creator': metadata.get('/Creator', ''),
        'producer': metadata.get('/Producer', ''),
        'creation_date': metadata.get('/CreationDate', ''),
        'modification_date': metadata.get('/ModDate', ''),
        'pages': len(reader.pages)
    }

    return info

# Usage
metadata = extract_metadata("document.pdf")
for key, value in metadata.items():
    print(f"{key}: {value}")
```

### Modify Metadata

```python
from pypdf import PdfReader, PdfWriter

def modify_metadata(input_pdf, output_pdf, metadata):
    """
    Modify PDF metadata.

    Args:
        metadata: Dict with keys like '/Title', '/Author', '/Subject', etc.
    """
    reader = PdfReader(input_pdf)
    writer = PdfWriter()

    # Copy all pages
    for page in reader.pages:
        writer.add_page(page)

    # Update metadata
    writer.add_metadata(metadata)

    with open(output_pdf, 'wb') as output_file:
        writer.write(output_file)

# Usage
new_metadata = {
    '/Title': 'Updated Title',
    '/Author': 'John Doe',
    '/Subject': 'Research Paper',
    '/Keywords': 'PDF, Python, Automation'
}
modify_metadata("document.pdf", "updated.pdf", new_metadata)
```

### Extract Comprehensive PDF Information

```python
import fitz

def get_pdf_info(pdf_path):
    """Get comprehensive PDF information."""
    doc = fitz.open(pdf_path)

    info = {
        'metadata': doc.metadata,
        'page_count': doc.page_count,
        'is_encrypted': doc.is_encrypted,
        'is_pdf': doc.is_pdf,
        'page_sizes': []
    }

    # Get page sizes
    for page_num in range(doc.page_count):
        page = doc[page_num]
        info['page_sizes'].append({
            'page': page_num + 1,
            'width': page.rect.width,
            'height': page.rect.height
        })

    doc.close()
    return info

# Usage
info = get_pdf_info("document.pdf")
print(f"Pages: {info['page_count']}")
print(f"Title: {info['metadata'].get('title', 'N/A')}")
print(f"Encrypted: {info['is_encrypted']}")
```

## Security and Encryption

### Add Password Protection

```python
from pypdf import PdfReader, PdfWriter

def encrypt_pdf(input_pdf, output_pdf, user_password, owner_password=None):
    """
    Add password protection to PDF.

    Args:
        user_password: Password to open the document
        owner_password: Password for full permissions (optional)
    """
    reader = PdfReader(input_pdf)
    writer = PdfWriter()

    # Copy all pages
    for page in reader.pages:
        writer.add_page(page)

    # Encrypt with password
    if owner_password is None:
        owner_password = user_password

    writer.encrypt(
        user_password=user_password,
        owner_password=owner_password,
        algorithm="AES-256"
    )

    with open(output_pdf, 'wb') as output_file:
        writer.write(output_file)

# Usage
encrypt_pdf("document.pdf", "encrypted.pdf", user_password="user123", owner_password="owner456")
```

### Decrypt PDF

```python
from pypdf import PdfReader, PdfWriter

def decrypt_pdf(input_pdf, output_pdf, password):
    """Remove password protection from PDF."""
    reader = PdfReader(input_pdf)

    # Decrypt
    if reader.is_encrypted:
        reader.decrypt(password)

    writer = PdfWriter()

    # Copy all pages
    for page in reader.pages:
        writer.add_page(page)

    with open(output_pdf, 'wb') as output_file:
        writer.write(output_file)

# Usage
decrypt_pdf("encrypted.pdf", "decrypted.pdf", password="user123")
```

### Handle Encrypted PDFs

```python
from pypdf import PdfReader

def handle_encrypted_pdf(pdf_path, password=None):
    """Safely handle encrypted PDFs."""
    reader = PdfReader(pdf_path)

    if reader.is_encrypted:
        if password:
            success = reader.decrypt(password)
            if success == 0:
                print("Incorrect password")
                return None
        else:
            print("PDF is encrypted, password required")
            return None

    # Now safe to process
    return reader
```

### Set Permissions

```python
from pypdf import PdfWriter, PdfReader

def set_pdf_permissions(input_pdf, output_pdf, password, allow_printing=True, allow_copying=False):
    """Set specific permissions on PDF."""
    reader = PdfReader(input_pdf)
    writer = PdfWriter()

    for page in reader.pages:
        writer.add_page(page)

    # Set permissions
    writer.encrypt(
        user_password=password,
        owner_password=password + "_owner",
        permissions_flag=(
            (0b100 if allow_printing else 0) |
            (0b10000 if allow_copying else 0)
        )
    )

    with open(output_pdf, 'wb') as f:
        writer.write(f)
```

## OCR for Scanned Documents

### Basic OCR

```python
from pdf2image import convert_from_path
import pytesseract
from PIL import Image

def ocr_pdf(pdf_path, output_txt_path=None):
    """
    Perform OCR on scanned PDF.

    Returns extracted text from all pages.
    """
    # Convert PDF to images
    images = convert_from_path(pdf_path)

    all_text = []

    for page_num, image in enumerate(images, start=1):
        # Perform OCR
        text = pytesseract.image_to_string(image)
        all_text.append(f"--- Page {page_num} ---\n{text}\n")

    full_text = "\n".join(all_text)

    # Optionally save to file
    if output_txt_path:
        with open(output_txt_path, 'w', encoding='utf-8') as f:
            f.write(full_text)

    return full_text

# Usage
text = ocr_pdf("scanned_document.pdf", "extracted_text.txt")
print(text)
```

### OCR with Language Support

```python
from pdf2image import convert_from_path
import pytesseract

def ocr_pdf_multilang(pdf_path, languages='eng'):
    """
    Perform OCR with multiple language support.

    Args:
        languages: Language codes separated by '+' (e.g., 'eng+fra+deu')
    """
    images = convert_from_path(pdf_path)
    all_text = []

    for image in images:
        text = pytesseract.image_to_string(image, lang=languages)
        all_text.append(text)

    return "\n\n".join(all_text)

# Usage
text = ocr_pdf_multilang("french_document.pdf", languages='fra')
```

### Create Searchable PDF

```python
from pdf2image import convert_from_path
import pytesseract
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
import fitz

def create_searchable_pdf(input_pdf, output_pdf):
    """Convert scanned PDF to searchable PDF with OCR layer."""
    # Convert to images
    images = convert_from_path(input_pdf)

    # Create new PDF with OCR text
    temp_pdfs = []

    for i, image in enumerate(images):
        # Perform OCR
        text = pytesseract.image_to_string(image)

        # Create PDF page with invisible text layer
        temp_pdf = f"/tmp/page_{i}.pdf"
        c = canvas.Canvas(temp_pdf, pagesize=letter)

        # Add invisible OCR text
        c.setFillColorRGB(1, 1, 1, alpha=0)  # Transparent
        text_obj = c.beginText(40, 750)
        text_obj.setFont("Helvetica", 10)

        for line in text.split('\n'):
            text_obj.textLine(line)

        c.drawText(text_obj)
        c.save()

        temp_pdfs.append(temp_pdf)

    # Merge all pages
    from pypdf import PdfMerger
    merger = PdfMerger()
    for pdf in temp_pdfs:
        merger.append(pdf)
    merger.write(output_pdf)
    merger.close()

    # Clean up
    import os
    for pdf in temp_pdfs:
        os.remove(pdf)

# Usage
create_searchable_pdf("scanned.pdf", "searchable.pdf")
```

### OCR with Preprocessing

```python
from pdf2image import convert_from_path
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

def ocr_pdf_enhanced(pdf_path):
    """Perform OCR with image preprocessing for better accuracy."""
    images = convert_from_path(pdf_path)
    all_text = []

    for image in images:
        # Preprocess image
        # Convert to grayscale
        image = image.convert('L')

        # Enhance contrast
        enhancer = ImageEnhance.Contrast(image)
        image = enhancer.enhance(2)

        # Sharpen
        image = image.filter(ImageFilter.SHARPEN)

        # Perform OCR
        text = pytesseract.image_to_string(image)
        all_text.append(text)

    return "\n\n".join(all_text)

# Usage
text = ocr_pdf_enhanced("low_quality_scan.pdf")
```

## Watermarks and Annotations

### Add Text Watermark

```python
import fitz

def add_text_watermark(input_pdf, output_pdf, watermark_text, opacity=0.3):
    """Add text watermark to all pages."""
    doc = fitz.open(input_pdf)

    for page in doc:
        # Get page dimensions
        rect = page.rect

        # Add watermark
        page.insert_textbox(
            rect,
            watermark_text,
            fontsize=50,
            align=fitz.TEXT_ALIGN_CENTER,
            rotate=45,
            opacity=opacity,
            color=(0.7, 0.7, 0.7)
        )

    doc.save(output_pdf)
    doc.close()

# Usage
add_text_watermark("document.pdf", "watermarked.pdf", "CONFIDENTIAL")
```

### Add Annotations

```python
import fitz

def add_annotations(input_pdf, output_pdf, annotations):
    """
    Add various annotations to PDF.

    Args:
        annotations: List of dicts with 'page', 'type', 'rect', 'content'
    """
    doc = fitz.open(input_pdf)

    for annot in annotations:
        page = doc[annot['page']]
        rect = fitz.Rect(annot['rect'])

        if annot['type'] == 'highlight':
            highlight = page.add_highlight_annot(rect)
            highlight.update()

        elif annot['type'] == 'text':
            text_annot = page.add_text_annot(
                rect.top_left,
                annot['content']
            )
            text_annot.update()

        elif annot['type'] == 'underline':
            underline = page.add_underline_annot(rect)
            underline.update()

        elif annot['type'] == 'strikeout':
            strike = page.add_strikeout_annot(rect)
            strike.update()

    doc.save(output_pdf)
    doc.close()

# Usage
annotations = [
    {
        'page': 0,
        'type': 'highlight',
        'rect': (100, 100, 300, 120)
    },
    {
        'page': 0,
        'type': 'text',
        'rect': (400, 400, 450, 450),
        'content': 'Important note here'
    }
]
add_annotations("document.pdf", "annotated.pdf", annotations)
```

### Add Stamp

```python
import fitz

def add_stamp(input_pdf, output_pdf, stamp_text, position="top-right"):
    """Add a stamp (e.g., 'APPROVED', 'DRAFT') to all pages."""
    doc = fitz.open(input_pdf)

    for page in doc:
        rect = page.rect

        # Determine stamp position
        if position == "top-right":
            stamp_rect = fitz.Rect(rect.width - 150, 20, rect.width - 20, 60)
        elif position == "top-left":
            stamp_rect = fitz.Rect(20, 20, 150, 60)
        elif position == "bottom-right":
            stamp_rect = fitz.Rect(rect.width - 150, rect.height - 60, rect.width - 20, rect.height - 20)
        else:  # center
            stamp_rect = fitz.Rect(rect.width/2 - 75, rect.height/2 - 20, rect.width/2 + 75, rect.height/2 + 20)

        # Add stamp
        page.draw_rect(stamp_rect, color=(1, 0, 0), width=2)
        page.insert_textbox(
            stamp_rect,
            stamp_text,
            fontsize=20,
            align=fitz.TEXT_ALIGN_CENTER,
            color=(1, 0, 0)
        )

    doc.save(output_pdf)
    doc.close()

# Usage
add_stamp("document.pdf", "stamped.pdf", "APPROVED", position="top-right")
```

## Optimization

### Optimize PDF Size

```python
import fitz

def optimize_pdf(input_pdf, output_pdf, image_quality=50):
    """Compress and optimize PDF file size."""
    doc = fitz.open(input_pdf)

    # Compress with optimization
    doc.save(
        output_pdf,
        garbage=4,  # Maximum garbage collection
        deflate=True,  # Compress streams
        clean=True,  # Clean up content
        pretty=False  # No pretty-printing
    )

    doc.close()

    import os
    original_size = os.path.getsize(input_pdf) / (1024 * 1024)
    optimized_size = os.path.getsize(output_pdf) / (1024 * 1024)

    print(f"Original: {original_size:.2f} MB")
    print(f"Optimized: {optimized_size:.2f} MB")
    print(f"Reduction: {((original_size - optimized_size) / original_size * 100):.1f}%")

# Usage
optimize_pdf("large_file.pdf", "optimized.pdf")
```

```

### references/best-practices.md

```markdown
# PDF Best Practices and Common Pitfalls

## Best Practices

### 1. Memory Management for Large PDFs

Process large PDFs in chunks to avoid memory issues:

```python
from pypdf import PdfReader
import gc

def process_large_pdf_in_chunks(pdf_path, chunk_size=10):
    """Process large PDFs in chunks to manage memory."""
    reader = PdfReader(pdf_path)
    total_pages = len(reader.pages)

    for start in range(0, total_pages, chunk_size):
        end = min(start + chunk_size, total_pages)

        # Process chunk
        for page_num in range(start, end):
            page = reader.pages[page_num]
            text = page.extract_text()

            # Process text here
            yield page_num, text

        # Force garbage collection
        gc.collect()

# Usage
for page_num, text in process_large_pdf_in_chunks("large_file.pdf"):
    print(f"Processing page {page_num}")
```

### 2. Handle Text Encoding Issues

Always handle potential encoding problems:

```python
import pdfplumber

def extract_text_safe(pdf_path):
    """Extract text with proper encoding handling."""
    with pdfplumber.open(pdf_path) as pdf:
        all_text = []

        for page in pdf.pages:
            text = page.extract_text()

            if text:
                # Handle encoding issues
                text = text.encode('utf-8', errors='ignore').decode('utf-8')
                all_text.append(text)

        return "\n\n".join(all_text)
```

### 3. Preserve Document Structure

Extract content while maintaining document structure:

```python
import pdfplumber

def extract_structured_content(pdf_path):
    """Extract content while preserving structure."""
    with pdfplumber.open(pdf_path) as pdf:
        structured_data = []

        for page in pdf.pages:
            page_data = {
                'page_number': page.page_number,
                'text': page.extract_text(),
                'tables': page.extract_tables(),
                'images': len(page.images),
                'width': page.width,
                'height': page.height
            }

            structured_data.append(page_data)

        return structured_data
```

### 4. Error Handling Template

Always implement proper error handling:

```python
from pypdf import PdfReader
import logging

def safe_pdf_operation(pdf_path):
    """Template for safe PDF operations with error handling."""
    try:
        reader = PdfReader(pdf_path)

        # Check if encrypted
        if reader.is_encrypted:
            logging.warning(f"PDF {pdf_path} is encrypted")
            return None

        # Perform operations
        result = []
        for page in reader.pages:
            try:
                text = page.extract_text()
                result.append(text)
            except Exception as e:
                logging.error(f"Error extracting page: {e}")
                result.append("")

        return result

    except FileNotFoundError:
        logging.error(f"File not found: {pdf_path}")
        return None
    except Exception as e:
        logging.error(f"Error processing PDF: {e}")
        return None
```

### 5. Optimize PDF Size

Compress and optimize PDFs when file size matters:

```python
import fitz

def optimize_pdf(input_pdf, output_pdf, image_quality=50):
    """Compress and optimize PDF file size."""
    doc = fitz.open(input_pdf)

    # Compress with optimization
    doc.save(
        output_pdf,
        garbage=4,  # Maximum garbage collection
        deflate=True,  # Compress streams
        clean=True,  # Clean up content
        pretty=False  # No pretty-printing
    )

    doc.close()

    import os
    original_size = os.path.getsize(input_pdf) / (1024 * 1024)
    optimized_size = os.path.getsize(output_pdf) / (1024 * 1024)

    print(f"Original: {original_size:.2f} MB")
    print(f"Optimized: {optimized_size:.2f} MB")
    print(f"Reduction: {((original_size - optimized_size) / original_size * 100):.1f}%")

# Usage
optimize_pdf("large_file.pdf", "optimized.pdf")
```

## Common Pitfalls

### 1. Scanned Documents Without OCR

**Problem**: Text extraction returns empty strings for scanned PDFs.

**Solution**: Use OCR (pytesseract + pdf2image)

```python
import fitz

def extract_text_with_ocr_fallback(pdf_path):
    """Try text extraction, fall back to OCR if needed."""
    doc = fitz.open(pdf_path)
    page = doc[0]
    text = page.get_text()

    if not text.strip():
        print("No text found, using OCR...")
        from pdf2image import convert_from_path
        import pytesseract

        images = convert_from_path(pdf_path)
        text = pytesseract.image_to_string(images[0])

    return text
```

### 2. Table Detection Accuracy

**Problem**: Tables not detected or extracted incorrectly.

**Solution**: Adjust table detection settings

```python
import pdfplumber

def extract_tables_robust(pdf_path):
    """Extract tables with multiple strategies."""
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[0]

        # Try different strategies
        strategies = [
            {"vertical_strategy": "lines", "horizontal_strategy": "lines"},
            {"vertical_strategy": "text", "horizontal_strategy": "text"},
            {"vertical_strategy": "lines", "horizontal_strategy": "text"}
        ]

        for strategy in strategies:
            tables = page.extract_tables(table_settings=strategy)
            if tables:
                return tables

        return []
```

### 3. Form Field Identification

**Problem**: Can't find form field names.

**Solution**: Inspect and list all fields first

```python
import fitz

def debug_form_fields(pdf_path):
    """Debug helper to see all form fields."""
    doc = fitz.open(pdf_path)

    print("=== Form Fields ===")
    for page_num in range(len(doc)):
        page = doc[page_num]
        widgets = page.widgets()

        if widgets:
            print(f"\nPage {page_num + 1}:")
            for widget in widgets:
                print(f"  Name: {widget.field_name}")
                print(f"  Type: {widget.field_type_string}")
                print(f"  Value: {widget.field_value}")
                print(f"  Rect: {widget.rect}")
                print("  ---")

    doc.close()
```

### 4. Encrypted PDFs

**Problem**: Operations fail on encrypted PDFs.

**Solution**: Check and handle encryption

```python
from pypdf import PdfReader

def handle_encrypted_pdf(pdf_path, password=None):
    """Safely handle encrypted PDFs."""
    reader = PdfReader(pdf_path)

    if reader.is_encrypted:
        if password:
            success = reader.decrypt(password)
            if success == 0:
                print("Incorrect password")
                return None
        else:
            print("PDF is encrypted, password required")
            return None

    # Now safe to process
    return reader
```

### 5. Page Rotation Issues

**Problem**: Extracted text appears rotated or out of order.

**Solution**: Check and handle page rotation

```python
import fitz

def extract_text_handle_rotation(pdf_path):
    """Extract text accounting for page rotation."""
    doc = fitz.open(pdf_path)

    for page in doc:
        # Check rotation
        rotation = page.rotation

        if rotation != 0:
            # Rotate page to 0 degrees
            page.set_rotation(0)

        text = page.get_text()
        print(text)

    doc.close()
```

### 6. Memory Issues with Large Files

**Problem**: Out of memory errors when processing large PDFs.

**Solution**: Process in chunks and manage resources

```python
from pypdf import PdfReader
import gc

def process_large_pdf_safe(pdf_path):
    """Process large PDF with memory management."""
    reader = PdfReader(pdf_path)

    for i, page in enumerate(reader.pages):
        # Process one page at a time
        text = page.extract_text()

        # Do something with text
        yield i, text

        # Free memory periodically
        if i % 10 == 0:
            gc.collect()

# Usage
for page_num, text in process_large_pdf_safe("huge.pdf"):
    print(f"Page {page_num}: {len(text)} characters")
```

### 7. Unicode and Special Characters

**Problem**: Special characters or non-ASCII text appears corrupted.

**Solution**: Handle encoding properly

```python
import pdfplumber

def extract_text_unicode_safe(pdf_path):
    """Extract text with proper Unicode handling."""
    with pdfplumber.open(pdf_path) as pdf:
        all_text = []

        for page in pdf.pages:
            text = page.extract_text()

            if text:
                # Normalize Unicode
                import unicodedata
                text = unicodedata.normalize('NFKC', text)

                # Handle encoding issues
                text = text.encode('utf-8', errors='replace').decode('utf-8')

                all_text.append(text)

        return "\n\n".join(all_text)
```

### 8. Missing Dependencies

**Problem**: Import errors or missing system libraries.

**Solution**: Verify all dependencies are installed

```python
def check_dependencies():
    """Check if all required dependencies are available."""
    dependencies = {
        'pypdf': 'pip install pypdf',
        'pdfplumber': 'pip install pdfplumber',
        'reportlab': 'pip install reportlab',
        'fitz': 'pip install PyMuPDF',
        'pdf2image': 'pip install pdf2image (requires poppler)',
        'pytesseract': 'pip install pytesseract (requires tesseract)'
    }

    for module, install_cmd in dependencies.items():
        try:
            __import__(module)
            print(f"✓ {module} is installed")
        except ImportError:
            print(f"✗ {module} is missing. Install with: {install_cmd}")

check_dependencies()
```

## Performance Tips

### 1. Use Appropriate Library for Task

- **Text extraction**: Use `pdfplumber` for layout-aware extraction, `pypdf` for simple extraction
- **Table extraction**: Always use `pdfplumber`
- **PDF creation**: Use `reportlab`
- **Advanced manipulation**: Use `PyMuPDF (fitz)`
- **OCR**: Use `pytesseract` + `pdf2image`

### 2. Batch Processing

Process multiple PDFs efficiently:

```python
from pathlib import Path
import concurrent.futures

def process_pdf_batch(pdf_directory, process_function):
    """Process multiple PDFs in parallel."""
    pdf_files = list(Path(pdf_directory).glob("*.pdf"))

    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        results = executor.map(process_function, pdf_files)

    return list(results)
```

### 3. Cache Results

Cache expensive operations:

```python
import functools
import hashlib

@functools.lru_cache(maxsize=128)
def extract_text_cached(pdf_path):
    """Cache text extraction results."""
    with open(pdf_path, 'rb') as f:
        file_hash = hashlib.md5(f.read()).hexdigest()

    # Actual extraction logic here
    from pypdf import PdfReader
    reader = PdfReader(pdf_path)
    return ''.join(page.extract_text() for page in reader.pages)
```

## Security Considerations

### 1. Validate Input Files

Always validate PDFs before processing:

```python
import fitz

def validate_pdf(pdf_path):
    """Validate that file is a legitimate PDF."""
    try:
        doc = fitz.open(pdf_path)
        is_valid = doc.is_pdf
        page_count = doc.page_count
        doc.close()

        if not is_valid:
            raise ValueError("File is not a valid PDF")

        if page_count == 0:
            raise ValueError("PDF has no pages")

        return True
    except Exception as e:
        print(f"Validation failed: {e}")
        return False
```

### 2. Sanitize Form Input

When filling forms, sanitize user input:

```python
import re

def sanitize_form_data(data):
    """Sanitize form input to prevent injection."""
    sanitized = {}

    for key, value in data.items():
        if isinstance(value, str):
            # Remove potentially dangerous characters
            value = re.sub(r'[<>\"\'%;()&+]', '', value)
            # Limit length
            value = value[:500]

        sanitized[key] = value

    return sanitized
```

### 3. Handle Temporary Files Securely

Use secure temporary file handling:

```python
import tempfile
import os

def process_pdf_secure(pdf_path):
    """Process PDF with secure temporary file handling."""
    with tempfile.TemporaryDirectory() as temp_dir:
        temp_file = os.path.join(temp_dir, "temp.pdf")

        # Do processing
        # ...

        # Temporary files are automatically cleaned up
```

```

### scripts/pdf_helper.py

```python
#!/usr/bin/env python3
"""
PDF Helper Script - Comprehensive PDF Manipulation Utilities

This script provides a collection of functions for common PDF operations including:
- Text and table extraction
- PDF merging and splitting
- Form filling
- PDF creation
- Watermarking and annotations
- Metadata management
- Encryption and security
- OCR processing

Dependencies:
    pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow

Author: Claude
Date: 2025-10-25
"""

import os
import logging
from typing import List, Dict, Tuple, Optional, Union
from pathlib import Path

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


# ============================================================================
# TEXT EXTRACTION
# ============================================================================

def extract_text(pdf_path: str, method: str = 'pdfplumber') -> str:
    """
    Extract all text from a PDF file.

    Args:
        pdf_path: Path to the PDF file
        method: Extraction method ('pdfplumber' or 'pypdf')

    Returns:
        Extracted text as a string

    Example:
        >>> text = extract_text("document.pdf")
        >>> print(text[:100])
    """
    try:
        if method == 'pdfplumber':
            import pdfplumber

            with pdfplumber.open(pdf_path) as pdf:
                text_parts = []
                for page_num, page in enumerate(pdf.pages, start=1):
                    page_text = page.extract_text()
                    if page_text:
                        text_parts.append(f"--- Page {page_num} ---\n{page_text}")

                return "\n\n".join(text_parts)

        elif method == 'pypdf':
            from pypdf import PdfReader

            reader = PdfReader(pdf_path)
            text_parts = []

            for page_num, page in enumerate(reader.pages, start=1):
                page_text = page.extract_text()
                if page_text:
                    text_parts.append(f"--- Page {page_num} ---\n{page_text}")

            return "\n\n".join(text_parts)

        else:
            raise ValueError(f"Unknown method: {method}. Use 'pdfplumber' or 'pypdf'")

    except Exception as e:
        logger.error(f"Error extracting text from {pdf_path}: {e}")
        raise


def extract_text_by_page(pdf_path: str) -> List[Dict[str, Union[int, str]]]:
    """
    Extract text from PDF, organized by page.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        List of dictionaries with page number and text

    Example:
        >>> pages = extract_text_by_page("document.pdf")
        >>> for page in pages:
        ...     print(f"Page {page['page']}: {len(page['text'])} characters")
    """
    import pdfplumber

    pages_data = []

    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages, start=1):
                text = page.extract_text() or ""
                pages_data.append({
                    'page': page_num,
                    'text': text,
                    'char_count': len(text),
                    'word_count': len(text.split())
                })

        logger.info(f"Extracted text from {len(pages_data)} pages")
        return pages_data

    except Exception as e:
        logger.error(f"Error extracting text by page from {pdf_path}: {e}")
        raise


# ============================================================================
# TABLE EXTRACTION
# ============================================================================

def extract_tables(pdf_path: str, page_numbers: Optional[List[int]] = None) -> List[Dict]:
    """
    Extract tables from PDF.

    Args:
        pdf_path: Path to the PDF file
        page_numbers: Specific pages to extract from (1-indexed), or None for all

    Returns:
        List of dictionaries containing table data

    Example:
        >>> tables = extract_tables("report.pdf")
        >>> for t in tables:
        ...     print(f"Page {t['page']}: {len(t['data'])} rows")
    """
    import pdfplumber
    import pandas as pd

    all_tables = []

    try:
        with pdfplumber.open(pdf_path) as pdf:
            pages_to_process = pdf.pages

            if page_numbers:
                # Convert to 0-indexed
                pages_to_process = [pdf.pages[p - 1] for p in page_numbers if 0 < p <= len(pdf.pages)]

            for page in pages_to_process:
                tables = page.extract_tables()

                for table_num, table in enumerate(tables, start=1):
                    if table and len(table) > 0:
                        # Convert to DataFrame
                        df = pd.DataFrame(table[1:], columns=table[0])

                        all_tables.append({
                            'page': page.page_number,
                            'table_number': table_num,
                            'data': df,
                            'raw_data': table
                        })

        logger.info(f"Extracted {len(all_tables)} tables from {pdf_path}")
        return all_tables

    except Exception as e:
        logger.error(f"Error extracting tables from {pdf_path}: {e}")
        raise


def save_tables_to_csv(pdf_path: str, output_dir: str) -> List[str]:
    """
    Extract tables from PDF and save each as CSV.

    Args:
        pdf_path: Path to the PDF file
        output_dir: Directory to save CSV files

    Returns:
        List of created CSV file paths

    Example:
        >>> csv_files = save_tables_to_csv("report.pdf", "output/")
        >>> print(f"Created {len(csv_files)} CSV files")
    """
    tables = extract_tables(pdf_path)
    os.makedirs(output_dir, exist_ok=True)

    csv_files = []

    for t in tables:
        filename = f"page{t['page']}_table{t['table_number']}.csv"
        filepath = os.path.join(output_dir, filename)

        t['data'].to_csv(filepath, index=False)
        csv_files.append(filepath)

        logger.info(f"Saved table to {filepath}")

    return csv_files


# ============================================================================
# PDF MERGING
# ============================================================================

def merge_pdfs(pdf_list: List[str], output_path: str, add_bookmarks: bool = False) -> None:
    """
    Merge multiple PDF files into one.

    Args:
        pdf_list: List of PDF file paths to merge
        output_path: Path for the output merged PDF
        add_bookmarks: Whether to add bookmarks for each source file

    Example:
        >>> merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf")
    """
    from pypdf import PdfMerger

    try:
        merger = PdfMerger()

        for pdf_path in pdf_list:
            if not os.path.exists(pdf_path):
                logger.warning(f"File not found: {pdf_path}, skipping")
                continue

            if add_bookmarks:
                bookmark_name = Path(pdf_path).stem
                merger.append(pdf_path, outline_item=bookmark_name)
            else:
                merger.append(pdf_path)

            logger.info(f"Added {pdf_path} to merge")

        merger.write(output_path)
        merger.close()

        logger.info(f"Successfully merged {len(pdf_list)} PDFs into {output_path}")

    except Exception as e:
        logger.error(f"Error merging PDFs: {e}")
        raise


def merge_pdfs_with_ranges(
    pdf_configs: List[Dict[str, Union[str, Tuple[int, int]]]],
    output_path: str
) -> None:
    """
    Merge PDFs with specific page ranges.

    Args:
        pdf_configs: List of dicts with 'path' and optional 'pages' (tuple)
        output_path: Path for the output merged PDF

    Example:
        >>> configs = [
        ...     {'path': 'doc1.pdf', 'pages': (0, 3)},
        ...     {'path': 'doc2.pdf'},  # All pages
        ... ]
        >>> merge_pdfs_with_ranges(configs, "output.pdf")
    """
    from pypdf import PdfMerger

    try:
        merger = PdfMerger()

        for config in pdf_configs:
            path = config['path']
            pages = config.get('pages')

            if not os.path.exists(path):
                logger.warning(f"File not found: {path}, skipping")
                continue

            if pages:
                merger.append(path, pages=pages)
                logger.info(f"Added {path} (pages {pages[0]}-{pages[1]})")
            else:
                merger.append(path)
                logger.info(f"Added {path} (all pages)")

        merger.write(output_path)
        merger.close()

        logger.info(f"Successfully created merged PDF: {output_path}")

    except Exception as e:
        logger.error(f"Error merging PDFs with ranges: {e}")
        raise


# ============================================================================
# PDF SPLITTING
# ============================================================================

def split_pdf(input_pdf: str, output_dir: str, pages_per_file: int = 1) -> List[str]:
    """
    Split a PDF into multiple files.

    Args:
        input_pdf: Path to the input PDF
        output_dir: Directory to save split PDFs
        pages_per_file: Number of pages per output file

    Returns:
        List of created file paths

    Example:
        >>> files = split_pdf("document.pdf", "output/", pages_per_file=2)
        >>> print(f"Created {len(files)} files")
    """
    from pypdf import PdfReader, PdfWriter

    try:
        reader = PdfReader(input_pdf)
        total_pages = len(reader.pages)

        os.makedirs(output_dir, exist_ok=True)

        output_files = []
        file_count = 1

        for start_page in range(0, total_pages, pages_per_file):
            writer = PdfWriter()

            end_page = min(start_page + pages_per_file, total_pages)

            for page_num in range(start_page, end_page):
                writer.add_page(reader.pages[page_num])

            output_filename = f"split_{file_count}.pdf"
            output_path = os.path.join(output_dir, output_filename)

            with open(output_path, 'wb') as output_file:
                writer.write(output_file)

            output_files.append(output_path)
            logger.info(f"Created {output_filename} (pages {start_page + 1}-{end_page})")

            file_count += 1

        logger.info(f"Split {input_pdf} into {len(output_files)} files")
        return output_files

    except Exception as e:
        logger.error(f"Error splitting PDF {input_pdf}: {e}")
        raise


def extract_page_range(input_pdf: str, output_pdf: str, start_page: int, end_page: int) -> None:
    """
    Extract a specific range of pages from a PDF.

    Args:
        input_pdf: Path to the input PDF
        output_pdf: Path for the output PDF
        start_page: Starting page number (1-indexed)
        end_page: Ending page number (1-indexed, inclusive)

    Example:
        >>> extract_page_range("document.pdf", "excerpt.pdf", 5, 10)
    """
    from pypdf import PdfReader, PdfWriter

    try:
        reader = PdfReader(input_pdf)
        writer = PdfWriter()

        # Convert to 0-indexed
        start_idx = start_page - 1
        end_idx = end_page

        if start_idx < 0 or end_idx > len(reader.pages):
            raise ValueError(f"Page range out of bounds (1-{len(reader.pages)})")

        for page_num in range(start_idx, end_idx):
            writer.add_page(reader.pages[page_num])

        with open(output_pdf, 'wb') as output_file:
            writer.write(output_file)

        logger.info(f"Extracted pages {start_page}-{end_page} to {output_pdf}")

    except Exception as e:
        logger.error(f"Error extracting page range: {e}")
        raise


# ============================================================================
# PDF CREATION
# ============================================================================

def create_pdf_from_text(output_path: str, content: List[str], title: str = "") -> None:
    """
    Create a simple PDF from text content.

    Args:
        output_path: Path for the output PDF
        content: List of text lines
        title: Optional title for the document

    Example:
        >>> content = ["Line 1", "Line 2", "Line 3"]
        >>> create_pdf_from_text("output.pdf", content, "My Document")
    """
    from reportlab.pdfgen import canvas
    from reportlab.lib.pagesizes import letter

    try:
        c = canvas.Canvas(output_path, pagesize=letter)
        width, height = letter

        c.setFont("Helvetica", 12)

        y_position = height - 50

        if title:
            c.setFont("Helvetica-Bold", 16)
            c.drawString(50, y_position, title)
            y_position -= 40
            c.setFont("Helvetica", 12)

        for line in content:
            if y_position < 50:  # Start new page
                c.showPage()
                c.setFont("Helvetica", 12)
                y_position = height - 50

            c.drawString(50, y_position, str(line))
            y_position -= 20

        c.save()
        logger.info(f"Created PDF: {output_path}")

    except Exception as e:
        logger.error(f"Error creating PDF from text: {e}")
        raise


def create_pdf_report(
    output_path: str,
    title: str,
    sections: List[Dict[str, Union[str, List[str]]]]
) -> None:
    """
    Create a formatted PDF report with sections.

    Args:
        output_path: Path for the output PDF
        title: Report title
        sections: List of dicts with 'heading' and 'content' keys

    Example:
        >>> sections = [
        ...     {'heading': 'Introduction', 'content': ['Paragraph 1', 'Paragraph 2']},
        ...     {'heading': 'Results', 'content': ['Result 1', 'Result 2']}
        ... ]
        >>> create_pdf_report("report.pdf", "Monthly Report", sections)
    """
    from reportlab.lib.pagesizes import letter
    from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
    from reportlab.lib.units import inch
    from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
    from reportlab.lib.enums import TA_CENTER

    try:
        doc = SimpleDocTemplate(output_path, pagesize=letter)
        story = []
        styles = getSampleStyleSheet()

        # Custom title style
        title_style = ParagraphStyle(
            'CustomTitle',
            parent=styles['Heading1'],
            fontSize=24,
            textColor='darkblue',
            alignment=TA_CENTER,
            spaceAfter=30
        )

        # Add title
        story.append(Paragraph(title, title_style))
        story.append(Spacer(1, 0.5 * inch))

        # Add sections
        for section in sections:
            # Section heading
            story.append(Paragraph(section['heading'], styles['Heading2']))
            story.append(Spacer(1, 0.2 * inch))

            # Section content
            content_list = section.get('content', [])
            if isinstance(content_list, str):
                content_list = [content_list]

            for paragraph in content_list:
                p = Paragraph(str(paragraph), styles['BodyText'])
                story.append(p)
                story.append(Spacer(1, 0.1 * inch))

            story.append(Spacer(1, 0.3 * inch))

        doc.build(story)
        logger.info(f"Created PDF report: {output_path}")

    except Exception as e:
        logger.error(f"Error creating PDF report: {e}")
        raise


# ============================================================================
# WATERMARKS AND ANNOTATIONS
# ============================================================================

def add_watermark(
    input_pdf: str,
    output_pdf: str,
    watermark_text: str,
    opacity: float = 0.3
) -> None:
    """
    Add a text watermark to all pages of a PDF.

    Args:
        input_pdf: Path to the input PDF
        output_pdf: Path for the output PDF
        watermark_text: Text to use as watermark
        opacity: Watermark opacity (0.0 to 1.0)

    Example:
        >>> add_watermark("document.pdf", "watermarked.pdf", "CONFIDENTIAL")
    """
    import fitz

    try:
        doc = fitz.open(input_pdf)

        for page in doc:
            rect = page.rect

            # Add watermark centered and rotated
            page.insert_textbox(
                rect,
                watermark_text,
                fontsize=50,
                align=fitz.TEXT_ALIGN_CENTER,
                rotate=45,
                opacity=opacity,
                color=(0.7, 0.7, 0.7)
            )

        doc.save(output_pdf)
        doc.close()

        logger.info(f"Added watermark to {output_pdf}")

    except Exception as e:
        logger.error(f"Error adding watermark: {e}")
        raise


def add_page_numbers(input_pdf: str, output_pdf: str, position: str = 'bottom-center') -> None:
    """
    Add page numbers to a PDF.

    Args:
        input_pdf: Path to the input PDF
        output_pdf: Path for the output PDF
        position: Position of page numbers ('bottom-center', 'bottom-right', etc.)

    Example:
        >>> add_page_numbers("document.pdf", "numbered.pdf")
    """
    import fitz

    try:
        doc = fitz.open(input_pdf)

        for page_num, page in enumerate(doc, start=1):
            rect = page.rect

            # Determine position
            if position == 'bottom-center':
                text_rect = fitz.Rect(rect.width / 2 - 20, rect.height - 30, rect.width / 2 + 20, rect.height - 10)
            elif position == 'bottom-right':
                text_rect = fitz.Rect(rect.width - 60, rect.height - 30, rect.width - 10, rect.height - 10)
            else:  # bottom-left
                text_rect = fitz.Rect(10, rect.height - 30, 60, rect.height - 10)

            page.insert_textbox(
                text_rect,
                str(page_num),
                fontsize=10,
                align=fitz.TEXT_ALIGN_CENTER,
                color=(0, 0, 0)
            )

        doc.save(output_pdf)
        doc.close()

        logger.info(f"Added page numbers to {output_pdf}")

    except Exception as e:
        logger.error(f"Error adding page numbers: {e}")
        raise


# ============================================================================
# METADATA MANAGEMENT
# ============================================================================

def get_pdf_metadata(pdf_path: str) -> Dict[str, Union[str, int, bool]]:
    """
    Extract metadata from a PDF file.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Dictionary containing metadata

    Example:
        >>> metadata = get_pdf_metadata("document.pdf")
        >>> print(metadata['title'])
    """
    from pypdf import PdfReader

    try:
        reader = PdfReader(pdf_path)
        metadata = reader.metadata

        info = {
            'title': metadata.get('/Title', '') if metadata else '',
            'author': metadata.get('/Author', '') if metadata else '',
            'subject': metadata.get('/Subject', '') if metadata else '',
            'creator': metadata.get('/Creator', '') if metadata else '',
            'producer': metadata.get('/Producer', '') if metadata else '',
            'creation_date': metadata.get('/CreationDate', '') if metadata else '',
            'modification_date': metadata.get('/ModDate', '') if metadata else '',
            'page_count': len(reader.pages),
            'is_encrypted': reader.is_encrypted
        }

        return info

    except Exception as e:
        logger.error(f"Error extracting metadata from {pdf_path}: {e}")
        raise


def update_pdf_metadata(input_pdf: str, output_pdf: str, metadata: Dict[str, str]) -> None:
    """
    Update PDF metadata.

    Args:
        input_pdf: Path to the input PDF
        output_pdf: Path for the output PDF
        metadata: Dictionary of metadata fields to update

    Example:
        >>> metadata = {'/Title': 'New Title', '/Author': 'John Doe'}
        >>> update_pdf_metadata("document.pdf", "updated.pdf", metadata)
    """
    from pypdf import PdfReader, PdfWriter

    try:
        reader = PdfReader(input_pdf)
        writer = PdfWriter()

        # Copy all pages
        for page in reader.pages:
            writer.add_page(page)

        # Update metadata
        writer.add_metadata(metadata)

        with open(output_pdf, 'wb') as output_file:
            writer.write(output_file)

        logger.info(f"Updated metadata in {output_pdf}")

    except Exception as e:
        logger.error(f"Error updating metadata: {e}")
        raise


# ============================================================================
# SECURITY AND ENCRYPTION
# ============================================================================

def encrypt_pdf(
    input_pdf: str,
    output_pdf: str,
    user_password: str,
    owner_password: Optional[str] = None
) -> None:
    """
    Encrypt a PDF with password protection.

    Args:
        input_pdf: Path to the input PDF
        output_pdf: Path for the output encrypted PDF
        user_password: Password to open the document
        owner_password: Password for full permissions (optional)

    Example:
        >>> encrypt_pdf("document.pdf", "secure.pdf", "user123", "owner456")
    """
    from pypdf import PdfReader, PdfWriter

    try:
        reader = PdfReader(input_pdf)
        writer = PdfWriter()

        # Copy all pages
        for page in reader.pages:
            writer.add_page(page)

        # Encrypt
        if owner_password is None:
            owner_password = user_password

        writer.encrypt(
            user_password=user_password,
            owner_password=owner_password,
            algorithm="AES-256"
        )

        with open(output_pdf, 'wb') as output_file:
            writer.write(output_file)

        logger.info(f"Encrypted PDF saved to {output_pdf}")

    except Exception as e:
        logger.error(f"Error encrypting PDF: {e}")
        raise


def decrypt_pdf(input_pdf: str, output_pdf: str, password: str) -> None:
    """
    Remove password protection from a PDF.

    Args:
        input_pdf: Path to the encrypted PDF
        output_pdf: Path for the output decrypted PDF
        password: Password to decrypt the PDF

    Example:
        >>> decrypt_pdf("secure.pdf", "unlocked.pdf", "user123")
    """
    from pypdf import PdfReader, PdfWriter

    try:
        reader = PdfReader(input_pdf)

        if reader.is_encrypted:
            success = reader.decrypt(password)
            if success == 0:
                raise ValueError("Incorrect password")

        writer = PdfWriter()

        # Copy all pages
        for page in reader.pages:
            writer.add_page(page)

        with open(output_pdf, 'wb') as output_file:
            writer.write(output_file)

        logger.info(f"Decrypted PDF saved to {output_pdf}")

    except Exception as e:
        logger.error(f"Error decrypting PDF: {e}")
        raise


# ============================================================================
# OCR PROCESSING
# ============================================================================

def ocr_pdf(pdf_path: str, output_txt_path: Optional[str] = None, language: str = 'eng') -> str:
    """
    Perform OCR on a scanned PDF.

    Args:
        pdf_path: Path to the PDF file
        output_txt_path: Optional path to save extracted text
        language: OCR language code (default: 'eng')

    Returns:
        Extracted text

    Example:
        >>> text = ocr_pdf("scanned.pdf", "output.txt")
    """
    try:
        from pdf2image import convert_from_path
        import pytesseract

        # Convert PDF to images
        images = convert_from_path(pdf_path)

        all_text = []

        for page_num, image in enumerate(images, start=1):
            logger.info(f"Processing page {page_num}/{len(images)}")

            # Perform OCR
            text = pytesseract.image_to_string(image, lang=language)
            all_text.append(f"--- Page {page_num} ---\n{text}")

        full_text = "\n\n".join(all_text)

        # Save to file if specified
        if output_txt_path:
            with open(output_txt_path, 'w', encoding='utf-8') as f:
                f.write(full_text)
            logger.info(f"Saved OCR text to {output_txt_path}")

        return full_text

    except ImportError as e:
        logger.error("Missing dependencies. Install: pip install pdf2image pytesseract")
        raise
    except Exception as e:
        logger.error(f"Error performing OCR on {pdf_path}: {e}")
        raise


# ============================================================================
# UTILITY FUNCTIONS
# ============================================================================

def get_pdf_info(pdf_path: str) -> Dict[str, Union[str, int, bool, List]]:
    """
    Get comprehensive information about a PDF.

    Args:
        pdf_path: Path to the PDF file

    Returns:
        Dictionary with PDF information

    Example:
        >>> info = get_pdf_info("document.pdf")
        >>> print(f"Pages: {info['page_count']}")
    """
    import fitz

    try:
        doc = fitz.open(pdf_path)

        info = {
            'file_path': pdf_path,
            'file_size_mb': os.path.getsize(pdf_path) / (1024 * 1024),
            'page_count': doc.page_count,
            'is_encrypted': doc.is_encrypted,
            'metadata': doc.metadata,
            'page_sizes': []
        }

        # Get page sizes
        for page_num in range(doc.page_count):
            page = doc[page_num]
            info['page_sizes'].append({
                'page': page_num + 1,
                'width': page.rect.width,
                'height': page.rect.height
            })

        doc.close()
        return info

    except Exception as e:
        logger.error(f"Error getting PDF info: {e}")
        raise


def compress_pdf(input_pdf: str, output_pdf: str) -> Dict[str, float]:
    """
    Compress a PDF file to reduce size.

    Args:
        input_pdf: Path to the input PDF
        output_pdf: Path for the compressed PDF

    Returns:
        Dictionary with original and compressed sizes

    Example:
        >>> result = compress_pdf("large.pdf", "small.pdf")
        >>> print(f"Reduced by {result['reduction_percent']:.1f}%")
    """
    import fitz

    try:
        doc = fitz.open(input_pdf)

        # Save with compression
        doc.save(
            output_pdf,
            garbage=4,  # Maximum garbage collection
            deflate=True,  # Compress streams
            clean=True  # Clean up content
        )

        doc.close()

        original_size = os.path.getsize(input_pdf) / (1024 * 1024)
        compressed_size = os.path.getsize(output_pdf) / (1024 * 1024)
        reduction = ((original_size - compressed_size) / original_size) * 100

        result = {
            'original_size_mb': original_size,
            'compressed_size_mb': compressed_size,
            'reduction_percent': reduction
        }

        logger.info(f"Compressed {input_pdf}: {original_size:.2f} MB -> {compressed_size:.2f} MB ({reduction:.1f}% reduction)")

        return result

    except Exception as e:
        logger.error(f"Error compressing PDF: {e}")
        raise


def extract_images_from_pdf(pdf_path: str, output_dir: str) -> List[str]:
    """
    Extract all images from a PDF.

    Args:
        pdf_path: Path to the PDF file
        output_dir: Directory to save extracted images

    Returns:
        List of saved image file paths

    Example:
        >>> images = extract_images_from_pdf("document.pdf", "images/")
        >>> print(f"Extracted {len(images)} images")
    """
    import fitz

    try:
        doc = fitz.open(pdf_path)
        os.makedirs(output_dir, exist_ok=True)

        image_files = []

        for page_num in range(len(doc)):
            page = doc[page_num]
            images = page.get_images()

            for img_index, img in enumerate(images):
                xref = img[0]
                base_image = doc.extract_image(xref)

                image_bytes = base_image["image"]
                image_ext = base_image["ext"]

                image_filename = os.path.join(
                    output_dir,
                    f"page{page_num + 1}_img{img_index + 1}.{image_ext}"
                )

                with open(image_filename, "wb") as img_file:
                    img_file.write(image_bytes)

                image_files.append(image_filename)

        doc.close()

        logger.info(f"Extracted {len(image_files)} images to {output_dir}")
        return image_files

    except Exception as e:
        logger.error(f"Error extracting images: {e}")
        raise


def rotate_pdf_pages(
    input_pdf: str,
    output_pdf: str,
    rotation: int = 90,
    pages: Optional[List[int]] = None
) -> None:
    """
    Rotate pages in a PDF.

    Args:
        input_pdf: Path to the input PDF
        output_pdf: Path for the output PDF
        rotation: Degrees to rotate (90, 180, 270)
        pages: List of page numbers to rotate (1-indexed), or None for all

    Example:
        >>> rotate_pdf_pages("document.pdf", "rotated.pdf", 90, [1, 3, 5])
    """
    from pypdf import PdfReader, PdfWriter

    try:
        reader = PdfReader(input_pdf)
        writer = PdfWriter()

        for page_num, page in enumerate(reader.pages):
            if pages is None or (page_num + 1) in pages:
                page.rotate(rotation)
            writer.add_page(page)

        with open(output_pdf, 'wb') as output_file:
            writer.write(output_file)

        logger.info(f"Rotated pages in {output_pdf}")

    except Exception as e:
        logger.error(f"Error rotating pages: {e}")
        raise


# ============================================================================
# MAIN (for testing)
# ============================================================================

if __name__ == "__main__":
    # Example usage
    print("PDF Helper Script")
    print("=" * 50)

    # Example 1: Extract text
    # text = extract_text("sample.pdf")
    # print(text[:500])

    # Example 2: Extract tables
    # tables = extract_tables("report.pdf")
    # for t in tables:
    #     print(f"Page {t['page']}: {t['data']}")

    # Example 3: Merge PDFs
    # merge_pdfs(["file1.pdf", "file2.pdf"], "merged.pdf")

    # Example 4: Get PDF info
    # info = get_pdf_info("document.pdf")
    # print(f"Pages: {info['page_count']}")

    print("\nImport this module to use the helper functions.")
    print("Example: from pdf_helper import extract_text, merge_pdfs")

```



---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### README.md

```markdown
# PDF Manipulation Skill

Comprehensive PDF manipulation, extraction, and generation with support for text extraction, form filling, merging, splitting, annotations, and creation.

## Overview

Work with PDF files in Python for extracting text and tables, filling PDF forms, merging/splitting PDFs, creating PDFs programmatically, adding watermarks/annotations, and managing PDF metadata. This skill covers everything from basic text extraction to advanced operations like OCR and PDF creation.

## Installation

Install the required Python libraries:

```bash
pip install pypdf pdfplumber reportlab PyMuPDF pdf2image pytesseract pillow
```

For detailed installation instructions including system dependencies, see [Library Installation Guide](./references/library-installation.md).

## What's Included

### SKILL.md
Comprehensive guide covering all PDF operations with progressive disclosure for efficiency. Includes quick start examples, detailed workflows for text/table extraction, form operations, merging/splitting, PDF creation, metadata management, and OCR.

### scripts/
- `pdf_helper.py` - Utility functions for common PDF operations

### examples/
- `invoice-generator.md` - Professional invoice template generation
- `report-automation.md` - Automated report generation workflows

### references/
- `library-installation.md` - Setup guide and dependencies
- `text-extraction.md` - All text extraction methods including OCR fallback
- `table-extraction.md` - Table detection strategies and data cleaning
- `pdf-operations.md` - Forms, merge, split, page operations
- `pdf-creation.md` - Creating PDFs from scratch with reportlab
- `metadata-security-ocr.md` - Advanced operations
- `best-practices.md` - Pitfalls and solutions

## Quick Start

### Extract Text from PDF

```python
from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)
```

### Extract Tables

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            print(table)
```

### Create a Simple PDF

```python
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

c = canvas.Canvas("output.pdf", pagesize=letter)
c.setFont("Helvetica", 12)
c.drawString(50, 750, "Hello, World!")
c.save()
```

### Fill PDF Forms

```python
import fitz

doc = fitz.open("form.pdf")
for page in doc:
    for widget in page.widgets():
        if widget.field_name == "name":
            widget.field_value = "John Doe"
            widget.update()
doc.save("filled.pdf")
doc.close()
```

## Key Features

- **Text Extraction**: Simple and layout-aware extraction with OCR fallback
- **Table Extraction**: Advanced table detection and parsing
- **Form Operations**: Fill, extract, and flatten PDF forms
- **Merging & Splitting**: Combine PDFs or extract specific pages
- **PDF Creation**: Generate PDFs with text, images, tables, and graphics
- **Watermarks & Annotations**: Add text/image watermarks and annotations
- **Metadata Management**: Extract and modify PDF metadata
- **Security**: Password protection and encryption
- **OCR Support**: Extract text from scanned documents
- **Optimization**: Compress and optimize PDF files

## Python Libraries Overview

- **pypdf**: Basic operations (merge, split, rotate, metadata)
- **pdfplumber**: Advanced text/table extraction with layout awareness
- **reportlab**: Create PDFs from scratch (reports, invoices, documents)
- **PyMuPDF (fitz)**: Advanced manipulation, annotations, compression
- **pdf2image**: Convert PDF pages to images (requires poppler)
- **pytesseract**: OCR for scanned documents (requires tesseract)

## Common Use Cases

### Extract Data from Invoices
Use pdfplumber to extract tables and structured data from PDF invoices for automated processing.

### Generate Reports Programmatically
Create professional PDF reports with reportlab including charts, tables, and formatted text.

### Fill PDF Forms in Bulk
Automate form filling using PyMuPDF for contracts, applications, or surveys.

### Merge Multiple PDFs
Combine multiple PDF files into a single document with pypdf.

### OCR Scanned Documents
Use pytesseract to extract text from scanned PDFs and create searchable PDFs.

## Best Practices

1. **Choose the right library** - Use pypdf for basic operations, pdfplumber for extraction, reportlab for creation
2. **Handle errors** - Always use try-except blocks for PDF operations
3. **Check for encryption** - Decrypt PDFs before processing
4. **Use OCR fallback** - Detect scanned documents and apply OCR when needed
5. **Process in chunks** - Handle large PDFs in chunks to manage memory
6. **Validate inputs** - Check file existence and format before processing
7. **Close documents** - Always close PyMuPDF documents to free resources

## Common Pitfalls

- **Scanned Documents**: Text extraction returns empty - use OCR (pytesseract)
- **Table Detection**: Tables not detected - adjust table_settings strategies
- **Encrypted PDFs**: Operations fail - check and decrypt with password first
- **Memory Issues**: Large PDFs cause crashes - process in chunks with garbage collection
- **Encoding Issues**: Special characters corrupted - handle with UTF-8 encoding

For detailed solutions, see [Best Practices and Common Pitfalls](./references/best-practices.md).

## Documentation

See `SKILL.md` for comprehensive documentation, detailed workflows, and advanced techniques.

## Requirements

- Python 3.7+
- pypdf
- pdfplumber
- reportlab
- PyMuPDF (fitz)
- pdf2image (optional, for image conversion)
- pytesseract (optional, for OCR)
- Pillow

System dependencies:
- poppler (for pdf2image)
- tesseract (for OCR)

```

pdf | SkillHub