Back to skills
SkillHub ClubShip Full StackFull Stack

pdf

全面的 PDF 操作工具包,用于提取文本和表格、创建新 PDF、合并/拆分文档以及处理表单。当 Claude 需要填写 PDF 表单或以编程方式大规模处理、生成或分析 PDF 文档时使用。

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
271
Hot score
98
Updated
March 20, 2026
Overall rating
C3.2
Composite score
3.2
Best-practice grade
F35.9

Install command

npx @skill-hub/cli install leastbit-claude-skills-zh-cn-pdf

Repository

LeastBit/Claude_skills_zh-CN

Skill path: skills/pdf

全面的 PDF 操作工具包,用于提取文本和表格、创建新 PDF、合并/拆分文档以及处理表单。当 Claude 需要填写 PDF 表单或以编程方式大规模处理、生成或分析 PDF 文档时使用。

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: 专有许可。完整条款请参阅 LICENSE.txt.

Original source

Catalog source: SkillHub Club.

Repository owner: LeastBit.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install pdf into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/LeastBit/Claude_skills_zh-CN before adding pdf to shared team environments
  • Use pdf for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: pdf
description: 全面的 PDF 操作工具包,用于提取文本和表格、创建新 PDF、合并/拆分文档以及处理表单。当 Claude 需要填写 PDF 表单或以编程方式大规模处理、生成或分析 PDF 文档时使用。
license: 专有许可。完整条款请参阅 LICENSE.txt
---

# PDF 处理指南

## 概述

本指南涵盖使用 Python 库和命令行工具进行的基本 PDF 处理操作。有关高级功能、JavaScript 库和详细示例,请参阅 reference.md。如果需要填写 PDF 表单,请阅读 forms.md 并按照其说明操作。

## 快速开始

```python
from pypdf import PdfReader, PdfWriter

# 读取 PDF
reader = PdfReader("document.pdf")
print(f"页数: {len(reader.pages)}")

# 提取文本
text = ""
for page in reader.pages:
    text += page.extract_text()
```

## Python 库

### pypdf - 基本操作

#### 合并 PDF
```python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)
```

#### 拆分 PDF
```python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)
```

#### 提取元数据
```python
reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"标题: {meta.title}")
print(f"作者: {meta.author}")
print(f"主题: {meta.subject}")
print(f"创建者: {meta.creator}")
```

#### 旋转页面
```python
reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  # 顺时针旋转90度
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)
```

### pdfplumber - 文本和表格提取

#### 提取带布局的文本
```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)
```

#### 提取表格
```python
with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"第 {i+1} 页的表格 {j+1}:")
            for row in table:
                print(row)
```

#### 高级表格提取
```python
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:  # 检查表格是否为空
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

# 合并所有表格
if all_tables:
    combined_df = pd.concat(all_tables, ignore_index=True)
    combined_df.to_excel("extracted_tables.xlsx", index=False)
```

### reportlab - 创建 PDF

#### 基本 PDF 创建
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter

# 添加文本
c.drawString(100, height - 100, "Hello World!")
c.drawString(100, height - 120, "这是用 reportlab 创建的 PDF")

# 添加线条
c.line(100, height - 140, 400, height - 140)

# 保存
c.save()
```

#### 创建多页 PDF
```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

# 添加内容
title = Paragraph("报告标题", styles['Title'])
story.append(title)
story.append(Spacer(1, 12))

body = Paragraph("这是报告的正文内容。" * 20, styles['Normal'])
story.append(body)
story.append(PageBreak())

# 第2页
story.append(Paragraph("第2页", styles['Heading1']))
story.append(Paragraph("第2页的内容", styles['Normal']))

# 构建 PDF
doc.build(story)
```

## 命令行工具

### pdftotext (poppler-utils)
```bash
# 提取文本
pdftotext input.pdf output.txt

# 提取文本并保留布局
pdftotext -layout input.pdf output.txt

# 提取指定页面
pdftotext -f 1 -l 5 input.pdf output.txt  # 第1-5页
```

### qpdf
```bash
# 合并 PDF
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# 拆分页面
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

# 旋转页面
qpdf input.pdf output.pdf --rotate=+90:1  # 将第1页旋转90度

# 移除密码
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
```

### pdftk(如果可用)
```bash
# 合并
pdftk file1.pdf file2.pdf cat output merged.pdf

# 拆分
pdftk input.pdf burst

# 旋转
pdftk input.pdf rotate 1east output rotated.pdf
```

## 常见任务

### 从扫描的 PDF 提取文本
```python
# 需要安装: pip install pytesseract pdf2image
import pytesseract
from pdf2image import convert_from_path

# 将 PDF 转换为图像
images = convert_from_path('scanned.pdf')

# 对每一页进行 OCR 识别
text = ""
for i, image in enumerate(images):
    text += f"第 {i+1} 页:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

print(text)
```

### 添加水印
```python
from pypdf import PdfReader, PdfWriter

# 创建水印(或加载现有的)
watermark = PdfReader("watermark.pdf").pages[0]

# 应用到所有页面
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)
```

### 提取图像
```bash
# 使用 pdfimages (poppler-utils)
pdfimages -j input.pdf output_prefix

# 这会将所有图像提取为 output_prefix-000.jpg、output_prefix-001.jpg 等
```

### 密码保护
```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

for page in reader.pages:
    writer.add_page(page)

# 添加密码
writer.encrypt("userpassword", "ownerpassword")

with open("encrypted.pdf", "wb") as output:
    writer.write(output)
```

## 快速参考

| 任务 | 最佳工具 | 命令/代码 |
|------|-----------|--------------|
| 合并 PDF | pypdf | `writer.add_page(page)` |
| 拆分 PDF | pypdf | 每页一个文件 |
| 提取文本 | pdfplumber | `page.extract_text()` |
| 提取表格 | pdfplumber | `page.extract_tables()` |
| 创建 PDF | reportlab | Canvas 或 Platypus |
| 命令行合并 | qpdf | `qpdf --empty --pages ...` |
| OCR 扫描 PDF | pytesseract | 先转换为图像 |
| 填写 PDF 表单 | pdf-lib 或 pypdf(参见 forms.md) | 参见 forms.md |

## 后续步骤

- 有关 pypdfium2 的高级用法,请参阅 reference.md
- 有关 JavaScript 库(pdf-lib),请参阅 reference.md
- 如果需要填写 PDF 表单,请按照 forms.md 中的说明操作
- 有关故障排除指南,请参阅 reference.md


---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### scripts/check_bounding_boxes.py

```python
from dataclasses import dataclass
import json
import sys


# 用于检查 Claude 在分析 PDF 时创建的 `fields.json` 文件
# 是否存在重叠的边界框。参见 forms.md。


@dataclass
class RectAndField:
    rect: list[float]
    rect_type: str
    field: dict


# 返回打印到标准输出供 Claude 读取的消息列表。
def get_bounding_box_messages(fields_json_stream) -> list[str]:
    messages = []
    fields = json.load(fields_json_stream)
    messages.append(f"读取了 {len(fields['form_fields'])} 个字段")

    def rects_intersect(r1, r2):
        disjoint_horizontal = r1[0] >= r2[2] or r1[2] <= r2[0]
        disjoint_vertical = r1[1] >= r2[3] or r1[3] <= r2[1]
        return not (disjoint_horizontal or disjoint_vertical)

    rects_and_fields = []
    for f in fields["form_fields"]:
        rects_and_fields.append(RectAndField(f["label_bounding_box"], "label", f))
        rects_and_fields.append(RectAndField(f["entry_bounding_box"], "entry", f))

    has_error = False
    for i, ri in enumerate(rects_and_fields):
        # 这是 O(N^2) 复杂度;如果成为问题可以优化。
        for j in range(i + 1, len(rects_and_fields)):
            rj = rects_and_fields[j]
            if ri.field["page_number"] == rj.field["page_number"] and rects_intersect(ri.rect, rj.rect):
                has_error = True
                if ri.field is rj.field:
                    messages.append(f"失败: `{ri.field['description']}` 的标签和输入边界框相交 ({ri.rect}, {rj.rect})")
                else:
                    messages.append(f"失败: `{ri.field['description']}` 的 {ri.rect_type} 边界框 ({ri.rect}) 与 `{rj.field['description']}` 的 {rj.rect_type} 边界框 ({rj.rect}) 相交")
                if len(messages) >= 20:
                    messages.append("中止进一步检查;请修复边界框后重试")
                    return messages
        if ri.rect_type == "entry":
            if "entry_text" in ri.field:
                font_size = ri.field["entry_text"].get("font_size", 14)
                entry_height = ri.rect[3] - ri.rect[1]
                if entry_height < font_size:
                    has_error = True
                    messages.append(f"失败: `{ri.field['description']}` 的输入边界框高度 ({entry_height}) 对于文本内容来说太小(字体大小: {font_size})。请增加框高度或减小字体大小。")
                    if len(messages) >= 20:
                        messages.append("中止进一步检查;请修复边界框后重试")
                        return messages

    if not has_error:
        messages.append("成功: 所有边界框均有效")
    return messages

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("用法: check_bounding_boxes.py [fields.json]")
        sys.exit(1)
    # 输入文件应为 forms.md 中描述的 `fields.json` 格式。
    with open(sys.argv[1]) as f:
        messages = get_bounding_box_messages(f)
    for msg in messages:
        print(msg)

```

### scripts/check_bounding_boxes_test.py

```python
import unittest
import json
import io
from check_bounding_boxes import get_bounding_box_messages


# 目前这不会在 CI 中自动运行;仅用于文档和手动检查。
class TestGetBoundingBoxMessages(unittest.TestCase):

    def create_json_stream(self, data):
        """从数据创建 JSON 流的辅助方法"""
        return io.StringIO(json.dumps(data))

    def test_no_intersections(self):
        """测试没有边界框相交的情况"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [60, 10, 150, 30]
                },
                {
                    "description": "邮箱",
                    "page_number": 1,
                    "label_bounding_box": [10, 40, 50, 60],
                    "entry_bounding_box": [60, 40, 150, 60]
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("成功" in msg for msg in messages))
        self.assertFalse(any("失败" in msg for msg in messages))

    def test_label_entry_intersection_same_field(self):
        """测试同一字段的标签和输入框相交的情况"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 60, 30],
                    "entry_bounding_box": [50, 10, 150, 30]  # 与标签重叠
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("失败" in msg and "相交" in msg for msg in messages))
        self.assertFalse(any("成功" in msg for msg in messages))

    def test_intersection_between_different_fields(self):
        """测试不同字段的边界框相交的情况"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [60, 10, 150, 30]
                },
                {
                    "description": "邮箱",
                    "page_number": 1,
                    "label_bounding_box": [40, 20, 80, 40],  # 与姓名的边界框重叠
                    "entry_bounding_box": [160, 10, 250, 30]
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("失败" in msg and "相交" in msg for msg in messages))
        self.assertFalse(any("成功" in msg for msg in messages))

    def test_different_pages_no_intersection(self):
        """测试不同页面上的边界框不算相交"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [60, 10, 150, 30]
                },
                {
                    "description": "邮箱",
                    "page_number": 2,
                    "label_bounding_box": [10, 10, 50, 30],  # 相同坐标但不同页面
                    "entry_bounding_box": [60, 10, 150, 30]
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("成功" in msg for msg in messages))
        self.assertFalse(any("失败" in msg for msg in messages))

    def test_entry_height_too_small(self):
        """测试输入框高度是否根据字体大小进行检查"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [60, 10, 150, 20],  # 高度为 10
                    "entry_text": {
                        "font_size": 14  # 字体大小大于高度
                    }
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("失败" in msg and "高度" in msg for msg in messages))
        self.assertFalse(any("成功" in msg for msg in messages))

    def test_entry_height_adequate(self):
        """测试足够的输入框高度通过验证"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [60, 10, 150, 30],  # 高度为 20
                    "entry_text": {
                        "font_size": 14  # 字体大小小于高度
                    }
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("成功" in msg for msg in messages))
        self.assertFalse(any("失败" in msg for msg in messages))

    def test_default_font_size(self):
        """测试未指定时使用默认字体大小"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [60, 10, 150, 20],  # 高度为 10
                    "entry_text": {}  # 未指定 font_size,应使用默认值 14
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("失败" in msg and "高度" in msg for msg in messages))
        self.assertFalse(any("成功" in msg for msg in messages))

    def test_no_entry_text(self):
        """测试缺少 entry_text 时不进行高度检查"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [60, 10, 150, 20]  # 高度小但没有 entry_text
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("成功" in msg for msg in messages))
        self.assertFalse(any("失败" in msg for msg in messages))

    def test_multiple_errors_limit(self):
        """测试错误消息数量限制以防止过多输出"""
        fields = []
        # 创建多个重叠的字段
        for i in range(25):
            fields.append({
                "description": f"字段{i}",
                "page_number": 1,
                "label_bounding_box": [10, 10, 50, 30],  # 全部重叠
                "entry_bounding_box": [20, 15, 60, 35]   # 全部重叠
            })

        data = {"form_fields": fields}

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        # 应在约 20 条消息后中止
        self.assertTrue(any("中止" in msg for msg in messages))
        # 应有一些失败消息但不是数百条
        failure_count = sum(1 for msg in messages if "失败" in msg)
        self.assertGreater(failure_count, 0)
        self.assertLess(len(messages), 30)  # 应被限制

    def test_edge_touching_boxes(self):
        """测试边缘接触的边界框不算相交"""
        data = {
            "form_fields": [
                {
                    "description": "姓名",
                    "page_number": 1,
                    "label_bounding_box": [10, 10, 50, 30],
                    "entry_bounding_box": [50, 10, 150, 30]  # 在 x=50 处接触
                }
            ]
        }

        stream = self.create_json_stream(data)
        messages = get_bounding_box_messages(stream)
        self.assertTrue(any("成功" in msg for msg in messages))
        self.assertFalse(any("失败" in msg for msg in messages))


if __name__ == '__main__':
    unittest.main()

```

### scripts/check_fillable_fields.py

```python
import sys
from pypdf import PdfReader


# 供 Claude 运行以确定 PDF 是否包含可填写表单字段的脚本。参见 forms.md。


reader = PdfReader(sys.argv[1])
if (reader.get_fields()):
    print("此 PDF 包含可填写的表单字段")
else:
    print("此 PDF 不包含可填写的表单字段;您需要通过视觉分析确定数据输入位置")

```

### scripts/convert_pdf_to_images.py

```python
import os
import sys

from pdf2image import convert_from_path


# 将 PDF 的每一页转换为 PNG 图像。


def convert(pdf_path, output_dir, max_dim=1000):
    images = convert_from_path(pdf_path, dpi=200)

    for i, image in enumerate(images):
        # 如果需要,缩放图像以保持宽度/高度在 `max_dim` 以下
        width, height = image.size
        if width > max_dim or height > max_dim:
            scale_factor = min(max_dim / width, max_dim / height)
            new_width = int(width * scale_factor)
            new_height = int(height * scale_factor)
            image = image.resize((new_width, new_height))

        image_path = os.path.join(output_dir, f"page_{i+1}.png")
        image.save(image_path)
        print(f"已将第 {i+1} 页保存为 {image_path}(尺寸: {image.size})")

    print(f"已将 {len(images)} 页转换为 PNG 图像")


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("用法: convert_pdf_to_images.py [输入 PDF] [输出目录]")
        sys.exit(1)
    pdf_path = sys.argv[1]
    output_directory = sys.argv[2]
    convert(pdf_path, output_directory)

```

### scripts/create_validation_image.py

```python
import json
import sys

from PIL import Image, ImageDraw


# 创建"验证"图像,其中包含 Claude 在确定 PDF 中添加文本注释位置时
# 创建的边界框信息的矩形。参见 forms.md。


def create_validation_image(page_number, fields_json_path, input_path, output_path):
    # 输入文件应为 forms.md 中描述的 `fields.json` 格式。
    with open(fields_json_path, 'r') as f:
        data = json.load(f)

        img = Image.open(input_path)
        draw = ImageDraw.Draw(img)
        num_boxes = 0

        for field in data["form_fields"]:
            if field["page_number"] == page_number:
                entry_box = field['entry_bounding_box']
                label_box = field['label_bounding_box']
                # 在输入边界框上绘制红色矩形,在标签上绘制蓝色矩形。
                draw.rectangle(entry_box, outline='red', width=2)
                draw.rectangle(label_box, outline='blue', width=2)
                num_boxes += 2

        img.save(output_path)
        print(f"已在 {output_path} 创建验证图像,包含 {num_boxes} 个边界框")


if __name__ == "__main__":
    if len(sys.argv) != 5:
        print("用法: create_validation_image.py [页码] [fields.json 文件] [输入图像路径] [输出图像路径]")
        sys.exit(1)
    page_number = int(sys.argv[1])
    fields_json_path = sys.argv[2]
    input_image_path = sys.argv[3]
    output_image_path = sys.argv[4]
    create_validation_image(page_number, fields_json_path, input_image_path, output_image_path)

```

### scripts/extract_form_field_info.py

```python
import json
import sys

from pypdf import PdfReader


# 提取 PDF 中可填写表单字段的数据,并输出 Claude 用于填写字段的 JSON。
# 参见 forms.md。


# 此格式与 PdfReader 的 `get_fields` 和 `update_page_form_field_values` 方法使用的格式匹配。
def get_full_annotation_field_id(annotation):
    components = []
    while annotation:
        field_name = annotation.get('/T')
        if field_name:
            components.append(field_name)
        annotation = annotation.get('/Parent')
    return ".".join(reversed(components)) if components else None


def make_field_dict(field, field_id):
    field_dict = {"field_id": field_id}
    ft = field.get('/FT')
    if ft == "/Tx":
        field_dict["type"] = "text"
    elif ft == "/Btn":
        field_dict["type"] = "checkbox"  # 单选按钮组单独处理
        states = field.get("/_States_", [])
        if len(states) == 2:
            # "/Off" 似乎总是未选中的值,如以下文档所述:
            # https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf#page=448
            # 它可以是 "/_States_" 列表中的第一个或第二个。
            if "/Off" in states:
                field_dict["checked_value"] = states[0] if states[0] != "/Off" else states[1]
                field_dict["unchecked_value"] = "/Off"
            else:
                print(f"复选框 `${field_id}` 的状态值异常。其选中和未选中的值可能不正确;如果您尝试选中它,请视觉验证结果。")
                field_dict["checked_value"] = states[0]
                field_dict["unchecked_value"] = states[1]
    elif ft == "/Ch":
        field_dict["type"] = "choice"
        states = field.get("/_States_", [])
        field_dict["choice_options"] = [{
            "value": state[0],
            "text": state[1],
        } for state in states]
    else:
        field_dict["type"] = f"unknown ({ft})"
    return field_dict


# 返回可填写 PDF 字段的列表:
# [
#   {
#     "field_id": "name",
#     "page": 1,
#     "type": ("text"、"checkbox"、"radio_group" 或 "choice")
#     // 每种类型的附加字段在 forms.md 中描述
#   },
# ]
def get_field_info(reader: PdfReader):
    fields = reader.get_fields()

    field_info_by_id = {}
    possible_radio_names = set()

    for field_id, field in fields.items():
        # 如果这是一个带有子元素的容器字段则跳过,但它可能是
        # 单选按钮选项的父组。
        if field.get("/Kids"):
            if field.get("/FT") == "/Btn":
                possible_radio_names.add(field_id)
            continue
        field_info_by_id[field_id] = make_field_dict(field, field_id)

    # 边界矩形存储在页面对象的注释中。

    # 单选按钮选项的每个选择都有单独的注释;
    # 所有选择都有相同的字段名称。
    # 参见 https://westhealth.github.io/exploring-fillable-forms-with-pdfrw.html
    radio_fields_by_id = {}

    for page_index, page in enumerate(reader.pages):
        annotations = page.get('/Annots', [])
        for ann in annotations:
            field_id = get_full_annotation_field_id(ann)
            if field_id in field_info_by_id:
                field_info_by_id[field_id]["page"] = page_index + 1
                field_info_by_id[field_id]["rect"] = ann.get('/Rect')
            elif field_id in possible_radio_names:
                try:
                    # ann['/AP']['/N'] 应该有两个项目。其中一个是 '/Off',
                    # 另一个是激活值。
                    on_values = [v for v in ann["/AP"]["/N"] if v != "/Off"]
                except KeyError:
                    continue
                if len(on_values) == 1:
                    rect = ann.get("/Rect")
                    if field_id not in radio_fields_by_id:
                        radio_fields_by_id[field_id] = {
                            "field_id": field_id,
                            "type": "radio_group",
                            "page": page_index + 1,
                            "radio_options": [],
                        }
                    # 注意:至少在 macOS 15.7 上,Preview.app 无法正确显示已选择的
                    # 单选按钮。(如果删除值前面的斜杠,它会正确显示,但这会导致
                    # 在 Chrome/Firefox/Acrobat 等中无法正确显示)。
                    radio_fields_by_id[field_id]["radio_options"].append({
                        "value": on_values[0],
                        "rect": rect,
                    })

    # 某些 PDF 的表单字段定义没有对应的注释,
    # 所以我们无法确定它们的位置。暂时忽略这些字段。
    fields_with_location = []
    for field_info in field_info_by_id.values():
        if "page" in field_info:
            fields_with_location.append(field_info)
        else:
            print(f"无法确定字段 ID 的位置: {field_info.get('field_id')},已忽略")

    # 按页码排序,然后按 Y 位置(PDF 坐标系中翻转),然后按 X。
    def sort_key(f):
        if "radio_options" in f:
            rect = f["radio_options"][0]["rect"] or [0, 0, 0, 0]
        else:
            rect = f.get("rect") or [0, 0, 0, 0]
        adjusted_position = [-rect[1], rect[0]]
        return [f.get("page"), adjusted_position]

    sorted_fields = fields_with_location + list(radio_fields_by_id.values())
    sorted_fields.sort(key=sort_key)

    return sorted_fields


def write_field_info(pdf_path: str, json_output_path: str):
    reader = PdfReader(pdf_path)
    field_info = get_field_info(reader)
    with open(json_output_path, "w") as f:
        json.dump(field_info, f, indent=2)
    print(f"已将 {len(field_info)} 个字段写入 {json_output_path}")


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("用法: extract_form_field_info.py [输入 PDF] [输出 JSON]")
        sys.exit(1)
    write_field_info(sys.argv[1], sys.argv[2])

```

### scripts/fill_fillable_fields.py

```python
import json
import sys

from pypdf import PdfReader, PdfWriter

from extract_form_field_info import get_field_info


# 填写 PDF 中的可填写表单字段。参见 forms.md。


def fill_pdf_fields(input_pdf_path: str, fields_json_path: str, output_pdf_path: str):
    with open(fields_json_path) as f:
        fields = json.load(f)
    # 按页码分组。
    fields_by_page = {}
    for field in fields:
        if "value" in field:
            field_id = field["field_id"]
            page = field["page"]
            if page not in fields_by_page:
                fields_by_page[page] = {}
            fields_by_page[page][field_id] = field["value"]

    reader = PdfReader(input_pdf_path)

    has_error = False
    field_info = get_field_info(reader)
    fields_by_ids = {f["field_id"]: f for f in field_info}
    for field in fields:
        existing_field = fields_by_ids.get(field["field_id"])
        if not existing_field:
            has_error = True
            print(f"错误: `{field['field_id']}` 不是有效的字段 ID")
        elif field["page"] != existing_field["page"]:
            has_error = True
            print(f"错误: `{field['field_id']}` 的页码不正确(得到 {field['page']},期望 {existing_field['page']})")
        else:
            if "value" in field:
                err = validation_error_for_field_value(existing_field, field["value"])
                if err:
                    print(err)
                    has_error = True
    if has_error:
        sys.exit(1)

    writer = PdfWriter(clone_from=reader)
    for page, field_values in fields_by_page.items():
        writer.update_page_form_field_values(writer.pages[page - 1], field_values, auto_regenerate=False)

    # 这似乎是许多 PDF 查看器正确格式化表单值所必需的。
    # 它可能导致查看器显示"保存更改"对话框,即使用户没有进行任何更改。
    writer.set_need_appearances_writer(True)

    with open(output_pdf_path, "wb") as f:
        writer.write(f)


def validation_error_for_field_value(field_info, field_value):
    field_type = field_info["type"]
    field_id = field_info["field_id"]
    if field_type == "checkbox":
        checked_val = field_info["checked_value"]
        unchecked_val = field_info["unchecked_value"]
        if field_value != checked_val and field_value != unchecked_val:
            return f'错误: 复选框字段 "{field_id}" 的值 "{field_value}" 无效。选中值为 "{checked_val}",未选中值为 "{unchecked_val}"'
    elif field_type == "radio_group":
        option_values = [opt["value"] for opt in field_info["radio_options"]]
        if field_value not in option_values:
            return f'错误: 单选按钮组字段 "{field_id}" 的值 "{field_value}" 无效。有效值为: {option_values}'
    elif field_type == "choice":
        choice_values = [opt["value"] for opt in field_info["choice_options"]]
        if field_value not in choice_values:
            return f'错误: 选择字段 "{field_id}" 的值 "{field_value}" 无效。有效值为: {choice_values}'
    return None


# pypdf(至少版本 5.7.0)在设置选择列表字段值时有一个 bug。
# 在 _writer.py 大约第 966 行:
#
# if field.get(FA.FT, "/Tx") == "/Ch" and field_flags & FA.FfBits.Combo == 0:
#     txt = "\n".join(annotation.get_inherited(FA.Opt, []))
#
# 问题是对于选择列表,`get_inherited` 返回一个二元素列表的列表,如:
# [["value1", "Text 1"], ["value2", "Text 2"], ...]
# 这导致 `join` 抛出 TypeError,因为它期望一个字符串的可迭代对象。
# 糟糕的解决方法是修补 `get_inherited` 以返回值字符串的列表。
# 我们调用原始方法并仅在 `get_inherited` 的参数是 `FA.Opt`
# 且返回值是二元素列表的列表时调整返回值。
def monkeypatch_pydpf_method():
    from pypdf.generic import DictionaryObject
    from pypdf.constants import FieldDictionaryAttributes

    original_get_inherited = DictionaryObject.get_inherited

    def patched_get_inherited(self, key: str, default = None):
        result = original_get_inherited(self, key, default)
        if key == FieldDictionaryAttributes.Opt:
            if isinstance(result, list) and all(isinstance(v, list) and len(v) == 2 for v in result):
                result = [r[0] for r in result]
        return result

    DictionaryObject.get_inherited = patched_get_inherited


if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("用法: fill_fillable_fields.py [输入 PDF] [field_values.json] [输出 PDF]")
        sys.exit(1)
    monkeypatch_pydpf_method()
    input_pdf = sys.argv[1]
    fields_json = sys.argv[2]
    output_pdf = sys.argv[3]
    fill_pdf_fields(input_pdf, fields_json, output_pdf)

```

### scripts/fill_pdf_form_with_annotations.py

```python
import json
import sys

from pypdf import PdfReader, PdfWriter
from pypdf.annotations import FreeText


# 通过添加 `fields.json` 中定义的文本注释来填写 PDF。参见 forms.md。


def transform_coordinates(bbox, image_width, image_height, pdf_width, pdf_height):
    """将边界框从图像坐标转换为 PDF 坐标"""
    # 图像坐标:原点在左上角,y 向下增加
    # PDF 坐标:原点在左下角,y 向上增加
    x_scale = pdf_width / image_width
    y_scale = pdf_height / image_height

    left = bbox[0] * x_scale
    right = bbox[2] * x_scale

    # 翻转 Y 坐标以适应 PDF
    top = pdf_height - (bbox[1] * y_scale)
    bottom = pdf_height - (bbox[3] * y_scale)

    return left, bottom, right, top


def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path):
    """使用 fields.json 中的数据填写 PDF 表单"""

    # `fields.json` 格式在 forms.md 中描述。
    with open(fields_json_path, "r") as f:
        fields_data = json.load(f)

    # 打开 PDF
    reader = PdfReader(input_pdf_path)
    writer = PdfWriter()

    # 将所有页面复制到 writer
    writer.append(reader)

    # 获取每页的 PDF 尺寸
    pdf_dimensions = {}
    for i, page in enumerate(reader.pages):
        mediabox = page.mediabox
        pdf_dimensions[i + 1] = [mediabox.width, mediabox.height]

    # 处理每个表单字段
    annotations = []
    for field in fields_data["form_fields"]:
        page_num = field["page_number"]

        # 获取页面尺寸并转换坐标。
        page_info = next(p for p in fields_data["pages"] if p["page_number"] == page_num)
        image_width = page_info["image_width"]
        image_height = page_info["image_height"]
        pdf_width, pdf_height = pdf_dimensions[page_num]

        transformed_entry_box = transform_coordinates(
            field["entry_bounding_box"],
            image_width, image_height,
            pdf_width, pdf_height
        )

        # 跳过空字段
        if "entry_text" not in field or "text" not in field["entry_text"]:
            continue
        entry_text = field["entry_text"]
        text = entry_text["text"]
        if not text:
            continue

        font_name = entry_text.get("font", "Arial")
        font_size = str(entry_text.get("font_size", 14)) + "pt"
        font_color = entry_text.get("font_color", "000000")

        # 字体大小/颜色在不同查看器中似乎不能可靠工作:
        # https://github.com/py-pdf/pypdf/issues/2084
        annotation = FreeText(
            text=text,
            rect=transformed_entry_box,
            font=font_name,
            font_size=font_size,
            font_color=font_color,
            border_color=None,
            background_color=None,
        )
        annotations.append(annotation)
        # pypdf 的 page_number 从 0 开始
        writer.add_annotation(page_number=page_num - 1, annotation=annotation)

    # 保存已填写的 PDF
    with open(output_pdf_path, "wb") as output:
        writer.write(output)

    print(f"已成功填写 PDF 表单并保存到 {output_pdf_path}")
    print(f"添加了 {len(annotations)} 个文本注释")


if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("用法: fill_pdf_form_with_annotations.py [输入 PDF] [fields.json] [输出 PDF]")
        sys.exit(1)
    input_pdf = sys.argv[1]
    fields_json = sys.argv[2]
    output_pdf = sys.argv[3]

    fill_pdf_form(input_pdf, fields_json, output_pdf)

```

pdf | SkillHub