SkillHub ClubShip Full StackFull Stack

news-extractor

新闻站点内容提取。支持微信公众号、今日头条、网易新闻、搜狐新闻、腾讯新闻。当用户需要提取新闻内容、抓取公众号文章、爬取新闻、或获取新闻JSON/Markdown时激活。

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C4.1

Composite score

4.1

Best-practice grade

B70.4

Install command

npx @skill-hub/cli install nanmicoder-claude-code-skills-news-extractor

Repository

NanmiCoder/claude-code-skills

Skill path: plugins/news-extractor/skills/news-extractor

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: NanmiCoder.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install news-extractor into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/NanmiCoder/claude-code-skills before adding news-extractor to shared team environments
Use news-extractor for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: news-extractor
description: 新闻站点内容提取。支持微信公众号、今日头条、网易新闻、搜狐新闻、腾讯新闻。当用户需要提取新闻内容、抓取公众号文章、爬取新闻、或获取新闻JSON/Markdown时激活。
---

# News Extractor Skill

从主流新闻平台提取文章内容，输出 JSON 和 Markdown 格式。

## 支持平台

| 平台 | ID | URL 示例 |
|------|-----|----------|
| 微信公众号 | wechat | `https://mp.weixin.qq.com/s/xxxxx` |
| 今日头条 | toutiao | `https://www.toutiao.com/article/123456/` |
| 网易新闻 | netease | `https://www.163.com/news/article/ABC123.html` |
| 搜狐新闻 | sohu | `https://www.sohu.com/a/123456_789` |
| 腾讯新闻 | tencent | `https://news.qq.com/rain/a/20251016A07W8J00` |

## 依赖安装

首次使用前需要安装依赖。根据你的环境选择以下任一方式：

### 方式一：使用 uv (推荐)

```bash
cd .claude/skills/news-extractor

# 安装依赖并创建虚拟环境
uv sync

# 运行脚本时 uv 会自动使用虚拟环境
uv run scripts/extract_news.py --list-platforms
```

### 方式二：使用 pip

```bash
cd .claude/skills/news-extractor

# 创建虚拟环境（可选但推荐）
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# 安装依赖
pip install -r requirements.txt

# 运行脚本
python scripts/extract_news.py --list-platforms
```

### 依赖列表

| 包名 | 用途 |
|------|------|
| pydantic | 数据模型验证 |
| requests | HTTP 请求 |
| curl_cffi | 浏览器模拟抓取 |
| tenacity | 重试机制 |
| parsel | HTML/XPath 解析 |
| demjson3 | 非标准 JSON 解析 |

## 使用方式

### 基本用法

```bash
# 提取新闻，自动检测平台，输出 JSON + Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"

# 指定输出目录
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output

# 仅输出 JSON
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json

# 仅输出 Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown

# 列出支持的平台
uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms
```

### 输出文件

脚本默认输出两种格式到指定目录（默认 `./output`）：
- `{news_id}.json` - 结构化 JSON 数据
- `{news_id}.md` - Markdown 格式文章

## 工作流程

1. **接收 URL** - 用户提供新闻链接
2. **平台检测** - 自动识别平台类型
3. **内容提取** - 调用对应爬虫获取并解析内容
4. **格式转换** - 生成 JSON 和 Markdown
5. **输出文件** - 保存到指定目录

## 输出格式

### JSON 结构

```json
{
  "title": "文章标题",
  "news_url": "原始链接",
  "news_id": "文章ID",
  "meta_info": {
    "author_name": "作者/来源",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "段落文本", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["段落1", "段落2"],
  "images": ["图片URL1", "图片URL2"],
  "videos": []
}
```

### Markdown 结构

```markdown
# 文章标题

## 文章信息
**作者**: xxx
**发布时间**: 2024-01-01 12:00
**原文链接**: [链接](URL)

---

## 正文内容

段落内容...

![图片](URL)

---

## 媒体资源
### 图片 (N)
1. URL1
2. URL2
```

## 使用示例

### 提取微信公众号文章

```bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"
```

输出:
```
[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md
```

### 提取今日头条文章

```bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"
```

## 错误处理

| 错误类型 | 说明 | 解决方案 |
|----------|------|----------|
| `无法识别该平台` | URL 不匹配任何支持的平台 | 检查 URL 是否正确 |
| `平台不支持` | 非支持的站点 | 本 Skill 仅支持列出的新闻站点 |
| `提取失败` | 网络错误或页面结构变化 | 重试或检查 URL 有效性 |

## 注意事项

- 仅用于教育和研究目的
- 不要进行大规模爬取
- 尊重目标网站的 robots.txt 和服务条款
- 微信公众号可能需要有效的 Cookie（当前默认配置通常可用）

## 参考

- [平台 URL 模式说明](references/platform-patterns.md)


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/platform-patterns.md

```markdown
# 中国新闻平台 URL 模式说明

本文档描述各平台的 URL 格式和特殊注意事项。

## 平台 URL 模式

### 微信公众号 (wechat)

**URL 格式**:
```
https://mp.weixin.qq.com/s/{article_id}
https://mp.weixin.qq.com/s?__biz=xxx&mid=xxx&idx=xxx&sn=xxx
```

**正则模式**:
```regex
https?://mp\.weixin\.qq\.com/s/
```

**特点**:
- 支持传统页面和 SSR 渲染页面（小红书风格）
- 使用 `curl_cffi` 进行 Chrome 模拟
- 可能需要 Cookie（当前默认配置通常可用）

**示例**:
- `https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ`
- `https://mp.weixin.qq.com/s/RUHJpS9w3RhuhEm94z-1Kw` (SSR 渲染)

---

### 今日头条 (toutiao)

**URL 格式**:
```
https://www.toutiao.com/article/{article_id}/
https://www.toutiao.com/article/{article_id}/?log_from=xxx
```

**正则模式**:
```regex
https?://www\.toutiao\.com/article/
```

**特点**:
- 文章 ID 通常为长数字字符串
- URL 末尾可能有或没有斜杠
- 可能包含 `log_from` 追踪参数

**示例**:
- `https://www.toutiao.com/article/7434425099895210546/`
- `https://www.toutiao.com/article/7404384826024935990/?log_from=xxx`

---

### 网易新闻 (netease)

**URL 格式**:
```
https://www.163.com/news/article/{article_id}.html
https://www.163.com/dy/article/{article_id}.html
```

**正则模式**:
```regex
https?://www\.163\.com/(news|dy)/article/
```

**特点**:
- 支持 `news` 和 `dy`（订阅号）两种路径
- 文章 ID 通常为大写字母数字组合
- 内容在 `div.post_body` 中

**示例**:
- `https://www.163.com/news/article/KC12OUHK000189FH.html`
- `https://www.163.com/dy/article/JK7AUN4S0514R9OJ.html`

---

### 搜狐新闻 (sohu)

**URL 格式**:
```
https://www.sohu.com/a/{article_id}_{source_id}
```

**正则模式**:
```regex
https?://www\.sohu\.com/a/
```

**特点**:
- URL 包含文章 ID 和来源 ID（下划线分隔）
- 图片 URL 可能被加密，需要从 JavaScript 提取真实 URL
- 内容在 `article#mp-editor` 中

**示例**:
- `https://www.sohu.com/a/945014338_160447`

---

### 腾讯新闻 (tencent)

**URL 格式**:
```
https://news.qq.com/rain/a/{article_id}
```

**正则模式**:
```regex
https?://news\.qq\.com/rain/a/
```

**特点**:
- 文章 ID 包含日期和字母数字组合
- 元信息存储在 `window.DATA` JavaScript 变量中
- 内容在 `div.rich_media_content` 中

**示例**:
- `https://news.qq.com/rain/a/20251016A07W8J00`

---

## 平台检测逻辑

检测器使用正则表达式按顺序匹配：

```python
PLATFORM_PATTERNS = {
    "toutiao": r"https?://www\.toutiao\.com/article/",
    "wechat": r"https?://mp\.weixin\.qq\.com/s/",
    "netease": r"https?://www\.163\.com/(news|dy)/article/",
    "sohu": r"https?://www\.sohu\.com/a/",
    "tencent": r"https?://news\.qq\.com/rain/a/",
}
```

## 常见问题

### Q: 为什么提取失败？

可能原因：
1. **Cookie 过期** - 尝试更新 Cookie 配置
2. **页面结构变化** - 平台可能更新了页面模板
3. **反爬策略** - 请求频率过高触发限制
4. **网络问题** - 检查网络连接

### Q: 如何获取新 Cookie？

1. 打开浏览器访问目标平台
2. 打开开发者工具 (F12)
3. 访问一篇文章
4. 在 Network 面板找到主请求
5. 复制 Cookie 头

### Q: 图片无法显示？

某些平台（如微信）的图片有防盗链：
- 微信图片需要特定 Referer
- 搜狐图片可能是加密的 Base64
- 建议在需要时使用 `embed_images=True` 将图片嵌入 Markdown

### Q: 支持批量提取吗？

当前版本仅支持单个 URL 提取。如需批量提取，可以：
1. 使用 shell 循环
2. 或直接调用 `news_extractor_core.services.extractor.ExtractorService`

```

### scripts/extract_news.py

```python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
中国新闻提取脚本

使用方式:
    uv run extract_news.py "URL" [--output DIR] [--format json|markdown|both]

支持平台:
    - 微信公众号 (wechat)
    - 今日头条 (toutiao)
    - 网易新闻 (netease)
    - 搜狐新闻 (sohu)
    - 腾讯新闻 (tencent)
"""

import argparse
import json
import sys
from pathlib import Path
from typing import Optional

# 添加当前目录到路径
SCRIPT_DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(SCRIPT_DIR))

from models import NewsItem
from detector import detect_platform, get_platform_name, PLATFORM_NAMES
from formatter import to_markdown
from crawlers.wechat import WeChatNewsCrawler
from crawlers.toutiao import ToutiaoNewsCrawler
from crawlers.netease import NeteaseNewsCrawler
from crawlers.sohu import SohuNewsCrawler
from crawlers.tencent import TencentNewsCrawler


# 爬虫映射
CRAWLERS = {
    "wechat": WeChatNewsCrawler,
    "toutiao": ToutiaoNewsCrawler,
    "netease": NeteaseNewsCrawler,
    "sohu": SohuNewsCrawler,
    "tencent": TencentNewsCrawler,
}


def log_info(msg: str) -> None:
    """输出信息日志"""
    print(f"[INFO] {msg}")


def log_success(msg: str) -> None:
    """输出成功日志"""
    print(f"[SUCCESS] {msg}")


def log_error(msg: str) -> None:
    """输出错误日志"""
    print(f"[ERROR] {msg}", file=sys.stderr)


def extract_news(
    url: str,
    output_dir: str = "./output",
    output_format: str = "both",
    platform: Optional[str] = None,
) -> int:
    """
    提取新闻内容

    Args:
        url: 新闻 URL
        output_dir: 输出目录
        output_format: 输出格式 (json, markdown, both)
        platform: 指定平台（可选，默认自动检测）

    Returns:
        0 表示成功，1 表示失败
    """
    # 1. 平台检测
    detected_platform = platform or detect_platform(url)

    if not detected_platform:
        log_error("无法识别该平台，请检查 URL 是否正确")
        log_info("支持的中国新闻平台:")
        for pid, pname in PLATFORM_NAMES.items():
            log_info(f"  - {pname} ({pid})")
        return 1

    # 检查是否为支持的平台
    if detected_platform not in CRAWLERS:
        log_error(f"平台 '{detected_platform}' 不支持")
        log_info("支持的中国新闻平台:")
        for pid, pname in PLATFORM_NAMES.items():
            log_info(f"  - {pname} ({pid})")
        return 1

    platform_name = get_platform_name(detected_platform)
    log_info(f"Platform detected: {detected_platform} ({platform_name})")

    # 2. 提取内容
    log_info("Extracting content...")
    try:
        crawler_class = CRAWLERS[detected_platform]
        crawler = crawler_class(url, save_path=output_dir)
        news_item = crawler.run(persist=False)  # 不使用爬虫自带的保存
    except ValueError as e:
        log_error(f"提取失败: {e}")
        return 1
    except Exception as e:
        log_error(f"未知错误: {e}")
        return 1

    # 3. 显示提取结果摘要
    log_info(f"Title: {news_item.title}")
    if news_item.meta_info.author_name:
        log_info(f"Author: {news_item.meta_info.author_name}")
    if news_item.meta_info.publish_time:
        log_info(f"Publish time: {news_item.meta_info.publish_time}")
    log_info(f"Text paragraphs: {len(news_item.texts)}")
    log_info(f"Images: {len(news_item.images)}")
    if news_item.videos:
        log_info(f"Videos: {len(news_item.videos)}")

    # 4. 创建输出目录
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    # 5. 生成输出文件
    news_id = news_item.news_id or "untitled"
    # 清理文件名中的非法字符
    safe_id = "".join(c if c.isalnum() or c in "-_" else "_" for c in news_id)

    # JSON 输出
    if output_format in ("json", "both"):
        json_file = output_path / f"{safe_id}.json"
        with open(json_file, "w", encoding="utf-8") as f:
            json.dump(news_item.to_dict(), f, ensure_ascii=False, indent=2)
        log_success(f"Saved: {json_file}")

    # Markdown 输出
    if output_format in ("markdown", "both"):
        md_file = output_path / f"{safe_id}.md"
        markdown_content = to_markdown(news_item, platform=detected_platform)
        with open(md_file, "w", encoding="utf-8") as f:
            f.write(markdown_content)
        log_success(f"Saved: {md_file}")

    return 0


def list_platforms() -> None:
    """列出支持的平台"""
    print("支持的中国新闻平台:")
    print()
    for pid, pname in PLATFORM_NAMES.items():
        print(f"  {pname} ({pid})")
    print()
    print("URL 格式示例:")
    print("  - 微信公众号: https://mp.weixin.qq.com/s/xxxxx")
    print("  - 今日头条:   https://www.toutiao.com/article/123456/")
    print("  - 网易新闻:   https://www.163.com/news/article/ABC123.html")
    print("  - 搜狐新闻:   https://www.sohu.com/a/123456_789")
    print("  - 腾讯新闻:   https://news.qq.com/rain/a/20251016A07W8J00")


def main():
    """主入口"""
    parser = argparse.ArgumentParser(
        description="中国新闻内容提取工具",
        epilog="支持平台: 微信公众号、今日头条、网易新闻、搜狐新闻、腾讯新闻",
    )
    parser.add_argument(
        "url",
        nargs="?",
        help="新闻 URL",
    )
    parser.add_argument(
        "--output", "-o",
        default="./output",
        help="输出目录 (默认: ./output)",
    )
    parser.add_argument(
        "--format", "-f",
        choices=["json", "markdown", "both"],
        default="both",
        help="输出格式 (默认: both)",
    )
    parser.add_argument(
        "--platform", "-p",
        choices=list(PLATFORM_NAMES.keys()),
        help="指定平台 (可选，默认自动检测)",
    )
    parser.add_argument(
        "--list-platforms",
        action="store_true",
        help="列出支持的平台",
    )

    args = parser.parse_args()

    # 列出平台
    if args.list_platforms:
        list_platforms()
        return 0

    # 检查 URL 是否提供
    if not args.url:
        parser.print_help()
        return 1

    # 执行提取
    return extract_news(
        url=args.url,
        output_dir=args.output,
        output_format=args.format,
        platform=args.platform,
    )


if __name__ == "__main__":
    sys.exit(main())

```