wechat-article-extractor
Extract metadata and content from WeChat Official Account articles. Use when user needs to parse WeChat article URLs (mp.weixin.qq.com), extract article info (title, author, content, publish time, cover image), or convert WeChat articles to structured data. Supports various article types including posts, videos, images, voice messages, and reposts.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install openclaw-skills-wechat-article-extractor-skill
Repository
Skill path: skills/freestylefly/wechat-article-extractor-skill
Extract metadata and content from WeChat Official Account articles. Use when user needs to parse WeChat article URLs (mp.weixin.qq.com), extract article info (title, author, content, publish time, cover image), or convert WeChat articles to structured data. Supports various article types including posts, videos, images, voice messages, and reposts.
Open repositoryBest for
Primary workflow: Write Technical Docs.
Technical facets: Full Stack, Data / AI, Tech Writer.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: openclaw.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install wechat-article-extractor into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/openclaw/skills before adding wechat-article-extractor to shared team environments
- Use wechat-article-extractor for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: wechat-article-extractor
description: Extract metadata and content from WeChat Official Account articles. Use when user needs to parse WeChat article URLs (mp.weixin.qq.com), extract article info (title, author, content, publish time, cover image), or convert WeChat articles to structured data. Supports various article types including posts, videos, images, voice messages, and reposts.
---
# WeChat Article Extractor
Extract metadata and content from WeChat Official Account (微信公众号) articles.
## Capabilities
- Parse WeChat article URLs (`mp.weixin.qq.com`)
- Extract article metadata: title, author, description, publish time
- Extract account info: name, avatar, alias, description
- Get article content (HTML)
- Get cover image URL
- Support multiple article types: post, video, image, voice, text, repost
- Handle various error cases: deleted content, expired links, access limits
## Usage
### Basic Extraction from URL
```javascript
const { extract } = require('./scripts/extract.js');
const result = await extract('https://mp.weixin.qq.com/s?__biz=...');
// Returns: { done: true, code: 0, data: {...} }
```
### Extraction from HTML
```javascript
const html = await fetch(url).then(r => r.text());
const result = await extract(html, { url: sourceUrl });
```
### Options
```javascript
const result = await extract(url, {
shouldReturnContent: true, // Return HTML content (default: true)
shouldReturnRawMeta: false, // Return raw metadata (default: false)
shouldFollowTransferLink: true, // Follow migrated account links (default: true)
shouldExtractMpLinks: false, // Extract embedded mp.weixin links (default: false)
shouldExtractTags: false, // Extract article tags (default: false)
shouldExtractRepostMeta: false // Extract repost source info (default: false)
});
```
## Response Format
### Success Response
```javascript
{
done: true,
code: 0,
data: {
// Account info
account_name: "公众号名称",
account_alias: "微信号",
account_avatar: "头像URL",
account_description: "功能介绍",
account_id: "原始ID",
account_biz: "biz参数",
account_biz_number: 1234567890,
account_qr_code: "二维码URL",
// Article info
msg_title: "文章标题",
msg_desc: "文章摘要",
msg_content: "HTML内容",
msg_cover: "封面图URL",
msg_author: "作者",
msg_type: "post", // post|video|image|voice|text|repost
msg_has_copyright: true,
msg_publish_time: Date,
msg_publish_time_str: "2024/01/15 10:30:00",
// Link params
msg_link: "文章链接",
msg_source_url: "阅读原文链接",
msg_sn: "sn参数",
msg_mid: 1234567890,
msg_idx: 1
}
}
```
### Error Response
```javascript
{
done: false,
code: 1001,
msg: "无法获取文章信息"
}
```
## Error Codes
| Code | Message | Description |
|------|---------|-------------|
| 1000 | 文章获取失败 | General failure |
| 1001 | 无法获取文章信息 | Missing title or publish time |
| 1002 | 请求失败 | HTTP request failed |
| 1003 | 响应为空 | Empty response |
| 1004 | 访问过于频繁 | Rate limited |
| 1005 | 脚本解析失败 | Script parsing error |
| 1006 | 公众号已迁移 | Account migrated |
| 2001 | 请提供文章内容或链接 | Missing input |
| 2002 | 链接已过期 | Link expired |
| 2003 | 内容涉嫌侵权 | Content removed (copyright) |
| 2004 | 无法获取迁移后的链接 | Migration link failed |
| 2005 | 内容已被发布者删除 | Content deleted by author |
| 2006 | 内容因违规无法查看 | Content blocked |
| 2007 | 内容发送失败 | Failed to send |
| 2008 | 系统出错 | System error |
| 2009 | 不支持的链接 | Unsupported URL |
| 2010 | 内容获取失败 | Content fetch failed |
| 2011 | 涉嫌过度营销 | Marketing/spam content |
| 2012 | 账号已被屏蔽 | Account blocked |
| 2013 | 账号已自主注销 | Account deleted |
| 2014 | 内容被投诉 | Content reported |
| 2015 | 账号处于迁移流程中 | Account migrating |
| 2016 | 冒名侵权 | Impersonation |
## Dependencies
Required npm packages:
- `cheerio` - HTML parsing
- `dayjs` - Date formatting
- `request-promise` - HTTP requests
- `qs` - Query string parsing
- `lodash.unescape` - HTML entities
## Notes
- Handles various WeChat page structures and anti-scraping measures
- Automatically detects article type from page content
- Supports extracting from Sogou WeChat search results (`weixin.sogou.com`)
- Some fields may be null depending on article type and page structure
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/extract.js
```javascript
const qs = require('qs');
const dayjs = require('dayjs');
const request = require('request-promise');
const cheerio = require('cheerio');
const unescape = require('lodash.unescape');
const errors = require('./errors');
const defaultConfig = {
shouldReturnRawMeta: false,
shouldReturnContent: true,
shouldFollowTransferLink: true,
shouldExtractMpLinks: false,
shouldExtractTags: false,
shouldExtractRepostMeta: false
};
function getError(code) {
return { done: false, code, msg: errors[code] };
}
function normalizeUrl(url = '') {
const parts = url.replace(/&/g, '&').split('?');
const querys = qs.stringify(qs.parse(parts[1]));
return querys ? `${parts[0]}?${querys}` : parts[0];
}
function getParameterByName(name, url) {
name = name.replace(/[\[\]]/g, '\\$&');
const regex = new RegExp('[?&]' + name + '(=([^&#]*)|&|#|$)');
const results = regex.exec(url);
if (!results) return null;
if (!results[2]) return '';
return decodeURIComponent(results[2].replace(/\+/g, ' '));
}
function parseUrlParams(url) {
if (!url) return {};
const rs = require('querystring').parse(url.replace(/&/g, '&').split('?')[1]);
return {
mid: rs.mid * 1,
idx: rs.idx * 1,
sn: rs.sn,
biz: rs.__biz
};
}
async function extract(input, options = {}) {
const config = Object.assign({}, defaultConfig, options);
const {
shouldReturnRawMeta,
shouldReturnContent,
shouldFollowTransferLink,
shouldExtractMpLinks,
shouldExtractTags,
shouldExtractRepostMeta
} = config;
if (!input) return getError(2001);
let paramType = 'HTML';
let url = options.url ? normalizeUrl(options.url) : null;
let rawUrl = null;
let html = input;
let type = 'post';
let hasCopyright = false;
// Handle URL input
if (/^http/.test(input)) {
const normalized = normalizeUrl(input);
if (!/https?:\/\/mp\.weixin\.qq\.com/.test(normalized) &&
!/https?:\/\/weixin\.sogou\.com/.test(normalized)) {
return getError(2009);
}
paramType = 'URL';
rawUrl = normalized;
if (!url) url = normalized;
const host = /weixin\.sogou\.com/.test(normalized) ? 'weixin.sogou.com' : 'mp.weixin.qq.com';
try {
html = await request({
uri: normalized,
method: 'GET',
headers: {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Host': host
}
});
} catch (e) {
return getError(1002);
}
} else {
html = input.replace(/\\n/g, '');
}
if (!html) return getError(1003);
// Check for error pages
if (html.includes('访问过于频繁') && !html.includes('js_content')) {
return paramType === 'URL' ? getError(1004) : getError(2010);
}
if (html.includes('链接已过期') && !html.includes('js_content')) return getError(2002);
if (html.includes('被投诉且经审核涉嫌侵权,无法查看')) return getError(2003);
if (html.includes('该公众号已迁移')) {
const match = html.match(/var\stransferTargetLink\s=\s'(.*?)';/);
if (match && match[1]) {
if (shouldFollowTransferLink) {
return await extract(match[1]);
}
return { ...getError(1006), url: match[1] };
}
return getError(2004);
}
if (html.includes('该内容已被发布者删除')) return getError(2005);
if (html.includes('此内容因违规无法查看')) return getError(2006);
if (html.includes('此内容发送失败无法查看')) return getError(2007);
if (html.includes('由用户投诉并经平台审核,涉嫌过度营销')) return getError(2011);
if (html.includes('此帐号已被屏蔽') && !html.includes('id="js_content"')) return getError(2012);
if (html.includes('此帐号已自主注销') && !html.includes('id="js_content"')) return getError(2013);
if (!html.includes('id="js_content"') && html.includes('此帐号处于帐号迁移流程中')) return getError(2015);
if (html.includes('page_rumor') && !html.includes('id="js_content"')) return getError(2014);
if (html.includes('投诉类型') && html.includes('冒名侵权')) return getError(2016);
if (!html.includes('id="js_content"') && !html.includes('id=\\"js_content\\"')) {
if (html.includes('cover_url')) {
type = 'image';
} else {
return getError(1000);
}
}
// Prepare HTML
html = html.replace('>微信号', ' id="append-account-alias">微信号')
.replace('>功能介绍', ' id="append-account-desc">功能介绍')
.replace(/\n\s+<script/g, '\n\n<script');
const $ = cheerio.load(html, { decodeEntities: false });
// Detect type and copyright
if ($('#copyright_logo')?.text().includes('原创')) hasCopyright = true;
if (/video/.test($('body').attr('class'))) type = 'video';
if ($('#js_content > #img_list').length) type = 'image';
if ($('#js_share_content').length) type = 'repost';
if ($('.page_share_audio').length || $('#voice_parent').length) type = 'voice';
if (/share_media_text/.test(html)) type = 'text';
// Check for expired link or system error
if ($('.weui-msg .weui-msg__title').text().trim() === '链接已过期') return getError(2002);
if ($('.global_error_msg.warn').text().trim().includes('系统出错')) return getError(2008);
// Extract basic info
const basic = {
accountName: $('.profile_nickname').text() || null,
accountBiz: null,
accountBizNumber: null,
accountId: null,
accountAvatar: null
};
const accountAliasPrev = $('#append-account-alias');
let accountAlias = accountAliasPrev.siblings('span').text() || null;
const accountDescPrev = $('#append-account-desc');
let accountDesc = accountDescPrev.siblings('span').text() || null;
if (!accountDesc) {
const $accountDesc = $('.profile_meta_value');
if ($accountDesc[1]) {
try {
const text = $accountDesc[1].children[0].data;
if (text?.length > 10) accountDesc = text;
} catch (e) {}
}
}
const post = {
msg_has_copyright: hasCopyright,
msg_content: shouldReturnContent ? $('#js_content').html() : null
};
// Extract author
try {
const author = $("meta[name='author']").attr('content');
if (author) post.msg_author = author;
} catch (e) {
const $author = $('#js_author_name');
if ($author.length) {
const info = $author.text().trim();
if (info) post.msg_author = info;
}
}
// Extract from scripts
const scripts = html.match(/<script[\s\S]*?>([\s\S]*?)<\/script>/gi) || [];
const extra = { biz: null, sn: null, mid: null, idx: null, msg_title: null, user_name: null, nick_name: null, hd_head_img: null };
let picturePageInfoList = null;
for (const script of scripts) {
// Picture type
if (script.includes('picture_page_info_list') && script.includes('https://mmbiz.qpic.cn')) {
try {
const lines = script.split('\n');
const code = lines.slice(1, lines.length - 2).join('\n').trim()
.replace(/^\(function\(\) {/, '').replace(/}\)\(\);$/, '');
const fn = new Function(`var x = {}; ${code.replace(/window\./g, 'x.').replace('//g', '/\\n/g')}\nreturn x;`);
const result = fn();
if (result.picture_page_info_list) picturePageInfoList = result.picture_page_info_list;
} catch (e) {}
}
// Voice type
if (type === 'voice' && script.includes('voiceid')) {
const lines = script.split(/\n|\r/).filter(one => one.includes('voiceid'))
.sort((a, b) => a.length > b.length ? -1 : 1);
if (lines.length) {
const val = lines[0].replace(/['"|:,voiceid\s]/g, '');
if (val) post.msg_source_url = `https://res.wx.qq.com/voice/getvoice?mediaid=${val.trim()}`;
}
}
// Extract extra fields
for (const field of Object.keys(extra)) {
const reg = new RegExp(`var\\s+${field}\\s*=`);
if (reg.test(script) && !extra[field]) {
try {
const line = script.split('\n').filter(one => reg.test(one))[0];
const fn = new Function(`${line}\nreturn ${field};`);
extra[field] = fn();
} catch (e) {}
}
if (!extra[field]) {
const reg2 = new RegExp(`window\\.${field}\\s*=`);
if (reg2.test(script)) {
try {
const line = script.split('\n').filter(one => reg2.test(one))[0];
const fn = new Function(`window = {}; ${line}\nreturn window.${field};`);
extra[field] = fn();
} catch (e) {}
}
}
}
// Data from d object (video/image/voice)
if ((type === 'image' || type === 'voice') && script.includes('d.title =')) {
try {
const lines = script.split('\n').filter(line => !!line.trim());
const codeLines = lines.filter((line, index) =>
/d\./.test(line) || (lines[index - 1] && lines[index - 1].includes('d.') && !line.includes('}'))
);
let code = `var d = {}; function getXmlValue(path) { return false; }\n` +
codeLines.join('\n').replace('var d = _g.cgiData;', 'var d = {}') +
'\nreturn d;';
code = `var _g = {}; ${code}`;
const fn = new Function(code);
const data = fn();
basic.accountName = data.nick_name;
basic.accountAvatar = data.hd_head_img;
basic.accountId = data.user_name;
if (!basic.accountBiz && data.biz) {
basic.accountBiz = data.biz;
basic.accountBizNumber = Buffer.from(data.biz, 'base64').toString() * 1;
}
post.msg_title = data.title;
post.msg_desc = null;
post.msg_cover = null;
post.msg_link = data.msg_link || null;
post.msg_sn = data.sn || null;
post.msg_idx = data.idx ? data.idx * 1 : null;
post.msg_mid = data.mid ? data.mid * 1 : null;
if (type === 'video') {
const vidMatch = html.match(/vid\s*:\s*'(.*?)'/);
if (vidMatch) data.vid = vidMatch[1];
post.msg_cover = $("meta[property='og:image']").attr('content');
}
if (type === 'video' || type === 'voice') {
post.msg_content = $("meta[name='description']").attr('content');
}
if (data.create_time) {
post.msg_publish_time = new Date(data.create_time * 1000);
post.msg_publish_time_str = dayjs(post.msg_publish_time).format('YYYY/MM/DD HH:mm:ss');
}
if (shouldReturnRawMeta) post.raw_data = data;
} catch (e) {
return getError(1005);
}
}
// Post/repost type
if ((type === 'post' || type === 'repost') && script.includes('var msg_link =')) {
try {
const lines = script.split('\n');
let code = lines.slice(1, lines.length - 1)
.filter(line => !line.includes('var title'))
.map(line => {
if (/var\s+msg_desc/.test(line)) {
line = line.replace(/`/g, "'").replace(/"/g, '`');
}
return line;
}).join('\n');
code = `var window = { location: { protocol: 'https' } };
var document = {
addEventListener: function() {},
getElementById: function() {
return { classList: { remove: function() {}, add: function() {} } };
}
};
var location = { protocol: "https" };\n${code}`;
const vars = code.match(/var\s(.*?)\s=/g)?.map(key => key.split(' ')[1]).filter(k => k !== 'window') || [];
let rs = ';\nvar rs = {';
vars.forEach(key => {
rs += `"${key}": typeof ${key} !== 'undefined' ? ${key} : null,`;
});
rs += '}\nreturn rs;';
const stringProto = `String.prototype.html = function(encode) {
var replace = ["'", "'", """, '"', " ", " ", ">", ">", "<", "<", "¥", "¥", "&", "&"];
var replaceReverse = ["&", "&", "¥", "¥", "<", "<", ">", ">", " ", " ", '"', """, "'", "'"];
var target = encode ? replaceReverse : replace;
for (var i = 0, str = this; i < target.length; i += 2) {
str = str.replace(new RegExp(target[i], 'g'), target[i + 1]);
}
return str;
};`;
const fn = new Function(stringProto + code + rs);
const data = fn();
// Fallback for biz
if (!basic.accountBiz) {
const reg = new RegExp(`var\\s+biz\\s*=`);
const matched = html.split('\n').find(line => reg.test(line) && line.length > 10);
if (matched) {
try {
const bizFn = new Function(`${matched}; return biz;`);
const rs = bizFn();
if (rs) {
basic.accountBiz = rs;
basic.accountBizNumber = Buffer.from(rs, 'base64').toString() * 1;
}
} catch (e) {}
}
}
['msg_title', 'msg_desc', 'msg_link', 'msg_source_url'].forEach(key => {
post[key] = data[key] || null;
});
post.msg_cover = data.msg_cdn_url;
post.msg_article_type = data._ori_article_type || null;
post.msg_publish_time = new Date(data.ct * 1000);
post.msg_publish_time_str = dayjs(post.msg_publish_time).format('YYYY/MM/DD HH:mm:ss');
if (shouldReturnRawMeta) post.raw_data = data;
basic.accountId = data.user_name;
basic.accountAvatar = data.ori_head_img_url;
if (!basic.accountName && data.nickname) basic.accountName = data.nickname;
} catch (e) {
return getError(1005);
}
}
}
// Set extracted extra fields
if (extra.biz) {
basic.accountBiz = extra.biz;
basic.accountBizNumber = Buffer.from(extra.biz, 'base64').toString() * 1;
}
post.msg_sn = extra.sn || post.msg_sn || null;
post.msg_idx = extra.idx ? extra.idx * 1 : post.msg_idx || null;
post.msg_mid = extra.mid ? extra.mid * 1 : post.msg_mid || null;
// Fallback for missing fields
if (!post.msg_publish_time) {
const date = $('#post-date').text() || $('#publish_time').text();
if (date) post.msg_publish_time = new Date(date);
}
if (!post.msg_publish_time && html.includes('.ct')) {
const line = html.split('\n').find(one => one.includes('.ct'));
const matched = /'(\d+)'/g.exec(line);
if (matched && matched[1]?.length >= 10) {
post.msg_publish_time = new Date(matched[1] * 1000);
}
}
if (!post.msg_title) {
const title = $('.rich_media_title').text();
if (title) post.msg_title = title.trim();
}
// Set account info from extra
if (!basic.accountId && extra.user_name) basic.accountId = extra.user_name;
if (!basic.accountName && extra.nick_name) basic.accountName = extra.nick_name;
if (!basic.accountAvatar && extra.hd_head_img) basic.accountAvatar = extra.hd_head_img;
if (!basic.accountName && $('.wx_follow_nickname')) {
const name = $('.wx_follow_nickname').text();
if (name) basic.accountName = name.trim();
}
// Build result
const data = {
account_name: basic.accountName,
account_alias: accountAlias,
account_avatar: basic.accountAvatar?.length > 10 ? basic.accountAvatar : null,
account_description: accountDesc,
account_id: basic.accountId,
account_biz: basic.accountBiz,
account_biz_number: basic.accountBizNumber,
account_qr_code: `https://open.weixin.qq.com/qr/code?username=${basic.accountId || accountAlias}`,
...post,
msg_type: type
};
// Clean empty strings to null
for (const key in data) {
if (data[key] === '') data[key] = null;
}
// Handle text type (no title)
if (!data.msg_title && type === 'post') {
data.msg_type = 'text';
const title = $("meta[property='og:title']").attr('content');
const desc = $("meta[property='og:description']").attr('content');
if (title) {
data.msg_title = title;
const rawContent = $('#js_panel_like_title').html();
data.msg_content = rawContent ? rawContent.trim().replace(/\n/g, '<br/>') : title;
}
if (!title && desc) data.msg_title = desc;
}
// Fallback for time
if (!data.msg_publish_time) {
const matched = html.match(/d\.ct\s*=\s*"(\d+)"/);
if (matched && matched[1]) {
data.msg_publish_time = new Date(matched[1] * 1000);
data.msg_publish_time_str = dayjs(data.msg_publish_time).format('YYYY/MM/DD HH:mm:ss');
}
}
// Fallback for link params
if (!data.msg_mid || !data.msg_link) {
let linkUrl = options?.url || rawUrl || $("meta[property='og:url']").attr('content');
if (linkUrl && /^http/.test(linkUrl) && /mid/.test(linkUrl) && /__biz/.test(linkUrl)) {
linkUrl = linkUrl.replace(/&/g, '&');
if (!data.msg_link) data.msg_link = linkUrl;
if (!data.msg_mid) data.msg_mid = getParameterByName('mid', linkUrl);
if (!data.msg_idx) data.msg_idx = getParameterByName('idx', linkUrl);
if (!data.msg_sn) data.msg_sn = getParameterByName('sn', linkUrl);
}
}
// Unescape title
if (data.msg_title) data.msg_title = unescape(data.msg_title);
// Handle video content
if (data.msg_type === 'video') {
if (!data.msg_content) {
data.msg_content = data.msg_title;
} else {
data.msg_content = data.msg_content
.replace(/\\x26/g, '&')
.replace(/\\x0a/g, '<br/>');
}
}
// Final fallback for title
if (!data.msg_title) {
const title = $("meta[property='og:title']").attr('content');
if (title) data.msg_title = title;
}
// Fallback for description
if (!data.msg_desc) {
data.msg_desc = $("meta[property='og:description']").attr('content') ||
$("meta[name='description']").attr('content');
}
if (!data.msg_desc && data.msg_content) {
const text = data.msg_content.replace(/<[^>]+>/g, '').replace(/\s+/g, ' ').trim();
if (text.length > 0) {
data.msg_desc = text.substring(0, 140) + (text.length > 140 ? '...' : '');
}
}
// Handle script in content
if (data.msg_content?.includes('<script') && data.msg_content.includes('script>') && data.msg_content.includes('nonce=')) {
const desc = $("meta[property='og:description']").attr('content');
if (desc) data.msg_content = desc;
}
// Validate result
if (!data.msg_title || !data.msg_publish_time) return getError(1001);
// Text type fallback
if (type === 'text' && !data.msg_content && data.msg_title) {
data.msg_content = data.msg_title;
}
// Image type with picture_page_info_list
if (picturePageInfoList) {
data.msg_type = 'image';
data.msg_content = `${data.msg_title}<br>`;
for (const one of picturePageInfoList) {
data.msg_content += `<img src="${one.cdn_url}" style="max-width:100%"/><br><br>`;
}
}
// Extract mp links
if (shouldExtractMpLinks) {
const mpLinks = [];
$('a').each((i, ele) => {
const href = $(ele).attr('href');
if (href?.includes('mp.weixin.qq.com')) {
mpLinks.push({ title: $(ele).text(), href });
}
});
data.mp_links_count = mpLinks.length;
data.mp_links = mpLinks;
}
// Extract tags
if (shouldExtractTags) {
const tags = [];
$('.article-tag__item-wrp').each((i, ele) => {
const $this = $(ele);
try {
const tagUrl = $this.attr('data-url');
const name = $this.find('.article-tag__item').text();
let count = $this.find('.article-tag__item-num').text();
if (name) {
if (!count && tags.length === 0) {
const $count = $('.article-tag-card__right');
if ($count.length) count = $count.text().replace('个', '');
}
tags.push({
id: getParameterByName('album_id', tagUrl) || getParameterByName('tag_id', tagUrl) || null,
url: tagUrl,
name: name.replace(/^#/, ''),
count: count?.replace(/\D/g, '') * 1 || 0
});
}
} catch (e) {}
});
data.tags = tags;
}
// Extract repost meta
if (shouldExtractRepostMeta && html.includes('copyright_info') && html.includes('original_primary_nickname')) {
const name = $('.original_primary_nickname').text();
if (name) data.repost_meta = { account_name: name };
}
// Clean up link
if (data.msg_link?.includes('&')) {
data.msg_link = data.msg_link.replace(/&/g, '&');
}
return { code: 0, done: true, data };
}
module.exports = { extract };
```
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### README.md
```markdown
# WeChat Article Extractor
[](https://nodejs.org/)
[](LICENSE)
一个 Claude Code Skill,用于提取微信公众号文章的元数据和内容。支持多种文章类型,包括图文、视频、图片集、语音和转载文章。当用户需要提供微信公众号文章链接(mp.weixin.qq.com)时,Claude 会自动触发此 Skill 来提取文章信息。
## 功能特性
- 解析微信公众号文章 URL (`mp.weixin.qq.com`)
- 提取文章元数据:标题、作者、摘要、发布时间
- 获取公众号信息:名称、头像、微信号、功能介绍
- 提取文章内容(HTML 格式)
- 获取封面图片 URL
- 支持多种文章类型:图文、视频、图片集、语音、纯文字、转载
- 处理各种异常情况:内容删除、链接过期、访问限制、账号迁移等
- 支持搜狗微信搜索结果的解析 (`weixin.sogou.com`)
- 可选提取文章标签和内嵌链接
## 安装
这是一个 Claude Code Skill,可以通过以下方式安装:
### 通过 Claude Code 安装(推荐)
在 Claude Code 中运行:
```
/skill install wechat-article-extractor
```
或指定目录安装:
```
/skill install /path/to/wechat-article-extractor-skill
```
### 手动克隆安装
```bash
git clone https://github.com/yourusername/wechat-article-extractor-skill.git
cd wechat-article-extractor-skill
npm install
```
然后在 Claude Code 中将该目录作为 Skill 加载。
## 使用方法
### 基本用法 - 从 URL 提取
```javascript
const { extract } = require('./scripts/extract.js');
async function main() {
const url = 'https://mp.weixin.qq.com/s?__biz=...&mid=...&idx=...&sn=...';
const result = await extract(url);
if (result.done) {
console.log('文章标题:', result.data.msg_title);
console.log('公众号:', result.data.account_name);
console.log('发布时间:', result.data.msg_publish_time_str);
} else {
console.error('提取失败:', result.msg);
}
}
main();
```
### 从 HTML 内容提取
如果你已经获取了页面 HTML,可以直接传入:
```javascript
const { extract } = require('./scripts/extract.js');
async function main() {
const html = await fetch(url).then(r => r.text());
const result = await extract(html, { url: sourceUrl });
console.log(result);
}
main();
```
### 高级选项
```javascript
const result = await extract(url, {
shouldReturnContent: true, // 返回 HTML 内容(默认:true)
shouldReturnRawMeta: false, // 返回原始元数据(默认:false)
shouldFollowTransferLink: true, // 自动跟随迁移后的公众号链接(默认:true)
shouldExtractMpLinks: false, // 提取内嵌的微信公众号链接(默认:false)
shouldExtractTags: false, // 提取文章标签(默认:false)
shouldExtractRepostMeta: false // 提取转载来源信息(默认:false)
});
```
## 响应格式
### 成功响应
```javascript
{
done: true,
code: 0,
data: {
// 公众号信息
account_name: "公众号名称",
account_alias: "微信号",
account_avatar: "头像URL",
account_description: "功能介绍",
account_id: "原始ID",
account_biz: "biz参数",
account_biz_number: 1234567890,
account_qr_code: "二维码URL",
// 文章信息
msg_title: "文章标题",
msg_desc: "文章摘要",
msg_content: "HTML内容",
msg_cover: "封面图URL",
msg_author: "作者",
msg_type: "post", // post|video|image|voice|text|repost
msg_has_copyright: true,
msg_publish_time: Date,
msg_publish_time_str: "2024/01/15 10:30:00",
// 链接参数
msg_link: "文章链接",
msg_source_url: "阅读原文链接",
msg_sn: "sn参数",
msg_mid: 1234567890,
msg_idx: 1
}
}
```
### 错误响应
```javascript
{
done: false,
code: 1001,
msg: "无法获取文章信息"
}
```
## 错误代码表
| 代码 | 错误信息 | 说明 |
|------|----------|------|
| 1000 | 文章获取失败 | 一般性失败 |
| 1001 | 无法获取文章信息 | 缺少标题或发布时间 |
| 1002 | 请求失败 | HTTP 请求失败 |
| 1003 | 响应为空 | 空响应 |
| 1004 | 访问过于频繁 | 被限流 |
| 1005 | 脚本解析失败 | 页面脚本解析错误 |
| 1006 | 公众号已迁移 | 账号已迁移,包含新链接 |
| 2001 | 请提供文章内容或链接 | 缺少输入参数 |
| 2002 | 链接已过期 | 链接已失效 |
| 2003 | 内容涉嫌侵权 | 内容因侵权被移除 |
| 2004 | 无法获取迁移后的链接 | 迁移链接获取失败 |
| 2005 | 内容已被发布者删除 | 作者已删除内容 |
| 2006 | 内容因违规无法查看 | 内容被平台屏蔽 |
| 2007 | 内容发送失败 | 发送失败 |
| 2008 | 系统出错 | 系统错误 |
| 2009 | 不支持的链接 | URL 格式不支持 |
| 2010 | 内容获取失败 | 内容获取失败 |
| 2011 | 涉嫌过度营销 | 营销/垃圾内容 |
| 2012 | 账号已被屏蔽 | 账号被封禁 |
| 2013 | 账号已自主注销 | 账号已注销 |
| 2014 | 内容被投诉 | 内容被举报 |
| 2015 | 账号处于迁移流程中 | 账号正在迁移 |
| 2016 | 冒名侵权 | 冒充侵权 |
## 支持的文章类型
| 类型 | 说明 | msg_type |
|------|------|----------|
| 图文 | 普通图文文章 | `post` |
| 视频 | 视频内容 | `video` |
| 图片集 | 多张图片展示 | `image` |
| 语音 | 音频内容 | `voice` |
| 纯文字 | 无标题文字内容 | `text` |
| 转载 | 转载他人文章 | `repost` |
## 项目结构
```
wechat-article-extractor-skill/
├── scripts/
│ ├── extract.js # 核心提取逻辑
│ └── errors.js # 错误代码定义
├── SKILL.md # Skill 定义文件(Claude Skill 格式,包含触发条件和描述)
├── package.json # 项目配置
└── README.md # 本文件
```
## 依赖项
- `cheerio` - 服务端 HTML 解析
- `dayjs` - 日期格式化
- `request-promise` - HTTP 请求
- `qs` - 查询字符串解析
- `lodash.unescape` - HTML 实体解码
## 注意事项
1. **频率限制**: 频繁请求可能会导致 IP 被暂时封禁,建议添加适当的延迟
2. **页面结构**: 微信页面结构可能会变化,如遇问题请检查是否为最新版本
3. **Cookie**: 某些文章可能需要登录才能访问完整内容
4. **反爬措施**: 请遵守微信的使用条款,合理使用本工具
## 示例应用
### 批量提取文章
```javascript
const { extract } = require('./scripts/extract.js');
async function batchExtract(urls) {
const results = [];
for (const url of urls) {
try {
const result = await extract(url);
if (result.done) {
results.push({
title: result.data.msg_title,
author: result.data.account_name,
publishTime: result.data.msg_publish_time_str
});
}
// 添加延迟避免被限流
await new Promise(r => setTimeout(r, 1000));
} catch (err) {
console.error(`提取失败: ${url}`, err.message);
}
}
return results;
}
const urls = [
'https://mp.weixin.qq.com/s?__biz=...',
'https://mp.weixin.qq.com/s?__biz=...'
];
batchExtract(urls).then(console.log);
```
### 保存为 Markdown
```javascript
const { extract } = require('./scripts/extract.js');
const fs = require('fs');
async function saveAsMarkdown(url, filename) {
const result = await extract(url);
if (!result.done) {
console.error('提取失败:', result.msg);
return;
}
const { data } = result;
const markdown = `
# ${data.msg_title}
> 作者: ${data.msg_author || data.account_name}
> 公众号: ${data.account_name}
> 发布时间: ${data.msg_publish_time_str}
${data.msg_content}
---
原文链接: [${data.msg_link}](${data.msg_link})
`;
fs.writeFileSync(filename, markdown);
console.log(`已保存: ${filename}`);
}
saveAsMarkdown('https://mp.weixin.qq.com/s?__biz=...', 'article.md');
```
## 许可证
[MIT](LICENSE)
## 贡献
欢迎提交 Issue 和 Pull Request!
## 更新日志
### v1.0.0
- 初始版本发布
- 支持基本的文章信息提取
- 支持多种文章类型
- 完善的错误处理机制
```
### _meta.json
```json
{
"owner": "freestylefly",
"slug": "wechat-article-extractor-skill",
"displayName": "微信公众号文章解析",
"latest": {
"version": "1.0.1",
"publishedAt": 1771492990230,
"commit": "https://github.com/openclaw/skills/commit/0e33e3223cb50b905ce5c70d7433ec62c06c40a9"
},
"history": [
{
"version": "1.0.0",
"publishedAt": 1771492575445,
"commit": "https://github.com/openclaw/skills/commit/813ac22355bd6d627fc1979cc869448cc348dca5"
}
]
}
```
### scripts/errors.js
```javascript
module.exports = {
1000: '文章获取失败',
1001: '无法获取文章信息',
1002: '请求失败',
1003: '响应为空',
1004: '访问过于频繁',
1005: '脚本解析失败',
1006: '公众号已迁移',
2001: '请提供文章内容或链接',
2002: '链接已过期',
2003: '内容涉嫌侵权,无法查看',
2004: '无法获取迁移后的链接',
2005: '内容已被发布者删除',
2006: '内容因违规无法查看',
2007: '内容发送失败无法查看',
2008: '系统出错',
2009: '不支持的链接',
2010: '内容获取失败',
2011: '由用户投诉并经平台审核,涉嫌过度营销、骚扰用户',
2012: '此帐号已被屏蔽,内容无法查看',
2013: '此帐号已自主注销,内容无法查看',
2014: '此内容被投诉且经审核确认存在不实信息',
2015: '此帐号处于帐号迁移流程中',
2016: '由用户投诉并经平台审核,涉嫌冒名侵权'
};
```