Back to skills
SkillHub ClubWrite Technical DocsFull StackTech Writer

pdf-text-extractor

Download PDFs (when available) and extract plain text to support full-text evidence, writing `papers/fulltext_index.jsonl` and `papers/fulltext/*.txt`. **Trigger**: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF. **Use when**: `queries.md` 设置 `evidence_mode: fulltext`(或你明确需要全文证据)并希望为 paper notes/claims 提供更强 evidence。 **Skip if**: `evidence_mode: abstract`(默认);或你不希望进行下载/抽取(成本/权限/时间)。 **Network**: fulltext 下载通常需要网络(除非你手工提供 PDF 缓存在 `papers/pdfs/`)。 **Guardrail**: 缓存下载到 `papers/pdfs/`;默认不覆盖已有抽取文本(除非显式要求重抽)。

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
334
Hot score
99
Updated
March 20, 2026
Overall rating
C4.3
Composite score
4.3
Best-practice grade
S96.0

Install command

npx @skill-hub/cli install willoscar-research-units-pipeline-skills-pdf-text-extractor

Repository

WILLOSCAR/research-units-pipeline-skills

Skill path: .codex/skills/pdf-text-extractor

Download PDFs (when available) and extract plain text to support full-text evidence, writing `papers/fulltext_index.jsonl` and `papers/fulltext/*.txt`. **Trigger**: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF. **Use when**: `queries.md` 设置 `evidence_mode: fulltext`(或你明确需要全文证据)并希望为 paper notes/claims 提供更强 evidence。 **Skip if**: `evidence_mode: abstract`(默认);或你不希望进行下载/抽取(成本/权限/时间)。 **Network**: fulltext 下载通常需要网络(除非你手工提供 PDF 缓存在 `papers/pdfs/`)。 **Guardrail**: 缓存下载到 `papers/pdfs/`;默认不覆盖已有抽取文本(除非显式要求重抽)。

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: WILLOSCAR.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install pdf-text-extractor into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/WILLOSCAR/research-units-pipeline-skills before adding pdf-text-extractor to shared team environments
  • Use pdf-text-extractor for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: pdf-text-extractor
description: |
  Download PDFs (when available) and extract plain text to support full-text evidence, writing `papers/fulltext_index.jsonl` and `papers/fulltext/*.txt`.
  **Trigger**: PDF download, fulltext, extract text, papers/pdfs, 全文抽取, 下载PDF.
  **Use when**: `queries.md` 设置 `evidence_mode: fulltext`(或你明确需要全文证据)并希望为 paper notes/claims 提供更强 evidence。
  **Skip if**: `evidence_mode: abstract`(默认);或你不希望进行下载/抽取(成本/权限/时间)。
  **Network**: fulltext 下载通常需要网络(除非你手工提供 PDF 缓存在 `papers/pdfs/`)。
  **Guardrail**: 缓存下载到 `papers/pdfs/`;默认不覆盖已有抽取文本(除非显式要求重抽)。
---

# PDF Text Extractor

Optionally collect **full-text snippets** to deepen evidence beyond abstracts.

This skill is intentionally conservative: in many survey runs, **abstract/snippet mode is enough** and avoids heavy downloads.

## Inputs

- `papers/core_set.csv` (expects `paper_id`, `title`, and ideally `pdf_url`/`arxiv_id`/`url`)
- Optional: `outline/mapping.tsv` (to prioritize mapped papers)

## Outputs

- `papers/fulltext_index.jsonl` (one record per attempted paper)
- Side artifacts:
  - `papers/pdfs/<paper_id>.pdf` (cached downloads)
  - `papers/fulltext/<paper_id>.txt` (extracted text)

## Decision: evidence mode

- `queries.md` can set `evidence_mode: "abstract" | "fulltext"`.
  - `abstract` (default template): **do not download**; write an index that clearly records skipping.
  - `fulltext`: download PDFs (when possible) and extract text to `papers/fulltext/`.

## Local PDFs Mode

When you cannot/should not download PDFs (restricted network, rate limits, no permission), provide PDFs manually and run in “local PDFs only” mode.

- PDF naming convention: `papers/pdfs/<paper_id>.pdf` where `<paper_id>` matches `papers/core_set.csv`.
- Set `- evidence_mode: "fulltext"` in `queries.md`.
- Run: `python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only`

If PDFs are missing, the script writes a to-do list:

- `output/MISSING_PDFS.md` (human-readable summary)
- `papers/missing_pdfs.csv` (machine-readable list)

## Workflow (heuristic)

1. Read `papers/core_set.csv`.
2. If `outline/mapping.tsv` exists, prioritize mapped papers first.
3. For each selected paper (fulltext mode):
   - resolve `pdf_url` (use `pdf_url`, else derive from `arxiv_id`/`url` when possible)
   - download to `papers/pdfs/<paper_id>.pdf` if missing
   - extract a reasonable prefix of text to `papers/fulltext/<paper_id>.txt`
   - append/update a JSONL record in `papers/fulltext_index.jsonl` with status + stats
4. Never overwrite existing extracted text unless explicitly requested (delete the `.txt` to re-extract).

## Quality checklist

- [ ] `papers/fulltext_index.jsonl` exists and is non-empty.
- [ ] If `evidence_mode: "fulltext"`: at least a small but non-trivial subset has extracted text (strict mode blocks if extraction coverage is near-zero).
- [ ] If `evidence_mode: "abstract"`: the index records clearly reflect skip status (no downloads attempted).

## Script

### Quick Start

- `python .codex/skills/pdf-text-extractor/scripts/run.py --help`
- `python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <workspace_dir>`

### All Options

- `--max-papers <n>`: cap number of papers processed (can be overridden by `queries.md`)
- `--max-pages <n>`: extract at most N pages per PDF
- `--min-chars <n>`: minimum extracted chars to count as OK
- `--sleep <sec>`: delay between downloads
- `--local-pdfs-only`: do not download; only use `papers/pdfs/<paper_id>.pdf` if present
- `queries.md` supports: `evidence_mode`, `fulltext_max_papers`, `fulltext_max_pages`, `fulltext_min_chars`

### Examples

- Abstract mode (no downloads):
  - Set `- evidence_mode: "abstract"` in `queries.md`, then run the script (it will emit `papers/fulltext_index.jsonl` with skip statuses)
- Fulltext mode with local PDFs only:
  - Set `- evidence_mode: "fulltext"` in `queries.md`, put PDFs under `papers/pdfs/`, then run: `python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --local-pdfs-only`
- Fulltext mode with smaller budget:
  - `python .codex/skills/pdf-text-extractor/scripts/run.py --workspace <ws> --max-papers 20 --max-pages 4 --min-chars 1200`

### Notes

- Downloads are cached under `papers/pdfs/`; extracted text is cached under `papers/fulltext/`.
- The script does not overwrite existing extracted text unless you delete the `.txt` file.

## Troubleshooting

### Issue: no PDFs are available to download

**Fix**:
- Use `evidence_mode: abstract` (default) or provide local PDFs under `papers/pdfs/` and rerun with `--local-pdfs-only`.

### Issue: extracted text is empty/garbled

**Fix**:
- Try a different extraction backend if supported; otherwise mark the paper as `abstract` evidence level and avoid strong fulltext claims.
pdf-text-extractor | SkillHub