Back to skills
SkillHub ClubWrite Technical DocsFull StackData / AITech Writer

math-extractor

Extracts strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from documents (PDF, MD, TEX, TXT), handling PDF conversion and AI-based cleaning. Use when the user wants to extract math content from a file.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
0
Hot score
74
Updated
March 20, 2026
Overall rating
C0.0
Composite score
0.0
Best-practice grade
A92.0

Install command

npx @skill-hub/cli install develata-deve-skills-math-extractor

Repository

Develata/Deve-Skills

Skill path: My-Skills/math-extractor

Extracts strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from documents (PDF, MD, TEX, TXT), handling PDF conversion and AI-based cleaning. Use when the user wants to extract math content from a file.

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, Data / AI, Tech Writer.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: Develata.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install math-extractor into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/Develata/Deve-Skills before adding math-extractor to shared team environments
  • Use math-extractor for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: math-extractor
description: Extracts strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from documents (PDF, MD, TEX, TXT), handling PDF conversion and AI-based cleaning. Use when the user wants to extract math content from a file.
---

# Math Extractor

This skill extracts mathematical definitions, theorems, lemmas, propositions, and proofs from documents.

## Input Schema

```xml
<input_schema>
  <file_path>Path to the source file (pdf/md/tex/txt)</file_path>
</input_schema>
```

## Logic & Workflow

The Agent must follow this Chain of Thought (CoT):

1.  **Env Check**: First, verify that `scripts/processor.py` can access the necessary API keys (MinerU & LLM) from the environment. If missing, return a configuration error.
2.  **Validation**: Check file extension. If not .pdf/.md/.tex/.txt, return "不支持当前文件格式".
3.  **Conversion**:
    *   If PDF: Call `convert_pdf`. The script internally uses the pre-configured MinerU key.
    *   If conversion fails (or key missing), return "未设定好pdf转化为md的工具".
4.  **Preprocessing**:
    *   Call `clean_and_chunk` (implemented in `clean_content`).
    *   Aggressively remove images, TOCs, and References to save tokens.
5.  **Extraction (Batch AI)**:
    *   Call `batch_extract_math` (implemented in `batch_extract`).
    *   The script uses the pre-configured LLM credentials to process chunks in parallel.
6.  **Merge & Output**:
    *   Save to `{filename}_extracted.md` and return the path.

## Usage

To use this skill, execute the python script with the file path.

**Required Environment Variables:**
*   `EXTRACTION_API_KEY`: API Key for LLM (e.g., OpenAI, DeepSeek).
*   `EXTRACTION_BASE_URL`: Base URL for LLM API (default: `https://api.openai.com/v1`).

**Optional Environment Variables:**
*   `MINERU_API_KEY`: Required only for PDF conversion.
*   `MINERU_BASE_URL`: Base URL for MinerU API (default: `https://api.mineru.com/v1`).
*   `LLM_MODEL`: Model name to use (default: `gpt-4o`).

```bash
python scripts/processor.py <file_path> <output_directory>
```

## Features

*   **Robust PDF Conversion**: Uses MinerU for high-quality PDF to Markdown conversion.
*   **Smart Chunking**: Splits text by paragraphs to avoid breaking math formulas.
*   **Cost Optimization**: Heuristically filters out non-math chunks to save tokens.
*   **Math Protection**: Whitelists safe HTML tags to prevent accidental deletion of math inequalities (e.g., `a < b`).
*   **Encoding Fallback**: Automatically tries UTF-8, GBK, and Latin-1 encodings.
*   **Retry Logic**: Built-in retries for API calls to handle network instability.


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### scripts/processor.py

```python
import os
import re
import json
import concurrent.futures
import requests
import logging
import argparse
from pathlib import Path

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Global Configuration
CONFIG = {
    'MINERU_API_KEY': os.getenv('MINERU_API_KEY', ''),
    'EXTRACTION_API_KEY': os.getenv('EXTRACTION_API_KEY', ''),
    'EXTRACTION_BASE_URL': os.getenv('EXTRACTION_BASE_URL', 'https://api.openai.com/v1'),
    'MINERU_BASE_URL': os.getenv('MINERU_BASE_URL', 'https://api.mineru.com/v1'), # Placeholder URL
    'LLM_MODEL': os.getenv('LLM_MODEL', 'gpt-4o')
}

class MathProcessor:
    def __init__(self):
        self._validate_config()

    def _validate_config(self):
        # 必须检查提取用的 API Key
        if not CONFIG['EXTRACTION_API_KEY']:
             logger.error("Configuration Error: 'EXTRACTION_API_KEY' environment variable is missing.")
             raise ValueError("Configuration Error: 'EXTRACTION_API_KEY' environment variable is missing.")
        
        # 警告:如果没有 PDF key,只能处理文本
        if not CONFIG['MINERU_API_KEY']:
            logger.warning("'MINERU_API_KEY' is missing. PDF conversion will fail.")

    def clean_content(self, text):
        """
        Regex cleaning for images/figures/HTML.
        Must remove "References"/"Bibliography" sections.
        """
        # Remove References/Bibliography section (from the header to the end)
        # Matches "References" or "Bibliography" on a line by itself (or with minimal whitespace)
        text = re.sub(r'(?im)^\s*(References|Bibliography)\s*$.*', '', text, flags=re.DOTALL)
        
        # Remove images/figures (markdown style ![...](...))
        text = re.sub(r'!\[.*?\]\(.*?\)', '', text)
        
        # Remove HTML tags - Use whitelist to protect math inequalities
        # Only remove specific, unsafe tags
        tags_to_remove = r'(script|style|div|span|p|br|iframe|video|img)'
        text = re.sub(r'<' + tags_to_remove + r'[^>]*>', '', text, flags=re.IGNORECASE)
        text = re.sub(r'</' + tags_to_remove + r'>', '', text, flags=re.IGNORECASE)
        
        # Remove TOC (heuristics: lines with multiple dots ...... and numbers at end)
        text = re.sub(r'(?m)^.*\.{4,}\s*\d+\s*$', '', text)
        
        return text.strip()

    def convert_pdf_to_md(self, file_path):
        """
        Uses CONFIG['MINERU_API_KEY'] to convert PDF to Markdown.
        """
        if not CONFIG['MINERU_API_KEY']:
            raise ValueError("未设定好pdf转化为md的工具 (Missing MINERU_API_KEY)")

        url = f"{CONFIG['MINERU_BASE_URL']}/pdf_to_markdown" # Hypothetical endpoint
        headers = {'Authorization': f"Bearer {CONFIG['MINERU_API_KEY']}"}
        
        try:
            logger.info(f"Converting PDF: {file_path}")
            with open(file_path, 'rb') as f:
                files = {'file': f}
                # [ACTION REQUIRED] 取消注释以下几行以启用真实转换
                response = requests.post(url, headers=headers, files=files, timeout=120) # 2 min timeout for PDF
                response.raise_for_status()
                # 假设 MinerU 返回格式是 {'markdown': '...'},根据实际 API 调整
                return response.json().get('markdown', '')
        except requests.exceptions.RequestException as e:
             # Return error message to be displayed to user
             error_msg = f"PDF conversion error: {str(e)}. Please check MINERU_BASE_URL and MINERU_API_KEY."
             logger.error(error_msg)
             return error_msg
        except Exception as e:
            logger.error(f"PDF conversion failed: {str(e)}")
            raise RuntimeError(f"PDF conversion failed: {str(e)}")

    def batch_extract(self, chunks):
        """
        Uses CONFIG['EXTRACTION_API_KEY'] and CONFIG['EXTRACTION_BASE_URL'].
        Implements concurrent.futures.ThreadPoolExecutor for speed.
        """
        if not CONFIG['EXTRACTION_API_KEY']:
            raise ValueError("Missing EXTRACTION_API_KEY")

        # Heuristic filtering to save tokens
        MATH_KEYWORDS = {
            "theorem", "definition", "lemma", "proof", "proposition", 
            "定理", "定义", "命题", "let", "assume", "suppose", "=", "\\", 
            "corollary", "推论", "example", "例"
        }

        results = [""] * len(chunks)
        chunks_to_process = []
        
        for i, chunk in enumerate(chunks):
            # Check if chunk contains any math keywords
            if any(k in chunk.lower() for k in MATH_KEYWORDS):
                chunks_to_process.append((i, chunk))
            else:
                # Skip non-math chunks
                results[i] = "" 

        if not chunks_to_process:
            logger.info("No math keywords found in chunks. Skipping extraction.")
            return ""

        logger.info(f"Processing {len(chunks_to_process)}/{len(chunks)} chunks with math content...")

        with concurrent.futures.ThreadPoolExecutor() as executor:
            future_to_index = {
                executor.submit(self._extract_chunk, chunk): i 
                for i, chunk in chunks_to_process
            }
            for future in concurrent.futures.as_completed(future_to_index):
                index = future_to_index[future]
                try:
                    results[index] = future.result()
                except Exception as e:
                    logger.error(f"Chunk {index} extraction failed: {e}")
                    results[index] = "" # Or keep original?
        
        return "\n\n".join(filter(None, results))

    def _extract_chunk(self, chunk, retries=3):
        headers = {
            "Authorization": f"Bearer {CONFIG['EXTRACTION_API_KEY']}",
            "Content-Type": "application/json"
        }
        data = {
            "model": CONFIG['LLM_MODEL'], # Configurable model
            "messages": [
                {"role": "system", "content": "You are a math extraction tool. Extract strictly mathematical terms (Definitions, Theorems, Lemmas, Propositions, Proofs) from the text. Keep only the math content. Do NOT change LaTeX/Code formatting. Do NOT output markdown code blocks (like ```latex). Output plain text only."},
                {"role": "user", "content": chunk}
            ]
        }
        
        for attempt in range(retries):
            try:
                response = requests.post(
                    f"{CONFIG['EXTRACTION_BASE_URL']}/chat/completions", 
                    headers=headers, 
                    json=data,
                    timeout=60 # Add timeout
                )
                response.raise_for_status()
                result = response.json()
                content = result['choices'][0]['message']['content']
                
                # Post-processing to remove potential markdown code blocks
                # Remove ```latex or ```markdown or just ``` 
                # Stronger regex to remove all code block markers
                content = re.sub(r'```[a-zA-Z]*', '', content).replace('```', '')
                
                return content.strip()
            except Exception as e:
                if attempt == retries - 1:
                    logger.error(f"Failed to extract chunk after {retries} attempts: {e}")
                    raise
                logger.warning(f"Attempt {attempt + 1} failed, retrying... Error: {e}")
                import time
                time.sleep(2) # Simple backoff

    def chunk_text(self, text, max_size=2000):
        """
        Smart chunking respecting paragraph boundaries.
        """
        # Split by 2 or more newlines to get paragraphs
        paragraphs = re.split(r'\n{2,}', text)
        chunks = []
        current_chunk = []
        current_size = 0
        
        for para in paragraphs:
            para_len = len(para)
            # If adding this paragraph exceeds max_size and we have content, yield current chunk
            if current_size + para_len > max_size and current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                current_chunk = []
                current_size = 0
            
            # If a single paragraph is larger than max_size, we have to split it hard
            # or accept it being slightly larger. Here we accept it to avoid breaking formulas.
            # But if it's WAY too large (e.g. > 2*max_size), we might want to split by single newline.
            
            current_chunk.append(para)
            current_size += para_len + 2 # +2 for the newline separator
            
        if current_chunk:
            chunks.append('\n\n'.join(current_chunk))
            
        return chunks if chunks else [""]

    def process_pipeline(self, file_path, output_dir):
        """
        The main entry point.
        """
        file_path = Path(file_path)
        if not file_path.exists():
            msg = f"Error: File {file_path} not found."
            logger.error(msg)
            return msg

        # Validation
        ext = file_path.suffix.lower()
        if ext not in ['.pdf', '.md', '.tex', '.txt']:
            return "不支持当前文件格式"

        logger.info(f"Processing file: {file_path}")

        # Conversion
        content = ""
        if ext == '.pdf':
            try:
                content = self.convert_pdf_to_md(file_path)
            except Exception as e:
                return f"未设定好pdf转化为md的工具: {str(e)}"
        else:
            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
            except UnicodeDecodeError:
                # Try latin-1 fallback
                try:
                    logger.warning("UTF-8 decode failed, trying GBK...")
                    with open(file_path, 'r', encoding='gbk') as f:
                        content = f.read()
                except UnicodeDecodeError:
                     logger.warning("GBK decode failed, trying Latin-1...")
                     with open(file_path, 'r', encoding='latin-1') as f:
                        content = f.read()

        # Preprocessing
        logger.info("Cleaning content...")
        cleaned = self.clean_content(content)
        
        # Chunking (Smart chunking)
        logger.info("Chunking content...")
        chunks = self.chunk_text(cleaned, max_size=2000)
        
        # Extraction
        try:
            logger.info("Extracting math content...")
            extracted = self.batch_extract(chunks)
        except Exception as e:
            logger.error(f"Extraction failed: {str(e)}")
            return f"Extraction failed: {str(e)}"

        # Merge & Output
        output_dir = Path(output_dir)
        output_dir.mkdir(parents=True, exist_ok=True)
        out_path = output_dir / f"{file_path.stem}_extracted.md"
        
        logger.info(f"Saving to {out_path}...")
        with open(out_path, 'w', encoding='utf-8') as f:
            f.write(extracted)
            
        return str(out_path)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Extract math content from documents.")
    parser.add_argument("file_path", help="Path to the source file (pdf/md/tex/txt)")
    parser.add_argument("output_dir", help="Directory to save the extracted markdown")
    
    args = parser.parse_args()
    
    processor = MathProcessor()
    result = processor.process_pipeline(args.file_path, args.output_dir)
    print(result)

```

math-extractor | SkillHub