Back to skills
SkillHub ClubShip Full StackFull StackBackendIntegration

scrapling-mcp

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MCP server for execution; this skill provides strategy, recipes, and best practices.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
3,067
Hot score
99
Updated
March 20, 2026
Overall rating
C4.6
Composite score
4.6
Best-practice grade
C62.8

Install command

npx @skill-hub/cli install openclaw-skills-scrapling-mcp

Repository

openclaw/skills

Skill path: skills/devbd1/scrapling-mcp

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MCP server for execution; this skill provides strategy, recipes, and best practices.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack, Backend, Integration.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: openclaw.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install scrapling-mcp into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/openclaw/skills before adding scrapling-mcp to shared team environments
  • Use scrapling-mcp for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: scrapling-mcp
description: Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MCP server for execution; this skill provides strategy, recipes, and best practices.
---

# Scrapling MCP — Web Scraping Guidance

Source repo: https://github.com/DevBD1/openclaw-skill-scrapling-mcp

> **Guidance Layer + MCP Integration**  
> Use this skill for **strategy and patterns**. For execution, call Scrapling's MCP server via `mcporter`.

## Quick Start (MCP)

### 1. Install Scrapling with MCP support
```bash
pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium
```

### 2. Add to OpenClaw MCP config
```json
{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}
```

### 3. Call via mcporter
```
mcporter call scrapling fetch_page --url "https://example.com"
```

## Execution vs Guidance

| Task | Tool | Example |
|------|------|---------|
| Fetch a page | **mcporter** | `mcporter call scrapling fetch_page --url URL` |
| Extract with CSS | **mcporter** | `mcporter call scrapling css_select --selector ".title::text"` |
| Which fetcher to use? | **This skill** | See "Fetcher Selection Guide" below |
| Anti-bot strategy? | **This skill** | See "Anti-Bot Escalation Ladder" |
| Complex crawl patterns? | **This skill** | See "Spider Recipes" |

## Fetcher Selection Guide

```
┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Fetcher       │────▶│ DynamicFetcher   │────▶│ StealthyFetcher  │
│   (HTTP)        │     │ (Browser/JS)     │     │ (Anti-bot)       │
└─────────────────┘     └──────────────────┘     └──────────────────┘
     Fastest              JS-rendered               Cloudflare, 
     Static pages         SPAs, React/Vue          Turnstile, etc.
```

### Decision Tree
1. **Static HTML?** → `Fetcher` (10-100x faster)
2. **Need JS execution?** → `DynamicFetcher`
3. **Getting blocked?** → `StealthyFetcher`
4. **Complex session?** → Use Session variants

### MCP Fetch Modes
- `fetch_page` — HTTP fetcher
- `fetch_dynamic` — Browser-based with Playwright
- `fetch_stealthy` — Anti-bot bypass mode

## Anti-Bot Escalation Ladder

### Level 1: Polite HTTP
```python
# MCP call: fetch_page with options
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "delay": 2.0
}
```

### Level 2: Session Persistence
```python
# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome")  # TLS fingerprint spoofing
```

### Level 3: Stealth Mode
```python
# MCP: fetch_stealthy
StealthyFetcher.fetch(
    url,
    headless=True,
    solve_cloudflare=True,  # Auto-solve Turnstile
    network_idle=True
)
```

### Level 4: Proxy Rotation
See `references/proxy-rotation.md`

## Adaptive Scraping (Anti-Fragile)

Scrapling can **survive website redesigns** using adaptive selectors:

```python
# First run — save fingerprints
products = page.css('.product', auto_save=True)

# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)
```

**MCP usage:**
```
mcporter call scrapling css_select \\
  --selector ".product" \\
  --adaptive true \\
  --auto-save true
```

## Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:
- ✅ **Spider**: 10+ pages, concurrency needed, resume capability, proxy rotation
- ✅ **Direct**: 1-5 pages, quick extraction, simple flow

### Basic Spider Pattern
```python
from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://example.com/products"]
    concurrent_requests = 10
    download_delay = 1.0
    
    async def parse(self, response: Response):
        for product in response.css('.product'):
            yield {
                "name": product.css('h2::text').get(),
                "price": product.css('.price::text').get(),
                "url": response.url
            }
        
        # Follow pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")
```

### Advanced: Multi-Session Spider
```python
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "/protected/" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast")
```

### Spider Features
- **Pause/Resume**: `crawldir` parameter saves checkpoints
- **Streaming**: `async for item in spider.stream()` for real-time processing
- **Auto-retry**: Configurable retry on blocked requests
- **Export**: Built-in `to_json()`, `to_jsonl()`

## CLI & Interactive Shell

### Terminal Extraction (No Code)
```bash
# Extract to markdown
scrapling extract get 'https://example.com' content.md

# Extract specific element
scrapling extract get 'https://example.com' content.txt \\
  --css-selector '.article' \\
  --impersonate 'chrome'

# Stealth mode
scrapling extract stealthy-fetch 'https://protected.com' content.md \\
  --no-headless \\
  --solve-cloudflare
```

### Interactive Shell
```bash
scrapling shell

# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')
```

## Parser API (Beyond CSS/XPath)

### BeautifulSoup-Style Methods
```python
# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))

# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')

# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children

# Similarity
similar = first.find_similar()  # Find visually/structurally similar elements
below = first.below_elements()  # Elements below in DOM
```

### Auto-Generated Selectors
```python
# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector()  # Returns stable CSS path
xpath = element.auto_xpath()
```

## Proxy Rotation

```python
from scrapling.spiders import ProxyRotator

# Cyclic rotation
rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://user:pass@proxy3:8080"
], strategy="cyclic")

# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
    page = session.get('https://example.com')
```

## Common Recipes

### Pagination Patterns
```python
# Page numbers
for page_num in range(1, 11):
    url = f"https://example.com/products?page={page_num}"
    ...

# Next button
while next_page := response.css('.next a::attr(href)').get():
    yield response.follow(next_page)

# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
    page = session.fetch(url)
    page.scroll_to_bottom()
    items = page.css('.item').getall()
```

### Login Sessions
```python
with StealthySession(headless=False) as session:
    # Login
    login_page = session.fetch('https://example.com/login')
    login_page.fill('input[name="username"]', 'user')
    login_page.fill('input[name="password"]', 'pass')
    login_page.click('button[type="submit"]')
    
    # Now session has cookies
    protected_page = session.fetch('https://example.com/dashboard')
```

### Next.js Data Extraction
```python
# Extract JSON from __NEXT_DATA__
import json
import re

next_data = json.loads(
    re.search(
        r'__NEXT_DATA__" type="application/json">(.*?)</script>',
        page.html_content,
        re.S
    ).group(1)
)
props = next_data['props']['pageProps']
```

## Output Formats

```python
# JSON (pretty)
result.items.to_json('output.json')

# JSONL (streaming, one per line)
result.items.to_jsonl('output.jsonl')

# Python objects
for item in result.items:
    print(item['title'])
```

## Performance Tips

1. **Use HTTP fetcher when possible** — 10-100x faster than browser
2. **Impersonate browsers** — `impersonate='chrome'` for TLS fingerprinting
3. **HTTP/3 support** — `FetcherSession(http3=True)`
4. **Limit resources** — `disable_resources=True` in Dynamic/Stealthy
5. **Connection pooling** — Reuse sessions across requests

## Guardrails (Always)

- Only scrape content you're authorized to access
- Respect robots.txt and ToS
- Add delays (`download_delay`) for large crawls
- Don't bypass paywalls or authentication without permission
- Never scrape personal/sensitive data

## References

- `references/mcp-setup.md` — Detailed MCP configuration
- `references/anti-bot.md` — Anti-bot handling strategies
- `references/proxy-rotation.md` — Proxy setup and rotation
- `references/spider-recipes.md` — Advanced crawling patterns
- `references/api-reference.md` — Quick API reference
- `references/links.md` — Official docs links

## Scripts

- `scripts/scrapling_scrape.py` — Quick one-off extraction
- `scripts/scrapling_smoke_test.py` — Test connectivity and anti-bot indicators


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/proxy-rotation.md

```markdown
# Proxy Rotation

## When to Use

- Rate limiting on target sites
- Geographic restrictions
- Large-scale crawls (100+ pages)
- Anti-bot evasion (with permission)

## ProxyRotator Setup

```python
from scrapling.spiders import ProxyRotator

# Cyclic rotation (round-robin)
rotator = ProxyRotator([
    "http://proxy1.example.com:8080",
    "http://proxy2.example.com:8080",
    "http://user:[email protected]:8080",
], strategy="cyclic")

# Get next proxy
proxy = rotator.next()
```

## With Sessions

```python
from scrapling.fetchers import FetcherSession

with FetcherSession(proxy=rotator.next()) as session:
    page = session.get('https://example.com')
```

## With Spiders

```python
from scrapling.spiders import Spider
from scrapling.fetchers import FetcherSession

class RotatingSpider(Spider):
    name = "rotating"
    
    def __init__(self):
        self.rotator = ProxyRotator([
            "http://proxy1:8080",
            "http://proxy2:8080",
        ])
    
    def configure_sessions(self, manager):
        manager.add("rotating", FetcherSession())
    
    async def parse(self, response):
        # Access current session and rotate proxy
        self.sessions["rotating"].proxy = self.rotator.next()
        # ... continue parsing
```

## Custom Rotation Strategy

```python
import random

class RandomRotator:
    def __init__(self, proxies):
        self.proxies = proxies
    
    def next(self):
        return random.choice(self.proxies)

rotator = RandomRotator(["http://p1:8080", "http://p2:8080"])
```

## Provider Recommendations

See Sponsors in Scrapling README for proxy providers:
- Scrapeless
- ThorData
- Evomi
- Decodo
- ProxyEmpire
- SwiftProxy
- RapidProxy

## Best Practices

1. **Test proxies first** — Check latency and success rate
2. **Rotate on failure** — Switch proxy after 3 consecutive failures
3. **Respect rate limits** — Don't use proxies to bypass reasonable limits
4. **Session affinity** — Keep same proxy for session duration when possible
5. **Monitor usage** — Track success rates per proxy

```

### references/mcp-setup.md

```markdown
# MCP Server Setup

## Installation

```bash
# Base MCP support
pip install scrapling[mcp]

# With browser automation
pip install scrapling[mcp,playwright]
python -m playwright install chromium
```

## OpenClaw Configuration

Add to your OpenClaw MCP config:

```json
{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}
```

Or with environment variables:

```json
{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"],
      "env": {
        "PYTHONPATH": "/path/to/venv/lib/python3.x/site-packages"
      }
    }
  }
}
```

## Available Tools

### fetch_page
Fast HTTP fetch for static pages.

```json
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "timeout": 30,
  "impersonate": "chrome"
}
```

### fetch_dynamic
Browser-based fetch for JS-rendered content.

```json
{
  "url": "https://spa.example.com",
  "headless": true,
  "network_idle": true,
  "wait_for": "selector",
  "timeout": 30000
}
```

### fetch_stealthy
Anti-bot fetch with Cloudflare bypass.

```json
{
  "url": "https://protected.example.com",
  "headless": true,
  "solve_cloudflare": true,
  "google_search": false
}
```

### css_select
Extract data using CSS selectors.

```json
{
  "html": "<html>...</html>",
  "selector": ".product .title::text",
  "first_only": false,
  "adaptive": false,
  "auto_save": false
}
```

### xpath_select
Extract data using XPath.

```json
{
  "html": "<html>...</html>",
  "xpath": "//div[@class='product']/h2/text()",
  "first_only": false
}
```

### start_spider
Run a spider crawl.

```json
{
  "name": "my_spider",
  "start_urls": ["https://example.com/products"],
  "concurrent_requests": 10,
  "download_delay": 1.0,
  "crawldir": "./crawl_data"
}
```

## Usage via mcporter

```bash
# Fetch a page
mcporter call scrapling fetch_page --url "https://example.com"

# Extract with CSS
mcporter call scrapling css_select \\
  --html "$(cat page.html)" \\
  --selector ".title::text"

# Stealth fetch
mcporter call scrapling fetch_stealthy \\
  --url "https://protected.com" \\
  --solve-cloudflare true
```

## Benefits of MCP Mode

1. **Reduced Token Usage** — Scrapling extracts data BEFORE passing to AI
2. **Faster Operations** — Direct function calls vs text generation
3. **Structured Output** — JSON responses, not parsed text
4. **Error Handling** — Proper exceptions and status codes

## Demo Video

See Scrapling MCP in action: https://www.youtube.com/watch?v=qyFk3ZNwOxE

```

### references/anti-bot.md

```markdown
# Anti-bot handling (permissioned)

This is about *reliability for legitimate automation* (your own sites, explicit permission, approved internal tooling).

## Escalation ladder
1) **Fetcher/FetcherSession** (fast HTTP)
   - Add delays / retry logic
   - Use session cookies where applicable

2) **DynamicSession** (JS-rendered)
   - Use `network_idle=True` or appropriate waits
   - Consider loading fewer resources if supported

3) **StealthySession** (protected pages)
   - Use when you see:
     - 403/429 patterns
     - Cloudflare/Turnstile interstitial pages
     - "verify you are human" flows

## Practical tips
- Lower concurrency, add jittered sleeps.
- Persist sessions/cookies across requests.
- Rotate proxies **only if** you have the right to access and you’re being rate-limited, not to evade restrictions.
- Always capture the HTML/screenshot of the block page for debugging.

## Minimal StealthySession example
```python
from scrapling.fetchers import StealthySession

with StealthySession(headless=True, solve_cloudflare=True) as session:
    page = session.fetch("https://example.com", google_search=False)
    title = page.css("title::text").get()
    print(title)
```

## What not to do
- Do not bypass paywalls or private/login-only content without authorization.
- Do not attempt to scrape sensitive personal data.

```

### references/spider-recipes.md

```markdown
# Spider Recipes

## Recipe 1: E-commerce Product Crawler

```python
from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://shop.example.com/products"]
    concurrent_requests = 5
    download_delay = 2.0
    
    custom_settings = {
        "RETRY_TIMES": 3,
        "RETRY_DELAY": 5,
    }
    
    async def parse(self, response: Response):
        # Extract products
        for product in response.css('.product-card'):
            yield {
                "name": product.css('.product-name::text').get(),
                "price": product.css('.price::text').get(),
                "rating": product.css('.rating::attr(data-score)').get(),
                "url": product.css('a::attr(href)').get(),
                "extracted_at": datetime.now().isoformat(),
            }
        
        # Follow pagination
        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# Run with resume
result = ProductSpider(crawldir="./crawl_products").start()
result.items.to_jsonl("products.jsonl")
```

## Recipe 2: Sitemap Crawler

```python
from scrapling.spiders import Spider
import xml.etree.ElementTree as ET

class SitemapSpider(Spider):
    name = "sitemap"
    
    def start_requests(self):
        # Fetch sitemap
        sitemap = Fetcher.get('https://example.com/sitemap.xml')
        root = ET.fromstring(sitemap.body)
        
        for url in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
            yield Request(url.text, callback=self.parse_page)
    
    async def parse_page(self, response):
        yield {
            "url": response.url,
            "title": response.css('h1::text').get(),
            "word_count": len(response.css('p::text').getall()),
        }
```

## Recipe 3: API-First Spider

```python
class APISpider(Spider):
    """Scrape via API endpoints instead of HTML."""
    
    name = "api_products"
    api_base = "https://api.example.com/v1"
    
    def start_requests(self):
        for page in range(1, 100):
            yield Request(
                f"{self.api_base}/products?page={page}",
                headers={"Authorization": "Bearer TOKEN"},
            )
    
    async def parse(self, response):
        data = response.json()
        for product in data['products']:
            yield product
        
        # Stop when no more results
        if not data['products']:
            raise StopIteration
```

## Recipe 4: Multi-Domain Crawl with Different Handlers

```python
class MultiDomainSpider(Spider):
    name = "multi"
    
    domain_handlers = {
        "shop.example.com": "parse_shop",
        "blog.example.com": "parse_blog",
    }
    
    async def parse(self, response):
        domain = urlparse(response.url).netloc
        handler = getattr(self, self.domain_handlers.get(domain, "parse_default"))
        async for item in handler(response):
            yield item
    
    async def parse_shop(self, response):
        for product in response.css('.product'):
            yield {"type": "product", "data": product.css('h2::text').get()}
    
    async def parse_blog(self, response):
        for article in response.css('article'):
            yield {"type": "article", "data": article.css('h1::text').get()}
```

## Recipe 5: Streaming with Real-time Processing

```python
class StreamingSpider(Spider):
    name = "streaming"
    start_urls = ["https://example.com/items"]
    
    async def parse(self, response):
        for item in response.css('.item'):
            yield {
                "id": item.css('::attr(data-id)').get(),
                "content": item.css('.content::text').get(),
            }

# Consume as stream
spider = StreamingSpider()
async for item in spider.stream():
    print(f"Got item: {item['id']}")
    # Send to database, queue, etc.
```

## Recipe 6: Deep Crawl with Depth Limit

```python
class DeepCrawlSpider(Spider):
    name = "deep_crawl"
    start_urls = ["https://example.com"]
    max_depth = 3
    
    async def parse(self, response):
        depth = response.meta.get('depth', 0)
        
        yield {
            "url": response.url,
            "depth": depth,
            "title": response.css('title::text').get(),
        }
        
        if depth < self.max_depth:
            for link in response.css('a::attr(href)').getall():
                yield response.follow(link, meta={'depth': depth + 1})
```

## Recipe 7: Form Submission Spider

```python
from scrapling.spiders import Spider, FormRequest

class FormSpider(Spider):
    name = "form_search"
    
    def start_requests(self):
        for keyword in ['python', 'scraping', 'data']:
            yield FormRequest(
                url="https://example.com/search",
                formdata={"q": keyword, "category": "all"},
                callback=self.parse_results,
                meta={"keyword": keyword}
            )
    
    async def parse_results(self, response):
        keyword = response.meta['keyword']
        for result in response.css('.search-result'):
            yield {
                "keyword": keyword,
                "title": result.css('h3::text').get(),
                "snippet": result.css('.snippet::text').get(),
            }
```

## Recipe 8: Image/Media Downloader

```python
import os
from urllib.parse import urlparse

class MediaSpider(Spider):
    name = "media"
    download_dir = "./downloads"
    
    async def parse(self, response):
        for img in response.css('img::attr(src)').getall():
            ext = os.path.splitext(urlparse(img).path)[1] or '.jpg'
            filename = f"{hash(img)}{ext}"
            yield response.follow(img, callback=self.save_image, meta={'filename': filename})
    
    async def save_image(self, response):
        path = os.path.join(self.download_dir, response.meta['filename'])
        with open(path, 'wb') as f:
            f.write(response.body)
        yield {"downloaded": path}
```

## Pause/Resume Best Practices

```python
# Start with checkpoint directory
spider = MySpider(crawldir="./crawl_checkpoint")
result = spider.start()

# If interrupted, restart with same crawldir
# Spider automatically resumes from last checkpoint

# To reset and start fresh:
import shutil
shutil.rmtree("./crawl_checkpoint")
spider.start()
```

## Error Handling Patterns

```python
class RobustSpider(Spider):
    name = "robust"
    
    async def parse(self, response):
        try:
            title = response.css('h1::text').get()
            if not title:
                self.logger.warning(f"No title found: {response.url}")
                return
            
            yield {"title": title, "url": response.url}
            
        except Exception as e:
            self.logger.error(f"Error parsing {response.url}: {e}")
            # Spider continues with other requests
```

```

### references/links.md

```markdown
# Scrapling references (quick links)

- Repo: https://github.com/D4Vinci/Scrapling
- Docs: https://scrapling.readthedocs.io

Useful docs sections:
- Fetchers overview: https://scrapling.readthedocs.io/en/latest/fetching/choosing/
- Selection methods (CSS/XPath/text/regex): https://scrapling.readthedocs.io/en/latest/parsing/selection/
- Spiders architecture: https://scrapling.readthedocs.io/en/latest/spiders/architecture.html
- Proxy rotation / blocking: https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html
- CLI overview: https://scrapling.readthedocs.io/en/latest/cli/overview/
- MCP server (AI integration): https://scrapling.readthedocs.io/en/latest/ai/mcp-server/

```

### scripts/scrapling_scrape.py

```python
#!/usr/bin/env python3
"""Quick one-off scraping helper using Scrapling.

Examples:
  python3 scrapling_scrape.py --url "https://quotes.toscrape.com" --css ".quote .text::text"
  python3 scrapling_scrape.py --url "https://example.com" --xpath "//h1/text()" --mode dynamic --headless
  python3 scrapling_scrape.py --url "https://example.com" --css "title::text" --mode stealthy --headless --solve-cloudflare

Notes:
- For anything beyond small, explicit extraction, prefer Scrapling Spiders.
- Some flags (adaptive/auto-save) depend on the installed Scrapling version.
"""

from __future__ import annotations

import argparse
import json
import sys
from typing import Any


def _die(msg: str, code: int = 2) -> None:
    print(msg, file=sys.stderr)
    raise SystemExit(code)


def _select(page: Any, *, css: str | None, xpath: str | None, adaptive: bool, auto_save: bool) -> Any:
    if css:
        # Try to pass adaptive/auto_save if supported; fall back if not.
        try:
            return page.css(css, adaptive=adaptive or None, auto_save=auto_save or None)
        except TypeError:
            return page.css(css)
    else:
        try:
            return page.xpath(xpath, adaptive=adaptive or None, auto_save=auto_save or None)  # type: ignore[arg-type]
        except TypeError:
            return page.xpath(xpath)  # type: ignore[arg-type]


def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--url", required=True)
    p.add_argument("--mode", choices=["fetcher", "dynamic", "stealthy"], default="fetcher")
    p.add_argument("--css", help="CSS selector (supports ::text and ::attr())")
    p.add_argument("--xpath", help="XPath selector")
    p.add_argument("--first", action="store_true", help="Return only the first match")
    p.add_argument("--headless", action="store_true", help="Headless browser (dynamic/stealthy)")
    p.add_argument("--solve-cloudflare", action="store_true", help="Attempt to solve Cloudflare (stealthy session)")
    p.add_argument("--network-idle", action="store_true", help="Wait for network idle (dynamic session)")
    p.add_argument("--adaptive", action="store_true", help="Use adaptive selectors (if supported)")
    p.add_argument("--auto-save", action="store_true", help="Auto-save selector fingerprints (if supported)")
    p.add_argument("--pretty", action="store_true", help="Pretty-print JSON")
    args = p.parse_args()

    if not args.css and not args.xpath:
        _die("Provide --css or --xpath")

    url = args.url

    try:
        # Sessions are more reliable than one-shot fetchers for anything non-trivial.
        from scrapling.fetchers import FetcherSession, DynamicSession, StealthySession
    except Exception:
        _die(
            "Scrapling is not installed in this Python environment. Try:\n"
            "  python3 -m pip install scrapling\n"
            "If you need browser-based fetching, you may also need:\n"
            "  python3 -m playwright install chromium"
        )

    if args.mode == "fetcher":
        with FetcherSession(impersonate="chrome") as session:
            page = session.get(url, stealthy_headers=True)

    elif args.mode == "dynamic":
        with DynamicSession(headless=args.headless, network_idle=args.network_idle) as session:
            page = session.fetch(url)

    else:
        # Use only when authorized.
        with StealthySession(headless=args.headless, solve_cloudflare=args.solve_cloudflare) as session:
            page = session.fetch(url, google_search=False)

    out = _select(page, css=args.css, xpath=args.xpath, adaptive=args.adaptive, auto_save=args.auto_save)

    result: Any
    if args.first:
        result = out.get()
    else:
        result = out.getall()

    payload = {
        "url": url,
        "mode": args.mode,
        "options": {
            "headless": bool(args.headless),
            "solve_cloudflare": bool(args.solve_cloudflare),
            "network_idle": bool(args.network_idle),
            "adaptive": bool(args.adaptive),
            "auto_save": bool(args.auto_save),
        },
        "selector": {"css": args.css, "xpath": args.xpath},
        "result": result,
    }

    if args.pretty:
        print(json.dumps(payload, ensure_ascii=False, indent=2))
    else:
        print(json.dumps(payload, ensure_ascii=False, separators=(",", ":")))


if __name__ == "__main__":
    main()

```

### scripts/scrapling_smoke_test.py

```python
#!/usr/bin/env python3
"""Scrapling smoke test / mini-extractor.

Usage:
  python scrapling_smoke_test.py URL [URL ...] --fetcher fetcher|stealthy|dynamic [--extract next_data]

Notes:
- Dynamic fetcher requires Playwright browsers: `playwright install chromium`.
"""

from __future__ import annotations

import argparse
import json
import re
import sys
from typing import Any, Iterable


def _pick_fetcher(kind: str):
    kind = kind.lower().strip()
    if kind == "fetcher":
        from scrapling.fetchers import Fetcher

        return ("Fetcher", lambda url, **kw: Fetcher.get(url, **kw))
    if kind == "stealthy":
        from scrapling.fetchers import StealthyFetcher

        return (
            "StealthyFetcher",
            lambda url, **kw: StealthyFetcher.fetch(url, **kw),
        )
    if kind == "dynamic":
        from scrapling.fetchers import DynamicFetcher

        return (
            "DynamicFetcher",
            lambda url, **kw: DynamicFetcher.fetch(url, **kw),
        )
    raise SystemExit(f"Unknown --fetcher kind: {kind}")


def _extract_next_data(html: str) -> dict[str, Any] | None:
    m = re.search(
        r'__NEXT_DATA__" type="application/json">(.*?)</script>', html, re.S
    )
    if not m:
        return None
    try:
        return json.loads(m.group(1))
    except json.JSONDecodeError:
        return None


def _find_strings(obj: Any, needles: Iterable[str]) -> list[tuple[str, str]]:
    needles = list(needles)
    out: list[tuple[str, str]] = []

    def walk(x: Any, path: str):
        if isinstance(x, dict):
            for k, v in x.items():
                walk(v, f"{path}/{k}")
        elif isinstance(x, list):
            for i, v in enumerate(x):
                walk(v, f"{path}[{i}]")
        else:
            if isinstance(x, str) and any(n in x for n in needles):
                out.append((path, x))

    walk(obj, "")
    return out


def main() -> int:
    ap = argparse.ArgumentParser()
    ap.add_argument("urls", nargs="+", help="One or more URLs")
    ap.add_argument(
        "--fetcher",
        default="fetcher",
        choices=["fetcher", "stealthy", "dynamic"],
        help="Which Scrapling fetcher to use",
    )
    ap.add_argument(
        "--extract",
        default="none",
        choices=["none", "next_data"],
        help="Optional specialized extraction",
    )
    ap.add_argument("--timeout", type=int, default=60, help="Timeout seconds")
    ap.add_argument(
        "--headless",
        action="store_true",
        default=True,
        help="(Dynamic/Stealthy) headless browser",
    )

    args = ap.parse_args()

    fetcher_name, fetch = _pick_fetcher(args.fetcher)

    for url in args.urls:
        print(f"\n=== {url}")
        print(f"fetcher: {fetcher_name}")

        kw = {}
        if args.fetcher in ("stealthy", "dynamic"):
            kw.update(
                {
                    "headless": bool(args.headless),
                    "network_idle": True,
                }
            )
        if args.fetcher == "fetcher":
            kw.update({"timeout": args.timeout})
        else:
            # milliseconds for Playwright-based fetchers
            kw.update({"timeout": args.timeout * 1000})

        try:
            resp = fetch(url, **kw)
        except Exception as e:
            print(f"ERROR: {type(e).__name__}: {e}")
            continue

        status = getattr(resp, "status", None)
        html = getattr(resp, "html_content", None) or ""

        print("status:", status)
        print("html_len:", len(html))

        # quick indicators
        low = html.lower()
        for needle in [
            "something went wrong",
            "enable javascript",
            "turnstile",
            "captcha",
            "access denied",
        ]:
            if needle in low:
                print("indicator:", needle)

        # meta/title
        try:
            title = resp.css("title::text").get()
        except Exception:
            title = None
        if title:
            print("title:", title[:240])

        if args.extract == "next_data":
            nd = _extract_next_data(html)
            if not nd:
                print("next_data: not found")
            else:
                hits = _find_strings(nd, ["Yapay", "Üretken", "Atölye", "Atolye"])
                print("next_data: found, hits:", len(hits))
                for p, s in hits[:25]:
                    s1 = s.replace("\n", " ").strip()
                    print(" ", p, "=>", s1[:240])

    return 0


if __name__ == "__main__":
    raise SystemExit(main())

```



---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### _meta.json

```json
{
  "owner": "devbd1",
  "slug": "scrapling-mcp",
  "displayName": "Scrapling MCP",
  "latest": {
    "version": "0.1.2",
    "publishedAt": 1772793813894,
    "commit": "https://github.com/openclaw/skills/commit/1994d74e8272d3e307bb78c7c3b622b39887283a"
  },
  "history": []
}

```

### references/api_reference.md

```markdown
# Scrapling quick reference (practical)

## Fetchers

- `scrapling.fetchers.Fetcher`
  - Best first try (fast HTTP fetch)
  - Common use: `Fetcher.get(url, timeout=...)`

- `scrapling.fetchers.StealthyFetcher`
  - Browser-backed stealth mode for anti-bot friction
  - Common use: `StealthyFetcher.fetch(url, headless=True, network_idle=True, timeout=...)`

- `scrapling.fetchers.DynamicFetcher`
  - Playwright-backed browser automation for JS-heavy sites
  - Common use: `DynamicFetcher.fetch(url, headless=True, network_idle=True, timeout=...)`
  - Requires: `pip install playwright` + `playwright install chromium`

## Response object (common fields)

Scrapling responses are not `requests.Response`.
Common useful attributes/methods:

- `response.status` (int)
- `response.url` (str)
- `response.html_content` (str) — decoded HTML (preferred)
- `response.body` (bytes) — raw bytes
- `response.css(selector)` / `response.xpath(expr)` — selection API

## Next.js extraction

Many Next.js sites embed JSON at:

- `<script id="__NEXT_DATA__" type="application/json"> ... </script>`

Parse that JSON and read `props.pageProps...`.

## Practical escalation ladder

1) `Fetcher`
2) `StealthyFetcher`
3) `DynamicFetcher`

```

### references/recipes.md

```markdown
# Scrapling recipes

## Extract text list (CSS)
```python
quotes = page.css('.quote .text::text').getall()
```

## Extract links (href)
```python
links = page.css('a::attr(href)').getall()
```

## Extract first match
```python
h1 = page.css('h1::text').get()
```

## JSON/JSONL export (Spider result)
Scrapling spiders commonly return a result object with items you can serialize.

Typical pattern:
```python
result = MySpider().start()
result.items.to_jsonl('out.jsonl')
```

## Adaptive selection
```python
els = page.css('.product', auto_save=True)
# later
els = page.css('.product', adaptive=True)
```

```

scrapling-mcp | SkillHub