Back to skills
SkillHub ClubAnalyze Data & AIFull StackData / AI

crawl4ai

This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
221
Hot score
97
Updated
March 20, 2026
Overall rating
C3.1
Composite score
3.1
Best-practice grade
C64.8

Install command

npx @skill-hub/cli install aiskillstore-marketplace-crawl4ai

Repository

aiskillstore/marketplace

Skill path: skills/smallnest/crawl4ai

This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: aiskillstore.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install crawl4ai into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/aiskillstore/marketplace before adding crawl4ai to shared team environments
  • Use crawl4ai for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: crawl4ai
description: This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
---

# Crawl4AI

## Overview

Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both **CLI** (recommended for quick tasks) and **Python SDK** (for programmatic control).

**Choose your interface:**
- **CLI** (`crwl`) - Quick, scriptable commands: [CLI Guide](references/cli-guide.md)
- **Python SDK** - Full programmatic control: [SDK Guide](references/sdk-guide.md)

---

## Quick Start

### Installation

```bash
pip install crawl4ai
crawl4ai-setup

# Verify installation
crawl4ai-doctor
```

### CLI (Recommended)

```bash
# Basic crawling - returns markdown
crwl https://example.com

# Get markdown output
crwl https://example.com -o markdown

# JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache

# See more examples
crwl --example
```

### Python SDK

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())
```

For SDK configuration details: [SDK Guide - Configuration](references/sdk-guide.md#configuration) (lines 61-150)

---

## Core Concepts

### Configuration Layers

Both CLI and SDK use the same underlying configuration:

| Concept | CLI | SDK |
|---------|-----|-----|
| Browser settings | `-B browser.yml` or `-b "param=value"` | `BrowserConfig(...)` |
| Crawl settings | `-C crawler.yml` or `-c "param=value"` | `CrawlerRunConfig(...)` |
| Extraction | `-e extract.yml -s schema.json` | `extraction_strategy=...` |
| Content filter | `-f filter.yml` | `markdown_generator=...` |

### Key Parameters

**Browser Configuration:**
- `headless`: Run with/without GUI
- `viewport_width/height`: Browser dimensions
- `user_agent`: Custom user agent
- `proxy_config`: Proxy settings

**Crawler Configuration:**
- `page_timeout`: Max page load time (ms)
- `wait_for`: CSS selector or JS condition to wait for
- `cache_mode`: bypass, enabled, disabled
- `js_code`: JavaScript to execute
- `css_selector`: Focus on specific element

For complete parameters: [CLI Config](references/cli-guide.md#configuration) | [SDK Config](references/sdk-guide.md#configuration)

### Output Content

Every crawl returns:
- **markdown** - Clean, formatted markdown
- **html** - Raw HTML
- **links** - Internal and external links discovered
- **media** - Images, videos, audio found
- **extracted_content** - Structured data (if extraction configured)

---

## Markdown Generation (Primary Use Case)

Crawl4AI excels at generating clean, well-formatted markdown:

### CLI

```bash
# Basic markdown
crwl https://docs.example.com -o markdown

# Filtered markdown (removes noise)
crwl https://docs.example.com -o markdown-fit

# With content filter
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
```

**Filter configuration:**
```yaml
# filter_bm25.yml (relevance-based)
type: "bm25"
query: "machine learning tutorials"
threshold: 1.0
```

### Python SDK

```python
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)

config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)

print(result.markdown.fit_markdown)  # Filtered
print(result.markdown.raw_markdown)  # Original
```

For content filters: [Content Processing](references/complete-sdk-reference.md#content-processing) (lines 2481-3101)

---

## Data Extraction

### 1. Schema-Based CSS Extraction (Most Efficient)

**No LLM required** - fast, deterministic, cost-free.

**CLI:**
```bash
# Generate schema once (uses LLM)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

# Use schema for extraction (no LLM)
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json
```

**Schema format:**
```json
{
  "name": "products",
  "baseSelector": ".product-card",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"},
    {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
  ]
}
```

### 2. LLM-Based Extraction

For complex or irregular content:

**CLI:**
```yaml
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "Extract product names and prices"
api_token: "your-token"
```

```bash
crwl https://shop.com -e extract_llm.yml -o json
```

For extraction details: [Extraction Strategies](references/complete-sdk-reference.md#extraction-strategies) (lines 4522-5429)

---

## Advanced Patterns

### Dynamic Content (JavaScript-Heavy Sites)

**CLI:**
```bash
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"
```

**Crawler config:**
```yaml
# crawler.yml
wait_for: "css:.ajax-content"
scan_full_page: true
page_timeout: 60000
delay_before_return_html: 2.0
```

### Multi-URL Processing

**CLI (sequential):**
```bash
for url in url1 url2 url3; do crwl "$url" -o markdown; done
```

**Python SDK (concurrent):**
```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)
```

For batch processing: [arun_many() Reference](references/complete-sdk-reference.md#arunmany-reference) (lines 1057-1224)

### Session & Authentication

**CLI:**
```yaml
# login_crawler.yml
session_id: "user_session"
js_code: |
  document.querySelector('#username').value = 'user';
  document.querySelector('#password').value = 'pass';
  document.querySelector('#submit').click();
wait_for: "css:.dashboard"
```

```bash
# Login
crwl https://site.com/login -C login_crawler.yml

# Access protected content (session reused)
crwl https://site.com/protected -c "session_id=user_session"
```

For session management: [Advanced Features](references/complete-sdk-reference.md#advanced-features) (lines 5429-5940)

### Anti-Detection & Proxies

**CLI:**
```yaml
# browser.yml
headless: true
proxy_config:
  server: "http://proxy:8080"
  username: "user"
  password: "pass"
user_agent_mode: "random"
```

```bash
crwl https://example.com -B browser.yml
```

---

## Common Use Cases

### Google Search Scraping

```bash
# Search Google and get results as JSON
python scripts/google_search.py "your search query" 20

# Example
python scripts/google_search.py "2026年Go语言展望" 20
```

The script extracts:
- Search result titles
- URLs (cleaned, removes Google redirects)
- Descriptions/snippets
- Site names

Output is saved to `google_search_results.json` and printed to stdout.

### Documentation to Markdown

```bash
crwl https://docs.example.com -o markdown > docs.md
```

### E-commerce Product Monitoring

```bash
# Generate schema once
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"

# Monitor (no LLM costs)
crwl https://shop.com -e extract_css.yml -s schema.json -o json
```

### News Aggregation

```bash
# Multiple sources with filtering
for url in news1.com news2.com news3.com; do
  crwl "https://$url" -f filter_bm25.yml -o markdown-fit
done
```

### Interactive Q&A

```bash
# First view content
crwl https://example.com -o markdown

# Then ask questions
crwl https://example.com -q "What are the main conclusions?"
crwl https://example.com -q "Summarize the key points"
```

---

## Resources

### Provided Scripts

- **scripts/google_search.py** - Google search scraper with JSON output
- **scripts/extraction_pipeline.py** - Schema generation and extraction
- **scripts/basic_crawler.py** - Simple markdown extraction
- **scripts/batch_crawler.py** - Multi-URL processing

### Reference Documentation

| Document | Purpose |
|----------|---------|
| [CLI Guide](references/cli-guide.md) | Command-line interface reference |
| [SDK Guide](references/sdk-guide.md) | Python SDK quick reference |
| [Complete SDK Reference](references/complete-sdk-reference.md) | Full API documentation (5900+ lines) |

---

## Best Practices

1. **Start with CLI** for quick tasks, SDK for automation
2. **Use schema-based extraction** - 10-100x more efficient than LLM
3. **Enable caching during development** - `--bypass-cache` only when needed
4. **Set appropriate timeouts** - 30s normal, 60s+ for JS-heavy sites
5. **Use content filters** for cleaner, focused markdown
6. **Respect rate limits** - Add delays between requests

---

## Troubleshooting

### JavaScript Not Loading

```bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"
```

### Bot Detection Issues

```bash
crwl https://example.com -B browser.yml
```

```yaml
# browser.yml
headless: false
viewport_width: 1920
viewport_height: 1080
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
```

### Content Not Extracted

```bash
# Debug: see full output
crwl https://example.com -o all -v

# Try different wait strategy
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
```

### Session Issues

```bash
# Verify session
crwl https://site.com -c "session_id=test" -o all | grep -i session
```

---

For comprehensive API documentation, see [Complete SDK Reference](references/complete-sdk-reference.md).


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/cli-guide.md

```markdown
# Crawl4AI CLI Guide
<!-- Reference: Tier 2 - Command-line interface for Crawl4AI -->

## Table of Contents
<!-- Lines 1-20 -->

- [Installation](#installation)
- [Basic Usage](#basic-usage)
- [Configuration](#configuration)
  - [Browser Configuration](#browser-configuration)
  - [Crawler Configuration](#crawler-configuration)
  - [Extraction Configuration](#extraction-configuration)
  - [Content Filtering](#content-filtering)
- [Advanced Features](#advanced-features)
  - [LLM Q&A](#llm-qa)
  - [Structured Data Extraction](#structured-data-extraction)
  - [Content Filtering](#content-filtering-1)
- [Output Formats](#output-formats)
- [Examples](#examples)
- [Best Practices & Tips](#best-practices--tips)

---

## Installation
<!-- Lines 21-25 -->

The Crawl4AI CLI (`crwl`) is installed automatically with the library:

```bash
pip install crawl4ai
crawl4ai-setup
```

---

## Basic Usage
<!-- Lines 26-50 -->

The `crwl` command provides a simple interface to the Crawl4AI library:

```bash
# Basic crawling - returns markdown
crwl https://example.com

# Specify output format
crwl https://example.com -o markdown

# Verbose JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache

# See usage examples
crwl --example
```

**Quick Example - Advanced Usage:**

```bash
# Extract structured data using CSS schema
crwl "https://www.infoq.com/ai-ml-data-eng/" \
    -e docs/examples/cli/extract_css.yml \
    -s docs/examples/cli/css_schema.json \
    -o json
```

---

## Configuration
<!-- Lines 51-160 -->

### Browser Configuration
<!-- Lines 51-75 -->

Browser settings via YAML file or command line:

```yaml
# browser.yml
headless: true
viewport_width: 1280
user_agent_mode: "random"
verbose: true
ignore_https_errors: true
```

```bash
# Using config file
crwl https://example.com -B browser.yml

# Using direct parameters
crwl https://example.com -b "headless=true,viewport_width=1280,user_agent_mode=random"
```

**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `headless` | Run without GUI (true/false) |
| `viewport_width` | Browser width in pixels |
| `viewport_height` | Browser height in pixels |
| `user_agent_mode` | "random" or specific UA string |

For all browser parameters: [BrowserConfig Reference](complete-sdk-reference.md#1-browserconfig--controlling-the-browser) (lines 1977-2020)

### Crawler Configuration
<!-- Lines 76-110 -->

Control crawling behavior:

```yaml
# crawler.yml
cache_mode: "bypass"
wait_until: "networkidle"
page_timeout: 30000
delay_before_return_html: 0.5
word_count_threshold: 100
scan_full_page: true
scroll_delay: 0.3
process_iframes: false
remove_overlay_elements: true
magic: true
verbose: true
```

```bash
# Using config file
crwl https://example.com -C crawler.yml

# Using direct parameters
crwl https://example.com -c "css_selector=#main,delay_before_return_html=2,scan_full_page=true"
```

**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `cache_mode` | bypass, enabled, disabled |
| `wait_until` | networkidle, domcontentloaded |
| `page_timeout` | Max page load time (ms) |
| `css_selector` | Focus on specific element |
| `scan_full_page` | Enable infinite scroll handling |

For all crawler parameters: [CrawlerRunConfig Reference](complete-sdk-reference.md#2-crawlerrunconfig--controlling-each-crawl) (lines 2020-2330)

### Extraction Configuration
<!-- Lines 111-160 -->

Two extraction types supported:

**1. CSS/XPath-based extraction:**

```yaml
# extract_css.yml
type: "json-css"
params:
  verbose: true
```

```json
// css_schema.json
{
  "name": "ArticleExtractor",
  "baseSelector": ".article",
  "fields": [
    {
      "name": "title",
      "selector": "h1.title",
      "type": "text"
    },
    {
      "name": "link",
      "selector": "a.read-more",
      "type": "attribute",
      "attribute": "href"
    }
  ]
}
```

**2. LLM-based extraction:**

```yaml
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4"
instruction: "Extract all articles with their titles and links"
api_token: "your-token"
params:
  temperature: 0.3
  max_tokens: 1000
```

For extraction strategies: [Extraction Strategies](complete-sdk-reference.md#extraction-strategies) (lines 4522-5429)

---

## Advanced Features
<!-- Lines 161-230 -->

### LLM Q&A
<!-- Lines 161-190 -->

Ask questions about crawled content:

```bash
# Simple question
crwl https://example.com -q "What is the main topic discussed?"

# View content then ask questions
crwl https://example.com -o markdown  # See content first
crwl https://example.com -q "Summarize the key points"
crwl https://example.com -q "What are the conclusions?"

# Combined with advanced crawling
crwl https://example.com \
    -B browser.yml \
    -c "css_selector=article,scan_full_page=true" \
    -q "What are the pros and cons mentioned?"
```

**First-time setup:**
- Prompts for LLM provider and API token
- Saves configuration in `~/.crawl4ai/global.yml`
- Supports: openai/gpt-4, anthropic/claude-3-sonnet, ollama (no token needed)
- See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for full list

### Structured Data Extraction
<!-- Lines 191-210 -->

```bash
# CSS-based extraction
crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json

# LLM-based extraction
crwl https://example.com \
    -e extract_llm.yml \
    -s llm_schema.json \
    -o json
```

### Content Filtering
<!-- Lines 211-230 -->

Filter content for relevance:

```yaml
# filter_bm25.yml (relevance-based)
type: "bm25"
query: "target content"
threshold: 1.0

# filter_pruning.yml (quality-based)
type: "pruning"
query: "focus topic"
threshold: 0.48
```

```bash
crwl https://example.com -f filter_bm25.yml -o markdown-fit
```

For content filtering: [Content Processing](complete-sdk-reference.md#content-processing) (lines 2481-3101)

---

## Output Formats
<!-- Lines 231-240 -->

| Format | Flag | Description |
|--------|------|-------------|
| `all` | `-o all` | Full crawl result including metadata |
| `json` | `-o json` | Extracted structured data |
| `markdown` | `-o markdown` or `-o md` | Raw markdown output |
| `markdown-fit` | `-o markdown-fit` or `-o md-fit` | Filtered markdown |

---

## Complete Examples
<!-- Lines 241-280 -->

**1. Basic Extraction:**
```bash
crwl https://example.com \
    -B browser.yml \
    -C crawler.yml \
    -o json
```

**2. Structured Data Extraction:**
```bash
crwl https://example.com \
    -e extract_css.yml \
    -s css_schema.json \
    -o json \
    -v
```

**3. LLM Extraction with Filtering:**
```bash
crwl https://example.com \
    -B browser.yml \
    -e extract_llm.yml \
    -s llm_schema.json \
    -f filter_bm25.yml \
    -o json
```

**4. Interactive Q&A:**
```bash
# First crawl and view
crwl https://example.com -o markdown

# Then ask questions
crwl https://example.com -q "What are the main points?"
crwl https://example.com -q "Summarize the conclusions"
```

---

## Best Practices & Tips
<!-- Lines 281-310 -->

1. **Configuration Management:**
   - Keep common configurations in YAML files
   - Use CLI parameters for quick overrides
   - Store sensitive data (API tokens) in `~/.crawl4ai/global.yml`

2. **Performance Optimization:**
   - Use `--bypass-cache` for fresh content
   - Enable `scan_full_page` for infinite scroll pages
   - Adjust `delay_before_return_html` for dynamic content

3. **Content Extraction:**
   - Use CSS extraction for structured content (faster, no API costs)
   - Use LLM extraction for unstructured content
   - Combine with filters for focused results

4. **Q&A Workflow:**
   - View content first with `-o markdown`
   - Ask specific questions
   - Use broader context with appropriate selectors

---

## Recap

The Crawl4AI CLI provides:
- Flexible configuration via files and parameters
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats

---

## See Also

- [Python SDK Guide](sdk-guide.md) - Programmatic Python interface
- [Complete SDK Reference](complete-sdk-reference.md) - Full API documentation

```

### references/sdk-guide.md

```markdown
# Crawl4AI Python SDK Guide
<!-- Reference: Tier 2 - Python SDK interface for Crawl4AI -->

## Quick Start
<!-- Lines 1-60 -->

### Installation

```bash
pip install crawl4ai
crawl4ai-setup
```

### Basic First Crawl

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())
```

### With Configuration

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    viewport_width=1920,
    viewport_height=1080
)

crawler_config = CrawlerRunConfig(
    page_timeout=30000,
    screenshot=True,
    remove_overlay_elements=True
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=crawler_config
    )
    print(f"Success: {result.success}")
    print(f"Markdown length: {len(result.markdown)}")
```

For complete API reference: [AsyncWebCrawler](complete-sdk-reference.md#asyncwebcrawler) (lines 517-778)

---

## Configuration
<!-- Lines 61-150 -->

### BrowserConfig

Controls the browser instance (global settings):

```python
from crawl4ai import BrowserConfig

browser_config = BrowserConfig(
    browser_type="chromium",    # chromium, firefox, webkit
    headless=True,              # Run without GUI
    viewport_width=1280,
    viewport_height=720,
    user_agent="custom-agent",  # Custom user agent
    proxy_config={              # Proxy settings
        "server": "http://proxy:8080",
        "username": "user",
        "password": "pass"
    }
)
```

**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `headless` | Run with/without GUI |
| `viewport_width/height` | Browser dimensions |
| `user_agent` | Custom user agent string |
| `cookies` | Pre-set cookies |
| `headers` | Custom HTTP headers |
| `proxy_config` | Proxy server settings |

For all parameters: [BrowserConfig Reference](complete-sdk-reference.md#1-browserconfig--controlling-the-browser) (lines 1977-2020)

### CrawlerRunConfig

Controls each crawl operation (per-crawl settings):

```python
from crawl4ai import CrawlerRunConfig, CacheMode

config = CrawlerRunConfig(
    # Timing
    page_timeout=30000,         # Max page load time (ms)
    wait_for="css:.content",    # Wait for element
    delay_before_return_html=0.5,

    # Content selection
    css_selector=".main-content",
    excluded_tags=["nav", "footer"],

    # Caching
    cache_mode=CacheMode.BYPASS,

    # JavaScript
    js_code="window.scrollTo(0, document.body.scrollHeight);",

    # Output
    screenshot=True,
    pdf=True
)
```

**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `page_timeout` | Max page load/JS time (ms) |
| `wait_for` | CSS selector or JS condition |
| `cache_mode` | ENABLED, BYPASS, DISABLED |
| `js_code` | JavaScript to execute |
| `session_id` | Persist session across crawls |
| `screenshot` | Capture screenshot |

For all parameters: [CrawlerRunConfig Reference](complete-sdk-reference.md#2-crawlerrunconfig--controlling-each-crawl) (lines 2020-2330)

---

## CrawlResult
<!-- Lines 151-200 -->

Every `arun()` call returns a `CrawlResult`:

```python
result = await crawler.arun(url, config=config)

# Status
result.success          # bool - crawl succeeded
result.status_code      # HTTP status code
result.error_message    # Error details if failed

# Content
result.html             # Raw HTML
result.cleaned_html     # Sanitized HTML
result.markdown         # MarkdownGenerationResult object
result.markdown.raw_markdown    # Full markdown
result.markdown.fit_markdown    # Filtered markdown (if filter used)

# Media & Links
result.media["images"]  # List of images
result.media["videos"]  # List of videos
result.links["internal"] # Internal links
result.links["external"] # External links

# Extras
result.screenshot       # Base64 screenshot (if requested)
result.pdf              # PDF bytes (if requested)
result.metadata         # Page metadata (title, description)
```

For complete fields: [CrawlResult Reference](complete-sdk-reference.md#crawlresult-reference) (lines 1224-1612)

---

## Content Processing
<!-- Lines 201-280 -->

### Markdown Generation

```python
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
    options={
        "ignore_links": False,
        "ignore_images": False,
        "body_width": 80
    }
)

config = CrawlerRunConfig(markdown_generator=md_generator)
```

### Content Filtering

Filter content for relevance before markdown generation:

```python
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Option 1: Pruning (removes low-quality content)
pruning_filter = PruningContentFilter(
    threshold=0.4,
    threshold_type="fixed"
)

# Option 2: BM25 (relevance-based)
bm25_filter = BM25ContentFilter(
    user_query="machine learning tutorials",
    bm25_threshold=1.0
)

md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)

result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown)  # Filtered content
print(result.markdown.raw_markdown)  # Original content
```

For filters and generators: [Content Processing](complete-sdk-reference.md#content-processing) (lines 2481-3101)

---

## Data Extraction
<!-- Lines 281-360 -->

### CSS-Based Extraction (No LLM)

Fast, deterministic extraction using CSS selectors:

```python
from crawl4ai import JsonCssExtractionStrategy

schema = {
    "name": "articles",
    "baseSelector": "article.post",
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "date", "selector": ".date", "type": "text"},
        {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
    ]
}

extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)

result = await crawler.arun(url, config=config)
data = json.loads(result.extracted_content)
```

### LLM-Based Extraction

For complex or irregular content:

```python
from crawl4ai import LLMExtractionStrategy, LLMConfig
from pydantic import BaseModel, Field

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: str = Field(description="Product price")

extraction_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",
        api_token="your-token"
    ),
    schema=Product.model_json_schema(),
    extraction_type="schema",
    instruction="Extract product information"
)

config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
```

For extraction strategies: [Extraction Strategies](complete-sdk-reference.md#extraction-strategies) (lines 4522-5429)

---

## Multi-URL Crawling
<!-- Lines 361-420 -->

### Concurrent Processing with arun_many()

```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    stream=True  # Enable streaming
)

async with AsyncWebCrawler() as crawler:
    # Streaming mode - process as they complete
    async for result in await crawler.arun_many(urls, config=config):
        if result.success:
            print(f"Completed: {result.url}")

    # Batch mode - wait for all
    config = config.clone(stream=False)
    results = await crawler.arun_many(urls, config=config)
```

### URL-Specific Configurations

```python
from crawl4ai import CrawlerRunConfig, MatchMode

# Different configs for different URL patterns
pdf_config = CrawlerRunConfig(
    url_matcher="*.pdf",
    # PDF-specific settings
)

blog_config = CrawlerRunConfig(
    url_matcher=["*/blog/*", "*/article/*"],
    match_mode=MatchMode.OR
)

default_config = CrawlerRunConfig()  # Fallback

results = await crawler.arun_many(
    urls=urls,
    config=[pdf_config, blog_config, default_config]
)
```

For dispatchers and advanced: [arun_many() Reference](complete-sdk-reference.md#arunmany-reference) (lines 1057-1224)

---

## Session Management
<!-- Lines 421-480 -->

### Persistent Sessions

```python
# First crawl - establish session
login_config = CrawlerRunConfig(
    session_id="user_session",
    js_code="""
    document.querySelector('#username').value = 'myuser';
    document.querySelector('#password').value = 'mypass';
    document.querySelector('#submit').click();
    """,
    wait_for="css:.dashboard"
)

await crawler.arun("https://site.com/login", config=login_config)

# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected", config=config)

# Clean up
await crawler.crawler_strategy.kill_session("user_session")
```

### Dynamic Content Handling

```python
config = CrawlerRunConfig(
    wait_for="css:.ajax-content",
    js_code="""
    window.scrollTo(0, document.body.scrollHeight);
    document.querySelector('.load-more')?.click();
    """,
    page_timeout=60000
)
```

For session patterns: [Advanced Features - Session Management](complete-sdk-reference.md#advanced-features) (lines 5429-5940)

---

## Best Practices

1. **Use context managers** - `async with AsyncWebCrawler()` ensures cleanup
2. **Enable caching during development** - `cache_mode=CacheMode.ENABLED`
3. **Set appropriate timeouts** - 30s normal, 60s+ for JS-heavy sites
4. **Prefer CSS extraction** over LLM - 10-100x more efficient
5. **Use clone() for config variants** - `config.clone(screenshot=True)`
6. **Respect rate limits** - Use delays between requests

---

## See Also

- [CLI Guide](cli-guide.md) - Command-line interface alternative
- [Complete SDK Reference](complete-sdk-reference.md) - Full API documentation

```

### references/complete-sdk-reference.md

```markdown
# Crawl4AI Complete SDK Documentation

**Generated:** 2025-10-19 12:56
**Format:** Ultra-Dense Reference (Optimized for AI Assistants)
**Crawl4AI Version:** 0.7.4

---

## Navigation

- [Installation & Setup](#installation--setup) (lines 22-126)
- [Quick Start](#quick-start) (lines 126-517)
- [Core API](#core-api) (lines 517-1056)
- [Configuration](#configuration) (lines 1612-2330)
- [Crawling Patterns](#crawling-patterns) (lines 2330-3568)
- [Content Processing](#content-processing) (lines 2479-3101)
- [Extraction Strategies](#extraction-strategies) (lines 4528-5436)
- [Advanced Features](#advanced-features) (lines 5436-5932)

---

## Installation & Setup
<!-- Section: lines 22-126 -->

## Installation & Setup (2023 Edition)

## 1. Basic Installation

```bash
pip install crawl4ai
```

## 2. Initial Setup & Diagnostics

### 2.1 Run the Setup Command

```bash
crawl4ai-setup
```

- Performs OS-level checks (e.g., missing libs on Linux)
- Confirms your environment is ready to crawl

### 2.2 Diagnostics

```bash
crawl4ai-doctor
```

- Check Python version compatibility
- Verify Playwright installation
- Inspect environment variables or library conflicts

If any issues arise, follow its suggestions (e.g., installing additional system packages) and re-run `crawl4ai-setup`.

## 3. Verifying Installation: A Simple Crawl (Skip this step if you already run `crawl4ai-doctor`)

Below is a minimal Python script demonstrating a **basic** crawl. It uses our new **`BrowserConfig`** and **`CrawlerRunConfig`** for clarity, though no custom settings are passed in this example:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.example.com",
        )
        print(result.markdown[:300])  # Show the first 300 characters of extracted text

if __name__ == "__main__":
    asyncio.run(main())
```

- A headless browser session loads `example.com`
- Crawl4AI returns ~300 characters of markdown.

If errors occur, rerun `crawl4ai-doctor` or manually ensure Playwright is installed correctly.

## 4. Advanced Installation (Optional)

### 4.1 Torch, Transformers, or All

- **Text Clustering (Torch)**

  ```bash
  pip install crawl4ai[torch]
  crawl4ai-setup
  ```

- **Transformers**

  ```bash
  pip install crawl4ai[transformer]
  crawl4ai-setup
  ```

- **All Features**

  ```bash
  pip install crawl4ai[all]
  crawl4ai-setup
  ```

```bash
crawl4ai-download-models
```

## 5. Docker (Experimental)

```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```

You can then make POST requests to `http://localhost:11235/crawl` to perform crawls. **Production usage** is discouraged until our new Docker approach is ready (planned in Jan or Feb 2025).

## 6. Local Server Mode (Legacy)

## Summary

1. **Install** with `pip install crawl4ai` and run `crawl4ai-setup`.
2. **Diagnose** with `crawl4ai-doctor` if you see errors.
3. **Verify** by crawling `example.com` with minimal `BrowserConfig` + `CrawlerRunConfig`.

## Quick Start
<!-- Section: lines 126-517 -->

## Getting Started with Crawl4AI

1. Run your **first crawl** using minimal configuration.
2. Experiment with a simple **CSS-based extraction** strategy.
3. Crawl a **dynamic** page that loads content via JavaScript.

## 1. Introduction

- An asynchronous crawler, **`AsyncWebCrawler`**.
- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports optional filters).
- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).

## 2. Your First Crawl

Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:300])  # Print first 300 chars

if __name__ == "__main__":
    asyncio.run(main())
```

- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
- It fetches `https://example.com`.
- Crawl4AI automatically converts the HTML into Markdown.

## 3. Basic Configuration (Light Introduction)

1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser_conf = BrowserConfig(headless=True)  # or False to see the browser
    run_conf = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_conf
        )
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())
```

> IMPORTANT: By default cache mode is set to `CacheMode.BYPASS` to have fresh content. Set `CacheMode.ENABLED` to enable caching.

## 4. Generating Markdown Output

- **`result.markdown`**:
- **`result.markdown.fit_markdown`**:
  The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).

### Example: Using a Filter with `DefaultMarkdownGenerator`

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

md_generator = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)

config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    markdown_generator=md_generator
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://news.ycombinator.com", config=config)
    print("Raw Markdown length:", len(result.markdown.raw_markdown))
    print("Fit Markdown length:", len(result.markdown.fit_markdown))
```

**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.

## 5. Simple Data Extraction (CSS-based)

```python
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig

# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"

# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
    html,
    llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")  # Required for OpenAI
)

# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
    html,
    llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None)  # Not needed for Ollama
)

# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)
```

```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def main():
    schema = {
        "name": "Example Items",
        "baseSelector": "div.item",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="raw://" + raw_html,
            config=CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                extraction_strategy=JsonCssExtractionStrategy(schema)
            )
        )
        # The JSON output is stored in 'extracted_content'
        data = json.loads(result.extracted_content)
        print(data)

if __name__ == "__main__":
    asyncio.run(main())
```

- Great for repetitive page structures (e.g., item listings, articles).
- No AI usage or costs.
- The crawler returns a JSON string you can parse or store.
> Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.

## 6. Simple Data Extraction (LLM-based)

- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
- Or any provider supported by the underlying library

```python
import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )

async def extract_structured_data_using_llm(
    provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
    print(f"\n--- Extracting Structured Data with {provider} ---")

    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    browser_config = BrowserConfig(headless=True)

    extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
    if extra_headers:
        extra_args["extra_headers"] = extra_headers

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=1,
        page_timeout=80000,
        extraction_strategy=LLMExtractionStrategy(
            llm_config = LLMConfig(provider=provider,api_token=api_token),
            schema=OpenAIModelFee.model_json_schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
            Do not miss any models in the entire content.""",
            extra_args=extra_args,
        ),
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/", config=crawler_config
        )
        print(result.extracted_content)

if __name__ == "__main__":

    asyncio.run(
        extract_structured_data_using_llm(
            provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
        )
    )
```

- We define a Pydantic schema (`PricingInfo`) describing the fields we want.

## 7. Adaptive Crawling (New!)

```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler

async def adaptive_example():
    async with AsyncWebCrawler() as crawler:
        adaptive = AdaptiveCrawler(crawler)

        # Start adaptive crawling
        result = await adaptive.digest(
            start_url="https://docs.python.org/3/",
            query="async context managers"
        )

        # View results
        adaptive.print_stats()
        print(f"Crawled {len(result.crawled_urls)} pages")
        print(f"Achieved {adaptive.confidence:.0%} confidence")

if __name__ == "__main__":
    asyncio.run(adaptive_example())
```

- **Automatic stopping**: Stops when sufficient information is gathered
- **Intelligent link selection**: Follows only relevant links
- **Confidence scoring**: Know how complete your information is

## 8. Multi-URL Concurrency (Preview)

If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def quick_parallel_example():
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]

    run_conf = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=True  # Enable streaming mode
    )

    async with AsyncWebCrawler() as crawler:
        # Stream results as they complete
        async for result in await crawler.arun_many(urls, config=run_conf):
            if result.success:
                print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
            else:
                print(f"[ERROR] {result.url} => {result.error_message}")

        # Or get all results at once (default behavior)
        run_conf = run_conf.clone(stream=False)
        results = await crawler.arun_many(urls, config=run_conf)
        for res in results:
            if res.success:
                print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
            else:
                print(f"[ERROR] {res.url} => {res.error_message}")

if __name__ == "__main__":
    asyncio.run(quick_parallel_example())
```

1. **Streaming mode** (`stream=True`): Process results as they become available using `async for`
2. **Batch mode** (`stream=False`): Wait for all results to complete

## 8. Dynamic Content Example

Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to **click** a “Next Page” button and wait for new commits to load on GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def extract_structured_data_using_css_extractor():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "KidoCode Courses",
        "baseSelector": "section.charge-methodology .w-tab-content > div",
        "fields": [
            {
                "name": "section_title",
                "selector": "h3.heading-50",
                "type": "text",
            },
            {
                "name": "section_description",
                "selector": ".charge-content",
                "type": "text",
            },
            {
                "name": "course_name",
                "selector": ".text-block-93",
                "type": "text",
            },
            {
                "name": "course_description",
                "selector": ".course-content-text",
                "type": "text",
            },
            {
                "name": "course_icon",
                "selector": ".image-92",
                "type": "attribute",
                "attribute": "src",
            },
        ],
    }

    browser_config = BrowserConfig(headless=True, java_script_enabled=True)

    js_click_tabs = """
    (async () => {
        const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
        for(let tab of tabs) {
            tab.scrollIntoView();
            tab.click();
            await new Promise(r => setTimeout(r, 500));
        }
    })();
    """

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        js_code=[js_click_tabs],
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology", config=crawler_config
        )

        companies = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(companies)} companies")
        print(json.dumps(companies[0], indent=2))

async def main():
    await extract_structured_data_using_css_extractor()

if __name__ == "__main__":
    asyncio.run(main())
```

- **`BrowserConfig(headless=False)`**: We want to watch it click “Next Page.”
- **`CrawlerRunConfig(...)`**: We specify the extraction strategy, pass `session_id` to reuse the same page.
- **`js_code`** and **`wait_for`** are used for subsequent pages (`page > 0`) to click the “Next” button and wait for new commits to load.
- **`js_only=True`** indicates we’re not re-navigating but continuing the existing session.
- Finally, we call `kill_session()` to clean up the page and browser session.

## 9. Next Steps

1. Performed a basic crawl and printed Markdown.
2. Used **content filters** with a markdown generator.
3. Extracted JSON via **CSS** or **LLM** strategies.
4. Handled **dynamic** pages with JavaScript triggers.

## Core API
<!-- Section: lines 517-1056 -->

## AsyncWebCrawler

The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
1. **Create** a `BrowserConfig` for global browser settings. 
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`. 
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually. 
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.

## 1. Constructor Overview

```python
class AsyncWebCrawler:
    def __init__(
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        config: Optional[BrowserConfig] = None,
        always_bypass_cache: bool = False,           # deprecated
        always_by_pass_cache: Optional[bool] = None, # also deprecated
        base_directory: str = ...,
        thread_safe: bool = False,
        **kwargs,
    ):
        """
        Create an AsyncWebCrawler instance.

        Args:
            crawler_strategy:
                (Advanced) Provide a custom crawler strategy if needed.
            config:
                A BrowserConfig object specifying how the browser is set up.
            always_bypass_cache:
                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
            base_directory:
                Folder for storing caches/logs (if relevant).
            thread_safe:
                If True, attempts some concurrency safeguards. Usually False.
            **kwargs:
                Additional legacy or debugging parameters.
        """
    )

### Typical Initialization

```

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
    browser_type="chromium",
    headless=True,
    verbose=True
crawler = AsyncWebCrawler(config=browser_cfg)
```

**Notes**:

- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.

---

## 2. Lifecycle: Start/Close or Context Manager

### 2.1 Context Manager (Recommended)

```python
async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("<https://example.com>")

## The crawler automatically starts/closes resources
```

When the `async with` block ends, the crawler cleans up (closes the browser, etc.).

### 2.2 Manual Start & Close

```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
result1 = await crawler.arun("<https://example.com>")
result2 = await crawler.arun("<https://another.com>")
await crawler.close()
```

Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.

---

## 3. Primary Method: `arun()`

```python
async def arun(
    url: str,
    config: Optional[CrawlerRunConfig] = None,

## Legacy parameters for backward compatibility...
```

### 3.1 New Approach

You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.

```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    css_selector="main.article",
    word_count_threshold=10,
    screenshot=True
async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("<https://example.com/news>", config=run_cfg)
```

### 3.2 Legacy Parameters Still Accepted

For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.

---

## 4. Batch Processing: `arun_many()`

```python
async def arun_many(
    urls: List[str],
    config: Optional[CrawlerRunConfig] = None,

## Legacy parameters maintained for backwards compatibility...
```

### 4.1 Resource-Aware Crawling

The `arun_many()` method now uses an intelligent dispatcher that:

- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently

### 4.2 Example Usage

Check page [Multi-url Crawling](../advanced/multi-url-crawling.md) for a detailed example of how to use `arun_many()`.

```python

## 4.3 Key Features

1. **Rate Limiting**

- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting

## 5. `CrawlResult` Output

Each `arun()` returns a **`CrawlResult`** containing:

- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.

## 6. Quick Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
import json

async def main():
    # 1. Browser config
    browser_cfg = BrowserConfig(
        browser_type="firefox",
        headless=False,
        verbose=True
    )

    # 2. Run config
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {
                "name": "title",
                "selector": "h2",
                "type": "text"
            },
            {
                "name": "url",
                "selector": "a",
                "type": "attribute",
                "attribute": "href"
            }
        ]
    }

    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        word_count_threshold=15,
        remove_overlay_elements=True,
        wait_for="css:.post"  # Wait for posts to appear
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/blog",
            config=run_cfg
        )

        if result.success:
            print("Cleaned HTML length:", len(result.cleaned_html))
            if result.extracted_content:
                articles = json.loads(result.extracted_content)
                print("Extracted articles:", articles[:2])
        else:
            print("Error:", result.error_message)

asyncio.run(main())
```

- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`. 
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc. 
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.

## 7. Best Practices & Migration Notes

1. **Use** `BrowserConfig` for **global** settings about the browser’s environment. 
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions). 
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:

   ```python
   run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
   result = await crawler.arun(url="...", config=run_cfg)
   ```

## 8. Summary

- **Constructor** accepts **`BrowserConfig`** (or defaults). 
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls. 
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs. 
- For advanced lifecycle control, use `start()` and `close()` explicitly. 
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.

## `arun()` Parameter Guide (New Approach)

In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:

```python
await crawler.arun(
    url="https://example.com",
    config=my_run_config
)
```

Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).

## 1. Core Usage

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode

async def main():
    run_config = CrawlerRunConfig(
        verbose=True,            # Detailed logging
        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
        check_robots_txt=True,   # Respect robots.txt rules
        # ... other parameters
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )

        # Check if blocked by robots.txt
        if not result.success and result.status_code == 403:
            print(f"Error: {result.error_message}")
```

- `verbose=True` logs each crawl step. 
- `cache_mode` decides how to read/write the local crawl cache.

## 2. Cache Control

**`cache_mode`** (default: `CacheMode.ENABLED`)
Use a built-in enum from `CacheMode`:

- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).

```python
run_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
)
```

- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.

## 3. Content Processing & Selection

### 3.1 Text Processing

```python
run_config = CrawlerRunConfig(
    word_count_threshold=10,   # Ignore text blocks <10 words
    only_text=False,           # If True, tries to remove non-text elements
    keep_data_attributes=False # Keep or discard data-* attributes
)
```

### 3.2 Content Selection

```python
run_config = CrawlerRunConfig(
    css_selector=".main-content",  # Focus on .main-content region only
    excluded_tags=["form", "nav"], # Remove entire tag blocks
    remove_forms=True,             # Specifically strip <form> elements
    remove_overlay_elements=True,  # Attempt to remove modals/popups
)
```

### 3.3 Link Handling

```python
run_config = CrawlerRunConfig(
    exclude_external_links=True,         # Remove external links from final content
    exclude_social_media_links=True,     # Remove links to known social sites
    exclude_domains=["ads.example.com"], # Exclude links to these domains
    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
```

### 3.4 Media Filtering

```python
run_config = CrawlerRunConfig(
    exclude_external_images=True  # Strip images from other domains
)
```

## 4. Page Navigation & Timing

### 4.1 Basic Browser Flow

```python
run_config = CrawlerRunConfig(
    wait_for="css:.dynamic-content", # Wait for .dynamic-content
    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
    page_timeout=60000,             # Navigation & script timeout (ms)
)
```

- `wait_for`:
  - `"css:selector"` or
  - `"js:() => boolean"`
  e.g. `js:() => document.querySelectorAll('.item').length > 10`.
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls. 
- `semaphore_count`: concurrency limit when crawling multiple URLs.

### 4.2 JavaScript Execution

```python
run_config = CrawlerRunConfig(
    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
        "document.querySelector('.load-more')?.click();"
    ],
    js_only=False
)
```

- `js_code` can be a single string or a list of strings. 
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”

### 4.3 Anti-Bot

```python
run_config = CrawlerRunConfig(
    magic=True,
    simulate_user=True,
    override_navigator=True
)
```

- `magic=True` tries multiple stealth features. 
- `simulate_user=True` mimics mouse movements or random delays. 
- `override_navigator=True` fakes some navigator properties (like user agent checks).

## 5. Session Management

**`session_id`**:

```python
run_config = CrawlerRunConfig(
    session_id="my_session123"
)
```

If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).

## 6. Screenshot, PDF & Media Options

```python
run_config = CrawlerRunConfig(
    screenshot=True,             # Grab a screenshot as base64
    screenshot_wait_for=1.0,     # Wait 1s before capturing
    pdf=True,                    # Also produce a PDF
    image_description_min_word_threshold=5,  # If analyzing alt text
    image_score_threshold=3,                # Filter out low-score images
)
```

- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.

## 7. Extraction Strategy

**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:

```python
run_config = CrawlerRunConfig(
    extraction_strategy=my_css_or_llm_strategy
)
```

The extracted data will appear in `result.extracted_content`.

## 8. Comprehensive Example

Below is a snippet combining many parameters:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy

async def main():
    # Example schema
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
        ]
    }

    run_config = CrawlerRunConfig(
        # Core
        verbose=True,
        cache_mode=CacheMode.ENABLED,
        check_robots_txt=True,   # Respect robots.txt rules

        # Content
        word_count_threshold=10,
        css_selector="main.content",
        excluded_tags=["nav", "footer"],
        exclude_external_links=True,

        # Page & JS
        js_code="document.querySelector('.show-more')?.click();",
        wait_for="css:.loaded-block",
        page_timeout=30000,

        # Extraction
        extraction_strategy=JsonCssExtractionStrategy(schema),

        # Session
        session_id="persistent_session",

        # Media
        screenshot=True,
        pdf=True,

        # Anti-bot
        simulate_user=True,
        magic=True,
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com/posts", config=run_config)
        if result.success:
            print("HTML length:", len(result.cleaned_html))
            print("Extraction JSON:", result.extracted_content)
            if result.screenshot:
                print("Screenshot length:", len(result.screenshot))
            if result.pdf:
                print("PDF bytes length:", len(result.pdf))
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

1. **Crawling** the main content region, ignoring external links. 
2. Running **JavaScript** to click “.show-more”. 
3. **Waiting** for “.loaded-block” to appear. 
4. Generating a **screenshot** & **PDF** of the final page. 

## 9. Best Practices

1. **Use `BrowserConfig` for global browser** settings (headless, user agent). 
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc. 
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it. 
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.

## 10. Conclusion

All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:

- Makes code **clearer** and **more maintainable**. 

## `arun_many(...)` Reference
<!-- Section: lines 1057-1224 -->

> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.

## Function Signature

```python
async def arun_many(
    urls: Union[List[str], List[Any]],
    config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
    dispatcher: Optional[BaseDispatcher] = None,
    ...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
    """
    Crawl multiple URLs concurrently or in batches.

    :param urls: A list of URLs (or tasks) to crawl.
    :param config: (Optional) Either:
        - A single `CrawlerRunConfig` applying to all URLs
        - A list of `CrawlerRunConfig` objects with url_matcher patterns
    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
    ...
    :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
    """
```

## Differences from `arun()`

1. **Multiple URLs**:

- Instead of crawling a single URL, you pass a list of them (strings or tasks). 
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
2. **Concurrency & Dispatchers**:
- **`dispatcher`** param allows advanced concurrency control. 
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally. 
3. **Streaming Support**:
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
- When streaming, use `async for` to process results as they become available.
4. **Parallel** Execution**:
- `arun_many()` can run multiple requests concurrently under the hood. 
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).

### Basic Example (Batch Mode)

```python
# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com"],
    config=CrawlerRunConfig(stream=False)  # Default behavior
)

for res in results:
    if res.success:
        print(res.url, "crawled OK!")
    else:
        print("Failed:", res.url, "-", res.error_message)
```

### Streaming Example

```python
config = CrawlerRunConfig(
    stream=True,  # Enable streaming mode
    cache_mode=CacheMode.BYPASS
)

# Process results as they complete
async for result in await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
    config=config
):
    if result.success:
        print(f"Just completed: {result.url}")
        # Process each result immediately
        process_result(result)
```

### With a Custom Dispatcher

```python
dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    max_session_permit=10
)
results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
    config=my_run_config,
    dispatcher=dispatcher
)
```

### URL-Specific Configurations

Instead of using one config for all URLs, provide a list of configs with `url_matcher` patterns:

```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
    url_matcher="*.pdf",
    scraping_strategy=PDFContentScrapingStrategy()
)

# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
    url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(threshold=0.48)
    )
)

# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
    url_matcher=lambda url: 'github.com' in url,
    js_code="window.scrollTo(0, 500);"
)

# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
    url_matcher=lambda url: 'api' in url or url.endswith('.json'),
    # Custome settings for JSON extraction
)

# Default fallback config
default_config = CrawlerRunConfig()  # No url_matcher means it never matches except as fallback

# Pass the list of configs - first match wins!
results = await crawler.arun_many(
    urls=[
        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",  # → pdf_config
        "https://blog.python.org/",  # → blog_config
        "https://github.com/microsoft/playwright",  # → github_config
        "https://httpbin.org/json",  # → api_config
        "https://example.com/"  # → default_config
    ],
    config=[pdf_config, blog_config, github_config, api_config, default_config]
)
```

- **String patterns**: `"*.pdf"`, `"*/blog/*"`, `"*python.org*"`
- **Function matchers**: `lambda url: 'api' in url`
- **Mixed patterns**: Combine strings and functions with `MatchMode.OR` or `MatchMode.AND`
- **First match wins**: Configs are evaluated in order
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info. 
- **Important**: Always include a default config (without `url_matcher`) as the last item if you want to handle all URLs. Otherwise, unmatched URLs will fail.

### Return Value

Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.

## Dispatcher Reference

- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage. 
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive. 

## Common Pitfalls

3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.

## Conclusion

Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.

## `CrawlResult` Reference

The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
**Location**: `crawl4ai/crawler/models.py` (for reference)

```python
class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
    cleaned_html: Optional[str] = None
    fit_html: Optional[str] = None  # Preprocessed HTML optimized for extraction
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
    pdf : Optional[bytes] = None
    mhtml: Optional[str] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    ...
```

## 1. Basic Crawl Info

### 1.1 **`url`** *(str)*

```python
print(result.url)  # e.g., "https://example.com/"
```

### 1.2 **`success`** *(bool)*

**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.

```python
if not result.success:
    print(f"Crawl failed: {result.error_message}")
```

### 1.3 **`status_code`** *(Optional[int])*

```python
if result.status_code == 404:
    print("Page not found!")
```

### 1.4 **`error_message`** *(Optional[str])*

**What**: If `success=False`, a textual description of the failure.

```python
if not result.success:
    print("Error:", result.error_message)
```

### 1.5 **`session_id`** *(Optional[str])*

```python
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
```

### 1.6 **`response_headers`** *(Optional[dict])*

```python
if result.response_headers:
    print("Server:", result.response_headers.get("Server", "Unknown"))
```

### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*

**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a  [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
 `subject`, `valid_from`, `valid_until`, etc.

```python
if result.ssl_certificate:
    print("Issuer:", result.ssl_certificate.issuer)
```

## 2. Raw / Cleaned Content

### 2.1 **`html`** *(str)*

```python
# Possibly large
print(len(result.html))
```

### 2.2 **`cleaned_html`** *(Optional[str])*

**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.

```python
print(result.cleaned_html[:500])  # Show a snippet
```

## 3. Markdown Fields

### 3.1 The Markdown Generation Approach

- **Raw** markdown
- **Links as citations** (with a references section)
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
**`MarkdownGenerationResult`** includes:
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.

```python
if result.markdown:
    md_res = result.markdown
    print("Raw MD:", md_res.raw_markdown[:300])
    print("Citations MD:", md_res.markdown_with_citations[:300])
    print("References:", md_res.references_markdown)
    if md_res.fit_markdown:
        print("Pruned text:", md_res.fit_markdown[:300])
```

### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*

**What**: Holds the `MarkdownGenerationResult`.

```python
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
```

**Important**: "Fit" content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.

## 4. Media & Links

### 4.1 **`media`** *(Dict[str, List[Dict]])*

**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.

- `src` *(str)*: Media URL
- `alt` or `title` *(str)*: Descriptive text
- `score` *(float)*: Relevance score if the crawler's heuristic found it "important"
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text

```python
images = result.media.get("images", [])
for img in images:
    if img.get("score", 0) > 5:
        print("High-value image:", img["src"])
```

### 4.2 **`links`** *(Dict[str, List[Dict]])*

**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.

- `href` *(str)*: The link target
- `text` *(str)*: Link text
- `title` *(str)*: Title attribute
- `context` *(str)*: Surrounding text snippet
- `domain` *(str)*: If external, the domain

```python
for link in result.links["internal"]:
    print(f"Internal link to {link['href']} with text {link['text']}")
```

## 5. Additional Fields

### 5.1 **`extracted_content`** *(Optional[str])*

**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).

```python
if result.extracted_content:
    data = json.loads(result.extracted_content)
    print(data)
```

### 5.2 **`downloaded_files`** *(Optional[List[str]])*

**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.

```python
if result.downloaded_files:
    for file_path in result.downloaded_files:
        print("Downloaded:", file_path)
```

### 5.3 **`screenshot`** *(Optional[str])*

**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.

```python
import base64
if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))
```

### 5.4 **`pdf`** *(Optional[bytes])*

**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.

```python
if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)
```

### 5.5 **`mhtml`** *(Optional[str])*

**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.

```python
if result.mhtml:
    with open("page.mhtml", "w", encoding="utf-8") as f:
        f.write(result.mhtml)
```

### 5.6 **`metadata`** *(Optional[dict])*

```python
if result.metadata:
    print("Title:", result.metadata.get("title"))
    print("Author:", result.metadata.get("author"))
```

## 6. `dispatch_result` (optional)

A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:

- **`task_id`**: A unique identifier for the parallel task.
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task's execution.
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.

```python
# Example usage:
for result in results:
    if result.success and result.dispatch_result:
        dr = result.dispatch_result
        print(f"URL: {result.url}, Task ID: {dr.task_id}")
        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
        print(f"Duration: {dr.end_time - dr.start_time}")
```

> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.

## 7. Network Requests & Console Messages

When you enable network and console message capturing in `CrawlerRunConfig` using `capture_network_requests=True` and `capture_console_messages=True`, the `CrawlResult` will include these fields:

### 7.1 **`network_requests`** *(Optional[List[Dict[str, Any]]])*

- Each item has an `event_type` field that can be `"request"`, `"response"`, or `"request_failed"`.
- Request events include `url`, `method`, `headers`, `post_data`, `resource_type`, and `is_navigation_request`.
- Response events include `url`, `status`, `status_text`, `headers`, and `request_timing`.
- Failed request events include `url`, `method`, `resource_type`, and `failure_text`.
- All events include a `timestamp` field.

```python
if result.network_requests:
    # Count different types of events
    requests = [r for r in result.network_requests if r.get("event_type") == "request"]
    responses = [r for r in result.network_requests if r.get("event_type") == "response"]
    failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]

    print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")

    # Analyze API calls
    api_calls = [r for r in requests if "api" in r.get("url", "")]

    # Identify failed resources
    for failure in failures:
        print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")
```

### 7.2 **`console_messages`** *(Optional[List[Dict[str, Any]]])*

- Each item has a `type` field indicating the message type (e.g., `"log"`, `"error"`, `"warning"`, etc.).
- The `text` field contains the actual message text.
- Some messages include `location` information (URL, line, column).
- All messages include a `timestamp` field.

```python
if result.console_messages:
    # Count messages by type
    message_types = {}
    for msg in result.console_messages:
        msg_type = msg.get("type", "unknown")
        message_types[msg_type] = message_types.get(msg_type, 0) + 1

    print(f"Message type counts: {message_types}")

    # Display errors (which are usually most important)
    for msg in result.console_messages:
        if msg.get("type") == "error":
            print(f"Error: {msg.get('text')}")
```

## 8. Example: Accessing Everything

```python
async def handle_result(result: CrawlResult):
    if not result.success:
        print("Crawl error:", result.error_message)
        return

    # Basic info
    print("Crawled URL:", result.url)
    print("Status code:", result.status_code)

    # HTML
    print("Original HTML size:", len(result.html))
    print("Cleaned HTML size:", len(result.cleaned_html or ""))

    # Markdown output
    if result.markdown:
        print("Raw Markdown:", result.markdown.raw_markdown[:300])
        print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
        if result.markdown.fit_markdown:
            print("Fit Markdown:", result.markdown.fit_markdown[:200])

    # Media & Links
    if "images" in result.media:
        print("Image count:", len(result.media["images"]))
    if "internal" in result.links:
        print("Internal link count:", len(result.links["internal"]))

    # Extraction strategy result
    if result.extracted_content:
        print("Structured data:", result.extracted_content)

    # Screenshot/PDF/MHTML
    if result.screenshot:
        print("Screenshot length:", len(result.screenshot))
    if result.pdf:
        print("PDF bytes length:", len(result.pdf))
    if result.mhtml:
        print("MHTML length:", len(result.mhtml))

    # Network and console capturing
    if result.network_requests:
        print(f"Network requests captured: {len(result.network_requests)}")
        # Analyze request types
        req_types = {}
        for req in result.network_requests:
            if "resource_type" in req:
                req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
        print(f"Resource types: {req_types}")

    if result.console_messages:
        print(f"Console messages captured: {len(result.console_messages)}")
        # Count by message type
        msg_types = {}
        for msg in result.console_messages:
            msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
        print(f"Message types: {msg_types}")
```

## 9. Key Points & Future

1. **Deprecated legacy properties of CrawlResult**

- `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
- `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`
2. **Fit Content**
- **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
- If no filter is used, they remain `None`.
3. **References & Citations**
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
4. **Links & Media**
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
5. **Error Cases**
- If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
- `status_code` might be `None` if we failed before an HTTP response.

Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.

## Configuration
<!-- Section: lines 1612-2330 -->

## Browser, Crawler & LLM Configuration (Quick Overview)

Crawl4AI's flexibility stems from two key classes:

1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)

In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).

## 1. BrowserConfig Essentials

```python
class BrowserConfig:
    def __init__(
        browser_type="chromium",
        headless=True,
        proxy_config=None,
        viewport_width=1080,
        viewport_height=600,
        verbose=True,
        use_persistent_context=False,
        user_data_dir=None,
        cookies=None,
        headers=None,
        user_agent=None,
        text_mode=False,
        light_mode=False,
        extra_args=None,
        enable_stealth=False,
        # ... other advanced parameters omitted here
    ):
        ...
```

### Key Fields to Note

1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
2. **`headless`**

- `True`: Runs the browser in headless mode (invisible browser).
- `False`: Runs the browser in visible mode, which helps with debugging.
3. **`proxy_config`**

- A dictionary with fields like:

```json
{
    "server": "http://proxy.example.com:8080",
    "username": "...",
    "password": "..."
}
```

- Leave as `None` if a proxy is not required.
1. **`viewport_width` & `viewport_height`**:

- The initial window size.
- Some sites behave differently with smaller or bigger viewports.
2. **`verbose`**:

- If `True`, prints extra logs.
- Handy for debugging.
3. **`use_persistent_context`**:

- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
- Typically also set `user_data_dir` to point to a folder.
4. **`cookies`** & **`headers`**:

- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
5. **`user_agent`**:

- Custom User-Agent string. If `None`, a default is used.
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
6. **`text_mode`** & **`light_mode`**:

- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
7. **`extra_args`**:

- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
8. **`enable_stealth`**:

- If `True`, enables stealth mode using playwright-stealth.
- Modifies browser fingerprints to avoid basic bot detection.
- Default is `False`. Recommended for sites with bot protection.

### Helper Methods

Both configuration classes provide a `clone()` method to create modified copies:

```python
# Create a base browser config
base_browser = BrowserConfig(
    browser_type="chromium",
    headless=True,
    text_mode=True
)

# Create a visible browser config for debugging
debug_browser = base_browser.clone(
    headless=False,
    verbose=True
)
```

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_conf = BrowserConfig(
    browser_type="firefox",
    headless=False,
    text_mode=True
)

async with AsyncWebCrawler(config=browser_conf) as crawler:
    result = await crawler.arun("https://example.com")
    print(result.markdown[:300])
```

## 2. CrawlerRunConfig Essentials

```python
class CrawlerRunConfig:
    def __init__(
        word_count_threshold=200,
        extraction_strategy=None,
        markdown_generator=None,
        cache_mode=None,
        js_code=None,
        wait_for=None,
        screenshot=False,
        pdf=False,
        capture_mhtml=False,
        # Location and Identity Parameters
        locale=None,            # e.g. "en-US", "fr-FR"
        timezone_id=None,       # e.g. "America/New_York"
        geolocation=None,       # GeolocationConfig object
        # Resource Management
        enable_rate_limiting=False,
        rate_limit_config=None,
        memory_threshold_percent=70.0,
        check_interval=1.0,
        max_session_permit=20,
        display_mode=None,
        verbose=True,
        stream=False,  # Enable streaming for arun_many()
        # ... other advanced parameters omitted
    ):
        ...
```

### Key Fields to Note

1. **`word_count_threshold`**:

- The minimum word count before a block is considered.
- If your site has lots of short paragraphs or items, you can lower it.
2. **`extraction_strategy`**:

- Where you plug in JSON-based extraction (CSS, LLM, etc.).
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
3. **`markdown_generator`**:

- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
- If `None`, a default approach is used.
4. **`cache_mode`**:

- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
5. **`js_code`**:

- A string or list of JS strings to execute.
- Great for "Load More" buttons or user interactions.
6. **`wait_for`**:

- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:

- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **Location Parameters**:

- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
9. **`verbose`**:

- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
10. **`enable_rate_limiting`**:

- If `True`, enables rate limiting for batch processing.
- Requires `rate_limit_config` to be set.
11. **`memory_threshold_percent`**:

- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
12. **`check_interval`**:

- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
13. **`max_session_permit`**:

- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
14. **`url_matcher`** & **`match_mode`**:

- Enable URL-specific configurations when used with `arun_many()`.
- Set `url_matcher` to patterns (glob, function, or list) to match specific URLs.
- Use `match_mode` (OR/AND) to control how multiple patterns combine.
15. **`display_mode`**:

- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.

### Helper Methods

The `clone()` method is particularly useful for creating variations of your crawler configuration:

```python
# Create a base configuration
base_config = CrawlerRunConfig(
    cache_mode=CacheMode.ENABLED,
    word_count_threshold=200,
    wait_until="networkidle"
)

# Create variations for different use cases
stream_config = base_config.clone(
    stream=True,  # Enable streaming mode
    cache_mode=CacheMode.BYPASS
)

debug_config = base_config.clone(
    page_timeout=120000,  # Longer timeout for debugging
    verbose=True
)
```

The `clone()` method:

- Creates a new instance with all the same settings
- Updates only the specified parameters
- Leaves the original configuration unchanged
- Perfect for creating variations without repeating all parameters

## 3. LLMConfig Essentials

### Key fields to note

1. **`provider`**:
- Which LLM provider to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:

- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:

- If your provider has a custom endpoint

```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```

## 4. Putting It All Together

In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
from crawl4ai import JsonCssExtractionStrategy

async def main():
    # 1) Browser config: headless, bigger viewport, no proxy
    browser_conf = BrowserConfig(
        headless=True,
        viewport_width=1280,
        viewport_height=720
    )

    # 2) Example extraction strategy
    schema = {
        "name": "Articles",
        "baseSelector": "div.article",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }
    extraction = JsonCssExtractionStrategy(schema)

    # 3) Example LLM content filtering

    gemini_config = LLMConfig(
        provider="gemini/gemini-1.5-pro",
        api_token = "env:GEMINI_API_TOKEN"
    )

    # Initialize LLM filter with specific instruction
    filter = LLMContentFilter(
        llm_config=gemini_config,  # or your preferred provider
        instruction="""
        Focus on extracting the core educational content.
        Include:
        - Key concepts and explanations
        - Important code examples
        - Essential technical details
        Exclude:
        - Navigation elements
        - Sidebars
        - Footer content
        Format the output as clean markdown with proper code blocks and headers.
        """,
        chunk_token_threshold=500,  # Adjust based on your needs
        verbose=True
    )

    md_generator = DefaultMarkdownGenerator(
        content_filter=filter,
        options={"ignore_links": True}
    )

    # 4) Crawler run config: skip cache, use extraction
    run_conf = CrawlerRunConfig(
        markdown_generator=md_generator,
        extraction_strategy=extraction,
        cache_mode=CacheMode.BYPASS,
    )

    async with AsyncWebCrawler(config=browser_conf) as crawler:
        # 4) Execute the crawl
        result = await crawler.arun(url="https://example.com/news", config=run_conf)

        if result.success:
            print("Extracted content:", result.extracted_content)
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

## 5. Next Steps

- [BrowserConfig, CrawlerRunConfig & LLMConfig Reference](../api/parameters.md)
- **Custom Hooks & Auth** (Inject JavaScript or handle login forms).
- **Session Management** (Re-use pages, preserve state across multiple calls).
- **Advanced Caching** (Fine-tune read/write cache modes).

## 6. Conclusion

## 1. **BrowserConfig** – Controlling the Browser

`BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.

```python
from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_cfg = BrowserConfig(
    browser_type="chromium",
    headless=True,
    viewport_width=1280,
    viewport_height=720,
    proxy="http://user:pass@proxy:8080",
    user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
)
```

## 1.1 Parameter Highlights

| **Parameter**         | **Type / Default**                     | **What It Does**                                                                                                                     |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`browser_type`**    | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests.                 |
| **`headless`**        | `bool` (default: `True`)               | Headless means no visible UI. `False` is handy for debugging.                                                                         |
| **`viewport_width`**  | `int` (default: `1080`)                | Initial page width (in px). Useful for testing responsive layouts.                                                                    |
| **`viewport_height`** | `int` (default: `600`)                 | Initial page height (in px).                                                                                                          |
| **`proxy`**           | `str` (deprecated)                      | Deprecated. Use `proxy_config` instead. If set, it will be auto-converted internally. |
| **`proxy_config`**    | `dict` (default: `None`)               | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`.                                  |
| **`use_persistent_context`** | `bool` (default: `False`)       | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`.          |
| **`user_data_dir`**   | `str or None` (default: `None`)        | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions.                                         |
| **`ignore_https_errors`** | `bool` (default: `True`)           | If `True`, continues despite invalid certificates (common in dev/staging).                                                            |
| **`java_script_enabled`** | `bool` (default: `True`)           | Disable if you want no JS overhead, or if only static content is needed.                                                              |
| **`cookies`**         | `list` (default: `[]`)                 | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`.                                                |
| **`headers`**         | `dict` (default: `{}`)                 | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`.                                                            |
| **`user_agent`**      | `str` (default: Chrome-based UA)       | Your custom or random user agent. `user_agent_mode="random"` can shuffle it.                                                          |
| **`light_mode`**      | `bool` (default: `False`)              | Disables some background features for performance gains.                                                                              |
| **`text_mode`**       | `bool` (default: `False`)              | If `True`, tries to disable images/other heavy content for speed.                                                                     |
| **`use_managed_browser`** | `bool` (default: `False`)          | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on.                  |
| **`extra_args`**      | `list` (default: `[]`)                 | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`.                                                |

- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
- If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`.
- For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content.

## 2. **CrawlerRunConfig** – Controlling Each Crawl

While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

run_cfg = CrawlerRunConfig(
    wait_for="css:.main-content",
    word_count_threshold=15,
    excluded_tags=["nav", "footer"],
    exclude_external_links=True,
    stream=True,  # Enable streaming for arun_many()
)
```

## 2.1 Parameter Highlights

### A) **Content Processing**

| **Parameter**                | **Type / Default**                   | **What It Does**                                                                                |
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
| **`word_count_threshold`**   | `int` (default: ~200)                | Skips text blocks below X words. Helps ignore trivial sections.                                 |
| **`extraction_strategy`**    | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.).                                  |
| **`markdown_generator`**     | `MarkdownGenerationStrategy` (None)  | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html').                 |
| **`css_selector`**           | `str` (None)                         | Retains only the part of the page matching this selector. Affects the entire extraction process. |
| **`target_elements`**        | `List[str]` (None)                   | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
| **`excluded_tags`**          | `list` (None)                        | Removes entire tags (e.g. `["script", "style"]`).                                               |
| **`excluded_selector`**      | `str` (None)                         | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`.                                    |
| **`only_text`**              | `bool` (False)                       | If `True`, tries to extract text-only content.                                                  |
| **`prettiify`**              | `bool` (False)                       | If `True`, beautifies final HTML (slower, purely cosmetic).                                      |
| **`keep_data_attributes`**   | `bool` (False)                       | If `True`, preserve `data-*` attributes in cleaned HTML.                                         |
| **`remove_forms`**           | `bool` (False)                       | If `True`, remove all `<form>` elements.                                                        |

### B) **Caching & Session**

| **Parameter**           | **Type / Default**     | **What It Does**                                                                                                              |
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **`cache_mode`**        | `CacheMode or None`    | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`.          |
| **`session_id`**        | `str or None`          | Assign a unique ID to reuse a single browser session across multiple `arun()` calls.                                          |
| **`bypass_cache`**      | `bool` (False)         | If `True`, acts like `CacheMode.BYPASS`.                                                                                     |
| **`disable_cache`**     | `bool` (False)         | If `True`, acts like `CacheMode.DISABLED`.                                                                                   |
| **`no_cache_read`**     | `bool` (False)         | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads).                                                  |
| **`no_cache_write`**    | `bool` (False)         | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes).                                                   |

### C) **Page Navigation & Timing**

| **Parameter**              | **Type / Default**      | **What It Does**                                                                                                    |
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`wait_until`**           | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`.                               |
| **`page_timeout`**         | `int` (60000 ms)        | Timeout for page navigation or JS steps. Increase for slow sites.                                                    |
| **`wait_for`**             | `str or None`           | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction.                     |
| **`wait_for_images`**      | `bool` (False)          | Wait for images to load before finishing. Slows down if you only want text.                                          |
| **`delay_before_return_html`** | `float` (0.1)       | Additional pause (seconds) before final HTML is captured. Good for last-second updates.                               |
| **`check_robots_txt`**     | `bool` (False)          | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency.            |
| **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
| **`semaphore_count`**      | `int` (5)               | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls.                                |

### D) **Page Interaction**

| **Parameter**              | **Type / Default**            | **What It Does**                                                                                                                       |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`**              | `str or list[str]` (None)      | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`.                                                     |
| **`js_only`**              | `bool` (False)                 | If `True`, indicates we’re reusing an existing session and only applying JS. No full reload.                                           |
| **`ignore_body_visibility`** | `bool` (True)                | Skip checking if `<body>` is visible. Usually best to keep `True`.                                                                     |
| **`scan_full_page`**       | `bool` (False)                 | If `True`, auto-scroll the page to load dynamic content (infinite scroll).                                                              |
| **`scroll_delay`**         | `float` (0.2)                  | Delay between scroll steps if `scan_full_page=True`.                                                                                   |
| **`process_iframes`**      | `bool` (False)                 | Inlines iframe content for single-page extraction.                                                                                     |
| **`remove_overlay_elements`** | `bool` (False)              | Removes potential modals/popups blocking the main content.                                                                              |
| **`simulate_user`**        | `bool` (False)                 | Simulate user interactions (mouse movements) to avoid bot detection.                                                                    |
| **`override_navigator`**   | `bool` (False)                 | Override `navigator` properties in JS for stealth.                                                                                      |
| **`magic`**                | `bool` (False)                 | Automatic handling of popups/consent banners. Experimental.                                                                             |
| **`adjust_viewport_to_content`** | `bool` (False)           | Resizes viewport to match page content height.                                                                                          |
If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab.

### E) **Media Handling**

| **Parameter**                              | **Type / Default**  | **What It Does**                                                                                         |
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
| **`screenshot`**                           | `bool` (False)      | Capture a screenshot (base64) in `result.screenshot`.                                                     |
| **`screenshot_wait_for`**                  | `float or None`     | Extra wait time before the screenshot.                                                                    |
| **`screenshot_height_threshold`**          | `int` (~20000)      | If the page is taller than this, alternate screenshot strategies are used.                                |
| **`pdf`**                                  | `bool` (False)      | If `True`, returns a PDF in `result.pdf`.                                                                 |
| **`capture_mhtml`**                        | `bool` (False)      | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. |
| **`image_description_min_word_threshold`** | `int` (~50)         | Minimum words for an image’s alt text or description to be considered valid.                              |
| **`image_score_threshold`**                | `int` (~3)          | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.).              |
| **`exclude_external_images`**              | `bool` (False)      | Exclude images from other domains.                                                                        |

### F) **Link/Domain Handling**

| **Parameter**                | **Type / Default**      | **What It Does**                                                                                                             |
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output.                                      |
| **`exclude_external_links`** | `bool` (False)          | Removes all links pointing outside the current domain.                                                                      |
| **`exclude_social_media_links`** | `bool` (False)      | Strips links specifically to social sites (like Facebook or Twitter).                                                      |
| **`exclude_domains`**        | `list` ([])             | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`).                                            |
| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. |

### G) **Debug & Logging**

| **Parameter**  | **Type / Default** | **What It Does**                                                         |
|----------------|--------------------|---------------------------------------------------------------------------|
| **`verbose`**  | `bool` (True)     | Prints logs detailing each step of crawling, interactions, or errors.    |
| **`log_console`** | `bool` (False) | Logs the page’s JavaScript console output if you want deeper JS debugging.|

### H) **Virtual Scroll Configuration**

| **Parameter**                | **Type / Default**           | **What It Does**                                                                                                                    |
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`virtual_scroll_config`**  | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. |
When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`:

```python
from crawl4ai import VirtualScrollConfig

virtual_config = VirtualScrollConfig(
    container_selector="#timeline",    # CSS selector for scrollable container
    scroll_count=30,                   # Number of times to scroll
    scroll_by="container_height",      # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
    wait_after_scroll=0.5             # Seconds to wait after each scroll for content to load
)

config = CrawlerRunConfig(
    virtual_scroll_config=virtual_config
)
```

**VirtualScrollConfig Parameters:**
| **Parameter**          | **Type / Default**        | **What It Does**                                                                          |
|------------------------|---------------------------|-------------------------------------------------------------------------------------------|
| **`container_selector`** | `str` (required)        | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`)              |
| **`scroll_count`**     | `int` (10)               | Maximum number of scrolls to perform                                                      |
| **`scroll_by`**        | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`)   |
| **`wait_after_scroll`** | `float` (0.5)           | Time in seconds to wait after each scroll for new content to load                        |

- Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram)
- Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll)

### I) **URL Matching Configuration**

| **Parameter**          | **Type / Default**           | **What It Does**                                                                                                                    |
|------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`url_matcher`**      | `UrlMatcher` (None)          | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types. **None means match ALL URLs**         |
| **`match_mode`**       | `MatchMode` (MatchMode.OR)   | How to combine multiple matchers in a list: `MatchMode.OR` (any match) or `MatchMode.AND` (all must match)                       |
The `url_matcher` parameter enables URL-specific configurations when used with `arun_many()`:

```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

# Simple string pattern (glob-style)
pdf_config = CrawlerRunConfig(
    url_matcher="*.pdf",
    scraping_strategy=PDFContentScrapingStrategy()
)

# Multiple patterns with OR logic (default)
blog_config = CrawlerRunConfig(
    url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
    match_mode=MatchMode.OR  # Any pattern matches
)

# Function matcher
api_config = CrawlerRunConfig(
    url_matcher=lambda url: 'api' in url or url.endswith('.json'),
    # Other settings like extraction_strategy
)

# Mixed: String + Function with AND logic
complex_config = CrawlerRunConfig(
    url_matcher=[
        lambda url: url.startswith('https://'),  # Must be HTTPS
        "*.org/*",                               # Must be .org domain
        lambda url: 'docs' in url                # Must contain 'docs'
    ],
    match_mode=MatchMode.AND  # ALL conditions must match
)

# Combined patterns and functions with AND logic
secure_docs = CrawlerRunConfig(
    url_matcher=["https://*", lambda url: '.doc' in url],
    match_mode=MatchMode.AND  # Must be HTTPS AND contain .doc
)

# Default config - matches ALL URLs
default_config = CrawlerRunConfig()  # No url_matcher = matches everything
```

**UrlMatcher Types:**

- **None (default)**: When `url_matcher` is None or not set, the config matches ALL URLs
- **String patterns**: Glob-style patterns like `"*.pdf"`, `"*/api/*"`, `"https://*.example.com/*"`
- **Functions**: `lambda url: bool` - Custom logic for complex matching
- **Lists**: Mix strings and functions, combined with `MatchMode.OR` or `MatchMode.AND`
**Important Behavior:**
- When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins!
- If no config matches a URL and there's no default config (one without `url_matcher`), the URL will fail with "No matching configuration found"

Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:

```python
# Create a base configuration
base_config = CrawlerRunConfig(
    cache_mode=CacheMode.ENABLED,
    word_count_threshold=200
)

# Create variations using clone()
stream_config = base_config.clone(stream=True)
no_cache_config = base_config.clone(
    cache_mode=CacheMode.BYPASS,
    stream=True
)
```

The `clone()` method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config.

## 2.3 Example Usage

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    # Configure the browser
    browser_cfg = BrowserConfig(
        headless=False,
        viewport_width=1280,
        viewport_height=720,
        proxy="http://user:pass@myproxy:8080",
        text_mode=True
    )

    # Configure the run
    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        session_id="my_session",
        css_selector="main.article",
        excluded_tags=["script", "style"],
        exclude_external_links=True,
        wait_for="css:.article-loaded",
        screenshot=True,
        stream=True
    )

    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/news",
            config=run_cfg
        )
        if result.success:
            print("Final cleaned_html length:", len(result.cleaned_html))
            if result.screenshot:
                print("Screenshot captured (base64, length):", len(result.screenshot))
        else:
            print("Crawl failed:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

## 2.4 Compliance & Ethics

| **Parameter**          | **Type / Default**      | **What It Does**                                                                                                    |
|-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`check_robots_txt`**| `bool` (False)          | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend.          |
| **`user_agent`**      | `str` (None)            | User agent string to identify your crawler. Used for robots.txt checking when enabled.                                |

```python
run_config = CrawlerRunConfig(
    check_robots_txt=True,  # Enable robots.txt compliance
    user_agent="MyBot/1.0"  # Identify your crawler
)
```

## 3. **LLMConfig** - Setting up LLM providers

1. LLMExtractionStrategy
2. LLMContentFilter
3. JsonCssExtractionStrategy.generate_schema
4. JsonXPathExtractionStrategy.generate_schema

## 3.1 Parameters

| **Parameter**         | **Type / Default**                     | **What It Does**                                                                                                                     |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`provider`**    | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use.
| **`api_token`**         |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables  <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`              | API token to use for the given provider
| **`base_url`**         |Optional. Custom API endpoint | If your provider has a custom endpoint

## 3.2 Example Usage

```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```

## 4. Putting It All Together

- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
- **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
- **Use** `LLMConfig` for LLM provider configurations that can be used across all extraction, filtering, schema generation tasks. Can be used in - `LLMExtractionStrategy`, `LLMContentFilter`, `JsonCssExtractionStrategy.generate_schema` & `JsonXPathExtractionStrategy.generate_schema`

```python
# Create a modified copy with the clone() method
stream_cfg = run_cfg.clone(
    stream=True,
    cache_mode=CacheMode.BYPASS
)
```

## Crawling Patterns
<!-- Section: lines 2332-3568 -->

## Simple Crawling

## Basic Usage

Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:

```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async def main():
    browser_config = BrowserConfig()  # Default browser configuration
    run_config = CrawlerRunConfig()   # Default crawl run configuration

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        print(result.markdown)  # Print clean markdown content

if __name__ == "__main__":
    asyncio.run(main())
```

## Understanding the Response

The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):

```python
config = CrawlerRunConfig(
    markdown_generator=DefaultMarkdownGenerator(
        content_filter=PruningContentFilter(threshold=0.6),
        options={"ignore_links": True}
    )
)

result = await crawler.arun(
    url="https://example.com",
    config=config
)

# Different content formats
print(result.html)         # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown.raw_markdown) # Raw markdown from cleaned html
print(result.markdown.fit_markdown) # Most relevant content in markdown

# Check success status
print(result.success)      # True if crawl succeeded
print(result.status_code)  # HTTP status code (e.g., 200, 404)

# Access extracted media and links
print(result.media)        # Dictionary of found media (images, videos, audio)
print(result.links)        # Dictionary of internal and external links
```

## Adding Basic Options

Customize your crawl using `CrawlerRunConfig`:

```python
run_config = CrawlerRunConfig(
    word_count_threshold=10,        # Minimum words per content block
    exclude_external_links=True,    # Remove external links
    remove_overlay_elements=True,   # Remove popups/modals
    process_iframes=True           # Process iframe content
)

result = await crawler.arun(
    url="https://example.com",
    config=run_config
)
```

## Handling Errors

```python
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)

if not result.success:
    print(f"Crawl failed: {result.error_message}")
    print(f"Status code: {result.status_code}")
```

## Logging and Debugging

Enable verbose logging in `BrowserConfig`:

```python
browser_config = BrowserConfig(verbose=True)

async with AsyncWebCrawler(config=browser_config) as crawler:
    run_config = CrawlerRunConfig()
    result = await crawler.arun(url="https://example.com", config=run_config)
```

## Complete Example

```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode

async def main():
    browser_config = BrowserConfig(verbose=True)
    run_config = CrawlerRunConfig(
        # Content filtering
        word_count_threshold=10,
        excluded_tags=['form', 'header'],
        exclude_external_links=True,

        # Content processing
        process_iframes=True,
        remove_overlay_elements=True,

        # Cache control
        cache_mode=CacheMode.ENABLED  # Use cache if available
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )

        if result.success:
            # Print clean content
            print("Content:", result.markdown[:500])  # First 500 chars

            # Process images
            for image in result.media["images"]:
                print(f"Found image: {image['src']}")

            # Process links
            for link in result.links["internal"]:
                print(f"Internal link: {link['href']}")

        else:
            print(f"Crawl failed: {result.error_message}")

if __name__ == "__main__":
    asyncio.run(main())
```

## Content Processing
<!-- Section: lines 2481-3101 -->

## Markdown Generation Basics

1. How to configure the **Default Markdown Generator**
2. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`)
> - You know how to configure `CrawlerRunConfig`.

## 1. Quick Example

```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
    config = CrawlerRunConfig(
        markdown_generator=DefaultMarkdownGenerator()
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com", config=config)

        if result.success:
            

... (truncated)
```

### scripts/extraction_pipeline.py

```python
#!/usr/bin/env python3
"""
Crawl4AI extraction pipeline - Three approaches:
1. Generate schema with LLM (one-time) then use CSS extraction (most efficient)
2. Manual CSS/JSON schema extraction
3. Direct LLM extraction (for complex/irregular content)

Usage examples:
  Generate schema: python extraction_pipeline.py --generate-schema <url> "<instruction>"
  Use generated schema: python extraction_pipeline.py --use-schema <url> schema.json
  Manual CSS: python extraction_pipeline.py --css <url> "<css_selector>"
  Direct LLM: python extraction_pipeline.py --llm <url> "<instruction>"
"""

import asyncio
import sys
import json
from pathlib import Path

# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
    from crawl4ai.__version__ import __version__
    from packaging import version
    if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
        print(f"⚠️  Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
    print(f"ℹ️  Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import (
    LLMExtractionStrategy,
    JsonCssExtractionStrategy,
    CosineStrategy
)

# =============================================================================
# APPROACH 1: Generate Schema (Most Efficient for Repetitive Patterns)
# =============================================================================

async def generate_schema(url: str, instruction: str, output_file: str = "generated_schema.json"):
    """
    Step 1: Generate a reusable schema using LLM (one-time cost)
    Best for: E-commerce sites, blogs, news sites with repetitive patterns
    """
    print("🔍 Generating extraction schema using LLM...")

    browser_config = BrowserConfig(headless=True)

    # Use LLM to analyze the page structure and generate schema
    extraction_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",  # Can use any LLM provider
        instruction=f"""
        Analyze this webpage and generate a CSS/JSON extraction schema.
        Task: {instruction}

        Return a JSON schema with CSS selectors that can extract the required data.
        Format:
        {{
            "name": "items",
            "selector": "main_container_selector",
            "fields": [
                {{"name": "field1", "selector": "css_selector", "type": "text"}},
                {{"name": "field2", "selector": "css_selector", "type": "link"}},
                // more fields...
            ]
        }}

        Make selectors as specific as possible to avoid false matches.
        """
    )

    crawler_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        wait_for="css:body",
        remove_overlay_elements=True
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=url, config=crawler_config)

        if result.success and result.extracted_content:
            try:
                # Parse and save the generated schema
                schema = json.loads(result.extracted_content)

                # Validate and enhance schema
                if "name" not in schema:
                    schema["name"] = "items"
                if "fields" not in schema:
                    print("⚠️ Generated schema missing fields, using fallback")
                    schema = {
                        "name": "items",
                        "baseSelector": "div.item, article, .product",
                        "fields": [
                            {"name": "title", "selector": "h1, h2, h3", "type": "text"},
                            {"name": "description", "selector": "p", "type": "text"},
                            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
                        ]
                    }

                # Save schema
                with open(output_file, "w") as f:
                    json.dump(schema, f, indent=2)

                print(f"✅ Schema generated and saved to: {output_file}")
                print(f"📋 Schema structure:")
                print(json.dumps(schema, indent=2))

                return schema

            except json.JSONDecodeError as e:
                print(f"❌ Failed to parse generated schema: {e}")
                print("Raw output:", result.extracted_content[:500])
                return None
        else:
            print(f"❌ Failed to generate schema: {result.error_message if result else 'Unknown error'}")
            return None

async def use_generated_schema(url: str, schema_file: str):
    """
    Step 2: Use the generated schema for fast, repeated extractions
    No LLM calls needed - pure CSS extraction
    """
    print(f"📂 Loading schema from: {schema_file}")

    try:
        with open(schema_file, "r") as f:
            schema = json.load(f)
    except FileNotFoundError:
        print(f"❌ Schema file not found: {schema_file}")
        print("💡 Generate a schema first using: python extraction_pipeline.py --generate-schema <url> \"<instruction>\"")
        return None

    print("🚀 Extracting data using generated schema (no LLM calls)...")

    extraction_strategy = JsonCssExtractionStrategy(
        schema=schema,
        verbose=True
    )

    crawler_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        wait_for="css:body"
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=crawler_config)

        if result.success and result.extracted_content:
            data = json.loads(result.extracted_content)
            items = data.get(schema.get("name", "items"), [])

            print(f"✅ Extracted {len(items)} items using schema")

            # Save results
            with open("extracted_data.json", "w") as f:
                json.dump(data, f, indent=2)
            print("💾 Saved to extracted_data.json")

            # Show sample
            if items:
                print("\n📋 Sample (first item):")
                print(json.dumps(items[0], indent=2))

            return data
        else:
            print(f"❌ Extraction failed: {result.error_message if result else 'Unknown error'}")
            return None

# =============================================================================
# APPROACH 2: Manual Schema Definition
# =============================================================================

async def extract_with_manual_schema(url: str, schema: dict = None):
    """
    Use a manually defined CSS/JSON schema
    Best for: When you know the exact structure of the website
    """

    if not schema:
        # Example schema for general content extraction
        schema = {
            "name": "content",
            "baseSelector": "body",  # Changed from 'selector' to 'baseSelector'
            "fields": [
                {"name": "title", "selector": "h1", "type": "text"},
                {"name": "paragraphs", "selector": "p", "type": "text", "all": True},
                {"name": "links", "selector": "a", "type": "attribute", "attribute": "href", "all": True}
            ]
        }

    print("📐 Using manual CSS/JSON schema for extraction...")

    extraction_strategy = JsonCssExtractionStrategy(
        schema=schema,
        verbose=True
    )

    crawler_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, config=crawler_config)

        if result.success and result.extracted_content:
            data = json.loads(result.extracted_content)
            # Handle both list and dict formats
            if isinstance(data, list):
                items = data
            else:
                items = data.get(schema["name"], [])

            print(f"✅ Extracted {len(items)} items using manual schema")

            with open("manual_extracted.json", "w") as f:
                json.dump(data, f, indent=2)
            print("💾 Saved to manual_extracted.json")

            return data
        else:
            print(f"❌ Extraction failed")
            return None

# =============================================================================
# APPROACH 3: Direct LLM Extraction
# =============================================================================

async def extract_with_llm(url: str, instruction: str):
    """
    Direct LLM extraction - uses LLM for every request
    Best for: Complex, irregular content or one-time extractions
    Note: Most expensive approach, use sparingly
    """
    print("🤖 Using direct LLM extraction...")

    browser_config = BrowserConfig(headless=True)

    extraction_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",  # Can change to ollama/llama3, anthropic/claude, etc.
        instruction=instruction,
        schema={
            "type": "object",
            "properties": {
                "items": {
                    "type": "array",
                    "items": {"type": "object"}
                },
                "summary": {"type": "string"}
            }
        }
    )

    crawler_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        wait_for="css:body",
        remove_overlay_elements=True
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=url, config=crawler_config)

        if result.success and result.extracted_content:
            try:
                data = json.loads(result.extracted_content)
                items = data.get('items', [])

                print(f"✅ LLM extracted {len(items)} items")
                print(f"📝 Summary: {data.get('summary', 'N/A')}")

                with open("llm_extracted.json", "w") as f:
                    json.dump(data, f, indent=2)
                print("💾 Saved to llm_extracted.json")

                if items:
                    print("\n📋 Sample (first item):")
                    print(json.dumps(items[0], indent=2))

                return data
            except json.JSONDecodeError:
                print("⚠️ Could not parse LLM output as JSON")
                print(result.extracted_content[:500])
                return None
        else:
            print(f"❌ LLM extraction failed")
            return None

# =============================================================================
# Main CLI Interface
# =============================================================================

async def main():
    if len(sys.argv) < 3:
        print("""
Crawl4AI Extraction Pipeline - Three Approaches

1️⃣  GENERATE & USE SCHEMA (Most Efficient for Repetitive Patterns):
    Step 1: Generate schema (one-time LLM cost)
    python extraction_pipeline.py --generate-schema <url> "<what to extract>"

    Step 2: Use schema for fast extraction (no LLM)
    python extraction_pipeline.py --use-schema <url> generated_schema.json

2️⃣  MANUAL SCHEMA (When You Know the Structure):
    python extraction_pipeline.py --manual <url>
    (Edit the schema in the script for your needs)

3️⃣  DIRECT LLM (For Complex/Irregular Content):
    python extraction_pipeline.py --llm <url> "<extraction instruction>"

Examples:
    # E-commerce products
    python extraction_pipeline.py --generate-schema https://shop.com "Extract all products with name, price, image"
    python extraction_pipeline.py --use-schema https://shop.com generated_schema.json

    # News articles
    python extraction_pipeline.py --generate-schema https://news.com "Extract headlines, dates, and summaries"

    # Complex content
    python extraction_pipeline.py --llm https://complex-site.com "Extract financial data and quarterly reports"
""")
        sys.exit(1)

    mode = sys.argv[1]
    url = sys.argv[2]

    if mode == "--generate-schema":
        if len(sys.argv) < 4:
            print("Error: Missing extraction instruction")
            print("Usage: python extraction_pipeline.py --generate-schema <url> \"<instruction>\"")
            sys.exit(1)
        instruction = sys.argv[3]
        output_file = sys.argv[4] if len(sys.argv) > 4 else "generated_schema.json"
        await generate_schema(url, instruction, output_file)

    elif mode == "--use-schema":
        if len(sys.argv) < 4:
            print("Error: Missing schema file")
            print("Usage: python extraction_pipeline.py --use-schema <url> <schema.json>")
            sys.exit(1)
        schema_file = sys.argv[3]
        await use_generated_schema(url, schema_file)

    elif mode == "--manual":
        await extract_with_manual_schema(url)

    elif mode == "--llm":
        if len(sys.argv) < 4:
            print("Error: Missing extraction instruction")
            print("Usage: python extraction_pipeline.py --llm <url> \"<instruction>\"")
            sys.exit(1)
        instruction = sys.argv[3]
        await extract_with_llm(url, instruction)

    else:
        print(f"Unknown mode: {mode}")
        print("Use --generate-schema, --use-schema, --manual, or --llm")
        sys.exit(1)

if __name__ == "__main__":
    asyncio.run(main())

```

### scripts/google_search.py

```python
#!/usr/bin/env python3
"""
Google Search Scraper using Crawl4AI
Usage: python google_search.py "<search query>" [max_results]

Example: python google_search.py "2026年Go语言展望" 20
"""

import asyncio
import sys
import json
import urllib.parse
from typing import List, Dict

try:
    from crawl4ai.__version__ import __version__
    from packaging import version
    MIN_CRAWL4AI_VERSION = "0.7.4"
    if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
        print(f"⚠️  Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
    print(f"ℹ️  Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy


async def search_google_css(query: str, max_results: int = 20) -> List[Dict]:
    """
    使用 CSS 选择器策略提取 Google 搜索结果(最快,无需 LLM)
    """

    # 构建搜索 URL
    encoded_query = urllib.parse.quote(query)
    search_url = f"https://www.google.com/search?q={encoded_query}&num={max_results}"

    print(f"🔍 Searching: {query}")
    print(f"📊 Max results: {max_results}")
    print(f"🌐 URL: {search_url}")

    # 定义 Google 搜索结果的 CSS schema
    # Google 的 HTML 结构会变化,这里使用常用的选择器
    schema = {
        "name": "search_results",
        "baseSelector": "div.g, div[data-hveid], div.tF2Cxc, div.yuRUbf",
        "fields": [
            {
                "name": "title",
                "selector": "h3, h3.LC20lb, div[role='heading']",
                "type": "text"
            },
            {
                "name": "link",
                "selector": "a",
                "type": "attribute",
                "attribute": "href"
            },
            {
                "name": "description",
                "selector": "div.VwiC3b, div.s, div.ITZIwc, span.aCOpRe",
                "type": "text"
            },
            {
                "name": "site_name",
                "selector": "div.NJo7tc, span.VuuXrf, cite",
                "type": "text"
            }
        ]
    }

    browser_config = BrowserConfig(
        headless=True,
        viewport_width=1920,
        viewport_height=1080,
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    )

    crawler_config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(schema=schema, verbose=True),
        cache_mode=CacheMode.BYPASS,
        wait_for="css:div.g, div.search, body",
        page_timeout=30000,
        js_code=[
            # 等待页面加载完成
            "const waitFor = (ms) => new Promise(resolve => setTimeout(resolve, ms));",
            "await waitFor(2000);"
        ]
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=search_url, config=crawler_config)

        if result.success:
            print("✅ Successfully fetched search results")

            if result.extracted_content:
                try:
                    data = json.loads(result.extracted_content)
                    # 处理列表和字典两种格式
                    if isinstance(data, list):
                        results = data
                    else:
                        results = data.get("search_results", data.get("results", []))

                    # 过滤掉空结果和无效结果
                    seen = set()
                    valid_results = []
                    for r in results:
                        if r.get("title") and r.get("link"):
                            # 清理 URL(Google 有时会在 URL 前加 /url?q=)
                            link = r["link"]
                            if link.startswith("/url?q="):
                                from urllib.parse import urlparse, parse_qs
                                parsed = urlparse(link)
                                link = parse_qs(parsed.query).get("q", [link])[0]
                            r["link"] = link

                            # 使用 URL 作为唯一标识去重
                            if link not in seen:
                                seen.add(link)
                                valid_results.append(r)

                    print(f"📋 Extracted {len(valid_results)} valid results")
                    return valid_results[:max_results]

                except json.JSONDecodeError as e:
                    print(f"❌ Failed to parse extracted content: {e}")
                    print("Raw output:", result.extracted_content[:500] if result.extracted_content else "None")
                    return []
            else:
                print("⚠️ No extracted content, trying alternative method...")
                return await search_google_llm_fallback(query, max_results)
        else:
            print(f"❌ Failed: {result.error_message}")
            print("Trying fallback method...")
            return await search_google_llm_fallback(query, max_results)


async def search_google_llm_fallback(query: str, max_results: int = 20) -> List[Dict]:
    """
    使用 LLM 作为备选方案提取搜索结果
    注意:这需要配置 LLM API 密钥
    """

    print("🤖 Using LLM fallback extraction...")

    encoded_query = urllib.parse.quote(query)
    search_url = f"https://www.google.com/search?q={encoded_query}&num={max_results}"

    # 尝试使用简单的 LLM 提取
    extraction_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",
        instruction=f"""
        Extract the top {max_results} search results from this Google search page for "{query}".

        For each search result, extract:
        1. Title - the blue link text
        2. Link - the URL (clean the URL, remove /url?q= prefix if present)
        3. Description - the gray text snippet below the title
        4. Site name - the green text showing the website name

        Return as JSON with a "results" array containing objects with these fields.
        Skip any ads or sponsored content.
        """
    )

    crawler_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        cache_mode=CacheMode.BYPASS,
        page_timeout=30000
    )

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=search_url, config=crawler_config)

        if result.success and result.extracted_content:
            try:
                data = json.loads(result.extracted_content)
                return data.get("results", [])
            except json.JSONDecodeError:
                print("⚠️ LLM output could not be parsed as JSON")
                return []
        else:
            print(f"❌ Fallback also failed: {result.error_message}")
            return []


async def search_google_with_html_parsing(query: str, max_results: int = 20) -> List[Dict]:
    """
    直接解析 HTML 作为最后的备选方案
    """

    print("🔧 Using direct HTML parsing...")

    encoded_query = urllib.parse.quote(query)
    search_url = f"https://www.google.com/search?q={encoded_query}&num={max_results}"

    browser_config = BrowserConfig(
        headless=True,
        viewport_width=1920,
        viewport_height=1080,
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    )

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_for="css:body",
        page_timeout=30000,
        js_code=[
            "const waitFor = (ms) => new Promise(resolve => setTimeout(resolve, ms));",
            "await waitFor(3000);"
        ]
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url=search_url, config=crawler_config)

        if result.success and result.html:
            from bs4 import BeautifulSoup

            soup = BeautifulSoup(result.html, 'html.parser')
            results = []

            # Google 搜索结果通常在 div.g 中
            for div in soup.select('div.g, div.tF2Cxc'):
                try:
                    # 提取标题
                    title_elem = div.select_one('h3')
                    title = title_elem.get_text() if title_elem else ""

                    # 提取链接
                    link_elem = div.select_one('a')
                    link = link_elem.get('href', '') if link_elem else ""

                    # 清理 Google 重定向链接
                    if link.startswith('/url?q='):
                        from urllib.parse import urlparse, parse_qs, unquote
                        parsed = urlparse(link)
                        link = unquote(parse_qs(parsed.query).get('q', [link])[0])

                    # 提取描述
                    desc_elem = div.select_one('div.VwiC3b, div.s, span.aCOpRe')
                    description = desc_elem.get_text() if desc_elem else ""

                    # 提取网站名称
                    site_elem = div.select_one('div.NJo7tc, span.VuuXrf, cite')
                    site_name = site_elem.get_text() if site_elem else ""

                    if title and link and not link.startswith('#'):
                        results.append({
                            "title": title.strip(),
                            "link": link.strip(),
                            "description": description.strip(),
                            "site_name": site_name.strip()
                        })

                        if len(results) >= max_results:
                            break

                except Exception as e:
                    continue

            print(f"📋 Parsed {len(results)} results from HTML")
            return results
        else:
            print(f"❌ HTML parsing failed")
            return []


async def main():
    if len(sys.argv) < 2:
        print("Usage: python google_search.py \"<search query>\" [max_results]")
        print("Example: python google_search.py \"2026年Go语言展望\" 20")
        sys.exit(1)

    query = sys.argv[1]
    max_results = int(sys.argv[2]) if len(sys.argv) > 2 else 20

    # 方法1: CSS 提取
    results = await search_google_css(query, max_results)

    # 如果 CSS 提取失败,尝试 HTML 解析
    if not results:
        results = await search_google_with_html_parsing(query, max_results)

    # 输出结果
    if results:
        output = {
            "query": query,
            "total_results": len(results),
            "results": results
        }

        print("\n" + "="*60)
        print(f"✅ Successfully extracted {len(results)} search results")
        print("="*60)

        # 保存到文件
        output_file = "google_search_results.json"
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(output, f, ensure_ascii=False, indent=2)

        print(f"\n💾 Results saved to: {output_file}")
        print("\n📋 Preview (first 3 results):")
        print(json.dumps(results[:3], ensure_ascii=False, indent=2))

        # 打印完整 JSON 到 stdout
        print("\n" + "="*60)
        print("FULL JSON OUTPUT:")
        print("="*60)
        print(json.dumps(output, ensure_ascii=False, indent=2))
    else:
        print("❌ No results extracted. Please check:")
        print("  1. Your internet connection")
        print("  2. Whether Google is blocking the request (try with headless=False)")
        print("  3. The CSS selectors (Google might have changed their HTML)")


if __name__ == "__main__":
    asyncio.run(main())

```



---

## Skill Companion Files

> Additional files collected from the skill directory layout.

### README.md

```markdown
# Crawl4AI Skill

> 强大的网页爬取和数据提取技能,支持 JavaScript 渲染、结构化数据提取和多 URL 批量处理。

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

基于 [crawl4ai-skill](https://github.com/brettdavies/crawl4ai-skill) 代码做基础实现。

## 特性

- **智能爬取** - 自动处理 JavaScript 渲染页面
- **结构化提取** - 支持 CSS 选择器和 LLM 两种提取模式
- **Markdown 生成** - 自动将网页内容转换为格式化的 Markdown
- **批量处理** - 高效处理多个 URL
- **会话管理** - 支持登录认证和状态保持
- **反爬虫对策** - 内置反检测和代理支持
- **Google 搜索** - 专用搜索结果提取脚本

## 安装

```bash
# 安装 crawl4ai
pip install crawl4ai

# 安装 Playwright 浏览器
crawl4ai-setup

# 验证安装
crawl4ai-doctor
```

## 快速开始

### CLI 模式(推荐)

```bash
# 基础爬取,输出 Markdown
crwl https://example.com

# JSON 格式输出
crwl https://example.com -o json

# 绕过缓存,详细输出
crwl https://example.com -o json -v --bypass-cache
```

### Python SDK

```python
import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown[:500])

asyncio.run(main())
```

## 使用示例

### Google 搜索爬取

```bash
# 搜索并提取前 20 个结果
python scripts/google_search.py "搜索关键词" 20

# 示例
python scripts/google_search.py "2026年Go语言展望" 20
```

**输出格式:**
```json
{
  "query": "搜索关键词",
  "total_results": 20,
  "results": [
    {
      "title": "结果标题",
      "link": "https://example.com",
      "description": "结果描述",
      "site_name": "网站名称"
    }
  ]
}
```

### 数据提取

#### 1. CSS 选择器提取(最快,无需 LLM)

```bash
# 生成提取 schema
python scripts/extraction_pipeline.py --generate-schema https://shop.com "提取所有商品信息"

# 使用 schema 进行提取
crwl https://shop.com -e extract_css.yml -s schema.json -o json
```

**Schema 格式:**
```json
{
  "name": "products",
  "baseSelector": ".product-card",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"},
    {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
  ]
}
```

#### 2. LLM 智能提取

```yaml
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "提取商品名称和价格"
api_token: "your-api-token"
```

```bash
crwl https://shop.com -e extract_llm.yml -o json
```

### Markdown 生成与过滤

```bash
# 基础 Markdown
crwl https://docs.example.com -o markdown > docs.md

# 过滤后的 Markdown(移除噪音)
crwl https://docs.example.com -o markdown-fit

# 使用 BM25 内容过滤
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
```

**过滤器配置:**
```yaml
# filter_bm25.yml
type: "bm25"
query: "机器学习教程"
threshold: 1.0
```

### 动态内容处理

```bash
# 等待特定元素加载
crwl https://example.com -c "wait_for=css:.ajax-content,page_timeout=60000"

# 扫描整个页面
crwl https://example.com -c "scan_full_page=true,delay_before_return_html=2.0"
```

### 批量处理

```python
# Python SDK 并发处理
urls = [
    "https://site1.com",
    "https://site2.com",
    "https://site3.com"
]
results = await crawler.arun_many(urls, config=config)
```

### 登录认证

```yaml
# login_crawler.yml
session_id: "user_session"
js_code: |
  document.querySelector('#username').value = 'user';
  document.querySelector('#password').value = 'pass';
  document.querySelector('#submit').click();
wait_for: "css:.dashboard"
```

```bash
# 先登录
crwl https://site.com/login -C login_crawler.yml

# 访问受保护内容
crwl https://site.com/protected -c "session_id=user_session"
```

## 目录结构

```
crawl4ai/
├── README.md                   # 本文件
├── SKILL.md                    # 技能详细文档
├── scripts/                    # 实用脚本
│   ├── google_search.py       # Google 搜索爬虫
│   ├── extraction_pipeline.py # 数据提取管道
│   ├── basic_crawler.py       # 基础爬虫
│   └── batch_crawler.py       # 批量爬虫
├── references/                 # 参考文档
│   ├── cli-guide.md           # CLI 完整指南
│   ├── sdk-guide.md           # SDK 快速参考
│   └── complete-sdk-reference.md # 完整 API 文档
└── tests/                      # 测试文件
    ├── README.md
    ├── run_all_tests.py
    ├── test_basic_crawling.py
    ├── test_data_extraction.py
    ├── test_markdown_generation.py
    └── test_advanced_patterns.py
```

## 提供的脚本

| 脚本 | 功能 |
|------|------|
| `google_search.py` | Google 搜索结果爬取,JSON 输出 |
| `extraction_pipeline.py` | 三种提取策略:CSS/LLM/手动 |
| `basic_crawler.py` | 基础网页爬取,带截图功能 |
| `batch_crawler.py` | 批量 URL 处理 |

## 配置说明

### BrowserConfig(浏览器配置)

| 参数 | 说明 | 默认值 |
|------|------|--------|
| `headless` | 无头模式 | `true` |
| `viewport_width` | 视口宽度 | `1920` |
| `viewport_height` | 视口高度 | `1080` |
| `user_agent` | 用户代理 | 随机 |
| `proxy_config` | 代理配置 | `null` |

### CrawlerRunConfig(爬虫配置)

| 参数 | 说明 | 默认值 |
|------|------|--------|
| `page_timeout` | 页面超时(ms) | `30000` |
| `wait_for` | 等待条件 | `null` |
| `cache_mode` | 缓存模式 | `enabled` |
| `js_code` | 执行的 JS | `null` |
| `css_selector` | CSS 选择器 | `null` |

## 最佳实践

1. **优先使用 CLI** - 快速任务用 CLI,自动化用 SDK
2. **使用 Schema 提取** - 比 LLM 快 10-100 倍,零成本
3. **开发时启用缓存** - 只在需要时使用 `--bypass-cache`
4. **合理设置超时** - 普通站点 30s,JS 重度站点 60s+
5. **使用内容过滤** - 获取更干净的 Markdown 输出
6. **遵守速率限制** - 请求之间添加延迟

## 常见问题

### JavaScript 内容未加载

```bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"
```

### 被反爬虫检测

```yaml
# browser.yml
headless: false
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
user_agent_mode: "random"
```

### 提取内容为空

```bash
# 调试模式查看完整输出
crwl https://example.com -o all -v

# 尝试不同的等待策略
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
```

## 文档

- [CLI 完整指南](references/cli-guide.md) - 命令行接口详解
- [SDK 快速参考](references/sdk-guide.md) - Python SDK 速查
- [完整 API 文档](references/complete-sdk-reference.md) - 5900+ 行完整参考

## 许可证

MIT License

## 相关链接

- [Crawl4AI 官方仓库](https://github.com/unclecode/crawl4ai)
- [Playwright 文档](https://playwright.dev/python/)
- [BeautifulSoup 文档](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

```

### scripts/basic_crawler.py

```python
#!/usr/bin/env python3
"""
Basic Crawl4AI crawler template
Usage: python basic_crawler.py <url>
"""

import asyncio
import sys

# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
    from crawl4ai.__version__ import __version__
    from packaging import version
    if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
        print(f"⚠️  Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
    print(f"ℹ️  Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def crawl_basic(url: str):
    """Basic crawling with markdown output"""

    # Configure browser
    browser_config = BrowserConfig(
        headless=True,
        viewport_width=1920,
        viewport_height=1080
    )

    # Configure crawler
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        remove_overlay_elements=True,
        wait_for_images=True,
        screenshot=True
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url=url,
            config=crawler_config
        )

        if result.success:
            print(f"✅ Crawled: {result.url}")
            print(f"   Title: {result.metadata.get('title', 'N/A')}")
            print(f"   Links found: {len(result.links.get('internal', []))} internal, {len(result.links.get('external', []))} external")
            print(f"   Media found: {len(result.media.get('images', []))} images, {len(result.media.get('videos', []))} videos")
            print(f"   Content length: {len(result.markdown)} chars")

            # Save markdown
            with open("output.md", "w") as f:
                f.write(result.markdown)
            print("📄 Saved to output.md")

            # Save screenshot if available
            if result.screenshot:
                # Check if screenshot is base64 string or bytes
                if isinstance(result.screenshot, str):
                    import base64
                    screenshot_data = base64.b64decode(result.screenshot)
                else:
                    screenshot_data = result.screenshot
                with open("screenshot.png", "wb") as f:
                    f.write(screenshot_data)
                print("📸 Saved screenshot.png")
        else:
            print(f"❌ Failed: {result.error_message}")

        return result

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python basic_crawler.py <url>")
        sys.exit(1)

    url = sys.argv[1]
    asyncio.run(crawl_basic(url))

```

### scripts/batch_crawler.py

```python
#!/usr/bin/env python3
"""
Crawl4AI batch/multi-URL crawler with concurrent processing
Usage: python batch_crawler.py urls.txt [--max-concurrent 5]
"""

import asyncio
import sys
import json
from pathlib import Path
from typing import List, Dict, Any

# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
    from crawl4ai.__version__ import __version__
    from packaging import version
    if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
        print(f"⚠️  Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
    print(f"ℹ️  Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

async def crawl_batch(urls: List[str], max_concurrent: int = 5):
    """
    Crawl multiple URLs efficiently with concurrent processing
    """
    print(f"🚀 Starting batch crawl of {len(urls)} URLs (max {max_concurrent} concurrent)")

    # Configure browser for efficiency
    browser_config = BrowserConfig(
        headless=True,
        viewport_width=1280,
        viewport_height=800,
        verbose=False
    )

    # Configure crawler
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        remove_overlay_elements=True,
        wait_for="css:body",
        page_timeout=30000,  # 30 seconds timeout per page
        screenshot=False  # Disable screenshots for batch processing
    )

    results = []
    failed = []

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Use arun_many for efficient batch processing
        batch_results = await crawler.arun_many(
            urls=urls,
            config=crawler_config,
            max_concurrent=max_concurrent
        )

        for result in batch_results:
            if result.success:
                results.append({
                    "url": result.url,
                    "title": result.metadata.get("title", ""),
                    "description": result.metadata.get("description", ""),
                    "content_length": len(result.markdown),
                    "links_count": len(result.links.get("internal", [])) + len(result.links.get("external", [])),
                    "images_count": len(result.media.get("images", [])),
                })
                print(f"✅ {result.url}")
            else:
                failed.append({
                    "url": result.url,
                    "error": result.error_message
                })
                print(f"❌ {result.url}: {result.error_message}")

    # Save results
    output = {
        "success_count": len(results),
        "failed_count": len(failed),
        "results": results,
        "failed": failed
    }

    with open("batch_results.json", "w") as f:
        json.dump(output, f, indent=2)

    # Save individual markdown files
    markdown_dir = Path("batch_markdown")
    markdown_dir.mkdir(exist_ok=True)

    for i, result in enumerate(batch_results):
        if result.success:
            # Create safe filename from URL
            safe_name = result.url.replace("https://", "").replace("http://", "")
            safe_name = "".join(c if c.isalnum() or c in "-_" else "_" for c in safe_name)[:100]

            file_path = markdown_dir / f"{i:03d}_{safe_name}.md"
            with open(file_path, "w") as f:
                f.write(f"# {result.metadata.get('title', result.url)}\n\n")
                f.write(f"URL: {result.url}\n\n")
                f.write(result.markdown)

    print(f"\n📊 Batch Crawl Complete:")
    print(f"   ✅ Success: {len(results)}")
    print(f"   ❌ Failed: {len(failed)}")
    print(f"   💾 Results saved to: batch_results.json")
    print(f"   📁 Markdown files saved to: {markdown_dir}/")

    return output

async def crawl_with_extraction(urls: List[str], schema_file: str = None):
    """
    Batch crawl with structured data extraction
    """
    from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

    schema = None
    if schema_file and Path(schema_file).exists():
        with open(schema_file) as f:
            schema = json.load(f)
        print(f"📋 Using extraction schema from: {schema_file}")
    else:
        # Default schema for general content
        schema = {
            "name": "content",
            "selector": "body",
            "fields": [
                {"name": "headings", "selector": "h1, h2, h3", "type": "text", "all": True},
                {"name": "paragraphs", "selector": "p", "type": "text", "all": True},
                {"name": "links", "selector": "a[href]", "type": "attribute", "attribute": "href", "all": True}
            ]
        }

    extraction_strategy = JsonCssExtractionStrategy(schema=schema)

    crawler_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        cache_mode=CacheMode.BYPASS
    )

    extracted_data = []

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun_many(
            urls=urls,
            config=crawler_config,
            max_concurrent=5
        )

        for result in results:
            if result.success and result.extracted_content:
                try:
                    data = json.loads(result.extracted_content)
                    extracted_data.append({
                        "url": result.url,
                        "data": data
                    })
                    print(f"✅ Extracted from: {result.url}")
                except json.JSONDecodeError:
                    print(f"⚠️ Failed to parse JSON from: {result.url}")

    # Save extracted data
    with open("batch_extracted.json", "w") as f:
        json.dump(extracted_data, f, indent=2)

    print(f"\n💾 Extracted data saved to: batch_extracted.json")
    return extracted_data

def load_urls(source: str) -> List[str]:
    """Load URLs from file or string"""
    if Path(source).exists():
        with open(source) as f:
            urls = [line.strip() for line in f if line.strip() and not line.startswith("#")]
    else:
        # Treat as comma-separated URLs
        urls = [url.strip() for url in source.split(",") if url.strip()]

    return urls

async def main():
    if len(sys.argv) < 2:
        print("""
Crawl4AI Batch Crawler

Usage:
    # Crawl URLs from file
    python batch_crawler.py urls.txt [--max-concurrent 5]

    # Crawl with extraction
    python batch_crawler.py urls.txt --extract [schema.json]

    # Crawl comma-separated URLs
    python batch_crawler.py "https://example.com,https://example.org"

Options:
    --max-concurrent N    Max concurrent crawls (default: 5)
    --extract [schema]    Extract structured data using schema

Example urls.txt:
    https://example.com
    https://example.org
    # Comments are ignored
    https://another-site.com
""")
        sys.exit(1)

    source = sys.argv[1]
    urls = load_urls(source)

    if not urls:
        print("❌ No URLs found")
        sys.exit(1)

    print(f"📋 Loaded {len(urls)} URLs")

    # Parse options
    max_concurrent = 5
    extract_mode = False
    schema_file = None

    for i, arg in enumerate(sys.argv[2:], 2):
        if arg == "--max-concurrent" and i + 1 < len(sys.argv):
            max_concurrent = int(sys.argv[i + 1])
        elif arg == "--extract":
            extract_mode = True
            if i + 1 < len(sys.argv) and not sys.argv[i + 1].startswith("--"):
                schema_file = sys.argv[i + 1]

    if extract_mode:
        await crawl_with_extraction(urls, schema_file)
    else:
        await crawl_batch(urls, max_concurrent)

if __name__ == "__main__":
    asyncio.run(main())

```