crawl4ai
This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install aiskillstore-marketplace-crawl4ai
Repository
Skill path: skills/smallnest/crawl4ai
This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: aiskillstore.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install crawl4ai into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/aiskillstore/marketplace before adding crawl4ai to shared team environments
- Use crawl4ai for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: crawl4ai
description: This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
---
# Crawl4AI
## Overview
Crawl4AI provides comprehensive web crawling and data extraction capabilities. This skill supports both **CLI** (recommended for quick tasks) and **Python SDK** (for programmatic control).
**Choose your interface:**
- **CLI** (`crwl`) - Quick, scriptable commands: [CLI Guide](references/cli-guide.md)
- **Python SDK** - Full programmatic control: [SDK Guide](references/sdk-guide.md)
---
## Quick Start
### Installation
```bash
pip install crawl4ai
crawl4ai-setup
# Verify installation
crawl4ai-doctor
```
### CLI (Recommended)
```bash
# Basic crawling - returns markdown
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache
# See more examples
crwl --example
```
### Python SDK
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())
```
For SDK configuration details: [SDK Guide - Configuration](references/sdk-guide.md#configuration) (lines 61-150)
---
## Core Concepts
### Configuration Layers
Both CLI and SDK use the same underlying configuration:
| Concept | CLI | SDK |
|---------|-----|-----|
| Browser settings | `-B browser.yml` or `-b "param=value"` | `BrowserConfig(...)` |
| Crawl settings | `-C crawler.yml` or `-c "param=value"` | `CrawlerRunConfig(...)` |
| Extraction | `-e extract.yml -s schema.json` | `extraction_strategy=...` |
| Content filter | `-f filter.yml` | `markdown_generator=...` |
### Key Parameters
**Browser Configuration:**
- `headless`: Run with/without GUI
- `viewport_width/height`: Browser dimensions
- `user_agent`: Custom user agent
- `proxy_config`: Proxy settings
**Crawler Configuration:**
- `page_timeout`: Max page load time (ms)
- `wait_for`: CSS selector or JS condition to wait for
- `cache_mode`: bypass, enabled, disabled
- `js_code`: JavaScript to execute
- `css_selector`: Focus on specific element
For complete parameters: [CLI Config](references/cli-guide.md#configuration) | [SDK Config](references/sdk-guide.md#configuration)
### Output Content
Every crawl returns:
- **markdown** - Clean, formatted markdown
- **html** - Raw HTML
- **links** - Internal and external links discovered
- **media** - Images, videos, audio found
- **extracted_content** - Structured data (if extraction configured)
---
## Markdown Generation (Primary Use Case)
Crawl4AI excels at generating clean, well-formatted markdown:
### CLI
```bash
# Basic markdown
crwl https://docs.example.com -o markdown
# Filtered markdown (removes noise)
crwl https://docs.example.com -o markdown-fit
# With content filter
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
```
**Filter configuration:**
```yaml
# filter_bm25.yml (relevance-based)
type: "bm25"
query: "machine learning tutorials"
threshold: 1.0
```
### Python SDK
```python
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
bm25_filter = BM25ContentFilter(user_query="machine learning", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # Filtered
print(result.markdown.raw_markdown) # Original
```
For content filters: [Content Processing](references/complete-sdk-reference.md#content-processing) (lines 2481-3101)
---
## Data Extraction
### 1. Schema-Based CSS Extraction (Most Efficient)
**No LLM required** - fast, deterministic, cost-free.
**CLI:**
```bash
# Generate schema once (uses LLM)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Use schema for extraction (no LLM)
crwl https://shop.com -e extract_css.yml -s product_schema.json -o json
```
**Schema format:**
```json
{
"name": "products",
"baseSelector": ".product-card",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
```
### 2. LLM-Based Extraction
For complex or irregular content:
**CLI:**
```yaml
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "Extract product names and prices"
api_token: "your-token"
```
```bash
crwl https://shop.com -e extract_llm.yml -o json
```
For extraction details: [Extraction Strategies](references/complete-sdk-reference.md#extraction-strategies) (lines 4522-5429)
---
## Advanced Patterns
### Dynamic Content (JavaScript-Heavy Sites)
**CLI:**
```bash
crwl https://example.com -c "wait_for=css:.ajax-content,scan_full_page=true,page_timeout=60000"
```
**Crawler config:**
```yaml
# crawler.yml
wait_for: "css:.ajax-content"
scan_full_page: true
page_timeout: 60000
delay_before_return_html: 2.0
```
### Multi-URL Processing
**CLI (sequential):**
```bash
for url in url1 url2 url3; do crwl "$url" -o markdown; done
```
**Python SDK (concurrent):**
```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
results = await crawler.arun_many(urls, config=config)
```
For batch processing: [arun_many() Reference](references/complete-sdk-reference.md#arunmany-reference) (lines 1057-1224)
### Session & Authentication
**CLI:**
```yaml
# login_crawler.yml
session_id: "user_session"
js_code: |
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#submit').click();
wait_for: "css:.dashboard"
```
```bash
# Login
crwl https://site.com/login -C login_crawler.yml
# Access protected content (session reused)
crwl https://site.com/protected -c "session_id=user_session"
```
For session management: [Advanced Features](references/complete-sdk-reference.md#advanced-features) (lines 5429-5940)
### Anti-Detection & Proxies
**CLI:**
```yaml
# browser.yml
headless: true
proxy_config:
server: "http://proxy:8080"
username: "user"
password: "pass"
user_agent_mode: "random"
```
```bash
crwl https://example.com -B browser.yml
```
---
## Common Use Cases
### Google Search Scraping
```bash
# Search Google and get results as JSON
python scripts/google_search.py "your search query" 20
# Example
python scripts/google_search.py "2026年Go语言展望" 20
```
The script extracts:
- Search result titles
- URLs (cleaned, removes Google redirects)
- Descriptions/snippets
- Site names
Output is saved to `google_search_results.json` and printed to stdout.
### Documentation to Markdown
```bash
crwl https://docs.example.com -o markdown > docs.md
```
### E-commerce Product Monitoring
```bash
# Generate schema once
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Monitor (no LLM costs)
crwl https://shop.com -e extract_css.yml -s schema.json -o json
```
### News Aggregation
```bash
# Multiple sources with filtering
for url in news1.com news2.com news3.com; do
crwl "https://$url" -f filter_bm25.yml -o markdown-fit
done
```
### Interactive Q&A
```bash
# First view content
crwl https://example.com -o markdown
# Then ask questions
crwl https://example.com -q "What are the main conclusions?"
crwl https://example.com -q "Summarize the key points"
```
---
## Resources
### Provided Scripts
- **scripts/google_search.py** - Google search scraper with JSON output
- **scripts/extraction_pipeline.py** - Schema generation and extraction
- **scripts/basic_crawler.py** - Simple markdown extraction
- **scripts/batch_crawler.py** - Multi-URL processing
### Reference Documentation
| Document | Purpose |
|----------|---------|
| [CLI Guide](references/cli-guide.md) | Command-line interface reference |
| [SDK Guide](references/sdk-guide.md) | Python SDK quick reference |
| [Complete SDK Reference](references/complete-sdk-reference.md) | Full API documentation (5900+ lines) |
---
## Best Practices
1. **Start with CLI** for quick tasks, SDK for automation
2. **Use schema-based extraction** - 10-100x more efficient than LLM
3. **Enable caching during development** - `--bypass-cache` only when needed
4. **Set appropriate timeouts** - 30s normal, 60s+ for JS-heavy sites
5. **Use content filters** for cleaner, focused markdown
6. **Respect rate limits** - Add delays between requests
---
## Troubleshooting
### JavaScript Not Loading
```bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"
```
### Bot Detection Issues
```bash
crwl https://example.com -B browser.yml
```
```yaml
# browser.yml
headless: false
viewport_width: 1920
viewport_height: 1080
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
```
### Content Not Extracted
```bash
# Debug: see full output
crwl https://example.com -o all -v
# Try different wait strategy
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
```
### Session Issues
```bash
# Verify session
crwl https://site.com -c "session_id=test" -o all | grep -i session
```
---
For comprehensive API documentation, see [Complete SDK Reference](references/complete-sdk-reference.md).
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### references/cli-guide.md
```markdown
# Crawl4AI CLI Guide
<!-- Reference: Tier 2 - Command-line interface for Crawl4AI -->
## Table of Contents
<!-- Lines 1-20 -->
- [Installation](#installation)
- [Basic Usage](#basic-usage)
- [Configuration](#configuration)
- [Browser Configuration](#browser-configuration)
- [Crawler Configuration](#crawler-configuration)
- [Extraction Configuration](#extraction-configuration)
- [Content Filtering](#content-filtering)
- [Advanced Features](#advanced-features)
- [LLM Q&A](#llm-qa)
- [Structured Data Extraction](#structured-data-extraction)
- [Content Filtering](#content-filtering-1)
- [Output Formats](#output-formats)
- [Examples](#examples)
- [Best Practices & Tips](#best-practices--tips)
---
## Installation
<!-- Lines 21-25 -->
The Crawl4AI CLI (`crwl`) is installed automatically with the library:
```bash
pip install crawl4ai
crawl4ai-setup
```
---
## Basic Usage
<!-- Lines 26-50 -->
The `crwl` command provides a simple interface to the Crawl4AI library:
```bash
# Basic crawling - returns markdown
crwl https://example.com
# Specify output format
crwl https://example.com -o markdown
# Verbose JSON output with cache bypass
crwl https://example.com -o json -v --bypass-cache
# See usage examples
crwl --example
```
**Quick Example - Advanced Usage:**
```bash
# Extract structured data using CSS schema
crwl "https://www.infoq.com/ai-ml-data-eng/" \
-e docs/examples/cli/extract_css.yml \
-s docs/examples/cli/css_schema.json \
-o json
```
---
## Configuration
<!-- Lines 51-160 -->
### Browser Configuration
<!-- Lines 51-75 -->
Browser settings via YAML file or command line:
```yaml
# browser.yml
headless: true
viewport_width: 1280
user_agent_mode: "random"
verbose: true
ignore_https_errors: true
```
```bash
# Using config file
crwl https://example.com -B browser.yml
# Using direct parameters
crwl https://example.com -b "headless=true,viewport_width=1280,user_agent_mode=random"
```
**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `headless` | Run without GUI (true/false) |
| `viewport_width` | Browser width in pixels |
| `viewport_height` | Browser height in pixels |
| `user_agent_mode` | "random" or specific UA string |
For all browser parameters: [BrowserConfig Reference](complete-sdk-reference.md#1-browserconfig--controlling-the-browser) (lines 1977-2020)
### Crawler Configuration
<!-- Lines 76-110 -->
Control crawling behavior:
```yaml
# crawler.yml
cache_mode: "bypass"
wait_until: "networkidle"
page_timeout: 30000
delay_before_return_html: 0.5
word_count_threshold: 100
scan_full_page: true
scroll_delay: 0.3
process_iframes: false
remove_overlay_elements: true
magic: true
verbose: true
```
```bash
# Using config file
crwl https://example.com -C crawler.yml
# Using direct parameters
crwl https://example.com -c "css_selector=#main,delay_before_return_html=2,scan_full_page=true"
```
**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `cache_mode` | bypass, enabled, disabled |
| `wait_until` | networkidle, domcontentloaded |
| `page_timeout` | Max page load time (ms) |
| `css_selector` | Focus on specific element |
| `scan_full_page` | Enable infinite scroll handling |
For all crawler parameters: [CrawlerRunConfig Reference](complete-sdk-reference.md#2-crawlerrunconfig--controlling-each-crawl) (lines 2020-2330)
### Extraction Configuration
<!-- Lines 111-160 -->
Two extraction types supported:
**1. CSS/XPath-based extraction:**
```yaml
# extract_css.yml
type: "json-css"
params:
verbose: true
```
```json
// css_schema.json
{
"name": "ArticleExtractor",
"baseSelector": ".article",
"fields": [
{
"name": "title",
"selector": "h1.title",
"type": "text"
},
{
"name": "link",
"selector": "a.read-more",
"type": "attribute",
"attribute": "href"
}
]
}
```
**2. LLM-based extraction:**
```yaml
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4"
instruction: "Extract all articles with their titles and links"
api_token: "your-token"
params:
temperature: 0.3
max_tokens: 1000
```
For extraction strategies: [Extraction Strategies](complete-sdk-reference.md#extraction-strategies) (lines 4522-5429)
---
## Advanced Features
<!-- Lines 161-230 -->
### LLM Q&A
<!-- Lines 161-190 -->
Ask questions about crawled content:
```bash
# Simple question
crwl https://example.com -q "What is the main topic discussed?"
# View content then ask questions
crwl https://example.com -o markdown # See content first
crwl https://example.com -q "Summarize the key points"
crwl https://example.com -q "What are the conclusions?"
# Combined with advanced crawling
crwl https://example.com \
-B browser.yml \
-c "css_selector=article,scan_full_page=true" \
-q "What are the pros and cons mentioned?"
```
**First-time setup:**
- Prompts for LLM provider and API token
- Saves configuration in `~/.crawl4ai/global.yml`
- Supports: openai/gpt-4, anthropic/claude-3-sonnet, ollama (no token needed)
- See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for full list
### Structured Data Extraction
<!-- Lines 191-210 -->
```bash
# CSS-based extraction
crwl https://example.com \
-e extract_css.yml \
-s css_schema.json \
-o json
# LLM-based extraction
crwl https://example.com \
-e extract_llm.yml \
-s llm_schema.json \
-o json
```
### Content Filtering
<!-- Lines 211-230 -->
Filter content for relevance:
```yaml
# filter_bm25.yml (relevance-based)
type: "bm25"
query: "target content"
threshold: 1.0
# filter_pruning.yml (quality-based)
type: "pruning"
query: "focus topic"
threshold: 0.48
```
```bash
crwl https://example.com -f filter_bm25.yml -o markdown-fit
```
For content filtering: [Content Processing](complete-sdk-reference.md#content-processing) (lines 2481-3101)
---
## Output Formats
<!-- Lines 231-240 -->
| Format | Flag | Description |
|--------|------|-------------|
| `all` | `-o all` | Full crawl result including metadata |
| `json` | `-o json` | Extracted structured data |
| `markdown` | `-o markdown` or `-o md` | Raw markdown output |
| `markdown-fit` | `-o markdown-fit` or `-o md-fit` | Filtered markdown |
---
## Complete Examples
<!-- Lines 241-280 -->
**1. Basic Extraction:**
```bash
crwl https://example.com \
-B browser.yml \
-C crawler.yml \
-o json
```
**2. Structured Data Extraction:**
```bash
crwl https://example.com \
-e extract_css.yml \
-s css_schema.json \
-o json \
-v
```
**3. LLM Extraction with Filtering:**
```bash
crwl https://example.com \
-B browser.yml \
-e extract_llm.yml \
-s llm_schema.json \
-f filter_bm25.yml \
-o json
```
**4. Interactive Q&A:**
```bash
# First crawl and view
crwl https://example.com -o markdown
# Then ask questions
crwl https://example.com -q "What are the main points?"
crwl https://example.com -q "Summarize the conclusions"
```
---
## Best Practices & Tips
<!-- Lines 281-310 -->
1. **Configuration Management:**
- Keep common configurations in YAML files
- Use CLI parameters for quick overrides
- Store sensitive data (API tokens) in `~/.crawl4ai/global.yml`
2. **Performance Optimization:**
- Use `--bypass-cache` for fresh content
- Enable `scan_full_page` for infinite scroll pages
- Adjust `delay_before_return_html` for dynamic content
3. **Content Extraction:**
- Use CSS extraction for structured content (faster, no API costs)
- Use LLM extraction for unstructured content
- Combine with filters for focused results
4. **Q&A Workflow:**
- View content first with `-o markdown`
- Ask specific questions
- Use broader context with appropriate selectors
---
## Recap
The Crawl4AI CLI provides:
- Flexible configuration via files and parameters
- Multiple extraction strategies (CSS, XPath, LLM)
- Content filtering and optimization
- Interactive Q&A capabilities
- Various output formats
---
## See Also
- [Python SDK Guide](sdk-guide.md) - Programmatic Python interface
- [Complete SDK Reference](complete-sdk-reference.md) - Full API documentation
```
### references/sdk-guide.md
```markdown
# Crawl4AI Python SDK Guide
<!-- Reference: Tier 2 - Python SDK interface for Crawl4AI -->
## Quick Start
<!-- Lines 1-60 -->
### Installation
```bash
pip install crawl4ai
crawl4ai-setup
```
### Basic First Crawl
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())
```
### With Configuration
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080
)
crawler_config = CrawlerRunConfig(
page_timeout=30000,
screenshot=True,
remove_overlay_elements=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawler_config
)
print(f"Success: {result.success}")
print(f"Markdown length: {len(result.markdown)}")
```
For complete API reference: [AsyncWebCrawler](complete-sdk-reference.md#asyncwebcrawler) (lines 517-778)
---
## Configuration
<!-- Lines 61-150 -->
### BrowserConfig
Controls the browser instance (global settings):
```python
from crawl4ai import BrowserConfig
browser_config = BrowserConfig(
browser_type="chromium", # chromium, firefox, webkit
headless=True, # Run without GUI
viewport_width=1280,
viewport_height=720,
user_agent="custom-agent", # Custom user agent
proxy_config={ # Proxy settings
"server": "http://proxy:8080",
"username": "user",
"password": "pass"
}
)
```
**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `headless` | Run with/without GUI |
| `viewport_width/height` | Browser dimensions |
| `user_agent` | Custom user agent string |
| `cookies` | Pre-set cookies |
| `headers` | Custom HTTP headers |
| `proxy_config` | Proxy server settings |
For all parameters: [BrowserConfig Reference](complete-sdk-reference.md#1-browserconfig--controlling-the-browser) (lines 1977-2020)
### CrawlerRunConfig
Controls each crawl operation (per-crawl settings):
```python
from crawl4ai import CrawlerRunConfig, CacheMode
config = CrawlerRunConfig(
# Timing
page_timeout=30000, # Max page load time (ms)
wait_for="css:.content", # Wait for element
delay_before_return_html=0.5,
# Content selection
css_selector=".main-content",
excluded_tags=["nav", "footer"],
# Caching
cache_mode=CacheMode.BYPASS,
# JavaScript
js_code="window.scrollTo(0, document.body.scrollHeight);",
# Output
screenshot=True,
pdf=True
)
```
**Key Parameters:**
| Parameter | Description |
|-----------|-------------|
| `page_timeout` | Max page load/JS time (ms) |
| `wait_for` | CSS selector or JS condition |
| `cache_mode` | ENABLED, BYPASS, DISABLED |
| `js_code` | JavaScript to execute |
| `session_id` | Persist session across crawls |
| `screenshot` | Capture screenshot |
For all parameters: [CrawlerRunConfig Reference](complete-sdk-reference.md#2-crawlerrunconfig--controlling-each-crawl) (lines 2020-2330)
---
## CrawlResult
<!-- Lines 151-200 -->
Every `arun()` call returns a `CrawlResult`:
```python
result = await crawler.arun(url, config=config)
# Status
result.success # bool - crawl succeeded
result.status_code # HTTP status code
result.error_message # Error details if failed
# Content
result.html # Raw HTML
result.cleaned_html # Sanitized HTML
result.markdown # MarkdownGenerationResult object
result.markdown.raw_markdown # Full markdown
result.markdown.fit_markdown # Filtered markdown (if filter used)
# Media & Links
result.media["images"] # List of images
result.media["videos"] # List of videos
result.links["internal"] # Internal links
result.links["external"] # External links
# Extras
result.screenshot # Base64 screenshot (if requested)
result.pdf # PDF bytes (if requested)
result.metadata # Page metadata (title, description)
```
For complete fields: [CrawlResult Reference](complete-sdk-reference.md#crawlresult-reference) (lines 1224-1612)
---
## Content Processing
<!-- Lines 201-280 -->
### Markdown Generation
```python
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
options={
"ignore_links": False,
"ignore_images": False,
"body_width": 80
}
)
config = CrawlerRunConfig(markdown_generator=md_generator)
```
### Content Filtering
Filter content for relevance before markdown generation:
```python
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Option 1: Pruning (removes low-quality content)
pruning_filter = PruningContentFilter(
threshold=0.4,
threshold_type="fixed"
)
# Option 2: BM25 (relevance-based)
bm25_filter = BM25ContentFilter(
user_query="machine learning tutorials",
bm25_threshold=1.0
)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
print(result.markdown.fit_markdown) # Filtered content
print(result.markdown.raw_markdown) # Original content
```
For filters and generators: [Content Processing](complete-sdk-reference.md#content-processing) (lines 2481-3101)
---
## Data Extraction
<!-- Lines 281-360 -->
### CSS-Based Extraction (No LLM)
Fast, deterministic extraction using CSS selectors:
```python
from crawl4ai import JsonCssExtractionStrategy
schema = {
"name": "articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
result = await crawler.arun(url, config=config)
data = json.loads(result.extracted_content)
```
### LLM-Based Extraction
For complex or irregular content:
```python
from crawl4ai import LLMExtractionStrategy, LLMConfig
from pydantic import BaseModel, Field
class Product(BaseModel):
name: str = Field(description="Product name")
price: str = Field(description="Product price")
extraction_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o-mini",
api_token="your-token"
),
schema=Product.model_json_schema(),
extraction_type="schema",
instruction="Extract product information"
)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
```
For extraction strategies: [Extraction Strategies](complete-sdk-reference.md#extraction-strategies) (lines 4522-5429)
---
## Multi-URL Crawling
<!-- Lines 361-420 -->
### Concurrent Processing with arun_many()
```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming
)
async with AsyncWebCrawler() as crawler:
# Streaming mode - process as they complete
async for result in await crawler.arun_many(urls, config=config):
if result.success:
print(f"Completed: {result.url}")
# Batch mode - wait for all
config = config.clone(stream=False)
results = await crawler.arun_many(urls, config=config)
```
### URL-Specific Configurations
```python
from crawl4ai import CrawlerRunConfig, MatchMode
# Different configs for different URL patterns
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
# PDF-specific settings
)
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*"],
match_mode=MatchMode.OR
)
default_config = CrawlerRunConfig() # Fallback
results = await crawler.arun_many(
urls=urls,
config=[pdf_config, blog_config, default_config]
)
```
For dispatchers and advanced: [arun_many() Reference](complete-sdk-reference.md#arunmany-reference) (lines 1057-1224)
---
## Session Management
<!-- Lines 421-480 -->
### Persistent Sessions
```python
# First crawl - establish session
login_config = CrawlerRunConfig(
session_id="user_session",
js_code="""
document.querySelector('#username').value = 'myuser';
document.querySelector('#password').value = 'mypass';
document.querySelector('#submit').click();
""",
wait_for="css:.dashboard"
)
await crawler.arun("https://site.com/login", config=login_config)
# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected", config=config)
# Clean up
await crawler.crawler_strategy.kill_session("user_session")
```
### Dynamic Content Handling
```python
config = CrawlerRunConfig(
wait_for="css:.ajax-content",
js_code="""
window.scrollTo(0, document.body.scrollHeight);
document.querySelector('.load-more')?.click();
""",
page_timeout=60000
)
```
For session patterns: [Advanced Features - Session Management](complete-sdk-reference.md#advanced-features) (lines 5429-5940)
---
## Best Practices
1. **Use context managers** - `async with AsyncWebCrawler()` ensures cleanup
2. **Enable caching during development** - `cache_mode=CacheMode.ENABLED`
3. **Set appropriate timeouts** - 30s normal, 60s+ for JS-heavy sites
4. **Prefer CSS extraction** over LLM - 10-100x more efficient
5. **Use clone() for config variants** - `config.clone(screenshot=True)`
6. **Respect rate limits** - Use delays between requests
---
## See Also
- [CLI Guide](cli-guide.md) - Command-line interface alternative
- [Complete SDK Reference](complete-sdk-reference.md) - Full API documentation
```
### references/complete-sdk-reference.md
```markdown
# Crawl4AI Complete SDK Documentation
**Generated:** 2025-10-19 12:56
**Format:** Ultra-Dense Reference (Optimized for AI Assistants)
**Crawl4AI Version:** 0.7.4
---
## Navigation
- [Installation & Setup](#installation--setup) (lines 22-126)
- [Quick Start](#quick-start) (lines 126-517)
- [Core API](#core-api) (lines 517-1056)
- [Configuration](#configuration) (lines 1612-2330)
- [Crawling Patterns](#crawling-patterns) (lines 2330-3568)
- [Content Processing](#content-processing) (lines 2479-3101)
- [Extraction Strategies](#extraction-strategies) (lines 4528-5436)
- [Advanced Features](#advanced-features) (lines 5436-5932)
---
## Installation & Setup
<!-- Section: lines 22-126 -->
## Installation & Setup (2023 Edition)
## 1. Basic Installation
```bash
pip install crawl4ai
```
## 2. Initial Setup & Diagnostics
### 2.1 Run the Setup Command
```bash
crawl4ai-setup
```
- Performs OS-level checks (e.g., missing libs on Linux)
- Confirms your environment is ready to crawl
### 2.2 Diagnostics
```bash
crawl4ai-doctor
```
- Check Python version compatibility
- Verify Playwright installation
- Inspect environment variables or library conflicts
If any issues arise, follow its suggestions (e.g., installing additional system packages) and re-run `crawl4ai-setup`.
## 3. Verifying Installation: A Simple Crawl (Skip this step if you already run `crawl4ai-doctor`)
Below is a minimal Python script demonstrating a **basic** crawl. It uses our new **`BrowserConfig`** and **`CrawlerRunConfig`** for clarity, though no custom settings are passed in this example:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.example.com",
)
print(result.markdown[:300]) # Show the first 300 characters of extracted text
if __name__ == "__main__":
asyncio.run(main())
```
- A headless browser session loads `example.com`
- Crawl4AI returns ~300 characters of markdown.
If errors occur, rerun `crawl4ai-doctor` or manually ensure Playwright is installed correctly.
## 4. Advanced Installation (Optional)
### 4.1 Torch, Transformers, or All
- **Text Clustering (Torch)**
```bash
pip install crawl4ai[torch]
crawl4ai-setup
```
- **Transformers**
```bash
pip install crawl4ai[transformer]
crawl4ai-setup
```
- **All Features**
```bash
pip install crawl4ai[all]
crawl4ai-setup
```
```bash
crawl4ai-download-models
```
## 5. Docker (Experimental)
```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```
You can then make POST requests to `http://localhost:11235/crawl` to perform crawls. **Production usage** is discouraged until our new Docker approach is ready (planned in Jan or Feb 2025).
## 6. Local Server Mode (Legacy)
## Summary
1. **Install** with `pip install crawl4ai` and run `crawl4ai-setup`.
2. **Diagnose** with `crawl4ai-doctor` if you see errors.
3. **Verify** by crawling `example.com` with minimal `BrowserConfig` + `CrawlerRunConfig`.
## Quick Start
<!-- Section: lines 126-517 -->
## Getting Started with Crawl4AI
1. Run your **first crawl** using minimal configuration.
2. Experiment with a simple **CSS-based extraction** strategy.
3. Crawl a **dynamic** page that loads content via JavaScript.
## 1. Introduction
- An asynchronous crawler, **`AsyncWebCrawler`**.
- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports optional filters).
- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
## 2. Your First Crawl
Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 chars
if __name__ == "__main__":
asyncio.run(main())
```
- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
- It fetches `https://example.com`.
- Crawl4AI automatically converts the HTML into Markdown.
## 3. Basic Configuration (Light Introduction)
1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_conf = BrowserConfig(headless=True) # or False to see the browser
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_conf
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
> IMPORTANT: By default cache mode is set to `CacheMode.BYPASS` to have fresh content. Set `CacheMode.ENABLED` to enable caching.
## 4. Generating Markdown Output
- **`result.markdown`**:
- **`result.markdown.fit_markdown`**:
The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
### Example: Using a Filter with `DefaultMarkdownGenerator`
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
```
**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
## 5. Simple Data Extraction (CSS-based)
```python
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig
# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"
# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token") # Required for OpenAI
)
# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
)
# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)
```
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {
"name": "Example Items",
"baseSelector": "div.item",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="raw://" + raw_html,
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema)
)
)
# The JSON output is stored in 'extracted_content'
data = json.loads(result.extracted_content)
print(data)
if __name__ == "__main__":
asyncio.run(main())
```
- Great for repetitive page structures (e.g., item listings, articles).
- No AI usage or costs.
- The crawler returns a JSON string you can parse or store.
> Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.
## 6. Simple Data Extraction (LLM-based)
- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
- Or any provider supported by the underlying library
```python
import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
extraction_strategy=LLMExtractionStrategy(
llm_config = LLMConfig(provider=provider,api_token=api_token),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/", config=crawler_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(
extract_structured_data_using_llm(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
)
)
```
- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
## 7. Adaptive Crawling (New!)
```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def adaptive_example():
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Start adaptive crawling
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View results
adaptive.print_stats()
print(f"Crawled {len(result.crawled_urls)} pages")
print(f"Achieved {adaptive.confidence:.0%} confidence")
if __name__ == "__main__":
asyncio.run(adaptive_example())
```
- **Automatic stopping**: Stops when sufficient information is gathered
- **Intelligent link selection**: Follows only relevant links
- **Confidence scoring**: Know how complete your information is
## 8. Multi-URL Concurrency (Preview)
If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def quick_parallel_example():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode
)
async with AsyncWebCrawler() as crawler:
# Stream results as they complete
async for result in await crawler.arun_many(urls, config=run_conf):
if result.success:
print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
else:
print(f"[ERROR] {result.url} => {result.error_message}")
# Or get all results at once (default behavior)
run_conf = run_conf.clone(stream=False)
results = await crawler.arun_many(urls, config=run_conf)
for res in results:
if res.success:
print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
else:
print(f"[ERROR] {res.url} => {res.error_message}")
if __name__ == "__main__":
asyncio.run(quick_parallel_example())
```
1. **Streaming mode** (`stream=True`): Process results as they become available using `async for`
2. **Batch mode** (`stream=False`): Wait for all results to complete
## 8. Dynamic Content Example
Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to **click** a “Next Page” button and wait for new commits to load on GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
],
}
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
tab.scrollIntoView();
tab.click();
await new Promise(r => setTimeout(r, 500));
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs],
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology", config=crawler_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
async def main():
await extract_structured_data_using_css_extractor()
if __name__ == "__main__":
asyncio.run(main())
```
- **`BrowserConfig(headless=False)`**: We want to watch it click “Next Page.”
- **`CrawlerRunConfig(...)`**: We specify the extraction strategy, pass `session_id` to reuse the same page.
- **`js_code`** and **`wait_for`** are used for subsequent pages (`page > 0`) to click the “Next” button and wait for new commits to load.
- **`js_only=True`** indicates we’re not re-navigating but continuing the existing session.
- Finally, we call `kill_session()` to clean up the page and browser session.
## 9. Next Steps
1. Performed a basic crawl and printed Markdown.
2. Used **content filters** with a markdown generator.
3. Extracted JSON via **CSS** or **LLM** strategies.
4. Handled **dynamic** pages with JavaScript triggers.
## Core API
<!-- Section: lines 517-1056 -->
## AsyncWebCrawler
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
1. **Create** a `BrowserConfig` for global browser settings.
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
## 1. Constructor Overview
```python
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
Args:
crawler_strategy:
(Advanced) Provide a custom crawler strategy if needed.
config:
A BrowserConfig object specifying how the browser is set up.
always_bypass_cache:
(Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory:
Folder for storing caches/logs (if relevant).
thread_safe:
If True, attempts some concurrency safeguards. Usually False.
**kwargs:
Additional legacy or debugging parameters.
"""
)
### Typical Initialization
```
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
crawler = AsyncWebCrawler(config=browser_cfg)
```
**Notes**:
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
---
## 2. Lifecycle: Start/Close or Context Manager
### 2.1 Context Manager (Recommended)
```python
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("<https://example.com>")
## The crawler automatically starts/closes resources
```
When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
### 2.2 Manual Start & Close
```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
result1 = await crawler.arun("<https://example.com>")
result2 = await crawler.arun("<https://another.com>")
await crawler.close()
```
Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.
---
## 3. Primary Method: `arun()`
```python
async def arun(
url: str,
config: Optional[CrawlerRunConfig] = None,
## Legacy parameters for backward compatibility...
```
### 3.1 New Approach
You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("<https://example.com/news>", config=run_cfg)
```
### 3.2 Legacy Parameters Still Accepted
For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
---
## 4. Batch Processing: `arun_many()`
```python
async def arun_many(
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
## Legacy parameters maintained for backwards compatibility...
```
### 4.1 Resource-Aware Crawling
The `arun_many()` method now uses an intelligent dispatcher that:
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently
### 4.2 Example Usage
Check page [Multi-url Crawling](../advanced/multi-url-crawling.md) for a detailed example of how to use `arun_many()`.
```python
## 4.3 Key Features
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
## 5. `CrawlResult` Output
Each `arun()` returns a **`CrawlResult`** containing:
- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.
## 6. Quick Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
# 2. Run config
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
```
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
## 7. Best Practices & Migration Notes
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
```python
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)
```
## 8. Summary
- **Constructor** accepts **`BrowserConfig`** (or defaults).
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
- For advanced lifecycle control, use `start()` and `close()` explicitly.
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
## `arun()` Parameter Guide (New Approach)
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
```python
await crawler.arun(
url="https://example.com",
config=my_run_config
)
```
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
## 1. Core Usage
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
run_config = CrawlerRunConfig(
verbose=True, # Detailed logging
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
check_robots_txt=True, # Respect robots.txt rules
# ... other parameters
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
# Check if blocked by robots.txt
if not result.success and result.status_code == 403:
print(f"Error: {result.error_message}")
```
- `verbose=True` logs each crawl step.
- `cache_mode` decides how to read/write the local crawl cache.
## 2. Cache Control
**`cache_mode`** (default: `CacheMode.ENABLED`)
Use a built-in enum from `CacheMode`:
- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
```python
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
```
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
## 3. Content Processing & Selection
### 3.1 Text Processing
```python
run_config = CrawlerRunConfig(
word_count_threshold=10, # Ignore text blocks <10 words
only_text=False, # If True, tries to remove non-text elements
keep_data_attributes=False # Keep or discard data-* attributes
)
```
### 3.2 Content Selection
```python
run_config = CrawlerRunConfig(
css_selector=".main-content", # Focus on .main-content region only
excluded_tags=["form", "nav"], # Remove entire tag blocks
remove_forms=True, # Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove modals/popups
)
```
### 3.3 Link Handling
```python
run_config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links from final content
exclude_social_media_links=True, # Remove links to known social sites
exclude_domains=["ads.example.com"], # Exclude links to these domains
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
```
### 3.4 Media Filtering
```python
run_config = CrawlerRunConfig(
exclude_external_images=True # Strip images from other domains
)
```
## 4. Page Navigation & Timing
### 4.1 Basic Browser Flow
```python
run_config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
page_timeout=60000, # Navigation & script timeout (ms)
)
```
- `wait_for`:
- `"css:selector"` or
- `"js:() => boolean"`
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
- `semaphore_count`: concurrency limit when crawling multiple URLs.
### 4.2 JavaScript Execution
```python
run_config = CrawlerRunConfig(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
js_only=False
)
```
- `js_code` can be a single string or a list of strings.
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
### 4.3 Anti-Bot
```python
run_config = CrawlerRunConfig(
magic=True,
simulate_user=True,
override_navigator=True
)
```
- `magic=True` tries multiple stealth features.
- `simulate_user=True` mimics mouse movements or random delays.
- `override_navigator=True` fakes some navigator properties (like user agent checks).
## 5. Session Management
**`session_id`**:
```python
run_config = CrawlerRunConfig(
session_id="my_session123"
)
```
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
## 6. Screenshot, PDF & Media Options
```python
run_config = CrawlerRunConfig(
screenshot=True, # Grab a screenshot as base64
screenshot_wait_for=1.0, # Wait 1s before capturing
pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt text
image_score_threshold=3, # Filter out low-score images
)
```
- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.
## 7. Extraction Strategy
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
```python
run_config = CrawlerRunConfig(
extraction_strategy=my_css_or_llm_strategy
)
```
The extracted data will appear in `result.extracted_content`.
## 8. Comprehensive Example
Below is a snippet combining many parameters:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Example schema
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
run_config = CrawlerRunConfig(
# Core
verbose=True,
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Respect robots.txt rules
# Content
word_count_threshold=10,
css_selector="main.content",
excluded_tags=["nav", "footer"],
exclude_external_links=True,
# Page & JS
js_code="document.querySelector('.show-more')?.click();",
wait_for="css:.loaded-block",
page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session
session_id="persistent_session",
# Media
screenshot=True,
pdf=True,
# Anti-bot
simulate_user=True,
magic=True,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/posts", config=run_config)
if result.success:
print("HTML length:", len(result.cleaned_html))
print("Extraction JSON:", result.extracted_content)
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
1. **Crawling** the main content region, ignoring external links.
2. Running **JavaScript** to click “.show-more”.
3. **Waiting** for “.loaded-block” to appear.
4. Generating a **screenshot** & **PDF** of the final page.
## 9. Best Practices
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
## 10. Conclusion
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
- Makes code **clearer** and **more maintainable**.
## `arun_many(...)` Reference
<!-- Section: lines 1057-1224 -->
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
## Function Signature
```python
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) Either:
- A single `CrawlerRunConfig` applying to all URLs
- A list of `CrawlerRunConfig` objects with url_matcher patterns
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
"""
```
## Differences from `arun()`
1. **Multiple URLs**:
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
2. **Concurrency & Dispatchers**:
- **`dispatcher`** param allows advanced concurrency control.
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
3. **Streaming Support**:
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
- When streaming, use `async for` to process results as they become available.
4. **Parallel** Execution**:
- `arun_many()` can run multiple requests concurrently under the hood.
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
### Basic Example (Batch Mode)
```python
# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=CrawlerRunConfig(stream=False) # Default behavior
)
for res in results:
if res.success:
print(res.url, "crawled OK!")
else:
print("Failed:", res.url, "-", res.error_message)
```
### Streaming Example
```python
config = CrawlerRunConfig(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
# Process results as they complete
async for result in await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=config
):
if result.success:
print(f"Just completed: {result.url}")
# Process each result immediately
process_result(result)
```
### With a Custom Dispatcher
```python
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10
)
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=my_run_config,
dispatcher=dispatcher
)
```
### URL-Specific Configurations
Instead of using one config for all URLs, provide a list of configs with `url_matcher` patterns:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
)
# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
)
# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Custome settings for JSON extraction
)
# Default fallback config
default_config = CrawlerRunConfig() # No url_matcher means it never matches except as fallback
# Pass the list of configs - first match wins!
results = await crawler.arun_many(
urls=[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → pdf_config
"https://blog.python.org/", # → blog_config
"https://github.com/microsoft/playwright", # → github_config
"https://httpbin.org/json", # → api_config
"https://example.com/" # → default_config
],
config=[pdf_config, blog_config, github_config, api_config, default_config]
)
```
- **String patterns**: `"*.pdf"`, `"*/blog/*"`, `"*python.org*"`
- **Function matchers**: `lambda url: 'api' in url`
- **Mixed patterns**: Combine strings and functions with `MatchMode.OR` or `MatchMode.AND`
- **First match wins**: Configs are evaluated in order
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
- **Important**: Always include a default config (without `url_matcher`) as the last item if you want to handle all URLs. Otherwise, unmatched URLs will fail.
### Return Value
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
## Dispatcher Reference
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
## Common Pitfalls
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
## Conclusion
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
## `CrawlResult` Reference
The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
**Location**: `crawl4ai/crawler/models.py` (for reference)
```python
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
fit_html: Optional[str] = None # Preprocessed HTML optimized for extraction
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
...
```
## 1. Basic Crawl Info
### 1.1 **`url`** *(str)*
```python
print(result.url) # e.g., "https://example.com/"
```
### 1.2 **`success`** *(bool)*
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
```python
if not result.success:
print(f"Crawl failed: {result.error_message}")
```
### 1.3 **`status_code`** *(Optional[int])*
```python
if result.status_code == 404:
print("Page not found!")
```
### 1.4 **`error_message`** *(Optional[str])*
**What**: If `success=False`, a textual description of the failure.
```python
if not result.success:
print("Error:", result.error_message)
```
### 1.5 **`session_id`** *(Optional[str])*
```python
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
```
### 1.6 **`response_headers`** *(Optional[dict])*
```python
if result.response_headers:
print("Server:", result.response_headers.get("Server", "Unknown"))
```
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
`subject`, `valid_from`, `valid_until`, etc.
```python
if result.ssl_certificate:
print("Issuer:", result.ssl_certificate.issuer)
```
## 2. Raw / Cleaned Content
### 2.1 **`html`** *(str)*
```python
# Possibly large
print(len(result.html))
```
### 2.2 **`cleaned_html`** *(Optional[str])*
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.
```python
print(result.cleaned_html[:500]) # Show a snippet
```
## 3. Markdown Fields
### 3.1 The Markdown Generation Approach
- **Raw** markdown
- **Links as citations** (with a references section)
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
**`MarkdownGenerationResult`** includes:
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
```python
if result.markdown:
md_res = result.markdown
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
if md_res.fit_markdown:
print("Pruned text:", md_res.fit_markdown[:300])
```
### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
**What**: Holds the `MarkdownGenerationResult`.
```python
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
```
**Important**: "Fit" content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
## 4. Media & Links
### 4.1 **`media`** *(Dict[str, List[Dict]])*
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.
- `src` *(str)*: Media URL
- `alt` or `title` *(str)*: Descriptive text
- `score` *(float)*: Relevance score if the crawler's heuristic found it "important"
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
```python
images = result.media.get("images", [])
for img in images:
if img.get("score", 0) > 5:
print("High-value image:", img["src"])
```
### 4.2 **`links`** *(Dict[str, List[Dict]])*
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.
- `href` *(str)*: The link target
- `text` *(str)*: Link text
- `title` *(str)*: Title attribute
- `context` *(str)*: Surrounding text snippet
- `domain` *(str)*: If external, the domain
```python
for link in result.links["internal"]:
print(f"Internal link to {link['href']} with text {link['text']}")
```
## 5. Additional Fields
### 5.1 **`extracted_content`** *(Optional[str])*
**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).
```python
if result.extracted_content:
data = json.loads(result.extracted_content)
print(data)
```
### 5.2 **`downloaded_files`** *(Optional[List[str]])*
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.
```python
if result.downloaded_files:
for file_path in result.downloaded_files:
print("Downloaded:", file_path)
```
### 5.3 **`screenshot`** *(Optional[str])*
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.
```python
import base64
if result.screenshot:
with open("page.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
```
### 5.4 **`pdf`** *(Optional[bytes])*
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.
```python
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
```
### 5.5 **`mhtml`** *(Optional[str])*
**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
```python
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
### 5.6 **`metadata`** *(Optional[dict])*
```python
if result.metadata:
print("Title:", result.metadata.get("title"))
print("Author:", result.metadata.get("author"))
```
## 6. `dispatch_result` (optional)
A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
- **`task_id`**: A unique identifier for the parallel task.
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task's execution.
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
```python
# Example usage:
for result in results:
if result.success and result.dispatch_result:
dr = result.dispatch_result
print(f"URL: {result.url}, Task ID: {dr.task_id}")
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
print(f"Duration: {dr.end_time - dr.start_time}")
```
> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.
## 7. Network Requests & Console Messages
When you enable network and console message capturing in `CrawlerRunConfig` using `capture_network_requests=True` and `capture_console_messages=True`, the `CrawlResult` will include these fields:
### 7.1 **`network_requests`** *(Optional[List[Dict[str, Any]]])*
- Each item has an `event_type` field that can be `"request"`, `"response"`, or `"request_failed"`.
- Request events include `url`, `method`, `headers`, `post_data`, `resource_type`, and `is_navigation_request`.
- Response events include `url`, `status`, `status_text`, `headers`, and `request_timing`.
- Failed request events include `url`, `method`, `resource_type`, and `failure_text`.
- All events include a `timestamp` field.
```python
if result.network_requests:
# Count different types of events
requests = [r for r in result.network_requests if r.get("event_type") == "request"]
responses = [r for r in result.network_requests if r.get("event_type") == "response"]
failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]
print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")
# Analyze API calls
api_calls = [r for r in requests if "api" in r.get("url", "")]
# Identify failed resources
for failure in failures:
print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")
```
### 7.2 **`console_messages`** *(Optional[List[Dict[str, Any]]])*
- Each item has a `type` field indicating the message type (e.g., `"log"`, `"error"`, `"warning"`, etc.).
- The `text` field contains the actual message text.
- Some messages include `location` information (URL, line, column).
- All messages include a `timestamp` field.
```python
if result.console_messages:
# Count messages by type
message_types = {}
for msg in result.console_messages:
msg_type = msg.get("type", "unknown")
message_types[msg_type] = message_types.get(msg_type, 0) + 1
print(f"Message type counts: {message_types}")
# Display errors (which are usually most important)
for msg in result.console_messages:
if msg.get("type") == "error":
print(f"Error: {msg.get('text')}")
```
## 8. Example: Accessing Everything
```python
async def handle_result(result: CrawlResult):
if not result.success:
print("Crawl error:", result.error_message)
return
# Basic info
print("Crawled URL:", result.url)
print("Status code:", result.status_code)
# HTML
print("Original HTML size:", len(result.html))
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
if result.markdown:
print("Raw Markdown:", result.markdown.raw_markdown[:300])
print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
if result.markdown.fit_markdown:
print("Fit Markdown:", result.markdown.fit_markdown[:200])
# Media & Links
if "images" in result.media:
print("Image count:", len(result.media["images"]))
if "internal" in result.links:
print("Internal link count:", len(result.links["internal"]))
# Extraction strategy result
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF/MHTML
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
if result.mhtml:
print("MHTML length:", len(result.mhtml))
# Network and console capturing
if result.network_requests:
print(f"Network requests captured: {len(result.network_requests)}")
# Analyze request types
req_types = {}
for req in result.network_requests:
if "resource_type" in req:
req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
print(f"Resource types: {req_types}")
if result.console_messages:
print(f"Console messages captured: {len(result.console_messages)}")
# Count by message type
msg_types = {}
for msg in result.console_messages:
msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
print(f"Message types: {msg_types}")
```
## 9. Key Points & Future
1. **Deprecated legacy properties of CrawlResult**
- `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
- `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`
2. **Fit Content**
- **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
- If no filter is used, they remain `None`.
3. **References & Citations**
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
4. **Links & Media**
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
5. **Error Cases**
- If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
- `status_code` might be `None` if we failed before an HTTP response.
Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.
## Configuration
<!-- Section: lines 1612-2330 -->
## Browser, Crawler & LLM Configuration (Quick Overview)
Crawl4AI's flexibility stems from two key classes:
1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)
In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
## 1. BrowserConfig Essentials
```python
class BrowserConfig:
def __init__(
browser_type="chromium",
headless=True,
proxy_config=None,
viewport_width=1080,
viewport_height=600,
verbose=True,
use_persistent_context=False,
user_data_dir=None,
cookies=None,
headers=None,
user_agent=None,
text_mode=False,
light_mode=False,
extra_args=None,
enable_stealth=False,
# ... other advanced parameters omitted here
):
...
```
### Key Fields to Note
1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
2. **`headless`**
- `True`: Runs the browser in headless mode (invisible browser).
- `False`: Runs the browser in visible mode, which helps with debugging.
3. **`proxy_config`**
- A dictionary with fields like:
```json
{
"server": "http://proxy.example.com:8080",
"username": "...",
"password": "..."
}
```
- Leave as `None` if a proxy is not required.
1. **`viewport_width` & `viewport_height`**:
- The initial window size.
- Some sites behave differently with smaller or bigger viewports.
2. **`verbose`**:
- If `True`, prints extra logs.
- Handy for debugging.
3. **`use_persistent_context`**:
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
- Typically also set `user_data_dir` to point to a folder.
4. **`cookies`** & **`headers`**:
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
5. **`user_agent`**:
- Custom User-Agent string. If `None`, a default is used.
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
6. **`text_mode`** & **`light_mode`**:
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
7. **`extra_args`**:
- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
8. **`enable_stealth`**:
- If `True`, enables stealth mode using playwright-stealth.
- Modifies browser fingerprints to avoid basic bot detection.
- Default is `False`. Recommended for sites with bot protection.
### Helper Methods
Both configuration classes provide a `clone()` method to create modified copies:
```python
# Create a base browser config
base_browser = BrowserConfig(
browser_type="chromium",
headless=True,
text_mode=True
)
# Create a visible browser config for debugging
debug_browser = base_browser.clone(
headless=False,
verbose=True
)
```
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_conf = BrowserConfig(
browser_type="firefox",
headless=False,
text_mode=True
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300])
```
## 2. CrawlerRunConfig Essentials
```python
class CrawlerRunConfig:
def __init__(
word_count_threshold=200,
extraction_strategy=None,
markdown_generator=None,
cache_mode=None,
js_code=None,
wait_for=None,
screenshot=False,
pdf=False,
capture_mhtml=False,
# Location and Identity Parameters
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Resource Management
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode=None,
verbose=True,
stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted
):
...
```
### Key Fields to Note
1. **`word_count_threshold`**:
- The minimum word count before a block is considered.
- If your site has lots of short paragraphs or items, you can lower it.
2. **`extraction_strategy`**:
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
3. **`markdown_generator`**:
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
- If `None`, a default approach is used.
4. **`cache_mode`**:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
5. **`js_code`**:
- A string or list of JS strings to execute.
- Great for "Load More" buttons or user interactions.
6. **`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **Location Parameters**:
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
9. **`verbose`**:
- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
10. **`enable_rate_limiting`**:
- If `True`, enables rate limiting for batch processing.
- Requires `rate_limit_config` to be set.
11. **`memory_threshold_percent`**:
- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
12. **`check_interval`**:
- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
13. **`max_session_permit`**:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
14. **`url_matcher`** & **`match_mode`**:
- Enable URL-specific configurations when used with `arun_many()`.
- Set `url_matcher` to patterns (glob, function, or list) to match specific URLs.
- Use `match_mode` (OR/AND) to control how multiple patterns combine.
15. **`display_mode`**:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
### Helper Methods
The `clone()` method is particularly useful for creating variations of your crawler configuration:
```python
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200,
wait_until="networkidle"
)
# Create variations for different use cases
stream_config = base_config.clone(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
debug_config = base_config.clone(
page_timeout=120000, # Longer timeout for debugging
verbose=True
)
```
The `clone()` method:
- Creates a new instance with all the same settings
- Updates only the specified parameters
- Leaves the original configuration unchanged
- Perfect for creating variations without repeating all parameters
## 3. LLMConfig Essentials
### Key fields to note
1. **`provider`**:
- Which LLM provider to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:
- If your provider has a custom endpoint
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
## 4. Putting It All Together
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
from crawl4ai import JsonCssExtractionStrategy
async def main():
# 1) Browser config: headless, bigger viewport, no proxy
browser_conf = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=720
)
# 2) Example extraction strategy
schema = {
"name": "Articles",
"baseSelector": "div.article",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
extraction = JsonCssExtractionStrategy(schema)
# 3) Example LLM content filtering
gemini_config = LLMConfig(
provider="gemini/gemini-1.5-pro",
api_token = "env:GEMINI_API_TOKEN"
)
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
llm_config=gemini_config, # or your preferred provider
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=500, # Adjust based on your needs
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
# 4) Crawler run config: skip cache, use extraction
run_conf = CrawlerRunConfig(
markdown_generator=md_generator,
extraction_strategy=extraction,
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
# 4) Execute the crawl
result = await crawler.arun(url="https://example.com/news", config=run_conf)
if result.success:
print("Extracted content:", result.extracted_content)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
## 5. Next Steps
- [BrowserConfig, CrawlerRunConfig & LLMConfig Reference](../api/parameters.md)
- **Custom Hooks & Auth** (Inject JavaScript or handle login forms).
- **Session Management** (Re-use pages, preserve state across multiple calls).
- **Advanced Caching** (Fine-tune read/write cache modes).
## 6. Conclusion
## 1. **BrowserConfig** – Controlling the Browser
`BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@proxy:8080",
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
)
```
## 1.1 Parameter Highlights
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. |
| **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. |
| **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. |
| **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). |
| **`proxy`** | `str` (deprecated) | Deprecated. Use `proxy_config` instead. If set, it will be auto-converted internally. |
| **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. |
| **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. |
| **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. |
| **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). |
| **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. |
| **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. |
| **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. |
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. |
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
| **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. |
| **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. |
| **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. |
- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
- If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`.
- For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content.
## 2. **CrawlerRunConfig** – Controlling Each Crawl
While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
run_cfg = CrawlerRunConfig(
wait_for="css:.main-content",
word_count_threshold=15,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
stream=True, # Enable streaming for arun_many()
)
```
## 2.1 Parameter Highlights
### A) **Content Processing**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
| **`excluded_selector`** | `str` (None) | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`. |
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |
| **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). |
| **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. |
| **`remove_forms`** | `bool` (False) | If `True`, remove all `<form>` elements. |
### B) **Caching & Session**
| **Parameter** | **Type / Default** | **What It Does** |
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. |
| **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. |
| **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. |
| **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. |
| **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). |
| **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). |
### C) **Page Navigation & Timing**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`. |
| **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. |
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
| **`check_robots_txt`** | `bool` (False) | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency. |
| **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
| **`semaphore_count`** | `int` (5) | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls. |
### D) **Page Interaction**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_only`** | `bool` (False) | If `True`, indicates we’re reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
| **`adjust_viewport_to_content`** | `bool` (False) | Resizes viewport to match page content height. |
If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab.
### E) **Media Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
| **`screenshot`** | `bool` (False) | Capture a screenshot (base64) in `result.screenshot`. |
| **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. |
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
| **`capture_mhtml`** | `bool` (False) | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an image’s alt text or description to be considered valid. |
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
### F) **Link/Domain Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output. |
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. |
### G) **Debug & Logging**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------|--------------------|---------------------------------------------------------------------------|
| **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. |
| **`log_console`** | `bool` (False) | Logs the page’s JavaScript console output if you want deeper JS debugging.|
### H) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`virtual_scroll_config`** | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. |
When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`:
```python
from crawl4ai import VirtualScrollConfig
virtual_config = VirtualScrollConfig(
container_selector="#timeline", # CSS selector for scrollable container
scroll_count=30, # Number of times to scroll
scroll_by="container_height", # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
wait_after_scroll=0.5 # Seconds to wait after each scroll for content to load
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
```
**VirtualScrollConfig Parameters:**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|---------------------------|-------------------------------------------------------------------------------------------|
| **`container_selector`** | `str` (required) | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`) |
| **`scroll_count`** | `int` (10) | Maximum number of scrolls to perform |
| **`scroll_by`** | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`) |
| **`wait_after_scroll`** | `float` (0.5) | Time in seconds to wait after each scroll for new content to load |
- Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram)
- Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll)
### I) **URL Matching Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`url_matcher`** | `UrlMatcher` (None) | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types. **None means match ALL URLs** |
| **`match_mode`** | `MatchMode` (MatchMode.OR) | How to combine multiple matchers in a list: `MatchMode.OR` (any match) or `MatchMode.AND` (all must match) |
The `url_matcher` parameter enables URL-specific configurations when used with `arun_many()`:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Simple string pattern (glob-style)
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Multiple patterns with OR logic (default)
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
match_mode=MatchMode.OR # Any pattern matches
)
# Function matcher
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Other settings like extraction_strategy
)
# Mixed: String + Function with AND logic
complex_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Must be HTTPS
"*.org/*", # Must be .org domain
lambda url: 'docs' in url # Must contain 'docs'
],
match_mode=MatchMode.AND # ALL conditions must match
)
# Combined patterns and functions with AND logic
secure_docs = CrawlerRunConfig(
url_matcher=["https://*", lambda url: '.doc' in url],
match_mode=MatchMode.AND # Must be HTTPS AND contain .doc
)
# Default config - matches ALL URLs
default_config = CrawlerRunConfig() # No url_matcher = matches everything
```
**UrlMatcher Types:**
- **None (default)**: When `url_matcher` is None or not set, the config matches ALL URLs
- **String patterns**: Glob-style patterns like `"*.pdf"`, `"*/api/*"`, `"https://*.example.com/*"`
- **Functions**: `lambda url: bool` - Custom logic for complex matching
- **Lists**: Mix strings and functions, combined with `MatchMode.OR` or `MatchMode.AND`
**Important Behavior:**
- When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins!
- If no config matches a URL and there's no default config (one without `url_matcher`), the URL will fail with "No matching configuration found"
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
```python
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200
)
# Create variations using clone()
stream_config = base_config.clone(stream=True)
no_cache_config = base_config.clone(
cache_mode=CacheMode.BYPASS,
stream=True
)
```
The `clone()` method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config.
## 2.3 Example Usage
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Configure the browser
browser_cfg = BrowserConfig(
headless=False,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@myproxy:8080",
text_mode=True
)
# Configure the run
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="my_session",
css_selector="main.article",
excluded_tags=["script", "style"],
exclude_external_links=True,
wait_for="css:.article-loaded",
screenshot=True,
stream=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/news",
config=run_cfg
)
if result.success:
print("Final cleaned_html length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot captured (base64, length):", len(result.screenshot))
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
## 2.4 Compliance & Ethics
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`check_robots_txt`**| `bool` (False) | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend. |
| **`user_agent`** | `str` (None) | User agent string to identify your crawler. Used for robots.txt checking when enabled. |
```python
run_config = CrawlerRunConfig(
check_robots_txt=True, # Enable robots.txt compliance
user_agent="MyBot/1.0" # Identify your crawler
)
```
## 3. **LLMConfig** - Setting up LLM providers
1. LLMExtractionStrategy
2. LLMContentFilter
3. JsonCssExtractionStrategy.generate_schema
4. JsonXPathExtractionStrategy.generate_schema
## 3.1 Parameters
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use.
| **`api_token`** |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"` | API token to use for the given provider
| **`base_url`** |Optional. Custom API endpoint | If your provider has a custom endpoint
## 3.2 Example Usage
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
## 4. Putting It All Together
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
- **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
- **Use** `LLMConfig` for LLM provider configurations that can be used across all extraction, filtering, schema generation tasks. Can be used in - `LLMExtractionStrategy`, `LLMContentFilter`, `JsonCssExtractionStrategy.generate_schema` & `JsonXPathExtractionStrategy.generate_schema`
```python
# Create a modified copy with the clone() method
stream_cfg = run_cfg.clone(
stream=True,
cache_mode=CacheMode.BYPASS
)
```
## Crawling Patterns
<!-- Section: lines 2332-3568 -->
## Simple Crawling
## Basic Usage
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def main():
browser_config = BrowserConfig() # Default browser configuration
run_config = CrawlerRunConfig() # Default crawl run configuration
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
print(result.markdown) # Print clean markdown content
if __name__ == "__main__":
asyncio.run(main())
```
## Understanding the Response
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
```python
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.6),
options={"ignore_links": True}
)
)
result = await crawler.arun(
url="https://example.com",
config=config
)
# Different content formats
print(result.html) # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown.raw_markdown) # Raw markdown from cleaned html
print(result.markdown.fit_markdown) # Most relevant content in markdown
# Check success status
print(result.success) # True if crawl succeeded
print(result.status_code) # HTTP status code (e.g., 200, 404)
# Access extracted media and links
print(result.media) # Dictionary of found media (images, videos, audio)
print(result.links) # Dictionary of internal and external links
```
## Adding Basic Options
Customize your crawl using `CrawlerRunConfig`:
```python
run_config = CrawlerRunConfig(
word_count_threshold=10, # Minimum words per content block
exclude_external_links=True, # Remove external links
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True # Process iframe content
)
result = await crawler.arun(
url="https://example.com",
config=run_config
)
```
## Handling Errors
```python
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
```
## Logging and Debugging
Enable verbose logging in `BrowserConfig`:
```python
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
```
## Complete Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
# Content filtering
word_count_threshold=10,
excluded_tags=['form', 'header'],
exclude_external_links=True,
# Content processing
process_iframes=True,
remove_overlay_elements=True,
# Cache control
cache_mode=CacheMode.ENABLED # Use cache if available
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
if result.success:
# Print clean content
print("Content:", result.markdown[:500]) # First 500 chars
# Process images
for image in result.media["images"]:
print(f"Found image: {image['src']}")
# Process links
for link in result.links["internal"]:
print(f"Internal link: {link['href']}")
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
## Content Processing
<!-- Section: lines 2481-3101 -->
## Markdown Generation Basics
1. How to configure the **Default Markdown Generator**
2. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`)
> - You know how to configure `CrawlerRunConfig`.
## 1. Quick Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator()
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
if result.success:
... (truncated)
```
### scripts/extraction_pipeline.py
```python
#!/usr/bin/env python3
"""
Crawl4AI extraction pipeline - Three approaches:
1. Generate schema with LLM (one-time) then use CSS extraction (most efficient)
2. Manual CSS/JSON schema extraction
3. Direct LLM extraction (for complex/irregular content)
Usage examples:
Generate schema: python extraction_pipeline.py --generate-schema <url> "<instruction>"
Use generated schema: python extraction_pipeline.py --use-schema <url> schema.json
Manual CSS: python extraction_pipeline.py --css <url> "<css_selector>"
Direct LLM: python extraction_pipeline.py --llm <url> "<instruction>"
"""
import asyncio
import sys
import json
from pathlib import Path
# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
from crawl4ai.__version__ import __version__
from packaging import version
if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
print(f"⚠️ Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
print(f"ℹ️ Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import (
LLMExtractionStrategy,
JsonCssExtractionStrategy,
CosineStrategy
)
# =============================================================================
# APPROACH 1: Generate Schema (Most Efficient for Repetitive Patterns)
# =============================================================================
async def generate_schema(url: str, instruction: str, output_file: str = "generated_schema.json"):
"""
Step 1: Generate a reusable schema using LLM (one-time cost)
Best for: E-commerce sites, blogs, news sites with repetitive patterns
"""
print("🔍 Generating extraction schema using LLM...")
browser_config = BrowserConfig(headless=True)
# Use LLM to analyze the page structure and generate schema
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", # Can use any LLM provider
instruction=f"""
Analyze this webpage and generate a CSS/JSON extraction schema.
Task: {instruction}
Return a JSON schema with CSS selectors that can extract the required data.
Format:
{{
"name": "items",
"selector": "main_container_selector",
"fields": [
{{"name": "field1", "selector": "css_selector", "type": "text"}},
{{"name": "field2", "selector": "css_selector", "type": "link"}},
// more fields...
]
}}
Make selectors as specific as possible to avoid false matches.
"""
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
wait_for="css:body",
remove_overlay_elements=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
try:
# Parse and save the generated schema
schema = json.loads(result.extracted_content)
# Validate and enhance schema
if "name" not in schema:
schema["name"] = "items"
if "fields" not in schema:
print("⚠️ Generated schema missing fields, using fallback")
schema = {
"name": "items",
"baseSelector": "div.item, article, .product",
"fields": [
{"name": "title", "selector": "h1, h2, h3", "type": "text"},
{"name": "description", "selector": "p", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
# Save schema
with open(output_file, "w") as f:
json.dump(schema, f, indent=2)
print(f"✅ Schema generated and saved to: {output_file}")
print(f"📋 Schema structure:")
print(json.dumps(schema, indent=2))
return schema
except json.JSONDecodeError as e:
print(f"❌ Failed to parse generated schema: {e}")
print("Raw output:", result.extracted_content[:500])
return None
else:
print(f"❌ Failed to generate schema: {result.error_message if result else 'Unknown error'}")
return None
async def use_generated_schema(url: str, schema_file: str):
"""
Step 2: Use the generated schema for fast, repeated extractions
No LLM calls needed - pure CSS extraction
"""
print(f"📂 Loading schema from: {schema_file}")
try:
with open(schema_file, "r") as f:
schema = json.load(f)
except FileNotFoundError:
print(f"❌ Schema file not found: {schema_file}")
print("💡 Generate a schema first using: python extraction_pipeline.py --generate-schema <url> \"<instruction>\"")
return None
print("🚀 Extracting data using generated schema (no LLM calls)...")
extraction_strategy = JsonCssExtractionStrategy(
schema=schema,
verbose=True
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
wait_for="css:body"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
items = data.get(schema.get("name", "items"), [])
print(f"✅ Extracted {len(items)} items using schema")
# Save results
with open("extracted_data.json", "w") as f:
json.dump(data, f, indent=2)
print("💾 Saved to extracted_data.json")
# Show sample
if items:
print("\n📋 Sample (first item):")
print(json.dumps(items[0], indent=2))
return data
else:
print(f"❌ Extraction failed: {result.error_message if result else 'Unknown error'}")
return None
# =============================================================================
# APPROACH 2: Manual Schema Definition
# =============================================================================
async def extract_with_manual_schema(url: str, schema: dict = None):
"""
Use a manually defined CSS/JSON schema
Best for: When you know the exact structure of the website
"""
if not schema:
# Example schema for general content extraction
schema = {
"name": "content",
"baseSelector": "body", # Changed from 'selector' to 'baseSelector'
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "paragraphs", "selector": "p", "type": "text", "all": True},
{"name": "links", "selector": "a", "type": "attribute", "attribute": "href", "all": True}
]
}
print("📐 Using manual CSS/JSON schema for extraction...")
extraction_strategy = JsonCssExtractionStrategy(
schema=schema,
verbose=True
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
# Handle both list and dict formats
if isinstance(data, list):
items = data
else:
items = data.get(schema["name"], [])
print(f"✅ Extracted {len(items)} items using manual schema")
with open("manual_extracted.json", "w") as f:
json.dump(data, f, indent=2)
print("💾 Saved to manual_extracted.json")
return data
else:
print(f"❌ Extraction failed")
return None
# =============================================================================
# APPROACH 3: Direct LLM Extraction
# =============================================================================
async def extract_with_llm(url: str, instruction: str):
"""
Direct LLM extraction - uses LLM for every request
Best for: Complex, irregular content or one-time extractions
Note: Most expensive approach, use sparingly
"""
print("🤖 Using direct LLM extraction...")
browser_config = BrowserConfig(headless=True)
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", # Can change to ollama/llama3, anthropic/claude, etc.
instruction=instruction,
schema={
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {"type": "object"}
},
"summary": {"type": "string"}
}
}
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
wait_for="css:body",
remove_overlay_elements=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
try:
data = json.loads(result.extracted_content)
items = data.get('items', [])
print(f"✅ LLM extracted {len(items)} items")
print(f"📝 Summary: {data.get('summary', 'N/A')}")
with open("llm_extracted.json", "w") as f:
json.dump(data, f, indent=2)
print("💾 Saved to llm_extracted.json")
if items:
print("\n📋 Sample (first item):")
print(json.dumps(items[0], indent=2))
return data
except json.JSONDecodeError:
print("⚠️ Could not parse LLM output as JSON")
print(result.extracted_content[:500])
return None
else:
print(f"❌ LLM extraction failed")
return None
# =============================================================================
# Main CLI Interface
# =============================================================================
async def main():
if len(sys.argv) < 3:
print("""
Crawl4AI Extraction Pipeline - Three Approaches
1️⃣ GENERATE & USE SCHEMA (Most Efficient for Repetitive Patterns):
Step 1: Generate schema (one-time LLM cost)
python extraction_pipeline.py --generate-schema <url> "<what to extract>"
Step 2: Use schema for fast extraction (no LLM)
python extraction_pipeline.py --use-schema <url> generated_schema.json
2️⃣ MANUAL SCHEMA (When You Know the Structure):
python extraction_pipeline.py --manual <url>
(Edit the schema in the script for your needs)
3️⃣ DIRECT LLM (For Complex/Irregular Content):
python extraction_pipeline.py --llm <url> "<extraction instruction>"
Examples:
# E-commerce products
python extraction_pipeline.py --generate-schema https://shop.com "Extract all products with name, price, image"
python extraction_pipeline.py --use-schema https://shop.com generated_schema.json
# News articles
python extraction_pipeline.py --generate-schema https://news.com "Extract headlines, dates, and summaries"
# Complex content
python extraction_pipeline.py --llm https://complex-site.com "Extract financial data and quarterly reports"
""")
sys.exit(1)
mode = sys.argv[1]
url = sys.argv[2]
if mode == "--generate-schema":
if len(sys.argv) < 4:
print("Error: Missing extraction instruction")
print("Usage: python extraction_pipeline.py --generate-schema <url> \"<instruction>\"")
sys.exit(1)
instruction = sys.argv[3]
output_file = sys.argv[4] if len(sys.argv) > 4 else "generated_schema.json"
await generate_schema(url, instruction, output_file)
elif mode == "--use-schema":
if len(sys.argv) < 4:
print("Error: Missing schema file")
print("Usage: python extraction_pipeline.py --use-schema <url> <schema.json>")
sys.exit(1)
schema_file = sys.argv[3]
await use_generated_schema(url, schema_file)
elif mode == "--manual":
await extract_with_manual_schema(url)
elif mode == "--llm":
if len(sys.argv) < 4:
print("Error: Missing extraction instruction")
print("Usage: python extraction_pipeline.py --llm <url> \"<instruction>\"")
sys.exit(1)
instruction = sys.argv[3]
await extract_with_llm(url, instruction)
else:
print(f"Unknown mode: {mode}")
print("Use --generate-schema, --use-schema, --manual, or --llm")
sys.exit(1)
if __name__ == "__main__":
asyncio.run(main())
```
### scripts/google_search.py
```python
#!/usr/bin/env python3
"""
Google Search Scraper using Crawl4AI
Usage: python google_search.py "<search query>" [max_results]
Example: python google_search.py "2026年Go语言展望" 20
"""
import asyncio
import sys
import json
import urllib.parse
from typing import List, Dict
try:
from crawl4ai.__version__ import __version__
from packaging import version
MIN_CRAWL4AI_VERSION = "0.7.4"
if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
print(f"⚠️ Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
print(f"ℹ️ Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
async def search_google_css(query: str, max_results: int = 20) -> List[Dict]:
"""
使用 CSS 选择器策略提取 Google 搜索结果(最快,无需 LLM)
"""
# 构建搜索 URL
encoded_query = urllib.parse.quote(query)
search_url = f"https://www.google.com/search?q={encoded_query}&num={max_results}"
print(f"🔍 Searching: {query}")
print(f"📊 Max results: {max_results}")
print(f"🌐 URL: {search_url}")
# 定义 Google 搜索结果的 CSS schema
# Google 的 HTML 结构会变化,这里使用常用的选择器
schema = {
"name": "search_results",
"baseSelector": "div.g, div[data-hveid], div.tF2Cxc, div.yuRUbf",
"fields": [
{
"name": "title",
"selector": "h3, h3.LC20lb, div[role='heading']",
"type": "text"
},
{
"name": "link",
"selector": "a",
"type": "attribute",
"attribute": "href"
},
{
"name": "description",
"selector": "div.VwiC3b, div.s, div.ITZIwc, span.aCOpRe",
"type": "text"
},
{
"name": "site_name",
"selector": "div.NJo7tc, span.VuuXrf, cite",
"type": "text"
}
]
}
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
crawler_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema=schema, verbose=True),
cache_mode=CacheMode.BYPASS,
wait_for="css:div.g, div.search, body",
page_timeout=30000,
js_code=[
# 等待页面加载完成
"const waitFor = (ms) => new Promise(resolve => setTimeout(resolve, ms));",
"await waitFor(2000);"
]
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=search_url, config=crawler_config)
if result.success:
print("✅ Successfully fetched search results")
if result.extracted_content:
try:
data = json.loads(result.extracted_content)
# 处理列表和字典两种格式
if isinstance(data, list):
results = data
else:
results = data.get("search_results", data.get("results", []))
# 过滤掉空结果和无效结果
seen = set()
valid_results = []
for r in results:
if r.get("title") and r.get("link"):
# 清理 URL(Google 有时会在 URL 前加 /url?q=)
link = r["link"]
if link.startswith("/url?q="):
from urllib.parse import urlparse, parse_qs
parsed = urlparse(link)
link = parse_qs(parsed.query).get("q", [link])[0]
r["link"] = link
# 使用 URL 作为唯一标识去重
if link not in seen:
seen.add(link)
valid_results.append(r)
print(f"📋 Extracted {len(valid_results)} valid results")
return valid_results[:max_results]
except json.JSONDecodeError as e:
print(f"❌ Failed to parse extracted content: {e}")
print("Raw output:", result.extracted_content[:500] if result.extracted_content else "None")
return []
else:
print("⚠️ No extracted content, trying alternative method...")
return await search_google_llm_fallback(query, max_results)
else:
print(f"❌ Failed: {result.error_message}")
print("Trying fallback method...")
return await search_google_llm_fallback(query, max_results)
async def search_google_llm_fallback(query: str, max_results: int = 20) -> List[Dict]:
"""
使用 LLM 作为备选方案提取搜索结果
注意:这需要配置 LLM API 密钥
"""
print("🤖 Using LLM fallback extraction...")
encoded_query = urllib.parse.quote(query)
search_url = f"https://www.google.com/search?q={encoded_query}&num={max_results}"
# 尝试使用简单的 LLM 提取
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
instruction=f"""
Extract the top {max_results} search results from this Google search page for "{query}".
For each search result, extract:
1. Title - the blue link text
2. Link - the URL (clean the URL, remove /url?q= prefix if present)
3. Description - the gray text snippet below the title
4. Site name - the green text showing the website name
Return as JSON with a "results" array containing objects with these fields.
Skip any ads or sponsored content.
"""
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS,
page_timeout=30000
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=search_url, config=crawler_config)
if result.success and result.extracted_content:
try:
data = json.loads(result.extracted_content)
return data.get("results", [])
except json.JSONDecodeError:
print("⚠️ LLM output could not be parsed as JSON")
return []
else:
print(f"❌ Fallback also failed: {result.error_message}")
return []
async def search_google_with_html_parsing(query: str, max_results: int = 20) -> List[Dict]:
"""
直接解析 HTML 作为最后的备选方案
"""
print("🔧 Using direct HTML parsing...")
encoded_query = urllib.parse.quote(query)
search_url = f"https://www.google.com/search?q={encoded_query}&num={max_results}"
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for="css:body",
page_timeout=30000,
js_code=[
"const waitFor = (ms) => new Promise(resolve => setTimeout(resolve, ms));",
"await waitFor(3000);"
]
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=search_url, config=crawler_config)
if result.success and result.html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(result.html, 'html.parser')
results = []
# Google 搜索结果通常在 div.g 中
for div in soup.select('div.g, div.tF2Cxc'):
try:
# 提取标题
title_elem = div.select_one('h3')
title = title_elem.get_text() if title_elem else ""
# 提取链接
link_elem = div.select_one('a')
link = link_elem.get('href', '') if link_elem else ""
# 清理 Google 重定向链接
if link.startswith('/url?q='):
from urllib.parse import urlparse, parse_qs, unquote
parsed = urlparse(link)
link = unquote(parse_qs(parsed.query).get('q', [link])[0])
# 提取描述
desc_elem = div.select_one('div.VwiC3b, div.s, span.aCOpRe')
description = desc_elem.get_text() if desc_elem else ""
# 提取网站名称
site_elem = div.select_one('div.NJo7tc, span.VuuXrf, cite')
site_name = site_elem.get_text() if site_elem else ""
if title and link and not link.startswith('#'):
results.append({
"title": title.strip(),
"link": link.strip(),
"description": description.strip(),
"site_name": site_name.strip()
})
if len(results) >= max_results:
break
except Exception as e:
continue
print(f"📋 Parsed {len(results)} results from HTML")
return results
else:
print(f"❌ HTML parsing failed")
return []
async def main():
if len(sys.argv) < 2:
print("Usage: python google_search.py \"<search query>\" [max_results]")
print("Example: python google_search.py \"2026年Go语言展望\" 20")
sys.exit(1)
query = sys.argv[1]
max_results = int(sys.argv[2]) if len(sys.argv) > 2 else 20
# 方法1: CSS 提取
results = await search_google_css(query, max_results)
# 如果 CSS 提取失败,尝试 HTML 解析
if not results:
results = await search_google_with_html_parsing(query, max_results)
# 输出结果
if results:
output = {
"query": query,
"total_results": len(results),
"results": results
}
print("\n" + "="*60)
print(f"✅ Successfully extracted {len(results)} search results")
print("="*60)
# 保存到文件
output_file = "google_search_results.json"
with open(output_file, "w", encoding="utf-8") as f:
json.dump(output, f, ensure_ascii=False, indent=2)
print(f"\n💾 Results saved to: {output_file}")
print("\n📋 Preview (first 3 results):")
print(json.dumps(results[:3], ensure_ascii=False, indent=2))
# 打印完整 JSON 到 stdout
print("\n" + "="*60)
print("FULL JSON OUTPUT:")
print("="*60)
print(json.dumps(output, ensure_ascii=False, indent=2))
else:
print("❌ No results extracted. Please check:")
print(" 1. Your internet connection")
print(" 2. Whether Google is blocking the request (try with headless=False)")
print(" 3. The CSS selectors (Google might have changed their HTML)")
if __name__ == "__main__":
asyncio.run(main())
```
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### README.md
```markdown
# Crawl4AI Skill
> 强大的网页爬取和数据提取技能,支持 JavaScript 渲染、结构化数据提取和多 URL 批量处理。
[](https://www.python.org/downloads/)
[](LICENSE)
基于 [crawl4ai-skill](https://github.com/brettdavies/crawl4ai-skill) 代码做基础实现。
## 特性
- **智能爬取** - 自动处理 JavaScript 渲染页面
- **结构化提取** - 支持 CSS 选择器和 LLM 两种提取模式
- **Markdown 生成** - 自动将网页内容转换为格式化的 Markdown
- **批量处理** - 高效处理多个 URL
- **会话管理** - 支持登录认证和状态保持
- **反爬虫对策** - 内置反检测和代理支持
- **Google 搜索** - 专用搜索结果提取脚本
## 安装
```bash
# 安装 crawl4ai
pip install crawl4ai
# 安装 Playwright 浏览器
crawl4ai-setup
# 验证安装
crawl4ai-doctor
```
## 快速开始
### CLI 模式(推荐)
```bash
# 基础爬取,输出 Markdown
crwl https://example.com
# JSON 格式输出
crwl https://example.com -o json
# 绕过缓存,详细输出
crwl https://example.com -o json -v --bypass-cache
```
### Python SDK
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500])
asyncio.run(main())
```
## 使用示例
### Google 搜索爬取
```bash
# 搜索并提取前 20 个结果
python scripts/google_search.py "搜索关键词" 20
# 示例
python scripts/google_search.py "2026年Go语言展望" 20
```
**输出格式:**
```json
{
"query": "搜索关键词",
"total_results": 20,
"results": [
{
"title": "结果标题",
"link": "https://example.com",
"description": "结果描述",
"site_name": "网站名称"
}
]
}
```
### 数据提取
#### 1. CSS 选择器提取(最快,无需 LLM)
```bash
# 生成提取 schema
python scripts/extraction_pipeline.py --generate-schema https://shop.com "提取所有商品信息"
# 使用 schema 进行提取
crwl https://shop.com -e extract_css.yml -s schema.json -o json
```
**Schema 格式:**
```json
{
"name": "products",
"baseSelector": ".product-card",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
```
#### 2. LLM 智能提取
```yaml
# extract_llm.yml
type: "llm"
provider: "openai/gpt-4o-mini"
instruction: "提取商品名称和价格"
api_token: "your-api-token"
```
```bash
crwl https://shop.com -e extract_llm.yml -o json
```
### Markdown 生成与过滤
```bash
# 基础 Markdown
crwl https://docs.example.com -o markdown > docs.md
# 过滤后的 Markdown(移除噪音)
crwl https://docs.example.com -o markdown-fit
# 使用 BM25 内容过滤
crwl https://docs.example.com -f filter_bm25.yml -o markdown-fit
```
**过滤器配置:**
```yaml
# filter_bm25.yml
type: "bm25"
query: "机器学习教程"
threshold: 1.0
```
### 动态内容处理
```bash
# 等待特定元素加载
crwl https://example.com -c "wait_for=css:.ajax-content,page_timeout=60000"
# 扫描整个页面
crwl https://example.com -c "scan_full_page=true,delay_before_return_html=2.0"
```
### 批量处理
```python
# Python SDK 并发处理
urls = [
"https://site1.com",
"https://site2.com",
"https://site3.com"
]
results = await crawler.arun_many(urls, config=config)
```
### 登录认证
```yaml
# login_crawler.yml
session_id: "user_session"
js_code: |
document.querySelector('#username').value = 'user';
document.querySelector('#password').value = 'pass';
document.querySelector('#submit').click();
wait_for: "css:.dashboard"
```
```bash
# 先登录
crwl https://site.com/login -C login_crawler.yml
# 访问受保护内容
crwl https://site.com/protected -c "session_id=user_session"
```
## 目录结构
```
crawl4ai/
├── README.md # 本文件
├── SKILL.md # 技能详细文档
├── scripts/ # 实用脚本
│ ├── google_search.py # Google 搜索爬虫
│ ├── extraction_pipeline.py # 数据提取管道
│ ├── basic_crawler.py # 基础爬虫
│ └── batch_crawler.py # 批量爬虫
├── references/ # 参考文档
│ ├── cli-guide.md # CLI 完整指南
│ ├── sdk-guide.md # SDK 快速参考
│ └── complete-sdk-reference.md # 完整 API 文档
└── tests/ # 测试文件
├── README.md
├── run_all_tests.py
├── test_basic_crawling.py
├── test_data_extraction.py
├── test_markdown_generation.py
└── test_advanced_patterns.py
```
## 提供的脚本
| 脚本 | 功能 |
|------|------|
| `google_search.py` | Google 搜索结果爬取,JSON 输出 |
| `extraction_pipeline.py` | 三种提取策略:CSS/LLM/手动 |
| `basic_crawler.py` | 基础网页爬取,带截图功能 |
| `batch_crawler.py` | 批量 URL 处理 |
## 配置说明
### BrowserConfig(浏览器配置)
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `headless` | 无头模式 | `true` |
| `viewport_width` | 视口宽度 | `1920` |
| `viewport_height` | 视口高度 | `1080` |
| `user_agent` | 用户代理 | 随机 |
| `proxy_config` | 代理配置 | `null` |
### CrawlerRunConfig(爬虫配置)
| 参数 | 说明 | 默认值 |
|------|------|--------|
| `page_timeout` | 页面超时(ms) | `30000` |
| `wait_for` | 等待条件 | `null` |
| `cache_mode` | 缓存模式 | `enabled` |
| `js_code` | 执行的 JS | `null` |
| `css_selector` | CSS 选择器 | `null` |
## 最佳实践
1. **优先使用 CLI** - 快速任务用 CLI,自动化用 SDK
2. **使用 Schema 提取** - 比 LLM 快 10-100 倍,零成本
3. **开发时启用缓存** - 只在需要时使用 `--bypass-cache`
4. **合理设置超时** - 普通站点 30s,JS 重度站点 60s+
5. **使用内容过滤** - 获取更干净的 Markdown 输出
6. **遵守速率限制** - 请求之间添加延迟
## 常见问题
### JavaScript 内容未加载
```bash
crwl https://example.com -c "wait_for=css:.dynamic-content,page_timeout=60000"
```
### 被反爬虫检测
```yaml
# browser.yml
headless: false
user_agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
user_agent_mode: "random"
```
### 提取内容为空
```bash
# 调试模式查看完整输出
crwl https://example.com -o all -v
# 尝试不同的等待策略
crwl https://example.com -c "wait_for=js:document.querySelector('.content')!==null"
```
## 文档
- [CLI 完整指南](references/cli-guide.md) - 命令行接口详解
- [SDK 快速参考](references/sdk-guide.md) - Python SDK 速查
- [完整 API 文档](references/complete-sdk-reference.md) - 5900+ 行完整参考
## 许可证
MIT License
## 相关链接
- [Crawl4AI 官方仓库](https://github.com/unclecode/crawl4ai)
- [Playwright 文档](https://playwright.dev/python/)
- [BeautifulSoup 文档](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
```
### scripts/basic_crawler.py
```python
#!/usr/bin/env python3
"""
Basic Crawl4AI crawler template
Usage: python basic_crawler.py <url>
"""
import asyncio
import sys
# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
from crawl4ai.__version__ import __version__
from packaging import version
if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
print(f"⚠️ Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
print(f"ℹ️ Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def crawl_basic(url: str):
"""Basic crawling with markdown output"""
# Configure browser
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080
)
# Configure crawler
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
remove_overlay_elements=True,
wait_for_images=True,
screenshot=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=crawler_config
)
if result.success:
print(f"✅ Crawled: {result.url}")
print(f" Title: {result.metadata.get('title', 'N/A')}")
print(f" Links found: {len(result.links.get('internal', []))} internal, {len(result.links.get('external', []))} external")
print(f" Media found: {len(result.media.get('images', []))} images, {len(result.media.get('videos', []))} videos")
print(f" Content length: {len(result.markdown)} chars")
# Save markdown
with open("output.md", "w") as f:
f.write(result.markdown)
print("📄 Saved to output.md")
# Save screenshot if available
if result.screenshot:
# Check if screenshot is base64 string or bytes
if isinstance(result.screenshot, str):
import base64
screenshot_data = base64.b64decode(result.screenshot)
else:
screenshot_data = result.screenshot
with open("screenshot.png", "wb") as f:
f.write(screenshot_data)
print("📸 Saved screenshot.png")
else:
print(f"❌ Failed: {result.error_message}")
return result
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python basic_crawler.py <url>")
sys.exit(1)
url = sys.argv[1]
asyncio.run(crawl_basic(url))
```
### scripts/batch_crawler.py
```python
#!/usr/bin/env python3
"""
Crawl4AI batch/multi-URL crawler with concurrent processing
Usage: python batch_crawler.py urls.txt [--max-concurrent 5]
"""
import asyncio
import sys
import json
from pathlib import Path
from typing import List, Dict, Any
# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
from crawl4ai.__version__ import __version__
from packaging import version
if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
print(f"⚠️ Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
print(f"ℹ️ Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def crawl_batch(urls: List[str], max_concurrent: int = 5):
"""
Crawl multiple URLs efficiently with concurrent processing
"""
print(f"🚀 Starting batch crawl of {len(urls)} URLs (max {max_concurrent} concurrent)")
# Configure browser for efficiency
browser_config = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=800,
verbose=False
)
# Configure crawler
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
remove_overlay_elements=True,
wait_for="css:body",
page_timeout=30000, # 30 seconds timeout per page
screenshot=False # Disable screenshots for batch processing
)
results = []
failed = []
async with AsyncWebCrawler(config=browser_config) as crawler:
# Use arun_many for efficient batch processing
batch_results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=max_concurrent
)
for result in batch_results:
if result.success:
results.append({
"url": result.url,
"title": result.metadata.get("title", ""),
"description": result.metadata.get("description", ""),
"content_length": len(result.markdown),
"links_count": len(result.links.get("internal", [])) + len(result.links.get("external", [])),
"images_count": len(result.media.get("images", [])),
})
print(f"✅ {result.url}")
else:
failed.append({
"url": result.url,
"error": result.error_message
})
print(f"❌ {result.url}: {result.error_message}")
# Save results
output = {
"success_count": len(results),
"failed_count": len(failed),
"results": results,
"failed": failed
}
with open("batch_results.json", "w") as f:
json.dump(output, f, indent=2)
# Save individual markdown files
markdown_dir = Path("batch_markdown")
markdown_dir.mkdir(exist_ok=True)
for i, result in enumerate(batch_results):
if result.success:
# Create safe filename from URL
safe_name = result.url.replace("https://", "").replace("http://", "")
safe_name = "".join(c if c.isalnum() or c in "-_" else "_" for c in safe_name)[:100]
file_path = markdown_dir / f"{i:03d}_{safe_name}.md"
with open(file_path, "w") as f:
f.write(f"# {result.metadata.get('title', result.url)}\n\n")
f.write(f"URL: {result.url}\n\n")
f.write(result.markdown)
print(f"\n📊 Batch Crawl Complete:")
print(f" ✅ Success: {len(results)}")
print(f" ❌ Failed: {len(failed)}")
print(f" 💾 Results saved to: batch_results.json")
print(f" 📁 Markdown files saved to: {markdown_dir}/")
return output
async def crawl_with_extraction(urls: List[str], schema_file: str = None):
"""
Batch crawl with structured data extraction
"""
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = None
if schema_file and Path(schema_file).exists():
with open(schema_file) as f:
schema = json.load(f)
print(f"📋 Using extraction schema from: {schema_file}")
else:
# Default schema for general content
schema = {
"name": "content",
"selector": "body",
"fields": [
{"name": "headings", "selector": "h1, h2, h3", "type": "text", "all": True},
{"name": "paragraphs", "selector": "p", "type": "text", "all": True},
{"name": "links", "selector": "a[href]", "type": "attribute", "attribute": "href", "all": True}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS
)
extracted_data = []
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=5
)
for result in results:
if result.success and result.extracted_content:
try:
data = json.loads(result.extracted_content)
extracted_data.append({
"url": result.url,
"data": data
})
print(f"✅ Extracted from: {result.url}")
except json.JSONDecodeError:
print(f"⚠️ Failed to parse JSON from: {result.url}")
# Save extracted data
with open("batch_extracted.json", "w") as f:
json.dump(extracted_data, f, indent=2)
print(f"\n💾 Extracted data saved to: batch_extracted.json")
return extracted_data
def load_urls(source: str) -> List[str]:
"""Load URLs from file or string"""
if Path(source).exists():
with open(source) as f:
urls = [line.strip() for line in f if line.strip() and not line.startswith("#")]
else:
# Treat as comma-separated URLs
urls = [url.strip() for url in source.split(",") if url.strip()]
return urls
async def main():
if len(sys.argv) < 2:
print("""
Crawl4AI Batch Crawler
Usage:
# Crawl URLs from file
python batch_crawler.py urls.txt [--max-concurrent 5]
# Crawl with extraction
python batch_crawler.py urls.txt --extract [schema.json]
# Crawl comma-separated URLs
python batch_crawler.py "https://example.com,https://example.org"
Options:
--max-concurrent N Max concurrent crawls (default: 5)
--extract [schema] Extract structured data using schema
Example urls.txt:
https://example.com
https://example.org
# Comments are ignored
https://another-site.com
""")
sys.exit(1)
source = sys.argv[1]
urls = load_urls(source)
if not urls:
print("❌ No URLs found")
sys.exit(1)
print(f"📋 Loaded {len(urls)} URLs")
# Parse options
max_concurrent = 5
extract_mode = False
schema_file = None
for i, arg in enumerate(sys.argv[2:], 2):
if arg == "--max-concurrent" and i + 1 < len(sys.argv):
max_concurrent = int(sys.argv[i + 1])
elif arg == "--extract":
extract_mode = True
if i + 1 < len(sys.argv) and not sys.argv[i + 1].startswith("--"):
schema_file = sys.argv[i + 1]
if extract_mode:
await crawl_with_extraction(urls, schema_file)
else:
await crawl_batch(urls, max_concurrent)
if __name__ == "__main__":
asyncio.run(main())
```